Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jan 1.
Published in final edited form as: Physica A. 2021 Sep 24;585:126442. doi: 10.1016/j.physa.2021.126442

Automatic Quasi-Clique Merger Algorithm – a Hierarchical Clustering Based on Subgraph-Density

Scott Payne 1, Edgar Fuller 2, George Spirou 3, Cun-Quan Zhang 1
PMCID: PMC8562650  NIHMSID: NIHMS1743332  PMID: 34737487

Abstract

The Automatic Quasi-clique Merger algorithm is a new algorithm adapted from early work published under the name QCM (introduced by Ou and Zhang in 2007). The AQCM algorithm performs hierarchical clustering in any data set for which there is an associated similarity measure quantifying the similarity of any data i and data j. Importantly, the method exhibits two valuable performance properties: 1) the ability to automatically return either a larger or smaller number of clusters depending on the inherent properties of the data rather than on a parameter. 2) the ability to return a very large number of relatively small clusters automatically when such clusters are reasonably well defined in a data set. In this work we present the general idea of a quasi-clique agglomerative approach, provide the full details of the mathematical steps of the AQCM algorithm, and explain some of the motivation behind the new methodology. The main achievement of the new methodology is that the agglomerative process now unfolds adaptively according to the inherent structure unique to a given data set, and this happens without the time-costly parameter adjustment that drove the previous QCM algorithm. For this reason we call the new algorithm automatic. We provide a demonstration of the algorithm’s performance at the task of community detection in a social media network of 22,900 nodes.

Keywords: Clustering, Agglomerative clustering, Community detection, Graph density, QCM, Facebook, Social network

1. Introduction

The task of finding a “natural” or automatically determined number of clusters in an arbitrary data set has long been known impossible in terms of provable optimality [8]. That is, it is known that there cannot be one mathematical definition of optimality and one algorithm that finds the true optimal solution for all clustering tasks and data sets. Nevertheless, in both science and industrial data mining applications it is often necessary to discover, within a data set, clusters that have some measurable degree of quality in terms of their distinctness and the way they characterize the data according to some aspect prescribed by the application [25]. Many methods have been studied over decades and our work does not aim to present a survey so in our treatment of the background of this field we will mention only a few examples of the main approaches to automatic clustering.

1.1. AQCM

The automatic quasi-clique merger (AQCM) algorithm is a rework of the earlier quasi-clique merger (QCM) algorithm [13, 12, 26, 20]. An explanation of AQCM subroutines and illustration of their implementation is given in Section 3. Here we begin with a basic description. AQCM is an agglomerative hierarchical clustering approach that takes as input an n × n similarity matrix S (or equivalently, a list of (n2) values representing the similarities for each pair of data points1). In this work the definition of a similarity matrix S (or the (n2) values) is understood to be any function S:X×X+ where X is the data set and + is the non-negative real numbers, and S[i, j] = S[j, i] for all i, j. Additional assumptions on the nature of similarity data input to AQCM are discussed in Subsection 1.3. For a detailed study of similarity measures in clustering see [22]. The objective of AQCM’s algorithm design is to allow clusters to grow locally from well chosen pairs of data points (called seeds) known to be members of the same cluster. Seed selection and cluster growth adapt to local properties in the similarity data so that data points may be in more than one seed pair when appropriate and clusters may grow to a size suitable for preserving relative local similarity density. Hence, the algorithm supports multi-membership (sometimes referred to as fuzzy clustering) and clustering where cluster sizes vary widely (some small clusters and some large clusters) 2. The process repeats iteratively, each time treating the set of grown clusters as a smaller data set representing a higher level in the hierarchical scheme. The algorithm is finished running when there is only one cluster encompassing the data. While agglomerative methods are common in hierarchical clustering, the unique design of the subroutines in AQCM drive its adaptive properties and relatively efficient running time. These characteristics and the utility we have observed in our experimentation lend evidence that the algorithm can prove an important tool for obtaining automatic clustering in large and complex scientific data sets.

The AQCM method not only produces a generalized hierarchy tree (see Subsection 1.4.3 for definition), but also a clustering output automatically determined by an optimization method drawing on the properties of the hierarchy tree as a graph (network). The optimization strategy was originally presented in [20] where density drop was defined as the drop in local cluster density moving up successive levels along the branches of the hierarchy tree. In that work [20], Qi, et al, proposed that an optimal clustering can be defined as the “cut” 3 across the tree featuring the most dramatic drop in cluster density. The advantage of the strategy is that tools from network science, such as max flow/min cut, provide optimization solutions for such a cut, and the resulting cut is not restricted to one level of the tree, thus allowing it to adapt to differences in local topological properties of the data along different branches of the tree. New in the work presented here is that we employ a recently proven mathematical optimization [16, 17] for the selection of such a cut. The new method was designed to increase the flexibility to discover a potentially large number of clusters in a data set. We have found the method to perform as intended, and it appears to be robust in its ability to return either a small or large number depending on the inherent structure of the input data.

To speak to the intended use of the AQCM approach it is helpful to mention common uses of hierarchical clustering. In some hierarchical methods, the generated hierarchy tree is intended simply as a source for several “resolution levels” of clustering from which the user should choose one clustering output by some target optimization function. The term resolution refers to the granularity of clusters at a given hierarchical level, many small clusters being fine grain (i.e. high resolution) fewer larger clusters at an upper hierarchy being coarse grain (i.e. lower resolution). Basic divisive (top-down), binary hierarchical methods are often used as such, while the Louvain method of community detection [1] is a more complex example of that use. In contrast, other methods aim to yield a hierarchy tree which characterizes complex systems of topological hierarchies inherent in a data set. Such hierarchical use is often described as “multi-resolution”, where the purpose is to study the relationships between levels and the way they might relate to other metadata in a scientifically relevant way. The method of Jeub et al. [10] is an effective example of this type of use in network data analysis.

In the context of the strategies mentioned above we can suggest the role for the AQCM algorithm is similar to the latter example while having the benefit of an optimal clustering as in the former. Although one could potentially use the automatically chosen clustering within a larger industrial data mining pipeline, AQCM might be best used in exploratory scientific data mining where not much is known about a given data set but complex hidden patterns of clustering are thought to exist. By visually and/or computationally interpreting the hierarchy tree output by AQCM a user could discover various clustering patterns that may correlate with other known metadata. Such an investigative approach often plays an important role in biology for example as described in [25]. We have found AQCM to perform very well in this role in a number of data types and scientific investigations. In Section 5 we provide a demonstration of the algorithm’s performance at the task of community detection in a social media network of 22,900 nodes. The algorithm’s ability to respond automatically to inherent local properties in a data set while at the same time maintaining a relative short runtime make it a powerful tool in applications where data size and complexity both present large challenges.

1.1.1. Contributions of AQCM

In this subsection, some novel contributions, special features are briefly summarized.

  1. Free of parameter tuning: there is no threshold in the algorithm for selection (this avoids time-consuming testing) (Section 3).

  2. A fine grain output (a large number of clusters) is produced (when naturally occurring in a data set) by each loop of iteration of AQCM (Section 3) and Automatic Community Selection (Section 4): a selection based on edge-cuts with the minimum average weight).

  3. A multi-level output with hierarchical structure are produced by repeating loops of iterations (see Figure 3 in Section 5), or simply by interpreting the hierarchy tree output by running AQCM once.

  4. The dendrogram constructed from AQCM is density monotonic (Section 3) in general (unless the dataset in some extreme structure, see discussion in Subsection 1.3).

  5. Because of the density monotonicity in the hierarchical dendrogram, the automatic community selection based on density drop becomes feasible (Sections 3 and 4).

1.1.2. Software

Implementations of the main algorithms and post processing subroutines featured in this work are provided here [14]. We chose MATLAB as the implementation platform due to its optimized performance for matrix operations. Many of the mathematical procedures in the subroutines of AQCM can be efficiently expressed as matrix operations. We found that MATLAB’s combination of runtime speed and ease of dynamic memory implementation made it an ideal choice for the scalable scientific work for which we envision that AQCM is best suited.

1.2. Automatic clustering, a short background

The naive approach for automatic determination of clusters would be to repeatedly run the k-means algorithm many times over many different choices of cluster number (k) and then choose the clustering output with the highest k-means quality function score. In many applications, however, this strategy is unfeasible for a myriad of reasons, the most obvious being time constraints and the fact that we must presume that the data is best represented as a set of n-dimensional points (many data sets are not represented as such). But moreover, the situation becomes untractable in cases where a data set is known to present some large but unknown number of clusters, rendering futile the job of choosing the range of k values.

A pivotal work in the development of automatic clustering techniques was the affinity propagation algorithm [6], or AP for short. The algorithm provides an excellent example of modern strategies that harness the power of Bayesian-like networks and allow a dynamic process to gravitate to an equilibrium. Specifically, AP relies on an adapted version of belief propagation [18] that Frey/Dueck refer to as “loopy belief”. AP is in a class of strategies that rely on the notion that a data cluster has a center, often referred to as a medoid. Also key to AP is that the input to the algorithm is not the data itself but rather a similarity matrix which must be an n × n real valued matrix where the ij entry scores how similar are data i and data j.

A different approach for the determination of clusters by medoids is to define cluster centers as data points with an unusually high number of “nearby points” as determined by some measure of local scale versus global scale distance. For data sets from metric spaces, the algorithm proposed in [21] (also see [3]) is able to recognize non-spherical clusters and therefore, is a significant improvement of k-Means clustering. Additionally, the method in [21] allows for the number of clusters to emerge naturally, although user interaction and threshold choice is required.

Another class of clustering strategies relies only indirectly on the notion of a medoid and treats clustering as the problem of fitting a finite mixture model to the input data set. Many algorithms exist but a particularly effective method found here [5] builds upon various strategies that optimize log-likelihood and can be adapted to various statistical models (Gaussian being the default).

A final example category of clustering strategies is found in the graph theoretic task of community detection. Any similarity matrix representing a data set can be thought of as a weighted graph (network) where each vertex (node) represents a data point. Communities are groups of vertices within which the edge weights (similarities) are relatively high and edge weights between communities tend to be lower. So community detection is equivalent to clustering when an appropriate similarity matrix exists. There is a large literature on community detection. The main early example is found in the work of Girvan/Newman [7]. The methods proposed by Girvan/Newman are driven by the optimization of a target function called modularity. A key advancement in modularity optimization techniques was the algorithm called Louvain method [1] which allowed for very large data sets to be processed in feasible time on an appropriate system. A hierarchical technique that generalizes the Louvain method is found here [10] and a very recent modularity generalization algorithm is found here [23]. Importantly, that work (the latter) provides theorems proving that the quality of the clusters (according to a user chosen target function and resolution parameter) is optimized by further iteration.

As mentioned above, the review of methods provided here is only intended to illustrate some of the existing strategies. We have focused on strategies that aim to produce clusters where the number suits the unique properties of the input data set in some “natural” way without the requirement that the user have foreknowledge of the “correct” number of clusters to find. It is important to take a moment to note, however, that in many cases a data set may have multiple equally valid solutions to the clustering task or none (for discussions see [9] and [19]). Furthermore some clustering tasks, such as those designed to divide data for distributed computing, are such that the user knows the desired number of clusters ahead of time. Hence, in the next section we clarify the assumptions we make in this work regarding the nature of clustering tasks we would aim to perform and for which the algorithms discussed above are intended.

1.3. The clustering task

In this work we assume that a given clustering task on a data set has the following characteristics:

  1. The data set has distinct structures in that there are subsets of the data within which the points tend to be more similar to each other and less similar to points in other subsets. The difficulty of such a clustering task would be determined by the range of similarities inside subsets and the range of similarities between subsets. The more that the between range overlaps the inside range, the more difficult it is to determine whether a pair of points are in the same cluster.

  2. If the data are points in n-dimensional space we assume that the points are distributed such that there are some number of regions of higher point density separated by regions of lower density. The difficulty of such a clustering task would be determined by the distinctness of the dense regions against the space between regions and also the shape of the regions (which need not be symmetrical or spherical). For example the moon and sun problem and the target problem are construction methods that create data sets which will confound most modern clustering algorithms.

1.4. Notation and definitions

1.4.1. AQCM and graph theory

Many of the subroutines in the AQCM algorithm are adaptations or redesigns of subroutines from the QCM algorithm, and the general structure of each iteration of AQCM is the same as in QCM. Above we have described AQCM as a clustering algorithm that takes as input an n × n similarity matrix S, a common type of input structure in data mining [22] (see Subsection 1.1 for precise definition). It is important to note that the original publications of the QCM algorithm [13, 12, 26, 20] presented the quasi-clique merger method as an algorithm for performing hierarchical “community detection”, a common data mining task in graph (network) data analysis. In that work, the input to QCM was a weighted graph (weighted network) and a community was defined as an “edge dense” set of vertices (nodes) where edge weights tend to be higher, edges between communities were expected to be more sparsely distributed and/or lower weight. The objective of the QCM algorithm and the type of input graph for which it was designed are equivalent to the idea of clustering in similarity data as commonly discussed in data mining today [22]. Thus far we have introduced the idea of AQCM using the language of similarity-based clustering due to its useful and accurate level of generality in framing the tasks for which AQCM may be used. But relevant to note is that the QCM algorithm itself is not able to operate on an unweighted graph, that is, the input must have weights associated with the edges which serve the same role as that of a similarity function. Hence, the description of the quasi-clique merger process in general might be most accurately described as a process that operates on similarity data.

On the other hand, while the mathematical steps in AQCM (and QCM) can be accurately described using the abstract notions of data points and similarity, many of the steps can actually be more easily discussed using their equivalent formulations in the language of graph theory. The graph theoretic concept of an edge, sometimes also referred to as a link, is a linguistically simple way to refer to the relationship between two data points, which in graph theoretic terms would be referred to as vertices, or sometimes nodes. Therefore, in the work presented here we may at times describe processes in terms of vertices and edges instead of data points and similarities, and input to AQCM may at times be described as a weighted graph instead of similarity values or a similarity matrix.

1.4.2. Graph theoretic terminology

For graph theoretic terminology and notation we follow [2, 24, 4] which are standard. A graph G(V, E), or G for shorthand, is defined by a vertex set V (sometimes notated V (G) ) and an edge set E (sometimes notated E(G) ) where an edge eE is a connection between some vertex uV and some other vertex vV so that e might also be notated uv. It is common to notate the size of the vertex set+ |V | = n and the size of the edge set |E| = m. If G is a weighted graph then w:E+ is the function assigning weights to edges and the weight of edge e is notated w(e). Some authors assume that for a weighted graph all weights are non-zero or equivalently that w(uv) = 0 implies uvE. In this work we will address such cases as necessary. If G is a directed graph (digraph) then each edge uv has the direction u to v where v is referred to as the head of the edge and u is the tail, and between two vertices u and v both the directed edges uv and vu might exist. Some authors use the term arc for a directed edge however here we will simply use the term edge and when the edge has a direction we will make that clear in context. Some additional technical graph-theoretic terminology is provided in Subsection 6.1.

As discussed in Subsection 1.4.1, parts of AQCM are sometimes more easily described using a graph theoretic language. Here we discuss notation. Let X be a data set and S be a similarity function S:X×X+ as defined in Subsection 1.1, then S is equivalent to a weighted graph G(V,E) with weight function w : E where vertex viV is associated with data point xiX and for a pair of vertices vi, vj the edge weight w(vivj) is equal to the similarity value S(xi, xj). Clearly the discrimination between w(vivj) and S(xi, xj) is purely semantic, so henceforth, when we adopt a graph theoretic language of algorithm description we will use the notation S and refer to it as a weighted graph and refer to S(xi, xj) as the edge weight for the edge xixj.

1.4.3. Input and output structures of AQCM

The input to AQCM is an n × n similarity matrix S (or equivalently, a list of (n2) values representing the similarities for each pair of data points) where n is the number of data points in some data set X. The definition we adopt for the terminology similarity was given in Subsection 1.1 and further assumptions on the nature of similarity data input to AQCM are discussed in Subsection 1.3.

The output of a hierarchical clustering algorithm is typically represented by a dendrogram which is simply a diagram of a hierarchy tree. A precise graph theoretic definition of the term tree is given in appendix 6.1. A graph that is a tree is typically given the notation T instead of G. The rooted tree (see appendix 6.1 for precise definition) is the proper graph theoretic object that characterizes the concept of a hierarchy tree. The root represents the data as one set, the leaves represent the data as individuals, and nodes along a path from the root to a leaf represent cluster assignment for that leaf at various hierarchical levels. In a dendrogram, the depiction is typically oriented with the root at the top of the diagram and the leaves (representing individual data points) at the bottom. In data science, hierarchical relationships are often described using the terminology parent and child. In a hierarchy tree T, a parent node vV (T) represents a cluster at some level (v is not a leaf node). For a node uV (T) such that there is an edge vuE(T) we say u is a child node of the parent node v. The node u represents a sub-cluster (of the parent cluster represented by v) at the next level down in the hierarchical scheme (or an individual data point if the parent is a level 1 cluster).

As mentioned in Subsection 1.1, the hierarchical information output by AQCM supports multimembership. That is, when appropriate, a data point (or cluster at some level in the hierarchy) could be a member of more than one cluster. In parent/child terminology we say a multimember child node has more than one parent node. Hence the hierarchy “tree” output by AQCM may not be a true tree according to the graph theoretic definition tree (see appendix 6.1). For simplicity of language in this work we will nonetheless refer to the hierarchical output produced by AQCM as a tree since we have clarified here the context and the idea of a generalized hierarchy tree (mentioned in Subsection 1.1) that allows for multimembership.

Based on the above we should mention that, in our method, the automatically chosen clustering selected from the hierarchy tree is not necessarily a partition of the data set (although it could be) but is more generally a family of subsets of the data notated C={C1,,Ck} where CiX is the ith cluster. Also, the clustering output C may or may not form a covering of the data set as some outlier points might not be assigned to any cluster. We will refer to such unassigned points as singletons or simply unclustered data. Examples are seen in Section 2.

A final aspect of our method that we will clarify here is the relationship between automatic cluster selection and the graph theoretic notion of an optimal edge cut. The terminology edge cut is defined precisely in appendix 6.1, but here we can describe its relevance in a hierarchy tree as a rooted tree 4. Consider that every cluster C defined within the structure of hierarchy tree T is represented by a child node of some parent node (possibly the root). Hence, for any clustering output C={C1,,Ck} selected from the tree T, there is an associated set of directed edges H = {e1, …, ek} ⊂ E(T) where ei = viui is associated with cluster Ci through its representation by child node ui. The set of edges H is called an edge cut because its removal separates the tree T into components. The selection of optimal edge cuts is a common topic in the theory of graphs and networks. Hence the selection of an optimal clustering C={C1,,Ck} from hierarchy tree T may be approached by the selection of an optimal edge cut H = {e1,…,ek} ⊂ E(T) provided that a suitable method exists for encoding, into the edges of T, information about the various clustering outputs available within T. We present our implementation of this approach in Section 4.

2. An Illustrated Introduction to AQCM

In this section we demonstrate the use of the AQCM algorithm to perform unsupervised cluster analysis on a basic abstract data set. The set was designed to exemplify the type of clustering tasks described in Subsection 1.3. 3D renderings of the data points are seen in Figures 1a,b,c. By mapping the data into the cube on the interval [0,1], it can be interpreted as RGB color data (as illustrated in Figure-1a). The colors allow us to intuitively understand the inherent hierarchies and relationships between clusters.

Figure 1:

Figure 1:

a. 3-dimensional (RGB) data seen in three views, RGB axes are indicated by their color, each data point is colored by it’s RGB value, each axis range is [0,1]. b. Three views match Fig. 1a, the data are colored to distinguish the clusters chosen by AQCM and match the indicator colors in the dendrogram seen in Fig. 1e, data colored gray are “unclustered/outliers”. c. As in Fig. 1b but here the colors distinguish the clusters obtained using a Gaussian finite mixture model fitting algorithm described in [5], we may compare and contrast this clustering with that obtained by AQCM. d. The dendrogram output by the AQCM algorithm, the layout of the dendrogram was computed using Matlab’s “layered layout” implementation for directed graphs, the colors of the lines indicate the similarity density drop between hierarchical levels according to the function described in Section 4 with light blue being more extreme dark blue moderate and purple least extreme, the colored dots indicate the automatically chosen clustering and their colors are the mean color per cluster. e. The dendrogram as described in Fig. 1d except that here the cluster indicators are colored to match the illustration in Fig. 1b, below the dendrogram the similarity data is displayed as organized by the dendrogram (this type of display is often referred to as a “heat map”) and the scale is given at the right, the yellow and green open circles show two alternate clusters represented in the AQCM output and these are essentially the same as the yellow and green clusters seen in Fig. 1c (only differing by a few data)

2.1. The data set

The construction of the data set was as follows: we chose 9 3-dimensional points randomly and centered Gaussian distributions at those points. Covariances were chosen so that samples from the mixture model would generate point clouds with reasonably distinct boundaries as discussed in Subsection 1.3. Samples were drawn of random sizes from each Gaussian in the mixture model so as to generate planted clusters of different sizes (in terms of number of data points). There are 595 data points in total. The construction method is standard for synthetically generating point clouds with clustering properties similar to those often seen in experimentally collected scientific data.

2.2. Benchmark algorithm

For a benchmark/comparison we used the finite mixture model fitting algorithm of Figueiredo et al described in [5] to automatically determine the cluster assignments illustrated by color in Figure 1c. The algorithm discovers 9 clusters whose centers and geometry match the underlying model used to generate the data, so we may think of this clustering as “correct” in that sense. We have tested this algorithm extensively in similar data sets and have found that it appears to return an accurate description of the planted clusters as long as the number of clusters is not too large and the differences in their relative geometries not too extreme. The algorithm is essentially parameter free for simple data sets with relatively few clusters since the required input range of cluster numbers to try can be fixed to an interval such as [1,25] (that is what we used here). We have found that when the properties of the planted clusters become more extreme (in terms of number, size and geometry) the algorithm tends to produce less accurate results even if a good parameter range is used. We should also note that we have not benchmarked this algorithm’s performance in high dimensional data, so we cannot say whether it is feasible for such data in terms of accuracy or run time.

2.3. AQCM algorithm output

Illustrations of AQCM output are seen in Figures 1b,d,e. Figures 1b,c are color-coded diagrams of the clustering outputs of AQCM and the benchmark algorithm respectively 5.

As AQCM requires a similarity matrix, we defined similarity as the negative of Euclidean distance shifted into the range 0 to 1. The similarity data is seen in Figure 1e as indexed by the hierarchy tree output by AQCM (seen above the matrix).

We begin by observing that the hierarchy tree output by AQCM has, on a high level, three main clusters (Figure 1d shows these): the purple dot indicates the average color of the cluster on the left branch while the right branch splits into two subclusters. Colored dots of further subclusters indicate that the two containing clusters represent a green group and a yellow/orange group with brown/tan/bluegrey clusters having multimembership of both green and yellow/orange. In the visual display of the color data seen in Figure 1a it is certainly clear that splitting the data in two between purple and non-purple data is a valid high level distinction and that the non-purple data can be described as yellow/orange and green with brown/tan/bluegrey in the middle. Hence the general structure of the hierarchy tree captures the inherent nature of the data well.

Using Figures 1b,c,e we can examine the clustering output selected by edge-cut cluster/community detection. The colors in Figures 1b,e illustrate that clustering. We have synchronized our labelling colors for Figures 1b,c so that the AQCM clustering may be directly compared with the benchmark algorithm clustering described above. The yellow/orange group and the green group each contain clusters that match the benchmark algorithm almost perfectly. The differences here are on account that AQCM leaves some data unclustered (see the red labelled and blue labelled clusters). This aspect of AQCM is a natural occurrence when a set of densely clustered points has outliers, AQCM may prefer the denser subset since it optimizes for cluster similarity density.

The main difference between the chosen clusters of edge-cut detection and the benchmark algorithm are in the treatment of the green labelled and yellow labelled clusters seen in Figure 1c. As illustrated by the open circles in Figure 1e, AQCM captures these same clusters at a higher hierarchy. Examination of the color data itself in Figure 1a would seem to indicate that the distinctions made by AQCM to split these two clusters further could also be considered valid since they each respectively contain locally dense similarity subsets as seen in Figure 1e. We should remark here that this difference between AQCM output and the benchmark algorithm demonstrates the different interpretations of data clusters given when one algorithm determines structure based on distance and another attempts to fit a Gaussian distribution. Hence the differences between the two outputs are inherent in the different definitions of the feature space of the data. Overall, the color data example demonstrates AQCM’s ability to function well even within a feature space that presents challenges in terms of fuzzy clusters and regions of varying data density. One might expect that if the similarity function on the data were able to reflect the planted Gaussian clusters even more clearly then the clustering detected by the edge-cut algorithm might even more precisely match that of the benchmark algorithm. In general, the description of the data set provided by the hierarchy tree output does provide a robust multi-level interpretation of the planted structures as we might naturally see them with our human ability to perceive groups and dense point clouds.

3. AQCM Algorithm and Subroutines

In Subsection 3.1 we provide a basic description of each of the subroutines of AQCM so as to facilitate an easier understanding of their objectives and mathematical strategies. An illustration of the process is seen in Figure 2. Full mathematical details are given in technical notation in Subsection 3.2. As discussed in Subsection 1.4.1, parts of AQCM are sometimes more easily described using a graph theoretic language. Here we adopt that style to describe the subroutines and their unfolding. See Subsection 1.4.2 for clarification of the relationship between our notation and graph theoretic language.

Figure 2:

Figure 2:

An illustration of each subroutine’s output as AQCM runs on a simple data set with planted clusters of 2D points (seen in the top left box). The similarity function we used was the “shifted negative of distance” as described in Subsection 2.3. Each row of the diagram shows an iteration (or hierarchical level) running through the processes of seed selection, growth, and adjustment. For each subroutine, its column in the diagram shows the relationship between the ways the algorithm views the data at different hierarchical levels. In column 1, iterations 2 and 3 show the contracted graph that will be input to seed selection, dot size reflects cluster size and their color is associated with steps in the previous iteration, the edge weights are seen to two decimal places. In illustrations of the growth step, the seeds from which the illustrated clusters grew are seen as colored edges. For this example data set, the 3rd iteration growth step resulted in one cluster covering the data, so the algorithm was finished at that step. Bottom right shows the dendrogram of the hierarchical clustering structure determined by the states of clustering at the end of each iteration, colored node indicators show the relations to the illustrated subroutines.

3.1. AQCM algorithm structure

AQCM has four subroutines: seed selection, growth step, adjustment step, contraction step. The AQCM algorithm iterates by running the subroutines in that order, each outputting objects for input to the next subroutine. The contraction step creates the input graph for the seed selection of the next iteration.

Input: S:X×X+ a similarity function on a data set X, equivalently, a weighted graph on a vertex set X where S(xi, xj) is the edge weight between vertices xi and xj, in the first iteration of AQCM this is the current graph, though in the notation below we use the letter V instead of X in order to generalize the language for all iterations.
Output: T a generalized hierarchy tree, (see Subsection 1.4.3 for terminology)
Seed Selection: Select a subset EseedsE (E is the edge set of the current graph on a vertex set V) such that for e = vivjEseeds, vi and vj are likely to be in the same cluster. This likelihood is ensured by estimations of locally close vertices made at each vertex: a seed edge e = vivjEseeds is an edge such that in the estimations at vertex vi and vertex vj, each designated the other as a locally close vertex.
Growth Step: Grow a set of clusters C={C1,,Ck} by optimally adding vertices of V to individual seed edges of Eseeds. Growth is controlled locally in that for a given cluster C growing from a seed eEseeds, C stops growing when its edge weight density falls below a threshold developed dynamically from the cluster’s growth properties. The process unfolds linearly starting with the highest weight seed edge in Eseeds. As a cluster C grows covering other seed edges, those edges are removed from Eseeds eliminating unnecessary processing.
Adjustment Step: Adjust the clustering C={C1,,Ck} when there are subsets of C with relatively large vertex overlap. Such subsets are possible since the seed/growth process may approximate a potential cluster several slightly different ways. In order to ensure that the most locally dense possible cluster is not overtaken by a less optimally dense approximation, the first iteration uses a different adjustment method than subsequent iterations.
First Iteration: working through {C1,,Ck} in order prioritizing whichever cluster Ci is most edge weight dense, remove any other cluster that overlaps Ci too much. Processing continues until degenerate overlaps are resolved.
Subsequent Iterations: working through {C1, …, Ck} in order prioritizing whichever cluster Ci is most edgeweight dense, merge into Ci any other cluster that overlaps Ci too much. When more than one merge is possible, choose the merge resulting in the highest edgeweight density. Processing continues until degenerate overlaps are resolved. Note that for iterations two and higher, merge decisions are made by interpreting clusters as subsets of the data set X as opposed to subsets of contracted graph vertices (see Figure 2 second row last column).
Contraction Step: Create a weighted graph representing the similarities between each pair of clusters Ci and Cj. As in the adjustment step, clusters are interpreted as subsets of the data set X as opposed to subsets of contracted graph vertices. Each cluster Ci is represented as a vertex vi in the new graph, the edge weight between vertices vi and vj is the mean of all edgeweights of edges (of the graph on X) between Ci and Cj. The weighted graph created by the contraction step becomes the current graph which is the input to the next iteration of AQCM beginning with seed selection.

3.2. Subroutine details

In the following descriptions of the AQCM subroutines we use a more proper graph theoretic notation style. Let S be a weight function S:E(Kn)+ where E(Kn) is the edge set of a complete graph Kn on n vertices with vertex set V. We define the following.

Definition 3.1. For a subgraph Q of Kn induced by a vertex subset CV, we define the density of Q by

den(Q)=2eE(Q)S(e)|C|(|C|1)

As Q is an induced subgraph of a complete graph, we have |E(Q)|=(|C|2). So it is easy to see that the function den(Q) is simply the mean edgeweight over the subgraph Q, or equivalently, the average similarity between pairs of data in the cluster C.

Definition 3.2. For a vertex vV(Q), we define the contribution of v to Q by

cont(v,Q)=uV(Q)S(uv)|V(Q)|

The contribution cont(v, Q) is the mean of edgeweights between v and Q, and as such it is clearly the simple and computationally efficient way to quantify a vertex’s candidacy for joining a growing cluster.

3.2.1. Seed selection

Seed selection is a new subroutine in AQCM not found in published versions of QCM. It’s purpose is to automatically select seeds from which to grow clusters in a way that adapts to the inherent properties of the input data. The intuition driving the seed selection algorithm comes from the assumptions we often make about what distinguishes a cluster within a data set (see Subsection 1.3). We expect that for a data x which is well within (not an outlier) a cluster C, the similarity values of x to other data in C would be in a numerical range that should be somewhat distinct from the numerical range of similarity values between x and the data not in C. In our seed selection algorithm we harness that property by considering, for data x, the list of similarity values to other data sorted in decreasing order. Thus we should expect some position in that ordered list where the similarity value abruptly drops as we move from similarities measured within C to those with other data outside C. While this property might not hold for every data x in C, it is easy to show that our algorithm will produce at least one seed edge inside each cluster and our algorithm will not choose edges representing a pair of data each in a different cluster assuming the similarity data has reasonable clusters as described in Subsection 1.3.

Seed selection algorithm:
Input: Kn a complete graph on n vertices, and edge weight function S:E(Kn)+, when AQCM is in the first iteration Kn and S are the inputs to AQCM, in subsequent iterations Kn and S are objects created by the previous iteration contraction step.
Output: the seed edge set EseedsE(Kn) = {e1, …, em}, for use in the growth step the set Eseeds should be output as an ordered set {er1,,erk} with {r1, …, rk} ⊂ {1, …, m} and S(eri)S(eri+1) for all i.
Step 1.
for each vertex vV (Kn) do:
  order vertices decreasing by edgeweight with v:
  {u1, …, un−1} is the set of vertices not including v
  {uq1,,uqn1} is an ordering with S(vuqi)S(vuqi+1) for all i
  compute a sequence of differences:
  {ai}i=1n2{S(vuqi)S(vuqi+1)}i=1n2
  compute the median of that set (treated as a statistical sample):
  Ma ← median( {a | ∈ {1, …, n − 2} } )
  find the first significant edgeweight drop:
  t ← min( {i | aiMa } )
  store a list of locally close vertices for v:
  Lv{uq1,,uqt}
Step 2.
 select seed edge set:
  Eseeds{uvE(Kn)uLv and vLu}
Step 3.
 sort the seed edge set Eseeds decreasing order by edgeweight

3.2.2. Growth step

The growth subroutine in this work follows the growth principals introduced in previous publications of QCM with the exception of a new parameterized method we introduce via the parameter τ. The method resolves the issue that occurs when more than one vertex is a numerically valid candidate to join a growing cluster C. Specifically, the parameter τ is intended as a sort of numerical padding for the case when similarity calculations may have slight differences between optional joins, but those differences may be a “digital artifact” of the similarity matrix calculation. That is, the optimal vertex to join may have average similarity to C greater by an insignificant margin than other possible vertices. τ allows that any vertices within an acceptable threshold of optimal may join at the same time. This new method allows for correct processing of data where symmetries exists such that some groups of data should be treated equally as opposed to being arranged in a linear order of priority. Finally, we should also mention that τ is not intended as a tuning parameter, but rather as a user determined value. We have found that τ = 0.008 has performed well in all data sets tested for similarity functions where the range is in the interval [0, 1].

Growth step algorithm:
Input: Eseeds={er1,,erk} with {r1,,rk}{1,,m}, and S(eri)S(eri+1) for all i. Kn a complete graph on n vertices, and edgeweight function S:E(Kn)+, when AQCM is in the first iteration Kn and S are the inputs to AQCM, in subsequent iterations Kn and S are objects created by the previous iteration contraction step. For notation below we abbreviate V(Kn) as just V.
Output: C={C1,,Ck}, a family of subsets CiV(Kn) where each Ci is an approximation of some cluster (in the current graph) of relatively high edgeweight density as in definition 3.1
Initialize: C{}
τ ← 0.008
Step 1. If Eseeds = {} exit growth step algo, Else do:
 start a new cluster from the maximum weight seed:
C ← {u, v} where uv = e is the first seed in the ordered list Eseeds
Step 2. try to grow cluster C:
  compute α value:
  α112(|C|+1)
  compute den(Q) where Q is the subgraph of Kn induced by C:
  see definition 3.1
  compute maximum contribution:
  see definition 3.2
  q ← max( {cont(v, Q) | vV \ C } )
  If qα * den(Q) do:
   find eligible vertices to join C:
   J{vV\Ccont(v,Q)>qτ}
   expand cluster C and continue:
   CCJ
   return to Step 2.
  Else do:
   store cluster C:
   CC{C}
   eliminate unneeded seed edges and continue:
   EseedsEseeds \ E(Q) where Q is the subgraph of Kn induced by C
   return to Step 1.
The lower bound of the sequence α.

In Step 2 of the “growth step algorithm”. There is a function α × den(Q) that controls the growth of an existing “cluster” Q. Some natural questions are: Is the function α × den(Q) decreasing? Does it have a lower bound? If it approaches to zero in some case, then this method would fail since the growth substep will extend a cluster Q to the entire graph Kn. These questions are answered in [13]. The function α × den(Q) is decreasing in general. However, the following lemma form [13] dismisses all the worries mentioned above.

Lemma. [13] In Step 2 of the “growth step algorithm”, the density of any new cluster Q′ is at least 23den(Qo) where Qo is the density the seed edge in the initial stage of this step.

3.2.3. Adjustment step

First iteration adjustment algorithm:
Input: C={C1,,Ck}, a family of subsets CiV(Kn) where each Ci is an approximation of some cluster (in the current graph) of relatively high edgeweight density as in definition 3.1. Observe that because it is the first iteration CiX, where X is the original data input.
Output: {C1,,Cs}{C1,,Ck}, a clustering such that for any pair Ci and Cj their intersection has |CiCj|0.5min(|Ci|,|Cj|)
Initialize: t ← 1, this index tracks the current cluster.
Step 1.
 sort the clusters in decreasing order by edgeweight density:
C={C1,,Ck} and induced subgraphs Q={Q1,,Qk} have den(Qi)den(Qi+1) for all i.
 initialize current cluster:
 set Ct (t was initialized above)
Step 2. If all clusters have been checked, exit adjustment step algo, Else do:
 find clusters overlapping current cluster too much:
H{Czz>t and |CzCt|>0.5min(|Cz|,|Ct|)}
Step 3. If H={} do:
   advance position and continue:
   tt + 1
   set current cluster as Ct
   goto Step 2.
  Else
   resolve degenerate overlap by removal:
   CC\H
   advance position and continue:
   tt + 1
   set current cluster as Ct
   goto Step 2.
Subsequent iterations adjustment algorithm:
Input: B={B1,,Bk}, a family of subsets BiX where each Bi is an approximation of some cluster (in the original input data) of relatively high edgeweight density as in definition 3.1. Each Bi is obtained from a CiC, the output of the growth step, by associating members of CiV(Kn) with the objects they represent in lower hierarchical levels.
Output: {B1,,Bs}{B1,,Bk}, a clustering such that for any pair Bi and Bj their intersection has |BiBj|0.5min(|Bi|,|Bj|)
Initialize: t ← 1, this index tracks the current cluster.
Step 1.
 sort the clusters in decreasing order by edgeweight density:
B={B1,,Bk} and induced subgraphs Q={Q1,,Qk} have den(Qi)den(Qi+1) for all i. NOTE that the subgraphs Q are subgraphs of the original graph input to AQCM with vertices X and edgeweights S.
 initialize current cluster:
 set Bt (t was initialized above)
Step 2. If all clusters have been checked, exit adjustment step algo, Else do:
 find clusters overlapping current cluster too much:
H{Bzz>t and |BzBt|>0.5min(|Bz|,|Bt|)}
Step 3. If H={} do:
   advance position and continue:
   tt + 1
   set current cluster as Bt
   goto Step 2.
  Else
   resolve degenerate overlap by merging:
    find optimal cluster for merging with current cluster:
    BBtBx, where BxH is the cluster such that Q⋆, the subgraph induced by BtBx, has den(Q⋆) highest among all possible choices of BxH
    update the clustering:
    BB\{Bt,Bx}
    BB{B}
    re-sort the clustering and continue:
    NOTE that we only need to find the correct index for placement of B⋆ to preserve den(Qi)den(Qi+1) for all i
    goto Step 2.

3.2.4. Contraction step

Graph contraction algorithm:

Input: {C1,,Cs} the output of the adjustment step (equivalently {B1,,Bs} if the iteration level of AQCM is 2 or higher, in this section we will simplify the notation by using the symbol C only). Also the graph contraction requires the original similarity data S that was input to AQCM in the first iteration.

Output: Ks a complete graph on s vertices where each vertex represents a cluster in {C1,,Cs} and edgeweight function S:E(Ks)+ where S(Ci,Cj) is the average similarity (according to S) between the data in Ci and the data in Cj.

Method:

The definition of Ks and S′ are clear as described above so here we only need to define the case when CiCj is non-empty. The definition we state here is general in that for the case where CiCj is empty, the definition is equal to the statement of the output described above.

edge set definitions: We need the following definitions of edge sets of the original graph on the data X with edge set E. For Ci and Cj with i,j{1,,s} we have:

Eβ:={uvEuCiandvCjandu,vCiCj}
Eγ:={uvEuCianduCjandvCiCj}
Eδ:={uvEuCianduCjandvCiCj}
Eσ:={uvEu,vCiCj}

edgeweight function: For Ci and Cj with i,j{1,,s} we define S′:

Ei,j:=EβEγEδEσ
S(Ci,Cj):=eEi,jS(e)|Ei,j|

4. Automatic Cluster/Community Selection

The following algorithm for automatic cluster/community selection is based on the principal that a “most significant” clustering within the many possibilities encoded by a hierarchy tree should have clusters in which the internal similarity is significantly higher than that of their parent clusters [20]. Within the context of a hierarchy tree T with parent node x and child node y (representing candidate clusters Cx and Cy respectively) this change in cluster density between levels can be expressed as “density drop” den(Cx) − den(Cy) (see Subsection 3.2 for definition of density). Furthermore, among possible clusterings exhibiting a significant density drop, we prefer smaller cluster size as we are interested in a larger number of clusters when such a clustering is an option.

The algorithm featured here assigns an edgeweight to the edge xyE(T) that captures the properties of density drop and cluster size and then selects a clustering as an edge cut across which the average of the edgeweights is optimal among all edge cuts in the tree T. The algorithm for solving the optimal average weight cut problem is seen in Substeps 3.1–3.4 and proof of optimality is provided in [16, 17]. The graph theoretic operation “edge contraction” plays an important role (see Substep 3.4 (i)) so we clarify the notation and definition here. In the directed tree T, the tree “T contracted by e” is notated as T/e and is constructed by: 1) delete the edge e = xy; 2) the vertices x and y are merged to one vertex, that is, all in edges of x and out edges of y are now in and out edges respectively of the new vertex.

Input. The generalized hierarchy tree T (on the data set X) with the root vertex v0, the size of every cluster C (corresponding to each node of T), the similarity density den(C) of every cluster C (see definition 3.1).

Output. C={C1,,Cs} a clustering of the data set X.

Denote the set of leafs of T by L.

Step 1. For each directed edge e = xyE(TL), define the edge weight

w(xy)=|Cy|den(Cy)2den(Cx)2 (1)

where Cx and Cy are clusters in X corresponding to the nodes x and y in T, respectively. That is, x is the parent node of the child node y.

Step 2. Find a maximum rooted spanning tree Tmax of T with respect to the weight w where v0 is the root of Tmax.

Step 3. In this step, we are to find an edge cut Q of Tmax separating the root v0 and the set L of leaves such that eQw(e)|Q| is minimum among all such edge cuts.

Substep 3.1. Calculate α0

α0=w(E+(v0))|E+(v0)|

Substep 3.2. For each eE(TmaxL) calculate λ(ei) the contractibility of the edge ei defined as

λ(ei)=w(E+(ei))w(ei)|E+(ei)|1

and E+(ei) is the set of all out-edges of the head of the edge ei in the directed tree Tmax.

Substep 3.3. Sort the edges ei of the E(TmaxL) so that

λ(e1)λ(e2)λ(em)

Substep 3.4. If λ(e1) < α0 then

  1. Denote the in edge to e1 by e*. Contract TmaxTmax/e1, and

  2. update λ value for e*, or update α0 if e1 had no in edge (it was in E+(v0)), and

  3. repeat Substep 3.3.

If λ(e1) ≥ α0 then go to the final step.

FINAL STEP. In the resulting (smaller) tree Tmax, let Q = E+(v0). The output is

{Cxv0xQ,CxistheclustercorrespondingtoxinTmax.}

We make an important note here regarding implementation of the above cluster selection algorithm. Recall that at the first iteration of AQCM it is not guaranteed that all data have joined a cluster. Indeed, at a given level in the hierarchy tree there may be some branch that represents a still unclustered data as of that level. We find that it is useful to remove such branches from the tree Tmax before solving the optimization problem described in Step 3 above. In so doing we ensure that the resulting edge cut is related to significant drops in density of clusters between hierarchical levels without the need to define such a drop in the case of unclustered data. Furthermore this heuristic would seem relevant since it is the change in density of clusters that underpins the density drop cluster selection as a strategy.

5. Hierarchical Clustering in Facebook Network Data

5.1. The data set

In order to test the ability of AQCM method to perform real world unsupervised learning at a scale that is large relative to the expected size of clusters we chose social network data for the test data set. The data are the Facebook friendships recorded as a “snapshot” of the state of the Facebook network at UC Berkeley in 2005. This choice of data satisfies our target of “scale relative to expected cluster size” because of the circumstances under which the network was formed. In 2005 Facebook was emerging as a prominent new social media platform at a time when other platforms such as Friendster and Myspace had already paved the way for social media to become a common aspect of young Americans’ internet use. At the time Facebook required users to sign up with an academic (university, college, or high school) email address as the platform was originally designed as an internet service to facilitate social interaction on campuses and it was within these campuses that Facebook networks first formed. These networks of student friendships represented relatively new social relationships forming around the types of social interactions fostered by campus life such as social events, fraternity/sorority life, classroom/degree related relationships. Moreover, college students tend to develop relatively small “close friend social circles” which are embedded within larger social circles related to campus activity. Hence in Facebook network data sets taken from colleges we might expect there to exist hierarchies of network community/cluster structures where on the most local level edge dense community/cluster structures are quite small relative to the network of the entire body of students on a given campus. The UC Berkeley data set represents 22,900 individuals so we might expect there to exist thousands of small clusters of close friends, thus the data set presents the type of challenge which AQCM was designed to address in automatic cluster detection.

5.2. Method of applying AQCM

5.2.1. Similarity

The UC Berkeley Facebook network is not a weighted network data set. That is, for individuals i and j who are friends the adjacency matrix has A[i, j] = 1 and if they are not friends A[i, j] = 0. Thus in order to apply AQCM it is necessary to define similarity between two network nodes/vertices. For this definition we adopt the method presented in [15]. Mathematical details of the method are given in the appendix (Subsection 6.2). The method defines for each vertex a diffusion pattern vector representing the way the vertex impacts the rest of the network in a diffusion process. Similarity between two vertices is then defined as the similarity of their diffusion vectors using “cosine of the angle between the two vectors”, a method common in data mining [22].

5.2.2. Initial results and further iterative approach

We applied AQCM to build a hierarchy tree from the similarity described above and then used edge-cut clustering to obtain 4,445 clusters covering approximately 15,000 of the 22,900 vertices of the Facebook network. The hierarchy tree output by AQCM has 86,430 vertices and has 15 hierarchical levels. Clearly such a complex hierarchical clustering output presents challenges for visualization and interpretation of the clusterings it describes. These challenges motivated the development of a further iterative process within which AQCM and edge-cut clustering serve as subroutines. The iterative process is detailed in Figure 3. It creates a simplified hierarchy tree based on the output of AQCM and edge-cut clustering run subsequently in a manner similar to AQCM itself. The method also applies some simple post-processing algorithms to clustering C (details provided in Appendix 6.3) to remove multimembership and force unclustered data into clusters in a refined clustering C. The purpose of post processing is so that diagrams may more easily describe the data set and validate the existence of hierarchical structures. For the contraction step in this iterative process we use the same contraction described in Subsection 3.2.4. The resulting similarity data S′ is input to AQCM and the process repeats. We were able to obtain a simplified hierarchy tree suitable for diagrams in three iterations of this process. Importantly, the run time for the iterative method was not significantly longer than simply running AQCM with edge cut clustering. This is because the large share of time is spent reducing the job from 22,900 data down to roughly 5,000 data (the size of the first clustering) at which point the job size is no longer large so the following iterations finish relatively quickly.

Figure 3:

Figure 3:

Iterative application of AQCM and edge-cut cluster/community detection

5.3. Assessment of discovered hierarchical structures

5.3.1. Similarity and connectivity

The hierarchy tree output by the process described in Subsection 5.2.2 has five significant hierarchical levels and is pictured in Figure 5a (level numbers illustrated in red) with the similarity data described in Subsection 5.2.1 seen underneath as indexed by the tree. For the purpose of demonstrating the validity of hierarchical structures discovered in the UC Berkeley network we focus on levels 1, 3, and 4 as these provide a sufficient view of the scale of the clusters at higher and lower levels. These three levels feature clusterings of 4800, 372, and 38 clusters respectively. Note that here we refer to non-trivial clusters, at level 1 there are 134 un-clustered points. Some statistics on cluster s.ze are provided for each of these three levels in table 1 and histograms of cluster size for each of these levels are provided in Figure 4. Level 1 has a large number of small clusters with sizes 2 and 3 accounting for 2599 of the clusters and sizes 4, 5, and 6 accounting for 1552 of the clusters. There are 315 level 1 clusters in size range 10 through 50 and only 14 with size greater than 50 with only five of those having size greater than 100.

Figure 5:

Figure 5:

Similarity and Connectivity: a. The hierarchy tree output by the iterative process discussed in Subsection 5.2.2 with five hierarchical levels labeled in red, the similarity data is shown as a matrix indexed by the tree. b., c. Successive zoom-ins of chosen subsets as discussed in Subsection 5.3.1. d. Same zoom-in as in c above but the data displayed is the adjacency (connectivity) data. e. Graphical layout of the UC Berkeley Facebook network with level 4 clusters shown darker and with cluster numeric labels. f. Zoom-in of the inset blue box in e, here dark edges illustrate clusters at level 3, the red arrow indicates level 3 cluster number 23 discussed in Subsection 5.3.1

Table 1:

cluster size statistics

level μ σ med min max
4 602.63 657.07 337.50 46 2490
3 61.56 55.99 44 8 501
1 4.74 7.2 3 2 259
Figure 4:

Figure 4:

(Left) Histograms of cluster sizes at each of the indicated levels. (Right) Zoom-in of the non-outlier cluster sizes at level 1

In order to provide deeper perspective on the complex structures found in the similarity data and their relationship to the social structure of the Facebook network we follow an exploratory approach along a particular branch of the hierarchy tree. The branch represents level 4 cluster number 32 with 2490 members and is indicated with a black dot in Figure 5a while the internal similarity data of the cluster is highlighted by the blue box Figure 5a. The internal similarity is relatively high compared with the cluster’s relatively low similarity to the rest of the data and visual inspection of Figure 5a indicates the cluster is distinguished in that way within the set of level 4 clusters. Figure 5e shows a depiction of the connectivity of the Facebook network with level 4 clusters illustrated by darker edges and cluster numeric labels. We computed the modularity [11] of each of the level 4 clusters and found cluster 32 has the highest modularity. Finally, accessing the metadata available, we note that 80% of cluster 32 are members of the freshman class and of the freshman class 57% are in cluster 32. Thus we can summarize that cluster 32 is suitably interesting as it has distinctness in it’s similarity and modularity properties and also correlates significantly with the social substructure induced by the freshman class. The existence of a distinct social subnetwork correlating with some active subset of the freshman class fits with our theory that the Facebook network should correspond with aspects of campus life. Indeed, we expect that freshmen seeking to establish a social life in their new environment would use Facebook friendships to facilitate social activity and thus a freshman oriented subnetwork should emerge.

Figure 5b shows a zoom in of the similarity data inside cluster 32 with the corresponding branch of the hierarchy tree. At this zoom level the lower level small clusters of high similarity begin to become visible as blocks along the diagonal of the matrix. Figure 5f shows a zoom in of the area in the inset blue box in Figure 5e. In Figure 5f the dark edges illustrate the clusters at level 3. The red arrow indicates level 3 cluster number 23. This cluster stands out as being significantly larger. The cluster’s internal similarity is highlighted by the inset blue box in Figure 5b. Visual inspection of Figure 5b reveals the relatively higher amount of similarity inside cluster 23. Interestingly cluster 23 also has relatively high similarity to many of the other clusters in Figure 5b,f. This property indicates that cluster 23 might be a high centrality cluster representing a hub within the Facebook network of cluster 32. In order to investigate this conjecture we compute the node degree (number of edges at a node) within cluster 32 for each member. For members of cluster 23 the average degree inside cluster 32 is 40 with a maximum of 184 while memuers of cluster 32 not in cluster 23 had average degree 21 with a maximum of 134. Furthermore, while cluster 23 members make up only 11% of cluster 32, more than half of “higher degree” (degree more than 50) cluster 32 members are in cluster 23. It is noteworthy though that cluster 23 is not entirely characterized as a group of high degree nodes. We observe that the average degree within cluster 32 is 24 and cluster 23 has 58% with degree greater than this, hence the degree distribution skew for cluster 23 is not very large. However, cluster 23 has 23 members with degree at least 100 while members of cluster 32 not in cluster 23 account for only 4 with degree at least 100. Hence we may characterize that cluster 23 contains almost all of the high centrality members within cluster 32 and the high centrality of a select few members of cluster 23 appears to be a key feature of the subnetwork defined by cluster 23.

Figure 5c,d shows the zoom-ins of the internal similarity data and adjacency data respectively of cluster 23. The adjacency data shows a complex system of smaller edge-dense submodules connected together by hub-like connectivity emanating from several clusters seen at top left. Hence the characterization of cluster 23 described above appears accurate on inspection. In the analysis below we further confirm the hub-like structures in the hierarchical subnetworks of clusters 32 and 23.

5.3.2. Modular properties and hub-like structures

For the purpose of understanding the complex relationships within and between clusters we quantify the modular properties of clusters using a stochastic block model approach. For a graph G with a clustering C={C1,,Cs} of the vertex set V the probability of an edge between clusters C and Ck is defined as the total number of such edges divided by the possible number of such edges. If = k then we count the number of edges inside C and divide by the possible number of such edges. For notation we use Pk. Also note that Pk can equivalently be referred to as edge density. Community structures in networks are understood to be relatively edge-dense subsets of vertices separated from the rest of the network by a relatively sparse edge cut. Under this model we would say a cluster (community) C should have P > Px where Cx is defined as the vertex set V \ C. In this work we are interested in all the values Pk over the clustering C. For visual analysis we define a matrix MC based on these values. For vertices viC and vjCk we assign MC[i,j]=Pk. We computed the matrix MC for each of the three clustering levels 1, 3, and 4. MC for level 4 is seen in Figure 6a as indexed by the hierarchy tree above it. The red arrow indicates level 4. Figures 6b,c show MC for levels 3 and 1 respectively and zoomed in the same way as in Figures 5b,c.

Figure 6:

Figure 6:

modular properties: a. b. c. Shown are edge probabilities (see Subsection 5.3.2) within and between clusters at each of levels 4, 3, and 1 respectively. Red arrows show the hierarchical level at which edge probabilities were calculated. The zoom-ins match those in Figure 5a,b,c. d. e. f. Shown are graphical layouts of the connectivity of the UC Berveley Facebook network with vertex placement matching Figure 5e. Darker edges illustrate the in-cluster edges at levels 4, 3, and 1 respectively.

In order to assess the distinctness of the clusters in terms of their modular properties as described above we may consider, for a cluster C, the ratios of the form (PPk)/P over all choices of k. For a cluster C we will notate this ratio as Δ(k). Furthermore it is useful to consider, for a cluster C, the minimum over k of Δ(k). For this minimum we use the notation δ(k).

Using the framework described above we first observe that for the level 4 clustering Δ(k) is positive for each C with an average of 94%. The average of δ(k) over all clusters is 83.8% with standard deviation 11.8%. Also, the minimum value of δ(k) observed is 51.9% indicating that even in the case where two clusters are not very distinctly separated there is still a significantly lower probability of between cluster edges. Hence the level 4 clustering exhibits significant community structure in that the difference in internal and between cluster edge probability is statistically significant relative to cluster edge density. In particular we have for cluster 32 the average over k of Δ32 (k) is 88.9%. For 86.5% of the k values Δ32 (k) is at least 75% validating that cluster 32 is a strongly separated community, a property that agrees with its high modularity as mentioned in Subsection 5.3.1. We may also observe that visual inspection of Figure 6a reveals that the edge probabilities between cluster 32 and most other clusters in level 4 is significantly lower than between cluster probabilities among other clusters, further demonstrating that cluster 32 stands out as a particularly well separated subnetwork.

Repeating the analytic inquiry above for the clustering at level 3 we find the following. Δ(k) is positive for each C with an average of 98.9%. The average of δ(k) over all clusters is 85.4% with standard deviation 7.9%. Also, the minimum value of δ(k) observed is 38% indicating that in some cases cluster separation is not as distinct and may be on account of a cluster group forming a higher level cluster. Hence the level 3 clustering exhibits community structure distinctness similar to that at level 4. Indeed, if we explore the internal connectivity of level 4 cluster 32 as an example, the level 3 cluster to cluster edge probabilities seen in Figure 6b clearly show the relatively edge dense level 3 clusters interconnected by relatively sparse connectivity.

Another feature that stands out in visual inspection of Figure 6b is the apparently higher edge probability of edges to cluster 23 (blue inset box). While the edge probability from any subcluster of 32 to subcluster 23 averages 0.9%, the expected value of such averages for other subclusters seen in Figure 6b is just 0.47% and for 90% of subclusters of 32 that average is less than 0.7%. Cluster 23 of course has the highest edge probability average hence, in the sense of the statistics mentioned here, we may confirm our conjecture made in Subsection 5.3.1 that subcluster 23 is a hub within cluster 32.

Repeating the analytic inquiry above for the level 1 clustering is more challenging on account that 1039 of the 4800 level 1 clusters have the property P < Pk for at least one value of k. Clusters such as this can arise when common neighborhood properties are more relevant for a set of vertices than direct connections. Consider for example ten Facebook members who are all friends with each other. Suppose there are three additional members who are not friends with each other but who are each friends with all of the ten. In this case the three would form a natural cluster of similarity but with no internal edge density in their group. In order to describe the edge dense properties of level 1 clusters we focus the statistics on those with P > Pk. There are 3087 such clusters. Among those the average of Δ(k) is 99.7% and the average of δ(k) over those clusters is 40.6%. Hence 64% of level 1 clusters are simply described by the edge-dense community model.

Inspecting Figure 6c it is clear that one of the most noticeable features is the apparently hub structure formed by the prominent cluster top left. This cluster is number 1123 of the level 1 clustering. The average of P1123,k over clusters k featured in Figure 6c is 17.34% which puts cluster in the top 8.9% of subclusters of cluster 23 in terms of connectivity to other clusters. At 14 members, cluster 1123 is significantly larger that other cluster 23 subclusters with hub-like connectivity, hence we may think of it as a prominent feature of the subnetwork of level 3 cluster 23 as conjectured in Subsection 5.3.1.

Acknowledgments

This work was partially supported by an NIH Grant R01 DC015901 and an NSF Grant DMS-1700218

6. Appendix

6.1. Additional graph theory terminology

For a vertex v and an edge e (in an undirected graph), we say e is incident to v if e = uv for some vertex u. The degree of a vertex v is the number of edges incident to v and is notated d(v). If G is a weighted graph then degree is generalized as d(v)=uVw(uv), where if uvE then w(uv) = 0. If v is a vertex of a directed graph we separately count the out-degree (the number of edges with v as a tail) and the in-degree (the number of edges with v as a head), and they are notated d+(v) and d(v) respectively. Generalization of weighted in and out degree follow the same intuition as for the undirected case.

A path between two vertices u,vV is a sequence u = u1u2uk = v such that for each consecutive pair i, i+1 we have, uiui+1E and each vertex appears only once in the sequence. For two specific vertices u and v a path between them is often called a u,v−path. A closed path is a path where the first and last vertex are the same and is often called a cycle. We say a graph is connected when for any pair of vertices u,vV there is a u,v−path. A set of edges HE is called an edge cut when the removal of the edges H causes the graph to become disconnected. In many graph theoretic works it is important to restrict study to edge cuts that split a graph into a set of non-trivial components {Q1,…,Qk}. A non-trivial component Q of a graph G is a subgraph of G with V(Q)V(G) and E(Q)E(G) and |V (Q)| > 1 and Q is connected and importantly for u,vV (Q) with uvE(G) then uvE(Q). Edge cuts that split a graph G into non-trivial components are called non-trivial edge cuts, but in context it is often understood, in a given work, that the term edge cut refers to those that are non-trivial.

A minimal connected graph on a vertex set V with |V| = n is any graph such that no edge can be removed without disconnecting the graph. It is easy to see that any such graph has no cycle and easy to show that any such graph has n − 1 edges. A minimal connected graph is called a tree and often such a graph is notated T instead of G. An obvious property of any tree T is that for any pair of vertices u and v there is exactly one u,v–path.

In the study of directed graphs, an important type of tree is the rooted tree. A rooted tree is a directed tree with exactly one vertex v with d(v) = 0 (this vertex is called the root) and set of vertices {u1,…,uk} with d+(ui) = 0 for each (these vertices are called leaves). Stated another way, for a rooted tree T, there is one vertex (root) that is the head of no edge and some set of vertices (leaves) that are each the tail of no edge. It is easy to show that for a root vertex v and any non-root vertex u there is a v,u–dipath (directed path), hence a rooted tree can be thought of as having a “layered” structure in that nodes can be organized into groups by their distance from the root along root-to-leaf paths.

6.2. The Similarity Matrix, S

We give here a brief summary of the construction of the matrix S, that we used in our analysis of the FaceBook network data. Detailed study of this construction of similarity matrix can be seen in [15].

Input. Given an adjacency matrix A of an input graph G where the (i, j)-entry A[i, j] is the weight of the directed edge vivj. Let n = |V (G)|. (If G is undirected, A is symmetric; If G is unweighted, A[i, j] = 0 or 1.)

Output. A diffusion similarity matrix S where the (i, j)-entry S[i, j] is the similarity between the vertices vi and vj.

NOTE: Performing Step 1. for any directed graph input will save computation time needed to otherwise confirm the strong-connected property for the graph.

Step 1. If G is strongly connected, then go to Step 2. Otherwise, let vn+1 be a new vertex, and

V(G)V(G){vn+1},
E(G)E(G){vivn+1:i=1,,n}{vn+1vi:i=1,,n}

and for every i = 1,···, n,

A[i,n+1]=A[n+1,i]=10%min{A[i,j]:A[i,j]>0,i,j=1,,n}

then go to Step 2.

Step 2. Construct W where the (i, j)-entry

W[i,j]:=A[i,j]μA[i,μ] (2)

Step 3. Construct

K:=k=1(1.63)kk!Wk (3)

Step 4. If |V (G)| = n + 1 (that is, Step 1 was taken), then deleting the (n + 1)st-row and (n + 1)st-column from K. And go to Step 5.

If |V (G)| = n, then go to Step 5.

Step 5. Let Krow(i) be the i-th-row of K and Kcol(i) be the i-th-column of K. Construct the output S where the (i, j)-entry

S[i,j]:=cos(Kcol(i),Kcol(j))+cos(Krow(i),Krow(j))2 (4)

where cos(α,β) is the cosine of the angle between vectors α and β. END.

6.3. Post processing

Let X be a data set and C={C1,,Cs} be a family of subsets of X, possibly with |Ci| = 1 for some Ci and possibly with some pair of subsets having non-empty intersection. Let S(x, y) be the similarity value of data x and y and let the similarity density of Ci, notated den(Ci), and the contribution of x to Ci, notated cont(x, Ci), be as defined in Subsection 3.2. If |Ci| = 1 then we shall define that den(Ci) = 1. The following post processing algorithms augment the clustering C so as to accommodate certain properties of clustering output that may be desirable for some applications.

6.3.1. Expanding the cluster set by including some un-clustered vertices

The AQCM algorithm builds a hierarchical description of a data set based on preserving local similarity density. It is possible for a given data set that the output clustering of AQCM and cluster detection will leave some data points unclustered as “outliers” of a significantly more locally dense selected cluster. For various data analytic purposes it may be desirable to place these unclustered data into clusters. These unclustered data also may naturally form small clusters themselves if their local cluster was detected at a position in the hierarchy tree higher than a cluster detection cut for some nearby denser cluster. For these reasons we have developed the following post processing algorithm which allows unclustered data to naturally “self-assign” to another unclustered data or a nearby cluster based on a factor described below.

The following definitions quantify the similarity of data x to Ci as relative to other clusters and also as relative to the similarity density of Ci. In this way the clustering factor ϕc(x) is a value that represents a best fit of data x to a cluster in that data x should be most similar to it’s ideal cluster without reducing the cluster’s similarity density too much.

Definition 6.1. For a given x ∈ X and CiC with Ci ≠ {x},

  1. The individual preference of x to Ci.
    ϕp(x,Ci)=cont(x,Ci)max{cont(x,Cj)CjCwithCj{x}}
  2. The community acceptance of x by Ci.
    ϕa(x,Ci)=cont(x,Ci)den(Ci)
  3. The mutual preference between x and Ci.
    ϕm(x,Ci)=ϕp(x,Ci)×ϕa(x,Ci)
  4. The clustering factor of x,
    ϕc(x)=max{ϕm(x,Ci)CiC}.

6.3.2. Expansión algorithm

Input: C={C1,,Cs} a clustering of a data set X with |Ci| ≥ 2 i, and a threshold parameter ρ.

Output: C={C1,,Ct}, with ts and C covers more of the data X than C.

Step 1: List the unclustered data: X{xXxCiforanyCiC}

Step 2: Update C as follows: for each xX′, create a cluster C = {x} and store the cluster CC{C}.

Step 3: For each xX′ and every CiC, calculate ϕc(x) and ϕm(x, Ci).

Step 4: For each xX′, if there is a unique cluster Ci for which ϕc(x) = ϕm(x,Ci), and ϕc(x) ≥ ρ, then add x to Ci.

Step 5: From C, remove Ci if |Ci| ≤ 1, and delete any duplicate cluster. CC and output C.

END

6.3.3. Elimination of multi-membership property

The growth subroutine of AQCM allows clusters to include optimally similar points regardless of whether those points have already joined another cluster to which they are also optimally similar. Thus points representing the boundary of clusters may be “multi-members”. For some analytic purposes however it may be desirable to enforce a strict partitioning of a data set into clusters. For this reason we provide the following simple method to require multi-member data points to “choose” one cluster in which to remain a member.

We may notate the multi-member set precisely. Let X′′X such that for xXHxC where xCiCiHx and |Hx|2.

Definition 6.2. For a given xXand CiC

  1. The contribution of x to the core of Ci.
    φ(x,Ci)=cont(x,Ci\X)
  2. The core factor of x.
    φc(x)=max{φ(x,Ci)CiC}

6.3.4. Multi-membership elimination algorithm

Input: C={C1,,Cs} a clustering of a data set X and some pairs Ci, Cj have CiCj

Output: C={C1,,Ct}, with the number of multi-member data reduced.

Step 1: List the multi-member data: X′′ ← multi-member data points as described above

Step 2: For each xX′′ and every CiC, calculate φc(x) and φ(x, Ci).

Step 3: For xX′′, if there is a unique cluster Ci for which φc(x) = φ(x, Ci) then remove x from all clusters except Ci.

Step 4: From C, remove Ci if |Ci| ≤ 1, and delete any duplicate cluster. CC and output C.

END

Footnotes

Declaration of interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

CQ Zhang

1

the software we have developed requires the full n×n similarity matrix for the purpose of faster matrix computations in various sections of the algorithm, however the mathematical procedures of the algorithm require simply the (n2) similarity values

2

the algorithm discovers clusters with such properties when those properties are inherent in the data, the algorithm does not force such properties onto the clustering output

3

the graph theoretic term “cut” is defined in the Subsection 1.4

4

For simplicity of explanation we describe the case of a tree with no multimembership, however the relationship exists similarly in trees featuring multimembership.

5

we have “aligned” the colors in the illustration for ease of comparison between the two clustering outputs

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • [1].Blondel Vincent D, Guillaume Jean-Loup, Lambiotte Renaud, and Lefebvre Etienne. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008, October 2008. [Google Scholar]
  • [2].Bondy Adrian and Murty M. Ram. Graph Theory. Springer, 2008. [Google Scholar]
  • [3].Cheng D, Zhu Q, Huang J, Wu Q, and Yang L. Clustering with local density peaks-based minimum spanning tree. IEEE Transactions on Knoviledge and Data Engineering, 33(2):374–387, 2021. [Google Scholar]
  • [4].Diestel R. Graph theory. Springer, 2017. [Google Scholar]
  • [5].Figueiredo MAT and Jain AK. Unsupervised learning of finite mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):381–396, 2002. [DOI] [PubMed] [Google Scholar]
  • [6].Frey Brendan J. and Dueck Delbert. Clustering by passing messages between data points. Science, 315(5814):16, February 2007. [DOI] [PubMed] [Google Scholar]
  • [7].Girvan M and Newman MEJ. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12):7821–7826, 2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Jain Anil K.. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666, 2010. Award winning papers from the 19th International Conference on Pattern Recognition (ICPR). [Google Scholar]
  • [9].Jain Anil K. and Law Martin H. C.. Data clustering: A user’s dilemma. In Pal Sankar K., Bandyopadhyay Sanghamitra, and Biswas Sambhunath, editors, Pattern Recognition and Machine Intelligence, pages 1–10, Berlin, Heidelberg, 2005. Springer; Berlin Heidelberg. [Google Scholar]
  • [10].Jeub Lucas G. S., Sporns Olaf, and Fortunato Santo. Multiresolution consensus clustering in networks. Scientific Reports, 8(1):3259, February 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Newman Mark EJ. Modularity and community structure in networks. Proceedings of the national academy of sciences, 103(23):8577–8582, 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Ou Yongbin, Guo Lan, and Zhang Cun-Quan. A new clustering method and its application to proteomic profiling for colon cancer. IASTED International Conference on Computational and Systems Biology: November 13–14, 2006, Dallas, Texas, USA (CASB 2006), 2006:68–72, Nov 2006. [PMC free article] [PubMed] [Google Scholar]
  • [13].Ou Yongbin and Zhang Cun-quan. A new multimembership clustering method. Journal of Industrial and Management Optimization, 3, 11 2007. [Google Scholar]
  • [14].Payne S. Aqcm algorithm matlab. https://github.com/scottpayne282/AQCM_algorithm_MATLAB, 2021.
  • [15].Payne Scott, Fuller Edgar, Spirou George, and Zhang Cun-Quan. Diffusion profile embedding as a basis for graph vertex similarity, 2021.
  • [16].Payne Scott, Fuller Edgar, and Zhang Cun-Quan. Edge-cuts of optimal average weights. Asia-Pacific Journal of Operational Research, 36(02):1940006, 2019. [Google Scholar]
  • [17].Payne Scott, Fuller Edgar, and Zhang Cun-Quan. Edge-cuts optimized for average weight: a new alternative to ford and fulkerson. https://arxiv.org/abs/2002.00263, 2020.
  • [18].Pearl J. Reverend bayes on inference engines: A distributed hierarchical approach. In Proceedings of the Second National Conference on Artificial Intelligence., pages 133–136, Menlo Park, California, 1982. AAAI Press. [Google Scholar]
  • [19].Peel Leto, Larremore Daniel B., and Clauset Aaron. The ground truth about metadata and community detection in networks. Science Advances, 3(5), 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Qi Xingqin, Tang Wenliang, Wu Yezhou, Guo Guodong, Fuller Eddie, and Zhang Cun-Quan. Optimal local community detection in social networks based on density drop of subgraphs. Pattern Recognition Letters, 36:46–53, 2014. [Google Scholar]
  • [21].Rodriguez A and Laio A. Clustering by fast search and find of density peaks. Science, 344(6191):1492–1496, 2014. [DOI] [PubMed] [Google Scholar]
  • [22].Shirkhorshidi Ali Seyed, Aghabozorgi Saeed, and Wah Teh Ying. A comparison study on similarity and dissimilarity measures in clustering continuous data. PLOS ONE, 10(12):1–20, 12 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Traag VA, Waltman L, and van Eck NJ. From louvain to leiden: guaranteeing well-connected communities. Scientific Reports, 9(1):5233, March 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].West D. Introduction to Graph Theory. Prentice Hall, 2001. [Google Scholar]
  • [25].Xu R and Wunsch DC. Clustering algorithms in biomedical research: A review. IEEE Reviews in Biomedical Engineering, 3:120–154, 2010. [DOI] [PubMed] [Google Scholar]
  • [26].Zhao Peixin and Zhang Cun-Quan. A new clustering method and its application in social networks. Pattern Recognition Letters, 32(15):2019–2018, 2011. [Google Scholar]

RESOURCES