Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Nov 26.
Published in final edited form as: J Am Stat Assoc. 2012 Oct 8;107(499):10.1080/01621459.2012.682530. doi: 10.1080/01621459.2012.682530

Multiscale Community Blockmodel for Network Exploration

Qirong Ho 1, Ankur P Parikh 2, Eric P Xing 3
PMCID: PMC3840468  NIHMSID: NIHMS519455  PMID: 24288419

Abstract

Real world networks exhibit a complex set of phenomena such as underlying hierarchical organization, multiscale interaction, and varying topologies of communities. Most existing methods do not adequately capture the intrinsic interplay among such phenomena. We propose a nonparametric Multiscale Community Blockmodel (MSCB) to model the generation of hierarchies in social communities, selective membership of actors to subsets of these communities, and the resultant networks due to within- and cross-community interactions. By using the nested Chinese Restaurant Process, our model automatically infers the hierarchy structure from the data. We develop a collapsed Gibbs sampling algorithm for posterior inference, conduct extensive validation using synthetic networks, and demonstrate the utility of our model in real-world datasets such as predator-prey networks and citation networks.

Keywords: Hierarchical network analysis, Latent space model, Bayesian nonparametrics, Gibbs sampler

1. INTRODUCTION

How do complex networks and their self-organization arise from coordinated interactions and information sharing among the actors? One way to tap into this question is to understand the latent structures over actors which lead to the formation and organization of these networks. In particular, we are interested in uncovering the functional/sociological communities of network actors, and their influence on network connections. We consider a community to be a group of actors that share a common theme, like a clique of football fans in a social network, or an ecosystem of dependent organisms in a biological food web. Our objective is to gain a deeper understanding of the relationships within and among these communities, so as to shed insight into the network topology.

More specifically, we seek to address three critical aspects of network modeling and community discovery:

  1. Hierarchy — not all communities are equal: a community can contain sub-communities, or be contained by super-communities. This is a natural way to structure the latent space of actors.

  2. Multiscale Granularity — we must distinguish between coarse or generic associations that may occur in a large super-community, as opposed to fine grained interactions that occur within or among small, closely-interconnected sub-communities.

  3. Assortativity/Disassortativity — some communities have strong within-community interactions and weak cross-community interactions (assortativity), yet others may exhibit the reverse (disassortativity).

These aspects are not independent, but are strongly interrelated. As an example, consider an oceanic food web (Figure 1), a directed network with species as actors and predator-prey relationships as edges. This network exhibits hierarchy: cold-blooded animals and mammals are large super-communities that can be sub-divided into smaller sub-communities, such as sharks and squid, or toothed whales and pinnipeds. These sub-communities can in turn be divided into even smaller communities (not shown). The ideas of hierarchy and network should not be confused with each other. The hierarchy is an organization of actors in some latent space learned from the observed network.

Figure 1.

Figure 1

Illustration of an oceanic food web as a set of nested communities (Left), and the corresponding hierarchy of communities (Right). Vertices in the network represent individual species, and directed edges represent predator-prey relationships (not all shown). Solid edges are fine-grained, specific interactions, while dashed edges are coarse-grained and generic to a large community.

Next, the predator-prey relationships in the ocean are multiscale. Consider a sperm whale: it occasionally eats fish, which are common prey for many oceanic animals. Hence, this “sperm whale, fish” interaction is generic. Moreover, sperm whales usually eat giant squid, which are prey specific to them (making this interaction fine-grained). It is important to differentiate between such interactions of different scale.

Finally, the toothed whale sub-community demonstrates both assortative and disassortative behavior. Many toothed whales feed on small fish and seals, which are cross-community interactions. However, whales such as orcas feed on other whales, which are within-community interactions.

We propose a nonparametric Multiscale Community Blockmodel (MSCB) that presents a unified approach to address these three concerns. Using the nested Chinese Restaurant Process (Blei, Griffiths and Jordan 2010) as a nonparametric structural prior, our model infers the appropriate hierarchy from the data, without requiring the user to pre-specify the branching factor at each node. Moreover, by exploiting latent space ideas from Blei et al. (2003) and Airoldi et al. (2008), we uncover the coarse/fine-grained interactions that underlie the network. Finally, our model builds upon the blockmodel concept (Wang and Wong 1987; Airoldi, Blei, Fienberg and Xing 2008) to integrate assortativity and disassortativity into our hierarchy. In order to use our model, we develop an MCMC algorithm for posterior inference and hyperparameter estimation, and study its performance on simulated and real datasets.

In particular, our qualitative studies are centered on two networks: a 75-species food web of grass-feeding wasps and their parasites (Dawah, Hawkins and Claridge 1995), and a subset of the arXiv High-Energy Physics (Theory) citation network (KDD 2010). The latter network originally contains 27,770 papers, but due to algorithmic limitations, we focus on a 1,000-paper subset published from Jan 2002 through May 20031. Using our MCMC algorithm, we uncover hierarchies from both networks, and analyze these in the context of nodal attributes such as species’ trophic levels (i.e. parasite, herbivore or plant) and journal paper titles. On the food web, we recover a hierarchy that, at a high level, follows intuitive trophic divisions: parasites are grouped together, and similarly for herbivores or plants. This is in contrast to the hierarchy-finding model of Clauset et al. (2008), whose recovered hierarchy is centered around highly-connected species rather than trophic levels. On the arXiv citation network, our recovered hierarchy splits the papers into specific research topics, which corresponds to the perception that most scientific research is conducted by highly-specialized communities of researchers. Finally, we support our qualitative studies with simulations that reveal conditions under which our algorithm outperforms other statistical and non-statistical network models.

1.1 Comparison to Existing Work

Existing methods for graph clustering and inferring community structure may address one or two of the aspects we have described, yet none capture all three aspects simultaneously. To begin, methods such as Girvan and Newman (2002), Hoff et al. (2002), Handcock et al. (2007), Krause et al. (2003) and Guimera and Amaral (2005) cannot discover disassortative communities characterized by weak within-community and strong cross-community interactions. Furthermore, they do not explicitly model organizational structure — and by extension, multiscale granularity of interactions. These methods do not meet any of our criteria, and are therefore unsuited for our purposes.

A common strategy for learning hierarchies is divisive or top-down: begin by learning first level clusters using a non-hierarchical network clustering algorithm, and then recursively apply the same algorithm to the subgraphs corresponding to each cluster. Spectral methods (Chung 1997) are a popular choice for top-down network clustering, not least because they approximate a graph cut objective function, making them a natural fit for networks. However, the biggest issue with divisive strategies is that deeper divisions cannot retroactively influence shallower ones, leading to situations in which poor choices made early in the algorithm doom the final outcome. In contrast, our hierarchical model specifies a probability distribution over the space of all possible K-level hierarchies, making it immune to such issues in principle — though its effectiveness in practice will admittedly depend on the particular inference algorithm used.

The counterpart to divisive clustering is agglomerative or bottom-up clustering, in which network entities are repeatedly merged to form larger an larger clusters. One popular version of this strategy, as adopted in the software program Pajek (Batagelj and Mrvar 1998), is to generate a dissimilarity matrix from the network, such as the corrected Euclidean-like dissimilarity of Batagelj et al. (1992), and then perform agglomerative clustering using Ward’s criterion (Ward 1963). As with top-down clustering, bottom-up clustering suffers from an inability to retroactively apply information gleaned from later merges, highlighting the need for probabilistic models that consider all possible hierarchies at once. In our experiments, we shall compare our method to the top-down spectral clustering strategy, as well as the bottom-up Pajek strategy.

The Mixed Membership Stochastic Blockmodel (MMSB) (Airoldi et al. 2008) is a “mixture of features” model, in that it aims to discover the multiple latent “roles” played by each actor in the network; additionally, it employs a blockmodel to accommodate both disassortative and assortative types of interactions. While the multi-role memberships discovered by MMSB are similar to our notion of coarse/fine-grained interactions, they are not identical; furthermore, MMSB does not induce a hierarchical structure over the actors. These considerations prevent MMSB from modeling the organized network phenomena that our model is designed to explore. Another “mixture of features” latent space model is that of Miller et al. (2009), which allows each actor to take on multiple binary features in an infinite-dimensional space. Like MMSB, this model does not learn a structure over its latent space, and therefore cannot replicate our model’s ability to discover community hierarchies.

Finally, methods such as Clauset et al. (2004), Radicchi et al. (2004) and Kemp et al. (2008) explicitly model some form of organizational structure, but do not permit actors to have multiple kinds of interactions, which precludes them from learning the kind of multiscale interactions we have described. Our MSCB model is perhaps most closely related to the Infinite Relational Model (IRM) (Kemp, Tenenbaum, Griffiths, Yamada and Ueda 2006), which is a special case of MSCB. More specifically, the IRM is equivalent to MSCB with a hierarchy depth of one (i.e. a flat hierarchy with no multiscale membership), an unsurprising fact given that the IRM is a nonparametric generalization of the stochastic blockmodel (Wang and Wong 1987), from which MSCB inherits some of its key attributes. Roy et al. (2007) have generalized the IRM in a different way for hierarchical group discovery, and further extended their work to the nonparametric setting with Mondrian Processes (Teh and Roy 2009). However, these two models are limited to binary hierarchies, and furthermore, they adopt a notion of multi-community membership that is not directly comparable to ours. In contrast, our model assumes no limit on the hierarchy’s branching factor, which is more realistic for certain networks.

2. MULTISCALE COMMUNITY BLOCKMODEL (MSCB)

We now describe aspects of our model, beginning with the hierarchy and ending with network generation. Throughout, we adopt an oceanic food web as a running example (Figure 1).

2.1 The Community Hierarchy

Our model organizes network actors into a depth-K tree-shaped hierarchy, where each node in the tree represents a community. The tree contains a single root node at level 0, followed by K levels of nodes; parent nodes can have any number of child nodes. Nodes closer to the root represent large super-communities, (e.g. the “cold-blooded animals” and “mammals” in Figure 1), while nodes closer to the terminal leaves represent finer-grained sub-communities (e.g. “toothed whales” or “sharks”).

The most important aspect of our hierarchy is that each actor is associated with not just one community, but with an entire path of super- through sub-communities, starting from a level-1 node and ending with a level-K terminal leaf. In our food web example, the sperm whale follows the path [mammal, toothed whale]. More formally, we represent actor i’s path by a random-valued, K-length path vector ci, where the k-th element of ci represents the branch choice taken at level k. Referring to our oceanic food web in Figure 2, we see that “mammals” are the 2nd branch from the root “sea creatures”, while “toothed whales” are the 1st branch from ”mammals” — hence the sperm whale has a path vector of c3 = [2, 1] in this hierarchy. Henceforth, we shall use the term path to refer to path vectors ci.

Figure 2.

Figure 2

Oceanic food web from Figure 1, annotated with three species’ path vectors ci.

It is important to note that we never represent our hierarchy explicity; it is implicit from the set of all actor paths {c1, …, cN}. For example, given c1 = [1, 1], c2 = [1, 2], and c3 = [2, 1], we can recover a hierarchy with two level-1 nodes, carrying 2 and 1 child nodes respectively at level 2. This implict representation is convenient for probabilistic inference, as we shall demonstrate later.

2.2 A Measure over Hierarchies

Because the number of actor paths N is finite, the space of meaningful hierarchies is also finite. Unfortunately, this space is still very large, which makes model selection (i.e. hierarchy selection) an inherently difficult task. In order to solve this problem, we make use of a measure over sets of actor paths {c1, …, cN} (recall that these paths implicitly define a depth-K hierarchy), the Nested Chinese Restaurant Process (nCRP) (Blei et al. 2010). As its name suggests, the nCRP is a recursive version of the Chinese Restaurant Process (CRP) (Aldous 1985; Teh, Jordan, Beal and Blei 2006), the latter being an exchangeable distribution over partitions of a finite set of objects (i.e. ways to divide up the objects).

To clearly present the nCRP, a brief overview of the CRP and the Dirichlet Process (DP) (Ferguson 1973) is in order. In Bayesian nonparametric mixture modeling, the DP is employed as a prior over a countably infinite number of mixture components (Escobar and West 1995; MacEachern and Müller 1998). However, the DP can only model flat (i.e. without hierarchy) nonparametric mixture models, which represent an extreme case of our model when the hierarchy has only K = 1 level. One particularly useful view of the DP is given by the Pólya urn scheme (Blackwell and MacQueen 1973), which provides a closed form for the i-th data point’s mixture component, given the mixture components of the previous i − 1 data points. If we disregard the locations of the mixture components and simply focus on their identities, the Pólya urn scheme becomes the CRP. Recall that the CRP is an exchangeable distribution over partitions of objects: in the context of the DP, the objects are data points, which are placed into parts (i.e. divisions of the partition) that represent mixture components.

In essence, the CRP allows us to separate the identities of the DP mixture components from their locations in the data domain (as drawn from a base measure). By exploiting this separation, Blei et al. (2010) generalize the CRP to the nCRP, transforming the flatly-organized mixture components of the former into the tree-shaped hierarchy of the latter.

Nested Chinese Restaurant Process

The nCRP, as defined by Blei et al. (2010), is a measure over sets of tree paths {c1, …, cN} of infinite length, though for our purposes we shall restrict the nCRP to paths of length K. The nCRP is most intuitive when described generatively: beginning with actor 1, each actor chooses its tree path in turn, conditioned on the existing paths chosen by previous actors. Consider the i-th actor: he begins at the root, and needs to choose which level-1 branch to take. With probability nx,i1(1)/(i1+γ) he selects branch x already in the tree, or with probability γ/(i − 1 + γ) he starts a completely new branch. Here, nx,i1(1) is the number of actors before i that chose branch x at level 1, and γ is a hyperparameter dictating the probability that an actor will start a new branch.

Actor i continues this process as he descends the tree. When picking a branch at level k, with probability ny,i1(k)/(ni1(k1)+γ) he selects branch y, and with probability γ/(ni1(k1)+γ) he starts a new branch. Here, ni1(k1) counts the number of actors 1, …, i − 1 having the same path as actor i up to (and including) level k − 1. Out of these actors, ny,i1(k) is the number that picked branch y (at level k). This sequence of branch choices defines the path ci of actor i, and as we have explained previously, the set of all actor paths implicitly defines the hierarchy. Note that our model limits the hierarchy to a maximum depth of K.

We now provide a more formal definition of the nCRP. In order to do so, we must first introduce the (original) Chinese Restaurant Process (CRP), an exchangeable distribution over partitions (of a set of objects) (Aldous 1985; Teh et al. 2006). For concreteness, we shall represent an object’s assigned part by a positive integer — as an example, suppose there are three random variables X1, X2, X3 corresponding to a partition of three objects, then X1 = 1, X2 = 1, X3 = 2 represents a partition where objects 1 and 2 are in one part, and object 3 is in another part. Note that this scheme allows every partition to be represented by infinitely many assignments, e.g. X1 = 3, X2 = 3, X3 = 2 represents the same partition as the earlier assignment. Despite this non-identifiability, the positive-integer representation turns out to be convenient for describing and implementing our Gibbs sampling inference procedure.

Let cik denote the k-th element of the i-th actor path, i.e. the branch taken at level k by actor i. In our model, the collection of all actor path first-level branch choices, which we denote as {ci1}, forms a partition distributed according to a CRP prior:

(ci1=x|c1:(i1),1)={|{j<i|cj1=x}|i1+γx{c1:(i1),1}γi1+γxis the smallest positive integer not in{c1:(i1),1} (1)

where γ > 0 is a “concentration” parameter that controls the probability of drawing new integers, and c1:(i−1),1 = (c11, …, c(i−1)1) denotes the set of 1st-level elements from paths c1 through ci−1. Again, we stress that different assignments to {ci1} can correspond to the same partition. High values of γ imply that partitions with more parts are more likely.

The nCRP (Blei et al. 2010) extends the CRP to recursively nested partitions. The nCRP can be thought of as a hierarchy of CRPs, beginning with a single CRP at the top level. To each unique integer x seen at the top-level prior, we associate a child CRP with |{i | ci1 = x}| observations, resulting in a two-level tree of CRPs. We can repeat this process ad infinitum on the newly-created child CRPs, resulting in an infinite-level tree of CRPs; however, we only use a K-level nCRP (denoted as nCRPK) as we have limited the maximum hierarchy depth to K. In our model, all child CRPs of the nCRP share the same concentration parameter γ, and high values of γ make “branchier” trees more likely. We note that one could easily have different γ’s for different tree levels, but do not discuss this modification.

Our model treats the collection of all actor paths {ci} as a recursively nested partition (of depth K), distributed according to an nCRPK(γ) prior:

(cik=x|c1:(i1),ci,1:(k1))={|{j<i|cj,1:(k1)=ci,1:(k1),cjk=x}||{j<i|cj,1:(k1)=ci,1:(k1)}|+γx{cjk|(j<i),cj,1:(k1)=ci,1:(k1)}γ|{j<i|cj,1:(k1)=ci,1:(k1)}|+γxis the smallest positive integer not in the above set, (2)

where in the first line, c1:(i−1) = {c1, …, ci−1} is the set of complete paths c1 through ci−1, and ci,1:(k−1) = {ci,1, …, ci,k−1} contains the 1st through (k − 1)-th elements of path ci. Each actor path ci is represented by a length-K vector of positive integers, and the distribution above enables us to draw their elements one at a time: c11, …, c1K, c21, …, c2K, and so on. As with the original CRP, different assignments to {ci} can correspond to the same set of (nested) partitions.

2.3 Multiscale Membership

In conjunction with the actor paths, MSCB’s notion of Multiscale Membership (MM) distinguishes it from other hierarchical clustering methods. Brifley, each actor i is endowed with a probability distribution over the hierarchy nodes in its path. We denote this distribution by a Multiscale Membership (MM) vector θi, a K-dimensional multinomial that encodes an actor’s tendencies to interact as a member of the different super- and sub-communities along its K-length path. Specifically, θik, the k-th element of θi, is the probability that actor i will interact as a member of the k-th community along its path from the root. Through MM, our model is able to accomodate multiscale granularity on interactions, i.e. the notion that some interactions are finer or coarser than others.

To use our food web example (Figure 2), consider two species of toothed whales: dolphins and sperm whales. Although both share the same hierarchy path, [mammal, toothed whale], they behave quite dissimilarly. A dolphin’s diet consists mainly of fish, which are common prey for many mammals. Thus, we would say it typically interacts as a member of the mammal super-community, though it occasionally chooses prey that are more specific to its species. On the other hand, a sperm whale barely eats fish, so it rarely interacts as a member of its super-community; instead, it prefers giant squid, a more specific prey uncommon to most mammals. Thus, the sperm whale has a higher probability of participating in fine-grained interactions (say, θ = [0.1, 0.9]) unlike the dolphin, which is more likely to pursue coarse-grained interactions (for example, θ = [0.8, 0.2]).

At this juncture, it is worth comparing our Multiscale Membership to the Mixed-Membership Stochastic Blockmodel (MMSB) (Airoldi et al. 2008). The latter model endows each actor with a distribution over latent roles, just as our Multiscale Membership vector provides each actor with a distribution over communities. There is a key difference, however: MMSB’s Mixed Membership vectors permit each actor to have a distribution over all latent roles, whereas our model’s Multiscale Membership vectors constrain each actor to distributions only over super- and sub-communities on its path. This restriction is crucial: if actors were allowed to have distributions over all hierarchy communities, the hierarchy could be rendered virtually meaningless — for instance, we could say that dolphins are simultaneously members of the shark and toothed whale communities, which is plainly unrealistic.

Just as with the paths ci, we place a suitable prior on the MM vectors θi, a truncated two-parameter stick breaking2 process (Blei et al. 2010), denoted by StickK(m, π) where 0 < m < 1 and π > 0. This prior is conjugate to the K-dimensional multinomial distribution, which will be important when deriving our Gibbs sampler inference algorithm. Furthermore, unlike the Dirichlet distribution (which is also conjugate to the multinomial), the stick breaking process makes it easy to bias the posterior of θi towards coarser interactions (i.e. elements θik have more probability mass for k closer to 1) or finer interactions (more mass for k closer to K). In contrast, a single-parameter Dirichlet prior cannot accomodate coarse/fine biases, while a full K-parameter Dirichlet prior may be too expressive — for instance, we do not anticipate needing a bimodal prior on θik for our applications.

Truncated Two-parameter Stick Breaking Process

As its name suggests, the StickK(m, π) distribution generates multinomial parameters θi via the analogy of breaking sticks into pieces. Beginning with a stick of length 1, draw the first stick fraction Vi1 ~ Beta(mπ, (1 − m)π) and let the first stick length be θi1 = Vi1, leaving 1 − θi1 as the remainder of the stick. To get the second stick length θi2, draw Vi2 ~ Beta(mπ, (1 − m)π) and break this fraction off from the remainder, giving θi2 = Vi2(1 − Vi1). We repeat this process until we have K sticks, after which we discard the remainder and renormalize the sticks to get θi.

Formally, let Vik be the k-th fraction to break off from the stick’s remainder, and let θik be the length of the k-th stick. To draw θi ~ StickK(m, π), we first draw Vik ~ Beta(mπ, (1 − m)π) for k ∈ {1, …, K} and define θik to be

θikViku=1k1(1Viu) (3)

with normalization factor k=1Kθik. We note that Blei et al. (2010) called this distribution a “two-parameter GEM distribution”, and let K → ∞ to get an infinte-dimensional prior.

Intuitively, the parameter 0 < m < 1 influences the posterior mean of θi; m → 1 results in elements θik having more mass for k closer to 1, while m → 0 results in more mass for k closer to K. The other parameter π > 0 indicates our confidence in the stick breaking prior; π → ∞ indicates more confidence, causing the posterior mean of θi to approach the prior mean, and the posterior variance of θi to approach zero.

2.4 Network Edge Generation

We now explain how the paths ci and MM vectors θi generate edges in the network. At this point, we must introduce additional notation: let E be the N × N adjacency matrix of observed network edges, where element Eij corresponds to the directed edge or interaction/relationship from actor i to j. In the context of our food web, the actors are sea creatures such as dolphins and sperm whales, and the edges represent predator-prey interactions. A value of Eij = 1 indicates that the interaction is present, while Eij = 0 indicates absence, and we ignore self-edges Eii. Because we are modeling a directed network, E is not necessarily symmetric. If we need to model symmetric relationships such as friendship, we can let Eij = Eji for all i, j.

To begin, we shall introduce MSCB’s generative process for network edges:

  • For each actor i ∈ {1, …, N}:
    • Sample i’s hierarchy path ci ~ nCRPK(γ).
    • Sample i’s Multiscale Membership (MM) vector θi ~ StickK(m, π). Note that θi and ci are drawn independently.
  • To generate the network, for each possible directed edge Eij:
    • Sample donor level zij ~ Multinomial(θi), and let h = ci[zij]. Formally, h represents the community at level zij on path ci, i.e. the zij-th element of ci.
    • Sample receiver level zij ~ Multinomial(θj ), and let h′ = cj [zij].
    • Sample the edge Eij ~ Bernoulli(SB(h, h′)). We shall define SB() later.

The basic idea is as follows: for every directed edge Eij, both actor i (the donor) and actor j (the receiver) pick communities h and h′ from their respective paths ci, cj, according to the levels drawn from their MM vectors θi, θj. The communities h, h′ are then used to select a community compatibility parameter via the function SB(h, h′), which in turn generates Eij ~ Bernoulli(SB(h, h′)). Note that the arrow in zij or zij indicates whether the variable belongs to the donor actor (i for zij) or receiver actor (j for zij), with respect to the edge Eij from i to j. The arrow does not indicate the edge direction between i and j.

2.5 Community Compatibility Matrices B

We turn our discussion to the community compatibility parameters that define edge probabilities between communities, as well as how the SB() function selects them during edge generation. Intuitively, the compatibility from community h to h′ is high if actors from h often interact with actors from h′. Conversely, a low compatibility indicates that actors from h rarely interact with actors from h′. Thus, it is natural that we define compatibility to be a Bernoulli parameter in [0, 1], where 1 indicates perfect compatibility. This notion of compatibility is what allows our model to account for both assortative and disassortative behavior, in similar fashion to stochastic blockmodels (Wang and Wong 1987) — for example, strongly assortative networks correspond to high compatibility parameters when h = h′, i.e. when the source and destination communities are the same.

There are many ways to associate compatibility parameters with pairs of communities h, h′, and the challenge is to find a scheme that tightly integrates compatibility parameters with the hierarchy and multiscale interactions over communities. A first attempt might be to ignore the hierarchy, and place a full H × H compatibility matrix over all community pairs h, h′ (where H is the total number of nodes in the hierarchy); this is analogous to the role-compatibility matrix used in MMSB (Airoldi et al. 2008). However, this formulation fails to capture the multiscale nature of interactions, because there is simply no connection between the compatibility parameter for h, h′ and those communities’ levels in the hierarchy.

In order to connect the community parameters with the hierarchy levels, we must restrict them in some meaningful way. First, we need to define the notion of a sibling group. Formally, a sibling group is a largest possible set of communities such that (1) all communities are at the same level, and (2) have the same immediate parent community. To put it another way, every parent (including the level-0 root) contains a sibling group comprised of its immediate children (but not grandchildren, or great-grandchildren, etc.). Hence, if there are P parent nodes, there are P sibling groups, and all sibling groups are disjoint from one another. To give a concrete example, if we have three paths, c1 = [1, 1], c2 = [1, 2], and c3 = [2, 1], then the hierarchy contains 3 sibling groups: one at level 1 with communities {[1], [2]}, and two at level 2 with communities {[1, 1], [1, 2]} and {[2, 1]} respectively.

Each sibling group is associated with its own compatibility matrix B, which contains the compatibility parameters for every pair of communities within that sibling group; refer to Figure 4 for an example illustration. This scheme restricts the community parameters — notice that communities from different sibling groups do not have explicit community parameters between them; we shall discuss how this affects edge generation shortly. Also, since MSCB infers the number of hierarchy nodes from the data by way of the nCRP prior, we cannot know a priori the number and sizes of the sibling group compatibility matrices. We will return to this issue when we discuss our inference procedure.

Figure 4.

Figure 4

Four sibling groups in an example hierarchy, and the sizes of their compatibility matrices B.

Now that we have defined the sibling group compatibility matrices, we can describe how network edges are generated from them. Recall that the edge generative process picks two interacting communities h, h′ according to paths ci, cj and MM vectors θi, θj. When h, h′ are at the same level and share the same immediate parent, we simply pick their community compatibility parameter3 Bh,h from their associated sibling group matrix, and draw the edge Eij ~ Bernoulli(Bh,h). However, if h, h′ do not share the same parent, then we invoke the following coarsening procedure:

  1. Recall that h = ci[zij ] and h′ = cj [zij].

  2. Set zmin ≔ min(zij, zij ), i.e. the smaller interaction level from h or h′.

  3. If the coarsened communities hcoarse = ci[zmin] and hcoarse=cj[zmin] are in the same sibling group, then we look up their compatibility matrix entry Bhcoarse,hcoarse. We then generate Eij~Bernoulli(Bhcoarse,hcoarse).

  4. Otherwise, set zminzmin − 1 and repeat step 3. This effectively sets hcoarse to its immediate parent (and likewise for hcoarse).

To put it another way, the coarsening procedure finds the deepest common hierarchy node (which could be the root node) shared by the paths ci and cj and whose level is strictly shallower than min(zij, zij). Supposing that this deepest common node has level zmin − 1, we then draw the compatibility parameter Bci[zmin],cj [zmin] and generate Eij according to it. Notice that the hierarchy nodes ci[zmin], cj [zmin] are part of the same sibling group by construction, thus the compatibility parameter Bci[zmin],cj [zmin] always exists. Now, we can formally define our SB() function from the previous section:

SB(h,h)=Bhcoarse,hcoarse,whereh=ci[zij],h=cj[zij],hcoarse=ci[zmin],hcoarse=cj[zmin],zmin=1+maxz:0z<min(zij,zij),ci[z]=cj[z]z, (4)

and where we adopt the convention ci[0] = cj[0], to represent the fact that all paths implicitly share the root node. In future, we shall employ the shorthand SBijSB(h,h) for brevity.

Finally, in keeping with our use of Bayesian inference, we place a Beta(λ1, λ2) prior over every element Bx,y of every sibling group compatibility matrix. This adds the following step to our generative process:

  • For each element x, y of each sibling group compatibility matrix:
    • Sample Bx,y ~ Beta(λ1, λ2).

This step comes after generating paths ci, but before generating edges Eij. A graphical model representation of our full generative process can be found in Figure 3. In summary, our use of sibling group compatibility matrices enforces the role of the hierarchy during edge generation, thus distinguishing MSCB from “at” blockmodels such as Wang et al. (1987) and Airoldi et al. (2008).

Figure 3.

Figure 3

Graphical model representation of our MSCB model. Diamonds represent model parameters, hollow circles represent hidden random variables, and filled circles represent observed random variables. Variables inside a numbered rectangle are duplicated by that number, e.g. there are N variables θi, numbered from θ1 through θN. Arrows denote probabilistic dependencies, and these are annotated with the probability distribution they represent: for instance, θi ~ StickK(m, π), zij ~ Multinomial(θi), etc..

3. A DISCUSSION ON MODEL IDENTIFIABILITY

In general, MSCB is not identifiable for hierarchy depths K > 1; the lack of identifiability implies that, for a given observed network Eij, there will be multiple hierarchies that produce the same model likelihood. However, the degree to which MSCB is non-identifiable can be compared to two other models, the Infinite Relational Model (IRM) (Kemp et al. 2006) and the Mixed Membership Stochastic Blockmodel (MMSB) (Airoldi et al. 2008).

There are two ways in which MSCB is non-identifiable, and it is important to distinguish between them. First, observe that the communities can be permuted without changing the model likelihood. This issue is common to practically all clustering models and algorithms; even the regular stochastic blockmodel (Wang and Wong 1987) suffers from this. Nevertheless, this type of non-identifiability is rarely a problem: in most clustering applications, the clusters are meant to be inspected by humans, or else further criteria can be applied post-hoc to select clusters of interest. The second, and more critical form of non-identifiability, arises because the MM vectors θi are mixtures. We shall discuss this further when we compare MSCB to the MMSB.

Observe that when K = 1 or m → 1, the MM vectors θi are reduced to point masses at the first element θi1, which eliminates the second non-identifiability (mixture non-identifiability). In this form, MSCB reduces to a nonparametric stochastic blockmodel (by nonparametric, we mean that the number of roles/clusters is automatically selected), identical to the IRM, and with only permutation non-identifiability to worry about. Although these limiting cases do not produce meaningful hierarchies and are thus uninteresting from an application perspective, they remain useful as a kind of lower bound on MSCB’s non-identifiability.

For the general case in which the MM vectors θi are nonzero in more than one dimension, we can gain insight by comparing MSCB to the MMSB. The latter is essentially a stochastic blockmodel with L communities/roles, but unlike the stochastic blockmodel, network entities are not restricted to just one role. Instead, each network entity i has a distribution ϕi over all L roles, from which every edge-touching entity i draws its own (possibly different) role assignment; this process is similar to how our MSCB allows edges to draw different level assignments. Because of the entity role mixtures ϕi, MMSB suffers from mixture non-identifiability, though it has been successfully applied to various datasets (Airoldi et al. 2008). Formally, the MMSB marginal likelihood of Eij = 1 conditioned on the role mixtures (but not the edge role assignments) is ϕiBϕj, where B is the blockmatrix. Observe that for any orthonormal matrix U, we have ϕi(UU)B(UU)ϕj=ϕiBϕj. Assuming a permutation-unbiased prior on the ϕi, this implies that the MMSB model likelihood remains the same if we transform B to UBU and all ϕi to Uϕi. In short, MMSB is non-identifiable up to orthonormal transformations.

To see how MSCB relates to MMSB, imagine that the MSCB sibling group matrices are sub-matrices along the diagonal of some large H × H blockmatrix B, where H is the total number of hierarchy nodes. Elements of this blockmatrix that do not correspond to some sibling group matrix represent community-community interaction parameters eliminated by coarsening, and thus are absent from the model. In this new representation, the MM vectors θi can be represented as H-dimensional vectors ψi, but with support on at most K elements, where K is the hierarchy depth. If we disregard coarsening, the marginal likelihood of Eij = 1 is simply ψiBψj, just like MMSB. The difference is that the vectors ψi have support on at most KH elements, making MSCB significantly more constrained, and thus more identifiable, than a H-role MMSB. Specifically, we can only transform B and ψi using orthonormal matrices U that preserve ψi having at most K elements of support, otherwise the resulting model no longer corresponds to a K-level MSCB. As for coarsening, its effect is to force parameter sharing — think of it as remapping the “eliminated” elements of B onto elements corresponding to some sibling group matrix. This further constrains MSCB, making it in fact more identifiable than our comparison to MMSB implies.

In summary, the fixed-depth hierarchy and coarsening rules make MSCB significantly more identifiable than the closely-related MMSB. However, as we increase the maximum hierarchy depth K, the degree of model non-identifiability also increases. We recommend using a maximum hierarchy depth of K = 2 or 3, noting that for a fixed depth K, MSCB can instead learn wider, highly-branched hiearchies without increasing the model non-identifiability.

4. COLLAPSED GIBBS SAMPLER INFERENCE

Given a directed network adjacency matrix E, our model’s primary purpose is to estimate (1) a hierarchy over the network actors (implictly given by the paths c), and (2) the actors’ Multiscale Membership (MM) vectors θ indicating their propensities to interact at different levels of the hierarchy. As secondary goals, we might also estimate (3) the sibling group compatibility matrices B so as to learn the communities’ propensities to interact with one another, or (4) the donor/receiver link interaction levels z in order to discover the granularity of each interaction. To be precise, since our model is Bayesian, we seek the posterior distributions of c, θ, B and z given E.

Unfortunately, finding the exact posteriors is infeasible — the number of possible assignments to the discrete random variables c and z is exponentially large, while analytic solutions to the posteriors of the continuous random variables θ and B do not even exist. We must therefore resort to approximating the posteriors, via some approximate inference algorithm. Two general strategies exist for approximate inference: Markov Chain Monte Carlo (MCMC) methods (Robert and Casella 2004) that take random samples from the posterior, and variational approximations (Jordan, Ghahramani, Jaakkola and Saul 1999; Wainwright and Jordan 2008) that find the closest approximation to the posterior in some space of “simpler”, analytically tractable distributions.

The literature contains many examples of variational approximations being applied to variants of the stochastic blockmodel (Airoldi et al. 2008; Xing, Fu and Song 2010; Nallapati, Ahmed, Xing and Cohen 2008). Variational approximations tend to be computationally faster than comparable MCMC methods, but lack guarantees on the quality of their approximation. In addition, nonparametric priors like the Nested Chinese Restaurant Process can have an unbounded number of parameters, hence they pose special difficulties for variational inference. This unbounded parameter issue can be solved by truncating the variational distribution (Blei and Jordan 2004), at the cost of introducing a new parameter that specifies the degree of truncation. Wang and Blei (2009) have developed a variational approximation specifically for the nCRP, in which special techniques are introducted to handle truncation adaptively. In general however, variational inference in the nonparametric case remains more difficult to derive and implement than in the parametric case.

We have chosen an MCMC method for posterior inference in our model: specifically, a collapsed Gibbs sampling scheme (Liu 1994), in which the continuous random variables B and θ are integrated out with the goal of simplifying the sampling equations. Gibbs sampling on discrete random variables is guaranteed to converge to the true posterior, though there are no general guarantees on how many samples are required for convergence. Unlike nonparametric variational inference, the unbounded number of parameters is not an issue for collapsed Gibbs samplers, hence we need not worry about truncation issues. In addition, the MCMC literature for nonparametric priors is mature; Gibbs sampling schemes have been derived for a variety of priors (Teh et al. 2006), including the nCRP itself (Blei et al. 2010).

4.1 Gibbs Sampling Equations

Approximate posterior inference in our model is conducted via a collapsed Gibbs sampling scheme (Liu 1994), in which some of the random variables are integrated out. Specifically, we integrate out two types of random variables: the first type are the community-compatibility matrices B, which we integrate by exploiting conjugacy between the Beta and Bernoulli distributions. This adds conditional dependencies among interactions Eij — specifically, all Eij that use the same compatibility parameter Bhcoarse,hcoarse (by way of the levels z and paths c) become dependent on each other when conditioned on z, c. However, Eij that use different compatibility parameters remain conditionally independent given all other variables. The second type of random variable that we integrate are the MM vectors θi, by exploiting conjugacy between the multinomial distribution and the truncated stick breaking process; note that this adds dependence between levels z that share the same θi.

The point of integrating B, θ is that it may lead to faster convergence of the Gibbs sampler, and this technique is widely used in the latent space modelling community (Blei et al. 2010; Mimno and McCallum 2007; Newman, Chemudugunta and Smyth 2006). Moreover, the resulting sampler is easier to implement because the remaining latent and observed variables z, c, E are all discrete. The reader might ask why we do not integrate z, c; our answer is that there are no known techniques for integrating these variables, and to the best of our knowledge, there is no evidence suggesting this will have any benefit over integrating B, θ. Note that, given a sample of the remaining hidden variables c, z, the posterior distributions of B, θ are easily recovered using Bayes’ Rule.

With B, θ integrated out, the only variables that need to be explicitly sampled are the levels z and the paths c. We shall derive their Gibbs sampling equations below.

Sampling Levels z To get the collapsed level sampling equations, we begin with the joint distribution of zij, B, θ conditioned on all other variables, and then integrate out B, θ:

(zij,B,θ|c,z(ij),E;γ,m,π,λ1,λ2)dBdθ=(zij|c,z(ij),E;γ,m,π,λ1,λ2)(integration)=(Eij,zij|c,z(ij),E(ij);γ,m,π,λ1,λ2)(Eij|c,z(ij),E(ij);γ,m,π,λ1,λ2)(conditional probability definition)(Eij,zij|c,z(ij),E(ij);γ,m,π,λ1,λ2)(denominator does not depend onzij)=(Eij|c,z,E(ij);γ,m,π,λ1,λ2)(zij|c,z(ij),E(ij);γ,m,π,λ1,λ2)(chain rule)=(Eij|c,z,E(ij);λ1λ2)(zij|zi,(j);m,π)(conditional independence) (5)

where E−(ij) is the set of all edges E except Eij, and z−(→ij) is the set of all level indicators z except zij, and finally zi,(−j) = {zi·, z←·i} \ zij is the set of all z’s that are drawn from θi except zij.

Two aspects of Eq. (5) are worth explaining. First, our goal is to Gibbs sample from the conditional distribution of zij (with B, θ integrated out), i.e. line 2. Consequently, we can discard proportionality factors that do not depend on zij, such as the denominator ℙ(Eij | …) from line 3 to 4. Second, from line 5 to 6, we can change the 2nd ℙ() term’s conditioning variables from z−(→ij) to zi,(−j) because (1) we are not conditioning on Eij, and (2) we are conditioning on m, π. Hence, zij is d-separated and thus conditionally independent from all z’s not drawn from θi.

Moving on, let us now expand the first ℙ() term:

(Eij|c,z,E(ij);λ1,λ2)=Γ(a+b+λ1+λ2)Γ(a+λ1)Γ(b+λ2)·Γ(a+Eij+λ1)Γ(b+(1Eij)+λ2)Γ(a+b+1+λ1+λ2),a=|{(x,y)|(x,y)(i,j),SBxy=SBij,Exy=1}|,b=|{(x,y)|(x,y)(i,j),SBxy=SBij,Exy=0}|. (6)

Eq. (6) results from integrating the compatibility matrices B via Beta-Bernoulli conjugacy; notice that it is a function of zij through the condition SBxy=SBij within the sub-expressions a, b. Because we have integrated B, the random variable zij is now dependent on all interactions Exy that, for the current sample value of z, c, use the same compatibility parameter SBij as Eij (by way of z, c and coarsening).

The second ℙ() term can be computed using the law of total expectation, conditioned on the stick-breaking lengths V1, …, VK associated with zij:

(zij=k|zi,(j),m,π)=𝔼[𝕀(zij=k)|zi,(j),m,π]=𝔼[𝔼[𝕀(zij=k)|Vi1,,Vik]|zi,(j),m,π]=𝔼[Viku=1k1(1Viu)|zi,(j),m,π]=𝔼[Vik|zi,(j),m,π]u=1k1𝔼[(1Viu)|zi,(j),m,π]mπ+#[zi,(j)=k]π+#[zi,(j)k]u=1k1(1m)π+#[zi,(j)>u]π+#[zi,(j)u], (7)

where #[A = x] is the number of elements in set A equal to x. The proportionality factor arises from truncating the stick-breaking process at level K, and is equal to k=1K(zij=k|zi,(j),m,π). Overall, Eq. (7) is a consequence of integrating out θi using multinomial-stick-breaking conjugacy, which makes zij dependent on zi,(−j).

For brevity, we omit the sampling equations for zij, as they are derived in similar fashion. The computational complexity of sampling a single zij is 𝒪(K), where K is the (fixed) depth of our hierarchy. Hence the total runtime required to sample all z is 𝒪 (N2K).

Sampling Paths c As with the levels, we start with the joint distribution of ci, B, θ conditioned on all other variables:

(ci|ci,z,E;γ,m,π,λ1,λ2)dBdθ=(ci|ci,z,E;γ,m,π,λ1,λ2)(integration)(ci,E(i·),(·i)|ci,z,E(i·),(·i);γ,m,π,λ1,λ2)(cond.probability;discarding denominator)=(E(i,·),(·,i)|c,z,E(i·),(·i);γ,m,π,λ1,λ2)(ci|ci,z,E(i·)(·i);γ,m,π,λ1,λ2)(chain rule)=(E(i·),(·i)|c,z,E(i·),(·i);λ1,λ2)(ci|ci;γ)(conditional independence) (8)

where E(i·),(·i) = {Exy | x = i or y = i} is the set of incoming and outgoing edges from entity i, and E−(i·),−(·i) is its complement. In line 2, we have implicitly discarded the denominator ℙ(E(i·),(·i) | …) as it does not depend on ci.

The first term ℙ(E(i·),(·i) | c, z, E−(i·),−(·i); λ1, λ2), as a function of ci, is

1st term=B𝔹(i,·),(·,i)Γ(gB+hB+λ1+λ2)Γ(gB+λ1)Γ(hB+λ2)·Γ(gB+rB+λ1)Γ(hB+sB+λ2)Γ(gB+hB+rB+sB+λ1+λ2),𝔹(i·),(·i)={Bh,h|(i,j)[EijE(i·),(·i),SBij=Bh,h]},gB=|{(x,y)|ExyE(i·),(·i),SBxy=B,Exy=1}|,hB=|{(x,y)|ExyE(i·),(·i),SBxy=B,Exy=0}|,rB=|{(x,y)|ExyE(i·),(·i),SBxy=B,Exy=1}|,sB=|{(x,y)|ExyE(i·),(·i),SBxy=B,Exy=0}|. (9)

Similar to Eq. (6), Eq. (9) is a consequence of integrating out B for all interactions E associated with actor i. In brief, the set 𝔹(i·),(·i) contains all sibling group community compatibility matrix elements Bh,h that are associated with some edge in E(i·),(·i); note that these elements may not necessarily be from the same sibling group matrix. More precisely, 𝔹(i·),(·i) is constructed as follows: (1) for each edge ExyE(i·),(·i), find Exy’s corresponding sibling group matrix element Bh,h by applying the coarsening procedure to cx, cy, zxy, zxy, and then (2) take the union over all matrix elements found this way. Because the sibling group matrices B have been integrated out, the set 𝔹(i·),(·i) is only used to reference their sufficient statistics, through the SB() function defined in Eq. (4). In particular, the four terms gB, hB, rB, sB are functions of B from the product inside ℙ(E(i·),(·i) | c, z, E−(i·),−(·i); λ1, λ2), and they represent counts of 0/1 edges associated with each B ∈ 𝔹(i·),(·i).

As for the second term ℙ(ci | ci; γ), it can be directly computed using the nCRP definition Eq. (2). The computational complexity for a single ci is 𝒪(NH), where H is the number of hierarchy nodes — hence the time to sample all c is 𝒪(N2H). Note that H = 𝒪(NK), so the complexity of sampling all c is 𝒪(N3K).

4.2 Hyperparameter Selection Using Metropolis-Hastings

The MSCB model contains 6 hyperparameters that need to be set: the hierarchy depth K, as well as the 5 prior hyperparameters γ, m, π, λ1, λ2. We will not discuss selection of K, expecting that the user knows how deep a hierarchy he or she needs — bearing in mind that model non-identifiability increases with increasing hierarchy depth. On the other hand, selecting all 5 prior hyperparameters is not a trivial affair and requires some attention. One could perform a gridsearch using the marginal likelihood of the network E as the objective function, but the search would be over 5 dimensions and thus impractical. Moreover, we would have to approximate the marginal likelihood as no analytic formula exists for computing it.

Clearly, we need a different strategy for selecting prior hyperparameters γ, m, π, λ1, λ2, and we choose to place hyper-priors on the hyperparameters, a common Bayesian technique for data-driven hyperparameter selection:

γ~Exponential(η1),m~Beta(η2,η3),π~Exponential(η4),λ1,λ2~Exponential(η5).

Although this introduces 5 new hyper-hyperparameters η1, …, η5, models with hyper-priors are typically less sensitive to the choice of hyper-hyperparameters than the original models were to the choice of hyperparameters (Bernardo, Smith and Berliner 2000). Thus, by setting the hyper-hyperparameters to reasonable values — all our experiments use η1, …, η5 = 1 — we allow the model to decide the best values for the hyperparameters.

All that remains is to derive an inference algorithm for the model with hyper-priors. Because the hyper-priors are not conjugate to the other distributions in the model, we cannot derive Gibbs sampling equations for the hyperparameters γ, m, π, λ1, λ2. We overcome this via a general MCMC strategy, in which we alternate between sampling sweeps over all model latent variables using our collapsed Gibbs sampling equations from Section 4.1, and sampling each hyperparameter using Independence Chain Metropolis-Hastings with the hyper-prior distributions as the proposals. This new inference algorithm is still a valid Markov Chain, although it may take longer to converge than our original model’s Gibbs sampling equations (Robert and Casella 2004). In particular, using the hyper-priors as proposal distributions greatly simplifies the Metropolis-Hastings acceptance probabilities, reducing runtime complexity and making the algorithm easier to implement. The simplified acceptance probabilities are:

a(γnew;γold)=(c;γnew)(c;γold),a(mnew,πnew;mold,πold)=(z;mnew,πnew)(z;mold,πold),a(λ1,new,λ2,new;λ1,old,λ2,old)=(E|c,z;λ1,new,λ2,new)(E|c,z;λ1,old,λ2,old), (10)

noting that we sample m jointly with π and sample λ1 jointly with λ2, and where

(c;γ)=i=1Nk=1K(cik|c1:(i1),ci,1:(k1);γ), (11)
(z;m,π)=i=1NjiN(zi,j|zi,1:j1;m,π)(zj,i|zi,1:N,z1:j1,i;m,π), (12)
(E|c,z;λ1,λ2)=B𝔹Γ(λ1+λ2)Γ(λ1)Γ(λ2)·Γ(aB+λ1)Γ(bB+λ2)Γ(aB+bB+λ1+λ2),aB=|{(x,y)|SBxy=B,Exy=1}|,bB=|{(x,y)|SBxy=B,Exy=0}|, (13)

with 𝔹 being the set of all sibling group compatibility matrix elements. Eq. (13) is similar to Eq. (9), except that we now consider all network edges instead of just those incident to some network entity i. As for Eq. (11,12), they result from applying the chain rule to exchangeable distributions; in the case of Eq. (11), we have applied it to the nCRP distribution over all paths c, whereas in Eq. (12), we have applied it to N compound-stick-breaking-multinomial distributions, each corresponding to the set of level indicators zi associated with some network entity i. The product terms in Eq. (11) are computed using the nCRP definition Eq. (2), while the terms in Eq. (12) are:

(zi,x=k|zi,subset;m,π)mπ+#[zi,subset=k]π+#[zi,subsetk]u=1k1(1m)π+#[zi,subset>u]π+#[zi,subsetu],

where zi,x and zi,subset denote an element and a subset respectively of zi = {zi·, z←·i}, and where the normalization factor is k=1K(zi,x=k|zi,subset;m,π).

By adopting this MCMC algorithm for simultaneous hyperparameter selection and posterior inference, we have reduced user input to just the hierarchy depth K. This not only makes our model easier to use, but also provides a data-driven means of choosing the hyper-parameters γ, m, π, λ1, λ2. Finally, the computational complexity of drawing all 5 hyperparameters and testing their acceptance probabilities is 𝒪(N2K), which is asymptotically less than a single 𝒪(N3K) Gibbs sampler sweep over all latent variables. In other words, using our Metropolis-Hastings strategy for hyperparameter selection does not incur a significant computational cost over Gibbs sampling the latent variables.

To initialize our MCMC algorithm, we first set the hyperparameters m, π, γ, λ1, λ2 to user-provided initial values, and then randomly draw the latent variables c, z according to the MSCB generative process (Section 2.4). Once done, we alternate between sampling the latent variables and the hyperparameters, as described earlier.

5. SIMULATION

We now evaluate our inference algorithm’s ability to recover hierarchies from data simulated from our model. Our goal is to examine how MSCB’s ability to model both assortative (within-community) interactions and disassortative (cross-community) interactions differentiates it from standard hierarchical clustering algorithms.

For all simulations, the number of actors N was 300. For K = 2, θ = (0.25, 0.75) for all actors, meaning that actors interact at level 1 25% of the time and level 2 75% of the time. For K = 3, θ = (0.1, 0.3, 0.6) for all actors.

Our experiments explore the effect of different compatibility matrices B. We first explore networks generated from “on-diagonal” Bs, where the diagonal elements are much larger than the off-diagonal elements (strong assortative interactions). We also investigate “off-diagonal” Bs, where the off-diagonal elements are larger (strong disassortative interactions). “Low noise” means the on/off diagonal element values are far apart, while “high noise” means they are closer together. Specifically, the types of Bs explored are:

  1. K=2, on-diagonal, low noise - Bon–diagonal = (0.4, 0.8), Boff–diagonal = (0.02, 0.02);

  2. K=2, off-diagonal, low noise - Bon–diagonal = (0.02, 0.02), Boff–diagonal = (0.4, 0.8);

  3. K=2, on-diagonal, high noise - Bon–diagonal = (0.3, 0.6), Boff–diagonal = (0.1, 0.2);

  4. K=2, off-diagonal, high noise - Bon–diagonal = (0.1, 0.2), Boff–diagonal = (0.3, 0.6);

  5. K=3, on-diagonal, low noise - Bon–diagonal = (0.5, 0.7, 0.9), Boff–diagonal = (0.02, 0.02, 0.02);

  6. K=3, off-diagonal, low noise - Bon–diagonal = (0.02, 0.02, 0.02), Boff–diagonal = (0.5, 0.7, 0.9).

  7. K=3, on-diagonal, high noise - Bon–diagonal = (0.4, 0.6, 0.8), Boff–diagonal = (0.1, 0.1, 0.2);

  8. K=3, off-diagonal, high noise - Bon–diagonal = (0.1, 0.1, 0.2), Boff–diagonal = (0.4, 0.6, 0.8).

Bon–diagonal = (a, b) means that actors interacting in the same level-1 community do so with probability a, while actors interacting in the same level 2 community do so with probability b (and analogously for Boff–diagonal).

We compare our approach to two baselines. The first is hierarchical spectral clustering, an adaptation of spectral clustering (Chung 1997) to top-down hierarchical clustering. Because spectral clustering does not specify how to select the number of clusters at each hierarchy node, we shall explore two variants that represent worst- and best-case scenarios respectively. The first variant, Spectral-Binary, does binary splits at every level. For the second variant, Spectral-Oracle, we give it the number of 1st level branches as an advantage, and then perform binary splits at deeper levels. We also compare to agglomerative clustering with Ward’s criterion (Ward 1963) with the dissimilarity measure used in Pajek (Batagelj and Mrvar 1998), (Batagelj and Ferligoj Patrick 1992). Like spectral clustering, we also have two variants: Ward-Binary) and Ward-Oracle. Ward-Binary does binary splits at all levels, while Ward-Oracle is given the true number of first level clusters as an advantage, but does binary splits for deeper levels.

For our approach, we intialize m = π = λ1 = λ2 = .5 and γ = 0.1. We run our collapsed gibbs sampler for 10,000 burn-in iterations, and then draw 10 samples with 100 lag in-between. We calculate the F1 score at each level k, F1k=2*Precision*RecallRecall+Precision where Recall=TPTP+FN, and Precision=TPTP+FP. TP is true positive count (actors that should be in the same cluster, up to depth k), FP is false positive count, TN is true negative count, and FN is false negative count. The total F1 score is computed by averaging the F1k scores for each level. For our approach, we average the F1 score over all the samples.

Figure 5 shows the F1 scores for all algorithms, as a function of the number of size ≥ 10 branches at the true hierarchy’s 1st level. To ensure a fair comparison, this was also the number of 1st level branches given to the Oracle variants of both baselines. From Figure 5(a), 5(c), 5(e), and 5(g), we see that when B is strongly on-diagonal, our method generally performs better than or comparable to all other methods when there are more than 3 first level branches, (except Spectral-Oracle), demonstrating its ability to determine the correct number of clusters from the data. In these on-diagonal experiments, Ward’s method tends to perform the worst while MSCB and Spectral perform better.

Figure 5.

Figure 5

Simulation Results for on/off diagonal compatibility matrices in low/high noise settings for depth of 2 and 3. We compare our method, MSCB (blue) against both hierarchical spectral clustering (red/pink, see text for details), and Ward’s method as implemented in Pajek (green/black) (Batagelj and Mrvar 1998).

However, once B is strongly off-diagonal (implying strong cross-community interactions), Spectral performs poorly. This is to be expected — by formulation, spectral clustering cannot recover disassortative communities. On the other hand, our method continues to yield good results (Figures 5(b), 5(d), 5(f), and 5(h)) comparable to the on-diagonal B case. Ward’s method can also recover dissassortative communities, but for K=2, we outperform the binary variant and perform comparably to the oracle variant. For the K=3 off-diagonal results (Figures 5(f) and 5(h)), Ward’s tends to perform a little better until the number of branches gets very large. However, Ward’s does not perform as well on the assortative simulations.

As a result, our method performs consistently well in all settings, and we note that for larger numbers of branches it outperforms both the binary variants of Ward and Spectral in most scenarios. The oracle variants sometimes perform better, but these require a priori knowledge of the number of first level branches, which our method does not.

6. HELD-OUT EVALUATION

Previously, we evaluated our MCMC algorithm’s performance against non-probabilistic hierarchical clustering schemes. One might then ask how our algorithm compares to probabilistic network models that do not model hierarchical structure. In particular, does MSCB’s hierarchy assumption permit a better statistical fit to real-world networks, in terms of the network marginal log-likelihood ℙ(E)?

In this section, we compare our MSCB inference algorithm to the inference algorithms of two other probabilistic models: the Infinite Relational Model (IRM) (Kemp et al. 2006), essentially a nonparametric version of the stochastic blockmodel (Wang and Wong 1987), and the Mixed Membership Stochastic Blockmodel (Airoldi et al. 2008), a variant of the stochastic blockmodel that permits network entities to come from multiple communities (which Airoldi et al. call “roles”). Since the IRM is a special case of MSCB when K = 1, we reused our full MCMC algorithm (Gibbs sampler plus Metropolis-Hastings) for posterior inference and hyperparameter selection. As for MMSB, we used the variational EM algorithm in Airoldi et al. (2008), which also performs both posterior inference and hyperparameter selection. In Section 3, we discussed how MSCB relates to IRM and MMSB: recall that the IRM is a special case of MSCB with hierarchy depth K = 1, while MSCB in turn is a highly-constrained version of MMSB. Particularly, we noted that MMSB is highly non-identifiable, and that MSCB, while also non-identifiable, is much less so than MMSB. On this point, we expect our MSCB inference/selection algorithm to perform better than MMSB’s.

Our experiments use two real-world datasets: a 75-species food web of grass-feeding wasps (Dawah et al. 1995; Clauset, Moore and Newman 2008), and the 62-actor September 11th, 2001 hijacker terrorist network (Krebs 2002; Clauset et al. 2008). These networks reflect common real-world modes of interaction: edges in the food web denote predator-prey relationships, while edges in the terrorist network reflect social cohesion. The food web could be represented as a hierarchy where different branches reflect different trophic levels (e.g. parasite, predator or prey), while the terrorist network could be interpreted as an organization chart.

At a high level, we conducted our held-out evaluation as follows: for each model, we (1) used the corresponding inference/selection algorithm to estimate model hyperparameters Θ̂ for a training network Etrain (ignoring the latent variable posteriors), and then (2) estimated the test network marginal log-likelihood ℙ(Etest; Θ̂), so as to evaluate how well each inference/selection algorithm estimates parameters for its model. More specifically, for both datasets, we generated 5 pairs of training and test subgraphs; each pair was obtained by randomly partitioning the actors into two equal sets, and keeping only the edges within each partition. Then, for each of the 3 models on each of the 5 training graphs, we selected the model hyperparameters Θ̂ using the appropriate inference/selection algorithm4. Finally, we estimated the corresponding test network’s marginal log-likelihood ℙ(Etest; Θ̂) using 10,000 importance samples, and averaged the results over all 5 train-test network pairs. The initial hyperparameter values for this experiment were m, π = 0.5 and γ, λ1, λ2 = 1.

Figure 6 displays the results for this experiment. On both datasets, observe that our MSCB algorithm achieves greater held-out marginal log-likelihoods ℙ(Etest; Θ̂) than MMSB, regardless of the MSCB hierarchy depth K or MMSB number of roles R. We believe this is related to MSCB being significantly more constrained than MMSB, and thus more identifiable. Moreover, MMSB’s likelihood peaks at R = 2 on both datasets, suggesting that we should choose just 2 roles, too few to provide anything but an extremely coarse network analysis. In contrast, our MSCB inference algorithm uses model hyperparameters to automatically select a suitable size and shape for its hierarchy — for example, the grass dataset training network posteriors for K = 2 had an average of 10.0 hierarchy nodes, a reasonable number considering that the training networks have 38 actors each. This illustrates one of the advantages of nonparametric models like MSCB and IRM over parametric models such as MMSB. Finally, we note that the differences of heldout likelihoods between K = 1 (IRM), 2 and 3 for MSCB are negligible and within error, suggesting that the increased non-identifiability from larger K has minimal negative impact on model fit.

Figure 6.

Figure 6

Average held-out marginal likelihoods ℙ(Etest; Θ̂) and standard deviations for our MSCB inference algorithm with hierarchy depth K = 1 (equivalent to IRM), 2 and 3, versus the MMSB variational inference algorithm for R ∈ {2, …, 20} roles.

7. EFFECTS OF HYPERPARAMETER INITIALIZATION

Apart from latent variable inference for c, z, our MCMC algorithm also performs hyperparameter selection for γ, m, π, λ1, λ2. Given that the final hyperparameter estimates may depend on their initial settings, it is only natural that we study how the former changes with the latter. In this section, we conducted experiments on the grass-feeding wasps and terrorist networks, repeating our MCMC algorithm over different initial hyperparameter values. We studied one out of the five hyperparameters at a time, while initializing the four remaining hyperparameters to the following default values: m = 0.5 and π, γ, λ1, λ2 = 1. The hierarchy depth was fixed to K = 2 throughout. For the hyperparameter under study, we ran our algorithm for five different initializations of that hyperparameter: m ∈ {0.1, 0.3, 0.5, 0.7, 0.9} or π, γ, λ1, λ2 ∈ {0.01, 0.1, 1, 10, 100}. Thus, we studied 25 different hyperparameter settings in total: 5 for each of the five hyperparameters. For each of these 25 settings, we took 5 repeat trials of 1,000 samples, with 10,000 iterations of burn-in prior to taking samples. In each trial, we estimated the hyperparameter under study by its mean value over the 1,000 samples, and we present each estimate’s average and standard deviation over the 5 trials in Tables 1 and 2.

Table 1.

Grass-feeding wasps network: Final hyperparameter estimates under different initializations.

Initial m 0.1 0.3 0.5 0.7 0.9
Final m 0.878 ± 0.126 0.833 ± 0.111 0.842 ± 0.130 0.832 ± 0.171 0.849 ± 0.0454
Initial π, γ, λ1, λ2 0.01 0.1 1 10 100
Final π 2.22 ± 2.08 2.11 ± 0.953 1.25 ± 0.927 9.75 ± 2.09 100 ± 0
Final γ 1.64 ± 0.395 1.66 ± 0.587 2.67 ± 0.950 3.08 ± 2.29 21.8 ± 43.8
Final λ1 0.0869 ± 0.0338 0.0641 ± 0.0172 0.0516 ± 0.0149 0.0645 ± 0.0366 0.0656 ± 0.0218
Final λ2 1.08 ± 0.394 1.26 ± 0.453 1.19 ± 0.465 0.840 ± 0.265 1.24 ± 0.532

Table 2.

Terrorist network: Final hyperparameter estimates under different initializations.

Initial m 0.1 0.3 0.5 0.7 0.9
Final m 0.0801 ± 0.0692 0.0982 ± 0.0518 0.144 ± 0.0708 0.0695 ± 0.0227 0.116 ± 0.126
Initial π, γ, λ1, λ2 0.01 0.1 1 10 100
Final π 0.406 ± 0.296 0.937 ± 0.545 1.64 ± 1.01 6.31 ± 5.06 100 ± 0
Final γ 2.15 ± 0.683 2.71 ± 1.97 2.63 ± 1.83 2.75 ± 0.678 3.49 ± 0.785
Final λ1 0.0656 ± 0.0140 0.0805 ± 0.0239 0.0776 ± 0.0381 0.0877 ± 0.0188 0.0921 ± 0.0143
Final λ2 0.252 ± 0.00766 0.221 ± 0.0931 0.292 ± 0.0602 0.287 ± 0.0843 0.248 ± 0.0348

For either dataset, our MCMC algorithm’s final estimates of λ1, λ2 have low variance, and are highly invariant to the choice of initial values. Since λ1, λ2 inuence the posterior of the community-compatibility matrices B, this suggests that our MCMC algorithm reliably estimates the posterior of B. The low values of λ1 compared to λ2 reflect the fact that both networks have more 0-edges than 1-edges.

The situation for m, π, is more nuanced. Recall that these hyperparameters respectively control the posterior mean and variance of the Multiscale Membership vectors θ. In Section 3, we argued that the MM vectors lead to some non-identifiability in the MSCB model, and it is therefore not surprising that m, π are more difficult to estimate reliably. On either dataset, the estimates of m ∈ (0, 1) generally have low variance and are reasonably constant across initial values, implying that the posterior mean estimates of θ are fairly reliable. We can even interpret these estimates: for the grass dataset, the estimates of m ≈ 0.85 imply that the MM vectors θ place almost all mass at the first level of the hierarchy, whereas the terrorist dataset estimates of m ≈ 0.1 imply that the MM vector mass is evenly distributed over the first and second levels. On the other hand, the estimates of π tend to vary with initialization (particularly when π = 100), and exhibit high variance over trials with the same initial value. Despite this, we note that π only controls the posterior variance of the MM vectors θ. Hence, if we only desire good mean estimates of θ, we may not necessarily need accurate estimates of π.

Lastly, we observe that γ’s final estimate tends to increase with its initial setting. This is a consequence of our MCMC algorithm, which initializes the actor paths c using the MSCB generative model. A high initial γ creates a starting tree with many branches, and the MCMC algorithm is unable to merge5 all the superuous branches within the 10,000 burn-in iteration limit, causing the final estimate of c to be higher than it would have been otherwise. Thus, care should be taken when choosing an initial value for γ, for high values will lead to more highly-branched hierarchies.

8. REAL-DATA QUALITATIVE ANALYSIS

In this section, we apply the MSCB inference/selection algorithm to interpret two real-world networks: (1) a 75-species food web of grass-feeding wasps and their parasitoids, and (2) a citation network containing 1,000 papers from the High-Energy Physics (Theory) portion of arXiv, an open-access archive of scientific paper preprints. Our objective is to study how the three network aspects we seek — hierarchy, multiscale granularity, and assortativity/disassortativity — manifest in real-world networks.

For both networks, we ran our algorithm with K = 2 levels, and initialized the hyper-parameters to m, π = 0.5 and γ, λ1, λ2 = 1. We discarded 10,000 iterations as burn-in, and then took 1,000 samples to estimate the posterior distribution over actor levels z and paths c. One issue we faced was the lack of an obvious way to visualize the posterior hierarchy; the posterior mean or mode of c do not represent intelligible hierarchies because of permutation non-identifiability6. In order to represent the posterior hierarchy meaningfully, we constructed a “consensus” hierarchy from the samples: for each level 1 ≤ kK and each pair of actors i, j, we computed the fraction of samples Sijk in which i, j had the same level-k hierarchy position. If i, j shared the same level-k position in at least τ of all samples, i.e. Sijk ≥ τ, then we assigned them identical level-k positions in the consensus. Higher threshold values τ produce wider, more detailed consensus hierarchies, whereas lower values give rise to simpler hierarchies. We use τ = 0.35 in our analyses, as it provides a good middle ground between detail and interpretability.7

As for interaction levels z, we represent each zij’s or zij’s posterior by taking its K-bin histogram over all its samples (recall that the z are discrete with K possibilities). Note that our ultimate goal is really the MM vectors θ; we obtain the posterior mean of actor i’s MM θi by averaging the histograms of all zs that, according to the generative model, are drawn from θi (that is to say, {zi·, z←·i}).

8.1 Grass-Feeding Wasp Parasitoids Food Web

We first consider the N = 75 species grass-feeding wasps food web. In this network, the actors are species in a food web, and the 113 directed edges represent predator-prey relationships. Each species in the network is annotated with its position or “trophic level” in the food web: grass, herbivore, parasitoid, hyper-parasitoid (parasites that prey on other parasites), or hyper-hyper parasitoid. We stress that these trophic levels are not the hierarchy levels in our model, but are nodal attributes of the species.

The inference/selection algorithm took 9 minutes on a 2.83GHz Intel Core 2 processor, and the average hyperparameters were γ = 1.49,m = 0.915, π = 2.30, λ1 = 0.0774, λ2 = 1.50. The high value of m suggests that most interactions were shallow (occurring at level 1), while the large ratio λ21 is expected since the number of edges is ≪ N2. We report the posterior “consensus” hierarchy and mean Multiscale Membership (MM) vectors in Figure 7. In the same Figure, we also show the original network, where each interaction Eij = 1 has been colored according to the 2nd-level communities involved (missing links Eij = 0 are not shown). The trophic level annotations are shown in the network by node shapes, and summarized as counts in the hierarchy.

Figure 7.

Figure 7

Grass-feeding wasps network: Top Left: Inferred hierarchy of communities, with community trophic level counts at the bottom. Top Right: Multiscale Membership vectors. Columns correspond to actor MM vectors θi, while rows correspond to hierarchy positions. Note that each MM vector θi can be nonzero only at the K = 2 hierarchy positions along its corresponding actor path ci. Cell intensities reflect the value of θi at a particular hierarchy position — black and white correspond to 1 and 0 respectively, while shades of blue correspond to values between 0 and 1. Bottom: Original network. Each edge is drawn with two colors, representing its 2nd-level level donor/receiver communities as inferred by our algorithm. Node shapes represent annotated trophic levels (see legend in bottom right).

Generally speaking, the first-level super-communities separate the trophic levels. For instance, all grass species are found in community 3, while community 2 contains all but one of the herbivores, and community 1 contains most of the parasitoids. Notice that the trophic levels form a set of disassortative communities, e.g. we see that herbivores feed on grasses, but not on other herbivores. We contrast our results with those of Clauset et al. (2008), who did not recover this structure because their method assumes all communities are assortative. On the other hand, our model is able to learn disassortative network structures by virtue of its stochastic blockmodel assumption.

Let us analyze the sub-communities in detail. We begin with super-community 4, which is separated from the rest of the network by just one edge — from species 20 (Tetramesa petiolata) to 6 (Deschampsia cespitosa) — and therefore represents an isolated sub-web. Observe that species 20, a herbivore, is unexpectedly found in sub-community 4.2 rather than 2.1. We explain this by noting that species 20 is the only herbivore that the parasitoids in sub-communities 4.1, 4.3 and 4.4 prey on. Additionally, the sub-communities within super-community 4 are topologically ordered, which reflects their food web trophic levels. To be precise, the species in 4.1 prey on 4.2−4, while 4.4 preys only on 4.2−3, and 4.3 preys only on 4.2. Next, consider super-communities 1 and 5. While the bulk of the parasitoids are in super-community 1, the apex parasitoids that prey on them are all in super-community 5. The distinction between apex parasitoids in sub-communities 5.1 and 5.2 appears to be driven by the number of parasitoids they prey upon: species 67 (Macroneura vesicularis) from 5.2 preys on more parasitoids (specifically, 8), whereas species 65 (Eupelmus atropurpureus) and 75 (Mesopolopus graminum) from 5.1 prey on fewer (4 and 5 parasitoids respectively).

Finally, we inspect the posterior mean of the MM vectors θ, shown in Figure 7. Intuitively, the MM vector θi shows how often species i interacts at the generic super-community level, versus the more specific sub-community level. In this network, all species interact primarily at the super-community level, with occasional interaction at the sub-community level. This suggests that the network structure is mostly explained by the hierarchy’s first level, while the second level is responsible for local structural details. Most second level interactions are found in super-communities 1, 2 and 3, corresponding to the majority of parasitoids, herbivores and grass species respectively. Because second-level interactions are used by the MSCB model to account for atypical behavior within super-communities, species with many such interactions are likely to have specialized roles in the food web, and thus make good targets for further investigation.

8.2 High-Energy Physics Citation Network

For our final experiment, we consider an N = 1, 000-paper subgraph of the arXiv high-energy physics citation network, taken from the 2003 KDD Cup (2010). This subgraph was constructed by subsampling papers involved in citations from Jan 2002 through May 2003. The average hyperparameters from the inference/selection algorithm were γ = 7.76, m = 0.851, π = 0.492, λ1 = 0.0169, λ2 = 0.976, and the algorithm took 12.4 days to run on a 2.83GHz Intel Core 2 processor. This high runtime is a consequence of two things: the 𝒪(N3K) runtime complexity per Gibbs sampler sweep, as well as the large value of γ, which reflects a high branching factor in the posterior hierarchy.

Figure 8 shows part of the posterior consensus hierarchy, where each displayed community has been annotated with the number of papers associated with it, as well as the most frequent title words from those papers8. We stress that the hierarchy is learnt only from the citation network adjacency matrix, without knowledge of the paper texts. In addition, we reveal the network community structure in Figure 9, by permuting the original adjacency matrix to match the order of inferred communities. The same figure also shows the posterior mean of the MM vectors, as a histogram over 2nd-level elements θi,2.

Figure 8.

Figure 8

HEP network: Inferred community hierarchy. Each hierarchy node is annotated with its number of associated papers, and the most common title words from those papers. Due to space contraints, we only show the largest 8 1st-level communities, as well as the largest 8 2nd-level sub-communities under 1st-level community 1. Regarding the other 1st-level communities, all their sub-communities contain only one document, thus they are uninteresting and have been hidden to save space.

Figure 9.

Figure 9

HEP network: Left: Adjacency matrix, permuted to match the order of communities in Figure 8. The blue dots represent citations from row i to column j. Right: Histogram of Multiscale-Membership vectors. The x-axis represents the value of θi,2, i.e. the 2nd-level component of the MM vectors.

The consensus hierarchy reflects strong assortative organization, and we expect most communities to correspond to specific areas of study. The giant community 1 contains 725 papers, and is clearly visible as the sparse diagonal block convering indices 1 to 725 in the adjacency matrix of Figure 9. Its top title words are general physics terms such as ”theory”, ”quantum” and ”field”, implying that, as a whole, this community does not represent any specific subfield of theoretical physics. This observation is further supported by the fact that papers in community 1 have few citations among themselves or to the rest of the network.

The remaining 1st-level communities exhibit denser within-community citations than community 1. We hypothesize that they are associated with small groups of researchers that work on specific subfields, and who are highly likely to cite each others’ work. For instance, community 2 (26 papers) corresponds to the dense diagonal block between indices 726 and 751 in Figure 9, and its top title keywords suggest it is predominantly about research in supergravity and string theory. In similar fashion, observe that community 3 (23 papers) corresponds to indices 752 through 774, and is focused on pp-waves and Penrose limits. The remaining 45 1st-level communities contain < 20 papers each, but still feature the dense within-community citations and specific jargon characteristic of specialized research.

At the 2nd level, the consensus hierarchy contains mostly singleton (one-member) sub-communities. This indicates that, past the 1st hierarchy level, our MCMC algorithm found little evidence to further group the papers. There are only 10 sub-communities with size > 1, all of which are found in super-community 1. The largest 8 are shown in Figure 8, and correspond to the diagonal block from index 1 through 39 in the adjacency matrix of Figure 9. These sub-communities contain either fewer or more within/between-community citations than is average for super-community 1, hence they are justified under a blockmodel assumption. For example, sub-community 1.3 contains 7 citations between 5 papers, while 4 of those papers contain the word ”radion” (a hypothetical particle) in their titles. This suggests a tightly-knit community of researchers focused on a specific object.

Finally, the Multiscale Membership distribution (Figure 9) tells a similar story to the grass-feeding wasps network: most interactions occur at the 1st level; over 75% of MM vectors θi have less than 0.05 of their mass at the 2nd level. This observation, coupled with the small number of meaningful 2nd-level sub-communities, suggests that the 1st hierarchy level suffices to explain most of the network. Because of this, the few non-singleton 2nd-level sub-communities are in fact significant, and merit further investigation.

8.3 Posterior Sample Analysis for Both Datasets

To complete our analysis, we need to inspect the quality of the posterior samples returned by our MSCB algorithm. It is well-known that adjacent samples in an MCMC sequence tend to be highly correlated, which inflates the variance of the samples by decreasing the effective sample size. Consequently, any estimator will require more samples to achieve the same level of precision, compared to an uncorrelated sequence of samples.

We quantify the degree of sample correlation in MSCB via the autocorrelation function RX(k). For a particular random variable X, this is defined as

RX(k)=t=1nk(Xt)(Xt+k)t=1nk(Xt)2,

where n is the total number of samples, Xt is the t-th sample of X, and is the sample mean of X. The question then, is which random variable to inspect. Recall that the MSCB algorithm samples the discrete random variables c (paths) and z (level indicators); however, the autocorrelation function is not well-defined for discrete domains. As a proxy, we shall instead consider ℓ = log (ℙ(E, c, z)), the complete log-likelihood of a particular sample of c, z (after integrating out B, θ). We expect the autocorrelation R(k) of ℓ to provide a good picture of sample correlation in c, z.

Figure 10 shows the log-likelihood autocorrelation R(k) for the grass-feeding wasps and High Energy Physics networks, as computed on the 1,000 posterior samples. Using first-order autoregressive process theory, we can compute the posterior’s sample size inflation factor or SSIF s = (1 + ρ)/(1 − ρ), where ρ = R(1) is the first-order autocorrelation. Intuitively, if we have n samples, then the effective sample size is given by n/s, because the autocorrelation ρ increases the standard error of the mean of ℓ by a factor of s. For the grass-feeding wasps network, ρ = 0.985 with SSIF s = 135, while for the High Energy Physics network, ρ = 0.991 with SSIF s = 210. This implies an effective posterior sample size of 7.4 and 4.8 for the grass-feeding wasps and High Energy Physics networks respectively.

Figure 10.

Figure 10

Autocorrelation function of the complete log-likelihood from the 1,000 posterior samples, for both the grass-feeding wasp parasitoids food web (Left) and the High-Energy Physics citation network (Right).

Regarding the grass-feeding wasps network, we note that the MSCB algorithm took only 9 minutes to perform burn-in (10,000 samples) and to take 1,000 samples from the posterior, for a total of 11,000 samples. Hence, it is computationally feasible to increase the effective sample size by taking 10,000 or even 100,000 samples from the posterior, and keeping only every 10th or 100th sample to save memory. However, the same cannot be said of the High-Energy Physics network, on which the MSCB algorithm took nearly 2 weeks to obtain the same number of samples (11,000 including burn-in). This dramatic increase in runtime directly results from MSCB’s 𝒪(N3K) computational complexity, which makes it impractical to take many samples from larger networks.

An alternative to taking more samples is to supplement the MSCB Gibbs sampling algorithm with Metropolis-Hastings moves, so as to decrease the autocorrelation between adjacent samples. Our inspection reveals that the MSCB Gibbs sampler has difficulty when trying to change many actor paths c quickly. To illustrate this problem, suppose the sampler is initialized with 100 actors in the same path, but the true posterior requires these actors to be split 50-50 in two different paths. The Gibbs sampler must first split off a single actor to form a new path, and then move the remaining 49 actors to the new path. However, because of the nCRP prior’s “rich-get-richer” property, the latter events tend to have low Gibbs sampler probabilities, and it may take many samples before the true 50-50 split is established. One solution to this problem is a split-merge strategy, similar to that used by Jain and Neal (2004) for the vanilla Chinese Restaurant Process. Brifley, a split-merge strategy for the nCRP will involve Metropolis-Hastings proposal moves that (1) split actors in a single path into two paths, and (2) merge actors from two paths into a single path. By interleaving these moves with the regular Gibbs sampler, we can make large jumps in the state space of c, and thus reduce autocorrelation. We expect this split-merge strategy to be a fairly simple extension of Jain and Neal (2004), though its implementation and derivation are out of the scope of this paper.

9. CONCLUSION

We have developed a nonparametric Multiscale Community Blockmodel (MSCB) that models networks in terms of hierarchical community memberships, which actors selectively undertake during interactions. To apply our model, we derived an MCMC algorithm that combines collapsed Gibbs sampling for latent variable posterior inference, and Metropolis-Hastings proposals for hyperparameter learning. Our algorithm automatically infers the structure of the hierarchy while simultaneously recovering the Multiscale Memberships of every actor, setting it apart from hierarchy-discovering methods that are restricted to binary hierarchies and/or single-community-memberships for actors. Moreover, because MSCB integrates aspects of stochastic blockmodels, it is expressive enough to account for both assortative (within-community) and disassortative (cross-community) interactions, as our simulation and real data experiments have demonstrated. These aspects of MSCB allow us to explore hierarchical network phenomena in a principled, statistical manner.

Acknowledgements

This paper is based on work supported by NSF IIS-0713379, NSF DBI-0546594 (Career), ONR N000140910758, AFOSR FA9550010247, NIH 1R01GM093156, and an Alfred P. Sloan Research Fellowship to Eric P. Xing. Qirong Ho is supported by a graduate fellowship from the Agency for Science, Technology And Research, Singapore. We thank Dr. Le Song for his help and discussion during this project.

Footnotes

1

Due to the short window of time, this 1,000-paper subnetwork has a lower citation density than the original network. We acknowledge that this subnetwork is not ideal for hierarchy learning, since papers that share only older citations will have no network paths between them. Nevertheless, this subnetwork retains enough structure for our algorithm to recover a 2-level hierarchy, which we report in our experiments.

2

Note that our use of the stick-breaking process is unrelated to the stick-breaking construction for the Dirichlet Process. We use the stick-breaking process to produce a mixture over the mixture components induced by the nCRP, not to define the mixture components themselves.

3

A word on notation: Bh,h is used as shorthand to (1) first select the community-compatibility matrix B associated with the parent of communities h, h′, and then (2) select the element of that B corresponding to h interacting with h′. Because our inference procedure integrates out the B matrices, a precise, long-form notation for them is unnecessary.

4

Algorithm details: for the IRM/MSCB MCMC algorithm, we took 10,000 samples as burn-in, and then estimated each hyperparameter γ, m, π, λ1, λ2 by its average over the next 1,000 samples. This was repeated for hierarchy depths K = 1 (i.e. IRM), 2 and 3. For the MMSB variational EM algorithm, we ran 100 random restarts until convergence, and then took hyperparameter estimates from the restart with the highest variational lower bound (with respect to the true likelihood). Because MMSB requires the number of latent roles R to be specified, we repeated this experiment for each 2 ≤ R ≤ 20.

5

One could address this by a Metropolis-Hastings scheme that splits and merges hierarchy branches, but the development of such a scheme is out of the scope of this work.

6

This issue is not unique to MSCB, but occurs when naively applying MCMC methods to models with permutation non-identifiability, such as stochastic blockmodels or models meant for clustering. In many such models, permuting the communities/clusters has no effect on the model likelihood, so the posterior mean must be an average over all permutations — which is usually uninterpretable.

7

A more thorough analysis would include consensus hierarchies over multiple values of τ ∈ [0, 1], so as to present a fuller complete picture of hierarchy posterior variation. Alternatively, one could analyze the “closeness” of pairs of network actors by their number of shared path samples.

8

We note that this output is reminiscent of text models from Natural Language Processing, particularly the Latent Dirichlet Allocation (Blei, Ng and Jordan 2003). However, we stress that MSCB is not a text model; the title words are determined post hoc, after hierarchy inference.

Contributor Information

Qirong Ho, Email: qho@cs.cmu.edu, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15217.

Ankur P. Parikh, Email: apparikh@cs.cmu.edu, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15217.

Eric P. Xing, Email: epxing@cs.cmu.edu, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15217.

REFERENCES

  1. Airoldi E, Blei D, Fienberg S, Xing E. Mixed membership stochastic blockmodels. The Journal of Machine Learning Research. 2008;9:1981–2014. [PMC free article] [PubMed] [Google Scholar]
  2. Aldous D. Exchangeability and related topics. École d’Été de Probabilités de Saint-Flour XIII1983. 1985:1–198. [Google Scholar]
  3. Batagelj V, Ferligoj Patrick A. Direct and indirect methods for structural equivalence* 1. Social Networks. 1992;14(1–2):63–90. [Google Scholar]
  4. Batagelj V, Mrvar A. Pajek-program for large network analysis. Connections. 1998;21(2):47–57. [Google Scholar]
  5. Bernardo J, Smith A, Berliner M. Bayesian theory. New York, USA: Wiley New York; 2000. [Google Scholar]
  6. Blackwell D, MacQueen J. Ferguson distributions via Pólya urn schemes. The annals of statistics. 1973:353–355. [Google Scholar]
  7. Blei D, Griffiths T, Jordan M. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM (JACM) 2010;57(2):1–30. [Google Scholar]
  8. Blei D, Jordan M. Variational methods for the Dirichlet process. Proceedings of the twenty-first international conference on Machine learning; ACM; 2004. p. 12. [Google Scholar]
  9. Blei D, Ng A, Jordan M. Latent dirichlet allocation. The Journal of Machine Learning Research. 2003;3:993–1022. [Google Scholar]
  10. Chung F. Spectral Graph Theory. Regional Conference Series in Mathematics American Mathematical Society; American Mathematical Society; 1997. pp. 1–212. [Google Scholar]
  11. Clauset A, Moore C, Newman M. Hierarchical structure and the prediction of missing links in networks. Nature. 2008;453(7191):98–101. doi: 10.1038/nature06830. [DOI] [PubMed] [Google Scholar]
  12. Clauset A, Newman M, Moore C. Finding community structure in very large networks. Physical Review E. 2004;70(6):66111. doi: 10.1103/PhysRevE.70.066111. [DOI] [PubMed] [Google Scholar]
  13. Dawah H, Hawkins B, Claridge M. Structure of the parasitoid communities of grass-feeding chalcid wasps. Journal of animal ecology. 1995;64(6):708–720. [Google Scholar]
  14. Escobar M, West M. Bayesian density estimation and inference using mixtures. Journal of the american statistical association. 1995:577–588. [Google Scholar]
  15. Ferguson T. A Bayesian analysis of some nonparametric problems. The annals of statistics. 1973:209–230. [Google Scholar]
  16. Girvan M, Newman M. Community structure in social and biological networks. Proceedings of the National Academy of Sciences. 2002;99(12):7821. doi: 10.1073/pnas.122653799. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Guimera R, Amaral L. Functional cartography of complex metabolic networks. Nature. 2005;433:895–900. doi: 10.1038/nature03288. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Handcock M, Raftery A, Tantrum J. Model-based clustering for social networks. Journal of the Royal Statistical Society-Series A. 2007;170(2):301–354. [Google Scholar]
  19. Hoff P, Raftery A, Handcock M. Latent space approaches to social network analysis. Journal of the American Statistical Association. 2002;97:1090–1098. [Google Scholar]
  20. Jain S, Neal R. A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. Journal of Computational and Graphical Statistics. 2004;13(1):158–182. [Google Scholar]
  21. Jordan M, Ghahramani Z, Jaakkola T, Saul L. An introduction to variational methods for graphical models. Machine learning. 1999;37(2):183–233. [Google Scholar]
  22. KDD. KDD Cup 2003 - Datasets. 2010 http://www.cs.cornell.edu/projects/kddcup/datasets.html.
  23. Kemp C, Tenenbaum J. The discovery of structural form. Proceedings of the National Academy of Sciences. 2008;105(31):10687. doi: 10.1073/pnas.0802631105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Kemp C, Tenenbaum J, Griffiths T, Yamada T, Ueda N. Proceedings of the National Conference on Artificial Intelligence. Vol. 21. Menlo Park, CA; Cambridge, MA; London: AAAI Press; MIT Press; 2006. Learning systems of concepts with an infinite relational model; p. 381. 1999. [Google Scholar]
  25. Krause A, Frank K, Mason D, Ulanowicz R, Taylor W. Compartments revealed in food-web structure. Nature. 2003;426(6964):282–285. doi: 10.1038/nature02115. [DOI] [PubMed] [Google Scholar]
  26. Krebs V. Mapping networks of terrorist cells. Connections. 2002;24(3):43–52. [Google Scholar]
  27. Liu J. The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem. Journal of the American Statistical Association. 1994;89(427) [Google Scholar]
  28. MacEachern S, Müller P. Estimating mixture of Dirichlet process models. Journal of Computational and Graphical Statistics. 1998:223–238. [Google Scholar]
  29. Miller K, Griffiths T, Jordan M. Nonparametric Latent Feature Models for Link Prediction. Advances in Neural Information Processing Systems (NIPS) 2009 [Google Scholar]
  30. Mimno D, McCallum A. Organizing the OCA: Learning faceted subjects from a library of digital books. Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries; ACM; 2007. pp. 376–385. [Google Scholar]
  31. Nallapati R, Ahmed A, Xing E, Cohen W. Joint latent topic models for text and citations. Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining; ACM; 2008. pp. 542–550. [Google Scholar]
  32. Newman D, Chemudugunta C, Smyth P. Statistical entity-topic models. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining; ACM; 2006. pp. 680–686. [Google Scholar]
  33. Radicchi F, Castellano C, Cecconi F, Loreto V, Parisi D. Defining and identifying communities in networks. Proceedings of the National Academy of Sciences. 2004;101(9):2658. doi: 10.1073/pnas.0400054101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Robert C, Casella G. Monte Carlo statistical methods. Springer Verlag; 2004. [Google Scholar]
  35. Roy D, Kemp C, Mansinghka V, Tenenbaum J. Learning annotated hierarchies from relational data. Advances in neural information processing systems. 2007;19:1185. [Google Scholar]
  36. Teh Y, Jordan M, Beal M, Blei D. Hierarchical dirichlet processes. Journal of the American Statistical Association. 2006;101(476):1566–1581. [Google Scholar]
  37. Teh Y, Roy D. The Mondrian Process. Advances in neural information processing systems. 2009 [Google Scholar]
  38. Wainwright M, Jordan M. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning. 2008;1(1–2):1–305. [Google Scholar]
  39. Wang C, Blei D. Variational inference for the nested Chinese restaurant process. Advances in Neural Information Processing Systems. 2009;22:1990–1998. [Google Scholar]
  40. Wang Y, Wong G. Stochastic blockmodels for directed graphs. Journal of the American Statistical Association. 1987;82(397):8–19. [Google Scholar]
  41. Ward J. Hierarchical grouping to optimize an objective function. Journal of the American statistical association. 1963;58(301):236–244. [Google Scholar]
  42. Xing EP, Fu W, Song L. A State-Space Mixed Membership Blockmodel for Dynamic Network Tomography. Annals of Applied Statistics. 2010;4(2):535–566. [Google Scholar]

RESOURCES