Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Nov 22.
Published in final edited form as: Ann Appl Stat. 2013 Dec 10;7(2):1010–1039. doi: 10.1214/12-AOAS617

MODEL-BASED CLUSTERING OF LARGE NETWORKS1

Duy Q Vu 1, David R Hunter 2, Michael Schweinberger 3
PMCID: PMC4655199  NIHMSID: NIHMS716300  PMID: 26605002

Abstract

We describe a network clustering framework, based on finite mixture models, that can be applied to discrete-valued networks with hundreds of thousands of nodes and billions of edge variables. Relative to other recent model-based clustering work for networks, we introduce a more flexible modeling framework, improve the variational-approximation estimation algorithm, discuss and implement standard error estimation via a parametric bootstrap approach, and apply these methods to much larger data sets than those seen elsewhere in the literature. The more flexible framework is achieved through introducing novel parameterizations of the model, giving varying degrees of parsimony, using exponential family models whose structure may be exploited in various theoretical and algorithmic ways. The algorithms are based on variational generalized EM algorithms, where the E-steps are augmented by a minorization-maximization (MM) idea. The bootstrapped standard error estimates are based on an efficient Monte Carlo network simulation idea. Last, we demonstrate the usefulness of the model-based clustering framework by applying it to a discrete-valued network with more than 131,000 nodes and 17 billion edge variables.

Keywords: Social networks, stochastic block models, finite mixture models, EM algorithms, generalized EM algorithms, variational EM algorithms, MM algorithms

1. Introduction

According to Fisher [(1922), page 311], “the object of statistical methods is the reduction of data.” The reduction of data is imperative in the case of discrete-valued networks that may have hundreds of thousands of nodes and billions of edge variables. The collection of such large networks is becoming more and more common, thanks to electronic devices such as cameras and computers. Of special interest is the identification of influential subsets of nodes and high-density regions of the network with an eye to break down the large network into smaller, more manageable components. These smaller, more manageable components may be studied by more advanced statistical models, such as advanced exponential family models [e.g., Frank and Strauss (1986), Strauss and Ikeda (1990), Wasserman and Pattison (1996), Snijders et al. (2006), Hunter and Handcock (2006)].

An example is given by signed networks, such as trust networks, which arise in World Wide Web applications. Users of internet-based exchange networks are invited to classify other users as either −1 (untrustworthy) or +1 (trustworthy). Trust networks can be used to protect users and enhance collaboration among users [Kunegis, Lommatzsch and Bauckhage (2009), Massa and Avesani (2007)]. A second example is the spread of infectious disease through populations by way of contacts among individuals [Britton and O'Neill (2002), Groendyke, Welch and Hunter (2011)]. In such applications, it may be of interest to identify potential super-spreaders—that is, individuals who are in contact with many other individuals and who could therefore spread the disease to many others—and dense regions of the network through which disease could spread rapidly.

The current article advances the model-based clustering of large networks in at least four ways. First, we introduce a simple and flexible statistical framework for parameterizing models based on statistical exponential families [e.g., Barndorff-Nielsen (1978)] that advances existing model-based clustering techniques. Model-based clustering of networks was pioneered by Snijders and Nowicki (1997). The simple, unconstrained parameterizations employed by Snijders and Nowicki (1997) and others [e.g., Nowicki and Snijders (2001), Airoldi et al. (2008), Daudin, Picard and Robin (2008), Zanghi et al. (2010), Mariadassou, Robin and Vacher (2010)] make sense when networks are small, undirected and binary, and when there are no covariates. In general, though, such parameterizations may be unappealing from both a scientific point of view and a statistical point of view, as they may result in nonparsimonious models with hundreds or thousands of parameters. An important advantage of the statistical framework we introduce here is that it gives researchers a choice: they can choose interesting features of the data, specify a model capturing those features, and cluster nodes based on the specified model. The resulting models are therefore both parsimonious and scientifically interesting.

Second, we introduce approximate maximum likelihood estimates of parameters based on novel variational generalized EM (GEM) algorithms, which take advantage of minorization-maximization (MM) algorithms [Hunter and Lange (2004)] and have computational advantages. For unconstrained models, tests suggest that the variational GEM algorithms we propose can converge quicker and better avoid local maxima than alternative algorithms; see Sections 6 and 7. In the presence of parameter constraints, we facilitate computations by exploiting the properties of exponential families [e.g., Barndorff-Nielsen (1978)]. In addition, we sketch how the variational GEM algorithm can be extended to obtain approximate Bayesian estimates.

Third, we introduce bootstrap standard errors to quantify the uncertainty about the approximate maximum likelihood estimates of the parameters, whereas other work has ignored the uncertainty about the approximate maximum likelihood estimates. To facilitate these bootstrap procedures, we introduce Monte Carlo simulation algorithms that generate sparse networks in much less time than conventional Monte Carlo simulation algorithms. In fact, without the more efficient Monte Carlo simulation algorithms, obtaining bootstrap standard errors would be infeasible.

Finally, while model-based clustering has been limited to networks with fewer than 13,000 nodes and 85 million edge variables [see the largest data set handled to date, Zanghi et al. (2010)], we demonstrate that we can handle much larger, nonbinary networks by considering an internet-based data set with more than 131,000 nodes and 17 billion edge variables, where “edge variables” comprise all observations, including node pairs between which no edge exists. Many internet-based companies and websites, such as http://amazon.com, http://netflix.com and http://epinions.com, allow users to review products and services. Because most users of the World Wide Web do not know each other and thus cannot be sure whether to trust each other, readers of reviews may be interested in an indication of the trustworthiness of the reviewers themselves. A convenient and inexpensive approach is based on evaluations of reviewers by readers. The data set we analyze in Section 7 comes from the website http://epinions.com, which collects such data by allowing any user i to evaluate any other user j as either untrustworthy, coded as yij = −1, or trustworthy, coded as yij = +1, where yij = 0 means that user i did not evaluate user j [Massa and Avesani (2007)]. The resulting network consists of n = 131,827 users and N = n(n − 1) = 17,378,226,102 observations. Since each user can only review a relatively small number of other users, the network is sparse: the vast majority of the observations yij are zero, with only 840,798 negative and positive evaluations. Our modeling goal, broadly speaking, is both to cluster the users based on the patterns of trusts and distrusts in this network and to understand the features of the various clusters by examining model parameters.

The rest of the article is structured as follows: A scalable model-based clustering framework based on finite mixture models is introduced in Section 2. Approximate maximum likelihood and Bayesian estimation are discussed in Sections 3 and 4, respectively, and an algorithm for Monte Carlo simulation of large networks is described in Section 5. Section 6 compares the variational GEM algorithm to the variational EM algorithm of Daudin, Picard and Robin (2008). Section 7 applies our methods to the trust network discussed above.

2. Models for large, discrete-valued networks

We consider n nodes, indexed by integers 1, …, n, and edges yij between pairs of nodes i and j, where yij can take values in a finite set of M elements. By convention, yii = 0 for all i, where 0 signifies “no relationship.” We call the set of all edges yij a discrete-valued network, which we denote by y, and we let Y denote the set of possible values of y. Special cases of interest are (a) undirected binary networks y, where yij ∈ {0, 1} is subject to the linear constraint yij = yji for all i < j; (b) directed binary networks y, where yij ∈ {0, 1} for all i, j; and (c) directed signed networks y, where yij ∈ {−1, 0, 1} for all i, j.

A general approach to modeling discrete-valued networks is based on exponential families of distributions [Besag (1974), Frank and Strauss (1986)]:

Pθ(Y=yx)=exp[θg(x,y)ψ(θ)],yy, (2.1)

where θ is the vector of canonical parameters and g(x, y) is the vector of canonical statistics depending on a matrix x of covariates, measured on the nodes or the pairs of nodes, and the network y, and ψ(θ) is given by

ψ(θ)=logyyexp[θg(x,y)],θRp, (2.2)

and ensures that Pθ(Y = y | x) sums to 1.

A number of exponential family models have been proposed [e.g., Holland and Leinhardt (1981), Frank and Strauss (1986), Wasserman and Pattison (1996), Snijders et al. (2006), Hunter and Handcock (2006)]. In general, though, exponential family models are not scalable: the computing time to evaluate the likelihood function is exp(N log M), where N = n(n − 1)/2 in the case of undirected edges and N = n(n − 1) in the case of directed edges, which necessitates time-consuming estimation algorithms [e.g., Snijders (2002), Hunter and Handcock (2006), Møller et al. (2006), Koskinen, Robins and Pattison (2010), Caimo and Friel (2011)].

We therefore restrict attention to scalable exponential family models, which are characterized by dyadic independence:

Pθ(Y=yx)=i<jnPθ(Dij=dijx), (2.3)

where DijDij(Y) corresponds to Yij in the case of undirected edges and (Yij, Yji) in the case of directed edges. The subscripted i < j and superscripted n mean that the product in (2.3) should be taken over all pairs (i, j) with 1 ≤ i < jn; the same is true for sums as in (3.5).

Dyadic independence has at least three advantages: (a) it facilitates estimation, because the computing time to evaluate the likelihood function scales linearly with N; (b) it facilitates simulation, because dyads are independent; and (c) by design it bypasses the so-called model degeneracy problem: if N is large, some exponential family models without dyadic independence tend to be ill-defined and impractical for modeling networks [Strauss (1986), Handcock (2003), Schweinberger (2011)].

A disadvantage is that most exponential families with dyadic independence are either simplistic [e.g., models with identically distributed edges, Erdős and Rényi (1959), Gilbert (1959)] or nonparsimonious [e.g., the p1 model with O(n) parameters, Holland and Leinhardt (1981)].

We therefore assume that the probability mass function has a K-component mixture form as follows:

Pγ,θ(Y=yx)=zZPθ(Y=yx,Z=z)Pγ(Z=z)=zZi<jnPθ(Dij=dijx,Z=z)Pγ(Z=z), (2.4)

where Z denotes the membership indicators Z1, …, Zn with distributions

Ziγ1,,γKi.i.d.Multinomial(1;γ1,,γK) (2.5)

and Z denotes the support of Z. In some applications, it may be desired to model the membership indicators Zi as functions of x by using multinomial logit or probit models with Zi as the outcome variables and x as predictors [e.g., Tallberg (2005)]. We do not elaborate on such models here, but the variational GEM algorithms discussed in Sections 3 and 4 could be adapted to such models.

Mixture models represent a reasonable compromise between model parsimony and complexity. In particular, the assumption of conditional dyadic independence does not imply marginal dyadic independence, which means that the mixture model of (2.4) captures some degree of dependence among the dyads. We give two specific examples of mixture models below.

Example 1. The p1 model of Holland and Leinhardt (1981) for directed, binary-valued networks may be modified using a mixture model. The original p1 models the sequence of in-degrees (number of incoming edges of nodes) and out-degrees (number of outgoing edges of nodes) as well as reciprociated edges, postulating that the dyads are independent and that the dyadic probabilities are of the form

Pθ(Dij=dij)=exp[(αi+βj)yij+(αj+βi)yij+ρyijyjiψij(θ)], (2.6)

where θ = (α1, …, αn, β1, …, βn, ρ) and exp{−ψij(θ)} is a normalizing constant. Following Holland and Leinhardt (1981), the parameters αi may be interpreted as activity or productivity parameters, representing the tendencies of nodes i to “send” edges to other nodes; the parameters βj may be interpreted as attractiveness or popularity parameters, representing the tendencies of nodes j to “receive” edges from other nodes; and the parameter ρ may be interpreted as a mutuality or reciprocity parameter, representing the tendency of nodes i and j to reciprocate edges.

A drawback of this model is that it requires 2n + 1 parameters. Here, we show how to extend it to a mixture model that is applicable to both directed and undirected networks as well as discrete-valued networks, that is much more parsimonious, and that allows identification of influential nodes.

Observe that the dyadic probabilities of (2.6) are of the form

Pθ(Dij=dij)exp[θ1g1(dij)+θ2ig2i(dij)+θ2jg2j(dij)], (2.7)

where θ1 = ρ is the reciprocity parameter and θ2i = (αi, βi) and θ2j = (αj, βj) are the sending and receiving propensities of nodes i and j, respectively. The corresponding statistics are the reciprocity indicator g1(dij) = yijyji and the sending and receiving indicators g2i(dij) = (yij, yji) and g2j(dij) = (yji, yji) of nodes i and j, respectively. A mixture model modification of the p1 model postulates that, conditional on Z, the dyadic probabilities are independent and of the form

Pθ(Dij=dijZik=Zjl=1)exp[θ1g1(dij)+θ2kg2k(dij)+θ2lg2l(dij)], (2.8)

where the parameter vectors θ2k and θ2l depend on the components k and l to which the nodes i and j belong, respectively. The mixture model version of the p1 model is therefore much more parsimonious provided Kn and was proposed by Schweinberger, Petrescu-Prahova and Vu (2012) in the case of undirected, binary-valued networks. Here, the probabilities of (2.7) and (2.8) are applicable to both undirected and directed networks as well as discrete-valued networks, because the functions g1k and g2l may be customized to fit the situation and may even depend on covariates x, though we have suppressed this possibility in the notation. Finally, the mixture model version of the p1 model admits model-based clustering of nodes based on indegrees or outdegrees or both. A small number of nodes with high indegree or outdegree or both is considered to be influential: if the corresponding nodes were to be removed, the network structure would be impacted.

Example 2. The mixture model of Nowicki and Snijders (2001) assumes that, conditional on Z, the dyads are independent and the conditional dyadic probabilities are of the form

Pπ(Dij=dZik=Zjl=1)=πd;kl. (2.9)

In other words, conditional on Z, the dyad probabilities are constant across dyads and do not depend on covariates. It is straightforward to add covariates by writing the conditional dyad probabilities in canonical form:

Pθ(Dij=dijx,Zik=Zjl=1)exp[θ1g1(x,dij)+θklg2(x,dij)], (2.10)

where the canonical statistic vectors g1(x, dij) and g2(x, dij) may depend on the covariates x. If the canonical parameter vectors θkl are constrained by the linear constraints θkl = θk + θl, where θk and θl are parameter vectors of the same dimension as θkl, then the mixture model version of the p1 model arises. In other words, the mixture model version of the p1 model can be viewed as a constrained version of the Nowicki and Snijders (2001) model. While the constrained version can be used to cluster nodes based on degree, the unconstrained version can be used to identify, for instance, high-density regions of the network, corresponding to subsets of nodes with large numbers of within-subset edges. These regions may then be studied individually in more detail by using more advanced statistical models such as exponential family models without dyadic independence as proposed by, for example, Holland and Leinhardt (1981), Frank and Strauss (1986), Strauss and Ikeda (1990), Wasserman and Pattison (1996), Snijders et al. (2006) or Hunter and Handcock (2006).

Other examples. Other mixture models for networks have been proposed by Tallberg (2005), Handcock, Raftery and Tantrum (2007) and Airoldi et al. (2008). However, these models scale less well to large networks, so we confine attention here to examples 1 and 2.

3. Approximate maximum likelihood estimation

A standard approach to maximum likelihood estimation of finite mixture models is based on the classical EM algorithm, taking the complete data to be (Y, Z), where Z is unobserved [Dempster, Laird and Rubin (1977)]. However, the E-step of an EM algorithm requires the computation of the conditional expectation of the complete data log-likelihood function under the distribution of Z | Y, which is intractable here even in the simplest cases [Daudin, Picard and Robin (2008)].

As an alternative, we consider so-called variational EM algorithms, which can be considered as generalizations of EM algorithms. The basic idea of variational EM algorithms is to construct a tractable lower bound on the intractable log-likelihood function and maximize the lower bound, yielding approximate maximum likelihood estimates. Celisse, Daudin and Pierre (2011) have shown that approximate maximum likelihood estimators along these lines are—at least in the absence of parameter constraints—consistent estimators.

We assume that all modeling of Y can be conditional on covariates x and define

πd;ij,kl,x(θ)=Pθ(Dij=dZik=Zjl=1,x).

However, for ease of presentation, we drop the notational dependence of πd;ij,kl,x on i, j, x and make the homogeneity assumption

πd;ij,kl,x(θ)=πd;kl(θ)for alli,j,x, (3.1)

which is satisfied by the models in examples 1 and 2. Exponential parameterizations of πd;kl(θ), as in (2.6) and (2.10), may or may not be convenient. An attractive property of the variational EM algorithm proposed here is that it can handle all possible parameterizations of πd;kl(θ). In some cases (e.g., example 1), exponential parameterizations are more advantageous than others, while in other cases (e.g., example 2), the reverse holds.

3.1. Variational EM algorithm

Let A(z) ≡ P(Z = z) be an auxiliary distribution with support Z. Using Jensen's inequality, the log-likelihood function can be bounded below as follows:

logPγ,θ(Y=y)=logzZPγ,θ(Y=y,Z=z)A(z)A(z)zZ[logPγ,θ(Y=y,Z=z)A(z)]A(z)=EA[logPγ,θ(Y=y,Z=z)]EA[logA(Z)]. (3.2)

Some choices of A(z) give rise to better lower bounds than others. To see which choice gives rise to the best lower bound, observe that the difference between the log-likelihood function and the lower bound is equal to the Kullback–Leibler divergence from A(z) to Pγ,θ(Z = z | Y = y):

logPγ,θ(Y=y)zZ[logPγ,θ(Y=y,Z=z)A(z)]A(z)=zZ[logPγ,θ(Y=y)]A(z)zZ[logPγ,θ(Y=y,Z=z)A(z)]A(z)=zZ[logA(z)Pγ,θ(Z=zY=y)]A(z). (3.3)

If the choice of A(z) were unconstrained in the sense that we could choose from the set of all distributions with support Z, then the best lower bound is obtained by the choice A(z) = Pγ,θ(Z = z | Y = y), which reduces the Kullback–Leibler divergence to 0 and makes the lower bound tight. If the optimal choice is intractable, as is the case here, then it is convenient to constrain the choice to a subset of tractable choices and substitute a choice which, within the subset of tractable choices, is as close as possible to the optimal choice in terms of Kullback–Leibler divergence. A natural subset of tractable choices is given by introducing the auxiliary parameters α = (α1,…,αn) and setting

A(z)=Pα(Z=z)=i=1nPαi(Zi=zi), (3.4)

where the marginal auxiliary distributions Pαi(Zi = zi) are Multinomial(1;αi1,…,αiK). In this case, the lower bound may be written

LBML(γ,θ;α)=Eα[logPγ,θ(Y=y,Z=z)]Eα[logPα(Z)]=i<jnk=1Kl=1Kαikαjllogπdij;kl(θ)+i=1nk=1Kαik(logγklogαik) (3.5)

Because equation (3.4) assumes independence, the Kullback–Leibler divergence between Pα(Z = z) and Pγ,θ(Z = z | Y = y), and thus the tightness of the lower bound, is determined by the dependence of the random variables Z1,…,Zn conditional on Y. If the random variables Z1,…,Zn are independent conditional on Y, then, for each i, there exists αi such that Pαi(Zi = zi) = Pγ,θ(Zi = zi | Y = y), which reduces the Kullback–Leibler divergence to 0 and makes the lower bound tight. In general, the random variables Z1,…,Zn are not independent conditional on Y and the Kullback–Leibler divergence (3.3) is thus positive.

Approximate maximum likelihood estimates of γ and θ can be obtained by maximizing the lower bound in (3.5) using variational EM algorithms of the following form, where t is the iteration number:

  • E-step: Letting γ(t) and θ(t) denote the current values of γ and θ, maximize LBML(γ(t), θ(t);α) with respect to α. Let α(t+1) denote the optimal value of α and compute Eα(t+1) [log Pγ,θ(Y = y, Z = z)].

  • M-step: Maximize Eα(t+1) [log Pγ,θ(Y = y, Z = z)] with respect to γ and θ, which is equivalent to maximizing LBML(γ,θ;α(t+1)) with respect to γ and θ.

The method ensures that the lower bound is nondecreasing in the iteration number:

LBML(γ(t),θ(t);α(t))LBML(γ(t),θ(t);α(t+1)) (3.6)
LBML(γ(t+1),θ(t+1);α(t+1)), (3.7)

where inequalities (3.6) and (3.7) follow from the E-step and M-step, respectively.

It is instructive to compare the variational EM algorithm to the classical EM algorithm as applied to finite mixture models. The E-step of the variational EM algorithm minimizes the Kullback–Leibler divergence between A(z) and Pγ(t),θ(t)(Z = z | Y = y). If the choice of A(z) were unconstrained, then the optimal choice would be A(z) = Pγ(t),θ(t)(Z = z | Y = y). Therefore, in the unconstrained case, the E-step of the variational EM algorithm reduces to the E-step of the classical EM algorithm, so the classical EM algorithm can be considered to be the optimal variational EM algorithm.

3.1.1. Generalized E-step: An MM algorithm

To implement the E-step, we exploit the fact that the lower bound is nondecreasing as long as the E-step and M-step increase the lower bound. In other words, we do not need to maximize the lower bound in the E-step and M-step. Indeed, increasing rather than maximizing the lower bound in the E-step and M-step may have computational advantages when n is large. In the literature on EM algorithms, the advantages of incremental E-steps and incremental M-steps are discussed by Neal and Hinton (1993) and Dempster, Laird and Rubin (1977), respectively. We refer to the variational EM algorithm with either an incremental E-step or an incremental M-step or both as a variational generalized EM, or variational GEM, algorithm.

Direct maximization of LBML(γ(t), θ(t);α) is unattractive: equation (3.5) shows that the lower bound depends on the products αikαjl and, therefore, fixed-point updates of αik along the lines of [Daudin, Picard and Robin (2008)] depend on all other αjl. We demonstrate in Section 6 that the variational EM algorithm with the fixed-point implementation of the E-step can be inferior to the variational GEM algorithm when K is large.

To separate the parameters of the maximization problem, we increase LBML(γ(t), θ(t);α) via an MM algorithm [Hunter and Lange (2004)]. MM algorithms can be viewed as generalizations of EM algorithms [Hunter and Lange (2004)] and are based on iteratively constructing and then optimizing surrogate (minorizing) functions to facilitate the maximization problem in certain situations. We consider here the surrogate function

QML(γ(t),θ(t);α(t),α)=i<jnk=1Kl=1K(αik2αjl(t)2αik(t)+αjl2αik(t)2αjl(t))logπdij;kl(θ(t))+i=1nk=1Kαik(logγk(t)logαik(t)αikαik(t)+1), (3.8)

which we show in Appendix A to have the following two properties:

QML(γ(t),θ(t),α(t);α)LBML(γ(t),θ(t);α)for allα, (3.9)
QML(γ(t),θ(t),α(t);α(t))=LBML(γ(t),θ(t);α(t)). (3.10)

In the language of MM algorithms, conditions (3.9) and (3.10) establish that QML(γ(t), θ(t), α(t);α) is a minorizer of LBML(γ(t), θ(t);α) at α(t). The theory of MM algorithms implies that maximizing the minorizer with respect to forces LBML(γ(t), θ(t);α) uphill [Hunter and Lange (2004)]. This maximization, involving n separate quadratic programming problems of K variables αi under the constraints αik ≥ 0 for all k and k=1Kαik=1, may be accomplished quickly using the method described by Stefanov (2004). When n is large, it is much easier to update α by maximizing the QML function, which is the sum of functions of the individual αi, than by maximizing the LBML function, in which the α parameters are not separated in this way. We therefore arrive at the following replacement for the E-step:

  • Generalized E-step: For i = 1, …, n, increase QML(γ(t), θ(t), α(t);α) as a function of αi subject to αik ≥ 0 for all k and k=1Kαik=1. Let α(t+1) denote the new value of α.

3.1.2. More on the M-step

To maximize LBML(γ, θ;α(t+1)) in the M-step, examination of (3.5) shows that maximization with respect to γ and θ may be accomplished separately. In fact, for γ, there is a simple, closed-form solution:

γk(t+1)=1ni=1nαik(t+1),k=1,,K. (3.11)

Concerning θ, if there are no constraints on π(θ) other than dDπd;kl(θ)=1, it is preferable to maximize with respect to π = π(θ) rather than θ, because there are closed-form expressions for π(t+1) but not for θ(t+1). Maximization with respect to π is accomplished by setting

πd;kl(t+1)=i<jnαik(t+1)αjl(t+1)I(Dij=d)i<jnαik(t+1)αjl(t+1),dD,k,l=1,,K. (3.12)

If the homogeneity assumption (3.1) does not hold, then closed-form expressions for π may not be available. In some cases, as in the presence of categorical covariates, closed form expressions for π are available, but the dimension of π, and thus computing time, increases with the number of categories.

If equations (2.1) and (2.3) hold, then the exponential parametrization π(θ) may be inverted to obtain an approximate maximum likelihood estimate of after the approximate MLE of π is found using the variational GEM algorithm. One method for accomplishing this inversion exploits the convex duality of exponential families [Barndor-Nielsen (1978), Wainwright and Jordan (2008)] and is explained in Appendix B.

If, in addition to the constraint dDπd;kl(θ)=1, additional constraints on π are present, the maximization with respect to π may either decrease or increase computing time. Linear constraints on π can be enforced by Lagrange multipliers and reduce the dimension of π and thus computing time. Nonlinear constraints on π, as in example 1, may not admit closed form updates of π and thus may require iterative methods. If so, and if the nonlinear constraints stem from exponential family parameterizations of π(θ) with natural parameter vector θ as in example 1, then it is convenient to translate the constrained maximization problem into an unconstrained problem by maximizing LBML(γ, θ;α(t+1)) with respect to θ and exploiting the fact that LBML(γ, θ;α(t+1)) is a concave function of θ owing to the exponential family membership of πd;kl(θ) [Barndorff-Nielsen (1978), page 150]. We show in Appendix C how the exponential family parameterization can be used to derive the gradient and Hessian of the lower bound of LBML(γ, θ;α(t+1)) with respect to θ, which we exploit in Section 7 using a Newton–Raphson algorithm.

3.2. Standard errors

Although we maximize the lower bound LBML(γ, θ;α) of the log-likelihood function to obtain approximate maximum likelihood estimates, standard errors of the approximate maximum likelihood estimates γ^ and θ^ based on the curvature of the lower bound LBML(γ, θ;α) may be too small. The reason is that even when the lower bound is close to the log-likelihood function, the lower bound may be more curved than the log-likelihood function [Wang and Titterington (2005)]; indeed, the higher curvature helps ensure that LBML(γ, θ;α) is a lower bound of the log-

likelihood function log Pγ,θ (Y = y) in the first place. As an alternative, we approximate the standard errors of the approximate maximum likelihood estimates of γ and θ by a parametric bootstrap method [Efron (1979)] that can be described as follows:

  • (1)

    Given the approximate maximum likelihood estimates of γ and θ, sample B data sets.

  • (2)

    For each data set, compute the approximate maximum likelihood estimates of γ and θ.

In addition to fast maximum likelihood algorithms, the parametric bootstrap method requires fast simulation algorithms. We propose such an algorithm in Section 5.

3.3. Starting and stopping

As usual with EM-like algorithms, it is a good idea to use multiple different starting values with the variational EM due to the existence of distinct local maxima. We find it easiest to use random starts in which we assign the values of α(0) and then commence with an M-step. This results in values γ(0) and θ(0), then the algorithm continues with the first E-step, and so on. The initial αik(0) are chosen independently uniformly randomly on (0, 1), then each αi(0) is multiplied by a normalizing constant chosen so that the elements of αi(0) sum to one for every i.

The numerical experiments of Section 7 use 100 random restarts each. Ideally, more restarts would be used, yet the size of the data sets with which we work makes every run somewhat expensive. We chose the number 100 because we were able to parallelize on a fairly large scale, essentially running 100 separate copies of the algorithm. Larger numbers of runs, such as 1000, would have forced longer run times since we would have had to run some of the trials in series rather than in parallel.

As a convergence criterion, we stop the algorithm as soon as

LBML(γ(t+1),θ(t+1);α(t+1))LBML(γ(t),θ(t);α(t))LBML(γ(t+1),θ(t+1);α(t+1))<1010.

We consider the relative change in the objective function rather than the absolute change or the changes in the parameters themselves because (1) even small changes in the parameter values can result in large changes of the objective function, and (2) the objective function is a lower bound of the log-likelihood, so small absolute changes of the objective function may not be worth the computational effort.

4. Approximate Bayesian estimation

The key to Bayesian model estimation and model selection is the marginal likelihood, defined as

P(Y=y)=ΓϴzZPγ,θ(Y=y,Z=z)p(γ,θ)dγdθ, (4.1)

where p(γ, θ) is the prior distribution of γ and θ. To ensure that the marginal likelihood is well-defined, we assume that the prior distribution is proper, which is common practice in mixture modeling [McLachlan and Peel (2000), Chapter 4]. A lower bound on the log marginal likelihood can be derived by introducing an auxiliary distribution with support Z × Γ × Θ, where Γ is the parameter space of γ and Θ is the parameter space of θ. A natural choice of auxiliary distributions is given by

Aα(z,γ,θ)[i=1nPαZ,i(Zi=zi)]pαγ(γ)[i=1LPαθ(θi)], (4.2)

where α denotes the set of auxiliary parameters αZ = (αZ,1, …, αZ,n), αγ and αθ.

A lower bound on the log marginal likelihood can be derived by Jensen's inequality:

logP(Y=y)=logΓϴzZPγ,θ(Y=y,Z=z)p(γ,θ)Aα(z,γ,θ)Aα(z,γ,θ)dγdθEα[logPγ,θ(Y=y,Z=z)p(γ,θ)]Eα[logAα(Z,γ,θ)], (4.3)

where the expectations are taken with respect to the auxiliary distribution Aα (z, γ, θ).

We denote the right-hand side of (4.3) by LBB(αγ, αθ; αZ). By an argument along the lines of (3.3), one can show that the difference between the log marginal likelihood and LBB(αγ, αθ; αZ) is equal to the Kullback–Leibler divergence from the auxiliary distribution Aα(z, γ, θ) to the posterior distribution P(Z = z, γ, θ |Y = y):

logP(Y=y)ΓϴzZ[logPγ,θ(Y=y,Z=z)p(γ,θ)Aα(z,γ,θ)]Aα(z,γ,θ)dγdθ=ΓϴzZ[logAα(z,γ,θ)P(Z=z,γ,θY=y)]Aα(z,γ,θ)dγdθ. (4.4)

The Kullback–Leibler divergence between the auxiliary distribution and the posterior distribution can be minimized by a variational GEM algorithm as follows, where t is the iteration number:

  • Generalized E-step: Letting αγ(t) and αθ(t) denote the current values of αγ and αθ, increase LBB(αγ(t), αθ(t); αZ) with respect to αZ. Let αZ(t+1) denote the new value of αZ.

  • Generalized M-step: Choose new values αγ(t+1) and αθ(t+1) that increase LBB(αγ, αθ; αZ(t+1)) with respect to αγ and αθ.

By construction, iteration t of a variational GEM algorithm increases the lower bound LBB(αγ, αθ; αZ):

LBB(αγ(t),αθ(t);αZ(t))LBB(αγ(t),αθ(t);αZ(t+1)) (4.5)
LBB(αγ(t+1),αθ(t+1);αZ(t+1)). (4.6)

A variational GEM algorithm approximates the marginal likelihood as well as the posterior distribution. Therefore, it tackles Bayesian model estimation and model selection at the same time.

Variational GEM algorithms for approximate Bayesian inference are only slightly more complicated to implement than the variational GEM algorithms for approximate maximum likelihood estimation presented in Section 3. To understand the difference, we examine the analogue of (3.5):

LBB(αγ,αθ;αZ)=i<jnk=1Kl=1KαZ,ikαZ,jlEα[logπdij;kl(θ)]+Eα[logPγ(Z=z)]+Eα[logp(γ,θ)]Eα[logA(Z=z,γ,θ)]. (4.7)

If the prior distributions of γ and θ are given by independent Dirichlet and Gaussian distributions and the auxiliary distributions of Z1, …, Zn, γ and θ are given by independent Multinomial, Dirichlet and Gaussian distributions, respectively, then the expectations on the right-hand side of (4.7) are tractable, with the possible exception of the expectations Eα [log πd;kl(θ)]. Under the exponential parameterization

πd;kl(θ)=exp{θg(d)logdDexp[θg(d)]}, (4.8)

the expectations can be written as

Eα[logπd;kl(θ)]=Eα[θ]g(d)Eα{logdDexp[θg(d)]} (4.9)

and are intractable. We are not aware of parameterizations under which the expectations are tractable. We therefore use exponential parameterizations and deal with the intractable nature of the resulting expectations by invoking Jensen's inequality:

Eα[logπd;kl(θ)]Eα[θ]g(d)logdDEα{exp[θg(d)]}. (4.10)

The right-hand side of (4.10) involves expectations of independent log-normal random variables, which are tractable. We thus obtain a looser, yet tractable, lower bound by replacing Eα[log πd;kl(θ)] in (4.7) by the right-hand side of inequality (4.10).

To save space, we do not address the specific numerical techniques that may be used to implement the variational GEM algorithm here. In short, the generalized E-step is based on an MM algorithm along the lines of Section 3.1.1. In the generalized M-step, numerical gradient-based methods may be used. A detailed treatment of this Bayesian estimation method and its implementation, using a more complicated prior distribution, may be found in Schweinberger, Petrescu-Prahova and Vu (2012); code related to this article is available at http://sites.stat.psu.edu/~dhunter/code/.

5. Monte Carlo simulation

Monte Carlo simulation of large, discrete-valued networks serves at least three purposes:

  • (a)

    to generate simulated data to be used in simulation studies;

  • (b)

    to approximate standard errors of the approximate maximum likelihood estimates by parametric bootstrap;

  • (c)

    to assess model goodness of fit by simulation.

A crude Monte Carlo approach is based on sampling Z by cycling through all n nodes and sampling Dij | Z by cycling through all n(n − 1)/2 dyads. However, the running time of such an algorithm is O(n2), which is too slow to be useful in practice, because each of the goals listed above tends to require numerous simulated data sets.

We propose Monte Carlo simulation algorithms that exploit the fact that discrete-valued networks tend to be sparse in the sense that one element of D is much more common than all other elements of D. An example is given by directed, binary-valued networks, where D={(0,0),(0,1),(1,0),(1,1)} is the sample space of dyads and (0,0)D tends to dominate all other elements of D.

Assume there exists an element b of D, called the baseline, that dominates the other elements of D in the sense that πb;kl ≫ 1 − πb;kl for all k and l. The Monte Carlo simulation algorithm exploiting the sparsity of large, discrete-valued networks can be described as follows:

  • (1)

    Sample Z by sampling M ~ Multinomial(n; γ1, …, γK) and assigning nodes 1, …, M1 to component 1, nodes M1 + 1, …, M1 + M2 to component 2, etc.

  • (2)
    Sample Y | Z as follows: for each 1 ≤ klK,
    • (a)
      sample the number of dyads Skl with nonbaseline values, Skl ~ Binomial(Nkl, 1 − πb;kl), where Nkl is the number of pairs of nodes belonging to components k and l;
    • (b)
      sample Skl out of Nkl pairs of nodes i < j without replacement;
    • (c)
      for each of the Skl sampled pairs of nodes i < j, sample the nonbase-line value Dij according to the probabilities πd;kl/(1 − πb;kl), dD, db.

In general, if the degree of any node (i.e., the number of nonbaseline values for all dyad variables incident on that node) has a bounded expectation, then the expected number of nonbaseline values S = Σkl Skl in the network scales with n and the expected running time of the Monte Carlo simulation algorithm scales with nK2D. If K is small and n is large, then the Monte Carlo approach that exploits the sparsity of large, discrete-valued networks is superior to the crude Monte Carlo approach.

6. Comparison of algorithms

We compare the variational EM algorithm based on the fixed-point (FP) implementation of the E-step along the lines of Daudin, Picard and Robin (2008) to the variational GEM algorithm based on the MM implementation of the E-step by applying them to two data sets. The first data set comes from the study on political blogs by Adamic and Glance (2005). We convert the binary network of political blogs with two labels, liberal (+1) and conservative (−1), into a signed network by assigning labels of receivers to the corresponding directed edges. The resulting network has 1490 nodes and 2,218,610 edge variables. The second data set is the Epinions data set described in Section 1 with more than 131,000 nodes and more than 17 billion edge variables.

We compare the two algorithms using the unconstrained network mixture model of (2.9) with K = 5 and K = 20 components. For the first data set, we allow up to 1 hour for K = 5 components and up to 6 hours for K = 20 components. For the second data set, we allow up to 12 hours for K = 5 components and up to 24 hours for K = 20 components. For each data set, for each number of components and for each algorithm, we carried out 100 runs using random starting values as described in Section 3.3.

Figure 1 shows trace plots of the lower bound LBML(γ(t), θ(t); α(t)) of the log-likelihood function, where red lines refer to the lower bound of the variational EM algorithm with FP implementation and blue lines refer to the lower bound of the variational GEM algorithm with MM implementation. The variational EM algorithm seems to outperform the variational GEM algorithm in terms of computing time when K and n are small. However, when K or n are large, the variational GEM algorithm appears far superior to the variational EM algorithm in terms of the lower bounds. The contrast is most striking when K is large, though the variational GEM seems to outperform the variational EM algorithm even when K is small and n is large. We believe that the superior performance of the variational GEM algorithm stems from the fact that it separates the parameters of the maximization problem and reduces the dependence of the updates of the variational parameters αik, as discussed in Section 3.1.1, while the variational EM algorithm tends to be trapped in local maxima.

Fig. 1.

Fig. 1

Trace plots of the lower bound LBML(γ(t), θ(t); α(t)) of the log-likelihood function for 100 runs each of the variational EM algorithm with FP implementation (red) and variational GEM algorithm with MM implementation (blue), applied to the unconstrained network mixture model of (2.9) for two different data sets.

Thus, if K and n are small and a computing cluster is available, it seems preferable to carry out a large number of runs using the variational EM algorithm in parallel, using random starting values as described in Section 3.3. However, if either K or n is large, it is preferable to use the variational GEM algorithm. Since the variational GEM algorithm is not prone to be trapped in local maxima, a small number of long runs may be all that is needed.

7. Application

Here, we address the problem of clustering the n = 131,000 users of the data set introduced in Section 1 according to their levels of trustworthiness, as indicated by the network of +1 and −1 ratings given by fellow users. To this end, we first introduce the individual “excess trust” statistics

ei(y)=1jn,jiyji.

Since ei(y) is the number of positive ratings received by user i in excess of the number of negative ratings, it is a natural measure of a user's individual trustworthiness. Our contention is that consideration of the overall pattern of network connections results in a more revealing clustering pattern than a mere consideration of the ei(y) statistics, and we support this claim by considering three different clustering methods: A parsimonious network model using the ei(y) statistics, the fully unconstrained network model of (2.9), and a mixture model that considers only the ei(y) statistics while ignoring the other network structure.

For each method, we assume that the number of categories, K, is five. Partly, this choice is motivated by the fact that formal model selection methods such as the ICL criterion suggested by Daudin, Picard and Robin (2008), which we discuss in Section 9, suggest dozens if not hundreds of categories, which complicate summary and interpretation. Since the reduction of data is the primary task of statistics [Fisher (1922)], we want to keep the number of categories small and follow the standard practice of internet-based companies and websites, such as http://amazon.com and http://netflix.com, which use five categories to classify the trustworthiness of reviewers, sellers and service providers.

Our parsimonious model, which enjoys benefits over the other two alternatives as we shall see, is based on

Pθ(Dij=dijZik=Zjl=1)exp[θ(yij+yji)+θ+(yij++yji+)+θkΔyji+θlΔyij+θyijyji+θ++yij+yji+], (7.1)

where yij=I(yij=1) and yij+=I(yij=1) are indicators of negative and positive edges, respectively. The parameters in model (7.1) are not identifiable, because yij=yij+yij and yij=yji+yji. We therefore constrain the positive edge parameter θ+ to be 0. Model (7.1) assumes in the interest of model parsimony that the propensities to form negative and positive edges and to reciprocate negative and positive edges do not vary across clusters; however, the flexibility afforded by this modeling framework enables us to define cluster-specific parameters for any of these propensities if we wish. The conditional probability mass function of the whole network is given by

Pθ(Y=yZ=z)exp[θi<jn(yij+yji)+k=1KθkΔtk(y,z)+θi<jnyijyji+θ++i<jnyij+yji+], (7.2)

where tk(y,z)=i=1nzikei(y) is the total excess trust for all nodes in the kth category. The θkΔ parameters are therefore measures of the trustworthiness of each of the categories. Furthermore, these parameters are estimated in the presence of—that is, after correcting for—the reciprocity effects as measured by the parameters θ−− and θ++, which summarize the overall tendencies of users to reciprocate negative and positive ratings, respectively. Thus, θ−− and θ++ may be considered to measure overall tendencies toward lex talionis and quid pro quo behaviors.

One alternative model we consider is the unconstrained network model obtained from (2.9). With five components, this model comprises four mixing parameters λ1, …, λ4 in addition to the πd;kl parameters, of which there are 105: there are nine types of dyads d whenever kl, contributing 8(52)=80 parameters, and six types of dyads d whenever k = l, contributing an additional 5(5) = 25 parameters. Despite the increased flexibility afforded by model (2.9), we view the loss of interpretability due to the large number of parameters as a detriment. Furthermore, more parameters opens up the possibility of overfitting and, as we discuss below, appears to make the lower bound of the log-likelihood function highly multi-modal.

Our other alternative model is a univariate mixture model applied to the ei(y) statistics directly, which assumes that the individual excesses ei(y) are independent random variables sampled from a distribution with density

f(x)=j=15λj1σjϕ(xμjσj), (7.3)

where λj, μj and σj are component-specific mixing proportions, means and standard deviations, respectively, and ϕ(·) is the standard normal density. Traditional univariate approaches like this are less suitable than network-based clustering approaches not only because by design they neither consider nor inform us about the topology of the network, which may be relevant, but also because the individual excesses are not independent: these ei(y) are functions of edges, and edges may be dependent owing to reciprocity (and other forms of dependence not modeled here), which decades of research [e.g., Davis (1968), Holland and Leinhardt (1981)] have shown to be important in shaping social networks. Unlike the univariate mixture model of (7.3), the mixture model we employ for networks allows for such dependence.

We use a variational GEM algorithm to estimate the network model (7.2), where the M-step is executed by a Newton–Raphson algorithm using the gradient and Hessian derived in Appendix C with a maximum of 100 iterations. It stops earlier if the largest absolute value in the gradient vector is less than 10−10. By contrast, the unconstrained network model following from (2.9) employs a variational GEM algorithm using the exact M-step update (3.12). The variational GEM algorithm stops when either the relative change in the objective function is less than 10−10 or 6000 iterations are performed. Most runs require the full 6000 iterations. To estimate the normal mixture model (7.3), we use the R package mixtools [Benaglia et al. (2009)].

To diagnose convergence of the algorithm for fitting the model (7.2), we present the trace plot of the lower bound of the log-likelihood function LBML(γ(t), θ(t); α(t)) in Figure 2(a) and the trace plot of the cluster-specific excess parameters θkΔ in Figure 2(b). Both figures are based on 100 runs, where the starting values are obtained by the procedure described in Section 3.3. The results suggest that all 100 runs seem to converge to roughly the same solution. This fact is somewhat remarkable, since many variational algorithms appear very sensitive to their starting values, converging to multiple distinct local optima [e.g., Daudin, Pierre and Vacher (2010), Salter-Townshend and Murphy (2013)]. For instance, the 100 runs for the unconstrained network model (2.9) produced essentially a unique set of estimates for each set of random starting values. Similarly, the normal mixture model algorithm produces many different local maxima, even after we try to correct for label-switching by choosing random starting values fairly tightly clustered by their mean values.

Fig. 2.

Fig. 2

(a) Trace plot of the lower bound LBML(γ(t), θ(t); α(t)) of the log-likelihood function and (b) cluster-specific excess parameters θkΔ, using 100 runs with random starting values.

Figure 3 shows the observed excesses e1(y),…,en(y) grouped by clusters for the best solutions, as measured by likelihood or approximate likelihood, found for each of the three clustering methods. It appears that the clustering based on the parsimonious network model does a better job of separating the ei(y) statistics into distinct subgroups—though this is not the sole criterion used—than the clusterings for the other two models, which are similar to each other. In addition, if we use a normal mixture model in which the variances are restricted to be constant across components, the results are even worse, with one large cluster and multiple clusters with few nodes.

Fig. 3.

Fig. 3

Observed values of excess trust ei(y), grouped by highest-probability component of i, for (a) parsimonious network mixture model (7.2) with 12 parameters, (b) normal mixture model (7.3) with 14 parameters, and (c) unconstrained network mixture model (2.9) with 109 parameters.

In Figure 4, we “ground truth” the clustering solutions using external information: the average ratings of 659,290 articles, grouped according to the highest-probability category of the article's author. While in Figure 3 the size of each cluster is the number of users in that cluster, in Figure 4 the size of each cluster is the number of articles written by users in that cluster. The widths of the boxes in Figures 3 and 4 are proportional to the square roots of the cluster sizes.

Fig. 4.

Fig. 4

Average ratings of 659,290 articles, grouped according to the highest-probability category of the article's author, for (a) parsimonious network mixture model (7.2) with 12 parameters, (b) normal mixture model (7.3) with 14 parameters, and (c) unconstrained network mixture model (2.9) with 109 parameters. The ordering of the five categories, which is the same as in Figure 3, indicates that the unconstrained network mixture model does not even preserve the correct ordering of the median average ratings.

As an objective criterion to compare the three models, we fit one-way ANOVA models where responses are article ratings and fixed effects are the group indicators of the articles' authors. The adjusted R2 values are 0.262, 0.165 and 0.172 for the network mixture model, the normal mixture model and the unconstrained network mixture model, respectively. In other words, the latent structure detected by the 12-component network mixture model of (7.2) explains the variation in article ratings better than the 14-parameter univariate mixture model or the 109-parameter unconstrained network model.

Table 1 reports estimates of the θ parameters from model (7.2) along with 95% confidence intervals reported in that table obtained by simulating 500 networks using the method of Section 5 and the parameter estimates obtained via our algorithm. For each network, we run our algorithm for 1000 iterations starting at the M-step, where the α parameters are initialized to reflect the “true” component to which each node is assigned by the simulation algorithm by setting αik = 10−10 for k not equal to the true component and αik = 1 – 4 × 10−10 otherwise. This is done to eliminate the so-called label-switching problem, which is rooted in the invariance of the likelihood function to switching the labels of the 5 components and which can affect bootstrap samples in the same way it can affect Markov chain Monte Carlo samples from the posterior of finite mixture models [Stephens (2000)]. The sample 2.5% and 97.5% quantiles form the confidence intervals shown. In addition, we give density estimates of the five trustworthiness bootstrap samples in Figure 5. Table 1 shows that some clusters of users are much more trustworthy than others. In addition, there is statistically significant evidence that users rate others in accordance with both lex talionis and quid pro quo, since both θ−− and θ++ are positive. These findings suggest that the ratings of pairs of users i and j are, perhaps unsurprisingly, dependent and not free of self-interest.

Table 1.

95% Confidence intervals based on parametric bootstrap using 500 simulated networks, with 1000 iterations for each network. The statistic Σi ei(y)Zik equals Σi Σj≠i yjiZik, where Zik = 1 if user i is a member of cluster k and Zik = 0 otherwise

Parameter Statistic Parameter estimate Confidence interval
Negative edges (θ)
ijyij
−24.020 (−24.029, −24.012)
Positive edges (θ+)
ijyij+
0
Negative reciprocity (θ−−)
ijyijyji
8.660 (8.614, 8.699)
Positive reciprocity (θ++)
ijyij+yji+
9.899 (9.891, 9.907)
Cluster 1 trustworthiness (θ1Δ) Σi ei(y)Zi1 −6.256 (−6.260, −6.251)
Cluster 2 trustworthiness (θ2Δ) Σi ei(y)Zi2 −7.658 (−7.662, −7.653)
Cluster 3 trustworthiness (θ3Δ) Σi ei(y)Zi3 −9.343 (−9.348, −9.337)
Cluster 4 trustworthiness (θ4Δ) Σi ei(y)Zi4 −11.914 (−11.919, −11.908)
Cluster 5 trustworthiness (θ5Δ) Σi ei(y)Zi5 −15.212 (−15.225, −15.200)

Fig. 5.

Fig. 5

Kernel density estimates of the five bootstrap samples of the trustworthiness parameters, shifted so that each component's estimated parameter value (shown in the legend) equals zero.

Finally, a few remarks concerning the parametric bootstrap are appropriate. While we are encouraged by the fact that bootstrapping is even feasible for problems of this size, there are aspects of our investigation that will need to be addressed with further research. First, the bootstrapping is so time-consuming that we were forced to rely on computing clusters with multiple computing nodes to generate a bootstrap sample in reasonable time. Future work could focus on more efficient bootstrapping. Some work on efficient bootstrapping was done by Kleiner et al. (2011), but it is restricted to simple models and not applicable here.

Second, when the variational GEM algorithm is initialized at random locations, it may converge to local maxima whose LBML(γ, θ; α) values are inferior to the solutions attained when the algorithm is initialized at the “true” values used to simulate the networks. While it is not surprising that variational GEM algorithms converge to local maxima, it is surprising that the issue shows up in some of the simulated data sets but not in the observed data set. One possible explanation is that the structure of the observed data set is clear cut, but that the components of the estimated model are not sufficiently separated. Therefore, the estimated model may place nonnegligible probability mass on networks where two or more subsets of nodes are hard to distinguish and the variational GEM algorithm may be attracted to local maxima.

Third, some groups of confidence intervals, such as the first four trustworthiness parameter intervals, have more or less the same width. We do not have a fully satisfying explanation for this result; it may be a coincidence or it may have some deeper cause related to the difficulty of the computational problem.

In summary, we find that the clustering framework we introduce here provides useful results for a very large network. Most importantly, the sensible application of statistical modeling ideas, which reduces the unconstrained 109-parameter model to a constrained 12-parameter model, produces vastly superior results in terms of interpretability, numerical stability and predictive performance.

8. Discussion

The model-based clustering framework outlined here represents several advances. An attention to standard statistical modeling ideas relevant in the network context improves model parsimony and interpretability relative to fully unconstrained clustering models, while also suggesting a viable method for assessing precision of estimates obtained. Algorithmically, our advances allow us to apply a variational EM idea, recently applied to network clustering models in numerous publications [e.g., Nowicki and Snijders (2001), Airoldi et al. (2008), Daudin, Picard and Robin (2008), Zanghi et al. (2010), Mariadassou, Robin and Vacher (2010)], to networks far larger than any that have been considered to date. We have applied our methods to networks with over a hundred thousand nodes and signed edges, indicating how they extend to categorical-valued edges generally or models that incorporate other covariate information. In practice, these methods could have myriad uses, from identifying high-density regions of large networks to selecting among competing models for a single network to testing specific network effects of scientific interest when clustering is present.

To achieve these advances, we have focused exclusively on models exhibiting dyadic independence conditional on the cluster memberships of nodes. It is important to remember that these models are not dyadic independence models overall, since the clustering itself introduces dependence. However, to more fully capture network effects such as transitivity, more complicated models may be needed, such as the latent space models of Hoff, Raftery and Handcock (2002), Schweinberger and Snijders (2003) or Handcock, Raftery and Tantrum (2007). A major drawback of latent space models is that they tend to be less scalable than the models considered here. An example is given by the variational Bayesian algorithm developed by Salter-Townshend and Murphy (2013) to estimate the latent space model of Handcock, Raftery and Tantrum (2007). The running time of the algorithm is O(n2) and it has therefore not been applied to networks with more than n = 300 nodes and N = 89,700 edge variables. An alternative to the variational Bayesian algorithm of Salter-Townshend and Murphy (2013) based on case-control sampling was proposed by Raftery et al. (2012). However, while the computing time of this alternative algorithm is O(n), the suggested preprocessing step, which requires determining the shortest path length between pairs of nodes, is O(n2). As a result, the largest network Raftery et al. (2012) analyze is an undirected network with n = 2716 nodes and N = 3,686,970 edge variables.

In contrast, the running time of the variational GEM algorithm proposed here is O(n) in the constrained and O(f(n)) in the unconstrained version of the Nowicki and Snijders (2001) model, where f(n) is the number of edge variables whose value is not equal to the baseline value. It is worth noting that f(n) is O(n) in the case of sparse graphs and, therefore, the running time of the variational GEM algorithm is O(n) in the case of sparse graphs. Indeed, even in the presence of the covariates, the running time of the variational GEM algorithm is O(ni=1ICi) with categorical covariates, where I is the number of covariates and Ci is the number of categories of the ith covariate. We have demonstrated that the variational GEM algorithm can be applied to networks with more than n = 131,000 nodes and N = 17 billion edge variables.

While the running time of O(n) shows that the variational GEM algorithm scales well with n, in practice, the “G” in “GEM” is an important contributor to the speed of the variational GEM algorithm: merely increasing the lower bound using an MM algorithm rather than actually maximizing it using a fixed-point algorithm along the lines of Daudin, Picard and Robin (2008) appears to save much computing time for large networks, though an exhaustive comparison of these two methods is a topic for further investigation.

An additional increase in speed might be gained by exploiting acceleration methods such as quasi-Newton methods [Press et al. (2002), Section 10.7], which have shown promise in the case of MM algorithms [Hunter and Lange (2004)] and which might accelerate the MM algorithm in the E-step of the variational GEM algorithm. However, application of these methods is complicated in the current modeling framework because of the exceptionally large number of auxiliary parameters introduced by the variational augmentation.

We have neglected here the problem of selecting the number of clusters. Daudin, Picard and Robin (2008) propose making this selection based on the so-called ICL criterion, but it is not known how the ICL criterion behaves when the intractable incomplete-data log-likelihood function in the ICL criterion is replaced by a variational-method lower bound. In our experience, the magnitude of the changes in the maximum lower bound value achieved with multiple random starting parameters is at least as large as the magnitude of the penalization imposed on the log-likelihood by the ICL criterion. Thus, we have been unsuccessful in obtaining reliable ICL-based results for very large networks. More investigation of this question, and of the selection of the number of clusters in general, seems warranted.

By demonstrating that scientifically interesting clustering models can be applied to very large networks by extending the variational-method ideas developed for network data sets recently in the statistical literature, we hope to encourage further investigation of the possibilities of these and related clustering methods.

The source code, written in C++, and data files used in Sections 6 and 7 are publicly available at http://sites.stat.psu.edu/~dhunter/code.

Acknowledgments

We are grateful to Paolo Massa and Kasper Souren of trustlet.org for sharing the epinion.com data

APPENDIX A: OBTAINING A MINORIZER OF THE LOWER BOUND

The lower bound LBML(γ, θ; α) of the log-likelihood function can be written as

LBML(γ,θ;α)=i<jnk=1Kl=1Kαikαjllogπdij;kl(θ)+i=1nk=1Kαik(logγklogαik). (A.1)

Since log πdij;kl(θ) < 0 for all θ, the arithmetic-geometric mean inequality implies that

αikαjllogπdij;kl(θ)(αik2α^jl2α^ik+αjl2α^ik2α^jl)logπdij;kl(θ) (A.2)

[Hunter and Lange (2004)], with equality if αik=α^ik and αjl=α^jl. In addition, the concavity of the logarithm function gives

logαiklogα^ikαikα^ik+1 (A.3)

with equality if αik=α^ik. Therefore, function QML(γ, θ, α; α^) as defined in (3.8) possesses properties (3.9) and (3.10).

APPENDIX B: CONVEX DUALITY OF EXPONENTIAL FAMILIES

We show how closed-form expressions of θ in terms of π can be obtained by exploiting the convex duality of exponential families. Let

ψ(μ)=supθ{θμψ(θ)} (B.1)

be the Legendre–Fenchel transform of ψ(θ), where μμ(θ) = Eθ [g(Y)] is the mean-value parameter vector and the subscripts k and l have been dropped. By Barndorff-Nielsen [(1978), page 140] and Wainwright and Jordan [(2008), pages 67 and 68], the Legendre–Fenchel transform of ψ(θ) is self-inverse and, thus, ψ(θ) can be written as

ψ(θ)=supμ{θμψ(μ)}=supπ{θμ(π)ψ(μ(π))}, (B.2)

where μ(π)=dDg(d)πd and ψ(μ(π))=dDπdlogπd. Therefore, closed-form expressions of θ in terms of π may be found by maximizing θμ(π) − ψ*(μ(π)) with respect to π.

APPENDIX C: GRADIENT AND HESSIAN OF LOWER BOUND

We are interested in the gradient and Hessian with respect to the parameter vector θ of the lower bound in (A.1). The two examples of models considered in Section 2 assume that the conditional dyad probabilities πdij;kl(θ) take the form

πdij;kl(θ)=exp[ηkl(θ)g(dij)ψkl(θ)], (C.1)

where ηkl(θ) = Aklθ is a linear function of parameter vector θ and Akl is a matrix of suitable order depending on components k and l. It is convenient to absorb the matrix Akl into the statistic vector g(dij) and write

πdij;kl(θ)=exp[θgkl(dij)ψkl(θ)], (C.2)

where gkl(dij)=Aklg(dij). Thus, we may write

LBML(γ,θ;α)=i<jnk=1Kl=1Kαikαjl[θgkl(dij)ψkl(θ)]+const, (C.3)

where “const” denotes terms which do not depend on θ and

ψkl(θ)=logdDexp[θgkl(d)]. (C.4)

Since the lower bound LBML(γ, θ; α) is a weighted sum of exponential family log-probabilities, it is straightforward to obtain the gradient and Hessian of LBML(γ, θ; α) with respect to θ, which are given by

θLBML(γ,θ;α)=i<jnk=1Kl=1Kαikαjl{gkl(dij)Eθ[gkl(Dij)]} (C.5)

and

θ2LBML(γ,θ;α)=i<jnk=1Kl=1KαikαjlEθ[gkl(Dij)gkl(Dij)], (C.6)

respectively.

In other words, the gradient and Hessian of LBML(γ, θ; α) with respect to α are weighted sums of expectations—the means, variances and covariances of statistics. Since the sample space of dyads D is finite and, more often than not, small, these expectations may be computed by complete enumeration of all possible values of dD and their probabilities.

Footnotes

1

Supported in part by the Office of Naval Research (ONR Grant N00014-08-1-1015) and the National Institutes of Health (NIH Grant 1 R01 GM083603). Experiments in this work were also supported in part through instrumentation funded by the National Science Foundation (NSF Grant OCI-0821527).

This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Applied Statistics, 2013, Vol. 7, No. 2, 1010-1039. This reprint differs from the original in pagination and typographic detail.

REFERENCES

  1. Adamic LA, Glance N. The political blogosphere and the 2004 U.S. election: Divided they blog. Proceedings of the 3rd International Workshop on Link Discovery. LinkKDD'05; New York: ACM; 2005. pp. 36–43. [Google Scholar]
  2. Airoldi E, Blei D, Fienberg S, Xing E. Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 2008;9:1981–2014. [PMC free article] [PubMed] [Google Scholar]
  3. Barndorff-Nielsen O. Information and Exponential Families in Statistical Theory. Wiley; Chichester: 1978. MR0489333. [Google Scholar]
  4. Benaglia T, Chauveau D, Hunter DR, Young D. mixtools: An R package for analyzing finite mixture models. Journal of Statistical Software. 2009;32:1–29. [Google Scholar]
  5. Besag J. Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. Ser. B Stat. Methodol. 1974;36:192–225. [Google Scholar]
  6. Britton T, O'Neill PD. Bayesian inference for stochastic epidemics in populations with random social structure. Scand. J. Stat. 2002;29:375–390. MR1925565. [Google Scholar]
  7. Caimo A, Friel N. Bayesian inference for exponential random graph models. Social Networks. 2011;33:41–55. [Google Scholar]
  8. Celisse A, Daudin J-J, Pierre L. Consistency of maximum-likelihood and variational estimators in the stochastic block model. 2011 Preprint. Available at http://arxiv.org/pdf/1105.3288.pdf.
  9. Daudin JJ, Picard F, Robin S. A mixture model for random graphs. Stat. Comput. 2008;18:173–183. MR2390817. [Google Scholar]
  10. Daudin J-J, Pierre L, Vacher C. Model for heterogeneous random networks using continuous latent variables and an application to a tree-fungus network. Biometrics. 2010;66:1043–1051. doi: 10.1111/j.1541-0420.2009.01378.x. MR2758491. [DOI] [PubMed] [Google Scholar]
  11. Davis JA. Statistical analysis of pair relationships: Symmetry, subjective consistency, and reciprocity. Sociometry. 1968;31:102–119. [PubMed] [Google Scholar]
  12. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 1977;39:1–38. MR0501537. [Google Scholar]
  13. Efron B. Bootstrap methods: Another look at the jackknife. Ann. Statist. 1979;7:1–26. MR0515681. [Google Scholar]
  14. Erdős P, Rényi A. On random graphs. I. Publ. Math. Debrecen. 1959;6:290–297. MR0120167. [Google Scholar]
  15. Fisher RA. On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 1922;222:309–368. doi: 10.1098/rsta.2014.0252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Frank O, Strauss D. Markov graphs. J. Amer. Statist. Assoc. 1986;81:832–842. MR0860518. [Google Scholar]
  17. Gilbert EN. Random graphs. Ann. Math. Statist. 1959;30:1141–1144. MR0108839. [Google Scholar]
  18. Groendyke C, Welch D, Hunter DR. Bayesian inference for contact networks given epidemic data. Scand. J. Stat. 2011;38:600–616. MR2833849. [Google Scholar]
  19. Handcock M. Technical report, Center for Statistics and the Social Sciences. Univ. Washington; Seattle: 2003. Assessing degeneracy in statistical models of social networks. Available at http://www.csss.washington.edu/Papers. [Google Scholar]
  20. Handcock MS, Raftery AE, Tantrum JM. Model-based clustering for social networks. J. Roy. Statist. Soc. Ser. A. 2007;170:301–354. MR2364300. [Google Scholar]
  21. Hoff PD, Raftery AE, Handcock MS. Latent space approaches to social network analysis. J. Amer. Statist. Assoc. 2002;97:1090–1098. MR1951262. [Google Scholar]
  22. Holland PW, Leinhardt S. An exponential family of probability distributions for directed graphs. J. Amer. Statist. Assoc. 1981;76:33–65. MR0608176. [Google Scholar]
  23. Hunter DR, Handcock MS. Inference in curved exponential family models for networks. J. Comput. Graph. Statist. 2006;15:565–583. MR2291264. [Google Scholar]
  24. Hunter DR, Lange K. A tutorial on MM algorithms. Amer. Statist. 2004;58:30–37. MR2055509. [Google Scholar]
  25. Kleiner A, Talwalkar A, Sarkar P, Jordan MI. A scalable bootstrap for massive data. 2011. Preprint. Available at arXiv:1112.5016. [Google Scholar]
  26. Koskinen JH, Robins GL, Pattison PE. Analysing exponential random graph (p-star) models with missing data using Bayesian data augmentation. Stat. Methodol. 2010;7:366–384. MR2643608. [Google Scholar]
  27. Kunegis J, Lommatzsch A, Bauckhage C. The slashdot zoo: Mining a social network with negative edges. WWW'09: Proceedings of the 18th International Conference on World Wide Web; New York: ACM; 2009. pp. 741–750. [Google Scholar]
  28. Mariadassou M, Robin S, Vacher C. Uncovering latent structure in valued graphs: A variational approach. Ann. Appl. Stat. 2010;4:715–742. MR2758646. [Google Scholar]
  29. Massa P, Avesani P. Trust metrics on controversial users: Balancing between tyranny of the majority and echo chambers. International Journal on Semantic Web and Information Systems. 2007;3:39–64. [Google Scholar]
  30. McLachlan G, Peel D. Finite Mixture Models. Wiley; New York: 2000. MR1789474. [Google Scholar]
  31. Møller J, Pettitt AN, Reeves R, Berthelsen KK. An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika. 2006;93:451–458. MR2278096. [Google Scholar]
  32. Neal RM, Hinton GE. Learning in Graphical Models. Kluwer Academic; Dordrecht: 1993. A new view of the EM algorithm that justifies incremental and other variants; pp. 355–368. [Google Scholar]
  33. Nowicki K, Snijders TAB. Estimation and prediction for stochastic blockstructures. J. Amer. Statist. Assoc. 2001;96:1077–1087. MR1947255. [Google Scholar]
  34. Press WH, Teukolsky SA, Vetterling WT, Flannery BP. The art of scientific computing. 2nd. Cambridge Univ. Press; Cambridge: 2002. Numerical Recipes in C++ MR1880993. [Google Scholar]
  35. Raftery AE, Niu X, Hoff PD, Yeung KY. Fast inference for the latent space network model using a case-control approximate likelihood. J. Comput. Graph. Statist. 2012;21:901–919. doi: 10.1080/10618600.2012.679240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Salter-Townshend M, Murphy TB. Variational Bayesian inference for the latent position cluster model for network data. Comput. Statist. Data Anal. 2013;57:661–671. MR2981116. [Google Scholar]
  37. Schweinberger M. Instability, sensitivity, and degeneracy of discrete exponential families. J. Amer. Statist. Assoc. 2011;106:1361–1370. doi: 10.1198/jasa.2011.tm10747. MR2896841. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Schweinberger M, Petrescu-Prahova M, Vu DQ. Technical Report. Center for Statistics and the Social Sciences, Univ. Washington; Seattle: 2012. Disaster response on September 11, 2001 through the lens of statistical network analysis; p. 116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Schweinberger M, Snijders TAB. Settings in social networks: A measurement model. In: Stolzenberg RM, editor. Sociological Methodology. Vol. 33. Blackwell; Boston: 2003. pp. 307–341. [Google Scholar]
  40. Snijders TAB. Markov chain Monte Carlo estimation of exponential random graph models. Journal of Social Structure. 2002;3:1–40. [Google Scholar]
  41. Snijders TAB, Nowicki K. Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J. Classification. 1997;14:75–100. MR1449742. [Google Scholar]
  42. Snijders TAB, Pattison PE, Robins GL, Handcock MS. New specifications for exponential random graph models. Sociological Methodology. 2006;36:99–153. [Google Scholar]
  43. Stefanov SM. Convex quadratic minimization subject to a linear constraint and box constraints. Appl. Math. Res. Express. AMRX. 2004;1:17–42. MR2064084. [Google Scholar]
  44. Stephens M. Dealing with label switching in mixture models. J. R. Stat. Soc. Ser. B Stat. Methodol. 2000;62:795–809. MR1796293. [Google Scholar]
  45. Strauss D. On a general class of models for interaction. SIAM Rev. 1986;28:513–527. MR0867682. [Google Scholar]
  46. Strauss D, Ikeda M. Pseudolikelihood estimation for social networks. J. Amer. Statist. Assoc. 1990;85:204–212. MR1137368. [Google Scholar]
  47. Tallberg C. A Bayesian approach to modeling stochastic blockstructures with covariates. J. Math. Sociol. 2005;29:1–23. [Google Scholar]
  48. Wainwright MJ, Jordan MI. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning. 2008;1:1–305. [Google Scholar]
  49. Wang B, Titterington DM. Inadequacy of interval estimates corresponding to variational Bayesian approximations. Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics; Jan 6–8, 2005; Savannah Hotel, Barbados: Society for Artificial Intelligence and Statistics; 2005. pp. 373–380. [Google Scholar]
  50. Wasserman S, Pattison P. Logit models and logistic regressions for social networks. I. An introduction to Markov graphs and p*. Psychometrika. 1996;61:401–425. MR1424909. [Google Scholar]
  51. Zanghi H, Picard F, Miele V, Ambroise C. Strategies for online inference of model-based clustering in large and growing networks. Ann. Appl. Stat. 2010;4:687–714. MR2758645. [Google Scholar]

RESOURCES