Large-scale estimation of random graph models with local dependence

Sergii Babkin; Jonathan R Stewart; Xiaochen Long; Michael Schweinberger

doi:10.1016/j.csda.2020.107029

. 2020 Jun 9;152:107029. doi: 10.1016/j.csda.2020.107029

Large-scale estimation of random graph models with local dependence

Sergii Babkin ^a, Jonathan R Stewart ^b, Xiaochen Long ^c, Michael Schweinberger ^c,^⁎

PMCID: PMC7282802 PMID: 32834264

Abstract

A class of random graph models is considered, combining features of exponential-family models and latent structure models, with the goal of retaining the strengths of both of them while reducing the weaknesses of each of them. An open problem is how to estimate such models from large networks. A novel approach to large-scale estimation is proposed, taking advantage of the local structure of such models for the purpose of local computing. The main idea is that random graphs with local dependence can be decomposed into subgraphs, which enables parallel computing on subgraphs and suggests a two-step estimation approach. The first step estimates the local structure underlying random graphs. The second step estimates parameters given the estimated local structure of random graphs. Both steps can be implemented in parallel, which enables large-scale estimation. The advantages of the two-step estimation approach are demonstrated by simulation studies with up to 10,000 nodes and an application to a large Amazon product recommendation network with more than 10,000 products.

Keywords: Exponential random graph models, Latent structure models, Stochastic block models, Variational methods, EM algorithms, MM algorithms

1. Introduction

The statistical analysis of network data is an emerging area in statistics (Kolaczyk, 2009). Network data arise in the study of insurgent and terrorist networks, contact networks facilitating the spread of infectious diseases (e.g., corona-viruses such as SARS, MERS, and COVID-19; Ebola; HIV), social networks, and the World Wide Web.

Many models of network data have been proposed, as described in recent review papers (Fienberg, 2012, Hunter et al., 2012, Salter-Townshend et al., 2012, Smith et al., 2019, Schweinberger et al., 2020, Hoff, 2020). Among the plethora of models, two broad classes of models can be distinguished: models with latent structure, including stochastic block models (e.g., Nowicki and Snijders, 2001, Bickel and Chen, 2009, Rohe et al., 2011) and latent space models (e.g., Hoff et al., 2002, Handcock et al., 2007, Sewell and Chen, 2015, Smith et al., 2019); and exponential-family models of random graphs (Lusher et al., 2013, Harris, 2013), including Erdős and Rényi random graphs, logistic regression models, and models resembling Markov random fields in spatial statistics (Besag, 1974). Both have advantages and disadvantages, as one of the pioneers of statistical network analysis observed:

“I expect that, especially for modeling larger networks (with, say, a few hundred or more nodes), the latent space models will not be able to represent network structures as expressed by subgraph counts sufficiently well and the exponential random-graph models will not be able to represent the cohesive structure sufficiently well. Models that combine important features of these two approaches may be the next generation of social network models” (Snijders, 2007 p. 324).

In other words, latent structure models capture who is close to whom, but are not flexible models of many network phenomena of interest (despite the fact that latent space models induce a weak tendency towards transitivity, see Hoff et al., 2002). Well-posed exponential-family models of random graphs can capture a vast range of network phenomena of interest (including, but not limited to, transitivity), but the underlying assumptions of many of those models make more sense in small networks than large networks. A case in point is the Markov random graphs of Frank and Strauss (1986). These models assume that possible edges $X_{i, j} \in {0, 1}$ and $X_{k, l} \in {0, 1}$ of pairs of nodes ${i, j}$ and ${k, l}$ are independent conditional on the rest of the graph provided ${i, j}$ and ${k, l}$ do not overlap, but are otherwise dependent. As a consequence, each possible edge $X_{i, j}$ depends on $2 (n - 2)$ other possible edges, where $n$ is the number of nodes. If Markov random graphs were applied to online social networks such as Facebook, then each possible friendship would depend on billions of other possible friendships. Such dependence assumptions are implausible, and when coupled with homogeneity assumptions regarding parameters can give rise to the well-studied issue of model near-degeneracy (Handcock, 2003, Schweinberger, 2011, Chatterjee and Diaconis, 2013, Mele, 2017).

One class of next-generation random graph models was introduced in Schweinberger and Handcock (2015), which combines important features of stochastic block models and exponential-family models, with the goal of retaining the strengths of both of them while reducing the weaknesses of each of them. The basic idea is that a set of nodes is partitioned into blocks, and edges among nodes within and between blocks are governed by exponential-family models of random graphs with local dependence within blocks. A simple example is a model where edges between blocks are independent Bernoulli random variables, whereas edges within blocks are generated by an exponential-family model which encourages triangles within blocks but ensures that, for each pair of nodes within a block, the added value of additional edges and triangles decays. Such models induce local dependence within blocks and the overall dependence induced by the model is weak provided the blocks are not too large. We have shown elsewhere that such models are well-behaved—in contrast to the infamous triangle model first studied by Strauss (1986), Jonasson (1999), Häggström and Jonasson (1999), and others (e.g., Chatterjee and Diaconis, 2013)—and that statistical inference is possible and supported by statistical theory: e.g., we have established concentration and consistency results for canonical and curved exponential-family models of random graphs with local dependence under the assumption that the blocks are known and the sizes of the blocks are similar in a well-defined sense (Schweinberger and Stewart, 2020). In addition, when the blocks are unknown, the block memberships of most nodes can be recovered with high probability under weak dependence and smoothness conditions (Schweinberger, 2020).

While some progress has been made in terms of statistical theory, computing remains challenging. When the block structure is known (which is the case in multilevel networks, e.g., networks consisting of units of armed forces, Lazega and Snijders, 2016), statistical inference for parameters can rely on existing methods for estimating parameters of exponential-family models of random graphs (e.g., Strauss and Ikeda, 1990, Snijders, 2002, Hunter and Handcock, 2006, Caimo and Friel, 2011, Hummel et al., 2012, Okabayashi and Geyer, 2012, Atchade et al., 2013, Jin and Liang, 2013, Thiemichen and Kauermann, 2017, Krivitsky, 2017, Byshkin et al., 2018, Tan and Friel, 2020). If the block structure is unknown, however, it needs to be estimated based on the observed network. The recovery of unknown block structure resembles the recovery of unknown block structure in stochastic block models (e.g., Nowicki and Snijders, 2001, Bickel and Chen, 2009, Bickel et al., 2011, Rohe et al., 2011, Choi et al., 2012, Celisse et al., 2012, Priebe et al., 2012, Amini et al., 2013, Rohe et al., 2014, Zhang and Zhou, 2016, Binkiewicz et al., 2017, Gao et al., 2017). However, the recovery of unknown block structure is more challenging for exponential-family models with local dependence than stochastic block models. The main challenge is the intractability of the complete-data likelihood function, i.e., the likelihood function given an observation of the network as well as the block structure. The intractability of the complete-data likelihood function is rooted in the intractability of the normalizing constants of within-block probability mass functions, which stems from the local dependence within blocks.

We present here a tractable approximation of the likelihood function, leveraging concentration results for random graphs with local dependence. Based on the approximation of the likelihood function, we propose a two-step estimation approach that exploits the local structure of random graphs with local dependence for the purpose of local computing. The first step estimates the block structure and decomposes the random graph into subgraphs with local dependence. The decomposition of the random graph relies on approximations of the likelihood function supported by theoretical results. The second step estimates parameters given the estimated block structure by using Monte Carlo maximum likelihood methods (Hunter and Handcock, 2006) or maximum pseudolikelihood methods (Strauss and Ikeda, 1990). Both steps can be implemented in parallel, which enables large-scale estimation on multi-core computers or computing clusters. We demonstrate the advantages of the two-step estimation approach by simulation studies with up to 10,000 nodes and an application to a large Amazon product recommendation network with more than 10,000 products.

The remainder of our paper is organized as follows. Section 2 introduces models. Section 3 discusses likelihood-based inference based on approximations of the likelihood function motivated by theoretical results, and Section 4 takes advantage of such approximations to estimate models. Section 5 presents simulation results and Section 6 presents an application.

2. Models

We consider random graphs with a set of nodes $A = {1, \dots, n}$ and a set of edges $E \subset A \times A$ . Here, edges are regarded as random variables $X_{i, j} \in {0, 1}$ , where $X_{i, j} = 1$ if nodes $i$ and $j$ are connected by an edge and $X_{i, j} = 0$ otherwise. We focus on undirected random graphs without self-edges, although the methods introduced here can be extended to directed random graphs. Throughout, we denote by $X = {(X_{i, j})}_{i < j}^{n}$ the set of possible edges of a random graph, and by $X = {0, 1}^{(\binom{n}{2})}$ the sample space of $X$ .

Since the pioneering research of Holland and Leinhardt, 1970, Holland and Leinhardt, 1972, Holland and Leinhardt, 1976 in the 1970s, it is known that network data are dependent data. In small networks, each possible edge might depend on many other possible edges – in fact, it might depend on all other possible edges. In large networks, however, it is not credible that each possible edge depends on all other possible edges: e.g., in networks with $n ≫$ 10,000 nodes – such as the network used in Section 6 – it is implausible that each possible edge depends on all $(\binom{n}{2}) - 1 ≫$ 50,000,000 other possible edges. Instead, it is tempting to believe that the dependence among edges in large networks is local, in the sense that each edge depends on a subset of other possible edges. We consider here a simple form of local dependence, following Schweinberger and Handcock (2015). Assume that $A$ is partitioned into $K \geq 2$ subsets of nodes $A_{1}, \dots, A_{K}$ , called blocks, and let $z = (z_{1}, \dots, z_{n})$ be the block memberships of nodes, where $z_{i, k} = 1$ if node $i$ belongs to block $A_{k}$ and $z_{i, k} = 0$ otherwise. We henceforth denote by $X_{k, k} = {(X_{i, j})}_{i < j : z_{i, k} = z_{j, k} = 1}^{n}$ the within-block edges among nodes in block $A_{k}$ ( $k = 1, \dots, K$ ) and by $X_{k, l} = {(X_{i, j})}_{i, j : z_{i, k} = z_{j, l} = 1}^{n}$ the between-block edges among nodes in block $A_{k}$ and nodes in block $A_{l}$ ( $l < k = 1, \dots, K$ ).

Definition Local Dependence —

A model of a random graph $X$ satisfies local dependence if the probability mass function of $X$ can be factorized as follows:

$p_{η (θ, z)} (x) = \prod_{k = 1}^{K} p_{η_{W, k} (θ, z)} (x_{k, k}) \prod_{l = 1}^{k - 1} \prod_{i, j : z_{i, k} = z_{j, l} = 1}^{n} p_{η_{B, k, l} (θ, z)} (x_{i, j}), x \in X,$ (1)

where

•
$p_{η (θ, z)} (x)$ is the probability mass of graph $x$ , parameterized by $η (θ, z) \in R^{d}$ ;

•
$p_{η_{W, k} (θ, z)} (x_{k, k})$ is the probability mass of within-block subgraph $x_{k, k}$ , parameterized by $η_{W, k} (θ, z) \in R^{d_{W, k}}$ ( $k = 1, \dots, K$ );

•
$p_{η_{B, k, l} (θ, z)} (x_{i, j})$ is the probability mass of between-block edge $x_{i, j}$ , parameterized by $η_{B, k, l} (θ, z) \in R^{d_{B, k, l}}$ ( $i, j : z_{i, k} = z_{j, l} = 1$ );

•
the parameter vector $η (θ, z) \in R^{d}$ consists of the subvectors $η_{W, k} (θ, z) \in R^{d_{W, k}}$ ( $k = 1, \dots, K$ ) and $η_{B, k, l} (θ, z) \in R^{d_{B, k, l}}$ ( $l < k = 1, \dots, K$ ).

The map $η : Θ \times Z \mapsto R^{d}$ depends on the model. In general, $η : Θ \times Z \mapsto R^{d}$ may be a linear or non-linear function of a parameter vector $θ \in Θ \subseteq R^{q}$ of dimension $q \leq d$ . An example is given by the curved exponential-family parameterizations used in Section 6, where $η : Θ \times Z \mapsto R^{d}$ is a non-linear function of a parameter vector $θ \in Θ \subseteq R^{q}$ of dimension $q < d$ . Well-chosen curved exponential-family parameterizations help ensure that the added value of additional edges, triangles, and other subgraph configurations decays (Stewart et al., 2019), which is plausible in many applications and can improve the in-sample performance of models (Hunter et al., 2008) as well as the out-of-sample performance of models (Stewart et al., 2019). The factorization of the probability mass function of models with local dependence has at least three implications. First, edges among nodes in block $A_{k}$ can depend on other edges in block $A_{k}$ ( $k = 1, \dots, K$ ). Second, the factorization implies that edges among nodes in block $A_{k}$ do not depend on edges that involve nodes in other blocks ( $k = 1, \dots, K$ ). Third, edges between blocks are independent. In other words, the dependence is local in the sense that it is confined to within-block subgraphs.

We introduce here two examples of models with local dependence, to be used as motivating examples throughout Sections 3, 4, 5.

Example 1 Stochastic Block Model —

An important special case of models with local dependence is given by stochastic block models (e.g., Nowicki and Snijders, 2001, Bickel and Chen, 2009, Rohe et al., 2011). Stochastic block models assume that possible edges $X_{i, j}$ between nodes $i$ and $j$ in blocks $k$ and $l$ are independent Bernoulli $(μ_{k, l})$ random variables, where the probability $μ_{k, l} \in (0, 1)$ of an edge depends on the blocks $k$ and $l$ of nodes $i$ and $j$ , respectively. Stochastic block models are special cases of models with local dependence, having within-block probability mass functions of the form

$p_{η_{W, k} (θ, z)} (x_{k, k}) = \prod_{i < j : z_{i, k} = z_{j, k} = 1}^{n} μ_{k, k}^{x_{i, j}} {(1 - μ_{k, k})}^{1 - x_{i, j}} = \prod_{i < j : z_{i, k} = z_{j, k} = 1}^{n} exp (θ_{W, k, k} x_{i, j} - log (1 + exp (θ_{W, k, k}))) \propto exp (θ_{W, k, k} \sum_{i < j}^{n} x_{i, j} z_{i, k} z_{j, k}),$

where

$θ_{W, k, k} = logit (μ_{k, k}) \in R, k = 1, \dots, K .$

The between-block probability mass functions are of the form

$p_{η_{B, k, l} (θ, z)} (x_{k, l}) = \prod_{i < j : z_{i, k} = z_{j, l} = 1}^{n} μ_{k, l}^{x_{i, j}} {(1 - μ_{k, l})}^{1 - x_{i, j}} = \prod_{i < j : z_{i, k} = z_{j, l} = 1}^{n} exp (θ_{B, k, l} x_{i, j} - log (1 + exp (θ_{B, k, l}))) \propto exp (θ_{B, k, l} \sum_{i < j}^{n} x_{i, j} z_{i, k} z_{j, l}),$

where

$θ_{B, k, l} = logit (μ_{k, l}) \in R, l < k = 1, \dots, K .$

As a consequence, the joint probability mass function of random graph $X$ is proportional to

$p_{η (θ, z)} (x) \propto \prod_{k = 1}^{K} exp (θ_{W, k, k} \sum_{i < j}^{n} x_{i, j} z_{i, k} z_{j, k}) \prod_{l = 1}^{k - 1} exp (θ_{B, k, l} \sum_{i < j}^{n} x_{i, j} z_{i, k} z_{j, l}) .$

However, while stochastic block models are popular, the assumption that edges within and between blocks are independent is restrictive, because edges among nodes that are close – in the sense of being members of the same block – may very well be dependent.

Example 2 Model with Local Dependence —

To allow edges within blocks to be dependent, we build on the stochastic block model described above, but tilt the within-block probability mass function of the stochastic block model as follows:

$p_{η_{W, k} (θ, z)} (x_{k, k}) \propto exp (θ_{W, k, k, 1} \sum_{i < j}^{n} x_{i, j} z_{i, k} z_{j, k}) exp (θ_{W, k, k, 2} \sum_{i < j}^{n} x_{i, j} z_{i, k} z_{j, k} 1_{i, j} (x_{k, k})),$

where $θ_{W, k, k, 1} \in R$ and $θ_{W, k, k, 2} \in R$ and $1_{i, j} (x_{k, k})$ is an indicator function, which is $1$ if nodes $i$ and $j$ are both connected to one or more other nodes in block $k$ , and is $0$ otherwise. If $x_{i, j} = 1$ and $1_{i, j} (x_{k, k}) = 1$ , the edge between nodes $i$ and $j$ is called transitive, because $i$ and $j$ form transitive triples with other nodes, which are known as triangles in the random graph literature. The first term of the within-block probability mass function shown above corresponds to the within-block probability mass function of the stochastic block model. The second term tilts the within-block probability mass function of the stochastic block model: If $θ_{W, k, k, 2} > 0$ , the model rewards within-block subgraphs with transitive edges, whereas $θ_{W, k, k, 2} < 0$ penalizes them, and $θ_{W, k, k, 2} = 0$ neither rewards nor penalizes them—in which case the model reduces to the stochastic block model.

Using the same between-block probability mass functions as the stochastic block model, the joint probability mass function of random graph $X$ is proportional to

p_{η (θ, z)} (x) \propto \prod_{k = 1}^{K} exp (θ_{W, k, k, 1} \sum_{i < j}^{n} x_{i, j} z_{i, k} z_{j, k}) exp (θ_{W, k, k, 2} \sum_{i < j}^{n} x_{i, j} z_{i, k} z_{j, k} 1_{i, j} (x_{k, k})) \times \prod_{l = 1}^{k - 1} exp (θ_{B, k, l} \sum_{i < j}^{n} x_{i, j} z_{i, k} z_{j, l}) .

The resulting model can capture an excess in the expected number of transitive edges within blocks, relative to the stochastic block model. To demonstrate, note that the resulting model is an exponential-family model and – by exponential-family theory (Brown, 1986 Corollary 2.5, p. 37) – the expected number of transitive edges in block $A_{k}$ satisfies

\underset{m o d e l w i t h l o c a l d e p e n d e n c e}{\underset{︸}{E_{θ_{W, k, k, 1}, θ_{W, k, k, 2} > 0} s_{k, k, 2} (X)}} > \underset{s t o c h a s t i c b l o c k m o d e l}{\underset{︸}{E_{θ_{W, k, k, 1}, θ_{W, k, k, 2} = 0} s_{k, k, 2} (X)}}, k = 1, \dots, K,

where $s_{k, k, 2} (X)$ is the number of transitive edges in block $A_{k}$ and $E_{θ_{W, k, k, 1}, θ_{W, k, k, 2}} s_{k, k, 2} (X)$ is the expectation of $s_{k, k, 2} (X)$ . In other words, the expected number of transitive edges in block $A_{k}$ is greater under the model with $θ_{W, k, k, 2} > 0$ than under the stochastic block model with $θ_{W, k, k, 2} = 0$ , assuming that both have the same edge parameters $θ_{W, k, k, 1}$ ( $k = 1, \dots, K$ ). Therefore, the model with local dependence can capture an excess in the expected number of transitive edges within blocks, relative to the stochastic block model. In addition, models with local dependence can capture excesses in the expected number of other subgraph statistics within blocks by adding suitable model terms.

How models with within-block edge and transitive edge terms differ from the “triangle model”.

It is worth noting that the model with local dependence induced by within-block edge and transitive edge terms differs in two important ways from the infamous triangle model with edge and triangle terms, which has been known to be ill-behaved since the 1980s (Strauss, 1986, Jonasson, 1999, Schweinberger, 2011, Chatterjee and Diaconis, 2013):

•
The model with local dependence restricts dependence among edges to subsets of nodes, i.e., blocks. As long as the blocks are not too large, the overall dependence induced by the model is weak. By contrast, the triangle model does not restrict dependence to subsets of nodes, leading to undesirable behavior in large graphs.
•
Within blocks, the model with local dependence and positive within-block transitive edge parameters assumes that, for each pair of nodes, the value added by the first triangle to the log odds of the conditional probability of an edge is positive, but additional triangles add nothing. By contrast, the triangle model assumes that, for each pair of nodes, each additional triangle has the same added value, which is unreasonable and results in undesirable behavior in large graphs.

These sensible assumptions ensure that models with local dependence have more desirable properties than the triangle model (Schweinberger and Stewart, 2020).

Exponential-family representations of models.

It is worth noting that Models 1 and 2 can be represented in exponential-family form:

p_{η (θ, z)} (x) = exp (〈 η (θ, z), s (x) 〉 - ψ (η (θ, z))), x \in X,

where $〈 η (θ, z), s (x) 〉$ denotes the inner product of a vector of natural parameters $η : Θ \times Z \mapsto R^{d}$ and a vector of sufficient statistics $s : X \mapsto R^{d}$ , and $ψ (η (θ, z))$ ensures that $p_{η (θ, z)} (x)$ sums to $1$ . While exponential-family representations are not needed to introduce Models 1 and 2, the properties of exponential families facilitate the theoretical results in Section 3.2, concerned with approximations of likelihood functions.

3. Likelihood-based inference

While the block structure $z$ is observed in some applications (e.g., in multilevel networks), it is unobserved in other applications. We focus here on unobserved block structure $z$ .

It is tempting to base statistical inference for the unknown block structure $z$ and the unknown parameter vector $θ$ on the likelihood function, but likelihood-based inference for models with local dependence is challenging. The main reason is that the probability mass function $p_{η (θ, z)} (x)$ is intractable, because the within-block probability mass functions $p_{η_{W, k} (θ, z)} (x_{k, k})$ are intractable ( $k = 1, \dots, K$ ). The intractability of $p_{η_{W, k} (θ, z)} (x_{k, k})$ is rooted in the fact that its normalizing constant is a sum over all $exp ((\binom{| A_{k} |}{2}) log 2)$ possible within-block subgraphs of block $A_{k}$ , which cannot be computed unless $A_{k}$ is small, i.e., unless $| A_{k} | ≪ 10$ ( $k = 1, \dots, K$ ).

To facilitate likelihood-based inference, we introduce tractable approximations of the intractable probability mass function $p_{η (θ, z)} (x)$ in Section 3.1 and support them by theoretical results in Section 3.2. A statistical algorithm that takes advantage of such approximations is introduced in Section 4.

3.1. Approximate likelihood functions: motivation

Suppose that we want to estimate both $z$ and $θ$ . It is natural to estimate them by using an iterative algorithm that cycles through updates of $z$ and $θ$ as follows:

Step 1: Update $z$ given $θ$ .
Step 2: Update $θ$ given $z$ .

The algorithm sketched above is generic and cannot be used in practice, but regardless of which specific algorithm is used – whether EM, Monte Carlo EM, variational EM, Bayesian Markov chain Monte Carlo, or other algorithms – most of them have in common that Step 1 is either infeasible or time-consuming, whereas Step 2 is less problematic than Step 1.

Step 1.

Step 1 is either infeasible or time-consuming, because the probability mass function $p_{η (θ, z)} (x)$ is intractable. To demonstrate, consider a Bayesian Markov chain Monte Carlo algorithm that updates $z = (z_{1}, \dots, z_{n})$ given $θ$ by Gibbs sampling. Gibbs sampling of $z_{1}, \dots, z_{n}$ turns out to be infeasible, because the full conditional distributions of $z_{1}, \dots, z_{n}$ depend on the intractable within-block probability mass functions $p_{η_{W, k} (θ, z)} (x_{k, k})$ ( $k = 1, \dots, K$ ). One could approximate them by Monte Carlo samples of within-block subgraphs, but such approximations may not generate Markov chain Monte Carlo samples from the target distribution (Liang et al., 2016) and are problematic on computational grounds:

•
Using Monte Carlo approximations of within-block probability mass functions is infeasible when the number of nodes $n$ is large, because such approximations are needed for each update of each of the $n$ block memberships $z_{1}, \dots, z_{n}$ .
•
Worse, the $n$ block memberships $z_{1}, \dots, z_{n}$ cannot be updated in parallel, because the block membership of one node depends on the block memberships of other nodes.

As a consequence, Step 1 is infeasible when $n$ is large.

Step 2.

Step 2 is less problematic than Step 1. While the probability mass function $p_{η (θ, z)} (x)$ is intractable and may have to be approximated by Monte Carlo methods (Hunter and Handcock, 2006), such Monte Carlo approximations are needed once to update $θ$ given $z_{1}, \dots, z_{n}$ , whereas Monte Carlo approximations are needed $n$ times to update $z_{1}, \dots, z_{n}$ given $θ$ one by one. In addition, the probability mass function $p_{η (θ, z)} (x)$ decomposes into between- and within-block probability mass functions and hence within-block probability mass functions can be approximated in parallel, i.e., on multi-core computers or computing clusters.

Approximations.

To enable feasible updates of $z$ given $θ$ when $n$ is large, we are interested in approximating the intractable probability mass function $p_{η (θ, z)} (x)$ by a tractable probability mass function. To do so, we focus on models with between-block edge terms $θ_{B, k, l} \sum_{i < j}^{n} x_{i, j} z_{i, k} z_{j, l}$ and within-block edge terms $θ_{W, k, k, 1} \sum_{i < j}^{n} x_{i, j} z_{i, k} z_{j, k}$ along with additional within-block terms that induce local dependence within blocks. We denote the vector of between- and within-block edge parameters by $θ_{1}$ and the vector of all other within-block parameters by $θ_{2}$ . We assume that $θ_{2} = 0$ reduces the model to the stochastic block model. An example is given by the model with between-block edge terms and within-block edge and transitive edge terms described in Section 2: the parameter vector $θ_{1}$ consists of the between-block edge parameters $θ_{B, k, l}$ ( $l < k = 1, \dots, K$ ) and the within-block edge parameters $θ_{W, k, k, 1}$ ( $k = 1, \dots, K$ ), whereas the parameter vector $θ_{2}$ consists of the within-block transitive edge parameters $θ_{W, k, k, 2}$ ( $k = 1, \dots, K$ ). If the parameter vector $θ_{2}$ is set to $0$ , the model reduces to the stochastic block model, under which edges within and between blocks are independent.

Such models have two useful properties:

•
The probability mass functions $p_{η (θ, z)} (x)$ and $p_{η (θ_{1}, θ_{2} = 0, z)} (x)$ impose the same probability law on between-block subgraphs.
•
The probability mass function $p_{η (θ_{1}, θ_{2} = 0, z)} (x)$ is tractable, because edges between and within blocks are independent under all $z$ .

We henceforth approximate $p_{η (θ, z)} (x)$ by $p_{η (θ_{1}, θ_{2} = 0, z)} (x)$ , which corresponds to the probability mass function of the stochastic block model.

The idea underlying the approximation is that when the blocks are not too large, most pairs of nodes are not members of the same block. Since $p_{η (θ, z)} (x)$ and $p_{η (θ_{1}, θ_{2} = 0, z)} (x)$ impose the same probability law on possible edges between pairs of nodes that are not members of the same block, $p_{η (θ, z)} (x)$ and $p_{η (θ_{1}, θ_{2} = 0, z)} (x)$ agree on most of the random graph. Therefore, $p_{η (θ, z)} (x)$ can be approximated by $p_{η (θ_{1}, θ_{2} = 0, z)} (x)$ for the purpose of updating $z$ given $θ$ . Suppose, e.g., that we consider to update $z$ given $θ$ by replacing $z$ by $z^{'} \neq z$ . We may decide to do so if the loglikelihood ratio

log \frac{p_{η (θ, z^{'})} (x)}{p_{η (θ, z)} (x)} = log p_{η (θ, z^{'})} (x) - log p_{η (θ, z)} (x)

is large: e.g., the acceptance probability of Metropolis–Hastings algorithms depends on the loglikelihood ratio above. If $p_{η (θ, z)} (x)$ can be approximated by $p_{η (θ_{1}, θ_{2} = 0, z)} (x)$ , we can base the decision on $log p_{η (θ_{1}, θ_{2} = 0, z^{'})} (x) - log p_{η (θ_{1}, θ_{2} = 0, z)} (x)$ rather than $log p_{η (θ, z^{'})} (x) - log p_{η (θ, z)} (x)$ , because

log p_{η (θ, z^{'})} (x) - log p_{η (θ, z)} (x) = [log p_{η (θ_{1}, θ_{2} = 0, z^{'})} (x) - log p_{η (θ_{1}, θ_{2} = 0, z)} (x)]

+ [log p_{η (θ, z^{'})} (x) - log p_{η (θ_{1}, θ_{2} = 0, z^{'})} (x)] - [log p_{η (θ, z)} (x) - log p_{η (θ_{1}, θ_{2} = 0, z)} (x)] .

Therefore, as long as

max_{z} | log p_{η (θ, z)} (x) - log p_{η (θ_{1}, θ_{2} = 0, z)} (x) |

is small, we have

log p_{η (θ, z^{'})} (x) - log p_{η (θ, z)} (x) \approx log p_{η (θ_{1}, θ_{2} = 0, z^{'})} (x) - log p_{η (θ_{1}, θ_{2} = 0, z)} (x) .

The advantage of approximating $p_{η (θ, z)} (x)$ by $p_{η (θ_{1}, θ_{2} = 0, z)} (x)$ is that there exist methods for stochastic block models to estimate the block structure from large random graphs (e.g., Daudin et al., 2008, Rohe et al., 2011, Amini et al., 2013, Vu et al., 2013). We take advantage of such methods in Section 4, but we first shed light on the conditions under which ${max}_{z} | log p_{η (θ, z)} (x) - log p_{η (θ_{1}, θ_{2} = 0, z)} (x) |$ is small.

3.2. Approximate likelihood functions: theoretical results

We show that updates of $z$ given $θ$ can be based on $p_{η (θ_{1}, θ_{2} = 0, z)} (x)$ rather than $p_{η (θ, z)} (x)$ by showing that

max_{z} | log p_{η (θ, z)} (X) - log p_{η (θ_{1}, θ_{2} = 0, z)} (X) |

is small with high probability when the blocks are not too large.

We start with a special case in Theorem 1 and then present more general results in Theorem 2. To prepare the ground for Theorem 1, Theorem 2, let $Z = {1, \dots, K}^{n}$ be the set of all block structures, and denote by $m (z)$ the size of the largest block under $z \in Z$ . Let $S \subseteq Z$ be any subset of block structures that includes the data-generating block structure $z^{⋆} \in Z$ , and denote by $m (S)$ the size of the largest block among all block structures $z \in S$ , so that $m (z) \leq m (S)$ for all $z \in S$ . We allow the size of the largest block $m (S)$ to increase as a function of the number of nodes $n$ : e.g., $m (S)$ may be a constant multiple of $log n$ or $n^{α}$ ( $α < 1$ ). The size of the largest data-generating block is denoted by $‖ A ‖_{\infty} = {max}_{1 \leq k \leq K} | A_{k} |$ .

Theorem 1

Consider the model with between-block edge terms and within-block edge and transitive edge terms described in Section 2 . Let $Θ = \prod_{l \leq k}^{K} Θ_{k, l}$ be the parameter space, $Θ_{k, l}$ be a compact subset of $R$ ( $l < k = 1, \dots, K$ ), and $Θ_{k, k}$ be a compact subset of $R^{2}$ ( $k = 1, \dots, K$ ). Choose $ϵ \in (0, 1)$ as small as desired. Then there exists a universal constant $c > 0$ such that, for all $θ \in Θ$ , with at least probability $1 - ϵ$ ,

$max_{z \in S} | log p_{η (θ, z)} (X) - log p_{η (θ_{1}, θ_{2} = 0, z)} (X) | < 2 c \sqrt{- log \frac{ϵ}{2} + n log K} \sqrt{K} m {(S)}^{2} ‖ A ‖_{\infty}^{2} log n .$

The proof of Theorem 1 can be found in the supplement. The basic idea underlying Theorem 1 is that the deviation ${max}_{z} | log p_{η (θ, z)} (x) - log p_{η (θ_{1}, θ_{2} = 0, z)} (x) |$ cannot be too large when the blocks are not too large, because most pairs of nodes are not members of the same block and $p_{η (θ, z)} (x)$ and $p_{η (θ_{1}, θ_{2} = 0, z)} (x)$ impose the same probability law on possible edges between pairs of nodes that are not members of the same block.

Theorem 1 implies that the largest deviation of the loglikelihood function under the unrestricted model and the restricted model is smaller than $n^{2}$ with high probability, provided the blocks are not too large. To see that, observe that $‖ A ‖_{\infty} \leq m (S)$ and $K \leq n$ imply

2 c \sqrt{- log \frac{ϵ}{2} + n log K} \sqrt{K} m {(S)}^{2} {‖ A ‖}_{\infty}^{2} log n \leq δ (ϵ) n m {(S)}^{4} {(log n)}^{3 ∕ 2},

where $δ (ϵ) > 0$ is a function of $ϵ$ but is not a function of the number of nodes $n$ . In other words, as long as the size $m (S)$ of the largest block in $S$ satisfies $m (S) ≪ n^{1 ∕ 4} ∕ {(log n)}^{3 ∕ 8}$ , the maximum deviation is much smaller than $n^{2}$ with high probability. As a consequence, the largest deviation of the loglikelihood function under the unrestricted model and the restricted model is small with high probability – note that in dense random graphs many quantities are of order $n^{2}$ , so quantities of order less than $n^{2}$ may be considered small. For example, consider Bernoulli $(μ)$ random graphs, under which possible edges $X_{i, j}$ are independent Bernoulli $(μ)$ random variables (Erdős and Rényi, 1959, Erdős and Rényi, 1960). If a Bernoulli $(μ)$ random graph is dense in the sense that $μ = E X_{i, j} \in (0, 1)$ does not decrease as a function of the number of nodes $n$ , then the expected number of edges is of order $n^{2}$ ,

E \sum_{i < j}^{n} X_{i, j} = (\binom{n}{2}) μ,

and so is the expected loglikelihood function of the natural parameter $θ = logit (μ) \in R$ ,

E log p_{θ} (X) = (\binom{n}{2}) (θ μ - log (1 + exp (θ))) .

Other quantities are likewise of order $n^{2}$ in dense random graphs, so quantities of order less than $n^{2}$ may be considered small. Thus, the largest deviation of the loglikelihood function under the unrestricted model and the restricted model is small with high probability.

Last, but not least, note that $m (S) ≪ n^{1 ∕ 4} ∕ {(log n)}^{3 ∕ 8}$ implies that the number of blocks $K$ must satisfy $K ≫ n^{3 ∕ 4} {(log n)}^{3 ∕ 8}$ . In other words, the number of blocks $K$ needs to grow as function of the number of nodes $n$ . It is worth noting that in the special case of stochastic block models, it is known that $K$ is allowed to grow as fast as $n ∕ log n$ in dense-graph settings (Zhang and Zhou, 2016).

It turns out that the result in Theorem 1 is not limited to the model with between-block edge terms and within-block edge and transitive edge terms, but is a special case of more general results. To introduce these more general results, we make the following assumptions. We assume that the probability mass function $p_{η (θ, z)} (x)$ can be represented in exponential-family form,

p_{η (θ, z)} (x) = exp (〈 η (θ, z), s (x) 〉 - ψ (η (θ, z))), x \in X,

where $〈 η (θ, z), s (x) 〉$ denotes the inner product of a vector of natural parameters $η : Θ \times Z \mapsto R^{d}$ and a vector of sufficient statistics $s : X \mapsto R^{d}$ , and $ψ (η (θ, z))$ ensures that $p_{η (θ, z)} (x)$ sums to $1$ ; note that all models used in our paper can be represented in exponential-family form. We assume that $η : Θ \times Z \mapsto Ξ$ and that $Ξ \subseteq int (N)$ is a subset of the interior $int (N)$ of the natural parameter space $N$ of the exponential family (Brown, 1986). Let $E \equiv E_{η^{⋆}}$ be the expectation under the data-generating parameter vector $η^{⋆} \equiv η (θ^{⋆}, z^{⋆})$ , where $(θ^{⋆}, z^{⋆}) \in Θ \times Z$ denotes the data-generating value of $(θ, z) \in Θ \times Z$ . We denote by $d : X \times X \mapsto {0, 1, 2, \dots}$ the Hamming metric, which is defined by

d (x_{1}, x_{2}) = \sum_{i < j}^{n} 1_{x_{1, i, j} \neq x_{2, i, j}}, (x_{1}, x_{2}) \in X \times X,

where $1_{x_{1, i, j} \neq x_{2, i, j}}$ is $1$ if $x_{1, i, j} \neq x_{2, i, j}$ and is $0$ otherwise. The main assumptions can then be stated as follows.

[C.1]
There exists a constant $c > 0$ such that, for all $(θ, z) \in Θ \times Z$ and all $(x_{1}, x_{2}) \in X \times X$ ,
$| 〈η (θ, z), s (x_{1}) - s (x_{2})〉 | \leq c d (x_{1}, x_{2}) m (z) log n .$
[C.2]
There exists a constant $c > 0$ such that, for all $(θ_{k, l, 1}, θ_{k, l, 2}) \in Θ_{k, l} \times Θ_{k, l}$ and all $z \in Z$ ,
$| 〈η_{k, l} (θ_{k, l, 1}, z) - η_{k, l} (θ_{k, l, 2}, z), E_{η (θ, z)} s_{k, l} (X)〉 | \leq c ‖ θ_{k, l, 1} - θ_{k, l, 2} ‖ m {(z)}^{2} log n,$
where $η_{k, l} (θ_{k, l}, z)$ , $θ_{k, l}$ , and $s_{k, l} (x)$ denote the subvectors of $η (θ, z)$ , $θ$ , and $s (x)$ corresponding to the subgraph between blocks $k$ and $l$ ( $l < k$ ) or the subgraph of block $k$ ( $k = l$ ) and $Θ_{k, l}$ is a compact subset of $R^{q_{k, l}}$ ( $k \leq l = 1, \dots, K$ ).

Conditions [C.1] and [C.2] are smoothness conditions: [C.1] states that the inner product of natural parameters and sufficient statistics must not be too sensitive to changes of edges, whereas [C.2] states that the inner product of between- and within-block natural parameters and expected sufficient statistics must not be too sensitive to changes of parameters.

An example of a model violating condition [C.1] is a model containing a sufficient statistic $s (x)$ of the form

s (x) = \{\begin{matrix} 0 & if \sum_{i < j}^{n} x_{i, j} = 0 \\ (\binom{n}{2}) & if \sum_{i < j}^{n} x_{i, j} \in {1, \dots, (\binom{n}{2})} . \end{matrix})

Under such models, adding a single edge can change the inner product of natural parameters and sufficient statistics by a multiple of $n^{2}$ . That would violate [C.1], because [C.1] assumes that changing a single edge changes the inner product by at most $c m (z) log n \leq c n log n$ .

An example of a model violating condition [C.2] is a model with within-block sufficient statistics that count the number of triangles within blocks,

s_{k, k} (x) = \sum_{h < i < j : h, i, j \in A_{k}}^{n} x_{i, h} x_{j, h} x_{i, j}, k = 1, \dots, K,

with within-block parameters of the form $η_{k, k} (θ, z) = θ_{k, k} \in R$ ( $k = 1, \dots, K$ ). If all nodes belong to block $A_{1}$ , the inner product of the block’s within-block natural parameter vector and expected sufficient statistic vector is a multiple of $| θ_{1, 1} - θ_{1, 1}^{'} | n^{3}$ , where $θ_{1, 1}$ and $θ_{1, 1}^{'}$ are two possible values of the within-block triangle parameter of block $A_{1}$ . As a result, [C.2] would be violated, because [C.2] requires the inner product to be at most $c | θ_{1, 1} - θ_{1, 1}^{'} | m {(z)}^{2} log n \leq c | θ_{1, 1} - θ_{1, 1}^{'} | n^{2} log n$ .

However, the fact that some model specifications violate conditions [C.1] and [C.2] is not necessarily a major concern, for two reasons. First, the specifications that violate conditions [C.1] and [C.2] are, more often than not, of limited interest in applications: e.g., we are not aware of any good reason for using the sufficient statistic in the first example, and the triangle term in the second example is known to induce model near-degeneracy in large networks and may therefore not be useful in practice (Strauss, 1986, Jonasson, 1999, Schweinberger, 2011, Chatterjee and Diaconis, 2013). Second, while some ill-posed specifications do not satisfy conditions [C.1] and [C.2], there are many well-posed specifications that do satisfy them: e.g., conditions [C.1] and [C.2] are satisfied by the model with between-block edge and within-block edge and transitive edge terms, which we verify in the proof of Theorem 1. In addition, conditions [C.1] and [C.2] cover the models with size-dependent parameterizations used in Sections 5, 6.

The following result, Theorem 2, is a generalization of Theorem 1.

Theorem 2

Consider a model with local dependence satisfying conditions [C.1] and [C.2]. Choose $ϵ \in (0, 1)$ as small as desired. Then there exists a universal constant $c > 0$ such that, for all $θ \in Θ$ , with at least probability $1 - ϵ$ ,

$max_{z \in S} | log p_{η (θ, z)} (X) - log p_{η (θ_{1}, θ_{2} = 0, z)} (X) | < 2 c \sqrt{- log \frac{ϵ}{2} + n log K} \sqrt{K} m {(S)}^{2} ‖ A ‖_{\infty}^{2} log n .$

The proof of Theorem 2 can be found in the supplement. An application of Theorem 2 to the model with between-block edge terms and within-block edge and transitive edge terms can be found in Theorem 1.

Trade-off between $m (S)$ and the recovery of block structure.

Last, but not least, it is worth noting that the upper bound $m (S)$ on the sizes of blocks cannot be too small, because it is not possible to recover the block structure with high probability when the blocks are too small. However, Zhang and Zhou (2016) showed that under stochastic block models the number of blocks $K$ is allowed to grow as fast as $n ∕ log n$ in dense-graph settings, which implies that the sizes of the blocks can be as small as $log n$ . These important issues are studied in more depth in Zhang and Zhou (2016) and Gao et al. (2017).

4. Two-step estimation approach

We propose a two-step estimation approach that takes advantage of the theoretical results of Section 3 and enables large-scale estimation of models with local dependence.

To describe the two-step estimation approach, assume that $z = (z_{1}, \dots, z_{n})$ is the observed value of a random variable $Z = (Z_{1}, \dots, Z_{n})$ with distribution

Z_{i} \overset{iid}{\sim} Multinomial (1, π = (π_{1}, \dots, π_{K})), i = 1, \dots, n .

It is natural to base statistical inference on the observed-data likelihood function

ℒ (θ, π) = \sum_{z \in Z} p_{η (θ, z)} (x) p_{π} (z) .

The problem is that $ℒ (θ, π)$ is intractable, because $p_{η (θ, z)} (x)$ is intractable and the set $Z$ contains $exp (n log K)$ elements.

The first problem can be solved by taking advantage of the theoretical results of Section 3, which suggest that $p_{η (θ, z)} (x)$ can be approximated by $p_{η (θ_{1}, θ_{2} = 0, z)} (x)$ provided that the blocks are not too large. A complication is that $p_{η (θ, z)} (x)$ and $p_{η (θ_{1}, θ_{2} = 0, z)} (x)$ may not be close when the block structure $z \in Z ∖ S$ is not contained in $S$ , in which case some of the blocks can be large and the within-block models of large blocks can induce strong dependence. In the worst case, all nodes belong to a single block, in which case $p_{η (θ, z)} (x)$ may induce strong dependence throughout the random graph while $p_{η (θ_{1}, θ_{2} = 0, z)} (x)$ induces no dependence at all, so $p_{η (θ, z)} (x)$ and $p_{η (θ_{1}, θ_{2} = 0, z)} (x)$ may be very different. However, the basic inequality

\sum_{z \in S} p_{η (θ, z)} (x) p_{π} (z) \leq ℒ (θ, π) \leq \sum_{z \in S} p_{η (θ, z)} (x) p_{π} (z) + P_{π} (Z \in Z ∖ S)

suggests that as long as the event $Z \in Z ∖ S$ is a rare event in the sense that $P_{π} (Z \in Z ∖ S)$ is close to $0$ , $ℒ (θ, π)$ can be approximated by $ℒ (θ_{1}, θ_{2} = 0, π)$ :

ℒ (θ, π) = \sum_{z \in Z} p_{η (θ, z)} (x) p_{π} (z) \approx \sum_{z \in S} p_{η (θ, z)} (x) p_{π} (z) \approx \sum_{z \in S} p_{η (θ_{1}, θ_{2} = 0, z)} (x) p_{π} (z) \approx \sum_{z \in Z} p_{η (θ_{1}, θ_{2} = 0, z)} (x) p_{π} (z) = ℒ (θ_{1}, θ_{2} = 0, π) .

The assumption that $Z \in Z ∖ S$ is a rare event – i.e., the probabilities $π_{1}, \dots, π_{K}$ are small – makes sense in a wide range of applications, because communities in real-world networks tend to be small (see, e.g., the discussion of Rohe et al., 2011). Therefore, as long as $Z \in Z ∖ S$ is a rare event, we can base statistical inference concerning the block structure on $ℒ (θ_{1}, θ_{2} = 0, π)$ rather than $ℒ (θ, π)$ . To simplify the notation, we write henceforth $ℒ (θ_{1}, π)$ instead of $ℒ (θ_{1}, θ_{2} = 0, π)$ .

The second problem can be solved by methods developed for stochastic block models, because $ℒ (θ_{1}, π)$ is the observed-data likelihood function of a stochastic block model. There are many stochastic block model methods that could be used, such as profile likelihood (Bickel and Chen, 2009), pseudo-likelihood (Amini et al., 2013), spectral clustering (Rohe et al., 2011), and variational methods (Daudin et al., 2008, Vu et al., 2013). Among these methods, we found that the variational methods of Vu et al. (2013) work best in practice: the simulation results in Section 5 demonstrate this. In addition, the variational methods of Vu et al. (2013) have the advantage of being able to estimate stochastic block models from networks with hundreds of thousands of nodes due to a running time of $O (n)$ for sparse random graphs and $O (n^{2})$ for dense random graphs (Vu et al., 2013). Some consistency and asymptotic normality results for variational methods for stochastic block models were established by Celisse et al. (2012) and Bickel et al. (2013).

Variational methods approximate $ℓ (θ_{1}, π) = log ℒ (θ_{1}, π)$ by introducing an auxiliary distribution $a (z)$ with support $Z$ and bounding $ℓ (θ_{1}, π)$ from below by using Jensen’s inequality:

ℓ (θ_{1}, π) = log \sum_{z \in Z} a (z) \frac{p_{η (θ_{1}, θ_{2} = 0, z)} (x) p_{π} (z)}{a (z)} \geq \sum_{z \in Z} a (z) log \frac{p_{η (θ_{1}, θ_{2} = 0, z)} (x) p_{π} (z)}{a (z)} \overset{def}{=} \hat{ℓ} (θ_{1}, π) .

Each auxiliary distribution with support $Z$ gives rise to a lower bound on $ℓ (θ_{1}, π)$ . To choose the best auxiliary distribution – i.e., the auxiliary distribution that gives rise to the tightest lower bound on $ℓ (θ_{1}, π)$ – we choose a family of auxiliary distributions and select the best member of the family. In practice, an important consideration is that the resulting lower bound is tractable. Therefore, we confine attention to a family of auxiliary distributions under which the resulting lower bounds are tractable. A natural choice is given by a family of auxiliary distributions under which the block memberships are independent:

Z_{i} \overset{ind}{\sim} Multinomial (1, α_{i} = (α_{i, 1}, \dots, α_{i, K})), i = 1, \dots, n .

By the independence of block memberships under the auxiliary distribution, we obtain the following tractable lower bound on $ℓ (θ_{1}, π)$ :

\hat{ℓ} (α; θ_{1}, π) \overset{def}{=} \sum_{z \in Z} a_{α} (z) log \frac{p_{η (θ_{1}, θ_{2} = 0, z)} (x) p_{π} (z)}{a_{α} (z)} = \sum_{i < j}^{n} \sum_{k = 1}^{K} \sum_{l = 1}^{K} α_{i, k} α_{j, l} log p_{η (θ_{1}, θ_{2} = 0, z_{i, k} = 1, z_{j, l} = 1, z_{- i, j})} (x_{i, j}) + \sum_{i = 1}^{n} \sum_{k = 1}^{K} α_{i, k} (log π_{k} - log α_{i, k}),

where $p_{η (θ_{1}, θ_{2} = 0, z_{i, k} = 1, z_{j, l} = 1, z_{- i, j})} (x_{i, j})$ denotes the marginal probability mass function of $X_{i, j}$ and $z_{- i, j}$ denotes the block memberships of all nodes excluding nodes $i$ and $j$ .

In practice, we obtain the best lower bound on $ℓ (θ_{1}, π)$ by maximizing $\hat{ℓ} (α; θ_{1}, π)$ with respect to $α$ . Direct maximization of $\hat{ℓ} (α; θ_{1}, π)$ with respect to $α$ is possible but inconvenient, because $\hat{ℓ} (α; θ_{1}, π)$ contains products of $α_{i, k}$ and $α_{j, l}$ . As a consequence, a fixed-point update of $α_{i, k}$ depends on $(n - 1) K$ other terms $α_{j, l}$ and hence fixed-point updates tend to be time-consuming and get stuck in local maxima, as demonstrated by Vu et al. (2013). Vu et al. (2013) proposed an elegant approach to alleviating the problem by using minorization–maximization (MM) methods (Hunter and Lange, 2004). Such methods construct a minorizing function that approximates $\hat{ℓ} (α; θ_{1}, π)$ but is easier to maximize than $\hat{ℓ} (α; θ_{1}, π)$ . A function $M (α; θ_{1}, π, α^{(t)})$ of $α$ minorizes $\hat{ℓ} (α; θ_{1}, π)$ at point $α^{(t)}$ at iteration $t$ of an iterative algorithm for maximizing $\hat{ℓ} (α; θ_{1}, π)$ if

M (α; θ_{1}, π, α^{(t)}) \leq \hat{ℓ} (α; θ_{1}, π) for all α,

M (α^{(t)}; θ_{1}, π, α^{(t)}) = \hat{ℓ} (α^{(t)}; θ_{1}, π),

where $θ_{1}, π, α^{(t)}$ are fixed. In other words, $M (α; θ_{1}, π, α^{(t)})$ is bounded above by $\hat{ℓ} (α; θ_{1}, π)$ for all $α$ and touches $\hat{ℓ} (α; θ_{1}, π)$ at $α = α^{(t)}$ . As a result, increasing $M (α; θ_{1}, π, α^{(t)})$ with respect to $α$ increases $\hat{ℓ} (α; θ_{1}, π)$ . Vu et al. (2013) showed that the following function minorizes $\hat{ℓ} (α; θ_{1}, π)$ at point $α^{(t)}$ :

M (α; θ_{1}, π, α^{(t)}) = \sum_{i < j}^{n} \sum_{k = 1}^{K} \sum_{l = 1}^{K} (α_{i, k}^{2} \frac{α_{j, l}^{(t)}}{2 α_{i, k}^{(t)}} + α_{j, l}^{2} \frac{α_{i, k}^{(t)}}{2 α_{j, l}^{(t)}}) log p_{η (θ_{1}, θ_{2} = 0, z_{i, k} = 1, z_{j, l} = 1, z_{- i, j})} (x_{i, j}) + \sum_{i = 1}^{n} \sum_{k = 1}^{K} α_{i, k} [log π_{k}^{(t)} - log α_{i, k}^{(t)} + (1 - \frac{α_{i, k}}{α_{i, k}^{(t)}})] .

The minorizing function $M (α; θ_{1}, π, α^{(t)})$ is easier to maximize than $\hat{ℓ} (α; θ_{1}, π)$ , because it replaces the products of $α_{i, k}$ and $α_{j, l}$ by sums of $α_{i, k}^{2}$ and $α_{j, l}^{2}$ . An additional advantage is that the maximization of $M (α; θ_{1}, π, α^{(t)})$ amounts to $n$ quadratic programming problems, which can be solved in parallel.

We therefore propose a two-step estimation approach as described in Table 1. We discuss the two steps below and conclude with some comments on parallel computing.

Table 1.

Two-step estimation approach.

Open in a new tab

Step 1.

The first step estimates $z$ based on $α$ . We do so by increasing $M (α; θ_{1}, π, α^{(t)})$ with respect to $α_{i}$ subject to the constraints $0 < α_{i, k} < 1$ ( $k = 1, \dots, K$ ) and $\sum_{k = 1}^{K} α_{i, k} = 1$ ( $i = 1, \dots, n$ ). We increase rather than maximize $M (α; θ_{1}, π, α^{(t)})$ , because maximizing $M (α; θ_{1}, π, α^{(t)})$ is more time-consuming and algorithms maximizing $M (α; θ_{1}, π, α^{(t)})$ are more prone to end up in local maxima than algorithms increasing $M (α; θ_{1}, π, α^{(t)})$ . Since $\hat{ℓ} (α; θ_{1}, π)$ and $M (α; θ_{1}, π, α^{(t)})$ depend on $θ_{1}$ and $π$ and both are unknown, we iterate between updates of $α$ and updates of $θ_{1}$ and $π$ . The updates of $θ_{1}$ and $π$ are based on maximizing $\hat{ℓ} (α; θ_{1}, π)$ with respect to $θ_{1}$ and $π$ and are identical to the updates of Vu et al. (2013), because $θ_{2} = 0$ reduces the model to a stochastic block model. As a convergence criterion, we use

\frac{| \hat{ℓ} (α^{(t + 1)}; θ_{1}^{(t + 1)}, π^{(t + 1)}) - \hat{ℓ} (α^{(t)}; θ_{1}^{(t)}, π^{(t)}) |}{\hat{ℓ} (α^{(t + 1)}; θ_{1}^{(t + 1)}, π^{(t + 1)})} < γ,

where $γ > 0$ is a small constant. Upon convergence, we estimate the block membership indicators $\hat{z}$ by computing $k = {arg max}_{1 \leq l \leq K} {\hat{α}}_{i, l}$ and setting ${\hat{z}}_{i, k} = 1$ and ${\hat{z}}_{i, l} = 0$ for all $l \neq k$ ( $i = 1, \dots, n$ ), where $\hat{α}$ denotes the final value of $α$ .

Step 2.

We estimate $θ$ given $\hat{z}$ by Monte Carlo maximum likelihood methods (Hunter and Handcock, 2006) or maximum pseudolikelihood methods (Strauss and Ikeda, 1990). Monte Carlo maximum likelihood methods exploit the fact that the loglikelihood function of $θ$ given $\hat{z}$ , defined by

ℓ_{\hat{z}} (θ) = log p_{η (θ, \hat{z})} (x) - log p_{η (θ_{0}, \hat{z})} (x),

can be written as

ℓ_{\hat{z}} (θ) = 〈 η (θ, \hat{z}) - η (θ_{0}, \hat{z}), s (x) 〉 - log E_{η (θ_{0}, \hat{z})} exp (〈 η (θ, \hat{z}) - η (θ_{0}, \hat{z}), s (X) 〉),

where $θ_{0}$ is a fixed parameter vector (e.g., $θ_{0}$ may be an educated guess of the data-generating parameter vector). In general, the expectation $E_{η (θ_{0}, \hat{z})}$ is intractable, but it can be estimated by a Monte Carlo sample average based on a Monte Carlo sample of graphs generated under $η (θ_{0}, \hat{z})$ . Therefore, we can approximate $ℓ_{\hat{z}} (θ)$ by

{\hat{ℓ}}_{\hat{z}} (θ) = 〈 η (θ, \hat{z}) - η (θ_{0}, \hat{z}), s (x) 〉 - log {\hat{E}}_{η (θ_{0}, \hat{z})} exp (〈 η (θ, \hat{z}) - η (θ_{0}, \hat{z}), s (X) 〉),

where ${\hat{E}}_{η (θ_{0}, \hat{z})}$ is a Monte Carlo approximation of $E_{η (θ_{0}, \hat{z})}$ based on a Monte Carlo sample of graphs generated by using $η (θ_{0}, \hat{z})$ . Hence $θ$ given $\hat{z}$ can be estimated by

\hat{θ} = \underset{θ \in Θ}{arg max} {\hat{ℓ}}_{\hat{z}} (θ) .

Additional details on Monte Carlo maximum likelihood methods can be found in Hunter and Handcock (2006). We note that the local dependence of the model facilitates parallel computing, which is discussed in the following paragraph. Standard errors of $\hat{θ}$ can be based on the estimated Fisher information matrix, although such standard errors are conditional on the estimated block structure $\hat{z}$ and therefore do not reflect the uncertainty about $\hat{z}$ . A parametric bootstrap approach would be an interesting alternative to capturing the additional uncertainty due to $\hat{z}$ , but would be time-consuming.

An alternative is to estimate $θ$ given $\hat{z}$ by maximum pseudolikelihood estimators (Strauss and Ikeda, 1990). Maximum pseudolikelihood estimators were invented by Besag (1974) to sidestep intractable likelihood functions of Markov random fields in spatial statistics, and are known to be consistent estimators of Markov random fields with exponential-family parameterizations (e.g., Comets, 1992). Strauss and Ikeda (1990) adapted them to exponential-family random graph models. While believed to be inferior to maximum likelihood estimators when the dependence among edges is strong and propagates throughout the random graph (e.g., van Duijn et al., 2009), maximum pseudolikelihood estimators seem to perform well when the dependence is weak, e.g., when the dependence among edges is confined to non-overlapping or overlapping blocks (Stewart and Schweinberger, 2020). The maximum pseudolikelihood estimator is defined as

\tilde{θ} = \underset{θ \in Θ}{arg max} {\tilde{ℓ}}_{\hat{z}} (θ),

where

{\tilde{ℓ}}_{\hat{z}} (θ) = \sum_{i < j}^{n} log p_{η (θ, \hat{z})} (x_{i, j} ∣ x_{- i, j}) .

Here, $x_{- i, j}$ denotes $x$ excluding $x_{i, j}$ , and $p_{η (θ, \hat{z})} (x_{i, j} ∣ x_{- i, j})$ denotes the conditional probability of $X_{i, j} = x_{i, j}$ given $X_{- i, j} = x_{- i, j}$ . Since the conditional distributions of $X_{i, j}$ given the rest of the random graph are Bernoulli distributions, computing maximum pseudolikelihood estimators consumes less time than computing Monte Carlo maximum likelihood estimators.

Parallel computing.

In Step 1, the maximization of the minorizing function amounts to $n$ quadratic programming problems, which can be solved in parallel. In Step 2, the local dependence induced by the model implies that the contributions of the between- and within-block subgraphs to the loglikelihood function and its gradient and Hessian can be computed in parallel. Hence both steps can be implemented in parallel, which suggests that the two-step estimation approach can be applied to large networks as long as the blocks are not too large and multi-core computers or computing clusters are available.

5. Simulation results

We assess the performance of the two-step estimation approach by conducting multiple simulation studies, with the number of nodes $n$ ranging from 30 to 10,000, the number of blocks $K$ ranging from 3 to 100, and the sizes of blocks ranging from 5 to 100:

I.
A small-scale simulation study to compare the block recovery of the gold standard, the Bayesian approach of Schweinberger and Handcock (2015), to the two-step estimation approach. Since the Bayesian approach is too time-consuming to be applied to large networks, we use small networks with $n = 30$ nodes and $K = 3$ blocks. The $K = 3$ blocks consist of 10 nodes (balanced case) or 5, 10, 15 nodes (unbalanced case).
II.
A large-scale simulation study to assess the block recovery in Step 1 of the two-step estimation approach, using $K = 25, 50, 75, 100$ blocks of size $25, 50, 75, 100$ .
III.
A large-scale simulation study to assess how the block recovery in Step 1 of the two-step estimation approach is affected by the sparsity of between-block subgraphs, using $K = 25$ blocks of size $25$ .
IV.
A large-scale simulation study to assess the parameter recovery in Step 2 of the two-step estimation approach, using $K = 25, 50, 75, 100$ blocks of size $25, 50, 75, 100$ .

In Step 1 of the two-step estimation approach, we use the variational approach described in Section 4. We compare the variational approach to the spectral clustering method described in Lei and Rinaldo (2015). Spectral clustering is an alternative to the variational approach in Step 1 of the two-step estimation approach. In Step 2 of the two-step estimation approach, we use maximum pseudolikelihood estimators, which are more scalable than Monte Carlo maximum likelihood estimators and facilitate simulation studies with up to 10,000 nodes. In each scenario, we generate 500 graphs from the model having between-block edge terms and within-block edge and transitive edge terms, as described in Section 2. To select sensible values of the parameter vector $θ$ , note that the probabilistic behavior of random graphs with local dependence is sensitive to the choice of parameter values, and so is the block recovery. The same applies to stochastic block models: e.g., when the probabilities of edges within and between blocks are too low, we may not be able to recover the block structure with high probability (Zhang and Zhou, 2016, Gao et al., 2017). We have therefore selected the parameter vector $θ$ based on the following considerations: We would like to ensure that, with high probability, the model generates graphs that resemble real-world networks in terms of sufficient statistics $s (X)$ . In principle, we could select $θ$ by inspecting the expectation of $s (X)$ . The problem is that the expectation is not available in closed form. We have therefore selected $θ$ based on simulating graphs, and checking whether the simulated graphs resemble real-world networks.

In simulation study I, to allow blocks of different sizes to have different parameters, we use size-dependent between-block edge parameters $θ_{B, k, l} = - . 882 log n$ ( $l < k = 1, \dots, K$ ) and within-block edge and transitive edge parameters $θ_{W, k, k, 1} = - . 434 log n_{k} (z)$ and $θ_{W, k, k, 2} = . 217 log n_{k} (z)$ ( $k = 1, \dots, K$ ), where $n_{k} (z)$ is the size of block $k$ under $z \in Z$ . The size-dependent parameterization is motivated by the sparsity of random graphs: e.g., if edges $X_{i, j}$ are independent Bernoulli $(μ)$ random variables, it is tempting to believe that there exist constants $c > 0$ and $0 < α < 1$ such that the expected number of edges of each node $(n - 1) μ$ is bounded above by $c n^{α}$ , because real-world networks are sparse. As a consequence, $μ$ should be of order $n^{θ}$ and $η = logit (μ)$ should be of order $log n^{θ} = θ log n$ , where $θ = α - 1 < 0$ . In more general models with edge terms as well as other model terms, all model terms should scale as the edge term, so that no model term can dominate any other model term. These considerations suggest that parameters should scale with the log number of nodes. In simulations studies II and IV, the sizes of blocks are identical, so we use between-block edge parameters $θ_{B, k, l} = . 5 - log n$ ( $l < k = 1, \dots, K$ ) and within-block edge and transitive edge parameters $θ_{W, k, k, 1} = - 1.5$ and $θ_{W, k, k, 2} = . 5$ ( $k = 1, \dots, K$ ). In simulation study III, we use the same within-block parameters, but between-block edge parameters $θ_{B, k, l} = . 5 - α log n$ ( $l < k = 1, \dots, K$ ), with $α$ varying from $. 5$ and $1$ .

In all scenarios, we assess block recovery by using Yule’s $ϕ$ -coefficient:

ϕ (z^{⋆}, z) = \frac{n_{0, 0} n_{1, 1} - n_{0, 1} n_{1, 0}}{\sqrt{(n_{0, 0} + n_{0, 1}) (n_{1, 0} + n_{1, 1}) (n_{0, 0} + n_{1, 0}) (n_{0, 1} + n_{1, 1})}},

where

n_{a, b} \equiv n_{a, b} (z^{⋆}, z) = \sum_{i < j}^{n} 1 (1 (z_{i}^{⋆} = z_{j}^{⋆}) = a) 1 (1 (z_{i} = z_{j}) = b), a, b \in {0, 1} .

Here, $1 (.)$ is an indicator function, which is $1$ if its argument is true and is $0$ otherwise. The quantity $n_{0, 0} (z^{⋆}, z)$ is the number of pairs of nodes that are assigned to distinct blocks under both $z^{⋆}$ and $z$ , while the quantity $n_{1, 1} (z^{⋆}, z)$ is the number of pairs of nodes that are assigned to the same block under both $z^{⋆}$ and $z$ . The sum of the other two quantities, $n_{0, 1} (z^{⋆}, z) + n_{1, 0} (z^{⋆}, z)$ , is the number of pairs of nodes on which $z^{⋆}$ and $z$ disagree. If $z^{⋆}$ and $z$ agree on all pairs of nodes, Yule’s $ϕ$ -coefficient is $1$ . In fact, Yule’s $ϕ$ -coefficient is bounded above by $1$ , and is invariant to the labeling of the blocks.

The results of the small-scale simulation study I are shown in Fig. 1. The results suggest that the two-step estimation approach is almost as good as the Bayesian approach in terms of block recovery in the balanced case ( $K = 3$ blocks of size $10$ ), but is worse in the unbalanced case ( $K = 3$ blocks of size $5, 10, 15$ ). The worse performance in the unbalanced case may be due to the fact that there are smaller blocks in the unbalanced case than in the balanced case, and recovering small blocks is more challenging than recovering large blocks. That said, the advantage of the Bayesian approach over the two-step estimation approach in terms of block recovery in the unbalanced case is limited, and comes at excessive costs: Table 2 reveals that the computing time of the Bayesian approach is thousands of times higher than the computing time of the two-step estimation approach.

Table 2.

Average computing time in seconds of the Bayesian approach, the variational approach, and spectral clustering. 500 graphs with $n = 30$ nodes and $K = 3$ blocks with 10 nodes (balanced case) or 5, 10, 15 nodes (unbalanced case) were generated. The Bayesian approach is the gold standard. The variational approach is the default in Step 1 of the two-step estimation approach. Spectral clustering is an alternative to the variational approach in Step 1. In Step 2 of the two-step estimation approach, maximum pseudolikelihood estimates are computed. The computing times of the variational approach and spectral clustering mentioned above are the total computing times, including both Step 1 and Step 2.

	Bayesian approach	Variational approach	Spectral clustering
$n = 30$ , $K = 3$ , balanced	42,579.6	23.83	.18
$n = 30$ , $K = 3$ , unbalanced	32,480.9	24.00	.18

Open in a new tab

The large-scale simulation studies II, III, and IV shed light on the performance of Steps 1 and 2 of the two-step estimation approach in large networks. The results of simulation study II, shown in Fig. 2, reveal that the variational approach used in Step 1 of the two-step estimation approach outperforms spectral clustering in terms of block recovery in most scenarios. Simulation study III shows how the block recovery in Step 1 of the two-step estimation approach is affected by between-block subgraph sparsity. According to Fig. 3, the more sparse between-block subgraphs are, the more the dense within-block subgraphs “stand out”, improving block recovery. Last, but not least, simulation study IV helps assess parameter recovery in Step 2 of the two-step estimation approach. Fig. 4 shows that maximum pseudolikelihood estimates are close to the data-generating parameters, and more so when the blocks are small and the number of blocks is large.

Fig. 2 — Block recovery in terms of Yule’s $ϕ$ -coefficient. $500$ graphs with $K = 25, 50, 75, 100$ blocks of size $25, 50, 75, 100$ were generated from the model with between-block edge parameters $θ_{B, k, l} = . 5 - log n$ ( $l < k = 1, \dots, K$ ) and within-block edge and transitive edge parameters $θ_{W, k, k, 1} = - 1.5$ and $θ_{W, k, k, 2} = . 5$ ( $k = 1, \dots, K$ ). The figure compares two alternative approaches to recovering the block structure in Step 1 of the two-step estimation approach, the variational approach and spectral clustering.

Fig. 3 — Block recovery in terms of Yule’s $ϕ$ -coefficient as a function of between-block subgraph sparsity. $500$ graphs with $K = 25$ blocks of size $25$ were generated from the model with between-block edge parameters $θ_{B, k, l} = . 5 - α log n$ ( $l < k = 1, \dots, K$ ) and within-block edge and transitive edge parameters $θ_{W, k, k, 1} = - 1.5$ and $θ_{W, k, k, 2} = . 5$ ( $k = 1, \dots, K$ ), with $α$ varying from $. 5$ to $1$ . The figure demonstrates that the more sparse between-block subgraphs are, the more the dense within-block subgraphs “stand out”, improving block recovery.

Fig. 4 — Maximum pseudolikelihood estimates of within-block parameters. $500$ graphs with $K = 25, 50, 75, 100$ blocks of size $25, 50, 75, 100$ were generated from the model with between-block edge parameters $θ_{B, k, l} = . 5 - log n$ ( $l < k = 1, \dots, K$ ) and within-block edge and transitive edge parameters $θ_{W, k, k, 1} = - 1.5$ and $θ_{W, k, k, 2} = . 5$ ( $k = 1, \dots, K$ ). The horizontal axis corresponds to $θ_{W, k, k, 1} = - 1.5$ , while the vertical axis corresponds to $θ_{W, k, k, 2} = . 5$ ( $k = 1, \dots, K$ ). The red circle in each plot indicates the data-generating within-block parameter vector $(θ_{W, k, k, 1}, θ_{W, k, k, 2}) = (- 1.5, . 5)$ .

6. Amazon product recommendation network

We use the two-step estimation approach to shed light on the structure of a large Amazon product recommendation network that is not captured by stochastic block models. The data on the Amazon product recommendation network were collected by Yang and Leskovec (2015) and can be downloaded from the website

http://snap.stanford.edu/data/com-Amazon.html

The network consists of products listed at www.amazon.com. Two products $i$ and $j$ are connected by an edge if $i$ and $j$ are frequently purchased together according to the “Customers Who Bought This Item Also Bought” feature at www.amazon.com. Amazon assigns all products to categories, which we consider to be ground-truth blocks. We use a subset of the network consisting of the top 500 non-overlapping categories with 10 to 80 products, where the ranking of categories is based on Yang and Leskovec (2015). The resulting network consists of 10,448 products and 33,537 edges.

To model the Amazon product recommendation network, we take advantage of curved exponential-family random graph models with within-block edge and geometrically weighted degree and edgewise shared partner terms (Snijders et al., 2006, Hunter and Handcock, 2006, Hunter et al., 2008). The resulting models are more general than the motivating example used in Sections 2, 3 – the model with between-block edge terms and within-block edge and transitive edge terms – and ensure that, for each pair of products, the added value of additional edges and triangles within blocks decays. In fact, transitive edge terms are special cases of geometrically weighted edgewise shared partner terms, and both of them are well-behaved alternatives to the ill-behaved triangle terms mentioned in Section 2. A full-fledged discussion of those models is beyond the scope of our paper. We refer the interested reader to the seminal papers of Snijders et al. (2006), Hunter and Handcock (2006), and Hunter et al. (2008).

The natural parameters of the within-block edge terms are given by

η_{W, k, 1} (θ, z) = θ_{1} log n_{k} (z),

where the logarithmic term $log n_{k} (z)$ arises from sparsity considerations and is a simple form of a size-dependent parameterization, as explained in Section 5. The within-block geometrically weighted degree terms are based on the number of products with $t$ edges in block $A_{k}$ . The natural parameters of within-block geometrically weighted degree terms are given by

η_{W, k, 2, t} (θ, z) = θ_{2} log n_{k} (z) exp (θ_{3}) [1 - {(1 - exp (- θ_{3}))}^{t}], t = 1, \dots, n_{k} (z) - 1 .

The within-block geometrically weighted edgewise shared partner terms are based on the number of connected pairs of products $i$ and $j$ in block $A_{k}$ such that $i$ and $j$ have $t$ shared partners in block $A_{k}$ . The natural parameters of the within-block geometrically weighted edgewise shared partner terms are given by

η_{W, k, 3, t} (θ, z) = θ_{4} log n_{k} (z) exp (θ_{5}) [1 - {(1 - exp (- θ_{5}))}^{t}], t = 1, \dots, n_{k} (z) - 2 .

To reduce computing time, it is convenient to truncate the two geometrically weighted model terms by setting $η_{W, k, 2, t} (θ, z) = 0$ ( $t = 21, \dots, n_{k} (z) - 1$ ) and $η_{W, k, 3, t} (θ, z) = 0$ ( $t = 13, \dots, n_{k} (z) - 2$ ). The two thresholds $21$ and $13$ are motivated by the fact that no product has $21$ or more edges and less than 1% of all pairs of products has $13$ or more edgewise shared partners. Last, but not least, the natural parameters of the between-block edge terms are given by

η_{B, k, l} (θ, z) = θ_{6} log n, l < k .

The resulting exponential family is a curved exponential family (Hunter and Handcock, 2006), because the natural parameter vector $η (θ, z)$ of the exponential family is a nonlinear function of $θ$ given $z \in Z$ . The natural parameter vector $η (θ, z)$ depends on the sizes of blocks, because we do not want to force small and large blocks to have the same natural parameters, as explained in Section 5.

The curved exponential-family model specified above can capture an excess in the expected number of triangles within blocks relative to stochastic block models, while ensuring that, for each pair of products, the added value of additional edges and triangles within blocks decays (Stewart et al., 2019). An excess in the expected number of triangles within blocks relative to stochastic block models can arise when, e.g., (a) three products are similar (e.g., three books on the same topic); (b) three products are dissimilar but complement each other (e.g., a bicycle helmet, head light, and tail light); and (c) three products, either similar or dissimilar, were produced by the same source (e.g., three books written by the same author).

Since we know the number of ground-truth blocks, we set $K = 500$ and estimate the block structure by using the two-step estimation approach, using the variational approach in Step 1 and Monte Carlo maximum likelihood estimates in Step 2. To assess the performance of the two-step estimation approach in terms of block recovery, we use Yule’s $ϕ$ -coefficient. Yule’s $ϕ$ -coefficient turns out to be $. 964$ , which indicates near-perfect recovery of the ground-truth block structure. An inspection of the estimated block structure reveals that, out of the 500 categories of products in the Amazon product recommendation network, products in 15 categories are misclassified. Some of the small categories are merged with large categories, while some unusual products of large blocks are misclassified as well. Some of the products are unusual in the sense of having few edges to other products in the same category, while others are unusual in the sense of having many edges to other products in the same category.

The Monte Carlo maximum likelihood estimates and standard errors of $θ_{1}, \dots, θ_{6}$ are shown in Table 3. The parameters $θ_{1}, \dots, θ_{6}$ of the geometrically weighted terms can be interpreted in terms of log odds of conditional probabilities of edges, given all other edges (Hunter, 2007). A worked-out example can be found in Stewart et al. (2019). Table 3 suggests that there is evidence for transitivity. The observed tendency towards transitivity has advantages in practice: It enables Amazon to recommend co-purchases of brandnew products and existing products, even when there are no data on past co-purchases of these products. For example, when a brandnew product $i$ is introduced (e.g., a novel) and product $i$ is known to be related to existing product $j$ (e.g., a novel by the same author), and product $j$ tends to be co-purchased with product $k$ (e.g., a classic novel), then Amazon could recommend co-purchases of products $i$ and $j$ as well as products $i$ and $k$ , despite the fact that there are no data on past co-purchases of products $i$ and $k$ and there is no direct connection between products $i$ and $k$ (although there in indirect connection via product $j$ ).

Table 3.

Monte Carlo maximum likelihood estimates and standard errors (S.E.) of $θ_{1}, \dots, θ_{6}$ estimated from the Amazon product recommendation network with 10,448 products; note that $θ = (θ_{1}, \dots, θ_{6})$ should not be confused with the size-dependent natural parameter vector $η (θ, z)$ . The parameters $θ_{1}$ and $θ_{6}$ are the base weights of the within- and between-block edge terms, respectively. The parameters $θ_{2}$ and $θ_{4}$ are the base weights of the within-block degree and edgewise shared partner terms, whereas $θ_{3}$ and $θ_{5}$ control the rate of decay of the added value of additional edges and edgewise shared partners, respectively.

Term	Estimate	S.E.	Estimate	S.E.
Within-block edges $θ_{1}$	$- . 369$	$. 002$	−1.410	$. 008$

Within-block degrees $θ_{2}$ (base parameter)			.742	$. 020$
Within-block degrees $θ_{3}$ (decay parameter)			.910	$. 023$

Within-block shared partners $θ_{4}$ (base parameter)			.303	$. 005$
Within-block shared partners $θ_{5}$ (decay parameter)			1.106	$. 012$

Between-block edges $θ_{6}$	−1.199	$. 004$	−1.199	$. 004$

Open in a new tab

To demonstrate that the curved exponential-family random graph model considered here can capture structural features of networks that simpler models – such as stochastic block models – cannot capture, we compare the goodness-of-fit of the curved exponential-family random graph model to the goodness-of-fit of stochastic block models. Since the two models impose the same probability law on between-block subgraphs, it is natural to compare the two models in terms of goodness-of-fit with respect to within-block subgraphs. We assess the goodness-of-fit of the two models in terms of the within-block geodesic distances of pairs of products, i.e., the length of the shortest path between pairs of products in the same block; the numbers of within-block dyadwise shared partners, i.e., the number of unconnected or connected pairs of products with $i$ shared partners in the same block; the numbers of within-block edgewise shared partners, i.e., the number of connected pairs of products with $i$ shared partners in the same block; and the number of within-block transitive edges, i.e., the number of pairs of products with at least one shared partner in the same block. Fig. 5, Fig. 6 compare the goodness-of-fit of the two models based on 1,000 graphs simulated from the estimated models. The figures suggest that the curved exponential-family random graph model considered here is superior to the stochastic block model in terms of both connectivity and transitivity.

Fig. 5 — Amazon product recommendation network with 10,448 products: goodness-of-fit of curved exponential-family random graph model. The red lines indicate observed values of statistics. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 6 — Amazon product recommendation network with 10,448 products: goodness-of-fit of stochastic block models. The red lines indicate observed values of statistics. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

7. Discussion

The two-step estimation approach proposed here enables large-scale estimation of models with local dependence and unknown block structure provided that the number of blocks $K$ is known. An important direction of future research are methods for selecting $K$ when $K$ is unknown. We note that even in the special case of stochastic block models, the issue of selecting $K$ has not received much attention, with the notable exception of recent work by Saldana et al. (2017), Wang and Bickel (2017), and others. Whether – and how – the developed methods can be extended to models with local dependence is an open question, but having scalable methods for selecting $K$ would doubtless be useful in practice.

The proposed methods are implemented in R packages hergm (Schweinberger and Luna, 2018). A stable and user-friendly version will be released in the near future.

Acknowledgments

We acknowledge support from the National Science Foundation (NSF), United States of America in the form of NSF awards DMS-1513644 (SB, JS, MS) and DMS-1812119 (JS, MS).

Footnotes

^{Appendix A}

Supplementary material related to this article can be found online at https://doi.org/10.1016/j.csda.2020.107029.

Contributor Information

Sergii Babkin, Email: sergii.babkin@gmail.com.

Jonathan R. Stewart, Email: jrstewart@fsu.edu.

Xiaochen Long, Email: xl81@rice.edu.

Michael Schweinberger, Email: michael.schweinberger@rice.edu.

Appendix A. Supplementary data

The following is the Supplementary material related to this article.

MMC S1

mmc1.pdf^{(240.5KB, pdf)}

References

Amini A.A., Chen A., Bickel P.J., Levina E. Pseudo-likelihood methods for community detection in large sparse networks. Ann. Statist. 2013;41:2097–2122. [Google Scholar]
Atchade Y.F., Lartillot N., Robert C. Bayesian computation for statistical models with intractable normalizing constants. Braz. J. Probab. Stat. 2013;27:416–436. [Google Scholar]
Besag J. Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. Ser. B Stat. Methodol. 1974;36:192–225. [Google Scholar]
Bickel P.J., Chen A. A nonparametric view of network models and Newman-Girvan and other modularities. Proc. Nat. Acad. Sci. 2009;106:21068–21073. doi: 10.1073/pnas.0907096106. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bickel P.J., Chen A., Levina E. The method of moments and degree distributions for network models. Ann. Statist. 2011;39:2280–2301. [Google Scholar]
Bickel P.J., Choi D., Chang X., Zhang H. Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. Ann. Statist. 2013;41:1922–1943. [Google Scholar]
Binkiewicz N., Vogelstein J.T., Rohe K. Covariate-assisted spectral clustering. Biometrika. 2017;104:361–377. doi: 10.1093/biomet/asx008. [DOI] [PMC free article] [PubMed] [Google Scholar]
Brown L. Institute of Mathematical Statistics; Hayworth, CA, USA: 1986. Fundamentals of Statistical Exponential Families: With Applications in Statistical Decision Theory. [Google Scholar]
Byshkin M., Stivala A., Mira A., Robins G., Lomi A. Fast maximum likelihood estimation via equilibrium expectation for large network data. Sci. Rep. 2018;8:2045–2322. doi: 10.1038/s41598-018-29725-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Caimo A., Friel N. Bayesian inference for exponential random graph models. Social Networks. 2011;33:41–55. [Google Scholar]
Celisse A., Daudin J.J., Pierre L. Consistency of maximum-likelihood and variational estimators in the stochastic block model. Electron. J. Stat. 2012;6:1847–1899. [Google Scholar]
Chatterjee S., Diaconis P. Estimating and understanding exponential random graph models. Ann. Statist. 2013;41:2428–2461. [Google Scholar]
Choi D.S., Wolfe P.J., Airoldi E.M. Stochastic blockmodels with growing number of classes. Biometrika. 2012;99:273–284. doi: 10.1093/biomet/asr053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Comets F. On consistency of a class of estimators for exponential families of Markov random fields on the lattice. Ann. Statist. 1992;20:455–468. [Google Scholar]
Daudin J.J., Picard F., Robin S. A mixture model for random graphs. Stat. Comput. 2008;18:173–183. [Google Scholar]
van Duijn M.A.J., Gile K., Handcock M.S. A framework for the comparison of maximum pseudo-likelihood and maximum likelihood estimation of exponential family random graph models. Social Networks. 2009;31:52–62. doi: 10.1016/j.socnet.2008.10.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
Erdős P., Rényi A. On random graphs. Publ. Math. 1959;6:290–297. [Google Scholar]
Erdős P., Rényi A. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci. 1960;5:17–61. [Google Scholar]
Fienberg S. A brief history of statistical models for network analysis and open challenges. J. Comput. Graph. Statist. 2012;21:825–839. [Google Scholar]
Frank O., Strauss D. Markov graphs. J. Amer. Statist. Assoc. 1986;81:832–842. [Google Scholar]
Gao C., Ma Z., Zhang A.Y., Zhou H.H. Achieving optimal misclassification proportion in stochastic block models. J. Mach. Learn. Res. 2017;18:1980–2024. [Google Scholar]
Häggström O., Jonasson J. Phase transition in the random triangle model. J. Appl. Probab. 1999;36:1101–1115. [Google Scholar]
Handcock M.S. Center for Statistics and the Social Sciences, University of Washington; 2003. Assessing Degeneracy in Statistical Models of Social Networks: Tech. Rep.http://www.csss.washington.edu/Papers [Google Scholar]
Handcock M.S., Raftery A.E., Tantrum J.M. Model-based clustering for social networks. J. R. Stat. Soc. A. 2007;170:301–354. [Google Scholar]
Harris J.K. Sage; Thousand Oaks, California: 2013. An Introduction to Exponential Random Graph Modeling. [Google Scholar]
Hoff P.D. Additive and multiplicative effects network models. Stat. Sci. 2020 in press. [Google Scholar]
Hoff P.D., Raftery A.E., Handcock M.S. Latent space approaches to social network analysis. J. Amer. Statist. Assoc. 2002;97:1090–1098. [Google Scholar]
Holland P.W., Leinhardt S. A method for detecting structure in sociometric data. Am. J. Sociol. 1970;76:492–513. [Google Scholar]
Holland P.W., Leinhardt S. Some evidence on the transitivity of positive interpersonal sentiment. Am. J. Sociol. 1972;77:1205–1209. [Google Scholar]
Holland P.W., Leinhardt S. Local structure in social networks. Sociol. Methodol. 1976:1–45. [Google Scholar]
Hummel R.M., Hunter D.R., Handcock M.S. Improving simulation-based algorithms for fitting ERGMs. J. Comput. Graph. Statist. 2012;21:920–939. doi: 10.1080/10618600.2012.679224. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hunter D.R. Curved exponential family models for social networks. Social Networks. 2007;29:216–230. doi: 10.1016/j.socnet.2006.08.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hunter D.R., Goodreau S.M., Handcock M.S. Goodness of fit of social network models. J. Amer. Statist. Assoc. 2008;103:248–258. [Google Scholar]
Hunter D.R., Handcock M.S. Inference in curved exponential family models for networks. J. Comput. Graph. Statist. 2006;15:565–583. [Google Scholar]
Hunter D.R., Krivitsky P.N., Schweinberger M. Computational statistical methods for social network models. J. Comput. Graph. Statist. 2012;21:856–882. doi: 10.1080/10618600.2012.732921. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hunter D.R., Lange K. A tutorial on MM algorithms. Amer. Statist. 2004;58:30–38. [Google Scholar]
Jin I.H., Liang F. Fitting social network models using varying truncation stochastic approximation MCMC algorithm. J. Comput. Graph. Statist. 2013;22:927–952. [Google Scholar]
Jonasson J. The random triangle model. J. Appl. Probab. 1999;36:852–876. [Google Scholar]
Kolaczyk E.D. Springer-Verlag; New York: 2009. Statistical Analysis of Network Data: Methods and Models. [Google Scholar]
Krivitsky P.N. Using contrastive divergence to seed Monte Carlo MLE for exponential-family random graph models. Comput. Statist. Data Anal. 2017;107:149–161. [Google Scholar]
Lazega E., Snijders T.A.B., editors. Multilevel Network Analysis for the Social Sciences. Springer-Verlag; Switzerland: 2016. [Google Scholar]
Lei J., Rinaldo A. Consistency of spectral clustering in stochastic block models. Ann. Statist. 2015;43:215–237. [Google Scholar]
Liang F., Jin I.H., Song Q., Liu J.S. An adaptive exchange algorithm for sampling from distributions with intractable normalizing constants. J. Amer. Statist. Assoc. 2016;111:377–393. [Google Scholar]
Lusher D., Koskinen J., Robins G. Cambridge University Press; Cambridge, UK: 2013. Exponential Random Graph Models for Social Networks. [Google Scholar]
Mele A. A structural model of dense network formation. Econometrica. 2017;85:825–850. [Google Scholar]
Nowicki K., Snijders T.A B. Estimation and prediction for stochastic blockstructures. J. Amer. Statist. Assoc. 2001;96:1077–1087. [Google Scholar]
Okabayashi S., Geyer C.J. Long range search for maximum likelihood in exponential families. Electron. J. Stat. 2012;6:123–147. [Google Scholar]
Priebe C.E., Sussman D.L., Tang M., Vogelstein J.T. Statistical inference on errorfully observed graphs. J. Amer. Statist. Assoc. 2012;107:1119–1128. [Google Scholar]
Rohe K., Chatterjee S., Yu B. Spectral clustering and the high-dimensional stochastic block model. Ann. Statist. 2011;39:1878–1915. [Google Scholar]
Rohe K., Qin T., Fan H. The highest-dimensional stochastic block model with a regularized estimator. Statist. Sinica. 2014;24:1771–1786. [Google Scholar]
Saldana D.F., Yu Y., Feng Y. How many communities are there? J. Comput. Graph. Statist. 2017;26:171–181. [Google Scholar]
Salter-Townshend M., White A., Gollini I., Murphy T.B. Review of statistical network analysis: models, algorithms, and software. Stat. Anal. Data Min. 2012;5:243–264. [Google Scholar]
Schweinberger M. Instability, sensitivity, and degeneracy of discrete exponential families. J. Amer. Statist. Assoc. 2011;106:1361–1370. doi: 10.1198/jasa.2011.tm10747. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schweinberger M. Consistent structure estimation of exponential-family random graph models with block structure. Bernoulli. 2020;26:1205–1233. [Google Scholar]
Schweinberger M., Handcock M.S. Local dependence in random graph models: characterization, properties and statistical inference. J. R. Stat. Soc. Ser. B Stat. Methodol. 2015;77:647–676. doi: 10.1111/rssb.12081. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schweinberger M., Krivitsky P.N., Butts C.T., Stewart J. Exponential-family models of random graphs: Inference in finite, super, and infinite population scenarios. Statist. Sci. 2020 in press. [Google Scholar]
Schweinberger M., Luna P. HERGM: Hierarchical exponential-family random graph models. J. Stat. Softw. 2018;85:1–39. [Google Scholar]
Schweinberger M., Stewart J. Concentration and consistency results for canonical and curved exponential-family models of random graphs. Ann. Statist. 2020;48:374–396. [Google Scholar]
Sewell D.K., Chen Y. Latent space models for dynamic networks. J. Amer. Statist. Assoc. 2015;110:1646–1657. [Google Scholar]
Smith A.L., Asta D.M., Calder C.A. The geometry of continuous latent space models for network data. Statist. Sci. 2019;34:428–453. doi: 10.1214/19-sts702. [DOI] [PMC free article] [PubMed] [Google Scholar]
Snijders T.A.B. Markov chain Monte Carlo estimation of exponential random graph models. J. Soc. Struct. 2002;3:1–40. [Google Scholar]
Snijders T.A.B. Contribution to the discussion of Handcock, M.S., Raftery, A.E., and J.M. Tantrum, Model-based clustering for social networks. J. R. Stat. Soc. Ser. A. 2007;170:322–324. [Google Scholar]
Snijders T.A.B., Pattison P.E., Robins G.L., Handcock M.S. New specifications for exponential random graph models. Sociol. Methodol. 2006;36:99–153. [Google Scholar]
Stewart J., Schweinberger M. Department of Statistics, Rice University; 2020. Scalable estimation of random graph models with dependent edges and parameter vectors of increasing dimension: Tech. Rep. [Google Scholar]
Stewart J., Schweinberger M., Bojanowski M., Morris M. Multilevel network data facilitate statistical inference for curved ERGMs with geometrically weighted terms. Social Networks. 2019;59:98–119. doi: 10.1016/j.socnet.2018.11.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
Strauss D. On a general class of models for interaction. SIAM Rev. 1986;28:513–527. [Google Scholar]
Strauss D., Ikeda M. Pseudolikelihood estimation for social networks. J. Amer. Statist. Assoc. 1990;85:204–212. [Google Scholar]
Tan L.S.L., Friel N. Bayesian variational inference for exponential random graph models. J. Comput. Graph. Statist. 2020 in press. [Google Scholar]
Thiemichen S., Kauermann G. Stable exponential random graph models with non-parametric components for large dense networks. Social Networks. 2017;49:67–80. [Google Scholar]
Vu D.Q., Hunter D.R., Schweinberger M. Model-based clustering of large networks. Ann. Appl. Stat. 2013;7:1010–1039. doi: 10.1214/12-AOAS617. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Y.X., Bickel P.J. Likelihood-based model selection for stochastic block models. Ann. Stat. 2017;45:500–528. [Google Scholar]
Yang J., Leskovec J. Defining and evaluating network communities based on ground-truth. Knowl. Inf. Syst. 2015;42:181–213. [Google Scholar]
Zhang A.Y., Zhou H.H. Minimax rates of community detection in stochastic block models. Ann. Statist. 2016;44:2252–2280. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

MMC S1

mmc1.pdf^{(240.5KB, pdf)}

[b1] Amini A.A., Chen A., Bickel P.J., Levina E. Pseudo-likelihood methods for community detection in large sparse networks. Ann. Statist. 2013;41:2097–2122. [Google Scholar]

[b2] Atchade Y.F., Lartillot N., Robert C. Bayesian computation for statistical models with intractable normalizing constants. Braz. J. Probab. Stat. 2013;27:416–436. [Google Scholar]

[b3] Besag J. Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. Ser. B Stat. Methodol. 1974;36:192–225. [Google Scholar]

[b4] Bickel P.J., Chen A. A nonparametric view of network models and Newman-Girvan and other modularities. Proc. Nat. Acad. Sci. 2009;106:21068–21073. doi: 10.1073/pnas.0907096106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b5] Bickel P.J., Chen A., Levina E. The method of moments and degree distributions for network models. Ann. Statist. 2011;39:2280–2301. [Google Scholar]

[b6] Bickel P.J., Choi D., Chang X., Zhang H. Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. Ann. Statist. 2013;41:1922–1943. [Google Scholar]

[b7] Binkiewicz N., Vogelstein J.T., Rohe K. Covariate-assisted spectral clustering. Biometrika. 2017;104:361–377. doi: 10.1093/biomet/asx008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b8] Brown L. Institute of Mathematical Statistics; Hayworth, CA, USA: 1986. Fundamentals of Statistical Exponential Families: With Applications in Statistical Decision Theory. [Google Scholar]

[b9] Byshkin M., Stivala A., Mira A., Robins G., Lomi A. Fast maximum likelihood estimation via equilibrium expectation for large network data. Sci. Rep. 2018;8:2045–2322. doi: 10.1038/s41598-018-29725-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b10] Caimo A., Friel N. Bayesian inference for exponential random graph models. Social Networks. 2011;33:41–55. [Google Scholar]

[b11] Celisse A., Daudin J.J., Pierre L. Consistency of maximum-likelihood and variational estimators in the stochastic block model. Electron. J. Stat. 2012;6:1847–1899. [Google Scholar]

[b12] Chatterjee S., Diaconis P. Estimating and understanding exponential random graph models. Ann. Statist. 2013;41:2428–2461. [Google Scholar]

[b13] Choi D.S., Wolfe P.J., Airoldi E.M. Stochastic blockmodels with growing number of classes. Biometrika. 2012;99:273–284. doi: 10.1093/biomet/asr053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b14] Comets F. On consistency of a class of estimators for exponential families of Markov random fields on the lattice. Ann. Statist. 1992;20:455–468. [Google Scholar]

[b15] Daudin J.J., Picard F., Robin S. A mixture model for random graphs. Stat. Comput. 2008;18:173–183. [Google Scholar]

[b16] van Duijn M.A.J., Gile K., Handcock M.S. A framework for the comparison of maximum pseudo-likelihood and maximum likelihood estimation of exponential family random graph models. Social Networks. 2009;31:52–62. doi: 10.1016/j.socnet.2008.10.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b17] Erdős P., Rényi A. On random graphs. Publ. Math. 1959;6:290–297. [Google Scholar]

[b18] Erdős P., Rényi A. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci. 1960;5:17–61. [Google Scholar]

[b19] Fienberg S. A brief history of statistical models for network analysis and open challenges. J. Comput. Graph. Statist. 2012;21:825–839. [Google Scholar]

[b20] Frank O., Strauss D. Markov graphs. J. Amer. Statist. Assoc. 1986;81:832–842. [Google Scholar]

[b21] Gao C., Ma Z., Zhang A.Y., Zhou H.H. Achieving optimal misclassification proportion in stochastic block models. J. Mach. Learn. Res. 2017;18:1980–2024. [Google Scholar]

[b22] Häggström O., Jonasson J. Phase transition in the random triangle model. J. Appl. Probab. 1999;36:1101–1115. [Google Scholar]

[b23] Handcock M.S. Center for Statistics and the Social Sciences, University of Washington; 2003. Assessing Degeneracy in Statistical Models of Social Networks: Tech. Rep.http://www.csss.washington.edu/Papers [Google Scholar]

[b24] Handcock M.S., Raftery A.E., Tantrum J.M. Model-based clustering for social networks. J. R. Stat. Soc. A. 2007;170:301–354. [Google Scholar]

[b25] Harris J.K. Sage; Thousand Oaks, California: 2013. An Introduction to Exponential Random Graph Modeling. [Google Scholar]

[b26] Hoff P.D. Additive and multiplicative effects network models. Stat. Sci. 2020 in press. [Google Scholar]

[b27] Hoff P.D., Raftery A.E., Handcock M.S. Latent space approaches to social network analysis. J. Amer. Statist. Assoc. 2002;97:1090–1098. [Google Scholar]

[b28] Holland P.W., Leinhardt S. A method for detecting structure in sociometric data. Am. J. Sociol. 1970;76:492–513. [Google Scholar]

[b29] Holland P.W., Leinhardt S. Some evidence on the transitivity of positive interpersonal sentiment. Am. J. Sociol. 1972;77:1205–1209. [Google Scholar]

[b30] Holland P.W., Leinhardt S. Local structure in social networks. Sociol. Methodol. 1976:1–45. [Google Scholar]

[b31] Hummel R.M., Hunter D.R., Handcock M.S. Improving simulation-based algorithms for fitting ERGMs. J. Comput. Graph. Statist. 2012;21:920–939. doi: 10.1080/10618600.2012.679224. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b32] Hunter D.R. Curved exponential family models for social networks. Social Networks. 2007;29:216–230. doi: 10.1016/j.socnet.2006.08.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b33] Hunter D.R., Goodreau S.M., Handcock M.S. Goodness of fit of social network models. J. Amer. Statist. Assoc. 2008;103:248–258. [Google Scholar]

[b34] Hunter D.R., Handcock M.S. Inference in curved exponential family models for networks. J. Comput. Graph. Statist. 2006;15:565–583. [Google Scholar]

[b35] Hunter D.R., Krivitsky P.N., Schweinberger M. Computational statistical methods for social network models. J. Comput. Graph. Statist. 2012;21:856–882. doi: 10.1080/10618600.2012.732921. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b36] Hunter D.R., Lange K. A tutorial on MM algorithms. Amer. Statist. 2004;58:30–38. [Google Scholar]

[b37] Jin I.H., Liang F. Fitting social network models using varying truncation stochastic approximation MCMC algorithm. J. Comput. Graph. Statist. 2013;22:927–952. [Google Scholar]

[b38] Jonasson J. The random triangle model. J. Appl. Probab. 1999;36:852–876. [Google Scholar]

[b39] Kolaczyk E.D. Springer-Verlag; New York: 2009. Statistical Analysis of Network Data: Methods and Models. [Google Scholar]

[b40] Krivitsky P.N. Using contrastive divergence to seed Monte Carlo MLE for exponential-family random graph models. Comput. Statist. Data Anal. 2017;107:149–161. [Google Scholar]

[b41] Lazega E., Snijders T.A.B., editors. Multilevel Network Analysis for the Social Sciences. Springer-Verlag; Switzerland: 2016. [Google Scholar]

[b42] Lei J., Rinaldo A. Consistency of spectral clustering in stochastic block models. Ann. Statist. 2015;43:215–237. [Google Scholar]

[b43] Liang F., Jin I.H., Song Q., Liu J.S. An adaptive exchange algorithm for sampling from distributions with intractable normalizing constants. J. Amer. Statist. Assoc. 2016;111:377–393. [Google Scholar]

[b44] Lusher D., Koskinen J., Robins G. Cambridge University Press; Cambridge, UK: 2013. Exponential Random Graph Models for Social Networks. [Google Scholar]

[b45] Mele A. A structural model of dense network formation. Econometrica. 2017;85:825–850. [Google Scholar]

[b46] Nowicki K., Snijders T.A B. Estimation and prediction for stochastic blockstructures. J. Amer. Statist. Assoc. 2001;96:1077–1087. [Google Scholar]

[b47] Okabayashi S., Geyer C.J. Long range search for maximum likelihood in exponential families. Electron. J. Stat. 2012;6:123–147. [Google Scholar]

[b48] Priebe C.E., Sussman D.L., Tang M., Vogelstein J.T. Statistical inference on errorfully observed graphs. J. Amer. Statist. Assoc. 2012;107:1119–1128. [Google Scholar]

[b49] Rohe K., Chatterjee S., Yu B. Spectral clustering and the high-dimensional stochastic block model. Ann. Statist. 2011;39:1878–1915. [Google Scholar]

[b50] Rohe K., Qin T., Fan H. The highest-dimensional stochastic block model with a regularized estimator. Statist. Sinica. 2014;24:1771–1786. [Google Scholar]

[b51] Saldana D.F., Yu Y., Feng Y. How many communities are there? J. Comput. Graph. Statist. 2017;26:171–181. [Google Scholar]

[b52] Salter-Townshend M., White A., Gollini I., Murphy T.B. Review of statistical network analysis: models, algorithms, and software. Stat. Anal. Data Min. 2012;5:243–264. [Google Scholar]

[b53] Schweinberger M. Instability, sensitivity, and degeneracy of discrete exponential families. J. Amer. Statist. Assoc. 2011;106:1361–1370. doi: 10.1198/jasa.2011.tm10747. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b54] Schweinberger M. Consistent structure estimation of exponential-family random graph models with block structure. Bernoulli. 2020;26:1205–1233. [Google Scholar]

[b55] Schweinberger M., Handcock M.S. Local dependence in random graph models: characterization, properties and statistical inference. J. R. Stat. Soc. Ser. B Stat. Methodol. 2015;77:647–676. doi: 10.1111/rssb.12081. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b56] Schweinberger M., Krivitsky P.N., Butts C.T., Stewart J. Exponential-family models of random graphs: Inference in finite, super, and infinite population scenarios. Statist. Sci. 2020 in press. [Google Scholar]

[b57] Schweinberger M., Luna P. HERGM: Hierarchical exponential-family random graph models. J. Stat. Softw. 2018;85:1–39. [Google Scholar]

[b58] Schweinberger M., Stewart J. Concentration and consistency results for canonical and curved exponential-family models of random graphs. Ann. Statist. 2020;48:374–396. [Google Scholar]

[b59] Sewell D.K., Chen Y. Latent space models for dynamic networks. J. Amer. Statist. Assoc. 2015;110:1646–1657. [Google Scholar]

[b60] Smith A.L., Asta D.M., Calder C.A. The geometry of continuous latent space models for network data. Statist. Sci. 2019;34:428–453. doi: 10.1214/19-sts702. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b61] Snijders T.A.B. Markov chain Monte Carlo estimation of exponential random graph models. J. Soc. Struct. 2002;3:1–40. [Google Scholar]

[b62] Snijders T.A.B. Contribution to the discussion of Handcock, M.S., Raftery, A.E., and J.M. Tantrum, Model-based clustering for social networks. J. R. Stat. Soc. Ser. A. 2007;170:322–324. [Google Scholar]

[b63] Snijders T.A.B., Pattison P.E., Robins G.L., Handcock M.S. New specifications for exponential random graph models. Sociol. Methodol. 2006;36:99–153. [Google Scholar]

[b64] Stewart J., Schweinberger M. Department of Statistics, Rice University; 2020. Scalable estimation of random graph models with dependent edges and parameter vectors of increasing dimension: Tech. Rep. [Google Scholar]

[b65] Stewart J., Schweinberger M., Bojanowski M., Morris M. Multilevel network data facilitate statistical inference for curved ERGMs with geometrically weighted terms. Social Networks. 2019;59:98–119. doi: 10.1016/j.socnet.2018.11.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b66] Strauss D. On a general class of models for interaction. SIAM Rev. 1986;28:513–527. [Google Scholar]

[b67] Strauss D., Ikeda M. Pseudolikelihood estimation for social networks. J. Amer. Statist. Assoc. 1990;85:204–212. [Google Scholar]

[b68] Tan L.S.L., Friel N. Bayesian variational inference for exponential random graph models. J. Comput. Graph. Statist. 2020 in press. [Google Scholar]

[b69] Thiemichen S., Kauermann G. Stable exponential random graph models with non-parametric components for large dense networks. Social Networks. 2017;49:67–80. [Google Scholar]

[b70] Vu D.Q., Hunter D.R., Schweinberger M. Model-based clustering of large networks. Ann. Appl. Stat. 2013;7:1010–1039. doi: 10.1214/12-AOAS617. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b71] Wang Y.X., Bickel P.J. Likelihood-based model selection for stochastic block models. Ann. Stat. 2017;45:500–528. [Google Scholar]

[b72] Yang J., Leskovec J. Defining and evaluating network communities based on ground-truth. Knowl. Inf. Syst. 2015;42:181–213. [Google Scholar]

[b73] Zhang A.Y., Zhou H.H. Minimax rates of community detection in stochastic block models. Ann. Statist. 2016;44:2252–2280. [Google Scholar]

PERMALINK

Large-scale estimation of random graph models with local dependence

Sergii Babkin

Jonathan R Stewart

Xiaochen Long

Michael Schweinberger

Abstract

1. Introduction

2. Models

Definition Local Dependence —

Example 1 Stochastic Block Model —

Example 2 Model with Local Dependence —

How models with within-block edge and transitive edge terms differ from the “triangle model”.

Exponential-family representations of models.

3. Likelihood-based inference

3.1. Approximate likelihood functions: motivation

Step 1.

Step 2.

Approximations.

3.2. Approximate likelihood functions: theoretical results

Theorem 1

Theorem 2

Trade-off between m(S) and the recovery of block structure.

4. Two-step estimation approach

Table 1.

Step 1.

Step 2.

Parallel computing.

5. Simulation results

Fig. 1.

Table 2.

Fig. 2.

Fig. 3.

Fig. 4.

6. Amazon product recommendation network

Table 3.

Fig. 5.

Fig. 6.

7. Discussion

Acknowledgments

Footnotes

Contributor Information

Appendix A. Supplementary data

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Trade-off between $m (S)$ and the recovery of block structure.