Stochastic blockmodels with a growing number of classes

D S Choi; P J Wolfe; E M Airoldi

doi:10.1093/biomet/asr053

. 2012 Apr 17;99(2):273–284. doi: 10.1093/biomet/asr053

Stochastic blockmodels with a growing number of classes

D S Choi ¹, P J Wolfe ², E M Airoldi ³

PMCID: PMC3635708 PMID: 23843660

Abstract

We present asymptotic and finite-sample results on the use of stochastic blockmodels for the analysis of network data. We show that the fraction of misclassified network nodes converges in probability to zero under maximum likelihood fitting when the number of classes is allowed to grow as the root of the network size and the average network degree grows at least poly-logarithmically in this size. We also establish finite-sample confidence bounds on maximum-likelihood blockmodel parameter estimates from data comprising independent Bernoulli random variates; these results hold uniformly over class assignment. We provide simulations verifying the conditions sufficient for our results, and conclude by fitting a logit parameterization of a stochastic blockmodel with covariates to a network data example comprising self-reported school friendships, resulting in block estimates that reveal residual structure.

Keywords: Likelihood-based inference, Social network analysis, Sparse random graph, Stochastic blockmodel

1. Introduction

The global structure of social, biological, and information networks is sometimes envisioned as the aggregate of many local interactions whose effects propagate in ways that are not yet well understood. There is increasing opportunity to collect data on an appropriate scale for such systems, but their analysis remains challenging (Goldenberg et al., 2009). Here we analyse a statistical model for network data known as the single-membership stochastic blockmodel. Its salient feature is that it partitions the N nodes of a network into K distinct classes whose members all interact similarly with the network. Blockmodels were first associated with the deterministic concept of structural equivalence in social network analysis (Lorrain & White, 1971), where two nodes were considered interchangeable if their connections were equivalent in a formal sense. This concept was adapted to stochastic settings and gave rise to the stochastic blockmodel in the work by Holland et al. (1983) and Fienberg et al. (1985). The model and extensions thereof have since been applied in a variety of disciplines (Airoldi et al., 2008; Hoff, 2008; Nowicki & Snijders, 2001; Girvan & Newman, 2002; Handcock et al., 2007; Copic et al., 2009; Mariadassou et al., 2010; Karrer & Newman, 2011).

In this work we provide a finite-sample confidence bound that can be used when estimating network structure from data modelled by independent Bernoulli random variates, and also show that under maximum likelihood fitting of a correctly specified K -class blockmodel, the fraction of misclassified network nodes converges in probability to zero even when the number of classes K grows with N. As noted by Rohe et al. (2011) this is advantageous if we expect class sizes to remain relatively constant even as N increases. Related results for fixed K have been shown by Snijders & Nowicki (1997) for networks with a linearly increasing degree, and in a stronger sense for sparse graphs with poly-logarithmically increasing degrees by Bickel & Chen (2009).

Our results can be related to those of Rohe et al. (2011), who use spectral methods to bound the number of misclassified nodes in the stochastic blockmodel with increasing K, although with the more restrictive requirement of nearly linearly increasing degree. As noted by those authors, this assumption may not hold in many practical settings. Our manner of proof requires only poly-logarithmically increasing degree, and is more closely related to the fixed-K proof of Bickel & Chen (2009), although we note that spectral clustering as suggested by Rohe et al. (2011) provides a computationally appealing alternative to maximum likelihood fitting in practice.

As discussed by Bickel & Chen (2009), one may assume exchangeability in lieu of a generative K -class blockmodel: an analogue to de Finetti’s theorem for exchangeable sequences states that the probability distribution of an infinite exchangeable random graph is expressible as a mixture of distributions whose components can be approximated by blockmodels (Kallenberg, 2005; Bickel & Chen, 2009). An observed network can then be viewed as a sample drawn from this infinite conceptual population, and so in this case the fitted blockmodel describes one mixture component thereof.

2. Statement of results

2.1. Problem formulation and definitions

We consider likelihood-based inference for independent Bernoulli data {A_ij} (i = 1, …, N; j = i + 1, …, N), both when no structure linking the success probabilities {P_ij} is assumed, as well as the special case when a stochastic blockmodel of known order K is assumed to apply. To this end, let A ∈ {0, 1}^N×N denote the symmetric adjacency matrix of a simple, undirected graph on N nodes whose entries {A_ij} for i < j are assumed independent Ber(P_ij) random variates, and whose main diagonal ${A_{i i}}_{i = 1}^{N}$ is fixed to zero. The average degree of this graph is 2M/N, where M = ∑_i<j P_ij is its expected number of edges. Under a K -class stochastic block-model, these edge probabilities are further restricted to satisfy

P_{i j} = θ_{z_{i} z_{j}} (i = 1, \dots, N; j = i + 1, \dots, N)

(1)

for some symmetric matrix θ ∈ [0, 1]^K×K and membership vector z ∈ {1, …, K}^N. Thus the probability of an edge between two nodes is assumed to depend only on the class of each node.

Let L(A; z, θ) denote the loglikelihood of observing data matrix A under a K -class block-model with parameters (z, θ), and L̄_P (z, θ) its expectation:

\begin{matrix} L (A; z, θ) = \sum_{i < j} {A_{i j} log θ_{z_{i} z_{j}} + (1 - A_{i j}) log (1 - θ_{z_{i} z_{j}})}, \\ {\bar{L}}_{P} (z, θ) = \sum_{i < j} {P_{i j} log θ_{z_{i} z_{j}} + (1 - P_{i j}) log (1 - θ_{z_{i} z_{j}})} . \end{matrix}

For fixed class assignment z, let N_a denote the number of nodes assigned to class a, and let n_ab denote the maximum number of possible edges between classes a and b; i.e., n_ab = N_a N_b if a ≠ b and n_aa = N_a!/{(N_a − 2)!2!}. Further, let θ̂^(z) and θ̄^(z) be symmetric matrices in [0, 1]^K×K, with

\begin{array}{l} {\hat{θ}}_{a b}^{(z)} = \frac{1}{n_{a b}} \sum_{i < j} A_{i j} 1 (z_{i} = a, z_{j} = b), \\ {\bar{θ}}_{a b}^{(z)} = \frac{1}{n_{a b}} \sum_{i < j} P_{i j} 1 (z_{i} = a, z_{j} = b) (a = 1, \dots, K; b = a, \dots, K) \end{array}

defined whenever n_ab ≠ 0. Observe that θ̂^(z) comprises sample proportion estimators as a function of z, whereas θ̄^(z) is its expectation under the independent {Ber(P_ij)} model. Taken over all class assignments z ∈ {1, …, K}^N, the sets {θ̂^(z)} comprise a sufficient statistic for the family of K -class stochastic blockmodels, and for each z, θ̂^(z) maximizes L(A; z, ·). Analogously, the sets {θ̄^(z)} are functions of the model parameters {P_ij}_i<j, and maximize L̄_P(z, ·). We write θ̂ and θ̄ when the choice of z is understood, and L(A; z) and L̄_P(z) to abbreviate sup_θ L(A; z, θ) and sup_θ L̄_P(z, θ), respectively.

Finally, observe that when a blockmodel with parameters (z̄, θ̄) is in force, then P_ij = θ̄_{z̄_iz̄_j} in accordance with (1), and consequently L̄_P is maximized by the true parameter values (z̄, θ̄):

{\bar{L}}_{P} (\bar{z}, \bar{θ}) - {\bar{L}}_{P} (z, θ) = \sum_{i < j} D (P_{i j} ‖ θ_{z_{i} z_{j}}) ⩾ \sum_{i < j} 2 {(P_{i j} - θ_{z_{i} z_{j}})}^{2} ⩾ 0,

where D(p ‖ p′) denotes the Kullback–Leibler divergence of a Ber(p′) distribution from a Ber(p) one.

2.2. Fitting a K -class stochastic blockmodel to independent Bernoulli trials

Fitting a K -class stochastic blockmodel to independent Ber(P_ij) trials yields estimates θ̂^(z) of averages θ̄^(z) of subsets of the parameter set {P_ij}, with each class assignment z inducing a partition of that set. We begin with a basic lemma that expresses the difference L(A; z) − L̄_P (z) in terms of θ̂^(z) and θ̄^(z), and follows directly from their respective maximizing properties.

Lemma 1. Let {A_ij}_i<j comprise independent Ber(P_ij) trials. Then the difference sup_θ L(A; z, θ) − sup_θ L̄_P(z, θ) can be expressed for X = ∑_i<j A_ij log{θ̄_{z_iz_j}/(1 − θ̄_{z_iz_j})} as

L (A; z) - {\bar{L}}_{P} (z) = \sum_{a ⩽ b} n_{a b} D ({\hat{θ}}_{a b} ‖ {\bar{θ}}_{a b}) + X - E (X) .

We first bound the former quantity in this expression, which provides a measure of the distance between θ̂ and its estimand θ̄ under the setting of Lemma 1. The bound is used in subsequent asymptotic results, and also yields a kind of confidence measure on θ̂ in the finite-sample regime.

Theorem 1. Suppose that a K -class stochastic blockmodel is fitted to data {A_ij}_i<j comprising N!/{(N − 2)!2!} independent Ber(P_ij) trials, where, for any class assignment z, estimate θ̂ maximizes the blockmodel loglikelihood L(A; z, ·). Then with probability at least 1 − δ,

max_{z} {\sum_{a ⩽ b} n_{a b} D ({\hat{θ}}_{a b} ‖ {\bar{θ}}_{a b})} < N log K + (K^{2} + K) log (\frac{N}{K} + 1) - log δ .

(2)

Theorem 1 is proved in the Appendix via the method of types: for fixed z, the probability of any realization of θ̂ is first bounded by exp{− ∑_a⩽b n_ab D(θ̂_ab ‖ θ̄_ab)}. A counting argument then yields a deviation result in terms of (N/K + 1)^K²+K, and finally a union bound is applied so that the result holds uniformly over all K^N possible choices of assignment vector z.

Our second result is asymptotic, and combines Theorem 1 with a Bernstein inequality for bounded random variables, applied to the latter terms X − E(X) in Lemma 1. To ensure bound-edness we assume minimal restrictions on each P_ij; this Bernstein inequality, coupled with a union bound to ensure that the result holds uniformly over all z, dictates growth restrictions on K and M.

Theorem 2. Assume the setting of Theorem 1, whereby a K -class blockmodel is fitted to N!/{(N − 2)!2!} independent Ber(P_ij) random variates {A_ij}_i<j, and further assume that 1/N² ⩽ P_ij ⩽ 1 − 1/N² for all N and i < j. Then if K = 𝒪(N^1/2) and M = ω(N(log N)^3+δ) for some δ > 0,

max_{z} | L (A; z) - {\bar{L}}_{P} (z) | = o_{P} (M) .

(3)

Thus whenever each P_ij is bounded away from 0 and 1 in the manner above, the maximized loglikelihood function L(A; z) = sup_θ L(A; z, θ) is asymptotically well behaved in network size N as long as the network’s average degree 2M/N grows faster than (log N)^3+δ and the number K of classes fitted to it grows no faster than N^1/2.

2.3. Fitting a correctly specified K -class stochastic blockmodel

The above results apply to the general case of independent Bernoulli data {A_ij}, with no additional structure assumed amongst the set of success probabilities {P_ij}; if we further assume the data to be generated by a K -class stochastic blockmodel whose parameters (z̄, θ̄) are subject to suitable identifiability conditions, it is possible to characterize the behaviour of the class assignment estimator ẑ under maximum likelihood fitting of a correctly specified K -class blockmodel.

Theorem 3. If (3) holds, and data are generated according to a K -class blockmodel with membership vector z̄, then

{\bar{L}}_{P} (\bar{z}) - {\bar{L}}_{P} (\hat{z}) = o_{P} (M),

(4)

with respect to the maximum-likelihood K -class blockmodel class assignment estimator ẑ.

Let N_e(ẑ) be the number of incorrect class assignments under ẑ, counted for every node whose true class under z̄ is not in the majority within its estimated class under ẑ. If furthermore the following identifiability conditions hold with respect to the model sequence:

for all blockmodel classes a = 1, …, K, class size N_a grows as min_a(N_a) = Ω(N/K);
the following holds over all distinct class pairs (a, b) and all classes c:
$min_{(a, b)} max_{c} {D ({\bar{θ}}_{a c} ‖ \frac{{\bar{θ}}_{a c} + {\bar{θ}}_{b c}}{2}) + D ({\bar{θ}}_{b c} ‖ \frac{{\bar{θ}}_{a c} + {\bar{θ}}_{b c}}{2})} = Ω (\frac{M K}{N^{2}}),$
then it follows from (4) that N_e(ẑ) = o_P (N).

Thus the conclusion of Theorem 3 is that under suitable conditions the fraction N_e/N of mis-classified nodes goes to zero in N, yielding a convergence result for stochastic blockmodels with a growing number of classes. Condition (i) stipulates that all class sizes grow at a rate that is eventually bounded below by a single constant times N/K, while condition (ii) ensures that any two rows of θ differ in at least one entry by an amount that is eventually bounded by a single constant time MK/N². Observe that if eventually K = N^1/2 and M = N(log N)⁴ so that conditions on K and M sufficient for Theorem 2 are met, then since (log N)⁴ = o(N^1/2), it follows that MK/N² goes to zero in N.

3. Numerical results

We now present results of a small simulation study undertaken to investigate the assumptions and conditions of Theorems 1–3, in which K -class blockmodels were fitted to various networks generated at random from models corresponding to each of the three theorems. Because exact maximization in z of the blockmodel loglikelihood L(A; z, θ) is computationally intractable even for moderate N, we instead employed Gibbs sampling to explore the function max_θ L(A; z, θ) and recorded the best value of z visited by the sampler. As the results of Theorems 1 and 2 hold uniformly in z, however, we expect θ̄ and L̄_P(z) to be close to their empirical estimates whenever N is sufficiently large, regardless of the approach employed to select z. This fact also suggests that a single-class blockmodel may come closest to achieving equality in Theorems 1 and 2, as many class assignments are equally likely a priori to have high likelihood. By similar reasoning, a weakly identifiable model should come closest to achieving the error bound in Theorem 3, such as one with nearly identical within- and between-class edge probabilities. We describe each of these cases empirically in the remainder of this section.

First, the tightness of the confidence bound of (2) from Theorem 1 was investigated by fitting K -class blockmodels to Erdös–Rényi networks comprising N!/{(N − 2)!2!} independent Ber(p) trials, with N = 500 nodes, p = 0.075 chosen to match the data analysis example in the sequel, and K ∈ {5, 10, 20, 30, 40, 50}. For each K, the error terms ∑_a⩽b n_ab D(θ̂_ab ‖ θ̄_ab) and {∑_a⩽b n_ab(θ̂_ab − θ̄_ab)²}^1/2 were recorded for each of 100 trials and compared with the respective 95% confidence bounds, δ = 0.05, derived from Theorem 1. The bounds overestimated the respective errors by a factor of 3–7 on average, with small standard deviation. In this worst-case scenario, the bound is loose, but not unusable; the errors never exceeded the 95% confidence bounds in any of the trials.

To test whether the assumptions of Theorem 2 are necessary as well as sufficient to obtain convergence of L(A; z)/M to L̄_P(z)/M, blockmodels were next fitted to Erdös–Rényi networks of increasing size, for N in the range 50–1050. The corresponding normalized loglikelihood error |L(A; z) − L̄_P(z)|/M for different rates of growth in the expected number of edges M and the number of fitted classes K is shown in Fig. 1. Observe from Fig. 1(a) that when M = N(log N)⁴ and K = N^1/2, as prescribed by the theorem, this error decreases in N. If the edge density is reduced to M/N = (log N)², we observe in Fig. 1(b) convergence when K = N^1/2 and divergence when K = N^3/5. This suggests that the error as a function of K follows Theorem 2 closely, but that the network can be somewhat more sparse than it requires.

Fig. 1 — Simulation study results illustrating Theorems 1–3. (a) Likelihood error |L(A; z) − *L̄_P* (z)|/M as a function of network size N, shown for M = N(log N)⁴ with K = N^1/2. (b) Same quantity for M = N(log N)² with K = N^3/5 (dotted) and K = N^1/2 (solid). (c) Error rate N_e(ẑ)/N for M = N(log N)² with K = N^1/2 and γ = 4/5 (dotted), γ = 9/10 (dashed), γ = 1 (solid).

To test the conditions of Theorem 3, blockmodels with parameters (z̄, θ̄) and increasing class size K were used to generate data, and corresponding node misclassification error rates N_e(z)/N were recorded as a function of correctly specified K -class blockmodel fitting. Model parameter z̄ was chosen to yield equally sized blocks, so as to meet identifiability condition (i) of Theorem 3. Parameter θ̄ = αI + β11^T was chosen to yield within- and between-class success probabilities with the property that for any class pair (a, b), the condition D(θ_aa ‖ (θ_aa + θ_ab)/2) = MK^γ/(20N²) was satisfied, with γ ∈ {4/5, 9/10, 1}; identifiability condition (ii) was thus met only in the γ = 1 case. Figure 1(c) shows the fraction N_e(z)/N of misclassified nodes when M = N(log N)² and K = N^1/2, corresponding to the setting in which convergence of L(A; z)/M to L̄_P (z)/M was observed above; this fraction is seen to decay when γ = 1 or 9/10, but to increase when γ = 4/5. This behaviour conforms with Theorem 3 and suggests that its identifiability conditions are close to being necessary as well as sufficient.

4. Network data example

4.1. Adolescent health social network dataset

To illustrate the use of our results in the fitting of K -class stochastic blockmodels to network data, we employed the Comm18 friendship network from the National Longitudinal Survey of Adolescent Health, in which N = 284 students at a school in the United States were asked to list up to five friends of each gender, yielding a network with 1189 edges signifying that one or both of the students had listed the other as a friend. The students also supplied additional information including their gender, school year and race. Further details of the study can be found in, e.g., Goodreau et al. (2009).

Of the three covariates, shared school year is reported by Goodreau et al. (2009) to be the best predictor of community structure. This finding is borne out in Fig. 2(a), which shows the adjacency structure under an ordering of students by school year and reveals strong community divisions between years.

Fig. 2 — Social network dataset and its fitting statistics for a varying number of blockmodel classes K. (a) Adjacency data matrix with students ordered by school year. (b) Model order statistic for fitted logit blockmodels as a function of K. (c) Out-of-sample prediction error using five-fold crossvalidation, as a function of K. Error bars indicate standard deviation.

4.2. Logit blockmodel parameterization and fitting procedure

Here we build on the observation of school year clustering by taking covariate information explicitly into account when fitting the dataset described above. Specifically, by assuming only that links are independent Bernoulli variates and then employing confidence bounds to assess fitted blocks by way of parameter θ̄^(z), we examine these data for residual community structure beyond that well explained by the covariates themselves.

Since the results of Theorems 1 and 2 hold uniformly over all choices of blockmodel membership vector z, we may select z in any manner, including those that depend on covariates. For this example, we determined an approximate maximum likelihood estimate ẑ under a logit block-model that allows the direct incorporation of covariates. The model is parameterized such that the log-odds ratio of an edge occurrence between nodes i and j is given by

log \frac{P_{i j}}{1 - P_{i j}} = {\tilde{θ}}_{z_{i} z_{j}} + x {(i, j)}^{T} β (i = 1, \dots, N; j = i + 1, \dots, N),

(5)

where x(i, j) a vector of covariates indicating shared group membership, and model parameters (θ̃, β, z) are estimated from the data. Three covariates were used, indicating shared gender, difference in school years, and a six-category covariate indicating the range of the observed degree of each node; see Karrer & Newman (2011) for related discussion on this point. The matrix θ̃ is analogous to blockmodel parameter θ, the vector z specifies the blockmodel class assignment and the vector β was implemented here with sum-to-zero identifiability constraints.

Because exact maximization of the loglikelihood function L(A; θ̃, β, z) corresponding to (5) is computationally intractable, we instead employed an approach that alternated between Markov chain Monte Carlo exploration of z while holding (θ̃, β) constant, and optimization of θ̃ and β while holding z constant. We tested different initialization methods and observed that highest likelihoods were consistently produced by first fitting the class assignment vector z. This fitting procedure provides a means of estimating averages θ̄^(z) over subsets of the set {P_ij}_i<j, under the assumption that the network data comprise independent Ber(P_ij) trials.

4.3. Data analysis

We fitted the logit blockmodel of (5) for values of K ranging from 1 to 25 using the stochastic maximization procedure described in the preceding paragraph, and gauged model order by the Bayesian information criterion and out-of-sample prediction shown, respectively, in Figs. 2(b) and (c). The minimum of the Bayesian information criterion corresponds in location with the knee of the out-of-sample prediction curve, suggesting a model order between 4 and 7. The corresponding 95% confidence bounds on the divergence of θ̂^(z) from θ̄^(z) provided by Theorem 1 also yield small values for K in this range also: for example, when K = 4, the normalized sum of Kullback–Leibler divergences N!/{N − 2)!2!} ∑_a⩽b n_ab D(θ̂_ab ‖ θ̄_ab) is bounded by 0.0120.

The top two rows of Fig. 3 depict approximate maximum likelihood estimates of z for K in the range 4–7. Larger values of K also reveal block structure, but exhibit correspondingly larger confidence bound evaluations; for example, when K = 10, the Kullback–Leibler divergence bound of 0.026 no longer excludes an Erdos–Renyi random graph whose density matches the observed network. Adjacency structures permuted to show block divisions under ẑ within each school year are shown in the top row, with the corresponding values of μ̂ shown in the bottom row. We note that the total number of visible communities shown in the top row appears to exceed K, due to the interaction of school year and latent class effects.

As K is increased, the groups do not become isolated but rather continue to exhibit cross-group friendships, suggesting fewer than four tightly demarcated communities per school year. Within each school year, the K groups can be separated into two meta-groups whose membership remained roughly constant, with 234 students whose meta-group membership did not change at all as K ranged from 4 to 7. The two meta-groups have similar school year and nodal degree distributions, with a two-sample Kolmogorov–Smirnov test returning p-values of 0.63 and 0.08 for school year and degree, respectively. The bottom row of Fig. 3 shows differing racial compositions for the meta-groups, with race 2 concentrated almost exclusively in meta-group 2. However, membership was not determined solely by race; we note that race 1 students in the second meta-group had a higher density of friendships with race 2 than did the race 1 students in the first meta-group by a factor of ten, justifying their inclusion in the second meta-group.

Acknowledgments

This work was supported in part by the National Science Foundation, National Institutes of Health, Army Research Office and the Office of Naval Research, U.S.A. Additional funding was provided by the Harvard Medical School’s Milton Fund.

Appendix.

Proof of Theorem 1. To begin, observe that for any fixed class assignment z, every θ̂_ab is a sum of n_ab independent Bernoulli random variables, with corresponding mean θ̄_ab. A Chernoff bound (Dubhashi & Panconesi, 2009) shows

\begin{array}{l} pr ({\hat{θ}}_{a b} ⩾ {\bar{θ}}_{a b} + t) ⩽ e^{- n_{a b} D ({\bar{θ}}_{a b} + t ‖ {\bar{θ}}_{a b})}, & 0 < t ⩽ 1 - {\bar{θ}}_{a b}, \\ pr ({\hat{θ}}_{a b} ⩽ {\bar{θ}}_{a b} - t) ⩽ e^{- n_{a b} D ({\bar{θ}}_{a b} - t ‖ {\bar{θ}}_{a b})}, & 0 < t ⩽ {\bar{θ}}_{a b} . \end{array}

Since these bounds also hold, respectively, for pr(θ̂_ab = θ̄_ab ± t), we may bound the probability of any given realization ϑ ∈ {0, 1/n_ab, …, 1} of θ̂_ab in terms of the Kullback–Leibler divergence of θ̄_ab from ϑ:

pr ({\hat{θ}}_{a b} = ϑ) ⩽ e^{- n_{a b} D (ϑ ‖ {\bar{θ}}_{a b})} .

By independence of the {A_ij}_i<j, this implies a corresponding bound on the probability of any θ̂:

pr (\hat{θ}) ⩽ exp {- \sum_{a ⩽ b} n_{a b} D ({\hat{θ}}_{a b} ‖ {\hat{θ}}_{a b})} .

(A1)

Now, let Θ̂ denote the range of θ̂ for fixed z, and observe that since each of the (K + 1)!/{(K − 1)!2!} lower-diagonal entries {θ̂_ab}_a⩽b of θ̂ can independently take on n_ab + 1 distinct values, we have that |Θ̂| = Π_a⩽b(n_ab + 1). Subject to the constraint that ∑_a⩽b n_ab = N!/{(N − 2)!2!}, we see that this quantity is maximized when n_ab = N!(K − 1)!/{(N − 2)!(K − 1)!} for all a ⩽ b, and hence

| \hat{Θ} | ⩽ {(\begin{matrix} N \\ 2 \end{matrix}) / (\begin{matrix} K + 1 \\ 2 \end{matrix}) + 1}^{(\begin{matrix} K + 1 \\ 2 \end{matrix})} < {(N^{2} / K^{2} + 1)}^{(K^{2} + K) / 2} < {(N / K + 1)}^{K^{2} + K} .

(A2)

Now consider the event that ∑_a⩽b n_ab D(θ̂_ab ‖ θ̄_ab) is at least as large as some ∊ > 0; the probability of this event is given by pr(Θ̂_∊) for

{\hat{Θ}}_{∊} = {\hat{θ} \in \hat{Θ} : \sum_{a ⩽ b} n_{a b} D ({\hat{θ}}_{a b} ‖ {\hat{θ}}_{a b}) ⩾ ∊} .

(A3)

Since ∑_a⩽b n_ab D(θ̂_ab ‖ θ̄_ab) ⩾ ∊ for all θ̂ ∈ Θ̂_∊, we have from (A1) and (A3) that

pr ({\hat{Θ}}_{∊}) = \sum_{\hat{θ} \in {\hat{Θ}}_{∊}} pr (\hat{θ}) ⩽ \sum_{\hat{θ} \in {\hat{Θ}}_{∊}} e^{- \sum_{a ⩽ b} n_{a b} D ({\hat{θ}}_{a b} ‖ {\bar{θ}}_{a b})} ⩽ \sum_{\hat{θ} \in {\hat{Θ}}_{∊}} e^{- ∊} = | {\hat{Θ}}_{∊} | e^{- ∊},

and since |Θ̂_∊| ⩽ |Θ̂|, we may use (A2) to obtain, for fixed class assignment z,

pr {\sum_{a ⩽ b} n_{a b} D (\hat{θ} ‖ \bar{θ}) ⩾ ∊} < {(N / K + 1)}^{K^{2} + K} e^{- ∊} .

(A4)

Appealing to a union bound over all K^N possible class assignments and setting ∊ = log{K^N (N/K + 1)^K²+K /δ} then yields the claimed result.

Proof of Theorem 2. By Lemma 1, the difference L(A; z) − L̄_P(z) can be expressed for any fixed class assignment z as ∑_a⩽b n_ab D(θ̂_ab ‖ θ̄_ab) + X − E(X), where the first term satisfies the deviation bound of (A4), and X = ∑_i<j A_ij log{θ̄_{z_iz_j} / (1 − θ̄_{z_iz_j})} comprises a weighted sum of independent Ber(P_ij) random variables.

To bound the quantity |X − E(X)|, observe that since by assumption N⁻² ⩽ P_ij ⩽ 1 − N⁻², the same is true for each corresponding average θ̄_{z_iz_j}. As a result, the random variables X_ij = A_ij log{θ̄_{z_iz_j} / (1 − θ̄_{z_iz_j})} comprising X are each bounded in magnitude by C = 2 log N. This allows us to apply a Bernstein inequality for sums of bounded independent random variables due to Chung & Lu (2006, Theorems 2.8 and 2.9, p. 27), which states that for any ∊ > 0,

pr {| X - E (X) | ⩾ ∊} ⩽ 2 exp {- \frac{∊^{2}}{2 \sum_{i < j} E (X_{i j}^{2}) + (2 / 3) ∊ C}} .

(A5)

Finally, observe that since the event |L(A; z) − L̄_P (z)| > 2∊ M implies either the event ∑_a⩽b n_ab D(θ̂_ab ‖ θ̄_ab) ⩾ ∊M or the event |X − E(X)| ⩾ ∊M, we have for fixed assignment z that

pr {| L (A; z) - {\bar{L}}_{P} (z) ⩾ 2 ∊ M} ⩽ pr [{\sum_{a ⩽ b} n_{a b} D ({\hat{θ}}_{a b} ‖ {\bar{θ}}_{a b}) ⩾ ∊ M} \cup {| X - E (X) | ⩾ ∊ M}] .

Summing the right-hand sides of (A4) and (A5), and then over all K^N possible assignments, yields

\begin{array}{r} pr {max_{z} | L (A; z) - {\bar{L}}_{P} (z) | ⩾ 2 ∊ M} ⩽ exp {K log N + (K^{2} + K) log (N / K + 1) - ∊ M} \\ + 2 exp {K log N - \frac{∊^{2} M}{8 {log}^{2} N + (4 / 3) ∊ log N}}, \end{array}

where we have used the fact that $\sum_{i < j} E (X_{i j}^{2}) ⩽ 4 M {log}^{2} (N)$ in (A5). It follows directly that if K = 𝒪(N^1/2) and M = ω(N(log N)^3+δ), then lim_N→∞ pr{max_z |L(A; z) − L̄_P (z)|/M ⩾ ∊} = 0 for every fixed ∊ > 0 as claimed.

Proof of Theorem 3. To begin, note that Theorem 2 holds uniformly in z, and thus implies that

| {\bar{L}}_{P} (\bar{z}) - L (A; \bar{z}) | + | {\bar{L}}_{P} (\hat{z}) - L (A; \hat{z}) | - o_{P} (M) .

Since ẑ is the maximum-likelihood estimate of class assignment z̄, we know that L(A; ẑ) ⩾ L(A; z̄), implying that L(A; ẑ) = L(A; z̄) + δ for some δ ⩾ 0. Thus, by the triangle inequality,

| {\bar{L}}_{P} (\bar{z}) - {\bar{L}}_{P} (\hat{z}) + δ | ⩽ | {\bar{L}}_{P} (\bar{z}) - L (A; \bar{z}) | + | {\bar{L}}_{P} (\hat{z}) - (L (A; \bar{z}) + δ) | = o_{P} (M),

and since L̄_P (z̄) ⩾ L̄_P (ẑ) under any blockmodel with parameter z̄, we have L̄_P (z̄) − L̄_P (ẑ) = o_P (M).

Under conditions (i) and (ii) of Theorem 3, we will now show that also

{\bar{L}}_{P} (\bar{z}) - {\bar{L}}_{P} (\hat{z}) = \frac{N_{e} (\hat{z})}{N} Ω (M),

(A6)

holds for every realization of ẑ, thus implying that N_e(ẑ) = o_P (N) and proving the theorem.

To show (A6), first observe that any blockmodel class assignment vector z induces a corresponding partition of the set {P_ij}_i<j according to (i, j) ↦ (z_i, z_j). Formally, z partitions {P_ij}_i<j into L subsets (S₁, …, S_L) via the mapping

ζ_{i j} : (i = 1, \dots, N; j = i + 1, \dots, N) \to (l = 1, \dots, L) .

This partition is separable in the sense that there exists a bijection between {1, …, L} and the upper triangular portion of blockmodel parameter θ, such that we write θ_{ζ_ij} = θ_{z_iz_j} for membership vector z. More generally, for any partition Π of {P_ij}_i<j, we may define θ̄_l = |S_l|⁻¹ ∑_i<j P_ij 1{P_ij ∈ S_l} as the arithmetic average over all P_ij in the subset S_l indexed by ζ_ij = l. Thus we may also define

{\bar{L}}_{P}^{*} (Π) = \sum_{i < j} {P_{i j} log {\bar{θ}}_{ζ_{i j}} + (1 - P_{i j}) log (1 - {\bar{θ}}_{ζ_{i j}})},

so that ${\bar{L}}_{P}^{*}$ and L̄_P coincide on partitions corresponding to admissible blockmodel assignments z.

The establishment of (A6) proceeds in three steps: first, we construct and analyse a refinement of the partition Π^z induced by any blockmodel assignment vector z in terms of its error N_e(z); then, we show that refinements increase ${\bar{L}}_{P}^{*} (\cdot)$ ; finally, we apply these results to the maximum-likelihood estimate ẑ.

Lemma A1. Consider a K -class stochastic blockmodel with membership vector z̄, and let Π^z denote the partition of its associated {P_ij}_1⩽i<j⩽N induced by any z ∈ {1, …, K}^N. For every Π^z, there exists a partition Π^* that refines Π^z and with the property that, if conditions (i) and (ii) of Theorem 3 hold,

{\bar{L}}_{P} (\bar{z}) - {\bar{L}}_{P}^{*} (Π^{*}) = \frac{N_{e} (\hat{z})}{N} Ω (M),

(A7)

where N_e(z) counts the number of nodes whose true class assignments under z̄ are not in the majority within their respective class assignments under z.

Lemma A2. Let Π′ be a refinement of any partition Π of the set {P_ij}_i<j; then ${\bar{L}}_{P}^{*} (Π^{'}) ⩾ {\bar{L}}_{P}^{*} (Π)$ .

Since Lemma A1 applies to any admissible blockmodel assignment vector z, it also applies to the maximum-likelihood estimate ẑ for any realization of the data; each ẑ in turn induces a partition Π̂ of blockmodel edge probabilities {P_ij}_i<j, and (A7) holds with respect to its refinement Π^*. By Lemma A2, ${\bar{L}}_{P}^{*} (Π^) ⩽ {\bar{L}}_{P}^{*} (Π^{*})$ . Finally, observe that ${\bar{L}}_{P} (\hat{z}) = {\bar{L}}_{P}^{*} (Π^)$ by the definition of ${\bar{L}}_{P}^{*}$ , and so ${\bar{L}}_{P} (\bar{z}) - {\bar{L}}_{P} (\hat{z}) ⩾ {\bar{L}}_{P} (\bar{z}) - {\bar{L}}_{P}^{*} (Π^{*})$ , thereby establishing (A6).

Proof of Lemma A1. The construction of Π^* will take several steps. For a given membership class under z, partition the corresponding set of nodes into subclasses according to the true class assignment z̄ of each node. Then remove one node from each of the two largest subclasses so obtained, and group them together as a pair; continue this pairing process until no more than one nonempty subclass remains, then terminate. Observe that if we denote pairs by their node indices as (i, j), then by construction z_i = z_j but z̄_i ≠ z̄_j.

Repeat the above procedure for each class under z, and let C₁ denote the total number of pairs thus formed. For each of the C₁ pairs (i, j), find all other distinct indices k for which the following holds:

D (P_{i k} ‖ \frac{P_{i k} + P_{j k}}{2}) + D (P_{j k} ‖ \frac{P_{i k} + P_{j k}}{2}) ⩾ C \frac{M K}{N^{2}},

(A8)

where C is the constant from condition (ii) of Theorem 3, and indices ik and jk in (A8) are to be interpreted, respectively, as ki whenever k < i, and kj whenever k < j. Let C₂ denote the total number of distinct triples that can be formed in this manner.

We are now ready to construct the partition Π^* of the probabilities {P_ij}_1⩽i<j⩽N as follows: For each of the C₂ triples (i, j, k), remove P_ik (or P_ki if k < i) and P_jk (or P_kj) from their previous subset assignment under Π^z, and place them both in a new, distinct two-element subset. We observe the following:

the partition Π^* is a refinement of the partition Π^z induced by z: Since nodes i and j have the same class label under z in that z_i = z_j, it follows that for any k, P_ik and P_jk are in the same subset under Π^z;
since for each class at most one nonempty subclass remains after the pairing process, the number of pairs is at least half the number of misclassifications in that class. Therefore, we conclude C₁ ⩾ N_e(z)/2;
condition (ii) of Theorem 3 implies that for every pair of classes (a, b), there exists at least one class c for which (A8) holds eventually. Thus eventually, for any of the C₁ pairs (i, j), we obtain a number of triples at least as large as the cardinality of class c. Condition (i) in turn implies that the cardinality of the smallest class grows as Ω(N/K), and thus we may write C₂ = C₁Ω(N/K).

We can now express the difference ${\bar{L}}_{P} (\bar{z}) - {\bar{L}}_{P}^{*} (Π^{*})$ as a sum of nonnegative divergences $D (P_{i j} ‖ {\bar{θ}}_{ζ_{i j}^{*}})$ , where $ζ_{i j}^{*}$ is the assignment mapping associated to Π^*, and use (A8) to bound this difference below:

{\bar{L}}_{P} (\bar{z}) - {\bar{L}}_{P}^{*} (Π^{*}) = \sum_{i < j} D (P_{i j} ‖ {\bar{θ}}_{ζ_{i j}^{*}}) = C_{2} Ω (\frac{M K}{N^{2}}) = \frac{N_{e} (z)}{2} Ω (\frac{M}{N}) .

Proof of Lemma A2. Let Π′ be a refinement of any partition Π of the set {P_ij}_i<j, and given a ∈ {1, …, L′} indexing S′_a, let F(a) denote its index under Π. We show that ${\bar{L}}_{P}^{*} (Π^{'}) ⩾ {\bar{L}}_{P}^{*} (Π)$ as follows:

\begin{array}{l} {\bar{L}}_{P}^{*} (Π^{'}) & = \sum_{a = 1}^{L^{'}} | S_{a}^{'} | {{\bar{θ}}_{a}^{'} log {\bar{θ}}_{a}^{'} + (1 - {\bar{θ}}_{a}^{'}) log (1 - {\bar{θ}}_{a}^{'})} \\ ⩾ \sum_{a = 1}^{L^{'}} | S_{a}^{'} | {{\bar{θ}}_{a}^{'} log {\bar{θ}}_{F (a)} + (1 - {\bar{θ}}_{a}^{'}) log (1 - {\bar{θ}}_{F (a)})} \\ = \sum_{b = 1}^{L} | S_{b} | {{\bar{θ}}_{b} log {\bar{θ}}_{b} + (1 - {\bar{θ}}_{b}) log (1 - {\bar{θ}}_{b})} = {\bar{L}}_{P}^{*} (Π), \end{array}

where the first inequality holds by nonnegativity of Kullback–Leibler divergence, and the second equality follows from the fact that Π′ is a refinement of Π.

Supplementary material

Supplementary material available at Biometrika online includes the dataset from § 4.1.

References

Airoldi EM, Blei DM, Fienberg SE, Xing EP. Mixed membership stochastic blockmodels. J Mach Learn Res. 2008;9:1981–2014. [PMC free article] [PubMed] [Google Scholar]
Bickel PJ, Chen A. A nonparametric view of network models and Newman–Girvan and other modularities. Proc Nat Acad Sci USA. 2009;106:21068–73. doi: 10.1073/pnas.0907096106. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chung FRK, Lu L. Complex Graphs and Networks. Providence, Rhode Island: American Mathematical Society; 2006. [Google Scholar]
Copic J, Jackson MO, Kirman A. Identifying community structures from network data via maximum likelihood methods. Berk Electron J Theor Economet. 2009;9 RePEc:bpj:bejtec:v:9:y:2009:i:1:n:30. [Google Scholar]
Dubhashi DP, Panconesi A. Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge, U.K.: Cambridge University Press; 2009. [Google Scholar]
Fienberg SE, Meyer MM, Wasserman SS. Statistical analysis of multiple sociometric relations. J Am Statist Assoc. 1985;80:51–67. [Google Scholar]
Girvan M, Newman MEJ. Community structure in social and biological networks. Proc Nat Acad Sci USA. 2002;99:7821–6. doi: 10.1073/pnas.122653799. [DOI] [PMC free article] [PubMed] [Google Scholar]
Goldenberg A, Zheng AX, Fienberg SE, Airoldi EM. A survey of statistical network models. Foundat Trend Mach Learn. 2009;2:129–233. [Google Scholar]
Goodreau S, Kitts J, Morris M. Birds of a feather, or friend of a friend? Using exponential random graph models to investigate adolescent social networks. Demography. 2009;46:103–25. doi: 10.1353/dem.0.0045. [DOI] [PMC free article] [PubMed] [Google Scholar]
Handcock MS, Raftery AE, Tantrum JM. Model-based clustering for social networks. J. R. Statist. Soc. A. 2007;170:301–54. [Google Scholar]
Hoff PD. Modeling homophily and stochastic equivalence in symmetric relational data. In: Platt JC, Koller D, Singer Y, Roweis S, editors. Advances in Neural Information Processing Systems. Vol. 20. Cambridge, Massachusetts: MIT Press; 2008. pp. 657–64. [Google Scholar]
Holland P, Laskey KB, Leinhardt S. Stochastic blockmodels: some first steps. Social Networks. 1983;5:109–37. [Google Scholar]
Kallenberg O. Probabilistic Symmetries and Invariance Principles. New York: Springer; 2005. [Google Scholar]
Karrer B, Newman MEJ. Stochastic blockmodels and community structure in networks. Phys Rev E. 2011;83:016107–1–10. doi: 10.1103/PhysRevE.83.016107. [DOI] [PubMed] [Google Scholar]
Lorrain F, White HC. Structural equivalence of individuals in social networks. J Math Sociol. 1971;1:49–80. [Google Scholar]
Mariadassou M, Robin S, Vacher C. Uncovering latent structure in valued graphs: a variational approach. Ann Appl Statist. 2010;4:715–42. [Google Scholar]
Nowicki K, Snijders TAB. Estimation and prediction for stochastic blockstructures. J Am Statist Assoc. 2001;96:1077–87. [Google Scholar]
Rohe K, Chatterjee S, Yu B. Spectral clustering and the high-dimensional stochastic blockmodel. Ann Statist. 2011;39:1878–915. [Google Scholar]
Snijders TAB, Nowicki K. Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J Classif. 1997;14:75–100. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material available at Biometrika online includes the dataset from § 4.1.

[b1-asr053] Airoldi EM, Blei DM, Fienberg SE, Xing EP. Mixed membership stochastic blockmodels. J Mach Learn Res. 2008;9:1981–2014. [PMC free article] [PubMed] [Google Scholar]

[b2-asr053] Bickel PJ, Chen A. A nonparametric view of network models and Newman–Girvan and other modularities. Proc Nat Acad Sci USA. 2009;106:21068–73. doi: 10.1073/pnas.0907096106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b3-asr053] Chung FRK, Lu L. Complex Graphs and Networks. Providence, Rhode Island: American Mathematical Society; 2006. [Google Scholar]

[b4-asr053] Copic J, Jackson MO, Kirman A. Identifying community structures from network data via maximum likelihood methods. Berk Electron J Theor Economet. 2009;9 RePEc:bpj:bejtec:v:9:y:2009:i:1:n:30. [Google Scholar]

[b5-asr053] Dubhashi DP, Panconesi A. Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge, U.K.: Cambridge University Press; 2009. [Google Scholar]

[b6-asr053] Fienberg SE, Meyer MM, Wasserman SS. Statistical analysis of multiple sociometric relations. J Am Statist Assoc. 1985;80:51–67. [Google Scholar]

[b7-asr053] Girvan M, Newman MEJ. Community structure in social and biological networks. Proc Nat Acad Sci USA. 2002;99:7821–6. doi: 10.1073/pnas.122653799. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b8-asr053] Goldenberg A, Zheng AX, Fienberg SE, Airoldi EM. A survey of statistical network models. Foundat Trend Mach Learn. 2009;2:129–233. [Google Scholar]

[b9-asr053] Goodreau S, Kitts J, Morris M. Birds of a feather, or friend of a friend? Using exponential random graph models to investigate adolescent social networks. Demography. 2009;46:103–25. doi: 10.1353/dem.0.0045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b10-asr053] Handcock MS, Raftery AE, Tantrum JM. Model-based clustering for social networks. J. R. Statist. Soc. A. 2007;170:301–54. [Google Scholar]

[b11-asr053] Hoff PD. Modeling homophily and stochastic equivalence in symmetric relational data. In: Platt JC, Koller D, Singer Y, Roweis S, editors. Advances in Neural Information Processing Systems. Vol. 20. Cambridge, Massachusetts: MIT Press; 2008. pp. 657–64. [Google Scholar]

[b12-asr053] Holland P, Laskey KB, Leinhardt S. Stochastic blockmodels: some first steps. Social Networks. 1983;5:109–37. [Google Scholar]

[b13-asr053] Kallenberg O. Probabilistic Symmetries and Invariance Principles. New York: Springer; 2005. [Google Scholar]

[b14-asr053] Karrer B, Newman MEJ. Stochastic blockmodels and community structure in networks. Phys Rev E. 2011;83:016107–1–10. doi: 10.1103/PhysRevE.83.016107. [DOI] [PubMed] [Google Scholar]

[b15-asr053] Lorrain F, White HC. Structural equivalence of individuals in social networks. J Math Sociol. 1971;1:49–80. [Google Scholar]

[b16-asr053] Mariadassou M, Robin S, Vacher C. Uncovering latent structure in valued graphs: a variational approach. Ann Appl Statist. 2010;4:715–42. [Google Scholar]

[b17-asr053] Nowicki K, Snijders TAB. Estimation and prediction for stochastic blockstructures. J Am Statist Assoc. 2001;96:1077–87. [Google Scholar]

[b18-asr053] Rohe K, Chatterjee S, Yu B. Spectral clustering and the high-dimensional stochastic blockmodel. Ann Statist. 2011;39:1878–915. [Google Scholar]

[b19-asr053] Snijders TAB, Nowicki K. Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J Classif. 1997;14:75–100. [Google Scholar]

PERMALINK

Stochastic blockmodels with a growing number of classes

D S Choi

P J Wolfe

E M Airoldi

Abstract

1. Introduction