Weighted stochastic block model

Tin Lok James Ng; Thomas Brendan Murphy

doi:10.1007/s10260-021-00590-6

. 2021 Sep 13;30(5):1365–1398. doi: 10.1007/s10260-021-00590-6

Weighted stochastic block model

Tin Lok James Ng ^1,^✉, Thomas Brendan Murphy ²

PMCID: PMC8608781 PMID: 34840548

Abstract

We propose a weighted stochastic block model (WSBM) which extends the stochastic block model to the important case in which edges are weighted. We address the parameter estimation of the WSBM by use of maximum likelihood and variational approaches, and establish the consistency of these estimators. The problem of choosing the number of classes in a WSBM is addressed. The proposed model is applied to simulated data and an illustrative data set.

Keywords: Weighted stochastic block model, Variational estimators, Maximum likelihood estimators, Consistency, Model selection

Introduction

Networks are used in many scientific disciplines to represent interactions among objects of interest. For example, in the social sciences, a network typically represents social ties between actors. In biological sciences, a network can represent interactions between proteins.

The stochastic block model (SBM) (Holland et al. 1983; Snijders and Nowicki 1997) is a popular generative model which partitions vertices into latent classes. Conditional on the latent class allocations, the connection probability between two vertices depends only on the latent classes to which the two vertices belong. Many extensions of the class SBM have been proposed which include the degree correlated SBM (Karrer and Newman 2011; Peng and Carvalho 2016), mixed membership SBM (Airoldi et al. 2008) and overlapping SBM (Latouche et al. 2011).

The SBM and many of its variants are usually restricted to Bernoulli networks. However, many binary networks are produced after applying a threshold to a weighted relationship (Ghasemian et al. 2016) which results in the loss of potentially valuable information. Although most of the literature has focused on binary networks, there is a growing interest in weighted graphs (Barrat et al. 2004; Newman 2004; Peixoto 2018).

In particular, a number of clustering methods have been proposed for weighted graphs including algorithm based and model based methods. Algorithm based methods for clustering of weighted graphs can be further divided into two classes: algorithms which do not explicitly optimize any criteria (Pons and Latapy 2005; von Luxburg 2007) and those directly optimize a criterion (Clauset et al. 2004; Stouffer and Bascompte 2011). Model based methods (Mariadassou et al. 2010; Aicher et al. 2013, 2015; Ludkin 2020) attempt to take into account the random variability in the data. A recent review of graph clustering methods is given by (Leger et al. 2014).

Mariadassou et al. (2010) presents a Poisson mixture random graph model for integer valued networks and proposes a variational inference approach for parameter estimation. The model can account for covariates via a regression model. In Zanghi et al. (2010), a mixture modelling framework is considered for random graphs with discrete or continuous edges. In particular, the edge distribution is assumed to follow an exponential family distribution. Aicher et al. (2013) proposed a general class of weighted stochastic block model for dense graphs where edge weights are assumed to be generated according to an exponential family distribution. In particular, their construction produces complete graphs, in which every pair of vertices is connected by some real-valued weight. Since most real-world networks are sparse, the constructed model cannot be applied directly. To address this shortcoming, Aicher et al. (2015) extends the work of Aicher et al. (2013) and models the edge existence using a Bernoulli distribution and the edge weights using an exponential family distribution. The contributions of edge-existence distribution and edge-weight distribution in the likelihood function are then combined via a simple tuning parameter. However, their construction does not result in a generative model and it is not obvious how to simulate network observations from the proposed model. More recently, Ludkin (2020) presents a generalization of the SBM which allows artbitrary edge weight distributions and proposes a reversible jump Markov chain Monte Carlo sampler for estimating the parameters and the number of blocks. However, the use of continuous probability distribution to model the edge weights implies that the resulting graph is complete whereby every edge is present. This assumption is unrealistic for many applications whereby a certain proportion of the real-valued edges is 0. Haj et al. (2020) presents a binomial SBM for weighted graphs and proposes a variational expectation maximization algorithm for parameter estimation.

In this paper, we propose a weighted Stochastic Block model (WSBM) with gamma weights which aims to capture the information of weights directly using a generative model. Both maximum likelihood estimation and variational methods are considered for parameter estimation where consistency results are derived. We also address the problem of choosing the number of classes using the Integrated Completed Likelihood (ICL) criteria (Biernacki et al. 2000). The proposed models and inference methodology are applied to an illustrative data set.

Model specification

In this section, we present the weighted stochastic block model in detail and introduce the main notations and assumptions.

We let $Ω = (V, X, Y)$ denote the set of directed weighted random graphs where $V = N$ is the set of countable vertices, $X = {0, 1}^{N \times N}$ is the set of edge-existence adjacency matrix, and $Y = R_{+}^{N \times N}$ is the set of weighted adjacency matrix. Given a random adjacency matrix $X = {X_{ij}}_{i, j \in N}$ , $X_{ij} = 1$ if an edge exists from vertex i to vertex j and $X_{ij} = 0$ otherwise. The associated weighted random adjacency matrix is given by: for $i \neq j$ , if $X_{ij} = 1$ , $Y_{ij} > 0$ , and $Y_{ij} = 0$ otherwise. Let $P$ be a probability measure on $Ω$ .

Generative model

We now describe the procedure of generating a sample of random graph (V, X, Y) with n vertices from $Ω$ .

Let $Z_{[n]} = (Z_{1}, \dots, Z_{n})$ be the vector of latent block allocations for the vertices, and set $θ = (θ_{1}, \dots, θ_{Q})$ with $\sum_{q} θ_{q} = 1$ . For each vertex $v_{i}$ , draw its block label $Z_{i} \in {1, \dots, Q}$ from a multinomial distribution
$\begin{matrix} Z_{i} \sim M (1 ; θ_{1}, \dots, θ_{Q}) . \end{matrix}$
Let $π = {(π_{ql})}_{q, l = 1}^{Q}$ be a $Q \times Q$ matrix with entries in [0, 1]. Conditional on the block allocations $Z_{[n]}$ , the entries $X_{ij}$ for $i \neq j$ of the edge-existence adjacency matrix $X_{[n]}$ is generated from independently a Bernoulli distribution
$\begin{matrix} X_{ij} | Z_{i} = q, Z_{j} = l \sim B (π_{ql}) . \end{matrix}$
Let $α = {(α_{ql})}_{q, l = 1}^{Q}$ and $β = {(β_{ql})}_{q, l = 1}^{Q}$ be $Q \times Q$ matrices with entries taking values in the positive reals. Conditional on the latent block allocations $Z_{[n]}$ and edge-existence adjacency matrix X, the weighted adjacency matrix $Y_{[n]}$ is generated independently from
$\begin{matrix} Y_{ij} | X_{ij} = 1, Z_{i} = q, Z_{j} = l \sim & Ga (α_{ql}, β_{ql}), \\ Y_{ij} | X_{ij} = 0, Z_{i} = q, Z_{j} = l \sim & δ_{{0}}, \end{matrix}$
where $Ga (\cdot, \cdot)$ denotes the gamma distribution and $δ_{{\cdot}}$ is the Dirac delta function.

The generative framework described above is a straightforward extension of the binary stochastic block model whereby a positive weight is generated according to a gamma distribution for each edge. In particular, (X, Z) is a realization of the binary directed SBM. The gamma distribution is chosen due to its flexibility in the sense that, depending on the value of its shape parameter, it can represent distributions of different shapes.

The log-likelihood of the observations $X_{[n]}$ and $Y_{[n]}$ is given by

\begin{matrix} L_{2} (Y_{[n]}, X_{[n]} ; θ, π, α, β) = log (\sum_{z_{[n]}} e^{L_{1} (Y_{[n]}, X_{[n]} ; z_{[n]}, π, α, β)} P {Z_{[n]} = z_{[n]}}), \end{matrix}

where the sum is over all possible latent block allocations, $P {Z_{[n]} = z_{[n]}} = \prod_{i = 1}^{n} θ_{z_{i}}$ is the probability of latent block allocation $z_{[n]}$ , and

\begin{matrix} L_{1} (Y_{[n]}, X_{[n]} ; z_{[n]}, π, α, β) \\ = \sum_{i \neq j} [X_{ij} {log π_{z_{i}, z_{j}} + log f (Y_{ij} ; α_{z_{i}, z_{j}}, β_{z_{i}, z_{j}})} + (1 - X_{ij}) log (1 - π_{z_{i}, z_{j}})] \\ = \sum_{i \neq j} {X_{ij} log π_{z_{i}, z_{j}} + (1 - X_{ij}) log (1 - π_{z_{i}, z_{j}})} \\ + X_{ij} {α_{z_{i}, z_{j}} log β_{z_{i}, z_{j}} + (α_{z_{i}, z_{j}} - 1) log Y_{ij} - β_{z_{i}, z_{j}} Y_{ij} - log Γ (α_{z_{i}, z_{j}})} \end{matrix}

is the complete data log-likelihood, where $f (\cdot ; a, b)$ is the gamma probability density function with shape parameter a and rate parameter b,

Assumptions

We present several assumptions needed for identifiability and consistency of maximum likelihood estimates. The following four assumptions were presented in Celisse et al. (2012) and are needed in this paper.

Assumption 1

For every $q \neq q^{^{'}}$ , there exists $l \in {1, \dots, Q}$ such that

\begin{matrix} π_{q, l} \neq π_{q^{^{'}}, l} or π_{l, q} \neq π_{l, q^{^{'}}} . \end{matrix}

Assumption 2

There exists $ζ \in (0, 1)$ such that for all $(q, l) \in {1, \dots, Q}^{2}$

\begin{matrix} π_{ql} \in [ζ, 1 - ζ] . \end{matrix}

Assumption 3

There exists $0 < γ < 1 / Q$ such that for all $q \in {1, \dots, Q}$ ,

\begin{matrix} θ_{q} \in [γ, 1 - γ] . \end{matrix}

Assumption 4

There exists $0 < γ < 1 / Q$ and $n_{0} \in N^{*}$ such that for all $q \in {1, \dots, Q}$ , for all $n \geq n_{0}$ ,

\begin{matrix} \frac{N_{q} (z_{[n]}^{*})}{n} \geq γ, \end{matrix}

where $N_{q} (z_{[n]}^{*}) = | {1 \leq i \leq n : z_{i}^{*} = q |$ and $z_{[n]}^{*}$ is any realized block allocation under the WSBM.

(A1) requires that no two classes have the same connectivity probabilities. If this assumption is violated, the resulting model has too many classes and is non-identifiable. (A2) requires that the connectivity probability between any two classes strictly lies within a closed subset of the unit interval. Note that this assumption is slightly more restrictive compared to assumption 2 of Celisse et al. (2012) in that we do not consider the boundary cases where $π_{q, l} \in {0, 1}$ . The boundary cases require special treatment and are not pursued in this paper. (A3) ensures that no class is empty with high probability while (A4) is the empirical version of (A3). We note that (A4) is satisfied asymptotically under the generative framework in Sect. 2.1 since the block allocations are generated according to a multinomial distribution.

In addition to the four assumptions above, we also have the following constraints on the gamma parameters.

Assumption 5

For every $q \neq q^{^{'}}$ , there exists $l \in {1, \dots, Q}$ such that

\begin{matrix} (α_{q, l}, β_{q, l}) \neq (α_{q^{^{'}}, l}, β_{q^{^{'}}, l}) or (α_{l, q}, β_{l, q}) \neq (α_{l, q^{^{'}}}, β_{l, q^{^{'}}}) \end{matrix}

(A5) requires that no two classes have the same weight distribution. This assumption is the exact counterpart of (A1).

The log-likelihood function (1) contains degeneracies that prevent the direct estimation of parameters $θ, π, α, β$ . To see this, we note that the probability density function of a gamma distribution $Ga (a, b)$ is given by

\begin{matrix} f (y ; a, b) = \frac{y^{a - 1} exp (- b y) b^{a}}{Γ (a)} . \end{matrix}

By Stirling’s formula, we have

\begin{matrix} Γ (a) = \sqrt{2 π} a^{a - 1 / 2} exp (- a) (1 + O (a^{- 1})) . \end{matrix}

Setting $y = a b$ ,

\begin{matrix} f (y ; a, b) = \frac{y^{a - 1} exp (- a) {(a / y)}^{a}}{Γ (a)} = \frac{1}{\sqrt{2 π} y} \frac{\sqrt{a}}{1 + O (a^{- 1})} . \end{matrix}

Therefore, letting $a \to \infty$ while keeping $a b = y$ , we have $f (y ; a, b) \to \infty$ . One can therefore show that the log-likelihood function is unbounded above. To avoid likelihood degeneracy, we compactify the parameter space. That is, we restrict the parameter space to a compact subset which contains the true paraemters. Therefore, we have the following assumption.

Assumption 6

There exists $0 < α_{c} < α_{C} < \infty$ and $0 < β_{c} < β_{C} < \infty$ such that for all $(q, l) \in {1, \dots, Q}$ ,

\begin{matrix} α_{c} \leq α_{ql} \leq α_{C}, β_{c} \leq β_{ql} \leq β_{C} . \end{matrix}

With this assumption, it is easy to see that the log-likelihood function is bounded for any sample size.

Identifiability

Sufficient conditions for identifiability of binary SBM with two classes have been first obtained by Allman et al. (2009). Celisse et al. (2012) show that the SBM parameters are identifiable up to a permutation of class labels under the conditions that $π θ$ has distinct coordinates and $n \geq 2 Q$ . The condition on $π θ$ is mild since the set of vectors violating this assumption has Lebesgue measure 0. The identifiability of weighted SBM is more challenging where the only known result (Section 4 of Allman et al. (2011)) requires all entries of $(π, α, β)$ to be distinct. We note that the assumptions in the previous section are not necessarily sufficient but are necessary to ensure that the identifiability of the parameters.

Asymptotic recovery of class labels

We study the posterior probability distribution of the class labels $Z_{[n]}$ given the random adjacency matrix $X_{[n]}$ and weight matrix $Y_{[n]}$ , which is denoted by $P (Z_{[n]} | X_{[n]}, Y_{[n]})$ . Since $X_{[n]}$ and $Y_{[n]}$ are random, $P (Z_{[n]} | X_{[n]}, Y_{[n]})$ is also random.

Let $P^{*} (X_{[n]}, Y_{[n]}) : = P (X_{[n]}, Y_{[n]} | Z_{[n]} = z_{[n]}^{*})$ be the true conditional distribution of $(X_{[n]}, Y_{[n]})$ which depends on the true parameters $(θ^{*}, π^{*}, α^{*}, β^{*})$ . We study the convergence rate of $P (Z_{[n]} | X_{[n]}, Y_{[n]})$ towards 1 with respect to $P^{*}$ .

The matrices $π, α, β$ are permutation-invariant if one permutes both its rows and columns according to some permutation $σ : {1, \dots, Q} \to {1, \dots, Q}$ . Let $π^{σ}, α^{σ}, β^{σ}$ be the matrices defined by

\begin{matrix} π_{ql}^{σ} = π_{σ (q), σ (l)}, α_{ql}^{σ} = α_{σ (q), σ (l)}, β_{ql}^{σ} = β_{σ (q), σ (l)} \end{matrix}

and define the set

\begin{matrix} Σ = {σ : {1, \dots, Q} \to {1, \dots, Q} | π^{σ} = π, α^{σ} = α, β^{σ} = β} . \end{matrix}

Two vectors of class labels z and $z^{^{'}}$ are equivalent if there exists $σ \in Σ$ such that $z_{i}^{^{'}} = σ (z_{i}),$ for all i. We let [z] denote the equivalence class of z and will omit the square-brackets in the equivalence class notation as long as no confusion arises.

The following result extends Theorem 3.1 of Celisse et al. (2012) to the case of WSBM.

Theorem 1

Under assumptions (A1)–(A6), for every $t > 0$ ,

\begin{matrix} P^{*} [\sum_{[z_{[n]}] \neq [z_{[n]}^{*}]} \frac{P ([Z_{[n]}] = [z_{[n]}] | X_{[n]}, Y_{[n]})}{P ([Z_{[n]}] = [z_{[n]}^{*}] | X_{[n]}, Y_{[n]})} > t] = O (n e^{- κ n}) \end{matrix}

uniformly with respect to $z^{*}$ , and for some $κ > 0$ depending only on $π^{*}, α^{*}, β^{*}$ but not on $z^{*}$ . Here $z^{*} = {(z_{i}^{*})}_{i = 1}^{\infty}$ with $z_{i}^{*} \in {1, \dots, Q}$ . Furthermore, $P^{*}$ can be replaced by $P$ under assumptions (A1)–(A3) and (A5)–(A6).

Maximum likelihood estimation of WSBM parameters

For the binary SBM, consistency of parameter estimation have been shown for profile likelihood maximization (Bickel and Chen 2009), spectral clustering method (Rohe et al. 2011), method of moments approach (Bickel et al. 2011), method based on empirical degrees (Channarond et al. 2012), and others (Choi et al. 2012). Consistency of both maximum likelihood estimation and variational approximation method are established in Celisse et al. (2012) and Bickel et al. (2013) where asymptotic normality is also established in Bickel et al. (2013). Abbe (2018) reviews recent development in the stochastic block model and community detections.

Ambroise and Matias (2012) proposes a general class of sparse and weighted SBM where the edge distribution may exhibit any parametric form and studies the consistency and convergence rates of various estimators considered in their paper. However, their model requires the edge existence parameter to be constant across the graph, that is, $π_{ql} = π,$ for all q, l, or $π_{ql}$ can be modelled as $π_{ql} = a_{1} I_{q = l} + a_{2} I_{q \neq l}$ where $I_{\cdot}$ is the indicator function and $a_{1} \neq a_{2}$ . Furthermore, they also assume that conditional on the block assignments, the edge weight $Y_{ij} | X_{i} = q, X_{j} = l$ is modelled using a parametric distribution with a single parameter $θ_{ql}$ . They further impose the restriction that

\begin{matrix} θ_{ql} = \{\begin{matrix} θ_{in} if q = l, \\ θ_{out} if q \neq l . \end{matrix}) \end{matrix}

These assumptions are more restrictive than those imposed in this paper. Jog and Loh (2015) studies the problem of characterizing the boundary between success and failure of MLE when edge weights are drawn from discrete distributions. More recently, Brault et al. (2020) studies the consistency and asymptotic normality of the MLE and variational estimators for the latent block model which is a generalization of the SBM. However, the model considered in Brault et al. (2020) is restricted to the dense setting and requires the observations in the data matrix to be modelled by univariate exponential family distributions.

This section addresses the consistency of the MLE of WSBM. In particular, we extend the results obtained in the pioneering paper of Celisse et al. (2012) to the case of weighted graphs. Our proof closely follows the proof of consistency of the MLE in Celisse et al. (2012). The MLE consistency proof of $(π, α, β)$ and $θ$ require different treatments since there are $n (n - 1)$ edges but only n vertices. The following result established the MLE consistency of $(π, α, β)$ .

Theorem 2

Assume that assumptions (A1), (A2), (A3), (A5), (A6) hold. Let us define the MLE of $(θ^{*}, π^{*}, α^{*}, β^{*})$ by

\begin{matrix} (\hat{θ}, \hat{π}, \hat{α}, \hat{β}) : = arg max_{θ, π, α, β} L_{2} (Y_{[n]}, X_{[n]} ; θ, π, α, β) . \end{matrix}

Then for any metric $d (\cdot, \cdot)$ on $(π, α, β)$ ,

\begin{matrix} d ((\hat{π}, \hat{α}, \hat{β}), (π^{*}, α^{*}, β^{*})) \to_{n \to \infty}^{P} 0 . \end{matrix}

Under additional assumption on the rate of convergence of the estimators $(\hat{π}, \hat{α}, \hat{β})$ of $(π, α, β)$ , consistency of $\hat{θ}$ can be established.

Theorem 3

Let $(\hat{θ}, \hat{π}, \hat{α}, \hat{β})$ denote the MLE of $(θ^{*}, π^{*}, α^{*}, β^{*})$ and assume that $| | \hat{π} - π^{*} {| |}_{\infty} = o_{P} (\sqrt{log n} / n)$ , $| | \hat{α} - α^{*} {| |}_{\infty} = o_{P} (\sqrt{log n} / n)$ , and $| | \hat{β} - β^{*} {| |}_{\infty} = o_{P} (\sqrt{log n} / n)$ , then

\begin{matrix} d (\hat{θ}, θ^{*}) \to_{n \to \infty}^{P} 0 \end{matrix}

for any metric d in $R^{Q}$ .

Variational estimators

Direct maximization of the log-likelihood function is intractable except for very small graphs since it involves a sum over $Q^{n}$ terms. In practice, approximate algorithms such as Markov Chain Monte Carlo (MCMC) and variational inference algorithms are often used for parameter inference. For the SBM, both MCMC and variational inference approaches have been proposed (Snijders and Nowicki 1997; Daudin et al. 2008). Variational inference algorithms have also been developed for mixed membership SBM (Airoldi et al. 2008), overlapping SBM (Latouche et al. 2011), and the weighted SBM proposed in Aicher et al. (2013). This section develops a variational inference algorithm for the WSBM which can be considered a natural extension of the algorithm proposed in Daudin et al. (2008) for the SBM.

The variational method consists in approximating $P (Z_{[n]} = \cdot | X_{[n]}, Y_{[n]})$ by a product of n multinomial distributions. Let $D_{n}$ denote a set of product multinomial distributions

\begin{matrix} D_{n} = {D_{τ_{[n]}} = \prod_{i = 1}^{n} M (1 ; τ_{i, 1}, \dots, τ_{i, Q}) | τ_{[n]} \in S_{n}} \end{matrix}

where

\begin{matrix} S_{n} = {τ_{[n]} = (τ_{1}, \dots, τ_{n}) \in {({[0, 1]}^{Q})}^{n} | for all i, τ_{i} = (τ_{i, 1}, \dots, τ_{i, Q}), \sum_{q = 1}^{Q} τ_{i, q} = 1} . \end{matrix}

For any $D_{τ_{[n]}} \in D_{n}$ , the variational log-likelihood is defined by

\begin{matrix} J (Y_{[n]}, X_{[n]} ; τ_{[n]}, θ, π, α, β) = L_{2} (Y_{[n]}, X_{[n]} ; θ, π, α, β) - KL (D_{τ_{[n]}}, P (\cdot | X_{[n]}, Y_{[n]})) . \end{matrix}

Here $KL (\cdot, \cdot)$ denotes the Kullback-Leibler divergence between two probability distributions, which is nonnegative. Therefore, $J$ provides a lower bound on the log-likelihood function. We have that

\begin{matrix} J (Y_{[n]}, X_{[n]} ; τ_{[n]}, θ, π, α, β) = \sum_{i \neq j} \sum_{q, l} τ_{i, q} τ_{jl} (log b (X_{ij} ; π_{ql}) + X_{ij} log f (Y_{ij} ; α_{ql}, β_{ql})) \\ - \sum_{i} \sum_{q} τ_{iq} (log τ_{iq} - log θ_{q}) \end{matrix}

where $b (\cdot ; p)$ denotes the probability mass function of a Bernoulli distribution with parameter p, and recall that $f (\cdot ; α_{ql}, β_{ql})$ denotes the density function of a gamma distribution $Ga (α_{ql}, β_{ql})$ .

The variational algorithm works by iteratively maximizing the lower bound $J$ with respect to the approximating distribution $D_{τ_{[n]}}$ , and estimating the model parameters. Maximization of $J$ with respect to $D_{τ_{[n]}}$ consists of solving

\begin{matrix} {\hat{τ}}_{[n]} : = {argmax}_{τ_{[n]}} J (Y_{[n]}, X_{[n]} ; τ_{[n]}, θ, π, α, β), \end{matrix}

where $θ, π, α, β$ can be replaced by plug-in estimates. This has a closed-form solution given by

\begin{matrix} {\hat{τ}}_{iq} \propto θ_{q} \prod_{j \neq i} \prod_{l} b {(X_{ij} ; π_{ql})}^{{\hat{τ}}_{jl}} f {(Y_{ij} ; α_{ql}, β_{ql})}^{{\hat{τ}}_{jl} X_{ij}} . \end{matrix}

Conditional on ${\hat{τ}}_{[n]}$ , the variational estimators of $(θ, π, α, β)$ are found by solving,

\begin{matrix} (\tilde{θ}, \tilde{π}, \tilde{α}, \tilde{β}) = {argmax}_{θ, π, α, β} J (Y_{[n]}, X_{[n]} ; τ_{[n]}, θ, π, α, β) . \end{matrix}

Closed-form updates for $\tilde{θ}$ and $\tilde{π}$ exist and are given by

\begin{matrix} {\tilde{θ}}_{q} & = \frac{1}{n} \sum_{i} {\hat{τ}}_{iq} \end{matrix}

\begin{matrix} {\tilde{π}}_{ql} & = \frac{\sum_{i \neq j} {\hat{τ}}_{iq} {\hat{τ}}_{jl} X_{ij}}{\sum_{i \neq j} {\hat{τ}}_{iq} {\hat{τ}}_{jl}} . \end{matrix}

On the other hand, updates for $\tilde{α}$ and $\tilde{β}$ do not have a closed form since the maximum likelihood estimators of the two parameters of a gamma distribution do not have closed forms. However, using the fact that a gamma distribution is a special case of a generalized gamma distribution, Ye and Chen (2017) derived simple closed-form estimators for the two parameters of gamma distribution. The estimators were shown to be strongly consistent and asymptotically normal. For $q, l = 1, \dots, Q$ , let us define the quantities

\begin{matrix} {\tilde{W}}_{ql} & = \sum_{i \neq j, X_{ij} = 1} {\hat{τ}}_{iq} {\hat{τ}}_{jl} \\ {\tilde{U}}_{ql} & = \sum_{i \neq j, X_{ij} = 1} {\hat{τ}}_{iq} {\hat{τ}}_{jl} Y_{ij} \\ {\tilde{V}}_{ql} & = \sum_{i \neq j, X_{ij} = 1} {\hat{τ}}_{iq} {\hat{τ}}_{jl} log Y_{ij} \\ {\tilde{S}}_{ql} & = \sum_{i \neq j, X_{ij} = 1} {\hat{τ}}_{iq} {\hat{τ}}_{jl} Y_{ij} log Y_{ij}, \end{matrix}

then the updates for $α_{ql}, β_{ql}$ are given by

\begin{matrix} {\tilde{α}}_{ql} & = \frac{{\tilde{W}}_{ql} {\tilde{U}}_{ql}}{{\tilde{W}}_{ql} {\tilde{S}}_{ql} - {\tilde{V}}_{ql} {\tilde{U}}_{ql}} \end{matrix}

\begin{matrix} {\tilde{β}}_{ql} & = \frac{{\tilde{W}}_{ql}^{2}}{{\tilde{W}}_{ql} {\tilde{S}}_{ql} - {\tilde{V}}_{ql} {\tilde{U}}_{ql}} . \end{matrix}

We obtain the variational estimators $(\tilde{θ}, \tilde{π}, \tilde{α}, \tilde{β})$ by computing (3), (4), (5), (6), (7) until convergence.

We now address the consistency of the variational estimators derived above. The following two propositions are the counterpart of Theorem 2 and 3 for variational estimators. We omit the proof since they follow similar arguments as the proof of Corollary 4.3 and Theorem 4.4 of Celisse et al. (2012).

Proposition 1

Assume that assumptions (A1), (A2), (A3), (A5) and (A6), and let $(\tilde{θ}, \tilde{π}, \tilde{α}, \tilde{β})$ be the variational estimators defined above. Then for any distance $d (\cdot, \cdot)$ on $(π, α, β)$ ,

\begin{matrix} d ((\tilde{π}, \tilde{α}, \tilde{β}), (π^{*}, α^{*}, β^{*})) \to_{n \to \infty}^{P} 0 . \end{matrix}

Proposition 2

Assume that the variational estimators $(\tilde{π}, \tilde{α}, \tilde{β})$ converge at rate 1/n to $(π^{*}, α^{*}, β^{*})$ , respectively, and assumptions (A1), (A2), (A3), (A5), (A6) hold. We have

\begin{matrix} d (\tilde{θ}, θ^{*}) \to_{n \to \infty}^{P} 0, \end{matrix}

where d denotes any distance between vectors in $R^{Q}$ .

Note a stronger assumption on the convergence rate 1/n of $\tilde{π}, \tilde{α}, \tilde{β}$ is assumed for Proposition 2 compared to $\sqrt{log n} / n$ in Theorem 3. The same assumption is also used in Theorem 4.4. of Celisse et al. (2012).

Choosing the number of classes

In real world applications, the number of classes is typically unknown and needs to be estimated from the data. For the SBM, a number of methods have been developed to determine the number of classes, including log-likelihood ratio statistic (Wang and Bickel 2017), composite likelihood (Saldaña et al. 2017), exact integrated complete data likelihood (Côme and Latouche 2015) and Bayesian framework (Yan 2016) based methods. Model selection for variants of SBM have also been investigated (Latouche et al. 2014).

We apply the integrated classification likelihood (ICL) criterion developed by Biernacki et al. (2000) to choose the number of classes for the WSBM. The ICL is an approximation of the complete-date integrated likelihood. The ICL criterion for SBM have been derived by Daudin et al. (2008) under the assumptions that the prior distribution of $(θ, π)$ factorizes and a non-informative Dirichlet prior on $θ$ . Here we follow the approach of Daudin et al. (2008) to derive an approximate ICL for the WSBM.

Let $m_{Q}$ denote the model with Q blocks, the ICL criterion is an approximation of the complete-data integrated likelihood:

\begin{matrix} L (Y_{[n]}, X_{[n]}, Z_{[n]} | m_{Q}) \\ = \int_{θ, π, α, β} L (Y_{[n]}, X_{[n]}, Z_{[n]} ; θ, π, α, β, m_{Q}) g (θ, π, α, β) d θ d π d α d β & . \end{matrix}

where $g (θ, π, α, β)$ is the prior distribution of the parameters. Assuming a non-informative Jeffreys prior on $θ$ , a Stirling approximation to the gamma function, and finally a BIC approximation to the conditional log-likelihood function, the approximate ICL can be derived. For a model $m_{Q}$ with Q blocks, and assuming a non-informative Jeffreys prior $D i r (0.5, \dots, 0.5)$ on $θ$ , the approximate ICL criterion is:

\begin{matrix} I C L (m_{Q}) & = max_{θ, π, α, β} log L (Y_{[n]}, X_{[n]}, {\tilde{Z}}_{[n]} ; θ, π, α, β, m_{Q}) \\ - \frac{3}{2} (Q (Q + 1)) log n (n - 1) - \frac{Q - 1}{2} log n \end{matrix}

where ${\tilde{Z}}_{[n]}$ is the estimate of $Z_{[n]}$ . The derivation of above follows exactly the same lines as the proof of Proposition 8 of Daudin et al. (2008).

Simulation

We validate the theoretical results developed in previous sections by conducting simulation studies. In particular, we investigate how fast parameter estimates of WSBM converge to their true values and the accuracy of posterior block allocations. Additionally, we investigate the performance of ICL in choosing the number of blocks.

Experiment 1 (two-class model)

For each fixed number of nodes n, 50 realizations of WSBM are generated based on the fixed parameter setting given in (8 and 9). The variational inference algorithm derived in Sect. 5 is then applied to estimate the model parameters and class allocations.

We can see from Table 1 that the estimated model parameters converge to their true values as the number of nodes increases while the posterior class assignment is accurate across any number of nodes. Table 2 shows the ICL criterion tends to select the correct number of classes, especially when the number of nodes is large.

\begin{matrix} θ & = (0.7, 0.3), \end{matrix}

\begin{matrix} π & = (\begin{matrix} 0.8 & 0.2 \\ 0.3 & 0.9 \end{matrix}), α = (\begin{matrix} 10.0 & 0.3 \\ 3.0 & 0.5 \end{matrix}), β = (\begin{matrix} 2.0 & 1.0 \\ 0.2 & 1.0 \end{matrix}) . \end{matrix}

Table 1.

Convergence analysis of posterior class allocations and parameter estimates under the two-class model

n	$\sum_{i = 1}^{n} I_{z_{i} = z_{i}^{*}} / n$	$\| \| θ - θ^{*} {\| \|}_{2}$	$\| \| π - π^{*} {\| \|}_{F}$	$\| \| α - α^{*} {\| \|}_{F}$	$\| \| β - β^{*} {\| \|}_{F}$
25	1	0.031	0.057	0.855	0.434
50	1	0.029	0.035	0.654	0.322
100	1	0.027	0.012	0.256	0.070
200	1	0.025	0.006	0.116	0.046
500	1	0.021	0.002	0.053	0.024

Open in a new tab

Table 2.

Frequency of choosing Q blocks by ICL under different number of nodes n under the two-class model

n \| Q	2	3	4
25	44	3	3
50	48	2	0
100	50	0	0
200	50	0	0
500	50	0	0

Open in a new tab

Experiment 2 (three-class model)

50 network realizations are obtained under the three-class model with parameter values given in (10 and 11) for a range of values n.

\begin{matrix} θ & = (0.5, 0.3, 0.2) \end{matrix}

\begin{matrix} π & = (\begin{matrix} 0.60 & 0.20 & 0.30 \\ 0.30 & 0.90 & 0.10 \\ 0.60 & 0.50 & 0.20 \end{matrix}), α = (\begin{matrix} 0.50 & 2.00 & 1.00 \\ 0.30 & 0.02 & 6.00 \\ 2.00 & 0.05 & 3.00 \end{matrix}), β = (\begin{matrix} 5.00 & 0.40 & 5.00 \\ 3.00 & 12.00 & 0.70 \\ 6.00 & 0.20 & 0.60 \end{matrix}) . \end{matrix}

We can see from Table 3 that the estimated parameters converge to their true values quickly as the number of nodes increases. The ICL criterion tends to overestimate the number of classes when the number of nodes is small, but consistently selects the correct model when the number of nodes is large (Table 4).

Table 3.

Convergence analysis of posterior class allocations and parameter estimates under the three-class model

n	$\sum_{i = 1}^{n} I_{z_{i} = z_{i}^{*}} / n$	$\| \| θ - θ^{*} {\| \|}_{2}$	$\| \| π - π^{*} {\| \|}_{F}$	$\| \| α - α^{*} {\| \|}_{F}$	$\| \| β - β^{*} {\| \|}_{F}$
25	0.961	0.116	0.178	7.86	136.72
50	1	0.039	0.039	1.256	7.866
100	1	0.033	0.026	0.481	1.426
200	1	0.014	0.024	0.228	1.011
500	1	0.003	0.006	0.197	0.222

Open in a new tab

Table 4.

Frequency of choosing Q blocks by ICL under different number of nodes n under the three-block model

n\|Q	2	3	4	5
25	3	37	8	2
50	0	43	7	0
100	0	50	0	0
200	0	49	1	0
500	0	50	0	0

Open in a new tab

Computational complexity

The computational complexity of the variational estimators derived in Sect. 5 scales as $O (n^{2})$ , which has the same complexity as the variational algorithm developed by Daudin et al. (2008). Therefore, the algorithm may be prohibitively expensive for networks with more than 1000 nodes. The estimated computational time for the two-class model in Sect. 7.1 and for the three-class model in Sect. 7.2 for various values of n are shown in Fig. 1. The estimated computing time at each n is the average running time of the variational algorithm over 20 replications.

Fig. 1 — Estimated computing time for the two-class and three-class models

Application: Washington bike data set

We apply the WSBM to analyse the Washington bike sharing scheme data set.1 Information with respect to start stations and end stations of trips as well as length (travel time) of trips are available in the data set. We select a time window of one week staring from January 10th, 2016 and construct the adjacency matrix X and weight matrix Y as follows:

$X_{ij} = 1$ if there is trip starting from station i and finishing at station j.
$Y_{ij}$ is the total length of trips (in minutes) from station i to station j.

The resulting network consists of 370 nodes with an average out-degree of 36.14. The average total length of trips between any pair of stations is 42.15 minutes. We apply the ICL criterion to select the number of classes for the WSBM. For each number of classes Q, the variational inference algorithm is fitted to the network 20 times with 20 random initializations and the highest value of ICL is recorded, and the six-class model is chosen (Table 5). Each bike station is plotted on the map in Fig. 2 where its colour represents the estimated class assignment. We observe that bike stations in class 6 (colored in brown) tend to be concentrated in the central area of Washington wheraas stations in class 3 (colored in red) tend to be located further from the center. Figure 2 shows some spatial effect in the class assignment of bike stations whereby stations that are close in distance tend to be in the same cluster, with the exception of class 3 (colored in red). One potential extension of the model is to take into account the spatial locations of the bike stations as covariates.

Table 5.

Model selection for the Washington Bike dataset using ICL criterion

Q	ICL
1	− 107157.24
2	− 93505.61
3	− 91688.33
4	− 90813.17
5	− 90300.90
6	− 89061.38
7	− 89326.35

Open in a new tab

Fig. 2 — Bike stations. Class 1: blue. Class 2: green. Class 3: red. Class 4: cyan. Class 5: black. Class 6: brown (color figure online)

The estimated class proportions $\hat{θ}$ shown in 12 indicate that class 3 has the largest number of stations whereas class 4 has the smallest number of stations. The estimated $\hat{π}$ shows that within class connectivity is generally higher compared to between class connectivity. We further observe that the connection probabilities between bike stations in class 1, 4, and 6 are substantially higher. Interestingly, we observe a near symmetry in the matrix $\hat{π}$ indicating that the probability of having a trip from a station in class k to another station in class l is similar with the probability of having a trip from class l to class k.

The estimated densities of travel time between each pair of classes are shown in Fig. 3. We observe that the majority of the estimated densities have mode and mean close to 0, particularly for the estimated densities in the diagonal of Fig. 3. This implies that the total travel times between stations in the same class are quite short. In comparison, the total travel time between stations in different classes tend to be longer. This is reasonable as the distance between bike stations in different classes tend to be longer which in turn requires longer travel time.

\begin{matrix} \hat{θ} & = (0.1985, 0.0975, 0.3256, 0.0270, 0.1377, 0.2136) \end{matrix}

\begin{matrix} \hat{π} & = (\begin{matrix} 0.2847 & 0.0539 & 0.0036 & 0.4268 & 0.0125 & 0.2811 \\ 0.0594 & 0.0630 & 0.0172 & 0.1254 & 0.0156 & 0.0791 \\ 0.0049 & 0.0156 & 0.0185 & 0.0079 & 0.0092 & 0.0029 \\ 0.4113 & 0.1140 & 0.0136 & 0.8438 & 0.0532 & 0.4494 \\ 0.0187 & 0.0318 & 0.0120 & 0.0910 & 0.2159 & 0.0241 \\ 0.2715 & 0.0514 & 0.0017 & 0.4965 & 0.0159 & 0.7268 \end{matrix}), \end{matrix}

Discussion

This paper proposes a weighted stochastic block model (WSBM) for networks. The proposed model is an extension of the stochastic block model. A variational inference strategy is developed for parameter estimation. Asymptotic properties of maximum likelihood estimators and variational estimators are derived, and the problem of choosing the number of classes is addressed by using an ICL criteria. Simulation studies are conducted to evaluate the performance of variational estimators and the use of ICL to determine the number of classes. The proposed model and inference methods are an illustrative data set.

It is straightforward to extend the WSBM to allow node covariates. Let $w_{i} \in R^{d}$ be the covariates for each node $i = 1, \dots, n$ , and let $w_{ij}$ be the covariates for each pair of nodes, $i, j = 1, \dots, n, i \neq j$ . The edge probability $p_{ij}$ between a pair of nodes i, j with node i in block q and node j in block l can be modelled as

\begin{matrix} log \frac{p_{ij}}{1 - p_{ij}} = ξ_{q l, 0} + ξ_{q l, 1}^{T} w_{ij} + ξ_{q l, 2}^{T} w_{i} + ξ_{q l, 3}^{T} w_{j}, \end{matrix}

where $ξ_{q l, 0} \in R$ and $ξ_{q l, 1}, ξ_{q l, 2}, ξ_{q l, 3} \in R^{d}$ . Conditional on block assignments and existence of an edge between a pair of nodes i, j, $Y_{ij}$ can be modelled as a Gamma random variable with mean $μ_{ij}$ and variance $σ_{ij}$ where

\begin{matrix} μ_{ij} & = E (Y_{ij} | Z_{i} = q, Z_{j} = l, X_{ij} = 1) = exp (ϕ_{q l, 0} + ϕ_{q l, 1}^{T} w_{ij} + ϕ_{q l, 2}^{T} w_{i} + ϕ_{q l, 3}^{T} w_{j}), \\ σ_{ij} & = V a r (Z_{i} = q, Z_{j} = l, X_{ij} = 1) = ν_{ql} \end{matrix}

where $ϕ_{q l, 0} \in R$ and $ϕ_{q l, 1}, ϕ_{q l, 2}, ϕ_{q l, 3} \in R^{d}$ .

Many possible future extensions are possible. First, it is desirable to investigate further theoretical properties of maximum likelihood and variational estimators of WSBM parameters such as asymptotic normality of the estimators. Furthermore, some of the assumptions imposed in this work in order to ensure consistency of the estimators maybe relaxed. Moreover, the number of blocks is assumed to be fixed in the asymptotic analysis of the estimators. It would be interesting to allow the number of blocks to grows as the number of nodes grows.

Auxiliary results

Definition 1

A random variable X with mean $μ = E (X)$ is sub-exponential if there are non-negative parameters $(ν, b)$ such that

\begin{matrix} E (e^{t (X - μ)}) \leq e^{\frac{ν^{2} t^{2}}{2}} for all | t | < \frac{1}{b} . \end{matrix}

The following lemma is a straight forward consequence of the definition.

Lemma 1

If the independent random variables ${X_{i}}_{i = 1}^{n}$ are sub-exponential with parameters $(ν_{i}, b_{i})$ , $i = 1, \dots, n$ , then $\sum_{i = 1}^{n} X_{i}$ is sub-exponential with parameters $(\sqrt{\sum_{i = 1}^{n} ν_{i}^{2}}, {max}_{i = 1}^{n} b_{i})$ .

The following results show that for a Gamma random variable Y and a Bernoulli random variable X, both XY and $X log Y$ are sub-exponential random variables. They are useful for the proof of the main theorems.

Proposition 3

If $Y \sim G a (a, b)$ and $X \sim B e r (π)$ , YX is a sub-exponential random variable.

Proof

The expectation of YX is given by

\begin{matrix} μ : = E (Y X) = π \frac{a}{b}, \end{matrix}

and thus

\begin{matrix} E (e^{t μ}) = exp (t π \frac{a}{b}) . \end{matrix}

The moment generating function of YX is given by

\begin{matrix} E (e^{tYX}) = (1 - π) + π (1 - \frac{t}{b})^{- a} \end{matrix}

which is defined for $t < b$ . Taylor series expansion of ${(1 - \frac{t}{b})}^{- a}$ around $t = 0$ gives that

\begin{matrix} {(1 - \frac{t}{b})}^{- a} = 1 + \frac{a}{b} t + \frac{a (a + 1)}{b^{2}} \frac{t^{2}}{2} + O (t^{3}) . \end{matrix}

Similary, Taylor series expansion of $exp (- t \frac{a}{b} π)$ around 0 gives that

\begin{matrix} exp (- t \frac{a}{b} π) = 1 - \frac{a}{b} π t + \frac{a (a + 1)}{b^{2}} π \frac{t^{2}}{2} + O (t^{3}) . \end{matrix}

Thus, we have

\begin{matrix} E (e^{tYX}) & = 1 - π + π + \frac{a}{b} π t + \frac{a (a + 1)}{b^{2}} π \frac{t^{2}}{2} + O (t^{3}) \\ = 1 + \frac{a}{b} π t + \frac{a (a + 1)}{b^{2}} π \frac{t^{2}}{2} + O (t^{3}) . \end{matrix}

This leads to

\begin{matrix} E (e^{t (Y X - μ)}) & = (1 + \frac{a}{b} π t + \frac{a (a + 1)}{b^{2}} π \frac{t^{2}}{2} + O (t^{3})) (1 - \frac{a}{b} π t + \frac{a (a + 1)}{b^{2}} π \frac{t^{2}}{2} + O (t^{3})) \\ = 1 + \frac{(a^{2} + 2 a) π}{b^{2}} t^{2} + O (t^{3}) . \end{matrix}

Hence, we can choose suitable $ν, b$ such that

\begin{matrix} E (e^{t (Y X - μ)}) \leq exp (\frac{ν^{2} t^{2}}{2}) for all | t | < \frac{1}{b} . \end{matrix}

$□$

Proposition 4

If $Y \sim G a (a, b)$ and $X \sim B e r (π)$ , $X log Y$ is a sub-exponential random variable.

Proof

The proof is analogous to the proof of Proposition 3. It is straightforward to show that

\begin{matrix} E (e^{t X log Y}) = (1 - π) + π \frac{Γ (α + t)}{Γ (α) β^{t}} \end{matrix}

and

\begin{matrix} μ : = E (X log Y) = π (Ψ (α) - log (β)) \end{matrix}

where $Ψ (\cdot)$ is the digamma function. The rest of the proof follows the same line as the proof of Proposition 3 by considering the Taylor series expansion of $e^{- μ t}$ and $E (e^{t X log Y})$ around $t = 0$ . $□$

The following lemma provides an upper bound for the tail probability of a sub-exponential random variable.

Lemma 2

(Sub-exponential tail bound)Suppose that X with mean $μ$ is sub-exponential with parameters $(ν, b)$ . Then

\begin{matrix} P (X \geq μ + t) \leq exp [- min (\frac{t^{2}}{2 ν^{2}}, \frac{t}{2 b})] . \end{matrix}

The following inequality for suprema of random processes is needed for the proof of Theorem 2.

Proposition 5

(Baraud 2010, Theorem 2.1) Let ${(S (g))}_{g \in G}$ be a family of real valued and centered random variables. Fix some $g_{0}$ in $G$ .

Suppose the following two conditions hold:

1.

There exist two arbitrary norms

| | \cdot {| |}_{2}

and

| | \cdot {| |}_{\infty}

G

and a nonnegative constant c such that for all

g_{1}, g_{2} \in G (g_{1} \neq g_{2})

\begin{matrix} E [e^{t (S (g_{1}) - S (g_{2}))}] \leq exp (\frac{t^{2} | | g_{1} - g_{2} {| |}_{2}^{2}}{2 (1 - t c | | g_{1} - g_{2} | |_{\infty})}) \end{matrix}

for all

t \in [0, \frac{1}{c | | g_{1} - g_{2} {| |}_{\infty}})

2.

Let

S

be a linear space with finite dimension D endowed with the two norms

| | \cdot {| |}_{\infty}

| | \cdot {| |}_{2}

defined above, we assume that for constants

u > 0

and

b \geq 0

\begin{matrix} G \subset {g \in S : | | g - g_{0} | |_{\infty} \leq u_{1}, c | | g - g_{0} | |_{2} \leq u_{2}} . \end{matrix}

we have for all $x > 0$ ,

\begin{matrix} P [sup_{g \in G} | S (g) - S (g_{0}) | \geq κ (\sqrt{u^{2} (D + x)} + b (D + x))] \leq 2 e^{- x} . \end{matrix}

where $κ = 18$ .

Proof

In this section, we prove Theorems 1, 2, 3 for the special case $α_{q, l} = α_{11}$ for all $(q, l) \in {1, \dots, Q}$ . The more general case can be proved similary by using the fact that $X_{ij} log Y_{ij}$ is a sub-exponential random variable.

Proof of Theorem 1

Proof

Our proof is adapted from Celisse et al. (2012). Using the fact that for every $z_{[n]}^{^{'}} \in [z_{[n]}]$ , $P (z_{[n]}^{^{'}} | X_{[n]}, Y_{[n]}) = P (z_{[n]} | X_{[n]}, Y_{[n]})$ , we have that

\begin{matrix} P (Z_{[n]} & = [z_{[n]}] | X_{[n]}, Y_{[n]}) = \sum_{z_{[n]}^{^{'}} \in [z_{[n]}]} P (z_{[n]}^{^{'}} | X_{[n]}, Y_{[n]}) \\ = | [z_{[n]}] | P (z_{[n]} | X_{[n]}, Y_{[n]}) \end{matrix}

where $| [z_{[n]}] |$ is the cardinality of the equivalence class $[z_{[n]}]$ .

By applying a similar approach as in ( Celisse et al. (2012); “Appendix B2”), one can derive an upper bound for $P^{*} [\sum_{[z_{[n]}] \neq [z_{[n]}^{^{'}}]} \frac{P ([Z_{[n]}] = [z_{[n]}] | X_{[n]}, Y_{[n]})}{P ([Z_{[n]}] = [z_{[n]}^{*}] | X_{[n]}, Y_{[n]})} > t]$ :

\begin{matrix} P^{*} [\sum_{[z_{[n]}] \neq [z_{[n]}^{*}]} \frac{P ([z_{[n]}] | X_{[n]}, Y_{[n]})}{P ([z_{[n]}^{*}] | X_{[n]}, Y_{[n]})} > t] \\ \leq \sum_{r = 1}^{n} \sum_{z_{[n]} \notin [z_{[n]}^{*}], | | z_{[n]} - z_{[n]}^{*} {| |}_{0} = r} P^{*} [\frac{P (z_{[n]} | X_{[n]}, Y_{[n]})}{P (z_{[n]}^{*} | X_{[n]}, Y_{[n]})} > \frac{t}{n^{r + 1} {(Q - 1)}^{r}}] \end{matrix}

where $| | z_{[n]} - z_{[n]}^{*} {| |}_{0}$ is the number of difference between $z_{[n]}$ and $z_{[n]}^{*}$ .

To simplify notation, we write $z : = z_{[n]}$ , $X : = X_{[n]}$ , and $Y = Y_{[n]}$ . We have that

\begin{matrix} P^{*} [\frac{P (z | X, Y)}{P (z^{*} | X, Y)} > \frac{t}{n^{r + 1} {(Q - 1)}^{r}}] \\ = P^{*} [\frac{P (Y | X, z) P (X | z) P (z)}{P (Y | X, z^{*}) P (X | z^{*}) P (z^{*})} > \frac{t}{n^{r + 1} {(Q - 1)}^{r}}] \\ = P^{*} [log \frac{P (Y | X, z)}{P (Y | X, z^{*})} + log \frac{P (X | Z) P (z)}{P (X | z^{*}) P (z^{*})} > log \frac{t}{n^{r + 1} {(Q - 1)}^{r}}] \\ \leq P^{*} [| log \frac{P (Y | X, z)}{P (Y | X, z^{*})} - E^{Z = z^{*}} (log \frac{P (Y | X, z)}{P (Y | X, z^{*})}) | \\ > \frac{1}{2} (log \frac{t}{n^{r + 1} {(Q - 1)}^{r}} - E^{Z = z^{*}} (log \frac{P (Y | X, z)}{P (Y | X, z^{*})}) - E^{Z = z^{*}} (log \frac{P (X | z) P (z)}{P (X | z^{*}) P (z^{*})}))] \\ + P^{*} [| log \frac{P (X | z) P (z)}{P (X | z^{*}) P (z^{*})} - E^{Z = z^{*}} (log \frac{P (X | z) P (z)}{P (X | z^{*}) P (z^{*})}) |] \\ > \frac{1}{2} (log \frac{t}{n^{r + 1} {(Q - 1)}^{r}} - E^{Z = z^{*}} (log \frac{P (Y | X, z)}{P (Y | X, z^{*})}) - E^{Z = z^{*}} (log \frac{P (X | z) P (z)}{P (X | z^{*}) P (z^{*})}))] \\ = : I_{1} + I_{2} \end{matrix}

Consider the first term on the RHS of the last inequality,

\begin{matrix} log \frac{P (Y | X, z)}{P (Y | X, z^{*})} - E^{Z = z^{*}} (log \frac{P (Y | X, z)}{P (Y | X, z^{*})}) \\ = \sum_{i \neq j} (α_{11}^{*} log \frac{β_{z_{i}, z_{j}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}}) (X_{ij} - π_{z_{i}^{*}, z_{j}^{*}}^{*}) + (β_{z_{i}^{*}, z_{j}^{*}}^{*} - β_{z_{i}, z_{j}}^{*}) (Y_{ij} X_{ij} - \frac{α_{11}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}} π_{z_{i}^{*}, z_{j}^{*}}^{*}) & . \end{matrix}

The summand in the summation above vanishes when $β_{z_{i}, z_{j}}^{*} = β_{z_{i}^{*}, z_{j}^{*}}^{*}$ . For two vectors z and $z^{^{'}}$ , we define

\begin{matrix} D (z, z^{^{'}}) = {(i, j) | i \neq j, β_{z_{i}, z_{j}}^{*} \neq β_{z_{i}^{^{'}}, z_{j}^{^{'}}}^{*}} \end{matrix}

and let $N_{r} (z) = | D (z, z^{*}) |$ denote the number of terms in the summation. Set

\begin{matrix} s & = \frac{1}{N_{r} (z)} \frac{1}{2} {log \frac{t}{n^{r + 1} {(Q - 1)}^{r}} - E^{Z = z^{*}} (log \frac{P (Y | X, z)}{P (Y | X, z^{*})}) \\ - E^{Z = z^{*}} (log \frac{P (X | z) P (z)}{P (X | z^{*}) P (z^{*})})}, \end{matrix}

one can show by Lemma B.3 of Celisse et al. (2012) that there exists some positive constant $c^{*}$ such that $s \geq c^{*} > 0$ . We have that

\begin{matrix} P^{*} [\frac{1}{N_{r} (z)} | log \frac{P (Y | X, z)}{P (Y | X, z^{*})} - E^{Z = z^{*}} (log \frac{P (Y | X, z)}{P (Y | X, z^{*})}) | > 2 s] \\ \leq P^{*} [\frac{1}{N_{r} (z)} | \sum_{i \neq j} α_{11}^{*} log \frac{β_{z_{i}, z_{j}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}} (X_{ij} - π_{z_{i}^{*}, z_{j}^{*}}^{*}) | > s] \\ + P^{*} [\frac{1}{N_{r} (z)} | \sum_{i \neq j} (β_{z_{i}^{*}, z_{j}^{*}}^{*} - β_{z_{i}, z_{j}}^{*}) (Y_{ij} X_{ij} - \frac{α_{11}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}} π_{z_{i}^{*}, z_{j}^{*}}^{*}) | > s] \end{matrix}

Assumption (A6) implies that the random variable $α_{11}^{*} (log β_{z_{i}, z_{j}}^{*} / β_{z_{i}^{*}, z_{j}^{*}}^{*}) X_{ij}$ is bounded for all i, j. An application of Hoeffding’s inequality yields that

\begin{matrix} P^{*} [\frac{1}{N_{r} (z)} | \sum_{i \neq j} α_{11}^{*} log \frac{β_{z_{i}, z_{j}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}} (X_{ij} - π_{z_{i}^{*}, z_{j}^{*}}^{*}) | > s] \leq exp (- \frac{N_{r} (z) s^{2}}{L_{1}}) \end{matrix}

for some constant $L_{1} > 0$ .

By Proposition 3, the random variable $Y_{ij} X_{ij}$ is sub-exponential and the tail bound for sub-exponential random variables (Lemma 2) implies that

\begin{matrix} P^{*} [\frac{1}{N_{r} (z)} | \sum_{i \neq j} (β_{z_{i}^{*}, z_{j}^{*}}^{*} - β_{z_{i}, z_{j}}^{*}) (Y_{ij} X_{ij} - \frac{α_{11}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}} π_{z_{i}^{*}, z_{j}^{*}}^{*}) | > s] \\ \leq max {exp (- \frac{N_{r} (z) s^{2}}{L_{2}}), exp (- \frac{N_{r} (z) s}{L_{3}})} \end{matrix}

for some constants $L_{2}, L_{3} > 0$ . Proposition B.4 of Celisse et al. (2012) shows that $N_{r} (z)$ is bounded below by

\begin{matrix} N_{r} (z) \geq \frac{γ^{2}}{2} n | | z_{[n]} - z_{[n]}^{*} {| |}_{0} . \end{matrix}

Combining inequalities (14), (15), (16), it is straight forward to show that

\begin{matrix} I_{1} = & O (exp (- A_{1} n)) . \end{matrix}

By Theorem 3.1 of Celisse et al. (2012), we have

\begin{matrix} I_{2} = & O (exp (- A_{2} n)) \end{matrix}

for some constant $A_{2} > 0$ . Therefore, with $A : = min {A_{1}, A_{2}}$

\begin{matrix} P^{*} [\sum_{[z_{[n]}] \neq [z_{[n]}^{*}]} \frac{P ([z_{[n]}] | X_{[n]}, Y_{[n]})}{P ([z_{[n]}^{*}] | X_{[n]}, Y_{[n]})} > t] \\ \leq \sum_{r = 1}^{n} (\binom{n}{r}) {(Q - 1)}^{r} O (exp (- A n)) \\ \to 0 \end{matrix}

as $n \to \infty$ . Since the upper bound does not depend on $z^{*}$ , $P^{*}$ can be replaced by $P$ . $□$

Proof of Theorem 2

Proof

As the first step of the proof, we define the normalized complete data log-likelihood function

\begin{matrix} ϕ_{n} (z_{[n]}, π, α, β) = \frac{1}{n (n - 1)} L_{1} (X_{[n]}, Y_{[n]} ; z_{[n]}, π, α, β) \\ = \frac{1}{n (n - 1)} [\sum_{i \neq j} {X_{ij} log π_{z_{i}, z_{j}} + (1 - X_{ij}) log (1 - π_{z_{i}, z_{j}})} \\ + X_{ij} {α_{z_{i}, z_{j}} log β_{z_{i}, z_{j}} + (α_{z_{i}, z_{j}} - 1) log Y_{ij} - β_{z_{i}, z_{j}} Y_{ij} - log Γ (α_{z_{i}, z_{j}})}], \end{matrix}

and its expectation

\begin{matrix} Φ_{n} (z_{[n]}, π, α, β) = E [ϕ_{n} (z_{[n]}, π, α, β) | Z_{[n]} = z_{[n]}^{*}] \\ = \frac{1}{n (n - 1)} [\sum_{i \neq j} {π_{z_{i}^{*}, z_{j}^{*}}^{*} log π_{z_{i}, z_{j}} + (1 - π_{z_{i}^{*}, z_{j}^{*}}^{*}) log (1 - π_{z_{i}, z_{j}})} \\ + π_{z_{i}^{*}, z_{j}^{*}}^{*} {α_{z_{i}, z_{j}} log β_{z_{i}, z_{j}} + (α_{z_{i}, z_{j}} - 1) (log β_{z_{i}^{*}, z_{j}^{*}}^{*} + ψ (α_{z_{i}^{*}, z_{j}^{*}}^{*})) \\ - β_{z_{i}, z_{j}} \frac{α_{z_{i}^{*}, z_{j}^{*}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}} - log Γ (α_{z_{i}, z_{j}})}] . \end{matrix}

The following proposition shows that $ϕ_{n} (z_{[n]}, π, α, β)$ uniformly converges to $Φ_{n} (z_{[n]}, π, α, β)$ . This result is an extension of Proposition 3.5 of Celisse et al. (2012).

Proposition 6

Under assumptions (A1), (A2), (A5), (A6), we have

\begin{matrix} sup_{P} | ϕ_{n} (z_{[n]}, π, α, β) - Φ_{n} (z_{[n]}, π, α, β) | \to_{n \to \infty}^{P} 0 \end{matrix}

where $P : = {(z_{[n]}, π, α, β) : (A 1), (A 2), (A 5), (A 6)}$ .

Proof

We have that

\begin{matrix} | ϕ_{n} (z_{[n]}, π, α, β) - Φ_{n} (z_{[n]}, π, α, β) | \\ \leq ρ_{n} | \sum_{i \neq j} (X_{ij} - π_{z_{i}^{*}, z_{j}^{*}}^{*}) (log \frac{π_{z_{i}, z_{j}}}{1 - π_{z_{i}, z_{j}}} + α_{z_{i}, z_{j}} log β_{z_{i}, z_{j}} - log Γ (α_{11})) | \\ + ρ_{n} | \sum_{i \neq j} (X_{ij} Y_{ij} - π_{z_{i}^{*}, z_{j}^{*}}^{*} \frac{α_{z_{i}, z_{j}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}}) β_{z_{i}, z_{j}} | \\ + ρ_{n} | \sum_{i \neq j} (X_{ij} log Y_{ij} - π_{z_{i}^{*}, z_{j}^{*}}^{*} log β_{z_{i}^{*}, z_{j}^{*}}^{*} - π_{z_{i}, z_{j}}^{*} ψ (α_{z_{i}^{*}, z_{j}^{*}}^{*})) (α_{z_{i}, z_{j}} - 1) | \end{matrix}

where $ρ_{n} = 1 / n (n - 1)$ . Notice that under Assumption (A2) and (A6),

\begin{matrix} Φ_{n} (z_{[n]}, π, α, β) < + \infty . \end{matrix}

Therefore, by Proposition 3.5 of Celisse et al. (2012),

\begin{matrix} sup_{P} ρ_{n} | \sum_{i \neq j} (X_{ij} - π_{z_{i}^{*}, z_{j}^{*}}^{*}) (log \frac{π_{z_{i}, z_{j}}}{1 - π_{z_{i}, z_{j}}} + α_{z_{i}, z_{j}} log β_{z_{i}, z_{j}} - log Γ (α_{z_{i}, z_{j}})) | \to_{n \to \infty}^{P} 0 \end{matrix}

To bound the second term on the RHS of the inequality 17, we apply Proposition 5 (Theorem 2.1 of Baraud (2010)). We first define

\begin{matrix} S_{n} (β) = \sum_{i \neq j} (X_{ij} Y_{ij} - π_{z_{i}^{*}, z_{j}^{*}}^{*} \frac{α_{z_{i}^{*}, z_{j}^{*}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}}) β_{z_{i}, z_{j}} . \end{matrix}

Now, for two parameters $β^{(1)}$ and $β^{(2)}$ ,

\begin{matrix} S_{n} (β^{(1)}) - S_{n} (β^{(2)}) & = \sum_{i \neq j} (X_{ij} Y_{ij} - π_{z_{i}^{*}, z_{j}^{*}}^{*} \frac{α_{z_{i}^{*}, z_{j}^{*}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}}) (β_{z_{i}, z_{j}}^{(1)} - β_{z_{i}, z_{j}}^{(2)}) \\ = \sum_{q, l} \sum_{i \neq j, z_{i} = q, z_{j} = l} (X_{ij} Y_{ij} - π_{z_{i}^{*}, z_{j}^{*}}^{*} \frac{α_{z_{i}^{*}, z_{j}^{*}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}}) (β_{ql}^{(1)} - β_{ql}^{(2)}) . \end{matrix}

Since $X_{ij} Y_{ij}$ is sub-exponential with parameters $(ν, b)$ , $\sum_{i \neq j, z_{i} = q, z_{j} = l} X_{i} X_{j}$ is sub-exponential with parameters $(\sqrt{n_{ql}} ν, b)$ by Lemma 3, where $n_{ql} = | {(i, j) : z_{i} = q, z_{j} = l} |$ . Therefore, define the norms

\begin{matrix} | | β^{(1)} - β^{(2)} {| |}_{2}^{2} & = n^{2} ν^{2} \sum_{q, l} {(β_{ql}^{(1)} - β_{ql}^{(2)})}^{2} \\ | | β^{(1)} - β^{(2)} {| |}_{\infty}^{2} & = \sum_{q, l} {(β_{ql}^{(1)} - β_{ql}^{(2)})}^{2}, \end{matrix}

we have

\begin{matrix} E [exp (t (S_{n} (β^{(1)}) - S_{n} (β^{(2)}))] \\ = \prod_{q, l} E [exp (t (β_{ql}^{(1)} - β_{ql}^{(2)}) \sum_{i \neq j, z_{i} = q, z_{j} = l} (X_{ij} Y_{ij} - π_{z_{i}^{*}, z_{j}^{*}}^{*} \frac{α_{z_{i}^{*}, z_{j}^{*}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}}))] \\ \leq \prod_{q, l} exp {\frac{n_{ql} ν^{2} t^{2} {(β_{ql}^{(1)} - β_{ql}^{(2)})}^{2}}{2}} \\ \leq exp {\frac{n^{2} ν^{2} t^{2} {(β_{ql}^{(1)} - β_{ql}^{(2)})}^{2}}{2}} \\ = exp {\frac{t^{2} | | β^{(1)} - β^{(2)} {| |}_{2}^{2}}{2}} \\ \leq exp {\frac{t^{2} | | β^{(1)} - β^{(2)} {| |}_{2}^{2}}{2 (1 - t c | | β^{(1)} - β^{(2)} | |_{\infty})}} \end{matrix}

for all

\begin{matrix} t \in [0, \frac{1}{c | | β^{(1)} - β^{(2)} | |}), \end{matrix}

where c is some negative constant. Therefore, the first condition of Proposition 5 is satisfied.

Fix some $β^{(0)}$ , (A6) implies that there exist $u_{1} = O (n)$ and $u_{2} = O (1)$ such that $| | β - β^{(0)} {| |}_{2} \leq u_{1}$ , and $| | β - β^{(0)} {| |}_{2} \leq u_{2}$ . Therefore the second condition of Proposition 5 is also satisfied with $D = Q^{2}$ . Now, we have

\begin{matrix} P^{*} (ρ_{n} sup_{β} | S_{n} (β) | > & η) \leq P^{*} (ρ_{n} sup_{β} | S_{n} (β) - S_{n} (β^{(0)}) | > \frac{η}{2}) \\ + P^{*} (ρ_{n} | S_{n} (β^{(0)}) | > \frac{η}{2}) . \end{matrix}

Since $\sum_{i \neq j} X_{ij} Y_{ij} β_{z_{i}, z_{j}}^{(0)}$ is a sub-exponential random variable with parameters $(n ν, b)$ , the second term on the RHS of inequality 18 can be bounded by

\begin{matrix} P^{*} (ρ_{n} | S_{n} (β^{(0)}) | > \frac{η}{2}) = O (exp (- n^{2})) . \end{matrix}

To bound the first term on the RHS of inequality 18, we introduce the set $P z_{[n]} = {(π, α, β) : (z_{[n]}, π, α, β) \in P}$ for every $z_{[n]}$ , and define the event

\begin{matrix} Ω_{n} (z_{[n]}) = {sup_{P (z_{[n]})} ρ_{n} | S_{n} (β) - S_{n} (β^{(0)}) | \leq ρ_{n} κ \sqrt{u^{2} (Q^{2} + x_{n})} + ρ_{n} b (Q^{2} + x_{n})} . \end{matrix}

We have

\begin{matrix} P^{*} (Ω_{n} {(z_{[n]})}^{c}) = 2 e^{- x_{n}} \end{matrix}

by Proposition 5. Combining 18, 19, and 20,

\begin{matrix} P^{*} [ρ_{n} | \sum_{i \neq j} (X_{ij} Y_{ij} - π_{z_{i}^{*}, z_{j}^{*}}^{*} \frac{α_{z_{i}^{*}, z_{j}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}}) β_{z_{i}, z_{j}} | > η] \\ \leq \sum_{z_{[n]}} P^{*} [{sup_{P (z_{[n]})} ρ_{n} | S_{n} (β) - S_{n} (β^{(0)}) | > \frac{η}{2}}] + P^{*} [ρ_{n} | S_{n} (β_{0}) | > \frac{η}{2}] \\ \leq \sum_{z_{[n]}} P^{*} [{sup_{P (z_{[n]})} ρ_{n} | S_{n} (β) - S_{n} (β^{(0)}) | > \frac{η}{2}} \cap Ω_{n} (z_{[n]})] + \sum_{z_{[n]}} 2 e^{- x_{n}} + \sum_{z_{[n]}} O (e^{- n^{2}}) \\ \leq \sum_{z_{[n]}} P^{*} [ρ_{n} κ \sqrt{u^{2} (D + x_{n})} + ρ_{n} b (D + x_{n}) > \frac{η}{2}] + \sum_{z_{[n]}} 2 e^{- x_{n}} + \sum_{z_{[n]}} O (e^{- n^{2}}) \end{matrix}

Since $z_{[n]}$ belongs to a set of cardinality at most $Q^{n}$ , by choosing $x_{n} = n log (n)$ , the three sums converge to 0.

By using the fact that $X_{ij} log Y_{ij}$ is a sub-exponential random variable, we can similarly show that

\begin{matrix} P^{*} [sup_{P} ρ_{n} | \sum_{i \neq j} (X_{ij} log Y_{ij} - π_{z_{i}^{*}, z_{j}^{*}}^{*} log β_{z_{i}^{*}, z_{j}^{*}}^{*} - π_{z_{i}^{*}, z_{j}^{*}}^{*} ψ (α_{z_{i}^{*}, z_{j}^{*}}^{*})) (α_{z_{i}, z_{j}} - 1) | > η] \to_{n \to \infty}^{P} 0 . \end{matrix}

Since the convergence is uniform with respect to $z_{[n]}^{*}$ , $P^{*}$ can be replaced by $P$ and the proof is completed. $□$

Proposition 6 allows us to establish the following result concerning the convergence of the normalized log-likelihood function. Proposition 7 is an extension of Theorem 3.6 of Celisse et al. (2012) and allows us to establish the consistency of MLE of $(π, α, β)$ .

Proposition 7

We assume that assumptions (A1), (A2), (A3), (A5), (A6) hold. For every $(θ, π, α, β)$ , set

\begin{matrix} M_{n} (θ, π, α, β) = {(n (n - 1))}^{- 1} L_{2} (Y_{[n]}, X_{[n]} ; θ, π, α, β), \end{matrix}

and

\begin{matrix} M (π, α, β, A) & = \sum_{q, l} θ_{q}^{*} θ_{l}^{*} \sum_{q^{^{'}}, l^{^{'}}} a_{q, q^{^{'}}} a_{l, l^{^{'}}} [π_{ql}^{*} log π_{q^{^{'}}, l^{^{'}}} + (1 - π_{ql}^{*}) log (1 - π_{q^{^{'}}, l^{^{'}}}) \\ + π_{ql}^{*} E_{α_{ql}^{*}, β_{ql}^{*}} (log f (. ; α_{q^{^{'}}, l^{^{'}}}, β_{q^{^{'}}, l^{^{'}}}))] \end{matrix}

where

\begin{matrix} A = {A = {(a_{q, l})}_{1 \leq q, l \leq Q} | a_{q, l} \geq 0, \sum_{l = 1}^{Q} a_{kl} = 1} . \end{matrix}

Then for any $η > 0$ ,

\begin{matrix} sup_{d ((π, α, β), (π^{*}, α^{*}, β^{*})) \geq η} M (π, α, β) < M (π^{*}, α^{*}, β^{*}), \\ sup_{θ, π, α, β} | M_{n} (θ, π, α, β) - M (π, α, β) | \to_{n \to \infty}^{P} 0 . \end{matrix}

Proof

We define the following:

\begin{matrix} {\hat{z}}_{[n]} (π, α, β) = {argmax}_{z} ϕ_{n} (z_{[n]}, π, α, β), \\ {\tilde{z}}_{[n]} (π, α, β) = {argmax}_{z} Φ_{n} (z_{[n]}, π, α, β), \\ {\bar{A}}_{π, α, β} = {argmax}_{A \in A} M (π, α, β, A), \\ M (π, α, β) = M (π, α, β, {\bar{A}}_{π, α, β}) . \end{matrix}

By a similar reasoning as in the proof of Theorem 3.6 of Celisse et al. (2012), we can show that ${\bar{A}}_{π^{*}, α^{*}, β^{*}} = I_{Q}$ and is unique. To show that for all $η > 0$ , ${sup}_{d ((π, α, β), (π^{*}, α^{*}, β^{*})) \geq η} M (π, α, β) < M (π^{*}, α^{*}, β^{*})$ , we let ${({\bar{a}}_{ql})}_{q, l}$ denote coefficients of ${\bar{A}}_{π, α, β}$ . We have that

\begin{matrix} M (π, α, β) - M (π^{*}, α^{*}, β^{*}) \\ = - \sum_{q, l} θ_{q}^{*} θ_{l}^{*} \sum_{q^{^{'}}, l^{^{'}}} {\bar{a}}_{q, q^{^{'}}} {\bar{a}}_{l, l^{^{'}}} {{KL}_{B} (π_{ql}^{*}, π_{q^{^{'}}, l^{^{'}}}) + π_{ql}^{*} {KL}_{G} ((α_{ql}^{*}, β_{ql}^{*}), (α_{q^{^{'}}, l^{^{'}}} β_{q^{^{'}}, l^{^{'}}}))}, \end{matrix}

where ${KL}_{B} (p, q)$ denotes the Kullback-Leibler divergence between two bernoulli distributions $Ber (p)$ and $Ber (q)$ , and ${KL}_{G} ((a_{,} b_{1}), (a_{2}, b_{2}))$ denotes the Kullback-Leibler divergence between two gamma distributions $Ga (a_{1}, b_{1})$ and $Ga (a_{2}, b_{2})$ .

Since the set ${(π, α, β) | d ((π, α, β), (π^{*}, α^{*}, β^{*})) \geq η)}$ is compact by assumptions, there exists $(π^{0}, α^{0}, β^{0}) \neq (π^{*}, α^{*}, β^{*})$ such that

\begin{matrix} sup_{d ((π, α, β), (π^{*}, α^{*}, β^{*})) \geq η} M (π, α, β) - M (π^{*}, α^{*}, β^{*}) \\ = M (π^{0}, α^{0}, β^{0}) - M (π^{*}, α^{*}, β^{*}) < 0 . \end{matrix}

Next we show that

\begin{matrix} sup_{θ, π, α, β} | M_{n} (θ, π, α, β) - M (π, α, β) | \to_{n \to \infty}^{P} 0 . \end{matrix}

We first have the following bound:

\begin{matrix} | M_{n} (θ, π, α, β) - M (π, α, β) | & \leq | M_{n} (θ, π, α, β) - ϕ_{n} (\hat{z}, π, α, β) | \\ + | ϕ_{n} (\hat{z}, π, α, β) - Φ_{n} (\tilde{z}, π, α, β) | \\ + | Φ_{n} (\tilde{z}, π, α, β) - M (π, α, β) | . \end{matrix}

We first consider the first term on the RHS of inequality (21):

\begin{matrix} sup_{θ, π, α, β} | M_{n} (θ, π, α, β) - ϕ_{n} (\hat{z}, π, α, β) | \\ = sup_{θ, π, α, β} \frac{| L_{2} (Y_{[n]}, X_{[n]} ; θ, π, α, β) - L_{1} (Y_{[n]}, X_{[n]} ; {\hat{z}}_{[n]}, π, α, β) |}{n (n - 1)} \\ \leq \frac{log (1 / γ)}{n - 1} \underset{n \to \infty}{\to} 0 . \end{matrix}

Consider the second term on the RHS of inequality (21), if $ϕ_{n} (\hat{z}, π, α, β) < Φ_{n} (\tilde{z}, π, α, β)$ , we have

\begin{matrix} | ϕ_{n} (\hat{z}, π, α, β) - Φ_{n} (\tilde{z}, π, α, β) | & = Φ_{n} (\tilde{z}, π, α, β) - ϕ_{n} (\hat{z}, π, α, β) \\ \leq Φ_{n} (\tilde{z}, π, α, β) - ϕ_{n} (\tilde{z}, π, α, β) . \end{matrix}

On the other hand, if $ϕ_{n} (\hat{z}, π, α, β) \geq Φ_{n} (\tilde{z}, π, α, β)$ ,

\begin{matrix} | ϕ_{n} (\hat{z}, π, α, β) - Φ_{n} (\tilde{z}, π, α, β) | & = ϕ_{n} (\tilde{z}, π, α, β) - Φ_{n} (\hat{z}, π, α, β) \\ \leq ϕ_{n} (\hat{z}, π, α, β) - Φ_{n} (\hat{z}, π, α, β) . \end{matrix}

Therefore, Proposition 6 implies that

\begin{matrix} sup_{π, α, β} | ϕ_{n} (\hat{z}, π, α, β) - Φ_{n} (\tilde{z}, π, α, β) | \to_{n \to \infty}^{P} 0 . \end{matrix}

Last, we notice that under our assumptions (A2) and (A6), and using the strong law of large number,

\begin{matrix} sup_{π, α, β} | Φ_{n} (\tilde{z}, π, α, β) - M (π, α, β) | \to_{n \to \infty}^{P} 0 . \end{matrix}

Combining (22), (23) and (24), we have the desired result. $□$

The consistency of $(\hat{π}, \hat{α}, \hat{β})$ follows from Proposition 7 and Theorem 3.4 of Celisse et al. (2012). $□$

Proof of Thereom 3

Proof

We denote $\hat{P} (Z_{[n]} = z_{[n]} | X_{[n]}, Y_{[n]})$ as the conditional distribution of $Z_{[n]}$ under the parameters $(\hat{θ}, \hat{π}, \hat{α}, \hat{β})$ . The following result is an extension of Proposition 3.8 of Celisse et al. (2012) and is needed to establish the consistency of $\hat{θ}$ .

Proposition 8

Assume that assumptions (A1)-(A6) hold, and there exists estimators $\hat{π}, \hat{α}, \hat{β}$ such that $| | \hat{π} - π {| |}_{\infty} = o_{P} (v_{n})$ , $| | \hat{α} - α {| |}_{\infty} = o_{P} (v_{n})$ , $| | \hat{β} - β {| |}_{\infty} = o_{P} (v_{n})$ , with $v_{n} = o (\sqrt{log n} / n)$ . Let also $\hat{θ}$ denote any estimator of $θ^{*}$ . Then for every $ϵ > 0$ ,

\begin{matrix} P^{*} [\sum_{z_{[n]} \neq z_{[n]}^{*}} \frac{\hat{P} (Z_{[n]} = z_{[n]} | X_{[n]}, Y_{[n]})}{\hat{P} (Z_{[n]} = z_{[n]}^{*} | X_{[n]}, Y_{[n]})} > ϵ] & \leq κ_{1} n e^{- κ_{2} \frac{{(log n)}^{2}}{n v_{n}^{2}}} + P [| | \hat{π} - π^{*} {| |}_{\infty} > v_{n}] \\ + P [| | \hat{α} - α^{*} | |_{\infty} > v_{n}] \\ + P [| | \hat{β} - β^{*} | |_{\infty} > v_{n}] \end{matrix}

for n large enough, and for some constants $κ_{1}, κ_{2} > 0$ and

\begin{matrix} log (\frac{\hat{P} (Z_{[n]} = z_{[n]} | X_{[n]}, Y_{[n]})}{\hat{P} (Z_{[n]} = z_{[n]}^{*} | X_{[n]}, Y_{[n]})}) \\ = log (\frac{\hat{P} (X_{[n]} | Z_{[n]} = z_{[n]}) P (Z_{[n]} = z_{[n]})}{\hat{P} (X_{[n]} | Z_{[n]} = z_{[n]}^{*}) P (Z_{[n]} = z_{[n]}^{*})}) + log (\frac{\hat{P} (Y_{[n]} | X_{[n]}, Z_{[n]} = z_{[n]})}{\hat{P} (Y_{[n]} | X_{[n]}, Z_{[n]} = z_{[n]}^{*})}) \\ = \sum_{i \neq j} {X_{ij} log (\frac{{\hat{π}}_{z_{i}, z_{j}}}{{\hat{π}}_{z_{i}^{*}, z_{j}^{*}}}) + (1 - X_{ij}) log (\frac{1 - {\hat{π}}_{z_{i}, z_{j}}}{1 - {\hat{π}}_{z_{i}^{*}, z_{j}^{*}}})} + \sum_{i} log \frac{{\hat{θ}}_{z_{i}}}{{\hat{θ}}_{z_{i}^{*}}} \\ + \sum_{i \neq j} {X_{ij} ({\hat{α}}_{z_{i}, z_{j}} log {\hat{β}}_{z_{i}, z_{j}} - {\hat{α}}_{z_{i}^{*}, z_{j}^{*}} log {\hat{β}}_{z_{i}^{*}, z_{j}^{*}}) + ({\hat{α}}_{z_{i}, z_{j}} - {\hat{α}}_{z_{i}^{*}, z_{j}^{*}}) X_{ij} log Y_{ij} \\ + X_{ij} Y_{ij} ({\hat{β}}_{z_{i}^{*}, z_{j}^{*}} - {\hat{β}}_{z_{i}, z_{j}}) + X_{ij} (log Γ ({\hat{α}}_{z_{i}^{*}, z_{j}^{*}}) - log Γ ({\hat{α}}_{z_{i}, z_{j}}))} \\ : = T_{0} + T_{1} & . \end{matrix}

Proof

We can write

\begin{matrix} T_{1} & = \sum_{D^{*}} {π_{z_{i}^{*}, z_{j}^{*}}^{*} \frac{α_{z_{i}^{*}, z_{j}^{*}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}} (β_{z_{i}^{*}, z_{j}^{*}}^{*} - β_{z_{i}, z_{j}}^{*}) + π_{z_{i}^{*}, z_{j}^{*}}^{*} α_{z_{i}, z_{j}}^{*} log \frac{β_{z_{i}, z_{j}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}}} \\ + \sum_{D^{*}} {(X_{ij} Y_{ij} - \frac{α_{z_{i}^{*}, z_{j}^{*}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}} π_{z_{i}^{*}, z_{j}^{*}}^{*}) (β_{z_{i}^{*}, z_{j}^{*}}^{*} - {\hat{β}}_{z_{i}, z_{j}}) + (X_{ij} - π_{z_{i}^{*}, z_{j}^{*}}^{*}) α_{z_{i}, z_{j}}^{*} log \frac{β_{z_{i}, z_{j}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}}} \\ + \sum_{\hat{D} \cup D^{*}} {\frac{α_{z_{i}^{*}, z_{j}^{*}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}} π_{z_{i}^{*}, z_{j}^{*}}^{*} ({\hat{β}}_{z_{i}^{*}, z_{j}^{*}} - β_{z_{i}^{*}, z_{j}^{*}}^{*}) + \frac{α_{z_{i}^{*}, z_{j}^{*}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}} π_{z_{i}^{*}, z_{j}^{*}}^{*} ({\hat{β}}_{z_{i}, z_{j}} - β_{z_{i}, z_{j}}^{*}) \\ + π_{z_{i}^{*}, z_{j}^{*}}^{*} ({\hat{α}}_{z_{i}, z_{j}} log \frac{{\hat{β}}_{z_{i}, z_{j}}}{{\hat{β}}_{z_{i}^{*}, z_{j}^{*}}} - α_{z_{i}, z_{j}}^{*} log \frac{β_{z_{i}, z_{j}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}})} \\ + \sum_{\hat{D} \cup D^{*}} {(X_{ij} - π_{z_{i}^{*}, z_{j}^{*}}^{*}) ({\hat{α}}_{z_{i}, z_{j}} log \frac{{\hat{β}}_{z_{i}, z_{j}}}{{\hat{β}}_{z_{i}^{*}, z_{j}^{*}}} - α_{z_{i}, z_{j}}^{*} log \frac{β_{z_{i}, z_{j}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}})} \\ + \sum_{\hat{D} \cup D^{*}} {(X_{ij} Y_{ij} - \frac{α_{z_{i}^{*}, z_{j}^{*}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}} π_{z_{i}^{*}, z_{j}^{*}}^{*}) ({\hat{β}}_{z_{i}^{*}, z_{j}^{*}} - β_{z_{i}^{*}, z_{j}^{*}}^{*}) \\ + (X_{ij} Y_{ij} - \frac{α_{z_{i}^{*}, z_{j}^{*}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}} π_{z_{i}^{*}, z_{j}^{*}}^{*}) (β_{z_{i}, z_{j}}^{*} - {\hat{β}}_{z_{i}, z_{j}})} \\ = : T_{1, 1} + T_{1, 2} + T_{1, 3} + T_{1, 4} + T_{1, 5} \end{matrix}

where

\begin{matrix} D^{*} : & = {(i, j) : i \neq j, π_{z_{i}, z_{j}}^{*} \neq π_{z_{i}^{*}, z_{j}^{*}}^{*}}, \\ \hat{D} : & = {(i, j) : i \neq j, {\hat{π}}_{z_{i}, z_{j}} \neq {\hat{π}}_{z_{i}^{*}, z_{j}^{*}}} . \end{matrix}

By the proof of Proposition 3.8 of Celisse et al. (2012), we have

\begin{matrix} P^{*} [{\sum_{[z_{[n]}] \neq [z_{[n]}^{*}]} \frac{\hat{P} (Z_{[n]} = z_{[n]} | X_{[n]}, Y_{[n]})}{\hat{P} (Z_{[n]} = z_{[n]}^{*} | X_{[n]}, Y_{[n]})} > ϵ} \cap Ω_{n}] \\ \leq \sum_{r = 1}^{n} \sum_{z_{[n]} \notin [z_{[n]}^{*}], | | z_{[n]} - z_{[n]}^{*} {| |}_{0} = r} P^{*} [{log \frac{\hat{P} (Z_{[n]} = z_{[n]} | X_{[n]}, Y_{[n]})}{\hat{P} (Z_{[n]} = z_{[n]}^{*} | X_{[n]}, Y_{[n]})} > - 5 r log n} \cap Ω_{n}] \\ = \sum_{r = 1}^{n} \sum_{z_{[n]} \notin [z_{[n]}^{*}], | | z_{[n]} - z_{[n]}^{*} {| |}_{0} = r} P^{*} [{T_{0} + T_{1} > - 5 r log n} \cap Ω_{n}] . \end{matrix}

We have that

\begin{matrix} P^{*} [{T_{0} + T_{1} > - 5 r log n} \cap Ω_{n}] \leq P^{*} [{T_{0} > - 10 r log n} \cap Ω_{n}} + P^{*} [{T_{1} > 5 r log n} \cap Ω_{n}] . \end{matrix}

The proof of Proposition 3.8 of Celisse et al. (2012) shows that

\begin{matrix} P^{*} [{T_{0} > - 10 r log n} \cap Ω_{n}} \leq C_{1} {exp (8 n log n - C_{2} \frac{{(log n)}^{2}}{n v_{n}^{2}})}^{r} . \end{matrix}

We have that

\begin{matrix} T_{1, 1} & = \sum_{D^{*}} π_{z_{i}^{*}, z_{j}^{*}}^{*} \frac{α_{z_{i}^{*}, z_{j}^{*}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}} (β_{z_{i}^{*}, z_{j}^{*}}^{*} - β_{z_{i}, z_{j}}^{*}) + π_{z_{i}^{*}, z_{j}^{*}}^{*} α_{11}^{*} log \frac{β_{z_{i}, z_{j}}^{*}}{β_{z_{i}^{*}, z_{j}^{*}}^{*}} \\ \leq | D^{*} | max_{(q, l) \neq (q^{^{'}}, l^{^{'}})} - {KL}_{G} ((α_{ql}^{*}, β_{ql}^{*}) | | (α_{q^{^{'}}, l^{^{'}}}^{*} β_{q^{^{'}}, l^{^{'}}}^{*})) \\ = - | D^{*} | K^{*} \end{matrix}

Since $X_{ij} Y_{ij}$ is a sub-exponential random variable and $X_{ij}$ is bounded for all i, j, one can show that

\begin{matrix} P^{*} [T_{1, 2} + \frac{1}{2} T_{1, 1} > t] \leq max {exp {- C_{1} \frac{(t + | D^{*} {| K^{*})}^{2}}{| D^{*} |}}, exp {- C_{2} (t + | D^{*} | K^{*}}} . \end{matrix}

Using similar techniques as in the proof of Proposition 3.8 of Celisse et al. (2012), one can show that

\begin{matrix} P^{*} [{| T_{1, 3} | > t} \cap Ω_{n}] \leq P^{*} (v_{n} > \frac{C_{3} t}{nr}), \end{matrix}

\begin{matrix} P^{*} [{T_{1, 4} > t} \cap Ω_{n}] \leq \sum_{k} \sum_{D, | D | = k} Q^{2} exp (- \frac{C_{4} t^{2}}{v_{n}^{2} (k + | D^{*} |)}), \end{matrix}

\begin{matrix} P^{*} [T_{1, 5} + \frac{1}{2} T_{1, 1} > t] \\ \leq \sum_{k} \sum_{D, | D | = k} Q^{2} max {exp [- \frac{C_{5} (t + (| D^{*} | + k) K^{*})}{v_{n}}], \\ exp [- \frac{C_{6} (t + (| D^{*} | + k) K^{*})^{2}}{(| D^{*} | + k) v_{n}^{2}}]} & . \end{matrix}

Combining inequalities 25, 26, 27, 28, 29, and letting $v_{n} = o (\sqrt{log n} / n)$ , we have for some $B_{1}, B_{2}, B_{3} > 0$ ,

\begin{matrix} P^{*} [{\sum_{[z_{[n]}] \neq [z_{[n]}^{*}]} \frac{\hat{P} (z_{[n]} | X_{[n]}, Y_{[n]})}{\hat{P} (z_{[n]}^{*} | X_{[n]}, Y_{[n]})} > ϵ} \cap Ω_{n}] \\ \leq \sum_{r = 1}^{n} (\binom{n}{r}) {(Q - 1)}^{r} B_{1} (exp [B_{2} n log n - B_{3} ω (n log n)])^{r} \\ = B_{1} [{(1 + (Q - 1) u_{n}^{^{'}})}^{n} - 1] \end{matrix}

where $u_{n}^{^{'}} = exp [B_{2} n log n - B_{3} \frac{{(log n)}^{2}}{n v_{n}^{2}}]$ . Since ${(1 + (Q - 1) u_{n}^{^{'}})}^{n} \to 1$ as $n \to \infty$ , and the proof is completed. $□$

The proof of the consistency of $\hat{θ}$ is a consequence of the proposition above and follows the same lines as the proof of Theorem 3.9 of Celisse et al. (2012). $□$

Funding

Open Access funding provided by the IReL Consortium. The funding was provided by Science Foundation Ireland (Grant No. SFI/12/RC/2289-2).

Footnotes

Historical data available at https://www.capitalbikeshare.com/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Abbe E. Community detection and stochastic block models: recent developments. J Mach Learn Res. 2018;18:1–86. [Google Scholar]
Airoldi EM, Blei DM, Fienberg SE, Xing EP. Mixed membership stochastic blockmodels. J Mach Learn Res. 2008;9:1981–2014. [PMC free article] [PubMed] [Google Scholar]
Aicher C, Jacobs AZ, Clauset A (2013) Adapting the stochastic block model to edge-weighted networks. ICML workshop on structured learning
Aicher C, Jacobs AZ, Clauset A. Learning latent block structure in weighted networks. J Compl Netw. 2015;3:221–248. doi: 10.1093/comnet/cnu026. [DOI] [Google Scholar]
Allman ES, Matias C, Rhodes JA. Identifiability of parameters in latent structure models with many observed variables. Ann Stat. 2009;37:3099–3132. doi: 10.1214/09-AOS689. [DOI] [Google Scholar]
Allman ES, Matias C, Rhodes JA. Parameter identifiability in a class of random graph mixture models. J Stat Plan Inference. 2011;141:1719–1736. doi: 10.1016/j.jspi.2010.11.022. [DOI] [Google Scholar]
Ambroise C, Matias C. New consistent and asymptotically normal parameter estimates for random-graph mixture models. J R Stat Soc Ser B Stat Methodol. 2012;74:3–35. doi: 10.1111/j.1467-9868.2011.01009.x. [DOI] [Google Scholar]
Baraud Y. A Bernstein-type inequality for suprema of random processes with applications to model selection in non-Gaussian regression. Bernoulli. 2010;16:1064–1085. doi: 10.3150/09-BEJ245. [DOI] [Google Scholar]
Barrat A, Barthélemy M, Pastor-Satorras R, Vespignani A. The architecture of complex weighted networks. Proc Natl Acad Sci. 2004;101:3747–3752. doi: 10.1073/pnas.0400087101. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bickel PJ, Chen A. A nonparametric view of network models and Newman–Girvan and other modularities. Proc Natl Acad Sci USA. 2009;106:21068–21073. doi: 10.1073/pnas.0907096106. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bickel PJ, Chen A, Levina E. The method of moments and degree distributions for network models. Ann Stat. 2011;39:2280–2301. doi: 10.1214/11-AOS904. [DOI] [Google Scholar]
Bickel P, Choi D, Chang X, Zhang H. Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. Ann Stat. 2013;41:1922–1943. doi: 10.1214/13-AOS1124. [DOI] [Google Scholar]
Biernacki C, Celeux G, Govaert G. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell. 2000;22:719–725. doi: 10.1109/34.865189. [DOI] [Google Scholar]
Brault V, Keribin C, Mariadassou M. Consistency and asymptotic normality of Latent Block Model estimators. Electron J Stat. 2020;14:1234–1268. doi: 10.1214/20-EJS1695. [DOI] [Google Scholar]
Celisse A, Daudin J-J, Pierre L. Consistency of maximum-likelihood and variational estimators in the stochastic block model. Electron J Stat. 2012;6:1847–1899. doi: 10.1214/12-EJS729. [DOI] [Google Scholar]
Channarond A, Daudin J-J, Robin S. Classification and estimation in the stochastic blockmodel based on the empirical degrees. Electron J Stat. 2012;6:2574–2601. doi: 10.1214/12-EJS753. [DOI] [Google Scholar]
Choi DS, Wolfe PJ, Airoldi EM. Stochastic blockmodels with a growing number of classes. Biometrika. 2012;99:273–284. doi: 10.1093/biomet/asr053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Clauset A, Newman MEJ, Moore C. Finding community structure in very large networks. Phys Rev E. 2004;70:066111. doi: 10.1103/PhysRevE.70.066111. [DOI] [PubMed] [Google Scholar]
Côme E, Latouche P. Model selection and clustering in stochastic block models based on the exact integrated complete data likelihood. Stat Model. 2015;15:564–589. doi: 10.1177/1471082X15577017. [DOI] [Google Scholar]
Daudin J-J, Picard F, Robin S. A mixture model for random graphs. Stat Comput. 2008;18:173–183. doi: 10.1007/s11222-007-9046-7. [DOI] [Google Scholar]
Ghasemian A, Zhang P, Clauset A, Moore C, Peel L. Detectability thresholds and optimal algorithms for community structure in dynamic networks. Phys Rev X. 2016;6:031005. [Google Scholar]
Haj AE, Slaoui Y, Louis P-Y, Khraibani Z (2020) Estimation in a binomial stochastic blockmodel for a weighted graph by a variational expectation maximization algorithm. Commun Stat Simul Comput:1–20
Holland PW, Laskey KB, Leinhardt S. Stochastic blockmodels: first steps. Soc Netw. 1983;5:109–137. doi: 10.1016/0378-8733(83)90021-7. [DOI] [Google Scholar]
Jog V, Loh P-L (2015) Information-theoretic bounds for exact recovery in weighted stochastic block models using the Rényi divergence. CoRR, arXiv:abs/1509.06418
Karrer B, Newman MEJ. Stochastic blockmodels and community structure in networks. Phys Rev E. 2011;83:016107. doi: 10.1103/PhysRevE.83.016107. [DOI] [PubMed] [Google Scholar]
Latouche P, Birmelé E, Ambroise C. Overlapping stochastic block models with application to the French political blogosphere. Ann Appl Stat. 2011;5:309–336. doi: 10.1214/10-AOAS382. [DOI] [Google Scholar]
Latouche P, Birmelé E, Ambroise C. Model selection in overlapping stochastic block models. Electron J Stat. 2014;8:762–794. doi: 10.1214/14-EJS903. [DOI] [Google Scholar]
Leger J-B, Vacher C, Daudin J-J. Detection of structurally homogeneous subsets in graphs. Stat Comput. 2014;24:675–692. doi: 10.1007/s11222-013-9395-3. [DOI] [Google Scholar]
Ludkin M. Inference for a generalised stochastic block model with unknown number of blocks and non-conjugate edge models. Comput Stat Data Anal. 2020;152:107051. doi: 10.1016/j.csda.2020.107051. [DOI] [Google Scholar]
Mariadassou M, Robin S, Vacher C. Uncovering latent structure in valued graphs: a variational approach. Ann Appl Stat. 2010;4:715–742. doi: 10.1214/10-AOAS361. [DOI] [Google Scholar]
Newman MEJ. Analysis of weighted networks. Phys Rev E. 2004;70:056131. doi: 10.1103/PhysRevE.70.056131. [DOI] [PubMed] [Google Scholar]
Peixoto TP. Nonparametric weighted stochastic block models. Phys Rev E. 2018;97:012306. doi: 10.1103/PhysRevE.97.012306. [DOI] [PubMed] [Google Scholar]
Peng L, Carvalho L. Bayesian degree-corrected stochastic blockmodels for community detection. Electron J Stat. 2016;10:2746–2779. doi: 10.1214/16-EJS1163. [DOI] [Google Scholar]
Pons P, Latapy M (2005) Computing communities in large networks using random walks. In: Proceedings of the 20th international conference on computer and information sciences. Springer, Berlin, ISCIS’05, pp 284–293
Rohe K, Chatterjee S, Yu B. Spectral clustering and the high-dimensional stochastic blockmodel. Ann Stat. 2011;39:1878–1915. doi: 10.1214/11-AOS887. [DOI] [Google Scholar]
Saldaña DF, Yu Y, Feng Y. How many communities are there? J Comput Graph Stat. 2017;26:171–181. doi: 10.1080/10618600.2015.1096790. [DOI] [Google Scholar]
Snijders TAB, Nowicki K. Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J Classif. 1997;14:75–100. doi: 10.1007/s003579900004. [DOI] [Google Scholar]
Stouffer DB, Bascompte J. Compartmentalization increases food-web persistence. Proc Natl Acad Sci USA. 2011;108:3648–3652. doi: 10.1073/pnas.1014353108. [DOI] [PMC free article] [PubMed] [Google Scholar]
von Luxburg U. A tutorial on spectral clustering. Stat Comput. 2007;17:395–416. doi: 10.1007/s11222-007-9033-z. [DOI] [Google Scholar]
Wang YXR, Bickel PJ. Likelihood-based model selection for stochastic block models. Ann Stat. 2017;45:500–528. [Google Scholar]
Yan X (2016) Bayesian model selection of stochastic block models. In: 2016 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM), pp 323–328
Ye Z-S, Chen N. Closed-form estimators for the gamma distribution derived from likelihood equations. Am Stat. 2017;71:177–181. doi: 10.1080/00031305.2016.1209129. [DOI] [Google Scholar]
Zanghi H, Picard F, Miele V, Ambroise C. Strategies for online inference of model-based clustering in large and growing networks. Ann Appl Stat. 2010;4:687–714. doi: 10.1214/10-AOAS359. [DOI] [Google Scholar]

[CR1] Abbe E. Community detection and stochastic block models: recent developments. J Mach Learn Res. 2018;18:1–86. [Google Scholar]

[CR2] Airoldi EM, Blei DM, Fienberg SE, Xing EP. Mixed membership stochastic blockmodels. J Mach Learn Res. 2008;9:1981–2014. [PMC free article] [PubMed] [Google Scholar]

[CR3] Aicher C, Jacobs AZ, Clauset A (2013) Adapting the stochastic block model to edge-weighted networks. ICML workshop on structured learning

[CR4] Aicher C, Jacobs AZ, Clauset A. Learning latent block structure in weighted networks. J Compl Netw. 2015;3:221–248. doi: 10.1093/comnet/cnu026. [DOI] [Google Scholar]

[CR5] Allman ES, Matias C, Rhodes JA. Identifiability of parameters in latent structure models with many observed variables. Ann Stat. 2009;37:3099–3132. doi: 10.1214/09-AOS689. [DOI] [Google Scholar]

[CR6] Allman ES, Matias C, Rhodes JA. Parameter identifiability in a class of random graph mixture models. J Stat Plan Inference. 2011;141:1719–1736. doi: 10.1016/j.jspi.2010.11.022. [DOI] [Google Scholar]

[CR7] Ambroise C, Matias C. New consistent and asymptotically normal parameter estimates for random-graph mixture models. J R Stat Soc Ser B Stat Methodol. 2012;74:3–35. doi: 10.1111/j.1467-9868.2011.01009.x. [DOI] [Google Scholar]

[CR8] Baraud Y. A Bernstein-type inequality for suprema of random processes with applications to model selection in non-Gaussian regression. Bernoulli. 2010;16:1064–1085. doi: 10.3150/09-BEJ245. [DOI] [Google Scholar]

[CR9] Barrat A, Barthélemy M, Pastor-Satorras R, Vespignani A. The architecture of complex weighted networks. Proc Natl Acad Sci. 2004;101:3747–3752. doi: 10.1073/pnas.0400087101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] Bickel PJ, Chen A. A nonparametric view of network models and Newman–Girvan and other modularities. Proc Natl Acad Sci USA. 2009;106:21068–21073. doi: 10.1073/pnas.0907096106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] Bickel PJ, Chen A, Levina E. The method of moments and degree distributions for network models. Ann Stat. 2011;39:2280–2301. doi: 10.1214/11-AOS904. [DOI] [Google Scholar]

[CR12] Bickel P, Choi D, Chang X, Zhang H. Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. Ann Stat. 2013;41:1922–1943. doi: 10.1214/13-AOS1124. [DOI] [Google Scholar]

[CR13] Biernacki C, Celeux G, Govaert G. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell. 2000;22:719–725. doi: 10.1109/34.865189. [DOI] [Google Scholar]

[CR14] Brault V, Keribin C, Mariadassou M. Consistency and asymptotic normality of Latent Block Model estimators. Electron J Stat. 2020;14:1234–1268. doi: 10.1214/20-EJS1695. [DOI] [Google Scholar]

[CR15] Celisse A, Daudin J-J, Pierre L. Consistency of maximum-likelihood and variational estimators in the stochastic block model. Electron J Stat. 2012;6:1847–1899. doi: 10.1214/12-EJS729. [DOI] [Google Scholar]

[CR16] Channarond A, Daudin J-J, Robin S. Classification and estimation in the stochastic blockmodel based on the empirical degrees. Electron J Stat. 2012;6:2574–2601. doi: 10.1214/12-EJS753. [DOI] [Google Scholar]

[CR17] Choi DS, Wolfe PJ, Airoldi EM. Stochastic blockmodels with a growing number of classes. Biometrika. 2012;99:273–284. doi: 10.1093/biomet/asr053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] Clauset A, Newman MEJ, Moore C. Finding community structure in very large networks. Phys Rev E. 2004;70:066111. doi: 10.1103/PhysRevE.70.066111. [DOI] [PubMed] [Google Scholar]

[CR19] Côme E, Latouche P. Model selection and clustering in stochastic block models based on the exact integrated complete data likelihood. Stat Model. 2015;15:564–589. doi: 10.1177/1471082X15577017. [DOI] [Google Scholar]

[CR20] Daudin J-J, Picard F, Robin S. A mixture model for random graphs. Stat Comput. 2008;18:173–183. doi: 10.1007/s11222-007-9046-7. [DOI] [Google Scholar]

[CR21] Ghasemian A, Zhang P, Clauset A, Moore C, Peel L. Detectability thresholds and optimal algorithms for community structure in dynamic networks. Phys Rev X. 2016;6:031005. [Google Scholar]

[CR22] Haj AE, Slaoui Y, Louis P-Y, Khraibani Z (2020) Estimation in a binomial stochastic blockmodel for a weighted graph by a variational expectation maximization algorithm. Commun Stat Simul Comput:1–20

[CR23] Holland PW, Laskey KB, Leinhardt S. Stochastic blockmodels: first steps. Soc Netw. 1983;5:109–137. doi: 10.1016/0378-8733(83)90021-7. [DOI] [Google Scholar]

[CR24] Jog V, Loh P-L (2015) Information-theoretic bounds for exact recovery in weighted stochastic block models using the Rényi divergence. CoRR, arXiv:abs/1509.06418

[CR25] Karrer B, Newman MEJ. Stochastic blockmodels and community structure in networks. Phys Rev E. 2011;83:016107. doi: 10.1103/PhysRevE.83.016107. [DOI] [PubMed] [Google Scholar]

[CR26] Latouche P, Birmelé E, Ambroise C. Overlapping stochastic block models with application to the French political blogosphere. Ann Appl Stat. 2011;5:309–336. doi: 10.1214/10-AOAS382. [DOI] [Google Scholar]

[CR27] Latouche P, Birmelé E, Ambroise C. Model selection in overlapping stochastic block models. Electron J Stat. 2014;8:762–794. doi: 10.1214/14-EJS903. [DOI] [Google Scholar]

[CR28] Leger J-B, Vacher C, Daudin J-J. Detection of structurally homogeneous subsets in graphs. Stat Comput. 2014;24:675–692. doi: 10.1007/s11222-013-9395-3. [DOI] [Google Scholar]

[CR29] Ludkin M. Inference for a generalised stochastic block model with unknown number of blocks and non-conjugate edge models. Comput Stat Data Anal. 2020;152:107051. doi: 10.1016/j.csda.2020.107051. [DOI] [Google Scholar]

[CR30] Mariadassou M, Robin S, Vacher C. Uncovering latent structure in valued graphs: a variational approach. Ann Appl Stat. 2010;4:715–742. doi: 10.1214/10-AOAS361. [DOI] [Google Scholar]

[CR31] Newman MEJ. Analysis of weighted networks. Phys Rev E. 2004;70:056131. doi: 10.1103/PhysRevE.70.056131. [DOI] [PubMed] [Google Scholar]

[CR32] Peixoto TP. Nonparametric weighted stochastic block models. Phys Rev E. 2018;97:012306. doi: 10.1103/PhysRevE.97.012306. [DOI] [PubMed] [Google Scholar]

[CR33] Peng L, Carvalho L. Bayesian degree-corrected stochastic blockmodels for community detection. Electron J Stat. 2016;10:2746–2779. doi: 10.1214/16-EJS1163. [DOI] [Google Scholar]

[CR34] Pons P, Latapy M (2005) Computing communities in large networks using random walks. In: Proceedings of the 20th international conference on computer and information sciences. Springer, Berlin, ISCIS’05, pp 284–293

[CR35] Rohe K, Chatterjee S, Yu B. Spectral clustering and the high-dimensional stochastic blockmodel. Ann Stat. 2011;39:1878–1915. doi: 10.1214/11-AOS887. [DOI] [Google Scholar]

[CR36] Saldaña DF, Yu Y, Feng Y. How many communities are there? J Comput Graph Stat. 2017;26:171–181. doi: 10.1080/10618600.2015.1096790. [DOI] [Google Scholar]

[CR37] Snijders TAB, Nowicki K. Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J Classif. 1997;14:75–100. doi: 10.1007/s003579900004. [DOI] [Google Scholar]

[CR38] Stouffer DB, Bascompte J. Compartmentalization increases food-web persistence. Proc Natl Acad Sci USA. 2011;108:3648–3652. doi: 10.1073/pnas.1014353108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] von Luxburg U. A tutorial on spectral clustering. Stat Comput. 2007;17:395–416. doi: 10.1007/s11222-007-9033-z. [DOI] [Google Scholar]

[CR40] Wang YXR, Bickel PJ. Likelihood-based model selection for stochastic block models. Ann Stat. 2017;45:500–528. [Google Scholar]

[CR41] Yan X (2016) Bayesian model selection of stochastic block models. In: 2016 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM), pp 323–328

[CR42] Ye Z-S, Chen N. Closed-form estimators for the gamma distribution derived from likelihood equations. Am Stat. 2017;71:177–181. doi: 10.1080/00031305.2016.1209129. [DOI] [Google Scholar]

[CR43] Zanghi H, Picard F, Miele V, Ambroise C. Strategies for online inference of model-based clustering in large and growing networks. Ann Appl Stat. 2010;4:687–714. doi: 10.1214/10-AOAS359. [DOI] [Google Scholar]

PERMALINK

Weighted stochastic block model

Tin Lok James Ng

Thomas Brendan Murphy

Abstract

Introduction

Model specification

Generative model

Assumptions

Assumption 1

Assumption 2

Assumption 3

Assumption 4

Assumption 5

Assumption 6

Identifiability

Asymptotic recovery of class labels

Theorem 1

Maximum likelihood estimation of WSBM parameters

Theorem 2

Theorem 3

Variational estimators

Proposition 1

Proposition 2

Choosing the number of classes

Simulation

Experiment 1 (two-class model)

Table 1.

Table 2.

Experiment 2 (three-class model)

Table 3.

Table 4.

Computational complexity

Fig. 1.

Application: Washington bike data set

Table 5.

Fig. 2.

Fig. 3.

Discussion

Auxiliary results

Definition 1

Lemma 1

Proposition 3

Proof

Proposition 4

Proof

Lemma 2

Proposition 5

Proof

Proof of Theorem 1

Proof

Proof of Theorem 2

Proof

Proposition 6

Proof

Proposition 7

Proof

Proof of Thereom 3

Proof

Proposition 8

Proof

Funding

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases