Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Oct 1.
Published in final edited form as: Biometrics. 2021 Apr 12;78(3):1018–1030. doi: 10.1111/biom.13464

Testing for association in multiview network data

Lucy L Gao 1, Daniela Witten 2, Jacob Bien 3
PMCID: PMC8484362  NIHMSID: NIHMS1729336  PMID: 33792914

Abstract

In this paper, we consider data consisting of multiple networks, each composed of a different edge set on a common set of nodes. Many models have been proposed for the analysis of such multiview network data under the assumption that the data views are closely related. In this paper, we provide tools for evaluating this assumption. In particular, we ask: given two networks that each follow a stochastic block model, is there an association between the latent community memberships of the nodes in the two networks? To answer this question, we extend the stochastic block model for a single network view to the two-view setting, and develop a new hypothesis test for the null hypothesis that the latent community memberships in the two data views are independent. We apply our test to protein–protein interaction data from the HINT database. We find evidence of a weak association between the latent community memberships of proteins defined with respect to binary interaction data and the latent community memberships of proteins defined with respect to cocomplex association data. We also extend this proposal to the setting of a network with node covariates. The proposed methods extend readily to three or more network/multivariate data views.

Keywords: community detection, data integration, multiview data, node covariates, stochastic block model

1 |. INTRODUCTION

A network consists of the pairwise relationships (edges) between objects of interest (nodes). For example, nodes could correspond to proteins, with edges representing physical interactions, or nodes could correspond to people, with edges representing social interactions. Of the many models for network data (Erdős and Rényi, 1960; Holland and Leinhardt, 1981; Hoff et al., 2002), one of the best known is the stochastic block model (SBM) (Holland et al., 1983), which assumes that nodes belong to latent communities, and that the probability of an edge between a pair of nodes is a function of their community memberships only.

Multiple sets of edges are often available on a common set of nodes, as shown in Figure 1A. Consider a pair of protein–protein interaction networks in which the nodes correspond to proteins. In one network, the edges represent physical interactions, and in the other, they represent comembership in a protein complex. Another often-encountered scenario involves a single network, with a set of covariates corresponding to each node, as shown in Figure 1B. For instance, we might have a social network along with p demographic covariates for each member of the network. Both Figures 1A and 1B are examples of the multiview data setting (Sun, 2013). We will refer to the two networks in Figure 1A, or the network and the covariates corresponding to the nodes in Figure 1B as two data views.

FIGURE 1.

FIGURE 1

Two examples of multiview data involving a network: (A) two network views on n = 10 nodes and (B) a network view and an n × p multivariate view on n = 10 nodes

Extensions of network models to the multiview data setting (Fosdick and Hoff, 2015; Han et al., 2015; Gollini and Murphy, 2016; Binkiewicz et al., 2017; Salter-Townshend and McCormick, 2017; D’Angelo et al., 2019) often assume that the data views are closely related. For example, extensions of the SBM typically assume that the latent communities within each network view are closely related (Han et al., 2015; Peixoto, 2015; Stanley et al., 2016; Binkiewicz et al., 2017; Stanley et al., 2019).

In this paper, we propose a test of the assumption that the latent communities are related. Why is this important? First of all, we should check whether two data views are, in fact, associated before we fit a model that relies on this assumption. Second, the relationship between the views may itself be of interest, and the test that we propose will allow us to assess this relationship. For example, such a tool can help shed light on whether the two distinct definitions of protein interactions capture similar versus complementary latent structures. Likewise, it can provide insight about whether peoples’ social interactions and demographics are related. Gao et al. (2020) investigated a similar problem for two multivariate data views, but did not consider the case where one or both views are networks.

To this end, we extend the SBM to the multiview network setting (Figure 1A) without assuming that the network views are closely related. We then ask: are the latent communities within each network view associated? Similarly, for the case of a network view and a multivariate view (Figure 1B), we model the network view with a SBM and model the multivariate view with a finite mixture model (FMM), without assuming that the views are closely related. We then ask: are the latent communities within the network data view and the latent clusters within the multivariate data view associated?

The rest of the paper is organized as follows. We review the SBM in Section 2. We extend the SBM to two network data views in Section 3, and develop a test for association between the latent communities within each view in Section 4. We develop a related test for the case of a network view and a multivariate view in Section 5. We review related literature in Section 6, and explore the performance of our tests via simulation in Section 7. In Section 8, we apply the test from Section 4 to protein networks from the HINT database (Das and Yu, 2012b). Section 9 provides a discussion.

2 |. THE STOCHASTIC BLOCK MODEL (Holland et al., 1983)

In this section, we briefly review the SBM proposed by Holland et al. (1983) for a single network; see Matias and Robin (2014) for a detailed review.

2.1 |. Model and notation

Let X ∈ {0, 1}n×n be the adjacency matrix of an undirected, unweighted network with n nodes and no self-loops, so that X is symmetric and Xii = 0 for i = 1, 2, …, n. We assume that the nodes are partitioned into K communities, with unobserved memberships given by a latent random vector Z = (Z1, …, Zn) with independent and identically distributed (i.i.d.) elements and (Zi=k)πk for πΔ+K{πK:1KTπ=1,πk>0}. Conditional on Z, the edges are independently drawn from a Bernoulli distribution, with [Xij=1|Z]=θZiZj for a symmetric matrix θ ∈ [0, 1]K×K. It follows that

f(X|Z)=i=1nj=1i1(θZiZj)Xij(1θZiZj)1Xij,(Z=z)=i=1nπzi. (1)

2.2 |. Approximate pseudolikelihood function

As a result of (1), the log-likelihood function for the SBM is given by

(θ,π;X)log (z1=1Kzn=1K(i=1nj=1i1(θzizj)Xij×(1θzizj)1Xij)(i=1nπzi)). (2)

Equation (2) sums over Kn terms, and is thus computationally intractable. Therefore, Amini et al. (2013) developed an approximate pseudolikelihood function, in the sense of Besag (1975). We briefly review this approach; see Web Appendix A for a detailed review.

Let Z^{1,,K}n be the results of applying spectral clustering with perturbations (Amini et al., 2013) to X. Define b^n×K with rows b^i and b^imj=1nXij1{Z^j=m} and let d = X1n. Here, b^im is the number of edges connecting the ith node to the mth estimated community in Z^, and d contains the degrees of the n nodes. Let R^ be the confusion matrix between Z^ and Z, and define the K × K matrix η=(diag(θR^1K))1θR^, with rows η1,,ηKΔ+K. Let g(·; N, q) denote the probability mass function of a Multinomial(N, q1, …, qK) random variable. Amini et al. (2013) treated Z^ and η as fixed and showed that

b^|d,Z~˙i=1ng(b^i;di,ηZi), (3)

where ~˙ denotes “approximately distributed as.” Ignoring any dependence between Z and d, and marginalizing over Z in (3) to approximate the conditional distribution of b^ given d, yields the following log-pseudolikelihood function:

PL(η,π;b^|d)i=1nlog (k=1Kπkg(b^i;di,ηk)). (4)

This can be viewed as the log-likelihood function of an FMM (McLachlan and Peel, 2000) with K components, of which the kth component has prior probability πk and density function g(b^i;di,ηk).

3 |. A STOCHASTIC BLOCK MODEL FOR TWO NETWORK DATA VIEWS

In this section, we extend the SBM to the setting of two network data views, and derive approximate pseudolikelihood functions for the proposed multiview SBM.

3.1 |. Model and notation

Suppose that we have two network views on a common set of n nodes, as in Figure 1A, for example, a binary network and a cocomplex network on n proteins. We assume that the networks are undirected, unweighted, and have no self-loops. Let X(1), X(2) ∈ {0, 1}n×n be the symmetric adjacency matrices of the two networks, where Xii(l)=0 for i = 1, 2, …, n and l = 1, 2.

We model X(1) with an SBM (Section 2.1) with K(1) communities, and X(2) with an SBM with K(2) communities. It follows from (1) that for l = 1, 2,

f(X(l)|Z(l))=j=1ni=1j1(θzi(l)Zj(l)(l))Xij(l)(1θZi(l)Zj(l)(l))1Xij(l),(Z(l)=z(l))=i=1nπzi(l)(l), (5)

for a symmetric matrix θ(l)[0,1]K(l)×K(l) and π(l)Δ+K(l). Here, for l = 1, 2, Z(l) represents the latent community memberships for the n nodes within the lth network data view. We assume that the n pairs {(Zi(1),Zi(2))}i=1n are i.i.d. and that X(1)X(2) | Z(1), Z(2).

The following result allows us to parameterize the joint distribution of Z(1) and Z(2).

Proposition 1 Gao et al., 2020. Consider two categorical random variables A and B with K and Klevels, respectively, and with (A=k)=πk and (B=k)=πk, for πΔ+K and πΔ+K. Then, there exists a unique matrix CCπ,π such that (A=k,B=k)=πkπkCkk, where Cπ,π{CK×K:Ckk0,Cπ=1K,CTπ=1K}.

It follows from applying Proposition 1 to each of the n pairs of categorical variables {(Zi(1),Zi(2))}i=1n that there exists a unique K(1) × K(2) matrix CCπ(1),π(2) such that

(Z(1)=z(1),Z(2)=z(2))=i=1n(Zi(1)=zi(1),Zi(2)=zi(2))=i=1nπzi(1)(1)πzi(2)(2)Czi(1)zi(2), (6)

where the first equality follows from the independence of the n pairs {(Zi(1),Zi(2))}i=1n. Here, Ckk=(Zi(1)=k,Zi(2)=k)(Zi(1)=k)(Zi(2)=k) describes the dependence between the kth community in the first view and the k′th community in the second view, with Ckk = 1 indicating independence, Ckk < 1 indicating negative dependence, and Ckk > 1 indicating positive dependence.

3.2 |. Approximate pseudolikelihood function

The log-likelihood function of model (5)–(6) is given by

(θ(1),θ(2),π(1),π(2),C;X(1),X(2))log (z1(1)=1K(1)zn(1)=1K(1)z1(2)=1K(2)zn(2)=1K(2)(l=12i=1nj=1i1(θzi(l)zj(l)(l))Xij(l)(1θzi(l)zj(l)(l))1Xij(l))(i=1nπzi(1)(1)πzi(2)(2)Czi(1)zi(2))). (7)

Equation (7) is computationally intractable, because it involves summing over (K(1)K(2))n terms. Thus, we will derive an approximate pseudolikelihood function for model (5)–(6). For l = 1, 2, let Z^(l){1,,K(l)}n be the results of applying spectral clustering with perturbations (Amini et al., 2013) to X(l), let b^(l) be the n × K(l) matrix defined by b^im(l)=i=1nXij(l)1{Z^j(l)=m}, and let d(l) = X(l)1n. Here, for the lth network, b^im(l) is the number of edges connecting the ith node to the mth estimated community, and di(l) is the degree of the ith node. We write

f(b^(1),b^(2)|d(1),d(2),Z(1),Z(2))=f(b^(1),b^(2),d(1),d(2)|Z(1),Z(2))f(d(1),d(2)|Z(1),Z(2))=l=12f(b^(l),d(l)|Z(l))f(d(l)|Z(l))=l=12f(b^(l)|d(l),Z(l)), (8)

where the first and third equalities follow from the definition of a conditional density, and the second equality follows from the fact that X(1)X(2) | Z(1), Z(2) and X(1)Z(2) | Z(1) and X(2)Z(1) | Z(2) (Section 3.1). Let R^(l) be the confusion matrix between Z^(l) and Z(l) and let η(l)=(diag(θ(l)R^(l)1K(l)))1θ(l)R^(l). As in Amini et al. (2013), we treat Z^(l) and η(l) as fixed, and apply (3) in Section 2.2 to approximate f(b^(l)|Z(l),d(l)) in (8), which yields

f(b^(1),b^(2)|d(1),d(2),Z(1),Z(2))l=12i=1ng(b^i(l);di(l),ηZi(l)(l)). (9)

Ignoring any dependence between (d(1), d(2)) and (Z(1), Z(2)) and marginalizing over the latent community memberships Z(1) and Z(2) in (9) to approximate the conditional distribution of b^(1) and b^(2) given d(1) and d(2) yields the following log-pseudolikelihood function:

PL(η(1),η(2),π(1),π(2),C;b^(1),b^(2)|d(1),d(2))i=1nlog (k=1K(1)k=1K(2)πk(1)πk(2)Ckkg(b^i(1);di(1),ηk(1))g(b^i(2);di(2),ηk(2))). (10)

This closely resembles the log-likelihood function of the FMM for two multivariate data views from Gao et al. (2020).

4 |. ARE TWO NETWORK VIEWS’ COMMUNITY MEMBERSHIPS ASSOCIATED?

Recall from (6) that (Z(1)=z(1),Z(2)=z(2))=i=1nπzi(1)(1)πzi(2)(2)Czi(1)zi(2), where CCπ(1),π(2), defined in Proposition 1. It follows from the definition of (Z(l)=z(l)) in (5) that

(Z(1)=z(1),Z(2)=z(2))=(Z(1)=z(1))(Z(2)=z(2))

if and only if C=1K(1)1K(2)T. Thus, testing the null hypothesis of independence between the latent community memberships Z(1) and Z(2) amounts to testing H0:C=1K(1)1K(2)T.

4.1 |. The P2LRT statistic

To test H0:C=1K(1)1K(2)T, one might consider using a likelihood ratio test. The likelihood ratio test statistic is of the form

maxθ(1),θ(2),π(1),π(2),C(η(1),η(2),π(1),π(2),C;X(1),X(2))maxη(1),η(2),π(1),π(2)(θ(1),θ(2),π(1),π(2),1K(1)1K(2)T;X(1),X(2)),

where the log-likelihood function is defined in (7). Unfortunately, recall from Section 3.2 that (7) is computationally intractable because it involves summing over (K(1)K(2))n terms. We could replace the log-likelihood functions with log-pseudolikelihood functions PL, defined in (10). This leads to a test statistic of the form

log Λmaxη(1),η(2),π(1),π(2),CPL(η(1),η(2),π(1),π(2),C;b^(1),b^(2)|d(1),d(2))maxη(1),η(2),π(1),π(2)PL(η(1),η(2),π(1),π(2),1K(1)1K(2)T;b^(1),b^(2)|d(1),d(2)). (11)

However, PL is a nonconcave function of its arguments, and therefore, no algorithms are available to exactly compute the two terms in (11)—they can at best be approximated via local maxima. Taking the difference between two local maxima can lead to undesirable behavior; for example, log Λ can be negative.

To overcome this problem, we take a different approach, motivated by the fact that each data view X(l) marginally follows an SBM (Section 3.1). Rather than estimating the parameters η(1), η(2), π(1), π(2), and C by maximizing the log-pseudolikelihood function for the multiview SBM (10), we first estimate η(1), π(1) and η(2), π(2) by maximizing the log-pseudolikelihood function for the SBM (4) for each view separately. As (4) can be viewed as the log-likelihood function of an FMM (Section 3.2), it can be maximized using the expectation-maximization (EM; Dempster et al., 1977) algorithm for fitting FMMs (McLachlan and Krishnan, 2007). We then plug these estimates into (11), yielding the test statistic

log Λ~maxCCπ^(1),π^(2)PL(η^(1),η^(2),π^(1),π^(2),C;b^(1),b^(2)|d(1),d(2))PL(η^(1),η^(2),π^(1),π^(2),1K(1)1K(2)T;b^(1),b^(2)|d(1),d(2)). (12)

Computing (12) requires maximizing the first term with respect to C, that is, to compute

C^arg maxCCπ^(1),π^(2)PL(η^(1),η^(2),π^(1),π^(2),C;b^(1),b^(2)|d(1),d(2)), (13)

where C, is defined in Proposition 1. Because the objective of (13) is a concave function of C, C^ can be obtained using techniques from convex optimization. (In particular, we use an exponentiated gradient descent algorithm (Kivinen and Warmuth, 1997) developed in Gao et al. (2020) for maximizing concave functions of C under the constraint that CCπ^(1),π^(2); the complexity of each iteration is O(nK(1)K(2)).) This means that (12) completely overcomes the challenges associated with the test statistic (11); for example, (12) cannot be negative. Furthermore, results from Liang and Self (1996) and Chen and Liang (2010) suggest that performing a partial maximization over the parameters (as in (12)) rather than a full maximization (as in (11)) does not lead to an appreciable loss in power when n is large.

We refer to log Λ~ in (12) as a pseudo-pseudo-likelihood ratio test (P2LRT) statistic. In the name P2LRT, the term “pseudo” is used in two different senses: the first is because we use the pseudolikelihood function PL in place of the likelihood function, and the second is because we do not perform a full joint maximization over (η(1), η(2), π(1), π(2), C).

4 |.

We summarize the procedure for computing the P2LRT statistic in Algorithm 1.

4.2 |. Approximating the null distribution

Under the null hypothesis that the community memberships Z(1) and Z(2) are independent, that is, under H0:C=1K(1)1K(2)T, we can write the joint density of X(1) and X(2) as

f(X(1),X(2))=EZ(1),Z(2)[f(X(1),X(2)|Z(1),Z(2))]=EZ(1),Z(2)[f(X(1)|Z(1))f(X(2)|Z(2))]=EZ(1)[f(X(1)|Z(1))]EZ(2)[f(X(2)|Z(2))]=f(X(1))f(X(2)),

where the second equality follows from the fact that X(1)X(2) | Z(1), Z(2) and X(1)Z(2) | Z(1) and X(2)Z(2) | Z(1) (Section 3.1). Thus, under H0:C=1K(1)1K(2)T, the joint distribution of X(1) and X(2) is invariant under permutation of the node labels {1, 2, …, n} in either network. It follows that we can approximate the null distribution of the P2LRT statistic log Λ~ defined in (12) by taking M random permutations of the node labels in the second network, and comparing the observed value of log Λ~ to its empirical distribution in the permuted data. As η^(1), η^(2), π^(1), and π^(2) are invariant to permutation, we only need to compute C^ for each permutation. This is another advantage of the P2LRT statistic log Λ~ in (12) over log Λ in (11): if we had used log Λ, then we would need to estimate η(1), η(2), π(1), π(2), and C for each permutation. Details of the testing procedure are in Algorithm 2. In Step 3 of Algorithm 2, we add 1 to the numerator and the denominator of the permutation p-value to ensure that the p-value is never exactly zero (Belinda and Smyth, 2010).

4 |.

When we reject H0:C=1K(1)1K(2)T, it is often of interest to investigate the strength and location of the dependence between views. Recall from Section 3.1 that Ckk measures the dependence between the kth community in the first view and the k′th community in the second view. Thus, we can gain insight into the strength and location of the dependence between the communities in the two data views by examining C^kk defined in (13).

5 |. EXTENSION TO A NETWORK VIEW AND A MULTIVARIATE VIEW

In this section, we develop a test of association between latent communities in a network view and latent clusters in a multivariate view.

5.1 |. Model and notation

We now propose an extension of the SBM to an undirected network view, X ∈ {0, 1}n×n, and a multivariate view, Yn×p. We assume that the network is undirected with no self-loops, so that X is symmetric and Xii = 0 for i = 1, 2, …, n. We model X with an SBM (Section 2.2) with K(1) communities and we model the rows of Y with an FMM (McLachlan and Peel, 2000) with K(2) clusters, so that

f(X|Z(1))=j=1ni=1j1(θZi(1)Zj(1))Xij(1θZi(1)Zj(1))1Xij,f(Y|Z(2))=i=1nϕ(Yi;γZi(2)), (14)

where ϕ(·; γ) is a density parameterized by γ, and for l = 1, 2, the latent random vector Z(l)=(Z1(l),,Zn(l)) has i.i.d. elements with (Zi(l)=k)=πk(l) for π(l)Δ+K(l). Here, Z(1) represents the latent community memberships in the network view, and Z(2) represents the latent cluster memberships in the multivariate view. We assume that the n pairs {(Zi(1),Zi(2))}i=1n are i.i.d., and that XYZ(1), Z(2). Thus, as in Section 3.1, it follows from Proposition 1 that there exists CCπ(1),π(2) such that

(Z(1)=z(1),Z(2)=z(2))=i=1nπzi(1)(1)πzi(2)(2)Cz(1)z(2), (15)

where Ckk describes the dependence between the kth community in the network view and the k′th cluster in the multivariate view.

5.2 |. Approximate pseudolikelihood function

The multiview log-likelihood function of model (14)–(15) is computationally intractable. Thus, we will derive a multiview log-pseudolikelihood function for model (14)–(15). We begin by approximating the conditional density of b^ and Y given d, where b^ contains the number of edges connecting each of the n nodes in the network to each of the K estimated communities in the network, and d contains the node degrees:

b^,Y|Z(1),Z(2),d~˙i=1ng(b^i;di,ηZi(1))ϕ(Yi;γZi(2)). (16)

The derivation of (16) is very similar to the derivation of (9) in Section 3.2. Ignoring any dependence between d and (Z(1), Z(2)), and marginalizing over Z(1) and Z(2) in (16) to approximate the conditional distribution of b^ and Y given d, yields

PL(η,γ,π(1),π(2),C;b^,Y|d)=i=1nlog (k,kπk(1)πk(2)Ckkg(b^i;di,ηk)ϕ(Yi;γk)). (17)

We observe that the log-pseudolikelihood function in (17) closely resembles (10).

5.3 |. Testing independence between Z(1) and Z(2)

We now propose a test for the null hypothesis that the latent community memberships Z(1) and the latent cluster memberships Z(2) in model (14)–(15) are independent. As in Section 4, this amounts to testing H0:C=1K(1)1K(2)T.

Recall that the network X marginally follows an SBM, and let η^ and π^ be the maximizers of PL(η,π(1);b^|d), where PL is the log-pseudolikelihood function for the SBM given by (4). As in Section 4.1, we can compute η^ and π^(1) by using the EM algorithm for fitting FMMs (McLachlan and Krishnan, 2007). Recall that the rows of the multivariate view Y marginally follow a FMM, and let γ^ and π^(2) be the maximizers of the log-likelihood function for the multivariate view, obtained via EM. We consider the P2LRT statistic given by

log Λ~arg maxCCπ^(1),π^(2)PL(η^,γ^,π^(1),π^(2),C;b^,Y|d)PL(η^,γ^,π^(1),π^(2),1K(1)1K(2)T;b^,Y|d),

where PL is the log-pseudolikelihood function in (17), and C,. is defined in Proposition 1. Once again, we can perform the maximization over C using techniques from convex optimization; details of the exponentiated gradient descent algorithm that we use are similar to Step 2 of Algorithm 1. As in Section 4.2, we approximate the null distribution of log Λ~ by taking M random permutations of the rows of X(2), and comparing the observed value of log Λ to its empirical distribution in the permuted data. Details are similar to Algorithm 2.

6 |. RELATED LITERATURE

Many papers have extended the SBM to the multiple network data view setting, under the assumption that a single set of communities is shared across all networks (Han et al., 2015; Peixoto, 2015; Paul and Chen, 2016) or a subset of networks (Stanley et al., 2016). The model proposed in Section 3.1 does not rely on this assumption. Most of the previous work that avoids the assumption of shared communities has focused on estimation of the community structure; Section 4 of Kim et al. (2018) reviews these papers in detail. By contrast, the primary goal of our paper is not estimation, but rather to develop a test of association between the communities underlying each network view (Section 4).

A related problem in functional neuroimaging is to test whether the communities underlying brain networks of two groups of healthy and diagnosed patients are the same; see Paul et al. (2020), and the references contained therein. However, the test statistics and/or p-values for these tests cannot be computed in the two network data view setting.

We proposed a test of the null hypothesis that the communities underlying two network views are independent. By contrast, Xiong et al. (2019) proposed a test of the null hypothesis that the networks are conditionally independent given their underlying communities.

In the case of a network view and a multivariate view, several papers have assumed that the communities underlying the network view and the clusters underlying the multivariate view are the same, and exploit this assumption to improve parameter estimation (Binkiewicz et al., 2017; Stanley et al., 2019; Yan and Sarkar, 2020). Our proposed model in Section 5.1 does not rely on this assumption. Another body of work estimates the relationship between community memberships and node covariates, but does not consider inference on this relationship (Yang et al., 2013; Newman and Clauset, 2016; Zhang et al., 2016).

In Section 5, we proposed testing for a specific type of relationship between the network view and the multivariate view: we test for association between the communities underlying the network view and the clusters underlying the multivariate view. Several papers have considered testing for other types of relationships between the network view and the multivariate view (Traud et al., 2011; Fosdick and Hoff, 2015; Peel et al., 2017). For example, Peel et al. (2017) test for association between the network view and a categorical node covariate.

7 |. SIMULATION RESULTS

In this section, we evaluate the power and Type I error of the tests proposed in Sections 45. Simulations in this paper were conducted using the simulator package (Bien, 2016).

7.1 |. SBM for two network data views

We will evaluate the performance of four tests of H0:C=1K(1)1K(2)T:

  1. the P2LRT proposed in Section 4, using the true values of K(1) and K(2),

  2. the P2LRT proposed in Section 4, using estimated values of K(1) and K(2),

  3. the G-test for testing dependence between two categorical variables (Agresti, 2003, Chapter 3.2) applied to the estimated community memberships for each view, using the true values of K(1) and K(2), and

  4. the G-test, using estimated values of K(1) and K(2).

We estimate K(1) and K(2) by applying the method of Le and Levina (2015) to X(1) and X(2), respectively. In all four tests, we approximate the null distribution with a permutation approach, as in Algorithm 2, using M = 200 permutation samples.

We generate data from model (5)–(6), with n = 1000, K(1) = K(2) = K = 6, and

C=(1Δ)1K1KT+Δdiag(K1K), (18)

for Δ ∈ [0, 1]. Here, Δ = 0 corresponds to independent communities and Δ = 1 corresponds to identical communities. We let π(1) = π(2) = 1K/K, and θ(1) = θ(2) = θ, with

θkk=ω(1{kk}+2r1{k=k}), (19)

for r > 0 describing the strength of the communities, and ω chosen so that the expected edge density of the network equals s, to be specified. We simulate 2000 data sets for a range of values of s, Δ, and r, and evaluate the power of the four tests described above. Results are shown in Figure 2.

FIGURE 2.

FIGURE 2

Power of the P2LRT and the G-test with both views drawn from an SBM, as we vary the dependence between views (Δ), the strength of the communities (r), the expected edge density (s), and how the number of communities is selected. Details are in Section 7.1.

For all tests, power tends to increase as Δ, which controls the dependence between views, increases. Power also tends to increase as the strength of the communities (r) increases, and as the expected edge density (s) increases. Estimating K(1) and K(2) tends to yield lower power than using the true values of K(1) and K(2). All tests control the Type I error, but the P2LRTs uniformly yield higher power than the G-tests. This is because the P2LRT can be interpreted as a version of the G-test that replaces the “hard” community assignments with “soft” community assignments (Gao et al., 2020, Section 5). Thus, the P2LRT outperforms the G-test when the communities are more difficult to detect.

We generate data with unbalanced community sizes in Web Appendix D, and investigate how the true values of K(1) and K(2) relate to power in Web Appendix E.

7.2 |. Degree-corrected SBM for two network data views

Under the SBM, nodes within the same community have the same expected degree. To investigate the performance of the test proposed in Section 4 in a setting where nodes can have different expected degrees, we generate each network view from the degree-corrected SBM (DCSBM, Karrer and Newman, 2011). We generate n vectors (Zi(1), Zi(2), δi(1), δi(2)) i.i.d. for i = 1, 2, …, n, with Zi(1) and Zi(2) categorical with K(1) and K(2) levels, respectively, and (Zi(1),Zi(2))(δi(1),δi(2)). Here, δ(1) and δ(2) represent popularities for the nodes in the two views; more popular nodes have higher expected degrees. We generate each view with

X(l)|Z(l),δ(l)~j=1ni=1j1(δi(l)δj(l)θZiZj(l))Xij(l)(1δi(l)δj(l)θZi(l)Zj(l)(l))1Xij(l),l=1,2. (20)

We set n, K(1), K(2), π(1), π(2), C, θ(1), and θ(2) as in Section 7.1 and take (δi(l)=2.5)=0.2, (δi(l)=0.625)=0.8, and δi(1)δi(2). We simulate 2000 data sets, varying the dependence between views (Δ), the expected edge density (s), and the strength of the communities (r); these parameters are defined in Section 7.1. Once again, we evaluate the power and Type I error of the four tests described in Section 7.1. Results are shown in Figure 3, and are similar to Section 7.1. The P2LRT performs well because it is based on an approximation to the conditional likelihood of the multiview SBM given the node degrees (Section 3.2); thus, it can handle the highly heterogeneous node degrees that characterize the multiview DCSBM.

FIGURE 3.

FIGURE 3

Power of the P2LRT and the G-test with both views drawn from a DCSBM, as we vary the dependence between views (Δ), the strength of the communities (r), the expected edge density (s), and how the number of communities is selected. Details are in Section 7.2.

In this subsection, we assumed that the node popularities (δ(1) and δ(2)) are independent. This can sometimes be an unrealistic assumption in practice. If δ(1) and δ(2) are dependent, then X(1) and X(2) could be dependent even when the communities are independent, which could inflate the Type I error rate. To investigate this effect, in Web Appendix B.1, we generate data from a multiview DCSBM with δ(1) and δ(2) dependent, and apply the P2LRT using a range of values of K(1) and K(2). We find that the Type I error rate is controlled, both when we estimate the number of communities and when we choose a fixed number of communities (as long as the number of communities is not grossly overspecified); Web Appendix B.2 gives intuition for why this is the case.

7.3 |. SBM for a network view and a multivariate view

We will evaluate the performance of six tests of H0:C=1K(1)1K(2)T

  1. the P2LRT proposed in Section 5, using the true values of K(1) and K(2),

  2. the P2LRT, using estimated values of K(1) and K(2),

  3. the G-test applied to the estimated community/cluster memberships in the network/multivariate view, using the true values of K(1) and K(2),

  4. the G-test, using estimated values of K(1) and K(2),

  5. the BEStest (Peel et al., 2017) applied to the network view and the estimated cluster memberships in the multivariate view, using the true values of K(1) and K(2),

  6. the BESTest, using estimated values of K(1) and K(2).

We estimate K(1) by applying the method of Le and Levina (2015), we estimate K(2) using BIC, and we approximate the null distributions using M = 200 permutation samples.

We generate data from model (14)–(15); we generate data from a degree-corrected version of model (14)–(15) in Web Appendix C. We set n = 500, and K(1) = K(2) = K = 3. Let π(1) = π(2) = 1K/K, and let C be given by (18). Let θ be given by (19), so that the expected edge density is s = 0.015. We draw the multivariate data view from a Gaussian mixture model, for which the kth mixture component is an N10(μk, σ2I10) distribution. The p × K mean matrix for the multivariate data view is given by μ=[0150151215215215015]. We simulate 2000 data sets for a range of values of Δ, r, and σ. Results are shown in Figure 4.

FIGURE 4.

FIGURE 4

Power of the P2LRT, the G-test, and the BESTest (Peel et al., 2017) with the multivariate view drawn from a Gaussian mixture model and the network view drawn from an SBM, as we vary the dependence between views (Δ), the strength of the communities (r), the variance of the clusters (σ), and how the number of communities and the number of clusters are selected. The expected edge density (s) is fixed at 0.015. Details are in Section 7.3.

All tests control the Type I error rate. Power tends to increase as the dependence between views (Δ) increases. Power also tends to increase as the strength of the communities (r) increases and the variance of the clusters (σ) decreases. The P2LRTs uniformly yield higher power than the G-tests and the BESTests.

8 |. APPLICATION TO PROTEIN–PROTEIN INTERACTION DATA

In this section, we focus on two types of protein–protein interaction data. A binary interaction is a physical interaction between proteins, and a cocomplex association is a pair of proteins that are part of the same complex. These two data views represent distinct biological concepts; physical interactions can occur between a pair of proteins that are not in the same complex, and not all proteins in complexes physically interact.

To investigate whether the latent communities of proteins defined with respect to binary interactions and cocomplex associations are related, we consider Homo sapiens protein–protein interaction data from the HINT (High-quality INteractomes; Das and Yu (2012b)) database, and ask: are the communities within the binary network and the communities within the cocomplex network associated?

We remove self-interactions from both networks, and consider only those proteins that appear in both networks. This yields 43,874 binary interactions and 88,960 cocomplex associations among a common set of n = 9037 proteins. We apply the P2LRT of H0:C=1K(1)1K(2)T developed in Section 4, using M = 104 in Step 3 of Algorithm 2. As in Section 7, we estimate the number of communities in each view by applying the method of Le and Levina (2015) to each view separately, which (coincidentally) estimates 14 communities in both data views. Figure 5 displays π^(1) and π^(2) (defined in Section 4.1), and C^ (defined in Equation (13)). Our test yields a p-value of 0.013, and thus provides some evidence against the null hypothesis that communities of proteins defined with respect to binary interactions and communities of proteins defined with respect to cocomplex associations are independent.

FIGURE 5.

FIGURE 5

Heatmaps of π^(1) and π^(2), defined in Section 4.1, and of C^, defined in (13), for the HINT data described in Section 8.

Our test of H0:C=1K(1)1K(2)T allows us to provide an answer to the high-level scientific question of whether there is a relationship between communities defined with respect to different types of protein interactions. However, it may also be of scientific interest to determine whether there is a relationship between the kth community in the binary view and the k′th community in the cocomplex view. Recall from Section 3.1 that Ckk = 1 indicates that the kth community in the binary view and the k′th community in the cocomplex view are independent. In Figure 5, most values of C^kk are close to 1. Thus, it may be of future interest to develop tests of H0 : Ckk = 1.

9 |. DISCUSSION

In this paper, we considered testing whether communities defined with respect to two networks on a common set of nodes are related. We extended this test to the setting of one network and one multivariate data set on a common set of nodes. The proposed tests control the Type I error rate, and yield higher power than applying the G-test to the estimated community/cluster memberships in each data view.

We focused on testing the association between communities/clusters in two data views. If three or more data views are available, we may be interested in testing mutual independence between all data views. The models proposed in Sections 3.1 and 5.1 extend readily to L > 2 data views, and we can test for mutual independence by testing the null hypothesis that all entries of an Lth-order tensor C are equal to 1. We can construct a P2LRT statistic along the lines of (12), and we can approximate the null distribution by permuting the node labels in the second through Lth views. If we are instead interested in pairwise independence between the data views, we could simply apply the tests developed in this paper to each pair of views.

In this paper, we considered only undirected, unweighted network views. There is a body of work that extends the single-view SBM to directed and/or weighted networks; see, for example, Wang and Wong (1987) and Aicher et al. (2014). It may be of future interest to extend the methodology developed in this paper to allow for directed and/or weighted networks.

Supplementary Material

Supplement (Web Appendix)

ACKNOWLEDGMENTS

Lucy L. Gao received funding from the Natural Sciences and Engineering Research Council of Canada. Daniela Witten and Jacob Bien were supported by NIH Grant R01GM123993. Jacob Bien was supported by NSF CAREER Award DMS-1653017. Daniela Witten was supported by NIH Grant DP5OD009145, NSF CAREER Award DMS-1252624, and Simons Investigator Award No. 560585. We thank Haiyuan Yu for useful input on protein interaction data.

Funding information

Simons Foundation, Grant/Award Number: Simons Investigator Award No. 560585; National Institutes of Health, Grant/Award Numbers: DP5OD009145, R01GM123993; National Science Foundation, Grant/Award Numbers: CAREER Award DMS-1252624, CAREER Award DMS-1653017

Footnotes

Conflict of Interest: None declared.

SUPPORTING INFORMATION

Web Appendices referenced in Sections 3 and 7 are available with this paper at the Biometrics website on Wiley Online Library. The tests developed in this paper are implemented in the R package multiviewtest, which is available on CRAN. Code to reproduce the results in this paper is available at https://github.com/lucylgao/mv-network-test-code, and also available with this paper at the Biometrics website on Wiley Online Library.

DATA AVALABILITY STATEMENT

The data that support the findings of this paper are openly available in the HINT (High-quality INTeractions) database at http://hint.yulab.org (Das and Yu, 2012a).

REFERENCES

  1. Agresti A (2003) Categorical Data Analysis, Vol. 482. Hoboken: John Wiley & Sons. [Google Scholar]
  2. Aicher C, Jacobs AZ and Clauset A (2014) Learning latent block structure in weighted networks. Journal of Complex Networks, 3, 221–248. [Google Scholar]
  3. Amini AA, Chen A, Bickel PJ and Levina E (2013) Pseudo-likelihood methods for community detection in large sparse networks. The Annals of Statistics, 3, 2097–2122. [Google Scholar]
  4. Belinda P and Smyth GK (2010) Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn. Statistical Applications in Genetics and Molecular Biology, 3, 1–16. [DOI] [PubMed] [Google Scholar]
  5. Besag J (1975) Statistical analysis of non-lattice data. Journal of the Royal Statistical Society: Series D (The Statistician), 3, 179–195. [Google Scholar]
  6. Bien J (2016) The simulator: an engine to streamline simulations. arXiv preprint arXiv:1607.00021. [Google Scholar]
  7. Binkiewicz N, Vogelstein JT and Rohe K (2017) Covariate-assisted spectral clustering. Biometrika, 3, 361–377. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chen Y and Liang K-Y (2010) On the asymptotic behaviour of the pseudolikelihood ratio test statistic with boundary problems. Biometrika, 3, 603–620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. D’Angelo S, Murphy TB and Alfò M (2019) Latent space modelling of multidimensional networks with application to the exchange of votes in Eurovision song contest. The Annals of Applied Statistics, 3, 900–930. [Google Scholar]
  10. Das J and Yu H (2012a) High-quality interactomes (HINT). http://hint.yulab.org. Accessed 22 January 2019.
  11. Das J and Yu H (2012b) HINT: high-quality interactomes and their applications in understanding human disease. BMC Systems Biology, 3, 92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Dempster AP, Laird NM and Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 3, 1–22. [Google Scholar]
  13. Erdős P and Rényi A (1960) On the evolution of random graphs. Proceedings of the Hungarian Academy of Sciences, 5, 17–61. [Google Scholar]
  14. Fosdick BK and Hoff PD (2015) Testing and modeling dependencies between a network and nodal attributes. Journal of the American Statistical Association, 3, 1047–1056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Gao LL, Bien J and Witten D (2020) Are clusterings of multiple data views independent? Biostatistics, 3, 692–708. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Gollini I and Murphy TB (2016) Joint modeling of multiple network views. Journal of Computational and Graphical Statistics, 3, 246–265. [Google Scholar]
  17. Han Q, Xu K and Airoldi E (2015) Consistent estimation of dynamic and multi-layer block models. In: Proceedings of the 32nd International Conference on Machine Learning - Volume 37, pp. 1511–1520. [Google Scholar]
  18. Hoff PD, Raftery AE and Handcock MS (2002) Latent space approaches to social network analysis. Journal of the American Statistical Association, 3, 1090–1098. [Google Scholar]
  19. Holland PW, Laskey KB and Leinhardt S (1983) Stochastic block-models: first steps. Social Networks, 3, 109–137. [Google Scholar]
  20. Holland PW and Leinhardt S (1981) An exponential family of probability distributions for directed graphs. Journal of the American Statistical Association, 3, 33–50. [Google Scholar]
  21. Karrer B and Newman ME (2011) Stochastic blockmodels and community structure in networks. Physical Review E, 3, 016107. [DOI] [PubMed] [Google Scholar]
  22. Kim B, Lee KH, Xue L and Niu X (2018) A review of dynamic network models with latent variables. Statistics Surveys, 3, 105–135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Kivinen J and Warmuth MK (1997) Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 3, 1–63. [Google Scholar]
  24. Le CM and Levina E (2015) Estimating the number of communities in networks by spectral methods. arXiv preprint arXiv:1507.00827. [Google Scholar]
  25. Liang K-Y and Self SG (1996) On the asymptotic behaviour of the pseudolikelihood ratio test statistic. Journal of the Royal Statistical Society. Series B (Methodological), 58, 785–796. [Google Scholar]
  26. Matias C and Robin S (2014) Modeling heterogeneity in random graphs through latent space models: a selective review. ESAIM: Proceedings and Surveys, 3, 55–74. [Google Scholar]
  27. McLachlan G and Krishnan T (2007). The EM Algorithm and Extensions, Vol. 382. New York, NY: John Wiley & Sons. [Google Scholar]
  28. McLachlan G and Peel D (2000). Finite Mixture Models. New York, NY: John Wiley & Sons. [Google Scholar]
  29. Newman ME and Clauset A (2016) Structure and inference in annotated networks. Nature Communications, 3, 11863. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Paul S and Chen Y (2016) Consistent community detection in multi-relational data through restricted multi-layer stochastic blockmodel. Electronic Journal of Statistics, 3, 3807–3870. [Google Scholar]
  31. Paul S and Chen Y (2020) A random effects stochastic block model for joint community detection in multiple networks with applications to neuroimaging. Annals of Applied Statistics, 3, 993–1029. [Google Scholar]
  32. Peel L, Larremore DB and Clauset A (2017) The ground truth about metadata and community detection in networks. Science Advances, 3, e1602548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Peixoto TP (2015) Inferring the mesoscale structure of layered, edge-valued, and time-varying networks. Physical Review E, 3, 042807. [DOI] [PubMed] [Google Scholar]
  34. Salter-Townshend M and McCormick TH (2017) Latent space models for multiview network data. The Annals of Applied Statistics, 11, 1217–1244. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Stanley N, Bonacci T, Kwitt R, Niethammer M and Mucha PJ (2019) Stochastic block models with multiple continuous attributes. Applied Network Science, 3, 1–22. [Google Scholar]
  36. Stanley N, Shai S, Taylor D and Mucha PJ (2016) Clustering network layers with the strata multilayer stochastic block model. IEEE Transactions on Network Science and Engineering, 3, 95–105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Sun S (2013) A survey of multi-view machine learning. Neural Computing and Applications, 3, 2031–2038. [Google Scholar]
  38. Traud AL, Kelsic ED, Mucha PJ and Porter MA (2011) Comparing community structure to characteristics in online collegiate social networks. SIAM Review, 3, 526–543. [Google Scholar]
  39. Wang YJ and Wong GY (1987) Stochastic blockmodels for directed graphs. Journal of the American Statistical Association, 3, 8–19. [Google Scholar]
  40. Xiong J, Shen C, Arroyo J and Vogelstein JT (2019) Graph independence testing. arXiv preprint arXiv:1906.03661. [Google Scholar]
  41. Yan B and Sarkar P (2020) Covariate regularized community detection in sparse graphs. Journal of the American Statistical Association. Advance online publication. [Google Scholar]
  42. Yang J, McAuley J and Leskovec J (2013) Community detection in networks with node attributes. In: Proceedings of the IEEE 13th International Conference on Data Mining, pp. 1151–1156. [Google Scholar]
  43. Zhang Y, Levina E and Zhu J (2016) Community detection in networks with node features. Electronic Journal of Statistics, 3, 3153–3178. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement (Web Appendix)

Data Availability Statement

The data that support the findings of this paper are openly available in the HINT (High-quality INTeractions) database at http://hint.yulab.org (Das and Yu, 2012a).

RESOURCES