Robust Multi-Network Clustering via Joint Cross-Domain Cluster Alignment

Rui Liu; Wei Cheng; Hanghang Tong; Wei Wang; Xiang Zhang

doi:10.1109/ICDM.2015.13

. Author manuscript; available in PMC: 2016 May 25.

Published in final edited form as: Proc IEEE Int Conf Data Min. 2015 Nov;2015:291–300. doi: 10.1109/ICDM.2015.13

Robust Multi-Network Clustering via Joint Cross-Domain Cluster Alignment

Rui Liu ^*,^¶, Wei Cheng ^†,^¶, Hanghang Tong ^‡, Wei Wang ^§, Xiang Zhang ^*

PMCID: PMC4880426 NIHMSID: NIHMS785953 PMID: 27239167

Abstract

Network clustering is an important problem that has recently drawn a lot of attentions. Most existing work focuses on clustering nodes within a single network. In many applications, however, there exist multiple related networks, in which each network may be constructed from a different domain and instances in one domain may be related to instances in other domains. In this paper, we propose a robust algorithm, MCA, for multi-network clustering that takes into account cross-domain relationships between instances. MCA has several advantages over the existing single network clustering methods. First, it is able to detect associations between clusters from different domains, which, however, is not addressed by any existing methods. Second, it achieves more consistent clustering results on multiple networks by leveraging the duality between clustering individual networks and inferring cross-network cluster alignment. Finally, it provides a multi-network clustering solution that is more robust to noise and errors. We perform extensive experiments on a variety of real and synthetic networks to demonstrate the effectiveness and efficiency of MCA.

I. Introduction

Networks (or graphs) are widely used in representing relationships between instances, in which each node corresponds to an instance and each edge depicts the relationship between a pair of instances. Network clustering (or graph clustering) [1]–[3] has become an effective means in discovering modules formed by closely related instances in such networks, which may in turn reveal functional structure of the networks. Recently, the attention has moved from clustering in a single homogeneous network (built on instances from one domain) to joint clustering on multiple heterogeneous networks (from different but related domains), due to obvious reasons: integrating information from different but related domains not only may help to resolve ambiguity and inconsistency in clustering outcome, but also may discover and leverage strong associations between clusters from different domains. Consequently, these multi-view network clustering methods [3], [4] are able to substantially improve the clustering accuracy. For example, millions of genetic variants on human genome have been reported to be disease related, most of which are in the form of single nucleotide polymorphism (SNP). These SNPs do not function independently. Instead, a set of SNPs may play joint roles in a disease. Such interactions between SNPs can be modeled by a SNP interaction network. Fig. 1 shows an exemplar SNP interaction network $G_{1}$ of 17 SNPs on the left, in which nodes are SNPs and weighted edges represent interactions between SNPs. Even though the underlying biological processes are complex and only partially solved, it is well established that SNPs may alter the expression levels of related genes which may in turn have a cascading effect to other genes, e.g., in the same biological pathways [5]. The interactions between genes can be measured by correlations of gene expressions and represented by a gene interaction network. Fig. 1 shows an exemplar gene interaction network of 20 genes on the right, in which nodes are genes and weighted edges represent interactions between genes. These two networks are heavily related because of the (complicated) relationships between SNPs and genes, as demonstrated in many expression quantitative trait loci (eQTL) studies. These cross-domain relationships are represented by dotted edges between SNPs and genes in Fig. 1. The strength of such relationship is coded by the edge weight. It is evident that a joint analysis becomes essential in these related domains.

Fig. 1 — An exemplar SNP interaction network and gene interaction network in an eQTL study

Despite the success of previous approaches in network clustering, they still suffer from two common limitations. First, existing methods usually assume that information collected in different domains are for the same set of instances. Thus, the cross-domain instance relationships are strictly one-to-one correspondence. This assumption may not hold in many applications. More often than not, data instances (e.g., SNPs) in one domain may be related to multiple instances (e.g., genes) in another domain. Methods that can account for many-to-many cross-domain relationships are in need [6]. Second, existing approaches tend to focus on network clustering and ignore any associations that may be exhibited between clusters from different domains. However, “alignment” between clusters from multiple domains may provide a more comprehensive depiction of the whole system. For example, a cluster of SNPs may jointly regulate the expressions of a cluster of genes, which may be revealed by cluster level associations. Fig. 1 shows 2 SNP clusters: A (including SNPs {1, 2, 3, 4}) and B (including SNPs {12, 13, 14, 16}), and 3 gene clusters: C (including genes {a, b, c, d}), D (including genes {p, q, r, s}) and E (including genes {i, j, k, m}). As summarized in Table I, SNP cluster A is strongly associated with gene cluster C and SNP cluster B is strongly associated with gene cluster D. Gene cluster E is not strongly associated with any SNP cluster. Although we are given cross-domain associations at the instance level, it is still nontrivial to discover cross-domain associations at the cluster level, especially in the presence of noise. Our goal is to discover such strong associations between pairs of clusters from different domains simultaneously when we perform network clustering.

TABLE I.

Cross-domain associations between SNP clusters and gene clusters in Fig. 1

association	1st cluster pair	2nd cluster pair
SNP interaction network $G_{1}$	{12, 13, 14, 16}	{1, 2, 3, 4}
gene interaction network $G_{2}$	{p, q, r, s}	{a, b, c, d }

Open in a new tab

In this paper, we propose a robust approach, MCA (Multi-network Clustering via cluster Alignment), to detect network clusters in multiple domains and their cross-domain associations. In addition to the advantages discussed above, the duality between clustering in individual networks and inferring cross-network cluster alignment enables mutual reinforcement when both tasks are performed simultaneously. As a result, MCA can effectively filter noise and resolve ambiguities in individual networks, and achieve much higher accuracy in detecting network clusters and their cross-domain associations. It also employs a sparsity regularizer on the cluster alignment to provide additional robustness to noise in the prior cross-domain (instance-level) relationships.

Our contributions are summarized as follows.

To the best of our knowledge, little prior work has studied the problem of cross-domain cluster association detection. In this paper, we propose and investigate this novel problem under the multi-domain setting. The problem is essential to a wide range of applications.
We develop a framework, MCA, based on nonnegative matrix tri-factorization to simultaneously cluster instances within each domain and reveal the associations between clusters from different domains. Clustering and cluster association discovery could mutually enhance each other. We provide rigorous theoretical analysis of MCA in terms of its correctness, convergence and complexity.
We evaluate MCA by extensive experiments on both synthetic and real datasets. The experimental results demonstrate that MCA is superior to existing approaches in both clustering accuracy and cluster association accuracy.

II. Related Work

To the best of our knowledge, this is the first work to “align” clusters across multiple domains. Existing work on network clustering primarily focused on clustering in a single network [7], [9]. In [9], the authors pioneered a graph partitioning algorithm using normalized cut. Spectral clustering has gained popularity as an efficient clustering algorithms in recent years [11]. It is simple to implement since its computation burden is primarily on the computation of eigenvectors which has been studied in-depth in numerical analysis. In [7], the authors proposed a framework to detect communities in a network based on modularity. Some other multi-domain graph (network) clustering methods [3] focus on improving the clustering accuracy within each domain utilizing information from other domains. The cross-domain instance relationships are only used to enhance the clustering result within each domain. They do not capture associations between clusters across multiple domains. Multi-domain data are inherently heterogenous. Networks constructed from multiple related domains can be transformed into a heterogenous information network, on which clustering may be performed [12]. However, this pioneer work focused on ranking-based clustering which ranks clusters on a pre-specified target type (domain). This is different from our goal of performing clustering within and cross domains. In addition, some methods on co-clustering also make use of Nonnegative Matrix Tri-Factorization (NMTF) and graph regularizer. Co-clustering [13], DNMTF [14] and RCC [15] were originally designed to improve the clustering accuracy on documents by clustering rows and columns of a term-document matrix simultaneously [16]. They usually pay special attentions to the duality of clustering of terms and clustering of documents. The cross-domain cluster associations are not explicitly considered by co-clustering methods, even though some of these methods may be adapted to derive information about cross-domain cluster associations. They are either incapable of handling multi-network setting or sensitive to noise, since they were not designed for network clustering.

III. Multi-Network Clustering via Cross-Domain Cluster Alignment

In this section, we discuss the problem definition and our proposed algorithm MCA.

A. Problem Definition

Suppose that we have N domains ${D_{1}, D_{2}, \dots, D_{N}}$ . Instances and their relationships within each domain are represented by a network $G_{p}$ , (1 ≤ p ≤ N). Let A_p be affinity/adjacency matrix $G_{p}$ . It is possible that some instances in domain $D_{p} (1 \leq p \leq N)$ may be related to some instances in domain $D_{q} (1 \leq q \leq N, p \neq q)$ . These cross-domain relationships between instances can be represented by a matrix W_pq. Important notations are listed in Table II. More often than not, the matrix W_pq is derived from prior knowledge and may be incomplete and noisy. Our goal is to integrate these cross-network instance relationships into the task of multi-network clustering and infer cross-network associations between clusters. We formulate this problem as an optimization problem that generates clustering and cluster associations simultaneously. We now discuss them in detail.

TABLE II.

Summary of symbols and their meanings

Symbol

Description

The number of domains

D_{p}

The p-th domain

n_p

The number of instances in

D_{p}

k_p

The number of clusters in

D_{p}

G_{p}

The network representing relationship among instances in

D_{p}

A_p

The affinity/adjacency matrix of

G_{p}

W_pq

The cross-domain relationship between instances from

D_{p}

and

D_{q}

H_p

The cluster assignment matrix in

D_{p}

S_pq

The cross-domain alignment matrix between

D_{p}

and

D_{q}

Open in a new tab

For simplicity, we begin with 2 domains. For clustering in domain $D_{1}$ , We want to minimize the following objective function

\sum_{i j} A_{1_{i j}} ‖ H_{1_{i} .} - H_{1_{j} .} ‖_{2}^{2}, s . t . H_{1}^{T} H_{1} = I, H_{1} \geq 0

where H₁ is the cluster assignment matrix in domain $D_{1}$ and $H_{1_{i} \cdot}$ means the i-th row of matrix H₁. $H_{1_{i, k}}$ can be viewed as the probability that the i-th instance in domain $D_{1}$ belongs to the k-th cluster of this domain. A similar objective function can be applied to clustering in domain $D_{2}$ .

In order to capture the cross-network cluster associations, we adopt the co-clustering strategy that minimizes the following objective function

\begin{array}{r} ‖ W_{12} - H_{1} S_{12} H_{2}^{T} ‖_{F}^{2} + η_{1} ‖ S_{12} | |_{1}, \\ s . t . H_{1}^{T} H_{1} = I, H_{2}^{T} H_{2} = I, H_{1} \geq 0, H_{2} \geq 0 \end{array}

This objective function is the Sparse Nonnegative Matrix Tri-Factorization. With orthogonality constraints on H₁ and H₂, it is equivalent to running K-means co-clustering on W₁₂ [13]. S₁₂ is the cross-domain alignment matrix, depicting the alignment of the clusters from the two domains. Because W₁₂ may contain noise, we employ the ℓ₁-norm on S₁₂ to suppress the influence of any inconsistencies in W₁₂.

Combining together the above two parts, we have the following optimization problem

\begin{array}{r} \min ‖ W_{12} - H_{1} S_{12} H_{2}^{T} ‖_{F}^{2} + α_{1} \sum_{i j} A_{1_{i j}} ‖ H_{1_{i} .} - H_{1_{j} .} ‖_{2}^{2} \\ + α_{2} \sum_{i j} A_{2_{i j}} ‖ H_{2_{i} .} - H_{2_{j} .} ‖_{2}^{2} + η_{1} ‖ S | |_{1} \\ s . t . H_{1}^{T} H_{1} = I, H_{2}^{T} H_{2} = I, H_{1} \geq 0, H_{2} \geq 0 \end{array}

(1)

where α₁, α₂ and η₁ are parameters that balance different between terms, whose values can be determined via cross validation.

By simplifying it, we obtain

\begin{array}{r} \min_{H_{1} \geq 0, H_{2} \geq 0, S \geq 0} ‖ W_{12} - H_{1} S_{12} H_{2}^{T} ‖_{F}^{2} + α_{1} Tr (H_{1}^{T} Θ_{1} H_{1}) \\ + α_{2} Tr (H_{2}^{T} Θ_{2} H_{2}) + η_{1} {‖ S ‖}_{1} \\ s . t . H_{1}^{T} H_{1} = I, H_{2}^{T} H_{2} = I \end{array}

(2)

where Θ₁ and Θ₂ are the Laplacian Matrices of $G_{1}$ and $G_{2}$ , respectively.

Now we come back to the multi-domain case. Let W_pq be the matrix defining instance level relationships between domain $D_{p}$ and domain $D_{q}$ . Here we assume $W_{p q} = W_{q p}^{T}$ . Similarly, we use S_pq to represent the cross-domain cluster alignment matrix between domain $D_{p}$ and domain $D_{q}$ and we have $S_{p q} = S_{q p}^{T}$ . The optimization problem in Eq. (2) can be naturally extended to the following multi-domain case.

\begin{array}{r} \min_{H_{p} \geq 0, S_{p q} \geq 0, \forall p \neq q} \sum_{p \neq q} ‖ W_{p q} - H_{p} S_{p q} H_{q}^{T} ‖_{F}^{2} + \\ \sum_{p} α_{p} Tr (H_{p}^{T} Θ_{p} H_{p}) + \\ \sum_{p \neq q} η_{p q} {‖ S_{p q} ‖}_{1}, s . t . \forall p, H_{p}^{T} H_{p} = I \end{array}

(3)

B. Learning Algorithm

In this section, we present the learning algorithm, MCA, to solve the optimization problem in Eq. (3). Since the objective function is not jointly convex with respect to all variables, we adopt an alternating optimization scheme. Specifically, each time we optimize the objective with respective to one variable while fixing others. The following two theorems set the foundation for our algorithm. Their correctness and convergence are guaranteed, which will be proven later.

Theorem 1

While fixing other variables, the objective function in Eq. (3) will monotonically decrease every time we update S_pq according to Eq. (4) until convergence.

S_{p q} \leftarrow S_{p q} \circ \sqrt{\frac{(H_{p}^{T} W_{p q} H_{q})}{(H_{p}^{T} H_{p} S_{p q} H_{q}^{T} H_{q}) + \frac{1}{2} η_{p q}}}

(4)

Theorem 2

While fixing other variables, the objective function in Eq. (3) will monotonically decrease every time we update H_p according to Eq. (5) until convergence.

H_{p} \leftarrow H_{p} \circ \sqrt{\frac{\sum_{q \neq p} (W_{p q} H_{q} S_{p q}^{T}) + α_{p} (Θ_{p}^{-} H_{p})}{H_{p} H_{p}^{T} (\sum_{q \neq p} (W_{p q} H_{q} S_{p q}^{T}) + α_{p} (Θ_{p}^{-} H_{p}))}}

(5)

where $Θ_{p}^{-}$ is the negative part of Θ_p, i.e., $Θ_{p_{i j}}^{-} = (| Θ_{p_{i j}} | - Θ_{p_{i j}}) / 2$ .

Note that ∘, $\frac{(\cdot)}{(\cdot)}$ and $\sqrt{(\cdot)}$ are element-wise multiplication, division and square root, respectively. Based on Theorems 1 and 2, we develop an iterative updating algorithm summarized in Algorithm 1.

C. Correctness Analysis

In this section, we give the correctness analysis of the updating rules in Theorem 1, according to the Karush-Kuhn-Tucker (KKT) condition. The proof of updating rules for Theorem 2 is similar and hence omitted here.

Define the Lagrangian function with respect to S_pq as

L (S_{p q}) = - 2 Tr (H_{p}^{T} W_{p q} H_{q} S_{p q}^{T}) + Tr (H_{p} S_{p q} H_{q}^{T} H_{q} S_{p q}^{T} H_{p}^{T}) + η_{p q} S_{p q} - Tr (Λ_{p q} S_{p q}^{T})

(6)

where Λ_pq is a symmetric matrix whose entries are the lagrangian multipliers. Note that ‖S_pq‖₁ becomes S_pq in Eq. (6) because S_pq ≥ 0. We also omit all constant terms with respect to S_pq in Eq.(6). The same tricks are used in the following analysis.

graphic file with name nihms785953u1.jpg

The partial derivative with respect to S_ij is

\frac{\partial L (S_{p q})}{\partial S_{p q}} = - 2 (H_{p}^{T} W_{p q} H_{q}) + 2 (H_{p}^{T} H_{p} S_{p q} H_{q}^{T} H_{q}) + η_{p q} - Λ_{p q}

(7)

From the optimality condition $\frac{\partial L (S_{p q})}{\partial S_{p q}} = 0$ ,

Λ_{p q} = - 2 {(H_{p}^{T} W_{p q} H_{q})}_{i j} + 2 (H_{p}^{T} H_{p} S_{p q} H_{q}^{T} H_{q}) + η_{p q}

(8)

The KKT complementarity condition for the nonnegativity of S_ij is

Λ_{p q} S_{p q} = 0

(9)

Combining with Eq. (8), the KKT complementarity condition becomes

(- 2 {(H_{p}^{T} W_{p q} H_{q})}_{i j} + 2 (H_{p}^{T} H_{p} S_{p q} H_{q}^{T} H_{q}) + η_{p q}) \circ S_{p q} = 0

(10)

According to Eq. (4), the update rule for S is

S_{p q} \leftarrow S_{p q} \circ \sqrt{\frac{(H_{p}^{T} W_{p q} H_{q})}{(H_{p}^{T} H_{p} S_{p q} H_{q}^{T} H_{q}) + \frac{1}{2} η_{p q}}}

At convergence, S_pq at the left-hand side and right-hand side should be equal. Then, via simple derivation, we can verify that the update rule for S_pq in Eq. (4) satisfies the KKT complementarity condition in Eq. (10).

D. Convergence Analysis

In this subsection, we prove the guarantee of convergence using an auxiliary function [17].

Definition 1

[17] Z(h, h′) is an auxiliary function for f(h) if the conditions

Z (h, h') \geq f (h), Z (h, h) = f (h)

(11)

are satisfied for any given h, h′.

Lemma 1

If Z is an auxiliary function for f, then f is non-increasing under the update [17]

h^{(t + 1)} = \arg min_{h} Z (h, h^{(t)})

(12)

Proof

f(h^(t+1)) ≤ Z(h^(t+1), h^(t)) ≤ Z(h^(t), h^(t)) = f(h^(t)). ■

The auxiliary function with respect to S is

\begin{array}{r} Z (S_{p q}, S_{p q}^{'}) = - 2 \sum_{i, j} {(H_{p}^{T} W H_{q})}_{i j} S_{p q_{i j}}^{'} (1 + \log \frac{S_{p q_{i j}}}{S_{p q_{i j}}^{'}}) \\ + \sum_{i, j} \frac{{(H_{p}^{T} H_{p} S_{p q}^{'} H_{q}^{T} H_{q})}_{i j} S_{p q_{i j}}^{2}}{S_{p q_{i j}}^{'}} + η_{p q} \sum_{i, j} \frac{S_{p q_{i j}}^{2} + S_{p q_{i j}}^{' 2}}{2 S_{p q_{i j}}^{' 2}} \end{array}

(13)

\begin{array}{r} \frac{\partial Z}{\partial S_{p q_{i j}}} = - 2 {(H_{p}^{T} W_{p q} H_{q})}_{i j} \frac{S_{p q_{i j}}^{'}}{S_{p q_{i j}}} + \\ 2 \frac{{(H_{p}^{T} H_{p} S_{p q}^{'} H_{q}^{T} H_{q})}_{i j}}{S_{p q_{i j}}^{'}} S_{p q_{i j}} + η_{p q} \frac{S_{p q_{i j}}}{S_{p q_{i j}}^{'}} \end{array}

(14)

Letting $\frac{\partial Z}{\partial S_{p q_{i j}}} = 0$ , we obtain

S_{p q_{i j}} = S_{p q_{i j}}^{'} \sqrt{\frac{(H_{p}^{T} W_{p q} H_{q}) i j}{{(H_{p}^{T} H_{p} S_{p q}^{'} H_{q}^{T} H_{q})}_{i j} + \frac{1}{2} η_{p q}}}

(15)

Similarly, the auxiliary function with respect to H_p is

\begin{array}{r} Z (H_{p}, H_{p}^{'}) = - 2 \sum_{q \neq p} \sum_{i, j} {(W_{p q} H_{q} S_{p q}^{T})}_{i j} H_{p_{i j}}^{'} (1 + \log \frac{H_{p_{i j}}}{H_{p_{i j}}^{'}}) + \\ \sum_{q \neq p} \sum_{i, j} \frac{{(H_{p}^{'} S_{p q} H_{q}^{T} H_{q} S_{p q}^{T})}_{i j} H_{p_{i j}}^{2}}{H_{p_{i j}}^{'}} + α_{p} \sum_{i, j} \frac{{(Θ_{p}^{+} H_{p}^{'})}_{i j} H_{p_{i j}}^{2}}{H_{p_{i j}}^{'}} \\ - α_{p} \sum_{i, j, k} Θ_{p_{k i}}^{-} H_{p_{k j}}^{'} H_{p_{i j}}^{'} (1 + \log \frac{H_{p_{k j}} H_{p_{i j}}}{H_{p_{k j}}^{'} H_{p_{i j}}^{'}}) \\ + \sum_{i, j} \frac{{(H_{p}^{'} Λ_{p})}_{i j} H_{p_{i j}}^{2}}{H_{p_{i j}}^{'}} - Tr (Λ_{p}) \end{array}

(16)

\begin{array}{r} \frac{\partial Z}{\partial H_{p_{i j}}} = - 2 \sum_{q \neq p} {(W_{p q} H_{q} S_{p q}^{T})}_{i j} \frac{H_{p_{i j}}^{'}}{H_{p_{i j}}} + \\ 2 \sum_{q \neq p} \frac{{(H_{p}^{'} S_{p q} H_{q}^{T} H_{q} S_{p q}^{T})}_{i j} H_{p_{i j}}}{H_{p_{i j}}^{'}} + 2 α_{p} \frac{{(Θ_{p}^{+} H_{p}^{'})}_{i j} H_{p_{i j}}}{H_{p_{i j}}^{'}} \\ - 2 α_{p} \frac{{(Θ_{p}^{-} H_{p}^{'})}_{i j} H_{p_{i j}}^{'}}{H_{p_{i j}}} + 2 \frac{{(H_{p}^{'} Λ_{p})}_{ij} H_{p_{i j}}}{H_{p_{i j}}^{'}} \end{array}

(17)

Letting $\frac{\partial Z}{\partial H_{p_{i j}}} = 0$ , we obtain

H_{p_{i j}} = H_{p_{i j}}^{'} \sqrt{\frac{\sum_{q \neq p} {(W_{p q} H_{q} S_{p q}^{T})}_{i j} + α_{p} {(Θ_{p}^{-} H_{p}^{'})}_{i j}}{\sum_{q \neq p} {(H_{p}^{'} S_{p q} H_{q}^{T} H_{q} S_{p q}^{T})}_{i j} + α_{p} {(Θ_{p}^{+} H_{p}^{'})}_{i j} + {(H_{p}^{'} Λ_{p})}_{i j}}}

(18)

In order to determine Λ_p, we have

\frac{\partial L}{\partial H_{p}} = - 2 \sum_{q \neq p} W_{p q} H_{q} S_{p q}^{T} + 2 \sum_{q \neq p} H_{p} S_{p q} H_{q}^{T} H_{q} S_{p q}^{T} + 2 α_{p} Θ_{p} H_{p} + 2 H_{p} Λ_{p}

(19)

Letting $\frac{\partial L}{\partial H_{p}} = 0$ , we can solve Λ_p.

After submitting Λ_p, we obtain

H_{p_{i j}} = H_{p_{i j}}^{'} \sqrt{\frac{\sum_{q \neq p} {(W_{p q} H_{q} S_{p q}^{T})}_{i j} + α_{p} {(Θ_{p}^{-} H_{p}^{'})}_{i j}}{{(H_{p}^{'} H_{p}^{' T} (\sum_{q \neq p} W_{p q} H_{q} S_{p q}^{T} + α_{p} Θ_{p}^{-} H_{p}^{'}))}_{i j}}}

(20)

E. Complexity Analysis

With proper order of multiplication, updating S and H once require $O ({\tilde{n}}^{2} \tilde{k} + \tilde{n} {\tilde{k}}^{2} + {\tilde{k}}^{3})$ and $O (N \tilde{n} {\tilde{k}}^{2} + {\tilde{n}}^{2} \tilde{k})$ , where $\tilde{n} = {max}_{p} {n_{p}}$ is the largest number of instances among all domains and k = max_p{k_p} is the largest number of clusters among all domains. If the number of iterations is Iter, the overall time complexity is $O (Iter (ñ {\tilde{k}}^{2} + ñ^{2} \tilde{k} + {\tilde{k}}^{3} + N ñ {\tilde{k}}^{2}))$ . In practice, the number of instances is much larger than the number of clusters in a domain, which leads to $\tilde{n} ≫ \tilde{k}$ . In this scenario, the overall time complexity of MCA can be simplified to $O (Iter (ñ^{2} \tilde{k} + N ñ {\tilde{k}}^{2}))$ .

IV. Experimental Results

In this section, we evaluate the performance of MCA on both synthetic and real datasets. To the best of our knowledge, there is no previous method that was specifically designed to discover cluster associations. Some co-clustering methods might be adapted to infer cluster associations. We compare our method MCA with three well-known co-clustering methods–Nonnegative Matrix Tri-Factorization proposed by Chris Ding (denoted as NMTF_Chris in our paper) [13], Graph Dual Regularization Non-negative Matrix Tri-Factorization (DNMTF) [14] and Robust Co-Clustering (RCC) [15]. The parameters of each algorithm are tuned using a 5-fold cross validation. Other co-clustering methods are not compared because they are either not suitable for network clustering or do not impose nonnegativity on association matrix S. The nonnegativity constraint on S is essential to ensure the result interpretability in this problem setting.

A. Evaluation Metrics

We evaluate our results in two perspectives: clustering accuracy within each domain and cluster association accuracy across domains.

1) Clustering Accuracy

We use the widely used normalized mutual information (MI) metric to evaluate the clustering accuracy in each domain. For any domain $D$ , assume that $C = {c_{i}, i = 1, 2, \dots, \hat{k}}$ is the clustering result where c_i is the i-th cluster. Let $T = {t_{i}, i = 1, 2, \dots k}$ be the ground truth where t_i is the i-th cluster. The normalized MI is defined as

\hat{M} I (C, T) = \frac{M I (C, T)}{\max (H (C), H (T))}

(21)

where $H (C)$ and $H (T)$ are the entropies for clustering $C$ and $T$ and $M I (C, T)$ is the mutual information between $C$ And $T$ .

M I (C, T) = \sum_{c_{i} \in C, t_{j} \in T} p (c_{i}, t_{j}) ln \frac{p (c_{i}, t_{j})}{p (c_{i}) p (t_{j})}

(22)

where p(c_i) is the percentage of instances contained in c_i and p(c_i, t_j) is the percentage of instances contained in the intersection of c_i and t_j.

2) Cluster Association Accuracy

To evaluate the cross-network cluster association accuracy, we propose a new metric, Clustering Association Metric (CAM). For simplicity, we consider the case where there are only two domains. Assume that we discover $\hat{h}$ pairs of cluster associations ${c_{j}, c_{j}^{'}}, 1 \leq j \leq \hat{h}$ , where c_j is a cluster in domain $D_{1}$ and $c_{j}^{'}$ is its corresponding cluster in domain $D_{2}$ . Also assume that the ground-truth contains h pairs of cluster associations ${t_{i}, t_{i}^{'}}, 1 \leq i \leq h$ , where t_i is a cluster in domain $D_{1}$ and $t_{j}^{'}$ is its corresponding cluster in domain $D_{2}$ . Then the CAM is defined by the following equation.

CAM = \frac{1}{h} \sum_{i = 1}^{h} max_{1 \leq j \leq \hat{h}} \frac{| (t_{i} \cup t_{i}^{'}) \cap (c_{j} \cup c_{j}^{'}) |}{| (t_{i} \cup t_{i}^{'}) \cup (c_{j} \cup c_{j}^{'}) |}

(23)

where ∪ is the set union and ∩ is the set intersection. Here, in order to measure the similarity between an inferred cluster association ${c_{j}, c_{j}^{'}}$ and the ground-truth ${t_{i}, t_{i}^{'}}$ , we use $\frac{| (t_{i} \cup t_{i}^{'}) \cap (c_{j} \cup c_{j}^{'}) |}{| (t_{i} \cup t_{i}^{'}) \cup (c_{j} \cup c_{j}^{'}) |}$ to evaluate the degree of overlap between $t_{i} \cup t_{i}^{'}$ and $c_{j} \cup c_{j}^{'}$ . For each ground-truth association pair ${t_{i}, t_{i}^{'}}$ , we get the maximal value of $\frac{| (t_{i} \cup t_{i}^{'}) \cap (c_{j} \cup c_{j}^{'}) |}{| (t_{i} \cup t_{i}^{'}) \cup (c_{j} \cup c_{j}^{'}) |}, \forall j = 1, \dots, \hat{h}$ . The CAM is the average of the maximal values for all ground-truth pairs.

B. Simulation Study

We constructed a simulation study on the two synthetic networks $G_{1}$ , $G_{2}$ in Fig. 1. We compare our method MCA with DNMTF, NMTF_Chris and RCC with respect to robustness to varying levels of noise in the cross-domain instance-level relationship matrix W₁₂. Fig. 2 shows that MCA achieves much higher clustering accuracy than all existing methods at all noise levels. Fig. 3 demonstrates the clear advantage of MCA over existing methods in capturing cross-domain cluster associations.

Fig. 2 — Clustering accuracy as a function of increasing percentage of noise in W₁₂ on simulated data.

Fig. 3 — Cluster association accuracy as a function of increasing percentage of noise in W₁₂ on simulated data.

C. DBLP Dataset

We also evaluate our method MCA using a labeled DBLP dataset [18], [19]. The dataset consists of papers and authors from 4 research areas: Database (DB), Artificial Intelligence (AI), Data Mining (DM) and Information Retrieval (IR). It contains 20 conferences and 4057 authors. These conferences are listed by area in Table III and the author distribution by area is shown in Table IV. We use $D_{1}$ to denote the author domain and $D_{2}$ to denote the conference domain. In $D_{1}$ , the network $G_{1}$ represents the co-authorship. Each entry $A_{1_{i j}}$ in the affinity matrix A₁ of $G_{1}$ is the number of papers coauthored by the i-th and j-th authors. The affinity matrix A₂ of $G_{2}$ represents the similarities between the topics covered by two conferences. To compute it, we first construct the term-conference matrix F, in which each entry F_ij is the number of occurrences of the i-th term in the titles of papers published in the j-th conference. Thus each column F_j of the matrix can be viewed as a feature vector describing the j-th conference. The similarity score of two conferences j and j′ can be computed as $\frac{F_{j} \cdot F_{j'}}{‖ F_{j} ‖ ‖ F_{j'} ‖}$ . W₁₂ represents the relationships between authors and conferences, in which each entry denotes the number of papers that an author published in a given conference. A snapshot of the DBLP network used in our experiment can be seen in Figure 6.

TABLE III.

List of conferences from each research area in the DBLP dataset

DB	AI	DM	IR

PODS	AAAI	KDD	SIGIR
SIGMOD	ICML	ICDM	WWW
VLDB	IJCAI	SDM	WSDM
EDBT	CVPR	PKDD	ECIR
ICDE	ECML	PAKDD	CIKM

Open in a new tab

TABLE IV.

Number of authors from each research area in the DBLP dataset

	DB	AI	DM	IR

number	1197	1109	745	1006
percentage	29.5%	27.3%	18.4%	24.8%

Open in a new tab

Fig. 6 — A snapshot of the real DBLP network. We only display edges whose weights are above some threshold.

To compare the robustness of different methods, we introduce noise by randomly shuffling a certain percentage of the entries in W₁₂. Fig. 4 shows that noise in the prior knowledge on cross-domain relationships does not affect the clustering accuracy of MCA in the conference domain at all, and only lowers the accuracy of MCA in the author domain modestly when W₁₂ is dominated by noise. Fig. 5 shows that the accuracy of the inferred cross domain cluster associations also only drops modestly for MCA when the noise level is very high. In contrast, we observe that all other methods are far more sensitive to noise, among which NMTF_Chris performs noticeably better than ONMTF and RCC.

Fig. 5 — Cluster association accuracy with respect to different noise levels on the DBLP dataset.

To better understand how these methods perform, we list the top 4 associations between conference clusters and author clusters returned by MCA and NMTF_Chris in Table V when the noise level is set to 30%. We do not list the results by DNMTF and RCC because they return only a single cluster that includes all conferences in the conference domain. This is obviously not what we desire. From Table V, we observe that MCA produces the correct clustering result in the conference domain. The conference cluster in each of the top 4 pairs corresponds to a distinct research area. However, NMTF_Chris makes many mistakes. It splits the conferences from the Database area into two clusters. The third and fourth conference clusters are mixtures of conferences from different areas. In the author domain, for each author cluster, the percentage of authors from each of the 4 research areas is also shown in Table V. Each author cluster returned by MCA is primarily dominated by authors from one research area, as indicated by the largest percentage highlighted in bold in each column, which perfectly matches the area suggested by the associated conference cluster. For example, consider the 1st pair of conference cluster and author cluster returned by MCA. The conference cluster includes PODS, SIGMOD, VLDB, ICDE and EDBT, all of which are Database conferences. 94.2% of authors in the author cluster also come from the Database area. MCA correctly infers this association between the author cluster and conference cluster. We can make same observation on the remaining cluster pairs by MCA in Table V. It demonstrates that MCA can discover meaningful associations between clusters from different domains. From Table V, we also observe that some conference clusters and author clusters discovered by NMTF_Chris represent a mixture of multiple research areas. To further quantify this observation, for the 4 author clusters, we compute the KL-divergence between author’s research area distributions of each pair of author clusters. A KL-divergence of 1 indicates that authors in the two clusters are from two distinct areas. A KL-divergence of 0 indicates that the two author clusters have identical research area distributions and thus are not distinguishable from each other. We use dark color for small KL-divergence and light color for large KL-divergence in Fig. 7. A diagonal entry depicts the KL-divergence of an author cluster to itself which is always 0. Off-diagonal entries correspond to KL-divergence of two author clusters — the larger the KL-divergence, the better the clustering result. We observe that the 4 clusters by MCA all have large KL-divergence to each other but the first two clusters by NMTF_Chris have small KL-divergence. This proves that MCA is more robust to random noise.

TABLE V.

The results on DBLP by MCA and NMTF_Chris at 30% noise level

			pair No. 1	pair No. 2	pair No. 3	pair No. 4
MCA	conference		PODS SIGMOD VLDB ICDE EDBT	KDD ICDM SDM PKDD PAKDD	AAAI ICML IJCAI CVPR ECML	SIGIR WWW WSDM ECIR CIKM
	author	DB	0.943	0.134	0.038	0.293
		DM	0.036	0.835	0.151	0.088
		AI	0.007	0.000	0.698	0.00
		IR	0.014	0.031	0.113	0.619
NMTF_Chris	conference		VLDB ICDE EDBT	PODS SIGMOD	AAAI ICML IJCAI ECML PKDD PAKDD	SIGIR WWW WSDM ECIR CIKM KDD ICDM SDM CVPR
	author	DB	0.736	0.868	0.035	0.084
		DM	0.236	0.088	0.292	0.361
		AI	0.000	0.022	0.602	0.042
		IR	0.028	0.022	0.071	0.513

Open in a new tab

Fig. 7 — The pairwise KL-divergence between research area distributions of author clusters in Table V.

Duplicate Names

It is possible that some authors may have the same name in the DBLP database. Publications by these authors might be mistakenly associated with other authors with the same name. In order to evaluate the robustness of our method in this context, we first randomly pick a certain percentage of authors and then randomly pair them up. We “pretend” that the two authors in each pair use the same name and thus their publications are not distinguishable. We replace the two corresponding row vectors in W₁₂ by their average vector. Since only a small percentage of authors may have this issue, we only test the robustness for up to 30% of duplicate names. Fig. 8 shows that MCA can successfully resolve the ambiguity and thus its clustering accuracies and cluster association accuracy are not impacted. NMTF_Chris is the second best, having comparable accuracies in cluster association and author clustering, but failing short on the conference clustering accuracy.

Fig. 8 — Results on the DBLP dataset with duplicate names.

D. Yeast eQTL Dataset

Expression quantitative trait loci (eQTL) mapping is the process of identifying single nucleotide polymorphisms (SNPs) that play important role in the expression of genes. It has been widely used to dissect genetic basis of complex traits [5]. Traditionally, associations between individual expression traits and SNPs are assessed separately [20]. Since genes in the same biological pathways are often coregulated and may share a common genetic basis [21], it is crucial to understand how multiple modesetly-associated SNPs interact to influence the phenotypes [22]. To answer this question, several approaches have been proposed to study the joint effect of multiple SNPs by testing the association between a set of SNPs and a gene expression trait [23]–[26]. Despite their success, these methods have two common limitations. First, only the association between a set of SNPs and a single expression trait is studied. Therefore, they overlook the joint effect of a set of SNPs on the activities of a set of genes, which may act and interact with each other to achieve certain biological function. Second, the SNP sets used in these methods are usually taken from known biological pathways, which are far from being complete. These methods cannot identify unknown associations between SNP sets or gene sets. To better elucidate the genetic basis of gene expression and understand the underlying biology pathways, it is highly desirable to develop methods that can automatically infer association between a group of SNPs and a group of genes. The process of identifying such associations is referred to as group-wise eQTL mapping, to distinguish it from the individual eQTL mapping [6] process that identifies associations between individual SNPs and genes. The MCA method proposed in this paper is suitable for group-wise eQTL mapping.

We compare MCA with NMTF_Chris, DNMTF and RCC on a yeast eQTL dataset [27]. This dataset originally includes expression profiles of 6229 genes and genotype profiles of 2956 SNPs. After preprocessing (e.g., removing missing values), the dataset is reduced to 1017 SNPs and 4474 genes expression profiles.

We denote the SNP domain as $D_{1}$ and the gene domain as $D_{2}$ , respectively. The SNP interaction network $G_{1}$ is generated as in [28]. The gene interaction network $G_{2}$ is constructed by computing the Pearson’s correlation of the expression levels of each pair of genes.

r_{X Y} = \frac{\sum_{i} (X_{i} - \bar{X}) (Y_{i} - \bar{Y})}{\sqrt{\sum_{i} {(X_{i} - \bar{X})}^{2}} \sqrt{\sum_{i} {(Y_{i} - \bar{Y})}^{2}}}

(24)

where X and Y are vectors representing the expression profiles of the two genes. X_i and Y_i are the ith components of X and Y, respectively. From Eq. (24), the value of Pearson’s correlation r ranges from −1 to 1, where 1 means that two genes are completely positively correlated and −1 means that they are completely negatively correlated. The edge between genes X and Y in the gene interaction network is weighted by |r_XY |. The association matrix W₁₂ is given by the association tests between individual SNPs and individual genes using PLINK [29].

Gene Ontology Enrichment Analysis

Since there is no ground-truth in the Yeast eQTL dataset, we cannot measure the clustering accuracy and cluster association accuracy directly. Here, we evaluate the quality of our result by the Gene Ontology Enrichment Analysis (GOEA) [30]. For each inferred gene cluster c_i, we identify the most significantly enriched Gene Ontology categories [31]. The significance (p-value) is determined by the Fisher’s exact test. The raw p-values are further calibrated to correct for the multiple testing problem [32]. To compute calibrated p-values for c_i, we perform a randomization test, wherein we apply the same test to 1000 randomly created gene sets that have the same number of genes as in c_i. In order to evaluate the clusters in the SNP domain, we first need to map the SNPs in a cluster to their nearest genes on the genome, and then apply the standard procedure of GOEA on the set of genes to compute a p-value. In Fig. 9, clusters are arranged in ascending order of their p-values. We consider the clusters with p-value less than 0.05 to be significant. The numbers of significant gene and SNP clusters are listed in Table VI. Not surprisingly, MCA can identify more significant clusters in both gene and SNP domains than the competitors.

Fig. 9 — Gene ontology enrichment analysis on the yeast eQTL data.

TABLE VI.

The number of significantly enriched clusters measured by GOEA

	MCA	NMTF_Chris	DNMTF	RCC

gene	36	31	23	26
SNP	99	98	82	62

Open in a new tab

E. Gene Disease Dataset

We further evaluate our algorithm MCA on a gene disease network dataset [33]. The dataset contains 590 disease phenotypes in 20 disease classes and 7997 genes in 200 gene pathways. There are two domains: gene domain $D_{1}$ and disease phenotype domain $D_{2}$ . $G_{1}$ represents the “functional” relationships between genes which are measured by interactions between the proteins transcribed from the genes, because most genes “perform” their functions through their transcribed proteins. This protein-protein interaction network can be obtained from HPRD [34]. The relationships among phenotypes are represented by a phenotype similarity network $G_{2}$ , which is obtained from [35]. It is an undirected network with vertices representing OMIM [36] disease phenotypes and edges (with weights between 0 and 1) representing the similarities between phenotypes measured by their co-occurrences in clinical synopsis records. The associations between disease phenotypes and genes are also available in OMIM. We evaluate the clustering accuracy in each domain using the normalized MI discussed in Section IV-A. The first row in Table VII is the normalized MI in the phenotype domain and the second row is the normalized MI in the gene domain. As we can see, MCA is again the winner.

TABLE VII.

Results on the gene disease network

method	MCA	NMTF_Chris	DNMTF	RCC

Normalized MI_pheno	0.19	0.13	0.14	0.15
Normalized MI_gene	0.05	0.02	0.05	0.04

Open in a new tab

F. Performance Evaluation

In this section, we study the run-time performance of MCA, measured by the number of iterations before converging to a local optima. Table VIII summarizes the network size and the number of iterations upon convergence on difference data sets. We observe that MCA can converge within a reasonable number of iterations even for large networks. As expected, the number of iterations will increase as the network size increases. Usually, several hundreds of iterations are needed before convergence, but the actual running time is fast. Table IX shows the time used by different methods to convergence on the DBLP dataset. All methods except RCC run very fast. We can conclude that MCA can achieve much better accuracy without entailing more computation time.

TABLE VIII.

Number of iterations to converge

dataset	size of network $G_{1}$	size of network $G_{2}$	number of iterations

synthetic	17	20	24
DBLP	441	20	80
eQTL	1017	4474	741
Gene-disease	3619	366	144

Open in a new tab

TABLE IX.

Amount of time to converge on DBLP

method	MCA	NMTF_Chris	DNMTF	RCC

time cost (second)	0.8	0.4	0.4	11.5

Open in a new tab

V. CONCLUSION

In this paper, we propose a novel algorithm, MCA, for network clustering across multiple related domains. By leveraging the duality between single network clustering and inferring cross-network cluster alignment, MCA well incorporates any prior knowledge on cross-network instance relationships into multi-network clustering. The algorithm is robust to noise and is capable of detecting cross-domain associations between clusters, which, was never addressed in previous study. Extensive experiments on both synthetic and several real datasets demonstrate the effectiveness and efficiency of MCA and its advantages over existing methods.

Acknowledgments

This work was partially supported by the National Science Foundation grants IIS-1162374, IIS-1218036, IIS-1313606 and IIS-1017415, by the Army Research Laboratory under Cooperative Agreement Number W911NF-09-2-0053, by National Institutes of Health under the grant number R01LM011986 and U54GM114833-01, Region II University Transportation Center under the project number 49997-33 25.

Contributor Information

Rui Liu, Email: rui.liu4@case.edu.

Wei Cheng, Email: weicheng@cs.unc.edu.

Hanghang Tong, Email: htong6@asu.edu.

Wei Wang, Email: weiwang@cs.ucla.edu.

Xiang Zhang, Email: xiang.zhang@case.edu.

References

1.MacQueen J, et al. Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. 1967;1:281–297. [Google Scholar]
2.Kuang D, Park H, Ding CH. Symmetric nonnegative matrix factorization for graph clustering. SDM. 2012;12:106–117. [Google Scholar]
3.Cheng W, Zhang X, Guo Z, Wu Y, Sullivan PF, Wang W. Flexible and robust co-regularized multi-domain graph clustering. KDD. 2013:320–328. doi: 10.1145/2903147. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Chaudhuri K, Kakade SM, Livescu K, Sridharan K. Multi-view clustering via canonical correlation analysis. ICML. 2009:129–136. [Google Scholar]
5.Michaelson JJ, Loguercio S, Beyer A. Detection and interpretation of expression quantitative trait loci (eqtl) Methods. 2009;48(3):265–276. doi: 10.1016/j.ymeth.2009.03.004. [DOI] [PubMed] [Google Scholar]
6.Cheng W, Yu S, Zhang X, Wang W. Fast and robust group-wise eQTL mapping using sparse graphical models. BMC Bioinformatics. 2015;16:2. doi: 10.1186/s12859-014-0421-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Newman ME. Modularity and community structure in networks. PNAS. 2006;103(23):8577–8582. doi: 10.1073/pnas.0601602103. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Shi J, Malik J. Normalized cuts and image segmentation. TPAMI. 2000;22(8):888–905. [Google Scholar]
9.Von Luxburg U. A tutorial on spectral clustering. Statistics and computing. 2007;17(4):395–416. [Google Scholar]
10.Sun Y, Han J, Zhao P, Yin Z, Cheng H, Wu T. Rankclus: integrating clustering with ranking for heterogeneous information network analysis. EDBT. 2009:565–576. ACM. [Google Scholar]
11.Ding C, Li T, Peng W, Park H. Orthogonal nonnegative matrix t-factorizations for clustering. KDD. 2006:126–135. [Google Scholar]
12.Shang F, Jiao L, Wang F. Graph dual regularization non-negative matrix factorization for co-clustering. Pattern Recognition. 2012;45(6):2237–2250. [Google Scholar]
13.Du L, Shen Y-D. Towards robust co-clustering. IJCAI. 2013:1317–1322. [Google Scholar]
14.Dhillon IS, Mallela S, Modha DS. Information-theoretic co-clustering. KDD. 2003:89–98. [Google Scholar]
15.Lee DD, Seung HS. Algorithms for non-negative matrix factorization. NIPS. 2000:556–562. [Google Scholar]
16.Ji M, Sun Y, Danilevsky M, Han J, Gao J. Machine Learning and Knowledge Discovery in Databases. Springer; 2010. Graph regularized transductive classification on heterogeneous information networks; pp. 570–586. [Google Scholar]
17.Gao J, Liang F, Fan W, Sun Y, Han J. Graph-based consensus maximization among multiple supervised and unsupervised models. NIPS. 2009:585–593. [Google Scholar]
18.Cheung VG, Spielman RS, Ewens KG, Weber TM, Morley M, Burdick JT. Mapping determinants of human gene expression by regional and genome-wide association. Nature. 2005;437(7063):1365–1369. doi: 10.1038/nature04244. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Musani SK, Shriner D, Liu N, Feng R, Coffey CS, Yi N, Tiwari HK, Allison DB. Detection of gene x gene interactions in genome-wide association studies of human population data. Human heredity. 2007;63(2):67–84. doi: 10.1159/000099179. [DOI] [PubMed] [Google Scholar]
20.Lander ES. Initial impact of the sequencing of the human genome. Nature. 2011;470(7333):187–197. doi: 10.1038/nature09792. [DOI] [PubMed] [Google Scholar]
21.Holden M, Deng S, Wojnowski L, Kulle B. Gsea-snp: applying gene set enrichment analysis to snp data from genome-wide association studies. Bioinformatics. 2008;24(23):2784–2785. doi: 10.1093/bioinformatics/btn516. [DOI] [PubMed] [Google Scholar]
22.Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics. 2011;89(1):82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Braun R, Buetow K. Pathways of distinction analysis: a new technique for multi–snp analysis of gwas data. PLoS Genetics. 2011;7(6):e1002101. doi: 10.1371/journal.pgen.1002101. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Listgarten J, Lippert C, Kang EY, Xiang J, Kadie CM, Heckerman D. A powerful and efficient set test for genetic markers that handles confounders. Bioinformatics. 2013;29(12):1526–1533. doi: 10.1093/bioinformatics/btt177. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Brem RB, Storey JD, Whittle J, Kruglyak L. Genetic interactions between polymorphisms that affect gene expression in yeast. Nature. 2005;436(7051):701–703. doi: 10.1038/nature03865. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Lee S, Xing EP. Leveraging input and output structures for joint mapping of epistatic and marginal eqtls. Bioinformatics. 2012;28(12):i137–i146. doi: 10.1093/bioinformatics/bts227. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, De Bakker PI, Daly MJ, et al. Plink: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 2007;81(3):559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. PNAS. 2005;102(43):15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Cheng W, Zhang X, Wu Y, Yin X, Li J, Heckerman D, Wang W. Inferring novel associations between snp sets and gene sets in eqtl study using sparse graphical model. Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine. 2012:466–473. ACM. [Google Scholar]
30.Westfall PH. Resampling-based multiple testing: Examples and methods for p-value adjustment. Vol. 279. John Wiley & Sons; 1993. [Google Scholar]
31.Hwang T, Atluri G, Xie M, Dey S, Hong C, Kumar V, Kuang R. Co-clustering phenome–genome for phenotype classification and disease gene discovery. Nucleic acids research. 2012;40(19):e146–e146. doi: 10.1093/nar/gks615. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Peri S, Navarro M, et al. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome research. 2003;13(10):2363–2371. doi: 10.1101/gr.1680803. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA. A text-mining analysis of the human phenome. European journal of human genetics. 2006;14(5):535–542. doi: 10.1038/sj.ejhg.5201585. [DOI] [PubMed] [Google Scholar]
34.Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. Online mendelian inheritance in man (omim), a knowledge-base of human genes and genetic disorders. Nucleic acids research. 2005;33(suppl 1):D514–D517. doi: 10.1093/nar/gki033. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.MacQueen J, et al. Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. 1967;1:281–297. [Google Scholar]

[R2] 2.Kuang D, Park H, Ding CH. Symmetric nonnegative matrix factorization for graph clustering. SDM. 2012;12:106–117. [Google Scholar]

[R3] 3.Cheng W, Zhang X, Guo Z, Wu Y, Sullivan PF, Wang W. Flexible and robust co-regularized multi-domain graph clustering. KDD. 2013:320–328. doi: 10.1145/2903147. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Chaudhuri K, Kakade SM, Livescu K, Sridharan K. Multi-view clustering via canonical correlation analysis. ICML. 2009:129–136. [Google Scholar]

[R5] 5.Michaelson JJ, Loguercio S, Beyer A. Detection and interpretation of expression quantitative trait loci (eqtl) Methods. 2009;48(3):265–276. doi: 10.1016/j.ymeth.2009.03.004. [DOI] [PubMed] [Google Scholar]

[R6] 6.Cheng W, Yu S, Zhang X, Wang W. Fast and robust group-wise eQTL mapping using sparse graphical models. BMC Bioinformatics. 2015;16:2. doi: 10.1186/s12859-014-0421-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Newman ME. Modularity and community structure in networks. PNAS. 2006;103(23):8577–8582. doi: 10.1073/pnas.0601602103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Shi J, Malik J. Normalized cuts and image segmentation. TPAMI. 2000;22(8):888–905. [Google Scholar]

[R9] 9.Von Luxburg U. A tutorial on spectral clustering. Statistics and computing. 2007;17(4):395–416. [Google Scholar]

[R10] 10.Sun Y, Han J, Zhao P, Yin Z, Cheng H, Wu T. Rankclus: integrating clustering with ranking for heterogeneous information network analysis. EDBT. 2009:565–576. ACM. [Google Scholar]

[R11] 11.Ding C, Li T, Peng W, Park H. Orthogonal nonnegative matrix t-factorizations for clustering. KDD. 2006:126–135. [Google Scholar]

[R12] 12.Shang F, Jiao L, Wang F. Graph dual regularization non-negative matrix factorization for co-clustering. Pattern Recognition. 2012;45(6):2237–2250. [Google Scholar]

[R13] 13.Du L, Shen Y-D. Towards robust co-clustering. IJCAI. 2013:1317–1322. [Google Scholar]

[R14] 14.Dhillon IS, Mallela S, Modha DS. Information-theoretic co-clustering. KDD. 2003:89–98. [Google Scholar]

[R15] 15.Lee DD, Seung HS. Algorithms for non-negative matrix factorization. NIPS. 2000:556–562. [Google Scholar]

[R16] 16.Ji M, Sun Y, Danilevsky M, Han J, Gao J. Machine Learning and Knowledge Discovery in Databases. Springer; 2010. Graph regularized transductive classification on heterogeneous information networks; pp. 570–586. [Google Scholar]

[R17] 17.Gao J, Liang F, Fan W, Sun Y, Han J. Graph-based consensus maximization among multiple supervised and unsupervised models. NIPS. 2009:585–593. [Google Scholar]

[R18] 18.Cheung VG, Spielman RS, Ewens KG, Weber TM, Morley M, Burdick JT. Mapping determinants of human gene expression by regional and genome-wide association. Nature. 2005;437(7063):1365–1369. doi: 10.1038/nature04244. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Musani SK, Shriner D, Liu N, Feng R, Coffey CS, Yi N, Tiwari HK, Allison DB. Detection of gene x gene interactions in genome-wide association studies of human population data. Human heredity. 2007;63(2):67–84. doi: 10.1159/000099179. [DOI] [PubMed] [Google Scholar]

[R20] 20.Lander ES. Initial impact of the sequencing of the human genome. Nature. 2011;470(7333):187–197. doi: 10.1038/nature09792. [DOI] [PubMed] [Google Scholar]

[R21] 21.Holden M, Deng S, Wojnowski L, Kulle B. Gsea-snp: applying gene set enrichment analysis to snp data from genome-wide association studies. Bioinformatics. 2008;24(23):2784–2785. doi: 10.1093/bioinformatics/btn516. [DOI] [PubMed] [Google Scholar]

[R22] 22.Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics. 2011;89(1):82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Braun R, Buetow K. Pathways of distinction analysis: a new technique for multi–snp analysis of gwas data. PLoS Genetics. 2011;7(6):e1002101. doi: 10.1371/journal.pgen.1002101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Listgarten J, Lippert C, Kang EY, Xiang J, Kadie CM, Heckerman D. A powerful and efficient set test for genetic markers that handles confounders. Bioinformatics. 2013;29(12):1526–1533. doi: 10.1093/bioinformatics/btt177. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Brem RB, Storey JD, Whittle J, Kruglyak L. Genetic interactions between polymorphisms that affect gene expression in yeast. Nature. 2005;436(7051):701–703. doi: 10.1038/nature03865. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Lee S, Xing EP. Leveraging input and output structures for joint mapping of epistatic and marginal eqtls. Bioinformatics. 2012;28(12):i137–i146. doi: 10.1093/bioinformatics/bts227. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, De Bakker PI, Daly MJ, et al. Plink: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 2007;81(3):559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. PNAS. 2005;102(43):15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Cheng W, Zhang X, Wu Y, Yin X, Li J, Heckerman D, Wang W. Inferring novel associations between snp sets and gene sets in eqtl study using sparse graphical model. Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine. 2012:466–473. ACM. [Google Scholar]

[R30] 30.Westfall PH. Resampling-based multiple testing: Examples and methods for p-value adjustment. Vol. 279. John Wiley & Sons; 1993. [Google Scholar]

[R31] 31.Hwang T, Atluri G, Xie M, Dey S, Hong C, Kumar V, Kuang R. Co-clustering phenome–genome for phenotype classification and disease gene discovery. Nucleic acids research. 2012;40(19):e146–e146. doi: 10.1093/nar/gks615. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Peri S, Navarro M, et al. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome research. 2003;13(10):2363–2371. doi: 10.1101/gr.1680803. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA. A text-mining analysis of the human phenome. European journal of human genetics. 2006;14(5):535–542. doi: 10.1038/sj.ejhg.5201585. [DOI] [PubMed] [Google Scholar]

[R34] 34.Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. Online mendelian inheritance in man (omim), a knowledge-base of human genes and genetic disorders. Nucleic acids research. 2005;33(suppl 1):D514–D517. doi: 10.1093/nar/gki033. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Robust Multi-Network Clustering via Joint Cross-Domain Cluster Alignment

Rui Liu

Wei Cheng

Hanghang Tong

Wei Wang

Xiang Zhang

Abstract

I. Introduction

Fig. 1.

TABLE I.

II. Related Work

III. Multi-Network Clustering via Cross-Domain Cluster Alignment

A. Problem Definition

TABLE II.

B. Learning Algorithm

Theorem 1

Theorem 2

C. Correctness Analysis

D. Convergence Analysis

Definition 1

Lemma 1

Proof

E. Complexity Analysis

IV. Experimental Results

A. Evaluation Metrics

1) Clustering Accuracy

2) Cluster Association Accuracy

B. Simulation Study

Fig. 2.

Fig. 3.

C. DBLP Dataset

TABLE III.

TABLE IV.

Fig. 6.

Fig. 4.

Fig. 5.

TABLE V.

Fig. 7.

Duplicate Names

Fig. 8.

D. Yeast eQTL Dataset

Gene Ontology Enrichment Analysis

Fig. 9.

TABLE VI.

E. Gene Disease Dataset

TABLE VII.

F. Performance Evaluation

TABLE VIII.

TABLE IX.

V. CONCLUSION

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases