Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Apr 5.
Published in final edited form as: IEEE Trans Artif Intell. 2021 Apr 5;2(2):146–168. doi: 10.1109/tai.2021.3065894

A Survey on Multi-View Clustering

Guoqing Chao 1, Shiliang Sun 2, Jinbo Bi 3
PMCID: PMC8925043  NIHMSID: NIHMS1736496  PMID: 35308425

Abstract

Clustering is a machine learning paradigm of dividing sample subjects into a number of groups such that subjects in the same groups are more similar to those in other groups. With advances in information acquisition technologies, samples can frequently be viewed from different angles or in different modalities, generating multi-view data. Multi-view clustering, that clusters subjects into subgroups using multi-view data, has attracted more and more attentions. Although MVC methods have been developed rapidly, there has not been enough survey to summarize and analyze the current progress. Therefore, we propose a novel taxonomy of the MVC approaches. Similar to other machine learning methods, we categorize them into generative and discriminative classes. In discriminative class, based on the way of view integration, we split it further into five groups: Common Eigenvector Matrix, Common Coefficient Matrix, Common Indicator Matrix, Direct Combination and Combination After Projection. Furthermore, we relate MVC to other topics: multi-view representation, ensemble clustering, multi-task clustering, multi-view supervised and semi-supervised learning. Several representative real-world applications are elaborated for practitioners. Some benchmark multi-view datasets are introduced and representative MVC algorithms from each group are empirically evaluated to analyze how they perform on benchmark datasets. To promote future development of MVC approaches, we point out several open problems that may require further investigation and thorough examination.

Keywords: Multi-view learning, clustering, survey, nonnegative matrix factorization, k-means, spectral clustering, subspace clustering, canonical correlation analysis, machine learning, data mining

I. INTRODUCTION

CLUSTERING [1] is a paradigm to divide the subjects into a number of groups such that subjects in the same groups are more similar to other subjects in the same group and dissimilar to the subjects in other groups. It is a fundamental task in machine learning, pattern recognition and data mining fields and has widespread applications. Once subgroups are obtained by clustering methods, many subsequent analytic tasks can be conducted to achieve different ultimate goals. Traditional methods cluster subjects on the basis of only a single set of features or a single information window of the subjects. When multiple sets of features are available for each individual subject, how these views can be integrated to help identify essential grouping structure is a problem of our concern in this paper, which is often referred to as multi-view clustering. A good example to understand the importance of multi-view clustering, or multi-view learning is “the blind men and the elephant” story where each blind man (a single view of the subject) may not acquire the true picture of the subject [2], thus only collecting multi-view data can recover the whole picture of the subject.

Multi-view data are very common in real-world applications in the big data era. For instance, a web page can be described by the words appearing on the web page itself and the words underlying the links pointing to the web page from other pages in nature. In multimedia content understanding, multimedia segments can be simultaneously described by their video signals from visual camera and audio signals from voice recorders. The existence of such multi-view data raised the interest of multi-view learning [3], [4], [5], which has been extensively studied in the semi-supervised learning setting. For unsupervised learning, particularly, multi-view clustering, single view based clustering methods cannot make an effective use of the multi-view information in various problems. For instance, a multi-view clustering problem may require to identify clusters of subjects that differ in each of the data views. In this case, concatenating features from the different views into a single union followed by a single-view clustering method may not serve the purpose. It has no mechanism to guarantee that the resultant clusters differ in all of the views because the grouping may be biased towards a view (or views) that yields a dominantly large number of features in the feature union. Multi-view clustering has thus attracted more and more attentions in the past two decades, which makes it necessary and beneficial to summarize the state of the art and delineate open problems to guide future advancement.

At first, we give the definition of multi-view clustering (MVC). MVC is a machine learning paradigm to classify similar subjects into the same group and dissimilar subjects into different groups by combining the available multi-view feature information, and to search for the consistent clusterings across different views. Similar to the categorization of clustering algorithms in [1], we divide the existing MVC methods into two categories: generative (or model-based) approaches and discriminative (or similarity-based) approaches. Generative approaches try to learn the fundamental distribution of the data and use generative models to represent the data with each model representing one cluster. Discriminative approaches directly optimize an objective function that involves pairwise similarities to minimize the average similarity within clusters and to maximize the average similarity between clusters. In discriminative clustering family, there are mainly three strategies to combine multiple views: assuming that all views share a similar structure, direct combination of the views, and combination after projection of each view. Due to the different similar structures shared, we further split those MVC methods based on the first strategy into three groups: (1) common eigenvector matrix based (mainly multi-view spectral clustering), (2) common coefficient matrix based (mainly multi-view subspace clustering), (3) common indicator matrix based (mainly multi-view nonnegative matrix factorization clustering). The complete taxonomy is shown in Fig. 1.

Fig. 1:

Fig. 1:

The taxonomy of multi-view clustering methods.

Similarly motivated by the multi-view real applications as MVC, multi-view representation, multi-view supervised, and multi-view semi-supervised learning methods have an inherently close relation with MVC. Therefore, the similarities and differences of these different learning paradigms are also worth discussing. An obvious commonality between them is that they all learn with multi-view information. However, their learning targets are different. Multi-view representation methods aim to learn a joint compact representation for subjects from all of the views whereas MVC aims to perform sample partitioning, and MVC is learned without any label information. In contrast, multi-view supervised and semi-supervised learning methods have access to all or part of the label information. Some of the view combination strategies in these related paradigms can be borrowed and adapted by MVC. In addition, the relationships among MVC, ensemble clustering, multi-task clustering are also elaborated in this review.

MVC has been applied to many scientific domains such as computer vision, natural language processing, social multimedia, bioinformatics, and health informatics. Although MVC has permeated into many fields and made great success in practice, there are still some open problems that limit its further advancement. We point out several open problems and hope they can be helpful to promote the development of MVC. With this survey, we hope that readers can have a more comprehensive view of the MVC development and what is beyond the current progress.

There has been an earlier MVC survey [6]. We describe the differences between that one and ours which necessitate this survey. First, that work summarized the methods corresponding to a subset of the methods in our discriminative category, but the generative category of methods is a non-negligible direction. The generative methods assume that each cluster comes from a specific distribution in each view and combine them together to conduct MVC. Since most of them are based on the EM algorithm or convex mixture model, they have some inherent advantages over discriminative methods, such as being capable of dealing with missing values or obtaining global optimal solutions. Second, we discuss the relationship between MVC and several related topics, such as multi-view representation learning, ensemble clustering, multi-task clustering, and multi-view supervised and semi-supervised learning. This discussion helps researchers to position MVC in a scientific context and potentially gain deeper insights into all these topics. Third, we summarize representative applications of the various MVC methods for reference by interested users. Fourth, in Sections II and III we examine the pros and cons of each class of MVC methods and give the circumstances for which they are suitable, and we conduct a comprehensive comparison over the representative MVC algorithm in each group to further analyze and verify the advantages and disadvantages of each group of MVC algorithms. Last but not least, we draw attention to certain open problems with the hope that these directions help further advance MVC.

The remainder of this paper is organized as follows. In section II, we review the existing generative methods for MVC. Section III introduces several classes of discriminative MVC methods. In Section IV, we analyze the relationships between MVC and several related topics. Section V presents the applications of MVC in different areas. In Section VI, we introduce several commonly-used MVC datasets and conduct some experiments on them to investigate how they perform. In Section VII, we list several open problems with the aim to help advance the development of MVC. Finally, we make the conclusions.

II. GENERATIVE APPROACHES

Generative approaches aim to learn the generative models each of which is used to generate the data from a cluster. In multi-view case, multiple generative models need to be learned and then combined to obtain the final clustering results. In most cases, generative clustering approaches are based on mixture models or constructed via expectation maximization (EM) [7]. Therefore, we first introduce mixture models, EM algorithm and another popular single-view clustering model named convex mixture models (CMMs) [8], and then introduce the multi-view variants of these methods.

A. Mixture Models and CMMs

A generative approach assumes that data are sampled independently from a mixture model of multiple probability distributions. The mixture distribution can be written as

p(xθ)=k=1Kπkp(xθk), (1)

where πk is the prior probability of the kth component and satisfies πk0, and k=1Kπk=1, θk is the parameter of the kth probability density model and θ={(πk,θk),k=1,2,,K} is the parameter set of the mixture model. For instance, θk={μk,Σk} for Gaussian mixture model.

EM is a widely used algorithm for parameter estimation of the mixture models. Suppose that the observed data and unobserved data are denoted by X and Z, respectively. {X, Z} and X are called complete data and incomplete data. In the E (expectation) step, the posterior distribution p(ZX,θold) of the unobserved data is evaluated with the current parameter values θold. The E step calculates the expectation of the complete data log likelihood evaluated for some general parameter value θ. The expectation, denoted by Q(θ,θold), is given by

Q(θ,θold)=Zp(ZX,θold)lnp(X,Zθ). (2)

The first item is the posterior distribution of the latent variables Z and the second one is the complete data log likelihood. According to maximum likelihood estimation, the M (maximization) step updates the parameters by maximizing the function (2)

θ=argmaxθQ(θ,θold). (3)

Note that for clustering, X can be considered as the observed data while Z is the latent variable whose entry znk indicates the nth data point comes from the kth component. Also note that the posterior distribution form used to be evaluated in E step and the expectation of the complete data log likelihood used to evaluate the parameters are different for different distribution assumptions. It can adopt Gaussian distribution and any other probability distribution form, which depends on specific applications.

CMMs [8] are simplified mixture models that can probabilistically assign data points to clusters after extracting the representative exemplars from the data set. By maximizing the log-likelihood, all instances compete to become the “center” (representative exemplar) of the clusters. The instances corresponding to the components that received the highest priors are selected exemplars and then the remaining instances are assigned to the “closest” exemplar. The priors of the components are the only adjustable parameters of a CMM.

Given a data set X=x1,x2,,xNd×N, the CMM distribution is Q(x)=j=1Nqjfj(x), xd, where qj0 denotes the prior probability of the jth component that satisfies the constraint j=1Nqj=1, and fj(x) is an exponential family distribution, with its expected parameters equal to the jth data point. Due to the bijection relationship between the exponential families and Bregman divergences [9], the exponential family fj(x)=Cϕ(x)exp(βdϕ(x,xj)) where dϕ denotes the Bregman divergence that calculates the component distribution, Cϕ(x) is independent of xj, and β is a constant controlling the sharpness of the components.

The log-likelihood that needs to be maximized is given as L(X;{qj}j=1N)=1Ni=1Nlog(j=1Nqjfj(xi))=1Ni=1Nlog(j=1Nqjeβdϕ(xi,xj))+const. If the empirical samples are equally drawn, i.e., the prior of drawing each example is P^=1/N, the log-likelihood can be equivalently expressed in terms of Kullback Leibler (KL) divergence between P^ and Q(x) as

D(P^Q)=i=1NP^(xi)logQ(xi)(P^)=L(X;{qj}j=1N)+c (4)

where (P^) is the entropy of the empirical distribution P^(x) which does not depend on the parameter qj, and c is a constant. Now, the problem is changed into minimizing (4), which is convex and can be solved by an iterative algorithm. In such an algorithm, the updating rule for prior probabilities is given by

qj(t+1)=qj(t)i=1NP^(xi)fj(xi)j=1Nqj(t)fj(xi). (5)

The data points are grouped into K disjoint clusters by requiring the instances with the K highest qj values to serve as exemplars and then assigning each of the remaining instances to an exemplar with which the instance has the highest posterior probability. Note that the clustering performance is affected by the value of β. In [8] a reference value β0 is determined using an empirical rule β0=N2logN/i,j=1Ndϕ(xi,xj) to identify a reasonable range of β, which is around β0. More details refer to Paper [8].

B. MVC Based on Mixture Models or EM Algorithm

The method in [10] assumes that the two views are independent, and a multinomial distribution is adopted for document clustering problem. It uses the two-view case as an example, and executes the M and E steps on each view and then interchange the posteriors in two separate views in each iteration. The optimization process is terminated if the log-likelihood of observing the data does not reach a new maximum for a fixed number of iterations in each view. Based on different criteria and assumptions, two multi-view EM algorithm versions for finite mixture models are proposed in the paper [11].

Specifically, based on the CMMs for single-view clustering, the multi-view version proposed in [12] became much attractive because it can locate the global optimum and thus avoid the initialization and local optima problems of standard mixture models, which require multiple executions of the EM algorithms.

For multi-view CMMs, each xi with m views is denoted by {xi1,xi2,,xim}, xivdv, the mixture distribution for each view is given as Qv(xv)=j=1Nqjfjv(xv)=Cϕ(xv)j=1Nqjeβvdϕv(xv,xjv). To pursue a common clustering across all views, all Qv(xv) share the same priors. In addition, an empirical data set distribution P^v(xv)=1/N, xv{x1v,x2v,,xNv}, is associated with each view and the multi-view algorithm minimizes the sum of KL divergences between P^v(xv) and Qv(xv) across all views with the constraint j=1Nqj=1. Thus, the formulated optimization problem is

minq1,,qNv=1mD(P^vQv)=minq1,,qNv=1mi=1NP^v(xiv)logQv(xiv)v=1m(P^v). (6)

It is straightforward to see that the optimized objective is convex, hence the global minimum can be found. The prior update rule is given as follows:

qj(t+1)=qj(t)Mv=1mi=1NP^vfjv(xiv)j=1Nqj(t)fjv(xiv). (7)

The prior qj associated with the jth instance is a measure of how likely this instance is to be an exemplar, taking all views into account. The appropriate βv values are identified in the range of an empirically defined β0v by β0v=N2logN/i,j=1Ndϕv(xiv,xjv). From Eq. (6), it can be found that all views contribute equally to the sum, without considering their different importance. To overcome this limitation, a weighted version of multi-view CMMs was proposed in [13].

1). Summary:

For the aforementioned MVC generative methods, we can find that linear combination with different weights to different views is a common way to fuse information. In addition, multi-view generative clustering has not attracted enough attention, maybe because the technique is more difficult compared with its discriminative counterpart. It is not easy for generative methods to combine views by sharing a common variable or distribution, but sharing common variable(s) is the most popular way to combine views in the discriminative paradigm. This can limit the development of multi-view generative clustering to some extent, but researchers are actively seeking for ways to combine views in multi-view generative clustering methods. For example, it is quite reasonable to share some commonality across the distributions of the data views corresponding to the same cluster. Moreover, generative methods have their advantages. Firstly, generative methods are based on data distribution, and if the data do follow the distribution assumed, the method should perform well. Secondly, given the methods, such as in [12] can get the global optimum, it is quite intriguing. Thirdly, there is no need to pre-specify the number of clusters. We believe multi-view generative clustering even single-view generative clustering method is an underestimated direction, more efforts can be made along this direction in the future.

III. DISCRIMINATIVE APPROACHES

Compared with generative approaches, discriminative approaches directly optimize the objective to seek for the best clustering solution rather than first modelling the sample distribution then solving these models to determine clustering result. Directly focusing on the objective of clustering makes discriminative approaches gain more attentions and develop more comprehensively. Up to now, most of existing MVC methods are discriminative approaches. Based on how to combine multiple views, we categorize MVC methods into five main classes and introduce the representative works in each group.

Given the data with N data points and m views, each data point xi is denoted by {xi1,xi2,,xim}, xivdv. The aim of MVC is to cluster the N data points into K classes. That is, finally we will get a membership matrix HN×K to indicate which data points are in the same group while others in other classes, the sum of each row entries of H should be 1 to make sure each row is a probability distribution. If only one entry of each row is 1 and all others are 0, it is the so-called hard clustering otherwise it is soft clustering. In the following 5 subsections, we will introduce each class of multi-view discriminative clustering methods.

A. Common Eigenvector Matrix (Mainly Multi-View Spectral Clustering)

This class of MVC methods are based on a commonly used clustering technique spectral clustering. Since spectral clustering hinges crucially on the construction of the graph Laplacian [14], [15] and the resulting eigenvectors reflect the grouping structure of the data, this class of MVC methods guarantee to get a common clustering result by assuming that all the views share the same or similar eigenvector matrix. There are two representative methods: co-training spectral clustering [16] and co-regularized spectral clustering [17]. Before discussing them, we will introduce spectral clustering [18] first.

1). Spectral Clustering:

Spectral clustering is a clustering technique that utilizes the properties of graph Laplacian where the graph edges denote the similarities between data points and solve a relaxation of the normalized min-cut problem on the graph [19]. Compared with other widely used methods such as the k-means algorithm that only fits the spherical shaped clusters, spectral clustering can apply to arbitrary shaped clusters and demonstrate good performance.

Given G=(V,E) as a weighted undirected graph with vertex set V=v1,,vN. The data adjacency matrix of the graph is defined as W whose entry wij represents the similarity of two vertices vi and vj. If wij=0, it means that the vertices vi and vj are not connected. Apparently W is symmetric since G is an undirected graph. The degree matrix D is defined as the diagonal matrix with the degrees d1,,dN of each vertex on the diagonal, where di=j=1Nwij. Generally, the graph Laplacian is DW and the normalized graph Laplacian is L˜=D1/2(DW)D1/2. In many spectral clustering works e.g. [18], [16], [17], [20], L=D1/2WD1/2 is also used to change a minimization problem (9) into a maximization problem (8) since L=IL˜ where I is the identity matrix. Following the same terminology adopted in [18], [16], [17], [20], we will name both L and L˜ as normalized graph Laplacians afterwards. Now the single-view spectral clustering approach can be formulated as follows:

{maxUN×Ktr(UTLU)s.t.UTU=I, (8)

which is also equivalent to the following problem:

{minUN×Ktr(UTL˜U)s.t.UTU=I, (9)

where tr denotes the trace norm of a matrix. The rows of matrix U are the embeddings of the data points, which can be fed into the k-means to obtain the final clustering results. A version of the Rayleigh-Ritz theorem in [21] shows that the solution of the above optimization problem is given by choosing U as the matrix containing, respectively, the largest or smallest K eigenvectors of L or L˜ as columns. To understand the spectral clustering method better, we outline a commonly used algorithm [18] as follows:

  • Construct the adjacency matrix W.

  • Compute the normalized Laplacian matrix L=D1/2WD1/2.

  • Calculate the eigenvectors of L and stack the top K eigenvectors as the columns to construct a N × K matrix U.

  • Normalize each row of U to obtain Usym.

  • Run the k-means algorithm to cluster the row vectors of Usym.

  • Assign subject i to cluster k if the ith row of Usym is assigned to cluster k by the k-means algorithm.

Apart from the symmetric normalization operator Usym, another normalization operator Ulr=D1W is also commonly used. Refer to [22] for more details about spectral clustering.

2). Co-Training Multi-View Spectral Clustering:

For semi-supervised learning, co-training with two views has been a widely recognized idea when both labeled and unlabeled data are available. It assumes that the predictive models constructed in each of the two views will lead to the same labels for the same sample with high probability. There are two main assumptions to guarantee the success of co-training: (1) Sufficiency: each view is sufficient for sample classification on its own, (2) Conditional independence: the views are conditionally independent given the class labels. In the original co-training algorithm [23], two initial predictive functions f1 and f2 are trained in each view using the labeled data, then the following steps are repeatedly performed: the most confident examples predicted by f1 are added to the labeled set to train f2 and vice versa, then f1 and f2 are re-trained on the enlarged labeled datasets. It can be shown that after a number of iterations, f1 and f2 will agree with each other on labels.

For co-training multi-view spectral clustering, the motivation is similar: the clustering result in all views should agree. In spectral clustering, the eigenvectors of the graph Laplacian encode the discriminative information of the clustering. Therefore, co-training multi-view spectral clustering [16] uses the eigenvectors of the graph Laplacian in one view to cluster samples and then use the clustering result to modify the graph Laplacian in the other view.

Each column of the similarity matrix (also called the adjacency matrix) WN×N can be considered as a N-dimensional vector that indicates the similarities of ith point with all the points in the graph. Since the largest K eigenvectors have the discriminative information for clustering, the similarity vectors can be projected along those directions to retain the discriminative information for clustering and throw away the within cluster details that might confuse the clustering. After that, the projected information is projected back to the original N-dimensional space to get the modified graph. Finally, k-means algorithm is conducted on most informative eigenvector matrix to get the final clustering result.

To make the co-training spectral clustering algorithm clear, we borrow Algorithm 1 from [16]. Note that the symmetrization operator sym on a matrix S is defined as sym (S)=(S+ST)/2 in Algorithm 1.

Algorithm 1.

Co-training Multi-View Spectral Clustering

Input: Similarity matrices for two views: W(1) and W(2).
Output: Assignments to K clusters.
Initialize: L(v)=D(v)(1/2)L(v)D(v)(1/2) for v = 1,2, U(v)0=argmaxUN×Ktr(UTL(v)U)s.t.UTU=I for v = 1,2.
for i=1 to t do
1. S(1)=sym(U(2)i1U(2)i1TW(1))
2. S(2)=sym(U(1)i1U(2)i1TW(2))
3. Use S(1) and S(2) as the new graph similarities and compute the graph Laplacians. Solve for the largest K eigenvectors to obtain U(1)i and U(2)i
end for
4: Normalize each row of U(1)i and U(2)i.
5: Form matrix V=U(v)i, where v is the most informative view a priori. If there is no prior knowledge on the view informativeness, matrix V can also be set to be columnwise concatenation of the two U(v)is.
6: Assign example j to cluster K if the jth row of V is assigned to cluster K by the k-means algorithm.

3). Co-Regularized Multi-View Spectral Clustering:

Co-regularization is an effective technique in semi-supervised multi-view learning. The core idea of co-regularization is to minimize the distinction between the predictor functions of two views acting as one part of the objective function. However, there are no predictor functions in unsupervised learning like clustering, so how to implement the co-regularization idea in clustering problem? Co-regularized multi-view spectral clustering [17] adopted the eigenvectors of graph Laplacian to play the similar role of predictor functions in semi-supervised learning scenario and proposed two co-regularized clustering approaches.

Let U(s) and U(t) be the eigenvector matrices corresponding to any pair of view graph Laplacians L(s) and L(t) (1s,tm,st). The first version uses a pairwise co-regularization criteria that enforces U(s) and U(t) as close as possible. The measure of clustering disagreement between the two views s and t is D(U(s),U(t))=K(s)K(s)F2K(t)K(t)F2F2, where K(s)=U(s)U(s)T using linear kernel is the similarity matrix of U(s). Since K(s)F2=K, where K is the number of the clusters, disagreement between the clustering solutions in the two views can be measured by D(U(s),U(t))=tr(U(s)U(s)TU(t)U(t)T). Integrating the measure of the disagreement between any pair of views into the spectral clustering objective function, the pairwise co-regularized multi-view spectral clustering can be formed as the following optimization problem:

{maxU(1),U(2),,U(m)N×Ks=1mtr(U(s)TL(s)U(s))+1s,tm,stλtr(U(s)U(s)TU(t)U(t)T)s.t.U(s)TU(s)=I,1sm. (10)

The hyperparameter λ is used to trade-off the spectral clustering objectives and the spectral embedding disagreement terms. After the embeddings are obtained, each Us can be fed for k-means clustering method, the final results are marginally different.

The second version named centroid-based co-regularization enforces the eigenvector matrix from each view to be similar by regularizing them towards a common consensus eigenvector matrix. The corresponding optimization problem is formulated as

{maxU(1),U(2),,U(m),UN×Ks=1mtr(U(s)TL(s)U(s))+λss=1mtr(U(s)U(s)TU()U()T)s.t.U(s)TU(s)=I,1sm,UTU=I. (11)

Compared with pairwise co-regularized version, centroid-based MVC does not need to combine the obtained eigenvector matrices of all views to run k-means. However, the centroid-based version possesses one potential drawback: the noisy views could potentially affect the optimal eigenvectors as it depends on all the views.

Cai et. al. [24] used a common indicator matrix across the views to perform multi-view spectral clustering and derived a formulation similar to the centroid-based co-regularization method. The main difference is that [24] used tr((U()U(s))T(U()U(s))) as the disagreement measure between each view eigenvector matrix and the common eigenvector matrix while co-regularized multi-view spectral clustering [17] adopted tr(U(s)U(s)TU()U()T). The optimization problem [24] is formulated as

{maxU(s),s=1,2,m,Us=1mtr(U(s)TL(s)U(s))+λ1mtr((UU(s))T(UU(s)))s.t.U0,UTU=I. (12)

where U0 makes U become the final cluster indicator matrix. Different from general spectral clustering that get eigenvector matrix first and then run clustering (such as k-means that is sensitive to initialization condition) to assign clusters, Cai et al. [24] directly solves the final cluster indicator matrix, thus it will be more robust to the initial condition.

4). Others:

Besides the two representative multi-view spectral clustering methods discussed above, Wang et al. [25] enforces a common eigenvector matrix across the views and formulates a multi-objective problem which is then solved using Pareto optimization.

A relaxed kernel k-means can be shown to be equivalent to spectral clustering, see the following Subsection III-D2, Ye et al. [26] proposes a co-regularized kernel k-means for MVC. With a multi-layer Grassmann manifold interpretation, Dong et al. [27] obtains the same formulation with the pairwise co-regularized multi-view spectral clustering.

Because the MVC methods based on a shared eigenvector matrix are rooted from the special clustering, they can be applied to data clusters of any shape or any positioning of cluster centers. This merit is inherited from spectral clustering that does not make any assumption about the statistics of the clusters. However, since spectral clustering needs eigendecomposition, this type of MVC methods can be time consuming.

B. Common Coefficient Matrix (Mainly Multi-View Subspace Clustering)

In many practical applications, even though the given data are high dimensional, the intrinsic dimension of the problem is often low. For example, the number of pixels in a given image can be large, yet only a few parameters are used to describe the appearance, geometry, and dynamics of a scene. This motivates the development of finding the underlying low dimensional subspace. In practice, the data could be sampled from multiple subspaces. Subspace clustering [28] is the technique to find the underlying subspaces and then cluster the data points correctly according to the identified subspaces.

1). Subspace clustering:

Subspace clustering uses the self-expressiveness property [29] of the data samples, i.e., each sample can be represented by a linear combination of a few other data samples. The classic subspace clustering formulation is given as follows:

X=XZ+E (13)

where Z={z1,z2,,zN}N×N is the subspace coefficient matrix (representation matrix), and each zi is the representation of the original data point xi based on the subspace. EN×N is the noise matrix.

The subspace clustering can be formulated as the following optimization problem:

{minZXXZF2s.t.Z(i,i)=0,ZT1=1. (14)

The constraint Z(i,i)=0 is used to avoid the case that a data point is represented by itself, while ZT1=1 denotes that the data point lies in a union of affine subspaces. The nonzero elements of zi correspond to data points from the same subspace.

After getting the subspace representation Z, the similarity matrix W=|Z|+|ZT|2 can be obtained to further construct the graph Laplacian and then run spectral clustering on that graph Laplacian to get the final clustering results.

2). Multi-View Subspace Clustering:

With multi-view information, each subspace representation Zv can be obtained from each view. To get a consistent clustering result from multiple views, Yin et al. [30] shares the common coefficient matrix by enforcing the coefficient matrices from each pair of views as similar as possible. The optimization problem is formulated as

{minZ(s),s=1,2,,ms=1mX(s)X(s)Z(s)F2+αs=1mZ(s)1+β1stZ(s)Z(t)1s.t.diag(Z(s))=0,s{1,2,,m}. (15)

where Z(s)Z(t)1 is the l1-norm based pairwise co-regularization constraint that can alleviate the noise problem. Z1 is used to enforce sparse solution. diag(Z) denotes the diagonal elements of matrix Z, and the zero constraint is used to avoid trivial solution (each data point represents by itself).

Maria et al. [31] also considered the low rank and sparse representation to conduct multi-view subspace clustering. Wang et al. [32] enforced the similar idea to combine multi-view information. Apart from that, it adopted a multi-graph regularization with each graph Laplacian regularization characterizing the view-dependent non-linear local data similarity. At the same time, it assumes that the view-dependent representation is low rank and sparse and considers the sparse noise in the data. Wang et al. [33] proposed an angular based similarity to measure the correlation consensus in multiple views and obtained a robust subspace clustering for multi-view data. Zhang et al. [34] adopted linear correlation and neural networks to integrate the representation of each view and proposed two latent subspace MVC methods. To deal with the scenario where each view is unsufficient to discover the latent cluster structure, Huang et al. [35] proposed a multi-view intact subspace clustering by assuming a latent space and defining a mapping function from the latent space to view representation. Different from the above approaches, the three works [36], [37], [38] adopted general nonnegative matrix factorization formulation but shared a common representation matrix for the samples with both views and kept each view representation matrix specific. Zhao et al.[39] adopted a deep semi-nonnegative matrix factorization to perform MVC, and enforced a common coefficient matrix in the last layer to exploit the multi-view information. By introducing a label constraint matrix and enforcing representation matrix of each view close to a common one, Cai et al. [40] solved the MVC in semi-supervised settings.

The MVC methods based on a shared coefficient matrix are applied to multi-view subspace clustering, which assumes that the cluster structures can be found by identifying the low dimensional subspaces. This kind of MVC methods has great utility in the computer vision field. Typically, after the final low-dimensional representation is obtained, spectral clustering is conducted on the graph Laplacian constructed from that representation, so this group of methods possesses the same advantages and disadvantages as spectral clustering as discussed in Sec. III-A.

C. Common Indicator Matrix (Mainly Multi-View Nonnegative Matrix Factorization Clustering)

Nonnegative matrix factorization is commonly used in clustering. It enforces one of the factorized matrix as an indicator matrix whose nonzero entry can indicate which data point belongs to which cluster. Therefore, enforcing the indicator matrix for multiple views be same or similar is a natural way to conduct MVC.

1). Nonnegative Matrix Factorization (NMF):

For a nonnegative data matrix X+d×N. Nonnegative Matrix Factorization [41] seeks two nonnegative matrix factors U+d×K and V+N×K such that their product is a good approximation of X:

XUVT, (16)

where K denotes the desired reduced dimension (for clustering, it is the number of clusters), U is the basis matrix, and V is the indicator matrix.

Due to the nonnegative constraints, a widely known property of NMF is that it can learn a part-based representation. It is intuitive and meaningful in many applications, such as in face recognition [41]. The samples in many of these applications e.g., information retrieval [41] and pattern recognition [42], [43] can be explained as additive combinations of nonnegative basis vectors. The NMF has been applied successfully to cluster analysis and has shown the state-of-the-art performance [41], [44].

2). MVC based on NMF:

To combine multi-view information in the NMF framework, Akata et al. [45] enforces a common indicator matrix in the NMF among different views to perform MVC. However, the indicator matrix V(v) might not be comparable at the same scale. In order to keep the clustering solutions across different views meaningful and comparable, Liu et al. [46] enforces a constraint to push each view-dependent indicator matrix towards a common indicator matrix, which leads to another normalization constraint inspired by the connection between NMF and probability latent semantic analysis. The final optimization problem is formulated as:

{minU(v),V(v),v=1,2,,mv=1mX(v)U(v)V(v)F2+v=1mλvV(v)VF2s.t.1kK,U.,k(v)1=1,U(v),V(v),V()0. (17)

The constraint U.,k(v)1=1 is used to guarantee V(v) within the same range for different v such that the comparison between the view-dependent indicator matrix V(v) and the consensus indicator matrix V(∗) is reasonable. After obtaining the consensus matrix V, the cluster label of data point i can be computed from argmaxkVi,k.

3). Multi-View K-Means:

The k-means clustering method can be formulated using NMF by introducing an indicator matrix U. The NMF formulation of k-means clustering is

{minU,VXTUVTF2s.t.Ui,k{0,1},Kk=1Ui,k=1,i=1,2,,N (18)

where the columns of Vd×K give the cluster centroids.

Because the k-means algorithm has lower computational cost than those requiring eigen-decomposition, it can be a good choice for large scale data clustering. To deal with large scale multi-view data, Cai et al. [47] proposed a multi-view k-means clustering method by adopting a common indicator matrix across different views. The l2,1 norm has been applied in traditional NMF-based clustering methods with proved performance, such as model sparse and robustness. Herein the Frobenius norm has been replaced by a l2,1 norm, and different views are weighed differently according to their importance. The new optimization problem obtained from (18) is formulated as follows:

{minV(v),α(v),Uv=1m(α(v))γX(v)TUV(v)T2,1s.t.Ui,k{0,1},k=1KUi,k=1,v=1mα(v)=1 (19)

where α(v) is the weight for the v-th view and γ is the parameter to control the weight distribution. By learning the weights α for different views, the important views will be emphasized.

Still based on multi-view k-means clustering (18), to deal with high dimensional problems in multiple views, Xu et al. [48] introduced one projection matrix for data of each view, and then conduct MVC by enforcing the common indicator matrix. Their optimization problem is formulated as

{minV(v),W(v),Uv=1mX(v)TW(v)UV(v)TFs.t.W(v)TW(v)=I,Ui,k{0,1},k=1KUi,k=1. (20)

where W(v)Dv×mv indicates the projection matrix which embeds the data matrix X(v) from Dv to mv,mv<Dv,v. Note that to deal with outliers, Frobenious norm (not squared) is adopted. By replacing Frobenious norm with a l2 norm and considering different importance of each view, a re-weighted discriminative embedding k-means method is formulated as

{minV(v),W(v),α(v),Uv=1mα(v)X(v)TW(v)UV(v)T2s.t.W(v)TW(v)=I,Ui,k{0,1},k=1KUi,k=1. (21)

where α(v)=(2X(v)TW(v)UV(v)TF)1 is the weight for the vth view and is computed by current V(v), W(v), U.

Besides the above multi-view nonnegative matrix factorization clustering methods, Liu and Fu [49] introduced a categorical utility function to measure similarity between the common indicator matrix and the indicator matrix from each view and proposed a consensus based MVC method.

According to the work [50], when W=HHT where W indicates similarity between data points or is a kernel, the above method is equivalent to spectral clustering or kernel k-means clustering. Although the single view methods (NMF, kernel k-means, and spectral clustering) have connections between each other, their multi-view versions are less connected because the views need to share some common factors, but there is only one factor H, which cannot be used in multiple of the views. However, for the multi-view k-means clustering method can be expressed as a multi-view NMF-based clustering problem with U indicating the indicator matrix according to formulation (18).

4). Others:

As mentioned earlier, there are generally two steps in subspace clustering: find a subspace representation and then run spectral clustering on the graph Laplacian computed from the subspace representation. To identify consistent clusters from different views, Gao et al. [51] merged these two steps in subspace clustering and enforced a common indicator matrix across different views. The formulation is given as follows:

{minZ(v),E(v),Uv=1mX(v)X(v)Z(v)E(v)F2+λ1tr(UT(D(v)W(v))U)+λ2v=1mE(v)1s.t.Z(v)T,Z(v)(i,i)=I,UTU=I (22)

where Z(v) is the subspace representation matrix of the v-th view, W(v)=|Z(v)|+|Z(v)T|2, D(v) is a diagonal matrix with diagonal elements defined as dvi,i=jwvi,j, and U is the common indicator matrix which indicates a unique cluster assignment for all the views. Although this multi-view subspace clustering method is based on subspace clustering, it does not enforce a common coefficient matrix Z, but uses a common indicator matrix for different views. We thus categorize it into this group.

Wang et al. [52] integrates multi-view information via a common indicator matrix and simultaneously selects features for different data clusters by formulating the problem as follows:

{minUTU=I,WXTW+1NbTUF+γ1WG1+γ2W2,1 (23)

where X={x1,x2,,xN}d×N, but here each xi includes the features from all the m views and each view has dj features such that d=j=1mdj. The coefficient matrix W=[w11,,wK1;,,,;w1m,,wKm]d×K contains the weights of each feature for K clusters, bK×1 is the intercept vector, 1N is N-element constant vector of ones, and U=[u1,,uN]TN×K is the cluster (assignment) indicator matrix. The regularizer WG1=i=1Kj=1mwij2 is the group l1 regularization to evaluate the importance of an entire view’s features as a whole for a cluster whereas W2,1=i=1dwi2 is the l2,1 norm to select individual features from all views that are important for all clusters.

In [53], a matrix factorization approach was adopted to reconcile the clusters arisen from the individual views. Specifically, a matrix that contains the partition indicator of each individual view is created and then decomposed into two matrices: one showing the contribution of individual groupings to the final MVC, called meta-clusters, and the other showing the assignment of instances to the meta-clusters. Tang et al. [54] treated MVC as clustering with multiple graphs, each of which is approximated by matrix factorization with two factors: a graph-specific factor and a factor common to all graphs. Qian et al. [55] and Zong et al. [56] required each view’s indicator matrix to be as close as possible to a common indicator matrix and employed the Laplacian regularization to maintain the latent geometric structure of the views simultaneously. After learning indicator matrices of different views, Kang et al. [57] learned a common indicator matrix by measuring distance between indicator matrix and considering different impact each view enforces. Also by learning an indicator matrix and maximizing the worst-case performance against single-view case, Tao et al. proposed a reliable MVC method [58]. Zhang et al. [59] proposed a robust manifold matrix factorization to cluster hyperspectral images. Taking the discriminative information in low dimensional spaces into account, Ma et al. [60] extend the work [59] to multi-view clustering by enforcing the same indicator matrix.

Besides using a common indicator matrix, [61], [62], [63] introduced a weight matrix to indicate whether there are missing entries so that it can tackle the missing value problem. The multi-view self-paced clustering method [64] takes the complexities of the samples and views into consideration to alleviate the local minima problem. Tao et al. [65] enforces a common indicator matrix and seeks for the consensus clustering among all the views in an ensemble way. Another method that utilizes a common indicator matrix to combine multiple views [66] employed the linear discriminant analysis idea and automatically weighed different views. For graph-based clustering methods, the similarity matrix for each view is obtained, and then by minimizing the differences between a common indicator matrix and each similarity matrix, Nie et al. [67] provided one MVC method with multiple graphs.

The MVC methods that use a shared indicator matrix across views include the k-means or non-negative matrix factorization. On one side, it can scale to large scale datasets compared with spectral clustering based MVC approaches. On the other side, it can only be applied to data with cluster of spherical shape to cluster center. This is because k-means clustering makes a strong assumption that the data points assigned to a cluster are spherical about the cluster center.

D. Direct Combination (Mainly Multi-Kernel Based MVC)

Besides the methods that share some structure among different views, direct view combination via a kernel is another common approach to perform MVC. A natural way is to define a kernel for each view and then combine these kernels in a convex combination [68], [69], [70].

1). Kernel Functions and Kernel Combination Methods:

Kernel is a trick to learn nonlinear problem just by linear learning algorithm, since kernel function K:X×X can directly give the inner products in feature space without explicitly defining the nonlinear transformation φ. There are some common kernel functions as follows:

  • Linear kernel: K(xi,xj)=(xixj),

  • Polynomial kernel: K(xi,xj)=(xixj+1)d,

  • Gaussian kernel (Radial basis kernel): K(xi,xj)=(exp(xixj22σ2),

  • Sigmoid kernel: K(xi,xj)=(tanh(ηxixj+ν)).

Kernel functions in a reproducing kernel Hilbert space (RKHS) can be viewed as similarity functions [71], [72] in a vector space, so we can use a kernel as a non-Euclidean similarity measure in the spectral clustering and kernel k-means methods. There have been some works on multi-kernel learning for clustering [73], [74], [75], [76], however, they are all for single-view clustering. If a kernel is derived from each view, and different kernels are combined elaborately to deal with the clustering problem, it will become the multi-kernel learning method for MVC. Obviously, multi-kernel learning [77], [78], [79], [80] can be considered as the most important part in this kind of MVC methods. There are three main categories of methods for combining multiple kernels [81]:

  • Linear combination: It includes two basic subcategories: unweighted sum K(xi,xj)=v=1mkv(xiv,xjv) and weighted sum K(xi,xj)=v=1mwvqkv(xiv,xjv) where wv+ denotes the kernel weight for the vth view and v=1mwv=1 is the hyperparameter to control the distribution of the weights,

  • Nonlinear combination: It uses a nonlinear function in terms of kernels, namely, multiplication, power, and exponentiation,

  • Data-dependent combination: It assigns specific kernel weights for each data instance, which can identify the local distributions in the data and learn proper kernel combination rules for different regions.

2). Kernel K-Means and Spectral Clustering:

Kernel k-means [82] and spectral clustering [83] are two kernel-based clustering methods for optimizing the intra-cluster variance. Let ϕ():xXH be a feature mapping which maps x onto a RKHS H. The kernel k-means method is formulated as the following optimization problem,

{minHi=1Nk=1KHikϕ(xi)μk22s.t.k=1KHik=1, (24)

where H{0,1}N×K is the cluster indicator matrix (also known as cluster assignment matrix), nk=i=1NHik and μk=1nki=1NHikϕ(xi) are the number of points in the kth cluster and the centroid of the kth cluster. With a kernel matrix K whose (i, j)th entry is Kij=ϕ(xi)Tϕ(xj), L=diag([n11,n21,,nK1]) and 1ll, a column vector of all ones, Eq. (24) can be equivalently rewritten as the following matrix-vector form,

{minHtr(K)tr(L12HTKHL12)s.t.H1k=1N. (25)

For the above kernel k-means matrix-factor form, the matrix H is binary, which makes the optimization problem difficult to solve. By relaxing the matrix H to take arbitrary real values, the above problem can be approximated. Specifically, by defining U=HL12 and letting U take real values, further considering Tr(K) is constant, Eq. (25) will be relaxed to

{maxUtr(UTKU)s.t.UTU=1K. (26)

The fact HTH=L1 leads to the orthogonality constraint on U which tells us that the optimal U can be obtained by the top K eigenvectors of the kernel matrix K. Therefore, Eq. (26) can be considered as the generalized optimization formulation of spectral clustering. Note that Eq. (26) is equivalent to Eq. (8) if the kernel matrix K takes the normalized Gram matrix form.

3). Multi-Kernel Based MVC:

Assume that there are m kernel matrices available, each of which corresponds to one view. To make a full use of all views, the weighted combination K=v=1mwvpK(v), wvp0, v=1mwvp=1, p1 will be used in kernel k-means (26) and spectral clustering (8) to obtain the corresponding multi-view kernel k-means and multi-view spectral clustering [84]. Using the same nonlinear combination but specifically setting p = 1, Guo et al. [85] extended the spectral clustering to MVC with kernel alignment. Due to the potential redundance of the selected kernels, Liu et al. [86] introduced a matrix-induced regularization to reduce the redundancy and enhance the diversity of the selected kernels to attain the final goal of boosting the clustering performance. By replacing the original Euclidean norm metric in fuzzy c-means with a kernel-induced metric in the data space and adopting the weighted kernel combination, Zhang et al. [87] successfully extended the fuzzy c-means to MVC that is robust to noise and outliers. In the case when incomplete multi-view data set exists, by optimizing the alignment of the shared data instances, Shao et al. [88] collectively completes the kernel matrices of incomplete data sets. Liu et al. [89] integrated imputation and clustering into a unified learning procedure, but the computational and storage complexities of this method is quite high. To overcome these drawbacks, they proposed a late fusion method that effectively and efficiently conduct MVC with a three-step iterative procedure [90]. To overcome the cluster initialization problem associated with kernel k-means, Tzortzis et al. [91] proposed a global kernel k-means algorithm, a deterministic and incremental approach that adds one cluster in each stage, through a global search procedure consisting of several executions of kernel k-means from suitable initiations.

4). Others:

Besides multi-kernel based MVC, there are some other methods that use the direct combination of features to perform MVC like [66], [67], [92]. In [93], two-level weights: view wights and variable wights are assigned to the clustering algorithm for multi-view data to identify the importance of the corresponding views and variables. Zhou et al. [94] learns an optimal neighborhood Laplacian matrix by searching the neighborhood of both the linear combination of the first-order and high-order base Laplacian matrices simultaneously to conduct multi-view spectral clustering finally. To extend fuzzy clustering method to MVC, each view is weighted and the multi-view versions of fuzzy c-means and fuzzy k-means are obtained in [95] and [96], respectively.

Direct combination based MVC can adaptively tune the weights of each view, which is necessary and important when some views are of low quality. The consensus information among different views are not clear in the direct combination based MVC methods because there are no commonality shared between different views.

E. Combination After Projection (Mainly CCA-Based MVC)

For multi-view data with all views with the same data type, like categorical or continuous, it is reasonable to directly combine them together. However, in real-world applications, the multiple representations may have different data types, and it is difficult to compare them directly. For instance, in bioinformatics, genetic information can be one view while clinical symptoms can be another view in the cluster analysis of patients [97]. Obviously, the information cannot be combined directly. Moreover, high dimension and noise are also difficult to handle. To solve the above problems, the last yet important combination way is introduced: combination after projection. The most commonly used technique is Canonical Correlation Analysis (CCA) and the kernel version of CCA (KCCA).

1). CCA and KCCA:

To better understand this style of view combination, CCA and KCCA are briefly introduced (refer to [98] for more detail). Given two data sets Sx=[x1,x2,,xN]dx×N and Sy=[y1,y2,,yN]dy×N where each entry x or y has a zero mean, CCA aims to find a projection wxdx for x and another projection wydy for y such that the correlation between the projection of Sx and Sy on wx and wy are maximized,

ρ=maxwx,wywxTCxywy(wxTCxxwx)(wyTCyywy) (27)

where ρ is the correlation and Cxy=E[xyT] denotes the covariance matrix of x and y with zero mean. Observing that ρ is not affected by scaling wx or wy either together or independently, CCA can be reformulated as

{maxwx,wywxTCxywys.t.wxTCxxwx=1,wyTCyywy=1. (28)

which can be solved using the method of Lagrange multiplier. The two Lagrange multipliers λx and λy are equal to each other, that is λx=λy=λ. If Cyy is invertible, wy can be obtained as wy=12Cyy1Cyxwx and Cxy(Cyy)1Cyxwx=λ2Cxxwx. Hence, wx can be obtained by solving an eigen problem. For different eigen values (from large to small), eigen vectors are obtained in a successive process.

The above canonical correlation problem can be transformed into a distance minimization problem. For ease of derivation, the successive formulation of the canonical correlation is replaced by the simultaneous formulation of the canonical correlation. Assume that the number of projections is p, the matrices Wx and Wy denote (wx1, wx2, . . . , wxp) and (wy1, wy2, . . . , wyp), respectively. The formulation that simultaneously identifies all the w’s can be written as an optimization problem with p iteration steps:

{max(wx1,wx2,,wxp),(wy1,wy2,,wyp)i=1pwxiTCxywyis.t.wxiTCxxwxj={1if i=j,0otherwise,wyiTCyywyj={1if i=j,0otherwise,i,j=1,2,,p,wxiTCxywyj=0,i,j=1,2,,p,ji. (29)

The matrix formulation to the optimization problem (29) is

{maxwx,wyTr(wxTCxywy)s.t.wxTCxxwx=I,wyTCyywy=I,wxiTCxywyj=0,wyiTCyxwxj=0,i,j=1,,p,ji. (30)

where I is an identity matrix with size p × p. Maximizing the objective function of Eq. (30) can be transformed into the equivalent form as follows:

minwx,wywxTSxwyTSyF, (31)

which is used widely in many works [36], [38], [99].

KCCA uses the “kernel trick” to maximize the correlation between two non-linear projected variables. Analogous to Eq. (28), the optimization problem for KCCA is formulated as follows:

{maxwx,wywxTKxKywy(wxTKx2wx)(wyTKy2wy)s.t.wxTKxwx=1,wyTKywy=1. (32)

In contrast to the linear CCA that works by solving an eigendecomposition of the covariance matrix, KCCA solves the following eigen-problem:

(0KxKyKyKx0)(wxwy)=λ(Kx200Ky2)(wxwy). (33)

2). CCA Based MVC:

Since cluster analysis in a high dimensional space is difficult, Chaudhuri et al. [100] first projects the data into a lower dimensional space via CCA and then clusters samples in the projected low dimensional space. Under the assumption that multiple views are uncorrelated given the cluster labels, it shows a weaker separation condition required to guarantee the algorithm successful. Blaschko et al. [101] projects the data onto the top directions obtained by the KCCA across different views and applies k-means to clustering the projected samples.

For the case of paired views with some class labels, CCA can still be applied by ignoring the class labels. However, the performance can be ineffective. To take an advantage of the class label information, Rasiwasia et al. [102] has proposed two solutions with CCA: mean-CCA and cluster-CCA. Consider two data sets each of which is divided into K different but corresponding classes or clusters. Given Sx={x1,x2,,xK} and Sy={y1,y2,,yK}, where xk={x1k,x2k,,x|xk|k} and yk={y1k,y2k,,y|yk|k} are the data points in the kth cluster for the first and second views, respectively. The first solution is to establish correspondences between the mean cluster vectors in the two views. Given the cluster means mxk=1|xk|i=1|xk|xik and myk=1|yk|i=1|yk|yik, mean-CCA is formulated as

ρ=maxwx,wywxVxywy(wxTVxxwx)(wyTVyywy), (34)

where Vxy=1Kk=1KmxkmykT, Vxx=1Kk=1KmxkmxkT and Vyy=1Kk=1KmykmykT. The second solution is to establish a one-to-one correspondence between all pairs of data points in a given cluster across the two views of data sets and then standard CCA is used to learn the projections.

For multi-view data with at least one complete view (features for this view are available for all data points), Anusua et al. [103] borrowed the idea from Laplacian regularization to complete the incomplete kernel matrix and then applied KCCA to perform MVC. In another method for MVC, multiple data matrices A(v)N×Kv, v = 1, 2, · · · , K each of which corresponds to a view are obtained in an intermediate step and then a consensus data matrix should be learned to approximate each view’s data matrix as much as possible. Due to the unsupervised property, however, the data matrices are often not directly comparable. Using the CCA formulation Eq. (31), Long et al. [104] projects one view’s data matrix first before comparing with another view’s data matrix.

The same idea can be used to tackle the incomplete view problem (i.e., there are no complete views). For instance, if there are only two views, the methods in [36], [38] split data into the portion of data with both views and the portion of data with only one view, and then projects each view’s data matrix so that it is close to the final indicator matrix. Multi-view information is connected by the common indicator matrix corresponding to the projected data from both views. Wang et al. [105] provides a MVC method using an extreme learning machine that maps the normalized feature space onto a higher dimensional feature space.

Combination after projection based MVC methods fit for scenarios where different views cannot be compared directly in original input space. Although the consensus information is used well in this group of MVC methods, the complementary information is not taken into account. This is contrary to direct combination based MVC approaches. Thus it is intriguing to explore whether it is possible to fuse this two groups of methods together to make full use of the consensus and complimentary information.

F. Other MVC methods

In subsections III-A, III-B, III-C, we have introduced three classes of similarity structure based MVC methods. In addition, there are also some methods to share other similar structures to perform MVC. By sharing an indicator vector across views in a singular value decomposition of multiple data matrices, Sun et al. [97], [106], [107] extend the biclustering [108] method to the multi-view settings. Wang et al. [109] chooses the Jaccard similarity to measure the cross-view clustering consistency and simultaneously considers the within-view clustering quality to cluster multi-view data. By sharing a shared subspace’s bidirectional sparsity, Fan et al. [110] proposed an MVC approach which can find an effective subspace dimension and deal with outliers simultaneously.

Apart from the above categorized methods, there are some other MVC methods. Different from exploiting the consensus information of multi-view data, Cao et al. [111] utilizes a Hilbert Schmidt Independence Criterion as a diversity term to explore the complementarity of multi-view information. It reduces the redundancy of multi-view information to improve the clustering performance. Based on the idea of “minimizing disagreement” between clusters from each view, De Sa [112] proposes a two-view spectral clustering that creates a bipartite graph of the views. Zhou et al. [113] defines a mixture of Markov chains on similarity graph of each view and generalize spectral clustering to multiple views. In [114], a transition probability matrix is constructed from each single view, and all these transition probability matrices are used to recover a shared low-rank transition probability matrix as a crucial input to the standard Markov chain method for clustering. By fusing the similarity data from different views, Lange et al. [115] formulates a nonnegative matrix factorization problem and adopts an entropy-based mechanism to control the weights of multi-view data. Zhu et al. [116] enforced a common affinity matrix to conduct MVC in one step. Liu et al. [117] chooses tensor to represent multi-view data and then performs cluster analysis via tensor methods. Based on an assumption that the exemplar of a cluster in one view is always an exemplar of that cluster in the other views, Zhang et al. [118] proposed a multi-view and multi-exemplar fuzzy clustering method which has a theoretical guarantee on the performance improvement compared with single-view clustering counterpart. In paper [119], via cross-view graph diffusion, a unified graph for multi-view data is learned to conduct final clustering.

IV. RELATIONSHIPS TO RELATED TOPICS

As we mentioned previously, MVC is a learning paradigm for cluster analysis with multi-view feature information. It is a basic task in machine learning and thus can be useful for various subsequent analyses. In machine learning and data mining fields, there are several closely related learning topics such as multi-view representation learning, ensemble clustering, multi-task clustering, and multi-view supervised and semi-supervised learning. In the following, we will elaborate the relationships between MVC and a few other topics.

A. Relationship to Multi-View Representation

Multi-view representation [120] is the problem of learning a more comprehensive or meaningful representation from multi-view data. According to [121], representation learning (also known as embedding learning or metric learning) is a way to take advantage of human ingenuity and prior knowledge to extract some useful but far-removed feature representation for the ultimate objective. Thus representation learning does not need to be unsupervised in nature. For instance, metric learning has mainly been studied from the supervised perspective, when class labels are present. Using the class labels, approaches usually form constraints, for example, pairwise or triplet-based constraints. Multi-view representation can be considered as a more basic task than MVC, since multi-view representation can be useful in broader purpose such as classification or clustering and so on. However, cluster analysis based on multi-view representation may not be ideal because the creation of multi-view representation is unaware of the final goal of clustering [122], [123].

In an archived survey article [120], multi-view representation methods are categorized into mainly two classes: the shallow methods and the deep methods. The shallow methods are mainly based on CCA, which may correspond to our subsection III-E. For the deep methods, there exist a large number of works [124], [125], [126], [127], [128], [129], [130] on multi-view representation. For multi-view deep clustering, there are also many recent works including [131], [132], [133], [134], [135], [136]. As mentioned above, the sequential way of first multi-view representation and then clustering is a natural way to perform MVC, but the ultimate performance is usually not good because of the gap in the two steps. Therefore, how to integrate clustering and multi-view representation learning into a simultaneous process is an intriguing direction up to date, especially for deep multi-view representation. In addition, although many MVC methods sprung up in recent years, it still has large space to develop, especially compared with the development of multi-view deep representation learning.

B. Relationship to Ensemble Clustering

Ensemble clustering [137] (also named consensus clustering or aggregation of clustering) is made up of two steps: generation step and consensus step. Generation step is used to generate several sets of clusterings of the data set while consensus step is used to combine those sets of clusterings to obtain a consensus clustering. MVC does not need to obtain the final clustering result based on the sets of clusterings from original datasets, the final clustering result can be directly obtained from original datasets. This is the big difference between ensemble clustering and MVC. Certainly, MVC can also conduct clustering from generation step and consensus step when original datasets are multi-view and those clusterings obtained in generation step are gotten from each view of the original datasets. Thus if ensemble clustering is applied to clustering with multiple views of data, it becomes a type of MVC method. In this sense, MVC and ensemble clustering have some overlaps. Therefore, some of the ensemble clustering techniques e.g., [138], [139], [140], [141], [142], [143] can be applied to MVC. These works [65], [144], [145] are representative multi-view ensemble clustering methods. Although the idea of ensemble clustering is simple, it has gained good performance in real-world application. Especially in many kaggle competitions held recently, ensemble mechanism is quite popular and performed well. Thus, more exploration in this direction can be done in future. However, it should be noted that MVC does not need to have clear separate generation and consensus steps. More works connect MVC and ensemble clustering can be investigated further.

C. Relationship to Multi-Task Clustering

Multi-task clustering improves the clustering performance of each task by transferring knowledge among the related tasks, such as in [146], [147], [148], [149], [150], [151]. Between MVC and multi-task clustering, there are two big differences: The first one is that multi-task cares about the performance of each task, while MVC just cares about a final consensus clustering performance not each view. The second one is that one works on multiple tasks while the other one works on multiple views. Multiple tasks can be based on multiple datasets, while multiple views have to be based on the same dataset (but just different views of this one dataset). If each task corresponds to clustering in a specific view of the same dataset, multiple clustering results will be obtained, and then ensemble clustering methods may be employed to fuse these clustering results. Therefore, multi-task clustering, potentially combined with ensemble clustering, can implement MVC in the scenario where each task corresponds to each view of the same data. In addition, multi-task clustering and MVC can be conducted simultaneously to improve the clustering performance [152], [153], [154]. However, we should still distinguish the differences between them, since multi-task clustering cares about the clustering performance of each task. Even if each task corresponds to each view of the dataset, multi-task clustering is still not equivalent to MVC. When multi-task clustering combines ensemble clustering further, it will achieve MVC. Thus some techniques and ideas in multitask clustering and ensemble clustering can be helpful for MVC

D. Relationship to Multi-View Supervised and Semi-Supervised Learning

The difference between MVC and multi-view supervised, semi-supervised learning lies in whether to use the label of the data. MVC does not use any label of the data while multi-view supervised learning [4], [155] uses the labeled data to learn classifiers (or other inference models), multi-view semi-supervised learning [3], [4] can learn classifiers with both the labeled and unlabeled data.

The commonality between them lies in the way to combine multiple views. Many widely recognized techniques for combining views in the supervised or semi-supervised settings, e.g., co-training [23], [156], co-regularization [157], [158], margin consistency [159], [160] can lend a hand to MVC if there is a mechanism to estimate the initial labels. Thus, the key point to conduct MVC with some techniques in multi-view supervised or semi-supervised learning is how to estimate the initial labels or get some pseudo labels to play the role of labels in multi-view supervised learning or multi-view semi-supervised learning.

V. APPLICATIONS

MVC has been successfully applied to various applications including computer vision, natural language processing, social multimedia, bioinformatics and health informatics and so on.

A. Computer Vision

MVC has been widely used in image categorization [30], [32], [51], [111], [161], [139], [162] and motion segmentation tasks [45], [163]. Typically, several feature types e.g., CEN-TRIST [164], ColorMoment [165], HOG [166], LBP [167] and SIFT [168] can be extracted from the images (see the Fig. 2 [51]) prior to cluster analysis. Yin et al. [30] proposed a pairwise sparse subspace representation for multi-view image clustering, which harnesses the prior information and maximizes the correlation between the representations of different views. Wang et al. [32] enforced between-view agreement in an iterative way to perform multi-view spectral clustering on images. Gao et al. [51] assumed a common low dimensional subspace representation for different views to reach the goal of MVC in computer vision applications. Cao et al. [111] adopted Hilbert Schmidt Independence Criterion as a diversity term to exploit the complementary information of different views and performed well on both image and video face clustering tasks. Jin et al. [161] utilized the CCA to perform multi-view image clustering for large-scale annotated image collections.

Fig. 2:

Fig. 2:

The five views (CENTRIST, ColorMoment, LBP, HOG and SIFT) on three sample images from Caltech101.

Ozay et al. [139] used consensus clustering to fuse image segmentations. Chi et al. [169] conducted MVC for web image retrieval ranking. Méndez et al. [162] adopted the ensemble way to perform MVC for MRI image segmentation. Nonnegative matrix factorization was adopted in [45] to perform MVC for motion segmentation. Djelouah et al. [163] addressed the motion segmentation problem by propagating segmentation coherence information in both space and time. Xin et al. [89] successfully applied MVC for person re-identification. Tao et al. [170] applied their proposed multi-view subspace clustering methods to background subtraction from multi-view videos.

B. Natural Language Processing

In natural language processing, text documents can be obtained in multiple languages. It is natural to use MVC to conduct document categorization [16], [17], [46], [51], [171], [172], [173] with each language as one view. Employing the co-training and co-regularization ideas, Kumar et al. [16], [17] proposed co-training MVC and co-regularization MVC, respectively. The performance comparison on multilingual data demonstrates the superiority of these two methods over single-view clustering. Liu et al. [46] extended nonnegative matrix factorization to multi-view settings for clustering multilingual documents. Kim et al. [171] obtained the clustering results from each view and then constructed a consistent data grouping by voting. Jiang et al. [172] proposed a collaborative PLSA method that combines individual PLSA models in different views and imports a regularizer to force the clustering results in an agreement across different views. Hussain [174] utilized an ensemble way to perform MVC on documents. Zhang et al. [43] adopted a MVC method with graph regularization to improve object recognition.

C. Social Multimedia

Currently, with the fast development of social multimedia, how to make full use of large quantities of social multimedia data is a challenging problem, especially when matching them to the “real-world concepts” such as the “social event detection”. Fig. 3 shows two such events: a concert and an NBA game. The pictures showed there form just one view, and other textural features such as tags and titles form the other view. Such a social event detection problem is a typical MVC problem. Petkos et al. [175] adopted a multi-view spectral clustering method to detect the social event and additionally utilized some known supervisory signals (the known clustering labels). Samangooei et al. [176] performed feature selection first before constructing the similarity matrix and applied a density based clustering to the fused similarity matrix. Petkos et al. [177] proposed a graph-based MVC to cluster the data from social multimedia. MVC has also been applied to grouping multimedia collections [178], news stories [179] and social web videos [180].

Fig. 3:

Fig. 3:

Some pictures from two social events: concerts (top row) and NBA game (bottom row).

D. Bioinformatics and Health Informatics

In order to identify genetic variants underlying the risk for substance dependence, Sun et al. [97], [106], [107] designed three multi-view co-clustering methods to refine diagnostic classification to better inform genetic association analyses. Chao et al. [181] extended the method in [97] to handle missing values that might appear in each view of the data, and used the method to analyze heroin treatment outcomes. The three views of data for heroin dependence patients are demonstrated in Fig. 4. Yu et al. [182], [183] designed a multi-kernel combination to fuse different views of information and showed superior performance on disease data sets. In [184], a MVC based on the Grassmann manifold was proposed to deal with gene detection for complex diseases. MVC is also applied to analyze athlete’s physical fitness test [185]. Recently, Rappoport and Shamir [186] provided a review on MVC on biomedical omics data sets.

Fig. 4:

Fig. 4:

Three views from health informatics: vital sign (left), urine drug screen (middle) and craving measure (right)).

VI. DATASETS AND EXPERIMENTS

To further analyze the advantages and disadvantages of each group of MVC algorithms, we provide several commonly used MVC datasets and conduct empirical evaluation to measure how each group of MVC algorithms performs.

A. Datasets

Six benchmark multi-view datasets are adopted, and the statistics of these datasets are summarized in Table I.

TABLE I:

Statistics of the multi-view datasets.

Dataset # Samples # Views # Clusters # Features in each view Is entry non-negative
Sources 169 3 6 3560, 3631, 3068 No
Reuters 600 5 6 21526, 24892, 34121, 15487, 11539 Yes
Handwritten Digits 2000 3 10 76, 216, 64 Yes
COIL20 1440 3 20 30, 19, 30 Yes
YALE 165 3 15 4096, 3304, 6750 No
Movies 617 2 17 1878, 1398 No

3 Sources1 is a news article dataset. These articles are collected from three news sources: BBC, Reuters, and Guardians. In original datasets, there are 948 articles that are reported by at least one of the three sources. Herein, 169 of these articles are included, and the bag-of-word representation is adopted to represent the articles. These 169 articles are dominated by one of the six topical classes: business, entertainment, health, politics, sport, technology.

Reuters [187] includes documents in five languages: English, French, German, Spanish, and Italian. These five language versions constructed five views of these documents, and bag of words is used to represent the features in each view. These documents belong to one of the six categories. 100 documents are randomly sampled from each category to construct a dataset of 600 documents.

Handwritten Digits is available from the UCI repository.2 It has 2000 examples of handwritten digits (0–9) extracted from Dutch utility maps. There are 200 examples in each class, each represented with six feature sets. Following experiments in [188], three feature sets: 76 Fourier coefficients of the character shapes, 216 profile correlations and 64 Karhunen-Love Coefficiens are adopted.

COIL20 3 consists of 1440 images belonging to 20 classes. Three views are represented by 30 isometric projection (ISO), 19 linear discriminant analysis (LDA), and 30 neighborhood preserving embedding (NPE), respectively.

YALE [189] consists of 165 images from 15 subjects, which has 11 images per subject and corresponds to different facial expressions or configurations. Each image is expressed by three heterogeneous feature sets with dimensions of 4096, 3304 and 6750.

Movies 4 includes 617 movies belonging to 17 genres. Each movie is described by two views: 1878 keywords and 1398 actors.

B. Compared Methods and Parameter Settings

In the experiment, six representative MVC algorithms corresponding to each group of MVC approaches are used to compare. To explore how deep MVC algorithms perform, one deep algorithm is chosen to compare. These algorithms are multi-view mixture-of-multinomials EM (MVMMEM) [10], co-regularization multi-view spectral clustering (Co-Reg) [17], multi-view low-rank sparse subspace clustering (MVLRSSC) [31], multi-view clustering via joint nonnegative matrix factorization (MultiNMF) [46], kernel-based weighted multi-view clustering (MVKKM) [84], multi-view clustering via canonical correlation analysis (MVCCA) [100], and multi-view clustering via deep matrix factorization (DeepNMF) [39].

As for the parameter settings, we try our best to set it according to that in their original papers. For MVMMEM, the number of rounds are selected from {5, 10, · · ·, 100}. For Co-Reg, parameter α is selected from 0.01 to 0.05 with step 0.01. For MVLRSSC, we tune the parameters β1, β2, λ(v) according to [31]. For MultiNMF, λv is set to 0.01 for all views. According to MVKKM [84], good performance can be obtained with p = 1.5, we adopted this setting. For MVCCA, we kept vectors with canonical correlation bigger than 0.01. For DeepNMF, two layers with layer size [100,50] are designed for all the datasets except COIL20 ([18,9]), parameter β = 0.1, γ = 0.5.

To conduct a comprehensive evaluation, all the approaches are compared with six evaluation metrics: normalized mutual information (NMI), accuracy (ACC), adjusted rand index (ARI), F-score, Precision, and Recall. For all these metrics, the higher value indicates better clustering performance. All the algorithms are run 20 times and the mean and standard deviation of each metric is reported.

C. Experiment Results

The results are shown in Table II. On datasets 3 Sources, Reuters and Movies, MVLRSSC performs best. On datasets Handwritten Digits, COIL20, MVKKM outperforms all the other algorithms. On dataset YALE, DeepNMF obtained the best performance. These results are almost consistent on six different metrics except NMI on dataset Handwritten Digits and Recall on dataset Movies. On dataset Handwritten Digits, the performance in NMI for MVKKM and MVLRSSC are very close. On dataset Movies, although the performance in Recall for MVKKM is significantly better than that for MVLRSSC, the precision and comprehensive metric F1 score for MVKKM are worse.

TABLE II:

Performance of seven algorithms on six multi-view datasets. The mean and standard deviation of 20 runs of these algorithms are reported.“—” indicates that this dataset has negative entries, thus MultiNMF cannot apply.“/” indicates this algorithm only applies to two-view case directly but this dataset has more than two views. The best results among seven MVC algorithms on each dataset is shown in bold font.

Dataset Method ACC F-score Precision Recall NMI ARI
Sources MVMMEM / / / / / /
Co-Reg 0.5434 (0.0097) 0.4648 (0.0100) 0.4985 (0.0124) 0.4373 (0.0100) 0.4894 (0.0086) 0.3161 (0.0131)
MVLRSSC 0.6730 (0.0089) 0.6350 (0.0101) 0.6770 (0.0086) 0.6005 (0.0145) 0.5855 (0.0054) 0.5355 (0.0018)
MultiNMF 0.4107 (0.0079) 0.3244 (0.0045) 0.2631 (0.0051) 0.4227 (0.0030) 0.3426 (0.0061) 0.0531 (0.0084)
MVKKM 0.3550 (0) 0.3621 (0) 0.2298 (0) 0.8430 (0) 0.1131 (0) −0.0064 (0)
MVCCA / / / / / /
DeepNMF 0.6509 (0.0076) 0.5079 (0.0047) 0.5645 (0.0060) 0.4614 (0.0065) 0.4892 (0.0107) 0.3780 (0.0056)
Reuters MVMMEM / / / / / /
Co-Reg 0.4800 (0.0068) 0.3699 (0.0030) 0.3386 (0.0041) 0.4091 (0.0047) 0.3000 (0.0038) 0.2308 (0.0041)
MVLRSSC 0.5280 (0.0069) 0.4174 (0.0030) 0.3641 (0.0062) 0.4931 (0.0052) 0.3782 (0.0032) 0.2802 (0.0055)
MultiNMF —– —– —– —– —– —–
MVKKM 0.2283 (0) 0.2870 (0) 0.1737 (0) 0.8261 (0) 0.1909 (0) 0.0191 (0)
MVCCA / / / / / /
DeepNMF 0.2977 (0.0020) 0.2217 (0.0018) 0.2138 (0.0034) 0.2303 (0.0049) 0.1053 (0.0014) 0.0607 (0.0032)
Handwritten Digits MVMMEM / / / / / /
Co-Reg 0.7571 (0.0064) 0.6842 (0.0087) 0.6656 (0.0087) 0.7042 (0.0087) 0.7298 (0.0076) 0.6481 (0.0097)
MVLRSSC 0.7699 (0.390) 0.7288 (0.0534) 0.6970 (0.0615) 0.7642 (0.0462) 0.7799 (0.0333) 0.6971 (0.0601)
MultiNMF —– —– —– —– —– —–
MVKKM 0.8650 (0) 0.7530 (0) 0.7411 (0) 0.7653 (0) 0.7740 (0) 0.7252 (0)
MVCCA / / / / / /
DeepNMF 0.7738 (0.0009) 0.7456 (0.0019) 0.7042 (0.0016) 0.7921 (0.0022) 0.7961 (0.0022) 0.7156 (0.0021)
COIL20 MVMMEM / / / / / /
Co-Reg 0.9591 (0.0146) 0.9643 (0.0129) 0.9436 (0.0199) 0.9871 (0.0050) 0.9899 (0.0037) 0.9623 (0.0136)
MVLRSSC 0.9767 (0.0078) 0.9799 (0.0065) 0.9686 (0.0099) 0.9922 (0.0028) 0.9943 (0.0019) 0.9788 (0.0068)
MultiNMF —– —– —– —– —– —–
MVKKM 1 (0) 1 (0) 1 (0) 1 (0) 1 (0) 1 (0)
MVCCA / / / / / /
DeepNMF 0.3857 (0.0050) 0.2688 (0.0121) 0.2133 (0.0163) 0.3651 (0.0156) 0.5144 (0.0029) 0.2202 (0.0142)
YALE MVMMEM / / / / / /
Co-Reg 0.5913 (0.0140) 0.4599 (0.0150) 0.4376 (0.0149) 0.4851 (0.0155) 0.6418 (0.0113) 0.4229 (0.0161)
MVLRSSC 0.5677 (0.0103) 0.4141 (0.0090) 0.3939 (0.0090) 0.4368 (0.0093) 0.6088 (0.0082) 0.3739 (0.0097)
MultiNMF 0.5188 (0.0054) 0.3652 (0.0149) 0.3470 (0.0111) 0.3855 (0.0201) 0.5602 (0.0203) 0.3217 (0.0154)
MVKKM 0.6364 (0) 0.4732 (0) 0.4064 (0) 0.5661 (0) 0.6855 (0) 0.4329 (0)
MVCCA / / / / / /
DeepNMF 0.7446(0.0191) 0.5664 (0.0138) 0.5522 (0.0168) 0.5815 (0.0107) 0.7312 (0.0103) 0.5375 (0.0149)
Movies MVMMEM 0.2592 (0.0163) 0.1538 (0.0134) 0.1460 (0.0135) 0.1627 (0.0150) 0.2529 (0.0144) 0.0955 (0.0144)
Co-Reg 0.2615 (0.0033) 0.1517 (0.0023) 0.1402 (0.0022) 0.1657 (0.0031) 0.2657 (0.0031) 0.0916 (0.0024)
MVLRSSC 0.3180 (0.0053) 0.1933 (0.0034) 0.1913 (0.0032) 0.1955 (0.0035) 0.3184 (0.0029) 0.1403 (0.0036)
MultiNMF 0.1900 (0.0130) 0.1220 (0.0055) 0.0722 (0.0038) 0.3966 (0.0412) 0.2073 (0.0115) 0.0208 (0.0068)
MVKKM 0.1005 (0) 0.1146 (0) 0.0611 (0) 0.9241 (0) 0.0670 (0) (0)
MVCCA 0.1295 (0.0036) 0.0607 (0.0013) 0.06175 (0.0014) 0.0596 (0.0014) 0.0854 (0.0058) 0.0007 (0.0015)
DeepNMF 0.1847 (0.0033) 0.0945 (0.0028) 0.0911 (0.0035) 0.0981 (0.0026) 0.1626 (0.0027) 0.0332 (0.0035)

Datasets 3 Sources, Reuters and Movies consist of text information. Due to special topic properties, low rank and sparsity are important when conducting multi-view clustering, thus the algorithm MVLRSSC that take low rank and sparsity into account performs well. Datasets Handwritten Digits, COIL20 and YALE are datasets of images, maybe it is necessary to use nonlinear or deep structure to learn the abstract or meaningful clusters, thus MVKKM and DeepNMF perform better on these datasets. In addition, different views contribute different in final clustering, thus different weights should be given them to promote the performance, thus MVKKM can be a good choice. From the results, we can find that MultiNMF just applied to datasets 3 Sources, YALE and Movies, that is because MultiNMF can apply to the scenario where all entries of the datasets are non-negative. This is one limitation of MultiNMF. Results in Table II also shows that algorithms MVMMEM and MVCCA just apply to dataset Movies, that is because they are suitable for the datasets with only two views, thus this is the limitation of these algorithms. MVMMEM performs worse than MVLRSSC, as well as Co-Reg, better than MultiNMF, MVKKM, MVCCA and DeepNMF. It can be seen that generative algorithms has the potential to be comparable with state of the art discriminative algorithms. Although MultiNMF has non-negative property, it does not perform well compared with other group of algorithms on the above datasets. This maybe because compared with non-negative property, other properties like low rank, sparsity, weights difference are more important for performance improvement.

Based on the results in Table II, we can find that multi-view subspace clustering group, multi-kernel MVC, and deep MVC algorithms perform well. Spectral clustering based MVC, NMF based MVC and MVCCA perform worse than the above algorithms on the six commonly used datasets. Generative MVC performs better than many discriminative ones, thus it is worth attracting more attention in future. In this experiment, we focused on clustering performance, a more comprehensive study including time cost factor, and more advanced MVC algorithms that are worth further exploration.

VII. OPEN PROBLEMS

We have identified several problems that are still underexplored in the current body of MVC literature. We discuss these problems in this section.

A. Large Scale Problem (size and dimension)

In modern life, large quantities of data are generated every day. For instance, several million posts are shared per minute in Facebook, which include multiple data forms (views): videos, images and texts. At the same time, a large amount of news are reported in different languages, which can also be considered as multi-view data with each language as one view. However, most of the existing MVC methods can only deal with small datasets. It is important to extend these methods to large scale applications. For instance, it is difficult for the existing multi-view spectral clustering based methods to work on datasets of massive samples due to the expensive computation of graph construction and eigen-decomposition. Although some previous works such as [190], [191], [192], [193] attempted to accelerate the spectral clustering method to scale with big data, it is intriguing to extend them effectively to the multi-view settings. Recently, Zhang et al. [194] proposed an interesting idea to solve large scale problem by encoding multi-view image data into a compact common binary code space and then conduct binary clustering.

Another type of big data has high dimensionality. There is a large quantity of single-view clustering methods [195] to deal with this kind of problem, However, there is still one special class of such problem tough to deal with. For instance, in bioinformatics, each person has millions of genetic variants as genetic features where, compared with the problem dimension, the number of samples is low. Using genetic features in a clinical analysis with another view of clinical phenotypes often forms a multi-view analytics problem. How to deal with such a clustering problems is tough due to the over-fitting problem. Although feature selection [196], [197] or feature dimension reduction [198] like PCA is commonly used to alleviate this problem in single-view settings, there are no convincing methods up to now, especially because deep learning cannot cope with it due to the properties: small size and high feature dimension. It may recall new theory to appear to handle this problem.

B. Incomplete Views or Missing Value

MVC has been successfully applied to many applications as shown in Section V. However, there is an underlying problem hidden behind: what if one or more views are incomplete? This is very common in real-world applications. For example, in multi-lingual documents, many documents may have only one or two language versions; in social multimedia, some sample may miss visual or audio information due to sensor failure; in health informatics, some patients may not take certain lab tests to cause missing views or missing values. Some data entries may be missing at random while others are non-random [181]. Simply replacing the missing entries with zero or mean values [199] is a common way to deal with the missing value problem, and multiple imputation [200] is also a popular method in statistical field. The missing entries can be generated by the recently popular generative adversarial networks [201]. However, without considering the differences of random and non-random effects in missing data, the clustering performance is not ideal [181].

Up to now, there have already been several multi-view works [202], [36], [37], [88], [38], [61], [63], [103] that attempted to solve the incomplete view problem. Two methods in [61], [63] introduced a weight matrix Mi,j to indicate whether the ith instance present in the jth view. For the two-view case, the method in [36] reorganized the multi-view data to include three parts: samples with both views, samples only having view 1 and samples only having view 2 and then analyzed them to handle missing entries. Assuming that there is at least one complete view, Trivedi et al. [103] used the graph Laplacian to complete the kernel matrix with missing values based on the kernel matrix computed from the complete view. Shao [88] borrowed the same idea to deal with multi-view setting. Instead imputing kernel matrix, Liu [203] imputed each base matrix generated by incomplete views with a learned consensus clustering matrix. It is noted that all these methods deal with incomplete views or missing value with some constraints, but they do not aim to deal with the situation with arbitrarily missing values in any of the views. In other words, this situation is that all views have missing values and the samples just miss a few features in a view. Obviously, the above methods have significant limitations that cannot make full use of the available multi-view incomplete information. In addition, all existing methods do not take into consideration the difference between random and non-random missing patterns. Therefore, it is worth exploring how to use the mixed types of data in multi-view analysis.

C. Initialization and Local Minima

For MVC methods based on k-means, the initial clusters are very important and different initalizations may lead to different clustering results. It is still challenging to select the initial clusters effectively in MVC and even in single-view clustering settings.

Most NMF-based methods rely on non-convex optimization formulations, and thus are prone to the local optimum problem, especially when missing values and outliers exist. By enforcing a consistent clustering result in different view, Zhao et al. [173] formulated a jointly convex optimization formulation and additionally using some side information. Self-paced learning [204] is a possible solution, and Xu et al. [64] applied it to MVC to alleviate the local minimum problem.

The generative convex clustering method [8] is an interesting approach to avoid the local minimum problem. In [12], a multi-view version of the method in [8] is proposed and shows good performance. This kind of generative methods may be another good direction worth further exploring.

D. Deep Learning

Recently, Deep learning has demonstrated outstanding performance in many applications such as speech recognition, image segmentation, object detection and so on. However, compared with the fast growth of supervised deep learning and unsupervised deep representation learning, deep clustering still has a lot of room to develop, especially multi-view deep clustering. A natural way to conduct deep clustering or multi-view deep clustering is to conduct clustering on the representation obtained from single-view representation learning or multi-view representation learning. In fact, there should be many advanced ways to explore how to conduct multi-view deep clustering.

Recently, there indeed appeared a number of deep clustering works. For example, the works in [205], [206], [207] borrowed the supervised deep learning idea to perform supervised clustering. In fact, they can be considered as performing semi-supervised learning. So far, there are already several truly deep clustering works [131], [132], [208]. Tian et al. [131] proposed a deep clustering algorithm that is based on spectral clustering, but replaced eigenvalue decomposition by a deep auto-encoder. Xie et al. [132] proposed a clustering approach using deep neural network which can learn representation and perform clustering simultaneously. It is interesting to explore how to extend them to multi-view scenarios.

Besides deep clustering works, there also exist some MVC methods. Huang et al. [208] proposed to use multiple layer matrix factorization and shared the same representation matrix across different views to conduct MVC. Experimental results demonstrates the superiority of this deep learning methods to multi-view shallow clustering methods like co-training clustering, co-regularization clustering and multi-view k-means clustering. By using auto-encoder architecture, Zhu et al [135] designed a diverse net and universal net to make full use of the complementary and consensus information among multiple views to implement MVC. To let clustering label to guide the representation learning, Sun et al. [136] proposed another deep subspace MVC method in a semi-supervised way. Li et al. [133] presented a deep MVC approach borrowing the idea and architecture of Generative Discriminative Network (GAN). Inspired by the great success obtained by using attention mechanism in deep learning fields, Zhou et al. [134] explored a MVC method by combing GAN and attention mechanisms, and experiments support its effectiveness.

Compared with traditional multi-view shallow clustering methods, the aforementioned multi-view deep clustering methods demonstrated better performance due to several reasons. Firstly, deep networks adopted in multi-view deep clustering methods have better expression ability, maybe it can discover the more real structure of the multi-view data. Secondly, part of them adopt end-to-end multi-view deep clustering way. The representation obtained amid can reflect multi-view data comprehensively and, at the same time, serve to the final goal clustering well. However, there are still large space to explore and develop in this direction. Firstly, there are more and more novel deep learning architectures; how to extend them to multi-view scenarios needs more investigation. Secondly, although some end-to-end multi-view deep clustering methods appeared, more such methods are expected, since multi-view deep representation learning developed more sufficiently than multi-view deep clustering, and it is simple and natural to run clustering algorithm on the representation obtained from multi-view deep representation learning. However, the separate process to deal with multi-view deep clustering has its limitations, like being unaware of clustering goal in representation learning. Thirdly, deep learning techniques has its special properties; more ways to combine multiple views can be designed to serve to multi-view deep clustering. Fourth, Some theoretical investigation should be conducted to unfold how and why multi-view deep clustering shows better performance than traditional shallow methods.

E. Mixed Data Types

Multi-view data may not necessarily just contain numerical or categorical features. They can also have other types such as symbolic, ordinal, etc. These different types can appear simultaneously in the same view, or in different views. How to integrate different types of data to perform MVC is worthy of careful investigation. Converting all of them to categorical type is a straightforward solution. However, much information will be lost during such processing. For example, the difference of the continuous values categorized into the same category is ignored. Paper [209] proposed a solution to mixed data type problem with vine copulas. It is worth more exploring to make full use of the information within mixed data types in MVC settings.

F. Multiple Solutions

Most of the existing MVC, even single-view clustering, algorithms only output a single clustering solution. However, in real-world applications, data can often be grouped in many different ways, and all these solutions are reasonable and interesting from different perspectives. For example, it is both reasonable to group the fruits apple, banana, and grape according to the fruit type or color. Until now, to the best of our knowledge, there are very few works along this direction [210], [211], [212]. Cui et al. [210] proposed to partition multi-view data by projecting the data to a space that is orthogonal to the current solution so that multiple non-redundant solutions were obtained. In another work [211], Hilbert-Schmidt Independence Criterion was adopted to measure the dependence across different views and then one clustering solution was found in each view. Chang et al. [212] automatically learned multiple expert views and the clustering structure corresponding to each view in a Bayesian probabilistic model. MVC algorithms that can produce multiple solutions should attract more attention in the future.

VIII. CONCLUSION

To sort out existing MVC methods, we proposed a novel taxonomy to introduce them. Similar to machine learning method categorizations, we split MVC methods into two classes: generative methods and discriminative methods. Based on the way to combine multiple views, discriminative methods are further split into five main classes, the first three of which have a commonality: sharing certain structures across the views. The fourth one uses direct combinations of the views, while the fifth one employs view combinations after projections. Compared with discriminative methods, generative methods have developed far less sufficiently. Although it has inherent limitation, it can deal with missing data and get global optima easily, thus it calls for more attention. To better understand MVC, we elaborate on the relationships between MVC and several closely-related learning topics. We have also introduced several real-world applications of MVC and, most importantly, we conducted a comprehensive experimental study on representative MVC algorithms of each group to further analyze the advantages and disadvantages of them, and finally pointed out some interesting and challenging directions to guide researchers to advance in future.

Impact Statement—

Multi-view clustering (MVC) is a substantial extension of clustering with the ability to make full use of the multi-view information of objects. While it is true that one can cluster objects by simply concatenating the multiple views, the correlations between those views are not explicitly modeled and could bias the resultant clusters.

To improve the performance of clustering by leveraging the multi-view information, this paper provides a systematic survey of the existing MVC approaches. We further present an experimental comparison of the representative algorithms from each category. Our most significant findings include: multi-view subspace clustering, multi-kernel MVC, deep MVC algorithms, and generative MVC methods perform better than other discriminative ones. We suggest more research is needed on generative MVC. Moreover, some important future directions such as incomplete MVC and deep MVC are proposed.

ACKNOWLEDGEMENTS

This work was supported by National Institutes of Health (NIH) grants R01DA037349 and K02DA043063, and National Science Foundation (NSF) grants DBI-1356655, CCF-1514357, and IIS-1718738. Jinbo Bi was also supported by NSF grants IIS-1320586, IIS-1407205, and IIS-1447711. The authors acknowledge Songtao Wang for facilitating the empirical studies.

Biography

graphic file with name nihms-1736496-b0005.gifGuoqing Chao received the B.S. degree from Xinyang Normal University, Xinyang, China, in 2009. He got the Ph.D. degree with the Department of Computer Science and Technology, East China Normal University, Shanghai, China in 2015. After that, he was a PostDoc in University of Connecticut and Northwestern University, US and Singapore Management University, Singapore. Currently, he works with Harbin Institute of Technology at Weihai, China. His research interests include machine learning, data mining and bioinformatics.

graphic file with name nihms-1736496-b0006.gifShiliang Sun is a professor at the School of Computer Science and Technology and the head of the Pattern Recognition and Machine Learning Research Group, East China Normal University. He received the B.E. degree from Beijing University of Aeronautics and Astronautics in 2002, and the Ph.D. degree from Tsinghua University in 2007. He is a member of the PASCAL network of excellence, and on the editorial boards of multiple international journals. His research interests include approximate inference, learning theory, sequential modeling, kernel methods, and their applications.

graphic file with name nihms-1736496-b0007.gifJinbo Bi, Professor, received a Ph.D. degree in mathematics from Rensselaer Polytechnic Institute, USA, and a master degree in Electrical Engineering and Automatic Control from Beijing Institute of Technology, China. She is a professor of Computer Science and Engineering at the University of Connecticut. Prior to her current appointment, she worked with Siemens Medical Solutions on computer aided diagnosis research and Partners Healthcare on clinical decision support systems, respectively. Her research interests include machine learning, data mining, bioinformatics and biomedical informatics.

Footnotes

Contributor Information

Guoqing Chao, School of Computer Science and Technology, Harbin Institute of Technology, Weihai 264209, PR China.

Shiliang Sun, School of Computer Science and Technology, East China Normal University, Shanghai, Shanghai 200062 China.

Jinbo Bi, Department of Computer Science, University of Connecticut, Storrs, CT 06269 USA.

REFERENCES

  • [1].Berkhin Pavel, “Survey of clustering data mining techniques,” Tech. Rep., Yahoo, 2002. [Google Scholar]
  • [2].Saxe John G, The blind men and the elephant, Enrich Spot Limited, 2016. [Google Scholar]
  • [3].Xu Chang, Tao Dacheng, and Xu Chao, “A survey on multi-view learning,” arXiv prepint arXiv:1304.5634, 2013. [Google Scholar]
  • [4].Sun Shiliang, “A survey of multi-view machine learning,” Neural Computing and Applications, vol. 23, no. 7–8, pp. 2031–2038, 2014. [Google Scholar]
  • [5].Zhao Jing, Xie Xijiong, Xu Xin, and Sun Shiliang, “Multi-view learning overview: Recent progress and new challenges,” Information Fusion, vol. 38, pp. 43–54, 2017. [Google Scholar]
  • [6].Yang Yan and Wang Hao, “Multi-view clustering: a survey,” Big Data Mining and Analytics, vol. 1, no. 2, pp. 83–107, 2018. [Google Scholar]
  • [7].Dempster Arthur P., Laird Nan M., and Rubin Donald B., “Maximum likelihood from incomplete data via the em algorithm,” Journal of the royal statistical society. Series B, vol. 39, pp. 1–38, 1977. [Google Scholar]
  • [8].Lashkari Danial and Golland Polina, “Convex clustering with exemplar-based models,” in Advances in Neural Information Processing Systems, December 2008, pp. 825–832. [PMC free article] [PubMed] [Google Scholar]
  • [9].Banerjee Arindam, Merugu Srujana, Dhillin Inderjit S., and Ghosh Joydeep, “Clustering with bregman divergences,” Journal of Machine Learning Research, vol. 6, no. 12, pp. 1705–1749, 2005. [Google Scholar]
  • [10].Bickel Steffen and Scheffer Tobias, “Multi-view clustering,” in Proceedings of the IEEE Internamtional Conference on Data Mining, 2004, pp. 19–26. [Google Scholar]
  • [11].Yi Xing, Xu Yunpeng, and Zhang Changshui, “Multi-view em algorithm for finite mixture models,” in Proceedings of the International Conference on Pattern Recognition and Image Analysis, August 2005, pp. 420–425. [Google Scholar]
  • [12].Tzortzis Grigorios and Likas Aristidis, “Convex mixture models for multi-view clustering,” in Proceedings of the International Conference on Artificial Neural Networks, December 2009, pp. 205–214. [Google Scholar]
  • [13].Tzortzis Grigorios and Kikas Aristidis, “Multiple view clustering using a weighted combination of exemplar-based mixture models,” IEEE Transactions on Neural Networks, vol. 21, no. 12, pp. 1925–1938, 2010. [DOI] [PubMed] [Google Scholar]
  • [14].Belkin Mikhail and Niyogi Partha, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in Advances in neural information processing systems, 2002, pp. 585–591. [Google Scholar]
  • [15].Kang Z, Pan H, Hoi SCH, and Xu Z, “Robust graph learning from noisy data,” IEEE Transactions on Cybernetics, pp. 1–11, 2019. [DOI] [PubMed] [Google Scholar]
  • [16].Kumar Abhishek and Daume Hal III, “A co-training approach for multi-view spectral clustering,” in Proceedings of the 28th International Conference on Machine Learning, New York, NY, USA, June 2011, pp. 393–400. [Google Scholar]
  • [17].Kumar Abhishek, Rai Piyush, and Daume Hal III, “Co-regularized multi-view spectral clustering,” in Advances in Neural Information Processing Systems, December 2011, pp. 1413–1421. [Google Scholar]
  • [18].Ng Andrew Y., Jordan Michael I., and Weiss Yair, “On spectral clustering: analysis and an algorithm,” in Advances in Neural Information Processing Systems 14, December 2001, pp. 849–856. [Google Scholar]
  • [19].Shi Jianbo and Malik Jitendra, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Learning, vol. 22, pp. 888–905, 2000. [Google Scholar]
  • [20].Chao Guoqing, “Discriminative k-means laplacian clustering,” Neural Processing Letters, vol. 49, pp. 393–405, 2019. [Google Scholar]
  • [21].Lütkepohl Helmut, Handbook of Mactrices, pp. 67–69, Chichester: Wiley, 1997. [Google Scholar]
  • [22].Luxburg Ulrike, “A tutorial on spectral clustering,” Statistical and Computing, vol. 17, no. 4, pp. 395–416, 2007. [Google Scholar]
  • [23].Blum A. and Mitchell T, “Combining labeled and unlabeled data with co-training,” in Proceedings of the 11th Annual Conference on Computational Learning Theory, Jul 1998, pp. 92–100. [Google Scholar]
  • [24].Cai Xiao, Nie Feiping, Huang Heng, and Kamangar Farhad, “Heterogeneous image feature integration via multi-modal spectral clustering,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2011, pp. 1977–1984. [Google Scholar]
  • [25].Wang Xiang, Qian Buyue, Ye Jieping, and Davidson Ian, “Multiobjective multi-view spectral clustering via pareto optimization,” in Proceedings of the 2013 SIAM International Conference on Data Mining, May 2013, pp. 234–242. [Google Scholar]
  • [26].Ye Yongkai, Liu Xinwang, Yin Jianping, and Zhu En, “Co-regularized kernel k-means for multi-view clustering,” in Proceedings of the 23rd International Conference on Pattern Recognition, August 2016, pp. 1583–1588. [Google Scholar]
  • [27].Dong Xiaowen, Frossard Pascal, Vandergheynst Pierre, and Nefedov Nikolai, “Clustering on multi-layer graphs via subspace analysis on grassmann manifolds,” IEEE Transactions on Signal Processing, vol. 62, no. 4, pp. 905–918, 2014. [Google Scholar]
  • [28].Vidal Rene, “A tutorial on subspace clustering,” IEEE Signal Processing Magazine, vol. 28, no. 2, pp. 52–68, 2011. [Google Scholar]
  • [29].Elhamifar Ehsan and Vidal Rene, “Sparse subspace clustering: Algorithm, theory, and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 11, pp. 2765–2781, 2013. [DOI] [PubMed] [Google Scholar]
  • [30].Yin Qiyue, Wu Shu, He Ran, and Wang Liang, “Multi-view clustering via pairwise sparse subspace representation,” Neurocomputing, vol. 156, no. 5, pp. 12–21, 2015. [Google Scholar]
  • [31].Brbić Maria and Kopriva Ivica, “Multi-view low-rank sparse subspace clustering,” Pattern Recognition, vol. 73, pp. 247–258, 2018. [Google Scholar]
  • [32].Wang Yang, Zhang Wenjie, Wu Lin, Lin Xuemin, Fang Meng, and Pan Shirui, “Iterative views agreement: an iterative low-rank based structured optimization method to multi-view spectral clustering,” in Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, July 2016, pp. 2153–2159. [Google Scholar]
  • [33].Wang Yang, Lin Xuemin, Wu Lin, Zhang Wenjie, Zhang Qing, and Huang Xiaodi, “Robust subspace clustering for multi-view data by exploiting correlation consensus,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 3939–3949, 2015. [DOI] [PubMed] [Google Scholar]
  • [34].Zhang Changqing, Fu Huazhu, Hu Qinghua, Cao Xiaochun, Xie Yuan, Tao Dacheng, and Xu Dong, “Generalized latent multi-view subspace clustering,” IEEE transactions on pattern analysis and machine intelligence, 2018. [DOI] [PubMed] [Google Scholar]
  • [35].Huang Ling, Chao Hong-Yang, and Wang Chang-Dong, “Multi-view intact space clustering,” Pattern Recognition, vol. 86, pp. 344–353, 2019. [Google Scholar]
  • [36].Li Shao-Yuan, Jiang Yuan, and Zhou Zhi-Hua, “Partial multi-view clustering,” in Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, July 2014, pp. 1968–1974. [Google Scholar]
  • [37].Zhao Handong, Liu Hongfu, and Fu Yun, “Incomplete multi-modal visual data grouping,” in Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, July 2016, pp. 2392–2398. [Google Scholar]
  • [38].Yin Qiyue, Wu Shu, and Wang Liang, “Incomplete multi-view clustering via subspace learning,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, October 2015, pp. 383–392. [Google Scholar]
  • [39].Zhao Handong, Ding Zhengming, and Fu Yun, “Multi-view clustering via deep matrix factorization,” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, June 2017. [Google Scholar]
  • [40].Cai Hao, Liu Bo, Xiao Yanshan, and Lin LuYue, “Semi-supervised multi-view clustering based on constrained nonnegative matrix factorization,” Knowledge-Based Systems, 2019. [Google Scholar]
  • [41].Lee Daniel D. and Seung H. Sebastian, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999. [DOI] [PubMed] [Google Scholar]
  • [42].Xu Wei, Liu Xin, and Gong Yihong, “Document clustering based on non-negative matrix factorization,” in Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, July 2003, pp. 267–273. [Google Scholar]
  • [43].Zhang Xinyu, Gao Hongbo, Li Guopeng, Zhao Jianhui, Huo Jianghao, Yin Jialun, Liu Yuchao, and Zheng Li, “Multi-view clustering based on graph-regularized nonnegative matrix factorization for object recognition,” Information Sciences, vol. 432, pp. 463–478, 2018. [Google Scholar]
  • [44].Jean-Philippe Pablo Tamayo, Golub Todd R., and Mesirov Jill P., “Metagenes and molecular pattern discovery using matrix factorization,” Proceedings of the National Academy of Sciences, vol. 101, no. 12, pp. 4164–4169, 2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [45].Akata Zeynep, Bauckhage Christian, and Thurau Christian, “Non-negative matrix factorization in multimodality data for segmentation and label prediction,” in 16th Computer Vision Winter Workshop, February, 2011, Wendel Andreas, Sternig Sabine, and Godec Martin, Eds., Mitterberg, Autriche, 2011, pp. 1–8. [Google Scholar]
  • [46].Liu Jialu, Wang Chi, Gao Jing, and Han Jiawei, “Multi-view clustering via joint nonnegative matrix factorization,” in Proceedings of the 2013 SIAM International Conference on Data Mining, February 2013, pp. 252–260. [Google Scholar]
  • [47].Cai Xiao, Nie Feiping, and Huang Heng, “Multi-view k-means clustering on big data,” in Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, August 2013, pp. 2598–2604. [Google Scholar]
  • [48].Xu Jinglin, Han Junwei, Nie Feiping, and Li Xuelong, “Re-weighted discriminatively embedded k-means for multi-view clustering,” IEEE Transactions on Image Processing, vol. 26, no. 6, pp. 3016–3027, 2017. [DOI] [PubMed] [Google Scholar]
  • [49].Liu Hongfu and Fu Yun, “Consensus guided multi-view clustering,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 12, no. 4, pp. 42, 2018. [Google Scholar]
  • [50].Ding Chris, He Xiaofeng, and Simon Horst D., “On the equivalence of nonnegative matrix factorization and spectral clustering,” in Proceedings of the 2005 SIAM International Conference on Data Mining, April 2005, pp. 1–5. [Google Scholar]
  • [51].Gao Hongchang, Nie Feiping, Li Xuelong, and Huang Heng, “Multi-view subspace clustering,” in IEEE Conference on Computer Vision, December 2015, pp. 4238–4246. [Google Scholar]
  • [52].Wang Hua, Nie Feiping, and Huang Heng, “Multi-view clustering and feature learning via structured sparsity,” in Proceedings of the 30th International Conference on Machine Learning, June 2013, pp. 352–360. [Google Scholar]
  • [53].Greene Derek and Cunningham Pádraig, “A matrix factorization approach for integrating multiple data views,” in Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I, September 2009, pp. 423–438. [Google Scholar]
  • [54].Tang Wei, Lu Zhengdong, and Dhillon Inderjit S., “Clustering with multiple graphs,” in Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, December 2009, pp. 1016–1021. [Google Scholar]
  • [55].Qian Bin, Shen Xiaobo, Gu Yanyang, Tang Zhenmin, and Ding Yuhua, “Double constrained nmf for partial multi-view clustering,” in 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), 2016, pp. 1–7. [Google Scholar]
  • [56].Zong Linlin, Zhang Xianchao, Zhao Long, Yu Hong, and Zhao Qianli, “Multi-view clustering via multi-manifold regularized non-negative matrix factorization,” Neural Networks, vol. 88, pp. 74–89, 2017. [DOI] [PubMed] [Google Scholar]
  • [57].Kang Zhao, Zhao Xinjia, Peng Chong, Zhu Hongyuan, Joey Tianyi Zhou Xi Peng, Chen Wenyu, and Xu Zenglin, “Partition level multiview subspace clustering,” Neural Networks, vol. 122, pp. 279–288, 2020. [DOI] [PubMed] [Google Scholar]
  • [58].Tao Hong, Hou Chenping, Liu Xinwang, Yi Dongyun, and Zhu Jubo, “Reliable multi-view clustering,” in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, July 2018, pp. 4123–4130. [Google Scholar]
  • [59].Zhang Lefei, Zhang Lianpei, Du Bo, You Jane, and Tao Dacheng, “Hyperspectral image unsupervised classification by robust manifold matrix factorization,” Information Sciences, vol. 485, pp. 154–169, 2019. [Google Scholar]
  • [60].Ma Jiaqi, Zhang Lipeng, and Zhang Lefei, “Discriminative subspace matrix factorization for multiview data clustering,” Pattern Recognition, vol. 111, 2021. [Google Scholar]
  • [61].Shao Weixiang, He Lifang, and Philip S Yu, “Multiple incomplete views clustering via weighted nonnegative matrix factorization with l21 regularization,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, September 2015, pp. 318–334. [Google Scholar]
  • [62].Xu Yu-Meng, Wang Chang-Dong, and Lai Jian-Huang, “Weighted multi-view clustering with feature selection,” Pattern Recognition, vol. 53, pp. 25–35, 2016. [Google Scholar]
  • [63].Shao Weixiang, He Lifang, ta Lu Chun, and Philip S Yu, “Online multi-view clustering with incomplete views,” in IEEE International Conference on Big Data, February 2016. [Google Scholar]
  • [64].Xu Chang, Tao Dacheng, and Xu Chao, “Multi-view self-paced learning for clustering,” in Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, July 2015, pp. 3974–3980. [Google Scholar]
  • [65].Tao Zhiqiang, Liu Hongfu, Li Sheng, Ding Zhengming, and Fu Yun, “From ensemble clustering to multi-view clustering,” in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, August 2017, pp. 2843–2849. [Google Scholar]
  • [66].Xu Jinlin, Han Junwei, and Nie Feiping, “Discriminatively embedded k-means for multi-view clustering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, December 2016, pp. 5356–5364. [Google Scholar]
  • [67].Nie Feiping, Li Jing, and Li Xuelong, “Self-weighted multiview clustering with multiple graphs,” in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, August 2017, pp. 2564–2570. [Google Scholar]
  • [68].Joachims Thorsten, Cristiani Nello, and Shawe-Taylor John, “Composite kernels for hypertext categorisation,” in Proceedings of the Eighteenth International Conference on Machine Learning, July 2001, pp. 250–257. [Google Scholar]
  • [69].Zhang Tong, Popescul Alexandrin, and Dom Byron, “Linear prediction models with graph regularization for web-page categorization,” in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2006, pp. 821–826. [Google Scholar]
  • [70].Chao Guoqing and Sun Shiliang, “Multi-kernel maximum entropy discrimination for multi-view learning,” Intelligent Data Analysis, vol. 20, no. 3, pp. 481–493, 2016. [Google Scholar]
  • [71].Vert JP, Tsuda K, and Schölkopf B, A Primer on Kernel Methods, pp. 35–70, MIT Press, Cambridge, MA, USA, 2004. [Google Scholar]
  • [72].Kang Zhao, Peng Chong, and Cheng Qiang, “Kernel-driven similarity learning,” Neurocomputing, vol. 267, pp. 210–219, 2017. [Google Scholar]
  • [73].Zhao Bin, Kwok James T., and Zhang Changshui, “Multiple kernel clustering,” in Proceedings of the SIAM International Conference on Data Mining, May 2009, pp. 638–649. [Google Scholar]
  • [74].Zeng Hong and Cheung Yiu ming, “Kernel learning for local learning based clustering,” in Proceedings of the International Conference on Artificial Neural Networks, September 2009, pp. 10–19. [Google Scholar]
  • [75].Valizadegan Hamed and Jin Rong, “Generalized maximum margin clustering and unsupervised kernel learning,” in Advances in neural information processing systems, November 2006, pp. 1417–1424. [Google Scholar]
  • [76].Kang Zhao, Wen Liangjian, Chen Wenyu, and Xu Zenglin, “Low-rank kernel learning for graph-based clustering,” Knowledge-Based Systems, vol. 163, pp. 510–517, 2019. [Google Scholar]
  • [77].Lanckriet Gert R.G., Cristianini Nello, Bartlett Peter, Ghaoui Laurent El, and Jordan Michael I, “Learning the kernel matrix with semidefinite programming,” Journal of Machine Learning Research, vol. 5, pp. 27–72, 2004. [Google Scholar]
  • [78].Bach Francis R., Lanckriet Gert R. G., and Jordan Michael I., “Multiple kernel learning, conic duality, and the smo algorithm,” in Proceedings of the Twenty-first International Conference on Machine Learning, July 2004, pp. 41–48. [Google Scholar]
  • [79].Sonnenburg Sören, Räsch Gunnar, and Schäfer Christin, “A general and effient multiple kernel learning algorithm,” in Proceedings of the 18th International Conference on Neural Information Processing Systems, December 2005, pp. 1273–1280. [Google Scholar]
  • [80].Gönen Mehmet and Alpaydin Ethem, “Localized multiple kernel learning,” in Proceedings of the Twenty-fifth International Conference on Machine Learning, July 2008, pp. 352–359. [Google Scholar]
  • [81].Gönen Mehmet and Alpaydin Ethem, “Multiple kernel learning algorithms,” Journal of Machine Learning Research, vol. 12, no. 3, pp. 2211–2268, 2011. [Google Scholar]
  • [82].Schölkopf Bernhard, Smola Alexander, and Müller Klaus-Robert, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Computation, vol. 10, no. 5, pp. 1299–1319, 1998. [Google Scholar]
  • [83].Dhillon Inderjit S., Guan Yuqiang, and Kulis Brian, “Weighted graph cuts without eigenvectors: a multilevel approach,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 11, pp. 1944–1957, 2007. [DOI] [PubMed] [Google Scholar]
  • [84].Tzortzis Grigorios and Likas Aristidis, “Kernel-based weighted multi-view clustering,” in Proceedings of the IEEE 12th International Conference on Data Mining, December 2012, pp. 675–684. [Google Scholar]
  • [85].Guo Dongyan, Zhang Jian, Liu Xinwang, Cui Ying, and Zhao Chunxia, “Multiple kernel learning based multi-view spectral clustering,” in Proceedings of the 22nd International Conference on Pattern Recognition, August 2014, pp. 3774–3779. [Google Scholar]
  • [86].Liu Xinwang, Dou Yong, Yin Jianping, Wang Lei, and Zhu En, “Multiple kernel k-means clustering with matrix-induced regularization,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 2016, pp. 1888–1894. [Google Scholar]
  • [87].Zhang Dao qiang and Chen Song can, “Clustering incomplete data using kernel-based fuzzy c-means algorithm,” Neural Processing Letters, vol. 18, pp. 155–162, 2003. [Google Scholar]
  • [88].Shao Weixiang, Shi Xiaoxiao, and Yu Philip S., “Clustering on multiple incomplete datasets via collective kernel learning,” in Proceedings of the IEEE 13th International Conference on Data Mining, December 2013, pp. 1181–1186. [Google Scholar]
  • [89].Xin Xiaomeng, Wang Jinjun, Xie Ruji, Zhou Sanping, Huang Wenli, and Zheng Nanning, “Semi-supervised person re-identification using multi-view clustering,” Pattern Recognition, vol. 88, pp. 285–297, 2019. [Google Scholar]
  • [90].Liu Xinwang, Zhu Xinzhong, Li Miaomiao, Wang Lei, Tang Chang, Yin Jianping, Shen Dinggang, Wang Huaimin, and Gao Wen, “Late fusion incomplete multi-view clustering,” IEEE transactions on pattern analysis and machine intelligence, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [91].Tzortzis Grigorios F. and Likas Aristidis C., “The global kernel kmeans algorithm for clustering in feature space,” IEEE Transactions on Neural Networks, vol. 20, no. 7, pp. 1181–1194, 2009. [DOI] [PubMed] [Google Scholar]
  • [92].Wang Qiang, Dou Yong, Liu Xinwang, Xia Fei, Lv Qi, and Yang Ke, “Local kernel alignment based multi-view clustering using extreme learning machine,” Neurocomputing, vol. 275, pp. 1099–1111, 2018. [Google Scholar]
  • [93].Chen Xiaojun, Xu Xiaofei, Huang Joshua Zhexue, and Ye Yunming, “Tw-k-means: Automated two-level variable weighting clustering algorithm for multiview data,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 4, pp. 932–944, 2011. [Google Scholar]
  • [94].Zhou Sihang, Liu Xinwang, Liu Jiyuan, Guo Xifeng, Zhao Yawei, Zhu Wen, Zhai Yongping, Yin Jianping, and Gao Wen, “Multi-view spectral clustering with optimal neighborhood laplacian matrix,” in Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, July 2020, pp. 6965–6972. [Google Scholar]
  • [95].Cleuziou Guillaume, Exbrayat Mathieu, Martin Lionel, and Sublemontier Jacques-Henri, “Cofkm: a centralized method for multiple-view clustering,” in Proceedings of IEEE International Conference on Data Mining, December 2009, pp. 752–757. [Google Scholar]
  • [96].Jiang Yizhang, Chuang Fu-Lai, Wang Shitong, Deng Zhaohong, Wang Jun, and Qian Pengjiang, “Collaborative fuzzy clustering from multiple weighted views,” IEEE Transactions on Cybernetics, vol. 45, no. 4, pp. 688–701, 2015. [DOI] [PubMed] [Google Scholar]
  • [97].Sun Jiangwen, Lu Jin, Xu Tingyang, and Bi Jinbo, “Multi-view sparse co-clustering via proximal alternating linearized minimization,” in Proceedings of the 32th Annual International Conference on Machine Learning, July 2015, pp. 757–766. [Google Scholar]
  • [98].Hardoon David R., Szedmak Sandor, and Shawe-Taylor John, “Canonical correlation analysis: an overview with application to learningmethods,” Neural Compution, vol. 16, no. 12, pp. 2639–2664, 2004. [DOI] [PubMed] [Google Scholar]
  • [99].Chao Guoqing and Sun Shiliang, “Consensus and complementarity based maximum entropy discrimination for multi-view classification,” Information Sciences, vol. 367, no. 11, pp. 296–310, 2016. [Google Scholar]
  • [100].Chaudhuri Kamalika, Kakade Sham M., Livescu Karen, and Sridharan Karthik, “Multi-view clustering via canonical correlation analysis,” in Proceedings of the 26th Annual International Conference on Machine Learning, June 2009, pp. 129–136. [Google Scholar]
  • [101].Blaschko Matthew B. and Lampert Christoph H., “Correlational spectral clustering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2008, pp. 1–8. [Google Scholar]
  • [102].Rasiwasia Nikhil, Mahajan Dhruv, Mahadevan Vijay, and Aggarwal Gaurav, “Cluster canonical correlation analysis,” in Proceedings of the 31th Annual International Conference on Machine Learning, June 2014, pp. 823–831. [Google Scholar]
  • [103].Trivedi Anusua, Rai Piyush, Daumé Hal, and DuVall Scott L, “Muliview clusterting with incomplete views,” in NIPS 2010: Workshop on Machine Learning for Social Computing, Whistler, Canda, 2010. [Google Scholar]
  • [104].Long Bo, Yu Philip S., and Zhang Zhongfei, “A general model for multiple view unsupervised learning,” in Proceedings of the 8th SIAM International Conference on Data Mining, April 2008, pp. 822–833. [Google Scholar]
  • [105].Wang Qiang, Dou Yong, Liu Xinwang, Lv Qi, and Li Shijie, “Multi-view clustering with extreme learning machine,” Neurocomputing, vol. 214, pp. 483–494, 2016. [Google Scholar]
  • [106].Sun Jiangwen, Bi Jinbo, and Kranzler Henry R., “Multi-view biclustering for genotype-phenotype association studies of complex diseases,” in Proceedings of IEEE International Conference on Bioinformatics and Biomedicine, 2013, pp. 316–321. [Google Scholar]
  • [107].Sun Jiangwen, , Bi Jinbo, and Kranzler Henry R., “Multi-view singular value decomposition for disease subtyping and genetic associations,” BMC Genetics, vol. 15, no. 73, pp. 1–12, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [108].Lee Mihee, Shen Haipeng, Huang Jianhua Z., and Marron JS, “Biclustering via sparse singular value decomposition,” Biometrics, vol. 66, pp. 1087–1095, 2010. [DOI] [PubMed] [Google Scholar]
  • [109].Wang Chang-Dong, Lai Jian-Huang, and Yu Philip S., “Multi-view clustering based on belief propagation,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 4, pp. 1007–1021, 2016. [Google Scholar]
  • [110].Fan Ruidong, Luo Tingjin, Zhuge Wenzhang, Qiang Sheng, and Hou Chenping, “Multi-view subspace learning via bidirectional sparsity,” Pattern Recognition, vol. 108, 2020. [Google Scholar]
  • [111].Cao Xiaochun, Zhang Changqing, Fu Huazhu, and Zhang Hua, “Diversity-induced multi-view subspace clustering,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp. 586–594. [Google Scholar]
  • [112].Virginia R. De Sa, “Spectral clustering with two views,” in Proceedings of the 22th Annual International Conference on Machine Learning, Workshop Learning With Multiple Views, June 2005, pp. 20–27. [Google Scholar]
  • [113].Zhou Dengyong and Christopher, “Spectral clustering and transductive learning with multiple views,” in Proceedings of the 24th International Conference on Machine Learning, July 2007, pp. 1159–1166. [Google Scholar]
  • [114].Xia Rongkai, Pan Yan, Du Lei, and Yin Jian, “Robust multi-view spectral clustering via low-rank and sparse decomposition,” in Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, July 2014, pp. 2149–2155. [Google Scholar]
  • [115].Lange Tilman and Buhmann Joachim M., “Fusion of similarity data in clustering,” in Proceedings of the 18th International Conference on Neural Information Processing Systems, December 2005, pp. 723–730. [Google Scholar]
  • [116].Zhu Xiaofeng, Zhang Shichao, He Wei, Hu Rongyao, Lei Cong, and Zhu Pengfei, “One-step multi-view spectral clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 10, pp. 2022–2034, 2018. [Google Scholar]
  • [117].Liu Xinhai, Ji Shuiwang, Glänzel Wolfgang, and De Moor Bart, “Multiview partitioning via tensor methods,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 5, pp. 1056–1069, 2012. [Google Scholar]
  • [118].Yuanpeng Zhang, Chung Fu-lai, and Wang ShiTong, “A multi-view & multi-exemplar fuzzy clustering approach: Theoretical analysis and experimental studies,” IEEE Transactions on Fuzzy Systems, 2018. [Google Scholar]
  • [119].Tang Chang, Liu Xinwang, Zhun Xinzhong, Zhu En, Luo Zhigang, Wang Lizhe, and Gao Wen, “Cgd: Multi-view clustering via cross-view graph diffusion,” in Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, July 2020, pp. 5924–5931. [Google Scholar]
  • [120].Li Yingming and Zhang Ming Yang Zhongfei, “Multi-view representation learning: A survey from shallow methods to deep methods,” arXiv prepint arXiv:1610.01206v4, 2016. [Google Scholar]
  • [121].Bengio Yoshua, Courville Aaron, and Vincent Pascal, “Representation learning: a review and new perspectives,” arXiv prepint arXiv:1206.5538v3, 2012. [DOI] [PubMed] [Google Scholar]
  • [122].Zhuge Wenzhang, Hou Chenping, Liu Xinwang, Tao Hong, and Yi Dongyun, “Simultaneous representation learning and clustering for incomplete multi-view data,” in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, July 2019, pp. 4482–4488. [Google Scholar]
  • [123].Zhuge Wenzhang, Tao Hong, Luo Tianjin, Hou Chenping, and Yi Dongyun, “Joint representation learning and clustering: A framework for grouping partial multiview data,” IEEE Transactions on Knowledge Date Engineering, 2020. [Google Scholar]
  • [124].Srivastava Nitish and Salakhutdinov Ruslan, “Multimodal learning with deep boltzmann machines,” Journal of Machine Learning Research, vol. 15, pp. 2949–2980, 2014. [Google Scholar]
  • [125].Ngiam Jiquan, Khosla Aditya, Kim Mingyu, Nam Juhan, Lee Hongkak, and Ng Andrew Y., “Multimodal deep learning,” in Proceedings of the 28th International Conference on International Conference on Machine Learning, July 2011, pp. 689–696. [Google Scholar]
  • [126].Mao Junhua, Xu Wei, Yang Yi, Wang Jiang, Huang Zhiheng, and Yuille Alan, “Deep captioning with multimodal recurrent neural networks (m-rnn),” arXiv prepint arXiv:1412.6632v5, 2014. [Google Scholar]
  • [127].Feng Fangxiang, Wang Xiaojie, Li Ruifan, and Ahmad Ibrar, “Correspondence autoencoders for cross-modal retrieval,” in ACM Multimedia, October 2015, pp. 7–16. [Google Scholar]
  • [128].Wang Weiran, Arora Raman, Livescu Karen, and Bilmes Jeff, “On deep multi-view representation learning,” in Proceedings of the 32nd International Conference on International Conference on Machine Learning, July 2015, pp. 1083–1092. [Google Scholar]
  • [129].Karpathy Andrej and Li Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 664–676, 2017. [DOI] [PubMed] [Google Scholar]
  • [130].Donahue Jeff, Hendricks Lisa Anne, Rohrbach Marcus, Venugopalan Subhashini, Guadarrama Sergio, Saenko Kate, and Darrell Trevor, “Long-term recurrent convolutional networks for visual recognition and description,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 677–691, 2017. [DOI] [PubMed] [Google Scholar]
  • [131].Tian Fei, Gao Bin, Cui Qing, Chen Enhong, and Liu Tie yan, “Learning deep representations for graph clustering,” in Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, July 2014, pp. 1293–1299. [Google Scholar]
  • [132].Xie Junyuan, Girshick Ross, and Farhadi Ali, “Unsupervised deep embedding for clustering analysis,” in Proceedings of The 33rd International Conference on Machine Learning, June 2016, pp. 478–487. [Google Scholar]
  • [133].Li Zhaoyang, Wang Qianqian, Tao Zhiqiang, Gao Quanxue, and Yang Zhaohua, “Deep adversarial multi-view clustering network,” in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, July 2019, pp. 2952–2958. [Google Scholar]
  • [134].Zhou Runwu and Shen Yi-Dong, “End-to-end adversarial-attention network for multi-modal clustering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14619–14628. [Google Scholar]
  • [135].Zhu Pengfei, Hui Binyuan, Zhang Changqing, Du Dawei, Wen Longyin, and Hu Qinghua, “Multi-view deep subspace clustering networks,” arXiv preprint arXiv:1908.01978, 2019. [Google Scholar]
  • [136].Sun Xiukun, Cheng Miaomiao, Min Chen, and Jing Liping, “Self-supervised deep multi-view subspace clustering,” in Asian Conference on Machine Learning, 2019, pp. 1001–1016. [Google Scholar]
  • [137].Vega-Pons Sandro and Ruiz-Shulcloper José, “A survey of clustering ensemble algorithm,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 25, no. 3, pp. 337–372, 2011. [Google Scholar]
  • [138].Zhang Xiaoli and Brodley Carla E., “Solving cluster ensemble problems by bipartite graph partitioning,” in Proceedings of the Twenty-first International Conference on Machine Learning, July 2004. [Google Scholar]
  • [139].Ozay Mete, Yarman Vural Fatos T., Kulkarni Sanjeev R., and Poor H. Vincent, “Fusion of image segmentation algorithms using consensus clustering,” in Proceeding of the 20th IEEE International Conference on Image Processing, September 2013, pp. 4049–4053. [Google Scholar]
  • [140].Lock Eric F. and Dunson David B., “Bayesian consensus clustering,” Bioinformatics, vol. 29, no. 20, pp. 2610–2616, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [141].Senbabaoğlu Yasin, Michilidis George, and Li Jun Z., “Critical limitations of consensus clustering in class discovery,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 4, no. 6207, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [142].Liu Hongfu, Wu Junjie, Liu Tongliang, Tao Dacheng, and Fu Yun, “Spectral ensemble clustering via weighted k-means: Theoretical and practical evidence,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 5, pp. 1129–1143, 2017. [Google Scholar]
  • [143].Saha Sriparna, Mitra Sayantan, and Kramer Stefan, “Exploring multiobjective optimization for multiview clustering,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 12, no. 4, pp. 44, 2018. [Google Scholar]
  • [144].Xie Xijiong and Sun Shiliang, “Multi-view clustering ensembles,” in Proceeding of the 2013 International Conference on Machine Learning and Cybernetics, September 2013, pp. 51–56. [Google Scholar]
  • [145].Xue Zhe, Du Junping, Du Dawei, and Lyu Siwei, “Deep low-rank subspace ensemble for multi-view clustering,” Information Sciences, vol. 482, pp. 210–227, 2019. [Google Scholar]
  • [146].Zhang Jianwen and Zhang Changshui, “Multitask bregman clustering,” Neurocomputing, vol. 74, no. 10, pp. 1720–1734, 2011. [Google Scholar]
  • [147].Zhang Xianchao, Zhang Xiaotong, and Liu Han, “Smart multitask bregman clustering and multitask kernel clustering,” ACM Transactions on Knowledge Discovery from Data, vol. 10, no. 1, pp. 8:1–8:29, 2015. [Google Scholar]
  • [148].Gu Quanquan, Li Zhenhui, and Han Jiawei, “Learning a kernel for multi-task clustering,” in Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, Aug 2011, pp. 368–373. [Google Scholar]
  • [149].Zhang Xiao-Lei, “Convex discriminative multi task clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 1, pp. 28–40, 2015. [DOI] [PubMed] [Google Scholar]
  • [150].Zhang Xianchao, Zhang Xiaotong, and Liu Han, “Self-adapted multitask clustering,” in Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, July 2016, pp. 2357–2363. [Google Scholar]
  • [151].Ren Yazhou, Que Xiaofan, Yao Dezhong, and Xu Zenglin, “Self-paced multi-task clustering,” Neurocomputing, vol. 350, pp. 212–220, 2019. [Google Scholar]
  • [152].Zhang Xianchao, Zhang Xiaotong, and Liu Han, “Multi-task multi-view clustering for non-negative data,” in Proceedings of the 24th International Conference on Artificial Intelligence, July 2016, pp. 4055–4061. [Google Scholar]
  • [153].Zhang Xianchao, Zhang Xiaotong, Liu Han, and Liu Xinyue, “Multitask multi-view clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 12, pp. 3324–3338, 2016. [Google Scholar]
  • [154].Ren Yazhou, Yan Xin, Hu Zechuan, and Xu Zenglin, “Self-paced multitask multi-view capped-norm clustering,” in International Conference on Neural Information Processing. Springer, 2018, pp. 205–217. [Google Scholar]
  • [155].Sun Shiliang, Shawe-Taylor John, and Mao Liang, “Pac-bayes analysis of multi-view learning,” Information Fusion, vol. 35, pp. 117–131, 2017. [Google Scholar]
  • [156].Yu S, Krishnapuram B, Rosales R, and Rao RB, “Bayesian co-training,” Journal of Machine Learning Research, vol. 12, pp. 2649–2680, Jan 2011. [Google Scholar]
  • [157].Sindhwani Vikas and Niyogi Partha, “A co-regularized approach to semi-supervised learning with multiple views,” in Proceedings of the ICML Workshop on Learning with Multiple Views, 2005. [Google Scholar]
  • [158].Sindhwani Vikas and Rosenberg David S, “An rkhs for multi-view learning and manifold co-regularization,” in Proceedings of the 25th International Conference on Machine Learning, Jul 2008, pp. 976–983. [Google Scholar]
  • [159].Sun Shiliang and Chao Guoqing, “Multi-view maximum entropy discrimination,” in Proceedings of the 23th International Joint Conference on Artificial Intelligence, August 2013, pp. 1706–1712. [Google Scholar]
  • [160].Sun Shiliang and Chao Guoqing, “Alternative multi-view maximum entropy discrimination,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, pp. 1445–1556, Jun 2016. [DOI] [PubMed] [Google Scholar]
  • [161].Jin Cheng, Mao Wenhui, Zhang Ruiqi, Zhang Yuejie, and Xue Xiangyang, “Cross-modal image clustering via canonical correlation analysis,” in Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Janary 2015, pp. 151–159. [Google Scholar]
  • [162].Andrés Méndez C, Summers Paul, and Menegaz1 Gloria, “Multiview cluster ensembles for multimodal mri segmentation,” International Journal of Imaging Systems and Technology, vol. 25, no. 1, pp. 56–67, 2015. [Google Scholar]
  • [163].Djelouah Abdelaziz, Franco Jean-Sébastien, and Boyer Edmond, “Multi-view object segmentation in space and time,” in Proceedings of the 2013 IEEE International Conference on Computer Vision, December 2013, pp. 2640–2647. [Google Scholar]
  • [164].Wu Jianxin and Rehg James M., “Centrist: A visual descriptor for scene categorization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 8, pp. 1489–1501, 2011. [DOI] [PubMed] [Google Scholar]
  • [165].Yu Hui, Li Mingjing, Zhang Hong-Jiang, and Feng Jufu, “Color texture moments for content-based image retrieval,” in Proceedings of the International Conference on Image Processing, September 2002, pp. 929–932. [Google Scholar]
  • [166].Dalal Navneet and Triggs Bill, “Histograms of oriented gradients for human detection,” in Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2005, pp. 886–893. [Google Scholar]
  • [167].Ojala Timo, Pietikäinen Matti, and Mäenpää Topi, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971–987, 2002. [Google Scholar]
  • [168].Lowe David G., “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004. [Google Scholar]
  • [169].Chi Mingmin, Zhang Peiwu, Zhao Yingbin, Feng Rui, and Xue Xiangyang, “Web image retrieval reranking with multi-view clustering,” in Proceedings of the 18th international conference on World wide web. ACM, 2009, pp. 1189–1190. [Google Scholar]
  • [170].Tao Hong, Hou Chenping, Qian Yuhua, Zhu Jubo, and Yi Dongyun, “Latent complete row space recovery for multi-view subspace clustering,” IEEE Transactions on Image Processing, vol. 29, pp. 8083–8096, 2020. [Google Scholar]
  • [171].Kim Young-Min, Amini Massih-Reza, Goutte Cyril, and Gallinari Patrick, “Multi-view clustering of multilingual documents,” in Proceedings of the 33rd international ACM SIGIR Conference on Research and Development in Information Retrieval, July 2010, pp. 821–822. [Google Scholar]
  • [172].Jiang Yu, Liu Jing, Li Zechao, and Lu Hanqing, “Collaborative plsa for multi-view clustering,” in Proceedings of the 21st International Conference on Pattern Recognition, pp. 2997–3000. IEEE, November 2012. [Google Scholar]
  • [173].Zhao Peng, Jiang Yuan, and Zhou Zhi-Hua, “Multi-view matrix completion for clustering with side information,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2017, pp. 403–415. [Google Scholar]
  • [174].Hussain Syed Fawad, Mushtaq Muhammad, and Halim Zahid, “Multi-view document clustering via ensemble method,” Journal of Intelligent Information Systems, vol. 43, no. 1, pp. 81–99, 2014. [Google Scholar]
  • [175].Petkos Georgios, Papadopoulos Symeon, and Kompatsiaris Yiannis, “Social event detection using multimodal clustering and integrating supervisory signals,” in Proceedings of the 2nd ACM International Conference on Multimedia Retrieval, June 2012. [Google Scholar]
  • [176].Samangooei Sina, Hare Jonathon S., Dupplaw David, Niranjan Mahesan, Gibbins Nicholas, Lewis Paul H., Davies Jamie, Jain Neha, and Preston John, “Social event detection via sparse multi-modal feature selection and incremental density based clustering,” in MediaEval, 2013. [Google Scholar]
  • [177].Petkos Georgios, Papadopoulos Symeon, Schinas Emmanouil, and Kompatsiaris Yiannis, “Graph-based multimodal clustering for social event detection in large collections of images,” in Proceedings of the 20th Anniversary International Conference on MultiMedia Modeling, January 2014, pp. 146–158. [Google Scholar]
  • [178].Bekkerman Ron and Jeon Jiwoon, “Multi-modal clustering for multimedia collections,” in Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, June 2007. [Google Scholar]
  • [179].Wu Xiao, Ngo Chong-Wah, and Hauptmann Alexander G., “Multimodal news story clustering with pairwise visual near-duplicate constraint,” IEEE Transactions on Multimedia, vol. 10, no. 2, pp. 188–199, 2008. [Google Scholar]
  • [180].Mekthanavanh Vinath, Li Tianrui, Meng Hua, Yang Yan, and Hu Jie, “Social web video clustering based on multi-view clustering via nonnegative matrix factorization,” International Journal of Machine Learning and Cybernetics, pp. 1–12, 2019. [Google Scholar]
  • [181].Chao Guoqing, Sun Jiangwen, Lu Jin, Wang An-Li, Langleben Daniel D, Li Chiang-Shan, and Bi Jinbo, “Multi-view cluster analysis with incomplete data to understand treatment effects,” Information Sciences, vol. 494, pp. 278–293, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [182].Yu Shi, Tranchevent Leon, Liu Xinhai, Glanzel Wolfgang, Johan AK Suykens Bart De Moor, and Moreau Yves, “Optimized data fusion for kernel k-means clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 5, pp. 1031–1039, 2012. [DOI] [PubMed] [Google Scholar]
  • [183].Yu Shi, Liu Xinhai, Tranchevent Léon-Charles, Wolfgang Glänzel Johan A. K. Suykens, Bart De Moor Yves Moreau, and Notes Author, “Optimized data fusion for k-means laplacian clustering,” Bioinformatics, vol. 27, no. 1, pp. 118–126, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [184].Li Danping, Wang Lei, Xue Zhong, and Wong Stephen T. C., “When discriminative k-means meets grassmann manifold: Disease gene identification via a general multi-view clustering method,” in Proceedings of the 2016 IEEE International Conference on Biomedical and Health Informatics, Feberatary 2016, pp. 364–367. [Google Scholar]
  • [185].Jiang Bin, Sun Hua, Bai Wanjian, Li Hongmei, Wang Yong, Xiong Hongwei, and Wang Ning, “Data analysis of soccer athletes physical fitness test based on multi-view clustering,” in Journal of Physics: Conference Series. IOP Publishing, 2018, vol. 1060, p. 012024. [Google Scholar]
  • [186].Rappoport Nimrod and Shamir Ron, “Multi-omic and multi-view clustering algorithms: review and cancer benchmark,” Nucleic acids research, vol. 46, no. 20, pp. 10546–10562, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [187].Lewis David D., Yang Yiming, Rose Tony G., and Li Fan, “Rcv1: A new benchmark collection for text categorization research,” Journal of Machine Learning Research, vol. 5, no. 4, pp. 361–397, 2004. [Google Scholar]
  • [188].Lu Canyi, Yan Shuicheng, and Lin Zhouchen, “Convex sparse spectral clustering: Single-view to multi-view,” IEEE Transactions on Image Processing, vol. 25, no. 6, pp. 2833–2843, 2016. [DOI] [PubMed] [Google Scholar]
  • [189].Cai Deng, He Xiaofei, and Han Jiawei, “Using graph model for face analysis,” Technical Report, 2005. [Google Scholar]
  • [190].Cai Deng and Chen Xinlei, “Optimized data fusion for kernel k-means clustering,” IEEE Transactions on Cybernetics, vol. 45, no. 8, pp. 1669–1680, 2014. [DOI] [PubMed] [Google Scholar]
  • [191].Fowlkes Charless, Belongie Serge, Chung Fan, and Malik Jitendra, “Spectral grouping using the nyström method,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 2, pp. 214–225, 2004. [DOI] [PubMed] [Google Scholar]
  • [192].Yan Donghui, Huang Ling, and Jordan Michael I., “Fast spectral clustering of data using sequential matrix compression,” in Proceedings of the 17th European Conference on Machine Learning, September 2006, pp. 590–597. [Google Scholar]
  • [193].Yan Donghui, Huang Ling, and Jordan Michael I., “Fast approximate spectral clustering,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, June 2009, pp. 907–916. [Google Scholar]
  • [194].Zhang Zheng, Liu Li, Shen Fumin, Shen Heng Tao, and Shao Ling, “Binary multi-view clustering,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 7, pp. 1774–1782, 2018. [DOI] [PubMed] [Google Scholar]
  • [195].Kriegel Hans-Peter, Kröger Peer, and Zimek Arthur, “Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 3, no. 1, pp. 1, 2009. [Google Scholar]
  • [196].Dy Jennifer G. and Brodley Carla E., “Feature selection for unsupervised learning,” Journal of Machine Learning Research, vol. 5, pp. 845–889, 2004. [Google Scholar]
  • [197].Witten Daniela M. and Tibshirani Robert, “A framework for feature selection in clustering,” Journal of the American Statistical Association, vol. 105, no. 490, pp. 713–726, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [198].Chao Guoqing, Luo Yuan, and Ding Weiping, “Recent advances in supervised dimension reduction: A survey,” Machine Learning and Knowledge Extraction, vol. 1, no. 1, pp. 341–358, 2019. [Google Scholar]
  • [199].Fujikawa Yoshikazu and Ho Tu Bao, “Cluster-based algorithms for dealing with missing values,” in Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, May 2002, pp. 549–554. [Google Scholar]
  • [200].Yu Sung Su Andrew Gelman, Hill Jennifer, and Yajima Masanao, “Multiple imputation with diagnostics (mi) in r:opening windows into the black box,” Journal of Statistical Software, vol. 45, 2011. [Google Scholar]
  • [201].Shang Chao, Palmer Aaron, Sun Jiangwen, Chen Ko-Shin, Lu Jin, and Bi Jinbo, “VIGAN: missing view imputation with generative adversarial networks,” in 2017 IEEE International Conference on Big Data, BigData 2017, Boston, MA, USA, December 11–14, 2017, 2017, pp. 766–775. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [202].Rai Nishant, Negi Sumit, Chaudhury Santanu, and Deshmukh Om, “Partial multi-view vlustering using graph regularized nmf,” in Proceedings of the 23rd International Conference on Pattern Recognition, December 2016, pp. 2192–2197. [Google Scholar]
  • [203].Xinwang Liu, Miaomiao Li, Chang Tang, Jingyuan Xia, Jian Xiong, Li Liu, Marius Kloft, and En Zhu, “Efficient and effective incomplete multi-view clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. [DOI] [PubMed] [Google Scholar]
  • [204].Zhao Qian, Meng Deyu, Jiang Lu, Xie Qi, Xu Zongben, and Hauptmann Alexander G., “Self-paced learning for matrix factorization,” in Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 2015, pp. 3196–3202. [Google Scholar]
  • [205].Law Marc Tevq, Urtasun Raquel, and Zemel Richard S., “Deep spectral clustering learning,” in Proceedings of The 34th International Conference on Machine Learning, August 2017, pp. 1985–1994. [Google Scholar]
  • [206].Hershey John R., Chen Zhuo, Roux Jonathan Le, and Watanabe Shinji, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, March 2016, pp. 31–35. [Google Scholar]
  • [207].Hyun Oh Song Stefanie Jegelka, Rathod Vivek, and Murphy Kevin, “Deep metric learning via facility location,” in Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017. [Google Scholar]
  • [208].Huang Shudong, Kang Zhao, and Xu Zenglin, “Auto-weighted multi-view clustering via deep matrix decomposition,” Pattern Recognition, p. 107015, 2019. [Google Scholar]
  • [209].Tekumalla Lavanya Sita, Rajan Vaibhav, and Bhattacharyya Chiranjib, “Vine copulas for mixed data: multi-view clustering for mixed data beyond meta-gaussian dependencies,” Machine Learning, vol. 106, no. 9–10, pp. 1331–1357, 2017. [Google Scholar]
  • [210].Cui Ying, Fern Xiaoli Z., and Dy Jennifer G., “Non-redundant multi-view clustering via orthogonalization,” in Proceedings of the Seventh IEEE International Conference on Data Mining, February 2007, pp. 133–142. [Google Scholar]
  • [211].Niu Donglin, Dy Jennifer G., and Jordan Michael I., “Multiple non-redundant spectral clustering views,” in Proceedings of the 27th International Conference on International Conference on Machine Learning, June 2010, pp. 831–838. [Google Scholar]
  • [212].Chang Yale, Chen Junxiang, Cho Michael H, Castaldi Peter J, Silverman Edwin K, and Dy Jennifer G, “Multiple clustering views from multiple uncertain experts,” in International Conference on Machine Learning, 2017, pp. 674–683. [Google Scholar]

RESOURCES