Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Dec 11.
Published in final edited form as: J Mach Learn Res. 2020;21:214.

Provable Convex Co-clustering of Tensors

Eric C Chi 1, Brian R Gaines 2, Will Wei Sun 3, Hua Zhou 4, Jian Yang 5
PMCID: PMC7731944  NIHMSID: NIHMS1648688  PMID: 33312074

Abstract

Cluster analysis is a fundamental tool for pattern discovery of complex heterogeneous data. Prevalent clustering methods mainly focus on vector or matrix-variate data and are not applicable to general-order tensors, which arise frequently in modern scientific and business applications. Moreover, there is a gap between statistical guarantees and computational efficiency for existing tensor clustering solutions due to the nature of their non-convex formulations. In this work, we bridge this gap by developing a provable convex formulation of tensor co-clustering. Our convex co-clustering (CoCo) estimator enjoys stability guarantees and its computational and storage costs are polynomial in the size of the data. We further establish a non-asymptotic error bound for the CoCo estimator, which reveals a surprising “blessing of dimensionality” phenomenon that does not exist in vector or matrix-variate cluster analysis. Our theoretical findings are supported by extensive simulated studies. Finally, we apply the CoCo estimator to the cluster analysis of advertisement click tensor data from a major online company. Our clustering results provide meaningful business insights to improve advertising effectiveness.

Keywords: Clustering, Fused lasso, High-dimensional Statistical Learning, Multiway Data, Non-asymptotic Error

1. Introduction

In this work, we study the problem of finding structure in multiway data, or tensors, via clustering. Tensors appear frequently in modern scientific and business applications involving complex heterogeneous data. For example, data in a neurogenomics study of brain development consists of a 3-way array of expression level measurements indexed by gene, space, and time (Liu et al., 2017). Other examples of 3-way data arrays consisting of matrices collected over time include email communications (sender, recipient, time) (Papalexakis et al., 2013), online chatroom communications (user, keyword, time) (Acar et al., 2006), bike rentals (source station, destination station, time) (Guigourès et al., 2015), and internet network traffic (source IP, destination IP, time) (Sun et al., 2006). The rise in tensor data has created new challenges in making predictions, such as in recommender systems for example (Zheng et al., 2016; Symeonidis, 2016; Symeonidis and Zioupos, 2016; Frolov and Oseledets, 2017; Bi et al., 2018) as well as inferring latent structure in multiway data (Acar and Yener, 2009; Anandkumar et al., 2014; Cichocki et al., 2015; Sidiropoulos et al., 2017).

As tensors become increasingly more common, the need for a reliable co-clustering method grows increasingly more urgent. Prevalent clustering methods, however, mainly focus on vector or matrix-variate data. The goal of vector clustering is to identify subgroups within the vector-variate observations (Ma and Zhong, 2008; Shen and Huang, 2010; Shen et al., 2012; Wang et al., 2013). Biclustering is the extension of clustering to two-way data where both the observations (rows) and the features (columns) of a data matrix are simultaneously grouped together (Hartigan, 1972; Madeira and Oliveira, 2004; Busygin et al., 2008). In spite of their prevalence, these approaches are not directly applicable to the cluster analysis of general-order (general-way) tensors. On the other hand, existing methods for co-clustering general D-way arrays, for D ≥ 3, employ one of three strategies: (i) extensions of spectral clustering to tensors (Wu et al., 2016b), (ii) directly clustering the subarrays along each dimension, or way, of the tensor using either k-means or variants on it (Jegelka et al., 2009), and (iii) low rank tensor decompositions (Sun et al., 2009; Papalexakis et al., 2013; Zhao et al., 2016). While all these existing approaches may demonstrate good empirical performance, they have limitations. For instance, the spectral co-clustering method proposed by Wu et al. (2016b) is limited to nonnegative tensors and the CoTeC method proposed by Jegelka et al. (2009), like k-means, requires specifying the number of clusters along each dimension as a tuning parameter. Most importantly, none of the existing methods provide statistical guarantees for recovering an underlying co-clustering structure. There is a conspicuous gap between statistical guarantees and computational efficiency for existing tensor clustering solutions due to the nature of the non-convex formulations of the previously mentioned works.

In this paper, we propose a Convex Co-clustering (CoCo) procedure that solves a convex formulation of the problem of co-clustering a D-way array for D ≥ 3. Our proposed CoCo estimator affords the following advantages over existing tensor co-clustering methods.

  1. Under modest assumptions on the data generating process, the CoCo estimator is guaranteed to recover an underlying co-clustering structure with high probability. In particular, we establish a non-asymptotic error bound for the CoCo estimator, which reveals a surprising “blessing of dimensionality” phenomenon: As the dimensions of the array increase, the CoCo estimator is still consistent even if the number of underlying co-clusters grows as a function of the number of elements in the tensor sample. More importantly, an underlying co-clustering structure can be consistently recovered with even a single tensor sample, which is a typical case in real applications. This phenomenon does not exist in vector or matrix-variate cluster analysis.

  2. The CoCo estimator possesses stability guarantees. In particular, the CoCo estimatoris Lipschitz continuous in the data and jointly continuous in the data and its tuning parameter. We emphasize that Lipschitz continuity in the data guarantees that perturbations in the data lead to graceful and commensurate variations in the cluster assignments, and the continuity in the tuning parameter can be leveraged to expedite computation through warm starts.

  3. The CoCo estimator can be iteratively computed with convergence guarantees via an accelerated first order method with storage and per-iteration cost that is linear in the size of the data.

In short, the CoCo estimator comes with (i) statistical guarantees, (ii) practically relevant stability guarantees at all sample sizes, and (iii) an algorithm with polynomial complexity. The theoretical properties of our CoCo estimator are supported by extensive simulation studies. To demonstrate its business impact, we apply the CoCo estimator to the cluster analysis of advertisement click tensor data from a major online company. Our clustering results provide meaningful business insights to help advertising planning.

Our work is related to, but also clearly distinct from, a number of recent developments in cluster analysis. The first related line of research tackles convex clustering (Hocking et al., 2011; Zhu et al., 2014; Chi and Lange, 2015; Chen et al., 2015; Tan and Witten, 2015; Wang et al., 2018; Radchenko and Mukherjee, 2017) and convex biclustering (Chi et al., 2017). These existing methods are not directly applicable to general-order tensors, however. Importantly, our CoCo estimator enjoys a unique “blessing of dimensionality” phenomenon that has not been established in the aforementioned approaches. Moreover, the CoCo estimator is similar in spirit to a recent series of work approximating a noisy observed array with an array that is smooth with respect to some latent organization associated with each dimension of the array (Gavish and Coifman, 2012; Ankenman, 2014; Mishne et al., 2016; Yair et al., 2017). Our proposed CoCo procedure seeks an approximating array that is smooth with respect to a latent clustering along each dimension of the array. While CoCo shares features with these array approximation techniques, namely the use of data-driven similarity graphs along tensor modes, a key distinction between our CoCo estimator and these methods is that CoCo produces an approximating array that explicitly recovers hard co-clustering assignments. As we will see shortly, focusing our attention in this work on the co-clustering model paves the way to the discovery and explicit characterization of new and interesting fundamental behavior in finding intrinsic organization within tensors.

The rest of the paper is organized as follows. In Section 2, we review standard facts and results about tensors that we will use. In Section 3, we introduce our convex formulation of the co-clustering problem. In Section 4, we establish the stability properties and prediction error bounds of the CoCo estimator. In Section 5, we describe the algorithm used to compute the CoCo estimator. In Section 6, we discuss how to specify weights used in our CoCo estimator, and in Section 7 we give guidance on how to set and select tuning parameters used in the CoCo estimator in practice. In Section 8, we present simulation results. In Section 9, we discuss the results of applying the CoCo estimator to co-cluster a real data tensor from online advertising. In Section 10, we close with a discussion. The Appendix contains a brief review of the two main tensor decompositions that are discussed in this paper, all technical proofs, as well as additional experiments.

2. Preliminaries

2.1. Notation

We adopt the terminology and notation used by Kolda and Bader (2009). We call the number of ways or modes of a tensor its order. Vectors are tensors of order one and denoted by boldface lowercase letters, e.g. a. Matrices are tensors of order two and denoted by boldface capital letters, e.g. A. Tensors of higher-order, namely order three and greater, we denote by boldface Euler script letters, e.g. A. Thus, if A represents a D-way data array of size n1 × n2 × ··· × nD, we say A is a tensor of order D. We denote scalars by lowercase letters, e.g. a. We denote the ith element of a vector a by ai, the ijth element of a matrix A by aij, the ijkth element of a third-order tensor A by aijk, and so on.

We can extract a subarray of a tensor by fixing a subset of its indices. For example, by fixing the first index of a matrix to be i, we extract the ith row of the matrix, and by fixing the second index of a matrix to be j, we extract a jth column of the matrix. We use a colon to indicate all elements of a mode. Consequently, we denote the ith row of a matrix A by Ai: and the jth column of a matrix A by A:j. Fibers are the subarrays of a tensor obtained by fixing all but one of its indices. In the case of a matrix, a mode-1 fiber is a matrix column and a mode-2 fiber is a matrix row. Slices are the two-dimensional subarrays of a tensor obtained by fixing all but two indices. For example, a third-order tensor A has three sets of slices denoted by Ai::, A:j:, and A::k.

2.2. Basic Tensor Operations

It is often convenient to reorder the elements of a D-way array into a matrix or a vector. Reordering a tensor’s elements into a matrix is referred to as matricization, while reordering its elements into a vector is referred to as vectorization. There are many ways to reorder a tensor into a matrix or vector. In this paper, we use a canonical mode-d matricization, where the mode-d fibers of a D-way tensor An1×n2××nD become the columns of a matrix A(d)nd×nd, where nd=jdnj. Recall that the column-major vectorization of a matrix maps a matrix Ap×q to the vector apq by stacking the columns of A on top of each other, namely a=(A:1TA:2TA:qT)Tpq. In this paper, we take the vectorization of a D-way tensor A, denoted vec(A), to be the column-major vectorization of the mode-1 matriciziation of A, namely vec(A)=vec(A(1))n, where n=dnd the total number of elements in A. As a shorthand, when the context leaves no ambiguity, we denote this vectorization of a tensor A by its boldface lowercase version a.

The Frobenius norm of a D-way tensor An1×n2××nD is the natural generalization of the Frobenius norm of a matrix, namely it is the square root of the sum of the squares of all its elements,

AF=i1=1n1i2=1n2iD=1nDai1i2iD2.

The Frobenius norm of a tensor is equivalent to the 2-norm of the vectorization of the tensor, namely AF=a2.

Let A be a tensor in n1×n2××nD and B be a matrix in m×nd. The d-mode (matrix) product of the tensor A with the matrix B, denoted by A×dB, is the tensor of size n1 × ⋯ × nd−1 × m × nd+1 × ⋯ × nD whose (i1, i2, ⋯ , id−1, j, id+1, ⋯ ,iD)th element is given by

(A×dB)i1id1jid+1iD=id=1ndai1i2iDbjid,

for j ∈ {1, … ,m}. The vectorization of the d-mode product A×dB can be expressed as

vec(A×dB)=(InDInd+1BInd1In1)a, (1)

where Ip is the p-by-p identity matrix and ⊗ denotes the Kronecker product between two matrices. The identity given in (1) generalizes the well known formula for the column-major vectorization of a product of two matrices, namely vec(BA) = (IB)a.

3. A Convex Formulation of Co-clustering

We first consider a convex formulation of co-clustering problem when the data is a 3-way tensor Xn1×n2×n3 before discussing the natural generalization to D-way tensors. Our basic assumption is that the observed data tensor is a noisy realization of an underlying tensor that exhibits a checkerbox structure modulo some unknown reordering along each of its modes. Specifically suppose that there are k1, k2, and k3 clusters along modes 1, 2, and 3 respectively. If the (i1, i2, i3)-th entry in X belongs to the cluster defined by the r1th mode-1 group, r2th mode-2 group, and r3th mode-3 group, then we assume that the observed tensor element xi1i2i3 is given by

xi1i2i3=cr1r2r3+ϵi1i2i3, (2)

Where c*r1r2r3 is the mean of the co-cluster defined by the r1th mode-1 partition, r2th mode-2 partition, and r3th mode-3 partition, and ϵi1i2i3 are noise terms. We will specify a joint distribution on the noise terms later in Section 4.2 in order to derive prediction bounds. Thus, we model the observed tensor X as the sum of a mean tensor U*n1×n2×n3, whose elements are expanded from the co-cluster means tensor C*k1×k2×k3, and a noise tensor En1×n2×n3. We can write this expansion explicitly by introducing a membership matrix Md{0,1}nd×kd for the dth mode, where the ikth element of Md is one if and only if the ith mode-d slice belongs to the kth mode-d cluster for k ∈ {1, … , kd}. We require that each row of the membership matrix sum to one, namely Md1 = 1, to ensure that each of the mode-d slices belongs to exactly one of the kd mode-d clusters. Then,

U*=C*×1M1×2M2×3M3.

Figure 1 illustrates an underlying mean tensor U* after permuting the slices along each of the modes to reveal a checkerbox structure.

Figure 1:

Figure 1:

A 3-way tensor with a checkerbox structure

The co-clustering model in (2) is the 3-way analogue of the checkerboard mean model often employed in biclustering data matrices (Madeira and Oliveira, 2004; Tan and Witten, 2014; Chi et al., 2017). Moreover, the tensor C* of co-cluster means corresponds to the tensor of cluster “centers” in the tensor clustering work by Jegelka et al. (2009). The model is complete and exclusive in that each tensor element is assigned to exactly one co-cluster. This is in contrast to models that allow potentially overlapping co-clusters (Lazzeroni and Owen, 2002; Bergmann et al., 2003; Turner et al., 2005; Huang et al., 2008; Witten et al., 2009; Lee et al., 2010; Sill et al., 2011; Bhar et al., 2015).

Estimating the model in (2) consists of finding (i) the partitions along each mode and (ii) the mean values of each of the k1k2k3 co-clusters. Estimating c*r1r2r3, given the mode clustering assignments is trivial. Let G1, G2 and G3 denote the indices of the r1th mode-1, r2th mode-2, and r3th mode-3 groups respectively. If the noise terms ϵi1i2i3 are iid N(0, σ2) for some positive σ2, then the maximum likelihood estimate of c*r1r2r3 is simply the sample mean of the entries of X over the indices defined by G1, G2, and G3, namely

c^r1r2r3*=1|G1||G2|||G3|i1G1i2G2i3G3xi1i2i3.

Finding the partitions G1, G2, and G3, on the other hand, is a combinatorially hard problem. In recent years, however, many combinatorially hard problems, that initially appear computationally intractable, have been successfully attacked by solving a convex relaxation to the original combinatorial optimization problem. Perhaps the most celebrated convex relaxations is the lasso (Tibshirani, 1996), which simultaneously performs variable selection and parameter estimation for fitting sparse regression models by minimizing a non-smooth convex criterion.

In light of the lasso’s success, we propose to simultaneously identify partitions along the modes of X and estimate the co-cluster means by minimizing the following convex objective function

Fγ(U)=12XUF2+γ[R1(U)+R2(U)+R3(U)]R(U), (3)

where

R1(U)=i<jw1,ijUi::Uj::FR2(U)=i<jw2,ijU:i:U:j:FR3(U)=i<jw3,ijU::iU::jF.

By seeking the minimizer U^γn1×n2×n3 of (3), we have cast co-clustering as a signal approximation problem, modeled as a penalized regression, to estimate the true co-cluster means tensor U*. In the following discussion, we drop the dependence of γ in U^γ and denote our estimator as U^ when there is no confusion. The quadratic term in (3) quantifies how well U approximates X, while the regularization term R(U) in (3) penalizes deviations away from a checkerbox pattern. The nonnegative parameter γ tunes the relative emphasis on these two terms. The parameters wd,ij are nonnegative weights whose purpose will be discussed shortly.

To appreciate how the regularization term R(U) steers the minimizer of (3) towards a checkerbox pattern, consider the effect of one of the terms Rd(U) in isolation. Specifically, suppose that R(U)=R1(U). When γ is zero, the minimum of (3) is attained when U=X. Or stated another way, Ui::=Xi:: for i ∈ {1, … , n1}. As γ increases, the mode-1 slices Ui:: will shrink towards each other and in fact coalesce due to the non-differentiability of the Frobenius norm at zero. In other words, as γ gets larger, the pairwise differences of the mode-1 slices of U^ will become increasingly sparser. Sparsity in these pairwise differences leads to a natural partitioning assignment. Two mode-1 slices Xi:: and Xj:: are assigned to the same mode-1 partition if Ui::=Uj::. Under mild regularity conditions, that we will spell out in Section 4, for sufficiently large γ, all mode-1 slices U^ will be identical and therefore belong to a single cluster. Similar behavior holds if R(U)=R2(U) or R(U)=R3(U).

When R(U) includes all three terms Rd(U) for d = 1, 2, 3, pairs of mode-1, mode-2, and mode-3 slices are simultaneously shrunk towards each other and coalesce as the parameter γ increases. By coupling clustering along each of the modes simultaneously, our formulation explicitly seeks out a solution with a checkerbox mean structure. Moreover, we will show in Section 4 that the solution U^ produces an entire solution path of checkerbox co-clustering estimates that varies continuously in γ. The solution path spans a range of models from the least smoothed model, where U^ is X and each tensor element occupies its own co-cluster, to the most smoothed model, where all the elements of U^ are identical and all tensor elements belong to a single co-cluster.

The nonnegative weights wd,ij fine tune the shrinkage of the slices along the dth mode. For example, if w1,ij>w1,ij, then there will be more pressure for Ui:: and Uj:: to fuse than for Ui:: and Uj:: to fuse as γ increases. Thus, the weight wd,ij quantifies the similarity between the ith and jth mode-d slices. A very large wd,ij indicates that the two slices are very similar, while a very small wd,ij indicates that they are very dissimilar. These pairwise similarities motivate a graphical view of clustering. For the dth mode, define the set Ed as the edge set of a similarity graph. Each slice is a node in the graph and the set Ed contains an edge (i, j) if and only if wd,ij > 0. Figure 2 shows an example of a mode-1 similarity graph, which corresponds to a tensor with seven mode-1 slices and positive weights that define the edge set

E1={(1,2),(2,3),(4,5),(4,6),(6,7)}.

Given the connectivity of the graph, as γ increases, the slices U1::, U2::, and U3:: will be shrunk towards each other while the slices U4::, U5::, U6:: and U7:: shrunk towards each other. Since wd,ij = 0 for any (i,j)Ed, we can express the penalty terms for the dth mode as

Rd(U)=(i,j)Edwd,ijUi::Uj::F.

Figure 2:

Figure 2:

A graph that summarizes the similarities between pairs of the mode-1 subarrays. Only edges with positive weight are drawn.

The graph in Figure 2 makes readily apparent that the convex objective in (3) separates over the connected components of the similarity graph for the mode-d slices. Consequently, one can solve for the optimal U component by component. Without loss of generality, we assume that the weights are such that all the similarity graphs are connected. Before leaving this preliminary description of the weights, however, we want to emphasize that in practice weights are set once in a data-adaptive manner and should be considered empirically chosen hyper-parameters rather than tuning parameters. Further discussion of the weights and practical recommendations for specifying them will be discussed in Section 6.

Having familiarized ourselves with the convex co-clustering of a 3-way array, we now present the natural extension of (3) for clustering the fibers of a general higher-order tensor Xn1××nD along all its D modes. Let Δd,ij=eiTejT where ei is the ith standard basis vector in nd. The objective function of our convex co-clustering for a general higher-order tensor is as follows.

Fγ(U)=12XUF2+γd=1D(i,j)Edwd,ijU×dΔd,ijF. (4)

The difference between the convex triclustering objective (3) and the general convex co-clustering objective (4) is in the penalty terms. Previously in (3) we penalized the difference between pairs slices whereas in (4) we penalize the differences between pairs of mode-d subarrays.

Note that the function Fγ(U) defined in (4) has a unique global minimizer. This follows immediately from the fact that Fγ(U) is strongly convex. The unique global minimizer of Fγ(U) is our proposed CoCo estimator, which is denoted by U^ for the remainder of the paper.

At times it will be more convenient to work with vectors rather than tensors. By applying the identity in (1), we can rewrite the objective function in (4) in terms of the vectorizations of U and X as follows

Fγ(u)=12xu22+γd=1D(i,j)Edwd,ijAd,iju2. (5)

where Ad,ij is the nd-by-n matrix

Ad,ij=InDInd+1Δd,ijInd1In1 (6)

where Ind is the nd-by-nd identity matrix. We will refer to the unique global minimizer of (5), û = argminu Fγ(u), as the vectorized version of our CoCo estimator.

Remark 1 The fusion penalties Rd(U) are a composition of the group lasso (Yuan and Lin, 2006) and the fused lasso (Tibshirani et al., 2005), a special case of the generalized lasso (Tibshirani and Taylor, 2011). When only a single mode is being clustered and only one of the terms Rd(U) is employed, we recover the objective function in the convex clustering problem (Pelckmans et al., 2005; She, 2010; Lindsten et al., 2011; Hocking et al., 2011; Sharpnack et al., 2012; Zhu et al., 2014; Chi and Lange, 2015; Radchenko and Mukherjee, 2017). Most prior work on convex clustering employ an element-wise ℓ1-norm penalty on pairwise differences, as in the original fused lasso, however, ℓ2-norm and ℓ-norm have also been considered (Hocking et al., 2011; Chi and Lange, 2015). In this paper, we restrict ourselves to the ℓ2-norm for two reasons. First, the ℓ2-norm is rotationally invariant. In general, we are reluctant to adopt a procedure whose co-clustering output may non-trivially change when the coordinate representation of the data along one of its modes is trivially changed. Second, the ℓ2-norm promotes the group-wise shrinkage of pairwise differences of subarrays along each mode leading to more straightforward partitioning along each mode. Pairwise differences are either exactly zero or not. When the tensor is a matrix and the rows and columns are being simultaneously clustered, we recover the objective function in the convex biclustering problem (Chi et al., 2017). In general, the fusion penalties Rd(U) shrink solutions to vector valued functions that are piece-wise constant over the mode-d similarity graph defined by the weights wd,ij. Viewed this way, we can see our approach as simultaneously performing the network lasso (Hallac et al., 2015) on D similarity graphs.

Remark 2 The CoCo estimator is invariant to permutations in the data tensor X in the following sense. Suppose U^ and U^ are the CoCo estimators when the data tensors are respectively X and X=X×1Π1×2×DΠD where Π1{0,1}n1×n1,,ΠD{0,1}nD×nD are permutation matrices, namely ΠdTΠd=I. In words, X can be obtained from X by permuting the subarrays of X along the dth mode according to Πd for d = 1, … , D, and X can be recovered from X by permuting along the dth mode according to ΠdT for d = 1, … , D. Since U×1Π1×2×DΠDF=UF, it follows that

U^=U^×1Π1×2×DΠDandU^=U^×1Π1T×2×DΠDT.

Permutation invariance is important because it means that the CoCo estimator is essentially unaltered by any reshuffling along the modes of the data tensor.

Remark 3 Given the co-clustering structure assumed in (2), one may wonder how much is added by explicitly seeking a co-clustering over clustering along each mode independently. In other words, why not solve D independent convex clustering problems with R(U)=Rd(U)? To provide some intuition on why co-clustering should be preferred over independently clustering each mode, consider the following problem. Imagine trying to cluster row vectors xi10,000 for i = 1, … , 100 drawn from a two-component mixture of Gaussians, namely

xi~iid12N(μ,σ2I)+12N(ν,σ2I).

This is a challenging clustering problem due to the disproportionately small number of observations compared to the number of features. If, however, we were told that μj = μ1 and νj = ν1 for j = 1, … , 5,000 and μj = μ2 and νj = ν2 for i = 5,001, … , 10,000, in other words that the features were clustered into two groups, our fortunes have reversed and we now have an abundance of observations compared to the number of effective features. Even if we lack a clear-cut clustering structure in the features, this example suggests that leveraging similarity structure along the columns can expedite identifying similarity structure along the rows, and vice versa. Indeed, if there is an underlying checkerbox mean tensor we may expect that simultaneously clustering along each mode should make the task of clustering along any one given mode easier. Our prediction error result presented in Section 4.2 in fact supports this suspicion (See Remark 10).

4. Properties

We first discuss how the CoCo estimator U^ behaves as a function of the data tensor X, the tuning parameter γ, and the weights wd,ij. We will then present its statistical properties under mild conditions on the data generating process. We highlight that these properties hold regardless of the algorithm used to minimize (4), as they are intrinsic to its convex formulation. All proofs are given in Appendix B and Appendix C.

4.1. Stability Properties

The CoCo estimator varies smoothly with respect to X, γ, and {wd,ij}. Let Wd = {wd,ij} denote the weights matrix for mode d.

Proposition 4 The minimizer U^ of (4) is jointly continuous in (X, γ, W1, W2, … , WD).

As noted earlier, in practice we will typically fix the weights wd,ij and compute the CoCo estimator over a grid of the penalization parameters γ in order to select a final CoCo estimator from among the computed candidate estimators of varying levels of smoothness. Since (4) does not admit a closed form minimizer, we resort to iterative algorithms for computing the CoCo estimator. Continuity of U^ in γ can be leveraged to expedite computation through warm starts, namely using the solution U^γ as the initial guess for iteratively computing U^γ where γ′ is slightly larger or smaller than γ. Due to the continuity of U^ in γ, small changes in γ will result in small changes in U^. Empirically the use of warm starts can lead to a non-trivial reduction in computation time (Chi and Lange, 2015). From the continuity in γ, we also see that convex co-clustering performs continuous co-clustering just as the lasso (Tibshirani, 1996) performs continuous variable selection.

The penalization parameter γ tunes the complexity of the CoCo estimator. Clearly when γ = 0, the CoCo estimator coincides with the data tensor, namely U^=X. The key to understanding the CoCo estimator’s behavior as γ increases is to recognize that the penalty functions Rd(U) are semi-norms. Under suitable conditions on the weights given in Assumption 4.1 below, Rd(U) vanishes if and only if the mode-d subarrays of U are identical.

Assumption 4.1 For any pair of mode-d subarrays, indexed by i and j with i < j, there exists a sequence of indices ik → ⋯ → lj along which the weights, wd,ik, …,wd,lj are positive.

Proposition 5 Under Assumption 4.1, Rd(U)=0 if and only if U(d) = 1cT for some cnd.

To give some intuition for Proposition 5, note that the term Rd(U) separates over the connected components of the mode-d similarity graph. Therefore, the term Rd(U) penalizes variation in the mode-d subarrays over the connected components of the mode-d similarity graph. Assumption 4.1, states that the mode-d similarity graph is connected. Thus, the only way for Rd(U) to attain its minimum value and vanish under Assumption 4.1, is if there is no variation in U along its mode-d subarrays.

Proposition 5 suggests that if Assumption 4.1 holds for all d = 1, … ,D then as γ increases the CoCo estimator converges to the solution of the following constrained optimization problem:

minu12xuF2 subject to u=c1 for some c,

the solution to which is just the global mean x¯, whose entries are all identically the average value of x over all its entries. The next result formalizes our intuition that as γ increases, the CoCo estimator will eventually coincide with x¯.

Proposition 6 Suppose Assumption 4.1 holds for d = 1, … ,D, then Fγ(U) is minimized by the grand mean X¯ for γ sufficiently large.

Thus, as γ increases from 0, the CoCo estimator U^ traces a continuous solution path that starts from n co-clusters, consisting of ui1iD=xi1iD, to a single co-cluster, where ui1iD=xT1/n for all i1, … , iD.

For a fixed γ, we can derive an explicit bound on sensitivity of the CoCo estimator to perturbations in the data.

Proposition 7 The minimizer U^ of (4) is a nonexpansive or 1-Lipschitz function of the data tensor X, namely

U^(X)U^(X˜)FXX˜F.

Nonexpansivity of U^ in X provides an attractive stability result. Since U^ varies smoothly with the data, small perturbations in the data are guaranteed to not lead to large variability of U^, or consequently large variability in the cluster assignments. In a special case of our method, Chi et al. (2017) showed empirically that the co-clustering assignments made by the 2-way version of the CoCo estimator was noticeably less sensitive to perturbations in the data than those made by several existing biclustering algorithms.

4.2. Statistical Properties

We next provide a finite sample bound for the prediction error of the CoCo estimator. For simplicity, we consider the case where we take uniform weights within a mode in (5), namely wd,ij = wd,i′j′ = 1/nd for all i, j, i′, j′ ∈ {1, … , nd}. Such uniform weight assumption has also been imposed in the analysis of the vector-version of convex clustering (Tan and Witten, 2015).

In order to derive the estimation error of û, we first define an important definition for the noise and introduce two regularity conditions.

Definition 8 (Vu and Wang (2015)) We say a random vector yn is M-concentrated if there are constants C1, C2 > 0 such that for any convex, 1-Lipschitz function ϕ:n and any t > 0,

(|ϕ(y)E[ϕ(y)]|t)C1exp(C2t2M2).

The M-concentrated random variable is more general than the Gaussian or sub-Gaussian random variables, and it allows dependence in its coordinates. Vu and Wang (2015) provided a few examples of M-concentrated random variables. For instance, if the coordinates of y are iid standard Gaussian, then y is 1-concentrated. If the coordinates of y are independent and M-bounded, then y is M-concentrated. If the coordinates of y come from a random walk with certain mixing properties, then y is M-concentrated for some M.

Assumption 4.2 (Model) We assume the true cluster center C*k1××kD has a checkerbox structure such that the mode-d subarrays have kd different values (number of clusters along the dth mode), and each entry of C* is bounded above by a constant C0 > 0. Define U*n1××nD as the true parameter expanded based on C*, namely

U*=C*×1M1×2M2×3×DMD,

where Md{0,1}nd×kd are binary mode-d cluster membership matrices such that Md1 = 1. Denote u*=vec(U*)n with n=d=1Dnd. We assume the samples belonging to the (r1, … , rD)-th cluster satisfy

xi1,,iD=cr1,,rD*+ϵi1,,iD,

with id ∈ {1, … , nd} and rd ∈ {1, … , kd}. Furthermore, we assume ϵ=vec(E) is a M-concentrated random variable defined in (8) with mean zero.

The checkerbox means model in Assumption 4.2 provides the underlying cluster structure of the tensor data. As a special case, Assumption 4.2 with D = 2 reduces to the model assumption underlying convex biclustering (Chi et al., 2017). In contrast to the independent sub-Gaussian condition assumed in vector-version convex clustering (Tan and Witten, 2015), our error condition is much weaker since we allow for non-sub-Gaussian distributions as well as allow for dependence among its coordinates.

Assumption 4.3 (Tuning) The tuning parameter γ satisfies

2log(n)nDγ2c0log(n)nD,

for some constant c0 > 1.

Theorem 9 Suppose that Assumption 4.2 and Assumption 4.3 hold. The estimation error of û in (5) with uniform weights satisfies,

1nu^u*221Dd=1D(1nd+log(n)nnd)+Clog(n)Dnd=1Dndjdkj, (7)

with a high probability, where C=12c0C02 is a positive constant, and kd is the true number of clusters in the dth mode.

Theorem 9 provides a finite sample error bound for the proposed CoCo tensor estimator. Our theoretical bound allows the number of clusters in each mode to diverge, which reflects a typical large-scale clustering scenario in big tensor data. A notable consequence of Theorem 9 is that, when D ≥ 3, namely a higher-order tensor with at least 3 modes, the CoCo estimator can achieve estimation consistency along all the D modes even when we only have one tensor sample. Here the sample size refers to the number of available tensor samples. In our tensor clustering problem, we only have access to one tensor sample.

This property is uniquely enjoyed by co-clustering of tensor data with D ≥ 3, and has not been previously established in the existing literature on vector clustering or biclustering. To see this, when nd are of the same order as n0, and kd are of the same order as k0, a sufficient condition for the consistency is that n0 → ∞ and k0=o(n0(D2)/(D1)) up to a log term. When D = 3, the CoCo estimator is consistent so long as the number of clusters k0 in each mode diverges slightly slower than n0. Remarkably, as we have more modes in the tensor data, this constraint on the rate of divergence of k0 gets weaker. In short, we reap a unique and surprisingly welcome “blessing of dimensionality” phenomenon in the tensor co-clustering problem.

Remark 10 Next we discuss the connections of our bound (7) with prior results in the literature. An intermediate step in the proof of Theorem 9 indicates that the estimation error in the dth mode is on the order of 1/nd+log(n)/nnd+log(n)ndjdkj/nd. In the clustering along the rows of a data matrix, our rate matches with that established for vector-version convex clustering (Tan and Witten, 2015), up to a log term log(n). Such a log term is due to that fact that Tan and Witten (2015) considers the error to be iid sub-Gaussian while we consider a general M-concentrated error. In practice, the iid assumption on the noise ϵ=vec(E) could be restrictive. Consequently, our theoretical analysis is built upon a new concentration inequality of quadratic forms recently developed in Vu and Wang (2015). In addition, our rate reveals an interesting theoretical property of the convex biclustering method proposed by Chi et al. (2017). When D = 2, our rate indicates that the estimation error along the row and column of the data matrix is log(n1n2)n1k2/n2 and log(n1n2)n2k1/n1, respectively. Clearly, both errors can not converge to zero simultaneously. This indicates a disadvantage of matricizing a data tensor for co-clustering.

5. Estimation Algorithm

We next discuss a simple first order method for computing the solution to the convex co-clustering problem. The proposed algorithm generalizes the variable splitting approach introduced for convex clustering problem described in Chi and Lange (2015) to the CoCo problem. The key observation is that the Lagrangian dual of an equivalent formulation of the convex co-clustering problem is a constrained least squares problem that can be iteratively solved using the classic projected gradient algorithm.

5.1. A Lagrangian Dual of the CoCo Problem

Recall that we seek to minimize the objective function in (5)

Fγ(u)=12xu22+γd=1DlEdwd,lAd,lu2.

Note that we have enumerated the edge indices in Ed to simplify the notation for the following derivation.

We perform variable splitting and introduce the dummy variables vd,l = Ad,lu. Let Vd denote the nd×|Ed| matrix whose lth column is vd,l. Further denote the vectorization of Vd by vd = vec(Vd) and let v=[v1Tv2TvDT]T denote the vector obtained by stacking the vectors vd on top of each other. We now solve the equivalent equality constrained minimization

minv,u12xu22+γd=1DlEdwd,lvd,l2 subject to vd=Adu,

where Ad=(InDInd+1ΦdInd1In1) and Φd is the oriented edge-vertex incidence matrix for the dth mode graph, namely

Φd,lv={1If node v is the head of edge l1If node v is the tail of edge l0otherwise.

We introduce dual variables λd corresponding to the equality constraint vd = Adu. Let Λd denote the nd×|Ed| matrix whose lth column is λd,l. Further denote the vectorization of Λd by λd = vec(Λd) and λ=[λ1Tλ2TλDT]T. The Lagrangian dual objective is given by

G(λ)=12x2212xATλ22d=1DlEdιCd,l(λd,l),

where A=[A1TA2TADT]T and ιCd,l is the indicator function of the closed convex set Cd,l = {z : ∥z2γwd,l}, namely ιCd,l is the function that vanishes on the set of Cd,l and is infinity on the complement of Cd,l. Details on the derivation of the dual objective G(λ) are provided in Appendix D.

Maximizing the dual objective G(λ) is equivalent to solving the following constrained least squares problem:

minλC12xATλ22, (8)

where C={λ:λd,lCd,l,lEd,d=1,,D}. We can recover the primal solution via the relationship:

u^=xATλ^,

where λ^ is a solution to the dual problem (8). The dual problem (8) has at least one solution by the Weierstrass extreme value theorem, but the solution may not be unique since AT has a non-trivial kernel. Nonetheless, our CoCo estimator û is still unique since ATλ^1=ATλ^2 for any solutions λ^1, λ^2 to the problem (8).

We numerically solve the constrained least squares problem in (8) with the projected gradient algorithm, which alternates between taking a gradient step and projecting onto the set C. Algorithm 1 provides pseudocode of the projected gradient algorithm, which has several good features. The projected gradient algorithm is guaranteed to converge to a global minimizer of (8). Its per-iteration and storage costs using the weight choices, described in Section 6, are both O(Dn), namely linear in either the number of dimensions D or in the number of elements n. For a modest additional computational and storage cost, we can accelerate the projected gradient method, for example with FISTA (Beck and Teboulle, 2009) or SpaRSA (Wright et al., 2009). In our experiments, we use a version of the latter, namely FASTA (Goldstein et al., 2014, 2015). Additional details on the derivation of the algorithmic updates, convergence guarantees, computational and storage costs, as well as stopping rules can be found in Appendix E.

6. Specifying Non-Uniform Weights

In Section 4.2, we assumed uniform weights wd,ij in the penalty terms Rd(U) to establish a prediction error bound, which revealed a surprising and beneficial “blessing of dimensionality” phenomenon. Although this simplifying assumption gives clarity and insight into how the co-clustering problem gets easier as the number of modes increases, in practice choosing non-uniform weights can substantially improve the quality of the clustering results. In the context of convex clustering, Chen et al. (2015) and Chi and Lange (2015) provided empirical evidence that convex clustering with uniform weights struggled to produce exact sparsity in the pairwise differences of smooth estimates when there was not a strong separation between groups. Indeed, similar phenomena were observed in earlier work on the related clustered lasso (She, 2010). Several related works (She, 2010; Hocking et al., 2011; Chen et al., 2015; Chi and Lange, 2015) recommend a weight assignment strategy described below. In addition, the use of sparse weights can also lead to non-trivial improvements in both computational time and clustering performance (Chi and Lange, 2015; Chi et al., 2017).

Algorithm 1 Convex Co-Clustering (CoCo) Estimation Algorithm

Initialize λ(0); for m = 0,1,…
repeat
  u(m+1) = xATλ(m) ▷ Gradient Step
  for d = 1,…, D do
   for lEd do
     λd,l(m+1)=PCd,l(λd,l(m)+ηAd,lu(m+1)) ▷ Projection Step
   end for
  end for
until convergence

To illustrate the practical value of non-uniform weights, we compare CoCo’s ability to recover co-clusters, using both uniform and non-uniform weights, as the size of a 3-way tensor increases when there are two clusters per mode with balanced cluster sizes along each mode. We assess the quality of the recovered clustering performance using the Adjusted Rand Index (ARI). The ARI (Hubert and Arabie, 1985) varies between −1 and 1, where 1 indicates a perfect match between two clustering assignments whereas a value close to zero indicates the two clustering assignments match about as might be expected if they were both randomly generated. Negative values indicate that there is less agreement between clusterings than expected from random partitions.

Figure 3 shows a comparison between using non-uniform weights that are described in Section 6.2 and uniform weights. Each plotted point in Figure 3 is the average ARI over 100 replicates. For CoCo using non-uniform weights, the smoothing parameter γ is chosen with the data-driven extended BIC method that is detailed in Section 7.1. In contrast, for CoCo using uniform weights, γ is chosen as the value that produces the estimator that minimizes the true but unknown MSE.

Figure 3:

Figure 3:

Uniform versus non-uniform weights: Average Adjusted Rand Index for an increasing size. Here n=n03 refers to a tensor of size n0 × n0 × n0.

We see that while using uniform weights in CoCo leads to recovering co-clusters exactly once a sufficient number of samples have been acquired, using non-uniform weights enables CoCo to recover the co-clusters exactly with notably fewer samples. The results of this experiment are especially remarkable because CoCo using non-uniform weights and a data-adaptive choice of γ outperformed CoCo using uniform weights and an ideally chosen oracle value of γ.

As in the case of convex clustering, using non-uniform weights can lead to significantly better performance over using uniform weights in practice. We give some explanation for why this is expected in Section 6.3 but leave it to future work to develop theory proving this performance improvement. Nonetheless based on this observation, we employ non-uniform weights in CoCo for the empirical studies presented later in the paper.

6.1. Basic Procedure for Specifying Weights

We first describe our basic two step procedure for constructing weights before elaborating on the final refinements used in our numerical experiments.

Step 1: We first calculate pre-weights w˜d,ij between the ith and jth mode-d subarrays as

w˜d,ij=ι{i,j}kexp(τdX(d),i:X(d),j:F2). (9)

The first factor on the right hand side of equation (9), ι{i,j}k, is an indicator function that equals 1 if the jth slice is among the ith slice’s k-nearest neighbors (or vice versa) and 0 othewise. The purpose of this term is to control the sparsity of the weights. The corresponding tuning parameter k influences the connectivity of the mode-d similarity graph. One can explore different levels of granularity in the clustering by varying k (Chen et al., 2015). As a default, one can use the smallest k such that the similarity graph is still connected. Note it is not necessary to calculate the exact k-nearest neighbors, which scales quadratically in the number of fibers in the mode. A fast approximation to the k-nearest neighbors is sufficient for the sake of inducing sparsity into the weights. Chi and Lange (2015) provided two reasons for using k-nearest neighbor weights. First, we wish to prioritize fusions between pairs of subarrays that are most similar; the subarrays that are most dissimilar should be the last pair of subarrays to fuse as the smoothing parameter γ increases. Second, we wish to use a sparse similarity graph as the computational and storage complexity of the estimation algorithm is proportional to the number of non-zero edges in the similarity graphs (Appendix E). Using k-nearest-neighbors weights accomplishes both goals.

The second factor on the right hand side of equation (9) is the Gaussian kernel, which takes on larger values for pairs of mode-d subarrays that are more similar to each other. Chi and Steinerberger (2019) give a detailed theoretical justification for using weights like the Gaussian kernel weights in the context of convex clustering. For space considerations, we refer readers interested in these technical details to their work and give a brief intuitive rationale for the employing the Gaussian kernel here. Intuitively, the weights should be inversely proportional to the distance between the ith and jth mode-d subarrays (Chen et al., 2015; Chi et al., 2017). The inverse of the nonnegative parameter τd is a measure of scale. In practice, we can set it to be the median Euclidean distance between the ith and jth mode-d subarrays that are k-nearest neighbors of each other. A value of τd = 0 corresponds to uniform weights. Note that with minor modification, we can make the inverse scale parameter to be pair dependent as described in Zelnik-Manor and Perona (2005).

Step 2: To obtain the mode-d weights wd,ij, we normalize the mode-d pre-weights w˜d,ij to sum to nd/n. The normalization step puts the penalty terms Rd(U) on the same scale and ensures that clustering along any given single mode will not dominate the entire co-clustering as γ increases.

6.2. Improving Weights via the Tucker Decomposition

In our preliminary experiments, we found that substituting a low-rank approximation of X, namely a Tucker decomposition X˜, in place of X in (9) led to a marked improvement in co-clustering performance. To understand the boost in performance suppose that X=U*+E with U* having a checkerbox structure and the entries of E are iid N(0, σ2) for simplicity. Further suppose that the ith and jth mode-d subarrays of U belong to the same partition and ι{i,j}k=1. Then

w˜d,ij=exp(τdE×dΔijF2)=exp(2τdσ2Zd,ij),

where Z=E×dΔijF22σ2 is distributed as a χ2 random variable with nd degrees of freedom. If we were able to perfectly denoise the tensor X so that σ = 0, then the pre-weight w˜d,ij would be set to its maximal value of 1, the ideal value for w˜d,ij since we have assumed the ith and jth mode-d subarrays belong to the same partition. Thus, if we can reduce σ2, namely denoise the observed tensor X, we can approach the ideal value of pre-weights. Note that we are more focused with approaching the ideal pre-weight values for pairs of subarrays that belong to the same partition and not concerned with pairs of subarrays in different partitions as the Gaussian kernel weights decay very rapidly. The Tucker decomposition is effective at reducing σ2 when U* has a checkerbox pattern as the checkerbox pattern is a low-rank tensor that can be effectively approximated with the Tucker decomposition.

Employing the Tucker decomposition introduces another tuning parameter, namely the rank of the decomposition. In our simulation studies described in Section 8, we use two different methods for choosing the rank as a robustness check to ensure our CoCo estimator’s performance does not crucially depend on the rank selection method. Details on these two methods can be found in Appendix F. While we found the Tucker decomposition to work well in practice, we suspect that other methods of denoising the tensor may work just as well or could possibly be more effective. We leave it to future work to explore alternatives to the Tucker decomposition.

6.3. Weights and Folded-Concave Penalties

We conclude our discussion on weights by highlighting how they provide a connection between convex clustering and other penalized regression-based clustering methods that use folded-concave penalties (Pan et al., 2013; Xiang et al., 2013; Zhu et al., 2013; Marchetti and Zhou, 2014; Wu et al., 2016a). Suppose we seek to minimize the objective

f˜γ(u)=12xu22+γd=1D(i,j)Edφd(Ad,iju2), (10)

where each φd : [0, ∞) 7 ↦ [0, ∞) has the following properties: (i) φd is concave and differentiable on (0, ∞), (ii) φd vanishes at the origin, and (iii) the directional derivative of φd exists and is positive at the origin. Such φd is collectively referred to as a folded-concave penalty; prominent examples of such function include the smoothly clipped absolute deviation (Fan and Li, 2001) or minimax concave penalty (Zhang, 2010).

Since φd is concave and differentiable, for all positive z and z˜

φd(z)φd(z˜)+φd(z˜)(zz˜). (11)

The inequality (11) indicates that the first order Taylor expansion of a differentiable concave function φd provides a tight global upper bound at the expansion point z˜. Thus, we can construct a function that is a tight upper bound of the function f˜γ(u)

gγ(uu˜)=12xu22+γd=1D(i,j)Edwd,ijAd,iju2+c, (12)

where the constant c does not depend on u and wd,ij are weights that depend on ũ, namely

wd,ij=φd(Ad,iju˜2). (13)

Note that if we take ũ to be the vectorization of the Tucker approximation of the data, vec(X˜), and φd(z) to be the following variation on the error function

φd(z)=1nd(i,j)Edwd,ij0zeτdω2dω,

then the function given in (10) coincides with the CoCo objective using the prescribed Tucker derived Gaussian kernel weights.

The function gγ(u | ũ) is said to majorize the function f˜γ(u) at the point ũ (Lange et al., 2000) and minimizing it corresponds to performing one-step of the local linear-approximation algorithm (Zou and Li, 2008; Schifano et al., 2010) which is a special case of the majorization-minimization (MM) algorithm (Lange et al., 2000). The corresponding MM algorithm would consist of repeating the following two steps: (i) using a previous CoCo estimate U˜ to compute weights wd,ij according to (13), and (ii) computing a new CoCo estimate using the new weights. In practice, we have found one-step to be adequate, however. Indeed, Zou and Li (2008) showed that the solution to the one-step algorithm was often sufficient in terms of its statistical estimation accuracy.

7. Other Practical Issues

In this section, we address other considerations for using the method in practice, namely how to choose the tuning parameter γ and how to recover the partitions along each mode from the CoCo estimator U^.

7.1. Choosing γ

The first major practical consideration is how to choose γ to produce a final co-clustering result. Since co-clustering is an exploratory method, it may be suitable for a user to manually inspect a sequence of CoCo estimators U^γ for a range of γ and use domain knowledge tied to a specific application to select γ to recover a co-clustering assignment of a desired complexity. Since this approach is time consuming and requires expert knowledge, an automated, data-driven procedure for selecting γ is desirable. Cross-validation (Stone, 1974; Geisser, 1975) and stability selection (Meinshausen and Bühlmann, 2010) are popular techniques for tuning parameter selection, but since both methods are based on resampling, they are unattractive in the tensor setting due to the computational burden. We turn to the extended Bayesian Information Criterion (eBIC) proposed by Chen and Chen (2008, 2012), as it does not rely on resampling and thus is not as computationally costly as cross-validation or stability selection.

eBIC(γ)=nlog(RSSγn)+2dfγ log(n),

where RSSγ is the residual sum of squares XU^γF2 and dfγ is the degrees of freedom for a particular value of γ. We use the number of co-clusters in the CoCo estimator U^γ as an estimate of dfγ, which is consistent with the spirit of degrees of freedom since each co-cluster mean is an estimated parameter. This criterion balances between model fitting and model complexity, and a similar version has been commonly employed in tuning parameter selection of tensor data analysis (Zhou et al., 2013; Sun et al., 2017).

The eBIC is calculated on a grid of values S={γ1,γ2,γs}, and we select the optimal γ, denoted γ*, which corresponds to the smallest value of the eBIC over S, namely

γ=arg minγS eBIC(γ).

7.2. Recovering the Partitions along Each Mode

The second major practical consideration is how to extract the partitions from the CoCo estimator U^. Recall that the ith and jth mode-d subtensors belong to the same partition if vd,ij=U×dΔij=0. Conversely, the ith and jth mode-d subtensors do not belong to the same partition if vd,ij0. Thus, a mode-d partition consists of the maximal set of mode-d subarrays such that for any pair i and j in this collection vd,ij = 0. We can automatically identify these maximal sets by extending a simple procedure employed by Chi and Lange (2015) for extracting clusters in the convex clustering problem. Identifying partitions along the dth mode is equivalent to finding connected components of a graph, where each node corresponds to a subarray along the dth mode, and there is an edge between nodes i and j if and only if vd,ij = 0.

We would like to read off which centroids have fused as the amount of regularization increases, namely determine partition assignments as a function of γ. Such assignments can be performed in O(nd) operations, using the differences variable Vd. We simply apply breadth-first search to identify the connected components of the following graph induced by the Vd. The graph identifies a node with every data point and places an edge between the lth pair of points if and only if vl = 0. Each connected component corresponds to a partition. Note that the graph constructed to determine partitions is not the same as the graph described in Section 3 with illustrative examples in Figure 2.

We emphasize that the recovered partition along each mode does not depend on the ordering of the input data X, since it is based off of the pairwise differences along each mode, namely Vd for d = 1, … ,D. Finally, we note that due to finite precision limitations, the difference variables vd,ij will likely not be exactly 0. In Appendix E.4, we detail a simple and principled procedure for ensuring sparsity in these difference variables.

8. Simulation Studies

To investigate the performance of the CoCo estimator in identifying co-clusters in tensor data, we first explore some simulated examples. We compare our CoCo estimator to a k-means based approach that is representative of various tensor generalizations of the spectral clustering method common in the tensor clustering literature (Kutty et al., 2011; Liu et al., 2013b; Zhang et al., 2013; Wu et al., 2016b). We refer to this method as CPD+k-means. The CPD+k-means method (Papalexakis et al., 2013; Sun and Li, 2019) first performs a rank-R CP decomposition on the D-way tensor X to reduce the dimensionality of the problem, and then independently applies k-means clustering to the rows of each of the D factor matrix from the resulting CP decomposition. The k-means algorithm has also been used to cluster the factor matrices resulting from a Tucker decomposition (Acar et al., 2006; Sun et al., 2006; Kolda and Sun, 2008; Sun et al., 2009; Kutty et al., 2011; Liu et al., 2013b; Zhang et al., 2013; Cao et al., 2015; Oh et al., 2017). We also considered this Tucker+k-means method in initial experiments, but its co-clustering performance was inferior to that of CPD+k-means so we only report co-clustering performance results for CPD+k-means in the comparison experiments that follow. Note, however, that we still use the Tucker decomposition to compute CoCo weights wd,ij as described Section 6. Both CoCo and CPD+kmeans account for the multiway structure of the data. To assess the importance of accounting for this structure, we also include comparisons with the CoTeC method (Jegelka et al., 2009), which applied k-means clustering along each mode and does not account for the multiway structure of the data.

All methods being compared have tuning parameters that need to be set. For the rank of the CP decomposition needed in CPD+k-means, we consider R ∈ {2, 3, 4, 5} and use the tuning procedure in Sun et al. (2017) to automatically select the rank. A CP decomposition is then performed using the chosen rank, and those factor matrices are the input into the k-means algorithm. A well known drawback of k-means is that the number of clusters k needs to be specified a priori. Several methods for selecting k have been proposed in the literature, and we use the “gap statistic” developed by Tibshirani et al. (2001) to select an optimal k* from the specified possible values. Since CoCo estimates an entire solution path of mode-clustering results, ranging from nd clusters to a single cluster along mode d, we consider a rather large set of possible k values to make the methods more comparable. Appendix G gives a more detailed description of the CPD+k-means procedure and the selection of its tuning parameters. CoTeC, which applies k-means clustering along each mode independently, also requires specifying the number of cluster along each mode. As in CPD+k-means, we also select this parameter along each mode using the “gap statistic.”

As described in Section 6, we employ a Tucker approximation to the data tensor in constructing weights wd,ij. In computing the Tucker decomposition we used one of two methods for selecting the rank. In the plots within this section, TD1 denotes the results where the Tucker rank was chosen using the SCORE algorithm (Yokota et al., 2017), while TD2 denotes results where the rank was chosen using a heuristic. Detailed discussion on these two methods are in Appendix F.

The results presented in this section report the average CoCo estimator performance quantified by the ARI across 200 simulated replicates. All simulations were performed in Matlab using the Tensor Toolbox (Bader et al., 2015). All the following plots, except the heatmaps in Figure 13, were made using the open source R package ggplot2 (Wickham, 2009).

Figure 13:

Figure 13:

Advertisement and Publisher Click-Through Rate Biclusters for a Randomly Selected User. The rows correspond to different advertisements and the columns correspond to different publishers. Darker blue corresponds to higher click-through rates for a given device.

8.1. Cubical Tensors, Checkerbox Pattern

For the first and main simulation setting, we study clustering data in a cubical tensor generated by a basic checkerbox mean model according to Assumption 4.2. Each entry in the observed data tensor is generated according to the underlying model (2) with independent errors ϵi1i2i3~N(0,σr1r2r32). Unless specified otherwise, there are two true clusters along each mode for a total of eight underlying co-clusters.

8.1.1. Balanced Cluster Sizes and Homoskedastic Noise

To get an initial feel for how the different co-clustering methods perform at recovering the true underlying checkerbox structure, we first consider a situation where the clusters corresponding to the two classes along each mode are all equally-sized, or balanced, and share the same error variance, namely σr1r2r3=σ for all r1, r2, and r3. The average co-clustering performance for this setting in a tensor with dimensions n1 = n2 = n3 = 60 are given in Figure 4 for different noise levels. Figure 4 shows that all three methods perform well when the noise level is low (σ = 1). As the noise level increases, however, CPD+k-means experiences an immediate and noticeable drop off in performance. CoTeC’s performance decays even more rapidly highlighting the importance of accounting for multiway structure. The CoCo estimator, on the other hand, is able to maintain near-perfect performance until the noise level becomes rather high (σ = 8).

Figure 4:

Figure 4:

Checkerbox Simulation Results: Impact of Noise Level. Two balanced clusters per mode across different levels of homoskedastic noise for n1 = n2 = n3 = 60. For each method, the confidence interval is calculated as the mean value plus/minus one standard error.

Figure 5 shows how the run times of CoCo and CPD+k-means vary as the size of a cubic tensor, n = n1n2n3 with n1 = n2 = n3 takes on the values 203, 303, 603, and 1003. These run times include all computations needed to fit and select a final model. For CoCo, a sequence of models were fit over a grid of γ parameters, and a final γ parameter was chosen using the eBIC. For CPD+k-means, a sequence of models were fit over a grid of possible (k1, k2, k3) parameters corresponding to the 3 factor matrices, and a final triple of (k1, k2, k3) parameters were chosen using the “gap statistic.” Timing comparisons were performed on a 3.2 GHz quad-core Intel Core i5 processor and 8 GB of RAM. The run time for CoCo scales linearly in the size of the data tensor as expected, namely proportionately with n13. Nonetheless, as also might be expected, the clustering performance enjoyed by CoCo does not come for free, and the simpler but less reliable CPD+k-means algorithm enjoys a better scaling as the tensor size grows. Timing results were similar for the following experiments and are omitted for space considerations.

Figure 5:

Figure 5:

Timing Results: Balanced Cluster Size and Homoskedastic Noise. Two balanced clusters per mode with a fixed level of homoskedastic noise for n1 = n2 = n3 = 20; 30; 60; and 100. Vertical and horizontal axes are on a log scale.

8.1.2. Imbalanced Cluster Sizes

When comparing clustering methods, one factor of interest is the extent to which the relative sizes of the clusters impact clustering performance. To investigate this, we again use a cubical tensor of size n1 = n2 = n3 = 60 but introduce different levels of cluster size imbalance along each mode, which we quantify via the ratio of the number of samples in cluster 2 of mode d and the total number of samples along mode d, for d = 1, 2, 3. Figure 6a shows that when the noise level is low, CPD+k-means is unaffected by the imbalance until the size of cluster 2 is less than 30% of the mode’s length. At this point, the performance of CPD+k-means drops off significantly and it performs as well as a random clustering assignment when the sizes are highly skewed (nd2/nd = 0.1). The CoCo estimator is more or less invariant to the imbalance, and its performance is almost perfect across all levels of cluster size imbalance. Figure 6b shows that the CoCo estimator exhibits a slight deterioration in performance only when the cluster size ratio is 0.1 in the high noise case. In both low and high noise scenarios, CoTeC performs poorly.

Figure 6:

Figure 6:

Checkerbox Simulation Results: Impact of Cluster Size Imbalance. Two imbalanced clusters per mode with either low or high homoskedastic noise for n1 = n2 = n3 = 60. Low noise corresponds to σ = 3 while high noise refers to σ = 6.

8.1.3. Heteroskedastic Noise

Another factor of interest is how the clustering methods perform when there is heteroskedasticity in the variability of the two classes. Figure 7 displays the co-clustering performance for different degrees of heteroskedasticity, as measured by the standard deviation for class 2 relative to class 1’s standard deviation, σ21. In the low noise setting, the CoCo estimator is immune to the heteroskedasticity until the noise levels differ by a factor of 4. CPD+k-means in contrast is very sensitive to a deviation from homoskedasticty, experiencing a decline even when the noise ratio increases from 1 to only 1.5. The CoCo estimator fares worse in the high noise setting and also has a drop in performance with a small deviation from homoskedasticty. Once class 2’s standard deviation is more than double the standard deviation for class 1, all three methods are essentially the same as random clustering. This result is not terribly surprising since, in the high noise setting, this would result in one class having a very high standard deviation of σ2 = 12. In both low and high noise scenarios, CoTeC performs poorly.

Figure 7:

Figure 7:

Checkerbox Simulation Results: Impact of Heteroskedasticity. Two balanced clusters per mode with either low or high heteroskedastic noise for n1 = n2 = n3 = 60. Low noise corresponds to σ1 = 3 while high noise refers to σ1 = 6.

8.1.4. Different Clustering Structures

So far, we have considered only a simple situation where there are exactly two true clusters along each mode, for a total of eight triclusters. Another factor of practical importance is how the clustering methods perform when there are more than two clusters per mode, and also when the number of clusters along each mode differs. We investigate both of these settings in this section. As before, the tensor is a perfect cube with n1 = n2 = n3 = 60 observations along each mode and an underlying checkerbox pattern. To gauge the performance, we again focus the attention on how the methods perform in the presence of both low and high noise.

The first situation studied is one in which there are three true clusters along each mode, resulting in a total of 27 triclusters. The left hand side of the graphs in Figure 8 show the results from this simulation setting. The graphs show that CoCo estimator consistently outperforms CPD+k-means and CoTeC in this setting across both noise levels. The CoCo estimator is able to recover the true co-clusters almost perfectly, while CPD+k-means struggles to handle the increased number of clusters per mode.

Figure 8:

Figure 8:

Checkerbox Simulation Results: Impact of Clustering Structure. Di_erent balanced clusters per mode with either low or high homoskedastic noise for n1 = n2 = n3 = 60. Low noise corresponds to σ = 3 while high noise refers to σ = 6.

We also investigated the clustering performance when the number of clusters per mode varies. In this setting, there are two, three, and four clusters along modes one, two, and three, respectively. From the right hand side of the graphs in Figure 8, we can see that the results are similar to the situation with three clusters per mode. CPD+k-means again performs very poorly across both noise levels, while convex co-clustering is again able to essentially recover the true co-clustering structure. Compared to the setting with three clusters per mode, CPD+k-means performs slightly worse in the face of a more complex clustering structure, while convex co-clustering is able to handle it in stride. These results bode well for convex co-clustering as the basic clustering structure of only two clusters per mode is unlikely to be observed in practice.

8.2. Rectangular Tensors

Up to this point, to get an initial feel for CoCo’s performance, we restricted our attention to cubical tensors with the same number of observations per mode so as to avoid changing too many factors at once. It is unlikely that the data tensor at hand will be a perfect cube, however, so it is important to understand the clustering performance when the methods are applied to rectangular tensors.

Now we turn to cluster a rectangular tensor with one short mode and two longer modes. Two additional simulations involving rectangular tensors can be found in Appendix H. Figure 9 shows that CoCo performs very well and better than CPD+k-means and CoTeC at the lower noise level (σ = 3) but has a sharp decrease in ARI at the higher noise level (σ = 4). The decline is more pronounced for the longer modes (Figure 9b and Figure 9a) as the short mode (Figure 9a) is still able to maintain perfect performance despite the increase in noise. This is not surprising, since the shorter mode has effectively more samples. Moreover, we see the “blessing of dimensionality” at work when the number of samples along the short mode are doubled (n1 = 20, n2 = n3 = 50), the performance along the two longer modes improves drastically in the high noise setting.

Figure 9:

Figure 9:

Checkerbox Simulation Results: Impact of Tensor Shape. Two balanced clusters per mode with two levels of homoskedastic noise for a tensor with one short mode and two longer modes. Average adjusted rand index plus/minus one standard error for different noise levels and mode lengths.

We finally note that, along the shorter mode, the use of the heuristic in determining the rank of the Tucker decomposition for calculating the weights performs better than the SCORE algorithm method along modes 1 and 2, though ultimately the co-clustering performance is comparable. This may indicate that the SCORE algorithm struggles to correctly identify the optimal Tucker rank for short modes in the presence of relatively higher noise, while the heuristic is more immune to the noise level as it is based simply on the dimensions of the tensor.

8.3. CANDECOMP/PARAFAC Model

In Section 8.1, we saw that the CoCo estimator performs well and typically better than CPD+k-means when clustering tensors whose co-clusters have an underlying checkerbox pattern. To evaluate the performance of our CoCo estimator under model misspecification, we consider the generative model as the following CP decomposition model. We first construct the factor matrix A80×2 and construct the following rank-2 CP means tensor

U*=i=12aiaiai,

where ◦ denotes the outer product. We then added varying levels of Gaussian noise to the U* to generate the observed data tensor. We consider two different types of factor matrices. As shown in Figure 10, one shape consists of two half-moon clusters (Hocking et al., 2011; Chi and Lange, 2015; Tan and Witten, 2015) while the other shape contains a bullseye, similar to the two-circles shape studied by Ng et al. (2002) and Tan and Witten (2015). In either case, the triangles in Figure 10 correspond to the first 40 rows of A, whereas the circles correspond to the second 40 rows of A. Note that this data generating mechanism should favor the CPD+k-means method.

Figure 10:

Figure 10:

Factor Matrices for the CP Models.

Figure 11 shows the simulation results for using the CP model with these two non-convex shapes generating the data. The discrepancy in performance between the CoCo estimator and the other two methods is quite large. The CoCo estimator almost perfectly identifies the true co-clusters. In contrast, both CPD+k-means and CoTeC perform very poorly, even when the noise variance is small. The poor performance of CPD+k-means and CoTeC are not completely surprising as other have noted the difficulty that k-means methods have in recovering non-convex clusters (Ng et al., 2002; Hocking et al., 2011; Tan and Witten, 2015). These results give us some assurances that the CoCo estimator is able to still perform well even under some model misspecification since the true co-clusters do not have a checkerbox pattern.

Figure 11:

Figure 11:

CP Model Simulation Results. Two balanced clusters per mode with low homoscedastic noise for n1 = n2 = n3 = 40. “Bullseye” and “Half Moons” refer to the shape embedded in the factor matrices used to generate the true tensor.

8.4. Comparison with Convex Biclustering

It is natural to ask how much additional gain there is in using CoCo over convex biclustering (Chi et al., 2017) on the matricizations of a data tensor. To answer this question, we compare CoCo to the following strategy for applying convex biclustering to estimate co-clusters. We explain the strategy for a 3-way tensor; the generalization to D-way tensors is straightforward. We first matricize the tensor X along mode-1 to obtain the matrix X(1), apply convex biclustering on X(1), and retain the mode-1 clustering results. Note that the mode-2 and mode-3 fibers have been mixed together through the matricization process. We then repeat the two-step procedure for mode-2 and mode-3. The final co-cluster estimates are obtained by taking the cross-products of the mode-1, mode-2, and mode-3 cluster assignments.

We consider two illustrative scenarios to understand the value of preserving the full multiway structure with CoCo: a balanced case and imbalanced case. In the balanced case, we have a 3-way data tensor x60×60×60 with two clusters along each mode, where clusters are of equal size and homoskedastic iid Gaussian noise has been added to all elements of the tensor. This scenario is similar to the one shown in Figure 4. In the imbalanced case, we have a 3-way data tensor x30×40×80. There are two clusters along mode-1 of sizes 10 and 20, three clusters along mode-2 of sizes 8, 12, and 20, and four clusters along mode-3 of sizes 5, 10, 20, and 45. Homoskedastic iid Gaussian noise has been added to all elements of the tensor. Finally, we note that the empirical performance of convex biclustering, like that of CoCo’s, depends on choosing good weights for the rows and columns of the input data matrix (Chi et al., 2017). To create a fair comparison, we construct convex biclustering weights based off of the same TD1 and TD2 denoising procedure used for CoCo, putting the preprocessing for both methods on equal footing.

Figure 12a and Figure 12b show the co-clustering performance of CoCo and the convex biclustering method in the balanced and imbalanced cases respectively. We see that in the balanced case, CoCo’s performance is marginally better than that of the convex biclustering method. On the other hand, we see that in the imbalanced case, CoCo’s performance degrades more gracefully than that of the convex biclustering method as the noise level increases. The example illustrates that CoCo has better co-cluster recovery when there is more imbalance in the data tensor - the aspect ratios of the tensor dimensions are more skewed and the number of clusters and the cluster sizes are more heterogenous.

Figure 12:

Figure 12:

A Comparison between CoCo and Convex Biclustering Average Adjusted Rand Index plus/minus one standard error for different noise levels.

The key formulation difference between CoCo and the convex biclustering method that provides some insight into these two results is that CoCo imposes a finer level of smoothness that respects the multiway structure in the data tensor. Imposing such finer level of smoothness imparts greater robustness in the presence of increasing noise to recovering the smaller co-clusters in the imbalanced scenario. An added incentive for using CoCo and preserving the multiway structure in the data is that the gains in co-cluster recovery over the convex biclustering method do not come at a greater computational cost. Note that the computational complexity of convex biclustering is O(n), using sparse weights for the row and column similarity graphs. For a D-way tensor, the computational complexity then becomes O(Dn), which is the same as the computational complexity of CoCo applied directly on the D-way tensor.

To summarize, in comparison to the convex biclustering method, CoCo (i) does not come at additional computational costs, (ii) can recover underlying co-clustering structure in imbalanced scenarios which are more likely to be encountered in practice, and (iii) has the ability to consistently recover an underlying co-clustering structure according to Theorem 9, with even a single tensor sample, which is a typical case in real applications. Since this phenomenon does not exist in vector or matrix variate cluster analysis, the convex biclustering method lacks this theoretical guarantee.

9. Real Data Application

Having studied the performance of the CoCo estimator in a variety of simulated settings, we now turn to using the CoCo estimator on a real data set. The proprietary data set comes from a major online company and contains the click-through rates for advertisements displayed on the company’s webpages from May 19, 2016 through June 15, 2016. The clic-kthrough rate is the number of times a user clicks on a specific advertisement divided by the number of times the advertisement was displayed. The data set contains information on 1000 users, 189 advertisements, 19 publishers, and 2 different devices, aggregated across time. Thus, the data forms a fourth-order tensor where each entry in the tensor corresponds to the click-through rate for the given combination of user, advertisement, publisher, and device. Here a publisher refers to a different webpage within the online company’s website, such as the main home page versus a page devoted to either breaking news or sports scores. The two device types correspond to how the user accessed the page, using either a personal computer or a mobile device such as a cell phone or tablet computer. The goal in this real application is to simultaneously cluster users, advertisements, and publishers to improve user behavior targeting and advertising planning.

In the click-through rate tensor data, over 99% of the values are missing since one user likely has seen only a handful of the possible advertisements. If a specific advertisement is never seen by a user, it is considered as a missing value. Since the proposed CoCo estimator can only handle complete data, we first preprocess the data by imputing the missing values before any clustering can be done. To impute the missing entries, we use the CP-based tensor completion method Jain and Oh (2014) and tune its rank via the information criterion proposed by Sun et al. (2017). This tuning method chooses the optimal rank as R = 20 from the rank list {1, 2, 3, 4, 5, 6, 8, 10, 12, 14, 16, 18, 20, 22}. Finally, the imputed values are truncated to ensure all the values of the tensor are within 0 and 1 since click-through rates are proportions.

One mode of the fourth-order tensor has only two observations and those observations already have a natural grouping (device type). Therefore, for the sake of clustering we analyze the devices separately. We compare our method with CPD+k-means. Furthermore, the tuning parameter for convex co-clustering is automatically selected using the eBIC (Section 7.1) while the number of clusters in CPD+k-means is chosen via the gap statistic (Tibshirani et al., 2001). We do not include comparisons with CoTeC given its poor performance in the simulation experiments.

We first look at the clustering results from clustering the click-through rates for users accessing the advertisements through a personal computer (PC). Table 1 contains the number of clusters identified as well as the sizes of the clusters, while Figure 13a visualizes the advertisement-by-publisher biclusters for a randomly selected user. As to be expected, the advertisement-by-publisher slices display a checkerbox pattern, which turns into a checkerbox pattern when the slices are meshed together. The clustering results for the users are omitted in this paper to ensure user privacy. However, co-clustering the tensor does not result in the loss of information that would occur if the tensor was converted into a matrix by averaging across users or flattening along one of the modes. Table 1 and Figure 13a show that the CoCo estimator identifies four advertisement clusters, with one cluster being much bigger than the others. The advertisements in this large cluster have click-through rates that are close to the grand average in the data set. One of the small clusters has very low click-through rates, while the other two clusters tend to have much higher click-through rates than the rest of the advertisements. On the other hand, CPD+k-means clusters the advertisements into 57 groups, which is less-useful from a practical standpoint. Many of the clusters are similarly-sized and contain only a few advertisements, likely due to the inability of CPD+k-means to handle imbalanced cluster sizes as was observed in the simulation experiments (Section 8.1.2). In terms of the publishers, the CoCo estimator identifies 3 clusters while CPD+k-means does not find any underlying grouping and simply identifies one big cluster, which again is not terribly useful (Table 1). We next provide some interpretations of the obtained clustering results of the publishers. One way online advertisers can reach more users is by entering agreements with other companies to route traffic to the advertiser’s website. For example, Google and Apple have a revenue-sharing agreement in which Google pays Apple a percentage of the revenue generated by searches on iPhones (McGarry, 2016). Similarly, the online company being studied partners with several internet service providers (ISPs) to host the defaut home pages for the ISP’s customers. It would make sense that these slightly different variants of the online company’s main home page would have similar click-through rates, and the CoCo estimator in fact assigned these variants into the same cluster.

Table 1:

Advertising Data Clustering Results

CoCo Estimator CPD+kmeans
Advertisements Publisher Advertisements Publisher
Device # of clusters Cluster Sizes # of clusters Cluster Sizes # of clusters # of clusters
PC 4 (156, 22, 8, 3) 3 (4, 3, 12) 57 1
Mobile 3 (145, 22, 22) 2 (7, 12) 49 13

For users accessing the advertisements through a mobile device, such as a mobile phone or tablet computer, the CoCo estimator results for the advertisements are largely similar to the results for PCs (Table 1 and Figure 13b). There is one large cluster that contains click-through rates similar to the overall average, while the two other equally-sized clusters have relatively very low or very high click-through rates, respectively. The underlying click-through rates for the PC data have more variability than the mobile data, which is consistent with the identification of an additional cluster for the PC data. As before, CPD+k-means finds a large number of advertisement clusters, most of which are roughly the same size, again likely impacted by the imbalance in the cluster sizes. When compared to the personal computer device, one difference is that the cluster with the higher click-through rates for mobile devices is larger and has a higher average click-through rate than the similar clusters for the personal computer device. This finding is consistent with research by the Pew Research Center that found that click-through rates for mobile devices are higher than for advertisements viewed on a personal computer or laptop (Mitchell et al., 2012).

It is also enlightening to take a closer look at the underlying advertisements clustered across the two devices. All of the advertisements clustered in the high click-through rate cluster for the mobile devices are in the average click-through rate cluster for personal computers. In taking a closer look at the ads in these clusters, there are several ads related to online shopping for personal goods, such as jeans, workout clothes, or neck ties. It makes sense to shop for these types of goods using a mobile device, such as while at work when it is not appropriate to do so on a work computer. Conversely, all of the advertisements in either of the two higher PC click-through rate clusters are in the large, average click-through rate cluster for the mobile devices. There are several financial-related ads in these two PC clusters, such as for mortgages or general investment advice. On the other hand, there are not many online shopping ads in those clusters, with the exception of more expensive technology-related goods that one may want to invest more time in researching before making a purchase.

In terms of the publisher clusters on mobile device, Table 1 shows that the CoCo estimator identifies two clusters of publishers while CPD+k-means identifies 13 small clusters. Contrary to the advertisement clusters, the publisher clusters across both devices are very similar. In fact, the only difference is that the smaller cluster for the mobile device, which contains seven publishers, is split into two clusters for personal computers. This can be seen in the click-through rate heatmaps given in Figure 13 in looking at the right part of each heatmap. The publishers in these smaller clusters have higher click-through rates on average than those in the larger cluster. Additionally, five of the seven (71%) publishers in the high click-through rate clusters have stand-alone apps that display ads, while only three of the twelve (25%) publishers in the larger cluster do. For mobile devices, it has been observed that in-app advertisements have higher click-through rates and browser-based ads (Hof, 2014). We conjecture that this is also true for personal computer apps, which is consistent with the clustering results. Thus it again appears that the clusters identified by CoCo also make sense practically.

10. Discussion

In this paper, we formulated and studied the problem of co-clustering of tensors as a convex optimization problem. The resulting CoCo estimator enjoys features in theory and practice that are arguably lacking in existing alternatives, namely statistical consistency, stability guarantees, and an algorithm with polynomial computational complexity. Through a battery of simulations, we observed that the CoCo estimator can identify co-clustering structures under realistic scenarios such as imbalanced co-cluster sizes, imbalanced number of clusters along each mode, heteroskedasticity in the noise distribution associated with each co-cluster, and even some violation of the checkerbox mean tensor assumption.

We have leveraged the power of the convex relaxation to engineer a computationally tractable co-clustering method that comes with statistical guarantees. These benefits, however, do not come for free. The CoCo estimator incurs similar costs that using the lasso incurs as a surrogate for a cardinality constraint or penalty. It is well known that the lasso leads to parameter estimates that are shrunk towards zero. This shrinkage toward zero is the price for simultaneously estimating the support, or locations of the nonzero entries, in a sparse vector as well as the values of the nonzero entries. In the context of convex co-clustering, the CoCo estimator U^ is shrunk towards the tensor X¯, namely the tensor whose entries are all equal to the average over all entries of X. The weights, however, play a critical role in reducing this bias. In fact, the weights can be seen as serving the same role as weights used in the adaptive lasso (Zou, 2006).

There are several possible extensions and open problems that have been left for future work. First, we note that there is a gap between what our theory predicts and what seems possible from our experiments. Specifically, Theorem 9 assumes uniform weights for each mode, yet simulation experiments indicate that the CoCo estimator using Tucker derived Gaussian kernel weights (9) can significantly outperform the CoCo estimator using uniform weights. One open problem is to derive prediction error bounds that relax the uniform weights assumption.

Second, although we have developed automatic methods for constructing the weights that work well empirically, other approaches to constructing the weights is a direction of future search. For example, other tensor approximation methods, such as the use of the 1-norm to make the decomposition most robust to heavy tail noise as done by Cao et al. (2015), could possibly improve the quality of the weights.

Third, in this paper we have focused on additive noise that is a zero-mean M-concentrated random variable. Real data, however, may not follow such a distribution motivating co-clustering procedures that can handle outliers. To address potential robustness issues, the CoCo framework could be extended to handle outliers by swapping the sum of squared residuals term in (5) with an analogous Huber loss or Tukey’s Biweight function.

Finally, while our first order algorithm for co-clustering tensors scales linearly in the size of the data, data tensors inevitably will only increase in size motivating the need for more scalable algorithms for computing the CoCo estimator. A natural approach would be to adopt an existing distributed version of the proximal methods, such as one the methods proposed by Combettes and Pesquet (2011), Chen and Ozdaglar (2012), Li et al. (2013), or Eckstein (2017). Another natural approach would be to investigate if stochastic versions of the recently proposed generalized dual gradient ascent (Ho et al., 2019) could be adapted to compute the CoCo estimator. Additionally, in practice many data tensors that we would like to co-cluster may be very sparse. The first order algorithm presented here assumes the data tensor is dense. Consequently, an important direction of future work is to investigate alternative optimization algorithms that could leverage the sparsity structure within a data tensor.

Acknowledgments

The authors thank Xu Han for his help with the simulation experiments during the revision of this work. The authors also thank the action editor and three reviewers for their helpful comments and suggestions which led to a much improved presentation. Eric Chi acknowledges support from the National Science Foundation (DMS-1752692) and National Institutes of Health (R01GM135928). Will Wei Sun acknowledges support from the Office of Naval Research (ONR N00014–18-1–2759). Hua Zhou acknowledges support from the National Institutes of Health (R01GM053275 and R01HG006139). Finally, this research collaboration was partially funded by the National Science Foundation under grant DMS-1127914 (the Statistical and Applied Mathematical Sciences Institute). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation, the National Institutes of Health, or the Office of Naval Research.

Appendix A. Tensor Decompositions

We review two basic tensor decompositions that generalize the singular value decomposition (SVD) of a matrix: (i) the CANDECOMP/PARAFAC (CP) decomposition (Carroll and Chang, 1970; Harshman, 1970) and (ii) the Tucker decomposition (Tucker, 1966). Just as the SVD can be used to construct a lower-dimensional approximation to a data matrix, these two decompositions can be used to construct a lower dimensional approximation to a D-way tensor Xn1×n2××nD The CP decomposition aims to approximate X by a sum of rank-one tensors, namely

Xi=1Rai(1)ai(2)ai(D),

where ◦ represents the outer product and ai(d) is the ith column of the dth factor matrix A(d)nd×R. The positive integer R denotes the rank of the approximation. For sufficiently large R, we can exactly represent X with a CP decomposition.

The Tucker decomposition aims to approximate X by a core tensor HR1×R2××RD multiplied by factor matrices along each of its modes, namely

XH×1A(1)×2A(2)×3×DA(D)=i1=1R1i2=1R2iD=1RDhi1i2iDai1(1)ai2(2)aiD(D),

where aid(d) is the idth column of the dth factor matrix A(d)nd×Rd. Typically the columns of A(d) are computed to be orthonomal and can be interpreted as principal components or basis vectors for the dth mode. For sufficiently large R1, … ,RD, we can exactly represent X with a Tucker decomposition.

Appendix B. Proofs of Smoothness Properties

B.1. Proof of Proposition 4

Without loss of generality, we can absorb γ into the weights matrices. Thus, we seek to show the continuity of U^ with respect to (X, W1, … ,WD). We use the following compact representation of the weights

w=(vec(W1)T,vec(W2)T,,vec(WD))Td=1D(nd2).

We check to see if the solution U^ is continuous in the variable ζ = (xT, wT)T. It is easy to verify that the following function is jointly continuous in U and ζ

f(U,ζ)=12XUF2+R(U,w),

Where

R(U,w)=d=1Di<jwd,ijU×dΔd,ijF

is a convex function of U that is continuous in (U, w). Let

U(ζ)=arg minUf(U,ζ).

Since f(U,ζ) is strongly convex in U, the minimizer U(ζ) exists and is unique.

We proceed with a proof by contradiction. Suppose U(ζ) is not continuous at a point ζ. Then there exists an ϵ > 0 and a sequence {ζ(m)} converging to a limit ζ such that U(m)U(ζ)Fϵ for all m where

U(m)=arg minUf(U,ζ(m)).

Since f(U,ζ) is strongly convex in U, the minimizer U(m) exists and is unique. Without loss of generality, we can assume ∥ζ(m)ζF ≤ 1. This fact will be used later in proving the boundedness of the sequence U(m).

If U(m) is a bounded sequence, then we can pass to a convergent subsequence with limit U¯. Fix an arbitrary point U˜. Note that f(U(m),ζ(m))f(U˜,ζ(m)) for all m. Since f is continuous in (U,ζ), taking limits gives us the inequality

f(U¯,ζ)f(U˜,ζ).

Since U˜ was selected arbitrarily, it follows that U¯=U(ζ), which is a contradiction. It only remains for us to show that the sequence U(m) is bounded.

Consider the function

g(U)=supζ˜:ζ˜ζF112X˜UF2+Rw˜(U).

Note that g is convex, since it is the point-wise supremum of a collection of convex functions. Since f(U,ζ(m))g(U) and f is strongly convex in U, it follows that g(U) is also strongly convex and therefore has a unique global minimizer U* such that g(U*). It also follows that

f(U(m),ζ(m))f(U*,ζ(m))g(U*) (14)

for all m. By the reverse triangle inequality it follows that

12(U(m)FX(m)F)212U(m)X(m)F2f(U(m),ζ(m)). (15)

Combining the inequalities in (14) and (15), we arrive at the conclusion that

12(U(m)FX(m)F)2g(U*),

for all m. Suppose the sequence U(m) is unbounded, namely U(m)F. But since X(m) converges to X, the left hand side must diverge. Thus, we arrive at a contradiction if U(m) is unbounded.

B.2. Proof of Proposition 5

First suppose that U(d) = 1cT, namely all the mode-d subarrays of U are identical. Recall that Z=U×dA if and only if Z(d) = AU(d). Therefore, Rd(U)=0 since Δd,ij1cT = 0 for all (i,j)Ed.

Now suppose that Rd(U) is zero. Take an arbitrary pair (i, j) with i < j. By Assumption 4.1, there exists a path ik → ⋯ → lj along which the weights are positive. Let w denote the smallest weight along this path, namely w = min{wd,ik, … ,wd,lj}. By the triangle inequality

U×dΔd,ijFU×dΔd,ikF++U×dΔd,ljF.

We can then conclude that

wU×dΔd,ijFRd(U)=0.

It follows that eiTU(d)=ejTU(d), since w is positive. Since the pair (i, j) is arbitrary, it follows that all the rows of U(d) are identical or in other words, U(d) = 1cT for some cnd. ■

B.3. Proof of Proposition 6

We will show that there is a γmax such that for all γγmax, the grand mean tensor X¯ is the unique global minimizer to the primal objective (4). We will certify that X¯ is the solution to the primal problem by showing that the optimal value of a dual problem, which lower bounds the primal, equals Fγ(X¯).

Note that the Lagrangian dual given in (28) is a tight lower bound on Fγ(U).

maxλCγ12ATλ22+λ,Ax.

For sufficiently large γ, the solution to the dual maximization problem coincides with the solution to the unconstrained maximization problem

maxλ12ATλ22+λ,Ax,

whose solution is λ* = (AAT) Ax. Plugging λ* into the dual objective gives an optimal value of

12AT(AAT)Ax22=12x[IAT(AAT)A]x22.

Note that [IAT (AAT) A] is the projection onto the orthogonal complement of the column space of AT, which is equivalent to the null space or kernel of A, denoted Ker(A). We will show below that Ker(A) is the span of the all ones vector. Consequently,

[IAT(AAT)A]x=1nx,11.

Note that the smallest γ such that λ* ∈ Cγ is an upper bound on γmax.

We now argue that Ker(A) is the span of 1n. We rely on the following fact: If Φd is an incidence matrix of a connected graph with nd vertices, then the rank of Φd is nd − 1 (Deo, 1974, Theorem 7.2). According to Assumption 4.1, the mode-d graphs are connected; it follows that Φd{1,0,1}|Ed|×nd has rank nd − 1. It follows then that Ker(Φd) has dimension one. Furthermore, since each row of Φd has one 1 and one −1, it follows that 1Ker(Φd)nd. A vector z ∈Ker(A) if and only if z ∈ Ker(Ad) for all d.

Recall that the rank of the Kronecker product AB is the product of the ranks of the matrices A and B. This rank property of Kronecker products of matrices implies that the dimension of Ker(Ad) equals nd. Let bi=1nD1nd+1ei1nd11n1 where 1pp is the vector of all ones and eind is the ith standard basis vector. Then that the set of vectors B={b1,b2,,bnd} forms a basis for Ker(Ad).

Take an arbitrary element from Ker(Ad), namely a vector of the form 1n′a1n″, where n=j=d+1Dnj and n=j=1d1. We will show that in order for 1n′a1d″ ∈ Ker(IΦd), a must be a multiple of 1nd. Consider the relevant matrix-vector product

Ad(1nDa1n1)=(1nD1nd+1Φda1nd11n1).

Therefore, Ad (1n′a1d″) = 0 if and only if Φda = 0. But the only way for Φda to be zero is for a=c1nd for some c. Thus, Ker(A) is the span of 1n.

B.4. Proof of Proposition 7

Note that U^ is the proximal mapping of the closed, convex function

d=1DRd(U)

Then U^ is firmly nonexpansive in X (Combettes and Wajs, 2005, Lemma 2.4). Finally, firmly nonexpansive mappings are nonexpansive, which completes the proof. ■

Appendix C. Proof of Theorem 9

We first prove some auxiliary lemmas before proving our prediction error result.

C.1. Auxiliary Lemmas

The following lemma considers the concentration of a random quadratic form yTBy for a M-concentrated random vector y and a deterministic matrix B (Vu and Wang, 2015). It can be viewed as a generalization of the standard Hanson and Wright inequality for the quadratic forms of independent sub-Gaussian random variables (Hanson and Wright, 1971).

Lemma 11 Let yn be a M-concentrated random vector, see Definition 8. Then there are constants C,C > 0 such that for any matrix Bn×n

(|yTBytr(B)|t)Clog(n)exp{CM2min[t2BF2log(n),tB2]}.

The next lemma studies the properties of the matrix Ad,ij, defined in (6), in the penalty function. Denote Sd as the matrix constructed by concatenating Ad,ij, i < j vertically. That is,

Sd=(Ad,12TAd,13TAd,nd1ndT)T[(nd2)nd]×n. (16)

Lemma 12 For each d = 1, … ,D, the rank of the matrix Sd is (nd − 1)n−d. Denote σmin(Sd) and σmax(Sd) as the minimum non-zero singular value and maximum singular value of Sd, respectively. We have σmin(Sd)=σmax(Sd)=nd.

The proof of Lemma 12 follows from Lemma 1 in Tan and Witten (2015) and is omitted. According to Lemma 12, we can construct a singular value decomposition of Sd=UdΛdVdT, where Ud[(nd2)nd]×(nd1)nd, Λd(nd1)nd×(nd1)nd, and Vdn×(nd1)nd. Denote

Gd=UdΛd[(nd2)nd]×(nd1)nd, (17)

and its pseudo-inverse as Gd(nd1)nd×[(nd2)nd]. The following lemma studies the properties of Gd and Gd, for each d = 1, … ,D.

Lemma 13 For each d = 1, … ,D, the rank of the matrix Gd is (nd −1)n−d. The minimal non-zero singular value and maximal singular value of Gd are σmin(Gd)=σmax(Gd)=nd. Moreover, σmin(Gd)=σmax(Gd)=1/nd.

Lemma 13 follows directly from the conclusions in Lemma 12.

C.2. Proof of Main Theorem

We first reformulate our optimization problem via a decomposition approach to simplify the theoretical analysis. Such strategy was developed in Liu et al. (2013a) and has been successfully applied in Tan and Witten (2015); Wang et al. (2018).

Denote γd = γ/nd. Our convex tensor co-clustering method is equivalent to solving

u^=arg minu{12xu22+d=1Dγd(i,j)EdAd,iju2}. (18)

According to the definition of Sd in (16), we define the penalty function R(·) such that

R(Sdu)=(i,j)EdAd,iju2.

According to the singular value decomposition of Sd=UdΛdVdT, there exists a matrix Wdn×nd such that V˜d=[Wd,Vd]n×n is an orthogonal matrix and WdTVd=0. Let αd=WdTund and βd=VdTun. Clearly, we have

Wdαd+Vdβd=WdWdTu+VdVdTu=V˜dV˜dTu=u, (19)

for any d = 1, … ,D. This fact together with the definition of Gd = UdΛd in (17) imply that solving our convex tensor clustering in (18) is equivalent to solving

minαd,βd,d=1,,Dd=1D{12DxWdαd+Vdβd22+γdR(Gdβd)} (20)

Denote the solution of (20) as α^d, β^d, d = 1, … ,D, which corresponds to the estimator û in (18) according to (19). Similarly, we denote the true parameters as αd*, βd* that corresponds to u* defined in Assumption 4.2. Our goal is to derive the upper bound of u^u*22 by above reparametrization. Since α^d, β^d, d = 1, … ,D minimizes the objective function in (20), we have

d=1D{12DxWdα^d+Vdβ^d22+γdR(Gdβ^d)}d=1D{12DxWdαd*+Vdβd*22+γdR(Gdβd*)}.

Note that xu^22xu*22=u^22u*222xT(u^u*)=u^u*22+2ϵT(u^u*), where the last equality is due to the model assumption x = u* + ϵ Therefore, we have

12u^u*22+d=1DγdR(Gdβ^d)12Dd=1DϵT(u*u^)+d=1DγdR(Gdβd*)12Dd=1D|ϵT[Wd(αd*α^d)+Vd(βd*β^d)]|f(α^d,β^d)+d=1DγdR(Gdβd*). (21)

Next we derive the bound for f(α^d,β^d). Note that the optimization over αd in (20) has a closed-form since the penalty term is independent of αd. In particular, by setting the derivative of xWdαd+Vdβd22 with respect to αd to be zero, we obtain that αd=WdT(xVdβd). This implies that

α^d=WdT(xVdβ^d)=WdT(Wdαd*+Vdβd*+ϵVdβ^d)=αd*+WdTϵ, (22)

where the second equality is due to x = u* + ϵ and the last equality is due to the fact that WdTVd=0 and WdTWd=I. According to (22), we have

f(α^d,β^d)=ϵTWdWdTϵ+ϵTVd(βd*β^d)]|ϵTWdWdTϵ|(I)+|ϵTVd(βd*β^d)]|(II). (23)

Bound (I): We apply the concentration inequality in Lemma 11 to bound (I). It remains to compute WdWdTF2 and WdWdT2. By construction, WdWdTn×n is a projection matrix since V˜dV˜dT=WdWdT+VdVdT=I. Therefore, the rank of WdWdT is jdnj, WdWdTF2=jdnj, WdWdT2=1, and tr(WdWdT)=jdnj.

Denote n=d=1Dnd. By Lemma 11 and Assumption 4.2, we have

(ϵTWdWdTϵt+nd)Clog(n)exp{CM2min[t2log(n)nd,t]}.

Setting t=ndlog(n)2, we have

(ϵTWdWdTϵlog(n)nd+nd)Cexp{loglog(n)CM2log(n)}, (24)

where the right hand side converges to zero as the dimension n=d=1Dnd. Note that our error ϵ in Assumption 4.2 is assumed to be a M-concentrated random variable. If we assume a stronger condition such that ϵ is a vector with iid sub-Gaussian, we can obtain a upper bound log(n)nd+nd according to the Hanson and Wright inequality (Hanson and Wright, 1971). Therefore, in spite of the relaxation in the error assumption, our bound in (24) is only up to a log-term larger.

Bound (II): By definitions of Gd in (17) and Gd, we have GdGd=I. Furthermore, let Gd,ij refer to the column of Gd that corresponds to the index (i, j), and let Gd,ij refer to the row of Gd that corresponds to the index (i, j). We have

(II)=|ϵTVd(βd*β^d)|=|ϵTVdGdGd(βd*β^d)|=|i<jϵTVdGd,ijGd,ij(βd*β^d)|i<jϵTVdGd,ij2Gd,ij(βd*β^d)2maxi<jϵTVdGd,ij2II1i<jGd,ij(βd*β^d)2

Bound II1: By construction, ϵTVdGd,ijnd. We have

ϵTVdGd,ij2ndϵTVdGd,ij,

and hence

maxi<jϵTVdGd,ij2ndmaxi<jϵTVdGd,ij=ndϵTVdGd

Let ηj=ejTGdTVdTϵ, where ej(nd2)nd is the basis vector with the jth entry one and the rest zeros. According to Lemma 13 and the property of Vd which consists of singular vectors, we have σmax(Vd) = 1 and σmax(Gd)=1/nd. Therefore, we have ηj is a M/nd-concentrated random variable with mean zero. According to the definition of concentrated random variable in Definition 8, we have

(|ηj|t1)C1exp(C2ndt12M2).

Therefore, by union bound, we have

(maxj|ηj|t1)C1(nd2)(nd)exp(C2ndt12M2).

By setting t1=log(n)log[(nd2)nd]/nd, we have

(ϵTVdGdlog(n)log[(nd2)nd]/nd)C3n,

for some constant C3 > 0. Hence with probability at least 1 − C3/n, we have

II1ndlog(n)log[(nd2)nd]/nd. (25)

Plugging the results in (24) and (25) into (23), we obtain that, for each d = 1, … ,D

f(α^d,β^d)log(n)nd+nd+ndlog(n)log[(nd2)nd]/ndi<jGd,ij(βd*β^d)2.

Therefore, Assumption 4.3 on the tuning parameter γd implies that

f(α^d,β^d)log(n)nd+nd+Dγdd=1Dγdi<jGd,ij(βd*β^d)2,

by noting that log((nd2)nd)log(nd2nd)2log(n). This combines with the inequality in (21) lead to

12u^u*2212Dd=1D[log(n)nd+nd]+32d=1DγdR(Sdu*). (26)

According to the cluster structure assumption in Assumption 4.2, there are kd clusters along the dth mode of the tensor. Therefore, along each mode the true parameter U* only has a few different slices. Denote Ui* as the i-th mode-d subarray. Formally, we have

R(Sdu*)=(i,j),i<j,i,j=1,,ndAd,iju2=(i,j),i<j,i,j=1,,ndUi*Uj*F4C02(nd2)jdkj, (27)

where C0 is a constant upper bound for the entries of U*. Combining the inequalities in (26) and (27) with the condition on γd given in Assumption 4.3 implies that

12u^u*2212Dd=1D(log(n)nd+nd)+32d=1D2c0log(n)nDnd4C02(nd2)jdkj.

Dividing both sides by n gives to the prediction error bound in (7). This ends the proof of Theorem 9. ■

Appendix D. Derivation of Lagrangian Dual

Let U×dA denote the multiplication of U along mode d by the matrix A. Recall that for a tensor Un1××nd and a matrix AL×nd

vec(U×dA)=(InDInd+1AInd1In1)u,

where u=vec(U)=vec(U(1)), namely the column-major vectorization of the mode-1 matricization of the tensor U. So, Note that Y=U×dA is equivalent to Y(d) = AU(d). We rewrite the penalty function Rd as follows.

Rd(U)=lEdwd,lU×dΔd,lF=lEdwd,lvec(U×dΔd,l)2=lEdwd,lAd,lu2,

where Ad,l=(InDInd+1Δd,lInd1In1).

We now write down the Lagrangian:

L(u,v,λ)=12xu22+d=1DlEd{γwd,lvd,l2+λd,l,Ad,luvd,l}={12xu22+d=1DAdTλd,u}d=1DlEd{λd,l,vd,lγwd,lvd,l2}={12xu22+ATλ,u}d=1DlEd{λd,l,vd,lγwd,lvd,l2}.

The Lagrangian dual objective is given by G(λ) by minimizing the Lagrangian L(u,v,λ) over the primal variables u and v, namely

G(λ)=minu,vL(u,v,λ)=minu{12xu22+ATλ,u}d=1DlEdmaxvd,l{λd,l,vd,lγwd,lvd,l2}=12x2212xATλ22d=1DlEdιCd,l(λd,l), (28)

where ιCd,l is the indicator function of the closed convex set Cd,l = {z : ∥z2γwd,l}.

The last equality in (28) follows from the fact that the Fenchel conjugate of a norm is the indicator function of the unit dual norm ball. Recall that the Fenchel conjugate f* of a function f is given by

f(λ)=supv{λ,vf(v)}.

Let B = {λ : ∥λ2 ≤ 1} denote the unit 2-norm ball. Since the 2-norm is self dual, we arrive at the identity

ιB(λ)=supv{λ,vv2}.

Appendix E. Projected Gradient Applied to the Lagrangian Dual

Note that the dual problem (8) has the form

minimize g(λ)subject to λC, (29)

where g(λ) is a convex and Lipschitz-differentiable function and the constraint set C is a closed convex set, which implies that every point λ possesses a unique orthogonal projection, PC(λ)=argminθCθλ2, onto C. When PC(λ) can be computed analytically, a simple and effective iterative algorithm for solving problems like (29) is the projected gradient descent algorithm, a special case of proximal gradient descent algorithm (Combettes and Wajs, 2005; Combettes and Pesquet, 2011). Recall that projected gradient descent alternates between taking a gradient step and projecting onto the set C. Thus, at the mth iteration, we perform the following update

λ(m)=PC(λ(m1)ηg(λ)), (30)

where η is a step-length parameter.

Applying the update rule in (30) to the dual problem (8), we obtain the following rule for computing the mth iteration

u(m)=xATλ(m1)λ(m)=PC(λ(m1)+ηAu(m)).

Note that, at the mth iteration, the gradient of the least squares objective in (8) is given by −Au(m). Thus, we automatically update our CoCo estimator u(m) as part of our gradient calculation. Finally, we note that the projection onto the set C consists of independent projections onto the sets Cd,l that can be carried out in parallel.

E.1. Per-Iteration and Storage Costs

The gradient update is dominated by the matrix-vector multiplications ATλ and Au. Although A is a d=1D|Ed|nd-by-n matrix it has only 2d=1D|Ed|nd non-zero elements. Thus, computing the gradient step requires O(d=1D|Ed|nd) flops. Projecting onto the set C also requires O(d=1D|Ed|nd) flops since projecting onto the set Cd,l requires O(nd) flops. Thus, the per-iteration cost is O(d=1D|Ed|nd) flops. The storage cost is dominated by storing the dual variable λ, which has d=1D|Ed|nd elements. At first glance these storage and per-iteration costs may seem prohibitive, as |Ed| can be as large as O(nd2) for a fully connected mode-d graph. Shrinking together all combinations of pairs of mode-d subarrays, however, typically produces poor clustering results in comparison to shrinking together mode-d subarrays that are nearest-neighbors as observed in prior work in convex clustering (Chen et al., 2015; Chi and Lange, 2015) and convex biclustering (Chi et al., 2017). Consequently, we employ sparse weights. Specifically, we keep positive weights between approximately nearest-neighbor mode-d subarrays so that |Ed| is O(nd). By using these sparse weights, the per-iteration and storage costs scale more reasonably as O(Dn), namely linearly in either the number of dimensions D or in the number of elements n. Details on our weights choices are elaborated in Section 6.

E.2. Convergence

The sequence of dual iterates λ(m) is guaranteed to converge to a solution λ^ of (8) provided that the step-size parameter η is less than twice the reciprocal of the spectral radius of the matrix ATA (Combettes and Wajs, 2005, Theorem 3.4). Consequently, the sequence of primal iterates u(m) is guaranteed to converge to the CoCo estimator û. We note that under the same step-size conditions, convergence of the sequence u(m) can also be guaranteed by observing that the projected gradient algorithm applied to the dual problem (8) is an example of the alternating minimization algorithm (Tseng, 1991, Proposition 2).

E.3. Monitoring Convergence via the Duality Gap

Recall that we can bound the suboptimality of the mth iterate, Fγ(u(m)) − Fγ(û), by the duality gap Fγ(u(m)) − G(λ(m)), which can be expressed solely in terms of the mth iterate of the primal variable u(m), namely

Fγ(u(m))G(λ(m))=u(m)22x,u(m)+γd=1DlEdwd,lAd,lu(m)2.

For any optimal dual solution λ^, the gap vanishes, namely Fγ(u^)=G(λ^). Note that computing the duality gap incurs minimal additional cost as u(m) and Ad,lu(m) are already computed as part of the gradient step. In short, including a duality gap computation will not change the O(Dn) per-iteration cost of the projected gradient algorithm. In practice, we can terminate the algorithm once the duality gap falls below some small tolerance.

E.4. Computing Mode-d Difference Variables

In Section 7.2, we explained how clustering assignments along the dth mode are made using the mode-d difference variables vd,l=U×dΔd,l. In practice we must deal with the fact that the û recovered by computing xATλ^ may exhibit a nearly but not exactly checkerbox structure due to limitations in numerical precision. This creates a practical issue as a small but non-zero difference variable will lead to an incorrect clustering assignment. Addressing this issue, however, is simple. The projected gradient algorithm used to compute CoCo is a natural generalization of the projected gradient algorithm used in Chi and Lange (2015) for convex clustering. Consequently, we can use the obvious adaptation of the procedure for computing the differences variables in convex clustering. The following brief technical discussion is expanded in more detail in Chi and Lange (2015).

The key fact that we use is that the projected gradient algorithm is equivalent to the alternating minimization algorithm (AMA) applied to the following augmented Lagrangian function

Lη(u,v,λ)=12xu22+d=1DlEd[γwd,lvd,l2+λd,l,vd,lAd,lu+η2vd,lAd,lu22].

The mode-d difference vector vd,l is determined by the proximal map

vd,l=arg minvd,l12[vd,lAd,luη1λd,l22+γwd,lηvd,l2]=proxσd,l2(Ad,luη1λd,l), (31)

where σd,l = γwd,l. Because the proximal mapping can produce mode-d difference variables that are exactly zero, the procedure for computing vd,l in (31) is immune to the numerical precision issues that hinder the direct computation U^×dΔd,l.

Appendix F. Details on Denoising with the Tucker Decomposition for Setting Weights

Employing the Tucker decomposition introduces another tuning parameter, namely the rank of the decomposition. When applicable, a user can leverage problem-specific knowledge to select the rank for the decomposition. Nonetheless, the availability of an automatic approach is desirable to handle cases when such knowledge is unavailable. Selecting the rank in a tensor decomposition, however, is an open question (Kolda and Bader, 2009; Yokota et al., 2017). During initial experiments, a few different methods for selecting the Tucker decomposition rank from the literature were compared: an L-curve approach that attempts to strike a balance between the decomposition’s relative error and compression ratio, as implemented by the mlrankest function in the Tensorlab Matlab toolbox (Vervliet et al., 2016), minimum description length (Rissanen, 1978; Yokota et al., 2017), and the recently-proposed SCORE algorithm (Yokota et al., 2017). Out of these, the SCORE algorithm produced the best average CoCo estimator performance. The SCORE algorithm itself includes a tuning parameter, ρ^, and Yokota et al. (2017) suggest setting ρ^[104,102]. We considered ρ^{104,103,102} and found 10−3 to perform the best, which also matches the value used in the experiments by Yokota et al. (2017).

We also developed a simple yet effective heuristic for choosing the rank where we set the Tucker rank for the dth mode to be the floor of nd/2. Two principles motivating the heuristic are that the rank of the decomposition should be both small relative to and also in proportion to the length of the modes. Both the SCORE algorithm and our heuristic were employed in our simulations described in Section 8 as a robustness check to ensure our CoCo estimator’s performance does not crucially depend on the choice of the rank.

The basic Tucker decomposition computation is accomplished by the higher order SVD (HOSVD) method (De Lathauwer et al., 2000) which computes for each mdoe k the rk leading left singular values of the mode-k matricization and stores them as a factor matrix Uk. The HOSVD then computes the core tensor by contracting the data tensor X×kUk. Thus, the main cost is computing D SVDs. This is an illustrative calculation, however, and more efficient alternatives exist (Vannieuwenhoven et al., 2012; Minster et al., 2020).

Appendix G. CPD+k-means

We describe in greater detail the CPD+k-means method for co-clustering a D-way tensor Xn1××nD. The method consists of two steps

Step 1. Compute a rank-R CP decomposition

Xi=1Rai(1)ai(2)ai(D),

where ◦ represents the outer product and ai(d) is the ith column of the dth factor matrix A(d)nd×R.

Step 2. For each factor matrices A(d), apply k-means clustering on the nd rows of A(d). Note that the D applications of k-means are done independently for each mode-d factor matrix A(d).

Tuning parameters: There are two sets of tuning parameters: (i) the rank parameter R, used in Step 1, and (ii) the D cluster number parameters for each factor matrix, used in Step 2. To choose the rank parameter R, we create a candidate set of ranks Rcandidate {1,2,3,} and select RRcandidate  using the tuning procedure in Sun et al. (2017). We then compute a CP decomposition using the selected rank R* and obtain the factor matrices A(d) for d = 1, … ,D. To choose the D cluster number parameters, we create D candidate sets of cluster numbers Kcandidate d{1,2,3,,nd} and select kdKcandidated for d = 1, … ,D using the gap statistic procedure in Tibshirani et al. (2001). We use the D clustering results from running k-means on the rows of each of the A(d) using kd.

Appendix H. Additional Simulations on Rectangular Tensors

The first rectangular tensor is one in which there are two short modes (n1 = n2 = 10) and one relatively longer mode (n3 = 50). Figure 14 presents the clustering results for this tensor shape.

Figure 14:

Figure 14:

Checkerbox Simulation Results: Impact of Tensor Shape. Two balanced clusters per mode with two levels of homoskedastic noise for a tensor with two short modes and one longer mode. Average adjusted rand index plus/minus one standard error for different noise levels and mode lengths.

At a lower noise level (σ = 2), CoCo performs very well and outperforms CPD+kmeans and CoTeC in terms of both single-mode clustering and co-clustering. When the noise level is bumped up (σ = 3), both methods experience a noticeable drop off in their performance and now perform more similarly. Interestingly, CoCo’s single-mode clustering results are better along the two shorter modes (modes 1 and 2), which is not what we expected. This provides some evidence that the performance along a mode depends on both the length of that mode as well as the lengths of the other modes. When the length of the shorter modes are increased slightly (from nd = 10 to nd = 20 for d = 1, 2), CoCo has near-perfect performance while CPD+k-means performs roughly the same as before. Thus, CoCo struggles with this tensor shape only when the short modes are really short (only 10 observations).

To further investigate the mode-by-mode performance with rectangular tensors, we also apply the clustering methods to a “Goldilocks” tensor with mode lengths that are short, medium, and long. This setting was again motivated by the results from the previous two tensor shapes to see how the performance is impacted when the size of a longer mode is increased. The ARI results for this tensor shape are given in Figure 15d, and they are consistent with what was observed previously. When the short mode has only 10 observations, CoCo initially performs very well until the noise reaches a certain level. At this point, its performance for the longer modes declines sharply and actually performs worse than CPD+k-means, and this pattern is more pronounced for the longest mode (n3 = 100). The overall co-clustering performance for both methods remains similar, however. As before, CoCo does not experience as much of a decrease when the shortest mode is made slightly longer (n1 = 20), and does noticeably better than CPD+k-means for the most part.

Figure 15:

Figure 15:

Checkerbox Simulation Results: Impact of Tensor Shape. Two balanced clusters per mode with two levels of homoskedastic noise for a tensor with short, medium, and long mode lengths. Average adjusted rand index plus/minus one standard error for different noise levels and mode lengths.

Overall, from clustering these different tensor shapes we see that CoCo still generally performs very well and better than CPD+k-means. The main issue it encounters is when at least one mode is very short (nd = 10). CoCo performs very well a lower noise levels but has a sharp decline in performance once the noise reaches a certain level. Unexpectedly, the decline in single-mode performance is worse for the longer modes. However, even when this happens, CoCo’s overall co-clustering performance is still comparable to CPD+k-means. Additionally, this pattern is much less striking when the length of the shortest mode is increased slightly.

Contributor Information

Eric C. Chi, Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA

Brian R. Gaines, Advanced Analytics R&D, SAS Institute Inc., Cary, NC 27513, USA

Will Wei Sun, Krannert School of Management, Purdue University, West Lafayette, IN 47907, USA.

Hua Zhou, Department of Biostatistics, University of California, Los Angeles, CA 90095, USA.

Jian Yang, Advertising Sciences, Yahoo Research, Sunnyvale, CA 94089, USA.

References

  1. Acar Evrim and Yener Bülent. Unsupervised multiway data analysis: A literature survey. IEEE Transactions on Knowledge and Data Engineering, 21(1):6–20, 2009. [Google Scholar]
  2. Acar Evrim, Çamtepe Seyit A., and Yener Bülent. Collective sampling and analysis of high order tensors for chatroom communications. In International Conference on Intelligence and Security Informatics, pages 213–224. Springer, 2006. [Google Scholar]
  3. Anandkumar Animashree, Ge Rong, Hsu Daniel, Kakade Sham M., and Telgarsky Matus. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15:2773–2832, 2014. [Google Scholar]
  4. Ankenman Jerrod I.. Geometry and analysis of dual networks on questionnaires, 2014.
  5. Bader Brett W., Kolda Tamara G., et al. Matlab tensor toolbox version 2.6. Available online, February 2015. URL http://www.sandia.gov/~tgkolda/TensorToolbox/.
  6. Beck Amir and Teboulle Marc. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009. [Google Scholar]
  7. Bergmann Sven, Ihmels Jan, and Barkai Naama. Iterative signature algorithm for the analysis of large-scale gene expression data. Physical Review E, 67(3):031902, 2003. [DOI] [PubMed] [Google Scholar]
  8. Bhar Anirban, Haubrock Martin, Mukhopadhyay Anirban, and Wingender Edgar. Multiobjective triclustering of time-series transcriptome data reveals key genes of biological processes. BMC Bioinformatics, 16(1):200, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Bi Xuan, Qu Annie, and Shen Xiaotong. Multilayer tensor factorization with applications to recommender systems. The Annals of Statistics, 46(6B):3308–3333, 2018. [Google Scholar]
  10. Busygin Stanislav, Prokopyev Oleg, and Pardalos Panos M.. Biclustering in data mining. Computers and Operations Research, 35(9):2964–2987, 2008. [Google Scholar]
  11. Cao Xiaochun, Wei Xingxing, Han Yahong, and Lin Dongdai. Robust face clustering via tensor decomposition. IEEE Transactions on Cybernetics, 45(11):2546–2557, 2015. [DOI] [PubMed] [Google Scholar]
  12. Carroll J. Douglas and Chang Jih-Jie. Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition. Psychometrika, 35(3):283–319, 1970. [Google Scholar]
  13. Chen Annie I. and Ozdaglar Asuman. A fast distributed proximal-gradient method. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, pages 601–608. IEEE, 2012. [Google Scholar]
  14. Chen Gary K., Chi Eric C., Ranola John Michael O., and Lange Kenneth. Convex clustering: An attractive alternative to hierarchical clustering. PLOS Computational Biology, 11(5): e1004228, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Chen Jiahua and Chen Zehua. Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759–771, 2008. [Google Scholar]
  16. Chen Jiahua and Chen Zehua. Extended BIC for small-n-large-P sparse GLM. Statistica Sinica, pages 555–574, 2012. [Google Scholar]
  17. Chi Eric C. and Lange Kenneth. Splitting methods for convex clustering. Journal of Computational and Graphical Statistics, 24(4):994–1013, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Chi Eric C. and Steinerberger Stefan. Recovering trees with convex clustering. SIAM Journal on Mathematics of Data Science, 1(3):383–407, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Chi Eric C., Allen Genevera I., and Baraniuk Richard G.. Convex biclustering. Biometrics, 73(1):10–19, 2017. [DOI] [PubMed] [Google Scholar]
  20. Cichocki Andrzej, Mandic Danilo, Lieven De Lathauwer Guoxu Zhou, Zhao Qibin, Caiafa Cesar, and Huy Anh Phan. Tensor decompositions for signal processing applications: From two-way to multiway component analysis. IEEE Signal Processing Magazine, 32 (2):145–163, 2015. [Google Scholar]
  21. Combettes Patrick L. and Pesquet Jean-Christophe. Proximal splitting methods in signal processing. In Fixed-point Algorithms for Inverse Problems in Science and Engineering, pages 185–212. Springer, 2011. [Google Scholar]
  22. Combettes Patrick L. and Wajs Valérie R.. Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation, 4(4):1168–1200, 2005. [Google Scholar]
  23. Lieven De Lathauwer Bart De Moor, and Vandewalle Joos. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 21(4):1253–1278, 2000. [Google Scholar]
  24. Deo Narsingh. Graph Theory with Applications to Engineering and Computer Science. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1974. [Google Scholar]
  25. Eckstein Jonathan. A simplified form of block-iterative operator splitting and an asynchronous algorithm resembling the multi-block alternating direction method of multipliers. Journal of Optimization Theory and Applications, 173(1):155–182, Apr 2017. [Google Scholar]
  26. Fan Jianqing and Li Runze. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001. [Google Scholar]
  27. Frolov Evgeny and Oseledets Ivan. Tensor methods and recommender systems. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(3):e1201, 2017. [Google Scholar]
  28. Gavish Matan and Coifman Ronald R.. Sampling, denoising and compression of matrices by coherent matrix organization. Applied and Computational Harmonic Analysis, 33(3): 354 – 369, 2012. [Google Scholar]
  29. Geisser Seymour. The predictive sample reuse method with applications. Journal of the American Statistical Association, 70(350):320–328, 1975. [Google Scholar]
  30. Goldstein Tom, Studer Christoph, and Baraniuk Richard. A field guide to forward-backward splitting with a FASTA implementation. arXiv eprint, abs/1411.3406, 2014. URL http://arxiv.org/abs/1411.3406.
  31. Goldstein Tom, Studer Christoph, and Baraniuk Richard. FASTA: A generalized implementation of forward-backward splitting, January 2015. http://arxiv.org/abs/1501.04979.
  32. Guigourès Romain, Boullé Marc, and Rossi Fabrice. Discovering patterns in time-varying graphs: A triclustering approach. Advances in Data Analysis and Classification, pages 1–28, 2015. [Google Scholar]
  33. Hallac David, Leskovec Jure, and Boyd Stephen. Network lasso: Clustering and optimization in large graphs. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘15, pages 387–396, New York, NY, USA, 2015. ACM. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Hanson DL and Wright FT. A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics, 42:1079–1083, 1971. [Google Scholar]
  35. Harshman Richard A.. Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multimodal factor analysis. UCLA Working Papers in Phonetics, 16:1–84, 1970. [Google Scholar]
  36. Hartigan John A.. Direct clustering of a data matrix. Journal of the American Statistical Association, 67(337):123–129, 1972. [Google Scholar]
  37. Ho Nhat, Lin Tianyi, and Jordan Michael I.. On structured filtering-clustering: Global error bound and optimal first-order algorithms. arXiv:1904.07462 [stat.ML], 2019. URL https://arxiv.org/abs/1904.07462.
  38. Hocking Toby D., Joulin Armand, Bach Francis, and Vert Jean-Philippe. Clusterpath an algorithm for clustering using convex fusion penalties. In Getoor Lise and Scheffer Tobias, editors, 28th International Conference on Machine Learning, page 1. ACM, 2011. [Google Scholar]
  39. Hof Robert. Study: Mobile Ads Actually Do Work - Especially In Apps. Forbes, August 27, 2014, August 2014. Last Accessed July 9, 2017 from https://www.forbes.com/sites/roberthof/2014/08/27/study-mobile-ads-actually-do-work-especially-in-apps/#27ce654057aa.
  40. Huang Heng, Ding Chris, Luo Dijun, and Li Tao. Simultaneous tensor subspace selection and clustering: The equivalence of high order SVD and k-means clustering. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 327–335. ACM, 2008. [Google Scholar]
  41. Hubert Lawrence and Arabie Phipps. Comparing partitions. Journal of Classification, 2 (1):193–218, 1985. [Google Scholar]
  42. Jain Prateek and Oh Sewoong. Provable tensor factorization with missing data. In Advances in Neural Information Processing Systems, pages 1431–1439, 2014. [Google Scholar]
  43. Jegelka Stefanie, Sra Suvrit, and Banerjee Arindam. Approximation algorithms for tensor clustering. In International Conference on Algorithmic Learning Theory, pages 368–383. Springer, 2009. [Google Scholar]
  44. Kolda Tamara G. and Bader Brett W.. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009. [Google Scholar]
  45. Kolda Tamara G. and Sun Jimeng. Scalable tensor decompositions for multi-aspect data mining. In 2008 Eighth IEEE International Conference on Data Mining, pages 363–372, 2008. [Google Scholar]
  46. Kutty Sangeetha, Nayak Richi, and Li Yuefeng. XML documents clustering using a tensor space model. Advances in Knowledge Discovery and Data Mining, pages 488–499, 2011. [Google Scholar]
  47. Lange Kenneth, Hunter David R., and Yang Ilsoon. Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics, 9(1):1–20, 2000. [Google Scholar]
  48. Lazzeroni Laura and Owen Art. Plaid models for gene expression data. Statistica Sinica, 12:61–86, 2002. [Google Scholar]
  49. Lee Mihee, Shen Haipeng, Huang Jianhua Z., and Marron JS. Biclustering via sparse singular value decomposition. Biometrics, 66(4):1087–1095, 2010. [DOI] [PubMed] [Google Scholar]
  50. Li Mu, Andersen David G., and Smola Alexander. Distributed delayed proximal gradient methods. In NIPS Workshop on Optimization for Machine Learning, volume 3, page 3, 2013. [Google Scholar]
  51. Lindsten Fredrik, Ohlsson Henrik, and Ljung Lennart. Just relax and come clustering! A convexification of k-means clustering. Technical report, Linköpings Universitet, 2011. URL http://www.control.isy.liu.se/research/reports/2011/2992.pdf.
  52. Liu Ji, Yuan Lei, and Ye Jieping. Guaranteed sparse recovery under linear transformation. In Dasgupta Sanjoy and McAllester David, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28, pages 91–99. PMLR, 2013a. [Google Scholar]
  53. Liu Tianqi, Yuan Ming, and Zhao Hongyu. Characterizing spatiotemporal transcriptome of human brain via low rank tensor decomposition. arXiv:1702.07449 [stat.ME], 2017. URL https://arxiv.org/abs/1702.07449.
  54. Liu Xinhai, Ji Shuiwang, Glänzel Wolfgang, and De Moor Bart. Multiview partitioning via tensor methods. IEEE Transactions on Knowledge and Data Engineering, 25(5): 1056–1069, 2013b. [Google Scholar]
  55. Ma Ping and Zhong Wenxuan. Penalized clustering of large scale functional data with multiple covariates. Journal of the American Statistical Association, 103:625–636, 2008. [Google Scholar]
  56. Madeira Sara C. and Oliveira Arlindo L.. Biclustering algorithms for biological data analysis: A survey. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 1(1): 24–45, 2004. [DOI] [PubMed] [Google Scholar]
  57. Marchetti Yuliya and Zhou Qing. Solution path clustering with adaptive concave penalty. Electronic Journal of Statistics, 8(1):1569–1603, 2014. [Google Scholar]
  58. McGarry Caitlin. Report: Google is the default iPhone search engine because it paid Apple $1 billion. Macworld, January 22, 2016, January 2016. Last Accessed July 9, 2017 from http://www.macworld.com/article/3025783/iphone-ipad/report-google-is-the-default-iphone-search-engine-because-it-paid-apple-1-billion.html.
  59. Meinshausen Nicolai and Bühlmann Peter. Stability Selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473, 2010. [Google Scholar]
  60. Minster Rachel, Saibaba Arvind K., and Kilmer Misha E.. Randomized algorithms for low-rank tensor decompositions in the Tucker format. SIAM Journal on Mathematics of Data Science, 2(1):189–215, 2020. [Google Scholar]
  61. Mishne Gal, Talmon Ronen, Meir Ron, Schiller Jackie, Lavzin Maria, Dubin Uri, and Coifman Ronald R.. Hierarchical coupled-geometry analysis for neuronal structure and activity pattern discovery. IEEE Journal of Selected Topics in Signal Processing, 10(7): 1238–1253, 2016. [Google Scholar]
  62. Mitchell Amy, Rosenstiel Tom, Laura Houston Santhanam, and Leah Christian. Future of mobile news. Project for Excellence in Journalism (PEJ)— Understanding News in the Information Age, 2012. [Google Scholar]
  63. Ng Andrew Y., Jordan Michael I., and Weiss Yair. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems, pages 849–856, 2002. [Google Scholar]
  64. Oh Jinoh, Shin Kijung, Papalexakis Evangelos E., Faloutsos Christos, and Yu Hwanjo. S-HOT: Scalable High-Order Tucker Decomposition. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 761–770. ACM, 2017. [Google Scholar]
  65. Pan Wei, Shen Xiaotong, and Liu Binghui. Cluster analysis: Unsupervised learning via supervised learning with a non-convex penalty. Journal of Machine Learning Research, 14:1865–1889, 2013. [PMC free article] [PubMed] [Google Scholar]
  66. Papalexakis Evangelos E., Sidiropoulos Nicholas D., and Bro Rasmus. From K-Means to Higher-Way Co-Clustering: Multilinear Decomposition With Sparse Latent Factors. IEEE Transactions on Signal Processing, 61(2):493–506, 2013. [Google Scholar]
  67. Pelckmans Kristiaan, De Brabanter Jos, Suykens Johan A.K., and De Moor Bart L.R.. Convex clustering shrinkage. In PASCAL Workshop on Statistics and Optimization of Clustering Workshop, 2005. [Google Scholar]
  68. Radchenko Peter and Mukherjee Gourab. Convex clustering via l1 fusion penalization. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(5):1527–1546, 2017. [Google Scholar]
  69. Rissanen Jorma. Modeling by shortest data description. Automatica, 14(5):465–471, 1978. [Google Scholar]
  70. Schifano Elizabeth D., Strawderman Robert L., and Wells Martin T.. Majorization-minimization algorithms for nonsmoothly penalized objective functions. Electronic Journal of Statistics, 4:1258–1299, 2010. [Google Scholar]
  71. Sharpnack James, Singh Aarti, and Rinaldo Alessandro. Sparsistency of the edge lasso over graphs. In Lawrence Neil D. and Girolami Mark, editors, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 of Proceedings of Machine Learning Research, pages 1028–1036, 2012. [Google Scholar]
  72. She Yiyuan. Sparse regression with exact clustering. Electronic Journal of Statistics, 4: 1055–1096, 2010. [Google Scholar]
  73. Shen Xiaotong and Huang Hsin-Cheng. Grouping pursuit through a regularization solution surface. Journal of the American Statistical Association, 105(490):727–739, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. Shen Xiaotong, Huang Hsin-Cheng, and Pan Wei. Simultaneous supervised clustering and feature selection over a graph. Biometrika, 99:899–914, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Sidiropoulos Nicholas D., Lieven De Lathauwer Xiao Fu, Huang Kejun, Papalexakis Evangelos E., and Faloutsos Christos. Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing, 65(13):3551–3582, 2017. [Google Scholar]
  76. Sill Martin, Kaiser Sebastian, Benner Axel, and Annette Kopp-Schneider. Robust biclustering by sparse singular value decomposition incorporating stability selection. Bioinformatics, 27(15):2089–2097, 2011. [DOI] [PubMed] [Google Scholar]
  77. Stone Mervyn. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), pages 111–147, 1974. [Google Scholar]
  78. Sun Jimeng, Tao Dacheng, and Faloutsos Christos. Beyond streams and graphs: dynamic tensor analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 374–383. ACM, 2006. [Google Scholar]
  79. Sun Jimeng, Papadimitriou Spiros, Lin Ching-Yung, Cao Nan, Liu Shixia, and Qian Weihong. Multivis: Content-based social network exploration through multiway visual analysis. In Proceedings of the 2009 SIAM International Conference on Data Mining, pages 1064–1075. SIAM, 2009. [Google Scholar]
  80. Sun Will Wei and Li Lexin. Dynamic tensor clustering. Journal of the American Statistical Association, 114(528):1894–1907, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  81. Will Wei Sun Junwei Lu, Liu Han, and Cheng Guang. Provable sparse tensor decomposition. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3): 899–916, 2017. [Google Scholar]
  82. Symeonidis Panagiotis. Matrix and tensor decomposition in recommender systems. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ‘16, pages 429–430, New York, NY, USA, 2016. ACM. [Google Scholar]
  83. Symeonidis Panagiotis and Zioupos Andreas. Matrix and Tensor Factorization Techniques for Recommender Systems. Springer International Publishing, 1 edition, 2016. [Google Scholar]
  84. Tan Kean Ming and Witten Daniela. Statistical properties of convex clustering. Electronic Journal of Statistics, 9:2324–2347, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  85. Tan Kean Ming and Witten Daniela M.. Sparse biclustering of transposable data. Journal of Computational and Graphical Statistics, 23(4):985–1008, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  86. Tibshirani Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 58(1):267–288, 1996. [Google Scholar]
  87. Tibshirani Robert, Walther Guenther, and Hastie Trevor. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411–423, 2001. [Google Scholar]
  88. Tibshirani Robert, Saunders Michael, Rosset Saharon, Zhu Ji, and Knight Keith. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91–108, 2005. [Google Scholar]
  89. Tibshirani Ryan J. and Taylor Jonathan. The solution path of the generalized lasso. The Annals of Statistics, 39(3):1335–1371, 2011. [Google Scholar]
  90. Tseng Paul. Applications of a Splitting Algorithm to Decomposition in Convex Programming and Variational Inequalities. SIAM Journal on Control and Optimization, 29(1): 119–138, 1991. [Google Scholar]
  91. Tucker Ledyard R.. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966. [DOI] [PubMed] [Google Scholar]
  92. Turner Heather, Bailey Trevor, and Krzanowski Wojtek. Improved biclustering of microarray data demonstrated through systematic performance tests. Computational Statistics and Data Analysis, 48(2):235–254, 2005. [Google Scholar]
  93. Vannieuwenhoven Nick., Vandebril Raf., and Meerbergen Karl.. A new truncation strategy for the higher-order singular value decomposition. SIAM Journal on Scientific Computing, 34(2):A1027–A1052, 2012. [Google Scholar]
  94. Vervliet Nico, Debals Otto, Sorber Laurent, Van Barel Marc, and De Lathauwer Lieven. Tensorlab 3.0, Mar. 2016. URL http://www.tensorlab.net. Available online.
  95. Vu Van and Wang Ke. Random weighted projections, random quadratic forms and random eigenvectors. Random Structures and Algorithms Archive, 47(4):792–821, 2015. [Google Scholar]
  96. Wang Binhuan, Zhang Yilong, Sun Will Wei, and Fang Yixin. Sparse Convex Clustering. Journal of Computational and Graphical Statistics, 27(2):393–403, 2018. [Google Scholar]
  97. Wang Yuxiang, Xu Huan, and Leng Chenlei. Provable subspace clustering: When LRR meets SSC. In Burges CJC, Bottou L, Welling M, Ghahramani Z, and Weinberger KQ, editors, Advances in Neural Information Processing Systems 26, pages 64–72. Curran Associates, Inc., 2013. [Google Scholar]
  98. Wickham Hadley. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag; New York, 2009. ISBN 978–0-387–98140-6. URL http://ggplot2.org. [Google Scholar]
  99. Witten Daniela M., Tibshirani Robert, and Hastie Trevor. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515–534, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  100. Wright Stephen J., Nowak Robert D., and Figueiredo Mário A.T.. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7):2479–2493, July 2009. [Google Scholar]
  101. Wu Chong, Kwon Sunghoon, Shen Xiaotong, and Pan Wei. A new algorithm and theory for penalized regression-based clustering. Journal of Machine Learning Research, 17(188): 1–25, 2016a. [PMC free article] [PubMed] [Google Scholar]
  102. Wu Tao, Benson Austin R., and Gleich David F.. General tensor spectral co-clustering for higher-order data. In Lee DD, Sugiyama M, Luxburg UV, Guyon I, and Garnett R, editors, Advances in Neural Information Processing Systems 29, pages 2559–2567. Curran Associates, Inc., 2016b. [Google Scholar]
  103. Xiang Shuo, Tong Xiaoshen, and Ye Jieping. Efficient sparse group feature selection via nonconvex optimization. In Dasgupta Sanjoy and McAllester David, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28, pages 284–292, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. [Google Scholar]
  104. Yair Or, Talmon Ronen, Coifman Ronald R., and Kevrekidis Ioannis G.. Reconstruction of normal forms by learning informed observation geometries from data. Proceedings of the National Academy of Sciences, 114(38):E7865–E7874, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  105. Yokota Tatsuya, Lee Namgil, and Cichocki Andrzej. Robust multilinear tensor rank estimation using higher order singular value decomposition and information criteria. IEEE Transactions on Signal Processing, 65(5):1196–1206, 2017. [Google Scholar]
  106. Yuan Ming and Lin Yi. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006. [Google Scholar]
  107. Zelnik-Manor Lihi and Perona Pietro. Self-tuning spectral clustering. In Saul LK, Weiss Y, and Bottou L, editors, Advances in Neural Information Processing Systems 17, pages 1601–1608. MIT Press, 2005. [Google Scholar]
  108. Zhang Cun-Hui. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942, 2010. [Google Scholar]
  109. Zhang Zhong-Yuan, Li Tao, and Ding Chris. Non-negative tri-factor tensor decomposition with applications. Knowledge and Information Systems, 34(2):243–265, 2013. [Google Scholar]
  110. Zhao Hongya, Wang Debby D., Chen Long, Liu Xinyu, and Yan Hong. Identifying multidimensional co-clusters in tensors based on hyperplane detection in singular vector spaces. PLOS ONE, 11(9):1–27, September 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  111. Zheng Xiaolin, Ding Weifeng, Lin Zhen, and Chen Chaochao. Topic tensor factorization for recommender system. Information Sciences, 372(Supplement C):276 – 293, 2016. [Google Scholar]
  112. Zhou Hua, Li Lexin, and Zhu Hongtu. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108:540–552, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  113. Zhu Changbo, Xu Huan, Leng Chenlei, and Yan Shuicheng. Convex optimization procedure for clustering: Theoretical revisit. In Ghahramani Z, Welling M, Cortes C, Lawrence ND, and Weinberger KQ, editors, Advances in Neural Information Processing Systems 27, pages 1619–1627. Curran Associates, Inc., 2014. [Google Scholar]
  114. Zhu Yunzhang, Shen Xiaotong, and Pan Wei. Simultaneous grouping pursuit and feature selection over an undirected graph. Journal of the American Statistical Association, 108 (502):713–725, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  115. Zou Hui. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006. [Google Scholar]
  116. Zou Hui and Li Runze. One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics, 36(4):1509–1533, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES