Provable Convex Co-clustering of Tensors

Eric C Chi; Brian R Gaines; Will Wei Sun; Hua Zhou; Jian Yang

. Author manuscript; available in PMC: 2020 Dec 11.

Published in final edited form as: J Mach Learn Res. 2020;21:214.

Provable Convex Co-clustering of Tensors

Eric C Chi ¹, Brian R Gaines ², Will Wei Sun ³, Hua Zhou ⁴, Jian Yang ⁵

PMCID: PMC7731944 NIHMSID: NIHMS1648688 PMID: 33312074

Abstract

Cluster analysis is a fundamental tool for pattern discovery of complex heterogeneous data. Prevalent clustering methods mainly focus on vector or matrix-variate data and are not applicable to general-order tensors, which arise frequently in modern scientific and business applications. Moreover, there is a gap between statistical guarantees and computational efficiency for existing tensor clustering solutions due to the nature of their non-convex formulations. In this work, we bridge this gap by developing a provable convex formulation of tensor co-clustering. Our convex co-clustering (CoCo) estimator enjoys stability guarantees and its computational and storage costs are polynomial in the size of the data. We further establish a non-asymptotic error bound for the CoCo estimator, which reveals a surprising “blessing of dimensionality” phenomenon that does not exist in vector or matrix-variate cluster analysis. Our theoretical findings are supported by extensive simulated studies. Finally, we apply the CoCo estimator to the cluster analysis of advertisement click tensor data from a major online company. Our clustering results provide meaningful business insights to improve advertising effectiveness.

Keywords: Clustering, Fused lasso, High-dimensional Statistical Learning, Multiway Data, Non-asymptotic Error

1. Introduction

In this work, we study the problem of finding structure in multiway data, or tensors, via clustering. Tensors appear frequently in modern scientific and business applications involving complex heterogeneous data. For example, data in a neurogenomics study of brain development consists of a 3-way array of expression level measurements indexed by gene, space, and time (Liu et al., 2017). Other examples of 3-way data arrays consisting of matrices collected over time include email communications (sender, recipient, time) (Papalexakis et al., 2013), online chatroom communications (user, keyword, time) (Acar et al., 2006), bike rentals (source station, destination station, time) (Guigourès et al., 2015), and internet network traffic (source IP, destination IP, time) (Sun et al., 2006). The rise in tensor data has created new challenges in making predictions, such as in recommender systems for example (Zheng et al., 2016; Symeonidis, 2016; Symeonidis and Zioupos, 2016; Frolov and Oseledets, 2017; Bi et al., 2018) as well as inferring latent structure in multiway data (Acar and Yener, 2009; Anandkumar et al., 2014; Cichocki et al., 2015; Sidiropoulos et al., 2017).

As tensors become increasingly more common, the need for a reliable co-clustering method grows increasingly more urgent. Prevalent clustering methods, however, mainly focus on vector or matrix-variate data. The goal of vector clustering is to identify subgroups within the vector-variate observations (Ma and Zhong, 2008; Shen and Huang, 2010; Shen et al., 2012; Wang et al., 2013). Biclustering is the extension of clustering to two-way data where both the observations (rows) and the features (columns) of a data matrix are simultaneously grouped together (Hartigan, 1972; Madeira and Oliveira, 2004; Busygin et al., 2008). In spite of their prevalence, these approaches are not directly applicable to the cluster analysis of general-order (general-way) tensors. On the other hand, existing methods for co-clustering general D-way arrays, for D ≥ 3, employ one of three strategies: (i) extensions of spectral clustering to tensors (Wu et al., 2016b), (ii) directly clustering the subarrays along each dimension, or way, of the tensor using either k-means or variants on it (Jegelka et al., 2009), and (iii) low rank tensor decompositions (Sun et al., 2009; Papalexakis et al., 2013; Zhao et al., 2016). While all these existing approaches may demonstrate good empirical performance, they have limitations. For instance, the spectral co-clustering method proposed by Wu et al. (2016b) is limited to nonnegative tensors and the CoTeC method proposed by Jegelka et al. (2009), like k-means, requires specifying the number of clusters along each dimension as a tuning parameter. Most importantly, none of the existing methods provide statistical guarantees for recovering an underlying co-clustering structure. There is a conspicuous gap between statistical guarantees and computational efficiency for existing tensor clustering solutions due to the nature of the non-convex formulations of the previously mentioned works.

In this paper, we propose a Convex Co-clustering (CoCo) procedure that solves a convex formulation of the problem of co-clustering a D-way array for D ≥ 3. Our proposed CoCo estimator affords the following advantages over existing tensor co-clustering methods.

Under modest assumptions on the data generating process, the CoCo estimator is guaranteed to recover an underlying co-clustering structure with high probability. In particular, we establish a non-asymptotic error bound for the CoCo estimator, which reveals a surprising “blessing of dimensionality” phenomenon: As the dimensions of the array increase, the CoCo estimator is still consistent even if the number of underlying co-clusters grows as a function of the number of elements in the tensor sample. More importantly, an underlying co-clustering structure can be consistently recovered with even a single tensor sample, which is a typical case in real applications. This phenomenon does not exist in vector or matrix-variate cluster analysis.
The CoCo estimator possesses stability guarantees. In particular, the CoCo estimatoris Lipschitz continuous in the data and jointly continuous in the data and its tuning parameter. We emphasize that Lipschitz continuity in the data guarantees that perturbations in the data lead to graceful and commensurate variations in the cluster assignments, and the continuity in the tuning parameter can be leveraged to expedite computation through warm starts.
The CoCo estimator can be iteratively computed with convergence guarantees via an accelerated first order method with storage and per-iteration cost that is linear in the size of the data.

In short, the CoCo estimator comes with (i) statistical guarantees, (ii) practically relevant stability guarantees at all sample sizes, and (iii) an algorithm with polynomial complexity. The theoretical properties of our CoCo estimator are supported by extensive simulation studies. To demonstrate its business impact, we apply the CoCo estimator to the cluster analysis of advertisement click tensor data from a major online company. Our clustering results provide meaningful business insights to help advertising planning.

Our work is related to, but also clearly distinct from, a number of recent developments in cluster analysis. The first related line of research tackles convex clustering (Hocking et al., 2011; Zhu et al., 2014; Chi and Lange, 2015; Chen et al., 2015; Tan and Witten, 2015; Wang et al., 2018; Radchenko and Mukherjee, 2017) and convex biclustering (Chi et al., 2017). These existing methods are not directly applicable to general-order tensors, however. Importantly, our CoCo estimator enjoys a unique “blessing of dimensionality” phenomenon that has not been established in the aforementioned approaches. Moreover, the CoCo estimator is similar in spirit to a recent series of work approximating a noisy observed array with an array that is smooth with respect to some latent organization associated with each dimension of the array (Gavish and Coifman, 2012; Ankenman, 2014; Mishne et al., 2016; Yair et al., 2017). Our proposed CoCo procedure seeks an approximating array that is smooth with respect to a latent clustering along each dimension of the array. While CoCo shares features with these array approximation techniques, namely the use of data-driven similarity graphs along tensor modes, a key distinction between our CoCo estimator and these methods is that CoCo produces an approximating array that explicitly recovers hard co-clustering assignments. As we will see shortly, focusing our attention in this work on the co-clustering model paves the way to the discovery and explicit characterization of new and interesting fundamental behavior in finding intrinsic organization within tensors.

The rest of the paper is organized as follows. In Section 2, we review standard facts and results about tensors that we will use. In Section 3, we introduce our convex formulation of the co-clustering problem. In Section 4, we establish the stability properties and prediction error bounds of the CoCo estimator. In Section 5, we describe the algorithm used to compute the CoCo estimator. In Section 6, we discuss how to specify weights used in our CoCo estimator, and in Section 7 we give guidance on how to set and select tuning parameters used in the CoCo estimator in practice. In Section 8, we present simulation results. In Section 9, we discuss the results of applying the CoCo estimator to co-cluster a real data tensor from online advertising. In Section 10, we close with a discussion. The Appendix contains a brief review of the two main tensor decompositions that are discussed in this paper, all technical proofs, as well as additional experiments.

2. Preliminaries

2.1. Notation

We adopt the terminology and notation used by Kolda and Bader (2009). We call the number of ways or modes of a tensor its order. Vectors are tensors of order one and denoted by boldface lowercase letters, e.g. a. Matrices are tensors of order two and denoted by boldface capital letters, e.g. A. Tensors of higher-order, namely order three and greater, we denote by boldface Euler script letters, e.g. $A$ . Thus, if $A$ represents a D-way data array of size n₁ × n₂ × ··· × n_D, we say $A$ is a tensor of order D. We denote scalars by lowercase letters, e.g. a. We denote the ith element of a vector a by a_i, the ijth element of a matrix A by a_ij, the ijkth element of a third-order tensor $A$ by a_ijk, and so on.

We can extract a subarray of a tensor by fixing a subset of its indices. For example, by fixing the first index of a matrix to be i, we extract the ith row of the matrix, and by fixing the second index of a matrix to be j, we extract a jth column of the matrix. We use a colon to indicate all elements of a mode. Consequently, we denote the ith row of a matrix A by A_i: and the jth column of a matrix A by A_:j. Fibers are the subarrays of a tensor obtained by fixing all but one of its indices. In the case of a matrix, a mode-1 fiber is a matrix column and a mode-2 fiber is a matrix row. Slices are the two-dimensional subarrays of a tensor obtained by fixing all but two indices. For example, a third-order tensor A has three sets of slices denoted by $A_{i : :}$ , $A_{: j :}$ , and $A_{: : k}$ .

2.2. Basic Tensor Operations

It is often convenient to reorder the elements of a D-way array into a matrix or a vector. Reordering a tensor’s elements into a matrix is referred to as matricization, while reordering its elements into a vector is referred to as vectorization. There are many ways to reorder a tensor into a matrix or vector. In this paper, we use a canonical mode-d matricization, where the mode-d fibers of a D-way tensor $A \in ℝ^{n_{1} \times n_{2} \times \dots \times n_{D}}$ become the columns of a matrix $A_{(d)} \in ℝ^{n_{d} \times n_{- d}}$ , where $n_{- d} = \prod_{j \neq d} n_{j}$ . Recall that the column-major vectorization of a matrix maps a matrix $A \in ℝ^{p \times q}$ to the vector $a \in ℝ^{p q}$ by stacking the columns of A on top of each other, namely $a = {(\begin{matrix} A_{: 1}^{T} & A_{: 2}^{T} & \dots & A_{: q}^{T} \end{matrix})}^{T} \in ℝ^{p q}$ . In this paper, we take the vectorization of a D-way tensor $A$ , denoted $vec(A)$ , to be the column-major vectorization of the mode-1 matriciziation of $A$ , namely $vec(A) = vec (A_{(1)}) \in ℝ^{n}$ , where $n = \prod_{d} n_{d}$ the total number of elements in $A$ . As a shorthand, when the context leaves no ambiguity, we denote this vectorization of a tensor $A$ by its boldface lowercase version a.

The Frobenius norm of a D-way tensor $A \in ℝ^{n_{1} \times n_{2} \times \dots \times n_{D}}$ is the natural generalization of the Frobenius norm of a matrix, namely it is the square root of the sum of the squares of all its elements,

‖ A ‖_{F} = \sqrt{\sum_{i_{1} = 1}^{n_{1}} \sum_{i_{2} = 1}^{n_{2}} \dots \sum_{i_{D} = 1}^{n_{D}} a_{i_{1} i_{2} \dots i_{D}}^{2}} .

The Frobenius norm of a tensor is equivalent to the ℓ₂-norm of the vectorization of the tensor, namely $‖ A ‖_{F} = ‖ a ‖_{2}$ .

Let $A$ be a tensor in $ℝ^{n_{1} \times n_{2} \times \dots \times n_{D}}$ and B be a matrix in $ℝ^{m \times n_{d}}$ . The d-mode (matrix) product of the tensor $A$ with the matrix B, denoted by $A \times_{d} B$ , is the tensor of size n₁ × ⋯ × n_d−1 × m × n_d+1 × ⋯ × n_D whose (i₁, i₂, ⋯ , i_d−1, j, i_d+1, ⋯ ,i_D)th element is given by

{(A \times_{d} B)}_{i_{1} \dots i_{d - 1} j i_{d + 1} \dots i_{D}} = \sum_{i_{d} = 1}^{n_{d}} a_{i_{1} i_{2} \dots i_{D}} b_{j i_{d}},

for j ∈ {1, … ,m}. The vectorization of the d-mode product $A \times_{d} B$ can be expressed as

vec (A \times_{d} B) = (I_{n_{D}} \otimes \dots \otimes I_{n_{d + 1}} \otimes B \otimes I_{n_{d - 1}} \otimes \dots \otimes I_{n_{1}}) a,

(1)

where I_p is the p-by-p identity matrix and ⊗ denotes the Kronecker product between two matrices. The identity given in (1) generalizes the well known formula for the column-major vectorization of a product of two matrices, namely vec(BA) = (I ⊗ B)a.

3. A Convex Formulation of Co-clustering

We first consider a convex formulation of co-clustering problem when the data is a 3-way tensor $X \in ℝ^{n_{1} \times n_{2} \times n_{3}}$ before discussing the natural generalization to D-way tensors. Our basic assumption is that the observed data tensor is a noisy realization of an underlying tensor that exhibits a checkerbox structure modulo some unknown reordering along each of its modes. Specifically suppose that there are k₁, k₂, and k₃ clusters along modes 1, 2, and 3 respectively. If the (i₁, i₂, i₃)-th entry in $X$ belongs to the cluster defined by the r₁th mode-1 group, r₂th mode-2 group, and r₃th mode-3 group, then we assume that the observed tensor element $x_{i_{1} i_{2} i_{3}}$ is given by

x_{i_{1} i_{2} i_{3}} = {c^{*}}_{r_{1} r_{2} r_{3}} + ϵ_{i_{1} i_{2} i_{3}},

(2)

Where ${c^{*}}_{r_{1} r_{2} r_{3}}$ is the mean of the co-cluster defined by the r₁th mode-1 partition, r₂th mode-2 partition, and r₃th mode-3 partition, and $ϵ_{i_{1} i_{2} i_{3}}$ are noise terms. We will specify a joint distribution on the noise terms later in Section 4.2 in order to derive prediction bounds. Thus, we model the observed tensor $X$ as the sum of a mean tensor $U^{*} \in ℝ^{n_{1} \times n_{2} \times n_{3}}$ , whose elements are expanded from the co-cluster means tensor $C^{*} \in ℝ^{k_{1} \times k_{2} \times k_{3}}$ , and a noise tensor $E \in ℝ^{n_{1} \times n_{2} \times n_{3}}$ . We can write this expansion explicitly by introducing a membership matrix $M_{d} \in {0, 1}^{n_{d} \times k_{d}}$ for the dth mode, where the ikth element of M_d is one if and only if the ith mode-d slice belongs to the kth mode-d cluster for k ∈ {1, … , k_d}. We require that each row of the membership matrix sum to one, namely M_d1 = 1, to ensure that each of the mode-d slices belongs to exactly one of the k_d mode-d clusters. Then,

U^{*} = C^{*} \times_{1} M_{1} \times_{2} M_{2} \times_{3} M_{3} .

Figure 1 illustrates an underlying mean tensor $U^{*}$ after permuting the slices along each of the modes to reveal a checkerbox structure.

The co-clustering model in (2) is the 3-way analogue of the checkerboard mean model often employed in biclustering data matrices (Madeira and Oliveira, 2004; Tan and Witten, 2014; Chi et al., 2017). Moreover, the tensor $C^{*}$ of co-cluster means corresponds to the tensor of cluster “centers” in the tensor clustering work by Jegelka et al. (2009). The model is complete and exclusive in that each tensor element is assigned to exactly one co-cluster. This is in contrast to models that allow potentially overlapping co-clusters (Lazzeroni and Owen, 2002; Bergmann et al., 2003; Turner et al., 2005; Huang et al., 2008; Witten et al., 2009; Lee et al., 2010; Sill et al., 2011; Bhar et al., 2015).

Estimating the model in (2) consists of finding (i) the partitions along each mode and (ii) the mean values of each of the k₁k₂k₃ co-clusters. Estimating ${c^{*}}_{r_{1} r_{2} r_{3}}$ , given the mode clustering assignments is trivial. Let $G_{1}$ , $G_{2}$ and $G_{3}$ denote the indices of the r₁th mode-1, r₂th mode-2, and r₃th mode-3 groups respectively. If the noise terms $ϵ_{i_{1} i_{2} i_{3}}$ are iid N(0, σ²) for some positive σ², then the maximum likelihood estimate of ${c^{*}}_{r_{1} r_{2} r_{3}}$ is simply the sample mean of the entries of $X$ over the indices defined by $G_{1}$ , $G_{2}$ , and $G_{3}$ , namely

{\hat{c}}_{r_{1} r_{2} r_{3}}^{*} = \frac{1}{| G_{1} | | G_{2} | | | G_{3} |} \sum_{i_{1} \in G_{1}} \sum_{i_{2} \in G_{2}} \sum_{i_{3} \in G_{3}} x_{i_{1} i_{2} i_{3}} .

Finding the partitions $G_{1}$ , $G_{2}$ , and $G_{3}$ , on the other hand, is a combinatorially hard problem. In recent years, however, many combinatorially hard problems, that initially appear computationally intractable, have been successfully attacked by solving a convex relaxation to the original combinatorial optimization problem. Perhaps the most celebrated convex relaxations is the lasso (Tibshirani, 1996), which simultaneously performs variable selection and parameter estimation for fitting sparse regression models by minimizing a non-smooth convex criterion.

In light of the lasso’s success, we propose to simultaneously identify partitions along the modes of $X$ and estimate the co-cluster means by minimizing the following convex objective function

F_{γ} (U) = \frac{1}{2} {‖ X - U ‖}_{F}^{2} + γ \underset{R (U)}{\underset{︸}{[R_{1} (U) + R_{2} (U) + R_{3} (U)]}},

(3)

where

\begin{array}{l} R_{1} (U) = \sum_{i < j} w_{1, i j} {‖ U_{i : :} - U_{j : :} ‖}_{F} \\ R_{2} (U) = \sum_{i < j} w_{2, i j} {‖ U_{: i :} - U_{: j :} ‖}_{F} \\ R_{3} (U) = \sum_{i < j} w_{3, i j} {‖ U_{: : i} - U_{: : j} ‖}_{F} . \end{array}

By seeking the minimizer ${\hat{U}}_{γ} \in ℝ^{n_{1} \times n_{2} \times n_{3}}$ of (3), we have cast co-clustering as a signal approximation problem, modeled as a penalized regression, to estimate the true co-cluster means tensor $U^{*}$ . In the following discussion, we drop the dependence of γ in ${\hat{U}}_{γ}$ and denote our estimator as $\hat{U}$ when there is no confusion. The quadratic term in (3) quantifies how well $U$ approximates $X$ , while the regularization term $R (U)$ in (3) penalizes deviations away from a checkerbox pattern. The nonnegative parameter γ tunes the relative emphasis on these two terms. The parameters w_d,ij are nonnegative weights whose purpose will be discussed shortly.

To appreciate how the regularization term $R (U)$ steers the minimizer of (3) towards a checkerbox pattern, consider the effect of one of the terms $R_{d} (U)$ in isolation. Specifically, suppose that $R (U) = R_{1} (U)$ . When γ is zero, the minimum of (3) is attained when $U = X$ . Or stated another way, $U_{i : :} = X_{i : :}$ for i ∈ {1, … , n₁}. As γ increases, the mode-1 slices $U_{i : :}$ will shrink towards each other and in fact coalesce due to the non-differentiability of the Frobenius norm at zero. In other words, as γ gets larger, the pairwise differences of the mode-1 slices of $\hat{U}$ will become increasingly sparser. Sparsity in these pairwise differences leads to a natural partitioning assignment. Two mode-1 slices $X_{i : :}$ and $X_{j : :}$ are assigned to the same mode-1 partition if $U_{i : :} = U_{j : :}$ . Under mild regularity conditions, that we will spell out in Section 4, for sufficiently large γ, all mode-1 slices $\hat{U}$ will be identical and therefore belong to a single cluster. Similar behavior holds if $R (U) = R_{2} (U)$ or $R (U) = R_{3} (U)$ .

When $R (U)$ includes all three terms $R_{d} (U)$ for d = 1, 2, 3, pairs of mode-1, mode-2, and mode-3 slices are simultaneously shrunk towards each other and coalesce as the parameter γ increases. By coupling clustering along each of the modes simultaneously, our formulation explicitly seeks out a solution with a checkerbox mean structure. Moreover, we will show in Section 4 that the solution $\hat{U}$ produces an entire solution path of checkerbox co-clustering estimates that varies continuously in γ. The solution path spans a range of models from the least smoothed model, where $\hat{U}$ is $X$ and each tensor element occupies its own co-cluster, to the most smoothed model, where all the elements of $\hat{U}$ are identical and all tensor elements belong to a single co-cluster.

The nonnegative weights w_d,ij fine tune the shrinkage of the slices along the dth mode. For example, if $w_{1, i j} > w_{1, i^{'} j^{'}}$ , then there will be more pressure for $U_{i : :}$ and $U_{j : :}$ to fuse than for $U_{i^{'} : :}$ and $U_{j^{'} : :}$ to fuse as γ increases. Thus, the weight w_d,ij quantifies the similarity between the ith and jth mode-d slices. A very large w_d,ij indicates that the two slices are very similar, while a very small w_d,ij indicates that they are very dissimilar. These pairwise similarities motivate a graphical view of clustering. For the dth mode, define the set $E_{d}$ as the edge set of a similarity graph. Each slice is a node in the graph and the set $E_{d}$ contains an edge (i, j) if and only if w_d,ij > 0. Figure 2 shows an example of a mode-1 similarity graph, which corresponds to a tensor with seven mode-1 slices and positive weights that define the edge set

E_{1} = {(1, 2), (2, 3), (4, 5), (4, 6), (6, 7)} .

Given the connectivity of the graph, as γ increases, the slices $U_{1 : :}$ , $U_{2 : :}$ , and $U_{3 : :}$ will be shrunk towards each other while the slices $U_{4 : :}$ , $U_{5 : :}$ , $U_{6 : :}$ and $U_{7 : :}$ shrunk towards each other. Since w_d,ij = 0 for any $(i, j) \notin E_{d}$ , we can express the penalty terms for the dth mode as

R_{d} (U) = \sum_{(i, j) \in E_{d}} w_{d, i j} {‖ U_{i : :} - U_{j : :} ‖}_{F} .

The graph in Figure 2 makes readily apparent that the convex objective in (3) separates over the connected components of the similarity graph for the mode-d slices. Consequently, one can solve for the optimal $U$ component by component. Without loss of generality, we assume that the weights are such that all the similarity graphs are connected. Before leaving this preliminary description of the weights, however, we want to emphasize that in practice weights are set once in a data-adaptive manner and should be considered empirically chosen hyper-parameters rather than tuning parameters. Further discussion of the weights and practical recommendations for specifying them will be discussed in Section 6.

Having familiarized ourselves with the convex co-clustering of a 3-way array, we now present the natural extension of (3) for clustering the fibers of a general higher-order tensor $X \in ℝ^{n_{1} \times \dots \times n_{D}}$ along all its D modes. Let $Δ_{d, i j} = e_{i}^{T} - e_{j}^{T}$ where e_i is the ith standard basis vector in $ℝ^{n_{d}}$ . The objective function of our convex co-clustering for a general higher-order tensor is as follows.

F_{γ} (U) = \frac{1}{2} {‖ X - U ‖}_{F}^{2} + γ \sum_{d = 1}^{D} \sum_{(i, j) \in E_{d}} w_{d, i j} {‖ U \times_{d} Δ_{d, i j} ‖}_{F} .

(4)

The difference between the convex triclustering objective (3) and the general convex co-clustering objective (4) is in the penalty terms. Previously in (3) we penalized the difference between pairs slices whereas in (4) we penalize the differences between pairs of mode-d subarrays.

Note that the function $F_{γ} (U)$ defined in (4) has a unique global minimizer. This follows immediately from the fact that $F_{γ} (U)$ is strongly convex. The unique global minimizer of $F_{γ} (U)$ is our proposed CoCo estimator, which is denoted by $\hat{U}$ for the remainder of the paper.

At times it will be more convenient to work with vectors rather than tensors. By applying the identity in (1), we can rewrite the objective function in (4) in terms of the vectorizations of $U$ and $X$ as follows

F_{γ} (u) = \frac{1}{2} {‖ x - u ‖}_{2}^{2} + γ \sum_{d = 1}^{D} \sum_{(i, j) \in E_{d}} w_{d, i j} {‖ A_{d, i j} u ‖}_{2} .

(5)

where A_d,ij is the n_−d-by-n matrix

A_{d, i j} = I_{n_{D}} \otimes \dots \otimes I_{n_{d + 1}} \otimes Δ_{d, i j} \otimes I_{n_{d - 1}} \otimes \dots \otimes I_{n_{1}}

(6)

where $I_{n_{d}}$ is the n_d-by-n_d identity matrix. We will refer to the unique global minimizer of (5), û = argmin_u F_γ(u), as the vectorized version of our CoCo estimator.

Remark 1 The fusion penalties $R_{d} (U)$ are a composition of the group lasso (Yuan and Lin, 2006) and the fused lasso (Tibshirani et al., 2005), a special case of the generalized lasso (Tibshirani and Taylor, 2011). When only a single mode is being clustered and only one of the terms $R_{d} (U)$ is employed, we recover the objective function in the convex clustering problem (Pelckmans et al., 2005; She, 2010; Lindsten et al., 2011; Hocking et al., 2011; Sharpnack et al., 2012; Zhu et al., 2014; Chi and Lange, 2015; Radchenko and Mukherjee, 2017). Most prior work on convex clustering employ an element-wise ℓ₁-norm penalty on pairwise differences, as in the original fused lasso, however, ℓ₂-norm and ℓ_∞-norm have also been considered (Hocking et al., 2011; Chi and Lange, 2015). In this paper, we restrict ourselves to the ℓ₂-norm for two reasons. First, the ℓ₂-norm is rotationally invariant. In general, we are reluctant to adopt a procedure whose co-clustering output may non-trivially change when the coordinate representation of the data along one of its modes is trivially changed. Second, the ℓ₂-norm promotes the group-wise shrinkage of pairwise differences of subarrays along each mode leading to more straightforward partitioning along each mode. Pairwise differences are either exactly zero or not. When the tensor is a matrix and the rows and columns are being simultaneously clustered, we recover the objective function in the convex biclustering problem (Chi et al., 2017). In general, the fusion penalties $R_{d} (U)$ shrink solutions to vector valued functions that are piece-wise constant over the mode-d similarity graph defined by the weights w_d,ij. Viewed this way, we can see our approach as simultaneously performing the network lasso (Hallac et al., 2015) on D similarity graphs.

Remark 2 The CoCo estimator is invariant to permutations in the data tensor $X$ in the following sense. Suppose $\hat{U}$ and ${\hat{U}}^{'}$ are the CoCo estimators when the data tensors are respectively $X$ and $X^{'} = X \times_{1} Π_{1} \times_{2} \dots \times_{D} Π_{D}$ where $Π_{1} \in {0, 1}^{n_{1} \times n_{1}}, \dots, Π_{D} \in {0, 1}^{n_{D} \times n_{D}}$ are permutation matrices, namely $Π_{d}^{T} Π_{d} = I$ . In words, $X^{'}$ can be obtained from $X$ by permuting the subarrays of $X$ along the dth mode according to Π_d for d = 1, … , D, and $X$ can be recovered from $X^{'}$ by permuting along the dth mode according to $Π_{d}^{T}$ for d = 1, … , D. Since ${‖ U \times_{1} Π_{1} \times_{2} \dots \times_{D} Π_{D} ‖}_{F} = {‖ U ‖}_{F}$ , it follows that

{\hat{U}}^{'} = \hat{U} \times_{1} Π_{1} \times_{2} \dots \times_{D} Π_{D} a n d \hat{U} = {\hat{U}}^{'} \times_{1} Π_{1}^{T} \times_{2} \dots \times_{D} Π_{D}^{T} .

Permutation invariance is important because it means that the CoCo estimator is essentially unaltered by any reshuffling along the modes of the data tensor.

Remark 3 Given the co-clustering structure assumed in (2), one may wonder how much is added by explicitly seeking a co-clustering over clustering along each mode independently. In other words, why not solve D independent convex clustering problems with $R (U) = R_{d} (U)$ ? To provide some intuition on why co-clustering should be preferred over independently clustering each mode, consider the following problem. Imagine trying to cluster row vectors $x_{i} \in ℝ^{10, 000}$ for i = 1, … , 100 drawn from a two-component mixture of Gaussians, namely

x_{i} \overset{i i d}{~} \frac{1}{2} N (μ, σ^{2} I) + \frac{1}{2} N (ν, σ^{2} I) .

This is a challenging clustering problem due to the disproportionately small number of observations compared to the number of features. If, however, we were told that μ_j = μ₁ and ν_j = ν₁ for j = 1, … , 5,000 and μ_j = μ₂ and ν_j = ν₂ for i = 5,001, … , 10,000, in other words that the features were clustered into two groups, our fortunes have reversed and we now have an abundance of observations compared to the number of effective features. Even if we lack a clear-cut clustering structure in the features, this example suggests that leveraging similarity structure along the columns can expedite identifying similarity structure along the rows, and vice versa. Indeed, if there is an underlying checkerbox mean tensor we may expect that simultaneously clustering along each mode should make the task of clustering along any one given mode easier. Our prediction error result presented in Section 4.2 in fact supports this suspicion (See Remark 10).

4. Properties

We first discuss how the CoCo estimator $\hat{U}$ behaves as a function of the data tensor $X$ , the tuning parameter γ, and the weights w_d,ij. We will then present its statistical properties under mild conditions on the data generating process. We highlight that these properties hold regardless of the algorithm used to minimize (4), as they are intrinsic to its convex formulation. All proofs are given in Appendix B and Appendix C.

4.1. Stability Properties

The CoCo estimator varies smoothly with respect to $X$ , γ, and {w_d,ij}. Let W_d = {w_d,ij} denote the weights matrix for mode d.

Proposition 4 The minimizer $\hat{U}$ of (4) is jointly continuous in ( $X$ , γ, W₁, W₂, … , W_D).

As noted earlier, in practice we will typically fix the weights w_d,ij and compute the CoCo estimator over a grid of the penalization parameters γ in order to select a final CoCo estimator from among the computed candidate estimators of varying levels of smoothness. Since (4) does not admit a closed form minimizer, we resort to iterative algorithms for computing the CoCo estimator. Continuity of $\hat{U}$ in γ can be leveraged to expedite computation through warm starts, namely using the solution ${\hat{U}}_{γ}$ as the initial guess for iteratively computing ${\hat{U}}_{γ^{'}}$ where γ′ is slightly larger or smaller than γ. Due to the continuity of $\hat{U}$ in γ, small changes in γ will result in small changes in $\hat{U}$ . Empirically the use of warm starts can lead to a non-trivial reduction in computation time (Chi and Lange, 2015). From the continuity in γ, we also see that convex co-clustering performs continuous co-clustering just as the lasso (Tibshirani, 1996) performs continuous variable selection.

The penalization parameter γ tunes the complexity of the CoCo estimator. Clearly when γ = 0, the CoCo estimator coincides with the data tensor, namely $\hat{U} = X$ . The key to understanding the CoCo estimator’s behavior as γ increases is to recognize that the penalty functions $R_{d} (U)$ are semi-norms. Under suitable conditions on the weights given in Assumption 4.1 below, $R_{d} (U)$ vanishes if and only if the mode-d subarrays of $U$ are identical.

Assumption 4.1 For any pair of mode-d subarrays, indexed by i and j with i < j, there exists a sequence of indices i → k → ⋯ → l → j along which the weights, w_d,ik, …,w_d,lj are positive.

Proposition 5 Under Assumption 4.1, $R_{d} (U) = 0$ if and only if U_(d) = 1c^T for some $c \in ℝ^{n_{- d}}$ .

To give some intuition for Proposition 5, note that the term $R_{d} (U)$ separates over the connected components of the mode-d similarity graph. Therefore, the term R_d(U) penalizes variation in the mode-d subarrays over the connected components of the mode-d similarity graph. Assumption 4.1, states that the mode-d similarity graph is connected. Thus, the only way for R_d(U) to attain its minimum value and vanish under Assumption 4.1, is if there is no variation in $U$ along its mode-d subarrays.

Proposition 5 suggests that if Assumption 4.1 holds for all d = 1, … ,D then as γ increases the CoCo estimator converges to the solution of the following constrained optimization problem:

min_{u} \frac{1}{2} {‖ x - u ‖}_{F}^{2} subject to u = c 1 for some c \in ℝ,

the solution to which is just the global mean $\bar{x}$ , whose entries are all identically the average value of x over all its entries. The next result formalizes our intuition that as γ increases, the CoCo estimator will eventually coincide with $\bar{x}$ .

Proposition 6 Suppose Assumption 4.1 holds for d = 1, … ,D, then $F_{γ} (U)$ is minimized by the grand mean $\bar{X}$ for γ sufficiently large.

Thus, as γ increases from 0, the CoCo estimator $\hat{U}$ traces a continuous solution path that starts from n co-clusters, consisting of $u_{i_{1} \dots i_{D}} = x_{i_{1} \dots i_{D}}$ , to a single co-cluster, where $u_{i_{1} \dots i_{D}} = x^{T} 1 / n$ for all i₁, … , i_D.

For a fixed γ, we can derive an explicit bound on sensitivity of the CoCo estimator to perturbations in the data.

Proposition 7 The minimizer $\hat{U}$ of (4) is a nonexpansive or 1-Lipschitz function of the data tensor $X$ , namely

{‖ \hat{U} (X) - \hat{U} (\tilde{X}) ‖}_{F} \leq {‖ X - \tilde{X} ‖}_{F} .

Nonexpansivity of $\hat{U}$ in $X$ provides an attractive stability result. Since $\hat{U}$ varies smoothly with the data, small perturbations in the data are guaranteed to not lead to large variability of $\hat{U}$ , or consequently large variability in the cluster assignments. In a special case of our method, Chi et al. (2017) showed empirically that the co-clustering assignments made by the 2-way version of the CoCo estimator was noticeably less sensitive to perturbations in the data than those made by several existing biclustering algorithms.

4.2. Statistical Properties

We next provide a finite sample bound for the prediction error of the CoCo estimator. For simplicity, we consider the case where we take uniform weights within a mode in (5), namely w_d,ij = w_d,i′j′ = 1/n_d for all i, j, i′, j′ ∈ {1, … , n_d}. Such uniform weight assumption has also been imposed in the analysis of the vector-version of convex clustering (Tan and Witten, 2015).

In order to derive the estimation error of û, we first define an important definition for the noise and introduce two regularity conditions.

Definition 8 (Vu and Wang (2015)) We say a random vector $y \in ℝ^{n}$ is M-concentrated if there are constants C₁, C₂ > 0 such that for any convex, 1-Lipschitz function $ϕ : ℝ^{n} \to ℝ$ and any t > 0,

ℙ (| ϕ (y) - E [ϕ (y)] | \geq t) \leq C_{1} exp (- \frac{C_{2} t^{2}}{M^{2}}) .

The M-concentrated random variable is more general than the Gaussian or sub-Gaussian random variables, and it allows dependence in its coordinates. Vu and Wang (2015) provided a few examples of M-concentrated random variables. For instance, if the coordinates of y are iid standard Gaussian, then y is 1-concentrated. If the coordinates of y are independent and M-bounded, then y is M-concentrated. If the coordinates of y come from a random walk with certain mixing properties, then y is M-concentrated for some M.

Assumption 4.2 (Model) We assume the true cluster center $C^{*} \in ℝ^{k_{1} \times \dots \times k_{D}}$ has a checkerbox structure such that the mode-d subarrays have k_d different values (number of clusters along the dth mode), and each entry of $C^{*}$ is bounded above by a constant C₀ > 0. Define $U^{*} \in ℝ^{n_{1} \times \dots \times n_{D}}$ as the true parameter expanded based on $C^{*}$ , namely

U^{*} = C^{*} \times_{1} M_{1} \times_{2} M_{2} \times_{3} \dots \times_{D} M_{D},

where $M_{d} \in {0, 1}^{n_{d} \times k_{d}}$ are binary mode-d cluster membership matrices such that M_d1 = 1. Denote $u^{*} = vec (U^{*}) \in ℝ^{n}$ with $n = \prod_{d = 1}^{D} n_{d}$ . We assume the samples belonging to the (r₁, … , r_D)-th cluster satisfy

x_{i_{1}, \dots, i_{D}} = c_{r_{1}, \dots, r_{D}}^{*} + ϵ_{i_{1}, \dots, i_{D}},

with i_d ∈ {1, … , n_d} and r_d ∈ {1, … , k_d}. Furthermore, we assume $ϵ = vec (E)$ is a M-concentrated random variable defined in (8) with mean zero.

The checkerbox means model in Assumption 4.2 provides the underlying cluster structure of the tensor data. As a special case, Assumption 4.2 with D = 2 reduces to the model assumption underlying convex biclustering (Chi et al., 2017). In contrast to the independent sub-Gaussian condition assumed in vector-version convex clustering (Tan and Witten, 2015), our error condition is much weaker since we allow for non-sub-Gaussian distributions as well as allow for dependence among its coordinates.

Assumption 4.3 (Tuning) The tuning parameter γ satisfies

\frac{2 log (n) \sqrt{n}}{D} \leq γ \leq \frac{2 c_{0} log (n) \sqrt{n}}{D},

for some constant c₀ > 1.

Theorem 9 Suppose that Assumption 4.2 and Assumption 4.3 hold. The estimation error of û in (5) with uniform weights satisfies,

\frac{1}{n} {‖ \hat{u} - u^{*} ‖}_{2}^{2} \leq \frac{1}{D} \sum_{d = 1}^{D} (\frac{1}{n_{d}} + \frac{log (n)}{\sqrt{n n_{d}}}) + \frac{C log (n)}{D \sqrt{n}} \sum_{d = 1}^{D} n_{d} \sqrt{\prod_{j \neq d} k_{j}},

(7)

with a high probability, where $C = 12 c_{0} C_{0}^{2}$ is a positive constant, and k_d is the true number of clusters in the dth mode.

Theorem 9 provides a finite sample error bound for the proposed CoCo tensor estimator. Our theoretical bound allows the number of clusters in each mode to diverge, which reflects a typical large-scale clustering scenario in big tensor data. A notable consequence of Theorem 9 is that, when D ≥ 3, namely a higher-order tensor with at least 3 modes, the CoCo estimator can achieve estimation consistency along all the D modes even when we only have one tensor sample. Here the sample size refers to the number of available tensor samples. In our tensor clustering problem, we only have access to one tensor sample.

This property is uniquely enjoyed by co-clustering of tensor data with D ≥ 3, and has not been previously established in the existing literature on vector clustering or biclustering. To see this, when n_d are of the same order as n₀, and k_d are of the same order as k₀, a sufficient condition for the consistency is that n₀ → ∞ and $k_{0} = o (n_{0}^{(D - 2) / (D - 1)})$ up to a log term. When D = 3, the CoCo estimator is consistent so long as the number of clusters k₀ in each mode diverges slightly slower than $\sqrt{n_{0}}$ . Remarkably, as we have more modes in the tensor data, this constraint on the rate of divergence of k₀ gets weaker. In short, we reap a unique and surprisingly welcome “blessing of dimensionality” phenomenon in the tensor co-clustering problem.

Remark 10 Next we discuss the connections of our bound (7) with prior results in the literature. An intermediate step in the proof of Theorem 9 indicates that the estimation error in the dth mode is on the order of $1 / n_{d} + log (n) / \sqrt{n n_{d}} + log (n) \sqrt{n_{d} \prod_{j \neq d} k_{j} / n_{- d}}$ . In the clustering along the rows of a data matrix, our rate matches with that established for vector-version convex clustering (Tan and Witten, 2015), up to a log term $\sqrt{log (n)}$ . Such a log term is due to that fact that Tan and Witten (2015) considers the error to be iid sub-Gaussian while we consider a general M-concentrated error. In practice, the iid assumption on the noise $ϵ = vec (E)$ could be restrictive. Consequently, our theoretical analysis is built upon a new concentration inequality of quadratic forms recently developed in Vu and Wang (2015). In addition, our rate reveals an interesting theoretical property of the convex biclustering method proposed by Chi et al. (2017). When D = 2, our rate indicates that the estimation error along the row and column of the data matrix is $log (n_{1} n_{2}) \sqrt{n_{1} k_{2} / n_{2}}$ and $log (n_{1} n_{2}) \sqrt{n_{2} k_{1} / n_{1}}$ , respectively. Clearly, both errors can not converge to zero simultaneously. This indicates a disadvantage of matricizing a data tensor for co-clustering.

5. Estimation Algorithm

We next discuss a simple first order method for computing the solution to the convex co-clustering problem. The proposed algorithm generalizes the variable splitting approach introduced for convex clustering problem described in Chi and Lange (2015) to the CoCo problem. The key observation is that the Lagrangian dual of an equivalent formulation of the convex co-clustering problem is a constrained least squares problem that can be iteratively solved using the classic projected gradient algorithm.

5.1. A Lagrangian Dual of the CoCo Problem

Recall that we seek to minimize the objective function in (5)

F_{γ} (u) = \frac{1}{2} {‖ x - u ‖}_{2}^{2} + γ \sum_{d = 1}^{D} \sum_{l \in E_{d}} w_{d, l} {‖ A_{d, l} u ‖}_{2} .

Note that we have enumerated the edge indices in $E_{d}$ to simplify the notation for the following derivation.

We perform variable splitting and introduce the dummy variables v_d,l = A_d,lu. Let V_d denote the $n_{- d} \times | E_{d} |$ matrix whose lth column is v_d,l. Further denote the vectorization of V_d by v_d = vec(V_d) and let $v = {[\begin{matrix} v_{1}^{T} & v_{2}^{T} & \dots & v_{D}^{T} \end{matrix}]}^{T}$ denote the vector obtained by stacking the vectors v_d on top of each other. We now solve the equivalent equality constrained minimization

min_{v, u} \frac{1}{2} {‖ x - u ‖}_{2}^{2} + γ \sum_{d = 1}^{D} \sum_{l \in E_{d}} w_{d, l} {‖ v_{d, l} ‖}_{2} subject to v_{d} = A_{d} u,

where $A_{d} = (I_{n_{D}} \otimes \dots \otimes I_{n_{d + 1}} \otimes Φ_{d} \otimes I_{n_{d - 1}} \otimes \dots \otimes I_{n_{1}})$ and Φ_d is the oriented edge-vertex incidence matrix for the dth mode graph, namely

Φ_{d, l v} = {\begin{array}{l} 1 & If node v is the head of edge l \\ - 1 & If node v is the tail of edge l \\ 0 & otherwise. \end{array}

We introduce dual variables λ_d corresponding to the equality constraint v_d = A_du. Let Λ_d denote the $n_{- d} \times | E_{d} |$ matrix whose lth column is λ_d,l. Further denote the vectorization of Λ_d by λ_d = vec(Λ_d) and $λ = {[\begin{matrix} λ_{1}^{T} & λ_{2}^{T} & \dots & λ_{D}^{T} \end{matrix}]}^{T}$ . The Lagrangian dual objective is given by

G (λ) = \frac{1}{2} {‖ x ‖}_{2}^{2} - \frac{1}{2} {‖ x - A^{T} λ ‖}_{2}^{2} - \sum_{d = 1}^{D} \sum_{l \in E_{d}} ι_{C_{d, l}} (λ_{d, l}),

where $A = {[\begin{matrix} A_{1}^{T} & A_{2}^{T} & \dots & A_{D}^{T} \end{matrix}]}^{T}$ and $ι_{C_{d, l}}$ is the indicator function of the closed convex set C_d,l = {z : ∥z∥₂ ≤ γw_d,l}, namely $ι_{C_{d, l}}$ is the function that vanishes on the set of C_d,l and is infinity on the complement of C_d,l. Details on the derivation of the dual objective G(λ) are provided in Appendix D.

Maximizing the dual objective G(λ) is equivalent to solving the following constrained least squares problem:

min_{λ \in C} \frac{1}{2} {‖ x - A^{T} λ ‖}_{2}^{2},

(8)

where $C = {λ : λ_{d, l} \in C_{d, l}, l \in E_{d}, d = 1, \dots, D}$ . We can recover the primal solution via the relationship:

\hat{u} = x - A^{T} \hat{λ},

where $\hat{λ}$ is a solution to the dual problem (8). The dual problem (8) has at least one solution by the Weierstrass extreme value theorem, but the solution may not be unique since A^T has a non-trivial kernel. Nonetheless, our CoCo estimator û is still unique since $A^{T} {\hat{λ}}_{1} = A^{T} {\hat{λ}}_{2}$ for any solutions ${\hat{λ}}_{1}$ , ${\hat{λ}}_{2}$ to the problem (8).

We numerically solve the constrained least squares problem in (8) with the projected gradient algorithm, which alternates between taking a gradient step and projecting onto the set C. Algorithm 1 provides pseudocode of the projected gradient algorithm, which has several good features. The projected gradient algorithm is guaranteed to converge to a global minimizer of (8). Its per-iteration and storage costs using the weight choices, described in Section 6, are both $O (D n)$ , namely linear in either the number of dimensions D or in the number of elements n. For a modest additional computational and storage cost, we can accelerate the projected gradient method, for example with FISTA (Beck and Teboulle, 2009) or SpaRSA (Wright et al., 2009). In our experiments, we use a version of the latter, namely FASTA (Goldstein et al., 2014, 2015). Additional details on the derivation of the algorithmic updates, convergence guarantees, computational and storage costs, as well as stopping rules can be found in Appendix E.

6. Specifying Non-Uniform Weights

In Section 4.2, we assumed uniform weights w_d,ij in the penalty terms $R_{d} (U)$ to establish a prediction error bound, which revealed a surprising and beneficial “blessing of dimensionality” phenomenon. Although this simplifying assumption gives clarity and insight into how the co-clustering problem gets easier as the number of modes increases, in practice choosing non-uniform weights can substantially improve the quality of the clustering results. In the context of convex clustering, Chen et al. (2015) and Chi and Lange (2015) provided empirical evidence that convex clustering with uniform weights struggled to produce exact sparsity in the pairwise differences of smooth estimates when there was not a strong separation between groups. Indeed, similar phenomena were observed in earlier work on the related clustered lasso (She, 2010). Several related works (She, 2010; Hocking et al., 2011; Chen et al., 2015; Chi and Lange, 2015) recommend a weight assignment strategy described below. In addition, the use of sparse weights can also lead to non-trivial improvements in both computational time and clustering performance (Chi and Lange, 2015; Chi et al., 2017).

Algorithm 1 Convex Co-Clustering (CoCo) Estimation Algorithm

Initialize λ⁽⁰⁾; for m = 0,1,…
repeat
u^(m+1) = x − A^Tλ^(m)	▷ Gradient Step
for d = 1,…, D do
for $l \in E_{d}$ do
$λ_{d, l}^{(m + 1)} = P_{C_{d, l}} (λ_{d, l}^{(m)} + η A_{d, l} u^{(m + 1)})$	▷ Projection Step
end for
end for
until convergence

Open in a new tab

To illustrate the practical value of non-uniform weights, we compare CoCo’s ability to recover co-clusters, using both uniform and non-uniform weights, as the size of a 3-way tensor increases when there are two clusters per mode with balanced cluster sizes along each mode. We assess the quality of the recovered clustering performance using the Adjusted Rand Index (ARI). The ARI (Hubert and Arabie, 1985) varies between −1 and 1, where 1 indicates a perfect match between two clustering assignments whereas a value close to zero indicates the two clustering assignments match about as might be expected if they were both randomly generated. Negative values indicate that there is less agreement between clusterings than expected from random partitions.

Figure 3 shows a comparison between using non-uniform weights that are described in Section 6.2 and uniform weights. Each plotted point in Figure 3 is the average ARI over 100 replicates. For CoCo using non-uniform weights, the smoothing parameter γ is chosen with the data-driven extended BIC method that is detailed in Section 7.1. In contrast, for CoCo using uniform weights, γ is chosen as the value that produces the estimator that minimizes the true but unknown MSE.

We see that while using uniform weights in CoCo leads to recovering co-clusters exactly once a sufficient number of samples have been acquired, using non-uniform weights enables CoCo to recover the co-clusters exactly with notably fewer samples. The results of this experiment are especially remarkable because CoCo using non-uniform weights and a data-adaptive choice of γ outperformed CoCo using uniform weights and an ideally chosen oracle value of γ.

As in the case of convex clustering, using non-uniform weights can lead to significantly better performance over using uniform weights in practice. We give some explanation for why this is expected in Section 6.3 but leave it to future work to develop theory proving this performance improvement. Nonetheless based on this observation, we employ non-uniform weights in CoCo for the empirical studies presented later in the paper.

6.1. Basic Procedure for Specifying Weights

We first describe our basic two step procedure for constructing weights before elaborating on the final refinements used in our numerical experiments.

Step 1: We first calculate pre-weights ${\tilde{w}}_{d, i j}$ between the ith and jth mode-d subarrays as

{\tilde{w}}_{d, i j} = ι_{{i, j}}^{k} exp (- τ_{d} {‖ X_{(d), i :} - X_{(d), j :} ‖}_{F}^{2}) .

(9)

The first factor on the right hand side of equation (9), $ι_{{i, j}}^{k}$ , is an indicator function that equals 1 if the jth slice is among the ith slice’s k-nearest neighbors (or vice versa) and 0 othewise. The purpose of this term is to control the sparsity of the weights. The corresponding tuning parameter k influences the connectivity of the mode-d similarity graph. One can explore different levels of granularity in the clustering by varying k (Chen et al., 2015). As a default, one can use the smallest k such that the similarity graph is still connected. Note it is not necessary to calculate the exact k-nearest neighbors, which scales quadratically in the number of fibers in the mode. A fast approximation to the k-nearest neighbors is sufficient for the sake of inducing sparsity into the weights. Chi and Lange (2015) provided two reasons for using k-nearest neighbor weights. First, we wish to prioritize fusions between pairs of subarrays that are most similar; the subarrays that are most dissimilar should be the last pair of subarrays to fuse as the smoothing parameter γ increases. Second, we wish to use a sparse similarity graph as the computational and storage complexity of the estimation algorithm is proportional to the number of non-zero edges in the similarity graphs (Appendix E). Using k-nearest-neighbors weights accomplishes both goals.

The second factor on the right hand side of equation (9) is the Gaussian kernel, which takes on larger values for pairs of mode-d subarrays that are more similar to each other. Chi and Steinerberger (2019) give a detailed theoretical justification for using weights like the Gaussian kernel weights in the context of convex clustering. For space considerations, we refer readers interested in these technical details to their work and give a brief intuitive rationale for the employing the Gaussian kernel here. Intuitively, the weights should be inversely proportional to the distance between the ith and jth mode-d subarrays (Chen et al., 2015; Chi et al., 2017). The inverse of the nonnegative parameter τ_d is a measure of scale. In practice, we can set it to be the median Euclidean distance between the ith and jth mode-d subarrays that are k-nearest neighbors of each other. A value of τ_d = 0 corresponds to uniform weights. Note that with minor modification, we can make the inverse scale parameter to be pair dependent as described in Zelnik-Manor and Perona (2005).

Step 2: To obtain the mode-d weights w_d,ij, we normalize the mode-d pre-weights ${\tilde{w}}_{d, i j}$ to sum to $\sqrt{n_{d} / n}$ . The normalization step puts the penalty terms $R_{d} (U)$ on the same scale and ensures that clustering along any given single mode will not dominate the entire co-clustering as γ increases.

6.2. Improving Weights via the Tucker Decomposition

In our preliminary experiments, we found that substituting a low-rank approximation of $X$ , namely a Tucker decomposition $\tilde{X}$ , in place of $X$ in (9) led to a marked improvement in co-clustering performance. To understand the boost in performance suppose that $X = U^{*} + E$ with $U^{*}$ having a checkerbox structure and the entries of $E$ are iid N(0, σ²) for simplicity. Further suppose that the ith and jth mode-d subarrays of $U^{⋆}$ belong to the same partition and $ι_{{i, j}}^{k} = 1$ . Then

{\tilde{w}}_{d, i j} = exp (- τ_{d} {‖ E \times_{d} Δ_{i j} ‖}_{F}^{2}) = exp (- 2 τ_{d} σ^{2} Z_{d, i j}),

where $Z = \frac{{‖ E \times_{d} Δ_{i j} ‖}_{F}^{2}}{2 σ^{2}}$ is distributed as a χ² random variable with n_d degrees of freedom. If we were able to perfectly denoise the tensor $X$ so that σ = 0, then the pre-weight ${\tilde{w}}_{d, i j}$ would be set to its maximal value of 1, the ideal value for ${\tilde{w}}_{d, i j}$ since we have assumed the ith and jth mode-d subarrays belong to the same partition. Thus, if we can reduce σ², namely denoise the observed tensor $X$ , we can approach the ideal value of pre-weights. Note that we are more focused with approaching the ideal pre-weight values for pairs of subarrays that belong to the same partition and not concerned with pairs of subarrays in different partitions as the Gaussian kernel weights decay very rapidly. The Tucker decomposition is effective at reducing σ² when $U^{*}$ has a checkerbox pattern as the checkerbox pattern is a low-rank tensor that can be effectively approximated with the Tucker decomposition.

Employing the Tucker decomposition introduces another tuning parameter, namely the rank of the decomposition. In our simulation studies described in Section 8, we use two different methods for choosing the rank as a robustness check to ensure our CoCo estimator’s performance does not crucially depend on the rank selection method. Details on these two methods can be found in Appendix F. While we found the Tucker decomposition to work well in practice, we suspect that other methods of denoising the tensor may work just as well or could possibly be more effective. We leave it to future work to explore alternatives to the Tucker decomposition.

6.3. Weights and Folded-Concave Penalties

We conclude our discussion on weights by highlighting how they provide a connection between convex clustering and other penalized regression-based clustering methods that use folded-concave penalties (Pan et al., 2013; Xiang et al., 2013; Zhu et al., 2013; Marchetti and Zhou, 2014; Wu et al., 2016a). Suppose we seek to minimize the objective

{\tilde{f}}_{γ} (u) = \frac{1}{2} {‖ x - u ‖}_{2}^{2} + γ \sum_{d = 1}^{D} \sum_{(i, j) \in E_{d}} φ_{d} ({‖ A_{d, i j} u ‖}_{2}),

(10)

where each φ_d : [0, ∞) 7 ↦ [0, ∞) has the following properties: (i) φ_d is concave and differentiable on (0, ∞), (ii) φ_d vanishes at the origin, and (iii) the directional derivative of φ_d exists and is positive at the origin. Such φ_d is collectively referred to as a folded-concave penalty; prominent examples of such function include the smoothly clipped absolute deviation (Fan and Li, 2001) or minimax concave penalty (Zhang, 2010).

Since φ_d is concave and differentiable, for all positive z and $\tilde{z}$

φ_{d} (z) \leq φ_{d} (\tilde{z}) + φ_{d}^{'} (\tilde{z}) (z - \tilde{z}) .

(11)

The inequality (11) indicates that the first order Taylor expansion of a differentiable concave function φ_d provides a tight global upper bound at the expansion point $\tilde{z}$ . Thus, we can construct a function that is a tight upper bound of the function ${\tilde{f}}_{γ} (u)$

g_{γ} (u ∣ \tilde{u}) = \frac{1}{2} {‖ x - u ‖}_{2}^{2} + γ \sum_{d = 1}^{D} \sum_{(i, j) \in E_{d}} w_{d, i j} {‖ A_{d, i j} u ‖}_{2} + c,

(12)

where the constant c does not depend on u and w_d,ij are weights that depend on ũ, namely

w_{d, i j} = φ_{d}^{'} ({‖ A_{d, i j} \tilde{u} ‖}_{2}) .

(13)

Note that if we take ũ to be the vectorization of the Tucker approximation of the data, $vec (\tilde{X})$ , and φ_d(z) to be the following variation on the error function

φ_{d} (z) = \frac{1}{\sqrt{n_{- d}} \sum_{(i, j) \in E_{d}} w_{d, i j}} \int_{0}^{z} e^{- τ_{d} ω^{2}} d ω,

then the function given in (10) coincides with the CoCo objective using the prescribed Tucker derived Gaussian kernel weights.

The function g_γ(u | ũ) is said to majorize the function ${\tilde{f}}_{γ} (u)$ at the point ũ (Lange et al., 2000) and minimizing it corresponds to performing one-step of the local linear-approximation algorithm (Zou and Li, 2008; Schifano et al., 2010) which is a special case of the majorization-minimization (MM) algorithm (Lange et al., 2000). The corresponding MM algorithm would consist of repeating the following two steps: (i) using a previous CoCo estimate $\tilde{U}$ to compute weights w_d,ij according to (13), and (ii) computing a new CoCo estimate using the new weights. In practice, we have found one-step to be adequate, however. Indeed, Zou and Li (2008) showed that the solution to the one-step algorithm was often sufficient in terms of its statistical estimation accuracy.

7. Other Practical Issues

In this section, we address other considerations for using the method in practice, namely how to choose the tuning parameter γ and how to recover the partitions along each mode from the CoCo estimator $\hat{U}$ .

7.1. Choosing γ

The first major practical consideration is how to choose γ to produce a final co-clustering result. Since co-clustering is an exploratory method, it may be suitable for a user to manually inspect a sequence of CoCo estimators ${\hat{U}}_{γ}$ for a range of γ and use domain knowledge tied to a specific application to select γ to recover a co-clustering assignment of a desired complexity. Since this approach is time consuming and requires expert knowledge, an automated, data-driven procedure for selecting γ is desirable. Cross-validation (Stone, 1974; Geisser, 1975) and stability selection (Meinshausen and Bühlmann, 2010) are popular techniques for tuning parameter selection, but since both methods are based on resampling, they are unattractive in the tensor setting due to the computational burden. We turn to the extended Bayesian Information Criterion (eBIC) proposed by Chen and Chen (2008, 2012), as it does not rely on resampling and thus is not as computationally costly as cross-validation or stability selection.

eBIC (γ) = n log (\frac{{RSS}_{γ}}{n}) + 2 {df}_{γ} log (n),

where RSS_γ is the residual sum of squares ${‖ X - {\hat{U}}_{γ} ‖}_{F}^{2}$ and df_γ is the degrees of freedom for a particular value of γ. We use the number of co-clusters in the CoCo estimator ${\hat{U}}_{γ}$ as an estimate of df_γ, which is consistent with the spirit of degrees of freedom since each co-cluster mean is an estimated parameter. This criterion balances between model fitting and model complexity, and a similar version has been commonly employed in tuning parameter selection of tensor data analysis (Zhou et al., 2013; Sun et al., 2017).

The eBIC is calculated on a grid of values $S = {γ_{1}, γ_{2}, \dots γ_{s}}$ , and we select the optimal γ, denoted γ*, which corresponds to the smallest value of the eBIC over $S$ , namely

γ^{⋆} = \underset{γ \in S}{arg min} eBIC (γ) .

7.2. Recovering the Partitions along Each Mode

The second major practical consideration is how to extract the partitions from the CoCo estimator $\hat{U}$ . Recall that the ith and jth mode-d subtensors belong to the same partition if $v_{d, i j} = U \times_{d} Δ_{i j} = 0$ . Conversely, the ith and jth mode-d subtensors do not belong to the same partition if v_d,ij ≠ 0. Thus, a mode-d partition consists of the maximal set of mode-d subarrays such that for any pair i and j in this collection v_d,ij = 0. We can automatically identify these maximal sets by extending a simple procedure employed by Chi and Lange (2015) for extracting clusters in the convex clustering problem. Identifying partitions along the dth mode is equivalent to finding connected components of a graph, where each node corresponds to a subarray along the dth mode, and there is an edge between nodes i and j if and only if v_d,ij = 0.

We would like to read off which centroids have fused as the amount of regularization increases, namely determine partition assignments as a function of γ. Such assignments can be performed in $O (n_{d})$ operations, using the differences variable V_d. We simply apply breadth-first search to identify the connected components of the following graph induced by the V_d. The graph identifies a node with every data point and places an edge between the lth pair of points if and only if v_l = 0. Each connected component corresponds to a partition. Note that the graph constructed to determine partitions is not the same as the graph described in Section 3 with illustrative examples in Figure 2.

We emphasize that the recovered partition along each mode does not depend on the ordering of the input data $X$ , since it is based off of the pairwise differences along each mode, namely V_d for d = 1, … ,D. Finally, we note that due to finite precision limitations, the difference variables v_d,ij will likely not be exactly 0. In Appendix E.4, we detail a simple and principled procedure for ensuring sparsity in these difference variables.

8. Simulation Studies

To investigate the performance of the CoCo estimator in identifying co-clusters in tensor data, we first explore some simulated examples. We compare our CoCo estimator to a k-means based approach that is representative of various tensor generalizations of the spectral clustering method common in the tensor clustering literature (Kutty et al., 2011; Liu et al., 2013b; Zhang et al., 2013; Wu et al., 2016b). We refer to this method as CPD+k-means. The CPD+k-means method (Papalexakis et al., 2013; Sun and Li, 2019) first performs a rank-R CP decomposition on the D-way tensor $X$ to reduce the dimensionality of the problem, and then independently applies k-means clustering to the rows of each of the D factor matrix from the resulting CP decomposition. The k-means algorithm has also been used to cluster the factor matrices resulting from a Tucker decomposition (Acar et al., 2006; Sun et al., 2006; Kolda and Sun, 2008; Sun et al., 2009; Kutty et al., 2011; Liu et al., 2013b; Zhang et al., 2013; Cao et al., 2015; Oh et al., 2017). We also considered this Tucker+k-means method in initial experiments, but its co-clustering performance was inferior to that of CPD+k-means so we only report co-clustering performance results for CPD+k-means in the comparison experiments that follow. Note, however, that we still use the Tucker decomposition to compute CoCo weights w_d,ij as described Section 6. Both CoCo and CPD+kmeans account for the multiway structure of the data. To assess the importance of accounting for this structure, we also include comparisons with the CoTeC method (Jegelka et al., 2009), which applied k-means clustering along each mode and does not account for the multiway structure of the data.

All methods being compared have tuning parameters that need to be set. For the rank of the CP decomposition needed in CPD+k-means, we consider R ∈ {2, 3, 4, 5} and use the tuning procedure in Sun et al. (2017) to automatically select the rank. A CP decomposition is then performed using the chosen rank, and those factor matrices are the input into the k-means algorithm. A well known drawback of k-means is that the number of clusters k needs to be specified a priori. Several methods for selecting k have been proposed in the literature, and we use the “gap statistic” developed by Tibshirani et al. (2001) to select an optimal k* from the specified possible values. Since CoCo estimates an entire solution path of mode-clustering results, ranging from n_d clusters to a single cluster along mode d, we consider a rather large set of possible k values to make the methods more comparable. Appendix G gives a more detailed description of the CPD+k-means procedure and the selection of its tuning parameters. CoTeC, which applies k-means clustering along each mode independently, also requires specifying the number of cluster along each mode. As in CPD+k-means, we also select this parameter along each mode using the “gap statistic.”

As described in Section 6, we employ a Tucker approximation to the data tensor in constructing weights w_d,ij. In computing the Tucker decomposition we used one of two methods for selecting the rank. In the plots within this section, TD1 denotes the results where the Tucker rank was chosen using the SCORE algorithm (Yokota et al., 2017), while TD2 denotes results where the rank was chosen using a heuristic. Detailed discussion on these two methods are in Appendix F.

The results presented in this section report the average CoCo estimator performance quantified by the ARI across 200 simulated replicates. All simulations were performed in Matlab using the Tensor Toolbox (Bader et al., 2015). All the following plots, except the heatmaps in Figure 13, were made using the open source R package ggplot2 (Wickham, 2009).

Figure 13: — Advertisement and Publisher Click-Through Rate Biclusters for a Randomly Selected User. The rows correspond to different advertisements and the columns correspond to different publishers. Darker blue corresponds to higher click-through rates for a given device.

8.1. Cubical Tensors, Checkerbox Pattern

For the first and main simulation setting, we study clustering data in a cubical tensor generated by a basic checkerbox mean model according to Assumption 4.2. Each entry in the observed data tensor is generated according to the underlying model (2) with independent errors $ϵ_{i_{1} i_{2} i_{3}} ~ N (0, σ_{r_{1} r_{2} r_{3}}^{2})$ . Unless specified otherwise, there are two true clusters along each mode for a total of eight underlying co-clusters.

8.1.1. Balanced Cluster Sizes and Homoskedastic Noise

To get an initial feel for how the different co-clustering methods perform at recovering the true underlying checkerbox structure, we first consider a situation where the clusters corresponding to the two classes along each mode are all equally-sized, or balanced, and share the same error variance, namely $σ_{r_{1} r_{2} r_{3}} = σ$ for all r₁, r₂, and r₃. The average co-clustering performance for this setting in a tensor with dimensions n₁ = n₂ = n₃ = 60 are given in Figure 4 for different noise levels. Figure 4 shows that all three methods perform well when the noise level is low (σ = 1). As the noise level increases, however, CPD+k-means experiences an immediate and noticeable drop off in performance. CoTeC’s performance decays even more rapidly highlighting the importance of accounting for multiway structure. The CoCo estimator, on the other hand, is able to maintain near-perfect performance until the noise level becomes rather high (σ = 8).

Figure 4: — Checkerbox Simulation Results: Impact of Noise Level. Two balanced clusters per mode across different levels of homoskedastic noise for n₁ = n₂ = n₃ = 60. For each method, the confidence interval is calculated as the mean value plus/minus one standard error.

Figure 5 shows how the run times of CoCo and CPD+k-means vary as the size of a cubic tensor, n = n₁n₂n₃ with n₁ = n₂ = n₃ takes on the values 20³, 30³, 60³, and 100³. These run times include all computations needed to fit and select a final model. For CoCo, a sequence of models were fit over a grid of γ parameters, and a final γ parameter was chosen using the eBIC. For CPD+k-means, a sequence of models were fit over a grid of possible (k₁, k₂, k₃) parameters corresponding to the 3 factor matrices, and a final triple of (k₁, k₂, k₃) parameters were chosen using the “gap statistic.” Timing comparisons were performed on a 3.2 GHz quad-core Intel Core i5 processor and 8 GB of RAM. The run time for CoCo scales linearly in the size of the data tensor as expected, namely proportionately with $n_{1}^{3}$ . Nonetheless, as also might be expected, the clustering performance enjoyed by CoCo does not come for free, and the simpler but less reliable CPD+k-means algorithm enjoys a better scaling as the tensor size grows. Timing results were similar for the following experiments and are omitted for space considerations.

8.1.2. Imbalanced Cluster Sizes

When comparing clustering methods, one factor of interest is the extent to which the relative sizes of the clusters impact clustering performance. To investigate this, we again use a cubical tensor of size n₁ = n₂ = n₃ = 60 but introduce different levels of cluster size imbalance along each mode, which we quantify via the ratio of the number of samples in cluster 2 of mode d and the total number of samples along mode d, for d = 1, 2, 3. Figure 6a shows that when the noise level is low, CPD+k-means is unaffected by the imbalance until the size of cluster 2 is less than 30% of the mode’s length. At this point, the performance of CPD+k-means drops off significantly and it performs as well as a random clustering assignment when the sizes are highly skewed (n_d2/n_d = 0.1). The CoCo estimator is more or less invariant to the imbalance, and its performance is almost perfect across all levels of cluster size imbalance. Figure 6b shows that the CoCo estimator exhibits a slight deterioration in performance only when the cluster size ratio is 0.1 in the high noise case. In both low and high noise scenarios, CoTeC performs poorly.

Figure 6: — Checkerbox Simulation Results: Impact of Cluster Size Imbalance. Two imbalanced clusters per mode with either low or high homoskedastic noise for n₁ = n₂ = n₃ = 60. Low noise corresponds to σ = 3 while high noise refers to σ = 6.

8.1.3. Heteroskedastic Noise

Another factor of interest is how the clustering methods perform when there is heteroskedasticity in the variability of the two classes. Figure 7 displays the co-clustering performance for different degrees of heteroskedasticity, as measured by the standard deviation for class 2 relative to class 1’s standard deviation, σ₂/σ₁. In the low noise setting, the CoCo estimator is immune to the heteroskedasticity until the noise levels differ by a factor of 4. CPD+k-means in contrast is very sensitive to a deviation from homoskedasticty, experiencing a decline even when the noise ratio increases from 1 to only 1.5. The CoCo estimator fares worse in the high noise setting and also has a drop in performance with a small deviation from homoskedasticty. Once class 2’s standard deviation is more than double the standard deviation for class 1, all three methods are essentially the same as random clustering. This result is not terribly surprising since, in the high noise setting, this would result in one class having a very high standard deviation of σ₂ = 12. In both low and high noise scenarios, CoTeC performs poorly.

Figure 7: — Checkerbox Simulation Results: Impact of Heteroskedasticity. Two balanced clusters per mode with either low or high heteroskedastic noise for n₁ = n₂ = n₃ = 60. Low noise corresponds to σ₁ = 3 while high noise refers to σ₁ = 6.

8.1.4. Different Clustering Structures

So far, we have considered only a simple situation where there are exactly two true clusters along each mode, for a total of eight triclusters. Another factor of practical importance is how the clustering methods perform when there are more than two clusters per mode, and also when the number of clusters along each mode differs. We investigate both of these settings in this section. As before, the tensor is a perfect cube with n₁ = n₂ = n₃ = 60 observations along each mode and an underlying checkerbox pattern. To gauge the performance, we again focus the attention on how the methods perform in the presence of both low and high noise.

The first situation studied is one in which there are three true clusters along each mode, resulting in a total of 27 triclusters. The left hand side of the graphs in Figure 8 show the results from this simulation setting. The graphs show that CoCo estimator consistently outperforms CPD+k-means and CoTeC in this setting across both noise levels. The CoCo estimator is able to recover the true co-clusters almost perfectly, while CPD+k-means struggles to handle the increased number of clusters per mode.

Figure 8: — Checkerbox Simulation Results: Impact of Clustering Structure. Di_erent balanced clusters per mode with either low or high homoskedastic noise for n₁ = n₂ = n₃ = 60. Low noise corresponds to σ = 3 while high noise refers to σ = 6.

We also investigated the clustering performance when the number of clusters per mode varies. In this setting, there are two, three, and four clusters along modes one, two, and three, respectively. From the right hand side of the graphs in Figure 8, we can see that the results are similar to the situation with three clusters per mode. CPD+k-means again performs very poorly across both noise levels, while convex co-clustering is again able to essentially recover the true co-clustering structure. Compared to the setting with three clusters per mode, CPD+k-means performs slightly worse in the face of a more complex clustering structure, while convex co-clustering is able to handle it in stride. These results bode well for convex co-clustering as the basic clustering structure of only two clusters per mode is unlikely to be observed in practice.

8.2. Rectangular Tensors

Up to this point, to get an initial feel for CoCo’s performance, we restricted our attention to cubical tensors with the same number of observations per mode so as to avoid changing too many factors at once. It is unlikely that the data tensor at hand will be a perfect cube, however, so it is important to understand the clustering performance when the methods are applied to rectangular tensors.

Now we turn to cluster a rectangular tensor with one short mode and two longer modes. Two additional simulations involving rectangular tensors can be found in Appendix H. Figure 9 shows that CoCo performs very well and better than CPD+k-means and CoTeC at the lower noise level (σ = 3) but has a sharp decrease in ARI at the higher noise level (σ = 4). The decline is more pronounced for the longer modes (Figure 9b and Figure 9a) as the short mode (Figure 9a) is still able to maintain perfect performance despite the increase in noise. This is not surprising, since the shorter mode has effectively more samples. Moreover, we see the “blessing of dimensionality” at work when the number of samples along the short mode are doubled (n₁ = 20, n₂ = n₃ = 50), the performance along the two longer modes improves drastically in the high noise setting.

Figure 9: — Checkerbox Simulation Results: Impact of Tensor Shape. Two balanced clusters per mode with two levels of homoskedastic noise for a tensor with one short mode and two longer modes. Average adjusted rand index plus/minus one standard error for different noise levels and mode lengths.

We finally note that, along the shorter mode, the use of the heuristic in determining the rank of the Tucker decomposition for calculating the weights performs better than the SCORE algorithm method along modes 1 and 2, though ultimately the co-clustering performance is comparable. This may indicate that the SCORE algorithm struggles to correctly identify the optimal Tucker rank for short modes in the presence of relatively higher noise, while the heuristic is more immune to the noise level as it is based simply on the dimensions of the tensor.

8.3. CANDECOMP/PARAFAC Model

In Section 8.1, we saw that the CoCo estimator performs well and typically better than CPD+k-means when clustering tensors whose co-clusters have an underlying checkerbox pattern. To evaluate the performance of our CoCo estimator under model misspecification, we consider the generative model as the following CP decomposition model. We first construct the factor matrix $A \in ℝ^{80 \times 2}$ and construct the following rank-2 CP means tensor

U^{*} = \sum_{i = 1}^{2} a_{i} \circ a_{i} \circ a_{i},

where ◦ denotes the outer product. We then added varying levels of Gaussian noise to the $U^{*}$ to generate the observed data tensor. We consider two different types of factor matrices. As shown in Figure 10, one shape consists of two half-moon clusters (Hocking et al., 2011; Chi and Lange, 2015; Tan and Witten, 2015) while the other shape contains a bullseye, similar to the two-circles shape studied by Ng et al. (2002) and Tan and Witten (2015). In either case, the triangles in Figure 10 correspond to the first 40 rows of A, whereas the circles correspond to the second 40 rows of A. Note that this data generating mechanism should favor the CPD+k-means method.

Figure 10: — Factor Matrices for the CP Models.

Figure 11 shows the simulation results for using the CP model with these two non-convex shapes generating the data. The discrepancy in performance between the CoCo estimator and the other two methods is quite large. The CoCo estimator almost perfectly identifies the true co-clusters. In contrast, both CPD+k-means and CoTeC perform very poorly, even when the noise variance is small. The poor performance of CPD+k-means and CoTeC are not completely surprising as other have noted the difficulty that k-means methods have in recovering non-convex clusters (Ng et al., 2002; Hocking et al., 2011; Tan and Witten, 2015). These results give us some assurances that the CoCo estimator is able to still perform well even under some model misspecification since the true co-clusters do not have a checkerbox pattern.

8.4. Comparison with Convex Biclustering

It is natural to ask how much additional gain there is in using CoCo over convex biclustering (Chi et al., 2017) on the matricizations of a data tensor. To answer this question, we compare CoCo to the following strategy for applying convex biclustering to estimate co-clusters. We explain the strategy for a 3-way tensor; the generalization to D-way tensors is straightforward. We first matricize the tensor $X$ along mode-1 to obtain the matrix X₍₁₎, apply convex biclustering on X₍₁₎, and retain the mode-1 clustering results. Note that the mode-2 and mode-3 fibers have been mixed together through the matricization process. We then repeat the two-step procedure for mode-2 and mode-3. The final co-cluster estimates are obtained by taking the cross-products of the mode-1, mode-2, and mode-3 cluster assignments.

We consider two illustrative scenarios to understand the value of preserving the full multiway structure with CoCo: a balanced case and imbalanced case. In the balanced case, we have a 3-way data tensor $x \in ℝ^{60 \times 60 \times 60}$ with two clusters along each mode, where clusters are of equal size and homoskedastic iid Gaussian noise has been added to all elements of the tensor. This scenario is similar to the one shown in Figure 4. In the imbalanced case, we have a 3-way data tensor $x \in ℝ^{30 \times 40 \times 80}$ . There are two clusters along mode-1 of sizes 10 and 20, three clusters along mode-2 of sizes 8, 12, and 20, and four clusters along mode-3 of sizes 5, 10, 20, and 45. Homoskedastic iid Gaussian noise has been added to all elements of the tensor. Finally, we note that the empirical performance of convex biclustering, like that of CoCo’s, depends on choosing good weights for the rows and columns of the input data matrix (Chi et al., 2017). To create a fair comparison, we construct convex biclustering weights based off of the same TD1 and TD2 denoising procedure used for CoCo, putting the preprocessing for both methods on equal footing.

Figure 12a and Figure 12b show the co-clustering performance of CoCo and the convex biclustering method in the balanced and imbalanced cases respectively. We see that in the balanced case, CoCo’s performance is marginally better than that of the convex biclustering method. On the other hand, we see that in the imbalanced case, CoCo’s performance degrades more gracefully than that of the convex biclustering method as the noise level increases. The example illustrates that CoCo has better co-cluster recovery when there is more imbalance in the data tensor - the aspect ratios of the tensor dimensions are more skewed and the number of clusters and the cluster sizes are more heterogenous.

The key formulation difference between CoCo and the convex biclustering method that provides some insight into these two results is that CoCo imposes a finer level of smoothness that respects the multiway structure in the data tensor. Imposing such finer level of smoothness imparts greater robustness in the presence of increasing noise to recovering the smaller co-clusters in the imbalanced scenario. An added incentive for using CoCo and preserving the multiway structure in the data is that the gains in co-cluster recovery over the convex biclustering method do not come at a greater computational cost. Note that the computational complexity of convex biclustering is $O (n)$ , using sparse weights for the row and column similarity graphs. For a D-way tensor, the computational complexity then becomes $O (D n)$ , which is the same as the computational complexity of CoCo applied directly on the D-way tensor.

To summarize, in comparison to the convex biclustering method, CoCo (i) does not come at additional computational costs, (ii) can recover underlying co-clustering structure in imbalanced scenarios which are more likely to be encountered in practice, and (iii) has the ability to consistently recover an underlying co-clustering structure according to Theorem 9, with even a single tensor sample, which is a typical case in real applications. Since this phenomenon does not exist in vector or matrix variate cluster analysis, the convex biclustering method lacks this theoretical guarantee.

9. Real Data Application

Having studied the performance of the CoCo estimator in a variety of simulated settings, we now turn to using the CoCo estimator on a real data set. The proprietary data set comes from a major online company and contains the click-through rates for advertisements displayed on the company’s webpages from May 19, 2016 through June 15, 2016. The clic-kthrough rate is the number of times a user clicks on a specific advertisement divided by the number of times the advertisement was displayed. The data set contains information on 1000 users, 189 advertisements, 19 publishers, and 2 different devices, aggregated across time. Thus, the data forms a fourth-order tensor where each entry in the tensor corresponds to the click-through rate for the given combination of user, advertisement, publisher, and device. Here a publisher refers to a different webpage within the online company’s website, such as the main home page versus a page devoted to either breaking news or sports scores. The two device types correspond to how the user accessed the page, using either a personal computer or a mobile device such as a cell phone or tablet computer. The goal in this real application is to simultaneously cluster users, advertisements, and publishers to improve user behavior targeting and advertising planning.

In the click-through rate tensor data, over 99% of the values are missing since one user likely has seen only a handful of the possible advertisements. If a specific advertisement is never seen by a user, it is considered as a missing value. Since the proposed CoCo estimator can only handle complete data, we first preprocess the data by imputing the missing values before any clustering can be done. To impute the missing entries, we use the CP-based tensor completion method Jain and Oh (2014) and tune its rank via the information criterion proposed by Sun et al. (2017). This tuning method chooses the optimal rank as R = 20 from the rank list {1, 2, 3, 4, 5, 6, 8, 10, 12, 14, 16, 18, 20, 22}. Finally, the imputed values are truncated to ensure all the values of the tensor are within 0 and 1 since click-through rates are proportions.

One mode of the fourth-order tensor has only two observations and those observations already have a natural grouping (device type). Therefore, for the sake of clustering we analyze the devices separately. We compare our method with CPD+k-means. Furthermore, the tuning parameter for convex co-clustering is automatically selected using the eBIC (Section 7.1) while the number of clusters in CPD+k-means is chosen via the gap statistic (Tibshirani et al., 2001). We do not include comparisons with CoTeC given its poor performance in the simulation experiments.

We first look at the clustering results from clustering the click-through rates for users accessing the advertisements through a personal computer (PC). Table 1 contains the number of clusters identified as well as the sizes of the clusters, while Figure 13a visualizes the advertisement-by-publisher biclusters for a randomly selected user. As to be expected, the advertisement-by-publisher slices display a checkerbox pattern, which turns into a checkerbox pattern when the slices are meshed together. The clustering results for the users are omitted in this paper to ensure user privacy. However, co-clustering the tensor does not result in the loss of information that would occur if the tensor was converted into a matrix by averaging across users or flattening along one of the modes. Table 1 and Figure 13a show that the CoCo estimator identifies four advertisement clusters, with one cluster being much bigger than the others. The advertisements in this large cluster have click-through rates that are close to the grand average in the data set. One of the small clusters has very low click-through rates, while the other two clusters tend to have much higher click-through rates than the rest of the advertisements. On the other hand, CPD+k-means clusters the advertisements into 57 groups, which is less-useful from a practical standpoint. Many of the clusters are similarly-sized and contain only a few advertisements, likely due to the inability of CPD+k-means to handle imbalanced cluster sizes as was observed in the simulation experiments (Section 8.1.2). In terms of the publishers, the CoCo estimator identifies 3 clusters while CPD+k-means does not find any underlying grouping and simply identifies one big cluster, which again is not terribly useful (Table 1). We next provide some interpretations of the obtained clustering results of the publishers. One way online advertisers can reach more users is by entering agreements with other companies to route traffic to the advertiser’s website. For example, Google and Apple have a revenue-sharing agreement in which Google pays Apple a percentage of the revenue generated by searches on iPhones (McGarry, 2016). Similarly, the online company being studied partners with several internet service providers (ISPs) to host the defaut home pages for the ISP’s customers. It would make sense that these slightly different variants of the online company’s main home page would have similar click-through rates, and the CoCo estimator in fact assigned these variants into the same cluster.

Table 1:

Advertising Data Clustering Results

	CoCo Estimator				CPD+kmeans
	Advertisements		Publisher		Advertisements	Publisher
Device	# of clusters	Cluster Sizes	# of clusters	Cluster Sizes	# of clusters	# of clusters
PC	4	(156, 22, 8, 3)	3	(4, 3, 12)	57	1
Mobile	3	(145, 22, 22)	2	(7, 12)	49	13

Open in a new tab

For users accessing the advertisements through a mobile device, such as a mobile phone or tablet computer, the CoCo estimator results for the advertisements are largely similar to the results for PCs (Table 1 and Figure 13b). There is one large cluster that contains click-through rates similar to the overall average, while the two other equally-sized clusters have relatively very low or very high click-through rates, respectively. The underlying click-through rates for the PC data have more variability than the mobile data, which is consistent with the identification of an additional cluster for the PC data. As before, CPD+k-means finds a large number of advertisement clusters, most of which are roughly the same size, again likely impacted by the imbalance in the cluster sizes. When compared to the personal computer device, one difference is that the cluster with the higher click-through rates for mobile devices is larger and has a higher average click-through rate than the similar clusters for the personal computer device. This finding is consistent with research by the Pew Research Center that found that click-through rates for mobile devices are higher than for advertisements viewed on a personal computer or laptop (Mitchell et al., 2012).

It is also enlightening to take a closer look at the underlying advertisements clustered across the two devices. All of the advertisements clustered in the high click-through rate cluster for the mobile devices are in the average click-through rate cluster for personal computers. In taking a closer look at the ads in these clusters, there are several ads related to online shopping for personal goods, such as jeans, workout clothes, or neck ties. It makes sense to shop for these types of goods using a mobile device, such as while at work when it is not appropriate to do so on a work computer. Conversely, all of the advertisements in either of the two higher PC click-through rate clusters are in the large, average click-through rate cluster for the mobile devices. There are several financial-related ads in these two PC clusters, such as for mortgages or general investment advice. On the other hand, there are not many online shopping ads in those clusters, with the exception of more expensive technology-related goods that one may want to invest more time in researching before making a purchase.

In terms of the publisher clusters on mobile device, Table 1 shows that the CoCo estimator identifies two clusters of publishers while CPD+k-means identifies 13 small clusters. Contrary to the advertisement clusters, the publisher clusters across both devices are very similar. In fact, the only difference is that the smaller cluster for the mobile device, which contains seven publishers, is split into two clusters for personal computers. This can be seen in the click-through rate heatmaps given in Figure 13 in looking at the right part of each heatmap. The publishers in these smaller clusters have higher click-through rates on average than those in the larger cluster. Additionally, five of the seven (71%) publishers in the high click-through rate clusters have stand-alone apps that display ads, while only three of the twelve (25%) publishers in the larger cluster do. For mobile devices, it has been observed that in-app advertisements have higher click-through rates and browser-based ads (Hof, 2014). We conjecture that this is also true for personal computer apps, which is consistent with the clustering results. Thus it again appears that the clusters identified by CoCo also make sense practically.

10. Discussion

In this paper, we formulated and studied the problem of co-clustering of tensors as a convex optimization problem. The resulting CoCo estimator enjoys features in theory and practice that are arguably lacking in existing alternatives, namely statistical consistency, stability guarantees, and an algorithm with polynomial computational complexity. Through a battery of simulations, we observed that the CoCo estimator can identify co-clustering structures under realistic scenarios such as imbalanced co-cluster sizes, imbalanced number of clusters along each mode, heteroskedasticity in the noise distribution associated with each co-cluster, and even some violation of the checkerbox mean tensor assumption.

We have leveraged the power of the convex relaxation to engineer a computationally tractable co-clustering method that comes with statistical guarantees. These benefits, however, do not come for free. The CoCo estimator incurs similar costs that using the lasso incurs as a surrogate for a cardinality constraint or penalty. It is well known that the lasso leads to parameter estimates that are shrunk towards zero. This shrinkage toward zero is the price for simultaneously estimating the support, or locations of the nonzero entries, in a sparse vector as well as the values of the nonzero entries. In the context of convex co-clustering, the CoCo estimator $\hat{U}$ is shrunk towards the tensor $\bar{X}$ , namely the tensor whose entries are all equal to the average over all entries of $X$ . The weights, however, play a critical role in reducing this bias. In fact, the weights can be seen as serving the same role as weights used in the adaptive lasso (Zou, 2006).

There are several possible extensions and open problems that have been left for future work. First, we note that there is a gap between what our theory predicts and what seems possible from our experiments. Specifically, Theorem 9 assumes uniform weights for each mode, yet simulation experiments indicate that the CoCo estimator using Tucker derived Gaussian kernel weights (9) can significantly outperform the CoCo estimator using uniform weights. One open problem is to derive prediction error bounds that relax the uniform weights assumption.

Second, although we have developed automatic methods for constructing the weights that work well empirically, other approaches to constructing the weights is a direction of future search. For example, other tensor approximation methods, such as the use of the ℓ₁-norm to make the decomposition most robust to heavy tail noise as done by Cao et al. (2015), could possibly improve the quality of the weights.

Third, in this paper we have focused on additive noise that is a zero-mean M-concentrated random variable. Real data, however, may not follow such a distribution motivating co-clustering procedures that can handle outliers. To address potential robustness issues, the CoCo framework could be extended to handle outliers by swapping the sum of squared residuals term in (5) with an analogous Huber loss or Tukey’s Biweight function.

Finally, while our first order algorithm for co-clustering tensors scales linearly in the size of the data, data tensors inevitably will only increase in size motivating the need for more scalable algorithms for computing the CoCo estimator. A natural approach would be to adopt an existing distributed version of the proximal methods, such as one the methods proposed by Combettes and Pesquet (2011), Chen and Ozdaglar (2012), Li et al. (2013), or Eckstein (2017). Another natural approach would be to investigate if stochastic versions of the recently proposed generalized dual gradient ascent (Ho et al., 2019) could be adapted to compute the CoCo estimator. Additionally, in practice many data tensors that we would like to co-cluster may be very sparse. The first order algorithm presented here assumes the data tensor is dense. Consequently, an important direction of future work is to investigate alternative optimization algorithms that could leverage the sparsity structure within a data tensor.

Acknowledgments

The authors thank Xu Han for his help with the simulation experiments during the revision of this work. The authors also thank the action editor and three reviewers for their helpful comments and suggestions which led to a much improved presentation. Eric Chi acknowledges support from the National Science Foundation (DMS-1752692) and National Institutes of Health (R01GM135928). Will Wei Sun acknowledges support from the Office of Naval Research (ONR N00014–18-1–2759). Hua Zhou acknowledges support from the National Institutes of Health (R01GM053275 and R01HG006139). Finally, this research collaboration was partially funded by the National Science Foundation under grant DMS-1127914 (the Statistical and Applied Mathematical Sciences Institute). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation, the National Institutes of Health, or the Office of Naval Research.

Appendix A. Tensor Decompositions

We review two basic tensor decompositions that generalize the singular value decomposition (SVD) of a matrix: (i) the CANDECOMP/PARAFAC (CP) decomposition (Carroll and Chang, 1970; Harshman, 1970) and (ii) the Tucker decomposition (Tucker, 1966). Just as the SVD can be used to construct a lower-dimensional approximation to a data matrix, these two decompositions can be used to construct a lower dimensional approximation to a D-way tensor $X \in ℝ^{n_{1} \times n_{2} \times \dots \times n_{D}}$ The CP decomposition aims to approximate $X$ by a sum of rank-one tensors, namely

X \approx \sum_{i = 1}^{R} a_{i}^{(1)} \circ a_{i}^{(2)} \circ \dots \circ a_{i}^{(D)},

where ◦ represents the outer product and $a_{i}^{(d)}$ is the ith column of the dth factor matrix $A^{(d)} \in ℝ^{n_{d} \times R}$ . The positive integer R denotes the rank of the approximation. For sufficiently large R, we can exactly represent $X$ with a CP decomposition.

The Tucker decomposition aims to approximate $X$ by a core tensor $H \in ℝ^{R_{1} \times R_{2} \times \dots \times R_{D}}$ multiplied by factor matrices along each of its modes, namely

X \approx H \times_{1} A^{(1)} \times_{2} A^{(2)} \times_{3} \dots \times_{D} A^{(D)} = \sum_{i_{1} = 1}^{R_{1}} \sum_{i_{2} = 1}^{R_{2}} \dots \sum_{i_{D} = 1}^{R_{D}} h_{i_{1} i_{2} \dots i_{D}} a_{i_{1}}^{(1)} \circ a_{i_{2}}^{(2)} \circ \dots \circ a_{i_{D}}^{(D)},

where $a_{i_{d}}^{(d)}$ is the i_dth column of the dth factor matrix $A^{(d)} \in ℝ^{n_{d} \times R_{d}}$ . Typically the columns of A^(d) are computed to be orthonomal and can be interpreted as principal components or basis vectors for the dth mode. For sufficiently large R₁, … ,R_D, we can exactly represent $X$ with a Tucker decomposition.

Appendix B. Proofs of Smoothness Properties

B.1. Proof of Proposition 4

Without loss of generality, we can absorb γ into the weights matrices. Thus, we seek to show the continuity of $\hat{U}$ with respect to ( $X$ , W₁, … ,W_D). We use the following compact representation of the weights

w = {(vec {(W_{1})}^{T}, vec {(W_{2})}^{T}, \dots, vec (W_{D}))}^{T} \in ℝ^{\sum_{d = 1}^{D} (\begin{matrix} n_{d} \\ 2 \end{matrix})} .

We check to see if the solution $\hat{U}$ is continuous in the variable ζ = (x^T, w^T)^T. It is easy to verify that the following function is jointly continuous in $U$ and ζ

f (U, ζ) = \frac{1}{2} {‖ X - U ‖}_{F}^{2} + R (U, w),

Where

R (U, w) = \sum_{d = 1}^{D} \sum_{i < j} w_{d, i j} {‖ U \times_{d} Δ_{d, i j} ‖}_{F}

is a convex function of $U$ that is continuous in ( $U$ , w). Let

U^{⋆} (ζ) = \underset{U}{arg min} f (U, ζ) .

Since $f (U, ζ)$ is strongly convex in $U$ , the minimizer $U^{⋆} (ζ)$ exists and is unique.

We proceed with a proof by contradiction. Suppose $U^{⋆} (ζ)$ is not continuous at a point ζ. Then there exists an ϵ > 0 and a sequence {ζ^(m)} converging to a limit ζ such that ${‖ U^{(m)} - U^{⋆} (ζ) ‖}_{F} \geq ϵ$ for all m where

U^{(m)} = \underset{U}{arg min} f (U, ζ^{(m)}) .

Since $f (U, ζ)$ is strongly convex in $U$ , the minimizer $U^{(m)}$ exists and is unique. Without loss of generality, we can assume ∥ζ^(m) – ζ∥_F ≤ 1. This fact will be used later in proving the boundedness of the sequence $U^{(m)}$ .

If $U^{(m)}$ is a bounded sequence, then we can pass to a convergent subsequence with limit $\bar{U}$ . Fix an arbitrary point $\tilde{U}$ . Note that $f (U^{(m)}, ζ^{(m)}) \leq f (\tilde{U}, ζ^{(m)})$ for all m. Since f is continuous in $(U, ζ)$ , taking limits gives us the inequality

f (\bar{U}, ζ) \leq f (\tilde{U}, ζ) .

Since $\tilde{U}$ was selected arbitrarily, it follows that $\bar{U} = U^{⋆} (ζ)$ , which is a contradiction. It only remains for us to show that the sequence $U^{(m)}$ is bounded.

Consider the function

g (U) = sup_{\tilde{ζ} : ‖ \tilde{ζ} - ζ ‖_{F} \leq 1} \frac{1}{2} {‖ \tilde{X} - U ‖}_{F}^{2} + R_{\tilde{w}} (U) .

Note that g is convex, since it is the point-wise supremum of a collection of convex functions. Since $f (U, ζ^{(m)}) \leq g (U)$ and f is strongly convex in $U$ , it follows that $g (U)$ is also strongly convex and therefore has a unique global minimizer $U^{*}$ such that $g (U^{*})$ . It also follows that

f (U^{(m)}, ζ^{(m)}) \leq f (U^{*}, ζ^{(m)}) \leq g (U^{*})

(14)

for all m. By the reverse triangle inequality it follows that

\frac{1}{2} {({‖ U^{(m)} ‖}_{F} - {‖ X^{(m)} ‖}_{F})}^{2} \leq \frac{1}{2} {‖ U^{(m)} - X^{(m)} ‖}_{F}^{2} \leq f (U^{(m)}, ζ^{(m)}) .

(15)

Combining the inequalities in (14) and (15), we arrive at the conclusion that

\frac{1}{2} {({‖ U^{(m)} ‖}_{F} - {‖ X^{(m)} ‖}_{F})}^{2} \leq g (U^{*}),

for all m. Suppose the sequence $U^{(m)}$ is unbounded, namely ${‖ U^{(m)} ‖}_{F} \to \infty$ . But since $X^{(m)}$ converges to $X$ , the left hand side must diverge. Thus, we arrive at a contradiction if $U^{(m)}$ is unbounded.

B.2. Proof of Proposition 5

First suppose that U_(d) = 1c^T, namely all the mode-d subarrays of $U$ are identical. Recall that $Z = U \times_{d} A$ if and only if Z_(d) = AU_(d). Therefore, $R_{d} (U) = 0$ since Δ_d,ij1c^T = 0 for all $(i, j) \in E_{d}$ .

Now suppose that $R_{d} (U)$ is zero. Take an arbitrary pair (i, j) with i < j. By Assumption 4.1, there exists a path i → k → ⋯ → l → j along which the weights are positive. Let w denote the smallest weight along this path, namely w = min{w_d,ik, … ,w_d,lj}. By the triangle inequality

{‖ U \times_{d} Δ_{d, i j} ‖}_{F} \leq {‖ U \times_{d} Δ_{d, i k} ‖}_{F} + \dots + {‖ U \times_{d} Δ_{d, l j} ‖}_{F} .

We can then conclude that

w {‖ U \times_{d} Δ_{d, i j} ‖}_{F} \leq R_{d} (U) = 0.

It follows that $e_{i}^{T} U_{(d)} = e_{j}^{T} U_{(d)}$ , since w is positive. Since the pair (i, j) is arbitrary, it follows that all the rows of U_(d) are identical or in other words, U_(d) = 1c^T for some $c \in ℝ^{n_{- d}}$ . ■

B.3. Proof of Proposition 6

We will show that there is a γ_max such that for all γ ≥ γ_max, the grand mean tensor $\bar{X}$ is the unique global minimizer to the primal objective (4). We will certify that $\bar{X}$ is the solution to the primal problem by showing that the optimal value of a dual problem, which lower bounds the primal, equals $F_{γ} (\bar{X})$ .

Note that the Lagrangian dual given in (28) is a tight lower bound on $F_{γ} (U)$ .

max_{λ \in C_{γ}} - \frac{1}{2} {‖ A^{T} λ ‖}_{2}^{2} + 〈 λ, A x 〉 .

For sufficiently large γ, the solution to the dual maximization problem coincides with the solution to the unconstrained maximization problem

max_{λ} - \frac{1}{2} {‖ A^{T} λ ‖}_{2}^{2} + 〈 λ, A x 〉,

whose solution is λ* = (AA^T)^† Ax. Plugging λ* into the dual objective gives an optimal value of

\frac{1}{2} {‖ A^{T} {(A A^{T})}^{†} A x ‖}_{2}^{2} = \frac{1}{2} {‖ x - [I - A^{T} {(A A^{T})}^{†} A] x ‖}_{2}^{2} .

Note that [I − A^T (AA^T)^† A] is the projection onto the orthogonal complement of the column space of A^T, which is equivalent to the null space or kernel of A, denoted Ker(A). We will show below that Ker(A) is the span of the all ones vector. Consequently,

[I - A^{T} {(A A^{T})}^{†} A] x = \frac{1}{n} 〈 x, 1 〉 1 .

Note that the smallest γ such that λ* ∈ C_γ is an upper bound on γ_max.

We now argue that Ker(A) is the span of $1 \in ℝ^{n}$ . We rely on the following fact: If Φ_d is an incidence matrix of a connected graph with n_d vertices, then the rank of Φ_d is n_d − 1 (Deo, 1974, Theorem 7.2). According to Assumption 4.1, the mode-d graphs are connected; it follows that $Φ_{d} \in {- 1, 0, 1}^{| E_{d} | \times n_{d}}$ has rank n_d − 1. It follows then that Ker(Φ_d) has dimension one. Furthermore, since each row of Φ_d has one 1 and one −1, it follows that $1 \in Ker (Φ_{d}) \subset ℝ^{n_{d}}$ . A vector z ∈Ker(A) if and only if z ∈ Ker(A_d) for all d.

Recall that the rank of the Kronecker product A⊗B is the product of the ranks of the matrices A and B. This rank property of Kronecker products of matrices implies that the dimension of Ker(A_d) equals n_−d. Let $b_{i} = 1_{n_{D}} \otimes \dots \otimes 1_{n_{d + 1}} \otimes e_{i} \otimes 1_{n_{d - 1}} \otimes \dots \otimes 1_{n_{1}}$ where $1_{p} \in ℝ^{p}$ is the vector of all ones and $e_{i} \in ℝ^{n_{d}}$ is the ith standard basis vector. Then that the set of vectors $B = {b_{1}, b_{2}, \dots, b_{n_{d}}}$ forms a basis for Ker(A_d).

Take an arbitrary element from Ker(A_d), namely a vector of the form 1_n′ ⊗ a ⊗ 1_n″, where $n^{'} = \prod_{j = d + 1}^{D} n_{j}$ and $n^{″} = \prod_{j = 1}^{d - 1}$ . We will show that in order for 1_n′ ⊗ a ⊗ 1_d″ ∈ Ker(I ⊗ Φ_d), a must be a multiple of $1_{n_{d}}$ . Consider the relevant matrix-vector product

A_{d} (1_{n_{D}} \otimes a \otimes 1_{n_{1}}) = (1_{n_{D}} \otimes \dots \otimes 1_{n_{d + 1}} \otimes Φ_{d} a \otimes 1_{n_{d - 1}} \otimes \dots \otimes 1_{n_{1}}) .

Therefore, A_d (1_n′ ⊗ a ⊗ 1_d″) = 0 if and only if Φ_da = 0. But the only way for Φ_da to be zero is for $a = c 1_{n_{d}}$ for some $c \in ℝ$ . Thus, Ker(A) is the span of 1_n.

B.4. Proof of Proposition 7

Note that $\hat{U}$ is the proximal mapping of the closed, convex function

\sum_{d = 1}^{D} R_{d} (U)

Then $\hat{U}$ is firmly nonexpansive in $X$ (Combettes and Wajs, 2005, Lemma 2.4). Finally, firmly nonexpansive mappings are nonexpansive, which completes the proof. ■

Appendix C. Proof of Theorem 9

We first prove some auxiliary lemmas before proving our prediction error result.

C.1. Auxiliary Lemmas

The following lemma considers the concentration of a random quadratic form y^TBy for a M-concentrated random vector y and a deterministic matrix B (Vu and Wang, 2015). It can be viewed as a generalization of the standard Hanson and Wright inequality for the quadratic forms of independent sub-Gaussian random variables (Hanson and Wright, 1971).

Lemma 11 Let $y \in ℝ^{n}$ be a M-concentrated random vector, see Definition 8. Then there are constants C,C^′ > 0 such that for any matrix $B \in ℝ^{n \times n}$

ℙ (| y^{T} B y - tr (B) | \geq t) \leq C log (n) exp {- C^{'} M^{- 2} min [\frac{t^{2}}{‖ B ‖_{F}^{2} log (n)}, \frac{t}{‖ B ‖_{2}}]} .

The next lemma studies the properties of the matrix A_d,ij, defined in (6), in the penalty function. Denote S_d as the matrix constructed by concatenating A_d,ij, i < j vertically. That is,

S_{d} = {(\begin{array}{l} A_{d, 12}^{T} & A_{d, 13}^{T} & \dots & A_{d, n_{d} - 1 n_{d}}^{T} \end{array})}^{T} \in ℝ^{[(\begin{matrix} n_{d} \\ 2 \end{matrix}) n_{- d}] \times n} .

(16)

Lemma 12 For each d = 1, … ,D, the rank of the matrix S_d is (n_d − 1)n_−d. Denote σ_min(S_d) and σ_max(S_d) as the minimum non-zero singular value and maximum singular value of S_d, respectively. We have $σ_{min} (S_{d}) = σ_{max} (S_{d}) = \sqrt{n_{d}}$ .

The proof of Lemma 12 follows from Lemma 1 in Tan and Witten (2015) and is omitted. According to Lemma 12, we can construct a singular value decomposition of $S_{d} = U_{d} Λ_{d} V_{d}^{T}$ , where $U_{d} \in ℝ^{[(\begin{matrix} n_{d} \\ 2 \end{matrix}) n_{- d}] \times (n_{d} - 1) n_{- d}}$ , $Λ_{d} \in ℝ^{(n_{d} - 1) n_{- d} \times (n_{d} - 1) n_{- d}}$ , and $V_{d} \in ℝ^{n \times (n_{d} - 1) n_{- d}}$ . Denote

G_{d} = U_{d} Λ_{d} \in ℝ^{[(\begin{matrix} n_{d} \\ 2 \end{matrix}) n_{- d}] \times (n_{d} - 1) n_{- d}},

(17)

and its pseudo-inverse as $G_{d}^{†} \in ℝ^{(n_{d} - 1) n_{- d} \times [(\begin{matrix} n_{d} \\ 2 \end{matrix}) n_{- d}]}$ . The following lemma studies the properties of G_d and $G_{d}^{†}$ , for each d = 1, … ,D.

Lemma 13 For each d = 1, … ,D, the rank of the matrix G_d is (n_d −1)n_−d. The minimal non-zero singular value and maximal singular value of G_d are $σ_{min} (G_{d}) = σ_{max} (G_{d}) = \sqrt{n_{d}}$ . Moreover, $σ_{min} (G_{d}^{†}) = σ_{max} (G_{d}^{†}) = 1 / \sqrt{n_{d}}$ .

Lemma 13 follows directly from the conclusions in Lemma 12.

C.2. Proof of Main Theorem

We first reformulate our optimization problem via a decomposition approach to simplify the theoretical analysis. Such strategy was developed in Liu et al. (2013a) and has been successfully applied in Tan and Witten (2015); Wang et al. (2018).

Denote γ_d = γ/n_d. Our convex tensor co-clustering method is equivalent to solving

\hat{u} = \underset{u}{arg min} {\frac{1}{2} {‖ x - u ‖}_{2}^{2} + \sum_{d = 1}^{D} γ_{d} \sum_{(i, j) \in E_{d}} {‖ A_{d, i j} u ‖}_{2}} .

(18)

According to the definition of S_d in (16), we define the penalty function R(·) such that

R (S_{d} u) = \sum_{(i, j) \in E_{d}} {‖ A_{d, i j} u ‖}_{2} .

According to the singular value decomposition of $S_{d} = U_{d} Λ_{d} V_{d}^{T}$ , there exists a matrix $W_{d} \in ℝ^{n \times n_{- d}}$ such that ${\tilde{V}}_{d} = [W_{d}, V_{d}] \in ℝ^{n \times n}$ is an orthogonal matrix and $W_{d}^{T} V_{d} = 0$ . Let $α_{d} = W_{d}^{T} u \in ℝ^{n_{- d}}$ and $β_{d} = V_{d}^{T} u \in ℝ^{n}$ . Clearly, we have

W_{d} α_{d} + V_{d} β_{d} = W_{d} W_{d}^{T} u + V_{d} V_{d}^{T} u = {\tilde{V}}_{d} {\tilde{V}}_{d}^{T} u = u,

(19)

for any d = 1, … ,D. This fact together with the definition of G_d = U_dΛ_d in (17) imply that solving our convex tensor clustering in (18) is equivalent to solving

min_{α_{d}, β_{d}, d = 1, \dots, D} \sum_{d = 1}^{D} {\frac{1}{2 D} {‖ x - W_{d} α_{d} + V_{d} β_{d} ‖}_{2}^{2} + γ_{d} R (G_{d} β_{d})}

(20)

Denote the solution of (20) as ${\hat{α}}_{d}$ , ${\hat{β}}_{d}$ , d = 1, … ,D, which corresponds to the estimator û in (18) according to (19). Similarly, we denote the true parameters as $α_{d}^{*}$ , $β_{d}^{*}$ that corresponds to u* defined in Assumption 4.2. Our goal is to derive the upper bound of ${‖ \hat{u} - u^{*} ‖}_{2}^{2}$ by above reparametrization. Since ${\hat{α}}_{d}$ , ${\hat{β}}_{d}$ , d = 1, … ,D minimizes the objective function in (20), we have

\sum_{d = 1}^{D} {\frac{1}{2 D} {‖ x - W_{d} {\hat{α}}_{d} + V_{d} {\hat{β}}_{d} ‖}_{2}^{2} + γ_{d} R (G_{d} {\hat{β}}_{d})} \leq \sum_{d = 1}^{D} {\frac{1}{2 D} {‖ x - W_{d} α_{d}^{*} + V_{d} β_{d}^{*} ‖}_{2}^{2} + γ_{d} R (G_{d} β_{d}^{*})} .

Note that $‖ x - \hat{u} ‖_{2}^{2} - {‖ x - u^{*} ‖}_{2}^{2} = {‖ \hat{u} ‖}_{2}^{2} - {‖ u^{*} ‖}_{2}^{2} - 2 x^{T} (\hat{u} - u^{*}) = {‖ \hat{u} - u^{*} ‖}_{2}^{2} + 2 ϵ^{T} (\hat{u} - u^{*})$ , where the last equality is due to the model assumption x = u* + ϵ Therefore, we have

\frac{1}{2} {‖ \hat{u} - u^{*} ‖}_{2}^{2} + \sum_{d = 1}^{D} γ_{d} R (G_{d} {\hat{β}}_{d}) \leq \frac{1}{2 D} \sum_{d = 1}^{D} ϵ^{T} (u^{*} - \hat{u}) + \sum_{d = 1}^{D} γ_{d} R (G_{d} β_{d}^{*}) \leq \frac{1}{2 D} \sum_{d = 1}^{D} \underset{f ({\hat{α}}_{d}, {\hat{β}}_{d})}{\underset{︸}{| ϵ^{T} [W_{d} (α_{d}^{*} - {\hat{α}}_{d}) + V_{d} (β_{d}^{*} - {\hat{β}}_{d})] |}} + \sum_{d = 1}^{D} γ_{d} R (G_{d} β_{d}^{*}) .

(21)

Next we derive the bound for $f ({\hat{α}}_{d}, {\hat{β}}_{d})$ . Note that the optimization over α_d in (20) has a closed-form since the penalty term is independent of α_d. In particular, by setting the derivative of ${‖ x - W_{d} α_{d} + V_{d} β_{d} ‖}_{2}^{2}$ with respect to α_d to be zero, we obtain that $α_{d} = W_{d}^{T} (x - V_{d} β_{d})$ . This implies that

{\hat{α}}_{d} = W_{d}^{T} (x - V_{d} {\hat{β}}_{d}) = W_{d}^{T} (W_{d} α_{d}^{*} + V_{d} β_{d}^{*} + ϵ - V_{d} {\hat{β}}_{d}) = α_{d}^{*} + W_{d}^{T} ϵ,

(22)

where the second equality is due to x = u* + ϵ and the last equality is due to the fact that $W_{d}^{T} V_{d} = 0$ and $W_{d}^{T} W_{d} = I$ . According to (22), we have

f ({\hat{α}}_{d}, {\hat{β}}_{d}) = ∣ ϵ^{T} W_{d} W_{d}^{T} ϵ + ϵ^{T} V_{d} (β_{d}^{*} - {\hat{β}}_{d})] ∣ \leq \underset{(I)}{\underset{︸}{| ϵ^{T} W_{d} W_{d}^{T} ϵ |}} + \underset{(I I)}{\underset{︸}{| ϵ^{T} V_{d} (β_{d}^{*} - {\hat{β}}_{d})] |}} .

(23)

Bound (I): We apply the concentration inequality in Lemma 11 to bound (I). It remains to compute ${‖ W_{d} W_{d}^{T} ‖}_{F}^{2}$ and ${‖ W_{d} W_{d}^{T} ‖}_{2}$ . By construction, $W_{d} W_{d}^{T} \in ℝ^{n \times n}$ is a projection matrix since ${\tilde{V}}_{d} {\tilde{V}}_{d}^{T} = W_{d} W_{d}^{T} + V_{d} V_{d}^{T} = I$ . Therefore, the rank of $W_{d} W_{d}^{T}$ is $\prod_{j \neq d} n_{j}$ , ${‖ W_{d} W_{d}^{T} ‖}_{F}^{2} = \prod_{j \neq d} n_{j}$ , ${‖ W_{d} W_{d}^{T} ‖}_{2} = 1$ , and $tr (W_{d} W_{d}^{T}) = \prod_{j \neq d} n_{j}$ .

Denote $n = \prod_{d = 1}^{D} n_{d}$ . By Lemma 11 and Assumption 4.2, we have

ℙ (ϵ^{T} W_{d} W_{d}^{T} ϵ \geq t + n_{- d}) \leq C log (n) exp {- C^{'} M^{- 2} min [\frac{t^{2}}{log (n) n_{- d}}, t]} .

Setting $t = \sqrt{n_{- d} log {(n)}^{2}}$ , we have

ℙ (ϵ^{T} W_{d} W_{d}^{T} ϵ \geq log (n) \sqrt{n_{- d}} + n_{- d}) \leq C exp {loglog (n) - C^{'} M^{- 2} log (n)},

(24)

where the right hand side converges to zero as the dimension $n = \prod_{d = 1}^{D} n_{d} \to \infty$ . Note that our error ϵ in Assumption 4.2 is assumed to be a M-concentrated random variable. If we assume a stronger condition such that ϵ is a vector with iid sub-Gaussian, we can obtain a upper bound $\sqrt{log (n) n_{- d}} + n_{- d}$ according to the Hanson and Wright inequality (Hanson and Wright, 1971). Therefore, in spite of the relaxation in the error assumption, our bound in (24) is only up to a log-term larger.

Bound (II): By definitions of G_d in (17) and $G_{d}^{†}$ , we have $G_{d}^{†} G_{d} = I$ . Furthermore, let $G_{d, i j}^{†}$ refer to the column of $G_{d}^{†}$ that corresponds to the index (i, j), and let G_d,ij refer to the row of G_d that corresponds to the index (i, j). We have

(I I) = | ϵ^{T} V_{d} (β_{d}^{*} - {\hat{β}}_{d}) | = | ϵ^{T} V_{d} G_{d}^{†} G_{d} (β_{d}^{*} - {\hat{β}}_{d}) | = | \sum_{i < j} ϵ^{T} V_{d} G_{d, i j}^{†} G_{d, i j} (β_{d}^{*} - {\hat{β}}_{d}) | \leq \sum_{i < j} {‖ ϵ^{T} V_{d} G_{d, i j}^{†} ‖}_{2} {‖ G_{d, i j} (β_{d}^{*} - {\hat{β}}_{d}) ‖}_{2} \leq \underset{I I_{1}}{\underset{︸}{max_{i < j} {‖ ϵ^{T} V_{d} G_{d, i j}^{†} ‖}_{2}}} \cdot \sum_{i < j} {‖ G_{d, i j} (β_{d}^{*} - {\hat{β}}_{d}) ‖}_{2}

Bound II₁: By construction, $ϵ^{T} V_{d} G_{d, i j}^{†} \in ℝ^{n_{- d}}$ . We have

{‖ ϵ^{T} V_{d} G_{d, i j}^{†} ‖}_{2} \leq \sqrt{n_{- d}} {‖ ϵ^{T} V_{d} G_{d, i j}^{†} ‖}_{\infty},

and hence

max_{i < j} {‖ ϵ^{T} V_{d} G_{d, i j}^{†} ‖}_{2} \leq \sqrt{n_{- d}} max_{i < j} {‖ ϵ^{T} V_{d} G_{d, i j}^{†} ‖}_{\infty} = \sqrt{n_{- d}} {‖ ϵ^{T} V_{d} G_{d}^{†} ‖}_{\infty}

Let $η_{j} = e_{j}^{T} G_{d}^{† T} V_{d}^{T} ϵ \in ℝ$ , where $e_{j} \in ℝ (\begin{matrix} n_{d} \\ 2 \end{matrix}) n_{- d}$ is the basis vector with the jth entry one and the rest zeros. According to Lemma 13 and the property of V_d which consists of singular vectors, we have σ_max(V_d) = 1 and $σ_{max} (G_{d}^{†}) = 1 / \sqrt{n_{d}}$ . Therefore, we have η_j is a $M / \sqrt{n_{d}}$ -concentrated random variable with mean zero. According to the definition of concentrated random variable in Definition 8, we have

ℙ (| η_{j} | \geq t_{1}) \leq C_{1} exp (- \frac{C_{2} n_{d} t_{1}^{2}}{M^{2}}) .

Therefore, by union bound, we have

ℙ (max_{j} | η_{j} | \geq t_{1}) \leq C_{1} (\begin{matrix} n_{d} \\ 2 \end{matrix}) (n_{- d}) exp (- \frac{C_{2} n_{d} t_{1}^{2}}{M^{2}}) .

By setting $t_{1} = \sqrt{log (n) log [(\begin{matrix} n_{d} \\ 2 \end{matrix}) n_{- d}] / n_{d}}$ , we have

ℙ ({‖ ϵ^{T} V_{d} G_{d}^{†} ‖}_{\infty} \geq \sqrt{log (n) log [(\begin{matrix} n_{d} \\ 2 \end{matrix}) n_{- d}] / n_{d}}) \leq \frac{C_{3}}{n},

for some constant C₃ > 0. Hence with probability at least 1 − C₃/n, we have

I I_{1} \leq \sqrt{n_{- d} log (n) log [(\begin{matrix} n_{d} \\ 2 \end{matrix}) n_{- d}] / n_{d}} .

(25)

Plugging the results in (24) and (25) into (23), we obtain that, for each d = 1, … ,D

f ({\hat{α}}_{d}, {\hat{β}}_{d}) \leq log (n) \sqrt{n_{- d}} + n_{- d} + \sqrt{n_{- d} log (n) log [(\begin{matrix} n_{d} \\ 2 \end{matrix}) n_{- d}] / n_{d} \sum_{i < j} {‖ G_{d, i j} (β_{d}^{*} - {\hat{β}}_{d}) ‖}_{2}} .

Therefore, Assumption 4.3 on the tuning parameter γ_d implies that

f ({\hat{α}}_{d}, {\hat{β}}_{d}) \leq log (n) \sqrt{n_{- d}} + n_{- d} + D γ_{d} \sum_{d = 1}^{D} γ_{d} \sum_{i < j} {‖ G_{d, i j} (β_{d}^{*} - {\hat{β}}_{d}) ‖}_{2},

by noting that $log ((\begin{matrix} n_{d} \\ 2 \end{matrix}) n_{- d}) \leq log (n_{d}^{2} n_{- d}) \leq 2 log (n)$ . This combines with the inequality in (21) lead to

\frac{1}{2} {‖ \hat{u} - u^{*} ‖}_{2}^{2} \leq \frac{1}{2 D} \sum_{d = 1}^{D} [log (n) \sqrt{n_{- d}} + n_{- d}] + \frac{3}{2} \sum_{d = 1}^{D} γ_{d} R (S_{d} u^{*}) .

(26)

According to the cluster structure assumption in Assumption 4.2, there are k_d clusters along the dth mode of the tensor. Therefore, along each mode the true parameter $U^{*}$ only has a few different slices. Denote $U_{\dots i \dots}^{*}$ as the i-th mode-d subarray. Formally, we have

\begin{array}{l} R (S_{d} u^{*}) = \sum_{(i, j), i < j, i, j = 1, \dots, n_{d}} {‖ A_{d, i j} u ‖}_{2} \\ = \sum_{(i, j), i < j, i, j = 1, \dots, n_{d}} {‖ U_{\dots i \dots}^{*} - U_{\dots j \dots}^{*} ‖}_{F} \leq 4 C_{0}^{2} (\begin{matrix} n_{d} \\ 2 \end{matrix}) \sqrt{\prod_{j \neq d} k_{j}}, \end{array}

(27)

where C₀ is a constant upper bound for the entries of $U^{*}$ . Combining the inequalities in (26) and (27) with the condition on γ_d given in Assumption 4.3 implies that

\frac{1}{2} {‖ \hat{u} - u^{*} ‖}_{2}^{2} \leq \frac{1}{2 D} \sum_{d = 1}^{D} (log (n) \sqrt{n_{- d}} + n_{- d}) + \frac{3}{2} \sum_{d = 1}^{D} \frac{2 c_{0} log (n) \sqrt{n}}{D n_{d}} 4 C_{0}^{2} (\begin{matrix} n_{d} \\ 2 \end{matrix}) \sqrt{\prod_{j \neq d} k_{j}} .

Dividing both sides by n gives to the prediction error bound in (7). This ends the proof of Theorem 9. ■

Appendix D. Derivation of Lagrangian Dual

Let $U \times_{d} A$ denote the multiplication of $U$ along mode d by the matrix A. Recall that for a tensor $U \in ℝ^{n_{1} \times \dots \times n_{d}}$ and a matrix $A \in ℝ^{L \times n_{d}}$

vec (U \times_{d} A) = (I_{n_{D}} \otimes \dots \otimes I_{n_{d + 1}} \otimes A \otimes I_{n_{d - 1}} \otimes \dots \otimes I_{n_{1}}) u,

where $u = vec (U) = vec (U_{(1)})$ , namely the column-major vectorization of the mode-1 matricization of the tensor $U$ . So, Note that $Y = U \times_{d} A$ is equivalent to Y_(d) = AU_(d). We rewrite the penalty function R_d as follows.

R_{d} (U) = \sum_{l \in E_{d}} w_{d, l} {‖ U \times_{d} Δ_{d, l} ‖}_{F} = \sum_{l \in E_{d}} w_{d, l} {‖ vec (U \times_{d} Δ_{d, l}) ‖}_{2} = \sum_{l \in E_{d}} w_{d, l} {‖ A_{d, l} u ‖}_{2},

where $A_{d, l} = (I_{n_{D}} \otimes \dots \otimes I_{n_{d + 1}} \otimes Δ_{d, l} \otimes I_{n_{d - 1}} \otimes \dots \otimes I_{n_{1}})$ .

We now write down the Lagrangian:

L (u, v, λ) = \frac{1}{2} {‖ x - u ‖}_{2}^{2} + \sum_{d = 1}^{D} \sum_{l \in E_{d}} {γ w_{d, l} {‖ v_{d, l} ‖}_{2} + 〈 λ_{d, l}, A_{d, l} u - v_{d, l} 〉} = {\frac{1}{2} {‖ x - u ‖}_{2}^{2} + \sum_{d = 1}^{D} 〈 A_{d}^{T} λ_{d}, u 〉} - \sum_{d = 1}^{D} \sum_{l \in E_{d}} {〈 λ_{d, l}, v_{d, l} 〉 - γ w_{d, l} {‖ v_{d, l} ‖}_{2}} = {\frac{1}{2} {‖ x - u ‖}_{2}^{2} + 〈 A^{T} λ, u 〉} - \sum_{d = 1}^{D} \sum_{l \in E_{d}} {〈 λ_{d, l}, v_{d, l} 〉 - γ w_{d, l} {‖ v_{d, l} ‖}_{2}} .

The Lagrangian dual objective is given by G(λ) by minimizing the Lagrangian $L (u, v, λ)$ over the primal variables u and v, namely

G (λ) = min_{u, v} L (u, v, λ) = min_{u} {\frac{1}{2} {‖ x - u ‖}_{2}^{2} + 〈 A^{T} λ, u 〉} - \sum_{d = 1}^{D} \sum_{l \in E_{d}} max_{v_{d, l}} {〈 λ_{d, l}, v_{d, l} 〉 - γ w_{d, l} {‖ v_{d, l} ‖}_{2}} = \frac{1}{2} {‖ x ‖}_{2}^{2} - \frac{1}{2} {‖ x - A^{T} λ ‖}_{2}^{2} - \sum_{d = 1}^{D} \sum_{l \in E_{d}} ι_{C_{d, l}} (λ_{d, l}),

(28)

where $ι_{C_{d, l}}$ is the indicator function of the closed convex set C_d,l = {z : ∥z∥₂ ≤ γw_d,l}.

The last equality in (28) follows from the fact that the Fenchel conjugate of a norm is the indicator function of the unit dual norm ball. Recall that the Fenchel conjugate f* of a function f is given by

f^{⋆} (λ) = sup_{v} {λ, v - f (v)} .

Let B = {λ : ∥λ∥₂ ≤ 1} denote the unit ℓ₂-norm ball. Since the ℓ₂-norm is self dual, we arrive at the identity

ι_{B} (λ) = sup_{v} {〈 λ, v 〉 - {‖ v ‖}_{2}} .

Appendix E. Projected Gradient Applied to the Lagrangian Dual

Note that the dual problem (8) has the form

minimize g (λ) subject to λ \in C,

(29)

where g(λ) is a convex and Lipschitz-differentiable function and the constraint set C is a closed convex set, which implies that every point λ possesses a unique orthogonal projection, $P_{C} (λ) = \arg {min}_{θ \in C} θ - λ_{2}$ , onto C. When $P_{C} (λ)$ can be computed analytically, a simple and effective iterative algorithm for solving problems like (29) is the projected gradient descent algorithm, a special case of proximal gradient descent algorithm (Combettes and Wajs, 2005; Combettes and Pesquet, 2011). Recall that projected gradient descent alternates between taking a gradient step and projecting onto the set C. Thus, at the mth iteration, we perform the following update

λ^{(m)} = P_{C} (λ^{(m - 1)} - η \nabla g (λ)),

(30)

where η is a step-length parameter.

Applying the update rule in (30) to the dual problem (8), we obtain the following rule for computing the mth iteration

u^{(m)} = x - A^{T} λ^{(m - 1)} λ^{(m)} = P_{C} (λ^{(m - 1)} + η A u^{(m)}) .

Note that, at the mth iteration, the gradient of the least squares objective in (8) is given by −Au^(m). Thus, we automatically update our CoCo estimator u^(m) as part of our gradient calculation. Finally, we note that the projection onto the set C consists of independent projections onto the sets C_d,l that can be carried out in parallel.

E.1. Per-Iteration and Storage Costs

The gradient update is dominated by the matrix-vector multiplications A^Tλ and Au. Although A is a $\sum_{d = 1}^{D} | E_{d} | n_{- d} -by- n$ matrix it has only $2 \sum_{d = 1}^{D} | E_{d} | n_{- d}$ non-zero elements. Thus, computing the gradient step requires $O (\sum_{d = 1}^{D} | E_{d} | n_{- d})$ flops. Projecting onto the set C also requires $O (\sum_{d = 1}^{D} | E_{d} | n_{- d})$ flops since projecting onto the set C_d,l requires $O (n_{- d})$ flops. Thus, the per-iteration cost is $O (\sum_{d = 1}^{D} | E_{d} | n_{- d})$ flops. The storage cost is dominated by storing the dual variable λ, which has $\sum_{d = 1}^{D} | E_{d} | n_{- d}$ elements. At first glance these storage and per-iteration costs may seem prohibitive, as $| E_{d} |$ can be as large as $O (n_{d}^{2})$ for a fully connected mode-d graph. Shrinking together all combinations of pairs of mode-d subarrays, however, typically produces poor clustering results in comparison to shrinking together mode-d subarrays that are nearest-neighbors as observed in prior work in convex clustering (Chen et al., 2015; Chi and Lange, 2015) and convex biclustering (Chi et al., 2017). Consequently, we employ sparse weights. Specifically, we keep positive weights between approximately nearest-neighbor mode-d subarrays so that $| E_{d} |$ is $O (n_{d})$ . By using these sparse weights, the per-iteration and storage costs scale more reasonably as $O (D n)$ , namely linearly in either the number of dimensions D or in the number of elements n. Details on our weights choices are elaborated in Section 6.

E.2. Convergence

The sequence of dual iterates λ^(m) is guaranteed to converge to a solution $\hat{λ}$ of (8) provided that the step-size parameter η is less than twice the reciprocal of the spectral radius of the matrix A^TA (Combettes and Wajs, 2005, Theorem 3.4). Consequently, the sequence of primal iterates u^(m) is guaranteed to converge to the CoCo estimator û. We note that under the same step-size conditions, convergence of the sequence u^(m) can also be guaranteed by observing that the projected gradient algorithm applied to the dual problem (8) is an example of the alternating minimization algorithm (Tseng, 1991, Proposition 2).

E.3. Monitoring Convergence via the Duality Gap

Recall that we can bound the suboptimality of the mth iterate, F_γ(u^(m)) − F_γ(û), by the duality gap F_γ(u^(m)) − G(λ^(m)), which can be expressed solely in terms of the mth iterate of the primal variable u^(m), namely

F_{γ} (u^{(m)}) - G (λ^{(m)}) = {‖ u^{(m)} ‖}_{2}^{2} - x, u^{(m)} + γ \sum_{d = 1}^{D} \sum_{l \in E_{d}} w_{d, l} {‖ A_{d, l} u^{(m)} ‖}_{2} .

For any optimal dual solution $\hat{λ}$ , the gap vanishes, namely $F_{γ} (\hat{u}) = G (\hat{λ})$ . Note that computing the duality gap incurs minimal additional cost as u^(m) and A_d,lu^(m) are already computed as part of the gradient step. In short, including a duality gap computation will not change the $O (D n)$ per-iteration cost of the projected gradient algorithm. In practice, we can terminate the algorithm once the duality gap falls below some small tolerance.

E.4. Computing Mode-d Difference Variables

In Section 7.2, we explained how clustering assignments along the dth mode are made using the mode-d difference variables $v_{d, l} = U \times_{d} Δ_{d, l}$ . In practice we must deal with the fact that the û recovered by computing $x - A^{T} \hat{λ}$ may exhibit a nearly but not exactly checkerbox structure due to limitations in numerical precision. This creates a practical issue as a small but non-zero difference variable will lead to an incorrect clustering assignment. Addressing this issue, however, is simple. The projected gradient algorithm used to compute CoCo is a natural generalization of the projected gradient algorithm used in Chi and Lange (2015) for convex clustering. Consequently, we can use the obvious adaptation of the procedure for computing the differences variables in convex clustering. The following brief technical discussion is expanded in more detail in Chi and Lange (2015).

The key fact that we use is that the projected gradient algorithm is equivalent to the alternating minimization algorithm (AMA) applied to the following augmented Lagrangian function

L_{η} (u, v, λ) = \frac{1}{2} {‖ x - u ‖}_{2}^{2} + \sum_{d = 1}^{D} \sum_{l \in E_{d}} [γ w_{d, l} {‖ v_{d, l} ‖}_{2} + 〈 λ_{d, l}, v_{d, l} - A_{d, l} u 〉 + \frac{η}{2} {‖ v_{d, l} - A_{d, l} u ‖}_{2}^{2}] .

The mode-d difference vector v_d,l is determined by the proximal map

v_{d, l} = \underset{v_{d, l}}{arg min} \frac{1}{2} [{‖ v_{d, l} - A_{d, l} u - η^{- 1} λ_{d, l} ‖}_{2}^{2} + \frac{γ w_{d, l}}{η} {‖ v_{d, l} ‖}_{2}] = {prox}_{σ_{d, l} ‖ \cdot ‖_{2}} (A_{d, l} u - η^{- 1} λ_{d, l}),

(31)

where σ_d,l = γw_d,l/η. Because the proximal mapping can produce mode-d difference variables that are exactly zero, the procedure for computing v_d,l in (31) is immune to the numerical precision issues that hinder the direct computation $\hat{U} \times_{d} Δ_{d, l}$ .

Appendix F. Details on Denoising with the Tucker Decomposition for Setting Weights

Employing the Tucker decomposition introduces another tuning parameter, namely the rank of the decomposition. When applicable, a user can leverage problem-specific knowledge to select the rank for the decomposition. Nonetheless, the availability of an automatic approach is desirable to handle cases when such knowledge is unavailable. Selecting the rank in a tensor decomposition, however, is an open question (Kolda and Bader, 2009; Yokota et al., 2017). During initial experiments, a few different methods for selecting the Tucker decomposition rank from the literature were compared: an L-curve approach that attempts to strike a balance between the decomposition’s relative error and compression ratio, as implemented by the mlrankest function in the Tensorlab Matlab toolbox (Vervliet et al., 2016), minimum description length (Rissanen, 1978; Yokota et al., 2017), and the recently-proposed SCORE algorithm (Yokota et al., 2017). Out of these, the SCORE algorithm produced the best average CoCo estimator performance. The SCORE algorithm itself includes a tuning parameter, $\hat{ρ}$ , and Yokota et al. (2017) suggest setting $\hat{ρ} \in [10^{- 4}, 10^{- 2}]$ . We considered $\hat{ρ} \in {10^{- 4}, 10^{- 3}, 10^{- 2}}$ and found 10⁻³ to perform the best, which also matches the value used in the experiments by Yokota et al. (2017).

We also developed a simple yet effective heuristic for choosing the rank where we set the Tucker rank for the dth mode to be the floor of $\sqrt{n_{d}} / 2$ . Two principles motivating the heuristic are that the rank of the decomposition should be both small relative to and also in proportion to the length of the modes. Both the SCORE algorithm and our heuristic were employed in our simulations described in Section 8 as a robustness check to ensure our CoCo estimator’s performance does not crucially depend on the choice of the rank.

The basic Tucker decomposition computation is accomplished by the higher order SVD (HOSVD) method (De Lathauwer et al., 2000) which computes for each mdoe k the r_k leading left singular values of the mode-k matricization and stores them as a factor matrix U_k. The HOSVD then computes the core tensor by contracting the data tensor $X \times_{k} U_{k}$ . Thus, the main cost is computing D SVDs. This is an illustrative calculation, however, and more efficient alternatives exist (Vannieuwenhoven et al., 2012; Minster et al., 2020).

Appendix G. CPD+k-means

We describe in greater detail the CPD+k-means method for co-clustering a D-way tensor $X \in ℝ^{n_{1} \times \dots \times n_{D}}$ . The method consists of two steps

Step 1. Compute a rank-R CP decomposition

X \approx \sum_{i = 1}^{R} a_{i}^{(1)} \circ a_{i}^{(2)} \circ \dots \circ a_{i}^{(D)},

where ◦ represents the outer product and $a_{i}^{(d)}$ is the ith column of the dth factor matrix $A^{(d)} \in ℝ^{n_{d} \times R}$ .

Step 2. For each factor matrices A^(d), apply k-means clustering on the n_d rows of A^(d). Note that the D applications of k-means are done independently for each mode-d factor matrix A^(d).

Tuning parameters: There are two sets of tuning parameters: (i) the rank parameter R, used in Step 1, and (ii) the D cluster number parameters for each factor matrix, used in Step 2. To choose the rank parameter R, we create a candidate set of ranks $R_{candidate} \subset {1, 2, 3, \dots}$ and select $R^{⋆} \in R_{candidate}$ using the tuning procedure in Sun et al. (2017). We then compute a CP decomposition using the selected rank R* and obtain the factor matrices A^(d) for d = 1, … ,D. To choose the D cluster number parameters, we create D candidate sets of cluster numbers $K_{candidate}^{d} \subset {1, 2, 3, \dots, n_{d}}$ and select $k_{d}^{⋆} \in K_{candidate}^{d}$ for d = 1, … ,D using the gap statistic procedure in Tibshirani et al. (2001). We use the D clustering results from running k-means on the rows of each of the A^(d) using $k_{d}^{⋆}$ .

Appendix H. Additional Simulations on Rectangular Tensors

The first rectangular tensor is one in which there are two short modes (n₁ = n₂ = 10) and one relatively longer mode (n₃ = 50). Figure 14 presents the clustering results for this tensor shape.

At a lower noise level (σ = 2), CoCo performs very well and outperforms CPD+kmeans and CoTeC in terms of both single-mode clustering and co-clustering. When the noise level is bumped up (σ = 3), both methods experience a noticeable drop off in their performance and now perform more similarly. Interestingly, CoCo’s single-mode clustering results are better along the two shorter modes (modes 1 and 2), which is not what we expected. This provides some evidence that the performance along a mode depends on both the length of that mode as well as the lengths of the other modes. When the length of the shorter modes are increased slightly (from n_d = 10 to n_d = 20 for d = 1, 2), CoCo has near-perfect performance while CPD+k-means performs roughly the same as before. Thus, CoCo struggles with this tensor shape only when the short modes are really short (only 10 observations).

To further investigate the mode-by-mode performance with rectangular tensors, we also apply the clustering methods to a “Goldilocks” tensor with mode lengths that are short, medium, and long. This setting was again motivated by the results from the previous two tensor shapes to see how the performance is impacted when the size of a longer mode is increased. The ARI results for this tensor shape are given in Figure 15d, and they are consistent with what was observed previously. When the short mode has only 10 observations, CoCo initially performs very well until the noise reaches a certain level. At this point, its performance for the longer modes declines sharply and actually performs worse than CPD+k-means, and this pattern is more pronounced for the longest mode (n₃ = 100). The overall co-clustering performance for both methods remains similar, however. As before, CoCo does not experience as much of a decrease when the shortest mode is made slightly longer (n₁ = 20), and does noticeably better than CPD+k-means for the most part.

Figure 15: — Checkerbox Simulation Results: Impact of Tensor Shape. Two balanced clusters per mode with two levels of homoskedastic noise for a tensor with short, medium, and long mode lengths. Average adjusted rand index plus/minus one standard error for different noise levels and mode lengths.

Overall, from clustering these different tensor shapes we see that CoCo still generally performs very well and better than CPD+k-means. The main issue it encounters is when at least one mode is very short (n_d = 10). CoCo performs very well a lower noise levels but has a sharp decline in performance once the noise reaches a certain level. Unexpectedly, the decline in single-mode performance is worse for the longer modes. However, even when this happens, CoCo’s overall co-clustering performance is still comparable to CPD+k-means. Additionally, this pattern is much less striking when the length of the shortest mode is increased slightly.

Contributor Information

Eric C. Chi, Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA

Brian R. Gaines, Advanced Analytics R&D, SAS Institute Inc., Cary, NC 27513, USA

Will Wei Sun, Krannert School of Management, Purdue University, West Lafayette, IN 47907, USA.

Hua Zhou, Department of Biostatistics, University of California, Los Angeles, CA 90095, USA.

Jian Yang, Advertising Sciences, Yahoo Research, Sunnyvale, CA 94089, USA.

References

Acar Evrim and Yener Bülent. Unsupervised multiway data analysis: A literature survey. IEEE Transactions on Knowledge and Data Engineering, 21(1):6–20, 2009. [Google Scholar]
Acar Evrim, Çamtepe Seyit A., and Yener Bülent. Collective sampling and analysis of high order tensors for chatroom communications. In International Conference on Intelligence and Security Informatics, pages 213–224. Springer, 2006. [Google Scholar]
Anandkumar Animashree, Ge Rong, Hsu Daniel, Kakade Sham M., and Telgarsky Matus. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15:2773–2832, 2014. [Google Scholar]
Ankenman Jerrod I.. Geometry and analysis of dual networks on questionnaires, 2014.
Bader Brett W., Kolda Tamara G., et al. Matlab tensor toolbox version 2.6. Available online, February 2015. URL http://www.sandia.gov/~tgkolda/TensorToolbox/.
Beck Amir and Teboulle Marc. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009. [Google Scholar]
Bergmann Sven, Ihmels Jan, and Barkai Naama. Iterative signature algorithm for the analysis of large-scale gene expression data. Physical Review E, 67(3):031902, 2003. [DOI] [PubMed] [Google Scholar]
Bhar Anirban, Haubrock Martin, Mukhopadhyay Anirban, and Wingender Edgar. Multiobjective triclustering of time-series transcriptome data reveals key genes of biological processes. BMC Bioinformatics, 16(1):200, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bi Xuan, Qu Annie, and Shen Xiaotong. Multilayer tensor factorization with applications to recommender systems. The Annals of Statistics, 46(6B):3308–3333, 2018. [Google Scholar]
Busygin Stanislav, Prokopyev Oleg, and Pardalos Panos M.. Biclustering in data mining. Computers and Operations Research, 35(9):2964–2987, 2008. [Google Scholar]
Cao Xiaochun, Wei Xingxing, Han Yahong, and Lin Dongdai. Robust face clustering via tensor decomposition. IEEE Transactions on Cybernetics, 45(11):2546–2557, 2015. [DOI] [PubMed] [Google Scholar]
Carroll J. Douglas and Chang Jih-Jie. Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition. Psychometrika, 35(3):283–319, 1970. [Google Scholar]
Chen Annie I. and Ozdaglar Asuman. A fast distributed proximal-gradient method. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, pages 601–608. IEEE, 2012. [Google Scholar]
Chen Gary K., Chi Eric C., Ranola John Michael O., and Lange Kenneth. Convex clustering: An attractive alternative to hierarchical clustering. PLOS Computational Biology, 11(5): e1004228, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen Jiahua and Chen Zehua. Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759–771, 2008. [Google Scholar]
Chen Jiahua and Chen Zehua. Extended BIC for small-n-large-P sparse GLM. Statistica Sinica, pages 555–574, 2012. [Google Scholar]
Chi Eric C. and Lange Kenneth. Splitting methods for convex clustering. Journal of Computational and Graphical Statistics, 24(4):994–1013, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chi Eric C. and Steinerberger Stefan. Recovering trees with convex clustering. SIAM Journal on Mathematics of Data Science, 1(3):383–407, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chi Eric C., Allen Genevera I., and Baraniuk Richard G.. Convex biclustering. Biometrics, 73(1):10–19, 2017. [DOI] [PubMed] [Google Scholar]
Cichocki Andrzej, Mandic Danilo, Lieven De Lathauwer Guoxu Zhou, Zhao Qibin, Caiafa Cesar, and Huy Anh Phan. Tensor decompositions for signal processing applications: From two-way to multiway component analysis. IEEE Signal Processing Magazine, 32 (2):145–163, 2015. [Google Scholar]
Combettes Patrick L. and Pesquet Jean-Christophe. Proximal splitting methods in signal processing. In Fixed-point Algorithms for Inverse Problems in Science and Engineering, pages 185–212. Springer, 2011. [Google Scholar]
Combettes Patrick L. and Wajs Valérie R.. Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation, 4(4):1168–1200, 2005. [Google Scholar]
Lieven De Lathauwer Bart De Moor, and Vandewalle Joos. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 21(4):1253–1278, 2000. [Google Scholar]
Deo Narsingh. Graph Theory with Applications to Engineering and Computer Science. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1974. [Google Scholar]
Eckstein Jonathan. A simplified form of block-iterative operator splitting and an asynchronous algorithm resembling the multi-block alternating direction method of multipliers. Journal of Optimization Theory and Applications, 173(1):155–182, Apr 2017. [Google Scholar]
Fan Jianqing and Li Runze. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001. [Google Scholar]
Frolov Evgeny and Oseledets Ivan. Tensor methods and recommender systems. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(3):e1201, 2017. [Google Scholar]
Gavish Matan and Coifman Ronald R.. Sampling, denoising and compression of matrices by coherent matrix organization. Applied and Computational Harmonic Analysis, 33(3): 354 – 369, 2012. [Google Scholar]
Geisser Seymour. The predictive sample reuse method with applications. Journal of the American Statistical Association, 70(350):320–328, 1975. [Google Scholar]
Goldstein Tom, Studer Christoph, and Baraniuk Richard. A field guide to forward-backward splitting with a FASTA implementation. arXiv eprint, abs/1411.3406, 2014. URL http://arxiv.org/abs/1411.3406.
Goldstein Tom, Studer Christoph, and Baraniuk Richard. FASTA: A generalized implementation of forward-backward splitting, January 2015. http://arxiv.org/abs/1501.04979.
Guigourès Romain, Boullé Marc, and Rossi Fabrice. Discovering patterns in time-varying graphs: A triclustering approach. Advances in Data Analysis and Classification, pages 1–28, 2015. [Google Scholar]
Hallac David, Leskovec Jure, and Boyd Stephen. Network lasso: Clustering and optimization in large graphs. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘15, pages 387–396, New York, NY, USA, 2015. ACM. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hanson DL and Wright FT. A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics, 42:1079–1083, 1971. [Google Scholar]
Harshman Richard A.. Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multimodal factor analysis. UCLA Working Papers in Phonetics, 16:1–84, 1970. [Google Scholar]
Hartigan John A.. Direct clustering of a data matrix. Journal of the American Statistical Association, 67(337):123–129, 1972. [Google Scholar]
Ho Nhat, Lin Tianyi, and Jordan Michael I.. On structured filtering-clustering: Global error bound and optimal first-order algorithms. arXiv:1904.07462 [stat.ML], 2019. URL https://arxiv.org/abs/1904.07462.
Hocking Toby D., Joulin Armand, Bach Francis, and Vert Jean-Philippe. Clusterpath an algorithm for clustering using convex fusion penalties. In Getoor Lise and Scheffer Tobias, editors, 28th International Conference on Machine Learning, page 1. ACM, 2011. [Google Scholar]
Hof Robert. Study: Mobile Ads Actually Do Work - Especially In Apps. Forbes, August 27, 2014, August 2014. Last Accessed July 9, 2017 from https://www.forbes.com/sites/roberthof/2014/08/27/study-mobile-ads-actually-do-work-especially-in-apps/#27ce654057aa.
Huang Heng, Ding Chris, Luo Dijun, and Li Tao. Simultaneous tensor subspace selection and clustering: The equivalence of high order SVD and k-means clustering. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 327–335. ACM, 2008. [Google Scholar]
Hubert Lawrence and Arabie Phipps. Comparing partitions. Journal of Classification, 2 (1):193–218, 1985. [Google Scholar]
Jain Prateek and Oh Sewoong. Provable tensor factorization with missing data. In Advances in Neural Information Processing Systems, pages 1431–1439, 2014. [Google Scholar]
Jegelka Stefanie, Sra Suvrit, and Banerjee Arindam. Approximation algorithms for tensor clustering. In International Conference on Algorithmic Learning Theory, pages 368–383. Springer, 2009. [Google Scholar]
Kolda Tamara G. and Bader Brett W.. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009. [Google Scholar]
Kolda Tamara G. and Sun Jimeng. Scalable tensor decompositions for multi-aspect data mining. In 2008 Eighth IEEE International Conference on Data Mining, pages 363–372, 2008. [Google Scholar]
Kutty Sangeetha, Nayak Richi, and Li Yuefeng. XML documents clustering using a tensor space model. Advances in Knowledge Discovery and Data Mining, pages 488–499, 2011. [Google Scholar]
Lange Kenneth, Hunter David R., and Yang Ilsoon. Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics, 9(1):1–20, 2000. [Google Scholar]
Lazzeroni Laura and Owen Art. Plaid models for gene expression data. Statistica Sinica, 12:61–86, 2002. [Google Scholar]
Lee Mihee, Shen Haipeng, Huang Jianhua Z., and Marron JS. Biclustering via sparse singular value decomposition. Biometrics, 66(4):1087–1095, 2010. [DOI] [PubMed] [Google Scholar]
Li Mu, Andersen David G., and Smola Alexander. Distributed delayed proximal gradient methods. In NIPS Workshop on Optimization for Machine Learning, volume 3, page 3, 2013. [Google Scholar]
Lindsten Fredrik, Ohlsson Henrik, and Ljung Lennart. Just relax and come clustering! A convexification of k-means clustering. Technical report, Linköpings Universitet, 2011. URL http://www.control.isy.liu.se/research/reports/2011/2992.pdf.
Liu Ji, Yuan Lei, and Ye Jieping. Guaranteed sparse recovery under linear transformation. In Dasgupta Sanjoy and McAllester David, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28, pages 91–99. PMLR, 2013a. [Google Scholar]
Liu Tianqi, Yuan Ming, and Zhao Hongyu. Characterizing spatiotemporal transcriptome of human brain via low rank tensor decomposition. arXiv:1702.07449 [stat.ME], 2017. URL https://arxiv.org/abs/1702.07449.
Liu Xinhai, Ji Shuiwang, Glänzel Wolfgang, and De Moor Bart. Multiview partitioning via tensor methods. IEEE Transactions on Knowledge and Data Engineering, 25(5): 1056–1069, 2013b. [Google Scholar]
Ma Ping and Zhong Wenxuan. Penalized clustering of large scale functional data with multiple covariates. Journal of the American Statistical Association, 103:625–636, 2008. [Google Scholar]
Madeira Sara C. and Oliveira Arlindo L.. Biclustering algorithms for biological data analysis: A survey. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 1(1): 24–45, 2004. [DOI] [PubMed] [Google Scholar]
Marchetti Yuliya and Zhou Qing. Solution path clustering with adaptive concave penalty. Electronic Journal of Statistics, 8(1):1569–1603, 2014. [Google Scholar]
McGarry Caitlin. Report: Google is the default iPhone search engine because it paid Apple $1 billion. Macworld, January 22, 2016, January 2016. Last Accessed July 9, 2017 from http://www.macworld.com/article/3025783/iphone-ipad/report-google-is-the-default-iphone-search-engine-because-it-paid-apple-1-billion.html.
Meinshausen Nicolai and Bühlmann Peter. Stability Selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473, 2010. [Google Scholar]
Minster Rachel, Saibaba Arvind K., and Kilmer Misha E.. Randomized algorithms for low-rank tensor decompositions in the Tucker format. SIAM Journal on Mathematics of Data Science, 2(1):189–215, 2020. [Google Scholar]
Mishne Gal, Talmon Ronen, Meir Ron, Schiller Jackie, Lavzin Maria, Dubin Uri, and Coifman Ronald R.. Hierarchical coupled-geometry analysis for neuronal structure and activity pattern discovery. IEEE Journal of Selected Topics in Signal Processing, 10(7): 1238–1253, 2016. [Google Scholar]
Mitchell Amy, Rosenstiel Tom, Laura Houston Santhanam, and Leah Christian. Future of mobile news. Project for Excellence in Journalism (PEJ)— Understanding News in the Information Age, 2012. [Google Scholar]
Ng Andrew Y., Jordan Michael I., and Weiss Yair. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems, pages 849–856, 2002. [Google Scholar]
Oh Jinoh, Shin Kijung, Papalexakis Evangelos E., Faloutsos Christos, and Yu Hwanjo. S-HOT: Scalable High-Order Tucker Decomposition. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 761–770. ACM, 2017. [Google Scholar]
Pan Wei, Shen Xiaotong, and Liu Binghui. Cluster analysis: Unsupervised learning via supervised learning with a non-convex penalty. Journal of Machine Learning Research, 14:1865–1889, 2013. [PMC free article] [PubMed] [Google Scholar]
Papalexakis Evangelos E., Sidiropoulos Nicholas D., and Bro Rasmus. From K-Means to Higher-Way Co-Clustering: Multilinear Decomposition With Sparse Latent Factors. IEEE Transactions on Signal Processing, 61(2):493–506, 2013. [Google Scholar]
Pelckmans Kristiaan, De Brabanter Jos, Suykens Johan A.K., and De Moor Bart L.R.. Convex clustering shrinkage. In PASCAL Workshop on Statistics and Optimization of Clustering Workshop, 2005. [Google Scholar]
Radchenko Peter and Mukherjee Gourab. Convex clustering via l1 fusion penalization. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(5):1527–1546, 2017. [Google Scholar]
Rissanen Jorma. Modeling by shortest data description. Automatica, 14(5):465–471, 1978. [Google Scholar]
Schifano Elizabeth D., Strawderman Robert L., and Wells Martin T.. Majorization-minimization algorithms for nonsmoothly penalized objective functions. Electronic Journal of Statistics, 4:1258–1299, 2010. [Google Scholar]
Sharpnack James, Singh Aarti, and Rinaldo Alessandro. Sparsistency of the edge lasso over graphs. In Lawrence Neil D. and Girolami Mark, editors, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 of Proceedings of Machine Learning Research, pages 1028–1036, 2012. [Google Scholar]
She Yiyuan. Sparse regression with exact clustering. Electronic Journal of Statistics, 4: 1055–1096, 2010. [Google Scholar]
Shen Xiaotong and Huang Hsin-Cheng. Grouping pursuit through a regularization solution surface. Journal of the American Statistical Association, 105(490):727–739, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shen Xiaotong, Huang Hsin-Cheng, and Pan Wei. Simultaneous supervised clustering and feature selection over a graph. Biometrika, 99:899–914, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sidiropoulos Nicholas D., Lieven De Lathauwer Xiao Fu, Huang Kejun, Papalexakis Evangelos E., and Faloutsos Christos. Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing, 65(13):3551–3582, 2017. [Google Scholar]
Sill Martin, Kaiser Sebastian, Benner Axel, and Annette Kopp-Schneider. Robust biclustering by sparse singular value decomposition incorporating stability selection. Bioinformatics, 27(15):2089–2097, 2011. [DOI] [PubMed] [Google Scholar]
Stone Mervyn. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), pages 111–147, 1974. [Google Scholar]
Sun Jimeng, Tao Dacheng, and Faloutsos Christos. Beyond streams and graphs: dynamic tensor analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 374–383. ACM, 2006. [Google Scholar]
Sun Jimeng, Papadimitriou Spiros, Lin Ching-Yung, Cao Nan, Liu Shixia, and Qian Weihong. Multivis: Content-based social network exploration through multiway visual analysis. In Proceedings of the 2009 SIAM International Conference on Data Mining, pages 1064–1075. SIAM, 2009. [Google Scholar]
Sun Will Wei and Li Lexin. Dynamic tensor clustering. Journal of the American Statistical Association, 114(528):1894–1907, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Will Wei Sun Junwei Lu, Liu Han, and Cheng Guang. Provable sparse tensor decomposition. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3): 899–916, 2017. [Google Scholar]
Symeonidis Panagiotis. Matrix and tensor decomposition in recommender systems. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ‘16, pages 429–430, New York, NY, USA, 2016. ACM. [Google Scholar]
Symeonidis Panagiotis and Zioupos Andreas. Matrix and Tensor Factorization Techniques for Recommender Systems. Springer International Publishing, 1 edition, 2016. [Google Scholar]
Tan Kean Ming and Witten Daniela. Statistical properties of convex clustering. Electronic Journal of Statistics, 9:2324–2347, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tan Kean Ming and Witten Daniela M.. Sparse biclustering of transposable data. Journal of Computational and Graphical Statistics, 23(4):985–1008, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tibshirani Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 58(1):267–288, 1996. [Google Scholar]
Tibshirani Robert, Walther Guenther, and Hastie Trevor. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411–423, 2001. [Google Scholar]
Tibshirani Robert, Saunders Michael, Rosset Saharon, Zhu Ji, and Knight Keith. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91–108, 2005. [Google Scholar]
Tibshirani Ryan J. and Taylor Jonathan. The solution path of the generalized lasso. The Annals of Statistics, 39(3):1335–1371, 2011. [Google Scholar]
Tseng Paul. Applications of a Splitting Algorithm to Decomposition in Convex Programming and Variational Inequalities. SIAM Journal on Control and Optimization, 29(1): 119–138, 1991. [Google Scholar]
Tucker Ledyard R.. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966. [DOI] [PubMed] [Google Scholar]
Turner Heather, Bailey Trevor, and Krzanowski Wojtek. Improved biclustering of microarray data demonstrated through systematic performance tests. Computational Statistics and Data Analysis, 48(2):235–254, 2005. [Google Scholar]
Vannieuwenhoven Nick., Vandebril Raf., and Meerbergen Karl.. A new truncation strategy for the higher-order singular value decomposition. SIAM Journal on Scientific Computing, 34(2):A1027–A1052, 2012. [Google Scholar]
Vervliet Nico, Debals Otto, Sorber Laurent, Van Barel Marc, and De Lathauwer Lieven. Tensorlab 3.0, Mar. 2016. URL http://www.tensorlab.net. Available online.
Vu Van and Wang Ke. Random weighted projections, random quadratic forms and random eigenvectors. Random Structures and Algorithms Archive, 47(4):792–821, 2015. [Google Scholar]
Wang Binhuan, Zhang Yilong, Sun Will Wei, and Fang Yixin. Sparse Convex Clustering. Journal of Computational and Graphical Statistics, 27(2):393–403, 2018. [Google Scholar]
Wang Yuxiang, Xu Huan, and Leng Chenlei. Provable subspace clustering: When LRR meets SSC. In Burges CJC, Bottou L, Welling M, Ghahramani Z, and Weinberger KQ, editors, Advances in Neural Information Processing Systems 26, pages 64–72. Curran Associates, Inc., 2013. [Google Scholar]
Wickham Hadley. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag; New York, 2009. ISBN 978–0-387–98140-6. URL http://ggplot2.org. [Google Scholar]
Witten Daniela M., Tibshirani Robert, and Hastie Trevor. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515–534, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wright Stephen J., Nowak Robert D., and Figueiredo Mário A.T.. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7):2479–2493, July 2009. [Google Scholar]
Wu Chong, Kwon Sunghoon, Shen Xiaotong, and Pan Wei. A new algorithm and theory for penalized regression-based clustering. Journal of Machine Learning Research, 17(188): 1–25, 2016a. [PMC free article] [PubMed] [Google Scholar]
Wu Tao, Benson Austin R., and Gleich David F.. General tensor spectral co-clustering for higher-order data. In Lee DD, Sugiyama M, Luxburg UV, Guyon I, and Garnett R, editors, Advances in Neural Information Processing Systems 29, pages 2559–2567. Curran Associates, Inc., 2016b. [Google Scholar]
Xiang Shuo, Tong Xiaoshen, and Ye Jieping. Efficient sparse group feature selection via nonconvex optimization. In Dasgupta Sanjoy and McAllester David, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28, pages 284–292, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. [Google Scholar]
Yair Or, Talmon Ronen, Coifman Ronald R., and Kevrekidis Ioannis G.. Reconstruction of normal forms by learning informed observation geometries from data. Proceedings of the National Academy of Sciences, 114(38):E7865–E7874, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yokota Tatsuya, Lee Namgil, and Cichocki Andrzej. Robust multilinear tensor rank estimation using higher order singular value decomposition and information criteria. IEEE Transactions on Signal Processing, 65(5):1196–1206, 2017. [Google Scholar]
Yuan Ming and Lin Yi. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006. [Google Scholar]
Zelnik-Manor Lihi and Perona Pietro. Self-tuning spectral clustering. In Saul LK, Weiss Y, and Bottou L, editors, Advances in Neural Information Processing Systems 17, pages 1601–1608. MIT Press, 2005. [Google Scholar]
Zhang Cun-Hui. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942, 2010. [Google Scholar]
Zhang Zhong-Yuan, Li Tao, and Ding Chris. Non-negative tri-factor tensor decomposition with applications. Knowledge and Information Systems, 34(2):243–265, 2013. [Google Scholar]
Zhao Hongya, Wang Debby D., Chen Long, Liu Xinyu, and Yan Hong. Identifying multidimensional co-clusters in tensors based on hyperplane detection in singular vector spaces. PLOS ONE, 11(9):1–27, September 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zheng Xiaolin, Ding Weifeng, Lin Zhen, and Chen Chaochao. Topic tensor factorization for recommender system. Information Sciences, 372(Supplement C):276 – 293, 2016. [Google Scholar]
Zhou Hua, Li Lexin, and Zhu Hongtu. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108:540–552, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhu Changbo, Xu Huan, Leng Chenlei, and Yan Shuicheng. Convex optimization procedure for clustering: Theoretical revisit. In Ghahramani Z, Welling M, Cortes C, Lawrence ND, and Weinberger KQ, editors, Advances in Neural Information Processing Systems 27, pages 1619–1627. Curran Associates, Inc., 2014. [Google Scholar]
Zhu Yunzhang, Shen Xiaotong, and Pan Wei. Simultaneous grouping pursuit and feature selection over an undirected graph. Journal of the American Statistical Association, 108 (502):713–725, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou Hui. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006. [Google Scholar]
Zou Hui and Li Runze. One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics, 36(4):1509–1533, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Acar Evrim and Yener Bülent. Unsupervised multiway data analysis: A literature survey. IEEE Transactions on Knowledge and Data Engineering, 21(1):6–20, 2009. [Google Scholar]

[R2] Acar Evrim, Çamtepe Seyit A., and Yener Bülent. Collective sampling and analysis of high order tensors for chatroom communications. In International Conference on Intelligence and Security Informatics, pages 213–224. Springer, 2006. [Google Scholar]

[R3] Anandkumar Animashree, Ge Rong, Hsu Daniel, Kakade Sham M., and Telgarsky Matus. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15:2773–2832, 2014. [Google Scholar]

[R4] Ankenman Jerrod I.. Geometry and analysis of dual networks on questionnaires, 2014.

[R5] Bader Brett W., Kolda Tamara G., et al. Matlab tensor toolbox version 2.6. Available online, February 2015. URL http://www.sandia.gov/~tgkolda/TensorToolbox/.

[R6] Beck Amir and Teboulle Marc. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009. [Google Scholar]

[R7] Bergmann Sven, Ihmels Jan, and Barkai Naama. Iterative signature algorithm for the analysis of large-scale gene expression data. Physical Review E, 67(3):031902, 2003. [DOI] [PubMed] [Google Scholar]

[R8] Bhar Anirban, Haubrock Martin, Mukhopadhyay Anirban, and Wingender Edgar. Multiobjective triclustering of time-series transcriptome data reveals key genes of biological processes. BMC Bioinformatics, 16(1):200, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Bi Xuan, Qu Annie, and Shen Xiaotong. Multilayer tensor factorization with applications to recommender systems. The Annals of Statistics, 46(6B):3308–3333, 2018. [Google Scholar]

[R10] Busygin Stanislav, Prokopyev Oleg, and Pardalos Panos M.. Biclustering in data mining. Computers and Operations Research, 35(9):2964–2987, 2008. [Google Scholar]

[R11] Cao Xiaochun, Wei Xingxing, Han Yahong, and Lin Dongdai. Robust face clustering via tensor decomposition. IEEE Transactions on Cybernetics, 45(11):2546–2557, 2015. [DOI] [PubMed] [Google Scholar]

[R12] Carroll J. Douglas and Chang Jih-Jie. Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition. Psychometrika, 35(3):283–319, 1970. [Google Scholar]

[R13] Chen Annie I. and Ozdaglar Asuman. A fast distributed proximal-gradient method. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, pages 601–608. IEEE, 2012. [Google Scholar]

[R14] Chen Gary K., Chi Eric C., Ranola John Michael O., and Lange Kenneth. Convex clustering: An attractive alternative to hierarchical clustering. PLOS Computational Biology, 11(5): e1004228, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Chen Jiahua and Chen Zehua. Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759–771, 2008. [Google Scholar]

[R16] Chen Jiahua and Chen Zehua. Extended BIC for small-n-large-P sparse GLM. Statistica Sinica, pages 555–574, 2012. [Google Scholar]

[R17] Chi Eric C. and Lange Kenneth. Splitting methods for convex clustering. Journal of Computational and Graphical Statistics, 24(4):994–1013, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Chi Eric C. and Steinerberger Stefan. Recovering trees with convex clustering. SIAM Journal on Mathematics of Data Science, 1(3):383–407, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Chi Eric C., Allen Genevera I., and Baraniuk Richard G.. Convex biclustering. Biometrics, 73(1):10–19, 2017. [DOI] [PubMed] [Google Scholar]

[R20] Cichocki Andrzej, Mandic Danilo, Lieven De Lathauwer Guoxu Zhou, Zhao Qibin, Caiafa Cesar, and Huy Anh Phan. Tensor decompositions for signal processing applications: From two-way to multiway component analysis. IEEE Signal Processing Magazine, 32 (2):145–163, 2015. [Google Scholar]

[R21] Combettes Patrick L. and Pesquet Jean-Christophe. Proximal splitting methods in signal processing. In Fixed-point Algorithms for Inverse Problems in Science and Engineering, pages 185–212. Springer, 2011. [Google Scholar]

[R22] Combettes Patrick L. and Wajs Valérie R.. Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation, 4(4):1168–1200, 2005. [Google Scholar]

[R23] Lieven De Lathauwer Bart De Moor, and Vandewalle Joos. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 21(4):1253–1278, 2000. [Google Scholar]

[R24] Deo Narsingh. Graph Theory with Applications to Engineering and Computer Science. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1974. [Google Scholar]

[R25] Eckstein Jonathan. A simplified form of block-iterative operator splitting and an asynchronous algorithm resembling the multi-block alternating direction method of multipliers. Journal of Optimization Theory and Applications, 173(1):155–182, Apr 2017. [Google Scholar]

[R26] Fan Jianqing and Li Runze. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001. [Google Scholar]

[R27] Frolov Evgeny and Oseledets Ivan. Tensor methods and recommender systems. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(3):e1201, 2017. [Google Scholar]

[R28] Gavish Matan and Coifman Ronald R.. Sampling, denoising and compression of matrices by coherent matrix organization. Applied and Computational Harmonic Analysis, 33(3): 354 – 369, 2012. [Google Scholar]

[R29] Geisser Seymour. The predictive sample reuse method with applications. Journal of the American Statistical Association, 70(350):320–328, 1975. [Google Scholar]

[R30] Goldstein Tom, Studer Christoph, and Baraniuk Richard. A field guide to forward-backward splitting with a FASTA implementation. arXiv eprint, abs/1411.3406, 2014. URL http://arxiv.org/abs/1411.3406.

[R31] Goldstein Tom, Studer Christoph, and Baraniuk Richard. FASTA: A generalized implementation of forward-backward splitting, January 2015. http://arxiv.org/abs/1501.04979.

[R32] Guigourès Romain, Boullé Marc, and Rossi Fabrice. Discovering patterns in time-varying graphs: A triclustering approach. Advances in Data Analysis and Classification, pages 1–28, 2015. [Google Scholar]

[R33] Hallac David, Leskovec Jure, and Boyd Stephen. Network lasso: Clustering and optimization in large graphs. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘15, pages 387–396, New York, NY, USA, 2015. ACM. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Hanson DL and Wright FT. A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics, 42:1079–1083, 1971. [Google Scholar]

[R35] Harshman Richard A.. Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multimodal factor analysis. UCLA Working Papers in Phonetics, 16:1–84, 1970. [Google Scholar]

[R36] Hartigan John A.. Direct clustering of a data matrix. Journal of the American Statistical Association, 67(337):123–129, 1972. [Google Scholar]

[R37] Ho Nhat, Lin Tianyi, and Jordan Michael I.. On structured filtering-clustering: Global error bound and optimal first-order algorithms. arXiv:1904.07462 [stat.ML], 2019. URL https://arxiv.org/abs/1904.07462.

[R38] Hocking Toby D., Joulin Armand, Bach Francis, and Vert Jean-Philippe. Clusterpath an algorithm for clustering using convex fusion penalties. In Getoor Lise and Scheffer Tobias, editors, 28th International Conference on Machine Learning, page 1. ACM, 2011. [Google Scholar]

[R39] Hof Robert. Study: Mobile Ads Actually Do Work - Especially In Apps. Forbes, August 27, 2014, August 2014. Last Accessed July 9, 2017 from https://www.forbes.com/sites/roberthof/2014/08/27/study-mobile-ads-actually-do-work-especially-in-apps/#27ce654057aa.

[R40] Huang Heng, Ding Chris, Luo Dijun, and Li Tao. Simultaneous tensor subspace selection and clustering: The equivalence of high order SVD and k-means clustering. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 327–335. ACM, 2008. [Google Scholar]

[R41] Hubert Lawrence and Arabie Phipps. Comparing partitions. Journal of Classification, 2 (1):193–218, 1985. [Google Scholar]

[R42] Jain Prateek and Oh Sewoong. Provable tensor factorization with missing data. In Advances in Neural Information Processing Systems, pages 1431–1439, 2014. [Google Scholar]

[R43] Jegelka Stefanie, Sra Suvrit, and Banerjee Arindam. Approximation algorithms for tensor clustering. In International Conference on Algorithmic Learning Theory, pages 368–383. Springer, 2009. [Google Scholar]

[R44] Kolda Tamara G. and Bader Brett W.. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009. [Google Scholar]

[R45] Kolda Tamara G. and Sun Jimeng. Scalable tensor decompositions for multi-aspect data mining. In 2008 Eighth IEEE International Conference on Data Mining, pages 363–372, 2008. [Google Scholar]

[R46] Kutty Sangeetha, Nayak Richi, and Li Yuefeng. XML documents clustering using a tensor space model. Advances in Knowledge Discovery and Data Mining, pages 488–499, 2011. [Google Scholar]

[R47] Lange Kenneth, Hunter David R., and Yang Ilsoon. Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics, 9(1):1–20, 2000. [Google Scholar]

[R48] Lazzeroni Laura and Owen Art. Plaid models for gene expression data. Statistica Sinica, 12:61–86, 2002. [Google Scholar]

[R49] Lee Mihee, Shen Haipeng, Huang Jianhua Z., and Marron JS. Biclustering via sparse singular value decomposition. Biometrics, 66(4):1087–1095, 2010. [DOI] [PubMed] [Google Scholar]

[R50] Li Mu, Andersen David G., and Smola Alexander. Distributed delayed proximal gradient methods. In NIPS Workshop on Optimization for Machine Learning, volume 3, page 3, 2013. [Google Scholar]

[R51] Lindsten Fredrik, Ohlsson Henrik, and Ljung Lennart. Just relax and come clustering! A convexification of k-means clustering. Technical report, Linköpings Universitet, 2011. URL http://www.control.isy.liu.se/research/reports/2011/2992.pdf.

[R52] Liu Ji, Yuan Lei, and Ye Jieping. Guaranteed sparse recovery under linear transformation. In Dasgupta Sanjoy and McAllester David, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28, pages 91–99. PMLR, 2013a. [Google Scholar]

[R53] Liu Tianqi, Yuan Ming, and Zhao Hongyu. Characterizing spatiotemporal transcriptome of human brain via low rank tensor decomposition. arXiv:1702.07449 [stat.ME], 2017. URL https://arxiv.org/abs/1702.07449.

[R54] Liu Xinhai, Ji Shuiwang, Glänzel Wolfgang, and De Moor Bart. Multiview partitioning via tensor methods. IEEE Transactions on Knowledge and Data Engineering, 25(5): 1056–1069, 2013b. [Google Scholar]

[R55] Ma Ping and Zhong Wenxuan. Penalized clustering of large scale functional data with multiple covariates. Journal of the American Statistical Association, 103:625–636, 2008. [Google Scholar]

[R56] Madeira Sara C. and Oliveira Arlindo L.. Biclustering algorithms for biological data analysis: A survey. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 1(1): 24–45, 2004. [DOI] [PubMed] [Google Scholar]

[R57] Marchetti Yuliya and Zhou Qing. Solution path clustering with adaptive concave penalty. Electronic Journal of Statistics, 8(1):1569–1603, 2014. [Google Scholar]

[R58] McGarry Caitlin. Report: Google is the default iPhone search engine because it paid Apple $1 billion. Macworld, January 22, 2016, January 2016. Last Accessed July 9, 2017 from http://www.macworld.com/article/3025783/iphone-ipad/report-google-is-the-default-iphone-search-engine-because-it-paid-apple-1-billion.html.

[R59] Meinshausen Nicolai and Bühlmann Peter. Stability Selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473, 2010. [Google Scholar]

[R60] Minster Rachel, Saibaba Arvind K., and Kilmer Misha E.. Randomized algorithms for low-rank tensor decompositions in the Tucker format. SIAM Journal on Mathematics of Data Science, 2(1):189–215, 2020. [Google Scholar]

[R61] Mishne Gal, Talmon Ronen, Meir Ron, Schiller Jackie, Lavzin Maria, Dubin Uri, and Coifman Ronald R.. Hierarchical coupled-geometry analysis for neuronal structure and activity pattern discovery. IEEE Journal of Selected Topics in Signal Processing, 10(7): 1238–1253, 2016. [Google Scholar]

[R62] Mitchell Amy, Rosenstiel Tom, Laura Houston Santhanam, and Leah Christian. Future of mobile news. Project for Excellence in Journalism (PEJ)— Understanding News in the Information Age, 2012. [Google Scholar]

[R63] Ng Andrew Y., Jordan Michael I., and Weiss Yair. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems, pages 849–856, 2002. [Google Scholar]

[R64] Oh Jinoh, Shin Kijung, Papalexakis Evangelos E., Faloutsos Christos, and Yu Hwanjo. S-HOT: Scalable High-Order Tucker Decomposition. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 761–770. ACM, 2017. [Google Scholar]

[R65] Pan Wei, Shen Xiaotong, and Liu Binghui. Cluster analysis: Unsupervised learning via supervised learning with a non-convex penalty. Journal of Machine Learning Research, 14:1865–1889, 2013. [PMC free article] [PubMed] [Google Scholar]

[R66] Papalexakis Evangelos E., Sidiropoulos Nicholas D., and Bro Rasmus. From K-Means to Higher-Way Co-Clustering: Multilinear Decomposition With Sparse Latent Factors. IEEE Transactions on Signal Processing, 61(2):493–506, 2013. [Google Scholar]

[R67] Pelckmans Kristiaan, De Brabanter Jos, Suykens Johan A.K., and De Moor Bart L.R.. Convex clustering shrinkage. In PASCAL Workshop on Statistics and Optimization of Clustering Workshop, 2005. [Google Scholar]

[R68] Radchenko Peter and Mukherjee Gourab. Convex clustering via l1 fusion penalization. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(5):1527–1546, 2017. [Google Scholar]

[R69] Rissanen Jorma. Modeling by shortest data description. Automatica, 14(5):465–471, 1978. [Google Scholar]

[R70] Schifano Elizabeth D., Strawderman Robert L., and Wells Martin T.. Majorization-minimization algorithms for nonsmoothly penalized objective functions. Electronic Journal of Statistics, 4:1258–1299, 2010. [Google Scholar]

[R71] Sharpnack James, Singh Aarti, and Rinaldo Alessandro. Sparsistency of the edge lasso over graphs. In Lawrence Neil D. and Girolami Mark, editors, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 of Proceedings of Machine Learning Research, pages 1028–1036, 2012. [Google Scholar]

[R72] She Yiyuan. Sparse regression with exact clustering. Electronic Journal of Statistics, 4: 1055–1096, 2010. [Google Scholar]

[R73] Shen Xiaotong and Huang Hsin-Cheng. Grouping pursuit through a regularization solution surface. Journal of the American Statistical Association, 105(490):727–739, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R74] Shen Xiaotong, Huang Hsin-Cheng, and Pan Wei. Simultaneous supervised clustering and feature selection over a graph. Biometrika, 99:899–914, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R75] Sidiropoulos Nicholas D., Lieven De Lathauwer Xiao Fu, Huang Kejun, Papalexakis Evangelos E., and Faloutsos Christos. Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing, 65(13):3551–3582, 2017. [Google Scholar]

[R76] Sill Martin, Kaiser Sebastian, Benner Axel, and Annette Kopp-Schneider. Robust biclustering by sparse singular value decomposition incorporating stability selection. Bioinformatics, 27(15):2089–2097, 2011. [DOI] [PubMed] [Google Scholar]

[R77] Stone Mervyn. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), pages 111–147, 1974. [Google Scholar]

[R78] Sun Jimeng, Tao Dacheng, and Faloutsos Christos. Beyond streams and graphs: dynamic tensor analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 374–383. ACM, 2006. [Google Scholar]

[R79] Sun Jimeng, Papadimitriou Spiros, Lin Ching-Yung, Cao Nan, Liu Shixia, and Qian Weihong. Multivis: Content-based social network exploration through multiway visual analysis. In Proceedings of the 2009 SIAM International Conference on Data Mining, pages 1064–1075. SIAM, 2009. [Google Scholar]

[R80] Sun Will Wei and Li Lexin. Dynamic tensor clustering. Journal of the American Statistical Association, 114(528):1894–1907, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R81] Will Wei Sun Junwei Lu, Liu Han, and Cheng Guang. Provable sparse tensor decomposition. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3): 899–916, 2017. [Google Scholar]

[R82] Symeonidis Panagiotis. Matrix and tensor decomposition in recommender systems. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ‘16, pages 429–430, New York, NY, USA, 2016. ACM. [Google Scholar]

[R83] Symeonidis Panagiotis and Zioupos Andreas. Matrix and Tensor Factorization Techniques for Recommender Systems. Springer International Publishing, 1 edition, 2016. [Google Scholar]

[R84] Tan Kean Ming and Witten Daniela. Statistical properties of convex clustering. Electronic Journal of Statistics, 9:2324–2347, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R85] Tan Kean Ming and Witten Daniela M.. Sparse biclustering of transposable data. Journal of Computational and Graphical Statistics, 23(4):985–1008, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R86] Tibshirani Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 58(1):267–288, 1996. [Google Scholar]

[R87] Tibshirani Robert, Walther Guenther, and Hastie Trevor. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411–423, 2001. [Google Scholar]

[R88] Tibshirani Robert, Saunders Michael, Rosset Saharon, Zhu Ji, and Knight Keith. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91–108, 2005. [Google Scholar]

[R89] Tibshirani Ryan J. and Taylor Jonathan. The solution path of the generalized lasso. The Annals of Statistics, 39(3):1335–1371, 2011. [Google Scholar]

[R90] Tseng Paul. Applications of a Splitting Algorithm to Decomposition in Convex Programming and Variational Inequalities. SIAM Journal on Control and Optimization, 29(1): 119–138, 1991. [Google Scholar]

[R91] Tucker Ledyard R.. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966. [DOI] [PubMed] [Google Scholar]

[R92] Turner Heather, Bailey Trevor, and Krzanowski Wojtek. Improved biclustering of microarray data demonstrated through systematic performance tests. Computational Statistics and Data Analysis, 48(2):235–254, 2005. [Google Scholar]

[R93] Vannieuwenhoven Nick., Vandebril Raf., and Meerbergen Karl.. A new truncation strategy for the higher-order singular value decomposition. SIAM Journal on Scientific Computing, 34(2):A1027–A1052, 2012. [Google Scholar]

[R94] Vervliet Nico, Debals Otto, Sorber Laurent, Van Barel Marc, and De Lathauwer Lieven. Tensorlab 3.0, Mar. 2016. URL http://www.tensorlab.net. Available online.

[R95] Vu Van and Wang Ke. Random weighted projections, random quadratic forms and random eigenvectors. Random Structures and Algorithms Archive, 47(4):792–821, 2015. [Google Scholar]

[R96] Wang Binhuan, Zhang Yilong, Sun Will Wei, and Fang Yixin. Sparse Convex Clustering. Journal of Computational and Graphical Statistics, 27(2):393–403, 2018. [Google Scholar]

[R97] Wang Yuxiang, Xu Huan, and Leng Chenlei. Provable subspace clustering: When LRR meets SSC. In Burges CJC, Bottou L, Welling M, Ghahramani Z, and Weinberger KQ, editors, Advances in Neural Information Processing Systems 26, pages 64–72. Curran Associates, Inc., 2013. [Google Scholar]

[R98] Wickham Hadley. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag; New York, 2009. ISBN 978–0-387–98140-6. URL http://ggplot2.org. [Google Scholar]

[R99] Witten Daniela M., Tibshirani Robert, and Hastie Trevor. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515–534, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R100] Wright Stephen J., Nowak Robert D., and Figueiredo Mário A.T.. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7):2479–2493, July 2009. [Google Scholar]

[R101] Wu Chong, Kwon Sunghoon, Shen Xiaotong, and Pan Wei. A new algorithm and theory for penalized regression-based clustering. Journal of Machine Learning Research, 17(188): 1–25, 2016a. [PMC free article] [PubMed] [Google Scholar]

[R102] Wu Tao, Benson Austin R., and Gleich David F.. General tensor spectral co-clustering for higher-order data. In Lee DD, Sugiyama M, Luxburg UV, Guyon I, and Garnett R, editors, Advances in Neural Information Processing Systems 29, pages 2559–2567. Curran Associates, Inc., 2016b. [Google Scholar]

[R103] Xiang Shuo, Tong Xiaoshen, and Ye Jieping. Efficient sparse group feature selection via nonconvex optimization. In Dasgupta Sanjoy and McAllester David, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28, pages 284–292, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. [Google Scholar]

[R104] Yair Or, Talmon Ronen, Coifman Ronald R., and Kevrekidis Ioannis G.. Reconstruction of normal forms by learning informed observation geometries from data. Proceedings of the National Academy of Sciences, 114(38):E7865–E7874, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R105] Yokota Tatsuya, Lee Namgil, and Cichocki Andrzej. Robust multilinear tensor rank estimation using higher order singular value decomposition and information criteria. IEEE Transactions on Signal Processing, 65(5):1196–1206, 2017. [Google Scholar]

[R106] Yuan Ming and Lin Yi. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006. [Google Scholar]

[R107] Zelnik-Manor Lihi and Perona Pietro. Self-tuning spectral clustering. In Saul LK, Weiss Y, and Bottou L, editors, Advances in Neural Information Processing Systems 17, pages 1601–1608. MIT Press, 2005. [Google Scholar]

[R108] Zhang Cun-Hui. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942, 2010. [Google Scholar]

[R109] Zhang Zhong-Yuan, Li Tao, and Ding Chris. Non-negative tri-factor tensor decomposition with applications. Knowledge and Information Systems, 34(2):243–265, 2013. [Google Scholar]

[R110] Zhao Hongya, Wang Debby D., Chen Long, Liu Xinyu, and Yan Hong. Identifying multidimensional co-clusters in tensors based on hyperplane detection in singular vector spaces. PLOS ONE, 11(9):1–27, September 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R111] Zheng Xiaolin, Ding Weifeng, Lin Zhen, and Chen Chaochao. Topic tensor factorization for recommender system. Information Sciences, 372(Supplement C):276 – 293, 2016. [Google Scholar]

[R112] Zhou Hua, Li Lexin, and Zhu Hongtu. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108:540–552, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R113] Zhu Changbo, Xu Huan, Leng Chenlei, and Yan Shuicheng. Convex optimization procedure for clustering: Theoretical revisit. In Ghahramani Z, Welling M, Cortes C, Lawrence ND, and Weinberger KQ, editors, Advances in Neural Information Processing Systems 27, pages 1619–1627. Curran Associates, Inc., 2014. [Google Scholar]

[R114] Zhu Yunzhang, Shen Xiaotong, and Pan Wei. Simultaneous grouping pursuit and feature selection over an undirected graph. Journal of the American Statistical Association, 108 (502):713–725, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R115] Zou Hui. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006. [Google Scholar]

[R116] Zou Hui and Li Runze. One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics, 36(4):1509–1533, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Provable Convex Co-clustering of Tensors

Eric C Chi

Brian R Gaines

Will Wei Sun

Hua Zhou

Jian Yang

Abstract

1. Introduction

2. Preliminaries

2.1. Notation

2.2. Basic Tensor Operations

3. A Convex Formulation of Co-clustering

Figure 1:

Figure 2:

4. Properties

4.1. Stability Properties

4.2. Statistical Properties

5. Estimation Algorithm

5.1. A Lagrangian Dual of the CoCo Problem

6. Specifying Non-Uniform Weights

Figure 3:

6.1. Basic Procedure for Specifying Weights

6.2. Improving Weights via the Tucker Decomposition

6.3. Weights and Folded-Concave Penalties

7. Other Practical Issues

7.1. Choosing γ

7.2. Recovering the Partitions along Each Mode

8. Simulation Studies

Figure 13:

8.1. Cubical Tensors, Checkerbox Pattern

8.1.1. Balanced Cluster Sizes and Homoskedastic Noise

Figure 4:

Figure 5:

8.1.2. Imbalanced Cluster Sizes

Figure 6:

8.1.3. Heteroskedastic Noise

Figure 7:

8.1.4. Different Clustering Structures

Figure 8:

8.2. Rectangular Tensors

Figure 9:

8.3. CANDECOMP/PARAFAC Model

Figure 10:

Figure 11:

8.4. Comparison with Convex Biclustering

Figure 12:

9. Real Data Application

Table 1:

10. Discussion

Acknowledgments

Appendix A. Tensor Decompositions

Appendix B. Proofs of Smoothness Properties

B.1. Proof of Proposition 4

B.2. Proof of Proposition 5

B.3. Proof of Proposition 6

B.4. Proof of Proposition 7

Appendix C. Proof of Theorem 9

C.1. Auxiliary Lemmas

C.2. Proof of Main Theorem

Appendix D. Derivation of Lagrangian Dual

Appendix E. Projected Gradient Applied to the Lagrangian Dual

E.1. Per-Iteration and Storage Costs

E.2. Convergence

E.3. Monitoring Convergence via the Duality Gap

E.4. Computing Mode-d Difference Variables

Appendix F. Details on Denoising with the Tucker Decomposition for Setting Weights

Appendix G. CPD+k-means

Appendix H. Additional Simulations on Rectangular Tensors

Figure 14:

Figure 15:

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases