Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Jul 10.
Published in final edited form as: Stat. 2020 Jan 2;8(1):e253. doi: 10.1002/sta4.253

Tensor canonical correlation analysis

Eun Jeong Min 1, Eric C Chi 2, Hua Zhou 3
PMCID: PMC7351364  NIHMSID: NIHMS1602468  PMID: 32655193

Abstract

Canonical correlation analysis (CCA) is a multivariate analysis technique for estimating a linear relationship between two sets of measurements. Modern acquisition technologies, for example, those arising in neuroimaging and remote sensing, produce data in the form of multidimensional arrays or tensors. Classic CCA is not appropriate for dealing with tensor data due to the multidimensional structure and ultrahigh dimensionality of such modern data. In this paper, we present tensor CCA (TCCA) to discover relationships between two tensors while simultaneously preserving multidimensional structure of the tensors and utilizing substantially fewer parameters. Furthermore, we show how to employ a parsimonious covariance structure to gain additional stability and efficiency. We delineate population and sample problems for each model and propose efficient estimation algorithms with global convergence guarantees. Also we describe a probabilistic model for TCCA that enables the generation of synthetic data with desired canonical variates and correlations. Simulation studies illustrate the performance of our methods.

Keywords: block coordinate ascent, CP decomposition, multidimensional array data

1 |. INTRODUCTION

Canonical correlation analysis (CCA) is a classic statistical method for identifying associations between two sets of measurements (Hotelling, 1936). Specifically, CCA identifies a pair of coefficient vectors, one for each set of measurements, such that the correlation between the corresponding linear combinations of variables from each set is maximized. By default, CCA applies when each observation consists of a pair of vector covariates. In many modern data analysis problems, however, each observation may consist more generally of a pair of multidimensional arrays or tensors. For example, in imaging genetics, to identify genetic variants that can best capture and explain phenotypic variations in brain function and structure, Stein et al. (2010) studied p = 448,293 single-nucleotide polymorphisms and q = 31,622 voxels in brain images, on n = 740 individuals. A naive approach to dealing with tensor-valued data would be to reshape tensor covariates into vectors and then apply standard CCA. There are, however, two serious drawbacks to doing so. First, structural information in tensors is discarded through vectorization. Second, the resulting vectors consist of a prohibitively large number of parameters. In the imaging genetics problem (Stein et al., 2010), vectorizing the voxel intensity measurements disregards the spatial correlation among neighbouring voxels. Moreover, applying standard CCA to vectors of single-nucleotide polymorphism covariates and vectorized brain images would require estimating nearly half a million parameters using fewer than a thousand observations.

In light of these issues, there have been extensions of CCA to handle special cases of tensor-valued data (Lee & Choi, 2007; Wang, 2010; Yan, Zheng, Zhou, & Zhao, 2012; Gang, Yong, Yan-Lei, & Jing, 2011; Lu, 2013; Wang, Yan, Sun, Zhao, & Fu, 2016). Although they have exhibited good empirical performance in some applications, there remains no clear population models underlying these sample based heuristics. To address this gap in the literature, we introduce a novel statistical model for tensor CCA (TCCA). We summarize at a high level our formulation and its contributions:

  • We propose a TCCA population model that imposes the CANDECOMP/PARAFAC (CP) decomposition (Carroll & Chang, 1970; Harshman, 1970) structure on canonical tensors. This population model enforces model parsimony and enables efficient estimation.

  • We propose a refinement of TCCA, which assumes a separable covariance structure (scTCCA). This refinement enables efficient estimation of large covariance matrices of tensor-valued data.

  • We derive convenient representations of the covariance between linear combinations of two random tensors under the unstructured and separable covariance structured assumptions.

  • We develop efficient estimation algorithms for TCCA and scTCCA, both based on block coordinate ascent, which leverage these efficient representations. Each step of both algorithms solves a substantially lower dimensional CCA problem; thus, both algorithms can be easily implemented using any standard solvers for the CCA problem. Moreover, we prove global convergence guarantees of both estimation algorithms under modest regularity conditions.

  • We develop simple modifications to the TCCA and scTCCA estimation algorithms to incorporate recovery of sparse canonical correlation tensors to improve interpretability of the estimated models.

  • Finally, we extend the probabilistic interpretation of CCA by Bach and Jordan (2006) to TCCA. This extension leads to a probabilistic model for generating datasets with specified canonical correlation and variates.

The remainder of the paper is organized as follows. In Section 2, we review tensor notation and basic operations used in this paper. In Sections 3 and 4, we propose our two TCCA methods: TCCA and scTCCA. In Section 5, we describe a modification of the estimation algorithms for TCCA and scTCCA for sparse models. In Section 6, we introduce the probabilistic TCCA model. In Section 7, we describe the numerical experiment results. In Section 8, we conclude and highlight directions for future work.

2 |. NOTATION AND PRELIMINARIES

We review basic operations on matrices and tensors invoked throughout this paper, adopting the terminology and notation in Kolda and Bader (2009). Throughout the paper, we use lowercase letters to indicate scalars, bold lowercase letters to indicate vectors, bold capital characters to indicate matrices, and bold calligraphic capital characters to indicate tensors. We will also use the shorthand [n] to denote an index set {1, …, n}.

Tensors can be considered generalizations of scalars, vectors, and matrices. Let X represent an D-dimensional tensor in p1××pD. The tensor X has order D, its number of dimensions or modes. For example, vectors are tensors of order one and have one mode. Matrices are tensors of order two and have two modes. We denote an element of X by xι1ι2,,ιD, where ιi ∈ [pi] and i ∈ [D]. Fibres are the generalization of matrix rows and columns to higher order tensors. A fibre is defined by fixing the index of every dimension except one dimension. Mode i fibres are pi-dimensional vectors extracted from X by fixing all the indices (ι1,,ιl1,ιi+1,,ιD) except the ith one ιi. For example, columns of a matrix are Mode 1 fibres, and rows of a matrix are Mode 2 fibres.

It is often useful to reshape a tensor into a matrix. Reordering a tensor into a matrix is referred to as matricization. The mode i matricization of a tensor Xp1××pD, denoted X(i)pi×pi with pi=Πk=1,kiDpk, arranges the mode i fibres as the columns of the matrix X(i). In a mode i matricization, the tensor element xι1,,ιD is mapped to the matrix element of X(i). with index (ιi j), where j=1+k=1,kiD(ιk1)Jk with

Jk={1ifk=1orifk=2andi=1.k=1,kik1pkotherwise.

Reordering a tensor into a vector is referred to as vectorization. We first describe vectorization of a matrix before describing vectorization of a general tensor. The vectorization of a matrix X is denoted by vec(X) and is the vector obtained by stacking the columns of X on top of each other. The vectorization of the mode i matricization of a tensor X in turn is denoted as x(i) = vec(X(i))). We then define the vectorization of a tensor X, denoted by vec(X), as the vectorization of its Mode 1 matricization, namely, vec(X(1)). When unambiguous from context, we will often denote the vectorization of a tensor X by its corresponding bold lowercase X.

The inner product of two tensors of compatible dimensions X, X˜p1××pD is the sum of the product of their entries, namely,

X,X˜=ι1=1p1ιD=1pDxι1,,ιDx˜ι1,,ιD.

The mode i product of a tensor X˜p1××pD with a matrix J×pd is denoted by X×i and is the tensor of size p1××pi1×J×pi+1××pD with elements

(X×i)ι1,,ιi1,jιi+1,,ιD=ιi=1pixι1ι2,,ιDUjιi.

Finally, we review three kinds of matrix products, as well as one definition of matrix division, that will be used throughout the paper.

  • For two matrices Ap1×p2 and Bq1×q2, the Kronecker product is the p1q1-by-p2q2 matrix,
    AB=(a11Ba1p2Bap1Bap1p2B).
  • For two matrices A=(a1ap2)p1×p2 and B=(b1bp2)q1×p2 that have the same number of columns p2, the Khatri-Rao product is the p1q1-by-p2 matrix,
    AB=(a1b1a2b2ap2bp2).
    which is a column-wise Kronecker product of A and B.
  • For two matrices A and B of the same size, the Hadamard product is the element-wise product A * B = {aijbij}. Because the Hadamard product commutes, we use *i Ai to denote A1 * … * Am = Aπ(1)* … * Aπ(m) for any permutation π.

  • Finally, for two matrices A and B of the same size, the Hadamard quotient is the element-wise quotient AB = {aij/bij}.

3 |. TENSORCANONICAL CORRELATION ANALYSIS

3.1 |. Population TCCA

Let Xp1××pDx and Yq1××qDy be two random tensors of order Dx and Dy, respectively. We denote the vectorizations of X and Y by x and y, respectively. Denote by Σx and Σy the covariances of x and y, respectively. Denote by Σx,y the covariance between x and y. Let Vp1××pDx, and Wq1××qDy be constant tensors, and let ρ(V,W) denote the correlation between the two linear combinations X,V and Y,W namely,

ρ(V,W)=Cov(X,V,Y,W)Var(X,V)Var(Y,W)=vΣx,ywvΣxvwΣyw. (1)

The pair (V,W) that maximizes ρ are the canonical tensors, and the optimal is the canonical coefficient. Maximizing the objective in Equation (1) presents two challenges: (a) high dimensionality of optimization variables V and W and (b) the estimation of the huge covariance matrices Σx and Σy and the cross-covariance matrix Σx,y. We will address challenge (b) by imposing a separable covariance structure in Section 4.

To address challenge (a), we impose the parsimonious CANDECOMP/PARAFAC (CP), or Kruskal, representation on the canonical tensors. The CP representation generalizes the idea of representing a matrix as the sum of Rank 1 matrices to representing a tensor as the sum of Rank 1 tensors. An order-D tensor Xp1××pD is Rank 1 if it can be expressed as the outer product of D vectors a(1), a(2), …, a(D), namely, X=a(1)a(2)a(D), where the binary operator o denotes the vector outer product. Thus, the (ι1, ι2,…, ιD )th element of X is xι1ι2ιD=aι1(1)aι2(2)aι0(D). A rank-R tensor can be written as the sum of R Rank 1 tensors, namely,

X=[A1,,AD]=r=1Rar(1)ar(D),

where Ai=(a1(1)a2(i)aR(i))pi×R denotes the mode i factor matrix. We use the Kruskal notation [] to concisely summarize the sum.

Thus, instead of searching over the space of all order-Dx and order-Dy tensor pairs (V,W), we limit our search to tensors of rank-Rx and rank-Ry,

V=[V1,,VDx],ViP×Rx,i[Dx],W=[W1,,WDy],Wjqj×Ry,j[Dy]. (2)

As we will see later, this parameterization makes progress towards alleviating the burden of estimating a huge covariance matrix. Note that

X,V=X,r=1Rxwr(1)wr(Dx)=r=1Rxx×1wr(1)×2×Dxwr(Dx)
Y,W=Y,r=1RYwr(1)wr(Dy)=r=1RyY×1wr(1)×2×Dywr(Dy).

Thus, we seek to maximize the correlation between a rank-Rx multilinear form in X and a rank-Ry multilinear form in Y. Multiway information is preserved, and the dimensionality is reduced from an exponential number of parameters Πi=1Dxpi+Πj=1Dyqj to a linear number of parameters Rxi=1Dxpi+Ryj=1Dyqj. Note that the ranks (Rx, Ry) here are not the number of canonical tensor pairs being sought. In this paper, we focus on obtaining only the top canonical tensor pair (V,W), which have ranks Rx and Ry.

The following representations of Var(X,Y), Var(Y,W), and Cov(X,Y,Y,N) in terms of a CP decomposition are keys to our estimation algorithms. Its proof is in the Supporting Information.

Proposition 1

Let Xp1××pDx and Yq1××qDy be two random tensors and V=[V1,,VDx] and W=[W1,,WDy] be two constant tensors of the same size as Xand Y respectively. Define

V(1)=[VDxVi+1Vi1V1]Ipi, (3)
W(j)=[WDxWj+1Wj1W1]Iqj, (4)

and let Σx(j) denote the covariance of x(i) = vec(X(i)). Define Σy(j) and Σx0,y(j) analogously. Then

Var(X,V)=viV(i)Σx(i)V(i)vi
Var(Y,W)=wjW(j)Σy(j)W(j)wj
Cov(X,V,Y,W)=viV(i)Σx(j),y(j)W(j)wj

for any i ∈ [Dx] and j ∈ [Dy], where vi = vec(Vi) and Wj = vec(Wj).

3.2 |. Sample TCCA

Suppose we observe N pairs of i.i.d. tensor data (Xn,Yn), and we estimate V and W by solving the optimization problem

maximizeρ^(V,W)vΣ^x,yWvΣ^xvwΣ^yw. (5)

where Σ^x, Σ^y and Σ^x,y are sample estimates of the corresponding covariances. Recall that CCA models can be estimated numerically by computing the solution to the following generalized eigenvalue problem:

(0Σ^x,yΣ^y,x0)(vw)=ρ(Σ^x00Σ^y)(vw).

This problem is guaranteed to have a solution if and only/ if the covariance matrices Σ^x and Σ^y are nonsingular. In practice, the sample size N is smaller than the size of Σ^x and Σ^y, (Πi=1Dxpi)(×Πi=1Dxpi),(Πj=1Dxqj)×(Πj=1Dyqj), respectively. Therefore, the sample covariance matrices Σ^x and Σ^y are singular, and a solution cannot be obtained. As a remedy, several regularized estimation methods for obtaining nonsingular sample covariance matrices (Vinod, 1976; Ledoit & Wolf, 2004; González, Déjean, Martin, & Baccini, 2008; Ledoit & Wolf, 2012; Cai & Yuan, 2012; Bickel & Levina, 2008; 2008; Kubokawa et al., 2013; Srivastava & Reid, 2012) can be used.

If we take Var(X,V), Var(Y,W) and Cov(X,V,Y,W) to be vTΣ^xv, wTΣ^yw, and vΣ^x,yw respectively, then Proposition 1 suggests a block coordinate ascent algorithm where we update the factor matrices in pairs (Vi, Wj), for different combinations of i ∈ [Dx] and j ∈ [Dy]. To update the pair (Vi, Wj), we solve the following problem:

maximizeviV(i)Σ^x(i),y(j)W(j)wjsubjecttoviV(i)Σ^x(i)V(i)vi=1,wjW(j)Σ^y(j)W(j)wj=1, (6)

which is a substantially smaller optimization problem over piRx + qjRy variables compared with any alternative “all-at-once” strategies to iteratively optimize over all (Πi=1Dxpi)(×Πi=1Dxpi)+(Πj=1Dyqj)×(Πj=1Dyqi) parameters simultaneously. Problem (6) includes Σ^x(i), a permuted version of Σ^x with size (Πi=1Dxpi)(×Πi=1Dxpi). However, we work on the “compressed” covariance matrix

V(i)Σ^x(i)V(i)piRx×piRx,

which is likely to be full rank when N ≥ piRx instead of a singular matrix Σ^x(i). This approach enables us to solve the generalized eigenvalue problem with matrices that are (piRx) + (qjRy)-by-(piRx) + (qjRy). Algorithm 1 summarizes the estimation procedure that comes with the following convergence guarantees. The proof is in the Supporting Information.

Proposition 2

If the matrices Σ^x and Σ^y are nonsingular, then the limit Points of the iterate sequence generated by Algorithm 1 are canonical tensors of the sample TCCA Problem.

Algorithm 1.

TCCA, with (i) assumptions on CP structure on canonical tensors (V,W) and (ii) no additional assumptions on the structures of the covariances Var(vec(X)) and Var(vec(Y))

Initialize vi(0) and wj(0), for i ∈ [Dx] and j ∈ [Dy]
t ← 0
repeat
 Select (i, j) ∈ [Dx] × [Dy]
V(i)(t)[vDx(t)Vi+1(t)Vi1(t)V1(t)]Ipi
W(j)(t)[WDy(t)Wj+1(t)Wj1(t)W1(t)]Iqj
Cx(t)V(i)(t)TΣ^x(i)V(i)(t)
Cy(t)W(j)(t)Σ^y(j)W(j)(t)
Cx,y(t)V(i)(t)TΣ^x(i)y(j)W(j)(t)
 Generalized eigen-decomposition: (0Cx,y(t)Cxy(t)T0)(wi(t+1)wj(t+1))=ρ(t+1)(Cx(t)00Cy(t))(wi(t+1)wj(t+1))
tt + l
until ρ(t) converges

4 |. TCCA WITH SEPARABLE COVARIANCE STRUCTURE

4.1 |. Population TCCA with separable covariances

Hoff (2011) proposed the array normal distribution with separable covariance structure. Separable marginal covariances are defined as

Σx=Var(x)=Σx,DxΣx,1andΣy=Var(y)=Σy,DyΣy,1, (7)

where Σx,jpi×pi for i ∈ [Dx] and Σy,jqj×qj for j ∈ [Dy]. Then the overall covariance of population model is

Var(Xy)=(Σx,DxΣx,1Σx,yΣyxΣy,DyΣy,1). (8)

Intuitively, Σx,i summarizes the covariance along the mode i fibres of tensor X and Σy,j summarizes the covariance along the mode j fibres of tensor Y. The following result shows a representation of Var(X,V) and Var(Y,W) in presence of separable covariance structure (7) that we will leverage in our estimation algorithm.

Proposition 3

Let Xp1××pDx and Yq1××qDy be two random tensors admitting the separable covariance structure (7) and V=[v1,,vDx] and W=[W1,,WDy] be two constant tensors. Define

Hx,i=*ii(ViΣx,iVi)andHy,j=*jj(WjΣy,jWj).

Then

Var(X,V)=vi(Hx,iΣx,i)viandVar(Y,W)=wj(Hy,jΣyj)wj,

for any i ∈ [Dx] and j ∈ [Dy], where vi = vec(Vi) and wj = vec(Wj).

With the separable covariance structure and the CP structure on canonical tensors (V,W), the objective function of TCCA population model (1) greatly simplifies. Note that the separable covariance structure may not hold for real data, in which case the covariance estimates are biased. Despite this drawback, this parsimonious structure is worth considering due to the stability that it can impart by reducing estimation variance.

4.2 |. Sample TCCA with separable covariances

Given data (Xn,Yn), n ∈ [N], the goal is to maximize the sample canonical correlation (5) under assumptions that (a) V and W have the CP decomposition structure and (b) X and Y admit the separable covariance structure. We follow the same strategy as sample TCCA in Section 3.2, updating parameters in pairs of factor matrices (Vi, Wj). To update the pair (Vi, Wj), we solve the subproblem

maximizeviV(i)Σ^x(i),y(j)W(j)wj,subjecttovi(Hx,iΣ^x,i)vi=1,wj(Hy,jΣ^y,j)wj=1.

where Hx,−i and Hy,−j, defined in Proposition 3, are evaluated at current iterates Vi′ where i′ ≠ i, and Wj′, where j′ ≠ j.

By assuming the separable covariance structure, Proposition 3 enables us to greatly simplify the variance calculations for Var(X,V) and Var(Y,W). Note that the calculation of V(i)Σ^x0V(i) and W(j)Σ^y(j)W(j) in the subproblem of the sample TCCA costs (Rxi=1Dxpi)2+(Ryj=1Dyqj)2 flops. In contrast, the calculation of matrices Hx,iΣ^x,j and Hy,jΣ^y,j in the subproblem of the sample TCCA with the separable covariance structure only costs (Rxpi)2 + (Ryqj)2 flops. Algorithm 2 summarizes the estimation procedure under the separable covariance assumption (7). Like Algorithm 1, Algorithm 2 also comes with convergence guarantees. The proof is in the Supporting Information.

Proposition 4

If the matrices Σ^x and Σ^y are nonsingular, then the limit points of the iterate sequence generated by Algorithm 2 are canonical tensors of the sample TCCA problem with separable covariances.

Algorithm 2.

TCCA for two tensors of modes Dx and Dy, respectively, assuming (i) CP structure on canonical correlation tensors (V,W) and (ii) separable covariances Var(vec(X)) and Var(vec(Y))

Initialize Vi(0),i[Dx]
Wj(0),j[Dy]
Hx(0)*i(Vi(0)TΣ^x,iVi(0))
Hy(0)*i(Wj(0)Σ^y,jWj(0))
t ← 0
repeat
 Select (i, j) ∈ [Dx] × [Dy]
Hx,i(t)Hx(t)τ(Vi(t)TΣ^x,iVi(t))
Cx(t)Hx,i(t)Σ^x,i
Hy,j(t)Hy(t)τ(Wj(t)TΣ^y,jWj(t))
Cy(t)Hy,j(t)Σ^y,j
V(i)(t)VDx(t)Vi+1(t)Vi1(t)V1(t)
W(j)(t)WDy(t)Wj+1(t)Wj1(t)W1(t)
Cxy(t)(Vi(t)TIpi)Σ^x(i),y(j)(Wj(t)TIqj)
 Solve following generalized eigenvalue decomposition, [0Cxy(t)Cxy(t)0][vi(t+1)Wj(t+1)]=ρ(t+1)[Cx(t)00Cy(t)][Vi(t+1)Wj(t+1)]
Hx(t+1)Hx,i(t)*(Vi(t+1)TΣ^x,iVi(t+1))
Hy(t+1)Hy,j(t)*(Wi(t+1)TΣ^y,jWj(t+1))
tt + 1
until ρ(t) converges

4.2.1 |. Estimation of separable covariance matrices

Algorithm 2 relies on the sample estimate of separable covariances Σ^x,i and Σ^y,j, and the unstructured cross-covariance Σ^x,y. The following lemma will be useful in computing a consistent estimator for covariance matrices with the separable structure.

Lemma 1

If a random tensor Xp1××pDx has mean zero and separable covariance Var(x)=Σx,DxΣx,1, then

E(X(i)X(i))=(iitr(Σx,i))Σx,iandE(x22)=i=1Dxtr(Σx,i).

Proof. For the first identity, see Hoff, 2011 (2011, Proposition 2.1). For the second identity, E(x22)=tr(Var(x))=tr(Σx,DxΣx,1)=Πitr(Σx,1). □

Given N i.i.d. observations (xn, yn), the estimators r^x=1Nn=1Nxnx¯22 and r^y=1Nn=1Nyny¯22 consistently estimate Πi=1Dxtr(Σx,i) and j=1Dytr(Σy,j), respectively, where x¯ and y¯ are the sample means of the vectorized tensors xn and yn. We propose the following covariance estimators:

Σ^x,i=1Nrx(Dx1)/Dxn=1N(xn(i)x¯(i))(xn(i)x¯(i)),Σ^y,j=1Nr^y(Dy1)/Dyn=1N(yn(j)y¯(j))(yn(j)y¯(j))T,
Σ^x,y=1Nn=1N(xnx¯)(yny¯)T,

where i ∈ [Dx] and j ∈ [Dy]. The vectors xn(i) and x¯(i) denote the mode i vectorization of the nth observation and the sample mean of the mode i vectorized tensors xn, respectively. The vectors yn(j) and y¯(j) denote analogous vectorizations.

Unfortunately, the separable covariance structure (8) is not identifiable in the individual Σx,i due to scaling indeterminacy. Therefore, Σx,i cannot be consistently estimated. Note, however, that we do not need to consistently estimate the individual Σx,i in order to consistently estimate their Kronecker product. To see this, note that by Slutsky’s theorem,

Σ^x,DxΣ^x,1(itr(Σx,j))Dx1(itr(Σx,j)Dx1Σx,DxΣx,1=Σx,DxΣx,1

consistently estimates Var(x).

Note that Hoff (2011) proposes an iterative algorithm for finding the maximum likelihood estimation (MLE) on the basis of the array normal assumption. The maximum likelihood estimation may improve upon the above estimates when data actually come from an array normal distribution.

5 |. SPARSE TCCA

We may also recover sparse canonical tensors for both TCCA and scTCCA to enhance the interpretability of the estimated canonical tensors. Following the iterative thresholding strategy introduced by Ma (202013) for sparse principal component analysis, by Wang, Gu, Ning, and Liu (2015) for sparse expectation-maximization algorithms, and by Tan et al. (2018) for generalized eigenvalue problems, we incorporate a hard-thresholding step in Algorithms 1 and 2, as follows:

wi(t+1)Θλ(wi(t+1))andwj(t+1)Θλ(wj(t+1)),

where Θλ(v) performs element-wise hard-thresholding on v, namely, the ith element of Θλ(v) is vi if |vi| > λ and 0 otherwise. In the simulation studies of Section 7, we employ fivefold cross-validation with a grid point search to choose the tuning parameter λ, following Tan, Wang, Liu, and Zhang (2018). Instead of a grid point search, a random search would be another choice (Bergstra & Bengio, 2012).

6 |. PROBABILISTIC MODEL FOR TCCA

Bach and Jordan (2006) give a probabilistic interpretation for the classic CCA, which enables us to simulate data with desired canonical vectors and canonical correlations. For TCCA without any assumption on the covariance structure, it is the same as the regular CCA by treating vectorized canonical tensors as canonical vectors. In this section, we first discuss how to generate data from given d canonical correlations ρd=(ρ1,,ρd) and their corresponding canonical vectors in columns of matrices (Vd, Wd), and then we extend it to TCCA with the separable covariance assumption. Let Σx and Σy be two covariance matrices such that VdΣxVd=WdΣyWd=Id. Define two linear transformations Ax=ΣxVdMx and Ay=ΣyWdMy, where Mx,Myd×d are arbitrary matrices such that MxMy=diag(ρd) and their spectral norms are less 1. We consider the latent factor model

zN(0,Id)
xZ=zN(Axz+μx,ΣxAxAx)
yZ=zN(Ayz+μy,ΣyAyAy).

The joint distribution of (x, y) is

(xy)N([μxμy],[ΣxΣxyΣxyΣy]),

where Σxy=ΣxVddiag(ρd)WdΣy.

Now, we discuss how to construct Σx and Σy from (Vd, Wd), which is not described in Bach and Jordan (2006). Let Vd = QxRx be the thin QR decomposition of Vd. Then

Σx=QxRxTRx1QxT+Tx(IpQxQxT)TxT

satisfies VdΣxVd=Id for arbitrary Txp×p. Similarly, let Wd = QyRy be the thin QR decomposition of Wd. Then

Σy=QyRyTRyTQyT+Ty(IqQyQyT)TyT

satisfies WdΣyWd=Id for arbitrary Tyq×q. In this notation, the joint covariance is

Var(xy)=(QxRxTRxTQxTQxRxTdiag(ρd)RyTQyTQyRyTdiag(ρd)RyTQyTQxRxTRyTQyT)+(Tx(IpQ×Qx)TxT00Ty(IqQyQyT)TyT). (9)

Tx and Ty are free parameters that adjust the noise level in x and y, respectively. The normal generative model is

zN(0,Id)
xz=zN(QxRxTMxz+μx,Σx)
yz=zN(QyRyTMyz+μy,Σy)

where Σx and Σy are as in Equation (9). The normality is not essential for construction of this covariance structure.

The separable covariance structure brings complication. To account for this parsimonious structure in the generative model, we construct marginal factor matrices Σx,i and Σy,j that satisfy ViΣx,iVi=Rx1/DxIRx and WjΣy,jWj=Ry1/DYyRy. Let Vi=Qx,iRx,j and Wj=Qy,jRy,j be the thin QR decompositions. Then

Σx,i=RxDxQx,jRx,yTRx,j1Qx,jT+Tx,j(IDiQx,iQx,jT)Tx,jT
Σy,j=RyDyQy,jRy,jTRy,j1Qy,jT+Ty,j(IqjQy,jQy,jT)Ty,jT

satisfy the conditions with arbitrary Tx,ipi×pi and Ty,jqj×qj. And we have the desired property

vec(V)T(Σx,DxΣx,1)vec(V)=1RxT(VDxV1)T(Σx,DxΣx,1)(VDxV1)1Rx=1Rx(*iViΣx,iVi)1x=Rx11RxTIRx1Rx=1.

Similarly, vec(W)T(Σy,DyΣy,1)vec(W)=1.

7 |. NUMERICAL EXPERIMENTS

We use the generative model described in Section 6 to evaluate the methods discussed in this paper: classic CCA, TCCA, scTCCA, sparse TCCA, and sparse TCCA with the separable covariance. We assess these methods on their ability to recover the true latent population parameters (V,W) used to generate i.i.d. samples of tensor data pairs (Xn,Yn) for n ∈ [1000]. In all examples, the true V is a vector of length 100 with six entries set to 1 and the rest to 0. We use three different latent W64×64 shown in Figure 1: Wrectangle, Wcross and Wbutterfly. White pixels indicate values of 1, and black pixels indicate values of 0. The rectangle and cross-population canonical tensors Wrectangle and Wcross are low rank; specifically, they are Rank 1 and Rank 2, respectively. The butterfly population canonical tensor Wbutterfly is a high rank. At the first glance, these illustrative examples do not come across as challenging estimation problems in the high dimension-low sample size regime, but if we were to vectorize the data and perform CCA, the number of parameters to fit is 100 + 644 = 4196 whereas the sample size is 1,000. Because the number of parameters exceeds the sample size, the sample covariance matrix is singular. Consequently, we add a small ridge term 10−3| to sample covariance matrices so that the generalized eigenvalue problem has a unique solution. The code for generating the simulation results is provided in the Supporting Information.

FIGURE 1.

FIGURE 1

True latent population W canonical tensors for numerical experiments[dummy]

7.1 |. Evaluation criteria and selection of tuning parameter

Canonical tensors can only be estimated up to a scaling factor. Thus, we measure estimation accuracy by the angle between population canonical tensors used to generate the data (V,W) and estimated canonical tensors (V^,W^), as follows:

(V,V^)=v,V^v2V^2,and(W,W^)=w,W^w2W^2.

The angles can take on values from −1 to 1 where an angle closer to 1 indicates better recovery of the true canonical tensors.

We use k-fold cross-validation to perform model selection, for example, selection of the tensor rank or the sparsity level. We split our data into K equally sized groups; for k ∈ [K], we estimate a pair (Vk,Wk) on all but the kth fold of data for a sequence of models of varying complexity, M1,M2,, where model M1 has a fixed pair of tensor ranks Rx and Ry and sparsity inducing parameter λ. We denote the fitted canonical correlation using model M1 by ρ^k(Vk,Wk;Ml) Using (Vk,Wk), we compute the canonical correlation on the held out jth fold, which we denote ρ^j(Vk,Wk). We choose the model that minimizes the average discrepancy between the empirical canonical correlations on the training sets and testing sets (Waaijenborg, Verselewel de Witt Hamer, & Zwinderman, 2008).

M^=Mlargmin1Kk=1K|ρ^k(Vk,Wk)ρ^k(Vk,Wk;Ml)|.

We then use all the data to estimate (V,W) using model M^.

7.2 |. Results

Figure 2 compares estimation accuracy of various methods over 100 replicates. First of all, we observe that the various versions of TCCA outperform the CCA on the vectorized data for all three choices of the latent canonical tensor W. Second, we notice that there appears to be some overfitting in the case of rectangle and cross problems. In the rectangle problem, where the population canonical tensor Wrectangle is a Rank 1 tensor, all four TCCA methods show the best performance when we fix the rank Ry for estimating W to be 1, and the performance deteriorates as higher ranks are used. The same trend can be seen in the result of the cross problem where the population canonical tensor Wcross is a Rank 2 tensor. All TCCA methods give smaller values of angle when Ry = 3 is used compared with when Ry = 2 is used. In contrast, in the case of the butterfly image, the rank of population canonical tensor Wbutterfly is much greater than 3. Thus, we do not expect an overfitting problem as well as good estimation performance as in the low-rank cases. We confirm that in Figure 2, calculated angles from the butterfly problem are lower than those from rectangles or cross problem.

FIGURE 2.

FIGURE 2

Angles between the recovered canonical vector/tensors and the true canonical vector/tensors. CCA, canonical correlation analysis; scTCCA, tensor canonical correlation analysis with separable covariance; spscTCCA, sparse tensor canonical correlation analysis with separable covariance; spTCCA, sparse tensor canonical correlation analysis; TCCA, tensor canonical correlation analysis

Another interesting observation is that results from TCCA methods with a separable covariance structure show better performance when we use a higher value for the rank parameter Ry than the true value than do other two models in cross and butterfly problems. This results implies that assuming a separable covariance structure improves the estimation accuracy in the two problems. This can be explained by the true image X, which has a symmetric structure. Due to this special structure, we possibly expect improved performance from the more parsimonious model. However, this structured model does not have much effect on the rectangle problem. This may be because the true canonical tensor W already possess relatively few parameters. Note that the rectangle image is also symmetric but Rank 1, which has very few parameters.

Table 1 shows computation times taken for each method. In most cases, the computation times of TCCA methods are better or comparable with those of the CCA method. Especially, there is huge improvement when the underlying true canonical tensor has a low rank. Also, sparse models take less time than do nonsparse models, and separable covariance structure models take less time than do the models without the assumption for computation in general, which is expected.

TABLE 1.

Computation time: The mean run times are reported in seconds with the standard deviation in parentheses

Problem Rank CCA TCCA spTCCA scTCCA spscTCCA
Rectangle 1 22.90 (3.62) 7.52 (1.10) 7.36 (0.95) 3.14 (0.55) 3.09 (0.44)
2 22.80 (3.18) 15.32 (2.51) 15.12 (2.17) 11.03 (2.03) 30.11 (8.78)
3 21.85 (3.36) 24.07 (3.91) 18.61 (3.14) 15.89 (2.95) 37.09 (12.09)
Cross 1 18.49 (2.32) 8.05 (1.34) 7.78 (1.40) 3.39 (0.53) 13.19 (12.14)
2 17.33 (2.43) 17.31 (3.32) 15.13 (5.02) 6.96 (1.11) 33.55 (12.78)
3 17.10 (2.55) 24.44 (3.79) 54.86 (25.73) 10.04 (1.84) 9.63 (2.05)
Butterfly 1 18.92 (3.48) 7.08 (1.60) 61.18 (28.58) 2.90 (0.64) 31.66 (6.36)
2 18.23 (2.92) 20.01 (4.53) 30.10 (14.87) 9.89 (2.15) 48.57 (13.14)
3 19.99 (2.55) 32.27 (4.68) 26.09 (3.89) 11.84 (2.06) 13.58 (5.89)

Note. Experiments were performed on a computer cluster consisting of machines with Intel Xeon-based processors (I5 or I7 processors) with RAM ranging from 1,600 to 2,100 MHz. Abbreviations: CCA, canonical correlation analysis; scTCCA, tensor canonical correlation analysis with separable covariance; spscTCCA, sparse tensor canonical correlation analysis with separable covariance; spTCCA, sparse tensor canonical correlation analysis; TCCA, tensor canonical correlation analysis

8 |. DISCUSSION

We have proposed a TCCA approach for finding the relationship between linear combinations of two tensor datasets. Our method combines the classic CCA approach and the low-rank tensor decomposition to reduce the vast dimensionality of tensor parameters. The proposed estimation algorithm scales well with the tensor data size and is easy to implement using existing statistical software. Our algorithms also support convergence guarantees, properties arguably lacking in the alternatives in the literature. We briefly highlight some open problems for further investigation.

In our illustrative examples, we did not incorporate a procedure for selecting the rank parameter. A main motivation for the low-rank formulation of the canonical tensors is reduced computation. Thus, one may wish to choose as high a rank parameter as one’s computational budget allows. Nonetheless, we leave for future work a more principled approach to rank selection. One promising angle of pursuit would be to leverage the equivalence between CCA and a least squares problem (Sun, Ji, & Ye, 2008) and then derive an information criterion, a strategy commonly used to select the rank parameter in several recently proposed tensor estimation procedures (Zhou et al., 2013; Sun et al., 2017; Sun & Li, 2019).

An additional direction for future work is to generalize our TCCA framework to handle the analysis of multiple tensor datasets. There are numerous proposed methods to extend CCA for multiple vector-measurement datasets (Carroll, 1968; Kettenring, 1971; Hanafi, 2007; Witten & Tibshirani, 2009; Luo et al., 2015) but not for multiple tensor datasets to the best of our knowledge.

Another problem not tackled in the paper includes verifying the consistency of the proposed estimation procedures. Our clear specification of the population models provides a framework for studying the consistency property in both the large n fixed p (Zhou et al., 2013) and the large n diverging p settings (Zhang, Li, Zhou, Zhou, & Shen, 2019). Finally, we are currently investigating our methods on a real dataset.

Supplementary Material

suppmaterial

Footnotes

SUPPORTING INFORMATION

Additional supporting information may be found online in the Supporting Information section at the end of the article.

REFERENCES

  1. Bach FR, & Jordan MI (2006). A probabilistic interpretation of canonical correlation analysis, Technical Report. [Google Scholar]
  2. Bergstra J, & Bengio Y (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb), 281–305. [Google Scholar]
  3. Bickel PJ, & Levina E (2008). Covariance regularization by thresholding. The Annals of Statistics, 36(6), 2577–2604. [Google Scholar]
  4. Bickel PJ, & Levina E (2008). Regularized estimation of large covariance matrices. The Annals of Statistics, 36(1), 199–227. [Google Scholar]
  5. Cai TT, & Yuan M (2012). Adaptive covariance matrix estimation through block thresholding. The Annals of Statistics, 40(4), 2014–2042. [Google Scholar]
  6. Carroll JD (1968). Generalization of canonical correlation analysis to three or more sets of variables, Proceedings of the 76th Annual Convention of the American Psychological Association, Washington, DC: Vol. 3, pp. 227–228. [Google Scholar]
  7. Carroll JD, & Chang J-J (1970). Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition. Psychometrika, 35, 283–319. [Google Scholar]
  8. Gang L, Yong Z, Yan-Lei L, & Jing D (2011). Three dimensional canonical correlation analysis and its application to facial expression recognition, International Conference on Intelligent Computing and Information Science Berlin, Heidelberg: Springer, pp. 56–61. [Google Scholar]
  9. González I, Déjean S, Martin P, & Baccini A (2008). CCA: An R package to extend canonical correlation analysis. Journal of Statistical Software, 23(1), 1–14. [Google Scholar]
  10. Hanafi M (2007). PLS path modelling: Computation of latent variables with the estimation mode b. Computational Statistics, 22(2), 275–292. [Google Scholar]
  11. Harshman RA (1970). Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-modal factor analysis, UCLA Working Papers in Phonetics, 16, 1–84. [Google Scholar]
  12. Hoff PD (2011). Separable covariance arrays via the Tucker product, with applications to multivariate relational data. Bayesian Anal, 6(2), 179–196. [Google Scholar]
  13. Hotelling H (1936). Relations between two sets of variates. Biometrika, 321–377. [Google Scholar]
  14. Kettenring JR (1971). Canonical analysis of several sets of variables. Biometrika, 58(3), 433–451. [Google Scholar]
  15. Kolda TG, & Bader BW (2009). Tensor decompositions and applications. SIAM Review, 51(3), 455–500. [Google Scholar]
  16. Kubokawa T, Srivastava MS, et al. (2013). Optimal ridge-type estimators of covariance matrix in high dimension: CIRJE, Faculty of Economics, University of Tokyo. [Google Scholar]
  17. Ledoit O, & Wolf M (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88(2), 365–411. [Google Scholar]
  18. Ledoit O, &Wolf M (2012). Nonlinear shrinkage estimation of large-dimensional covariance matrices. The Annals of Statistics, 40(2), 1024–1060. [Google Scholar]
  19. Lee SH, & Choi S (2007). Two-dimensional canonical correlation analysis. Signal Processing Letters, IEEE, 14(10), 735–738. [Google Scholar]
  20. Lu H (2013). Learning canonical correlations of paired tensor sets via tensor-to-vector projection, Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, IJCAI ‘13 Beijing, China: AAAI Press, pp. 1516–1522. [Google Scholar]
  21. Luo Y, Tao D, Ramamohanarao K, Xu C, & Wen Y (2015). Tensor canonical correlation analysis for multi-view dimension reduction. IEEE Transactions on Knowledge and Data Engineering, 27(11), 3111–3124. [Google Scholar]
  22. Ma Z (2013). Sparse principal component analysis and iterative thresholding. The Annals of Statistics, 41(2), 772–801. [Google Scholar]
  23. Srivastava MS, & Reid N (2012). Testing the structure of the covariance matrix with fewer observations than the dimension. Journal of Multivariate Analysis, 112, 156–171. [Google Scholar]
  24. Stein JL, Hua X, Lee S, Ho AJ, Leow AD, Toga AW & Thompson PM (2010). Voxelwise genome-wide association study (vGWAS). Neuroimage, 53(3), 1160–1174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Sun L, Ji S, & Ye J (2008). A least squares formulation for canonical correlation analysis, Proceedings of the 25th International Conference on Machine Learning, ICML ‘08 New York, NY, USA: ACM, pp. 1024–1031. [Google Scholar]
  26. Sun WW, & Li L (2019). Dynamic tensor clustering. Journal of the American Statistical Association, in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Sun WW, Lu J, Liu H, & Cheng G (2017). Provable sparse tensor decomposition. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3), 899–916. [Google Scholar]
  28. Tan KM, Wang Z, Liu H, & Zhang T (2018). Sparse generalized eigenvalue problem: Optimal statistical rates via truncated Rayleigh flow. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(5), 1057–1086. [Google Scholar]
  29. Vinod HD (1976). Canonical ridge and econometrics of joint production. J Econ, 4(2), 147–166. [Google Scholar]
  30. Waaijenborg S, Verselewel de Witt Hamer PC, & Zwinderman AH (2008). Quantifying the association between gene expressions and DNA-markers by penalized canonical correlation analysis. Statistical Applications in Genetics and Molecular Biology, 7(1), 3. [DOI] [PubMed] [Google Scholar]
  31. Wang H (2010). Local two-dimensional canonical correlation analysis. Signal Processing Letters, IEEE, 17(11), 921–924. [Google Scholar]
  32. Wang Z, Gu Q, Ning Y, & Liu H (2015). High dimensional EM algorithm: Statistical optimization and asymptotic normality In Cortes C, Lawrence ND, Lee DD, Sugiyama M, & Garnett R (Eds.), Advances in Neural Information Processing Systems 28: Curran Associates, Inc, pp. 2521–2529. [PMC free article] [PubMed] [Google Scholar]
  33. Wang S-J, Yan W-J, Sun T, Zhao G, & Fu X (2016). Sparse tensor canonical correlation analysis for micro-expression recognition. Neurocomputing, 214, 218–232. [Google Scholar]
  34. Witten DM, & Tibshirani RJ (2009). Extensions of sparse canonical correlation analysis with applications to genomic data. Statistical Applications in Genetics and Molecular Biology, 8(1), 1–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Yan J, Zheng W, Zhou X, & Zhao Z (2012). Sparse 2-D canonical correlation analysis via low rank matrix approximation for feature extraction. Signal Processing Letters, IEEE, 19(1), 51–54. [Google Scholar]
  36. Zhang X, Li L, Zhou H, Zhou Y, & Shen D (2019). Tensor generalized estimating equations for longitudinal imaging analysis. Statistica Sinica, 29, 1977–2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Zhou H, Li L, &Zhu H (2013). Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502), 540–552. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

suppmaterial

RESOURCES