Tensor canonical correlation analysis

Eun Jeong Min; Eric C Chi; Hua Zhou

doi:10.1002/sta4.253

. Author manuscript; available in PMC: 2020 Jul 10.

Published in final edited form as: Stat. 2020 Jan 2;8(1):e253. doi: 10.1002/sta4.253

Tensor canonical correlation analysis

Eun Jeong Min ¹, Eric C Chi ², Hua Zhou ³

PMCID: PMC7351364 NIHMSID: NIHMS1602468 PMID: 32655193

Abstract

Canonical correlation analysis (CCA) is a multivariate analysis technique for estimating a linear relationship between two sets of measurements. Modern acquisition technologies, for example, those arising in neuroimaging and remote sensing, produce data in the form of multidimensional arrays or tensors. Classic CCA is not appropriate for dealing with tensor data due to the multidimensional structure and ultrahigh dimensionality of such modern data. In this paper, we present tensor CCA (TCCA) to discover relationships between two tensors while simultaneously preserving multidimensional structure of the tensors and utilizing substantially fewer parameters. Furthermore, we show how to employ a parsimonious covariance structure to gain additional stability and efficiency. We delineate population and sample problems for each model and propose efficient estimation algorithms with global convergence guarantees. Also we describe a probabilistic model for TCCA that enables the generation of synthetic data with desired canonical variates and correlations. Simulation studies illustrate the performance of our methods.

Keywords: block coordinate ascent, CP decomposition, multidimensional array data

1 |. INTRODUCTION

Canonical correlation analysis (CCA) is a classic statistical method for identifying associations between two sets of measurements (Hotelling, 1936). Specifically, CCA identifies a pair of coefficient vectors, one for each set of measurements, such that the correlation between the corresponding linear combinations of variables from each set is maximized. By default, CCA applies when each observation consists of a pair of vector covariates. In many modern data analysis problems, however, each observation may consist more generally of a pair of multidimensional arrays or tensors. For example, in imaging genetics, to identify genetic variants that can best capture and explain phenotypic variations in brain function and structure, Stein et al. (2010) studied p = 448,293 single-nucleotide polymorphisms and q = 31,622 voxels in brain images, on n = 740 individuals. A naive approach to dealing with tensor-valued data would be to reshape tensor covariates into vectors and then apply standard CCA. There are, however, two serious drawbacks to doing so. First, structural information in tensors is discarded through vectorization. Second, the resulting vectors consist of a prohibitively large number of parameters. In the imaging genetics problem (Stein et al., 2010), vectorizing the voxel intensity measurements disregards the spatial correlation among neighbouring voxels. Moreover, applying standard CCA to vectors of single-nucleotide polymorphism covariates and vectorized brain images would require estimating nearly half a million parameters using fewer than a thousand observations.

In light of these issues, there have been extensions of CCA to handle special cases of tensor-valued data (Lee & Choi, 2007; Wang, 2010; Yan, Zheng, Zhou, & Zhao, 2012; Gang, Yong, Yan-Lei, & Jing, 2011; Lu, 2013; Wang, Yan, Sun, Zhao, & Fu, 2016). Although they have exhibited good empirical performance in some applications, there remains no clear population models underlying these sample based heuristics. To address this gap in the literature, we introduce a novel statistical model for tensor CCA (TCCA). We summarize at a high level our formulation and its contributions:

We propose a TCCA population model that imposes the CANDECOMP/PARAFAC (CP) decomposition (Carroll & Chang, 1970; Harshman, 1970) structure on canonical tensors. This population model enforces model parsimony and enables efficient estimation.
We propose a refinement of TCCA, which assumes a separable covariance structure (scTCCA). This refinement enables efficient estimation of large covariance matrices of tensor-valued data.
We derive convenient representations of the covariance between linear combinations of two random tensors under the unstructured and separable covariance structured assumptions.
We develop efficient estimation algorithms for TCCA and scTCCA, both based on block coordinate ascent, which leverage these efficient representations. Each step of both algorithms solves a substantially lower dimensional CCA problem; thus, both algorithms can be easily implemented using any standard solvers for the CCA problem. Moreover, we prove global convergence guarantees of both estimation algorithms under modest regularity conditions.
We develop simple modifications to the TCCA and scTCCA estimation algorithms to incorporate recovery of sparse canonical correlation tensors to improve interpretability of the estimated models.
Finally, we extend the probabilistic interpretation of CCA by Bach and Jordan (2006) to TCCA. This extension leads to a probabilistic model for generating datasets with specified canonical correlation and variates.

The remainder of the paper is organized as follows. In Section 2, we review tensor notation and basic operations used in this paper. In Sections 3 and 4, we propose our two TCCA methods: TCCA and scTCCA. In Section 5, we describe a modification of the estimation algorithms for TCCA and scTCCA for sparse models. In Section 6, we introduce the probabilistic TCCA model. In Section 7, we describe the numerical experiment results. In Section 8, we conclude and highlight directions for future work.

2 |. NOTATION AND PRELIMINARIES

We review basic operations on matrices and tensors invoked throughout this paper, adopting the terminology and notation in Kolda and Bader (2009). Throughout the paper, we use lowercase letters to indicate scalars, bold lowercase letters to indicate vectors, bold capital characters to indicate matrices, and bold calligraphic capital characters to indicate tensors. We will also use the shorthand [n] to denote an index set {1, …, n}.

Tensors can be considered generalizations of scalars, vectors, and matrices. Let $X$ represent an D-dimensional tensor in $ℝ^{p_{1} \times \dots \times p_{D}}$ . The tensor $X$ has order D, its number of dimensions or modes. For example, vectors are tensors of order one and have one mode. Matrices are tensors of order two and have two modes. We denote an element of $X$ by $x_{ι_{1} ι_{2}, \dots, ι_{D}}$ , where ι_i ∈ [p_i] and i ∈ [D]. Fibres are the generalization of matrix rows and columns to higher order tensors. A fibre is defined by fixing the index of every dimension except one dimension. Mode i fibres are p_i-dimensional vectors extracted from $X$ by fixing all the indices $(ι_{1}, \dots, ι_{l - 1}, ι_{i + 1}, \dots, ι_{D})$ except the ith one ι_i. For example, columns of a matrix are Mode 1 fibres, and rows of a matrix are Mode 2 fibres.

It is often useful to reshape a tensor into a matrix. Reordering a tensor into a matrix is referred to as matricization. The mode i matricization of a tensor $X \in ℝ^{p_{1} \times \dots \times p_{D}}$ , denoted $X_{(i)} \in ℝ^{p_{i} \times p_{- i}}$ with $p_{- i} = Π_{k = 1, k \neq i}^{D} p_{k}$ , arranges the mode i fibres as the columns of the matrix X_(i). In a mode i matricization, the tensor element $x_{ι_{1}, \dots, ι_{D}}$ is mapped to the matrix element of X_(i). with index (ι_i j), where $j = 1 + \sum_{k = 1, k \neq i}^{D} (ι_{k} - 1) J_{k}$ with

J_{k} = {\begin{array}{l} 1 & if k = 1 or if k = 2 and i = 1. \\ \prod_{k = 1, k^{'} \neq i}^{k - 1} p_{k^{'}} & otherwise. \end{array}

Reordering a tensor into a vector is referred to as vectorization. We first describe vectorization of a matrix before describing vectorization of a general tensor. The vectorization of a matrix X is denoted by vec(X) and is the vector obtained by stacking the columns of X on top of each other. The vectorization of the mode i matricization of a tensor $X$ in turn is denoted as x_(i) = vec(X_(i))). We then define the vectorization of a tensor $X$ , denoted by vec( $X$ ), as the vectorization of its Mode 1 matricization, namely, vec(X₍₁₎). When unambiguous from context, we will often denote the vectorization of a tensor $X$ by its corresponding bold lowercase $X$ .

The inner product of two tensors of compatible dimensions $X$ , $\tilde{X} \in ℝ^{p_{1} \times \dots \times p_{D}}$ is the sum of the product of their entries, namely,

〈 X, \tilde{X} 〉 = \sum_{ι_{1} = 1}^{p_{1}} \dots \sum_{ι_{D} = 1}^{p_{D}} x_{ι_{1}, \dots, ι_{D}} {\tilde{x}}_{ι_{1}, \dots, ι_{D}} .

The mode i product of a tensor $\tilde{X} \in ℝ^{p_{1} \times \dots \times p_{D}}$ with a matrix $\cup \in ℝ^{J \times p_{d}}$ is denoted by $X \times_{i} \cup$ and is the tensor of size $p_{1} \times \dots \times p_{i - 1} \times J \times p_{i + 1} \times \dots \times p_{D}$ with elements

{(X \times_{i} \cup)}_{ι_{1}, \dots, ι_{i - 1}, j ι_{i + 1}, \dots, ι_{D}} = \sum_{ι_{i} = 1}^{p_{i}} x_{ι_{1} ι_{2}, \dots, ι_{D}} U_{j_{ι_{i}}} .

Finally, we review three kinds of matrix products, as well as one definition of matrix division, that will be used throughout the paper.

For two matrices $A \in ℝ^{p_{1} \times p_{2}}$ and $B \in ℝ^{q_{1} \times q_{2}}$ , the Kronecker product is the p₁q₁-by-p₂q₂ matrix,
$A \otimes B = (\begin{matrix} a_{11} B & \dots & a_{1 p_{2}} B \\ ⋮ & ⋱ & ⋮ \\ a_{p_{1}} B & \dots & a_{p_{1} p_{2}} B \end{matrix}) .$
For two matrices $A = (\begin{array}{l} a_{1} & \dots & a_{p_{2}} \end{array}) \in ℝ^{p_{1} \times p_{2}}$ and $B = (\begin{array}{l} b_{1} & \dots & b_{p_{2}} \end{array}) \in ℝ^{q_{1} \times p_{2}}$ that have the same number of columns p₂, the Khatri-Rao product is the p₁q₁-by-p₂ matrix,
$A ⊙ B = (a_{1} \otimes b_{1} a_{2} \otimes b_{2} \dots a_{p_{2}} \otimes b_{p_{2}}) .$
which is a column-wise Kronecker product of A and B.
For two matrices A and B of the same size, the Hadamard product is the element-wise product A * B = {a_ijb_ij}. Because the Hadamard product commutes, we use *_i A_i to denote A₁ * … * A_m = A_π(1)* … * A_π(m) for any permutation π.
Finally, for two matrices A and B of the same size, the Hadamard quotient is the element-wise quotient A ⊘ B = {a_ij/b_ij}.

3 |. TENSORCANONICAL CORRELATION ANALYSIS

3.1 |. Population TCCA

Let $X \in ℝ^{p_{1} \times \dots \times p_{D_{x}}}$ and $Y \in ℝ^{q_{1} \times \dots \times q_{D_{y}}}$ be two random tensors of order D_x and D_y, respectively. We denote the vectorizations of $X$ and $Y$ by x and y, respectively. Denote by $Σ_{x}$ and $Σ_{y}$ the covariances of x and y, respectively. Denote by $Σ_{x,y}$ the covariance between x and y. Let $V \in ℝ^{p_{1} \times \dots \times p_{D_{x}}}$ , and $W \in ℝ^{q_{1} \times \dots \times q_{D_{y}}}$ be constant tensors, and let $ρ (V, W)$ denote the correlation between the two linear combinations $〈 X, V 〉$ and $〈 Y, W 〉$ namely,

ρ (V, W) = \frac{Cov (〈 X, V 〉, 〈 Y, W 〉)}{\sqrt{Var (〈 X, V 〉)} \sqrt{Var (〈 Y, W 〉)}} = \frac{v^{⊤} Σ_{x, y} w}{\sqrt{v^{⊤} Σ_{x} v} \sqrt{w^{⊤} Σ_{y} w}} .

(1)

The pair $(V, W)$ that maximizes ρ are the canonical tensors, and the optimal is the canonical coefficient. Maximizing the objective in Equation (1) presents two challenges: (a) high dimensionality of optimization variables $V$ and $W$ and (b) the estimation of the huge covariance matrices $Σ_{x}$ and $Σ_{y}$ and the cross-covariance matrix $Σ_{x,y}$ . We will address challenge (b) by imposing a separable covariance structure in Section 4.

To address challenge (a), we impose the parsimonious CANDECOMP/PARAFAC (CP), or Kruskal, representation on the canonical tensors. The CP representation generalizes the idea of representing a matrix as the sum of Rank 1 matrices to representing a tensor as the sum of Rank 1 tensors. An order-D tensor $X \in ℝ^{p_{1} \times \dots \times p_{D}}$ is Rank 1 if it can be expressed as the outer product of D vectors a⁽¹⁾, a⁽²⁾, …, a^(D), namely, $X = a^{(1)} \circ a^{(2)} \circ \dots \circ a^{(D)}$ , where the binary operator o denotes the vector outer product. Thus, the (ι₁, ι₂,…, ι_D )th element of $X$ is $x_{ι_{1} ι_{2} \dots ι_{D}} = a_{ι_{1}}^{(1)} a_{ι_{2}}^{(2)} \dots a_{ι_{0}}^{(D)}$ . A rank-R tensor can be written as the sum of R Rank 1 tensors, namely,

X = [A_{1}, \dots, A_{D}] = \sum_{r = 1}^{R} a_{r}^{(1)} \circ \dots \circ a_{r}^{(D)},

where $A_{i} = (\begin{array}{l} a_{1}^{(1)} & a_{2}^{(i)} & \dots & a_{R}^{(i)} \end{array}) \in ℝ^{p_{i} \times R}$ denotes the mode i factor matrix. We use the Kruskal notation $[\dots]$ to concisely summarize the sum.

Thus, instead of searching over the space of all order-D_x and order-D_y tensor pairs $(V, W)$ , we limit our search to tensors of rank-R_x and rank-R_y,

V = [V_{1}, \dots, V_{D_{x}}], V_{i} \in ℝ^{P \times R_{x}},i \in [D_{x}], W = [W_{1}, \dots, W_{D_{y}}], W_{j} \in ℝ^{q_{j}}^{\times R_{y}}, j \in [D_{y}] .

(2)

As we will see later, this parameterization makes progress towards alleviating the burden of estimating a huge covariance matrix. Note that

〈 X, V 〉 = 〈 X, \sum_{r = 1}^{R_{x}} w_{r}^{(1)} \circ \dots \circ w_{r}^{(D_{x})} 〉 = \sum_{r = 1}^{R_{x}} x \times_{1} w_{r}^{(1)} \times_{2} \dots \times_{D_{x}} w_{r}^{(D_{x})}

〈 Y, W 〉 = 〈 Y, \sum_{r = 1}^{R_{Y}} w_{r}^{(1)} \circ \dots \circ w_{r}^{(D_{y})} 〉 = \sum_{r = 1}^{R_{y}} Y \times_{1} w_{r}^{(1)} \times_{2} \dots \times_{D_{y}} w_{r}^{(D_{y})} .

Thus, we seek to maximize the correlation between a rank-R_x multilinear form in $X$ and a rank-R_y multilinear form in $Y$ . Multiway information is preserved, and the dimensionality is reduced from an exponential number of parameters $Π_{i = 1}^{D_{x}} p_{i} + Π_{j = 1}^{D_{y}} q_{j}$ to a linear number of parameters $R_{x} \sum_{i = 1}^{D_{x}} p_{i} + R_{y} \sum_{j = 1}^{D_{y}} q_{j}$ . Note that the ranks (R_x, R_y) here are not the number of canonical tensor pairs being sought. In this paper, we focus on obtaining only the top canonical tensor pair $(V, W)$ , which have ranks R_x and R_y.

The following representations of $Var (〈 X, Y 〉)$ , $Var (〈 Y, W 〉)$ , and $Cov (〈 X, Y 〉, 〈 Y, N 〉)$ in terms of a CP decomposition are keys to our estimation algorithms. Its proof is in the Supporting Information.

Proposition 1

Let $X \in ℝ^{p_{1} \times \dots \times p_{D_{x}}}$ and $Y \in ℝ^{q_{1} \times \dots \times q_{D_{y}}}$ be two random tensors and $V = [V_{1}, \dots, V_{D_{x}}]$ and $W = [W_{1}, \dots, W_{D_{y}}]$ be two constant tensors of the same size as $X$ and $Y$ respectively. Define

V_{(- 1)} = [V_{D_{x}} ⊙ \dots ⊙ V_{i + 1} ⊙ V_{i - 1} ⊙ \dots ⊙ V_{1}] \otimes I_{p_{i}},

(3)

W_{(- j)} = [W_{D_{x}} ⊙ \dots ⊙ W_{j + 1} ⊙ W_{j - 1} ⊙ \dots ⊙ W_{1}] \otimes I_{q_{j}},

(4)

and let $Σ_{x_{(j)}}$ denote the covariance of x_(i) = vec(X_(i)). Define $Σ_{y_{(j)}}$ and $Σ_{x_{0}, y_{(j}_{)}}$ analogously. Then

Var (〈 X, V 〉) = v_{i}^{⊤} V_{(- i)}^{⊤} Σ_{x_{(i)}} V_{(- i)} v_{i}

Var (〈 Y, W 〉) = w_{j}^{⊤} W_{(- j)}^{⊤} Σ_{y_{(j)}} W_{(- j)} w_{j}

C o v (〈 X, V 〉, 〈 Y, W 〉) = v_{i}^{⊤} V_{(- i)}^{⊤} Σ_{x_{(j)}, y_{(j)}} W_{(- j)} w_{j}

for any i ∈ [D_x] and j ∈ [D_y], where v_i = vec(V_i) and Wj = vec(W_j).

3.2 |. Sample TCCA

Suppose we observe N pairs of i.i.d. tensor data $(X_{n}, Y_{n})$ , and we estimate $V$ and $W$ by solving the optimization problem

maximize \hat{ρ} (V, W) \equiv \frac{v^{⊤} {\hat{Σ}}_{x, y} W}{\sqrt{v^{⊤} {\hat{Σ}}_{x} v} \sqrt{w^{⊤} {\hat{Σ}}_{y} w}} .

(5)

where ${\hat{Σ}}_{x}$ , ${\hat{Σ}}_{y}$ and ${\hat{Σ}}_{x, y}$ are sample estimates of the corresponding covariances. Recall that CCA models can be estimated numerically by computing the solution to the following generalized eigenvalue problem:

(\begin{matrix} 0 & {\hat{Σ}}_{x, y} \\ {\hat{Σ}}_{y, x} & 0 \end{matrix}) (\begin{matrix} v \\ w \end{matrix}) = ρ (\begin{matrix} {\hat{Σ}}_{x} & 0 \\ 0 & {\hat{Σ}}_{y} \end{matrix}) (\begin{matrix} v \\ w \end{matrix}) .

This problem is guaranteed to have a solution if and only/ if the covariance matrices ${\hat{Σ}}_{x}$ and ${\hat{Σ}}_{y}$ are nonsingular. In practice, the sample size N is smaller than the size of ${\hat{Σ}}_{x}$ and ${\hat{Σ}}_{y}$ , $(Π_{i = 1}^{D_{x}} p_{i}) (\times Π_{i = 1}^{D_{x}} p_{i}), (Π_{j = 1}^{D_{x}} q_{j}) \times (Π_{j = 1}^{D_{y}} q_{j})$ , respectively. Therefore, the sample covariance matrices ${\hat{Σ}}_{x}$ and ${\hat{Σ}}_{y}$ are singular, and a solution cannot be obtained. As a remedy, several regularized estimation methods for obtaining nonsingular sample covariance matrices (Vinod, 1976; Ledoit & Wolf, 2004; González, Déjean, Martin, & Baccini, 2008; Ledoit & Wolf, 2012; Cai & Yuan, 2012; Bickel & Levina, 2008; 2008; Kubokawa et al., 2013; Srivastava & Reid, 2012) can be used.

If we take $Var (〈 X, V 〉)$ , $Var (〈 Y, W 〉)$ and $Cov (〈 X, V 〉, 〈 Y, W 〉)$ to be $v^{T} {\hat{Σ}}_{x} v$ , $w^{T} {\hat{Σ}}_{y} w$ , and $v^{⊤} {\hat{Σ}}_{x, y} w$ respectively, then Proposition 1 suggests a block coordinate ascent algorithm where we update the factor matrices in pairs (V_i, W_j), for different combinations of i ∈ [D_x] and j ∈ [D_y]. To update the pair (V_i, W_j), we solve the following problem:

maximize v_{i}^{⊤} V_{(- i)}^{⊤} {\hat{Σ}}_{x_{(i)}, y_{(j)}} W_{(- j)} w_{j} subject to v_{i}^{⊤} V_{(- i)}^{⊤} {\hat{Σ}}_{x_{(i)}} V_{(- i)} v_{i} = 1, w_{j}^{⊤} W_{(- j)}^{⊤} {\hat{Σ}}_{y_{(j)}} W_{(- j)} w_{j} = 1,

(6)

which is a substantially smaller optimization problem over p_iR_x + q_jR_y variables compared with any alternative “all-at-once” strategies to iteratively optimize over all $(Π_{i = 1}^{D_{x}} p_{i}) (\times Π_{i = 1}^{D_{x}} p_{i}) + (Π_{j = 1}^{D_{y}} q_{j}) \times (Π_{j = 1}^{D_{y}} q_{i})$ parameters simultaneously. Problem (6) includes ${\hat{Σ}}_{x_{(i)}}$ , a permuted version of ${\hat{Σ}}_{x}$ with size $(Π_{i = 1}^{D_{x}} p_{i}) (\times Π_{i = 1}^{D_{x}} p_{i})$ . However, we work on the “compressed” covariance matrix

V_{(- i)}^{⊤} {\hat{Σ}}_{x_{(i)}} V_{(- i)} \in ℝ^{p_{i} R_{x} \times p_{i} R_{x}},

which is likely to be full rank when N ≥ p_iR_x instead of a singular matrix ${\hat{Σ}}_{x_{(i)}}$ . This approach enables us to solve the generalized eigenvalue problem with matrices that are (p_iR_x) + (q_jR_y)-by-(p_iR_x) + (q_jR_y). Algorithm 1 summarizes the estimation procedure that comes with the following convergence guarantees. The proof is in the Supporting Information.

Proposition 2

If the matrices ${\hat{Σ}}_{x}$ and ${\hat{Σ}}_{y}$ are nonsingular, then the limit Points of the iterate sequence generated by Algorithm 1 are canonical tensors of the sample TCCA Problem.

Algorithm 1.

TCCA, with (i) assumptions on CP structure on canonical tensors $(V, W)$ and (ii) no additional assumptions on the structures of the covariances $Var (vec (X))$ and $Var (vec (Y))$

Initialize

v_{i}^{(0)}

and

w_{j}^{(0)}

, for i ∈ [D_x] and j ∈ [D_y]

t ← 0

repeat

Select (i, j) ∈ [D_x] × [D_y]

V_{(- i)}^{(t)} \leftarrow [v_{D_{x}}^{(t)} ⊙ \dots ⊙ V_{i + 1}^{(t)} ⊙ V_{i - 1}^{(t)} ⊙ \dots ⊙ V_{1}^{(t)}] \otimes I_{p_{i}}

W_{(- j)}^{(t)} \leftarrow [W_{D_{y}}^{(t)} ⊙ \dots ⊙ W_{j + 1}^{(t)} ⊙ W_{j - 1}^{(t)} ⊙ \dots ⊙ W_{1}^{(t)}] \otimes I_{q_{j}}

C_{x}^{(t)} \leftarrow V_{(- i)}^{(t) T} {\hat{Σ}}_{x_{(i)}} V_{(- i)}^{(t)}

C_{y}^{(t)} \leftarrow W_{(- j)}^{(t) ⊤} {\hat{Σ}}_{y_{(j)}} W_{(- j)}^{(t)}

C_{x, y}^{(t)} \leftarrow V_{(- i)}^{(t) T} {\hat{Σ}}_{x_{(i)}} y_{(j)} W_{(- j)}^{(t)}

Generalized eigen-decomposition:

(\begin{matrix} 0 & C_{x, y}^{(t)} \\ C_{x y}^{(t) T} & 0 \end{matrix}) (\begin{matrix} w_{i}^{(t + 1)} \\ w_{j}^{(t + 1)} \end{matrix}) = ρ^{(t + 1)} (\begin{matrix} C_{x}^{(t)} & 0 \\ 0 & C_{y}^{(t)} \end{matrix}) (\begin{matrix} w_{i}^{(t + 1)} \\ w_{j}^{(t + 1)} \end{matrix})

t ← t + l

until ρ^(t) converges

Open in a new tab

4 |. TCCA WITH SEPARABLE COVARIANCE STRUCTURE

4.1 |. Population TCCA with separable covariances

Hoff (2011) proposed the array normal distribution with separable covariance structure. Separable marginal covariances are defined as

Σ_{x} = Var (x) = Σ_{x, D_{x}} \otimes \dots \otimes Σ_{x, 1} and Σ_{y} = Var (y) = Σ_{y, D_{y}} \otimes \dots \otimes Σ_{y, 1},

(7)

where $Σ_{x, j} \in ℝ^{p_{i} \times p_{i}}$ for i ∈ [D_x] and $Σ_{y, j} \in ℝ^{q_{j} \times q_{j}}$ for j ∈ [D_y]. Then the overall covariance of population model is

Var (\begin{array}{l} X \\ y \end{array}) = (\begin{matrix} Σ_{x, D_{x}} \otimes \dots \otimes Σ_{x, 1} & Σ_{x, y} \\ Σ_{y_{x}} & Σ_{y, D_{y}} \otimes \dots \otimes Σ_{y, 1} \end{matrix}) .

(8)

Intuitively, $Σ_{x, i}$ summarizes the covariance along the mode i fibres of tensor $X$ and $Σ_{y, j}$ summarizes the covariance along the mode j fibres of tensor $Y$ . The following result shows a representation of $Var (〈 X, V 〉)$ and $Var (〈 Y, W 〉)$ in presence of separable covariance structure (7) that we will leverage in our estimation algorithm.

Proposition 3

Let $X \in ℝ^{p_{1} \times \dots \times p_{D_{x}}}$ and $Y \in ℝ^{q_{1} \times \dots \times q_{D_{y}}}$ be two random tensors admitting the separable covariance structure (7) and $V = [v_{1}, \dots, v_{D_{x}}]$ and $W = [W_{1}, \dots, W_{D_{y}}]$ be two constant tensors. Define

H_{x, - i} = *_{i^{'} \neq i} (V_{i^{'}}^{⊤} Σ_{x, i^{'}} V_{i^{'}}) and H_{y, - j} = *_{j^{'} \neq j} (W_{j^{'}}^{⊤} Σ_{y, j^{'}} W_{j^{'}}) .

Then

Var (〈 X, V 〉) = v_{i}^{⊤} (H_{x, - i} \otimes Σ_{x, i}) v_{i} and Var (〈 Y, W 〉) = w_{j}^{⊤} (H_{y, - j} \otimes Σ_{y_{j}}) w_{j},

for any i ∈ [D_x] and j ∈ [D_y], where v_i = vec(V_i) and w_j = vec(W_j).

With the separable covariance structure and the CP structure on canonical tensors $(V, W)$ , the objective function of TCCA population model (1) greatly simplifies. Note that the separable covariance structure may not hold for real data, in which case the covariance estimates are biased. Despite this drawback, this parsimonious structure is worth considering due to the stability that it can impart by reducing estimation variance.

4.2 |. Sample TCCA with separable covariances

Given data $(X_{n}, Y_{n})$ , n ∈ [N], the goal is to maximize the sample canonical correlation (5) under assumptions that (a) $V$ and $W$ have the CP decomposition structure and (b) $X$ and $Y$ admit the separable covariance structure. We follow the same strategy as sample TCCA in Section 3.2, updating parameters in pairs of factor matrices (V_i, W_j). To update the pair (V_i, W_j), we solve the subproblem

maximize v_{i}^{⊤} V_{(- i)}^{⊤} {\hat{Σ}}_{x_{(i)}, y_{(j)}} W_{(- j)} w_{j}, subject to v_{i}^{⊤} (H_{x, - i} \otimes {\hat{Σ}}_{x, i}) v_{i} = 1, w_{j}^{⊤} (H_{y, - j} \otimes {\hat{Σ}}_{y, j}) w_{j} = 1.

where H_x,_−i and H_y,−j, defined in Proposition 3, are evaluated at current iterates V_i′ where i′ ≠ i, and W_j′, where j′ ≠ j.

By assuming the separable covariance structure, Proposition 3 enables us to greatly simplify the variance calculations for $Var (〈 X, V 〉)$ and $Var (〈 Y, W 〉)$ . Note that the calculation of $V_{(- i)}^{⊤} {\hat{Σ}}_{x_{0}} V_{(- i)}$ and $W_{(- j)}^{⊤} {\hat{Σ}}_{y_{(j)}} W_{(- j)}$ in the subproblem of the sample TCCA costs ${(R_{x} \prod_{i = 1}^{D_{x}} p_{i})}^{2} + {(R_{y} \prod_{j = 1}^{D_{y}} q_{j})}^{2}$ flops. In contrast, the calculation of matrices $H_{x, - i} \otimes {\hat{Σ}}_{x, j}$ and $H_{y, - j} \otimes {\hat{Σ}}_{y, j}$ in the subproblem of the sample TCCA with the separable covariance structure only costs (R_xp_i)² + (R_yq_j)² flops. Algorithm 2 summarizes the estimation procedure under the separable covariance assumption (7). Like Algorithm 1, Algorithm 2 also comes with convergence guarantees. The proof is in the Supporting Information.

Proposition 4

If the matrices ${\hat{Σ}}_{x}$ and ${\hat{Σ}}_{y}$ are nonsingular, then the limit points of the iterate sequence generated by Algorithm 2 are canonical tensors of the sample TCCA problem with separable covariances.

Algorithm 2.

TCCA for two tensors of modes D_x and D_y, respectively, assuming (i) CP structure on canonical correlation tensors $(V, W)$ and (ii) separable covariances $Var (vec (X))$ and $Var (vec (Y))$

Initialize

V_{i}^{(0)}, i \in [D_{x}]

W_{j}^{(0)}, j \in [D_{y}]

H_{x}^{(0)} \leftarrow *_{i} (V_{i}^{(0)}^{T} {\hat{Σ}}_{x, i} V_{i}^{(0)})

H_{y}^{(0)} \leftarrow *_{i} (W_{j}^{(0)}^{⊤} {\hat{Σ}}_{y, j} W_{j}^{(0)})

t ← 0

repeat

Select (i, j) ∈ [D_x] × [D_y]

H_{x, - i}^{(t)} \leftarrow H_{x}^{(t)} τ (V_{i}^{(t)}^{T} {\hat{Σ}}_{x, i} V_{i}^{(t)})

C_{x}^{(t)} \leftarrow H_{x, - i}^{(t)} \otimes {\hat{Σ}}_{x, i}

H_{y, - j}^{(t)} \leftarrow H_{y}^{(t)} τ (W_{j}^{(t)}^{T} {\hat{Σ}}_{y, j} W_{j}^{(t)})

C_{y}^{(t)} \leftarrow H_{y, - j}^{(t)} \otimes {\hat{Σ}}_{y, j}

V_{(- i)}^{(t)} \leftarrow V_{D_{x}}^{(t)} ⊙ \dots ⊙ V_{i + 1}^{(t)} ⊙ V_{i - 1}^{(t)} ⊙ \dots ⊙ V_{1}^{(t)}

W_{(- j)}^{(t)} \leftarrow W_{D_{y}}^{(t)} ⊙ \dots ⊙ W_{j + 1}^{(t)} ⊙ W_{j - 1}^{(t)} ⊙ \dots ⊙ W_{1}^{(t)}

C_{x y}^{(t)} \leftarrow (V_{- i}^{{(t)}^{T}} \otimes I_{p_{i}}) {\hat{Σ}}_{x_{(i)}, y_{(j)}} (W_{- j}^{{(t)}^{T}} \otimes I_{q_{j}})

Solve following generalized eigenvalue decomposition,

[\begin{matrix} 0 & C_{xy}^{(t)} \\ C_{xy}^{(t)} & 0 \end{matrix}] [\begin{matrix} v_{i}^{(t + 1)} \\ W_{j}^{(t + 1)} \end{matrix}] = ρ^{(t + 1)} [\begin{matrix} C_{x}^{(t)} & 0 \\ 0 & C_{y}^{(t)} \end{matrix}] [\begin{matrix} V_{i}^{(t + 1)} \\ W_{j}^{(t + 1)} \end{matrix}]

H_{x}^{(t + 1)} \leftarrow H_{x, - i}^{(t)} * (V_{i}^{{(t + 1)}^{T}} {\hat{Σ}}_{x, i} V_{i}^{(t + 1)})

H_{y}^{(t + 1)} \leftarrow H_{y, - j}^{(t)} * (W_{i}^{{(t + 1)}^{T}} {\hat{Σ}}_{y, j} W_{j}^{(t + 1)})

t ← t + 1

until ρ^(t) converges

Open in a new tab

4.2.1 |. Estimation of separable covariance matrices

Algorithm 2 relies on the sample estimate of separable covariances ${\hat{Σ}}_{x, i}$ and ${\hat{Σ}}_{y, j}$ , and the unstructured cross-covariance ${\hat{Σ}}_{x, y}$ . The following lemma will be useful in computing a consistent estimator for covariance matrices with the separable structure.

Lemma 1

If a random tensor $X \in ℝ^{p_{1} \times \dots \times p_{D_{x}}}$ has mean zero and separable covariance $Var (x) = Σ_{x, D_{x}} \otimes \dots \otimes Σ_{x, 1}$ , then

E (X_{(i)} X_{(i)}^{⊤}) = (\prod_{i^{'} \neq i} t r (Σ_{x, i^{'}})) Σ_{x, i} and E (‖ x ‖_{2}^{2}) = \prod_{i = 1}^{D_{x}} t r (Σ_{x, i}) .

Proof. For the first identity, see Hoff, 2011 (2011, Proposition 2.1). For the second identity, $E (‖ x ‖_{2}^{2}) = tr (Var (x)) = tr (Σ_{x, D_{x}} \otimes \dots \otimes Σ_{x, 1}) = Π_{i} tr (Σ_{x, 1})$ . □

Given N i.i.d. observations (x_n, y_n), the estimators ${\hat{r}}_{x} = \frac{1}{N} \sum_{n = 1}^{N} {‖ x_{n} - \bar{x} ‖}_{2}^{2}$ and ${\hat{r}}_{y} = \frac{1}{N} \sum_{n = 1}^{N} {‖ y_{n} - \bar{y} ‖}_{2}^{2}$ consistently estimate $Π_{i = 1}^{D_{x}} tr (Σ_{x, i})$ and $\prod_{j = 1}^{D_{y}} tr (Σ_{y, j})$ , respectively, where $\bar{x}$ and $\bar{y}$ are the sample means of the vectorized tensors x_n and y_n. We propose the following covariance estimators:

{\hat{Σ}}_{x, i} = \frac{1}{N r_{x}^{(D_{x} - 1) / D_{x}}} \sum_{n = 1}^{N} (x_{n (i)} - {\bar{x}}_{(i)}) {(x_{n (i)} - {\bar{x}}_{(i)})}^{⊤}, {\hat{Σ}}_{y, j} = \frac{1}{N {\hat{r}}_{y}^{(D_{y} - 1) / D_{y}}} \sum_{n = 1}^{N} (y_{n (j)} - {\bar{y}}_{(j)}) {(y_{n (j)} - {\bar{y}}_{(j)})}^{T},

{\hat{Σ}}_{x, y} = \frac{1}{N} \sum_{n = 1}^{N} (x_{n} - \bar{x}) {(y_{n} - \bar{y})}^{T},

where i ∈ [D_x] and j ∈ [D_y]. The vectors x_n(i) and ${\bar{x}}_{(i)}$ denote the mode i vectorization of the nth observation and the sample mean of the mode i vectorized tensors x_n, respectively. The vectors y_n(j) and ${\bar{y}}_{(j)}$ denote analogous vectorizations.

Unfortunately, the separable covariance structure (8) is not identifiable in the individual $Σ_{x, i}$ due to scaling indeterminacy. Therefore, $Σ_{x, i}$ cannot be consistently estimated. Note, however, that we do not need to consistently estimate the individual $Σ_{x, i}$ in order to consistently estimate their Kronecker product. To see this, note that by Slutsky’s theorem,

{\hat{Σ}}_{x, D_{x}} \otimes \dots \otimes {\hat{Σ}}_{x, 1} \to \frac{{(\prod_{i} tr (Σ_{x, j}))}^{D_{x} - 1}}{(\prod_{i} tr {(Σ_{x, j})}^{D_{x} - 1}} Σ_{x, D_{x}} \otimes \dots \otimes Σ_{x, 1} = Σ_{x, D_{x}} \otimes \dots \otimes Σ_{x, 1}

consistently estimates Var(x).

Note that Hoff (2011) proposes an iterative algorithm for finding the maximum likelihood estimation (MLE) on the basis of the array normal assumption. The maximum likelihood estimation may improve upon the above estimates when data actually come from an array normal distribution.

5 |. SPARSE TCCA

We may also recover sparse canonical tensors for both TCCA and scTCCA to enhance the interpretability of the estimated canonical tensors. Following the iterative thresholding strategy introduced by Ma (202013) for sparse principal component analysis, by Wang, Gu, Ning, and Liu (2015) for sparse expectation-maximization algorithms, and by Tan et al. (2018) for generalized eigenvalue problems, we incorporate a hard-thresholding step in Algorithms 1 and 2, as follows:

w_{i}^{(t + 1)} \leftarrow Θ_{λ} (w_{i}^{(t + 1)}) and w_{j}^{(t + 1)} \leftarrow Θ_{λ} (w_{j}^{(t + 1)}),

where Θ_λ(v) performs element-wise hard-thresholding on v, namely, the ith element of Θ_λ(v) is v_i if |v_i| > λ and 0 otherwise. In the simulation studies of Section 7, we employ fivefold cross-validation with a grid point search to choose the tuning parameter λ, following Tan, Wang, Liu, and Zhang (2018). Instead of a grid point search, a random search would be another choice (Bergstra & Bengio, 2012).

6 |. PROBABILISTIC MODEL FOR TCCA

Bach and Jordan (2006) give a probabilistic interpretation for the classic CCA, which enables us to simulate data with desired canonical vectors and canonical correlations. For TCCA without any assumption on the covariance structure, it is the same as the regular CCA by treating vectorized canonical tensors as canonical vectors. In this section, we first discuss how to generate data from given d canonical correlations $ρ_{d} = (ρ_{1}, \dots, ρ_{d})$ and their corresponding canonical vectors in columns of matrices (V_d, W_d), and then we extend it to TCCA with the separable covariance assumption. Let $Σ_{x}$ and $Σ_{y}$ be two covariance matrices such that $V_{d}^{⊤} Σ_{x} V_{d} = W_{d}^{⊤} Σ_{y} W_{d} = I_{d}$ . Define two linear transformations $A_{x} = Σ_{x} V_{d} M_{x}$ and $A_{y} = Σ_{y} W_{d} M_{y}$ , where $M_{x}, M_{y} \in ℝ^{d \times d}$ are arbitrary matrices such that $M_{x} M_{y}^{⊤} = diag (ρ_{d})$ and their spectral norms are less 1. We consider the latent factor model

z \sim N (0, I_{d})

x ∣ Z = z \sim N (A_{x} z + μ_{x}, Σ_{x} - A_{x} A_{x}^{⊤})

y ∣ Z = z \sim N (A_{y} z + μ_{y}, Σ_{y} - A_{y} A_{y}^{⊤}) .

The joint distribution of (x, y) is

(\begin{array}{l} x \\ y \end{array}) \sim N ([\begin{array}{l} μ_{x} \\ μ_{y} \end{array}], [\begin{matrix} Σ_{x} Σ_{x y} \\ Σ_{x y}^{⊤} Σ_{y} \end{matrix}]),

where $Σ_{xy} = Σ_{x} V_{d} diag (ρ_{d}) W_{d}^{⊤} Σ_{y}$ .

Now, we discuss how to construct $Σ_{x}$ and $Σ_{y}$ from (V_d, W_d), which is not described in Bach and Jordan (2006). Let V_d = Q_xR_x be the thin QR decomposition of V_d. Then

Σ_{x} = Q_{x} R_{x}^{- T} R_{x}^{- 1} Q_{x}^{T} + T_{x} (I_{p} - Q_{x} Q_{x}^{T}) T_{x}^{T}

satisfies $V_{d}^{⊤} Σ_{x} V_{d} = I_{d}$ for arbitrary $T_{x} \in ℝ^{p \times p}$ . Similarly, let W_d = Q_yR_y be the thin QR decomposition of W_d. Then

Σ_{y} = Q_{y} R_{y}^{- T} R_{y}^{T} Q_{y}^{T} + T_{y} (I_{q} - Q_{y} Q_{y}^{T}) T_{y}^{T}

satisfies $W_{d}^{⊤} Σ_{y} W_{d} = I_{d}$ for arbitrary $T_{y} \in ℝ^{q \times q}$ . In this notation, the joint covariance is

Var (\begin{array}{l} x \\ y \end{array}) = (\begin{matrix} Q_{x} R_{x}^{- T} R_{x}^{T} Q_{x}^{T} & Q_{x} R_{x}^{- T} diag (ρ_{d}) R_{y}^{T} Q_{y}^{T} \\ Q_{y} R_{y}^{- T} diag (ρ_{d}) R_{y}^{T} Q_{y}^{T} & Q_{x} R_{x}^{- T} R_{y}^{T} Q_{y}^{T} \end{matrix}) + (\begin{matrix} T_{x} (I_{p} - Q_{\times} Q_{x}^{⊤}) T_{x}^{T} & 0 \\ 0 & T_{y} (I_{q} - Q_{y} Q_{y}^{T}) T_{y}^{T} \end{matrix}) .

(9)

T_x and T_y are free parameters that adjust the noise level in x and y, respectively. The normal generative model is

z \sim N (0, I_{d})

x ∣ z = z \sim N (Q_{x} R_{x}^{- T} M_{x} z + μ_{x}, Σ_{x})

y ∣ z = z \sim N (Q_{y} R_{y}^{- T} M_{y} z + μ_{y}, Σ_{y})

where $Σ_{x}$ and $Σ_{y}$ are as in Equation (9). The normality is not essential for construction of this covariance structure.

The separable covariance structure brings complication. To account for this parsimonious structure in the generative model, we construct marginal factor matrices $Σ_{x,i}$ and $Σ_{y,j}$ that satisfy $V_{i}^{⊤} Σ_{x, i} V_{i} = R_{x}^{1 / D_{x}} I_{R_{x}}$ and $W_{j}^{⊤} Σ_{y, j} W_{j} = R_{y}^{1 / D_{Y}} y_{R_{y}}$ . Let $V_{i} = Q_{x, i} R_{x, j}$ and $W_{j} = Q_{y, j} R_{y, j}$ be the thin QR decompositions. Then

Σ_{x, i} = R_{x}^{- D_{x}} Q_{x, j} R_{x, y}^{- T} R_{x, j}^{- 1} Q_{x, j}^{T} + T_{x, j} (I_{D_{i}} - Q_{x, i} Q_{x, j}^{T}) T_{x, j}^{T}

Σ_{y, j} = R_{y}^{- D_{y}} Q_{y, j} R_{y, j}^{- T} R_{y, j}^{- 1} Q_{y, j}^{T} + T_{y, j} (I_{q_{j}} - Q_{y, j} Q_{y, j}^{T}) T_{y, j}^{T}

satisfy the conditions with arbitrary $T_{x, i} \in ℝ^{p_{i} \times p_{i}}$ and $T_{y, j} \in ℝ^{q_{j} \times q_{j}}$ . And we have the desired property

vec {(V)}^{T} (Σ_{x, D_{x}} \otimes \dots \otimes Σ_{x, 1}) vec (V) = 1_{R_{x}}^{T} {(V_{D_{x}} ⊙ \dots ⊙ V_{1})}^{T} (Σ_{x, D_{x}} \otimes \dots \otimes Σ_{x, 1}) (V_{D_{x}} ⊙ \dots ⊙ V_{1}) 1_{R_{x}} = 1_{R_{x}}^{⊤} {(*_{i} V_{i}^{⊤} Σ_{x, i} V_{i})}_{1_{x}} = R_{x}^{- 1} 1_{R_{x}}^{T} I_{R_{x}} 1_{R_{x}} = 1 .

Similarly, $vec {(W)}^{T} (Σ_{y, D_{y}} \otimes \dots \otimes Σ_{y, 1}) vec (W) = 1$ .

7 |. NUMERICAL EXPERIMENTS

We use the generative model described in Section 6 to evaluate the methods discussed in this paper: classic CCA, TCCA, scTCCA, sparse TCCA, and sparse TCCA with the separable covariance. We assess these methods on their ability to recover the true latent population parameters $(V, W)$ used to generate i.i.d. samples of tensor data pairs $(X_{n}, Y_{n})$ for n ∈ [1000]. In all examples, the true $V$ is a vector of length 100 with six entries set to 1 and the rest to 0. We use three different latent $W \in ℝ^{64 \times 64}$ shown in Figure 1: $W_{rectangle}$ , $W_{cross}$ and $W_{butterfly}$ . White pixels indicate values of 1, and black pixels indicate values of 0. The rectangle and cross-population canonical tensors $W_{rectangle}$ and $W_{cross}$ are low rank; specifically, they are Rank 1 and Rank 2, respectively. The butterfly population canonical tensor $W_{butterfly}$ is a high rank. At the first glance, these illustrative examples do not come across as challenging estimation problems in the high dimension-low sample size regime, but if we were to vectorize the data and perform CCA, the number of parameters to fit is 100 + 64⁴ = 4196 whereas the sample size is 1,000. Because the number of parameters exceeds the sample size, the sample covariance matrix is singular. Consequently, we add a small ridge term 10⁻³| to sample covariance matrices so that the generalized eigenvalue problem has a unique solution. The code for generating the simulation results is provided in the Supporting Information.

True latent population $W$ canonical tensors for numerical experiments[dummy]

7.1 |. Evaluation criteria and selection of tuning parameter

Canonical tensors can only be estimated up to a scaling factor. Thus, we measure estimation accuracy by the angle between population canonical tensors used to generate the data $(V, W)$ and estimated canonical tensors $(\hat{V}, \hat{W})$ , as follows:

∠ (V, \hat{V}) = \frac{〈 v, \hat{V} 〉}{‖ v ‖_{2} ‖ \hat{V} ‖_{2}}, and ∠ (W, \hat{W}) = \frac{〈 w, \hat{W} 〉}{‖ w ‖_{2} ‖ \hat{W} ‖_{2}} .

The angles can take on values from −1 to 1 where an angle closer to 1 indicates better recovery of the true canonical tensors.

We use k-fold cross-validation to perform model selection, for example, selection of the tensor rank or the sparsity level. We split our data into K equally sized groups; for k ∈ [K], we estimate a pair $(V_{- k}, W_{- k})$ on all but the kth fold of data for a sequence of models of varying complexity, $M_{1}, M_{2}, \dots$ , where model $M_{1}$ has a fixed pair of tensor ranks R_x and R_y and sparsity inducing parameter λ. We denote the fitted canonical correlation using model $M_{1}$ by ${\hat{ρ}}_{- k} (V_{- k}, W_{- k}; M_{l})$ Using $(V_{- k}, W_{- k})$ , we compute the canonical correlation on the held out jth fold, which we denote ${\hat{ρ}}_{- j} (V_{- k}, W_{- k})$ . We choose the model that minimizes the average discrepancy between the empirical canonical correlations on the training sets and testing sets (Waaijenborg, Verselewel de Witt Hamer, & Zwinderman, 2008).

\hat{M} = M_{l} \arg \min \frac{1}{K} \sum_{k = 1}^{K} | {\hat{ρ}}_{k} (V_{- k}, W_{- k}) - {\hat{ρ}}_{- k} (V_{- k}, W_{- k}; M_{l}) | .

We then use all the data to estimate $(V, W)$ using model $\hat{M}$ .

7.2 |. Results

Figure 2 compares estimation accuracy of various methods over 100 replicates. First of all, we observe that the various versions of TCCA outperform the CCA on the vectorized data for all three choices of the latent canonical tensor $W$ . Second, we notice that there appears to be some overfitting in the case of rectangle and cross problems. In the rectangle problem, where the population canonical tensor $W_{rectangle}$ is a Rank 1 tensor, all four TCCA methods show the best performance when we fix the rank R_y for estimating $W$ to be 1, and the performance deteriorates as higher ranks are used. The same trend can be seen in the result of the cross problem where the population canonical tensor $W_{cross}$ is a Rank 2 tensor. All TCCA methods give smaller values of angle when R_y = 3 is used compared with when R_y = 2 is used. In contrast, in the case of the butterfly image, the rank of population canonical tensor $W_{butterfly}$ is much greater than 3. Thus, we do not expect an overfitting problem as well as good estimation performance as in the low-rank cases. We confirm that in Figure 2, calculated angles from the butterfly problem are lower than those from rectangles or cross problem.

Angles between the recovered canonical vector/tensors and the true canonical vector/tensors. CCA, canonical correlation analysis; scTCCA, tensor canonical correlation analysis with separable covariance; spscTCCA, sparse tensor canonical correlation analysis with separable covariance; spTCCA, sparse tensor canonical correlation analysis; TCCA, tensor canonical correlation analysis

Another interesting observation is that results from TCCA methods with a separable covariance structure show better performance when we use a higher value for the rank parameter R_y than the true value than do other two models in cross and butterfly problems. This results implies that assuming a separable covariance structure improves the estimation accuracy in the two problems. This can be explained by the true image $X$ , which has a symmetric structure. Due to this special structure, we possibly expect improved performance from the more parsimonious model. However, this structured model does not have much effect on the rectangle problem. This may be because the true canonical tensor $W$ already possess relatively few parameters. Note that the rectangle image is also symmetric but Rank 1, which has very few parameters.

Table 1 shows computation times taken for each method. In most cases, the computation times of TCCA methods are better or comparable with those of the CCA method. Especially, there is huge improvement when the underlying true canonical tensor has a low rank. Also, sparse models take less time than do nonsparse models, and separable covariance structure models take less time than do the models without the assumption for computation in general, which is expected.

TABLE 1.

Computation time: The mean run times are reported in seconds with the standard deviation in parentheses

Problem	Rank	CCA	TCCA	spTCCA	scTCCA	spscTCCA
Rectangle	1	22.90 (3.62)	7.52 (1.10)	7.36 (0.95)	3.14 (0.55)	3.09 (0.44)
	2	22.80 (3.18)	15.32 (2.51)	15.12 (2.17)	11.03 (2.03)	30.11 (8.78)
	3	21.85 (3.36)	24.07 (3.91)	18.61 (3.14)	15.89 (2.95)	37.09 (12.09)
Cross	1	18.49 (2.32)	8.05 (1.34)	7.78 (1.40)	3.39 (0.53)	13.19 (12.14)
	2	17.33 (2.43)	17.31 (3.32)	15.13 (5.02)	6.96 (1.11)	33.55 (12.78)
	3	17.10 (2.55)	24.44 (3.79)	54.86 (25.73)	10.04 (1.84)	9.63 (2.05)
Butterfly	1	18.92 (3.48)	7.08 (1.60)	61.18 (28.58)	2.90 (0.64)	31.66 (6.36)
	2	18.23 (2.92)	20.01 (4.53)	30.10 (14.87)	9.89 (2.15)	48.57 (13.14)
	3	19.99 (2.55)	32.27 (4.68)	26.09 (3.89)	11.84 (2.06)	13.58 (5.89)

Open in a new tab

Note. Experiments were performed on a computer cluster consisting of machines with Intel Xeon-based processors (I5 or I7 processors) with RAM ranging from 1,600 to 2,100 MHz. Abbreviations: CCA, canonical correlation analysis; scTCCA, tensor canonical correlation analysis with separable covariance; spscTCCA, sparse tensor canonical correlation analysis with separable covariance; spTCCA, sparse tensor canonical correlation analysis; TCCA, tensor canonical correlation analysis

8 |. DISCUSSION

We have proposed a TCCA approach for finding the relationship between linear combinations of two tensor datasets. Our method combines the classic CCA approach and the low-rank tensor decomposition to reduce the vast dimensionality of tensor parameters. The proposed estimation algorithm scales well with the tensor data size and is easy to implement using existing statistical software. Our algorithms also support convergence guarantees, properties arguably lacking in the alternatives in the literature. We briefly highlight some open problems for further investigation.

In our illustrative examples, we did not incorporate a procedure for selecting the rank parameter. A main motivation for the low-rank formulation of the canonical tensors is reduced computation. Thus, one may wish to choose as high a rank parameter as one’s computational budget allows. Nonetheless, we leave for future work a more principled approach to rank selection. One promising angle of pursuit would be to leverage the equivalence between CCA and a least squares problem (Sun, Ji, & Ye, 2008) and then derive an information criterion, a strategy commonly used to select the rank parameter in several recently proposed tensor estimation procedures (Zhou et al., 2013; Sun et al., 2017; Sun & Li, 2019).

An additional direction for future work is to generalize our TCCA framework to handle the analysis of multiple tensor datasets. There are numerous proposed methods to extend CCA for multiple vector-measurement datasets (Carroll, 1968; Kettenring, 1971; Hanafi, 2007; Witten & Tibshirani, 2009; Luo et al., 2015) but not for multiple tensor datasets to the best of our knowledge.

Another problem not tackled in the paper includes verifying the consistency of the proposed estimation procedures. Our clear specification of the population models provides a framework for studying the consistency property in both the large n fixed p (Zhou et al., 2013) and the large n diverging p settings (Zhang, Li, Zhou, Zhou, & Shen, 2019). Finally, we are currently investigating our methods on a real dataset.

Supplementary Material

suppmaterial

NIHMS1602468-supplement-suppmaterial.pdf^{(203.2KB, pdf)}

Footnotes

SUPPORTING INFORMATION

Additional supporting information may be found online in the Supporting Information section at the end of the article.

REFERENCES

Bach FR, & Jordan MI (2006). A probabilistic interpretation of canonical correlation analysis, Technical Report. [Google Scholar]
Bergstra J, & Bengio Y (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb), 281–305. [Google Scholar]
Bickel PJ, & Levina E (2008). Covariance regularization by thresholding. The Annals of Statistics, 36(6), 2577–2604. [Google Scholar]
Bickel PJ, & Levina E (2008). Regularized estimation of large covariance matrices. The Annals of Statistics, 36(1), 199–227. [Google Scholar]
Cai TT, & Yuan M (2012). Adaptive covariance matrix estimation through block thresholding. The Annals of Statistics, 40(4), 2014–2042. [Google Scholar]
Carroll JD (1968). Generalization of canonical correlation analysis to three or more sets of variables, Proceedings of the 76th Annual Convention of the American Psychological Association, Washington, DC: Vol. 3, pp. 227–228. [Google Scholar]
Carroll JD, & Chang J-J (1970). Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition. Psychometrika, 35, 283–319. [Google Scholar]
Gang L, Yong Z, Yan-Lei L, & Jing D (2011). Three dimensional canonical correlation analysis and its application to facial expression recognition, International Conference on Intelligent Computing and Information Science Berlin, Heidelberg: Springer, pp. 56–61. [Google Scholar]
González I, Déjean S, Martin P, & Baccini A (2008). CCA: An R package to extend canonical correlation analysis. Journal of Statistical Software, 23(1), 1–14. [Google Scholar]
Hanafi M (2007). PLS path modelling: Computation of latent variables with the estimation mode b. Computational Statistics, 22(2), 275–292. [Google Scholar]
Harshman RA (1970). Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-modal factor analysis, UCLA Working Papers in Phonetics, 16, 1–84. [Google Scholar]
Hoff PD (2011). Separable covariance arrays via the Tucker product, with applications to multivariate relational data. Bayesian Anal, 6(2), 179–196. [Google Scholar]
Hotelling H (1936). Relations between two sets of variates. Biometrika, 321–377. [Google Scholar]
Kettenring JR (1971). Canonical analysis of several sets of variables. Biometrika, 58(3), 433–451. [Google Scholar]
Kolda TG, & Bader BW (2009). Tensor decompositions and applications. SIAM Review, 51(3), 455–500. [Google Scholar]
Kubokawa T, Srivastava MS, et al. (2013). Optimal ridge-type estimators of covariance matrix in high dimension: CIRJE, Faculty of Economics, University of Tokyo. [Google Scholar]
Ledoit O, & Wolf M (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88(2), 365–411. [Google Scholar]
Ledoit O, &Wolf M (2012). Nonlinear shrinkage estimation of large-dimensional covariance matrices. The Annals of Statistics, 40(2), 1024–1060. [Google Scholar]
Lee SH, & Choi S (2007). Two-dimensional canonical correlation analysis. Signal Processing Letters, IEEE, 14(10), 735–738. [Google Scholar]
Lu H (2013). Learning canonical correlations of paired tensor sets via tensor-to-vector projection, Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, IJCAI ‘13 Beijing, China: AAAI Press, pp. 1516–1522. [Google Scholar]
Luo Y, Tao D, Ramamohanarao K, Xu C, & Wen Y (2015). Tensor canonical correlation analysis for multi-view dimension reduction. IEEE Transactions on Knowledge and Data Engineering, 27(11), 3111–3124. [Google Scholar]
Ma Z (2013). Sparse principal component analysis and iterative thresholding. The Annals of Statistics, 41(2), 772–801. [Google Scholar]
Srivastava MS, & Reid N (2012). Testing the structure of the covariance matrix with fewer observations than the dimension. Journal of Multivariate Analysis, 112, 156–171. [Google Scholar]
Stein JL, Hua X, Lee S, Ho AJ, Leow AD, Toga AW & Thompson PM (2010). Voxelwise genome-wide association study (vGWAS). Neuroimage, 53(3), 1160–1174. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sun L, Ji S, & Ye J (2008). A least squares formulation for canonical correlation analysis, Proceedings of the 25th International Conference on Machine Learning, ICML ‘08 New York, NY, USA: ACM, pp. 1024–1031. [Google Scholar]
Sun WW, & Li L (2019). Dynamic tensor clustering. Journal of the American Statistical Association, in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sun WW, Lu J, Liu H, & Cheng G (2017). Provable sparse tensor decomposition. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3), 899–916. [Google Scholar]
Tan KM, Wang Z, Liu H, & Zhang T (2018). Sparse generalized eigenvalue problem: Optimal statistical rates via truncated Rayleigh flow. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(5), 1057–1086. [Google Scholar]
Vinod HD (1976). Canonical ridge and econometrics of joint production. J Econ, 4(2), 147–166. [Google Scholar]
Waaijenborg S, Verselewel de Witt Hamer PC, & Zwinderman AH (2008). Quantifying the association between gene expressions and DNA-markers by penalized canonical correlation analysis. Statistical Applications in Genetics and Molecular Biology, 7(1), 3. [DOI] [PubMed] [Google Scholar]
Wang H (2010). Local two-dimensional canonical correlation analysis. Signal Processing Letters, IEEE, 17(11), 921–924. [Google Scholar]
Wang Z, Gu Q, Ning Y, & Liu H (2015). High dimensional EM algorithm: Statistical optimization and asymptotic normality In Cortes C, Lawrence ND, Lee DD, Sugiyama M, & Garnett R (Eds.), Advances in Neural Information Processing Systems 28: Curran Associates, Inc, pp. 2521–2529. [PMC free article] [PubMed] [Google Scholar]
Wang S-J, Yan W-J, Sun T, Zhao G, & Fu X (2016). Sparse tensor canonical correlation analysis for micro-expression recognition. Neurocomputing, 214, 218–232. [Google Scholar]
Witten DM, & Tibshirani RJ (2009). Extensions of sparse canonical correlation analysis with applications to genomic data. Statistical Applications in Genetics and Molecular Biology, 8(1), 1–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yan J, Zheng W, Zhou X, & Zhao Z (2012). Sparse 2-D canonical correlation analysis via low rank matrix approximation for feature extraction. Signal Processing Letters, IEEE, 19(1), 51–54. [Google Scholar]
Zhang X, Li L, Zhou H, Zhou Y, & Shen D (2019). Tensor generalized estimating equations for longitudinal imaging analysis. Statistica Sinica, 29, 1977–2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou H, Li L, &Zhu H (2013). Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502), 540–552. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

suppmaterial

NIHMS1602468-supplement-suppmaterial.pdf^{(203.2KB, pdf)}

[R1] Bach FR, & Jordan MI (2006). A probabilistic interpretation of canonical correlation analysis, Technical Report. [Google Scholar]

[R2] Bergstra J, & Bengio Y (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb), 281–305. [Google Scholar]

[R3] Bickel PJ, & Levina E (2008). Covariance regularization by thresholding. The Annals of Statistics, 36(6), 2577–2604. [Google Scholar]

[R4] Bickel PJ, & Levina E (2008). Regularized estimation of large covariance matrices. The Annals of Statistics, 36(1), 199–227. [Google Scholar]

[R5] Cai TT, & Yuan M (2012). Adaptive covariance matrix estimation through block thresholding. The Annals of Statistics, 40(4), 2014–2042. [Google Scholar]

[R6] Carroll JD (1968). Generalization of canonical correlation analysis to three or more sets of variables, Proceedings of the 76th Annual Convention of the American Psychological Association, Washington, DC: Vol. 3, pp. 227–228. [Google Scholar]

[R7] Carroll JD, & Chang J-J (1970). Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition. Psychometrika, 35, 283–319. [Google Scholar]

[R8] Gang L, Yong Z, Yan-Lei L, & Jing D (2011). Three dimensional canonical correlation analysis and its application to facial expression recognition, International Conference on Intelligent Computing and Information Science Berlin, Heidelberg: Springer, pp. 56–61. [Google Scholar]

[R9] González I, Déjean S, Martin P, & Baccini A (2008). CCA: An R package to extend canonical correlation analysis. Journal of Statistical Software, 23(1), 1–14. [Google Scholar]

[R10] Hanafi M (2007). PLS path modelling: Computation of latent variables with the estimation mode b. Computational Statistics, 22(2), 275–292. [Google Scholar]

[R11] Harshman RA (1970). Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-modal factor analysis, UCLA Working Papers in Phonetics, 16, 1–84. [Google Scholar]

[R12] Hoff PD (2011). Separable covariance arrays via the Tucker product, with applications to multivariate relational data. Bayesian Anal, 6(2), 179–196. [Google Scholar]

[R13] Hotelling H (1936). Relations between two sets of variates. Biometrika, 321–377. [Google Scholar]

[R14] Kettenring JR (1971). Canonical analysis of several sets of variables. Biometrika, 58(3), 433–451. [Google Scholar]

[R15] Kolda TG, & Bader BW (2009). Tensor decompositions and applications. SIAM Review, 51(3), 455–500. [Google Scholar]

[R16] Kubokawa T, Srivastava MS, et al. (2013). Optimal ridge-type estimators of covariance matrix in high dimension: CIRJE, Faculty of Economics, University of Tokyo. [Google Scholar]

[R17] Ledoit O, & Wolf M (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88(2), 365–411. [Google Scholar]

[R18] Ledoit O, &Wolf M (2012). Nonlinear shrinkage estimation of large-dimensional covariance matrices. The Annals of Statistics, 40(2), 1024–1060. [Google Scholar]

[R19] Lee SH, & Choi S (2007). Two-dimensional canonical correlation analysis. Signal Processing Letters, IEEE, 14(10), 735–738. [Google Scholar]

[R20] Lu H (2013). Learning canonical correlations of paired tensor sets via tensor-to-vector projection, Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, IJCAI ‘13 Beijing, China: AAAI Press, pp. 1516–1522. [Google Scholar]

[R21] Luo Y, Tao D, Ramamohanarao K, Xu C, & Wen Y (2015). Tensor canonical correlation analysis for multi-view dimension reduction. IEEE Transactions on Knowledge and Data Engineering, 27(11), 3111–3124. [Google Scholar]

[R22] Ma Z (2013). Sparse principal component analysis and iterative thresholding. The Annals of Statistics, 41(2), 772–801. [Google Scholar]

[R23] Srivastava MS, & Reid N (2012). Testing the structure of the covariance matrix with fewer observations than the dimension. Journal of Multivariate Analysis, 112, 156–171. [Google Scholar]

[R24] Stein JL, Hua X, Lee S, Ho AJ, Leow AD, Toga AW & Thompson PM (2010). Voxelwise genome-wide association study (vGWAS). Neuroimage, 53(3), 1160–1174. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Sun L, Ji S, & Ye J (2008). A least squares formulation for canonical correlation analysis, Proceedings of the 25th International Conference on Machine Learning, ICML ‘08 New York, NY, USA: ACM, pp. 1024–1031. [Google Scholar]

[R26] Sun WW, & Li L (2019). Dynamic tensor clustering. Journal of the American Statistical Association, in press. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Sun WW, Lu J, Liu H, & Cheng G (2017). Provable sparse tensor decomposition. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3), 899–916. [Google Scholar]

[R28] Tan KM, Wang Z, Liu H, & Zhang T (2018). Sparse generalized eigenvalue problem: Optimal statistical rates via truncated Rayleigh flow. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(5), 1057–1086. [Google Scholar]

[R29] Vinod HD (1976). Canonical ridge and econometrics of joint production. J Econ, 4(2), 147–166. [Google Scholar]

[R30] Waaijenborg S, Verselewel de Witt Hamer PC, & Zwinderman AH (2008). Quantifying the association between gene expressions and DNA-markers by penalized canonical correlation analysis. Statistical Applications in Genetics and Molecular Biology, 7(1), 3. [DOI] [PubMed] [Google Scholar]

[R31] Wang H (2010). Local two-dimensional canonical correlation analysis. Signal Processing Letters, IEEE, 17(11), 921–924. [Google Scholar]

[R32] Wang Z, Gu Q, Ning Y, & Liu H (2015). High dimensional EM algorithm: Statistical optimization and asymptotic normality In Cortes C, Lawrence ND, Lee DD, Sugiyama M, & Garnett R (Eds.), Advances in Neural Information Processing Systems 28: Curran Associates, Inc, pp. 2521–2529. [PMC free article] [PubMed] [Google Scholar]

[R33] Wang S-J, Yan W-J, Sun T, Zhao G, & Fu X (2016). Sparse tensor canonical correlation analysis for micro-expression recognition. Neurocomputing, 214, 218–232. [Google Scholar]

[R34] Witten DM, & Tibshirani RJ (2009). Extensions of sparse canonical correlation analysis with applications to genomic data. Statistical Applications in Genetics and Molecular Biology, 8(1), 1–27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Yan J, Zheng W, Zhou X, & Zhao Z (2012). Sparse 2-D canonical correlation analysis via low rank matrix approximation for feature extraction. Signal Processing Letters, IEEE, 19(1), 51–54. [Google Scholar]

[R36] Zhang X, Li L, Zhou H, Zhou Y, & Shen D (2019). Tensor generalized estimating equations for longitudinal imaging analysis. Statistica Sinica, 29, 1977–2005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Zhou H, Li L, &Zhu H (2013). Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502), 540–552. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Tensor canonical correlation analysis

Eun Jeong Min

Eric C Chi

Hua Zhou

Abstract

1 |. INTRODUCTION

2 |. NOTATION AND PRELIMINARIES

3 |. TENSORCANONICAL CORRELATION ANALYSIS

3.1 |. Population TCCA

Proposition 1

3.2 |. Sample TCCA

Proposition 2

Algorithm 1.

4 |. TCCA WITH SEPARABLE COVARIANCE STRUCTURE

4.1 |. Population TCCA with separable covariances

Proposition 3

4.2 |. Sample TCCA with separable covariances

Proposition 4

Algorithm 2.

4.2.1 |. Estimation of separable covariance matrices

Lemma 1

5 |. SPARSE TCCA

6 |. PROBABILISTIC MODEL FOR TCCA

7 |. NUMERICAL EXPERIMENTS

FIGURE 1.

7.1 |. Evaluation criteria and selection of tuning parameter

7.2 |. Results

FIGURE 2.

TABLE 1.

8 |. DISCUSSION

Supplementary Material

Footnotes

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases