Abstract
Canonical correlation analysis (CCA) is a multivariate analysis technique for estimating a linear relationship between two sets of measurements. Modern acquisition technologies, for example, those arising in neuroimaging and remote sensing, produce data in the form of multidimensional arrays or tensors. Classic CCA is not appropriate for dealing with tensor data due to the multidimensional structure and ultrahigh dimensionality of such modern data. In this paper, we present tensor CCA (TCCA) to discover relationships between two tensors while simultaneously preserving multidimensional structure of the tensors and utilizing substantially fewer parameters. Furthermore, we show how to employ a parsimonious covariance structure to gain additional stability and efficiency. We delineate population and sample problems for each model and propose efficient estimation algorithms with global convergence guarantees. Also we describe a probabilistic model for TCCA that enables the generation of synthetic data with desired canonical variates and correlations. Simulation studies illustrate the performance of our methods.
Keywords: block coordinate ascent, CP decomposition, multidimensional array data
1 |. INTRODUCTION
Canonical correlation analysis (CCA) is a classic statistical method for identifying associations between two sets of measurements (Hotelling, 1936). Specifically, CCA identifies a pair of coefficient vectors, one for each set of measurements, such that the correlation between the corresponding linear combinations of variables from each set is maximized. By default, CCA applies when each observation consists of a pair of vector covariates. In many modern data analysis problems, however, each observation may consist more generally of a pair of multidimensional arrays or tensors. For example, in imaging genetics, to identify genetic variants that can best capture and explain phenotypic variations in brain function and structure, Stein et al. (2010) studied p = 448,293 single-nucleotide polymorphisms and q = 31,622 voxels in brain images, on n = 740 individuals. A naive approach to dealing with tensor-valued data would be to reshape tensor covariates into vectors and then apply standard CCA. There are, however, two serious drawbacks to doing so. First, structural information in tensors is discarded through vectorization. Second, the resulting vectors consist of a prohibitively large number of parameters. In the imaging genetics problem (Stein et al., 2010), vectorizing the voxel intensity measurements disregards the spatial correlation among neighbouring voxels. Moreover, applying standard CCA to vectors of single-nucleotide polymorphism covariates and vectorized brain images would require estimating nearly half a million parameters using fewer than a thousand observations.
In light of these issues, there have been extensions of CCA to handle special cases of tensor-valued data (Lee & Choi, 2007; Wang, 2010; Yan, Zheng, Zhou, & Zhao, 2012; Gang, Yong, Yan-Lei, & Jing, 2011; Lu, 2013; Wang, Yan, Sun, Zhao, & Fu, 2016). Although they have exhibited good empirical performance in some applications, there remains no clear population models underlying these sample based heuristics. To address this gap in the literature, we introduce a novel statistical model for tensor CCA (TCCA). We summarize at a high level our formulation and its contributions:
We propose a TCCA population model that imposes the CANDECOMP/PARAFAC (CP) decomposition (Carroll & Chang, 1970; Harshman, 1970) structure on canonical tensors. This population model enforces model parsimony and enables efficient estimation.
We propose a refinement of TCCA, which assumes a separable covariance structure (scTCCA). This refinement enables efficient estimation of large covariance matrices of tensor-valued data.
We derive convenient representations of the covariance between linear combinations of two random tensors under the unstructured and separable covariance structured assumptions.
We develop efficient estimation algorithms for TCCA and scTCCA, both based on block coordinate ascent, which leverage these efficient representations. Each step of both algorithms solves a substantially lower dimensional CCA problem; thus, both algorithms can be easily implemented using any standard solvers for the CCA problem. Moreover, we prove global convergence guarantees of both estimation algorithms under modest regularity conditions.
We develop simple modifications to the TCCA and scTCCA estimation algorithms to incorporate recovery of sparse canonical correlation tensors to improve interpretability of the estimated models.
Finally, we extend the probabilistic interpretation of CCA by Bach and Jordan (2006) to TCCA. This extension leads to a probabilistic model for generating datasets with specified canonical correlation and variates.
The remainder of the paper is organized as follows. In Section 2, we review tensor notation and basic operations used in this paper. In Sections 3 and 4, we propose our two TCCA methods: TCCA and scTCCA. In Section 5, we describe a modification of the estimation algorithms for TCCA and scTCCA for sparse models. In Section 6, we introduce the probabilistic TCCA model. In Section 7, we describe the numerical experiment results. In Section 8, we conclude and highlight directions for future work.
2 |. NOTATION AND PRELIMINARIES
We review basic operations on matrices and tensors invoked throughout this paper, adopting the terminology and notation in Kolda and Bader (2009). Throughout the paper, we use lowercase letters to indicate scalars, bold lowercase letters to indicate vectors, bold capital characters to indicate matrices, and bold calligraphic capital characters to indicate tensors. We will also use the shorthand [n] to denote an index set {1, …, n}.
Tensors can be considered generalizations of scalars, vectors, and matrices. Let represent an D-dimensional tensor in . The tensor has order D, its number of dimensions or modes. For example, vectors are tensors of order one and have one mode. Matrices are tensors of order two and have two modes. We denote an element of by , where ιi ∈ [pi] and i ∈ [D]. Fibres are the generalization of matrix rows and columns to higher order tensors. A fibre is defined by fixing the index of every dimension except one dimension. Mode i fibres are pi-dimensional vectors extracted from by fixing all the indices except the ith one ιi. For example, columns of a matrix are Mode 1 fibres, and rows of a matrix are Mode 2 fibres.
It is often useful to reshape a tensor into a matrix. Reordering a tensor into a matrix is referred to as matricization. The mode i matricization of a tensor , denoted with , arranges the mode i fibres as the columns of the matrix X(i). In a mode i matricization, the tensor element is mapped to the matrix element of X(i). with index (ιi j), where with
Reordering a tensor into a vector is referred to as vectorization. We first describe vectorization of a matrix before describing vectorization of a general tensor. The vectorization of a matrix X is denoted by vec(X) and is the vector obtained by stacking the columns of X on top of each other. The vectorization of the mode i matricization of a tensor in turn is denoted as x(i) = vec(X(i))). We then define the vectorization of a tensor , denoted by vec(), as the vectorization of its Mode 1 matricization, namely, vec(X(1)). When unambiguous from context, we will often denote the vectorization of a tensor by its corresponding bold lowercase .
The inner product of two tensors of compatible dimensions , is the sum of the product of their entries, namely,
The mode i product of a tensor with a matrix is denoted by and is the tensor of size with elements
Finally, we review three kinds of matrix products, as well as one definition of matrix division, that will be used throughout the paper.
- For two matrices and , the Kronecker product is the p1q1-by-p2q2 matrix,
- For two matrices and that have the same number of columns p2, the Khatri-Rao product is the p1q1-by-p2 matrix,
which is a column-wise Kronecker product of A and B. For two matrices A and B of the same size, the Hadamard product is the element-wise product A * B = {aijbij}. Because the Hadamard product commutes, we use *i Ai to denote A1 * … * Am = Aπ(1)* … * Aπ(m) for any permutation π.
Finally, for two matrices A and B of the same size, the Hadamard quotient is the element-wise quotient A ⊘ B = {aij/bij}.
3 |. TENSORCANONICAL CORRELATION ANALYSIS
3.1 |. Population TCCA
Let and be two random tensors of order Dx and Dy, respectively. We denote the vectorizations of and by x and y, respectively. Denote by and the covariances of x and y, respectively. Denote by the covariance between x and y. Let , and be constant tensors, and let denote the correlation between the two linear combinations and namely,
| (1) |
The pair that maximizes ρ are the canonical tensors, and the optimal is the canonical coefficient. Maximizing the objective in Equation (1) presents two challenges: (a) high dimensionality of optimization variables and and (b) the estimation of the huge covariance matrices and and the cross-covariance matrix . We will address challenge (b) by imposing a separable covariance structure in Section 4.
To address challenge (a), we impose the parsimonious CANDECOMP/PARAFAC (CP), or Kruskal, representation on the canonical tensors. The CP representation generalizes the idea of representing a matrix as the sum of Rank 1 matrices to representing a tensor as the sum of Rank 1 tensors. An order-D tensor is Rank 1 if it can be expressed as the outer product of D vectors a(1), a(2), …, a(D), namely, , where the binary operator o denotes the vector outer product. Thus, the (ι1, ι2,…, ιD )th element of is . A rank-R tensor can be written as the sum of R Rank 1 tensors, namely,
where denotes the mode i factor matrix. We use the Kruskal notation to concisely summarize the sum.
Thus, instead of searching over the space of all order-Dx and order-Dy tensor pairs , we limit our search to tensors of rank-Rx and rank-Ry,
| (2) |
As we will see later, this parameterization makes progress towards alleviating the burden of estimating a huge covariance matrix. Note that
Thus, we seek to maximize the correlation between a rank-Rx multilinear form in and a rank-Ry multilinear form in . Multiway information is preserved, and the dimensionality is reduced from an exponential number of parameters to a linear number of parameters . Note that the ranks (Rx, Ry) here are not the number of canonical tensor pairs being sought. In this paper, we focus on obtaining only the top canonical tensor pair , which have ranks Rx and Ry.
The following representations of , , and in terms of a CP decomposition are keys to our estimation algorithms. Its proof is in the Supporting Information.
Proposition 1
Let and be two random tensors and and be two constant tensors of the same size as and respectively. Define
| (3) |
| (4) |
and let denote the covariance of x(i) = vec(X(i)). Define and analogously. Then
for any i ∈ [Dx] and j ∈ [Dy], where vi = vec(Vi) and Wj = vec(Wj).
3.2 |. Sample TCCA
Suppose we observe N pairs of i.i.d. tensor data , and we estimate and by solving the optimization problem
| (5) |
where , and are sample estimates of the corresponding covariances. Recall that CCA models can be estimated numerically by computing the solution to the following generalized eigenvalue problem:
This problem is guaranteed to have a solution if and only/ if the covariance matrices and are nonsingular. In practice, the sample size N is smaller than the size of and , , respectively. Therefore, the sample covariance matrices and are singular, and a solution cannot be obtained. As a remedy, several regularized estimation methods for obtaining nonsingular sample covariance matrices (Vinod, 1976; Ledoit & Wolf, 2004; González, Déjean, Martin, & Baccini, 2008; Ledoit & Wolf, 2012; Cai & Yuan, 2012; Bickel & Levina, 2008; 2008; Kubokawa et al., 2013; Srivastava & Reid, 2012) can be used.
If we take , and to be , , and respectively, then Proposition 1 suggests a block coordinate ascent algorithm where we update the factor matrices in pairs (Vi, Wj), for different combinations of i ∈ [Dx] and j ∈ [Dy]. To update the pair (Vi, Wj), we solve the following problem:
| (6) |
which is a substantially smaller optimization problem over piRx + qjRy variables compared with any alternative “all-at-once” strategies to iteratively optimize over all parameters simultaneously. Problem (6) includes , a permuted version of with size . However, we work on the “compressed” covariance matrix
which is likely to be full rank when N ≥ piRx instead of a singular matrix . This approach enables us to solve the generalized eigenvalue problem with matrices that are (piRx) + (qjRy)-by-(piRx) + (qjRy). Algorithm 1 summarizes the estimation procedure that comes with the following convergence guarantees. The proof is in the Supporting Information.
Proposition 2
If the matrices and are nonsingular, then the limit Points of the iterate sequence generated by Algorithm 1 are canonical tensors of the sample TCCA Problem.
Algorithm 1.
TCCA, with (i) assumptions on CP structure on canonical tensors and (ii) no additional assumptions on the structures of the covariances and
| Initialize and , for i ∈ [Dx] and j ∈ [Dy] |
| t ← 0 |
| repeat |
| Select (i, j) ∈ [Dx] × [Dy] |
| Generalized eigen-decomposition: |
| t ← t + l |
| until ρ(t) converges |
4 |. TCCA WITH SEPARABLE COVARIANCE STRUCTURE
4.1 |. Population TCCA with separable covariances
Hoff (2011) proposed the array normal distribution with separable covariance structure. Separable marginal covariances are defined as
| (7) |
where for i ∈ [Dx] and for j ∈ [Dy]. Then the overall covariance of population model is
| (8) |
Intuitively, summarizes the covariance along the mode i fibres of tensor and summarizes the covariance along the mode j fibres of tensor . The following result shows a representation of and in presence of separable covariance structure (7) that we will leverage in our estimation algorithm.
Proposition 3
Let and be two random tensors admitting the separable covariance structure (7) and and be two constant tensors. Define
Then
for any i ∈ [Dx] and j ∈ [Dy], where vi = vec(Vi) and wj = vec(Wj).
With the separable covariance structure and the CP structure on canonical tensors , the objective function of TCCA population model (1) greatly simplifies. Note that the separable covariance structure may not hold for real data, in which case the covariance estimates are biased. Despite this drawback, this parsimonious structure is worth considering due to the stability that it can impart by reducing estimation variance.
4.2 |. Sample TCCA with separable covariances
Given data , n ∈ [N], the goal is to maximize the sample canonical correlation (5) under assumptions that (a) and have the CP decomposition structure and (b) and admit the separable covariance structure. We follow the same strategy as sample TCCA in Section 3.2, updating parameters in pairs of factor matrices (Vi, Wj). To update the pair (Vi, Wj), we solve the subproblem
where Hx,−i and Hy,−j, defined in Proposition 3, are evaluated at current iterates Vi′ where i′ ≠ i, and Wj′, where j′ ≠ j.
By assuming the separable covariance structure, Proposition 3 enables us to greatly simplify the variance calculations for and . Note that the calculation of and in the subproblem of the sample TCCA costs flops. In contrast, the calculation of matrices and in the subproblem of the sample TCCA with the separable covariance structure only costs (Rxpi)2 + (Ryqj)2 flops. Algorithm 2 summarizes the estimation procedure under the separable covariance assumption (7). Like Algorithm 1, Algorithm 2 also comes with convergence guarantees. The proof is in the Supporting Information.
Proposition 4
If the matrices and are nonsingular, then the limit points of the iterate sequence generated by Algorithm 2 are canonical tensors of the sample TCCA problem with separable covariances.
Algorithm 2.
TCCA for two tensors of modes Dx and Dy, respectively, assuming (i) CP structure on canonical correlation tensors and (ii) separable covariances and
| Initialize |
| t ← 0 |
| repeat |
| Select (i, j) ∈ [Dx] × [Dy] |
| Solve following generalized eigenvalue decomposition, |
| t ← t + 1 |
| until ρ(t) converges |
4.2.1 |. Estimation of separable covariance matrices
Algorithm 2 relies on the sample estimate of separable covariances and , and the unstructured cross-covariance . The following lemma will be useful in computing a consistent estimator for covariance matrices with the separable structure.
Lemma 1
If a random tensor has mean zero and separable covariance , then
Proof. For the first identity, see Hoff, 2011 (2011, Proposition 2.1). For the second identity, . □
Given N i.i.d. observations (xn, yn), the estimators and consistently estimate and , respectively, where and are the sample means of the vectorized tensors xn and yn. We propose the following covariance estimators:
where i ∈ [Dx] and j ∈ [Dy]. The vectors xn(i) and denote the mode i vectorization of the nth observation and the sample mean of the mode i vectorized tensors xn, respectively. The vectors yn(j) and denote analogous vectorizations.
Unfortunately, the separable covariance structure (8) is not identifiable in the individual due to scaling indeterminacy. Therefore, cannot be consistently estimated. Note, however, that we do not need to consistently estimate the individual in order to consistently estimate their Kronecker product. To see this, note that by Slutsky’s theorem,
consistently estimates Var(x).
Note that Hoff (2011) proposes an iterative algorithm for finding the maximum likelihood estimation (MLE) on the basis of the array normal assumption. The maximum likelihood estimation may improve upon the above estimates when data actually come from an array normal distribution.
5 |. SPARSE TCCA
We may also recover sparse canonical tensors for both TCCA and scTCCA to enhance the interpretability of the estimated canonical tensors. Following the iterative thresholding strategy introduced by Ma (202013) for sparse principal component analysis, by Wang, Gu, Ning, and Liu (2015) for sparse expectation-maximization algorithms, and by Tan et al. (2018) for generalized eigenvalue problems, we incorporate a hard-thresholding step in Algorithms 1 and 2, as follows:
where Θλ(v) performs element-wise hard-thresholding on v, namely, the ith element of Θλ(v) is vi if |vi| > λ and 0 otherwise. In the simulation studies of Section 7, we employ fivefold cross-validation with a grid point search to choose the tuning parameter λ, following Tan, Wang, Liu, and Zhang (2018). Instead of a grid point search, a random search would be another choice (Bergstra & Bengio, 2012).
6 |. PROBABILISTIC MODEL FOR TCCA
Bach and Jordan (2006) give a probabilistic interpretation for the classic CCA, which enables us to simulate data with desired canonical vectors and canonical correlations. For TCCA without any assumption on the covariance structure, it is the same as the regular CCA by treating vectorized canonical tensors as canonical vectors. In this section, we first discuss how to generate data from given d canonical correlations and their corresponding canonical vectors in columns of matrices (Vd, Wd), and then we extend it to TCCA with the separable covariance assumption. Let and be two covariance matrices such that . Define two linear transformations and , where are arbitrary matrices such that and their spectral norms are less 1. We consider the latent factor model
The joint distribution of (x, y) is
where .
Now, we discuss how to construct and from (Vd, Wd), which is not described in Bach and Jordan (2006). Let Vd = QxRx be the thin QR decomposition of Vd. Then
satisfies for arbitrary . Similarly, let Wd = QyRy be the thin QR decomposition of Wd. Then
satisfies for arbitrary . In this notation, the joint covariance is
| (9) |
Tx and Ty are free parameters that adjust the noise level in x and y, respectively. The normal generative model is
where and are as in Equation (9). The normality is not essential for construction of this covariance structure.
The separable covariance structure brings complication. To account for this parsimonious structure in the generative model, we construct marginal factor matrices and that satisfy and . Let and be the thin QR decompositions. Then
satisfy the conditions with arbitrary and . And we have the desired property
Similarly, .
7 |. NUMERICAL EXPERIMENTS
We use the generative model described in Section 6 to evaluate the methods discussed in this paper: classic CCA, TCCA, scTCCA, sparse TCCA, and sparse TCCA with the separable covariance. We assess these methods on their ability to recover the true latent population parameters used to generate i.i.d. samples of tensor data pairs for n ∈ [1000]. In all examples, the true is a vector of length 100 with six entries set to 1 and the rest to 0. We use three different latent shown in Figure 1: , and . White pixels indicate values of 1, and black pixels indicate values of 0. The rectangle and cross-population canonical tensors and are low rank; specifically, they are Rank 1 and Rank 2, respectively. The butterfly population canonical tensor is a high rank. At the first glance, these illustrative examples do not come across as challenging estimation problems in the high dimension-low sample size regime, but if we were to vectorize the data and perform CCA, the number of parameters to fit is 100 + 644 = 4196 whereas the sample size is 1,000. Because the number of parameters exceeds the sample size, the sample covariance matrix is singular. Consequently, we add a small ridge term 10−3| to sample covariance matrices so that the generalized eigenvalue problem has a unique solution. The code for generating the simulation results is provided in the Supporting Information.
FIGURE 1.
True latent population canonical tensors for numerical experiments[dummy]
7.1 |. Evaluation criteria and selection of tuning parameter
Canonical tensors can only be estimated up to a scaling factor. Thus, we measure estimation accuracy by the angle between population canonical tensors used to generate the data and estimated canonical tensors , as follows:
The angles can take on values from −1 to 1 where an angle closer to 1 indicates better recovery of the true canonical tensors.
We use k-fold cross-validation to perform model selection, for example, selection of the tensor rank or the sparsity level. We split our data into K equally sized groups; for k ∈ [K], we estimate a pair on all but the kth fold of data for a sequence of models of varying complexity, , where model has a fixed pair of tensor ranks Rx and Ry and sparsity inducing parameter λ. We denote the fitted canonical correlation using model by Using , we compute the canonical correlation on the held out jth fold, which we denote . We choose the model that minimizes the average discrepancy between the empirical canonical correlations on the training sets and testing sets (Waaijenborg, Verselewel de Witt Hamer, & Zwinderman, 2008).
We then use all the data to estimate using model .
7.2 |. Results
Figure 2 compares estimation accuracy of various methods over 100 replicates. First of all, we observe that the various versions of TCCA outperform the CCA on the vectorized data for all three choices of the latent canonical tensor . Second, we notice that there appears to be some overfitting in the case of rectangle and cross problems. In the rectangle problem, where the population canonical tensor is a Rank 1 tensor, all four TCCA methods show the best performance when we fix the rank Ry for estimating to be 1, and the performance deteriorates as higher ranks are used. The same trend can be seen in the result of the cross problem where the population canonical tensor is a Rank 2 tensor. All TCCA methods give smaller values of angle when Ry = 3 is used compared with when Ry = 2 is used. In contrast, in the case of the butterfly image, the rank of population canonical tensor is much greater than 3. Thus, we do not expect an overfitting problem as well as good estimation performance as in the low-rank cases. We confirm that in Figure 2, calculated angles from the butterfly problem are lower than those from rectangles or cross problem.
FIGURE 2.
Angles between the recovered canonical vector/tensors and the true canonical vector/tensors. CCA, canonical correlation analysis; scTCCA, tensor canonical correlation analysis with separable covariance; spscTCCA, sparse tensor canonical correlation analysis with separable covariance; spTCCA, sparse tensor canonical correlation analysis; TCCA, tensor canonical correlation analysis
Another interesting observation is that results from TCCA methods with a separable covariance structure show better performance when we use a higher value for the rank parameter Ry than the true value than do other two models in cross and butterfly problems. This results implies that assuming a separable covariance structure improves the estimation accuracy in the two problems. This can be explained by the true image , which has a symmetric structure. Due to this special structure, we possibly expect improved performance from the more parsimonious model. However, this structured model does not have much effect on the rectangle problem. This may be because the true canonical tensor already possess relatively few parameters. Note that the rectangle image is also symmetric but Rank 1, which has very few parameters.
Table 1 shows computation times taken for each method. In most cases, the computation times of TCCA methods are better or comparable with those of the CCA method. Especially, there is huge improvement when the underlying true canonical tensor has a low rank. Also, sparse models take less time than do nonsparse models, and separable covariance structure models take less time than do the models without the assumption for computation in general, which is expected.
TABLE 1.
Computation time: The mean run times are reported in seconds with the standard deviation in parentheses
| Problem | Rank | CCA | TCCA | spTCCA | scTCCA | spscTCCA |
|---|---|---|---|---|---|---|
| Rectangle | 1 | 22.90 (3.62) | 7.52 (1.10) | 7.36 (0.95) | 3.14 (0.55) | 3.09 (0.44) |
| 2 | 22.80 (3.18) | 15.32 (2.51) | 15.12 (2.17) | 11.03 (2.03) | 30.11 (8.78) | |
| 3 | 21.85 (3.36) | 24.07 (3.91) | 18.61 (3.14) | 15.89 (2.95) | 37.09 (12.09) | |
| Cross | 1 | 18.49 (2.32) | 8.05 (1.34) | 7.78 (1.40) | 3.39 (0.53) | 13.19 (12.14) |
| 2 | 17.33 (2.43) | 17.31 (3.32) | 15.13 (5.02) | 6.96 (1.11) | 33.55 (12.78) | |
| 3 | 17.10 (2.55) | 24.44 (3.79) | 54.86 (25.73) | 10.04 (1.84) | 9.63 (2.05) | |
| Butterfly | 1 | 18.92 (3.48) | 7.08 (1.60) | 61.18 (28.58) | 2.90 (0.64) | 31.66 (6.36) |
| 2 | 18.23 (2.92) | 20.01 (4.53) | 30.10 (14.87) | 9.89 (2.15) | 48.57 (13.14) | |
| 3 | 19.99 (2.55) | 32.27 (4.68) | 26.09 (3.89) | 11.84 (2.06) | 13.58 (5.89) |
Note. Experiments were performed on a computer cluster consisting of machines with Intel Xeon-based processors (I5 or I7 processors) with RAM ranging from 1,600 to 2,100 MHz. Abbreviations: CCA, canonical correlation analysis; scTCCA, tensor canonical correlation analysis with separable covariance; spscTCCA, sparse tensor canonical correlation analysis with separable covariance; spTCCA, sparse tensor canonical correlation analysis; TCCA, tensor canonical correlation analysis
8 |. DISCUSSION
We have proposed a TCCA approach for finding the relationship between linear combinations of two tensor datasets. Our method combines the classic CCA approach and the low-rank tensor decomposition to reduce the vast dimensionality of tensor parameters. The proposed estimation algorithm scales well with the tensor data size and is easy to implement using existing statistical software. Our algorithms also support convergence guarantees, properties arguably lacking in the alternatives in the literature. We briefly highlight some open problems for further investigation.
In our illustrative examples, we did not incorporate a procedure for selecting the rank parameter. A main motivation for the low-rank formulation of the canonical tensors is reduced computation. Thus, one may wish to choose as high a rank parameter as one’s computational budget allows. Nonetheless, we leave for future work a more principled approach to rank selection. One promising angle of pursuit would be to leverage the equivalence between CCA and a least squares problem (Sun, Ji, & Ye, 2008) and then derive an information criterion, a strategy commonly used to select the rank parameter in several recently proposed tensor estimation procedures (Zhou et al., 2013; Sun et al., 2017; Sun & Li, 2019).
An additional direction for future work is to generalize our TCCA framework to handle the analysis of multiple tensor datasets. There are numerous proposed methods to extend CCA for multiple vector-measurement datasets (Carroll, 1968; Kettenring, 1971; Hanafi, 2007; Witten & Tibshirani, 2009; Luo et al., 2015) but not for multiple tensor datasets to the best of our knowledge.
Another problem not tackled in the paper includes verifying the consistency of the proposed estimation procedures. Our clear specification of the population models provides a framework for studying the consistency property in both the large n fixed p (Zhou et al., 2013) and the large n diverging p settings (Zhang, Li, Zhou, Zhou, & Shen, 2019). Finally, we are currently investigating our methods on a real dataset.
Supplementary Material
Footnotes
SUPPORTING INFORMATION
Additional supporting information may be found online in the Supporting Information section at the end of the article.
REFERENCES
- Bach FR, & Jordan MI (2006). A probabilistic interpretation of canonical correlation analysis, Technical Report. [Google Scholar]
- Bergstra J, & Bengio Y (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb), 281–305. [Google Scholar]
- Bickel PJ, & Levina E (2008). Covariance regularization by thresholding. The Annals of Statistics, 36(6), 2577–2604. [Google Scholar]
- Bickel PJ, & Levina E (2008). Regularized estimation of large covariance matrices. The Annals of Statistics, 36(1), 199–227. [Google Scholar]
- Cai TT, & Yuan M (2012). Adaptive covariance matrix estimation through block thresholding. The Annals of Statistics, 40(4), 2014–2042. [Google Scholar]
- Carroll JD (1968). Generalization of canonical correlation analysis to three or more sets of variables, Proceedings of the 76th Annual Convention of the American Psychological Association, Washington, DC: Vol. 3, pp. 227–228. [Google Scholar]
- Carroll JD, & Chang J-J (1970). Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition. Psychometrika, 35, 283–319. [Google Scholar]
- Gang L, Yong Z, Yan-Lei L, & Jing D (2011). Three dimensional canonical correlation analysis and its application to facial expression recognition, International Conference on Intelligent Computing and Information Science Berlin, Heidelberg: Springer, pp. 56–61. [Google Scholar]
- González I, Déjean S, Martin P, & Baccini A (2008). CCA: An R package to extend canonical correlation analysis. Journal of Statistical Software, 23(1), 1–14. [Google Scholar]
- Hanafi M (2007). PLS path modelling: Computation of latent variables with the estimation mode b. Computational Statistics, 22(2), 275–292. [Google Scholar]
- Harshman RA (1970). Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-modal factor analysis, UCLA Working Papers in Phonetics, 16, 1–84. [Google Scholar]
- Hoff PD (2011). Separable covariance arrays via the Tucker product, with applications to multivariate relational data. Bayesian Anal, 6(2), 179–196. [Google Scholar]
- Hotelling H (1936). Relations between two sets of variates. Biometrika, 321–377. [Google Scholar]
- Kettenring JR (1971). Canonical analysis of several sets of variables. Biometrika, 58(3), 433–451. [Google Scholar]
- Kolda TG, & Bader BW (2009). Tensor decompositions and applications. SIAM Review, 51(3), 455–500. [Google Scholar]
- Kubokawa T, Srivastava MS, et al. (2013). Optimal ridge-type estimators of covariance matrix in high dimension: CIRJE, Faculty of Economics, University of Tokyo. [Google Scholar]
- Ledoit O, & Wolf M (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88(2), 365–411. [Google Scholar]
- Ledoit O, &Wolf M (2012). Nonlinear shrinkage estimation of large-dimensional covariance matrices. The Annals of Statistics, 40(2), 1024–1060. [Google Scholar]
- Lee SH, & Choi S (2007). Two-dimensional canonical correlation analysis. Signal Processing Letters, IEEE, 14(10), 735–738. [Google Scholar]
- Lu H (2013). Learning canonical correlations of paired tensor sets via tensor-to-vector projection, Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, IJCAI ‘13 Beijing, China: AAAI Press, pp. 1516–1522. [Google Scholar]
- Luo Y, Tao D, Ramamohanarao K, Xu C, & Wen Y (2015). Tensor canonical correlation analysis for multi-view dimension reduction. IEEE Transactions on Knowledge and Data Engineering, 27(11), 3111–3124. [Google Scholar]
- Ma Z (2013). Sparse principal component analysis and iterative thresholding. The Annals of Statistics, 41(2), 772–801. [Google Scholar]
- Srivastava MS, & Reid N (2012). Testing the structure of the covariance matrix with fewer observations than the dimension. Journal of Multivariate Analysis, 112, 156–171. [Google Scholar]
- Stein JL, Hua X, Lee S, Ho AJ, Leow AD, Toga AW & Thompson PM (2010). Voxelwise genome-wide association study (vGWAS). Neuroimage, 53(3), 1160–1174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun L, Ji S, & Ye J (2008). A least squares formulation for canonical correlation analysis, Proceedings of the 25th International Conference on Machine Learning, ICML ‘08 New York, NY, USA: ACM, pp. 1024–1031. [Google Scholar]
- Sun WW, & Li L (2019). Dynamic tensor clustering. Journal of the American Statistical Association, in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun WW, Lu J, Liu H, & Cheng G (2017). Provable sparse tensor decomposition. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3), 899–916. [Google Scholar]
- Tan KM, Wang Z, Liu H, & Zhang T (2018). Sparse generalized eigenvalue problem: Optimal statistical rates via truncated Rayleigh flow. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(5), 1057–1086. [Google Scholar]
- Vinod HD (1976). Canonical ridge and econometrics of joint production. J Econ, 4(2), 147–166. [Google Scholar]
- Waaijenborg S, Verselewel de Witt Hamer PC, & Zwinderman AH (2008). Quantifying the association between gene expressions and DNA-markers by penalized canonical correlation analysis. Statistical Applications in Genetics and Molecular Biology, 7(1), 3. [DOI] [PubMed] [Google Scholar]
- Wang H (2010). Local two-dimensional canonical correlation analysis. Signal Processing Letters, IEEE, 17(11), 921–924. [Google Scholar]
- Wang Z, Gu Q, Ning Y, & Liu H (2015). High dimensional EM algorithm: Statistical optimization and asymptotic normality In Cortes C, Lawrence ND, Lee DD, Sugiyama M, & Garnett R (Eds.), Advances in Neural Information Processing Systems 28: Curran Associates, Inc, pp. 2521–2529. [PMC free article] [PubMed] [Google Scholar]
- Wang S-J, Yan W-J, Sun T, Zhao G, & Fu X (2016). Sparse tensor canonical correlation analysis for micro-expression recognition. Neurocomputing, 214, 218–232. [Google Scholar]
- Witten DM, & Tibshirani RJ (2009). Extensions of sparse canonical correlation analysis with applications to genomic data. Statistical Applications in Genetics and Molecular Biology, 8(1), 1–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yan J, Zheng W, Zhou X, & Zhao Z (2012). Sparse 2-D canonical correlation analysis via low rank matrix approximation for feature extraction. Signal Processing Letters, IEEE, 19(1), 51–54. [Google Scholar]
- Zhang X, Li L, Zhou H, Zhou Y, & Shen D (2019). Tensor generalized estimating equations for longitudinal imaging analysis. Statistica Sinica, 29, 1977–2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou H, Li L, &Zhu H (2013). Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502), 540–552. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


