Abstract
We present an efficient stochastic algorithm (RSG+) for canonical correlation analysis (CCA) using a reparametrization of the projection matrices. We show how this reparametrization (into structured matrices), simple in hindsight, directly presents an opportunity to repurpose/adjust mature techniques for numerical optimization on Riemannian manifolds. Our developments nicely complement existing methods for this problem which either require O(d3) time complexity per iteration with convergence rate (where d is the dimensionality) or only extract the top 1 component with convergence rate. In contrast, our algorithm offers an improvement: it achieves O(d2k) runtime complexity per iteration for extracting the top k canonical components with convergence rate. While our paper focuses more on the formulation and the algorithm, our experiments show that the empirical behavior on common datasets is quite promising. We also explore a potential application in training fair models with missing sensitive attributes.
1. Introduction
Canonical correlation analysis (CCA) is a classical method for evaluating correlations between two sets of variables. It is commonly used in unsupervised multi-view learning, where the multiple views of the data may correspond to image, text, audio and so on, Rupnik and Shawe-Taylor [2010], Chaudhuri et al. [2009], Luo et al. [2015], and has been applied to manifold-valued data also Kim et al. [2014]. Classical formulations have also been extended to leverage advances in representation learning, for example, Andrew et al. [2013] showed how the CCA can be interfaced with deep neural networks enabling modern use cases. Many results over the last few years have used CCA or its variants for problems including measuring representational similarity in deep neural networks Morcos et al. [2018] and speech recognition Couture et al. [2019].
The goal in CCA is to find linear combinations within two random variables X and Y which have maximum correlation with each other. Formally, the CCA problem is defined as follows. Let and be N samples respectively drawn from pair of random variables X (dx-variate random variable) and Y (dy-variate random variable), with unknown joint probability distribution. The goal is to find the projection matrices and , with k ≤ min{dx, dy}, such that the correlation is maximized:
(1) |
Here, and are the sample covariance matrices, and denotes the sample cross-covariance.
The objective in (1) is the expected cross-correlation in the projected space and the constraints specify that different canonical components should be decorrelated. Let us define the whitened covariance and Φk (and Ψk) contains the top-k left (and right) singular vectors of T. It is known Golub and Zha [1992] that the optimum of (1) is achieved at , . We can compute U∗, V ∗ by applying a k-truncated SVD to T.
Runtime and memory considerations.
The above procedure is simple but is only feasible when the data matrices are small. In modern applications, not only are the datasets large but also the dimension d (let d = max{dx, dy}) of each sample can be large, especially if representations are learned using deep models. As a result, the resource needs of the algorithm can be high. This has motivated the study of stochastic optimization routines for solving CCA, and many efficient strategies have been proposed. For example, Ge et al. [2016], Wang et al. [2016] present Empirical Risk Minimization (ERM) models which optimize the empirical objective. More recently, Gao et al. [2019], Bhatia et al. [2018], Arora et al. [2017] describe proposals that optimize the population objective. To summarize the approaches, if we are satisfied with the top 1 component of CCA, effective schemes with convergence rate are available by utilizing either extensions of the Oja’s rule Oja [1982] to the generalized eigenvalue problem Bhatia et al. [2018] or the alternating SVRG algorithm Gao et al. [2019]. Otherwise, a stochastic approach will use an explicit whitening operation which can cost d3 operations for each iteration Arora et al. [2017] and the convergence rate for the stochastic scheme depends on its specific steps and calculations, e.g., in Arora et al. [2017] (Thm 2.3, pp 5).
Observation.
Most approaches either directly optimize (1) or instead a reparameterized or regularized form Ge et al. [2016], Allen-Zhu and Li [2016], Arora et al. [2017]. Often, the search space for U and V corresponds to the entire Rd×k (ignoring the constraints for the moment). But if the formulation could be cast in a form which involved approximately writing U and V as a product of structured matrices, we may be able to obtain specialized routines which are tailored to exploit those properties. Such a reformulation is not difficult to derive – where the matrices used to express U and V can be identified as objects that live in well studied geometric spaces. Then, utilizing the geometry of the space and borrowing relevant tools from differential geometry could lead to an efficient approximate scheme for top-k CCA which optimizes the population objective in a streaming fashion.
Contributions.
(a) First, we re-parameterize the top-k CCA problem as an optimization problem on specific matrix manifolds, and show that it is equivalent to the original formulation in (1). (b) Informed by the geometry of the manifold, we derive stochastic gradient descent (SGD) algorithms for solving the re-parameterized problem with O(d2k) cost per iteration and provide convergence rate guarantees. (c) This analysis gives a direct mechanism to obtain an upper bound on the number of iterations needed to guarantee an ϵ error w.r.t. the population objective for the CCA problem. (d) The algorithm works in a streaming manner so it easily scales to large datasets and we do not need to assume access to the full dataset at the outset. (e) We present empirical evidence for the standard CCA model and the DeepCCA setting Andrew et al. [2013], describing advantages and limitations.
2. Stochastic CCA: Reformulation, Algorithm and Analysis
Let us review the objective for CCA as given in (1). We denote as the matrix consisting of the samples {xi} drawn from a zero mean random variable and denotes the matrix consisting of samples {yi} drawn from a zero mean random variable . For simplicity, we assume that dx = dy = d although the results hold for general dx and dy. Also recall that CX (and CY resp.) is the covariance matrix of X (and Y resp.) and CXY is the cross-covariance matrix between X and Y. Let U ∈ Rd×k (V ∈ Rd×k) be the matrix consisting of {uj} ({vj}), where ({uj},{vj}) are the canonical directions. The constraints in (1) are called whitening constraints.
Reformulation:
In the CCA formulation, the matrices consisting of canonical correlation directions, i.e., U and V, are unconstrained, hence the search space is the entire Rd×k. Now, we reformulate the CCA objective by reparameterizing U and V. In order to do that, let us take a brief detour and recall the objective function of principal component analysis (PCA):
(2) |
Observe that by performing PCA and assigning in (1) (analogous for V using CY ), we can satisfy the whitening constraint. Of course, writing does satisfy the whitening constraint, but such a U (and V) will not maximize trace (UTCXYV), the objective of (1). Hence, additional work beyond the PCA solution is needed. Let us start from but relax the PCA solution by using an arbitrary instead of diagonal (this will still satisfy the whitening constraint).
Write with and . Thus we can approximate CCA objective (we will later check how good this approximation is) as
(3) |
Here, St(k, d) denotes the manifold consisting of d × k (with k ≤ d) column orthonormal matrices, i.e., . Observe that in (3), we approximate the optimal U and V as a linear combination of and respectively. Thus, the aforementioned PCA solution can act as a feasible initial solution for (3).
As the choice of Ru and Rv is arbitrary, we can further reparameterize these matrices by constraining them to be full rank (of rank k) and using the RQ decomposition Golub and Reinsch [1971] which gives us the following reformulation.
A Reformulation for CCA
(4a) |
(4b) |
Here, SO(k) is the space of k × k special orthogonal matrices, i.e., . Before evaluating how good the aforementioned approximation is, we first point out some useful properties of the reformulation (4): (a) in the reparametrization of U and V, all components are structured, hence, the search space becomes a subset of Rk×k (b) we can essentially initialize with a PCA solution and then try to optimize (4) via some scheme.
Why (4) helps?
First, we note that CCA seeks to maximize the total correlation under the constraint that different components are decorrelated. One difficulty in the optimization is to ensure decorrelation, which leads to a higher complexity in existing streaming CCA algorithms. On the contrary, in (4), we separate (1) into finding the PCs, , (by adding the variance maximization terms) and finding the linear combination (SuQu and SvQv) of the principal directions. After optimizing for these variables, the whitening constraints are, up to a rescaling, automatically satisfied. Here, we can (almost) utilize an efficient off-the-shelf streaming PCA algorithm. We will defer describing the specific details of the individual steps until the next sub-section. First, we will show why substituting (1) with (4) is sensible under some assumptions.
Why the solution of the reformulation makes sense?
We start by stating some mild assumptions needed for the analysis. Assumptions: (a) The random variables and with covariance and covariance for some c > 0. (b) The samples X and Y drawn from and respectively have zero mean. (c) For a given k ≤ d, , have non-zero top-k eigen values.
We show how the presented solution, assuming access to an effective numerical procedure, approximates the CCA problem presented in (1). We formally state the result in the following theorem with a sketch of proof (appendix includes the full proof) by first stating the following proposition.
Definition 1.
A random variable X is called sub-Gaussian if the norm given by ∥X∥⋆ := inf {d ≥ 0|EX [exp(trace(XTX)/d2)] ≤ 2} is finite. Let U ∈ Rd×k, then XU is sub-Gaussian Vershynin [2017].
Proposition 1 (Reiβ et al. [2020]).
Let X be a random variable which follows a sub-Gaussian distribution. Let be the approximation of X ∈ RN×d (samples drawn from ) with the top-k principal vectors. Let be the covariance of . Also, assume that λi is the ith eigen value of CX for i = 1,··· ,d – 1 and λi ≥ λi+1 for all i. Then, the PCA reconstruction error, denoted by (in the Frobenius norm sense) can be upper bounded as follows
The aforementioned proposition suggests that the error between the data matrix X and the reconstructed data matrix using the top-k principal vectors is bounded.
Recall from (1) and (4) that the optimal value of the true and approximated CCA objective is denoted by F and respectively. The following theorem states that we can bound the error, (proof in the appendix). In other words, if we start from PCA solution and can successfully optimize (4) without leaving the feasible set, we will obtain a good solution.
Theorem 1.
Using the hypothesis and assumptions above, the approximation error as a function of N is bounded and goes to zero as N → ∞ while the whitening constraints in equation 4b are satisfied.
Sketch of the Proof.
Let U∗ and V ∗ be the true solution of CCA, i.e., of (1). Let , be the solution of (4), with , be the PCA solutions of X and Y respectively. Let and be the reconstruction of X and Y using principal vectors. Let and . Then we can write . Similarly we can write . As and are the approximation of X and Y respectively using the principal vectors, we use Prop. 1 to bound the error . Now observe that can be rewritten into (similar for ). Thus, as long as the solution SuQu and SvQv respectively well-approximate and , is a good approximation of F. □
Now, the only unresolved issue is an optimization scheme for equation 4a that keeps the constraints in equation 4b satisfied by leveraging the geometry of the structured solution space.
2.1. How to numerically optimize (4a) satisfying constraints in (4b)?
Overview.
We now describe how to maximize the formulation in (4a)–(4b) with respect to , , Qu, Qv, Su and Sv. We will first compute top-k principal vectors to get and . Then, we will use a gradient update rule to solve for Qu, Qv, Su and Sv to improve the objective. Since all these matrices are “structured”, care must be taken to ensure that the matrices remain on their respective manifolds – which is where the geometry of the manifolds will offer desirable properties. We re-purpose a Riemannian stochastic gradient descent (RSGD) to achieve this task, so call our algorithm RSG+. Of course, more sophisticated Riemannian optimization techniques can be substituted in. For instance, different Riemannian optimization methods are available in Absil et al. [2007] and optimization schemes for many manifolds are offered in PyManOpt Boumal et al. [2014].
The algorithm block is in Algorithm 1. Recall that is the contribution from the principal directions which we used to ensure the “whitening constraint”. Moreover, is the contribution from the canonical correlation directions (note that we use the subscript ‘cca’ for making CCA objective explicit). The algorithm consists of four main blocks denoted by different colors, namely (a) the Red block deals with gradient calculation of the objective function where we calculate the top-k principal vectors (denoted by ) with respect to , ; (b) the Green block describes calculation of the gradient corresponding to the canonical directions (denoted by ) with respect to , , Su, Sv, Qu and Qv; (c) the Gray block combines the gradient computation from both and with respect to unknowns , , Su, Sv, Qu and Qv; and finally (d) the Blue block performs a batch update of the canonical directions using Riemannian gradient updates.
Gradient calculations.
The gradient update for , is divided into two parts (a) The (Red block) gradient updates the “principal” directions (denoted by and ), which is specifically designed to satisfy the whitening constraint. This requires updating the principal subspaces, so, the gradient descent needs to proceed on the manifold of k-dimensional subspaces of Rd, i.e., on the Grassmannian Gr(k, d). (b) The (green block) gradient from the objective function in (4), is denoted by and . In order to ensure that the Riemannian gradient update for and stays on the manifold St(k, d), we need to make sure that the gradients, i.e., and lies in the tangent space of St(k, d). To do so, we need to first calculate the Euclidean gradient and then project on to the tangent space of St(k, d).
The gradient updates for Qu, Qv, Su, Sv are given in the green block, denoted by , , , and . Note that unlike the previous step, this gradient only has components from canonical correlation calculation. As before, this step requires first computing the Euclidean gradient and then projecting on to the tangent space of the underlying Riemannian manifolds involved, i.e., SO(k) and the space of upper triangular matrices.
Finally, we get the gradient to update the canonical directions by combining the gradients which is shown in the gray block. With these gradients we can perform a batch update as shown in the blue block. A schematic diagram is given in Fig. 1.
Figure 1:
Schematic diagram of the proposed CCA algorithm, here , where is the approximated objective value for CCA (as in (4))
Using results presented next in Propositions 2–3, this scheme can be shown (under some assumptions) to approximately optimize the CCA objective in (1).
We can now move to the convergence properties of the algorithm. We present two results stating the asymptotic proof of convergence for top-k principal vectors and canonical directions in the algorithm.
Proposition 2 (Chakraborty et al. [2020]).
(Asymptotically) If the samples, X, are drawn from a Gaussian distribution, then the gradient update rule presented in Step 5 in Algorithm 1 returns an orthonormal basis – the top-k principal vectors of the covariance matrix CX.
Proposition 3.
(Bonnabel [2013]) Consider a connected Riemannian manifold with injectivity radius bounded from below by I > 0. Assume that the sequence of step sizes (γl) satisfy the condition (a) (b) . Suppose {Al} lie in a compact set . We also suppose that ∃D > 0 such that, . Then and l → ∞.
Notice that in our problem, the injectivity radius bound in Proposition 3 is satisfied as “I” for Gr(p, n), St(p, n) or SO(p) is , , π/2 respectively. So, in order to apply Proposition 3, we need to guarantee the step sizes satisfy the aforementioned condition. One example of the step sizes that satisfies the property is .
2.2. Convergence rate and complexity of the RSG+ algorithm
In this section, we describe the convergence rate and complexity of the algorithm proposed in Algorithm 1. Observe that the key component of Algorithm 1 is a Riemannian gradient update. Let At be the generic entity needed to be updated in the algorithm using the Riemannian gradient update , where γt is the step size at time step t. Also assume for a Riemannian manifold . The following proposition states that under certain assumptions, the Riemannian gradient update has a convergence rate of .
Proposition 4.
(Nemirovski et al. [2009], Bécigneul and Ganea [2018]) Let {At} lie inside a geodesic ball of radius less than the minimum of the injectivity radius and the strong convexity radius of . Assume to be a geodesically complete Riemannian manifold with sectional curvature lower bounded by κ ≤ 0. Moreover, assume that the step size {γt} diverges and the squared step size converges. Then, the Riemannian gradient descent update given by with a bounded , i.e., for some C ≥ 0, converges in the rate of with the number of iterates bounded by , for some tolerance ϵ > 0 and for the Lipschitz bound D of the objective function .
Algorithm 1:
Riemannian SGD based algorithm (RSG+) to compute canonical directions
![]() |
For this result to be applicable, we need the CCA objective function to be geodesically convex as a function of U and V (proof in the appendix). All Riemannian manifolds we needed, i.e., Gr(k, d), St(k, d) and SO(k) are geodesically complete, and these manifolds have non-negative sectional curvatures, i.e., lower bounded by κ = 0. Moreover the minimum of convexity and injectivity radius for Gr(k, d), St(k, d) and SO(k) are . Now, as long as the Riemannian updates lie inside the geodesic ball of radius less than , the convergence rate for RGD applies in our setting.
Running time.
To evaluate time complexity, we must look at the main compute-heavy steps needed. The basic modules are Exp and Exp−1 maps for St(k, d), Gr(k, d) and SO(k) manifolds (see Table 1 in appendix for a detailed specification of these maps). Observe that the complexity of these modules is influenced by the complexity of svd needed for the Exp map for the St and Gr manifolds. Our algorithm involves structured matrices of size d × k and k × k, so any matrix operation should not exceed a cost of O(max(d2k,k3)), since in general d ≫ k. Specifically, the most expensive calculation is SVD of matrices of size d × k, which is O(d2k), see Golub and Reinsch [1971]. All other calculations are dominated by this term.
3. Experiments
We first evaluate RSG+ for extracting top-k canonical components on three benchmark datasets and show that it performs favorably compared with Arora et al. [2017]. Then, we show that RSG+ also fits into feature learning in DeepCCA Andrew et al. [2013], and can scale to large feature dimensions where the non-stochastic method fails. Finally, we show that RSG+ can be used to improve fairness of deep neural networks without full access to labels of protected attributes during training.
3.1. CCA on Fixed Datasets
Datasets and baseline.
We conduct experiments on three benchmark datasets (MNIST LeCun et al. [2010], Mediamill Snoek et al. [2006] and CIFAR-10 Krizhevsky [2009]) to evaluate the performance of RSG+ to extract top-k canonical components. To our knowledge, Arora et al. [2017] is the only previous work which stochastically optimizes the population objective in a streaming fashion and can extract top-k components, so we compare our RSG+ with the matrix stochastic gradient (MSG) method proposed in Arora et al. [2017] (note: there are two methods proposed in Arora et al. [2017] and we choose MSG because it performs better in the experiments in Arora et al. [2017]). The details regarding the three datasets and how we process them are as follows:
MNIST.
LeCun et al. [2010]: MNIST contains grey-scale images of size 28 × 28. We use its full training set containing 60K images. Every image is split into left/right half, which are used as the two views. Mediamill. Snoek et al. [2006]: Mediamill contains around 25.8K paired features of videos and corresponding commentary of dimension 120,101 respectively. CIFAR-10. Krizhevsky [2009]: CIFAR-10 contains 60K 32 × 32 color images. Like MNIST, we split the images into left/right half and use them as two views.
Evaluation metric.
We choose to use Proportion of Correlations Captured (PCC) which is widely used Ma et al. [2015], Ge et al. [2016], partly due to its efficiency, especially for relatively large datasets. Let , denote the estimated subspaces returned by RSG+, and , denote the true canonical subspaces (all for top-k). The PCC is defined as , where TCC is the sum of canonical correlations between two matrices.
Performance.
We run our algorithm with step sizes chosen from {1, 0.1, 0.01, 0.001, 0.0001, 0.00001}. The performance in terms of PCC as a function of the number of seen samples (shown in a streaming manner) are shown in Fig. 2, and our RSG+ achieves around 10× runtime improvement over MSG (see Table 1). Our RSG+ captures more correlation than MSG Arora et al. [2017] while being 5 – 10 times faster. One case where our RSG+ underperforms Arora et al. [2017] is when the top-k eigenvalues are dominated by the top-l eigenvalues with l < k (Fig. 2b): on Mediamill dataset, the top-4 eigenvalues of the covariance matrix in view 1 are: 8.61,2.99,1.15,0.37. The first eigenvalue is dominantly large compared to the rest and our RSG+ performs better for k = 1 and worse than Arora et al. [2017] for k = 2,4. Runtime of RSG+ for different data dimensions (set dx = dy = d) and number of total samples (from a joint Gaussian distribution) is in the appendix.
Figure 2:
Performance on three datasets in terms of PCC as a function of # of seen samples.
Table 1:
Wall-clock runtime of one pass through the data of our RSG+ and MSG on MNIST, Mediamill and CIFAR (average of 5 runs).
MNIST | Mediamill | CIFAR | |||||||
---|---|---|---|---|---|---|---|---|---|
Time (s) | k = 1 | k = 2 | k = 4 | k = 1 | k = 2 | k = 4 | k = 1 | k = 2 | k = 4 |
| |||||||||
RSG+ (Ours) | 4.16 | 4.24 | 4.71 | 1.89 | 1.60 | 1.44 | 14.80 | 17.22 | 22.10 |
MSG | 35.32 | 42.09 | 49.17 | 11.59 | 14.21 | 17.34 | 80.21 | 100.80 | 106.55 |
3.2. CCA for Deep Feature Learning
Background and motivation.
A deep neural network (DNN) extension of CCA was proposed by Andrew et al. [2013] and has become popular in multi-view representation learning tasks. The idea is to learn a deep neural network as the mapping from original data space to a latent space where the canonical correlations are maximized. We refer the reader to Andrew et al. [2013] for details of the task. Since deep neural networks are usually trained using SGD on mini-batches, this requires obtaining an estimate of the CCA objective at every iteration in a streaming fashion, thus our RSG+ can be a natural fit. We conduct experiments on a noisy version of MNIST dataset to evaluate RSG+.
Dataset.
We follow Wang et al. [2015a] to construct a noisy version of MNIST: View 1 is a randomly sampled image which is first rescaled to [0, 1] and then rotated by a random angle from . View 2 is randomly sampled from the same class as view 1. Then we add independent uniform noise from [0, 1] to each pixel. Finally the image is truncated into [0, 1] to form view 2.
Implementation details.
We use a simple 2-layer MLP with ReLU nonlinearity, where the hidden dimension in the middle is 512 and the output feature dimension is d ∈ {100, 500, 1000}. After the network is trained on the CCA objective, we use a linear Support Vector Machine (SVM) to measure classification accuracy on output latent features. Andrew et al. [2013] uses the closed form CCA objective on the current batch directly, which costs O(d3) memory and time for every iteration.
Performance.
Table 2 shows that we get similar performance when d = 100 and can scale to large latent dimensions d = 1000 while the batch method Andrew et al. [2013] encounters numerical difficulty on our GPU resources and the Pytorch Paszke et al. [2019] platform in performing an eigen-decomposition of a d × d matrix when d = 500, and becomes difficult if d is larger than 1000.
Table 2:
Results of feature learning on MNIST.
Accuracy(%) | d = 100 | d = 500 | d = 1000 |
---|---|---|---|
| |||
DeepCCA | 80.57 | N/A | N/A |
Ours | 79.79 | 84.09 | 86.39 |
N/A means fails to yield a result on our hardware.
3.3. CCA for Fairness applications
Background and motivation.
Fairness is becoming an important issue to consider in the design of learning algorithms. A common strategy to make an algorithm fair is to remove the influence of one/more protected attributes when training the models, see Lokhande et al. [2020]. Most methods assume that the labels of protected attributes are known during training but this may not always be possible. CCA enables considering a slightly different setting, where we may not have per-sample protected attributes which may be sensitive or hard to obtain for third-parties Price and Cohen [2019]. On the other hand, we assume that a model pre-trained to predict the protected attribute labels is provided. For example, if the protected attribute is gender, we only assume that a good classifier which is trained to predict gender from the samples is available rather than sample-wise gender values themselves. We next demonstrate that fairness of the model, using standard measures, can be improved via constraints on correlation values from CCA.
Dataset.
CelebA Wang et al. [2015b] consists of 200K celebrity face images from the internet. There are up to 40 labels, each of which is binary-valued. Here, we follow Lokhande et al. [2020] to focus on the attactiveness attribute (which we want to train a classifier to predict) and the gender is treated as “protected” since it may lead to an unfair classifier according to Lokhande et al. [2020].
Method.
Our strategy is inspired by Morcos et al. [2018] which showed that canonical correlations can reveal the similarity in neural networks: when two networks (same architecture) are trained using different labels/schemes for example, canonical correlations can indicate how similar their features are. Our observation is the following. Consider a classifier that is trained on gender (the protected attribute), and another classifier that is trained on attractiveness, if the features extracted by the latter model share a high similarity with the one trained to predict gender, then it is more likely that the latter model is influenced by features in the image pertinent to gender, which will lead to an unfairly biased trained model. We show that by imposing a loss on the canonical correlation between the network being trained (but we lack per-sample protected attribute information) and a well trained classifier pre-trained on the protected attributes, we can obtain a more fair model. This may enable training fairer models in settings which would otherwise be difficult. The training architecture is shown in Fig. 3.
Figure 3:
Training architecture for fairness experiment. The model above is the pretrained model and the model below is being trained. Use of CCA allows the two network architectures to be different.
Implementation details.
To simulate the case where we only have a pretrained network on protected attributes, we train a Resnet-18 He et al. [2016] on the gender attribute, and when we train the classifier to predict attractiveness, we add a loss using the canonical correlations between these two networks on intermediate layers: Ltotal = Lcross-entropy + LCCA where the first term is the standard cross entropy term and the second term is the canonical correlation. See appendix for more details of training/evaluation.
Results.
We choose two commonly used error metrics for fairness: difference in Equality of Opportunity Hardt et al. [2016] (DEO), and difference in Demographic Parity Yao and Huang [2017] (DDP). We conduct experiments by applying the canonical correlation loss on three different layers in Resnet-18. In Table 3, we can see that applying canonical correlation loss generally improves the DEO and DDP metrics (lower is better) over the standard model (trained using cross entropy loss only). Specifically, applying the loss on early layers like conv0 and conv1 gets better performance than applying at a relatively late layer like conv2. Another promising aspect of our approach is that is can easily handle the case where the protected attribute is a continuous variable (as long as a well trained regression network on the protected attribute is given) while other methods like Lokhande et al. [2020], Zhang et al. [2018] need to first discretize the variable and then enforce constraints which can be much more involved.
Table 3:
Fairness results on CelebA, We applied CCA on three different layers in Resnet-18 respeetively, See appendix for positions of conv 0, 1, 2, “Ours-conv[0,1]-conv[1,2]” means staeking features from different layers to form hypercolumn features Hariharan et al, [2015], which shows that our approach allows two networks to have different shape/size.
Accuracy(%) | DEO(%) | DDP(%) | |
---|---|---|---|
| |||
Unconstrained | 76.3 | 22.3 | 4.8 |
Ours-conv0 | 76.5 | 17.4 | 1.4 |
Ours-conv1 | 77.7 | 15.3 | 3.2 |
Ours-conv2 | 75.9 | 22.0 | 2.8 |
Ours-conv[0,1]-conv[1,2] | 76.0 | 22.1 | 3.9 |
Limitations.
Our current implementation has difficulty to scale beyond d = 105 data dimension and this may be desirable for large scale DNNs. Exploring sparsity may be one way to solve the problem and will be enabled by additional developments in modern toolboxes.
4. Related Work
Stochastic CCA:
There has been much interest in designing scalable and provable algorithms for CCA: Ma et al. [2015] proposed the first stochastic algorithm for CCA, where local convergence is proven for the non-stochastic version. Wang et al. [2016] designed an algorithm which uses alternating SVRG combined with shift-and-invert pre-conditioning, with global convergence properties. These stochastic methods, and Ge et al. [2016] Allen-Zhu and Li [2016], which reduce the CCA problem to a generalized eigenvalue problem and solve it via an efficient power method, all belong to the class of methods that seeks to to solve the empirical CCA problem. It can be seen as an ERM approximation of the original population objective, which requires solving numerically the empirical CCA objective on a fixed data set. These methods usually assume access to the full dataset at the outset, which may not be suitable for some practical applications where data is presented in a streaming manner. Recently, there appears to be an interest in considering the population CCA problem Arora et al. [2017] Gao et al. [2019]. The main difficulty in the population setting is that we have limited knowledge about the objective unless we know the distribution of X and Y. Arora et al. [2017] handles this problem by deriving an estimation of the gradient of the population objective whose error can be properly bounded so that applying proximal gradient to a convex relaxed objective will provably converge. Gao et al. [2019] provides a tightened analysis of the time complexity of the algorithm in Wang et al. [2016], and provides sample complexity for certain distributions. The problem we study is similar to the one in Arora et al. [2017], Gao et al. [2019]: to optimize the population objective of CCA in a streaming fashion.
Riemannian Optimization:
Riemannian optimization is a generalization of standard Euclidean optimization methods to smooth manifolds, which takes the following form: given , solve , where is a Riemannian manifold. Advantages often include efficient numerical procedures for certain classes of constrained optimization problems. Applications include matrix and tensor factorization Ishteva et al. [2011], Tan et al. [2014], PCA Edelman et al. [1998], CCA Yger et al. [2012], and so on. We remark that Yger et al. [2012] also describes CCA formulation by rewriting it as a Riemannian optimization on the Stiefel manifold. In our work, we further explore the benefits of the Riemannian optimization toolkit, decomposing the linear space spanned by canonical vectors into products of several matrices which lie in several different Riemannian manifolds.
5. Conclusions
In this work, we presented a stochastic approach (RSG+) for the CCA model based on the observation that the solution of CCA can be decomposed into a product of matrices which lie on certain structured spaces. This affords specialized numerical schemes and makes the optimization more efficient. The optimization is based on Riemannian stochastic gradient descent and we provide a proof for its convergence rate with the number of iterates upper bounded, with O(d2k) time complexity per iteration. In experimental evaluations, we find that our RSG+ behaves favorably relative to the baseline stochastic CCA method in capturing the correlation in the datasets. We also show the use of RSG+ in the DeepCCA setting showing feasibility when scaling to large dimensions as well as in an interesting use case in training fair models.
Our full codebase is available for use at https://github.com/zihangm/riemannian-streaming-cca.
Potential negative societal impacts.
Since CCA is a fundamental problem in statistical machine learning and not tied to specific applications, we do not see a negative societal impact of our proposed method. However, it is possible that CCA can be used to uncover relationships between measurements which can be used for undesirable purposes.
Supplementary Material
6. Acknowledgments
This work was supported by NIH RF1AG059312, RF1AG062336 and R01EB022883, and NSF CCF #1918211. We thank Vishnu Lokhande for help with setting up the fairness experiments.
References
- Absil P-A, Mahony RE, and Sepulchre R. Optimization algorithms on matrix manifolds. 2007.
- Allen-Zhu Z and Li Y. Doubly accelerated methods for faster cca and generalized eigendecomposition. In ICML, 2016. [Google Scholar]
- Andrew G, Arora R, Bilmes J, and Livescu K. Deep canonical correlation analysis. In International conference on machine learning, pages 1247–1255, 2013. [Google Scholar]
- Arora R, Marinov TV, Mianjy P, and Srebro N. Stochastic approximation for canonical correlation analysis. In Advances in Neural Information Processing Systems, pages 4775–4784, 2017. [Google Scholar]
- Bécigneul G and Ganea O-E. Riemannian adaptive optimization methods. arXiv preprint arXiv:1810.00760, 2018. [Google Scholar]
- Bhatia K, Pacchiano A, Flammarion N, Bartlett PL, and Jordan MI. Gen-oja: Simple & efficient algorithm for streaming generalized eigenvector computation. In Advances in Neural Information Processing Systems, pages 7016–7025, 2018. [Google Scholar]
- Bonnabel S. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217–2229, 2013. [Google Scholar]
- Boumal N, Mishra B, Absil P-A, and Sepulchre R. Manopt, a matlab toolbox for optimization on manifolds. The Journal of Machine Learning Research, 15(1):1455–1459, 2014. [Google Scholar]
- Chakraborty R, Yang L, Hauberg S, and Vemuri B. Intrinsic grassmann averages for online linear, robust and nonlinear subspace learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. [DOI] [PubMed] [Google Scholar]
- Chaudhuri K, Kakade SM, Livescu K, and Sridharan K. Multi-view clustering via canonical correlation analysis. In Proceedings of the 26th annual international conference on machine learning, pages 129–136, 2009. [Google Scholar]
- Couture HD, Kwitt R, Marron JS, Troester MA, Perou CM, and Niethammer M. Deep multi-view learning via task-optimal cca. ArXiv, abs/1907.07739, 2019. [Google Scholar]
- Edelman A, Arias TA, and Smith ST. The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl, 20:303–353, 1998. [Google Scholar]
- Gao C, Garber D, Srebro N, Wang J, and Wang W. Stochastic canonical correlation analysis. Journal of Machine Learning Research, 20(167):1–46, 2019. [Google Scholar]
- Ge R, Jin C, Netrapalli P, Sidford A, et al. Efficient algorithms for large-scale generalized eigenvector computation and canonical correlation analysis. In International Conference on Machine Learning, pages 2741–2750, 2016. [Google Scholar]
- Golub GH and Reinsch C. Singular value decomposition and least squares solutions. In Linear Algebra, pages 134–151. Springer, 1971. [Google Scholar]
- Golub GH and Zha H. The canonical correlations of matrix pairs and their numerical computation. 1992.
- Hardt M, Price E, and Srebro N. Equality of opportunity in supervised learning. In Advances in neural information processing systems, pages 3315–3323, 2016. [Google Scholar]
- Hariharan B, Arbeláez P, Girshick R, and Malik J. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 447–456, 2015. [Google Scholar]
- He K, Zhang X, Ren S, and Sun J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [Google Scholar]
- Ishteva M, Absil P-A, Huffel SV, and Lathauwer LD. Best low multilinear rank approximation of higher-order tensors, based on the riemannian trust-region scheme. SIAM J. Matrix Anal. Appl, 32:115–135, 2011. [Google Scholar]
- Kim HJ, Adluru N, Bendlin BB, Johnson SC, Vemuri BC, and Singh V. Canonical correlation analysis on riemannian manifolds and its applications. In European Conference on Computer Vision, pages 251–267. Springer, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krizhevsky A. Learning multiple layers of features from tiny images. Technical report, 2009. [Google Scholar]
- LeCun Y, Cortes C, and Burges C. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010. [Google Scholar]
- Lokhande VS, Akash AK, Ravi SN, and Singh V. Fairalm: Augmented lagrangian method for training fair models with little regret. In European Conference on Computer Vision, pages 365–381. Springer, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luo Y, Tao D, Ramamohanarao K, Xu C, and Wen Y. Tensor canonical correlation analysis for multi-view dimension reduction. IEEE transactions on Knowledge and Data Engineering, 27(11): 3111–3124, 2015. [Google Scholar]
- Ma Z, Lu Y, and Foster DP. Finding linear structure in large datasets with scalable canonical correlation analysis. In ICML, 2015. [Google Scholar]
- Morcos A, Raghu M, and Bengio S. Insights on representational similarity in neural networks with canonical correlation. In Advances in Neural Information Processing Systems, pages 5727–5736, 2018. [Google Scholar]
- Nemirovski A, Juditsky AB, Lan G, and Shapiro A. Robust stochastic approximation approach to stochastic programming. SIAM J. Optimization, 19:1574–1609, 2009. [Google Scholar]
- Oja E. Simplified neuron model as a principal component analyzer. Journal of Mathematical Biology, 15:267–273, 1982. [DOI] [PubMed] [Google Scholar]
- Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024–8035, 2019. [Google Scholar]
- Price WN and Cohen IG. Privacy in the age of medical big data. Nature medicine, 25(1):37–43, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reiβ M, Wahl M, et al. Nonasymptotic upper bounds for the reconstruction error of pca. Annals of Statistics, 48(2):1098–1123, 2020. [Google Scholar]
- Rupnik J and Shawe-Taylor J. Multi-view canonical correlation analysis. In Conference on Data Mining and Data Warehouses (SiKDD 2010), pages 1–4, 2010. [Google Scholar]
- Snoek CG, Worring M, Van Gemert JC, Geusebroek J-M, and Smeulders AW. The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proceedings of the 14th ACM international conference on Multimedia, pages 421–430, 2006. [Google Scholar]
- Tan M, Tsang IW-H, Wang L, Vandereycken B, and Pan SJ. Riemannian pursuit for big matrix recovery. In ICML, 2014. [Google Scholar]
- Vershynin R. Four lectures on probabilistic methods for data science. The Mathematics of Data, IAS/Park City Mathematics Series, pages 231–271, 2017. [Google Scholar]
- Wang W, Arora R, Livescu K, and Bilmes J. On deep multi-view representation learning. In International Conference on Machine Learning, pages 1083–1092, 2015a. [Google Scholar]
- Wang W, Arora R, Livescu K, and Srebro N. Stochastic optimization for deep cca via nonlinear orthogonal iterations. In 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 688–695. IEEE, 2015b. [Google Scholar]
- Wang W, Wang J, Garber D, and Srebro N. Efficient globally convergent stochastic optimization for canonical correlation analysis. In Advances in Neural Information Processing Systems, pages 766–774, 2016. [Google Scholar]
- Yao S and Huang B. Beyond parity: Fairness objectives for collaborative filtering. In Advances in Neural Information Processing Systems, pages 2921–2930, 2017. [Google Scholar]
- Yger F, Berar M, Gasso G, and Rakotomamonjy A. Adaptive canonical correlation analysis based on matrix manifolds. In ICML, 2012. [Google Scholar]
- Zhang BH, Lemoine B, and Mitchell M. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 335–340, 2018. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.