Abstract
In this paper, we consider asymptotically exact support recovery in the context of high dimensional and sparse Canonical Correlation Analysis (CCA). Our main results describe four regimes of interest based on information theoretic and computational considerations. In regimes of “low” sparsity we describe a simple, general, and computationally easy method for support recovery, whereas in a regime of “high” sparsity, it turns out that support recovery is information theoretically impossible. For the sake of information theoretic lower bounds, our results also demonstrate a non-trivial requirement on the “minimal” size of the nonzero elements of the canonical vectors that is required for asymptotically consistent support recovery. Subsequently, the regime of “moderate” sparsity is further divided into two subregimes. In the lower of the two sparsity regimes, we show that polynomial time support recovery is possible by using a sharp analysis of a co-ordinate thresholding [1] type method. In contrast, in the higher end of the moderate sparsity regime, appealing to the “Low Degree Polynomial” Conjecture [2], we provide evidence that polynomial time support recovery methods are inconsistent. Finally, we carry out numerical experiments to compare the efficacy of various methods discussed.
Keywords: Canonical Correlation Analysis, Support Recovery, Low Degree Polynomials, Variable Selection, High Dimension
I. Introduction
Canonical Correlation Analysis (CCA) is a highly popular technique to perform initial dimension reduction while exploring relationships between two multivariate objects. Due to its natural interpretability and success in finding latent information, CCA has found enthusiasm across vast canvas of disciplines, which include, but are not limited to psychology and agriculture, information retrieving [3]-[5], brain-computer interface [6], neuroimaging [7], genomics [8], organizational research [9], natural language processing [10], [11], fMRI data analysis [12], computer vision [13], and speech recognition [14], [15].
Early developments in the theory and applications of CCA have now been well documented in the statistical literature, and we refer the interested reader to [16] and references therein for further details. However, the modern surge in interest for CCA, often being motivated by data from high throughput biological experiments [17]-[19], requires re-thinking several aspects of the traditional theory and methods. A natural structural constraint that has gained popularity in this regard, is that of sparsity, i.e., the phenomenon of an (unknown) collection of variables being related to each other. In order to formally introduce the framework of sparse CCA, we present our statistical setup next. We shall consider n-i.i.d. samples with and being multivariate mean zero random variables with joint variance covariance matrix
| (1) |
The first canonical correlation is then defined as the maximum possible correlation between two linear combinations of X and Y. This definition interprets as the optimal value of the following maximization problem:
| (2) |
The solutions to (2) are the vectors that maximize the correlation of the projections of X and Y in those respective directions. Higher order canonical correlations can thereafter be defined in a recursive fashion (cf. [20]). In particular, for , we define the jth canonical correlation and the corresponding directions uj and vj by maximizing (2) with the additional constraint
| (3) |
As mentioned earlier, in many modern data examples, the sample size n is typically at most comparable to or much smaller than p or q – rendering the classical CCA inconsistent and inadequate without further structural assumptions [21]-[23]. The framework of Sparse Canonical Correlation Analysis (SCCA) (cf. [8], [24]), where the ui’s and the vi’s are sparse vectors, was subsequently developed to target low dimensional structures (that allows consistent estimation) when p, q are potentially larger than n. The corresponding sparse estimates of the leading canonical directions naturally perform variable selection, thereby leading to the recovery of their support (cf. [8], [19], [24], [25]). It is unknown, however, under what settings, this naïve method of support recovery, or any other method for the matter, is consistent. The support recovery of the leading canonical directions serves an important purpose of identifying groups of variables that explain the most linear dependence among high dimensional random objects (X and Y) under study – and thereby renders crucial interpretability. Asymptotically optimal support recovery is yet to be explored systematically in the context of SCCA – both theoretically, and from the computational viewpoint. In fact, despite the renewed enthusiasm for CCA, both the theoretical and applied communities have mainly focused on the estimation of the leading canonical directions, and relevant scalable algorithms – see, e.g., [22], [24], [26]-[28]. This paper explores the crucial question of support recovery in the context of SCCA. 1
The problem of support recovery for SCCA naturally connects to a vast class of variable selection problems (cf. [29]-[33]). The problem closest in terms of complexity turns out to be the sparse PCA (SPCA) problem [34]. Support recovery in the latter problem is known to present interesting information theoretic and computational bottlenecks (cf. [30], [35]-[37]). Moreover, information theoretic and computational issues also arise in context of SCCA estimation problem (cf. [24], [26]-[28]). In view of the above, it is natural to expect that such information theoretic and computational issues exist in context of SCCA support recovery problem as well. However, the techniques used in SPCA support recovery analysis are not directly applicable to the SCCA problem, which poses additional challenges due to the presence of high dimensional nuisance parameters and . The main focus of our work is therefore retrieving the complete picture of the information theoretic and computational limitations of SCCA support recovery. Before going into further details, we present a brief summary of our contributions, and defer the discussions on the main subtleties to Section III. Our methods can be implemented using the R package Support.CCA [38].
A. Summary of Main Results
We say a method successfully recovers the support if it achieves exact recovery with probability tending to one uniformly over the sparse parameter spaces defined in Section II. In the sequel, we denote the cardinality of the combined support of the ui’s and the vi’s by sx and sy, respectively. Thus sx and sy will be our respective sparsity parameters. Our main contributions are listed below.
1). General methodology:
In Section III-A, we construct a general algorithm called RecoverSupp, which leads to successful support recovery whenever the latter is information theoretically tractable. This also serves as the first step in creating a polynomial time procedure for recovering support in one of the difficult regimes of the problem – see e,g. Corollary 2, which shows that RecoverSupp accompanied by a co-ordinate thresholding type method recovers the support in polynomial time in a regime that requires subtle analysis. Moreover, Theorem 1 shows that the minimal signal strength required by RecoverSupp matches the information theoretic limit whenever the nuisance precision matrices and are sufficiently sparse.
2). Information theoretic and computational hardness as a function of sparsity:
As the sparsity level increases, we show that the CCA support recovery problem transitions from being efficiently solvable, to NP hard (conjectured), and to information theoretically impossible. According to this hardness pattern, the sparsity domain can be partitioned into the following three regimes: (i) , , , , and (iii) , . We describe below the distinguishing behaviours of these three regimes, which is consistent with the sparse PCA scenario.
We show that when , (“easy regime”), polynomial time support recovery is possible, and well-known consistent estimators of the canonical correlates (cf. [24], [28]) can be utilized to that end. When , (“difficult regime”), we show that a co-ordinate thresholding type algorithm (inspired by [1]) succeeds provided . We call the last regime “difficult” because it is unknown whether existing estimation methods like COLAR [28] or SCCA [24] have valid statistical guarantees in this regime – see Section III-A and Section III-D for more details.
In Section III-C, we show that when , (“hard regime”), support recovery is computationally hard subject to the so called “low degree polynomial conjecture” recently popularized by [39], [40], and [2]. Of course, this phenomenon is observable only when p, , because otherwise, the problem would be solvable by the ordinary CCA analysis (cf. [23], [41]). Our findings are consistent with the conjectured computational barrier in context of SCCA estimation problem [28].
When , , we show that support recovery is information theoretically impossible (see Section III-B).
3). Information theoretic hardness as a function of minimal signal strength:
In context of support recovery, the signal strength is quantified by
Generally, support recovery algorithms require the signal strength to lie above some threshold. As a concrete example, the detailed analyses provided in [1], [30], and [35] are all based on the nonzero principal component elements being of the order . To the best of our knowledge, prior to our work, there was no result in the PCA/CCA literature on the information theoretic limit of the minimal signal strength. Generally, PCA studies assume that the top eigenvectors are de-localized, i.e., the principal components have elements of the order and thereby mostly considered the cases of de-localized eigenvectors. We do not make any such assumption on the canonical covariates, and thereby we believe that our study paints a more complete picture of the support recovery.
In Section III-B, we show that (or ) is a necessary requirement for successful support recovery by U (or V).
B. Notation
For a vector , we denote its support by . We will overload notation, and for a matrix , we will denote by D(A) the indexes of the nonzero rows of A. By an abuse of notation, sometimes we will refer to D(A) as the support of A as well. When and are unknown parameters, generally, the estimator of their supports will be denoted by and , respectively. We let denote the set of all positive numbers, and write for the set of all natural numbers {0, 1, 2, … ,}. For any , We let [n] denote the set {1, … , n}. We define the projection of A onto by
| (4) |
For any finite set , we denote its cardinality by . Also, for any event , we let be the indicator of the event . For any , we let denote the unit sphere in .
We let be the usual lk norm in for . In particular, we let denote the number of nonzero elements of a vector . For any probability measure on the Borel sigma field of , we let to be the set of all measurable functions such that . The corresponding inner product will be denoted by . We denote the operator norm and the Frobenius norm of a matrix by and , respectively. We let Ai* and Aj denote the i-th row and j-th column of A, respectively. For , we define the norms and . The maximum and minimum eigenvalues of a square matrix A will be denoted by and , respectively. Also, we let s(A) denote the maximum number of nonzero entries in any column of A, i.e., .
The results in this paper are mostly asymptotic (in n) in nature and thus require some standard asymptotic notations. If an and bn are two sequences of real numbers then (and ) implies that (and ) as , respectively. Similarly (and ) implies that for some (and for some ). Alternatively, will also imply and will imply that for some . We write if there are positive constants C1 and C2 such that for all . We will write to indicate an and bn are asymptotically of the same order up to a poly-log term. Finally, in our mathematical statements, C and c will be two different generic constants which can vary from line to line.
II. Mathematical Formalism
We denote the rank of by r. It can be shown that exactly r canonical correlations are positive and the rest are zero in the model (2). We will consider the matrices and . From (2) and (3), it is not hard to see that and . The indexes of the nonzero rows of U and V, respectively, are the combined support of the ui’s and the vi’s. Since we are interested in the recovery of the latter, it will be useful for us to study of the properties of U and V. To that end, we often make use of the following representation connecting to U and V [16]:
| (5) |
To keep our results straightforward, we restrict our attention to a particular model throughout, defined as follows.
Definition 1. Suppose . Let be a constant. We say if
-
A1
(Sub-Gaussian) X and Y are sub-Gaussian random vectors (cf. [42]), with joint covariance matrix Σ as defined in (1). Also rank.
-
A2
Recall the definition of the canonical correlation from (3). Note that by definition, . For , additionally satisfies .
-
A3
(Sparsity) The number of nonzero rows of U and V are sx and sy, respectively, that is and . Here U and V are as defined in (5).
-
A4(Bounded eigenvalue)
-
A5
(Positive eigen-gap) for i = 2, … , r.
Sometimes we will consider a sub-model of where each is Gaussian. This model will be denoted by , where “G” stands for the Gaussian assumption. Some remarks on the modeling assumptions A1—A5 are in order, which we provide next.
-
A1.
We begin by noting that we do not require X and Y to be jointly sub-Gaussian. Moreover, the individual sub-Gaussian assumption itself is common in the , regime in the SCCA literature (cf. [24], [28], [43]). Our proof techniques depend crucially on the sub-Gaussian assumption. We also anticipate that the results derived in this paper will change under the violation of this assumption. For the sharper analysis in the difficult regime (, ), our proof techniques require the Gaussian model – which is in parallel with [1]’s treatment of the sparse PCA in the corresponding difficult regime. In general, the Gaussian spiked model assumption in sparse PCA goes back to [44], and is common in the PCA literature (cf. [30], [35]).
-
A2-A4.
These assumptions are standard in the analysis of canonical correlations (cf. [24], [28]).
-
A5.
This assumption concerns the gap between consecutive canonical correlation strengths. However, we refer to this gap as “Eigengap” because of its similarity with the Eigengap in the sparse PCA literature (cf. [1], [45]). This assumption is necessary for the estimation of the i-th canonical covariates. Indeed, if then there is no hope of estimating the i-th canonical covariates because they are not identifiable, and so support recovery also becomes infeasible. This assumption can be relaxed to requiring only k many λi’s to be strictly larger than where k ≤ r. In this case, we can recover the support of only the first k canonical covariates.
In the following sections, we will denote the preliminary estimators of U and V by and , respectively. The columns of and will be denoted by and , respectively. Therefore and will stand for the corresponding preliminary estimators of ui and vi. In case of CCA, the ui’s and vi’s are identifiable only up to a sign flip. Hence, they are also estimable only up to a sign flip. Finally, we denote the empirical estimates of , , and , by , , and , respectively – which will often be appended with superscripts to denote their estimation through suitable subsamples of the data 2. Finally, we let denote a positive constant which depends on only through , but can vary from line to line.
III. Main Results
We divide our main results into the following parts based on both statistical and computational difficulties of different regimes. First, in Section III-A we present a general method and associated sufficient conditions for support recovery. This allows us to elicit a sequence of questions regarding necessity of the conditions and remaining gaps both from statistical and computational perspectives. Our subsequent sections are devoted to answering these very questions. In particular, in Section III-B we discuss information theoretic lower bounds followed by evidence for statistical-computational gaps in Section III-C. Finally, we close a final computational gap in asymptotic regime through sharp analysis of a special co-ordinate-thresholding type method in Section III-D.
A. A Simple and General Method:
We begin with a simple method for estimating the support, which readily establishes the result for the easy regime, and sets the directions for the investigation into other more subtle regimes. Since the estimation of D(U) and D(V) are similar, we focus only on the estimation of D(V) for the time being.
Suppose is a row sparse estimator of V. The nonzero indexes of is the most intuitive estimator of D(V). Such an is also easily attainable because most estimators of the canonical directions in the high dimension are sparse (cf. [24], [26], [28] among others). Although we have not yet been able to show the validity of this apparently “naïve” method, we provide numerical results in Section IV to explore its finite sample performance. However, a simple method can refine these initial estimators, to often optimally recover the support D(V). We now provide the details of this method and derive its asymptotic properties.
To that end, suppose we have at our disposal an estimating procedure for , which we generically denote by and an estimator of U. We split the sample in two equal parts, and compute and from the first part of the sample, and the estimator from the second part of the sample. Define . Our estimator of D(V) is then given by
| (6) |
where cut is a pre-specified cut-off or threshold. We will discuss more on cut later. The resulting algorithm will be referred as RecoverSupp from now on. Algorithm 1 gives the algorithm for the support recovery of V, but the full version of RecoverSupp, which estimates D(U) and D(V) simultaneously, can be found in Appendix A; see Algorithm 3 there. RecoverSupp is similar in spirit to the “cleaning” step in the sparse PCA support recovery literature (cf. [1]). One thing to remember here is that is not an estimator V. In fact, the (i, j)-th element of is an estimator of .
Remark 1. In many applications, the rank r may be unknown. [46] (see Section 4.6.5 therein) suggests to use the screeplot of the canonical correlations to estimate r. Screeplot is also a popular tool to estimate the number of nonzero principal components in PCA analysis [1]. For CCA, the screeplot is the plot of the estimated canonical correlations versus their orders. If there is a clear gap between two successive correlations, [46] suggests taking the larger correlation as the estimator of . One can use [24]’s SCCA method to estimate the canonical correlations to obtain the screeplot. There can be other ways of estimating r. For example, in their trans-eQTL study, [47] uses a resampling technique on a holdout dataset to generate observations from the null distribution of the i-th canonical correlation estimate under the hypothesis , where . The largest i, for which the test is rejected, is taken as the estimated rank. A similar technique has been used by [48] to select the ranks for a related method JIVE.
| Algorithm 1 RecoverSupp ): sup-port recovery of V | |
|---|---|
|
|
It turns out that, albeit being so simple, RecoverSupp has desirable statistical guarantees provided and are reasonable estimators of U and , respectively. These theoretical properties of RecoverSupp, and the hypotheses and queries generated thereof, lay out the roadmap for the rest of our paper. However, before getting into the detailed theoretical analysis of RecoverSupp, we state a l2-consistency condition on and , where we remind the readers that we let and denote the i-th columns of and , respectively. Recall also that the i-th columns of U and V are denoted by ui and vi, respectively.
Condition 1 (l2 consistency ). There exists a function so that and the estimators and of ui and vi satisfy
with probability 1 − o(1) uniformly over .
We will discuss the estimators which satisfy Condition 1 later. Theorem 1 also requires the signal strength Sigy to be at least of the order , where the parameter depends on the type of as follows:
is of type A if there exists Cpre > 0 so that satisfies with probability 1 − o(1) uniformly over . Here we remind the readers that . In this case, .
is of type B if with probability 1 − o(1) uniformly over for some Cpre > 0. In this case, .
is of type C if . In this case, .
The estimation error of clearly decays from type A to C, with the error being zero at type C. Because is generally much smaller than , shrinks from Case A to Case C monotonously as well. Thus it is fair to say that reflects the precision of the estimator in that is smaller if is a sharper estimator. We are now ready to state Theorem 1. This theorem is proved in Appendix C.
Theorem 1. Suppose and the estimators satisfy Condition 1. Further suppose is of type A, B, or C, which are stated above. Let where depends on the type of as outlined above. Then there exists a constant , depending only on , so that if
| (7) |
and , with , then the algorithm RecoverSupp fully recovers D(V) with probability 1 − o(1) uniformly over (for of type A and C), or uniformly over (for of type B).
The assumption that log p and log q are o(n) appears in all theoretical works of CCA (cf. [24], [28]). A requirement of this type is generally unavoidable. Note that Theorem 1 implies a more precise estimator requires smaller signal strength for full support recovery.
Main idea behind the proof of Theorem 1:
Because , is an estimator of for and . If , then for all . Therefore, in this case, we expect to be small for all . We will show that whenever , is uniformly bounded by for and with high probability. Here is a constant. Second, when , we will show that can not be too small. In fact, we will show that
| (8) |
for some with high probability in this case. The lower bound in the above inequality is bounded below by . Thus, if the minimal signal strength Sigy is bounded below by a large enough multiple of ϵn, then the lower bound will be larger than the upper bound in the case. Therefore, in this scenario, we can choose C > 0 so that
If we set , then the above inequality leads to.
These C1 and C2 are behind the constant in (7) and our choice of θn.
Thus the key step in the proof of Theorem 1 is analyzing the bias of , which hinges on the following bias decomposition:
| (9) |
Note that the term corresponds to the bias in estimating . Similarly, the error terms and incur due to the bias in estimating and ui, respectively. The main contributing term in the upper bound in (9) is . One can use the consistency property of to show that is of the order . Since has different rates and modes of convergence in cases A, B, and C, has different orders in cases A, B, and C, which explains why ϵn is of different order in these cases.
The term is much smaller – it is of the order . The proof bases on the fact that the l∞ error of estimating by is of the order for subgaussian X and Y. The error term is exactly zero for , and hence does not contribute. Thus only and contribute to the bias of for , which is therefore bounded by for some C1 > 0 with high probability in this case. The term does contribute to the bias of for , however, and it is of the order in this case. Because Err is small by Condition 1, we can show that is smaller than , which eventually leads to the relation in (8), thus completing the proof. We have already mentioned that RecoverSupp is analogous to the cleaning step in sparse PCA. Therefore the proof of Theorem 1 has similarities with some analogous results in sparse PCA. See for example Theorem 3 of [1], which proves the consistency of a “cleaned” estimator of the joint support of the spiked principal components. However, the proof in the CCA case is a bit more involved because of the presence of , which needs to be estimated for the cleaning step. Different estimators of can have different rates of convergence, which leads to the different types of the estimators. This ultimately leads to different requirements on the order of the threshold cut and the minimal signal strength Sigy.
Next we will discuss the implications of Theorem 1, but before getting into that detail, we will make two important remarks.
Remark 2. Although the estimation of the high dimensional precision matrix is potentially complicated, it is often unavoidable owing to the inherent subtlety of the CCA framework due to the presence of high dimensional nuisance parameters and . [26] also used precision matrix estimator for partial recovery of the support. In case of sparse CCA, to the best of our knowledge, there does not exist an algorithm that can recover the support, partially or completely, without estimating the precision matrix. However, our requirements on are not strict in that many common precision matrix estimators, e.g., the nodewise Lasso [49, Theorem 2.4], the thresholding estimator [50, Theorem 1 and Section 2.3], and the CLIME estimator [51, Theorem 6] exhibit the decay rate of type A and B under standard sparsity assumptions on . We will not get into the detail of the sparsity requirements on because they are unrelated to the sparsity of U or V, and hence are irrelevant to the primary goal of the current paper.
Remark 3. In the easy regime , polynomial time estimators satisfying Condition 1 are already available, e.g., COLAR [28, Theorem 4.2] or SCCA [24, Condition C4]. Thus it is easily seen that polynomial time support recovery is possible in the easy regime provided (7) is satisfied.
The implications of Theorem 1 in context of the sparsity requirements on D(U) and D(V) for full support recovery are somewhat implicit through the assumptions and conditions. However, the restriction on the sparsity is indirectly imposed by two different sources – which we elaborate on now. To keep the interpretations simple, throughout the following discussion, we assume that (a) , (b) p and q are of the same order, and (c) sx and sy are also of the same order. Note that (a) implies for a type B estimator of . Since we separate the task of estimating the nuisance parameter from the support recovery of V, we also assume that , which implies for a type A estimator of . The assumption , combined with (a), reduces the minimal signal strength condition (7) in Theorem 1 to .
In lieu of the discussion above, the first source of sparsity restriction is the minimal signal strength condition (7) on Sigy. To see this, first note that
where . Since ,
implying . Therefore, implicit in Theorem 1 lies the condition
| (10) |
which is enforced by the minimal signal strength requirement (7). Thus Theorem 1 does not hold for even when and r are small. This regime requires some attention because in case of sparse PCA [30] and linear regression [29], support recovery at 3 is proven to be information theoretically impossible. However, although a parallel result can be intuited to hold for CCA, the details of the nuances of SCCA support recovery in this regime is yet to be explored. Therefore, the sparsity requirement in (10) raises the question whether support recovery for CCA is at all possible when , even if and is known.
Question 1. Does there exist any decoder such that when ?
A related question is whether the minimal signal strength requirement (7) is necessary. To the best of our knowledge, there is no formal study on the information theoretic limit of the minimal signal strength even in context of the sparse PCA support recovery. Indeed, as we noted before, the detailed analyses of support recovery for SPCA provided in [1], [30], and [35] are all based on the nonzero principal component elements being of the order . Finally, although this question is not directly related to the sparsity conditions, it indeed probes the sharpness of the results in Theorem 1.
Question 2. What is the minimum signal strength required for the recovery of D(V)?
We will discuss Question 1 and Question 2 at greater length in Section III-B. In particular, Theorem 2(A) shows that there exists C > 0 so that support recovery at is indeed information theoretically intractable. On the other hand, in Theorem 2(B), we show that the minimal signal strength has to be of the order for full recovery of D(V). Thus when , (7) is indeed necessary from information theoretic perspectives.
The second source of restriction on the sparsity lies in Condition 1. Condition 1 is a l2-consistency condition, which has sparsity requirement itself owing the inherent hardness in the estimation of U. Indeed, Theorem 3.3 of [28] entails that it is impossible to estimate the canonical directions ui’s consistently if for some large C > 0. Hence, Condition 1 indirectly imposes the restriction . However, when , , and , the above restriction is already absorbed into the condition elicited in the last paragraph. In fact, there exist consistent estimators of U whenever and (see [27] or Section 3 of [28]). Therefore, in the latter regime, RecoverSupp coupled with the above-mentioned estimators succeeds. In view of the above, it might be tempting to think that Condition 1 does not impose significant additional restrictions. The restriction due to Condition 1, however, is rather subtle and manifests itself through computational challenges. Note that when support recovery is information theoretically possible, the computational hardness of recovery by RecoverSupp will be at least as much as that of the estimation of U. Indeed, the estimators of U which work in the regime , are not adaptive of the sparsity, and they require a search over exponentially many sets of size sx and sy. Furthermore, under , all polynomial time consistent estimators of U in the literature, e.g., COLAR [28, Theorem 4.2] or SCCA [24, Condition C4], require sx, sy to be of the order . In fact, [28] indicates that estimation of U or V for sparsity of larger order is NP hard.
The above raises the question whether RecoverSupp (or any method as such) can succeed at polynomial time when , . We turn to the landscape of sparse PCA for intuition. Indeed, in case of sparse PCA, different scenarios are observed in the regime , depending on whether , or (we recall that for SPCA we denote the sparsity of the leading principal component direction generically through s). We focus on the sub-regime first. In this case, both estimation and support recovery for sparse PCA are conjectured to be NP hard, which means no polynomial time method succeeds; see Section III-C for more details. The above hints that the regime , is NP hard for sparse CCA as well.
Question 3. Is there any polynomial time method that can recover the support D(V) when , ?
We dedicate Section III-C to answering this question. Subject to the recent advances in the low degree polynomial conjecture, we establish computational hardness of the regime , (up to a logarithmic factor gap) subject to , q. Our results are consistent with [28]’s findings in the estimation case and cover a broader regime; see Remark 5 for a comparison.
When the sparsity is of the order and , however, polynomial time support recovery and estimation are possible for the sparse PCA case. [1] showed that a co-ordinate thresholding type spectral algorithm works in this regime. Thus the following question is immediate.
Question 4. Is there any polynomial time method that can recover the support D(V) when , ?
We give an affirmative answer to Question 4 in Section III-D, which is in parallel with the observations for the sparse PCA. In fact, Corollary 2 shows that when and are known, , and , , estimation is possible in polynomial time. Since estimation is possible, RecoverSupp suffices for polynomial time support recovery in this regime, where is well below the information theoretic limit of . The main tool used in Section III-D is co-ordinate thresholding, which is originally a method for high dimensional matrix estimation [50], and apparently has nothing to do with estimation of canonical directions. However, under our setup, if the covariance matrix is consistently estimated in operator norm, by Wedin’s Sin θ Theorem [52], an SVD is enough to get a consistent estimator of U and V suitable for further precise analysis.
Remark 4. RecoverSupp uses sample splitting, which can reduce the efficiency. One can swap between the samples and compute two estimators of the supports. One can easily show that both the intersection and the union of the resulting supports enjoy the asymptotic guarantees of Theorem 1.
This section can be best summarized by Figure 1, which gives the information theoretic and computational landscape of sparse CCA analysis in terms of the sparsity. In other words, Figure 1 gives the phase transition plot for SCCA support recovery with respect to sparsity. It can be seen that our contributions (colored in red) complete the picture, which was initiated by [28].
Fig. 1:
Phase transition plots for SCCA estimation and support recovery problems with respect to sparsity. We have taken here. COLAR corresponds to the estimation method of [28]. Our contributions are colored in red. See [28] for more details on the regions colored in blue.
B. Information Theoretic Lower Bounds: Answers to Question 1 and 2
Theorem 2 establishes the information theoretic limits on the sparsity levels sx, sy, and the signal strengths Sigx and Sigy. The proof of Theorem 2 is deferred to Appendix D.
Theorem 2. Suppose and are estimators of D(U) and D(V), respectively. Let sx, sy > 1, and , . Then the following assertions hold:
-
If , thenOn the other hand, if , then
-
Let be the class of distributions satisfying . ThenOn the other hand, if , Then
In both cases, the infimum is over all possible decoders and .
First, we discuss the implications of part A of Theorem 2. This part entails that for full support recovery of V, the minimum sample size requirement is of the order . This requirement is consistent with the traditional lower bound on n in context of support recovery for sparse PCA [30, Theorem 3] and L1 regression [29, Corollary 1]. However, when , the sample size requirement for estimation of V is slightly relaxed, that is, [28, Theorem 3.2]. Therefore, from information theoretic point of view, the task of full support recovery appears to be slightly harder than the task of estimation. The scenario for partial support recovery might be different and we do not pursue it here. Moreover, as mentioned earlier, in the regime , RecoverSupp works with [28]’s (see Section 3 therein) estimator of U. Thus part A of Theorem 2 implies that is the information theoretic upper bound on the sparsity for the full support recovery of sparse CCA.
Part B of Theorem 2 implies that it is not possible to push the minimum signal strength below the level . Thus the minimal signal strength requirement (7) by Theorem 1 is indeed minimal up to a factor of . The last statement can be refined further. To that end, we remind the readers that for a good estimator of , i.e., a type B estimator, if . However, the latter always holds if support recovery is at all possible, because in that case , and elementary linear algebra gives . Thus, it is fair to say that, provided a good estimator of , the requirement (7) is minimal up to a factor of . Indeed, this implies that for banded inverses with finite band-width our results are rate optimal.
It is further worth comparing this part of the result to the SPCA literature. In the SPCA support recovery literature, generally, the lower bound on the signal strength is depicted in terms of the sparsity s, and usually a signal strength of order is postulated (cf. [1], [30], [35]). Using our proof strategies, it can be easily shown that for SPCA, the analogous lower bound on the signal strength would be . The latter is generally much smaller than and only when , the requirement of close to the lower bound. Thus, in the regime , the lower bound should rather be of the order . Therefore the minimum signal strength requirement of typically assumed in SPCA literature seems larger than necessary.
Main idea behind the proof of Theorem 2:
The main device used in this proof is Fano’s inequality [53]. Note that for any ,
| (11) |
Therefore it suffices to show that the left hand side in the above inequality is bounded away from 1/2 for some carefully chosen . If is finite, we can lower bound the left hand side of (11) using Fano’s inequality [53], which yields
| (12) |
Thus the main task is to choose in a way so that the right hand side (RHS) of (12) is large. We will choose so that X and Y are jointly Gaussian. In particular, , , and where and are fixed, and α is allowed to vary in a set . In this model, r = 1, ρ is the canonical correlation, and α and β0 are the left and right canonical covariates, respectively. Also, varies across as α varies across . Moreover, . Our main task boils down to choosing carefully.
The idea behind choosing is as follows. For any decoder, i.e., an estimator of the support, the chance of making error increases when is large. This can also be seen noting that the right hand side of (12) increases as increases. However, even if we prefer a larger , we need to ensure that the KL divergence between the distributions in the resulting is small. The reason is that, for a large , the right hand side of (12) can be small unless the KL divergence between the corresponding distributions in is small. In other words, any decoder will face a challenge detecting the true support of α when there are many distributions to choose from, and these distributions are also close to each other in KL distance.
For part A of Theorem 2, we choose in the following way. Letting
we let be the class of α’s which are obtained by replacing one of the in α0 by 0, and one of the zero’s in α0 by . A typical α obtained this way looks like
In this case, it turns out that . Under the conditions of part A of 2, we can show that the RHS of (12) is bounded below by 1/2 for this . The proof of part A is similar to its PCA analogue, which is Theorem 3 of [30]. The latter theorem is also based on Fano’s lemma and uses a similar construction for . However, there is no PCA analogue of part B. For part B of Theorem 2, we let be the class of all α’s so that
where
can take any position out of the positions. Clearly, . It can be shown that the RHS of (12) is bounded below by 1/2 in this case as well.
C. Computational Limits and Low Degree Polynomials: Answer to Question 3
We have so far explored the information theoretic upper and lower bounds for recovering the true support of leading canonical correlation directions. However, as indicated in the discussion preceding Question 3, the statistically optimal procedures in the regime where , are computationally intensive and is of exponential complexity (as a function of p, q). In particular, [28] have already showed that when sx and sy belong to parts of this regime, estimation of the canonical correlates is computationally hard, subject to a computational complexity based Planted Clique Conjecture. For the case of support recovery, the SPCA has been explored in detail and the corresponding computational hardness has been established in analogous regimes – see, e.g., [1], [30], and [35] for details. A similar phenomenon of computational hardness is observed in case of SPCA spike detection problem [54]. In light of the above, it is natural to believe that the SCCA support recovery is also computationally hard in the regime , , and, as a result, yields a statistical-computational gap. Although several paths exist to provide evidence towards such gaps 4, the recent developments using “Predictions from Low Degree Polynomials” [2], [39], [40] is particularly appealing due its simplicity in exposition. In order to show computationally hardness of the SCCA support recovery problem in the , regime, we shall resort to this very style of ideas, which has so far been applied successfully to explore statistical-computational gaps under sparse PCA [36], Stochastic Block Models, and tensor PCA [40], among others. This will allow us to explore the computational hardness of the problem in the entire regime where
| (13) |
compared to the somewhat partial results (see Remark 5 for detailed comparison) in earlier literature.
We divide our discussions to argue the existence of a statistical-computational gap in this regime as follows. Starting with a brief background on the statistical literature on such gaps, we first present a natural reduction of our problem to a suitable hypothesis testing problem in Section III-C1. Subsequently, in Section III-C2 we present the main idea of the “low degree polynomial conjecture” by appealing to the recent developments in [39], [40], and [2]. Finally, we present our main result for this regime in Section III-C3, thereby providing evidence of the aforementioned gap modulo the Low Degree Polynomial Conjecture presented in Conjecture 1.
1). Reduction to Testing Problem::
Denote by the distribution of a random vector. Therefore corresponds to the case when X and Y are uncorrelated. We first show that there is any scope of support recovery in only if is distinguishable from , i.e., the test vs. has asymptotic zero error.
To formalize the ideas, suppose we observe i.i.d random vectors which are distributed either as or . We denote the n-fold product measures corresponding to and by and , respectively. Note that if , then . We overload notation, and denote the combined sample and by X and Y respectively. In this section, X and Y should be viewed as unordered sets. The test for testing the null vs. the alternative is said to strongly distinguish and if
The above implies that both the type I error and the type II error of Φn converges to zero as . In case of composite alternative , the test strongly distinguishes from if
Now we explain how support recovery and the testing framework are connected. Suppose there exist decoders which exactly recover D(U) and D(V) under for . Then the trivial test, which rejects the null if either of the estimated supports is non-empty, strongly distinguishes from . The above can be coined as the following lemma.
Lemma 1. Suppose there exist polynomial time decoders and of D(U) and D(V) so that
| (14) |
Further assume, , and . Then there exists a polynomial time test which strongly distinguishes and .
Thus, if a regime does not allow any polynomial time test for distinguishing from , there can be no polynomial time computable consistent decoder for D(U) and D(V). Therefore, it suffices to show that there is no polynomial time test which distinguishes from in the regime , . To be more explicit, we want to show that if , , then
| (15) |
for any Φn that is computable in polynomial time.
The testing problem under concern is commonly known as the CCA detection problem, owing to its alternative formulation as vs. . In other words, the test tries to detect if there is any signal in the data. Note that, Lemma 1 also implies that detection is an easier problem than support recovery in that the former is always possible whenever the latter is feasible. The opposite direction may not be true, however, since detection does not reveal much information on the support.
2). Background on the Low-degree Framework:
We shall provide a brief introduction to the low-degree polynomial conjecture which forms the basis of our analyses here, and refer the interested reader to [39], [40], and [2] for in-depth discussions on the topic. We will apply this method in context of the test vs. . The low-degree method centers around the likelihood ratio , which takes the form in the above framework. Our key tool here will be the Hermite polynomials, which form a basis system of [62]. Central to the low-degree approach lies the projection of onto the subspace (of ) formed by the Hermite polynomials of degree at most . The latter projection, to be denoted by from now on, is important because it measures how well polynomials of degree ≤ Dn can distinguish from . In particular,
| (16) |
where the maximization is over polynomials of degree at most Dn [36].
The norm of the untruncated likelihood ratio has long held an important place in the theory hypothesis testing since implies and are asymptotically indistinguishable. While the untruncated likelihood ratio is connected to the existence of any distinguishing test, degree Dn projections of are connected to the existence of polynomial time distinguishing tests. The implications of the above heuristics are made precise by the following conjecture [40, Hypothesis 2.1.5].
Conjecture 1 (Informal). Suppose . For “nice” sequences of distributions and , if as n → ∞ whenever Dn ≤ t(n)polylog(n), then there is no time-nt(n) test that strongly distinguishes and .
Thus Conjecture 1 implies that the degree-Dn polynomial is a proxy for time-nt(n) algorithms [2]. If we can show that for a Dn of the order for some ϵ > 0, then the low degree Conjecture says that no polynomial time test can strongly distinguish and [2, Conjecture 1.16].
Conjecture 1 is informal in the sense that we do not specify the “nice” distributions, which are defined in Section 4.2.4 of [2] (see also Conjecture 2.2.4 of [40]). Niceness requires to be sufficiently symmetric, which is generally guaranteed by naturally occurring high dimensional problems like ours. The condition of “niceness” is attributed to eliminate pathological cases where the testing can be made easier by methods like Gaussian elimination. See [40] for more details.
3). Main Result:
Similar to [36], we will consider a Bayesian framework. It might not be immediately clear how a Bayesian formulation will fit into the low-degree framework, and lead to (15). However, the connection will be clear soon. We put independent Rademacher priors and on α and β. We say if are i.i.d., and for each ,
| (17) |
The Rademacher prior can be defined similarly. We will denote the product measure by . Let us define
| (18) |
When , is the covariance matrix corresponding to and with covariance . Hence, for to be positive definite, is a sufficient condition. The priors and put positive weight on α and β that do not lead to a positive definite , and hence calls for extra care during the low-degree analysis. This subtlety is absent in the sparse PCA analogue [36].
Let us define
| (19) |
We denote the n-fold product measure corresponding to by . If , then the marginal density of (X, Y) is . The following lemma, which is proved in Appendix H-C, explains how the Bayesian framework is connected to (15).
Lemma 2. Suppose and . Then
where is the shorthand for .
Note that a similar result holds for as well because . Lemma 2 implies that to show (15), it suffices to show that a polynomial time computable fails to strongly distinguish the marginal distribution of X and Y from . However, the latter falls within the realms of the low degree framework because the corresponding likelihood ratio takes the form
| (20) |
Using priors on the alternative space is a common trick to convert a composite alternative to a simple alternative, which generally yields more easily to various mathematical tools.
If we can show that for some , then Conjecture 1 would indicate that a time computable fails to distinguish the distribution of from . Theorem 3 accomplishes the above under some additional conditions on p, q, and n, which we will discuss shortly. Theorem 3 is proved in Appendix E.
Theorem 3. Suppose ,
| (21) |
Then is O(1) where is as defined in (20).
The following Corollary results from combining Lemma 2 with Theorem 3.
Corollary 1.Suppose| (22) |
If Conjecture 1 is true, then for , there is no time test that strongly distinguishes and .
Corollary 1 conjectures that polynomial time algorithms can not strongly distinguish and provided sx, sy, p, and q satisfy (22). Therefore under (22), Lemma 1 conjectures support recovery to be NP hard.
Now we discuss a bit on condition (22). The first constraint in (22) is expected because it ensures , , which indicates that the sparsity is in the hard regime. We need to explain a bit on why the other constraint p, is needed. If , the sample canonical correlations are consistent, and therefore strong separation is possible in polynomial time without any restriction on the sparsity [23], [41]. Even if and , then also strong separation is possible in model 18 provided the canonical correlation ρ is larger than some threshold depending on c1 and c2 [23]. The restriction p, ensures that the problem is hard enough so that the vanilla CCA does not lead to successful detection. The constant 3e is not sharp and possibly can be improved. The necessity of the condition p, is unknown for support recovery, however. Since support recovery is a harder problem than detection, in the hard regime, polynomial time support recovery algorithms may fail at a weaker condition on n, p, and q.
Remark 5. [Comparison with previous work:] As mentioned earlier, [28] was the first to discover the existence of computational gap in context of sparse CCA. In their seminal work, [28] established the computational hardness of CCA estimation problem at a particular subregime of , provided is allowed. In view of the above, it was hinted that sparse CCA becomes computationally hard when , . However, when is bounded, the entire regime , is probably not computationally hard. In Section III-D, we show that if , then both polynomial time estimation and support recovery are possible if , at least in the known and case. The latter sparsity regime can be considerably larger than , . Together, Section III-D and the current section indicate that in the bounded case, the transition of computational hardness for sparse CCA probably happens at the sparsity level , not , which is consistent with sparse PCA. Also, the low-degree polynomial conjecture allowed us to explore almost the entire targeted regime , , where [28], who used the planted clique conjecture, considers only a subregime of , .
We will end the current section with a brief outline of the proof of Theorem 3.
The main idea behind the proof of Theorem 3:
Let us denote by the linear span of all -variate Hermite polynomials of degree at most Dn. For each and , we let , where is the univariate normalized Hermite polynomial of degree zi. We will discuss the Hermite polynomials in greater detail in Appendix E. Any normalized m-variate Hermite polynomial is of the form , where . Then is the linear span of all with
Since is the projection of on , it then holds that
The first step of the proof is to find out the expression of . Since , we can partition w into w = (w1, … , wn), where for each i ∈ [n]. Using some algebra, we can show that
Exploiting the properties of Hermite polynomials, it can be shown that
where for , , and any function , the notation stands for the z-th order partial derivative of f with respect to t evaluated at the origin. The rest of the proof is similar to the PCA analogue in [36], but there is an extra indicator term for the CCA case. Following [36], we use the common trick of using replicas of α and β to simplify the algebra. Suppose , and , are independent. Let W be the indicator function of the event . Denote by the p-th order truncation of the Taylor series expansion of at x = 0. Following some algebra, it can be shown that
Comparing the above with the analogous result for PCA, namely Lemma 4.2 of [36], we note that the indicator term W does not appear in the PCA analogue. The indicator term W appears in the CCA case because we had set to be for to tackle the extra restrictions on α and β in this case.
D. A Polynomial Time Algorithm for Regime : Answer to Question 4
In this subsection, we show that in the difficult regime , using a soft co-ordinate thresholding (CT) type algorithm, we can estimate the canonical directions consistently when . CT was introduced by the seminal work of [50] for the purpose of estimating high dimensional covariance matrices. For SPCA, [1]’s CT is the only algorithm that provably recovers the full support in the difficult regime (see also [35]). In context of CCA, [26] uses CT for partial support recovery in the rank one model under what we referred to as the easy regime. However, [26]’s main goal was the estimation of the leading canonical vectors, not support recovery. As a result, [26] detects the support of the relatively large elements of the leading canonical directions, which are subsequently used to obtain consistent preliminary estimators of the leading canonical directions. Our thresholding level and theoretical analysis are different from that of [26] because the analytical tools used in the easy regime do not work in the difficult regime.
1). Methodology: Estimation via CT:
By “thresholding a matrix A co-ordinate-wise”, we will roughly mean the process of assigning the value zero to any element of A which is below a certain threshold in absolute value. Similar to [1], we will consider the soft thresholding operator, which, at threshold level t, takes the form
It will be worth noting that the soft thresholding operator is continuous.
| Algorithm 2 Coordinate Thresholding (CT) for estimating D(V) | |
|---|---|
|
|
We will also assume that the covariance matrices and are known. To understand the difficulty of unknown and , we remind the readers that . Because the matrices U and V are sandwiched between the matrices and , their sparsity pattern does not get reflected in the sparsity pattern of . Therefore, if one blindly applies CT to , they can at best hope to recover the sparsity pattern of the outer matrices and . If the supports of the matrices U and V are of main concern, CT should rather be applied on the matrix . If and are unknown, one needs to efficiently estimate before the application of CT. Although under certain structural conditions, it is possible to find rate optimal estimators and of and at least in theory, the errors and may still blow up due to the presence of the high dimensional matrix , which can be as big as in operator norm. One may be tempted to replace with a sparse estimator of to facilitate faster estimation, but that does not work because we explicitly require the formulation of as the sum of Wishart matrices (see equation 37 in the proof). The latter representation, which is critical for the sharp analysis, may not be preserved by a CLIME [51] or nodewise Lasso estimator [49] of Σxy.
We remark in passing that it is possible to obtain an estimator so that . Although the latter does not provide much control over the operator norm of , it is sufficient for partial support recovery, e.g., the recovery of the rows of U or V with strongest signals. (See Appendix B of [26] for example, for some results in this direction under the easy regime when r = 1.)
As indicated by the previous paragraph, we apply coordinate thresholding to the matrix , which directly targets the matrix . We call this step the peeling step because it extracts the matrix from the sandwiched matrix . We then perform the entry-wise co-ordinate thresholding algorithm on the peeled form with threshold Thr so as to obtain . We postpone the discussion on Thr to Section III-D2. The thresholded matrix is an estimator of , but we need an estimator of . Therefore, we again sandwich between and . The motivation behind this sandwiching is that if , then is a good estimator of in that
However, is an SVD of . Using Davis-Kahan sin theta theorem [52], one can show that the SVD of produces estimators and of and , where the columns of and are ϵn-consistent in l2 norm for the columns of and , respectively, up to a sign flip (cf. Theorem 2 of [52]). Pre-multiplying the resulting U′ by yields an estimator of U up to a sign flip of the columns. We do not worry about the sign flip because Condition 1 allows for the sign flips of the columns. Therefore, we feed this into RecoverSupp as our final step. See Algorithm 2 for more details.
Remark 6. In case of electronic health records data, it is possible to obtain large surrogate data on X and Y separately and thus might allow relaxing the known precision matrices assumption above. We do not pursue such semi-supervised setups here.
2). Analysis of the CT Algorithm:
For the asymptotic analysis of the CT algorithm, we will assume the underlying distribution to be Gaussian, i.e., . This Gaussian assumption will be used to perform a crucial decomposition of sample covariance matrix, which typically holds for Gaussian random vectors. [1], who used similar devices for obtaining the sharp rate results in SPCA, also required a similar Gaussian assumption. We do not yet know how to extend these results to sub-Gaussian random vectors.
Let us consider the threshold , where Thr is explicitly given in Theorem 4. Unfortunately, tuning of Thr requires the knowledge of the underlying sparsity sx and sy. Similar to [1], our thresholding level is different than the traditional choice of order in the easy regime analyzed in [50], [63] and [26]. The latter level is too large to successfully recover all the nonzero elements in the difficult regime. We threshold at a lower level, which in its turn, complicates the analysis to a greater degree. Our main result in this direction, stated in Theorem 4, is proved in Appendix F.
Theorem 4. Suppose . Further suppose , , and . Let K and C1 be constants so that and , where C > 0 is an absolute constant. Suppose the threshold level Thr is defined by
Suppose is a constant that takes the value K, C1, or one in case (i), (ii), and (iii), respectively. Then there exists an absolute constant C > 0 so that the following holds with probability 1 − o(1) for :
To disentangle the statement of Theorem 4, let us assume for the time being. Then case (ii) in the theorem corresponds to . Thus, CT works in the difficult regime provided . It should be noted that the threshold for this case is almost of the order , which is much smaller than , the traditional threshold for the easy regime. Next, observe that case (i) is an easy case because is much smaller than . Therefore, in this case, the traditional threshold of the easy regime works. Case (iii) includes the hard regime, where polynomial time support recovery is probably impossible. Because it is unlikely that CT can improve over the vanilla estimator in this regime, a threshold of zero is set.
Remark 7. Theorem 4 requires because one of our concentration inequalities in the analysis of case (ii) needs this technical condition (see Lemma 8). The omitted regime is indeed an easier one, where special methods like CT is not even required. In fact, it is well known that subgaussian X and Y satisfy (cf. Theorem 4.7.1 of [42])
which is in the regime under concern. Including this result in the statement of Theorem 4 could unnecessarily lengthen the exposition. Therefore, we decided to exclude this regime from Theorem 4 to focus more on the regime.
Remark 8. The statement of Theorem 4 is not explicit on the lower bound of the constant C1. However, our simulation shows that the algorithm works for . Both threshold parameters C1 and K in Theorem 4 depend on the unknown . The proof actually shows that can be replaced by , , , .
Finally, Theorem 4 leads to the following corollary, which establishes that in the difficult regime, there exist estimators which satisfy Condition 1, and Algorithm 2 succeeds with probability one provided . This answers Question 4 in the affirmative for Gaussian distributions.
Corollary 2. Instate the conditions of Theorem 4. Then there exists so that if
| (23) |
then the defined in Algorithm 2 satisfies Condition 1, and (Algorithm 2 correctly recovers .
We defer the proof of Corollary 2 to Appendix G. We will now present a brief outline of the proof of Theorem 4.
Main idea behind the proof of Theorem 4:
The proof hinges on the hidden variable representation of X and Y due to [64]. We discuss this representation in detail in Appendix C-2, which basically says the data matrices X and Y can be represented as
where , , and are independent standard Gaussian data matrices, and , , , and . We will later show in Section C-2 that and are well defined positive definite matrices. It follows that has the representation
Next, we define some sets. Let , , , and . Therefore E1 and E2 correspond to the supports, where F1 and F2 correspond to their complements. Now we partition [p] × [q] into the following three sets:
| (24) |
and
| (25) |
Therefore E is the set that contains the joint support. We can decompose as
| (26) |
where is the projection operator defined in (4).
The usefulness of the decomposition in (26) is that S1, S2, and S3 have different supports, which enables us to write
We can therefore analyze the three terms , , and separately. In general, the thresholding operator η is not linear in that for matrices A and B, generally does not hold.
As indicated above, we analyze the operator norms of , , and separately. Among S1, S2, and S3, S1 is the only matrix that is supported on E, the true support. The basic idea of the proof is showing that co-ordinate thresholding preserves the matrix S1, and kills off the other matrices S2 and S3, which contain the noise terms. S1 includes the matrix . Because concentrates around Ir by Bai-Yin law (cf. Lemma 4.7.1 of [42]), concentrates around . Therefore the analysis of η(S1) is relatively straightforward.
Most of the proof is devoted towards showing and are small, i.e., co-ordinate thresholding kills off the noise terms. The difficulty arises because the threshold was kept smaller than the traditional threshold of order to adjust for the hard regime. Therefore the approaches of [50] or [28] do not work in this regime. The noise matrices S2 and S3 are sum of matrices of the form , , or , or their transposes, where for rest of this section, M and N should be understood as deterministic matrices of appropriate dimension, whose definition can change from line to line. Analyzing and essentially hinges on Lemma 8, which upper bounds the operator norm of matrices of the form . The proof of Lemma 8 uses, among other tools, a sharp Gaussian concentration result from [1] (see Corollary 10 therein), and a generalized Chernoff’s inequality for dependent Bernoulli random variables [65]. Using Lemma 8, we can also upper bound operator norms of matrices of the form because can be represented as for some matrix M3 of appropriate dimension. Therefore, to show and are small, Lemma 8 suffices, which also completes the proof.
The proof of Theorem 4 has similarities with the proof of the analogous result for PCA in [1] (see Theorem 1 therein). However, one main difference is that for PCA, the key instrument is the representation of X as the spiked model [44], which yields the representation
| (27) |
where and are standard Gaussian data matrices, and is a deterministic matrix. The analysis in PCA revolves around the sample covariance matrix , which, following (27), writes as
From the above representation, it can be shown that the analogues of S2 and S3 in the PCA case are sum of matrices of the form or their transposes. [1] uses an upper bound on to bound the PCA analogue of and (see Proposition 13 therein). In contrast, we encounter terms of the form since CCA is concerned with XTY/n. To deal with these terms, we needed the upper bound result on instead, which requires a separate elaborate proof. Although the basic idea behind bounding and bounding is similar, the proof of bounding is more involved. For example, some independence structures are destroyed due to the pre and post multiplication by the matrices M1 and N1, respectively. We required concentration inequalities on dependent Bernoulli random variables to tackle the latter.
IV. Numerical Experiments
This section illustrates the performance of different polynomial time CCA support recovery methods when the sparsity transitions from the easy to difficult regime. We base our demonstration on a Gaussian rank one model, i.e., (X, Y) are jointly Gaussian with covariance matrix . For simplicity, we take p = q and sx = sy = s. In all our simulations, ρ is set to be 0.5, and , where
are unit norm vectors. Note that the order of most elements of β is O(s−2/3), where a typical element of α is O(s−1/2). Therefore, we will refer to α and β as the moderate and the small signal case, respectively. For the population covariance matrices Σx and Σy of X and Y, we consider the following two scenarios:
A (Identity): Σx = Ip and Σy = Iq. Since p = q, they are essentially the same.
- B (Sparse inverse): This example is taken from [28]. In this case, are banded matrices, whose entries are given by
Now we explain our common simulation scheme. We take the sample size n to be 1000, and consider three values for p: 100, 200, and 300. The highest value of p + q is thus 600, which is smaller than but in proportion to n regime. Our simulations indicate that all of the methods considered here requires n to be quite larger than p + q for the asymptotics to kick in at ρ = 0.5. We will later discuss this point in detail. We further let vary in the set [0.01, 2]. To be more specific, we consider 16 equidistant points in the set [0.01, 2] for the ratio .
Now we discuss the error metric used here to compare the performance of different support recovery methods. Type I and type II errors are commonly used tools to measure the performance of support recovery [1]. In case of support recovery of α, we define the type I error to be the proportion of zero elements in α that appear in the estimated support . Thus, we quantify the type I error of α by . On the other hand, the type II error for α is the proportion of elements in D(α) which are absent in , i.e., the type II error is quantified by . One can define the type I and type II errors corresponding to β similarly. Our simulations demonstrate that often the methods with low type I error exhibit a high type II error, and vice versa. In such situations, comparison between the corresponding methods becomes difficult if one uses the type I and type II errors separately. Therefore, we consider a scaled Hamming loss type metric, which suitably combines the type I and type II error. The symmetric Hamming error of estimating D(α) by is [66, Section 2.1]
Note that the above quantity is always bounded above by one. We can similarly define the symmetric Hamming distance between D(β) and . Finally, the estimates of these three errors (Type I, Type II, and scaled Hamming Loss) are obtained based on 1000 Monte Carlo replications.
Now we discuss the support recovery methods we compare here.
Naïve SCCA. We estimate α and β using the SCCA method of [24], and set and , where and are the corresponding SCCA estimators. To implement the SCCA method of [24], we use the R code referred therein with default tuning parameters.
Cleaned SCCA. This method implements RecoverSupp with the above mentioned SCCA estimators of α and β as the preliminary estimators.
CT. This is the method outlined in Algorithm 2, which is RecoverSupp coupled with the CT estimators of α and β.
Our CT method requires the knowledge of the population covariance matrices Σx and Σy. Therefore, to keep the comparison fair, in case of the cleaned SCCA method as well, we implement RecoverSupp with the popular covariance matrices. Because of their reliance on RecoverSupp, both cleaned SCCA and CT depend on the threshold cut, tuning which seems to be a non-trivial task. We set , where C is the thresholding constant. Our simulations show that a large C results in high type II error, where insufficient thresholding inflates the type I error. Taking the hamming loss into account, we observe that C ≈ 1 leads to a better performance in case A in an overall sense. On the other hand, case B requires a smaller value of thresholding parameter. In particular, we let C to be one in case A, and set C = 0.05 and 0.2, respectively, for the support recovery of α and β in case B. The CT algorithm requires an extra threshold parameter, namely the parameter Thr in Algorithm 2, which corresponds to the co-ordinate thresholding step. We set Thr in accordance with Theorem 4 and Remark 8, with K being and C1 being . We set as in Remark 8, that is
The errors incurred by our methods in case A are displayed in Figure 2 (for α) and Figure 3 (for β). Figures 4 and 5, on the other hand, display the errors in the recovery of α and β, respectively, in case B.
Fig. 2:
Support recovery for α when and . Here threshold refers to cut in Theorem 1.
Fig. 3:
Support recovery for β when and . Here threshold refers to cut in Theorem 1.
Fig. 4:
Support recovery for α when and are the sparse covariance matrices. Here threshold refers to cut in Theorem 1.
Fig. 5:
Support recovery for β when and are the sparse covariance matrices. Here threshold refers to cut in Theorem 1.
Now we discuss the main observations from the above plots. When the sparsity parameter s is considerably low (less than ten in the current settings), the naíve SCCA method is sufficient in the sense that the specialized methods do not perform any better. Moreover, the naïve method is the most conservative one among all three methods. As a consequence, the associated type I error is always small, although the type II error of the naïve method grows faster than any other method. The specialized methods are able to improve the type II error at the cost of higher type I error. At a higher sparsity level, the specialized methods can outperform the naïve method in terms of the Hamming error, however. This is most evident when the setting is also complex, i.e., the signal is small, or the underlying covariance matrices are not identity. In particular, Figure 2 and 4 entail that when the signal strength is moderate and the sparsity is high, the cleaned SCCA has the lowest hamming error. In the small signal case, however, CT exhibits the best hamming error as increases; cf. Figure 3 and 5.
The Type I error of CT can be slightly improved if the sparsity information can be incorporated during the thresholding step. We simply replace cut by the maximum of cut and the s-th largest element of , where the latter is as in Algorithm RecoverSupp. See, e.g., Figure 6, which entails that this modification reduces the Hamming error of the CT algorithm in case A. our empirical analysis hints that the CT algorithm has potential for improvement from the implementation perspective. In particular, it may be desirable to obtain a more efficient procedure for choosing cut in a systematic way. However, such a detailed numerical analysis is beyond the scope of the current paper and will require further modifications of the initial methods for estimation of α, β both for scalability and finite sample performance reasons. We keep these explorations as important future directions.
Fig. 6:
Support recovery by the CT algorithm when we use the information on sparsity to improve the type I error. Here and are and , respectively, and threshold refers to cut in Theorem 1. To see the decrease in type I error, compare the errors with that of Figure 2 and Figure 3.
It is natural to wonder what is the effect of cleaning via RecoverSupp on SCCA. As mentioned earlier, during our simulations we observed that a cleaning step generally improves the type II error of the naïve SCCA, but it also increases the type I error. In terms of the combined measure, i.e., the Hamming error, it turns out that cleaning does have an edge at higher sparsity levels in case B; cf. Figure 4 and Figure 5. However, the scenario is different in case A. Although Figures 2 and 3 indicate that almost no cleaning occurs at the set threshold level of one, we saw that cleaning happens at lower threshold levels. However, the latter does not improve the overall Hamming error of naïve SCCA. The consequence of cleaning may be different for other SCCA methods.
To summarize, when the sparsity is low, support recovery using the naïve SCCA is probably as good as the specialized methods. However, at higher sparsity level, specialized support recovery methods may be preferable. Consequently, the precise analysis of the apparently naïve SCCA will indeed be an interesting future direction.
V. Discussion
In this paper, we have discussed rate optimal behavior of information theoretic and computational limits of the joint support recovery for the sparse canonical correlation analysis problem. Inspired by recent results in the estimation theory of sparse CCA, a flurry of results in sparse PCA, and related developments based on low-degree polynomial conjecture – we are able to paint a complete picture of the landscape of support recovery for SCCA. For future directions, it is worth noting that our results are so far not designed to recover D(vi) for individual i ∈ [r] separately (and hence the term joint recovery). Although this is also the case for most state of the art in the sparse pCA problem (results often exist only for the combined support [1] or the single spike model where r = 1 [29]), we believe that it is an interesting question for deeper explorations in the future. Moreover, moving beyond asymptotically exact recovery of support to more nuanced metrics (e.g., Hamming Loss) will also require new ideas worth studying. Finally, it remains an interesting question to pursue whether polynomial time support recovery is possible in the , regime using a CT type idea – but for unknown yet structured high dimensional nuisance parameters Σx, Σy.
Acknowledgments
This work was supported by National Institutes of Health grant P42ES030990.
Biographies
Nilanjana Laha received a Bachelor of Statistics in 2012 and a Master of Statistics in 2014 from the Indian Statistical Institute, Kolkata. Then she received a Ph.D. in statistics in 2019 from the University of Washington, Seattle. She was a postdoctoral research fellow at the department of Biostatistics at Harvard university from 2019 to 2022. She is currently an assistant professor in Statistics at Texas A & M University. Her research interests include dynamic treatment regimes, high dimensional association, and shape constrained inference.
Rajarshi Mukherjee received a Bachelor of Statistics in 2007 and a Master of Statistics in 2009 from the Indian Statistical Institute, Kolkata. He received his Ph.D. degree in Bisostatistics from Harvard University in 2014. He was a Stein fellow in the department of Statistics at Stanford University from 2014 to 2017. He was an assistant professor at the division of Biostatistics at the University of California, Berkeley, from 2017 to 2018. Since 2018, he has been an assistant professor at the department of Biostatistics at Harvard University. His research interests primarily lie in structured signal detection problems in high dimensional and network models, and functional estimation and adaptation theory in nonparametric statistics.
Appendix A. Full version of RecoverSupp
| Algorithm 3 RecoverSupp: simultaneous support recovery of U and V | |
|---|---|
|
|
In Algorithm 3, we used different cut-offs for estimating and , which are cutx and cuty, respectively. In practice, one can choose the same threshold cut for both of them.
Appendix B. Proof preliminaries
The Appendix collects the proof of all our theorems and lemmas. This section introduces some new notations and collects some facts, which are used repeatedly in our proofs.
A. New Notations
Since the columns of , i.e., [] are orthogonal, we can extend it to an orthogonal basis of , which can also be expressed in the form [] since Σx is non-singular. Let us denote the matrix [u1,…,up] by , whose first r columns form the matrix U. Along the same line, we can define , whose first q columns constitute the matrix V.
Suppose is a matrix. Recall the projection operator defined in (4). For any S ⊂ [p], we let denote the matrix . Similarly, for , we let AF be the matrix . For , we define the norms and . We will use the notation ∣A∣∞ to denote the quantity .
The Kullback Leibler (KL) divergence between two probability distributions P1 and P2 will be denoted by KL(P1 ∣ P2). For , we let denote greatest integer less than or equal to .
B. Facts on
First, note that since by (2) for all i ∈ [q], we have . Similarly, we can also show that . Second, we note that , and
| (28) |
because the largest element of Λ is not larger than one. Since Xi’s and Yi’s are Subgaussian, for any random vector v independent of X and Y, it follows that [45, Lemma 7]
| (29) |
with probability 1 – o(1) uniformly over . Also, we can show that satisfies
where Cauchy-Schwarz inequality was used in the first step.
C. General Technical Facts
Fact 1.For two matricesand, we haveFact 2 (Lemma 11 of [1]). Let be a matrix with i.i.d. standard normal entries, i.e., Zi,j ~ N(0, 1). Then for every t > 0,
As a consequence, there exists an absolute constant C > 0 such that
Recall that for , in Appendix B-A, we defined ∥A∥1,∞ and ∥A∥∞1 to be the matrix norms and , respectively.
The following fact is a Corollary to (29).
Fact 3. Suppose X and Y are jointly subgaussian. Then .
Fact 4 (Chi-square tail bound). Suppose . Then for any y > 5, we have
Proof of Fact 4. Since Zl’s are independent standard Gaussian random variables, by tail bounds on Chi-squared random variables (The form below is from Lemma 12 of [1]),
Plugging in x = yk, we obtain that
which implies for y > 1,
which can be rewritten as
as long as y > 5. □
Appendix C. Proof of Theorem 1
For the sake of simplicity, we denote , , and by , , and , respectively. The reader should keep in mind that and are independent of and because they are constructed from a different sample. Next, using Condition 1, we can show that there exists so that
as n → ∞. Without loss of generality, we assume wi = 1 for all i ∈ [r]. The proof will be similar for general wi’s. Thus
| (30) |
Therefore for all i ∈ [r] with probability tending to one.
Now we will collect some facts which will be used during the proof. Because and are independent, (29) implies that
Using (30), we obtain that . Because , we have
| (31) |
Noting (28) implies , and that , using (31), we obtain that
| (32) |
with probability 1 – o(1).
Now we are ready to prove Theorem 1. We will denote the columns of by for i ∈ [r]. Because , it holds that
leading to
Handling the term T2 is the easiest because
with probability 1 – o(1) uniformly over , where we used (31) and the fact that . The difference in cases (A), (B), (C) arises only due to different bounds on T1(i, k) in these cases. We demonstrate the whole proof only for case (A). For the other two cases, we only discuss the analysis of T1(i, k) because the rest of the proof remains identical in these cases.
1). Case (A):
Since we have shown in (32) that , we calculate
with probability tending to one, uniformly over , where to get the last inequality, we also used the bound on in case (A).
Finally, for T3, we notice that
since Λ1 ≥ 1. Since (vj)k = Vkj, it is clear that T3(i, k) is identically zero if . Otherwise, Cauchy Schwarz inequality implies,
because are orthogonal. Thus
Now we will combine the above pieces together. Note that
| (33) |
For , denoting the i-th column of by we observe that,
| (34) |
with probability 1 – o(1) uniformly over . On the other hand, if k ∈ D(vi), then we have for all i ∈ [r],
which implies
Since and , we have
Thus, noting Vki = (vi)k, we obtain that
with probability 1 – o(1) uniformly over . Suppose . Note that
where θn > 2. Then with probability 1 – o(1) uniformly over ,
This, combined with (34) implies setting leads to full support recovery with probability 1 – o(1). The proof of the first part follows.
2). Case (B):
In the Gaussian case, we resort to the hidden variable representation of X and Y due to [64], which enables sharper bound on the term T1(i, k). Suppose Z ~ Nr(0, Ir) where r is the rank of Σxy. Consider Z1 ~ Np(0, Ip) and Z2 ~ Nq(0, Iq) independent of Z. Then X and Y can be represented as
| (35) |
where
and
Here is well defined because
where Λx is a p × p diagonal matrix whose first p elements are Λ1,…,Λr, and they rest are zero. Because Λ1 ≤ 1, we have
Similarly, we can show that
where Λy is the diagonal matrix whose first r elements are Λ1,…,Λr, and the rest are zero. It can be easily verified that
and
which ensures that the joint variance of (X, Y) is still α. Also, some linear algebra leads to
| (36) |
Suppose we have n independent realizations of the pseudoobservations Z1, Z2, and Z. Denote by Z1, Z2, and Z, the stacked data matrices with the i-th row as (Z1)i, (Z2)i, and Zi, respectively, where i ∈ [n]. Here we used the term data-matrix although we do not observe Z, Z1 and Z2 directly. Due to the representation in (35), the data matrices X and Y have the form
We can write the covariance matrix as
| (37) |
Therefore, for any vector and , we have
| (38) |
By Bai-Yin law on eigenvalues of Wishart matrices [67], there exists abolute constant C > 0 so that for any t > 1,
which, combined with (36), implies
Now we will state a lemma which will be required to control the other terms on the right hand side of (38).
Lemma 3. Suppose and are independent Gaussian data matrices. Further suppose and are either deterministic or independent of both Z1 and Z2. Then there exists a constant C > 0 so that for any t > 1,
The proof of Lemma 3 follows directly setting b = 1 in the following Lemma, which is proved in Appendix H-D.
Lemma 4. Suppose and are independent standard Gaussian data matrices, and and are deterministic matrices with rank a and b, respectively. Let a ≤ b ≤ n. Then there exists an absolute constant C > 0 so that for any t ≥ 0, the following holds with probability at least :
Lemma 3, in conjunction with (36), implies that there exists an absolute constant C > 0 so that
with probability at least for all . Therefore, there exists C > 0 so that
| (39) |
for all . Note that
Now suppose and . By our assumption, with probability 1 – o(1) uniformly across . We also showed that . It is not had to see that
| (40) |
with probability 1 – o(1) uniformly across . For T11, observe that (39) applies because and are independent of . Thus we can write that for any t > 1, there exists such that
Applying union bound, we obtain that for any ,
Since r < q and log q = o(n), setting , we obtain that
is o(1). Using (33) and (40), one can show that
in this case.
3). Case (C):
Note that when . Therefore, (33) implies in this case.
Appendix D. Proof of Theorem 2
Since the proof for U and V follows in a similar way, we will only consider the support recovery of U. The proof for both cases follows a common structure. Therefore, we will elaborate the common structure first. Since the model is fairly large, we will work with a smaller submodel. Specifically, we will consider a subclass of the single spike models, i.e., r = 1. Because we are concerned with only the support recovery of the left singular vectors, we fix in so that . We also fix ρ ∈ (0, 1) and consider the subset . Both ρ and will be chosen later. We restrict our attention to the submodel given by
where (41) is as follows:
| (41) |
That Σ is positive definite for ρ ∈ (0, 1) can be shown either using elementary linear algebra or the the hidden variable representation (35). During the proof of part (B), we will choose so that , which will ensure that as well.
Note that for , U corresponds to α, and hence D(U) = D(α). Therefore for the proof of both parts, it suffices to show that for any decoder of D(α),
| (42) |
In both of the proofs, our will be a finite set. Our goal is to choose so that is structurally rich enough to guarantee (42), yet lends itself to easy computations. The guidance for choosing comes from our main technical tool for this proof, which is Fano’s inequality. We use the verson of Fano’s inequality in [53] (Fano’s Lemma). Applied to our problem, this inequality yields
| (43) |
where denotes the product measure corresponding to n i.i.d. observations from . We also have the following result for product measures, . Moreover, when , with left singular vectors and , respectively,
where by Lemma 13, and
by Lemma 14. Noting , , and are unit vectors, we derive . Therefore, in our case, (43) reduces to
| (44) |
Thus, to ensure the right hand side of (44) is non-negligible, the key is to choose so that the α’s in are close in l2 norm, but is sufficiently large. Note that the above ensures that distinguishing the α’s in is difficult.
A. Proof of part (A)
Note that our main job is to choose and ρ suitably. Let us denote
We generate a class of α’s by replacing one of the in by 0, and one of the zero’s in by . A typical α obtained this way looks like
Let be the class, which consists of , and all such resulting α’s. Note that , and , satisfy
Because p > sx > 1, we have
Therefore, (44) leads to
which is bounded below by 1/2 whenever
which follows if
because . To get the best bound on sx, we choose the value of ρ which minimizes ρ2/(1 – ρ2) for , that is . Plugging in , the proof follows.
B. Proof of part (B)
Suppose each is of the following form
We fix z ∈ (0, 1), and hence is also fixed. We will choose the value of ρ and z later so that . Since z is fixed, such an α can be chosen in p–sx + 1 ways. Therefore . Also note that for α, , . Therefore (44) implies
| (45) |
| (46) |
which is greater than 1/2 whenever
which holds if
because 16 < p – sx. To get the best bound on z, we choose the value of ρ for which maximizes (1 – ρ2)/ρ2, that is . Thus (42) is satisfied when , and corresponds to
Since the minimal signal strength Sigx for any equals min(z, b) ≤ z, we have , which completes the proof.
Appendix E. Proof of Theorem 3
We first introduce some notations and terminologies that are required for the proof. For , and , we denote and . In low-degree polynomial literature, when , the notation ∣w∣ is commonly used to denote the sum for sake of simplicity. We also follow the above convention. Here the notation ∣·∣ should not be confused with the absolute value of real numbers. Also, for any function , , and t = (t1,…, tm), we denote
We will also use the shorthand notation to denote , sometimes.
Our analysis relies on the Hermite polynomial, which we will discuss here very briefly. For a detailed account on the Hermite polynomials, see Chapter V of [62]. The univariate Hermite polynomials of degree k will be denoted by hk. For k ≥ 0, the univariate Hermite polynomials are defined recursively as follows:
The normalized univariate Hermite polynomials are given by . The univariate Hermite polynomials form an orthogonal basis of L2(N(0, 1)). For , the m-variate Hermite polynomials are given by , where . The normalized version of Hw equals . The polynomials form an orthogonal basis of L2(Nm(0, Im)). We denote by the linear span of all n(p + q)-variate Hermite polynomials of degree at most Dn. Since is the projection of on , it then follows that
| (47) |
From now on, the degree-index vector w of or Hw will be assumed to lie in . We will partition w into n components, which gives w = (w1,…,wn), where for each i ∈ [n]. Clearly, i here corresponds to the i-th observation. We also separate each wi into two parts and so that . We will also denote , and . Note that and , but w ≠ (wx, wy) in general, although ∣w∣= ∣wx∣+∣wy∣.
Now we state the main lemmas which yields the value of . The first lemma, proved in Appendix H-C, gives the form of the inner products .
Lemma 5. Suppose w is as defined above and is as in (20). Then it holds that
Here the priors πx and πy are the Rademacher priors defined in (17).
Our next lemma uses Lemma 5 to give the form of . This lemma uses replicas of α and β. Suppose , and , are all independent Rademacher priors, where πx and πy are defined as in (17). We overload notation, and use to denote the expectation under , , , and .
Lemma 6. Suppose W is the indicator function of the event , . Then For any , equals
The proof of Lemma 6 is also deferred to Appendix H-C. We remark in passing that the negative binomial series expansion yields
| (48) |
whose Dn-th order truncation equals
Note that W is nonzero if and only if and , which, by Cauchy Schwarz inequality, implies
Thus when W = 1. Hence Lemma 6 can also be written as
Now we are ready to prove Theorem 3.
Proof of Theorem 3. Our first task is to get rid of W from the expression of in Lemma 6. However, we can not directly bound W by one since the term may be negative for odd . We claim that if is odd. To see this, first we write
| (49) |
Note that , has the same distribution as
Notice from (17) that marginally, , and is independent of , and . Therefore,
Hence, conditional on and , is a symmetric random variable, and for any odd positive integer d. Since W is binary random variable, Wd = W. Thus, as well for an odd number . Thus the claim follows from (49). Therefore, from Lemma 6, it follows that
Observe that . Hence, . Also the summands in the last expression are non-negative. Therefore, using the fact that W ≤ 1, we obtain
| (50) |
Our next step is to simplify the above bound on . To that end, define the random variables for , and for . Denoting
we note that
Also, since and are symmetric, and vanishes for any . Then for any ,
by Fact 6. Since the odd moments of and vanish, the above equals
where we remind the readers that ∣D(z)∣ denotes the cardinality of the support of z for any vector z. The above implies
Plugging the above into (50) yields
where (a) follows since for a, . Let us denote and . By (21), , and
Therefore we have
Hence Lemma 4.5 of [36] implies that for any 11 ≤ d ≤ Dn,
For d ≤ 1, Theorem 5 of [68] gives
Also since , we have
leading to
Therefore is bounded by a constant multiple of
Since , it follows that . Note that the above sum converges if
or equivalently , which is satisfied for all since . Thus the proof follows. □
Appendix F. Proof of Theorem 4
We invoke the decomposition of in (37). But first, we will derive a simplified form for the matrices and in (37). Note that we can write as
Let us denote
| (51) |
Because is an orthogonal matrix, is a spectral decomposition, which leads to
Similarly, we can show that the matrix in (37) equals , where
Finally the fact that and in conjuction with (37) produces the following representation for :
Now recall the sets E, F, and G defined in (24) and (25) in Section III-D, and the decomposition of in (26). From (26) it follows that
Recall that for any matrix , and , we denote by the matrix . Then it is not hard to see that and , which leads to
| (52) |
Next, note that and . Therefore,
| (53) |
Finally, we note that S3 = (H1 + H2), where
| (54) |
and
Here the term S1 holds the information about . Its elements are not killed off by co-ordinate thresholding because it contains the Wishart matrix which concentrates around Ir by Bai-Yin law (cf. Theorem 4.7.1 of of [42]). The only term that contributes to is S1. Lemma 7 entails that η(S1) concentrates around in operator norm. The proof of Lemma 7 is deferred to Appendix H-B.
Lemma 7. Suppose , . Then with probability 1 – o(1),
The entries of S2 and S3 are linear combinations of the entries of , and . Since Z, Z1, Z2 are independent, the entries from the latter matrices are of order Op(n−1/2), and as we will see, they are killed off by the thresholding operator η. Our main work boils down to showing that thresholding kills off most terms of the noise matrices S2 and S3, making and small. To that end, we state some general lemmas, which are proved in Appendix H-B. That and are small follows as corollaries to this lemmas. Our next lemma provides a sharp concentration bound which is our main tool in analyzing the difficult regime, i.e., case.
Lemma 8. Suppose and are independent standard Gaussian data matrices. Let us also denote where and are fixed matrices so that and . Further suppose and . Let . Suppose K ≥ K0 is such that threshold level τ satisfies . Then there exists a constant C > 0 so that with probability 1 – o(1),
Our next lemma, which also is proved in Appendix H-B, handles the easier case when the threshold is exactly of the order . This thresholding, as we will see, is required in the easier sparsity regime, i.e., . Although Lemma 9 follows as a corollary to Lemma A.3 of [50], we include it here for the sake of completeness.
Lemma 9. Suppose Z1, Z2, M, N, and QM,N are as in Lemma 8, and . Further suppose , where C > 0 is an absolute constant. Let . Here the tuning parameter where C > 0 is a sufficiently large constant. Then with probability tending to one.
We will need another technical lemma for handling the terms S2 and S3.
Lemma 10. Suppose and . Then the followings hold:
.
Note that satisfies
However, . Also because , it follows that . Moreover, since is orthogonal, . Part B of Lemma 10 then yields . Therefore
| (55) |
Similarly we can show that the matrix satisfies . Because by (53), that is small follows immediately from Lemma 8. Under the conditions of Lemma 8, we have
| (56) |
with high probability provided and . On the other hand, under the setup of Lemma 9, . Lemma 11, which we prove in Appendix H-B, entails that the same holds for S3.
Lemma 11. Consider the setup of Lemma 8. Suppose is such that . Then there exists a constant C > 0 so that with probability tending to one,
Under the setup of Lemma 9, on the other hand, with probability tending to one.
We will now combine all the above lemmas and finish the proof. First, we consider the regime when , so that there is thresholding, i.e., . We split this regime into two subregimes: and .
1). Regime
First, we explain why we needed to split the regime into two parts. Since sx, , Lemma 7 applies. Note that if , with , then Lemma 11 and (56) also apply. Therefore it follows that in this case
| (57) |
We will shortly show that under , setting ensures that the bound in (57) is small. However, for (57) to hold, needs to satisfy
which holds with if and only if
Since the above holds when
Therefore, setting is useful when we are in the regime . We will analyze the regime using separate procedure.
In the case,
because , and similarly,
since we also assume . The above bounds entail that, in this regime, the first term on the bound in (57) is the leading term provided , i.e.,
with probability . Plugging in the value of Thr leads to
in the regime . In our case, . Also since by definition of , we also have , indicating
2). Regime
When , of course, the above line of arguments may not work although this indeed is an easier regime because is less than . In this regime, we set where is a constant depending on as in Lemma 9. For this τ, we have showed that with probability tending to one. Lemma 11 implies the same holds for as well. Thus from the decomposition of in (26), it follows that the asymptotic error occurs only due to the estimation of by . Using Lemma 7, we thus obtain
On the other hand, since , rearranging terms, we have
Thus, in the regime , we have
| (58) |
3). Regime :
It remains to analyze the case when either . In that case, there is no thresholding, i.e., Thr = 0. We will show that the assertions of Theorem 4 holds in this case as well. To that end, note that (26) implies
From the proof of Lemma 7 it follows that . For S2, we have shown that it is of the form where . On the other hand, we showed that , where the proof of Lemma 11 shows H1 and H2 are of the form M AN where and A is either (for H1) or (for H2). Therefore, it is not hard to see that is bounded by
For standard Gaussian matrices and it holds that with probability (cf. Theorem 4.7.1 of of [42]). Since , it follows that with probability . The above discussion leads to
because . If , the above bound is of the order . Thus Theorem 4 follows.
Appendix G. Proof of Corollary 2
Proof of Corollary 2. We will first show that there exist so that
| (59) |
For the sake of simplicity, we denote the matrix in Algorithm 2 by . Denoting , we note that
Also the matrix defined in Algorithm 2 and are the matrices corresponding to the leading r singular vectors of and , respectively. By Wedin’s sin-theta theorem (we use Theorem 4 of [52]), for any ,
where Λ0 is taken to be ∞, and
Since , and . Therefore, for , we have
We have to show . Theorem 4 gives a bound on , which can be made smaller than one if the in (23) is chosen to be sufficiently large. Hence, the above inequality holds. Because , using the fact , the last display implies
which, combined with Theorem 4, proves (59). Now note that the constant in (23) can be chosen so large such that the right hand side of (59) is smaller than . Since , it follows that Condition 1 is satisfied, and the rest of the proof then follows from Theorem 1.
Appendix H. Proof of Auxilliary Lemmas
A. Proof of Technical Lemmas for Theorem 2
The following lemma can be verified using elementary linear algebra, and hence its proof is omitted.
Lemma 12. Suppose Σ is of the form (41). Then the spectral decomposition of Σ is as follows:
where the eigenvectors are of the following form:
For , where forms an orthonormal basis system of the orthogonal space of α.
For , where forms an orthonormal basis system of the orthogonal space of β.
and .
Here for , 0k denotes the k-dimensional vector whose all entries are zero.
Lemma 13. Suppose Σ is as in (41). Then and
Proof of Lemma 13. Follows directly from Lemma 12.
Lemma 14. Suppose Σ1 and Σ2 are of the form (41) with singular vectors α1, β1, α2, and β2, respectively. Then
Proof. Lemma 13 can be used to obtain the form of , which implies equals
where . Since equals the sum of the two p × p and q × q diagonal submatrices, we obtain that
where we used the linearity of Trace operator, as well as the fact that . Noticing , the result follows.
B. Proof of Key Lemmas for Theorem 4
1). Proof of Lemma 7:
Proof of Lemma 7. Note that
We deal with the term T1 first. Recall from (26) that is a sparse matrix. In particular, each row and column of S1 can have at most sy and sx many nonzero elements, respectively. Now we make use of two elementary facts. First, for , , and second, for any matrix ,
The above results, combined with the row and column sparsity of S1, lead to
which is the first term in the bound of .
Now for T2, noting , observe that
It is easy to see that
Since , and are bounded it operator norm by . Also, and are orthonormal matrices. Therefore the operator norms of the matrices , , and Λ are bounded by one. On the other hand, by Bai-Yin’s law on eigenvalues of Wishart matrices (cf. Theorem 4.7.1 of [42]), with high probability. Since , clearly . Thus with high probability. Hence it suffices to show that the terms S12, S13, and S14 are small in operator norm, for which, we will make use of Lemma 4. First let us consider the case of S12. Clearly,
We already mentioned that , and and are bounded by one. Therefore, it follows that
Now we apply Lemma 4 on the term with , and . Note that , , and are full rank matrices, i.e., they have rank q. Therefore, the rank of B equals rank of . Note that the rows of the matrix are linearly independent because the square matrix has full rank. Therefore, the rank of is , which is . Hence, the rank of is also . Also note that . Therefore Lemma 4 can be applied with and . Also, trivially follows. Using the same arguments which led to (55), on the other hand, we can show that by (26). Therefore Lemma 4 implies that for any t > 0, the following holds with probability at least :
which implies with high probability. Exchanging the role of X and Y in the above arguments, we can show that with high probability. For S14, we note that
We intend to apply Lemma 4 with and . Arguing in the lines of the proof for the term S12, we can show that A and B have rank and , respectively. Without loss of generality we assume , which yields , as requred by Lemma 4. Otherwise, we can just take the transpose of C14, which leads to and , implying . Using (55), as before, we can show that the operator norms of A and B are bounded by . Therefore, Lemma 4 implies that for all ,
with probability at least . Hence, it follows that with probability ,
2). Proof of Lemma 8:
Without loss of generality, we will assume that . We will also assume, without loss of generality, that and . If that is not the case, we can add some zero rows to M and zero columns to N, respectively, which does not change their operator norm, but ensures and . For any , let denote the unit sphere in . We denote an ϵ-net (with respect to Eucledian norm) on any set by . When , there exists an ϵ-net of so that
By , we denote such an ϵ-net. Although may not be unique, that is not necessary for our purpose. For a subset , will denote an ϵ-net of the set . Note that each element of the latter set has at most many degrees of freedom, from which, one can show that . The following Fact on ϵ-nets will be very useful for us. The proof is standard and can be found, for example, in [42].
Fact 5. Let for p, . Then there exist and such that .
Letting , and using Fact 5, we obtain that
for any . Proceeding like Proposition 15 of [1], we fix , and introduce the sets
| (60) |
and their complements and . The precise value of Jp,q will be chosen later. For any subset , , and vector , we denote by the projection of x onto A, which means and if , and zero otherwise. Let us denote the projections of x and y on , , , and , by , , , and , respectively. Note that this implies
as well as
There are fewer elements the sets and compared to their complements. Therefore, we will treat these sets separately. To that end, we consider the splitting
| (61) |
The term T1 can be bounded by Lemma 15.
Lemma 15. Suppose M and N are as in Lemma 8 and where . Then for any , there exist absolute constants C, such that
We state another lemma which helps in controlling the terms T2 and T3.
Lemma 16. Suppose M, N, Z1, Z2, and An are as in Lemma 8. Let . Suppose is such that and moreover, . Let be either the set or the set . Then there exist absolute constants C, such that the following holds for any ;
Note that when , Lemma 16 yields a bound on T3. On the other hand, the case yields a bound on the term
| (62) |
While is not exactly equal to T2, interchanging the role of x and y in gives T2. Since the upper bound on given by Lemma 16 is symmetric in p and q, it is not hard to see that the same bound works for T2.
If we let , then . Combining the bounds on T1, T2, and T3, we conclude that the right hand side of (61) is if is larger than some constant multiple of
where . We will show that the first term dominates the second term. By our assumption on τ, , which implies , which combined with the fact , yields . On the other hand, under p > q, our assumption on n implies . Also because , it follows that is small, in particular
Therefore, for to be small,
suffices. In particular, we choose . Note that because , this choice of Jp,q ensures that , as required. The proof follows noting this choice of Jp,q also implies
3). Proof of Lemma 9:
Proof of 9. For any and ,
and are independent. In this case, there exist absolute constants δ, c and , so that (cf. Lemma A.3 of [50])
for all . Since , and , using union bound we obtain
Letting and , we observe that for our choice of τ, for all sufficiently large n since . Therefore, the above inequality leads to
Because and by our assumption on M and N, suffices. Hence the proof follows.
4). Proof of Lemma 11:
Proof of Lemma 11. From the definition of S3 in (26), and (54), it is not hard to see that . We will show that H1 is of the form where and . Then the first part would follow from Lemma 8, which, when applied to this case, would imply
provided , and . Since , the upper bound of Thr becomes . The proof for will follow in a similar way, and hence skipped.
Letting
we note that (54) implies , which can be written as
We will now invoke Lemma 8 because is a Gaussian data matrix with n rows and columns, and the matrices A4 and A3 are also bounded in operator norm. To see the latter, first, noting , and we observe that
Therefore it suffices to bound the operator norms of A1, A2, and A3 only. Using (55), we can show that the operator norm of the matrices of the form A2, or A3 is bounded by for . Since has orthogonal columns, it can be easily seen that . Therefore
because as per the definition of . The proof of the first part now follows by Lemma 8. Because and , the proof of the second part follows directly from Lemma 9, and hence skipped.
C. Proof of Additional Lemmas for Section III-C and Theorem 3
Proof of Lemma 2. To prove the current lemma, we will require a result on the concentration of and under and . To that end, for s, satisfying , let us define the set
Suppose and are the Rademacher priors on and as defined in Section III-C. The following lemma then says that and concentrates on and with probability tending to one.
Lemma 17. Suppose . Then
| (63) |
Here the probability depends on n through and q. Similarly depends on n through and q.
Recall the definition of from (19). Let us consider the class
If and , than because . Therefore (19) implies that has canonical correlation . Thus , implying
Suppose and are the Borel σ-field associated with and , respectively. Define the probability measures and on and , respectively, by
and
Note also that if and , then . Therefore
whose denominator is one by Lemma 17. Denoting , we note that
by Lemma 17. Similarly, denoting , we can show that
Therefore, it holds that
Thus the proof follows.
Proof of Lemma 17:
Proof of Lemma 17.. We are going to show (63) only for because the proof for follows in the identical manner. Throughout we will denote by and the expectation and variance under . Note that when , , where ’s are i.i.d. Bernoulli random variables with success probability . Therefore, Chebyshev’s inequality yields that for any ,
which goes to zero if . Therefore, for , we have
Also, since , Chebyshev’s inequality implies that
which goes to zero if for any fixed . Here (a) uses the fact that ’s are i.i.d. The proof now follows setting .
1). Proof of Lemma 5:
Proof of Lemma depends on two auxiliary lemmas. We state and prove these lemmas first.
Lemma 18. Suppose , and is a matrix. Let be the measure induced by the m-dimensional standard Gaussian random vector and denote by the corresponding expectation. Then for any we have
Proof of Lemma 18. The generating function of Hw has the convergent expansion [69, Proposition 6]
for any . Therefore,
Multiplying both side by the density of and then integrating over gives us
Lemma 19. Let be as defined in (18). Suppose where and . Then for any , we have
Proof of Lemma 19. Let us partition t as where and . We then calculate
which implies
which equals
In step (a), we stacked the variables and to form . Note that following the terminologies set in the beginning of Appendix E, and . Note that if , then the term has zero coefficient in the above expansion. Thus the lemma follows.
Proof of Lemma 5.
where (a) follows because ’s are independent observations. Now note that if , then (19) implies
where the last step follows because for any . If , then defined in (18) is positive definite, and (19) implies
by Lemma 18. Here is as in (18), and is positive definite because , as discussed in Section III-C. Therefore, we can write
Lemma 19 gives the form of the partial derivative in the above expression, and implies that the partial derivative is zero unless . Therefore, only if for all . In this case, is even, and by Lemma 19,
Therefore,
2). Proof of Lemma 6:
Proof. Lemma 5 implies that belongs to the subspace generated by those ’s whose degree-index ω has for all . The degree of the polynomial is , which is even in the above case. Therefore, if is odd, equals . Hence, it suffices to compute the norm of , where . Suppose is such that for all . Lemma 5 gives
Consider the pair of replicas and . Letting W denote the indicator function of the event , we can then write
| (64) |
Denote by . Using (64), we obtain the following expression:
where
Therefore equals
In the last step, we used the variables , and . Suppose for each . For any and , it holds that
where (a) follows from Fact 6.
Fact 6. [Multinomial Theorem] Suppose . Then for ,
Therefore it follows that
which implies
where (a) follows since the number of such that equals . Noting , the proof follows.
D. Proof of Technical Lemmas for Theorem 4
First, we introduce some additional notations and state some useful results that will be used repeatedly throughout the proof. Suppose . We can write A as
We define the vectorization operator as
We will use two well known operations on the vetorization operators, which follow from Section 10.2.2 of [70].
Fact 7. A. .
B. where denotes the Kronecker delta product.
Often times we will also use the fact that [71, Theorem 13.12]
| (65) |
Define the Hadamard product between vectors and by
Note that Cauchy-Schwarz inequality implies that
| (66) |
We will also often use of Fact 1, which states .
1). Proof of Lemma 10:
Proof. The first result is immediate. For the second result, denote by by the projection of y on . Note that for any and .
Thus the maximum singular value of D(A) is smaller than that of A, indicating that
2). Proof of Lemma 4:
First, we state and prove two facts, which are used in the proof of Lemma 4.
Fact 8. Suppose , are potentially random matrices satisfying and . Let be such that , and is distributed as a standard Gaussian data matrix. Then the matrix is distributed as a standard Gaussian data matrix.
Proof of Fact 8. is a Gaussian data matrix with covariance if and only if
| (67) |
Now
where (a) follows from Fact 7B. However, since , (67) implies
but
Therefore,
Then the result follows from (67).
In the above fact, it may appear that is independent of matrices A and B since its conditional distribution is standard Gaussian. However, still depends on A and B through r and s, which may be random quantities.
Fact 9. Suppose , , are such that conditional on A and B, X is distributed as a standard Gaussian data matrix. Further suppose that the rank of A and B are a and b, respectively. Then the following assertion holds:
where is distributed as a standard Gaussian data matrix in .
Proof of Fact 9. Suppose and are the projection matrices onto the column spaces of A and B, respectively. Then we can write and , where and are matrices matrices with full column rank so that and . Writing and , we obtain that
which is bounded by
That and are one follows from the definitions of and . Fact 8 implies conditional on and , is distributed as a standard Gaussian data matrix. Hence, the proof follows.
Proof of Lemma 4. Let us denote the rank of by . Note that . Letting , and applying Fact 9, we have the bound
where is distributed as a standard Gaussian data matrix in . Next we apply Fact 9 again, but now on the term , which leads to
where is a standard Gaussian data matrix. Therefore,
We use the Gaussian matrix concentration inequality in Fact 2 to show that with probability at least . Also, for , the first part of Fact 2 implies
for any . Since , and is deterministic, the above implies
Hence, for any , we have the following with probability at least :
Since , it follows that
Therefore, the proof follows.
3). Proof of Lemma 15:
Proof of Lemma 15. Denoting
we note that
Therefore it suffices to show that there exist absolute constants C, such that
Let us denote , , and . Thus
Recalling , we define
| (68) |
To obtain a tight concentration inequality for , we want to use the following Gaussian concentration lemma due to [1]
Lemma 20 (Corollary 10 of [1]). Let be a vector of n i.i.d. standard Gaussian variables. Suppose is a finite set and we have functions for every . Assume is a Borel set such that for lebesgue-almost every :
Then, there exists an absolute constant so that for any ,
Here is an independent copy of .
In our case, the index b corresponds to (x, y), the set corresponds to , and the function corresponds to . To find the centering and the Lipschitz constant , we need to compute and , respectively.
First, note that since Z1 and Z2 are independent standard Gaussian data matrices, . Noting for any symmetric random variable X, we deduce
Using Lemma 21 we obtain that
and
where
Because for each ,
since . Also, because equals , we have
| (69) |
and similarly,
| (70) |
Therefore,
Letting denote , we note that the above two inequalities imply
Because , , we have
| (71) |
We choose a good set where the above bound is small. To that end, we take to be
| (72) |
Let us denote and . To apply Lemma 21, now we define the process
Equation 71 implies that on ,
We are now in a position to apply Lemma 21, which yields that
| (73) |
From equation 79 of [1] it follows that C can be chosen so large such that
Thus, after plugging in the value of , the first term on the right hand side of (73) can be bounded above by
To bound the second term in (73), notice that Lemma 22 yields the bound
whereas Fact 2 leads to the bound
| (74) |
Therefore the proof follows.
Lemma 21. Suppose is as defined in (68) and . Then
where , , and .
Proof. Using , and the fact that , we calculate that
Fact 7 implies
| (75) |
which yields . Noting , we can hence write as
Let us denote by the derivative of evaluated at . For , we denote by the matrix whose (i, j)-th entry equals . Then we obtain that for ,
indicating that
where implies the Hadamard product. It follows that
Then the first part of the proof follows from (75). The proof of the second part follows similarly, and hence, skipped.
Writing , we have
which equals
Fact 7 implies that the above equals
where . Thus, similarly we can show that
Therefore, the proof follows.
Lemma 22. There exists an absolute constant C so that the function defined in (68) satisfies
Proof. As usual, we let . Since , we have
Here (a) follows because the operator norm is smaller than the Frobenius norm, (b) follows because , and (c) follows from Fact 1. Since Z1 and Z2 are independent,
Now note that since Z1 and Z2 are standard Gaussian data matrices,
for some absolute constants and . We can choose C so large such that . Similarly, we can show that
implying
for sufficiently large C.
4). Proof of Lemma 16:
Proof. The framework will be same as the proof of Lemma 15. Define where
Let , , , and be as in Lemma 15. In this case, the main difference from Lemma 15 is that is much larger. Eventually we will arrive at (73) using the concentration inequality in Lemma 21, but large makes the right hand side of the inequality in (73) much larger. Therefore, we require a tighter bound on , which is the bound on the Lipschitz constant of on the good set, so that the concentration inequality in (73) is still useful. To bound the Lipschitz constant, as before, we bound using Lemma 21, which implies that
where and . From (69) it follows that
| (76) |
In Lemma 15, we bounded by , which was later bounded by 1. We require a tighter bound on this time. Note that at all for any directional derivative of η. Noting for , we deduce that any and satisfy
which is not greater than
because for .
Thus, it follows that
Similarly, we can show that
Thus,
We want to define the good set of such that
satisfies both and
We claim that the above holds if defined in (72), and for all ,
| (77) |
The above claim follows from (89) and (90) of [1]. Therefore we define the good set to be the subset of where (77) is satisfied. Defining and , we obtain that for some absolute constant , it holds that
provided . Similar to the proof of Lemma 15, using Lemma 21, we obtain that there exists an absolute constant C so that
| (78) |
Now since , and for any , the ϵ-net is chosen so as to satisfy , we have . Therefore, we conclude that the first term of the bound in (78) is not larger than
Rest of the proof is devoted to bounding the second term of the bound in (78). The expectation term can be bounded easily using Lemma 22, which yields
We will now show that is small. Note that by definition, , where is the set of , which satisfies the equation system (77). Notice that by (74), we already have for some . Thus it suffices to show that is small. To this end, note that since , , , are independent, (77) implies
Defining the set , we bound the above probability as follows:
| (79) |
Now note that , or . Therefore, there exists a universal constant so that
| (80) |
where the last bound is due to the Chi-square tail bound in Fact 4 (see also Lemma 1 of [72] and Lemma 12 of [1]). Therefore, it only remains to bound the first term in (79). We begin with an expansion of as follows
Since and are independent, conditioned on is still a standard Gaussian data matrix. Hence, for , conditional on , ’s are independent random variables. As a result, for each and , can be written as , where , and . Noting for every , we derive the following bound provided :
Defining
| (81) |
we notice that the above calculations implies conditional on ,
Therefore,
| (82) |
which is is bounded by by Lemma 23. Therefore, (79), (80), and (82) jointly imply that
Therefore is bounded by
which completes the proof.
Lemma 23. Suppose and . Further suppose are independent standard Gaussian random variables. Then the function f defined in (81) satisfies
Proof of Lemma 23. Note that is a sum of dependent Bernoulli random variables. Therefore the traditional Chernoffs or Hoeffding’s bound for inependent Bernoulli random variables will not apply. We use a generalized version of Chernoffs inequality, originally due to [65] (also discussed by [73], [74] among others), which applies to weakly dependent Bernoulli random variables.
Lemma 24 ([65]). Let be Bernoulli random variables and . Suppose there exists such that for any , the following assertion holds:
| (83) |
For , we denote
Then we have
Note that if we take and , then the above lemma can Ire applied to bound provided (83) holds, which will be referred as the weak dependence Condition from now on. Suppose . For the sake of simplicity, we take . The arguments, which are to follow, would hold for any other choice of as well as long as . Denote by the submatrix of M containing only the first k rows of M. Let us denote . Letting , we observe that for our choice of ’s, equals
The operator norm equals , which is bounded by by Lemma 10B. Therefore, the right hand side of the last display is bounded by . By Chi-square tail bounds (see for instance Fact 4), the latter probability is bounded above by for all . Since , note that suffices. For such τ, we have thus shown that
Thus our , which is less than because . Thus our (δ, ϵ) pair satisfies the weak dependence condition. Therefore by Lemma 24, it follows that
We will now use the lower bound for . Because , indicating , which is greater than , or equivalently . Therefore, the current lemma follows.
Footnotes
In this paper, by support recovery, we refer to the exact recovery of the combined support of the ’s (or the ’s) corresponding to nonzero ’s.
e.g., , , and will stand for the empirical estimators created from the jth-equal split of the data.
here and later, we will use s to generically denote the sparsity of relevant parameter vectors in parallel problems like sparse PCA or sparse linear regression.
Contributor Information
Nilanjana Laha, Department of Statistics, Texas A&M University, College Station, TX 77843.
Rajarshi Mukherjee, Department of Biostatistics, Harvard T. H. Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115.
References
- [1].Deshpande Y and Montanari A, “Sparse pca via covariance thresholding,” Journal of Machine Learning Research, vol. 17, no. 141, pp. 1–14, 2016. [Google Scholar]
- [2].Kunisky D, Wein AS, and Bandeira AS, “Notes on computational hardness of hypothesis testing: Predictions using the low-degree likelihood ratio,” in Mathematical Analysis, its Applications and Computation, 2022, pp. 1–50, part of the Springer Proceedings in Mathematics and Statistics book series. [Google Scholar]
- [3].Hardoon DR, Szedmak S, and Shawe-Taylor J, “Canonical correlation analysis: An overview with application to learning methods,” Neural computation, vol. 16, no. 12, pp. 2639–2664, 2004. [DOI] [PubMed] [Google Scholar]
- [4].Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, and Vasconcelos N, “A new approach to cross-modal multimedia retrieval,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 251–260. [Google Scholar]
- [5].Gong Y, Ke Q, Isard M, and Lazebnik S, “A multi-view embedding space for modeling internet images, tags, and their semantics,” International journal of computer vision, vol. 106, no. 2, pp. 210–233, 2014. [Google Scholar]
- [6].Bin G, Gao X, Yan Z, Hong B, and Gao S, “An online multi-channel ssvep-based brain–computer interface using a canonical correlation analysis method,” Journal of neural engineering, vol. 6, no. 4, p. 046002, 2009. [DOI] [PubMed] [Google Scholar]
- [7].Avants BB, Cook PA, Ungar L, Gee JC, and Grossman M, “Dementia induces correlated reductions in white matter integrity and cortical thickness: a multivariate neuroimaging study with sparse canonical correlation analysis,” Neuroimage, vol. 50, no. 3, pp. 1004–1016, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Witten DM, Tibshirani R, and Hastie T, “A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis,” Biostatistics, vol. 10, no. 3, pp. 515–534, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Bagozzi RP, “Measurement and meaning in information systems and organizational research: Methodological and philosophical foundations,” Management Information Systems quarterly, vol. 35, no. 2, pp. 261–292, 2011. [Google Scholar]
- [10].Dhillon P, Foster DP, and Ungar L, “Multi-view learning of word embeddings via CCA,” in Proceedings of the 24th International Conference on Neural Information Processing Systems, vol. 24, 2011, p. 199207. [Google Scholar]
- [11].Faruqui M and Dyer C, “Improving vector space word representations using multilingual correlation,” in Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 2014, pp. 462–471. [Google Scholar]
- [12].Friman O, Borga M, Lundberg P, and Knutsson H, “Adaptive analysis of fmri data,” NeuroImage, vol. 19, no. 3, pp. 837–845, 2003. [DOI] [PubMed] [Google Scholar]
- [13].Kim T-K, Wong S-F, and Cipolla R, “Tensor canonical correlation analysis for action classification,” in 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8. [Google Scholar]
- [14].Arora R and Livescu K, “Multi-view cca-based acoustic features for phonetic recognition across speakers and domains,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 7135–7139. [Google Scholar]
- [15].Wang W, Arora R, Livescu K, and Bilmes J, “On deep multi-view representation learning,” in Proceedings of the 32nd International Conference on International Conference on Machine Learning, vol. 37, 2015, pp. 1083–1092. [Google Scholar]
- [16].Anderson TW, An Introduction to Multivariate Statistical Analysis, ser. Wiley Series in Probability and Statistics. Wiley, 2003. [Google Scholar]
- [17].Lê Cao K-A, Martin PG, Robert-Granié C, and Besse P, “Sparse canonical methods for biological data integration: application to a cross-platform study,” BMC bioinformatics, vol. 10, no. 1, pp. 1–17, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Lee W, Lee D, Lee Y, and Pawitan Y, “Sparse canonical covariance analysis for high-throughput data,” Statistical Applications in Genetics and Molecular Biology, vol. 10, no. 1, pp. 1–24, 2011.21291411 [Google Scholar]
- [19].Waaijenborg S, de Witt Hamer PCV, and Zwinderman AH, “Quantifying the association between gene expressions and dna-markers by penalized canonical correlation analysis,” Statistical applications in genetics and molecular biology, vol. 7, no. 1, 2008. [DOI] [PubMed] [Google Scholar]
- [20].Anderson TW, “Asymptotic theory for canonical correlation analysis,” Journal of Multivariate Analysis, vol. 70, no. 1, pp. 1–29, 1999. [Google Scholar]
- [21].Cai TT and Zhang A, “Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics,” The Annals of Statistics, vol. 46, no. 1, pp. 60 – 89, 2018. [Google Scholar]
- [22].Ma Z, Li X et al. , “Subspace perspective on canonical correlation analysis: Dimension reduction and minimax rates,” Bernoulli, vol. 26, no. 1, pp. 432–470, 2020. [Google Scholar]
- [23].Bao Z, Hu J, Pan G, and Zhou W, “Canonical correlation coefficients of high-dimensional gaussian vectors: Finite rank case,” The Annals of Statistics, vol. 47, no. 1, pp. 612–640, February 2019. [Google Scholar]
- [24].Mai Q and Zhang X, “An iterative penalized least squares approach to sparse canonical correlation analysis,” Biometrics, vol. 75, no. 3, pp. 734–744, 2019. [DOI] [PubMed] [Google Scholar]
- [25].Solari OS, Brown JB, and Bickel PJ, “Sparse canonical correlation analysis via concave minimization,” arXiv preprint arXiv:1909.07947, 2019. [Google Scholar]
- [26].Chen M, Gao C, Ren Z, and Zhou HH, “Sparse cca via precision adjusted iterative thresholding,” arXiv preprint arXiv:1311.6186, 2013. [Google Scholar]
- [27].Gao C, Ma Z, Ren Z, Zhou HH et al. , “Minimax estimation in sparse canonical correlation analysis,” The Annals of Statistics, vol. 43, no. 5, pp. 2168–2197, 2015. [Google Scholar]
- [28].Gao C, Ma Z, Zhou HH et al. , “Sparse cca: Adaptive estimation and computational barriers,” The Annals of Statistics, vol. 45, no. 5, pp. 2074–2101, 2017. [Google Scholar]
- [29].Wainwright MJ, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting,” IEEE Transactions on Information Theory, vol. 55, no. 12, pp. 5728–5741, 2009. [Google Scholar]
- [30].Amini AA and Wainwright MJ, “High-dimensional analysis of semidefinite relaxations for sparse principal components,” The Annals of Statistics, vol. 37, no. 5B, pp. 2877 – 2921, 2009. [Google Scholar]
- [31].Butucea C, Ingster YI, and Suslina IA, “Sharp variable selection of a sparse submatrix in a high-dimensional noisy matrix,” ESAIM: Probability and Statistics, vol. 19, pp. 115–134, 2015. [Google Scholar]
- [32].Butucea C and Stepanova N, “Adaptive variable selection in nonparametric sparse additive models,” Electronic Journal of Statistics, vol. 11, no. 1, pp. 2321–2357, 2017. [Google Scholar]
- [33].Meinshausen N and Bühlmann P, “Stability selection,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 72, no. 4, pp. 417–473, 2010. [Google Scholar]
- [34].Johnstone IM and Lu AY, “On consistency and sparsity for principal components analysis in high dimensions,” Journal of the American Statistical Association, vol. 104, no. 486, pp. 682–693, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].Krauthgamer R, Nadler B, Vilenchik D et al. , “Do semidefinite relaxations solve sparse pca up to the information limit?” The Annals of Statistics, vol. 43, no. 3, pp. 1300–1322, 2015. [Google Scholar]
- [36].Ding Y, Kunisky D, Wein AS, and Bandeira AS, “Subexponential-time algorithms for sparse pca,” arXiv preprint arXiv:1907.11635, 2019. [Google Scholar]
- [37].Arous GB, Wein AS, and Zadik I, “Free energy wells and overlap gap property in sparse pca,” in Proceedings of the 33rd Annual Conference on Learning Theory, vol. 125, 2020, pp. 479–482. [Google Scholar]
- [38].Laha N and Mukherjee R, “Support.CCA,” https://github.com/nilanjanalaha/Support.CCA, 2021. [Google Scholar]
- [39].Hopkins SB and Steurer D, “Efficient bayesian estimation from few samples: community detection and related problems,” in IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), 2017, pp. 379–390. [Google Scholar]
- [40].Hopkins S, “Statistical inference and the sum of squares method,” Ph.D. dissertation, Cornell University, 2018. [Google Scholar]
- [41].Ma Z and Yang F, “Sample canonical correlation coefficients of high-dimensional random vectors with finite rank correlations,” arXiv preprint arXiv:2102.03297, 2021. [Google Scholar]
- [42].Vershynin R, High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018. [Google Scholar]
- [43].Laha N, Huey N, Coull B, and Mukherjee R, “On statistical inference with high dimensional sparse cca,” arXiv preprint arXiv:2109.11997, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [44].Johnstone IM, “On the distribution of the largest eigenvalue in principal components analysis,” The Annals of statistics, vol. 29, no. 2, pp. 295–327, 2001. [Google Scholar]
- [45].Janková J and van de Geer S, “De-biased sparse pca: Inference and testing for eigenstructure of large covariance matrices,” IEEE Transactions on Information Theory, vol. 67, no. 4, pp. 2507–2527, 2021. [Google Scholar]
- [46].Meloun M and Militky J, Statistical data analysis: A practical guide. Woodhead Publishing Limited, 2011. [Google Scholar]
- [47].Dutta D, Sen A, and Satagopan J, “Sparse canonical correlation to identify breast cancer related genes regulated by copy number aberrations,” medRxiv, 2022. [Online]. Available: https://www.medrxiv.org/content/early/2022/05/09/2021.08.29.21262811 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [48].Lock EF, Hoadley KA, Marron JS, and Nobel AB, “Joint and individual variation explained (jive) for integrated analysis of multiple data types,” The annals of applied statistics, vol. 7, no. 1, p. 523, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [49].van de Geer S, Bhlmann P, Ritov Y, and Dezeure R, “On asymptotically optimal confidence regions and tests for high-dimensional models,” The Annals of Statistics, vol. 42, no. 3, pp. 1166–1202, 2014. [Google Scholar]
- [50].Bickel PJ and Levina E, “Covariance regularization by thresholding,” The Annals of Statistics, vol. 36, no. 6, pp. 2577–2604, 2008. [Google Scholar]
- [51].Cai TT, Liu W, and Luo X, “A constrained minimization approach to sparse precision matrix estimation,” Journal of the American Statistical Association, vol. 106, no. 494, pp. 594–607, 2011. [Google Scholar]
- [52].Yu Y, Wang T, and Samworth RJ, “A useful variant of the davis–kahan theorem for statisticians,” Biometrika, vol. 102, no. 2, pp. 315–323, 2015. [Google Scholar]
- [53].Yatracos YG, “A lower bound on the error in nonparametric regression type problems,” The Annals of Statistics, vol. 16, no. 3, pp. 1180–1187, 1988. [Google Scholar]
- [54].Berthet Q and Rigollet P, “Complexity theoretic lower bounds for sparse principal component detection,” in Proceedings of the 26th Annual Conference on Learning Theory, vol. 30, 2013, pp. 1046–1066. [Google Scholar]
- [55].Brennan M, Bresler G, and Huleihel W, “Reducibility and computational lower bounds for problems with planted sparse structure,” in Proceedings of the 31st Conference On Learning Theory, vol. 75, 2018, pp. 48–166. [Google Scholar]
- [56].Kearns M, “Efficient noise-tolerant learning from statistical queries,” Journal of the ACM (JACM), vol. 45, no. 6, pp. 983–1006, 1998. [Google Scholar]
- [57].Feldman V and Kanade V, “Computational bounds on statistical query learning,” in Proceedings of the 25th Annual Conference on Learning Theory, vol. 23, 2012, pp. 16.1–16.22. [Google Scholar]
- [58].Brennan MS, Bresler G, Hopkins S, Li J, and Schramm T, “Statistical query algorithms and low degree tests are almost equivalent,” in Proceedings of Thirty Fourth Conference on Learning Theory, vol. 134, 2021, pp. 774–774. [Google Scholar]
- [59].Dudeja R and Hsu D, “Statistical query lower bounds for tensor pca,” Journal of Machine Learning Research, vol. 22, no. 83, pp. 1–51, 2021. [Google Scholar]
- [60].David G and Ilias Z, “High dimensional regression with binary coefficients. Estimating squared error and a phase transtition,” in Proceedings of the Conference on Learning Theory, vol. 65, 2017, pp. 948–953. [Google Scholar]
- [61].Gamarnik D, Jagannath A, and Sen S, “The overlap gap property in principal submatrix recovery,” Probability Theory and Related Fields volume, vol. 181, pp. 757–814, 2021. [Google Scholar]
- [62].Szegö G, Orthogonal polynomials, ser. American Mathematical Society colloquium publications. American Mathematical Society, 1939. [Google Scholar]
- [63].Cai TT, Zhou HH et al. , “Optimal rates of convergence for sparse covariance matrix estimation,” The Annals of Statistics, vol. 40, no. 5, pp. 2389–2420, 2012. [Google Scholar]
- [64].Bach FR and Jordan MI, “A probabilistic interpretation of canonical correlation analysis,” Tech. Rep, 2005. [Online]. Available: https://www.di.ens.fr/~fbach/probacca.pdf [Google Scholar]
- [65].Panconesi A and Srinivasan A, “Randomized distributed edge coloring via an extension of the chernoff–hoeffding bounds,” SIAM Journal on Computing, vol. 26, no. 2, pp. 350–368, 1997. [Google Scholar]
- [66].Wang S, Fan J, Pocock G, Arena ET, Eliceiri KW, and Yuan M, “Structured correlation detection with application to colocalization analysis in dual-channel fluorescence microscopic imaging,” Statistica Sinica, vol. 31, no. 1, pp. 333–360, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [67].Bai Z-D and Yin Y-Q, “Limit of the Smallest Eigenvalue of a Large Dimensional Sample Covariance Matrix,” The Annals of Probability, vol. 21, no. 3, pp. 1275 – 1294, 1993. [Google Scholar]
- [68].Sandór J and Debnath L, “On certain inequalities involving the constant e and their applications,” Journal of mathematical analysis and applications, vol. 249, no. 2, pp. 569–582, 2000. [Google Scholar]
- [69].Rahman S, “Wiener–hermite polynomial expansion for multivariate gaussian probability measures,” Journal of Mathematical Analysis and Applications, vol. 454, no. 1, pp. 303–334, 2017. [Google Scholar]
- [70].Petersen KB and Pedersen MS, “The matrix cookbook,” 2015, version: Nov 12, 2015. [Online]. Available: http://www2.compute.dtu.dk/pubdb/pubs/3274-full.html [Google Scholar]
- [71].Laub A, Matrix Analysis for Scientists and Engineers. Society for Industrial and Applied Mathematics, 2005. [Google Scholar]
- [72].Laurent B and Massart P, “Adaptive estimation of a quadratic functional by model selection,” The Annals of Statistics, vol. 28, no. 5, pp. 1302–1338, 2000. [Google Scholar]
- [73].Pelekis C and Ramon J, “Hoeffding’s inequality for sums of weakly dependent random variables,” Mediterranean Journal of Mathematics volume, vol. 14, no. 243, 2017. [Google Scholar]
- [74].Linial N and Luria Z, “Chernoff’s inequality-a very elementary proof,” arXiv preprint arXiv:1403.7739, 2014. [Google Scholar]






