Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Mar 1.
Published in final edited form as: IEEE Trans Inf Theory. 2022 Oct 17;69(3):1695–1738. doi: 10.1109/tit.2022.3214201

On Support Recovery with Sparse CCA: Information Theoretic and Computational Limits

Nilanjana Laha 1, Rajarshi Mukherjee 2
PMCID: PMC10569110  NIHMSID: NIHMS1875481  PMID: 37842015

Abstract

In this paper, we consider asymptotically exact support recovery in the context of high dimensional and sparse Canonical Correlation Analysis (CCA). Our main results describe four regimes of interest based on information theoretic and computational considerations. In regimes of “low” sparsity we describe a simple, general, and computationally easy method for support recovery, whereas in a regime of “high” sparsity, it turns out that support recovery is information theoretically impossible. For the sake of information theoretic lower bounds, our results also demonstrate a non-trivial requirement on the “minimal” size of the nonzero elements of the canonical vectors that is required for asymptotically consistent support recovery. Subsequently, the regime of “moderate” sparsity is further divided into two subregimes. In the lower of the two sparsity regimes, we show that polynomial time support recovery is possible by using a sharp analysis of a co-ordinate thresholding [1] type method. In contrast, in the higher end of the moderate sparsity regime, appealing to the “Low Degree Polynomial” Conjecture [2], we provide evidence that polynomial time support recovery methods are inconsistent. Finally, we carry out numerical experiments to compare the efficacy of various methods discussed.

Keywords: Canonical Correlation Analysis, Support Recovery, Low Degree Polynomials, Variable Selection, High Dimension

I. Introduction

Canonical Correlation Analysis (CCA) is a highly popular technique to perform initial dimension reduction while exploring relationships between two multivariate objects. Due to its natural interpretability and success in finding latent information, CCA has found enthusiasm across vast canvas of disciplines, which include, but are not limited to psychology and agriculture, information retrieving [3]-[5], brain-computer interface [6], neuroimaging [7], genomics [8], organizational research [9], natural language processing [10], [11], fMRI data analysis [12], computer vision [13], and speech recognition [14], [15].

Early developments in the theory and applications of CCA have now been well documented in the statistical literature, and we refer the interested reader to [16] and references therein for further details. However, the modern surge in interest for CCA, often being motivated by data from high throughput biological experiments [17]-[19], requires re-thinking several aspects of the traditional theory and methods. A natural structural constraint that has gained popularity in this regard, is that of sparsity, i.e., the phenomenon of an (unknown) collection of variables being related to each other. In order to formally introduce the framework of sparse CCA, we present our statistical setup next. We shall consider n-i.i.d. samples (Xi,Yi)P with XiRp and YiRq being multivariate mean zero random variables with joint variance covariance matrix

Σ=[ΣxΣyxΣyxΣy]. (1)

The first canonical correlation Λ1 is then defined as the maximum possible correlation between two linear combinations of X and Y. This definition interprets Λ1 as the optimal value of the following maximization problem:

maximizeuRp,vRquTΣxyvsubject touTΣxu=vTΣyv=1. (2)

The solutions to (2) are the vectors that maximize the correlation of the projections of X and Y in those respective directions. Higher order canonical correlations can thereafter be defined in a recursive fashion (cf. [20]). In particular, for j1, we define the jth canonical correlation Λj and the corresponding directions uj and vj by maximizing (2) with the additional constraint

uTΣxul=vTΣyvl=0,0lj1. (3)

As mentioned earlier, in many modern data examples, the sample size n is typically at most comparable to or much smaller than p or q – rendering the classical CCA inconsistent and inadequate without further structural assumptions [21]-[23]. The framework of Sparse Canonical Correlation Analysis (SCCA) (cf. [8], [24]), where the ui’s and the vi’s are sparse vectors, was subsequently developed to target low dimensional structures (that allows consistent estimation) when p, q are potentially larger than n. The corresponding sparse estimates of the leading canonical directions naturally perform variable selection, thereby leading to the recovery of their support (cf. [8], [19], [24], [25]). It is unknown, however, under what settings, this naïve method of support recovery, or any other method for the matter, is consistent. The support recovery of the leading canonical directions serves an important purpose of identifying groups of variables that explain the most linear dependence among high dimensional random objects (X and Y) under study – and thereby renders crucial interpretability. Asymptotically optimal support recovery is yet to be explored systematically in the context of SCCA – both theoretically, and from the computational viewpoint. In fact, despite the renewed enthusiasm for CCA, both the theoretical and applied communities have mainly focused on the estimation of the leading canonical directions, and relevant scalable algorithms – see, e.g., [22], [24], [26]-[28]. This paper explores the crucial question of support recovery in the context of SCCA. 1

The problem of support recovery for SCCA naturally connects to a vast class of variable selection problems (cf. [29]-[33]). The problem closest in terms of complexity turns out to be the sparse PCA (SPCA) problem [34]. Support recovery in the latter problem is known to present interesting information theoretic and computational bottlenecks (cf. [30], [35]-[37]). Moreover, information theoretic and computational issues also arise in context of SCCA estimation problem (cf. [24], [26]-[28]). In view of the above, it is natural to expect that such information theoretic and computational issues exist in context of SCCA support recovery problem as well. However, the techniques used in SPCA support recovery analysis are not directly applicable to the SCCA problem, which poses additional challenges due to the presence of high dimensional nuisance parameters Σx and Σy. The main focus of our work is therefore retrieving the complete picture of the information theoretic and computational limitations of SCCA support recovery. Before going into further details, we present a brief summary of our contributions, and defer the discussions on the main subtleties to Section III. Our methods can be implemented using the R package Support.CCA [38].

A. Summary of Main Results

We say a method successfully recovers the support if it achieves exact recovery with probability tending to one uniformly over the sparse parameter spaces defined in Section II. In the sequel, we denote the cardinality of the combined support of the ui’s and the vi’s by sx and sy, respectively. Thus sx and sy will be our respective sparsity parameters. Our main contributions are listed below.

1). General methodology:

In Section III-A, we construct a general algorithm called RecoverSupp, which leads to successful support recovery whenever the latter is information theoretically tractable. This also serves as the first step in creating a polynomial time procedure for recovering support in one of the difficult regimes of the problem – see e,g. Corollary 2, which shows that RecoverSupp accompanied by a co-ordinate thresholding type method recovers the support in polynomial time in a regime that requires subtle analysis. Moreover, Theorem 1 shows that the minimal signal strength required by RecoverSupp matches the information theoretic limit whenever the nuisance precision matrices Σx1 and Σy1 are sufficiently sparse.

2). Information theoretic and computational hardness as a function of sparsity:

As the sparsity level increases, we show that the CCA support recovery problem transitions from being efficiently solvable, to NP hard (conjectured), and to information theoretically impossible. According to this hardness pattern, the sparsity domain can be partitioned into the following three regimes: (i) sx, syn, (ii)nsx, synlog(p+q), and (iii) sx, synlog(p+q). We describe below the distinguishing behaviours of these three regimes, which is consistent with the sparse PCA scenario.

  • We show that when sx, synlog(p+q) (“easy regime”), polynomial time support recovery is possible, and well-known consistent estimators of the canonical correlates (cf. [24], [28]) can be utilized to that end. When nlog(p+q)sx, syn (“difficult regime”), we show that a co-ordinate thresholding type algorithm (inspired by [1]) succeeds provided p+qn. We call the last regime “difficult” because it is unknown whether existing estimation methods like COLAR [28] or SCCA [24] have valid statistical guarantees in this regime – see Section III-A and Section III-D for more details.

  • In Section III-C, we show that when nsx, synlog(p+q) (“hard regime”), support recovery is computationally hard subject to the so called “low degree polynomial conjecture” recently popularized by [39], [40], and [2]. Of course, this phenomenon is observable only when p, qn, because otherwise, the problem would be solvable by the ordinary CCA analysis (cf. [23], [41]). Our findings are consistent with the conjectured computational barrier in context of SCCA estimation problem [28].

  • When sx, synlog(p+q), we show that support recovery is information theoretically impossible (see Section III-B).

3). Information theoretic hardness as a function of minimal signal strength:

In context of support recovery, the signal strength is quantified by

Sigx=mink[p]maxi[r](ui)kandSigy=mink[q]maxi[r](vi)k.

Generally, support recovery algorithms require the signal strength to lie above some threshold. As a concrete example, the detailed analyses provided in [1], [30], and [35] are all based on the nonzero principal component elements being of the order ±1sparsity. To the best of our knowledge, prior to our work, there was no result in the PCA/CCA literature on the information theoretic limit of the minimal signal strength. Generally, PCA studies assume that the top eigenvectors are de-localized, i.e., the principal components have elements of the order O(1s) and thereby mostly considered the cases of de-localized eigenvectors. We do not make any such assumption on the canonical covariates, and thereby we believe that our study paints a more complete picture of the support recovery.

  • In Section III-B, we show that Sigxlog(psx)n (or Sigylog(qsy)n) is a necessary requirement for successful support recovery by U (or V).

B. Notation

For a vector xRp, we denote its support by D(x)={i:xi0}. We will overload notation, and for a matrix ARp×q, we will denote by D(A) the indexes of the nonzero rows of A. By an abuse of notation, sometimes we will refer to D(A) as the support of A as well. When ARp×q and αRp are unknown parameters, generally, the estimator of their supports will be denoted by D^(A) and D^(α), respectively. We let N denote the set of all positive numbers, and write Z for the set of all natural numbers {0, 1, 2, … ,}. For any nN, We let [n] denote the set {1, … , n}. We define the projection of A onto D[p]×[q] by

(𝒫D{A})i,j={Ai,jif(i,j)D,0otherwise.} (4)

For any finite set 𝒜, we denote its cardinality by 𝒜. Also, for any event , we let 1{} be the indicator of the event . For any pN, we let Sp1 denote the unit sphere in Rp.

We let k be the usual lk norm in Rk for kZ. In particular, we let x0 denote the number of nonzero elements of a vector xRp. For any probability measure P on the Borel sigma field of Rp, we let L2(P) to be the set of all measurable functions f:RpR such that fL2(P)=f2dP<. The corresponding L2(P) inner product will be denoted by ,L2(P). We denote the operator norm and the Frobenius norm of a matrix ARp×q by Aop and AF, respectively. We let Ai* and Aj denote the i-th row and j-th column of A, respectively. For kN, we define the norms Ak,=maxj[q]Ajk and A,k=maxi[q]Aik. The maximum and minimum eigenvalues of a square matrix A will be denoted by Λmax(A) and Λmin(A), respectively. Also, we let s(A) denote the maximum number of nonzero entries in any column of A, i.e., s(A)=maxj[q]Aj0.

The results in this paper are mostly asymptotic (in n) in nature and thus require some standard asymptotic notations. If an and bn are two sequences of real numbers then anbn (and anbn) implies that anbn (and anbn0) as n, respectively. Similarly anbn (and anbn) implies that lim infnanbn=C for some C(0,] (and lim supnanbn=C for some C[0,)). Alternatively, an=o(bn) will also imply anbn and an=O(bn) will imply that lim supnanbn=C for some C[0,). We write anbn if there are positive constants C1 and C2 such that C1bnanC2bn for all nN. We will write an=Φ~(bn) to indicate an and bn are asymptotically of the same order up to a poly-log term. Finally, in our mathematical statements, C and c will be two different generic constants which can vary from line to line.

II. Mathematical Formalism

We denote the rank of Σxy by r. It can be shown that exactly r canonical correlations are positive and the rest are zero in the model (2). We will consider the matrices U=[u1,,ur] and V=[v1,,vr]. From (2) and (3), it is not hard to see that UTΣxU=Ip and VTΣyV=Iq. The indexes of the nonzero rows of U and V, respectively, are the combined support of the ui’s and the vi’s. Since we are interested in the recovery of the latter, it will be useful for us to study of the properties of U and V. To that end, we often make use of the following representation connecting Σxy to U and V [16]:

Σxy=ΣxUΛVTΣy=Σx(i=1rΛiuiviT)Σy. (5)

To keep our results straightforward, we restrict our attention to a particular model 𝒫(r,sx,sy,) throughout, defined as follows.

Definition 1. Suppose (X,Y)P. Let >1 be a constant. We say P𝒫(r,sx,sy,) if

  • A1

    (Sub-Gaussian) X and Y are sub-Gaussian random vectors (cf. [42]), with joint covariance matrix Σ as defined in (1). Also rank(Σxy)=r.

  • A2

    Recall the definition of the canonical correlation Λis from (3). Note that by definition, 0ΛrΛ1. For P𝒫(r,sx,sy,), Λr additionally satisfies Λr1.

  • A3

    (Sparsity) The number of nonzero rows of U and V are sx and sy, respectively, that is sx=i=1rD(ui) and sy=i=1rD(vi). Here U and V are as defined in (5).

  • A4
    (Bounded eigenvalue)
    1<Λmin(Σy),Λmin(Σy),Λmax(Σx),Λmax(Σy)<.
  • A5

    (Positive eigen-gap) ΛiΛi11 for i = 2, … , r.

Sometimes we will consider a sub-model of 𝒫(r,sx,sy,) where each P𝒫(r,sx,sy,) is Gaussian. This model will be denoted by 𝒫G(r,sx,sy,), where “G” stands for the Gaussian assumption. Some remarks on the modeling assumptions A1—A5 are in order, which we provide next.

  • A1.

    We begin by noting that we do not require X and Y to be jointly sub-Gaussian. Moreover, the individual sub-Gaussian assumption itself is common in the sx, synlog(p+q) regime in the SCCA literature (cf. [24], [28], [43]). Our proof techniques depend crucially on the sub-Gaussian assumption. We also anticipate that the results derived in this paper will change under the violation of this assumption. For the sharper analysis in the difficult regime (nlog(p+q)sx, syn), our proof techniques require the Gaussian model 𝒫G – which is in parallel with [1]’s treatment of the sparse PCA in the corresponding difficult regime. In general, the Gaussian spiked model assumption in sparse PCA goes back to [44], and is common in the PCA literature (cf. [30], [35]).

  • A2-A4.

    These assumptions are standard in the analysis of canonical correlations (cf. [24], [28]).

  • A5.

    This assumption concerns the gap between consecutive canonical correlation strengths. However, we refer to this gap as “Eigengap” because of its similarity with the Eigengap in the sparse PCA literature (cf. [1], [45]). This assumption is necessary for the estimation of the i-th canonical covariates. Indeed, if λi=λi+1 then there is no hope of estimating the i-th canonical covariates because they are not identifiable, and so support recovery also becomes infeasible. This assumption can be relaxed to requiring only k many λi’s to be strictly larger than λi1s where kr. In this case, we can recover the support of only the first k canonical covariates.

In the following sections, we will denote the preliminary estimators of U and V by U^ and V^, respectively. The columns of U^ and V^ will be denoted by u^n,i and v^n,i(i[r]), respectively. Therefore u^n,i and v^n,i will stand for the corresponding preliminary estimators of ui and vi. In case of CCA, the ui’s and vi’s are identifiable only up to a sign flip. Hence, they are also estimable only up to a sign flip. Finally, we denote the empirical estimates of Σx, Σy, and Σxy, by Σ^n,x, Σ^n,y, and Σ^n,xy, respectively – which will often be appended with superscripts to denote their estimation through suitable subsamples of the data 2. Finally, we let C denote a positive constant which depends on P only through , but can vary from line to line.

III. Main Results

We divide our main results into the following parts based on both statistical and computational difficulties of different regimes. First, in Section III-A we present a general method and associated sufficient conditions for support recovery. This allows us to elicit a sequence of questions regarding necessity of the conditions and remaining gaps both from statistical and computational perspectives. Our subsequent sections are devoted to answering these very questions. In particular, in Section III-B we discuss information theoretic lower bounds followed by evidence for statistical-computational gaps in Section III-C. Finally, we close a final computational gap in asymptotic regime through sharp analysis of a special co-ordinate-thresholding type method in Section III-D.

A. A Simple and General Method:

We begin with a simple method for estimating the support, which readily establishes the result for the easy regime, and sets the directions for the investigation into other more subtle regimes. Since the estimation of D(U) and D(V) are similar, we focus only on the estimation of D(V) for the time being.

Suppose V^ is a row sparse estimator of V. The nonzero indexes of V^ is the most intuitive estimator of D(V). Such an V^ is also easily attainable because most estimators of the canonical directions in the high dimension are sparse (cf. [24], [26], [28] among others). Although we have not yet been able to show the validity of this apparently “naïve” method, we provide numerical results in Section IV to explore its finite sample performance. However, a simple method can refine these initial estimators, to often optimally recover the support D(V). We now provide the details of this method and derive its asymptotic properties.

To that end, suppose we have at our disposal an estimating procedure for Σy1, which we generically denote by Ω^n and an estimator U^Rp×r of U. We split the sample in two equal parts, and compute U^(1) and Ω^n(1) from the first part of the sample, and the estimator Σ^n,xy(2) from the second part of the sample. Define V^clean=Ω^n(1)Σ^n,yx(2)U^(1). Our estimator of D(V) is then given by

D^(V){i[q]:V^ikclean>cut for somek[r]}, (6)

where cut is a pre-specified cut-off or threshold. We will discuss more on cut later. The resulting algorithm will be referred as RecoverSupp from now on. Algorithm 1 gives the algorithm for the support recovery of V, but the full version of RecoverSupp, which estimates D(U) and D(V) simultaneously, can be found in Appendix A; see Algorithm 3 there. RecoverSupp is similar in spirit to the “cleaning” step in the sparse PCA support recovery literature (cf. [1]). One thing to remember here is that V^clean is not an estimator V. In fact, the (i, j)-th element of V^clean is an estimator of Λi(vi)j.

Remark 1. In many applications, the rank r may be unknown. [46] (see Section 4.6.5 therein) suggests to use the screeplot of the canonical correlations to estimate r. Screeplot is also a popular tool to estimate the number of nonzero principal components in PCA analysis [1]. For CCA, the screeplot is the plot of the estimated canonical correlations versus their orders. If there is a clear gap between two successive correlations, [46] suggests taking the larger correlation as the estimator of Λr. One can use [24]’s SCCA method to estimate the canonical correlations to obtain the screeplot. There can be other ways of estimating r. For example, in their trans-eQTL study, [47] uses a resampling technique on a holdout dataset to generate observations from the null distribution of the i-th canonical correlation estimate under the hypothesis H0:Λi=0, where i[min(p,q)]. The largest i, for which the test is rejected, is taken as the estimated rank. A similar technique has been used by [48] to select the ranks for a related method JIVE.

Algorithm 1 RecoverSupp U^(1),Ω^n(1),Σ^n,xy(2),cut,r): sup-port recovery of V
Input:1)Preliminary estimatorsU^(1)andΩ^n(1)ofUandΣy1,respectively,based on sampleO1=(xi,yi)i=1[n2].2)EstimatorΣ^n,xy(2)ofΣxybasedonsampleO2=(xi,yi)i=[n2]+1n.3)Threshold levelcut>0and rankrN.Output:D^(V),an estimator ofD(V).1)Cleaning:V^cleanΩ^n(1)Σ^n,yx(2)U^(1).2)Threshold:ComputeD^(V)asin(6).return:D^(V).

It turns out that, albeit being so simple, RecoverSupp has desirable statistical guarantees provided U^(1) and Ω^n(1) are reasonable estimators of U and Σy1, respectively. These theoretical properties of RecoverSupp, and the hypotheses and queries generated thereof, lay out the roadmap for the rest of our paper. However, before getting into the detailed theoretical analysis of RecoverSupp, we state a l2-consistency condition on u^n,i and v^n,is, where we remind the readers that we let u^n,i and v^n,i denote the i-th columns of V^ and U^, respectively. Recall also that the i-th columns of U and V are denoted by ui and vi, respectively.

Condition 1 (l2 consistency ). There exists a function ErrErr:(n,p,q,sx,sy,)R so that Err<1(2r) and the estimators u^n,i and v^n,i of ui and vi satisfy

maxi[r]minw{±1}(wu^n,iui)TΣx(wu^n,iui)<Err2,
maxi[r]minw{±1}(wv^n,ivi)TΣy(wv^n,ivi)<Err2

with P probability 1 − o(1) uniformly over P𝒫(r,sx,sy,).

We will discuss the estimators which satisfy Condition 1 later. Theorem 1 also requires the signal strength Sigy to be at least of the order ϵn=ξnlog(p+q)s(Σy1)n, where the parameter ξn depends on the type of Ω^n as follows:

  1. Ω^n is of type A if there exists Cpre > 0 so that Ω^n satisfies Ω^nΣy1,1Cpres(Σy1)(logq)n with P probability 1 − o(1) uniformly over P𝒫(r,sx,sy,). Here we remind the readers that s(Σy1)=maxj[q](Σy1)j0. In this case, ξn=Cpres(Σy1).

  2. Ω^n is of type B if Ω^nΣy1,2Cpres(Σy1)log(q)n with P probability 1 − o(1) uniformly over P𝒫G(r,sx,sy,) for some Cpre > 0. In this case, ξn=Cpremax{r(logq)n,1}.

  3. Ω^n is of type C if Ω^n=Σy1. In this case, ξn=1.

The estimation error of Ω^n clearly decays from type A to C, with the error being zero at type C. Because r(logq)n is generally much smaller than s(Σy1), ξn shrinks from Case A to Case C monotonously as well. Thus it is fair to say that ξn reflects the precision of the estimator Ω^n in that ξn is smaller if Ω^n is a sharper estimator. We are now ready to state Theorem 1. This theorem is proved in Appendix C.

Theorem 1. Suppose log(pq)=o(n) and the estimators u^n,is satisfy Condition 1. Further suppose Ω^n is of type A, B, or C, which are stated above. Let ϵn=ξnlog(p+q)s(Σy1)n where ξn depends on the type of Ω^n as outlined above. Then there exists a constant C>0, depending only on >0, so that if

Sigy>2Cn, (7)

and cut[Cϵn(2), (θn1)Cϵn(2)] with θn=Sigy(Cϵn), then the algorithm RecoverSupp fully recovers D(V) with P probability 1 − o(1) uniformly over P𝒫(r,sx,sy,) (for Ω^n of type A and C), or uniformly over P𝒫G(r,sx,sy,) (for Ω^n of type B).

The assumption that log p and log q are o(n) appears in all theoretical works of CCA (cf. [24], [28]). A requirement of this type is generally unavoidable. Note that Theorem 1 implies a more precise estimator Ω^n requires smaller signal strength for full support recovery.

Main idea behind the proof of Theorem 1:

Because Λi(vi)k=ekTΣy1Σyxui, V^ikclean is an estimator of Λi(vi)k for i[q] and k[r]. If kD(V), then (vi)k=0 for all i[r]. Therefore, in this case, we expect V^ikclean to be small for all i[q]. We will show that whenever kD(V), V^ikclean is uniformly bounded by C1ϵn for i[q] and k[r] with high probability. Here C1>0 is a constant. Second, when (vi)k0, we will show that maxi[r]V^ikclean can not be too small. In fact, we will show that

maxi[r]V^ikclean>C2maxi[r]Λi(vi)kC1ϵn,k[r] (8)

for some C2>0 with high probability in this case. The lower bound in the above inequality is bounded below by C2SigyC1ϵn. Thus, if the minimal signal strength Sigy is bounded below by a large enough multiple of ϵn, then the lower bound C2SigyC1ϵn will be larger than the upper bound C1ϵn in the kD(V) case. Therefore, in this scenario, we can choose C > 0 so that

C1ϵn<Cϵn<C2SigyC1ϵn.

If we set cut=Cϵn, then the above inequality leads to.

supiD(V)V^ikclean<cut<infiD(V)V^ikclean.

These C1 and C2 are behind the constant C in (7) and our choice of θn.

Thus the key step in the proof of Theorem 1 is analyzing the bias of V^ikclean, which hinges on the following bias decomposition:

V^ikcleanΛi(vi)kekT(Ω^nΦ0)Σ^n,yxu^n,iT1(i,k)+ekTΦ0(Σ^n,yxΣyx)u^n,iT2(i,k)+ekTΦ0Σyx(u^n,iui)T3(i,k). (9)

Note that the term T1(i,k) corresponds to the bias in estimating Φ0. Similarly, the error terms T2(i,k) and T3(i,k) incur due to the bias in estimating Σyx and ui, respectively. The main contributing term in the upper bound in (9) is T1(i,k). One can use the consistency property of Ω^n to show that T1(i,k) is of the order Op(ϵn). Since Ω^n has different rates and modes of convergence in cases A, B, and C, T1(i,k) has different orders in cases A, B, and C, which explains why ϵn is of different order in these cases.

The term T2(i,k) is much smaller – it is of the order (s(Σx1)log(p+q))n)12. The proof bases on the fact that the l error of estimating Σxy by Σ^n,xy is of the order (log(p+q)n)12 for subgaussian X and Y. The error term T3(i,k) is exactly zero for iD(V), and hence does not contribute. Thus only T1(i,k) and T2(i,k) contribute to the bias of V^ikclean for iD(V), which is therefore bounded by C1ϵn for some C1 > 0 with high probability in this case. The term T3(i,k) does contribute to the bias of V^ikclean for iD(V), however, and it is of the order rmaxj[r](vj)kErr in this case. Because Err is small by Condition 1, we can show that T3(i,k) is smaller than maxi[r]Λi(vi)k, which eventually leads to the relation in (8), thus completing the proof. We have already mentioned that RecoverSupp is analogous to the cleaning step in sparse PCA. Therefore the proof of Theorem 1 has similarities with some analogous results in sparse PCA. See for example Theorem 3 of [1], which proves the consistency of a “cleaned” estimator of the joint support of the spiked principal components. However, the proof in the CCA case is a bit more involved because of the presence of Σy1, which needs to be estimated for the cleaning step. Different estimators of Σy1 can have different rates of convergence, which leads to the different types of the estimators. This ultimately leads to different requirements on the order of the threshold cut and the minimal signal strength Sigy.

Next we will discuss the implications of Theorem 1, but before getting into that detail, we will make two important remarks.

Remark 2. Although the estimation of the high dimensional precision matrix Σy1 is potentially complicated, it is often unavoidable owing to the inherent subtlety of the CCA framework due to the presence of high dimensional nuisance parameters Σx and Σy. [26] also used precision matrix estimator for partial recovery of the support. In case of sparse CCA, to the best of our knowledge, there does not exist an algorithm that can recover the support, partially or completely, without estimating the precision matrix. However, our requirements on Ω^n are not strict in that many common precision matrix estimators, e.g., the nodewise Lasso [49, Theorem 2.4], the thresholding estimator [50, Theorem 1 and Section 2.3], and the CLIME estimator [51, Theorem 6] exhibit the decay rate of type A and B under standard sparsity assumptions on Σy1. We will not get into the detail of the sparsity requirements on Σy1 because they are unrelated to the sparsity of U or V, and hence are irrelevant to the primary goal of the current paper.

Remark 3. In the easy regime syn(log(p+q), polynomial time estimators satisfying Condition 1 are already available, e.g., COLAR [28, Theorem 4.2] or SCCA [24, Condition C4]. Thus it is easily seen that polynomial time support recovery is possible in the easy regime provided (7) is satisfied.

The implications of Theorem 1 in context of the sparsity requirements on D(U) and D(V) for full support recovery are somewhat implicit through the assumptions and conditions. However, the restriction on the sparsity is indirectly imposed by two different sources – which we elaborate on now. To keep the interpretations simple, throughout the following discussion, we assume that (a) r=O(nlogq), (b) p and q are of the same order, and (c) sx and sy are also of the same order. Note that (a) implies ξn=O(1) for a type B estimator of Ω^n. Since we separate the task of estimating the nuisance parameter Σy1 from the support recovery of V, we also assume that s(Σy1)=O(1), which implies ξn=O(1) for a type A estimator of Ω^n. The assumption s(Σy1)=O(1), combined with (a), reduces the minimal signal strength condition (7) in Theorem 1 to SigyClog(p+q)n.

In lieu of the discussion above, the first source of sparsity restriction is the minimal signal strength condition (7) on Sigy. To see this, first note that

1=viTΣyviΛmin(Σy)vi22

where i[r]. Since Λmin(Σy)1,

Λmin(Σy)vi22vi22Sigy2sy,

implying Sigysy12. Therefore, implicit in Theorem 1 lies the condition

syC2nlog(p+q), (10)

which is enforced by the minimal signal strength requirement (7). Thus Theorem 1 does not hold for synlog(p+q) even when s(Σy1) and r are small. This regime requires some attention because in case of sparse PCA [30] and linear regression [29], support recovery at snlog(ps) 3 is proven to be information theoretically impossible. However, although a parallel result can be intuited to hold for CCA, the details of the nuances of SCCA support recovery in this regime is yet to be explored. Therefore, the sparsity requirement in (10) raises the question whether support recovery for CCA is at all possible when synlog(p+q), even if Σx and Σy is known.

Question 1. Does there exist any decoder D^ such that supP𝒫(r,sx,sy,)P(D^(V)D(V))0 when synlog(qsy)?

A related question is whether the minimal signal strength requirement (7) is necessary. To the best of our knowledge, there is no formal study on the information theoretic limit of the minimal signal strength even in context of the sparse PCA support recovery. Indeed, as we noted before, the detailed analyses of support recovery for SPCA provided in [1], [30], and [35] are all based on the nonzero principal component elements being of the order O(1s). Finally, although this question is not directly related to the sparsity conditions, it indeed probes the sharpness of the results in Theorem 1.

Question 2. What is the minimum signal strength required for the recovery of D(V)?

We will discuss Question 1 and Question 2 at greater length in Section III-B. In particular, Theorem 2(A) shows that there exists C > 0 so that support recovery at syC2nlog(qsy) is indeed information theoretically intractable. On the other hand, in Theorem 2(B), we show that the minimal signal strength has to be of the order log(qsy)n for full recovery of D(V). Thus when pq, (7) is indeed necessary from information theoretic perspectives.

The second source of restriction on the sparsity lies in Condition 1. Condition 1 is a l2-consistency condition, which has sparsity requirement itself owing the inherent hardness in the estimation of U. Indeed, Theorem 3.3 of [28] entails that it is impossible to estimate the canonical directions ui’s consistently if sx>Cn(r+log(epsx)) for some large C > 0. Hence, Condition 1 indirectly imposes the restriction sxnmax{log(psx),r}. However, when sxsy, pq, and r=O(1), the above restriction is already absorbed into the condition synlog(qsy) elicited in the last paragraph. In fact, there exist consistent estimators of U whenever sxnmax{log(psx),r} and synmax{log(qsy),r} (see [27] or Section 3 of [28]). Therefore, in the latter regime, RecoverSupp coupled with the above-mentioned estimators succeeds. In view of the above, it might be tempting to think that Condition 1 does not impose significant additional restrictions. The restriction due to Condition 1, however, is rather subtle and manifests itself through computational challenges. Note that when support recovery is information theoretically possible, the computational hardness of recovery by RecoverSupp will be at least as much as that of the estimation of U. Indeed, the estimators of U which work in the regime sxnlog(psx), synlog(qsy) are not adaptive of the sparsity, and they require a search over exponentially many sets of size sx and sy. Furthermore, under 𝒫(r,sx,sy,), all polynomial time consistent estimators of U in the literature, e.g., COLAR [28, Theorem 4.2] or SCCA [24, Condition C4], require sx, sy to be of the order nlog(p+q). In fact, [28] indicates that estimation of U or V for sparsity of larger order is NP hard.

The above raises the question whether RecoverSupp (or any method as such) can succeed at polynomial time when nlog(p+q)sx, synlog(p+q). We turn to the landscape of sparse PCA for intuition. Indeed, in case of sparse PCA, different scenarios are observed in the regime snlogp, depending on whether nsnlogp, or sn (we recall that for SPCA we denote the sparsity of the leading principal component direction generically through s). We focus on the sub-regime nsnlogp first. In this case, both estimation and support recovery for sparse PCA are conjectured to be NP hard, which means no polynomial time method succeeds; see Section III-C for more details. The above hints that the regime sx, syn is NP hard for sparse CCA as well.

Question 3. Is there any polynomial time method that can recover the support D(V) when sx, syn?

We dedicate Section III-C to answering this question. Subject to the recent advances in the low degree polynomial conjecture, we establish computational hardness of the regime sx, syn (up to a logarithmic factor gap) subject to np, q. Our results are consistent with [28]’s findings in the estimation case and cover a broader regime; see Remark 5 for a comparison.

When the sparsity is of the order n and pn, however, polynomial time support recovery and estimation are possible for the sparse PCA case. [1] showed that a co-ordinate thresholding type spectral algorithm works in this regime. Thus the following question is immediate.

Question 4. Is there any polynomial time method that can recover the support D(V) when sx, sy[nlog(p+q),n]?

We give an affirmative answer to Question 4 in Section III-D, which is in parallel with the observations for the sparse PCA. In fact, Corollary 2 shows that when Σx and Σy are known, p+qn, and sx, syn, estimation is possible in polynomial time. Since estimation is possible, RecoverSupp suffices for polynomial time support recovery in this regime, where n is well below the information theoretic limit of nlog(p+q). The main tool used in Section III-D is co-ordinate thresholding, which is originally a method for high dimensional matrix estimation [50], and apparently has nothing to do with estimation of canonical directions. However, under our setup, if the covariance matrix is consistently estimated in operator norm, by Wedin’s Sin θ Theorem [52], an SVD is enough to get a consistent estimator of U and V suitable for further precise analysis.

Remark 4. RecoverSupp uses sample splitting, which can reduce the efficiency. One can swap between the samples and compute two estimators of the supports. One can easily show that both the intersection and the union of the resulting supports enjoy the asymptotic guarantees of Theorem 1.

This section can be best summarized by Figure 1, which gives the information theoretic and computational landscape of sparse CCA analysis in terms of the sparsity. In other words, Figure 1 gives the phase transition plot for SCCA support recovery with respect to sparsity. It can be seen that our contributions (colored in red) complete the picture, which was initiated by [28].

Fig. 1:

Fig. 1:

Phase transition plots for SCCA estimation and support recovery problems with respect to sparsity. We have taken sx=sy here. COLAR corresponds to the estimation method of [28]. Our contributions are colored in red. See [28] for more details on the regions colored in blue.

B. Information Theoretic Lower Bounds: Answers to Question 1 and 2

Theorem 2 establishes the information theoretic limits on the sparsity levels sx, sy, and the signal strengths Sigx and Sigy. The proof of Theorem 2 is deferred to Appendix D.

Theorem 2. Suppose D^(U) and D^(V) are estimators of D(U) and D(V), respectively. Let sx, sy > 1, and psx, qsy>16. Then the following assertions hold:

  1. If sx>16n{(21)log(psx)}, then
    infD^supP𝒫(r,sx,sy,)P(D^(U)D(U))>12.
    On the other hand, if sy>16n{(21)log(qsy)}, then
    infD^supP𝒫(r,sx,sy,)P(D^(V)D(V))>12.
  2. Let 𝒫Sig(r,sx,sy,) be the class of distributions P𝒫(r,sx,sy,) satisfying Sigx2(21)(log(psx))(8n). Then
    infD^supP𝒫Sig(r,sx,sy,)P(D^(U)D(U))>12.
    On the other hand, if Sigy2(21)(log(qsy))(8n), Then
    infD^supP𝒫Sig(r,sx,sy,)P(D^(V)D(V))>12.

In both cases, the infimum is over all possible decoders D^(U) and D^(V).

First, we discuss the implications of part A of Theorem 2. This part entails that for full support recovery of V, the minimum sample size requirement is of the order sylog(qsy). This requirement is consistent with the traditional lower bound on n in context of support recovery for sparse PCA [30, Theorem 3] and L1 regression [29, Corollary 1]. However, when r=O(1), the sample size requirement for estimation of V is slightly relaxed, that is, nsylog(qsy) [28, Theorem 3.2]. Therefore, from information theoretic point of view, the task of full support recovery appears to be slightly harder than the task of estimation. The scenario for partial support recovery might be different and we do not pursue it here. Moreover, as mentioned earlier, in the regime syCnlog(p+q), RecoverSupp works with [28]’s (see Section 3 therein) estimator of U. Thus part A of Theorem 2 implies that nlog(p+q) is the information theoretic upper bound on the sparsity for the full support recovery of sparse CCA.

Part B of Theorem 2 implies that it is not possible to push the minimum signal strength below the level O(log(qsy)n). Thus the minimal signal strength requirement (7) by Theorem 1 is indeed minimal up to a factor of ξns(Σy1). The last statement can be refined further. To that end, we remind the readers that for a good estimator of Σy1, i.e., a type B estimator, ξn=O(1) if r=O(nlogq). However, the latter always holds if support recovery is at all possible, because in that case synlog(p+q), and elementary linear algebra gives syr. Thus, it is fair to say that, provided a good estimator of Σy1, the requirement (7) is minimal up to a factor of s(Σy1). Indeed, this implies that for banded inverses with finite band-width our results are rate optimal.

It is further worth comparing this part of the result to the SPCA literature. In the SPCA support recovery literature, generally, the lower bound on the signal strength is depicted in terms of the sparsity s, and usually a signal strength of order O(1s) is postulated (cf. [1], [30], [35]). Using our proof strategies, it can be easily shown that for SPCA, the analogous lower bound on the signal strength would be log(ps)n. The latter is generally much smaller than 1s and only when snlog(p), the requirement of 1s close to the lower bound. Thus, in the regime snlogp, the lower bound should rather be of the order O(1s). Therefore the minimum signal strength requirement of O(1s) typically assumed in SPCA literature seems larger than necessary.

Main idea behind the proof of Theorem 2:

The main device used in this proof is Fano’s inequality [53]. Note that for any 𝒞𝒫(r,sx,sy,),

infD^αsupP𝒞P(D^αD(α))<infD^αsupP𝒫(r,sx,sy,)P(D^αD(α)). (11)

Therefore it suffices to show that the left hand side in the above inequality is bounded away from 1/2 for some carefully chosen 𝒞. If 𝒞 is finite, we can lower bound the left hand side of (11) using Fano’s inequality [53], which yields

infD^αsupP𝒞P(D^αD(α))1P1,P2cKL(P1nP2n)𝒞2+log2log(𝒞1), (12)

Thus the main task is to choose 𝒞 in a way so that the right hand side (RHS) of (12) is large. We will choose 𝒞 so that X and Y are jointly Gaussian. In particular, XNp(0,Ip), YNq(0,Iq), and Σxy=ρβ0αT where β0Sq1 and ρ(0,1) are fixed, and α is allowed to vary in a set Sp1. In this model, r = 1, ρ is the canonical correlation, and α and β0 are the left and right canonical covariates, respectively. Also, P varies across 𝒞 as α varies across . Moreover, 𝒞=. Our main task boils down to choosing carefully.

The idea behind choosing is as follows. For any decoder, i.e., an estimator of the support, the chance of making error increases when is large. This can also be seen noting that the right hand side of (12) increases as =𝒞 increases. However, even if we prefer a larger , we need to ensure that the KL divergence between the distributions in the resulting 𝒞 is small. The reason is that, for a large , the right hand side of (12) can be small unless the KL divergence between the corresponding distributions in 𝒞 is small. In other words, any decoder will face a challenge detecting the true support of α when there are many distributions to choose from, and these distributions are also close to each other in KL distance.

For part A of Theorem 2, we choose in the following way. Letting

α0=(1sx,,1sxsxmany,0,,0psxmany),

we let be the class of α’s which are obtained by replacing one of the 1sxs in α0 by 0, and one of the zero’s in α0 by 1sx. A typical α obtained this way looks like

α=(1sx,,0,1sxsxmany,0,,1sx,,0psxmany).

In this case, it turns out that =sx(psx). Under the conditions of part A of 2, we can show that the RHS of (12) is bounded below by 1/2 for this . The proof of part A is similar to its PCA analogue, which is Theorem 3 of [30]. The latter theorem is also based on Fano’s lemma and uses a similar construction for . However, there is no PCA analogue of part B. For part B of Theorem 2, we let be the class of all α’s so that

α=(b,,bsx1many,0,,0,z,0,,0psx+1many).

where

z=1ρ24nρ2log(psx4)

can take any position out of the psx+1 positions. Clearly, =psx+1. It can be shown that the RHS of (12) is bounded below by 1/2 in this case as well.

C. Computational Limits and Low Degree Polynomials: Answer to Question 3

We have so far explored the information theoretic upper and lower bounds for recovering the true support of leading canonical correlation directions. However, as indicated in the discussion preceding Question 3, the statistically optimal procedures in the regime where nsx, synlog(p+q) are computationally intensive and is of exponential complexity (as a function of p, q). In particular, [28] have already showed that when sx and sy belong to parts of this regime, estimation of the canonical correlates is computationally hard, subject to a computational complexity based Planted Clique Conjecture. For the case of support recovery, the SPCA has been explored in detail and the corresponding computational hardness has been established in analogous regimes – see, e.g., [1], [30], and [35] for details. A similar phenomenon of computational hardness is observed in case of SPCA spike detection problem [54]. In light of the above, it is natural to believe that the SCCA support recovery is also computationally hard in the regime nsx, synlog(p+q), and, as a result, yields a statistical-computational gap. Although several paths exist to provide evidence towards such gaps 4, the recent developments using “Predictions from Low Degree Polynomials” [2], [39], [40] is particularly appealing due its simplicity in exposition. In order to show computationally hardness of the SCCA support recovery problem in the sx, sy(n,nlog(p+q)) regime, we shall resort to this very style of ideas, which has so far been applied successfully to explore statistical-computational gaps under sparse PCA [36], Stochastic Block Models, and tensor PCA [40], among others. This will allow us to explore the computational hardness of the problem in the entire regime where

sx+sy(n)(logn)c, (13)

compared to the somewhat partial results (see Remark 5 for detailed comparison) in earlier literature.

We divide our discussions to argue the existence of a statistical-computational gap in this regime as follows. Starting with a brief background on the statistical literature on such gaps, we first present a natural reduction of our problem to a suitable hypothesis testing problem in Section III-C1. Subsequently, in Section III-C2 we present the main idea of the “low degree polynomial conjecture” by appealing to the recent developments in [39], [40], and [2]. Finally, we present our main result for this regime in Section III-C3, thereby providing evidence of the aforementioned gap modulo the Low Degree Polynomial Conjecture presented in Conjecture 1.

1). Reduction to Testing Problem::

Denote by Qn the distribution of a Np+q(0,Ip+q) random vector. Therefore (X,Y)Q corresponds to the case when X and Y are uncorrelated. We first show that there is any scope of support recovery in 𝒫(r,sx,sy,) only if 𝒫(r,sx,sy,) is distinguishable from Q, i.e., the test H0:(X,Y)Q vs. H1:(X,Y)P𝒫(r,sx,sy,) has asymptotic zero error.

To formalize the ideas, suppose we observe i.i.d random vectors {Xi,Yi}i=1n which are distributed either as P or Q. We denote the n-fold product measures corresponding to P and Q by Pn and Qn, respectively. Note that if P𝒫(r,sx,sy,), then Pn𝒫(r,sx,sy,)n. We overload notation, and denote the combined sample {Xi}i=1n and {Yi}i=1n by X and Y respectively. In this section, X and Y should be viewed as unordered sets. The test Φn:Rpn+qn{0,1} for testing the null H0:(X,Y)Qn vs. the alternative H1:(X,Y)Pn is said to strongly distinguish Pn and Qn if

limnQn(Φn(X,Y)=1)+limnPn(Φn(X,Y)=0)=0.

The above implies that both the type I error and the type II error of Φn converges to zero as n. In case of composite alternative H1:(X,Y)Pn𝒫(r,sx,sy,)n, the test strongly distinguishes Qn from 𝒫(r,sx,sy,)n if

liminfn{Qn(Φn(X,Y)=1)}+{supPn𝒫(r,sx,sy,)nPn(Φn(X,Y)=0)}=0.

Now we explain how support recovery and the testing framework are connected. Suppose there exist decoders which exactly recover D(U) and D(V) under 𝒫(r,sx,sy,) for 0. Then the trivial test, which rejects the null if either of the estimated supports is non-empty, strongly distinguishes Qn from 𝒫(r,sx,sy,)n. The above can be coined as the following lemma.

Lemma 1. Suppose there exist polynomial time decoders D^x and D^y of D(U) and D(V) so that

liminfnsupPn𝒫(r,sx,sy,)nPn(D^x(X,Y)=D(U))and(D^y(X,Y)=D(V))=1 (14)

Further assume, Qn(D^x(X,Y)=)1, and Qn(D^y(X,Y)=)1. Then there exists a polynomial time test which strongly distinguishes 𝒫(r,sx,sy,)n and Qn.

Thus, if a regime does not allow any polynomial time test for distinguishing Qn from 𝒫(r,sx,sy,)n, there can be no polynomial time computable consistent decoder for D(U) and D(V). Therefore, it suffices to show that there is no polynomial time test which distinguishes Qn from 𝒫(r,sx,sy,)n in the regime sx, syn. To be more explicit, we want to show that if sx, syn, then

liminfn{Qn(Φn(X,Y)=1)}+{supPn𝒫(r,sx,sy,)nPn(Φn(X,Y)=0)}>0 (15)

for any Φn that is computable in polynomial time.

The testing problem under concern is commonly known as the CCA detection problem, owing to its alternative formulation as H0:Λ1=0 vs. H1:Λ1>0. In other words, the test tries to detect if there is any signal in the data. Note that, Lemma 1 also implies that detection is an easier problem than support recovery in that the former is always possible whenever the latter is feasible. The opposite direction may not be true, however, since detection does not reveal much information on the support.

2). Background on the Low-degree Framework:

We shall provide a brief introduction to the low-degree polynomial conjecture which forms the basis of our analyses here, and refer the interested reader to [39], [40], and [2] for in-depth discussions on the topic. We will apply this method in context of the test H0:(X,Y)Qn vs. H1:(X,Y)Pn. The low-degree method centers around the likelihood ratio Ln, which takes the form dPndQn in the above framework. Our key tool here will be the Hermite polynomials, which form a basis system of L2(Qn) [62]. Central to the low-degree approach lies the projection of Ln onto the subspace (of L2(Qn)) formed by the Hermite polynomials of degree at most DnN. The latter projection, to be denoted by LnDn from now on, is important because it measures how well polynomials of degree ≤ Dn can distinguish Pn from Qn. In particular,

LnDnL2(Qn)maxfdegDnEPn[f(X,Y)]EQn[f(X,Y)2], (16)

where the maximization is over polynomials f:Rn(p+q)R of degree at most Dn [36].

The L2(Q) norm of the untruncated likelihood ratio Ln has long held an important place in the theory hypothesis testing since LnL2(Q)=O(1) implies Pn and Qn are asymptotically indistinguishable. While the untruncated likelihood ratio Ln is connected to the existence of any distinguishing test, degree Dn projections of Ln are connected to the existence of polynomial time distinguishing tests. The implications of the above heuristics are made precise by the following conjecture [40, Hypothesis 2.1.5].

Conjecture 1 (Informal). Suppose t:NN. For “nice” sequences of distributions Pn and Qn, if LnDnL2(Qn)=O(1) as n → ∞ whenever Dnt(n)polylog(n), then there is no time-nt(n) test Φn:Rn(p+q){0,1} that strongly distinguishes Pn and Qn.

Thus Conjecture 1 implies that the degree-Dn polynomial LnDn is a proxy for time-nt(n) algorithms [2]. If we can show that LnDnL2(Qn)=O(1) for a Dn of the order (logn)1+ϵ for some ϵ > 0, then the low degree Conjecture says that no polynomial time test can strongly distinguish Pn and Qn [2, Conjecture 1.16].

Conjecture 1 is informal in the sense that we do not specify the “nice” distributions, which are defined in Section 4.2.4 of [2] (see also Conjecture 2.2.4 of [40]). Niceness requires Pn to be sufficiently symmetric, which is generally guaranteed by naturally occurring high dimensional problems like ours. The condition of “niceness” is attributed to eliminate pathological cases where the testing can be made easier by methods like Gaussian elimination. See [40] for more details.

3). Main Result:

Similar to [36], we will consider a Bayesian framework. It might not be immediately clear how a Bayesian formulation will fit into the low-degree framework, and lead to (15). However, the connection will be clear soon. We put independent Rademacher priors πx and πy on α and β. We say απx if α1,,αp are i.i.d., and for each i[p],

αi={1sxw.p.sx(2p)1sxw.p.sx(2p)0w.p.1sxp.} (17)

The Rademacher prior πy can be defined similarly. We will denote the product measure πx×πy by π. Let us define

Σ(α,β,ρ)=[IpραβTρβαTIq],αRp,βRq,ρ>0. (18)

When ρα2β2<1, Σ(α,β,ρ) is the covariance matrix corresponding to XNp(0,Ip) and YNq(0,Iq) with covariance cov(X,Y)=ραβT. Hence, for Σ(α,β,ρ) to be positive definite, α2β2<1ρ is a sufficient condition. The priors πx and πy put positive weight on α and β that do not lead to a positive definite Σ(α,β,ρ), and hence calls for extra care during the low-degree analysis. This subtlety is absent in the sparse PCA analogue [36].

Let us define

Pα,β={N(0,Σ(α,β,1))whenα2β2<Qo.w.} (19)

We denote the n-fold product measure corresponding to Pα,β by Pn,α,β. If (X,Y)α,βPn,α,β, then the marginal density of (X, Y) is Eαπx,βπydPn,α,β. The following lemma, which is proved in Appendix H-C, explains how the Bayesian framework is connected to (15).

Lemma 2. Suppose >2 and sx,sy. Then

liminfnsupPn𝒫G(r,2sx,2sy,)nPn(Φn(X,Y)=0)liminfnEπPn,α,β(Φn(X,Y)=0),

where Eπ is the shorthand for Eαπx,βπy.

Note that a similar result holds for 𝒫(r,sx,sy,) as well because 𝒫G(r,sx,sy,)𝒫(r,sx,sy,). Lemma 2 implies that to show (15), it suffices to show that a polynomial time computable Φn fails to strongly distinguish the marginal distribution of X and Y from Qn. However, the latter falls within the realms of the low degree framework because the corresponding likelihood ratio takes the form

Ln=Eαπx,βπydPn,α,βdQn(X,Y). (20)

Using priors on the alternative space is a common trick to convert a composite alternative to a simple alternative, which generally yields more easily to various mathematical tools.

If we can show that LnDnL2(Qn)2=O(1) for some Dn=O(logn), then Conjecture 1 would indicate that a nΘ~(Dn) time computable Φn fails to distinguish the distribution of Eαπx,βπydPn,α,β from Qn. Theorem 3 accomplishes the above under some additional conditions on p, q, and n, which we will discuss shortly. Theorem 3 is proved in Appendix E.

Theorem 3. Suppose Dnmin(p,q,n),

sx,syenDnandp,q3en2. (21)

Then LnDnL2(Qn)2 is O(1) where Ln is as defined in (20).

The following Corollary results from combining Lemma 2 with Theorem 3.

Corollary 1.Suppose
sx,sy2enDnandp,q3en2. (22)

If Conjecture 1 is true, then for Dnmin(p,q,n), there is no time-nΘ~(Dn) test that strongly distinguishes 𝒫G(r,sx,sy,) and Qn.

Corollary 1 conjectures that polynomial time algorithms can not strongly distinguish 𝒫G(r,sx,sy,)n and Qn provided sx, sy, p, and q satisfy (22). Therefore under (22), Lemma 1 conjectures support recovery to be NP hard.

Now we discuss a bit on condition (22). The first constraint in (22) is expected because it ensures sx, syn, which indicates that the sparsity is in the hard regime. We need to explain a bit on why the other constraint p, q>3en2 is needed. If np,q, the sample canonical correlations are consistent, and therefore strong separation is possible in polynomial time without any restriction on the sparsity [23], [41]. Even if pnc1(0,1) and qnc2(0,1), then also strong separation is possible in model 18 provided the canonical correlation ρ is larger than some threshold depending on c1 and c2 [23]. The restriction p, q>3en2 ensures that the problem is hard enough so that the vanilla CCA does not lead to successful detection. The constant 3e is not sharp and possibly can be improved. The necessity of the condition p, qn2 is unknown for support recovery, however. Since support recovery is a harder problem than detection, in the hard regime, polynomial time support recovery algorithms may fail at a weaker condition on n, p, and q.

Remark 5. [Comparison with previous work:] As mentioned earlier, [28] was the first to discover the existence of computational gap in context of sparse CCA. In their seminal work, [28] established the computational hardness of CCA estimation problem at a particular subregime of sx, syn(log(p+q)) provided is allowed. In view of the above, it was hinted that sparse CCA becomes computationally hard when sx, syn(log(p+q)). However, when is bounded, the entire regime sx, syn(log(p+q)) is probably not computationally hard. In Section III-D, we show that if p+qn, then both polynomial time estimation and support recovery are possible if sx+syn, at least in the known Σx and Σy case. The latter sparsity regime can be considerably larger than sx, synlog(p+q). Together, Section III-D and the current section indicate that in the bounded case, the transition of computational hardness for sparse CCA probably happens at the sparsity level n, not nlog(p+q), which is consistent with sparse PCA. Also, the low-degree polynomial conjecture allowed us to explore almost the entire targeted regime sx, syn, where [28], who used the planted clique conjecture, considers only a subregime of sx, syn(log(p+q)).

We will end the current section with a brief outline of the proof of Theorem 3.

The main idea behind the proof of Theorem 3:

Let us denote by ΠnDn the linear span of all n(p+q)-variate Hermite polynomials of degree at most Dn. For each zZm and yRm, we let H^z(y)=i=1mh^zi(yi), where h^zi is the univariate normalized Hermite polynomial of degree zi. We will discuss the Hermite polynomials in greater detail in Appendix E. Any normalized m-variate Hermite polynomial is of the form H^z, where zZm. Then ΠnDn is the linear span of all H^ws with

w𝒞l{zZn(p+q):i=1n(p+q)ziDn}.

Since LnDn is the projection of Ln on ΠnDn, it then holds that

LnDnL2(Qn)2=w𝒞lLn,H^wL2(Qn)2.

The first step of the proof is to find out the expression of Ln,H^wL2(Qn)n. Since wZn(p+q), we can partition w into w = (w1, … , wn), where wiZp+q for each i ∈ [n]. Using some algebra, we can show that

Ln,H^wL2(Qn)=Eπ[i[n]E(Xi,Yi)Pα,β[H^wi(Xi,Yi)]].

Exploiting the properties of Hermite polynomials, it can be shown that

E(Xi,Yi)Pα,β[H^wi(Xi,Yi)]=1{α2β2<}j=1p+q(wi)j!×iwi(exp{12tT(Σ(α,β,1)Ip+q)t})t=(0,,0),

where for zZp+q, tRp+q, and any function f:Rp+qR, the notation tz(f(t))t=(0,,0) stands for the z-th order partial derivative of f with respect to t evaluated at the origin. The rest of the proof is similar to the PCA analogue in [36], but there is an extra indicator term for the CCA case. Following [36], we use the common trick of using replicas of α and β to simplify the algebra. Suppose α1, α2πx and β1, β2πy are independent. Let W be the indicator function of the event (α1Tα2)(β1Tβ2)<2. Denote by ((1x)n)p the p-th order truncation of the Taylor series expansion of (1x)n at x = 0. Following some algebra, it can be shown that

LnDnL2(Qn)2=Eπ[W{(12(α1Tα2)(β1Tβ2))n}Dn2].

Comparing the above with the analogous result for PCA, namely Lemma 4.2 of [36], we note that the indicator term W does not appear in the PCA analogue. The indicator term W appears in the CCA case because we had set Pα,β to be Q for α2β2> to tackle the extra restrictions on α and β in this case.

D. A Polynomial Time Algorithm for nlog(p+q)sx,syn Regime : Answer to Question 4

In this subsection, we show that in the difficult regime sx+sy[nlog(p+q),n], using a soft co-ordinate thresholding (CT) type algorithm, we can estimate the canonical directions consistently when p+qn. CT was introduced by the seminal work of [50] for the purpose of estimating high dimensional covariance matrices. For SPCA, [1]’s CT is the only algorithm that provably recovers the full support in the difficult regime (see also [35]). In context of CCA, [26] uses CT for partial support recovery in the rank one model under what we referred to as the easy regime. However, [26]’s main goal was the estimation of the leading canonical vectors, not support recovery. As a result, [26] detects the support of the relatively large elements of the leading canonical directions, which are subsequently used to obtain consistent preliminary estimators of the leading canonical directions. Our thresholding level and theoretical analysis are different from that of [26] because the analytical tools used in the easy regime do not work in the difficult regime.

1). Methodology: Estimation via CT:

By “thresholding a matrix A co-ordinate-wise”, we will roughly mean the process of assigning the value zero to any element of A which is below a certain threshold in absolute value. Similar to [1], we will consider the soft thresholding operator, which, at threshold level t, takes the form

η(x,t)={xtx>t0x<tx+tx<t.}

It will be worth noting that the soft thresholding operator xη(x,t) is continuous.

Algorithm 2 Coordinate Thresholding (CT) for estimating
D(V)
Input:1)Sample covariance matricesΣ^n,xy(1)andΣ^n,xy(2)basedonsamplesO1=(xi,yi)i=1[n2]andO2=(xi,yi)i=[n2]+1n,respectively.2)VariancesΣxandΣy.3)ParametersThrandcut.4)r,i.e.,rank ofΣxyOutput:D^(V).1)Peeling:calculateΣ~xy=Σx1Σ^n,xy(1)Σy1.2)Threshold:LettingN=m+n,perform soft thresh-oldingxη(x;ThrN)entrywiseonΣ~xytoobtain thresholdedη(Σ~xy).3)Sandwithch:η(Σ~xy)Σx12η(Σ~xy)Σy12.4)SVD:FindU^pre,the matrix of the leadingrsingularvector ofΣx12η(Σ~xy)Σy12.5)Premultiply:SetU^(1)=Σx12U^pre.return:RECOVERSUPP(U^(1),cut,Σy1,Σ^n,xy(2),r)whereRECOVERSUPPis given by Algorithm1

We will also assume that the covariance matrices Σx and Σy are known. To understand the difficulty of unknown Σx and Σy, we remind the readers that Σxy=ΣxUΛVTΣy. Because the matrices U and V are sandwiched between the matrices Σx and Σy, their sparsity pattern does not get reflected in the sparsity pattern of Σxy. Therefore, if one blindly applies CT to Σ^n,xy, they can at best hope to recover the sparsity pattern of the outer matrices Σx and Σy. If the supports of the matrices U and V are of main concern, CT should rather be applied on the matrix Σ~xy=Σx1Σ^n,xyΣy1. If Σx and Σy are unknown, one needs to efficiently estimate Σ~xy before the application of CT. Although under certain structural conditions, it is possible to find rate optimal estimators Σ^n,x1 and Σ^n,y1 of Σx1 and Σy1 at least in theory, the errors (Σ^n,x1Σx1)Σ^n,xyΣy1op and Σx1Σ^n,xy(Σ^n,y1Σy1)op may still blow up due to the presence of the high dimensional matrix Σ^n,xy, which can be as big as O((p+q)n) in operator norm. One may be tempted to replace Σ^n,xy with a sparse estimator of Σxy to facilitate faster estimation, but that does not work because we explicitly require the formulation of Σ^n,xy as the sum of Wishart matrices (see equation 37 in the proof). The latter representation, which is critical for the sharp analysis, may not be preserved by a CLIME [51] or nodewise Lasso estimator [49] of Σxy.

We remark in passing that it is possible to obtain an estimator A^ so that A^Σ~xy=op(1). Although the latter does not provide much control over the operator norm of A^Σ~xy, it is sufficient for partial support recovery, e.g., the recovery of the rows of U or V with strongest signals. (See Appendix B of [26] for example, for some results in this direction under the easy regime when r = 1.)

As indicated by the previous paragraph, we apply coordinate thresholding to the matrix Σ~xy=Σx1Σ^n,xyΣy1, which directly targets the matrix Σx1ΣxyΣy1=UΛVT. We call this step the peeling step because it extracts the matrix Σ~xy from the sandwiched matrix Σ^n,xy=ΣxΣ~xyΣy. We then perform the entry-wise co-ordinate thresholding algorithm on the peeled form Σ~xy with threshold Thr so as to obtain η(Σ~xy;Thrn). We postpone the discussion on Thr to Section III-D2. The thresholded matrix is an estimator of Σx1ΣxyΣy1, but we need an estimator of Σx12ΣxyΣy12. Therefore, we again sandwich Σ~xy between Σx12 and Σy12. The motivation behind this sandwiching is that if Σ~xyΣx1ΣxyΣy1op=ϵn, then Σx12Σ~xyΣy12 is a good estimator of Σx12ΣxyΣy12 in that

Σx12Σ~xyΣy12Σx12ΣxyΣy12opΣxopΣyopϵnϵn.

However, Σx12UΛVTΣy12 is an SVD of Σx12ΣxyΣy12. Using Davis-Kahan sin theta theorem [52], one can show that the SVD of Σx12Σ~xyΣy12 produces estimators U^ and V^ of Σx12U and Σy12V, where the columns of U^ and V^ are ϵn-consistent in l2 norm for the columns of Σx12U and Σy12V, respectively, up to a sign flip (cf. Theorem 2 of [52]). Pre-multiplying the resulting U′ by Σx12 yields an estimator U^ of U up to a sign flip of the columns. We do not worry about the sign flip because Condition 1 allows for the sign flips of the columns. Therefore, we feed this U^ into RecoverSupp as our final step. See Algorithm 2 for more details.

Remark 6. In case of electronic health records data, it is possible to obtain large surrogate data on X and Y separately and thus might allow relaxing the known precision matrices assumption above. We do not pursue such semi-supervised setups here.

2). Analysis of the CT Algorithm:

For the asymptotic analysis of the CT algorithm, we will assume the underlying distribution to be Gaussian, i.e., P𝒫G(r,sx,sy,). This Gaussian assumption will be used to perform a crucial decomposition of sample covariance matrix, which typically holds for Gaussian random vectors. [1], who used similar devices for obtaining the sharp rate results in SPCA, also required a similar Gaussian assumption. We do not yet know how to extend these results to sub-Gaussian random vectors.

Let us consider the threshold Thrn, where Thr is explicitly given in Theorem 4. Unfortunately, tuning of Thr requires the knowledge of the underlying sparsity sx and sy. Similar to [1], our thresholding level is different than the traditional choice of order O(log(p+q)n) in the easy regime analyzed in [50], [63] and [26]. The latter level is too large to successfully recover all the nonzero elements in the difficult regime. We threshold Σ~xy at a lower level, which in its turn, complicates the analysis to a greater degree. Our main result in this direction, stated in Theorem 4, is proved in Appendix F.

Theorem 4. Suppose (Xi,Yi)P𝒫G(r,sx,sy,). Further suppose sx+sy<n, pq=o(logn), and logn=o(pq). Let K and C1 be constants so that K12884 and C1C4, where C > 0 is an absolute constant. Suppose the threshold level Thr is defined by

Thr={C1log(p+q)if(sx+sy)2<214(p+q)34(casei)(Klog(p+q(sx+sy)2))12if214(p+q)34(sx+sy)2(p+q)e(caseii)0o.w.(caseiii).}

Suppose c is a constant that takes the value K, C1, or one in case (i), (ii), and (iii), respectively. Then there exists an absolute constant C > 0 so that the following holds with probability 1 − o(1) for Σ~xy=Σx1Σ^n,xyΣy1 :

η(Σ~xy;η)Σx1ΣxyΣy1opC2(sx+sy)n×max{(clog(p+q(sx+sy)2))12,1}.

To disentangle the statement of Theorem 4, let us assume p+qn for the time being. Then case (ii) in the theorem corresponds to n34(sx+sy)2n. Thus, CT works in the difficult regime provided p+qn. It should be noted that the threshold for this case is almost of the order O(1n), which is much smaller than O(log(p+q)n), the traditional threshold for the easy regime. Next, observe that case (i) is an easy case because sx+sy is much smaller than n. Therefore, in this case, the traditional threshold of the easy regime works. Case (iii) includes the hard regime, where polynomial time support recovery is probably impossible. Because it is unlikely that CT can improve over the vanilla estimator Σ~xy in this regime, a threshold of zero is set.

Remark 7. Theorem 4 requires logn=o(pq) because one of our concentration inequalities in the analysis of case (ii) needs this technical condition (see Lemma 8). The omitted regime logn>C(pq) is indeed an easier one, where special methods like CT is not even required. In fact, it is well known that subgaussian X and Y satisfy (cf. Theorem 4.7.1 of [42])

Σ^n,xyΣxyopC((p+qn)12+p+qn),

which is O(lognn) in the regime under concern. Including this result in the statement of Theorem 4 could unnecessarily lengthen the exposition. Therefore, we decided to exclude this regime from Theorem 4 to focus more on the sx+syp+q regime.

Remark 8. The statement of Theorem 4 is not explicit on the lower bound of the constant C1. However, our simulation shows that the algorithm works for C1504. Both threshold parameters C1 and K in Theorem 4 depend on the unknown >0. The proof actually shows that can be replaced by max{Λmax(Σx), Λmax(Σy), Λmax(Σx1), Λmax(Σy1)}.

Finally, Theorem 4 leads to the following corollary, which establishes that in the difficult regime, there exist estimators which satisfy Condition 1, and Algorithm 2 succeeds with probability one provided p+qn. This answers Question 4 in the affirmative for Gaussian distributions.

Corollary 2. Instate the conditions of Theorem 4. Then there exists C>0 so that if

nCr(sx+sy)2max{log(p+q(sx+sy)2),1}, (23)

then the U^(1) defined in Algorithm 2 satisfies Condition 1, and infP𝒫G(r,sx,sy,)P(Algorithm 2 correctly recovers D(V))n1.

We defer the proof of Corollary 2 to Appendix G. We will now present a brief outline of the proof of Theorem 4.

Main idea behind the proof of Theorem 4:

The proof hinges on the hidden variable representation of X and Y due to [64]. We discuss this representation in detail in Appendix C-2, which basically says the data matrices X and Y can be represented as

X=Z𝒲1T+Z11TandY=Z𝒲2T+Z22T,

where ZRn×r, Z1Rn×p, and Z2Rn×q are independent standard Gaussian data matrices, and 𝒲1=ΣxUΛ12, 𝒲2=ΣyVΛ12, 1=(Σx𝒲1𝒲1T)12, and 2=(Σy𝒲2𝒲2T)12. We will later show in Section C-2 that 1 and 2 are well defined positive definite matrices. It follows that Σ^n,xy=XTYn has the representation

Σ^n,xy=1n{𝒲1ZTZ𝒲2T+𝒲1ZTZ22+1TZ1TZ𝒲2T}{+1TZ1TZ22}.

Next, we define some sets. Let E1=i=1rD(ui), F1=[p]E1, E2=i=1rD(vi), and F2=[q]E2. Therefore E1 and E2 correspond to the supports, where F1 and F2 correspond to their complements. Now we partition [p] × [q] into the following three sets:

E=E1×E2,F=F1×F2, (24)

and

G=(F1×E2)(E1×F2). (25)

Therefore E is the set that contains the joint support. We can decompose Σ~xy as

Σ~xy=𝒫E{Σ~xy}S1+𝒫F{Σ~xy}S2+𝒫G{Σ~xy}S3. (26)

where 𝒫 is the projection operator defined in (4).

The usefulness of the decomposition in (26) is that S1, S2, and S3 have different supports, which enables us to write

η(Σ~xy)=η(S1)+η(S2)+η(S3).

We can therefore analyze the three terms η(S1), η(S2), and η(S3) separately. In general, the thresholding operator η is not linear in that for matrices A and B, η(A+B)=η(A)+η(B) generally does not hold.

As indicated above, we analyze the operator norms of η(S1), η(S2), and η(S3) separately. Among S1, S2, and S3, S1 is the only matrix that is supported on E, the true support. The basic idea of the proof is showing that co-ordinate thresholding preserves the matrix S1, and kills off the other matrices S2 and S3, which contain the noise terms. S1 includes the matrix UΛ12ZTZΛ12UT. Because ZTZ concentrates around Ir by Bai-Yin law (cf. Lemma 4.7.1 of [42]), UΛ12ZTZΛ12UT concentrates around UΛVT. Therefore the analysis of η(S1) is relatively straightforward.

Most of the proof is devoted towards showing η(S2)op and η(S3)op are small, i.e., co-ordinate thresholding kills off the noise terms. The difficulty arises because the threshold was kept smaller than the traditional threshold of order log(p+q)n to adjust for the hard regime. Therefore the approaches of [50] or [28] do not work in this regime. The noise matrices S2 and S3 are sum of matrices of the form MZTZ1N, MZ1TZ2N, or MZTZ2N, or their transposes, where for rest of this section, M and N should be understood as deterministic matrices of appropriate dimension, whose definition can change from line to line. Analyzing η(S2)op and η(S3)op essentially hinges on Lemma 8, which upper bounds the operator norm of matrices of the form η(MZ1TZ2N). The proof of Lemma 8 uses, among other tools, a sharp Gaussian concentration result from [1] (see Corollary 10 therein), and a generalized Chernoff’s inequality for dependent Bernoulli random variables [65]. Using Lemma 8, we can also upper bound operator norms of matrices of the form η(M1Z1TZN1+M2Z1TZ2N2) because M1Z1TZN1+M2Z1TZ2N2 can be represented as M3[ZZ1]TZ2N2 for some matrix M3 of appropriate dimension. Therefore, to show η(S2)op and η(S3)op are small, Lemma 8 suffices, which also completes the proof.

The proof of Theorem 4 has similarities with the proof of the analogous result for PCA in [1] (see Theorem 1 therein). However, one main difference is that for PCA, the key instrument is the representation of X as the spiked model [44], which yields the representation

X=ZM+σZ1, (27)

where ZRn×r and Z1Rn×p are standard Gaussian data matrices, and MRr×p is a deterministic matrix. The analysis in PCA revolves around the sample covariance matrix Σ^n,x=XTXn, which, following (27), writes as

Σ^n,x=1n{MTZTZM+σZ1TZM+σMTZTZ1+σ2Z1TZ1}.

From the above representation, it can be shown that the analogues of S2 and S3 in the PCA case are sum of matrices of the form M1Z1TZ2 or their transposes. [1] uses an upper bound on η(Z1TZ2)op to bound the PCA analogue of η(S2)op and η(S3)op (see Proposition 13 therein). In contrast, we encounter terms of the form M1Z1TZ2N1 since CCA is concerned with XTY/n. To deal with these terms, we needed the upper bound result on η(M1Z1TZ2N1)op instead, which requires a separate elaborate proof. Although the basic idea behind bounding η(M1Z1TZ2N1)op and bounding η(Z1TZ2)op is similar, the proof of bounding η(M1Z1TZ2N1)op is more involved. For example, some independence structures are destroyed due to the pre and post multiplication by the matrices M1 and N1, respectively. We required concentration inequalities on dependent Bernoulli random variables to tackle the latter.

IV. Numerical Experiments

This section illustrates the performance of different polynomial time CCA support recovery methods when the sparsity transitions from the easy to difficult regime. We base our demonstration on a Gaussian rank one model, i.e., (X, Y) are jointly Gaussian with covariance matrix Σxy=ρΣxαβTΣy. For simplicity, we take p = q and sx = sy = s. In all our simulations, ρ is set to be 0.5, and α=α(α)TΣxα, β=β(β)TΣyβ where

α=(1s,,1s,0,,0),β=(1(s1)s43,s23,,s23,0,,0)

are unit norm vectors. Note that the order of most elements of β is O(s−2/3), where a typical element of α is O(s−1/2). Therefore, we will refer to α and β as the moderate and the small signal case, respectively. For the population covariance matrices Σx and Σy of X and Y, we consider the following two scenarios:

  • A (Identity): Σx = Ip and Σy = Iq. Since p = q, they are essentially the same.

  • B (Sparse inverse): This example is taken from [28]. In this case, Σx1=Σy1 are banded matrices, whose entries are given by
    (Σx1)i,j=1{i=j}+0.65×1{ij=1}+0.4×1{ij=2}.

Now we explain our common simulation scheme. We take the sample size n to be 1000, and consider three values for p: 100, 200, and 300. The highest value of p + q is thus 600, which is smaller than but in proportion to n regime. Our simulations indicate that all of the methods considered here requires n to be quite larger than p + q for the asymptotics to kick in at ρ = 0.5. We will later discuss this point in detail. We further let sn vary in the set [0.01, 2]. To be more specific, we consider 16 equidistant points in the set [0.01, 2] for the ratio sn.

Now we discuss the error metric used here to compare the performance of different support recovery methods. Type I and type II errors are commonly used tools to measure the performance of support recovery [1]. In case of support recovery of α, we define the type I error to be the proportion of zero elements in α that appear in the estimated support D^(α). Thus, we quantify the type I error of α by D^(α)D(α)(ps). On the other hand, the type II error for α is the proportion of elements in D(α) which are absent in D^(α), i.e., the type II error is quantified by D(α)D^(α)s. One can define the type I and type II errors corresponding to β similarly. Our simulations demonstrate that often the methods with low type I error exhibit a high type II error, and vice versa. In such situations, comparison between the corresponding methods becomes difficult if one uses the type I and type II errors separately. Therefore, we consider a scaled Hamming loss type metric, which suitably combines the type I and type II error. The symmetric Hamming error of estimating D(α) by D^(α) is [66, Section 2.1]

1D(α)D^(α)D(α)D^(α).

Note that the above quantity is always bounded above by one. We can similarly define the symmetric Hamming distance between D(β) and D^(β). Finally, the estimates of these three errors (Type I, Type II, and scaled Hamming Loss) are obtained based on 1000 Monte Carlo replications.

Now we discuss the support recovery methods we compare here.

  • Naïve SCCA. We estimate α and β using the SCCA method of [24], and set D^(α)={i[p]:α^i0} and D^(β)={i[q]:β^i0}, where α^ and β^ are the corresponding SCCA estimators. To implement the SCCA method of [24], we use the R code referred therein with default tuning parameters.

  • Cleaned SCCA. This method implements RecoverSupp with the above mentioned SCCA estimators of α and β as the preliminary estimators.

  • CT. This is the method outlined in Algorithm 2, which is RecoverSupp coupled with the CT estimators of α and β.

Our CT method requires the knowledge of the population covariance matrices Σx and Σy. Therefore, to keep the comparison fair, in case of the cleaned SCCA method as well, we implement RecoverSupp with the popular covariance matrices. Because of their reliance on RecoverSupp, both cleaned SCCA and CT depend on the threshold cut, tuning which seems to be a non-trivial task. We set cut=Clog(p+q)s(Σx1)n, where C is the thresholding constant. Our simulations show that a large C results in high type II error, where insufficient thresholding inflates the type I error. Taking the hamming loss into account, we observe that C ≈ 1 leads to a better performance in case A in an overall sense. On the other hand, case B requires a smaller value of thresholding parameter. In particular, we let C to be one in case A, and set C = 0.05 and 0.2, respectively, for the support recovery of α and β in case B. The CT algorithm requires an extra threshold parameter, namely the parameter Thr in Algorithm 2, which corresponds to the co-ordinate thresholding step. We set Thr in accordance with Theorem 4 and Remark 8, with K being 12884 and C1 being 504. We set as in Remark 8, that is

=max{Λmax(Σx),Λmax(Σy),Λmax(Σx1),Λmax(Σy1)}.

The errors incurred by our methods in case A are displayed in Figure 2 (for α) and Figure 3 (for β). Figures 4 and 5, on the other hand, display the errors in the recovery of α and β, respectively, in case B.

Fig. 2:

Fig. 2:

Support recovery for α when Σx=Ip and Σy=Iq. Here threshold refers to cut in Theorem 1.

Fig. 3:

Fig. 3:

Support recovery for β when Σx=Ip and Σy=Iq. Here threshold refers to cut in Theorem 1.

Fig. 4:

Fig. 4:

Support recovery for α when Σx and Σy are the sparse covariance matrices. Here threshold refers to cut in Theorem 1.

Fig. 5:

Fig. 5:

Support recovery for β when Σx and Σy are the sparse covariance matrices. Here threshold refers to cut in Theorem 1.

Now we discuss the main observations from the above plots. When the sparsity parameter s is considerably low (less than ten in the current settings), the naíve SCCA method is sufficient in the sense that the specialized methods do not perform any better. Moreover, the naïve method is the most conservative one among all three methods. As a consequence, the associated type I error is always small, although the type II error of the naïve method grows faster than any other method. The specialized methods are able to improve the type II error at the cost of higher type I error. At a higher sparsity level, the specialized methods can outperform the naïve method in terms of the Hamming error, however. This is most evident when the setting is also complex, i.e., the signal is small, or the underlying covariance matrices are not identity. In particular, Figure 2 and 4 entail that when the signal strength is moderate and the sparsity is high, the cleaned SCCA has the lowest hamming error. In the small signal case, however, CT exhibits the best hamming error as sn increases; cf. Figure 3 and 5.

The Type I error of CT can be slightly improved if the sparsity information can be incorporated during the thresholding step. We simply replace cut by the maximum of cut and the s-th largest element of V^clean, where the latter is as in Algorithm RecoverSupp. See, e.g., Figure 6, which entails that this modification reduces the Hamming error of the CT algorithm in case A. our empirical analysis hints that the CT algorithm has potential for improvement from the implementation perspective. In particular, it may be desirable to obtain a more efficient procedure for choosing cut in a systematic way. However, such a detailed numerical analysis is beyond the scope of the current paper and will require further modifications of the initial methods for estimation of α, β both for scalability and finite sample performance reasons. We keep these explorations as important future directions.

Fig. 6:

Fig. 6:

Support recovery by the CT algorithm when we use the information on sparsity to improve the type I error. Here Σx and Σy are Ip and Iq, respectively, and threshold refers to cut in Theorem 1. To see the decrease in type I error, compare the errors with that of Figure 2 and Figure 3.

It is natural to wonder what is the effect of cleaning via RecoverSupp on SCCA. As mentioned earlier, during our simulations we observed that a cleaning step generally improves the type II error of the naïve SCCA, but it also increases the type I error. In terms of the combined measure, i.e., the Hamming error, it turns out that cleaning does have an edge at higher sparsity levels in case B; cf. Figure 4 and Figure 5. However, the scenario is different in case A. Although Figures 2 and 3 indicate that almost no cleaning occurs at the set threshold level of one, we saw that cleaning happens at lower threshold levels. However, the latter does not improve the overall Hamming error of naïve SCCA. The consequence of cleaning may be different for other SCCA methods.

To summarize, when the sparsity is low, support recovery using the naïve SCCA is probably as good as the specialized methods. However, at higher sparsity level, specialized support recovery methods may be preferable. Consequently, the precise analysis of the apparently naïve SCCA will indeed be an interesting future direction.

V. Discussion

In this paper, we have discussed rate optimal behavior of information theoretic and computational limits of the joint support recovery for the sparse canonical correlation analysis problem. Inspired by recent results in the estimation theory of sparse CCA, a flurry of results in sparse PCA, and related developments based on low-degree polynomial conjecture – we are able to paint a complete picture of the landscape of support recovery for SCCA. For future directions, it is worth noting that our results are so far not designed to recover D(vi) for individual i ∈ [r] separately (and hence the term joint recovery). Although this is also the case for most state of the art in the sparse pCA problem (results often exist only for the combined support [1] or the single spike model where r = 1 [29]), we believe that it is an interesting question for deeper explorations in the future. Moreover, moving beyond asymptotically exact recovery of support to more nuanced metrics (e.g., Hamming Loss) will also require new ideas worth studying. Finally, it remains an interesting question to pursue whether polynomial time support recovery is possible in the nlog(p+q)sx, syn regime using a CT type idea – but for unknown yet structured high dimensional nuisance parameters Σx, Σy.

Acknowledgments

This work was supported by National Institutes of Health grant P42ES030990.

Biographies

Nilanjana Laha received a Bachelor of Statistics in 2012 and a Master of Statistics in 2014 from the Indian Statistical Institute, Kolkata. Then she received a Ph.D. in statistics in 2019 from the University of Washington, Seattle. She was a postdoctoral research fellow at the department of Biostatistics at Harvard university from 2019 to 2022. She is currently an assistant professor in Statistics at Texas A & M University. Her research interests include dynamic treatment regimes, high dimensional association, and shape constrained inference.

Rajarshi Mukherjee received a Bachelor of Statistics in 2007 and a Master of Statistics in 2009 from the Indian Statistical Institute, Kolkata. He received his Ph.D. degree in Bisostatistics from Harvard University in 2014. He was a Stein fellow in the department of Statistics at Stanford University from 2014 to 2017. He was an assistant professor at the division of Biostatistics at the University of California, Berkeley, from 2017 to 2018. Since 2018, he has been an assistant professor at the department of Biostatistics at Harvard University. His research interests primarily lie in structured signal detection problems in high dimensional and network models, and functional estimation and adaptation theory in nonparametric statistics.

Appendix A. Full version of RecoverSupp

Algorithm 3 RecoverSupp: simultaneous support recovery
of U and V
Input:1)Preliminary estimationsU^(1)andV^(1)ofUandV,and estimatorsΓ^n(1)andΩ^n(1)ofΣx1Σy1,re-spectively.All are based on smapleO1=(xi,yi)i=1[n2].2)EstimatorΣ^n,xy(2)ofΣxybased on sampleO2=(xi,yi)i=[n2]+1n.3)Threshold levels cutx,cuty>0and rankrN.Output:D^(U)andD^(V),estimators ofD(U)andD(V),respectively.1)Cleaning:V^cleanΩ^n(1)Σ^n,yx(2)U^(1);U^cleanΓ^n(1)Σ^n,xy(2)V^(1).2)Threshold:ComputeD^(U){i[p]:U^ikclean>cutxfor somek[r]}andD^(V){i[q]:V^ikclean>cutyfor somek[r]}.return:D^(U)andD^(V).

In Algorithm 3, we used different cut-offs for estimating D^(U) and D^(V), which are cutx and cuty, respectively. In practice, one can choose the same threshold cut for both of them.

Appendix B. Proof preliminaries

The Appendix collects the proof of all our theorems and lemmas. This section introduces some new notations and collects some facts, which are used repeatedly in our proofs.

A. New Notations

Since the columns of Σx12U, i.e., [Σx12U1,,Σx12Ur] are orthogonal, we can extend it to an orthogonal basis of Rp, which can also be expressed in the form [Σx12u1,,Σx12up] since Σx is non-singular. Let us denote the matrix [u1,…,up] by U~, whose first r columns form the matrix U. Along the same line, we can define V~, whose first q columns constitute the matrix V.

Suppose ARp×q is a matrix. Recall the projection operator defined in (4). For any S ⊂ [p], we let AS denote the matrix 𝒫S×[q]{A}. Similarly, for F[q], we let AF be the matrix 𝒫[p]×F{A}. For kN, we define the norms Ak,=maxj[q]Ajk and A,k=maxi[q]Aik. We will use the notation ∣A to denote the quantity sup1[p],j[q]Ai,j.

The Kullback Leibler (KL) divergence between two probability distributions P1 and P2 will be denoted by KL(P1P2). For xR, we let x denote greatest integer less than or equal to xR.

B. Facts on 𝒫(r,sx,sy,)

First, note that since viTΣyvi=1 by (2) for all i ∈ [q], we have vi2. Similarly, we can also show that ui2. Second, we note that Σx12Uop=Σy12Vop=1, and

ΣyxΣyxop=ΣyVΛUTΣxopΣy12opΣy12VopΛopΣx12UopΣx12op (28)

because the largest element of Λ is not larger than one. Since Xi’s and Yi’s are Subgaussian, for any random vector v independent of X and Y, it follows that [45, Lemma 7]

(Σ^n,yxΣyx)vCv2log(p+q)n (29)

with P probability 1 – o(1) uniformly over P𝒫(r,sx,sy,). Also, we can show that Φ0=Σy1 satisfies

(Φ0)k1,s(Σx)(Φ0)k2,s(Σx)Φ0ops(Σx),

where Cauchy-Schwarz inequality was used in the first step.

C. General Technical Facts

Fact 1.For two matricesARm×nandBRn×q, we have
ABF2Aop2BF2,ABF2AF2Bop2

Fact 2 (Lemma 11 of [1]). Let ZRn×p be a matrix with i.i.d. standard normal entries, i.e., Zi,j ~ N(0, 1). Then for every t > 0,

P(Zopp+n+t)exp(t22).

As a consequence, there exists an absolute constant C > 0 such that

P(Zop2(p+n))exp(C(p+n)).

Recall that for ARp×q, in Appendix B-A, we defined ∥A1,∞ and ∥A∞1 to be the matrix norms maxj[q]Aj1 and maxi[p]Ai1, respectively.

The following fact is a Corollary to (29).

Fact 3. Suppose X and Y are jointly subgaussian. Then Σ^n,xyΣxy=Op(log(p+q)n).

Fact 4 (Chi-square tail bound). Suppose Z1,,ZkiidN(0,1). Then for any y > 5, we have

P(l=1kZl2yk)exp(yk5).

Proof of Fact 4. Since Zl’s are independent standard Gaussian random variables, by tail bounds on Chi-squared random variables (The form below is from Lemma 12 of [1]),

P(l=1kZl2k+2kx+2x)exp(x).

Plugging in x = yk, we obtain that

P(l=1kZl2(1+2y+2y)k)exp(yk),

which implies for y > 1,

P(l=1kZl25yk)exp(yk),

which can be rewritten as

P(l=1kZl2yk)exp(yk5)

as long as y > 5. □

Appendix C. Proof of Theorem 1

For the sake of simplicity, we denote U^(1), Σ^n,xy(2), and Ω^n(1) by U^, Σ^n,xy, and Ω^n, respectively. The reader should keep in mind that U^ and Ω^n are independent of Σ^n,xy and Ω^n because they are constructed from a different sample. Next, using Condition 1, we can show that there exists (wi,,wp){±1}p so that

infP𝒫(r,sx,sy,)P(maxi[r](wiu^n,iui)TΣx(wiu^n,iui))(<Err2)1.

as n → ∞. Without loss of generality, we assume wi = 1 for all i ∈ [r]. The proof will be similar for general wi’s. Thus

infP𝒫(r,sx,sy,)P(maxi[r](u^n,iui)TΣx(u^n,iui)<Err2)1 (30)

Therefore u^n,iui2Err for all i ∈ [r] with P probability tending to one.

Now we will collect some facts which will be used during the proof. Because u^n,i and Σ^n,yx are independent, (29) implies that

(Σ^n,yxΣyx)u^n,iCu^n,i2log(p+q)n.

Using (30), we obtain that u^n,i2u^n,iui2+ui2B(Err+1). Because Err<11, we have

infP𝒫(r,sx,sy,)P(maxi[r](Σ^n,yxΣyx)u^n,iClog(p+q)n)=1o(1). (31)

Noting (28) implies Σyxu^n,iΣyxopu^n,i2232, and that log(p+q)=o(n), using (31), we obtain that

maxi[r]Σ^n,yxu^n,i(Σ^n,yxΣyx)u^n,i+Σyxu^n,i332 (32)

with P probability 1 – o(1).

Now we are ready to prove Theorem 1. We will denote the columns of V^nclean by v^n,iclean for i ∈ [r]. Because Λi(vi)k=ekTΣy1Σyxui, it holds that

(v^n,iclean)kΛi(vi)k=ekT(Ω^nΦ0)Σ^n,yxu^n,i+ekTΦ0(Σ^n,yxΣyx)u^n,i+ekTΦ0Σyx(u^n,iui)

leading to

(v^n,iclean)kΛi(vi)kekT(Ω^nΦ0)Σ^n,yxu^n,iT1(i,k)+ekTΦ0(Σ^n,yxΣyx)u^n,iT2(i,k)+ekTΦ0Σyx(u^n,iui)T3(i,k).

Handling the term T2 is the easiest because

maxi[r],k[q]T2(i,k)Φ01,(Σ^n,yxΣyx)u^n,iCs(Σy1)log(p+q)n

with P probability 1 – o(1) uniformly over 𝒫(r,sx,sy,), where we used (31) and the fact that Φ01,s(Σy1). The difference in cases (A), (B), (C) arises only due to different bounds on T1(i, k) in these cases. We demonstrate the whole proof only for case (A). For the other two cases, we only discuss the analysis of T1(i, k) because the rest of the proof remains identical in these cases.

1). Case (A):

Since we have shown in (32) that Σ^n,yxu^n,i332, we calculate

maxi[r],k[q]T1(i,k)Ω^nΦ01,maxi[r]Σ^n,yxu^n,i332Cpres(Σy1)logqn

with P probability tending to one, uniformly over 𝒫(r,sx,sy,), where to get the last inequality, we also used the bound on Ω^nΦ0,1 in case (A).

Finally, for T3, we notice that

T3(i,k)=ekTΦ0Σyx(u^n,iui)=ekTj=1rΛjvjujTΣx(u^n,iui)maxj[r](vj)kj=1rujTΣx(u^n,iui)

since Λ1 ≥ 1. Since (vj)k = Vkj, it is clear that T3(i, k) is identically zero if kD(V). Otherwise, Cauchy Schwarz inequality implies,

j=1rujTΣx(u^n,iui)r(j=1r(ujTΣx(u^n,iui))2)12rΣx12(u^n,iui)2

because Σx12ujs are orthogonal. Thus

maxi[r],kD(V)T3(i,k)rmaxj[r](vj)kErr.

Now we will combine the above pieces together. Note that

maxi[q]maxk[r](T1(i,k)+T2(i,k)CCpres(Σy1)log(p+q)nϵn. (33)

For kD(V), denoting the i-th column of V^clean by v^n,iclean we observe that,

maxkD(V)maxi[r]V^kiclean=maxkD(V)maxi[r](v^n,iclean)kmaxi[q]maxk[r](T1(i,k)+T2(i,k))Cϵn (34)

with P probability 1 – o(1) uniformly over P𝒫(r,sx,sy,). On the other hand, if kD(vi), then we have for all i ∈ [r],

(v^n,iclean)k>Λi(vi)krmaxj[r](vj)kErrmaxi[q]maxk[r](T1(i,k)+T2(i,k)),

which implies

maxi[r]V^kiclean>maxi[r]Λi(vi)krmaxi[r](vi)kErrCϵn.

Since Err<1(2r) and 1<mini[r]Λi, we have

maxi[r]Λi(vi)krmaxi[r](vi)kErr>(1rErr)maxi[r](vi)k>1maxi[r](vi)k2.

Thus, noting Vki = (vi)k, we obtain that

minkD(V)maxi[r](v^n,iclean)k=minkD(V)maxi[r]V^kiclean>minkD(V)maxi[r]Vki(2)Cϵn

with P probability 1 – o(1) uniformly over P𝒫(r,sx,sy,). Suppose C=2C. Note that

mink[p]maxi[r](vi)k=θnCϵn

where θn > 2. Then with P probability 1 – o(1) uniformly over P𝒫(r,sx,sy,),

minkD(V)maxi[r]V^kiclean>(θn1)Cϵn(2).

This, combined with (34) implies setting cut[Cϵn(2),(θn1)Cϵn(2)] leads to full support recovery with P probability 1 – o(1). The proof of the first part follows.

2). Case (B):

In the Gaussian case, we resort to the hidden variable representation of X and Y due to [64], which enables sharper bound on the term T1(i, k). Suppose Z ~ Nr(0, Ir) where r is the rank of Σxy. Consider Z1 ~ Np(0, Ip) and Z2 ~ Nq(0, Iq) independent of Z. Then X and Y can be represented as

X=𝒲1Z+1Z1andY=𝒲2Z+2Z2, (35)

where

𝒲1=ΣxUΛ12,𝒲2=ΣyVΛ12,1=(Σx𝒲1𝒲1T)12,

and

2=(Σy𝒲2𝒲2T)12.

Here (Σx𝒲1𝒲1T)12 is well defined because

Σx𝒲1𝒲1T=ΣxU~(IpΛx)U~TΣx,

where Λx is a p × p diagonal matrix whose first p elements are Λ1,…,Λr, and they rest are zero. Because Λ1 ≤ 1, we have

(Σx𝒲1𝒲1T)12=ΣxU~(IpΛx)12U~TΣx.

Similarly, we can show that

(Σy𝒲2𝒲2T)12=ΣyV~(IqΛy)12V~TΣy,

where Λy is the diagonal matrix whose first r elements are Λ1,…,Λr, and the rest are zero. It can be easily verified that

Var(X)=𝒲1𝒲1T+1=Σx,Var(Y)=𝒲2𝒲2T+2=Σy,

and

Σxy=𝒲1𝒲2T=ΣxUΛVTΣy,

which ensures that the joint variance of (X, Y) is still α. Also, some linear algebra leads to

max{1op2,2op2,𝒲1op,𝒲2op}<. (36)

Suppose we have n independent realizations of the pseudoobservations Z1, Z2, and Z. Denote by Z1, Z2, and Z, the stacked data matrices with the i-th row as (Z1)i, (Z2)i, and Zi, respectively, where i ∈ [n]. Here we used the term data-matrix although we do not observe Z, Z1 and Z2 directly. Due to the representation in (35), the data matrices X and Y have the form

X=Z𝒲1T+Z11,Y=Z𝒲2T+Z22.

We can write the covariance matrix Σ^n,xy=XTYn as

Σ^n,xy=1n{𝒲1ZTZ𝒲2T+𝒲1ZTZ22+1TZ1TZ𝒲2T}{+1TZ1TZ22}. (37)

Therefore, for any vector θ1Rp and θ2Rq, we have

θ1T(Σ^n,xyΣxy)θ2=θ1T𝒲1T(ZTZnIr)𝒲2θ2+1nθ1T(𝒲1ZTZ22+1TZ1TZ𝒲2T+1TZ1TZ22)θ2. (38)

By Bai-Yin law on eigenvalues of Wishart matrices [67], there exists abolute constant C > 0 so that for any t > 1,

P(ZTZnIrop<trn)12exp(Ct2r),

which, combined with (36), implies

infP𝒫G(r,sx,sy,)P(θ1T𝒲1T(ZTZnIr)𝒲2θ2)(t2θ12θ22rn)12exp(Ct2r).

Now we will state a lemma which will be required to control the other terms on the right hand side of (38).

Lemma 3. Suppose Z1Rn×p and Z2Rn×q are independent Gaussian data matrices. Further suppose xRp and yRq are either deterministic or independent of both Z1 and Z2. Then there exists a constant C > 0 so that for any t > 1,

P(xTZ1TZ2y>tx2y2n)exp(Cn)exp(t22).

The proof of Lemma 3 follows directly setting b = 1 in the following Lemma, which is proved in Appendix H-D.

Lemma 4. Suppose Z1Rn×p and Z2Rn×q are independent standard Gaussian data matrices, and DRn×k1 and BRn×k2 are deterministic matrices with rank a and b, respectively. Let abn. Then there exists an absolute constant C > 0 so that for any t ≥ 0, the following holds with probability at least 1exp(Cn)exp(t22):

DTZ1TZ2BopCDopBopnmax{b,t}.

Lemma 3, in conjunction with (36), implies that there exists an absolute constant C > 0 so that

1nθ1T(𝒲1ZTZ22+1TZ1TZ𝒲2T+1TZ1TZ22)θ2t2θ12θ22n12

with P probability at least 1exp(Cn)exp(t22) for all P𝒫G(r,sx,sy,). Therefore, there exists C > 0 so that

P(θ1T(Σ^n,xyΣxy)θ2tr2θ12θ22n12)1exp(Cn)exp(Ct2). (39)

for all P𝒫G(r,sx,sy,). Note that

T1(i,k)((Ω^n)k(Σy1)k)T(Σ^n,yxΣyx)u^n,iT11(i,k)+((Ω^n)k(Σy1)k)TΣyxu^n,iT12(i,k).

Now suppose θ1=(Ω^n)k(Σy1)k and θ2=u^n,i. By our assumption, θ12Cpres(Σy1)(logq)n with P probability 1 – o(1) uniformly across P𝒫G(r,sx,sy,). We also showed that u^n,i22. It is not had to see that

supi[q],k[r]T12(i,k)232Cpres(Σy1)(logq)n (40)

with P probability 1 – o(1) uniformly across P𝒫G(r,sx,sy,). For T11, observe that (39) applies because θi=(Ω^n)k(Σy1)k and θ2=u^n,i are independent of Σ^n,xy. Thus we can write that for any t > 1, there exists C>1 such that

supP𝒫G(r,sx,sy,)P(T11(i,k)>tCCprers(Σy1)logqn)exp(Cn)+exp(Ct2).

Applying union bound, we obtain that for any P𝒫G(r,sx,sy,),

P(maxi[q]maxk[r]T11(i,k)>tCCprers(Σy1)logqn)exp(Cn+log(qr))+exp(Ct2+log(qr)).

Since r < q and log q = o(n), setting t=2logqC, we obtain that

supP𝒫G(r,sx,sy,)P(maxi[q]maxk[r]T11(i,k)>CCprers(Σy1)logqn)

is o(1). Using (33) and (40), one can show that

ϵn=Cpres(Σy1)(log(p+q))nmax{r(logq)n,1}

in this case.

3). Case (C):

Note that when Ω^n=Σy1,T1(i,k)=0. Therefore, (33) implies ϵn=s(Σy1)log(p+q)n in this case.

Appendix D. Proof of Theorem 2

Since the proof for U and V follows in a similar way, we will only consider the support recovery of U. The proof for both cases follows a common structure. Therefore, we will elaborate the common structure first. Since the model 𝒫(r,sx,sy,) is fairly large, we will work with a smaller submodel. Specifically, we will consider a subclass of the single spike models, i.e., r = 1. Because we are concerned with only the support recovery of the left singular vectors, we fix β0 in Rq so that β02=1. We also fix ρ ∈ (0, 1) and consider the subset {αRp:α2=1}. Both ρ and will be chosen later. We restrict our attention to the submodel (sx,sy,ρ,) given by

{P𝒫(1,sx,sy,):PNp+q(0,Σ)whereΣ}{is of the form(41)withα,β=β0},

where (41) is as follows:

Σ=[IpραβTρβαTIq]. (41)

That Σ is positive definite for ρ ∈ (0, 1) can be shown either using elementary linear algebra or the the hidden variable representation (35). During the proof of part (B), we will choose so that Sigx2(B21)(log(psx))8n, which will ensure that (sx,sy,ρ,)𝒫Sig(r,sx,sy,) as well.

Note that for P(sx,sy,ρ,), U corresponds to α, and hence D(U) = D(α). Therefore for the proof of both parts, it suffices to show that for any decoder D^α of D(α),

infD^αsupP(sx,sy,)P(D^αD(α))>12. (42)

In both of the proofs, our will be a finite set. Our goal is to choose so that (sx,sy,ρ,) is structurally rich enough to guarantee (42), yet lends itself to easy computations. The guidance for choosing comes from our main technical tool for this proof, which is Fano’s inequality. We use the verson of Fano’s inequality in [53] (Fano’s Lemma). Applied to our problem, this inequality yields

infD^αsupP(sx,sy,ρ,)P(D^αD(α))1P1,P2(sx,sy,ρ,)KL(P1nP2n)(sx,sy,ρ,)2+log2log((sx,sy,ρ,)1), (43)

where Pn denotes the product measure corresponding to n i.i.d. observations from P. We also have the following result for product measures, KL(P1nP2n)=nKL(P1P2). Moreover, when P1, P2(sx,xy,ρ,) with left singular vectors α1 and α2, respectively,

KL(P1P2)=logdet(Σ2)det(Σ1)(p+q)+Tr(Σ21Σ1),

where det(Σ1)=det(Σ2)=1ρ2 by Lemma 13, and

(p+q)+Tr(Σ21Σ1)=2ρ21ρ2(1(α1Tα2)β022)

by Lemma 14. Noting α1, α2, and β0 are unit vectors, we derive KL(P1P2)=ρ2(α1α222)(1ρ2). Therefore, in our case, (43) reduces to

infD^αsupP(sx,sy,ρ,)P(D^αD(α))1nρ2supα1,α2α1α22(1ρ2)+log2log(1). (44)

Thus, to ensure the right hand side of (44) is non-negligible, the key is to choose so that the α’s in are close in l2 norm, but is sufficiently large. Note that the above ensures that distinguishing the α’s in is difficult.

A. Proof of part (A)

Note that our main job is to choose and ρ suitably. Let us denote

α0=(1sx,,1sxsxmany,0,,0psxmany).

We generate a class of α’s by replacing one of the 1sx in α0 by 0, and one of the zero’s in α0 by 1sx. A typical α obtained this way looks like

α=((1sx,,0,1sxsxmany,0,,1sx,,0psxmany).

Let be the class, which consists of α0, and all such resulting α’s. Note that =sx(psx), and α1, α2 satisfy

α1α222α1α022+α2α0224sx1.

Because p > sx > 1, we have

log(sx(psx)1)log(psx).

Therefore, (44) leads to

infD^αsupP(sx,sy,ρ,)P(D^αD(α))14ρ2nsx1(1ρ2)+log2log(psx),

which is bounded below by 1/2 whenever

sx>8ρ2n(1ρ2){log(psx)log4},

which follows if

sx>16ρ2n(1ρ2)log(psx)

because 4=16<psx. To get the best bound on sx, we choose the value of ρ which minimizes ρ2/(1 – ρ2) for P𝒫(r,sx,sy,), that is ρ=1. Plugging in ρ=1, the proof follows.

B. Proof of part (B)

Suppose each α is of the following form

α=(b,,bsx1many,0,,0,z,0,,0psx+1many).

We fix z ∈ (0, 1), and hence b=(1z2)(sx1) is also fixed. We will choose the value of ρ and z later so that 𝒫Sig(r,sx,sy,)(sx,sy,ρ,). Since z is fixed, such an α can be chosen in psx + 1 ways. Therefore =psx+1. Also note that for α, α, αα222z2. Therefore (44) implies

infD^αsupP(sx,sy,ρ,)P(D^αD(α)) (45)
12nρ2z2(1ρ2)+log2log(psx), (46)

which is greater than 1/2 whenever

z2<1ρ24nρ2log(psx4),

which holds if

z2=1ρ28nρ2log(psx)

because 16 < psx. To get the best bound on z, we choose the value of ρ for P𝒫(r,sx,sy,) which maximizes (1 – ρ2)/ρ2, that is ρ=1. Thus (42) is satisfied when ρ=1, and corresponds to

z2=(21)log(psx)(8n)

Since the minimal signal strength Sigx for any P(sx,sy,1,) equals min(z, b) ≤ z, we have 𝒫Sig(r,sx,sy,)(sx,sy,1,), which completes the proof.

Appendix E. Proof of Theorem 3

We first introduce some notations and terminologies that are required for the proof. For wZm, and xRm, we denote w!=i=1mwi! and xw=i=1mxiwi. In low-degree polynomial literature, when wZm, the notation ∣w∣ is commonly used to denote the sum i=1nwi for sake of simplicity. We also follow the above convention. Here the notation ∣·∣ should not be confused with the absolute value of real numbers. Also, for any function f:RmR, wZm, and t = (t1,…, tm), we denote

twf(t)=wt1w1trwrf(t).

We will also use the shorthand notation Eπ to denote Eαπx,βπy, sometimes.

Our analysis relies on the Hermite polynomial, which we will discuss here very briefly. For a detailed account on the Hermite polynomials, see Chapter V of [62]. The univariate Hermite polynomials of degree k will be denoted by hk. For k ≥ 0, the univariate Hermite polynomials hk:RR are defined recursively as follows:

h0(x)=1,h1(x)=xh0(x),,hk+1(x)=xhk(x)hk(x).

The normalized univariate Hermite polynomials are given by h^k(x)=hk(x)k!. The univariate Hermite polynomials form an orthogonal basis of L2(N(0, 1)). For wZm, the m-variate Hermite polynomials are given by Hw(y)=i=1mhwi(yi), where yRm. The normalized version H^w of Hw equals Hww!. The polynomials H^ws form an orthogonal basis of L2(Nm(0, Im)). We denote by ΠnDn the linear span of all n(p + q)-variate Hermite polynomials of degree at most Dn. Since LnDn is the projection of Ln on ΠDn, it then follows that

LnDnL2(Qn)2=wZn(p+q)wDnLn,H^wL2(Qn)2. (47)

From now on, the degree-index vector w of H^w or Hw will be assumed to lie in Zn(p+q). We will partition w into n components, which gives w = (w1,…,wn), where wiZp+q for each i ∈ [n]. Clearly, i here corresponds to the i-th observation. We also separate each wi into two parts wixZp and wiyZq so that wi=(wix,wiy). We will also denote wx=(w1x,,wnx), and wy=(w1y,,wny). Note that wxZnp and wyZnq, but w ≠ (wx, wy) in general, although ∣w∣= ∣wx∣+∣wy∣.

Now we state the main lemmas which yields the value of LnDnL2(Qn)2. The first lemma, proved in Appendix H-C, gives the form of the inner products Ln,H^wL2(Qn).

Lemma 5. Suppose w is as defined above and Ln is as in (20). Then it holds that

Ln,H^wL2(Qn)2={ww!{Eπ[1{α2β2<}αi=1nwixβi=1nwiy]}2×(i=1nwix!)2ifwix=wiyforalli[n],0o.w.}

Here the priors πx and πy are the Rademacher priors defined in (17).

Our next lemma uses Lemma 5 to give the form of LnDnL2(Qn)2. This lemma uses replicas of α and β. Suppose α1, α2πy and β1, β2πy are all independent Rademacher priors, where πx and πy are defined as in (17). We overload notation, and use Eπ to denote the expectation under α1, α2, β1, and β2.

Lemma 6. Suppose W is the indicator function of the event {α12β12<, α22β22<}. Then For any DnN, LnDnL2(Qn)2 equals

Eπ[Wd=0Dn2(d+n1d){2(α1Tα2)(β1Tβ2)}d].

The proof of Lemma 6 is also deferred to Appendix H-C. We remark in passing that the negative binomial series expansion yields

(1x)n=d=0(n+d1d)xd,forx<1, (48)

whose Dn-th order truncation equals

((1x)n)Dn=d=0Dn(n+d1d)xd.

Note that W is nonzero if and only if α12β12< and α22β22<, which, by Cauchy Schwarz inequality, implies

(α1Tα2)(β1Tβ2)<2.

Thus 2(α1Tα2)(β1Tβ2)<1 when W = 1. Hence Lemma 6 can also be written as

LnDnL2(Qn)2=Eπ[W{(12(α1Tα2)(β1Tβ2))n}Dn2].

Now we are ready to prove Theorem 3.

Proof of Theorem 3. Our first task is to get rid of W from the expression of LnDnL2(Qn) in Lemma 6. However, we can not directly bound W by one since the term (α1Tα2)d(β1Tβ2)dW may be negative for odd dN. We claim that E[(α1Tα2)d(β1Tβ2)dW]=0 if dN is odd. To see this, first we write

E[(α1Tα2)d(β1Tβ2)dW]=E[E[[(α1Tα2)dWβ1,β2](β1Tβ2)d]. (49)

Note that (α1Tα2)dWβ1, β2 has the same distribution as

1{α12<β121}1{α22<β221}(α1Tα2).

Notice from (17) that marginally, α1=dα1, and α1 is independent of α2, β1 and β2. Therefore,

(α1Tα2)Wβ1,β2=d(α1Tα2)Wβ1,β2.

Hence, conditional on β1 and β2, (α1Tα2)W is a symmetric random variable, and E[(α1Tα2)dWdβ1,β2]=0 for any odd positive integer d. Since W is binary random variable, Wd = W. Thus, E[(α1Tα2)dWβ1,β2]=0 as well for an odd number dN. Thus the claim follows from (49). Therefore, from Lemma 6, it follows that

LnDnL2(Qn)2=Eπ[Wd=0Dn22(2d+n12d){2(α1Tα2)(β1Tβ2)}2d].

Observe that Dn22Dn4. Hence, Dn22Dn4. Also the summands in the last expression are non-negative. Therefore, using the fact that W ≤ 1, we obtain

LnDnL2(Qn)2Eπ[d=0Dn4(2d+n12d){2(α1Tα2)(β1Tβ2)}2d]. (50)

Our next step is to simplify the above bound on LnDnL2(Qn)2. To that end, define the random variables ξi=α1iα2i for i[p], and ξj=β1jβ2j for j[q]. Denoting

ν=(sxp)2,andω=(syq)2,

we note that

ξi={+1sxw.p.ν21sxw.p.ν20w.p.1ν,}andξj={+1syw.p.ω21syw.p.ω20w.p.1ω.}

Also, since ξi and ξjs are symmetric, Eξi2k+1 and Eξj2k+1 vanishes for any kZ. Then for any dZ,

Eπ[(α1Tα2)2d(β1Tβ2)2d]=Eπx[(i=1pξi)2d]Eπy[(j=1qξj)2d]=(zZp,z=2d(2d)!z!i=1pE[ξizi])(lZq,l=2d(2d)!l!j=1qE[(ξj)lj])

by Fact 6. Since the odd moments of ξ and ξ vanish, the above equals

(zZp,z=2d(2d)!(2z)!i=1pE[ξi2zi])(lZq,l=d(2d)!(2l)!j=1qE[(ξj)2lj])=(zZp,z=dνD(z)(2d)!(2z)!i=1psx2zi)×(lZq,l=dωD(z)(2d)!(2l)!j=1qsy2lj),

where we remind the readers that ∣D(z)∣ denotes the cardinality of the support of z for any vector z. The above implies

Eπ[(α1Tα2)2d(β1Tβ2)2d]=(sxsy)2dzZp,z=d(2d)!(2z)!νD(z)𝒥(d;p)lZq,l=d(2d)!(2l)!νD(l)𝒥(d;q).

Plugging the above into (50) yields

LnDnL2(Qn)2d=0Dn4(2d+n12d)(sxsy)2d4d𝒥(d;p)𝒥d(d;q)(a)d=0Dn4((2d+n1)e2d)2d(sxsy)2d4d𝒥(d;p)𝒥(d;q),

where (a) follows since (ab)(aeb)b for a, bN. Let us denote μx=ne(p) and μy=ne(q). By (21), μx, μy<13 and

Dnmin{sy2,sx2}2ne.

Therefore we have

μx2Dn<sx2pandμy2Dn<sy2q.

Hence Lemma 4.5 of [36] implies that for any 11 ≤ dDn,

𝒥(d;p)(2d)!(pd)ded2p+d223d2μx2dνd,𝒥(d;q)(2d)!(qd)ded2q+d223d2μy2dωd.

For d ≤ 1, Theorem 5 of [68] gives

(2d)!(2d)2d+1e2d2π2d1.

Also since (pd)(ped)d, we have

𝒥(d;p)(2d)2d+12(ped)dded2pd23d2μx2dνd,𝒥(d;q)(2d)2d+12(qed)dded2qd23d2μy2dωd,

leading to

𝒥(d;p)𝒥(d;q)d2d+2ed2p+d2q2d+1(μxμy)2d(νp)d(ωq)d.

Therefore LnDL2(Qn)2 is bounded by a constant multiple of

d=11Dn4((2d+n1)e2d)2d(sxsy)2dd2d+2ed2p+d2q×2d+1(μxμy)2d(νp)d(ωq)d4dd=11Dn4d{4(2d+n1)2e22μx2μy2pq}ded2p+d2q.

Since Dn2min{p,q}, it follows that ed2p+d2qe2. Note that the above sum converges if

(Dn2+n1)2e2<24μx2μy2pq=2n2e2,

or equivalently (Dn2+n1)2<2n2, which is satisfied for all nN since Dn<n. Thus the proof follows. □

Appendix F. Proof of Theorem 4

We invoke the decomposition of Σ^n,xy in (37). But first, we will derive a simplified form for the matrices 1 and 2 in (37). Note that we can write 1 as

1=Σx12(IpΣx12UΛUTΣx12)Σx12.

Let us denote

Bx=diag(1Λ1,,1Λrrtimes,1,,1prtimes). (51)

Because Σx12U~ is an orthogonal matrix, Σx12U~BxU~TΣx12 is a spectral decomposition, which leads to

1=Σx12(Σx12U~BxU~TΣx12)12Σx12=ΣxU~Bx12U~TΣx.

Similarly, we can show that the matrix 2 in (37) equals ΣyV~By12V~TΣy, where

By=diag(1Λ1,,1Λrrtimes,1,,1qrtimes).

Finally the fact that 1=ΣxUΛ12 and 𝒲2=ΣyVΛ12 in conjuction with (37) produces the following representation for Σ~xy=Σx1Σ^n,xyΣy1:

Σ~xy=1n{UΛ12ZTZΛ12VT+UΛ12ZTZ2Σy(V~ByV~T)}+(U~ByU~T)ΣxZ1TZΛ12VT{+(U~BxU~T)ΣxZ1TZ2Σy(V~ByV~T)}.

Now recall the sets E, F, and G defined in (24) and (25) in Section III-D, and the decomposition of Σ~xy in (26). From (26) it follows that

η(Σ~xy)=η(𝒫E{Σ~xy})+η(𝒫F{Σ~xy})+η(𝒫G{Σ~xy}).

Recall that for any matrix ARp×q, and S[p], we denote by AS the matrix 𝒫S×[q]{A}. Then it is not hard to see that UE1=U and VE2=V, which leads to

S1=1n{UΛ12ZTZΛ12VT}+UΛ12ZTZ2Σy(V~By(V~E2)T)+(U~E1ByU~T)ΣxZ1TZΛ12VT{+(U~E1BxU~T)ΣxZ1TZ2Σy(V~By(V~E2)T)}. (52)

Next, note that UF1=0 and VF2=0. Therefore,

S2=1n{(U~F1BxU~T)ΣxZ1TZ2Σy(V~By(V~F2)T)}. (53)

Finally, we note that S3 = (H1 + H2), where

H1=1n{(UΛ12ZTZ2Σy(V~By(V~F2)T)}{+(U~E1BxU~T)ΣxZ1TZ2Σy(V~By(V~F2)T)} (54)

and

H2=1n{(U~F1ByU~T)ΣxZ1TZΛ12VT}{+(U~F1BxU~T)ΣxZ1TZ2Σy(V~By(V~E2)T)}.

Here the term S1 holds the information about Σx1ΣxyΣy1=UΛVT. Its elements are not killed off by co-ordinate thresholding because it contains the Wishart matrix ZTZ which concentrates around Ir by Bai-Yin law (cf. Theorem 4.7.1 of of [42]). The only term that contributes to Σ^n,xy is S1. Lemma 7 entails that η(S1) concentrates around Σx1ΣxyΣy1 in operator norm. The proof of Lemma 7 is deferred to Appendix H-B.

Lemma 7. Suppose sx, sy<n. Then with probability 1 – o(1),

η(S1)Σx1ΣxyΣy1opThrmin{sx,sy}n+C2max{sx,sy}n.

The entries of S2 and S3 are linear combinations of the entries of Z1TZ2, ZTZ1 and ZTZ2. Since Z, Z1, Z2 are independent, the entries from the latter matrices are of order Op(n−1/2), and as we will see, they are killed off by the thresholding operator η. Our main work boils down to showing that thresholding kills off most terms of the noise matrices S2 and S3, making η(S2)op and η(S3)op small. To that end, we state some general lemmas, which are proved in Appendix H-B. That η(S2)op and η(S3)op are small follows as corollaries to this lemmas. Our next lemma provides a sharp concentration bound which is our main tool in analyzing the difficult regime, i.e., sx+syp+q case.

Lemma 8. Suppose Z1Rn×p and Z2Rn×q are independent standard Gaussian data matrices. Let us also denote QM,N=MZ1TZ2N where MRp×p and NRq×q are fixed matrices so that pp and qq. Further suppose log(pq)=o(n) and logn=o(pq). Let K0=161Mop2Nop2. Suppose KK0 is such that threshold level τ satisfies τ[K0,Klog(max{p,q})2]. Then there exists a constant C > 0 so that with probability 1 – o(1),

η(QM,N;τn)opCMopNop(p+qnp+qn)×eτ2K.

Our next lemma, which also is proved in Appendix H-B, handles the easier case when the threshold is exactly of the order log(p+q). This thresholding, as we will see, is required in the easier sparsity regime, i.e., sx+syp+q. Although Lemma 9 follows as a corollary to Lemma A.3 of [50], we include it here for the sake of completeness.

Lemma 9. Suppose Z1, Z2, M, N, and QM,N are as in Lemma 8, and log(p+q)=o(n). Further suppose Mop, NopC where C > 0 is an absolute constant. Let τ=C1log(p+q). Here the tuning parameter C1>C4 where C > 0 is a sufficiently large constant. Then η(QM,N;τn)=0 with probability tending to one.

We will need another technical lemma for handling the terms S2 and S3.

Lemma 10. Suppose ARm×p and D=D1×D2[m]×[p]. Then the followings hold:

  1. 𝒫D(η(A))=η(𝒫D(A)).

  2. 𝒫D(A)opAop

Note that M=U~F1BxU~TΣx satisfies

MopΣx12opΣx12U~FopBxopΣx12U~opΣx12op.

However, Bxop1. Also because P𝒫(r,sx,sy,), it follows that Σxop,Σx1op. Moreover, since Σx12U~ is orthogonal, Σx12U~op=1. Part B of Lemma 10 then yields Σx12U~FopΣx12U~op=1. Therefore

Mop=U~F1BxU~TΣxop. (55)

Similarly we can show that the matrix N=V~By(V~F2)TΣy satisfies Nop. Because S2=η(MZ1TZ2N) by (53), that η(S2) is small follows immediately from Lemma 8. Under the conditions of Lemma 8, we have

η(S2)opC2(p+qnp+qn)×eThr2K (56)

with high probability provided K1614 and Thr[132,Klog(max{p,q})2]. On the other hand, under the setup of Lemma 9, p(η(S2)op=0)n1. Lemma 11, which we prove in Appendix H-B, entails that the same holds for S3.

Lemma 11. Consider the setup of Lemma 8. Suppose K12884 is such that Thr[362,Klog(2max{p,q})2]. Then there exists a constant C > 0 so that with probability tending to one,

η(S3)opC2(p+qnp+qn)eThr2K.

Under the setup of Lemma 9, on the other hand, η(η(S3)op)=0 with probability tending to one.

We will now combine all the above lemmas and finish the proof. First, we consider the regime when (sx+sy)2(p+q)e, so that there is thresholding, i.e., Thr>0. We split this regime into two subregimes: 214(p+q)34(sx+sy)2(p+q)e and (sx+sy)2214(p+q)34.

1). Regime 214(p+q)34(sx+sy)2(p+q)e

First, we explain why we needed to split the e(sx+sy)2p+q regime into two parts. Since sx, sy<n, Lemma 7 applies. Note that if Thr[362, Klog(max{p,q})2] with K12884, then Lemma 11 and (56) also apply. Therefore it follows that in this case

η(Σ~xy)Σx1ΣxyΣy1opC2((sx+sy)Thrn)(+sx+syn+p+qnp+qneThr2K). (57)

We will shortly show that under (sx+sy)2(p+q)e, setting Thr2=Klog((p+q)(sx+sy)2) ensures that the bound in (57) is small. However, for (57) to hold, Thr2 needs to satisfy

Thr2Klog(max{p,q}))4,

which holds with Thr2=Klog((p+q)(sx+sy)2) if and only if

log((p+q)(sx+sy)2)log(max{p,q}14).

Since max{p,q}(p+q)2 the above holds when

(p+q)34214(sx+sy)2.

Therefore, setting Thr2=Klog((p+q)(sx+sy)2) is useful when we are in the regime 214(p+q)34(sx+sy)2(p+q)e. We will analyze the regime (sx+sy)214(p+q)34 using separate procedure.

In the 214(p+q)34(sx+sy)2(p+q)e case,

p+qneThr2K=p+qn(sx+sy)2(p+q)=(sx+sy)nsx+syp+q<(sx+sy)en

because (sx+sy)2(p+q)e, and similarly,

p+qneThr2K=p+qn(sx+sy)2(p+q)=(sx+sy)2n

since we also assume sx+sy2n. The above bounds entail that, in this regime, the first term on the bound in (57) is the leading term provided Thr>1, i.e.,

η(Σ~xy)Σx1ΣxyΣy1opC2((sx+sy)Thrn+(sx+sy)n)C2(sx+sy)max(Thr,1)n

with probability 1o(1). Plugging in the value of Thr leads to

η(Σ~xy)Σx1ΣxyΣy1opC2sx+syn(max{Klog((sx+sy)2p+q),1})12

in the regime 214(p+q)34(sx+sy)2(p+q)e. In our case, (sx+sy)2(p+q)e. Also since >1 by definition of 𝒫(r,sx,sy,), we also have K1, indicating

η(Σ~xy)Σx1ΣxyΣy1opC2sx+syn(Klog((sx+sy)2p+q))12.

2). Regime (sx+sy)2<214(p+q)34:

When (sx+sy)2<214(p+q)34, of course, the above line of arguments may not work although this indeed is an easier regime because sx+sy is less than (p+q)log(p+q). In this regime, we set Thr=C1log(p+q) where C1 is a constant depending on as in Lemma 9. For this τ, we have showed that η(S2)op=0 with probability tending to one. Lemma 11 implies the same holds for η(S3)op as well. Thus from the decomposition of Σ~xy in (26), it follows that the asymptotic error occurs only due to the estimation of Σx1ΣxyΣy1 by η(S1). Using Lemma 7, we thus obtain

Σ~xyΣx1ΣxyΣy1opC2(sx+sy)max{Thr,1}n.

On the other hand, since (p+q)34>214(sx+sy)2, rearranging terms, we have

log((p+q)(sx+sy)2)>(log(p+q)log2)4>Clog(p+q).

Thus, in the regime (sx+sy)2<214(p+q)34, we have

Σ~xyΣx1ΣxyΣy1opC2sx+synmax{C1log(p+q(sx+sy)2),1}. (58)

3). Regime (sx+sy)2>(p+q)e::

It remains to analyze the case when either (sx+sy)2>(p+q)e. In that case, there is no thresholding, i.e., Thr = 0. We will show that the assertions of Theorem 4 holds in this case as well. To that end, note that (26) implies

Σ~xyΣx1ΣxyΣy1opS1Σx1ΣxyΣy1op+S2+S3op.

From the proof of Lemma 7 it follows that S1Σx1ΣxyΣy1opC2max{sx,sy}n. For S2, we have shown that it is of the form MZ1TZ2Nn where Mop,Nop. On the other hand, we showed that S3=H1+H2, where the proof of Lemma 11 shows H1 and H2 are of the form M AN where MopNop22 and A is either [ZZ1]TZ2n (for H1) or Z1T[ZZ2]n (for H2). Therefore, it is not hard to see that S2+S3op is bounded by

C2(Z1TZ2op+Z1T[ZZ2]op+[ZZ1]TZ2op).

For standard Gaussian matrices Z1Rn×p and Z2Rn×q it holds that Z1TZ2nopC((p+q)n+(p+q)n) with probability 1o(1) (cf. Theorem 4.7.1 of of [42]). Since rmin{p,q}, it follows that S2+S3opC2(p+qn+(p+q)n) with probability 1o(1). The above discussion leads to

Σ~xyΣx1ΣxyΣy1opC2((sx+syn)12+(p+qn)12)+(p+qn)2C2((p+qn)12+p+qn)

because sx+sy<p+q. If (p+q)e(sx+sy)2, the above bound is of the order (sx+sy)n. Thus Theorem 4 follows.

Appendix G. Proof of Corollary 2

Proof of Corollary 2. We will first show that there exist C,c>0 so that

maxi[r]u^n,iwui2C(sx+sy)nmax{(clog(p+q(sx+sy)2))12,1}. (59)

For the sake of simplicity, we denote the matrix U^(1) in Algorithm 2 by U^. Denoting ϵn=η(Σ~xy)Σx1ΣxyΣy1op, we note that

Σx12η(Σ~xy)Σy12Σx12UΛVTΣy12opϵn.

Also the matrix U^pre defined in Algorithm 2 and Σx12U are the matrices corresponding to the leading r singular vectors of Σx12η(Σ~xy)Σy12 and Σx12UΛVTΣy12, respectively. By Wedin’s sin-theta theorem (we use Theorem 4 of [52]), for any 1i<r,

minw{±1}U^iprewΣx12ui232(2Λ1+ϵn)ϵnmin{Λi12Λi2,Λi2Λi+12},

where Λ0 is taken to be ∞, and

minw{±1}U^rprewΣx12ur2232(2Λ1+ϵn)ϵnΛr12Λr2.

Since P𝒫G(r,sx,sy,), mini[r](Λi1Λi)>1 and mini[r]Λi>1. Therefore, for ϵn<1, we have

maxi[r]minw{±1}U^iprewΣx12ui2Cϵn.

We have to show ϵn<1. Theorem 4 gives a bound on ϵn, which can be made smaller than one if the C in (23) is chosen to be sufficiently large. Hence, the above inequality holds. Because U^ipre=Σx12u^n,i, using the fact Σxop, the last display implies

maxi[r]minw{±1}u^n,iwui212Cϵn,

which, combined with Theorem 4, proves (59). Now note that the constant C in (23) can be chosen so large such that the right hand side of (59) is smaller than 1(22r). Since Σxop<, it follows that Condition 1 is satisfied, and the rest of the proof then follows from Theorem 1.

Appendix H. Proof of Auxilliary Lemmas

A. Proof of Technical Lemmas for Theorem 2

The following lemma can be verified using elementary linear algebra, and hence its proof is omitted.

Lemma 12. Suppose Σ is of the form (41). Then the spectral decomposition of Σ is as follows:

Σ=i=1p1x1(i)(x1(i))T+i=1q1x2(i)(x2(i))T+(1+ρ)x3x3T+(1ρ)x4x4T,

where the eigenvectors are of the following form:

  1. For i[p1],x1(i)=(yi,0q), where {yi}i=1p1Rp forms an orthonormal basis system of the orthogonal space of α.

  2. For i[q1],x2(i)=(0p,zi), where {zi}i=1q1Rp forms an orthonormal basis system of the orthogonal space of β.

  3. x3=(α2,β2) and x4=(α2,β2).

Here for kN, 0k denotes the k-dimensional vector whose all entries are zero.

Lemma 13. Suppose Σ is as in (41). Then det(Σ)=1ρ2 and

Σ1=[IααT00IββT]+12(1+ρ)[ααTαβTβαTββT]+12(1ρ)[ααTαβTβαTββT].

Proof of Lemma 13. Follows directly from Lemma 12.

Lemma 14. Suppose Σ1 and Σ2 are of the form (41) with singular vectors α1, β1, α2, and β2, respectively. Then

Tr(Σ1Σ21)=p+q+2ρ21ρ2(1(β1Tβ2)(α1Tα2)).

Proof. Lemma 13 can be used to obtain the form of Σ21, which implies Σ1Σ21 equals

[Ipρα1β1Tρβ1α1TIq][Ip+ρCρα2α2TCρα2β2TCρβ2α2TIq+ρCρβ2β2T,]

where Cρ=ρ1ρ2. Since Tr(Σ1Σ21) equals the sum of the two p × p and q × q diagonal submatrices, we obtain that

Tr(Σ1Σ21)=Tr(Ip+ρCρα2α2TρCρ(β1Tβ2)α1α2T)+Tr(Iq+ρCρβ2β2TρCρ(α1Tα2)β1β2T)=p+q+ρ21ρ2(Tr(α2Tα2)+Tr(β2Tβ2))((β1Tβ2)Tr(α1α2T)(α1Tα2)Tr(β1β2T)),

where we used the linearity of Trace operator, as well as the fact that Tr(AB)=Tr(BA). Noticing α22=β22=1, the result follows.

B. Proof of Key Lemmas for Theorem 4

1). Proof of Lemma 7:

Proof of Lemma 7. Note that

η(S1)Σx1ΣxyΣy1opη(S1)S1opT1+S1Σx1ΣxyΣy1opT2.

We deal with the term T1 first. Recall from (26) that S1=𝒫E1×E2(Σ~xy) is a sparse matrix. In particular, each row and column of S1 can have at most sy and sx many nonzero elements, respectively. Now we make use of two elementary facts. First, for x0, η(x)xThrn, and second, for any matrix ARp×q,

Aopmax1ipj=1pAijmax1jqi=1nAij.

The above results, combined with the row and column sparsity of S1, lead to

T1=η(S1)S1op(max1ip(S1)i0)(max1jq(S1)j0)Thrnmin{sx,sy}Thrn,

which is the first term in the bound of η(S1)Σxyop.

Now for T2, noting Σx1ΣxyΣy1=UΛVT, observe that

S1Σx1ΣxyΣy1=UΛ12(ZTZnIr)Λ12VTS11+UΛ12ZTZ2nΣy(V~By(V~E2)T)S12+(U~E1ByU~T)ΣxZ1TZnΛ12VTS13+(U~E1BxU~T)ΣxZ1TZnΣy(V~By(V~E2)T)S14.

It is easy to see that

S11opΣx12opΣx12UopΣy12opΣy12VopΛop×ZTZnIrop.

Since (X,Y)P𝒫G(r,sx,sy,), Σx1 and Σy1 are bounded it operator norm by . Also, Σx12U~ and Σy12V~ are orthonormal matrices. Therefore the operator norms of the matrices Σx12U, Σy12V, and Λ are bounded by one. On the other hand, by Bai-Yin’s law on eigenvalues of Wishart matrices (cf. Theorem 4.7.1 of [42]), ZTZnIropC(rn+rn) with high probability. Since r<sx<n, clearly rn<1. Thus S11opCrn with high probability. Hence it suffices to show that the terms S12, S13, and S14 are small in operator norm, for which, we will make use of Lemma 4. First let us consider the case of S12. Clearly,

S12opΣx12opΣx12UopΛ12op×ZTZnΣy(V~By(V~E2)T)op.

We already mentioned that Σx1op, and Σx12Uop and Λp are bounded by one. Therefore, it follows that

S12op12ZTZ2nΣy(V~By(V~E2)T)op.

Now we apply Lemma 4 on the term ZTZ2Σy(V~By(V~E2)T) with A=Ir, and B=ΣyV~By(V~E2)T. Note that Σy, V~, and By are full rank matrices, i.e., they have rank q. Therefore, the rank of B equals rank of V~E2. Note that the rows of the matrix V~ are linearly independent because the square matrix V~ has full rank. Therefore, the rank of V~E2 is E2, which is sy. Hence, the rank of B is also sy. Also note that (A)=rsyn. Therefore Lemma 4 can be applied with a=r and b=sy. Also, Aop=1 trivially follows. Using the same arguments which led to (55), on the other hand, we can show that Bop by (26). Therefore Lemma 4 implies that for any t > 0, the following holds with probability at least 1exp(Cn)exp(t22):

ZTZ2nΣy(V~By(V~E2)T)opCmax{sy,t}n,

which implies S12C32max{sy,t}n with high probability. Exchanging the role of X and Y in the above arguments, we can show that S13C32max{sx,t}n with high probability. For S14, we note that

S14op(U~E1BxU~T)ΣxZ1TZ2nΣy(V~By(V~E2)T)op.

We intend to apply Lemma 4 with A=ΣxU~Bx(U~E1)T and B=ΣyV~By(V~E2)T. Arguing in the lines of the proof for the term S12, we can show that A and B have rank a=sx and b=sy, respectively. Without loss of generality we assume sysx, which yields ba, as requred by Lemma 4. Otherwise, we can just take the transpose of C14, which leads to a=sy and b=sx, implying ba. Using (55), as before, we can show that the operator norms of A and B are bounded by . Therefore, Lemma 4 implies that for all t0,

S14opC2max{sx,sy,t}n

with probability at least 1exp(Cn)exp(t22). Hence, it follows that with probability 1o(1),

S1Σx1ΣxyΣy1opC2max{sx,sy}n.

2). Proof of Lemma 8:

Without loss of generality, we will assume that p>q. We will also assume, without loss of generality, that p=p and q=q. If that is not the case, we can add some zero rows to M and zero columns to N, respectively, which does not change their operator norm, but ensures p=p and q=q. For any pN, let Sp1 denote the unit sphere in Rp. We denote an ϵ-net (with respect to Eucledian norm) on any set 𝒳Rp by Tϵ(𝒳). When 𝒳=Sp1, there exists an ϵ-net of Sp1 so that

Tϵ(Sp1)(1+2ϵ)p.

By Tpϵ, we denote such an ϵ-net. Although Tpϵ may not be unique, that is not necessary for our purpose. For a subset S[p], Tpϵ(S) will denote an ϵ-net of the set {xSp1:xi=0ifi0}. Note that each element of the latter set has at most S1 many degrees of freedom, from which, one can show that Tkϵ(S)(1+2ϵ)S. The following Fact on ϵ-nets will be very useful for us. The proof is standard and can be found, for example, in [42].

Fact 5. Let ARp×q for p, qN. Then there exist xTpϵ and yTqϵ such that x,Ay(12ϵ)Aop.

Letting An=η(MZ1TZ2N), and using Fact 5, we obtain that

P(Anop>δ)P(maxxTpϵ,yTqϵx,Any(12ϵ)δ)

for any δ>0. Proceeding like Proposition 15 of [1], we fix 1<Jp,qmin{p,q}, and introduce the sets

Sx={i[p]:xiJp,qp}Sy={i[q]:yiJp,qq}, (60)

and their complements Sxc=[p]Sx and Syc=[q]Sy. The precise value of Jp,q will be chosen later. For any subset A[k], kN, and vector xRk, we denote by xA the projection of x onto A, which means xARp and (xA)i=xi if iA, and zero otherwise. Let us denote the projections of x and y on Sx, Sxc, Sy, and Syc, by xSx, xSxc, ySy, and ySyc, respectively. Note that this implies

x=xSx+xSxc,y=ySy+ySyc,

as well as

xSx,xSxcRp,ySy,ySycRq.

There are fewer elements the sets Sx and Sy compared to their complements. Therefore, we will treat these sets separately. To that end, we consider the splitting

P(maxxTpϵ,yTqϵx,Any4δ(12ϵ))P(maxxTpϵ,yTqϵxSx,AnySyδ(12ϵ))T1+P(maxxTpϵ,yTqϵxSx,AnySycδ(12ϵ))T2+P(maxxTpϵ,yTqϵxSxc,Anyδ(12ϵ))T3 (61)

The term T1 can be bounded by Lemma 15.

Lemma 15. Suppose M and N are as in Lemma 8 and An=η(QM,N) where QM,N=MZ1TZ2Nn. Then for any Δ>0, there exist absolute constants C, c>0 such that

P{maxxTpϵ,yTqϵxSx,AnySyΔ}Cexp((p+q)log(CJp,q)Jp,q)(n2Δ24CMop2Nop2(2n+p+q))+CΔ2Mop2Nop2(n(p+q))C{ec(n+q)+ec(n+p)}

We state another lemma which helps in controlling the terms T2 and T3.

Lemma 16. Suppose M, N, Z1, Z2, and An are as in Lemma 8. Let K0=161Mop2Nop2. Suppose k>0 is such that KK0 and moreover, τ[K0,Klogp2]. Let 𝒯2 be either the set Tqϵ or the set T~qϵ={ySy:yTqϵ}. Then there exist absolute constants C, c>0 such that the following holds for any Δ>0;

P{maxxTpϵ,y𝒯2xSxc,AnyΔ}Cexp(C(p+q)Δ2n2eτ2KCMop2Nop2Jp,q(2n+p+q))+CMop2Nop2Δ2(n(p+q))Cexp(cmin(n,p)).

Note that when 𝒟=Tqϵ, Lemma 16 yields a bound on T3. On the other hand, the case 𝒯2=T~qϵ yields a bound on the term

T2=P(maxxTpϵ,yTqϵxSxc,AnySyδ(12ϵ)). (62)

While T2 is not exactly equal to T2, interchanging the role of x and y in T2 gives T2. Since the upper bound on T2 given by Lemma 16 is symmetric in p and q, it is not hard to see that the same bound works for T2.

If we let ϵ=14, then Δ=δ2. Combining the bounds on T1, T2, and T3, we conclude that the right hand side of (61) is o(1) if Δ2 is larger than some constant multiple of

max{(n+p+q)(p+q)n2(logJp,qJp,q+Jp,qeτ2K0)},{(n(p+q))Cexp(cmin{n,p})}Mop2Nop2

where K0=320Mop2Nop2. We will show that the first term dominates the second term. By our assumption on τ, τ2<80logpMop2Nop2, which implies τ2K0<log(pq)2, which combined with the fact Jp,q>1, yields Jp,qexp(τ2K0)>Jp,qpq. On the other hand, under p > q, our assumption on n implies logn=o(p). Also because p+q=o(logn), it follows that (n(p+q))Cexp(cmin{n,p}) is small, in particular

(n+p+q)(p+q)n2(logJp,qJp,q+Jp,qeτ2K0)(n+p+q)(p+q)n2pq(n(p+q))Cexp(cmin{n,p}).

Therefore, for P(Anδ>δ) to be small,

δ2>Cmin1<Jp,q<pqMop2Nop2(n+p+q)(p+q)n2×(logJp,qJp,q+Jp,qeτ2K0)

suffices. In particular, we choose Jp,q=exp(τ2(2K0)). Note that because τ2K0log(pq)2, this choice of Jp,q ensures that Jp,qmin{p,q}, as required. The proof follows noting this choice of Jp,q also implies

logJp,qJp,q+Jp,qeτ2K0eτ2(2.5K0)={exp(τ2402Mop2Nop2)}2.

3). Proof of Lemma 9:

Proof of 9. For any i[p] and j[q],

Z1MiMi2N(0,In)

and Z2NjNj2N(0,In) are independent. In this case, there exist absolute constants δ, c and C>0, so that (cf. Lemma A.3 of [50])

P(MiTZ1TZ2NjMi2Nj2nt)Cexp(cnt2)

for all tδ. Since (MZ1TZ2N)ij=MiTZ1TZ2Nj, and Mi2,Ni2, using union bound we obtain

P(MZ1TZ2Nnt)Cexp(log(pq)cnt24).

Letting τ=2Clog(p+q) and t=τn, we observe that for our choice of τ, t<δ for all sufficiently large n since log(p+q)=o(n). Therefore, the above inequality leads to

P(η(QM,N)0)=P(QM,Nτn)Cexp(2log(p+q)cClog(p+q)).

Because pp and qq by our assumption on M and N, C>2c suffices. Hence the proof follows.

4). Proof of Lemma 11:

Proof of Lemma 11. From the definition of S3 in (26), and (54), it is not hard to see that η(S3)=η(H1)+η(H2). We will show that H1 is of the form M[ZZ1]TZ2N where Mop2 and Nop. Then the first part would follow from Lemma 8, which, when applied to this case, would imply

η(H1)opC2(p+qnp+qn)eThr2K

provided Thr[362, Klog(maxp+r,q)2] and K12884. Since r<min{p,q}, the upper bound of Thr becomes Klog(2max{p,q})2. The proof for η(H2)op will follow in a similar way, and hence skipped.

Letting

A1=Λ12UT,A2=ΣxU~Bx(U~E1)T,A3=ΣyV~By(V~F2)T,

we note that (54) implies H1=A1TZTZ2A3+A2TZ1TZ2A3, which can be written as

H1=A1T([ZZ1])TZ2A3,whereA4=[A1A2].

We will now invoke Lemma 8 because Z3=[ZZ2] is a Gaussian data matrix with n rows and p+r2p columns, and the matrices A4 and A3 are also bounded in operator norm. To see the latter, first, noting A4op=A4TA4op, and we observe that

A4TA4op=A1TA1+A2TA2opA1op2+A2op2.

Therefore it suffices to bound the operator norms of A1, A2, and A3 only. Using (55), we can show that the operator norm of the matrices of the form A2, or A3 is bounded by for (X,Y)P𝒫(r,sx,sy,). Since Σx12U has orthogonal columns, it can be easily seen that A1op1. Therefore

A4opA1op+A2op1+2

because >1 as per the definition of 𝒫(r,sx,sy,). The proof of the first part now follows by Lemma 8. Because A4op2 and A3op, the proof of the second part follows directly from Lemma 9, and hence skipped.

C. Proof of Additional Lemmas for Section III-C and Theorem 3

Proof of Lemma 2. To prove the current lemma, we will require a result on the concentration of α and β under πx and πy. To that end, for s, mN satisfying sm, let us define the set

𝒲(s,m)={xRm:x0[s2,2s],x2[0.9,1.1]}.

Suppose πx and πy are the Rademacher priors on α and β as defined in Section III-C. The following lemma then says that α and β concentrates on 𝒲(sx,p) and 𝒲(sy,q) with probability tending to one.

Lemma 17. Suppose sy. Then

limnπx(α𝒲(sx,p))=1;limnπy(β𝒲(sy,q))=1. (63)

Here the probability πx(α𝒲(sx,p)) depends on n through sx and q. Similarly πy(β𝒲(sy,q)) depends on n through sy and q.

Recall the definition of Pα,β from (19). Let us consider the class

𝒫sub()={Pα,β:α𝒲(sx,p),β𝒲(sy,q)}.

If α𝒲(sx,p) and β𝒲(sx,p), than α2β2(1.1)2< because >2. Therefore (19) implies that (X,Y)P𝒫sub() has canonical correlation 1. Thus 𝒫sub()𝒫G(r,2sx,2sy,), implying

liminfnsupPn𝒫G(r,2sx,2sy,)nPn(Φn(X,Y)=1))liminfnsupPn𝒫sub()Pn(Φn(X,Y)=1)).

Suppose x and y are the Borel σ-field associated with 𝒲(sx,p) and 𝒲(sy,q), respectively. Define the probability measures π~x and π~y on (𝒲(sx,p),x) and (𝒲(sy,q),y), respectively, by

π~x(A)=πx(A)πx(𝒲(sx,p))for allAx,

and

π~y(B)=πy(B)πy(𝒲(sy,q))for allBy.

Note also that if α𝒲(sx,p) and β𝒲(sy,q), then Pα,β𝒫sub(). Therefore

liminfnsupPn𝒫sub()Pn(Φn(X,Y)=1))liminfn𝒲(sx,p)×𝒲(sy,q)Pn,α,β(Φn(X,Y)=1)dπ~x(α)dπ~y(β)=lim infn𝒲(sx,p)×𝒲(sy,q)Pn,α,β(Φn(X,Y)=1)dπx(α)dπy(β)lim supn(πx(𝒲(sx,p))πy(𝒲(sy,q))),

whose denominator is one by Lemma 17. Denoting 𝒲(sy,q)c=Rp𝒲(sy,q), we note that

Rp×𝒲(sy,q)cPn,α,β(Φn(X,Y)=1)dπx(α)dπy(β)1πy(𝒲(sy,q))n0

by Lemma 17. Similarly, denoting 𝒲(sx,p)c=Rp𝒲(sx,p), we can show that

𝒲(sx,p)c×RyPn,α,β(Φn(X,Y)=1)dπx(α)dπy(β)n0.

Therefore, it holds that

liminfn𝒲(sx,p)×𝒲(sy,q)Pn,α,β(Φn(X,Y)=1)dπx(α)dπy(β)=&liminfnEπ[Pn,α,β(Φn(X,Y)=0)].

Thus the proof follows.

Proof of Lemma 17:

Proof of Lemma 17.. We are going to show (63) only for πx because the proof for πy follows in the identical manner. Throughout we will denote by Eπx and varπx the expectation and variance under πx. Note that when απx, α0=i=1pI[αi0], where I[αi0]’s are i.i.d. Bernoulli random variables with success probability sxp. Therefore, Chebyshev’s inequality yields that for any ϵ>0,

πx(α0sx>sxϵ)pvarπx(I[αi0])sx2ϵ2=1sxpsxϵ2,

which goes to zero if sx. Therefore, for ϵ=12, we have

πx(α0[sx2,2sx])πx(α0sx>sxϵ)0.

Also, since Eπx[i=1pαi2]=1, Chebyshev’s inequality implies that

πx(i=1pαi21ϵ)varπx(i=1pαi2)ϵ2=(a)p.varπx(αi2)ϵ2pEπx[αi4]ϵ2=1sxϵ2,

which goes to zero if sx for any fixed ϵ>0. Here (a) uses the fact that αi’s are i.i.d. The proof now follows setting ϵ=0.1.

1). Proof of Lemma 5:

Proof of Lemma depends on two auxiliary lemmas. We state and prove these lemmas first.

Lemma 18. Suppose wZm, and ARm×m is a matrix. Let P be the measure induced by the m-dimensional standard Gaussian random vector and denote by EP the corresponding expectation. Then for any xRm we have

jZmtjj!EP[Hj(AZ)]=etT(A2I)t2.

Proof of Lemma 18. The generating function of Hw has the convergent expansion [69, Proposition 6]

jZmtjj!Hj(x)=exp{tTxtTt2}

for any xRm. Therefore,

jZmtjj!Hj(Ax)=exp{tTAxtTt2}.

Multiplying both side by the density dP of P and then integrating over Rm gives us

jZmtjj!EP[Hj(AZ)]=EP[etTAZ]etTt2=etT(A2I)t2.

Lemma 19. Let Σ(α,β,1) be as defined in (18). Suppose z=(zx,zy) where zxZp and zyZq. Then for any tRp+q, we have

tzexp{12tT(Σ(α,β,1)Ip+q)t}t=(0,,0)={zxzx!αzxβzyifzx=zy,0o.w.}

Proof of Lemma 19. Let us partition t as (tx,ty) where tx=(tx(1),,tx(p))Rp and ty=(ty(1),,ty(q))Rq. We then calculate

tT(Σ(α,β,1)Ip+q)t2=1txTαβTty,

which implies

exp{12tT(Σ(α,β,1)Ip+q)t}=exp{1i=1pj=1qαiβjtx(i)ty(j)}=k=0k(i=1pαitx(i))k(j=1qβjty(j))kk!

which equals

k=0kk!zxZp,zx=kzyZq,zy=kk!zx!k!zy!αzxβzytxzxtyzy=k=0zxZp,zx=kzyZq,zy=kkk!z!αzxβzytxzxtyzy=(a)zZp+qzx=zyzxzx!z!αzxβzytz.

In step (a), we stacked the variables zx and zy to form z=(zx,zy)T. Note that following the terminologies set in the beginning of Appendix E, z!=zx!zy! and tz=txzxtyzy. Note that if zxzy, then the term tz has zero coefficient in the above expansion. Thus the lemma follows.

Proof of Lemma 5.

Ln,HwL2(Qn)=E(X,Y)Qn[Eπ[Hw(X,Y)dPn,α,βdQn]]=Eπ[E(X,Y)Pn,α,β[Hw(X,Y)]]=(a)Eπ[E(Xi,Yi)Pα,β,i[n][i[n]Hwi(Xi,Yi)]]=Eπ[i[n]E(Xi,Yi)Pα,β[Hwi(Xi,Yi)]]

where (a) follows because (Xi,Yi)’s are independent observations. Now note that if αβ2, then (19) implies

E(Xi,Yi)Pα,β[Hwi(Xi,Yi)]=E(Xi,Yi)Q[Hwi(Xi,Yi)]=0,

where the last step follows because EZQ[Hwi(Z)]=0 for any i[n]. If αβ2<, then Σ(α,β,1) defined in (18) is positive definite, and (19) implies

E(Xi,Yi)Pα,β[Hwi(Xi,Yi)]=EZQ[Hwi(Σ(α,β,1)12Z)]=twi(exp{12tT(Σ(α,β,1)Ip+q)t})t=(0,,0)

by Lemma 18. Here Σ(α,β,1) is as in (18), and Σ(α,β,1) is positive definite because α2β2<, as discussed in Section III-C. Therefore, we can write

E(Xi,Yi)Pα,β[Hwi(Xi,Yi)]=1{α2β2<}×twi(exp{12tT(Σ(α,β,1)Ip+q)t})t=(0,,0)

Lemma 19 gives the form of the partial derivative in the above expression, and implies that the partial derivative is zero unless wix=wiy. Therefore, Ln,HwL2(Qn)0 only if wix=wiy for all i[n]. In this case, wi=2wix is even, and by Lemma 19,

Ln,HwL2(Qn)=Eπ[1{α2β2<}i[n]wixwix!αwixβwiy]={i=1nwixi=1nwix!}×Eπ[1{α2β2<}αi=1nwixβi=1nwiy]=w2{i=1nwix!}×Eπ[1{α2β2<}αi=1nwixβi=1nwiy].

Therefore,

Ln,H^wL2(Qn)2={ww!Eπ[1{α2β2<}αi=1nwixβi=1nwiy]2×{i=1nwix!}2ifwix=wiyfor alli[n],0o.w.}

2). Proof of Lemma 6:

Proof. Lemma 5 implies that Ln belongs to the subspace generated by those Hw’s whose degree-index ω has wix=wiy for all i[n]. The degree of the polynomial Hw is w, which is even in the above case. Therefore, if Dn1 is odd, LnDnL2(Qn)2 equals Ln(Dn1)L2(Qn)2. Hence, it suffices to compute the norm of Ln2𝒟n, where 𝒟n=Dn2. Suppose wZn(p+q) is such that wix=wiy for all i[n]. Lemma 5 gives

Ln,H^wL2(Qn)2=ww!{Eπ[1{α2β2<}αi=1nwixβi=1nwiy]}2×{i=1nwix!}2.

Consider the pair of replicas α1,α2iidπx and β1,β2iidπy. Letting W denote the indicator function of the event {α12β12<,α22β22<}, we can then write

Ln,H^wL2(Qn)2=ww!Eπ[(α1α2)i=1nwix(β1β2)i=1nwiyW]×{i=1nwix!}2. (64)

Denote by d¯=(d1,,dn)Zn. Using (64), we obtain the following expression:

Ln2𝒟nL2(Q)=d=0𝒟n2dd¯:di=dw:wixZp,wiyZq,wix=wiy=diTd¯,w

where

Td¯,w=Eπ[Wi=1n(di2wix!wiy!(α1α2)wix(β1β2)wiy)].

Therefore Ln2𝒟nL2(Q) equals

d=0𝒟n2dd¯:di=dEπ[Ww:wixZp,wiyZq,wix=wiy=di(i=1ndi!wix!(α1α2)wix)]×[(i=1ndi!wiy!(β1β2)wiy)]=d=0𝒟n2dd¯:di=dEπ[W(wx:wixZpwix=dii=1ndi!wix!(α1α2)wix)]×[(wy:wiyZqwiy=dii=1ndi!wiy!(β1β2)wiy)]

In the last step, we used the variables wx=(w1x,,wnx), and wy=(w1y,,wny). Suppose ziZp for each i[n]. For any xRp and yRq, it holds that

ziZp,zi=dii=1ndi!zi!xziyzi=i=1nziZp,zi=didi!zi!xziyzi=(a)i=1n(xTy)di=(xTy)i=1ndi,

where (a) follows from Fact 6.

Fact 6. [Multinomial Theorem] Suppose αRp. Then for mZ,

(i=1pαi)m=zZp,z=mk!αzz!.

Therefore it follows that

(wx:wixZpwix=didi!wix!(α1α2)wix)(wy:wiyZqwiy=didi!wiy!(β1β2)wiy)=(α1Tα2)i=1ndi(β1Tβ2)i=1ndi,

which implies

Ln2𝒟nL2(Q)=d=0𝒟n2dd¯:di=dEπ[W(α1Tα2)i=1ndi(β1Tβ2)i=1ndi]=(a)d=0𝒟n2d(d+n1d)Eπ[W(α1Tα2)d(β1Tβ2)d]=Eπ[Wd=0𝒟n{(d+n1d)(2(α1Tα2)(β1Tβ2)d}].

where (a) follows since the number of d¯Zn such that d¯=d equals (n+d1d). Noting 𝒟n=Dn2, the proof follows.

D. Proof of Technical Lemmas for Theorem 4

First, we introduce some additional notations and state some useful results that will be used repeatedly throughout the proof. Suppose ARp×q. We can write A as

A=[A1A2Aq].

We define the vectorization operator as

Vec(A)=[A1Aq].

We will use two well known operations on the vetorization operators, which follow from Section 10.2.2 of [70].

Fact 7. A. Trace(ATB)=Vec(A)TVec(B).

B. Vec(AXB)=(BTA)Vec(X) where denotes the Kronecker delta product.

Often times we will also use the fact that [71, Theorem 13.12]

ABop=AopBop. (65)

Define the Hadamard product between vectors x=(x1,,xp) and y=(y1,,yp) by

xy=(x1y1,,xpyp)T.

Note that Cauchy-Schwarz inequality implies that

xy2x2y2 (66)

We will also often use of Fact 1, which states ABF2Aop2BF2.

1). Proof of Lemma 10:

Proof. The first result is immediate. For the second result, denote by xD by the projection of y on RD. Note that for any xRm and yRp.

xTD(A)yxy=xD1TAyD2xyxD1TAyD2xD1yD2

Thus the maximum singular value of D(A) is smaller than that of A, indicating that

D(A)A.

2). Proof of Lemma 4:

First, we state and prove two facts, which are used in the proof of Lemma 4.

Fact 8. Suppose ARn×r, BRp×s are potentially random matrices satisfying ATA=Ir and BTB=Is. Let XRn×p be such that r,sp, and XA,B is distributed as a standard Gaussian data matrix. Then the matrix ATXBA,B is distributed as a standard Gaussian data matrix.

Proof of Fact 8. XRn×p is a Gaussian data matrix with covariance ΣRp×p if and only if

Vec(XT)Nnp(0,InΣ). (67)

Now

Vec((ATXB)T)=Vec(BTXTA)=(a)(ATBT)Vec(XT)

where (a) follows from Fact 7B. However, since (ATBT)Rrs×np, (67) implies

(ATBT)Vec(XT)A,BNrs(0,(ATBT)(AB)),

but

(ATBT)(AB)=ATABTB=IrIs=Irs.

Therefore,

Vec((ATXB)T)A,BNrs(0,Irs).

Then the result follows from (67).

In the above fact, it may appear that ATXB is independent of matrices A and B since its conditional distribution is standard Gaussian. However, ATXB still depends on A and B through r and s, which may be random quantities.

Fact 9. Suppose ARn×k, XRn×p, BRp×s are such that conditional on A and B, X is distributed as a standard Gaussian data matrix. Further suppose that the rank of A and B are a and b, respectively. Then the following assertion holds:

ATXBopAopBopZop

where ZA,B is distributed as a standard Gaussian data matrix in Ra×b.

Proof of Fact 9. Suppose PA and PB are the projection matrices onto the column spaces of A and B, respectively. Then we can write PA=VAVAT and PB=VBVBT, where VARn×a and VBRp×b are matrices matrices with full column rank so that VATVA=Ia and VBTVB=Ib. Writing A=PAA and B=PBB, we obtain that

ATXBop=ATVAVATXVBVBTBop

which is bounded by

AopVAopVATXVBopVBopBop

That VAop and VBop are one follows from the definitions of VA and VB. Fact 8 implies conditional on VA and VB, VATXVBRa×b is distributed as a standard Gaussian data matrix. Hence, the proof follows.

Proof of Lemma 4. Let us denote the rank of Z1D by a. Note that arank(D)=a. Letting A=Z1D, and applying Fact 9, we have the bound

DTZ1TZ2BopZ1DopZopBop

where ZZ1 is distributed as a standard Gaussian data matrix in Ra×b. Next we apply Fact 9 again, but now on the term Z1Dop, which leads to

Z1DopDopZop,

where ZRn×a is a standard Gaussian data matrix. Therefore,

DTZ1TZ2BopAopZopZopBop.

We use the Gaussian matrix concentration inequality in Fact 2 to show that with probability at least 1exp(Cn),Zop2(n+a). Also, for ZRa×b, the first part of Fact 2 implies

P(Zopa+b+tZ1)1exp(t22)

for any t>0. Since aa, and t is deterministic, the above implies

P(Zopa+b+t)1exp(t22).

Hence, for any t>0, we have the following with probability at least 1exp(Cn)exp(t22):

DTZ1TZ2Bop2DopBop(n+a)(a+b+t).

Since abn, it follows that

DTZ1TZ2BopCDopBopnmax{b,t}.

Therefore, the proof follows.

3). Proof of Lemma 15:

Proof of Lemma 15. Denoting

𝒯={(x,y)Rp:x=xSx,y=ySy,xTpϵ,yTqϵ},

we note that

maxxTpϵ,yTqϵxSx,AnySy=max(x,y)Tx,Any.

Therefore it suffices to show that there exist absolute constants C, c>0 such that

P{max(x,y)𝒯x,AnyΔ}Cexp{(p+q)log(CJp,q)Jp,qn2Δ24CMop2Nop2(2n+p+q)}.+CΔ2Mop2Nop2(n(p+q))C{ec(n+q)+ec(n+p)}

Let us denote 𝒵1=Vec(Z1T), 𝒵2=Vec(Z2T), and 𝒵=(𝒵1T,𝒵2T)T. Thus

𝒵T={(𝒵1)1T,,(Z1)pT,(Z2)1T,,(Z1)qT}.

Recalling QM,N=MZ1TZ2Nn, we define

fx,y(𝒵1,𝒵2)=x,η(QM,N)y=x,Ayy. (68)

To obtain a tight concentration inequality for fx,y(𝒵1,𝒵2), we want to use the following Gaussian concentration lemma due to [1]

Lemma 20 (Corollary 10 of [1]). Let 𝒵N(0,In) be a vector of n i.i.d. standard Gaussian variables. Suppose is a finite set and we have functions Fb:RnR for every b. Assume 𝒢Rn×Rn is a Borel set such that for lebesgue-almost every (Z,Z)𝒢 :

maxbsupt[0,1]Fb(tZ+1tZ)2.

Then, there exists an absolute constant C>0 so that for any Δ>0,

P(maxbFb(𝒵)EFb(𝒵)Δ)Cexp(Δ2C2)+CΔ2E[maxb(Fb(𝒵)Fb(𝒵))4]P(𝒢c)12.

Here 𝒵 is an independent copy of 𝒵.

In our case, the index b corresponds to (x, y), the set corresponds to 𝒯, and the function Fb(𝒵) corresponds to Fx,y(𝒵). To find the centering and the Lipschitz constant , we need to compute Efx,y(𝒵1,𝒵2) and 𝒵fx,y(𝒵1,𝒵2), respectively.

First, note that since Z1 and Z2 are independent standard Gaussian data matrices, QM,N=dQM,N. Noting Eη(X)=0 for any symmetric random variable X, we deduce

Ex,Any=x,E[η(QM,N)]y=0.

Using Lemma 21 we obtain that

fx,y(𝒵1,𝒵2)𝒵12g(Z2)opvη(Vec(QM,N))2

and

fx,y(𝒵1,𝒵2)𝒵22h(Z1)opvη(Vec(QM,N))2

where

v=Vec(xyT),g(Z2)=Z2NMTnh(Z1)=Z1MTNn.

Because η(x)<1 for each xR,

vη(Vec(QM,N))2supxη(x)v2v2=x2y2

since v22=xyTF2=x22y22. Also, because ABop equals AopBop, we have

g(Z2)op2=Z2NMTop2n2=Z2Nop2Mop2n2Mop2Nop2Z2op2n2. (69)

and similarly,

h(Z1)op2Mop2Nop2Z1op2n2. (70)

Therefore,

fx,y(𝒵1,𝒵2)𝒵12x2y2MopNopZ2opn,fx,y(𝒵1,𝒵2)𝒵22x2y2MopNopZ1opn.

Letting fx,y(𝒵) denote fx,y(𝒵)𝒵, we note that the above two inequalities imply

fx,y(𝒵)22x22y22Mop2Nop2(Z1op2+Z2op2)n2.

Because x2, y21, we have

fx,y(𝒵)22Mop2Nop2(Z1op2+Z2op2)n2. (71)

We choose a good set 𝒢1 where the above bound is small. To that end, we take 𝒢1 to be

𝒢1={(Z~1,Z~1,Z~2,Z~2):Z~1Rn×p,Z~1Rn×p,Z~2Rn×q,}Z~2Rn×q,max{Z1op,Z1op}2(n+p),{max{Z2op,Z2op}2(n+q)}. (72)

Let us denote 𝒵i=Vec(ZiT) and 𝒵~i=Vec(Z~iT). To apply Lemma 21, now we define the process

𝒵i(t)=t𝒵~i+1t𝒵~i,t[0,1],i=1,2.

Equation 71 implies that on 𝒢1,

𝒵fx,y(𝒵1(t),𝒵2(t))224Mop2Nop2(2n+p+q)n2=.

We are now in a position to apply Lemma 21, which yields that

P(max(x,y)𝒯fx,y(𝒵1,𝒵2)Δ)C𝒯exp(Δ2C2)+CΔ2E[max(x,y)Tfx,y(𝒵1,𝒵2)4]12×P(𝒢1c)12. (73)

From equation 79 of [1] it follows that C can be chosen so large such that

𝒯exp((p+q)log(CJp,q)Jp,q).

Thus, after plugging in the value of , the first term on the right hand side of (73) can be bounded above by

Cexp{(p+q)log(CJp,q)Jp,qn2Δ24CM2N2(2n+p+q)}.

To bound the second term in (73), notice that Lemma 22 yields the bound

E[max(x,y)𝒯fx,y(𝒵1,𝒵2)4]CMop4Nop4(n(p+q))C,

whereas Fact 2 leads to the bound

P(𝒢1c)122(exp(c(n+p))+exp(c(n+q))). (74)

Therefore the proof follows.

Lemma 21. Suppose fx,y is as defined in (68) and QM,N=MZ1TZ2Nn. Then

fx,y(𝒵1,𝒵2)𝒵12g(Z2)opvη(Vec(QM,N))2,fx,y(𝒵1,𝒵2)𝒵22h(Z1)opvη(Vec(QM,N))2

where v=Vec(xyT), g(Z2)=Z2NMTn, and h(Z1)=Z1MTNn.

Proof. Using v=Vec(xyT), and the fact that Tr(AB)=Tr(BA), we calculate that

fx,y(𝒵1,𝒵2)=Tr(yxTη(QM,N))=Tr((xyT)Tη(MZ1TZ2Nn))=Vec(xyT)TVec(η(MZ1TZ2Nn))=vTη(Vec(MZ1TZ2Nn)).

Fact 7 implies

Vec(QM,N)=(NTZ2TM)n𝒵1=g(Z2)T𝒵1, (75)

which yields fx,y(𝒵1,𝒵2)=vTη(g(Z2)T𝒵1). Noting vRpq, we can hence write fx,y(𝒵1,𝒵2) as

fx,y(𝒵1,𝒵2)=i=1pqviη([g(Z2)i]T𝒵1).

Let us denote by η(x) the derivative of η(x) evaluated at xR. For ARp×q, we denote by η(A) the matrix whose (i, j)-th entry equals η(Ai,j). Then we obtain that for j[np],

fx,y(𝒵1,𝒵2)(𝒵1)j=i=1pqviη([g(Z2)i]T𝒵1)g(Z2)ij,

indicating that

fx,y(𝒵1,𝒵2)𝒵1=i=1pqviη([g(Z2)i]T𝒵1)g(Z2)i=g(Z2)[vη(g(Z2)T𝒵1)]

where implies the Hadamard product. It follows that

fx,y(𝒵1,𝒵2)𝒵12g(Z2)opvη(g(Z2)T𝒵1)2.

Then the first part of the proof follows from (75). The proof of the second part follows similarly, and hence, skipped.

Writing v=Vec(yxT), we have

fx,y(𝒵1,𝒵2)=Tr(η(NTZ2TZ1MTn)xyT)=Tr(xyTη(NTZ2TZ1MTn)),

which equals

Tr((yxT)Tη(NTZ2TZ1MTn))=Vec(yxT)TVec(η(NTZ2TZ1MTn))=(v)Tη(Vec(NTZ2TZ1MTn)).

Fact 7 implies that the above equals

(v)Tη((MZ1TNT)n𝒵2)=(v)Tη(h(Z1)T𝒵2).

where h(Z1)=Z1MTNn. Thus, similarly we can show that

fx,y(𝒵1,𝒵2)𝒵22h(Z1)opvη(h(Z1)T𝒵2)2=h(Z1)opVec((xyT)T)Vec(η([MZ1TZ2Nn])T)2=h(Z1)opVec(xyT)Vec(η([MZ1TZ2Nn]))2=h(Z1)opvη(g(Z2)T𝒵1)2.

Therefore, the proof follows.

Lemma 22. There exists an absolute constant C so that the function fx,y defined in (68) satisfies

E[maxx21,y21fx,y(𝒵1,𝒵2)4]CMop4Nop4(n(p+q))C.

Proof. As usual, we let QM,N=MZ1TZ2Nn. Since x2,y21, we have

fx,y(𝒵1,𝒵2)4η(QM,N)op4(a)η(QM,N)F4(b)QM,NF4(c)n4Mop4Nop4Z1F4Z2F4.

Here (a) follows because the operator norm is smaller than the Frobenius norm, (b) follows because η(x)x, and (c) follows from Fact 1. Since Z1 and Z2 are independent,

E[maxx21,y21fx,y(𝒵1,𝒵2)4]n4Mop4Nop4E[Z1F4]E[Z2F4].

Now note that since Z1 and Z2 are standard Gaussian data matrices,

E[Z1F4]E[Tr(Z1TZ1)2]k1(n+p)k2

for some absolute constants k1 and k2. We can choose C so large such that k1(n+p)k2C(n+p)C. Similarly, we can show that

E[Z2F4]E[Tr(Z2TZ2)2]C(n+q)C,

implying

E[maxx21,y21fx,y(𝒵1,𝒵2)4]CMop4Nop4(n(p+q))C

for sufficiently large C.

4). Proof of Lemma 16:

Proof. The framework will be same as the proof of Lemma 15. Define 𝒯=𝒯1×𝒯2 where

𝒯1={xRp:x=xSx,xTpϵ}.

Let 𝒵1, 𝒵2, 𝒵, and fx,y be as in Lemma 15. In this case, the main difference from Lemma 15 is that 𝒯 is much larger. Eventually we will arrive at (73) using the concentration inequality in Lemma 21, but large 𝒯 makes the right hand side of the inequality in (73) much larger. Therefore, we require a tighter bound on , which is the bound on the Lipschitz constant of fx,y(𝒵) on the good set, so that the concentration inequality in (73) is still useful. To bound the Lipschitz constant, as before, we bound fx,y(𝒵1,𝒵2)2 using Lemma 21, which implies that

fx,y(𝒵1,𝒵2)𝒵12g(Z2)opvη(Vec(QM,N))2,

where v=Vec(xyT) and g(Z2)=Z2NMTn. From (69) it follows that

g(Z2)op2Mop2Nop2Z2op2n2. (76)

In Lemma 15, we bounded vη(Vec(QM,N))2 by v2, which was later bounded by 1. We require a tighter bound on vη(Vec(QM,N))2 this time. Note that η(z)1{zτn} at all zR for any directional derivative of η. Noting xJp,qp for x𝒯1, we deduce that any ARp×q and (x,y)𝒯 satisfy

vη(Vec(A))22=i=1pj=1qxi2yj2η(Ai,j)2Jp,qpj=1qyj2supj[q]i=1pη(Ai,j)2=Jp,qpy22supj[q]i=1p1{Ai,j>τn},

which is not greater than

Jp,qsupj[q]i=1p1{Ai,j>τn}p

because y221 for y𝒯2.

Thus, it follows that

fx,y(𝒵1,𝒵2)𝒵1222Jp,qMop2Nop2Z2op2pn2×supj[q]{i=1p1{(QM,N)i,j>τn}}.

Similarly, we can show that

fx,y(𝒵1,𝒵2)𝒵2222Jp,qMop2Nop2Z1op2pn2×supj[q]{i=1p1{(QM,N)i,j>τn}}.

Thus,

fx,y(𝒵)222Jp,qMop2Nop2(Z1op2+Z2op2)n2×supj[q]{i=1p1{(QM,N)i,j>τn}}p.

We want to define the good set 𝒢2 of (Z~1,Z~2,Z~1,Z~2) such that

Zi(t)=tZ~i+1tZ~i,t[0,1],i=1,2,

satisfies both Z1(t)op2+Z2(t)op24(2n+p+q) and

supj[q]i=1p1{(MZ1(t)TZ2(t)N)i,j>τn}4peτ2K.

We claim that the above holds if (Z~1,Z~2,Z~1,Z~2)𝒢1 defined in (72), and for all j[q],

i=1p1{(MZ~1TZ~2N)i,j>τn2},i=1p1{(M(Z~1)TZ~2N)i,j>τn2}2peτ2Ki=1p1{(M(Z~1)TZ~2N)i,j>τn2},i=1p1{(M(Z~1)TZ~2N)i,j>τn2}2peτ2K. (77)

The above claim follows from (89) and (90) of [1]. Therefore we define the good set 𝒢2 to be the subset of 𝒢1 where (77) is satisfied. Defining 𝒵1(t)=Vec(Z1(t)T) and 𝒵2(t)=Vec(Z2(t)T), we obtain that for some absolute constant C>0, it holds that

fx,y(𝒵1(t),𝒵2(t))22qCJp,q(2n+p+q)Mop2Nop2eτ2K0n22=C2

provided Z~1,Z~1,Z~2,Z~2𝒢2. Similar to the proof of Lemma 15, using Lemma 21, we obtain that there exists an absolute constant C so that

P{max(x,y)𝒯fx,y(𝒵1,𝒵2)Δ}C𝒯exp(Δ2C2)+CΔ2E[max(x,y)𝒯fx,y(Z1,Z2)4]12P(𝒢2c)12. (78)

Now since 𝒯Tpϵ×Tqϵ, and for any kN, the ϵ-net Tkϵ is chosen so as to satisfy Tkϵ(1+2ϵ)k, we have 𝒯(1+2ϵ)p+q. Therefore, we conclude that the first term of the bound in (78) is not larger than

Cexp(Δ2C2+(p+q)log(1+2ϵ)).

Rest of the proof is devoted to bounding the second term of the bound in (78). The expectation term can be bounded easily using Lemma 22, which yields

E[max(x,y)Tfx,y(Z1,Z2)4]CMop4Nop4{n(p+q)}C.

We will now show that P(𝒢2c) is small. Note that by definition, 𝒢2=𝒢1𝒱, where 𝒱 is the set of (Z~1,Z~2,Z~1,Z~2), which satisfies the equation system (77). Notice that by (74), we already have P(𝒢1c)ec(n+p)+ec(n+q) for some c>0. Thus it suffices to show that P(𝒱c) is small. To this end, note that since Z~1, Z~1, Z~2, Z~2 are independent, (77) implies

P(𝒱c)4P(i=1p1{MiTZ~1TZ~2Nj>τn4}>2peτ2K)(for allj[q]).

Defining the set 𝒜j={Z~2Nj22nNj2}, we bound the above probability as follows:

P(𝒱c)4j=1qP(i=1p1{MiTZ~1TZ~2Nj>τn4)(>2pexp(τ2K0)Z~2𝒜j)+4j=1qP(Z~2𝒜jc). (79)

Now note that Z~2NjNn(0,Nj22In), or Z~2NjNj2Nn(0,In). Therefore, there exists a universal constant c>0 so that

j=1qP(Z~2𝒜jc)=j=1qP(Nj21Z~2Nj2>2n)qexp(cn), (80)

where the last bound is due to the Chi-square tail bound in Fact 4 (see also Lemma 1 of [72] and Lemma 12 of [1]). Therefore, it only remains to bound the first term in (79). We begin with an expansion of MiTZ~1TZ~2Nj as follows

MiTZ~1TZ~2Nj=l=1pk=1nMil(Z~1)kl(Z~2N)kj=l=1pMilk=1n(Z~1)kl(Z~2N)kjΨlj.

Since Z~1 and Z~2 are independent, Z~1conditioned on Z~2 is still a standard Gaussian data matrix. Hence, for l[p], conditional on Z~2, Ψlj’s are independent N(0,Z~2Nj22) random variables. As a result, for each l[p] and j[q], Ψlj can be written as Z~2Nj2Zl, where Zl=ΨljZ~2Nj2, and Z1,,ZpZ~2iidN(0,1). Noting Nj2Nop for every j[q], we derive the following bound provided Z~2𝒜j:

i=1p1{MiTZ~1TZ~2Nj>τn4}=i=1p1[Z~2Nj2l=1pMilZl>τn4]i=1p1[2Nopl=1pMilZl>τ4].

Defining

f(x)f(x1,,xp)=i=1p1[l=1pMilxl>τ(42Nop)]p, (81)

we notice that the above calculations implies conditional on Z~2𝒜j,

i=1pMiTZ~1TZ~2Nj>τn4]pf(Z1,,Zp).

Therefore,

P(i=1p1{MiTZ~1TZ~2Nj>τn4}>2peτ2KZ~2𝒜j)P(f(Z1,,Zp)>2exp(τ2K)Z~2𝒜j), (82)

which is is bounded by exp(2p) by Lemma 23. Therefore, (79), (80), and (82) jointly imply that

P(𝒱c)4qexp(cn)+4qexp(2p).

Therefore P(𝒢2c) is bounded by

exp(c(n+p))+exp(c(n+q))+4qexp(cn)+4qexp(2p)4qexp(cmin{n,p}),

which completes the proof.

Lemma 23. Suppose 160Mop2Nop2<K and τ<Klogp2. Further suppose Z1,Zp are independent standard Gaussian random variables. Then the function f defined in (81) satisfies

P(f(Z1,,Zp)>2exp(τ2K))exp(2p).

Proof of Lemma 23. Note that pf(Z1,,Zp) is a sum of dependent Bernoulli random variables. Therefore the traditional Chernoffs or Hoeffding’s bound for inependent Bernoulli random variables will not apply. We use a generalized version of Chernoffs inequality, originally due to [65] (also discussed by [73], [74] among others), which applies to weakly dependent Bernoulli random variables.

Lemma 24 ([65]). Let X1,,Xp be Bernoulli random variables and ϵ(0,1). Suppose there exists δ(0,ϵ) such that for any [p], the following assertion holds:

E[iXi]δ. (83)

For x,y(0,1), we denote

D(xy)=ylog(yx)+(1y)log((1y)(1x)).

Then we have

P[i=1pXipϵ]exp(pD(δϵ)).

Note that if we take Xi=1{l=1pMilZl>τ(42Nop)} and ϵ=2exp(τ2K), then the above lemma can Ire applied to bound P(f(Z1,,Zp)>2exp(τ2K)) provided (83) holds, which will be referred as the weak dependence Condition from now on. Suppose =k. For the sake of simplicity, we take ={1,,k}. The arguments, which are to follow, would hold for any other choice of as well as long as =k. Denote by Mk the submatrix of M containing only the first k rows of M. Let us denote Z1:k=(Z1,,Zk). Letting t=τ(42Nop), we observe that for our choice of Xi’s, E[iXi] equals

P(MiTZ1:k>t,l[k])P(Z1:kTMkTMkZ1:k>kt2)P(MkTMkopi=1kZl2>kt2).

The operator norm MkTMkop equals Mkop2, which is bounded by Mop2 by Lemma 10B. Therefore, the right hand side of the last display is bounded by P(l=1kZl2>kt2Mop2). By Chi-square tail bounds (see for instance Fact 4), the latter probability is bounded above by exp(kt2(5Mop2)) for all t>5Mop. Since t=τ(42Nop), note that τ>160MopNop suffices. For such τ, we have thus shown that

E[iXi]exp(τ2160Mop2Nop2).

Thus our δ=exp(τ2160Mop2Nop2), which is less than ϵ2=exp(τ2K) because K>160Mop2Nop2. Thus our (δ, ϵ) pair satisfies the weak dependence condition. Therefore by Lemma 24, it follows that

P(f(Z1,,Zp)>2exp(τ2K))exp(pD(δϵ)).

We will now use the lower bound D(xy)2(xy)2 for x,y(0,1). Because δϵϵ2,D(δϵ)ϵ22, indicating pD(δϵ)2pexp(2τ2K), which is greater than 2pif2τ2Klogp2, or equivalently τ2(Klogp)4. Therefore, the current lemma follows.

Footnotes

1

In this paper, by support recovery, we refer to the exact recovery of the combined support of the ui’s (or the vi’s) corresponding to nonzero Λi’s.

2

e.g., Σ^n,x(j), Σ^n,y(j), and Σ^n,xy(j) will stand for the empirical estimators created from the jth-equal split of the data.

3

here and later, we will use s to generically denote the sparsity of relevant parameter vectors in parallel problems like sparse PCA or sparse linear regression.

4

e.g., Planted Clique Conjecture [28], [54], [55], Statistical Query based lower bounds [56]-[59], and Overlap Gap Property based analysis [37], [60], [61].

Contributor Information

Nilanjana Laha, Department of Statistics, Texas A&M University, College Station, TX 77843.

Rajarshi Mukherjee, Department of Biostatistics, Harvard T. H. Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115.

References

  • [1].Deshpande Y and Montanari A, “Sparse pca via covariance thresholding,” Journal of Machine Learning Research, vol. 17, no. 141, pp. 1–14, 2016. [Google Scholar]
  • [2].Kunisky D, Wein AS, and Bandeira AS, “Notes on computational hardness of hypothesis testing: Predictions using the low-degree likelihood ratio,” in Mathematical Analysis, its Applications and Computation, 2022, pp. 1–50, part of the Springer Proceedings in Mathematics and Statistics book series. [Google Scholar]
  • [3].Hardoon DR, Szedmak S, and Shawe-Taylor J, “Canonical correlation analysis: An overview with application to learning methods,” Neural computation, vol. 16, no. 12, pp. 2639–2664, 2004. [DOI] [PubMed] [Google Scholar]
  • [4].Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, and Vasconcelos N, “A new approach to cross-modal multimedia retrieval,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 251–260. [Google Scholar]
  • [5].Gong Y, Ke Q, Isard M, and Lazebnik S, “A multi-view embedding space for modeling internet images, tags, and their semantics,” International journal of computer vision, vol. 106, no. 2, pp. 210–233, 2014. [Google Scholar]
  • [6].Bin G, Gao X, Yan Z, Hong B, and Gao S, “An online multi-channel ssvep-based brain–computer interface using a canonical correlation analysis method,” Journal of neural engineering, vol. 6, no. 4, p. 046002, 2009. [DOI] [PubMed] [Google Scholar]
  • [7].Avants BB, Cook PA, Ungar L, Gee JC, and Grossman M, “Dementia induces correlated reductions in white matter integrity and cortical thickness: a multivariate neuroimaging study with sparse canonical correlation analysis,” Neuroimage, vol. 50, no. 3, pp. 1004–1016, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Witten DM, Tibshirani R, and Hastie T, “A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis,” Biostatistics, vol. 10, no. 3, pp. 515–534, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Bagozzi RP, “Measurement and meaning in information systems and organizational research: Methodological and philosophical foundations,” Management Information Systems quarterly, vol. 35, no. 2, pp. 261–292, 2011. [Google Scholar]
  • [10].Dhillon P, Foster DP, and Ungar L, “Multi-view learning of word embeddings via CCA,” in Proceedings of the 24th International Conference on Neural Information Processing Systems, vol. 24, 2011, p. 199207. [Google Scholar]
  • [11].Faruqui M and Dyer C, “Improving vector space word representations using multilingual correlation,” in Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 2014, pp. 462–471. [Google Scholar]
  • [12].Friman O, Borga M, Lundberg P, and Knutsson H, “Adaptive analysis of fmri data,” NeuroImage, vol. 19, no. 3, pp. 837–845, 2003. [DOI] [PubMed] [Google Scholar]
  • [13].Kim T-K, Wong S-F, and Cipolla R, “Tensor canonical correlation analysis for action classification,” in 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8. [Google Scholar]
  • [14].Arora R and Livescu K, “Multi-view cca-based acoustic features for phonetic recognition across speakers and domains,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 7135–7139. [Google Scholar]
  • [15].Wang W, Arora R, Livescu K, and Bilmes J, “On deep multi-view representation learning,” in Proceedings of the 32nd International Conference on International Conference on Machine Learning, vol. 37, 2015, pp. 1083–1092. [Google Scholar]
  • [16].Anderson TW, An Introduction to Multivariate Statistical Analysis, ser. Wiley Series in Probability and Statistics. Wiley, 2003. [Google Scholar]
  • [17].Lê Cao K-A, Martin PG, Robert-Granié C, and Besse P, “Sparse canonical methods for biological data integration: application to a cross-platform study,” BMC bioinformatics, vol. 10, no. 1, pp. 1–17, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Lee W, Lee D, Lee Y, and Pawitan Y, “Sparse canonical covariance analysis for high-throughput data,” Statistical Applications in Genetics and Molecular Biology, vol. 10, no. 1, pp. 1–24, 2011.21291411 [Google Scholar]
  • [19].Waaijenborg S, de Witt Hamer PCV, and Zwinderman AH, “Quantifying the association between gene expressions and dna-markers by penalized canonical correlation analysis,” Statistical applications in genetics and molecular biology, vol. 7, no. 1, 2008. [DOI] [PubMed] [Google Scholar]
  • [20].Anderson TW, “Asymptotic theory for canonical correlation analysis,” Journal of Multivariate Analysis, vol. 70, no. 1, pp. 1–29, 1999. [Google Scholar]
  • [21].Cai TT and Zhang A, “Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics,” The Annals of Statistics, vol. 46, no. 1, pp. 60 – 89, 2018. [Google Scholar]
  • [22].Ma Z, Li X et al. , “Subspace perspective on canonical correlation analysis: Dimension reduction and minimax rates,” Bernoulli, vol. 26, no. 1, pp. 432–470, 2020. [Google Scholar]
  • [23].Bao Z, Hu J, Pan G, and Zhou W, “Canonical correlation coefficients of high-dimensional gaussian vectors: Finite rank case,” The Annals of Statistics, vol. 47, no. 1, pp. 612–640, February 2019. [Google Scholar]
  • [24].Mai Q and Zhang X, “An iterative penalized least squares approach to sparse canonical correlation analysis,” Biometrics, vol. 75, no. 3, pp. 734–744, 2019. [DOI] [PubMed] [Google Scholar]
  • [25].Solari OS, Brown JB, and Bickel PJ, “Sparse canonical correlation analysis via concave minimization,” arXiv preprint arXiv:1909.07947, 2019. [Google Scholar]
  • [26].Chen M, Gao C, Ren Z, and Zhou HH, “Sparse cca via precision adjusted iterative thresholding,” arXiv preprint arXiv:1311.6186, 2013. [Google Scholar]
  • [27].Gao C, Ma Z, Ren Z, Zhou HH et al. , “Minimax estimation in sparse canonical correlation analysis,” The Annals of Statistics, vol. 43, no. 5, pp. 2168–2197, 2015. [Google Scholar]
  • [28].Gao C, Ma Z, Zhou HH et al. , “Sparse cca: Adaptive estimation and computational barriers,” The Annals of Statistics, vol. 45, no. 5, pp. 2074–2101, 2017. [Google Scholar]
  • [29].Wainwright MJ, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting,” IEEE Transactions on Information Theory, vol. 55, no. 12, pp. 5728–5741, 2009. [Google Scholar]
  • [30].Amini AA and Wainwright MJ, “High-dimensional analysis of semidefinite relaxations for sparse principal components,” The Annals of Statistics, vol. 37, no. 5B, pp. 2877 – 2921, 2009. [Google Scholar]
  • [31].Butucea C, Ingster YI, and Suslina IA, “Sharp variable selection of a sparse submatrix in a high-dimensional noisy matrix,” ESAIM: Probability and Statistics, vol. 19, pp. 115–134, 2015. [Google Scholar]
  • [32].Butucea C and Stepanova N, “Adaptive variable selection in nonparametric sparse additive models,” Electronic Journal of Statistics, vol. 11, no. 1, pp. 2321–2357, 2017. [Google Scholar]
  • [33].Meinshausen N and Bühlmann P, “Stability selection,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 72, no. 4, pp. 417–473, 2010. [Google Scholar]
  • [34].Johnstone IM and Lu AY, “On consistency and sparsity for principal components analysis in high dimensions,” Journal of the American Statistical Association, vol. 104, no. 486, pp. 682–693, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [35].Krauthgamer R, Nadler B, Vilenchik D et al. , “Do semidefinite relaxations solve sparse pca up to the information limit?” The Annals of Statistics, vol. 43, no. 3, pp. 1300–1322, 2015. [Google Scholar]
  • [36].Ding Y, Kunisky D, Wein AS, and Bandeira AS, “Subexponential-time algorithms for sparse pca,” arXiv preprint arXiv:1907.11635, 2019. [Google Scholar]
  • [37].Arous GB, Wein AS, and Zadik I, “Free energy wells and overlap gap property in sparse pca,” in Proceedings of the 33rd Annual Conference on Learning Theory, vol. 125, 2020, pp. 479–482. [Google Scholar]
  • [38].Laha N and Mukherjee R, “Support.CCA,” https://github.com/nilanjanalaha/Support.CCA, 2021. [Google Scholar]
  • [39].Hopkins SB and Steurer D, “Efficient bayesian estimation from few samples: community detection and related problems,” in IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), 2017, pp. 379–390. [Google Scholar]
  • [40].Hopkins S, “Statistical inference and the sum of squares method,” Ph.D. dissertation, Cornell University, 2018. [Google Scholar]
  • [41].Ma Z and Yang F, “Sample canonical correlation coefficients of high-dimensional random vectors with finite rank correlations,” arXiv preprint arXiv:2102.03297, 2021. [Google Scholar]
  • [42].Vershynin R, High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018. [Google Scholar]
  • [43].Laha N, Huey N, Coull B, and Mukherjee R, “On statistical inference with high dimensional sparse cca,” arXiv preprint arXiv:2109.11997, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [44].Johnstone IM, “On the distribution of the largest eigenvalue in principal components analysis,” The Annals of statistics, vol. 29, no. 2, pp. 295–327, 2001. [Google Scholar]
  • [45].Janková J and van de Geer S, “De-biased sparse pca: Inference and testing for eigenstructure of large covariance matrices,” IEEE Transactions on Information Theory, vol. 67, no. 4, pp. 2507–2527, 2021. [Google Scholar]
  • [46].Meloun M and Militky J, Statistical data analysis: A practical guide. Woodhead Publishing Limited, 2011. [Google Scholar]
  • [47].Dutta D, Sen A, and Satagopan J, “Sparse canonical correlation to identify breast cancer related genes regulated by copy number aberrations,” medRxiv, 2022. [Online]. Available: https://www.medrxiv.org/content/early/2022/05/09/2021.08.29.21262811 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [48].Lock EF, Hoadley KA, Marron JS, and Nobel AB, “Joint and individual variation explained (jive) for integrated analysis of multiple data types,” The annals of applied statistics, vol. 7, no. 1, p. 523, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [49].van de Geer S, Bhlmann P, Ritov Y, and Dezeure R, “On asymptotically optimal confidence regions and tests for high-dimensional models,” The Annals of Statistics, vol. 42, no. 3, pp. 1166–1202, 2014. [Google Scholar]
  • [50].Bickel PJ and Levina E, “Covariance regularization by thresholding,” The Annals of Statistics, vol. 36, no. 6, pp. 2577–2604, 2008. [Google Scholar]
  • [51].Cai TT, Liu W, and Luo X, “A constrained l1 minimization approach to sparse precision matrix estimation,” Journal of the American Statistical Association, vol. 106, no. 494, pp. 594–607, 2011. [Google Scholar]
  • [52].Yu Y, Wang T, and Samworth RJ, “A useful variant of the davis–kahan theorem for statisticians,” Biometrika, vol. 102, no. 2, pp. 315–323, 2015. [Google Scholar]
  • [53].Yatracos YG, “A lower bound on the error in nonparametric regression type problems,” The Annals of Statistics, vol. 16, no. 3, pp. 1180–1187, 1988. [Google Scholar]
  • [54].Berthet Q and Rigollet P, “Complexity theoretic lower bounds for sparse principal component detection,” in Proceedings of the 26th Annual Conference on Learning Theory, vol. 30, 2013, pp. 1046–1066. [Google Scholar]
  • [55].Brennan M, Bresler G, and Huleihel W, “Reducibility and computational lower bounds for problems with planted sparse structure,” in Proceedings of the 31st Conference On Learning Theory, vol. 75, 2018, pp. 48–166. [Google Scholar]
  • [56].Kearns M, “Efficient noise-tolerant learning from statistical queries,” Journal of the ACM (JACM), vol. 45, no. 6, pp. 983–1006, 1998. [Google Scholar]
  • [57].Feldman V and Kanade V, “Computational bounds on statistical query learning,” in Proceedings of the 25th Annual Conference on Learning Theory, vol. 23, 2012, pp. 16.1–16.22. [Google Scholar]
  • [58].Brennan MS, Bresler G, Hopkins S, Li J, and Schramm T, “Statistical query algorithms and low degree tests are almost equivalent,” in Proceedings of Thirty Fourth Conference on Learning Theory, vol. 134, 2021, pp. 774–774. [Google Scholar]
  • [59].Dudeja R and Hsu D, “Statistical query lower bounds for tensor pca,” Journal of Machine Learning Research, vol. 22, no. 83, pp. 1–51, 2021. [Google Scholar]
  • [60].David G and Ilias Z, “High dimensional regression with binary coefficients. Estimating squared error and a phase transtition,” in Proceedings of the Conference on Learning Theory, vol. 65, 2017, pp. 948–953. [Google Scholar]
  • [61].Gamarnik D, Jagannath A, and Sen S, “The overlap gap property in principal submatrix recovery,” Probability Theory and Related Fields volume, vol. 181, pp. 757–814, 2021. [Google Scholar]
  • [62].Szegö G, Orthogonal polynomials, ser. American Mathematical Society colloquium publications. American Mathematical Society, 1939. [Google Scholar]
  • [63].Cai TT, Zhou HH et al. , “Optimal rates of convergence for sparse covariance matrix estimation,” The Annals of Statistics, vol. 40, no. 5, pp. 2389–2420, 2012. [Google Scholar]
  • [64].Bach FR and Jordan MI, “A probabilistic interpretation of canonical correlation analysis,” Tech. Rep, 2005. [Online]. Available: https://www.di.ens.fr/~fbach/probacca.pdf [Google Scholar]
  • [65].Panconesi A and Srinivasan A, “Randomized distributed edge coloring via an extension of the chernoff–hoeffding bounds,” SIAM Journal on Computing, vol. 26, no. 2, pp. 350–368, 1997. [Google Scholar]
  • [66].Wang S, Fan J, Pocock G, Arena ET, Eliceiri KW, and Yuan M, “Structured correlation detection with application to colocalization analysis in dual-channel fluorescence microscopic imaging,” Statistica Sinica, vol. 31, no. 1, pp. 333–360, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [67].Bai Z-D and Yin Y-Q, “Limit of the Smallest Eigenvalue of a Large Dimensional Sample Covariance Matrix,” The Annals of Probability, vol. 21, no. 3, pp. 1275 – 1294, 1993. [Google Scholar]
  • [68].Sandór J and Debnath L, “On certain inequalities involving the constant e and their applications,” Journal of mathematical analysis and applications, vol. 249, no. 2, pp. 569–582, 2000. [Google Scholar]
  • [69].Rahman S, “Wiener–hermite polynomial expansion for multivariate gaussian probability measures,” Journal of Mathematical Analysis and Applications, vol. 454, no. 1, pp. 303–334, 2017. [Google Scholar]
  • [70].Petersen KB and Pedersen MS, “The matrix cookbook,” 2015, version: Nov 12, 2015. [Online]. Available: http://www2.compute.dtu.dk/pubdb/pubs/3274-full.html [Google Scholar]
  • [71].Laub A, Matrix Analysis for Scientists and Engineers. Society for Industrial and Applied Mathematics, 2005. [Google Scholar]
  • [72].Laurent B and Massart P, “Adaptive estimation of a quadratic functional by model selection,” The Annals of Statistics, vol. 28, no. 5, pp. 1302–1338, 2000. [Google Scholar]
  • [73].Pelekis C and Ramon J, “Hoeffding’s inequality for sums of weakly dependent random variables,” Mediterranean Journal of Mathematics volume, vol. 14, no. 243, 2017. [Google Scholar]
  • [74].Linial N and Luria Z, “Chernoff’s inequality-a very elementary proof,” arXiv preprint arXiv:1403.7739, 2014. [Google Scholar]

RESOURCES