On Support Recovery with Sparse CCA: Information Theoretic and Computational Limits

Nilanjana Laha; Rajarshi Mukherjee

doi:10.1109/tit.2022.3214201

. Author manuscript; available in PMC: 2024 Mar 1.

Published in final edited form as: IEEE Trans Inf Theory. 2022 Oct 17;69(3):1695–1738. doi: 10.1109/tit.2022.3214201

On Support Recovery with Sparse CCA: Information Theoretic and Computational Limits

Nilanjana Laha ¹, Rajarshi Mukherjee ²

PMCID: PMC10569110 NIHMSID: NIHMS1875481 PMID: 37842015

Abstract

In this paper, we consider asymptotically exact support recovery in the context of high dimensional and sparse Canonical Correlation Analysis (CCA). Our main results describe four regimes of interest based on information theoretic and computational considerations. In regimes of “low” sparsity we describe a simple, general, and computationally easy method for support recovery, whereas in a regime of “high” sparsity, it turns out that support recovery is information theoretically impossible. For the sake of information theoretic lower bounds, our results also demonstrate a non-trivial requirement on the “minimal” size of the nonzero elements of the canonical vectors that is required for asymptotically consistent support recovery. Subsequently, the regime of “moderate” sparsity is further divided into two subregimes. In the lower of the two sparsity regimes, we show that polynomial time support recovery is possible by using a sharp analysis of a co-ordinate thresholding [1] type method. In contrast, in the higher end of the moderate sparsity regime, appealing to the “Low Degree Polynomial” Conjecture [2], we provide evidence that polynomial time support recovery methods are inconsistent. Finally, we carry out numerical experiments to compare the efficacy of various methods discussed.

Keywords: Canonical Correlation Analysis, Support Recovery, Low Degree Polynomials, Variable Selection, High Dimension

I. Introduction

Canonical Correlation Analysis (CCA) is a highly popular technique to perform initial dimension reduction while exploring relationships between two multivariate objects. Due to its natural interpretability and success in finding latent information, CCA has found enthusiasm across vast canvas of disciplines, which include, but are not limited to psychology and agriculture, information retrieving [3]-[5], brain-computer interface [6], neuroimaging [7], genomics [8], organizational research [9], natural language processing [10], [11], fMRI data analysis [12], computer vision [13], and speech recognition [14], [15].

Early developments in the theory and applications of CCA have now been well documented in the statistical literature, and we refer the interested reader to [16] and references therein for further details. However, the modern surge in interest for CCA, often being motivated by data from high throughput biological experiments [17]-[19], requires re-thinking several aspects of the traditional theory and methods. A natural structural constraint that has gained popularity in this regard, is that of sparsity, i.e., the phenomenon of an (unknown) collection of variables being related to each other. In order to formally introduce the framework of sparse CCA, we present our statistical setup next. We shall consider n-i.i.d. samples $(X_{i}, Y_{i}) \sim P$ with $X_{i} \in R^{p}$ and $Y_{i} \in R^{q}$ being multivariate mean zero random variables with joint variance covariance matrix

Σ = [\begin{matrix} Σ_{x} & Σ_{y x} \\ Σ_{y x} & Σ_{y} \end{matrix}] .

(1)

The first canonical correlation $Λ_{1}$ is then defined as the maximum possible correlation between two linear combinations of X and Y. This definition interprets $Λ_{1}$ as the optimal value of the following maximization problem:

\underset{u \in R^{p}, v \in R^{q}}{m a x i m i z e} u^{T} Σ_{x y} v subject to u^{T} Σ_{x} u = v^{T} Σ_{y} v = 1 .

(2)

The solutions to (2) are the vectors that maximize the correlation of the projections of X and Y in those respective directions. Higher order canonical correlations can thereafter be defined in a recursive fashion (cf. [20]). In particular, for $j \geq 1$ , we define the j^th canonical correlation $Λ_{j}$ and the corresponding directions u_j and v_j by maximizing (2) with the additional constraint

u^{T} Σ_{x} u_{l} = v^{T} Σ_{y} v_{l} = 0, 0 \leq l \leq j - 1 .

(3)

As mentioned earlier, in many modern data examples, the sample size n is typically at most comparable to or much smaller than p or q – rendering the classical CCA inconsistent and inadequate without further structural assumptions [21]-[23]. The framework of Sparse Canonical Correlation Analysis (SCCA) (cf. [8], [24]), where the u_i’s and the v_i’s are sparse vectors, was subsequently developed to target low dimensional structures (that allows consistent estimation) when p, q are potentially larger than n. The corresponding sparse estimates of the leading canonical directions naturally perform variable selection, thereby leading to the recovery of their support (cf. [8], [19], [24], [25]). It is unknown, however, under what settings, this naïve method of support recovery, or any other method for the matter, is consistent. The support recovery of the leading canonical directions serves an important purpose of identifying groups of variables that explain the most linear dependence among high dimensional random objects (X and Y) under study – and thereby renders crucial interpretability. Asymptotically optimal support recovery is yet to be explored systematically in the context of SCCA – both theoretically, and from the computational viewpoint. In fact, despite the renewed enthusiasm for CCA, both the theoretical and applied communities have mainly focused on the estimation of the leading canonical directions, and relevant scalable algorithms – see, e.g., [22], [24], [26]-[28]. This paper explores the crucial question of support recovery in the context of SCCA. ¹

The problem of support recovery for SCCA naturally connects to a vast class of variable selection problems (cf. [29]-[33]). The problem closest in terms of complexity turns out to be the sparse PCA (SPCA) problem [34]. Support recovery in the latter problem is known to present interesting information theoretic and computational bottlenecks (cf. [30], [35]-[37]). Moreover, information theoretic and computational issues also arise in context of SCCA estimation problem (cf. [24], [26]-[28]). In view of the above, it is natural to expect that such information theoretic and computational issues exist in context of SCCA support recovery problem as well. However, the techniques used in SPCA support recovery analysis are not directly applicable to the SCCA problem, which poses additional challenges due to the presence of high dimensional nuisance parameters $Σ_{x}$ and $Σ_{y}$ . The main focus of our work is therefore retrieving the complete picture of the information theoretic and computational limitations of SCCA support recovery. Before going into further details, we present a brief summary of our contributions, and defer the discussions on the main subtleties to Section III. Our methods can be implemented using the R package Support.CCA [38].

A. Summary of Main Results

We say a method successfully recovers the support if it achieves exact recovery with probability tending to one uniformly over the sparse parameter spaces defined in Section II. In the sequel, we denote the cardinality of the combined support of the u_i’s and the v_i’s by s_x and s_y, respectively. Thus s_x and s_y will be our respective sparsity parameters. Our main contributions are listed below.

1). General methodology:

In Section III-A, we construct a general algorithm called RecoverSupp, which leads to successful support recovery whenever the latter is information theoretically tractable. This also serves as the first step in creating a polynomial time procedure for recovering support in one of the difficult regimes of the problem – see e,g. Corollary 2, which shows that RecoverSupp accompanied by a co-ordinate thresholding type method recovers the support in polynomial time in a regime that requires subtle analysis. Moreover, Theorem 1 shows that the minimal signal strength required by RecoverSupp matches the information theoretic limit whenever the nuisance precision matrices $Σ_{x}^{- 1}$ and $Σ_{y}^{- 1}$ are sufficiently sparse.

2). Information theoretic and computational hardness as a function of sparsity:

As the sparsity level increases, we show that the CCA support recovery problem transitions from being efficiently solvable, to NP hard (conjectured), and to information theoretically impossible. According to this hardness pattern, the sparsity domain can be partitioned into the following three regimes: (i) $s_{x}$ , $s_{y} ≲ \sqrt{n}$ , $(ii) \sqrt{n} ≲ s_{x}$ , $s_{y} ≲ n ∕ \log (p + q)$ , and (iii) $s_{x}$ , $s_{y} ≳ n ∕ \log (p + q)$ . We describe below the distinguishing behaviours of these three regimes, which is consistent with the sparse PCA scenario.

We show that when $s_{x}$ , $s_{y} ≲ \sqrt{n ∕ \log (p + q)}$ (“easy regime”), polynomial time support recovery is possible, and well-known consistent estimators of the canonical correlates (cf. [24], [28]) can be utilized to that end. When $\sqrt{n ∕ \log (p + q)} ≲ s_{x}$ , $s_{y} ≲ \sqrt{n}$ (“difficult regime”), we show that a co-ordinate thresholding type algorithm (inspired by [1]) succeeds provided $p + q ≍ n$ . We call the last regime “difficult” because it is unknown whether existing estimation methods like COLAR [28] or SCCA [24] have valid statistical guarantees in this regime – see Section III-A and Section III-D for more details.
In Section III-C, we show that when $\sqrt{n} ≲ s_{x}$ , $s_{y} ≲ n ∕ \log (p + q)$ (“hard regime”), support recovery is computationally hard subject to the so called “low degree polynomial conjecture” recently popularized by [39], [40], and [2]. Of course, this phenomenon is observable only when p, $q ≳ n$ , because otherwise, the problem would be solvable by the ordinary CCA analysis (cf. [23], [41]). Our findings are consistent with the conjectured computational barrier in context of SCCA estimation problem [28].
When $s_{x}$ , $s_{y} ≳ n ∕ \log (p + q)$ , we show that support recovery is information theoretically impossible (see Section III-B).

3). Information theoretic hardness as a function of minimal signal strength:

In context of support recovery, the signal strength is quantified by

{Sig}_{x} = min_{k \in [p]} \max_{i \in [r]} ∣ (u_{i})_{k} ∣ and {Sig}_{y} = min_{k \in [q]} \max_{i \in [r]} ∣ (v_{i})_{k} ∣ .

Generally, support recovery algorithms require the signal strength to lie above some threshold. As a concrete example, the detailed analyses provided in [1], [30], and [35] are all based on the nonzero principal component elements being of the order $\pm 1 ∕ \sqrt{sparsity}$ . To the best of our knowledge, prior to our work, there was no result in the PCA/CCA literature on the information theoretic limit of the minimal signal strength. Generally, PCA studies assume that the top eigenvectors are de-localized, i.e., the principal components have elements of the order $O (1 ∕ \sqrt{s})$ and thereby mostly considered the cases of de-localized eigenvectors. We do not make any such assumption on the canonical covariates, and thereby we believe that our study paints a more complete picture of the support recovery.

In Section III-B, we show that ${Sig}_{x} ≳ \sqrt{\log (p - s_{x}) ∕ n}$ (or ${Sig}_{y} ≳ \sqrt{\log (q - s_{y}) ∕ n}$ ) is a necessary requirement for successful support recovery by U (or V).

B. Notation

For a vector $x \in R^{p}$ , we denote its support by $D (x) = {i : x_{i} \neq 0}$ . We will overload notation, and for a matrix $A \in R^{p \times q}$ , we will denote by D(A) the indexes of the nonzero rows of A. By an abuse of notation, sometimes we will refer to D(A) as the support of A as well. When $A \in R^{p \times q}$ and $α \in R^{p}$ are unknown parameters, generally, the estimator of their supports will be denoted by $\hat{D} (A)$ and $\hat{D} (α)$ , respectively. We let $N$ denote the set of all positive numbers, and write $Z$ for the set of all natural numbers {0, 1, 2, … ,}. For any $n \in N$ , We let [n] denote the set {1, … , n}. We define the projection of A onto $D \subset [p] \times [q]$ by

{(𝒫_{_{D}} {A})}_{i, j} = {\begin{matrix} A_{i, j} & if (i, j) \in D, \\ 0 & otherwise . \end{matrix}

(4)

For any finite set $𝒜$ , we denote its cardinality by $∣ 𝒜 ∣$ . Also, for any event $ℬ$ , we let $1 {ℬ}$ be the indicator of the event $ℬ$ . For any $p \in N$ , we let $S^{p - 1}$ denote the unit sphere in $R^{p}$ .

We let $‖ \cdot ‖_{k}$ be the usual l_k norm in $R^{k}$ for $k \in Z$ . In particular, we let $‖ x ‖_{0}$ denote the number of nonzero elements of a vector $x \in R^{p}$ . For any probability measure $P$ on the Borel sigma field of $R^{p}$ , we let $L_{2} (P)$ to be the set of all measurable functions $f : R^{p} \mapsto R$ such that $‖ f ‖_{L_{2} (P)} = \sqrt{\int f^{2} d P} < \infty$ . The corresponding $L_{2} (P)$ inner product will be denoted by $〈 \cdot, \cdot 〉_{L_{2} (P)}$ . We denote the operator norm and the Frobenius norm of a matrix $A \in R^{p \times q}$ by $‖ A ‖_{o p}$ and $‖ A ‖_{F}$ , respectively. We let A_i* and A_j denote the i-th row and j-th column of A, respectively. For $k \in N$ , we define the norms $‖ A ‖_{k, \infty} = \max_{j \in [q]} ‖ A_{j} ‖_{k}$ and $‖ A ‖_{\infty, k} = \max_{i \in [q]} ‖ A_{i *} ‖_{k}$ . The maximum and minimum eigenvalues of a square matrix A will be denoted by $Λ_{m a x} (A)$ and $Λ_{m i n} (A)$ , respectively. Also, we let s(A) denote the maximum number of nonzero entries in any column of A, i.e., $s (A) = \max_{j \in [q]} ‖ A_{j} ‖_{0}$ .

The results in this paper are mostly asymptotic (in n) in nature and thus require some standard asymptotic notations. If a_n and b_n are two sequences of real numbers then $a_{n} ≫ b_{n}$ (and $a_{n} ≪ b_{n}$ ) implies that $a_{n} ∕ b_{n} \to \infty$ (and $a_{n} ∕ b_{n} \to 0$ ) as $n \to \infty$ , respectively. Similarly $a_{n} ≳ b_{n}$ (and $a_{n} ≲ b_{n}$ ) implies that ${lim inf}_{n \to \infty} a_{n} ∕ b_{n} = C$ for some $C \in (0, \infty]$ (and ${lim sup}_{n \to \infty} a_{n} ∕ b_{n} = C$ for some $C \in [0, \infty)$ ). Alternatively, $a_{n} = o (b_{n})$ will also imply $a_{n} ≪ b_{n}$ and $a_{n} = O (b_{n})$ will imply that ${lim sup}_{n \to \infty} a_{n} ∕ b_{n} = C$ for some $C \in [0, \infty)$ . We write $a_{n} ≍ b_{n}$ if there are positive constants C₁ and C₂ such that $C_{1} b_{n} \leq a_{n} \leq C_{2} b_{n}$ for all $n \in N$ . We will write $a_{n} = \tilde{Φ} (b_{n})$ to indicate a_n and b_n are asymptotically of the same order up to a poly-log term. Finally, in our mathematical statements, C and c will be two different generic constants which can vary from line to line.

II. Mathematical Formalism

We denote the rank of $Σ_{x y}$ by r. It can be shown that exactly r canonical correlations are positive and the rest are zero in the model (2). We will consider the matrices $U = [u_{1}, \dots, u_{r}]$ and $V = [v_{1}, \dots, v_{r}]$ . From (2) and (3), it is not hard to see that $U^{T} Σ_{x} U = I_{p}$ and $V^{T} Σ_{y} V = I_{q}$ . The indexes of the nonzero rows of U and V, respectively, are the combined support of the u_i’s and the v_i’s. Since we are interested in the recovery of the latter, it will be useful for us to study of the properties of U and V. To that end, we often make use of the following representation connecting $Σ_{x y}$ to U and V [16]:

Σ_{x y} = Σ_{x} U Λ V^{T} Σ_{y} = Σ_{x} (\sum_{i = 1}^{r} Λ_{i} u_{i} v_{i}^{T}) Σ_{y} .

(5)

To keep our results straightforward, we restrict our attention to a particular model $𝒫 (r, s_{x}, s_{y}, ℬ)$ throughout, defined as follows.

Definition 1. Suppose $(X, Y) \sim P$ . Let $ℬ > 1$ be a constant. We say $P \in 𝒫 (r, s_{x}, s_{y}, ℬ)$ if

A1
(Sub-Gaussian) X and Y are sub-Gaussian random vectors (cf. [42]), with joint covariance matrix Σ as defined in (1). Also rank $(Σ_{x y}) = r$ .
A2
Recall the definition of the canonical correlation $Λ_{i} ’ s$ from (3). Note that by definition, $0 \leq Λ_{r} \leq \dots \leq Λ_{1}$ . For $P \in 𝒫 (r, s_{x}, s_{y}, ℬ)$ , $Λ_{r}$ additionally satisfies $Λ_{r} \geq 1 ∕ ℬ$ .
A3
(Sparsity) The number of nonzero rows of U and V are s_x and s_y, respectively, that is $s_{x} = ∣ \cup_{i = 1}^{r} D (u_{i}) ∣$ and $s_{y} = ∣ \cup_{i = 1}^{r} D (v_{i}) ∣$ . Here U and V are as defined in (5).
A4
(Bounded eigenvalue)
$1 ∕ ℬ < Λ_{m i n} (Σ_{y}), Λ_{m i n} (Σ_{y}), Λ_{m a x} (Σ_{x}), Λ_{m a x} (Σ_{y}) < ℬ .$
A5
(Positive eigen-gap) $Λ_{i} - Λ_{i - 1} \geq ℬ^{- 1}$ for i = 2, … , r.

Sometimes we will consider a sub-model of $𝒫 (r, s_{x}, s_{y}, ℬ)$ where each $P \in 𝒫 (r, s_{x}, s_{y}, ℬ)$ is Gaussian. This model will be denoted by $𝒫_{G} (r, s_{x}, s_{y}, ℬ)$ , where “G” stands for the Gaussian assumption. Some remarks on the modeling assumptions A1—A5 are in order, which we provide next.

A1.
We begin by noting that we do not require X and Y to be jointly sub-Gaussian. Moreover, the individual sub-Gaussian assumption itself is common in the $s_{x}$ , $s_{y} ≲ \sqrt{n} ∕ \log (p + q)$ regime in the SCCA literature (cf. [24], [28], [43]). Our proof techniques depend crucially on the sub-Gaussian assumption. We also anticipate that the results derived in this paper will change under the violation of this assumption. For the sharper analysis in the difficult regime ( $\sqrt{n ∕ \log (p + q)} ≲ s_{x}$ , $s_{y} ≲ \sqrt{n}$ ), our proof techniques require the Gaussian model $𝒫_{G}$ – which is in parallel with [1]’s treatment of the sparse PCA in the corresponding difficult regime. In general, the Gaussian spiked model assumption in sparse PCA goes back to [44], and is common in the PCA literature (cf. [30], [35]).
A2-A4.
These assumptions are standard in the analysis of canonical correlations (cf. [24], [28]).
A5.
This assumption concerns the gap between consecutive canonical correlation strengths. However, we refer to this gap as “Eigengap” because of its similarity with the Eigengap in the sparse PCA literature (cf. [1], [45]). This assumption is necessary for the estimation of the i-th canonical covariates. Indeed, if $λ_{i} = λ_{i + 1}$ then there is no hope of estimating the i-th canonical covariates because they are not identifiable, and so support recovery also becomes infeasible. This assumption can be relaxed to requiring only k many λ_i’s to be strictly larger than $λ_{i - 1} ’ s$ where k ≤ r. In this case, we can recover the support of only the first k canonical covariates.

In the following sections, we will denote the preliminary estimators of U and V by $\hat{U}$ and $\hat{V}$ , respectively. The columns of $\hat{U}$ and $\hat{V}$ will be denoted by ${\hat{u}}_{n, i}$ and ${\hat{v}}_{n, i} (i \in [r])$ , respectively. Therefore ${\hat{u}}_{n, i}$ and ${\hat{v}}_{n, i}$ will stand for the corresponding preliminary estimators of u_i and v_i. In case of CCA, the u_i’s and v_i’s are identifiable only up to a sign flip. Hence, they are also estimable only up to a sign flip. Finally, we denote the empirical estimates of $Σ_{x}$ , $Σ_{y}$ , and $Σ_{x y}$ , by ${\hat{Σ}}_{n, x}$ , ${\hat{Σ}}_{n, y}$ , and ${\hat{Σ}}_{n, x y}$ , respectively – which will often be appended with superscripts to denote their estimation through suitable subsamples of the data ². Finally, we let $C_{ℬ}$ denote a positive constant which depends on $P$ only through $ℬ$ , but can vary from line to line.

III. Main Results

We divide our main results into the following parts based on both statistical and computational difficulties of different regimes. First, in Section III-A we present a general method and associated sufficient conditions for support recovery. This allows us to elicit a sequence of questions regarding necessity of the conditions and remaining gaps both from statistical and computational perspectives. Our subsequent sections are devoted to answering these very questions. In particular, in Section III-B we discuss information theoretic lower bounds followed by evidence for statistical-computational gaps in Section III-C. Finally, we close a final computational gap in asymptotic regime through sharp analysis of a special co-ordinate-thresholding type method in Section III-D.

A. A Simple and General Method:

We begin with a simple method for estimating the support, which readily establishes the result for the easy regime, and sets the directions for the investigation into other more subtle regimes. Since the estimation of D(U) and D(V) are similar, we focus only on the estimation of D(V) for the time being.

Suppose $\hat{V}$ is a row sparse estimator of V. The nonzero indexes of $\hat{V}$ is the most intuitive estimator of D(V). Such an $\hat{V}$ is also easily attainable because most estimators of the canonical directions in the high dimension are sparse (cf. [24], [26], [28] among others). Although we have not yet been able to show the validity of this apparently “naïve” method, we provide numerical results in Section IV to explore its finite sample performance. However, a simple method can refine these initial estimators, to often optimally recover the support D(V). We now provide the details of this method and derive its asymptotic properties.

To that end, suppose we have at our disposal an estimating procedure for $Σ_{y}^{- 1}$ , which we generically denote by ${\hat{Ω}}_{n}$ and an estimator $\hat{U} \in R^{p \times r}$ of U. We split the sample in two equal parts, and compute ${\hat{U}}^{(1)}$ and ${\hat{Ω}}_{n}^{(1)}$ from the first part of the sample, and the estimator ${\hat{Σ}}_{n, x y}^{(2)}$ from the second part of the sample. Define ${\hat{V}}^{c l e a n} = {\hat{Ω}}_{n}^{(1)} {\hat{Σ}}_{n, y x}^{(2)} {\hat{U}}^{(1)}$ . Our estimator of D(V) is then given by

\hat{D} (V) ≔ {i \in [q] : ∣ {\hat{V}}_{i k}^{c l e a n} ∣ > cut for some k \in [r]},

(6)

where cut is a pre-specified cut-off or threshold. We will discuss more on cut later. The resulting algorithm will be referred as RecoverSupp from now on. Algorithm 1 gives the algorithm for the support recovery of V, but the full version of RecoverSupp, which estimates D(U) and D(V) simultaneously, can be found in Appendix A; see Algorithm 3 there. RecoverSupp is similar in spirit to the “cleaning” step in the sparse PCA support recovery literature (cf. [1]). One thing to remember here is that ${\hat{V}}^{c l e a n}$ is not an estimator V. In fact, the (i, j)-th element of ${\hat{V}}^{c l e a n}$ is an estimator of $Λ_{i} (v_{i})_{j}$ .

Remark 1. In many applications, the rank r may be unknown. [46] (see Section 4.6.5 therein) suggests to use the screeplot of the canonical correlations to estimate r. Screeplot is also a popular tool to estimate the number of nonzero principal components in PCA analysis [1]. For CCA, the screeplot is the plot of the estimated canonical correlations versus their orders. If there is a clear gap between two successive correlations, [46] suggests taking the larger correlation as the estimator of $Λ_{r}$ . One can use [24]’s SCCA method to estimate the canonical correlations to obtain the screeplot. There can be other ways of estimating r. For example, in their trans-eQTL study, [47] uses a resampling technique on a holdout dataset to generate observations from the null distribution of the i-th canonical correlation estimate under the hypothesis $H_{0} : Λ_{i} = 0$ , where $i \in [min (p, q)]$ . The largest i, for which the test is rejected, is taken as the estimated rank. A similar technique has been used by [48] to select the ranks for a related method JIVE.

Algorithm 1 RecoverSupp

{\hat{U}}^{(1)}, {\hat{Ω}}_{n}^{(1)}, {\hat{Σ}}_{n, x y}^{(2)}, cut, r

): sup-port recovery of V

\begin{matrix} Input : 1) Preliminary estimators {\hat{U}}^{(1)} and {\hat{Ω}}_{n}^{(1)} of U and \\ Σ_{y}^{- 1}, respectively, based on sample O_{1} = (x_{i}, y_{i})_{i = 1}^{[n ∕ 2]} . \\ 2) Estimator {\hat{Σ}}_{n, x y}^{(2)} of Σ_{x y} based on sample O_{2} = \\ (x_{i}, y_{i})_{i = [n ∕ 2] + 1}^{n} . \\ 3) Threshold level cut > 0 and rank r \in N . \\ Output : \hat{D} (V), an estimator of D (V) . \\ 1) Cleaning : {\hat{V}}^{c l e a n} \leftarrow {\hat{Ω}}_{n}^{(1)} {\hat{Σ}}_{n, y x}^{(2)} {\hat{U}}^{(1)} . \\ 2) Threshold : Compute \hat{D} (V) as in (6) . \\ return : \hat{D} (V) . \end{matrix}

Open in a new tab

It turns out that, albeit being so simple, RecoverSupp has desirable statistical guarantees provided ${\hat{U}}^{(1)}$ and ${\hat{Ω}}_{n}^{(1)}$ are reasonable estimators of U and $Σ_{y}^{- 1}$ , respectively. These theoretical properties of RecoverSupp, and the hypotheses and queries generated thereof, lay out the roadmap for the rest of our paper. However, before getting into the detailed theoretical analysis of RecoverSupp, we state a l₂-consistency condition on ${\hat{u}}_{n, i}$ and ${\hat{v}}_{n, i} ’ s$ , where we remind the readers that we let ${\hat{u}}_{n, i}$ and ${\hat{v}}_{n, i}$ denote the i-th columns of $\hat{V}$ and $\hat{U}$ , respectively. Recall also that the i-th columns of U and V are denoted by u_i and v_i, respectively.

Condition 1 (l₂ consistency ). There exists a function $E r r \equiv E r r : (n, p, q, s_{x}, s_{y}, ℬ) \mapsto R$ so that $∣ E r r ∣ < 1 ∕ (2 ℬ \sqrt{r})$ and the estimators ${\hat{u}}_{n, i}$ and ${\hat{v}}_{n, i}$ of u_i and v_i satisfy

\max_{i \in [r]} min_{w \in {\pm 1}} ∣ (w {\hat{u}}_{n, i} - u_{i})^{T} Σ_{x} (w {\hat{u}}_{n, i} - u_{i}) ∣ < E r r^{2},

\max_{i \in [r]} min_{w \in {\pm 1}} ∣ (w {\hat{v}}_{n, i} - v_{i})^{T} Σ_{y} (w {\hat{v}}_{n, i} - v_{i}) ∣ < E r r^{2}

with $P$ probability 1 − o(1) uniformly over $P \in 𝒫 (r, s_{x}, s_{y}, ℬ)$ .

We will discuss the estimators which satisfy Condition 1 later. Theorem 1 also requires the signal strength Sig_y to be at least of the order $ϵ_{n} = ξ_{n} \sqrt{\log (p + q) s (Σ_{y}^{- 1}) ∕ n}$ , where the parameter $ξ_{n}$ depends on the type of ${\hat{Ω}}_{n}$ as follows:

${\hat{Ω}}_{n}$ is of type A if there exists C_pre > 0 so that ${\hat{Ω}}_{n}$ satisfies $‖ {\hat{Ω}}_{n} - Σ_{y}^{- 1} ‖_{\infty, 1} \leq C_{pre} s (Σ_{y}^{- 1}) \sqrt{(\log q) ∕ n}$ with $P$ probability 1 − o(1) uniformly over $P \in 𝒫 (r, s_{x}, s_{y}, ℬ)$ . Here we remind the readers that $s (Σ_{y}^{- 1}) = \max_{j \in [q]} ‖ (Σ_{y}^{- 1})_{j} ‖_{0}$ . In this case, $ξ_{n} = C_{pre} \sqrt{s (Σ_{y}^{- 1})}$ .
${\hat{Ω}}_{n}$ is of type B if $‖ {\hat{Ω}}_{n} - Σ_{y}^{- 1} ‖_{\infty, 2} \leq C_{pre} \sqrt{s (Σ_{y}^{- 1}) \log (q) ∕ n}$ with $P$ probability 1 − o(1) uniformly over $P \in 𝒫_{G} (r, s_{x}, s_{y}, ℬ)$ for some C_pre > 0. In this case, $ξ_{n} = C_{pre} \max {\sqrt{r (\log q) ∕ n}, 1}$ .
${\hat{Ω}}_{n}$ is of type C if ${\hat{Ω}}_{n} = Σ_{y}^{- 1}$ . In this case, $ξ_{n} = 1$ .

The estimation error of ${\hat{Ω}}_{n}$ clearly decays from type A to C, with the error being zero at type C. Because $\sqrt{r (\log q) ∕ n}$ is generally much smaller than $s (Σ_{y}^{- 1})$ , $ξ_{n}$ shrinks from Case A to Case C monotonously as well. Thus it is fair to say that $ξ_{n}$ reflects the precision of the estimator ${\hat{Ω}}_{n}$ in that $ξ_{n}$ is smaller if ${\hat{Ω}}_{n}$ is a sharper estimator. We are now ready to state Theorem 1. This theorem is proved in Appendix C.

Theorem 1. Suppose $\log (p \lor q) = o (n)$ and the estimators ${\hat{u}}_{n, i} ’ s$ satisfy Condition 1. Further suppose ${\hat{Ω}}_{n}$ is of type A, B, or C, which are stated above. Let $ϵ_{n} = ξ_{n} \sqrt{\log (p + q) s (Σ_{y}^{- 1}) ∕ n}$ where $ξ_{n}$ depends on the type of ${\hat{Ω}}_{n}$ as outlined above. Then there exists a constant $C_{ℬ}^{'} > 0$ , depending only on $ℬ > 0$ , so that if

S i g_{y} > 2 C_{ℬ}^{'} ∊_{n},

(7)

and $c u t \in [C_{ℬ}^{'} ϵ_{n} ∕ (2 ℬ)$ , $(θ_{n} - 1) C_{ℬ}^{'} ϵ_{n} ∕ (2 ℬ)]$ with $θ_{n} = S i g_{y} ∕ (C_{ℬ}^{'} ϵ_{n})$ , then the algorithm RecoverSupp fully recovers D(V) with $P$ probability 1 − o(1) uniformly over $P \in 𝒫 (r, s_{x}, s_{y}, ℬ)$ (for ${\hat{Ω}}_{n}$ of type A and C), or uniformly over $P \in 𝒫_{G} (r, s_{x}, s_{y}, ℬ)$ (for ${\hat{Ω}}_{n}$ of type B).

The assumption that log p and log q are o(n) appears in all theoretical works of CCA (cf. [24], [28]). A requirement of this type is generally unavoidable. Note that Theorem 1 implies a more precise estimator ${\hat{Ω}}_{n}$ requires smaller signal strength for full support recovery.

Main idea behind the proof of Theorem 1:

Because $Λ_{i} (v_{i})_{k} = e_{k}^{T} Σ_{y}^{- 1} Σ_{y x} u_{i}$ , ${\hat{V}}_{i k}^{c l e a n}$ is an estimator of $Λ_{i} (v_{i})_{k}$ for $i \in [q]$ and $k \in [r]$ . If $k \notin D (V)$ , then $(v_{i})_{k} = 0$ for all $i \in [r]$ . Therefore, in this case, we expect ${\hat{V}}_{i k}^{c l e a n}$ to be small for all $i \in [q]$ . We will show that whenever $k \notin D (V)$ , $∣ {\hat{V}}_{i k}^{c l e a n} ∣$ is uniformly bounded by $C_{1} ϵ_{n}$ for $i \in [q]$ and $k \in [r]$ with high probability. Here $C_{1} > 0$ is a constant. Second, when $(v_{i})_{k} \neq 0$ , we will show that $\max_{i \in [r]} ∣ {\hat{V}}_{i k}^{c l e a n} ∣$ can not be too small. In fact, we will show that

\max_{i \in [r]} ∣ {\hat{V}}_{i k}^{c l e a n} ∣ > C_{2} \max_{i \in [r]} ∣ Λ_{i} (v_{i})_{k} ∣ - C_{1} ϵ_{n}, k \in [r]

(8)

for some $C_{2} > 0$ with high probability in this case. The lower bound in the above inequality is bounded below by $C_{2} {Sig}_{y} - C_{1} ϵ_{n}$ . Thus, if the minimal signal strength Sig_y is bounded below by a large enough multiple of ϵ_n, then the lower bound $C_{2} {Sig}_{y} - C_{1} ϵ_{n}$ will be larger than the upper bound $C_{1} ϵ_{n}$ in the $k \notin D (V)$ case. Therefore, in this scenario, we can choose C > 0 so that

C_{1} ϵ_{n} < C ϵ_{n} < C_{2} {Sig}_{y} - C_{1} ϵ_{n} .

If we set $cut = C ϵ_{n}$ , then the above inequality leads to.

sup_{i \notin D (V)} ∣ {\hat{V}}_{i k}^{c l e a n} ∣ < cut < inf_{i \in D (V)} ∣ {\hat{V}}_{i k}^{c l e a n} ∣ .

These C₁ and C₂ are behind the constant $C_{ℬ}^{'}$ in (7) and our choice of θ_n.

Thus the key step in the proof of Theorem 1 is analyzing the bias of ${\hat{V}}_{i k}^{c l e a n}$ , which hinges on the following bias decomposition:

∣ {\hat{V}}_{i k}^{c l e a n} - Λ_{i} (v_{i})_{k} ∣ \leq \underset{T_{1} (i, k)}{\underset{︸}{∣ e_{k}^{T} ({\hat{Ω}}_{n} - Φ_{0}) {\hat{Σ}}_{n, y x} {\hat{u}}_{n, i} ∣}} + \underset{T_{2} (i, k)}{\underset{︸}{∣ e_{k}^{T} Φ_{0} ({\hat{Σ}}_{n, y x} - Σ_{y x}) {\hat{u}}_{n, i} ∣}} + \underset{T_{3} (i, k)}{\underset{︸}{∣ e_{k}^{T} Φ_{0} Σ_{y x} ({\hat{u}}_{n, i} - u_{i}) ∣}} .

(9)

Note that the term $T_{1} (i, k)$ corresponds to the bias in estimating $Φ_{0}$ . Similarly, the error terms $T_{2} (i, k)$ and $T_{3} (i, k)$ incur due to the bias in estimating $Σ_{y x}$ and u_i, respectively. The main contributing term in the upper bound in (9) is $T_{1} (i, k)$ . One can use the consistency property of ${\hat{Ω}}_{n}$ to show that $T_{1} (i, k)$ is of the order $O_{p} (ϵ_{n})$ . Since ${\hat{Ω}}_{n}$ has different rates and modes of convergence in cases A, B, and C, $T_{1} (i, k)$ has different orders in cases A, B, and C, which explains why ϵ_n is of different order in these cases.

The term $T_{2} (i, k)$ is much smaller – it is of the order $(s (Σ_{x}^{- 1}) \log (p + q)) ∕ n)^{1 ∕ 2}$ . The proof bases on the fact that the l_∞ error of estimating $Σ_{x y}$ by ${\hat{Σ}}_{n, x y}$ is of the order $(\log (p + q) ∕ n)^{1 ∕ 2}$ for subgaussian X and Y. The error term $T_{3} (i, k)$ is exactly zero for $i \notin D (V)$ , and hence does not contribute. Thus only $T_{1} (i, k)$ and $T_{2} (i, k)$ contribute to the bias of ${\hat{V}}_{i k}^{c l e a n}$ for $i \notin D (V)$ , which is therefore bounded by $C_{1} ϵ_{n}$ for some C₁ > 0 with high probability in this case. The term $T_{3} (i, k)$ does contribute to the bias of ${\hat{V}}_{i k}^{c l e a n}$ for $i \in D (V)$ , however, and it is of the order $\sqrt{r} \max_{j \in [r]} ∣ (v_{j})_{k} ∣ Err$ in this case. Because Err is small by Condition 1, we can show that $T_{3} (i, k)$ is smaller than $\max_{i \in [r]} Λ_{i} ∣ (v_{i})_{k} ∣$ , which eventually leads to the relation in (8), thus completing the proof. We have already mentioned that RecoverSupp is analogous to the cleaning step in sparse PCA. Therefore the proof of Theorem 1 has similarities with some analogous results in sparse PCA. See for example Theorem 3 of [1], which proves the consistency of a “cleaned” estimator of the joint support of the spiked principal components. However, the proof in the CCA case is a bit more involved because of the presence of $Σ_{y}^{- 1}$ , which needs to be estimated for the cleaning step. Different estimators of $Σ_{y}^{- 1}$ can have different rates of convergence, which leads to the different types of the estimators. This ultimately leads to different requirements on the order of the threshold cut and the minimal signal strength Sig_y.

Next we will discuss the implications of Theorem 1, but before getting into that detail, we will make two important remarks.

Remark 2. Although the estimation of the high dimensional precision matrix $Σ_{y}^{- 1}$ is potentially complicated, it is often unavoidable owing to the inherent subtlety of the CCA framework due to the presence of high dimensional nuisance parameters $Σ_{x}$ and $Σ_{y}$ . [26] also used precision matrix estimator for partial recovery of the support. In case of sparse CCA, to the best of our knowledge, there does not exist an algorithm that can recover the support, partially or completely, without estimating the precision matrix. However, our requirements on ${\hat{Ω}}_{n}$ are not strict in that many common precision matrix estimators, e.g., the nodewise Lasso [49, Theorem 2.4], the thresholding estimator [50, Theorem 1 and Section 2.3], and the CLIME estimator [51, Theorem 6] exhibit the decay rate of type A and B under standard sparsity assumptions on $Σ_{y}^{- 1}$ . We will not get into the detail of the sparsity requirements on $Σ_{y}^{- 1}$ because they are unrelated to the sparsity of U or V, and hence are irrelevant to the primary goal of the current paper.

Remark 3. In the easy regime $s_{y} ≲ \sqrt{n ∕ (\log (p + q)}$ , polynomial time estimators satisfying Condition 1 are already available, e.g., COLAR [28, Theorem 4.2] or SCCA [24, Condition C4]. Thus it is easily seen that polynomial time support recovery is possible in the easy regime provided (7) is satisfied.

The implications of Theorem 1 in context of the sparsity requirements on D(U) and D(V) for full support recovery are somewhat implicit through the assumptions and conditions. However, the restriction on the sparsity is indirectly imposed by two different sources – which we elaborate on now. To keep the interpretations simple, throughout the following discussion, we assume that (a) $r = O (n ∕ \log q)$ , (b) p and q are of the same order, and (c) s_x and s_y are also of the same order. Note that (a) implies $ξ_{n} = O (1)$ for a type B estimator of ${\hat{Ω}}_{n}$ . Since we separate the task of estimating the nuisance parameter $Σ_{y}^{- 1}$ from the support recovery of V, we also assume that $s (Σ_{y}^{- 1}) = O (1)$ , which implies $ξ_{n} = O (1)$ for a type A estimator of ${\hat{Ω}}_{n}$ . The assumption $s (Σ_{y}^{- 1}) = O (1)$ , combined with (a), reduces the minimal signal strength condition (7) in Theorem 1 to ${Sig}_{y} \geq C_{ℬ} \sqrt{\log (p + q) ∕ n}$ .

In lieu of the discussion above, the first source of sparsity restriction is the minimal signal strength condition (7) on Sig_y. To see this, first note that

1 = v_{i}^{T} Σ_{y} v_{i} \geq Λ_{m i n} (Σ_{y}) ‖ v_{i} ‖_{2}^{2}

where $i \in [r]$ . Since $Λ_{m i n} (Σ_{y}) \geq ℬ^{- 1}$ ,

Λ_{m i n} (Σ_{y}) ‖ v_{i} ‖_{2}^{2} \geq ‖ v_{i} ‖_{2}^{2} ∕ ℬ \geq {Sig}_{y}^{2} s_{y} ∕ ℬ,

implying ${Sig}_{y} \leq \sqrt{ℬ} s_{y}^{- 1 ∕ 2}$ . Therefore, implicit in Theorem 1 lies the condition

s_{y} \leq \frac{C_{ℬ}^{2} n}{\log (p + q)},

(10)

which is enforced by the minimal signal strength requirement (7). Thus Theorem 1 does not hold for $s_{y} ≫ n ∕ \log (p + q)$ even when $s (Σ_{y}^{- 1})$ and r are small. This regime requires some attention because in case of sparse PCA [30] and linear regression [29], support recovery at $s ≫ n ∕ \log (p - s)$ ³ is proven to be information theoretically impossible. However, although a parallel result can be intuited to hold for CCA, the details of the nuances of SCCA support recovery in this regime is yet to be explored. Therefore, the sparsity requirement in (10) raises the question whether support recovery for CCA is at all possible when $s_{y} ≫ n ∕ \log (p + q)$ , even if $Σ_{x}$ and $Σ_{y}$ is known.

Question 1. Does there exist any decoder $\hat{D}$ such that ${sup}_{P \in 𝒫 (r, s_{x}, s_{y}, ℬ)} P (\hat{D} (V) \neq D (V)) \to 0$ when $s_{y} ≫ n ∕ \log (q - s_{y})$ ?

A related question is whether the minimal signal strength requirement (7) is necessary. To the best of our knowledge, there is no formal study on the information theoretic limit of the minimal signal strength even in context of the sparse PCA support recovery. Indeed, as we noted before, the detailed analyses of support recovery for SPCA provided in [1], [30], and [35] are all based on the nonzero principal component elements being of the order $O (1 ∕ \sqrt{s})$ . Finally, although this question is not directly related to the sparsity conditions, it indeed probes the sharpness of the results in Theorem 1.

Question 2. What is the minimum signal strength required for the recovery of D(V)?

We will discuss Question 1 and Question 2 at greater length in Section III-B. In particular, Theorem 2(A) shows that there exists C > 0 so that support recovery at $s_{y} \geq C ℬ^{- 2} n ∕ \log (q - s_{y})$ is indeed information theoretically intractable. On the other hand, in Theorem 2(B), we show that the minimal signal strength has to be of the order $ℬ \sqrt{\log (q - s_{y}) ∕ n}$ for full recovery of D(V). Thus when $p ≍ q$ , (7) is indeed necessary from information theoretic perspectives.

The second source of restriction on the sparsity lies in Condition 1. Condition 1 is a l₂-consistency condition, which has sparsity requirement itself owing the inherent hardness in the estimation of U. Indeed, Theorem 3.3 of [28] entails that it is impossible to estimate the canonical directions u_i’s consistently if $s_{x} > C n ∕ (r + \log (e p ∕ s_{x}))$ for some large C > 0. Hence, Condition 1 indirectly imposes the restriction $s_{x} ≲ n ∕ \max {\log (p ∕ s_{x}), r}$ . However, when $s_{x} ≍ s_{y}$ , $p ≍ q$ , and $r = O (1)$ , the above restriction is already absorbed into the condition $s_{y} ≲ n ∕ \log (q - s_{y})$ elicited in the last paragraph. In fact, there exist consistent estimators of U whenever $s_{x} ≲ n ∕ \max {\log (p ∕ s_{x}), r}$ and $s_{y} ≲ n ∕ \max {\log (q ∕ s_{y}), r}$ (see [27] or Section 3 of [28]). Therefore, in the latter regime, RecoverSupp coupled with the above-mentioned estimators succeeds. In view of the above, it might be tempting to think that Condition 1 does not impose significant additional restrictions. The restriction due to Condition 1, however, is rather subtle and manifests itself through computational challenges. Note that when support recovery is information theoretically possible, the computational hardness of recovery by RecoverSupp will be at least as much as that of the estimation of U. Indeed, the estimators of U which work in the regime $s_{x} ≍ n ∕ \log (p ∕ s_{x})$ , $s_{y} ≍ n ∕ \log (q ∕ s_{y})$ are not adaptive of the sparsity, and they require a search over exponentially many sets of size s_x and s_y. Furthermore, under $𝒫 (r, s_{x}, s_{y}, ℬ)$ , all polynomial time consistent estimators of U in the literature, e.g., COLAR [28, Theorem 4.2] or SCCA [24, Condition C4], require s_x, s_y to be of the order $\sqrt{n ∕ \log (p + q)}$ . In fact, [28] indicates that estimation of U or V for sparsity of larger order is NP hard.

The above raises the question whether RecoverSupp (or any method as such) can succeed at polynomial time when $\sqrt{n ∕ \log (p + q)} ≪ s_{x}$ , $s_{y} ≲ n ∕ \log (p + q)$ . We turn to the landscape of sparse PCA for intuition. Indeed, in case of sparse PCA, different scenarios are observed in the regime $s ≲ n ∕ \log p$ , depending on whether $\sqrt{n} ≪ s ≲ n ∕ \log p$ , or $s ≲ \sqrt{n}$ (we recall that for SPCA we denote the sparsity of the leading principal component direction generically through s). We focus on the sub-regime $\sqrt{n} ≪ s ≲ n ∕ \log p$ first. In this case, both estimation and support recovery for sparse PCA are conjectured to be NP hard, which means no polynomial time method succeeds; see Section III-C for more details. The above hints that the regime $s_{x}$ , $s_{y} ≫ \sqrt{n}$ is NP hard for sparse CCA as well.

Question 3. Is there any polynomial time method that can recover the support D(V) when $s_{x}$ , $s_{y} ≫ \sqrt{n}$ ?

We dedicate Section III-C to answering this question. Subject to the recent advances in the low degree polynomial conjecture, we establish computational hardness of the regime $s_{x}$ , $s_{y} ≫ \sqrt{n}$ (up to a logarithmic factor gap) subject to $n ≲ p$ , q. Our results are consistent with [28]’s findings in the estimation case and cover a broader regime; see Remark 5 for a comparison.

When the sparsity is of the order $\sqrt{n}$ and $p ≍ n$ , however, polynomial time support recovery and estimation are possible for the sparse PCA case. [1] showed that a co-ordinate thresholding type spectral algorithm works in this regime. Thus the following question is immediate.

Question 4. Is there any polynomial time method that can recover the support D(V) when $s_{x}$ , $s_{y} \in [\sqrt{n ∕ \log (p + q)}, \sqrt{n}]$ ?

We give an affirmative answer to Question 4 in Section III-D, which is in parallel with the observations for the sparse PCA. In fact, Corollary 2 shows that when $Σ_{x}$ and $Σ_{y}$ are known, $p + q ≍ n$ , and $s_{x}$ , $s_{y} ≲ \sqrt{n}$ , estimation is possible in polynomial time. Since estimation is possible, RecoverSupp suffices for polynomial time support recovery in this regime, where $\sqrt{n}$ is well below the information theoretic limit of $n ∕ \log (p + q)$ . The main tool used in Section III-D is co-ordinate thresholding, which is originally a method for high dimensional matrix estimation [50], and apparently has nothing to do with estimation of canonical directions. However, under our setup, if the covariance matrix is consistently estimated in operator norm, by Wedin’s Sin θ Theorem [52], an SVD is enough to get a consistent estimator of U and V suitable for further precise analysis.

Remark 4. RecoverSupp uses sample splitting, which can reduce the efficiency. One can swap between the samples and compute two estimators of the supports. One can easily show that both the intersection and the union of the resulting supports enjoy the asymptotic guarantees of Theorem 1.

This section can be best summarized by Figure 1, which gives the information theoretic and computational landscape of sparse CCA analysis in terms of the sparsity. In other words, Figure 1 gives the phase transition plot for SCCA support recovery with respect to sparsity. It can be seen that our contributions (colored in red) complete the picture, which was initiated by [28].

Fig. 1: — Phase transition plots for SCCA estimation and support recovery problems with respect to sparsity. We have taken $s_{x} = s_{y}$ here. COLAR corresponds to the estimation method of [28]. Our contributions are colored in red. See [28] for more details on the regions colored in blue.

B. Information Theoretic Lower Bounds: Answers to Question 1 and 2

Theorem 2 establishes the information theoretic limits on the sparsity levels s_x, s_y, and the signal strengths Sig_x and Sig_y. The proof of Theorem 2 is deferred to Appendix D.

Theorem 2. Suppose $\hat{D} (U)$ and $\hat{D} (V)$ are estimators of D(U) and D(V), respectively. Let s_x, s_y > 1, and $p - s_{x}$ , $q - s_{y} > 16$ . Then the following assertions hold:

If $s_{x} > 16 n ∕ {(ℬ^{2} - 1) \log (p - s_{x})}$ , then
$inf_{\hat{D}} sup_{P \in 𝒫 (r, s_{x}, s_{y}, ℬ)} P (\hat{D} (U) \neq D (U)) > 1 ∕ 2 .$

On the other hand, if $s_{y} > 16 n ∕ {(ℬ^{2} - 1) \log (q - s_{y})}$ , then
$inf_{\hat{D}} sup_{P \in 𝒫 (r, s_{x}, s_{y}, ℬ)} P (\hat{D} (V) \neq D (V)) > 1 ∕ 2 .$
Let $𝒫_{S i g} (r, s_{x}, s_{y}, ℬ)$ be the class of distributions $P \in 𝒫 (r, s_{x}, s_{y}, ℬ)$ satisfying $S i g_{x}^{2} \leq (ℬ^{2} - 1) (\log (p - s_{x})) ∕ (8 n)$ . Then
$inf_{\hat{D}} sup_{P \in 𝒫_{S i g} (r, s_{x}, s_{y}, ℬ)} P (\hat{D} (U) \neq D (U)) > 1 ∕ 2 .$

On the other hand, if $S i g_{y}^{2} \leq (ℬ^{2} - 1) (\log (q - s_{y})) ∕ (8 n)$ , Then
$inf_{\hat{D}} sup_{P \in 𝒫_{S i g} (r, s_{x}, s_{y}, ℬ)} P (\hat{D} (V) \neq D (V)) > 1 ∕ 2 .$

In both cases, the infimum is over all possible decoders $\hat{D} (U)$ and $\hat{D} (V)$ .

First, we discuss the implications of part A of Theorem 2. This part entails that for full support recovery of V, the minimum sample size requirement is of the order $s_{y} \log (q - s_{y})$ . This requirement is consistent with the traditional lower bound on n in context of support recovery for sparse PCA [30, Theorem 3] and L₁ regression [29, Corollary 1]. However, when $r = O (1)$ , the sample size requirement for estimation of V is slightly relaxed, that is, $n ≫ s_{y} \log (q ∕ s_{y})$ [28, Theorem 3.2]. Therefore, from information theoretic point of view, the task of full support recovery appears to be slightly harder than the task of estimation. The scenario for partial support recovery might be different and we do not pursue it here. Moreover, as mentioned earlier, in the regime $s_{y} ≲ C_{ℬ} n ∕ \log (p + q)$ , RecoverSupp works with [28]’s (see Section 3 therein) estimator of U. Thus part A of Theorem 2 implies that $n ∕ \log (p + q)$ is the information theoretic upper bound on the sparsity for the full support recovery of sparse CCA.

Part B of Theorem 2 implies that it is not possible to push the minimum signal strength below the level $O (\sqrt{\log (q - s_{y}) ∕ n})$ . Thus the minimal signal strength requirement (7) by Theorem 1 is indeed minimal up to a factor of $ξ_{n} \sqrt{s (Σ_{y}^{- 1})}$ . The last statement can be refined further. To that end, we remind the readers that for a good estimator of $Σ_{y}^{- 1}$ , i.e., a type B estimator, $ξ_{n} = O (1)$ if $r = O (n ∕ \log q)$ . However, the latter always holds if support recovery is at all possible, because in that case $s_{y} ≲ n ∕ \log (p + q)$ , and elementary linear algebra gives $s_{y} \geq r$ . Thus, it is fair to say that, provided a good estimator of $Σ_{y}^{- 1}$ , the requirement (7) is minimal up to a factor of $\sqrt{s (Σ_{y}^{- 1})}$ . Indeed, this implies that for banded inverses with finite band-width our results are rate optimal.

It is further worth comparing this part of the result to the SPCA literature. In the SPCA support recovery literature, generally, the lower bound on the signal strength is depicted in terms of the sparsity s, and usually a signal strength of order $O (1 ∕ \sqrt{s})$ is postulated (cf. [1], [30], [35]). Using our proof strategies, it can be easily shown that for SPCA, the analogous lower bound on the signal strength would be $\sqrt{\log (p - s) ∕ n}$ . The latter is generally much smaller than $1 ∕ \sqrt{s}$ and only when $s ≍ n ∕ \log (p)$ , the requirement of $1 ∕ \sqrt{s}$ close to the lower bound. Thus, in the regime $s ≲ \sqrt{n ∕ \log p}$ , the lower bound should rather be of the order $O (1 ∕ s)$ . Therefore the minimum signal strength requirement of $O (1 ∕ \sqrt{s})$ typically assumed in SPCA literature seems larger than necessary.

Main idea behind the proof of Theorem 2:

The main device used in this proof is Fano’s inequality [53]. Note that for any $𝒞 \subset 𝒫 (r, s_{x}, s_{y}, ℬ)$ ,

inf_{{\hat{D}}_{α}} sup_{P \in 𝒞} P ({\hat{D}}_{α} \neq D (α)) < inf_{{\hat{D}}_{α}} sup_{P \in 𝒫 (r, s_{x}, s_{y}, ℬ)} P ({\hat{D}}_{α} \neq D (α)) .

(11)

Therefore it suffices to show that the left hand side in the above inequality is bounded away from 1/2 for some carefully chosen $𝒞$ . If $𝒞$ is finite, we can lower bound the left hand side of (11) using Fano’s inequality [53], which yields

inf_{{\hat{D}}_{α}} sup_{P \in 𝒞} P ({\hat{D}}_{α} \neq D (α)) \geq 1 - \frac{\frac{\sum_{P_{1}, P_{2} \in c} K L (P_{1}^{n} ∣ P_{2}^{n})}{∣ 𝒞 ∣^{2}} + \log 2}{\log (∣ 𝒞 ∣ - 1)},

(12)

Thus the main task is to choose $𝒞$ in a way so that the right hand side (RHS) of (12) is large. We will choose $𝒞$ so that X and Y are jointly Gaussian. In particular, $X \sim N_{p} (0, I_{p})$ , $Y \sim N_{q} (0, I_{q})$ , and $Σ_{x y} = ρ β_{0} α^{T}$ where $β_{0} \in S^{q - 1}$ and $ρ \in (0, 1)$ are fixed, and α is allowed to vary in a set $ℰ \subset S^{p - 1}$ . In this model, r = 1, ρ is the canonical correlation, and α and β₀ are the left and right canonical covariates, respectively. Also, $P$ varies across $𝒞$ as α varies across $ℰ$ . Moreover, $∣ 𝒞 ∣ = ∣ ℰ ∣$ . Our main task boils down to choosing $ℰ$ carefully.

The idea behind choosing $ℰ$ is as follows. For any decoder, i.e., an estimator of the support, the chance of making error increases when $∣ ℰ ∣$ is large. This can also be seen noting that the right hand side of (12) increases as $∣ ℰ ∣ = ∣ 𝒞 ∣$ increases. However, even if we prefer a larger $ℰ$ , we need to ensure that the KL divergence between the distributions in the resulting $𝒞$ is small. The reason is that, for a large $ℰ$ , the right hand side of (12) can be small unless the KL divergence between the corresponding distributions in $𝒞$ is small. In other words, any decoder will face a challenge detecting the true support of α when there are many distributions to choose from, and these distributions are also close to each other in KL distance.

For part A of Theorem 2, we choose $ℰ$ in the following way. Letting

α_{0} = (\underset{s_{x} many}{\underset{︸}{1 ∕ {\sqrt{s}}_{x}, \dots, 1 ∕ {\sqrt{s}}_{x}}}, \underset{p - s_{x} many}{\underset{︸}{0, \dots, 0}}),

we let $ℰ$ be the class of α’s which are obtained by replacing one of the $1 ∕ {\sqrt{s}}_{x} ’ s$ in α₀ by 0, and one of the zero’s in α₀ by $1 ∕ {\sqrt{s}}_{x}$ . A typical α obtained this way looks like

α = (\underset{s_{x} many}{\underset{︸}{1 ∕ {\sqrt{s}}_{x}, \dots, 0, \dots 1 ∕ {\sqrt{s}}_{x}}}, \underset{p - s_{x} many}{\underset{︸}{0, \dots, 1 ∕ \sqrt{s_{x}}, \dots, 0}}) .

In this case, it turns out that $∣ ℰ ∣ = s_{x} (p - s_{x})$ . Under the conditions of part A of 2, we can show that the RHS of (12) is bounded below by 1/2 for this $ℰ$ . The proof of part A is similar to its PCA analogue, which is Theorem 3 of [30]. The latter theorem is also based on Fano’s lemma and uses a similar construction for $ℰ$ . However, there is no PCA analogue of part B. For part B of Theorem 2, we let $ℰ$ be the class of all α’s so that

α = (\underset{s_{x} - 1 many}{\underset{︸}{b, \dots, b}}, \underset{p - s_{x} + 1 many}{\underset{︸}{0, \dots, 0, z, 0, \dots, 0}}) .

where

z = \sqrt{\frac{1 - ρ^{2}}{4 n ρ^{2}}} \log (\frac{p - s_{x}}{4})

can take any position out of the $p - s_{x} + 1$ positions. Clearly, $∣ ℰ ∣ = p - s_{x} + 1$ . It can be shown that the RHS of (12) is bounded below by 1/2 in this case as well.

C. Computational Limits and Low Degree Polynomials: Answer to Question 3

We have so far explored the information theoretic upper and lower bounds for recovering the true support of leading canonical correlation directions. However, as indicated in the discussion preceding Question 3, the statistically optimal procedures in the regime where $\sqrt{n} ≲ s_{x}$ , $s_{y} ≲ n ∕ \log (p + q)$ are computationally intensive and is of exponential complexity (as a function of p, q). In particular, [28] have already showed that when s_x and s_y belong to parts of this regime, estimation of the canonical correlates is computationally hard, subject to a computational complexity based Planted Clique Conjecture. For the case of support recovery, the SPCA has been explored in detail and the corresponding computational hardness has been established in analogous regimes – see, e.g., [1], [30], and [35] for details. A similar phenomenon of computational hardness is observed in case of SPCA spike detection problem [54]. In light of the above, it is natural to believe that the SCCA support recovery is also computationally hard in the regime $\sqrt{n} ≲ s_{x}$ , $s_{y} ≲ n ∕ \log (p + q)$ , and, as a result, yields a statistical-computational gap. Although several paths exist to provide evidence towards such gaps ⁴, the recent developments using “Predictions from Low Degree Polynomials” [2], [39], [40] is particularly appealing due its simplicity in exposition. In order to show computationally hardness of the SCCA support recovery problem in the $s_{x}$ , $s_{y} \in (\sqrt{n}, n ∕ \log (p + q))$ regime, we shall resort to this very style of ideas, which has so far been applied successfully to explore statistical-computational gaps under sparse PCA [36], Stochastic Block Models, and tensor PCA [40], among others. This will allow us to explore the computational hardness of the problem in the entire regime where

s_{x} + s_{y} ≳ (\sqrt{n}) (\log n)^{c},

(13)

compared to the somewhat partial results (see Remark 5 for detailed comparison) in earlier literature.

We divide our discussions to argue the existence of a statistical-computational gap in this regime as follows. Starting with a brief background on the statistical literature on such gaps, we first present a natural reduction of our problem to a suitable hypothesis testing problem in Section III-C1. Subsequently, in Section III-C2 we present the main idea of the “low degree polynomial conjecture” by appealing to the recent developments in [39], [40], and [2]. Finally, we present our main result for this regime in Section III-C3, thereby providing evidence of the aforementioned gap modulo the Low Degree Polynomial Conjecture presented in Conjecture 1.

1). Reduction to Testing Problem::

Denote by $Q_{n}$ the distribution of a $N_{p + q} (0, I_{p + q})$ random vector. Therefore $(X, Y) \sim Q$ corresponds to the case when X and Y are uncorrelated. We first show that there is any scope of support recovery in $𝒫 (r, s_{x}, s_{y}, ℬ)$ only if $𝒫 (r, s_{x}, s_{y}, ℬ)$ is distinguishable from $Q$ , i.e., the test $H_{0} : (X, Y) \sim Q$ vs. $H_{1} : (X, Y) \sim P \in 𝒫 (r, s_{x}, s_{y}, ℬ)$ has asymptotic zero error.

To formalize the ideas, suppose we observe i.i.d random vectors ${X_{i}, Y_{i}}_{i = 1}^{n}$ which are distributed either as $P$ or $Q$ . We denote the n-fold product measures corresponding to $P$ and $Q$ by $P_{n}$ and $Q_{n}$ , respectively. Note that if $P \in 𝒫 (r, s_{x}, s_{y}, ℬ)$ , then $P_{n} \in 𝒫 (r, s_{x}, s_{y}, ℬ)^{n}$ . We overload notation, and denote the combined sample ${X_{i}}_{i = 1}^{n}$ and ${Y_{i}}_{i = 1}^{n}$ by X and Y respectively. In this section, X and Y should be viewed as unordered sets. The test $Φ_{n} : R^{p n + q n} \mapsto {0, 1}$ for testing the null $H_{0} : (X, Y) \sim Q_{n}$ vs. the alternative $H_{1} : (X, Y) \sim P_{n}$ is said to strongly distinguish $P_{n}$ and $Q_{n}$ if

lim_{n} Q_{n} (Φ_{n} (X, Y) = 1) + lim_{n} P_{n} (Φ_{n} (X, Y) = 0) = 0 .

The above implies that both the type I error and the type II error of Φ_n converges to zero as $n \to \infty$ . In case of composite alternative $H_{1} : (X, Y) \sim P_{n} \in 𝒫 (r, s_{x}, s_{y}, ℬ)^{n}$ , the test strongly distinguishes $Q_{n}$ from $𝒫 (r, s_{x}, s_{y}, ℬ)^{n}$ if

\underset{n \to \infty}{lim inf} {Q_{n} (Φ_{n} (X, Y) = 1) + sup_{P_{n} \in 𝒫 (r, s_{x}, s_{y}, ℬ)^{n}} P_{n} (Φ_{n} (X, Y) = 0)} = 0 .

Now we explain how support recovery and the testing framework are connected. Suppose there exist decoders which exactly recover D(U) and D(V) under $𝒫 (r, s_{x}, s_{y}, ℬ)$ for $ℬ \geq 0$ . Then the trivial test, which rejects the null if either of the estimated supports is non-empty, strongly distinguishes $Q_{n}$ from $𝒫 (r, s_{x}, s_{y}, ℬ)^{n}$ . The above can be coined as the following lemma.

Lemma 1. Suppose there exist polynomial time decoders ${\hat{D}}_{x}$ and ${\hat{D}}_{y}$ of D(U) and D(V) so that

\underset{n \to \infty}{lim inf} sup_{P_{n} \in 𝒫 (r, s_{x}, s_{y}, ℬ)^{n}} P_{n} ({\hat{D}}_{x} (X, Y) = D (U) a n d {\hat{D}}_{y} (X, Y) = D (V)) = 1

(14)

Further assume, $Q_{n} ({\hat{D}}_{x} (X, Y) = \emptyset) \to 1$ , and $Q_{n} ({\hat{D}}_{y} (X, Y) = \emptyset) \to 1$ . Then there exists a polynomial time test which strongly distinguishes $𝒫 (r, s_{x}, s_{y}, ℬ)^{n}$ and $Q_{n}$ .

Thus, if a regime does not allow any polynomial time test for distinguishing $Q_{n}$ from $𝒫 (r, s_{x}, s_{y}, ℬ)^{n}$ , there can be no polynomial time computable consistent decoder for D(U) and D(V). Therefore, it suffices to show that there is no polynomial time test which distinguishes $Q_{n}$ from $𝒫 (r, s_{x}, s_{y}, ℬ)^{n}$ in the regime $s_{x}$ , $s_{y} ≫ \sqrt{n}$ . To be more explicit, we want to show that if $s_{x}$ , $s_{y} ≫ \sqrt{n}$ , then

\underset{n \to \infty}{lim inf} {Q_{n} (Φ_{n} (X, Y) = 1) + sup_{P_{n} \in 𝒫 (r, s_{x}, s_{y}, ℬ)^{n}} P_{n} (Φ_{n} (X, Y) = 0)} > 0

(15)

for any Φ_n that is computable in polynomial time.

The testing problem under concern is commonly known as the CCA detection problem, owing to its alternative formulation as $H_{0} : Λ_{1} = 0$ vs. $H_{1} : Λ_{1} > 0$ . In other words, the test tries to detect if there is any signal in the data. Note that, Lemma 1 also implies that detection is an easier problem than support recovery in that the former is always possible whenever the latter is feasible. The opposite direction may not be true, however, since detection does not reveal much information on the support.

2). Background on the Low-degree Framework:

We shall provide a brief introduction to the low-degree polynomial conjecture which forms the basis of our analyses here, and refer the interested reader to [39], [40], and [2] for in-depth discussions on the topic. We will apply this method in context of the test $H_{0} : (X, Y) \sim Q_{n}$ vs. $H_{1} : (X, Y) \sim P_{n}$ . The low-degree method centers around the likelihood ratio $L_{n}$ , which takes the form $\frac{d P_{n}}{d Q_{n}}$ in the above framework. Our key tool here will be the Hermite polynomials, which form a basis system of $L_{2} (Q_{n})$ [62]. Central to the low-degree approach lies the projection of $L_{n}$ onto the subspace (of $L_{2} (Q_{n})$ ) formed by the Hermite polynomials of degree at most $D_{n} \in N$ . The latter projection, to be denoted by $L_{n}^{\leq D_{n}}$ from now on, is important because it measures how well polynomials of degree ≤ D_n can distinguish $P_{n}$ from $Q_{n}$ . In particular,

‖ L_{n}^{\leq D_{n}} ‖_{L_{2} (Q_{n})} ≔ \max_{f deg \leq D_{n}} \frac{E_{P_{n}} [f (X, Y)]}{\sqrt{E_{Q_{n}} [f (X, Y)^{2}]}},

(16)

where the maximization is over polynomials $f : R^{n (p + q)} \mapsto R$ of degree at most D_n [36].

The $L_{2} (Q)$ norm of the untruncated likelihood ratio $L_{n}$ has long held an important place in the theory hypothesis testing since $‖ L_{n} ‖_{L_{2} (Q)} = O (1)$ implies $P_{n}$ and $Q_{n}$ are asymptotically indistinguishable. While the untruncated likelihood ratio $L_{n}$ is connected to the existence of any distinguishing test, degree D_n projections of $L_{n}$ are connected to the existence of polynomial time distinguishing tests. The implications of the above heuristics are made precise by the following conjecture [40, Hypothesis 2.1.5].

Conjecture 1 (Informal). Suppose $t : N \mapsto N$ . For “nice” sequences of distributions $P_{n}$ and $Q_{n}$ , if $‖ L_{n}^{\leq D_{n}} ‖_{L_{2} (Q_{n})} = O (1)$ as n → ∞ whenever D_n ≤ t(n)polylog(n), then there is no time-n^t(n) test $Φ_{n} : R^{n (p + q)} \mapsto {0, 1}$ that strongly distinguishes $P_{n}$ and $Q_{n}$ .

Thus Conjecture 1 implies that the degree-D_n polynomial $L_{n}^{\leq D_{n}}$ is a proxy for time-n^t(n) algorithms [2]. If we can show that $‖ L_{n}^{\leq D_{n}} ‖_{L_{2} (Q_{n})} = O (1)$ for a D_n of the order $(\log n)^{1 + ϵ}$ for some ϵ > 0, then the low degree Conjecture says that no polynomial time test can strongly distinguish $P_{n}$ and $Q_{n}$ [2, Conjecture 1.16].

Conjecture 1 is informal in the sense that we do not specify the “nice” distributions, which are defined in Section 4.2.4 of [2] (see also Conjecture 2.2.4 of [40]). Niceness requires $P_{n}$ to be sufficiently symmetric, which is generally guaranteed by naturally occurring high dimensional problems like ours. The condition of “niceness” is attributed to eliminate pathological cases where the testing can be made easier by methods like Gaussian elimination. See [40] for more details.

3). Main Result:

Similar to [36], we will consider a Bayesian framework. It might not be immediately clear how a Bayesian formulation will fit into the low-degree framework, and lead to (15). However, the connection will be clear soon. We put independent Rademacher priors $π_{x}$ and $π_{y}$ on α and β. We say $α \sim π_{x}$ if $α_{1}, \dots, α_{p}$ are i.i.d., and for each $i \in [p]$ ,

α_{i} = {\begin{matrix} 1 ∕ \sqrt{s_{x}} & w . p . s_{x} ∕ (2 p) \\ - 1 ∕ \sqrt{s_{x}} & w . p . s_{x} ∕ (2 p) \\ 0 & w . p . 1 - s_{x} ∕ p . \end{matrix}

(17)

The Rademacher prior $π_{y}$ can be defined similarly. We will denote the product measure $π_{x} \times π_{y}$ by $π$ . Let us define

Σ (α, β, ρ) = [\begin{matrix} I_{p} & ρ α β^{T} \\ ρ β α^{T} & I_{q} \end{matrix}], α \in R^{p}, β \in R^{q}, ρ > 0 .

(18)

When $ρ ‖ α ‖_{2} ‖ β ‖_{2} < 1$ , $Σ (α, β, ρ)$ is the covariance matrix corresponding to $X \sim N_{p} (0, I_{p})$ and $Y \sim N_{q} (0, I_{q})$ with covariance $cov (X, Y) = ρ α β^{T}$ . Hence, for $Σ (α, β, ρ)$ to be positive definite, $‖ α ‖_{2} ‖ β ‖_{2} < 1 ∕ ρ$ is a sufficient condition. The priors $π_{x}$ and $π_{y}$ put positive weight on α and β that do not lead to a positive definite $Σ (α, β, ρ)$ , and hence calls for extra care during the low-degree analysis. This subtlety is absent in the sparse PCA analogue [36].

Let us define

P_{α, β} = {\begin{matrix} N (0, Σ (α, β, 1 ∕ ℬ)) & when ‖ α ‖_{2} ‖ β ‖_{2} < ℬ \\ Q & o.w. \end{matrix}

(19)

We denote the n-fold product measure corresponding to $P_{α, β}$ by $P_{n, α, β}$ . If $(X, Y) ∣ α, β \sim P_{n, α, β}$ , then the marginal density of (X, Y) is $E_{α \sim π_{x}, β \sim π_{y}} d P_{n, α, β}$ . The following lemma, which is proved in Appendix H-C, explains how the Bayesian framework is connected to (15).

Lemma 2. Suppose $ℬ > 2$ and $s_{x}, s_{y} \to \infty$ . Then

\underset{n}{lim inf} sup_{P_{n} \in 𝒫_{G} (r, 2 s_{x}, 2 s_{y}, ℬ)^{n}} P_{n} (Φ_{n} (X, Y) = 0) \geq \underset{n}{lim inf} E_{π} P_{n, α, β} (Φ_{n} (X, Y) = 0),

where $E_{π}$ is the shorthand for $E_{α \sim π_{x}, β \sim π_{y}}$ .

Note that a similar result holds for $𝒫 (r, s_{x}, s_{y}, ℬ)$ as well because $𝒫_{G} (r, s_{x}, s_{y}, ℬ) \subset 𝒫 (r, s_{x}, s_{y}, ℬ)$ . Lemma 2 implies that to show (15), it suffices to show that a polynomial time computable $Φ_{n}$ fails to strongly distinguish the marginal distribution of X and Y from $Q_{n}$ . However, the latter falls within the realms of the low degree framework because the corresponding likelihood ratio takes the form

L_{n} = \frac{E_{α \sim π_{x}, β \sim π_{y}} d P_{n, α, β}}{d Q_{n} (X, Y)} .

(20)

Using priors on the alternative space is a common trick to convert a composite alternative to a simple alternative, which generally yields more easily to various mathematical tools.

If we can show that $‖ L_{n}^{\leq D_{n}} ‖_{L_{2} (Q_{n})}^{2} = O (1)$ for some $D_{n} = O (\log n)$ , then Conjecture 1 would indicate that a $n^{\tilde{Θ} (D_{n})_{_{-}}}$ time computable $Φ_{n}$ fails to distinguish the distribution of $E_{α \sim π_{x}, β \sim π_{y}} d P_{n, α, β}$ from $Q_{n}$ . Theorem 3 accomplishes the above under some additional conditions on p, q, and n, which we will discuss shortly. Theorem 3 is proved in Appendix E.

Theorem 3. Suppose $D_{n} \leq min (\sqrt{p}, \sqrt{q}, n)$ ,

s_{x}, s_{y} \geq \sqrt{e n D_{n}} ∕ ℬ a n d p, q \geq 3 e n ∕ ℬ^{2} .

(21)

Then $‖ L_{n}^{\leq D_{n}} ‖_{L_{2} (Q_{n})}^{2}$ is O(1) where $L_{n}$ is as defined in (20).

The following Corollary results from combining Lemma 2 with Theorem 3.

Corollary 1.Suppose

s_{x}, s_{y} \geq 2 \sqrt{e n D_{n}} ∕ ℬ a n d p, q \geq 3 e n ∕ ℬ^{2} .

(22)

If Conjecture 1 is true, then for $D_{n} \leq min (\sqrt{p}, \sqrt{q}, n)$ , there is no time $- n^{\tilde{Θ} (D_{n})}$ test that strongly distinguishes $𝒫_{G} (r, s_{x}, s_{y}, ℬ)$ and $Q_{n}$ .

Corollary 1 conjectures that polynomial time algorithms can not strongly distinguish $𝒫_{G} (r, s_{x}, s_{y}, ℬ)^{n}$ and $Q_{n}$ provided s_x, s_y, p, and q satisfy (22). Therefore under (22), Lemma 1 conjectures support recovery to be NP hard.

Now we discuss a bit on condition (22). The first constraint in (22) is expected because it ensures $s_{x}$ , $s_{y} ≫ \sqrt{n}$ , which indicates that the sparsity is in the hard regime. We need to explain a bit on why the other constraint p, $q > 3 e n ∕ ℬ^{2}$ is needed. If $n ≫ p, q$ , the sample canonical correlations are consistent, and therefore strong separation is possible in polynomial time without any restriction on the sparsity [23], [41]. Even if $p ∕ n \to c_{1} \in (0, 1)$ and $q ∕ n \to c_{2} \in (0, 1)$ , then also strong separation is possible in model 18 provided the canonical correlation ρ is larger than some threshold depending on c₁ and c₂ [23]. The restriction p, $q > 3 e n ∕ ℬ^{2}$ ensures that the problem is hard enough so that the vanilla CCA does not lead to successful detection. The constant 3e is not sharp and possibly can be improved. The necessity of the condition p, $q ≳ n ∕ ℬ^{2}$ is unknown for support recovery, however. Since support recovery is a harder problem than detection, in the hard regime, polynomial time support recovery algorithms may fail at a weaker condition on n, p, and q.

Remark 5. [Comparison with previous work:] As mentioned earlier, [28] was the first to discover the existence of computational gap in context of sparse CCA. In their seminal work, [28] established the computational hardness of CCA estimation problem at a particular subregime of $s_{x}$ , $s_{y} ≫ \sqrt{n} ∕ (ℬ \sqrt{\log (p + q)})$ provided $ℬ \to \infty$ is allowed. In view of the above, it was hinted that sparse CCA becomes computationally hard when $s_{x}$ , $s_{y} ≫ \sqrt{n} ∕ (ℬ \sqrt{\log (p + q)})$ . However, when $ℬ$ is bounded, the entire regime $s_{x}$ , $s_{y} ≫ \sqrt{n} ∕ (ℬ \sqrt{\log (p + q)})$ is probably not computationally hard. In Section III-D, we show that if $p + q ≍ n$ , then both polynomial time estimation and support recovery are possible if $s_{x} + s_{y} ≲ \sqrt{n}$ , at least in the known $Σ_{x}$ and $Σ_{y}$ case. The latter sparsity regime can be considerably larger than $s_{x}$ , $s_{y} ≲ \sqrt{n ∕ \log (p + q)}$ . Together, Section III-D and the current section indicate that in the bounded $ℬ$ case, the transition of computational hardness for sparse CCA probably happens at the sparsity level $\sqrt{n}$ , not $\sqrt{n ∕ \log (p + q)}$ , which is consistent with sparse PCA. Also, the low-degree polynomial conjecture allowed us to explore almost the entire targeted regime $s_{x}$ , $s_{y} ≫ \sqrt{n}$ , where [28], who used the planted clique conjecture, considers only a subregime of $s_{x}$ , $s_{y} ≫ \sqrt{n} ∕ (ℬ \sqrt{\log (p + q)})$ .

We will end the current section with a brief outline of the proof of Theorem 3.

The main idea behind the proof of Theorem 3:

Let us denote by $Π_{n}^{\leq D_{n}}$ the linear span of all $n (p + q)$ -variate Hermite polynomials of degree at most D_n. For each $z \in Z^{m}$ and $y \in R^{m}$ , we let ${\hat{H}}_{z} (y) = \prod_{i = 1}^{m} {\hat{h}}_{z_{i}} (y_{i})$ , where ${\hat{h}}_{z_{i}}$ is the univariate normalized Hermite polynomial of degree z_i. We will discuss the Hermite polynomials in greater detail in Appendix E. Any normalized m-variate Hermite polynomial is of the form ${\hat{H}}_{z}$ , where $z \in Z^{m}$ . Then $Π_{n}^{\leq D_{n}}$ is the linear span of all ${\hat{H}}_{w} ’ s$ with

w \in 𝒞_{l} ≔ {z \in Z^{n (p + q)} : \sum_{i = 1}^{n (p + q)} z_{i} \leq D_{n}} .

Since $L_{n}^{\leq D_{n}}$ is the projection of $L_{n}$ on $Π_{n}^{\leq D_{n}}$ , it then holds that

‖ L_{n}^{\leq D_{n}} ‖_{L_{2} (Q_{n})}^{2} = \sum_{w \in 𝒞_{l}} 〈 L_{n}, {\hat{H}}_{w} 〉_{L^{2} (Q_{n})}^{2} .

The first step of the proof is to find out the expression of $〈 L_{n}, {\hat{H}}_{w} 〉_{L^{2} (Q_{n})}^{n}$ . Since $w \in Z^{n (p + q)}$ , we can partition w into w = (w₁, … , w_n), where $w_{i} \in Z^{p + q}$ for each i ∈ [n]. Using some algebra, we can show that

〈 L_{n}, {\hat{H}}_{w} 〉_{L^{2} (Q_{n})} = E_{π} [\prod_{i \in [n]} E_{(X_{i}, Y_{i}) \sim P_{α, β}} [{\hat{H}}_{w_{i}} (X_{i}, Y_{i})]] .

Exploiting the properties of Hermite polynomials, it can be shown that

E_{(X_{i}, Y_{i}) \sim P_{α, β}} [{\hat{H}}_{w_{i}} (X_{i}, Y_{i})] = \frac{1 {‖ α ‖_{2} ‖ β ‖_{2} < ℬ}}{\sqrt{\prod_{j = 1}^{p + q} (w_{i})_{j}!}} \times {\partial_{i}^{w_{i}} (exp {\frac{1}{2} t^{T} (Σ (α, β, 1 ∕ ℬ) - I_{p + q}) t}) ∣}_{t = (0, \dots, 0)},

where for $z \in Z^{p + q}$ , $t \in R^{p + q}$ , and any function $f : R^{p + q} \mapsto R$ , the notation $\partial_{t}^{z} (f (t)) ∣_{t = (0, \dots, 0)}$ stands for the z-th order partial derivative of f with respect to t evaluated at the origin. The rest of the proof is similar to the PCA analogue in [36], but there is an extra indicator term for the CCA case. Following [36], we use the common trick of using replicas of α and β to simplify the algebra. Suppose $α_{1}$ , $α_{2} \sim π_{x}$ and $β_{1}$ , $β_{2} \sim π_{y}$ are independent. Let W be the indicator function of the event $∣ (α_{1}^{T} α_{2}) (β_{1}^{T} β_{2}) ∣ < ℬ^{2}$ . Denote by $((1 - x)^{- n})^{\leq p}$ the p-th order truncation of the Taylor series expansion of $(1 - x)^{- n}$ at x = 0. Following some algebra, it can be shown that

‖ L_{n}^{\leq D_{n}} ‖_{L_{2} (Q_{n})}^{2} = E_{π} [W {{(1 - ℬ^{- 2} (α_{1}^{T} α_{2}) (β_{1}^{T} β_{2}))}^{- n}}^{\leq ⌊ D_{n} ∕ 2 ⌋}] .

Comparing the above with the analogous result for PCA, namely Lemma 4.2 of [36], we note that the indicator term W does not appear in the PCA analogue. The indicator term W appears in the CCA case because we had set $P_{α, β}$ to be $Q$ for $‖ α ‖_{2} ‖ β ‖_{2} > ℬ$ to tackle the extra restrictions on α and β in this case.

D. A Polynomial Time Algorithm for $\sqrt{n} ∕ \log (p + q) ≪ s_{x}, s_{y} ≪ \sqrt{n}$ Regime : Answer to Question 4

In this subsection, we show that in the difficult regime $s_{x} + s_{y} \in [\sqrt{n ∕ \log (p + q)}, \sqrt{n}]$ , using a soft co-ordinate thresholding (CT) type algorithm, we can estimate the canonical directions consistently when $p + q ≍ n$ . CT was introduced by the seminal work of [50] for the purpose of estimating high dimensional covariance matrices. For SPCA, [1]’s CT is the only algorithm that provably recovers the full support in the difficult regime (see also [35]). In context of CCA, [26] uses CT for partial support recovery in the rank one model under what we referred to as the easy regime. However, [26]’s main goal was the estimation of the leading canonical vectors, not support recovery. As a result, [26] detects the support of the relatively large elements of the leading canonical directions, which are subsequently used to obtain consistent preliminary estimators of the leading canonical directions. Our thresholding level and theoretical analysis are different from that of [26] because the analytical tools used in the easy regime do not work in the difficult regime.

1). Methodology: Estimation via CT:

By “thresholding a matrix A co-ordinate-wise”, we will roughly mean the process of assigning the value zero to any element of A which is below a certain threshold in absolute value. Similar to [1], we will consider the soft thresholding operator, which, at threshold level t, takes the form

η (x, t) = {\begin{matrix} x - t & x > t \\ 0 & ∣ x ∣ < t \\ x + t & x < - t . \end{matrix}

It will be worth noting that the soft thresholding operator $x \mapsto η (x, t)$ is continuous.

Algorithm 2 Coordinate Thresholding (CT) for estimating
D(V)

\begin{matrix} Input : 1) Sample covariance matrices {\hat{Σ}}_{n, x y}^{(1)} and {\hat{Σ}}_{n, x y}^{(2)} \\ based on samples O_{1} = (x_{i}, y_{i})_{i = 1}^{[n ∕ 2]} and O_{2} = \\ (x_{i}, y_{i})_{i = [n ∕ 2] + 1}^{n}, respectively . \\ 2) Variances Σ_{x} and Σ_{y} . \\ 3) Parameters Thr and cut . \\ 4) r, i.e., rank of Σ_{x y} \\ Output : \hat{D} (V) . \\ 1) Peeling : calculate {\tilde{Σ}}_{x y} = Σ_{x}^{- 1} {\hat{Σ}}_{n, x y}^{(1)} Σ_{y}^{- 1} . \\ 2) Threshold : Letting N = m + n, perform soft thresh - \\ olding x \mapsto η (x; Thr ∕ \sqrt{N}) entrywise on {\tilde{Σ}}_{x y} to \\ obtain thresholded η ({\tilde{Σ}}_{x y}) . \\ 3) Sandwithch : η ({\tilde{Σ}}_{x y}) \mapsto Σ_{x}^{1 ∕ 2} η ({\tilde{Σ}}_{x y}) Σ_{y}^{1 ∕ 2} . \\ 4) SVD : Find {\hat{U}}_{p r e}, the matrix of the leading r singular \\ vector of Σ_{x}^{1 ∕ 2} η ({\tilde{Σ}}_{x y}) Σ_{y}^{1 ∕ 2} . \\ 5) Premultiply : Set {\hat{U}}^{(1)} = Σ_{x}^{1 ∕ 2} {\hat{U}}_{p r e} . \\ return : RECOVERSUPP ({\hat{U}}^{(1)}, cut, Σ_{y}^{- 1}, {\hat{Σ}}_{n, x y}^{(2)}, r) where \\ RECOVERSUPP is given by Algorithm 1 \end{matrix}

Open in a new tab

We will also assume that the covariance matrices $Σ_{x}$ and $Σ_{y}$ are known. To understand the difficulty of unknown $Σ_{x}$ and $Σ_{y}$ , we remind the readers that $Σ_{x y} = Σ_{x} U Λ V^{T} Σ_{y}$ . Because the matrices U and V are sandwiched between the matrices $Σ_{x}$ and $Σ_{y}$ , their sparsity pattern does not get reflected in the sparsity pattern of $Σ_{x y}$ . Therefore, if one blindly applies CT to ${\hat{Σ}}_{n, x y}$ , they can at best hope to recover the sparsity pattern of the outer matrices $Σ_{x}$ and $Σ_{y}$ . If the supports of the matrices U and V are of main concern, CT should rather be applied on the matrix ${\tilde{Σ}}_{x y} = Σ_{x}^{- 1} {\hat{Σ}}_{n, x y} Σ_{y}^{- 1}$ . If $Σ_{x}$ and $Σ_{y}$ are unknown, one needs to efficiently estimate ${\tilde{Σ}}_{x y}$ before the application of CT. Although under certain structural conditions, it is possible to find rate optimal estimators ${\hat{Σ}}_{n, x}^{- 1}$ and ${\hat{Σ}}_{n, y}^{- 1}$ of $Σ_{x}^{- 1}$ and $Σ_{y}^{- 1}$ at least in theory, the errors $‖ ({\hat{Σ}}_{n, x}^{- 1} - Σ_{x}^{- 1}) {\hat{Σ}}_{n, x y} Σ_{y}^{- 1} ‖_{o p}$ and $‖ Σ_{x}^{- 1} {\hat{Σ}}_{n, x y} ({\hat{Σ}}_{n, y}^{- 1} - Σ_{y}^{- 1}) ‖_{o p}$ may still blow up due to the presence of the high dimensional matrix ${\hat{Σ}}_{n, x y}$ , which can be as big as $O (\sqrt{(p + q) ∕ n})$ in operator norm. One may be tempted to replace ${\hat{Σ}}_{n, x y}$ with a sparse estimator of $Σ_{x y}$ to facilitate faster estimation, but that does not work because we explicitly require the formulation of ${\hat{Σ}}_{n, x y}$ as the sum of Wishart matrices (see equation 37 in the proof). The latter representation, which is critical for the sharp analysis, may not be preserved by a CLIME [51] or nodewise Lasso estimator [49] of Σ_xy.

We remark in passing that it is possible to obtain an estimator $\hat{A}$ so that $∣ \hat{A} - {\tilde{Σ}}_{x y} ∣_{\infty} = o_{p} (1)$ . Although the latter does not provide much control over the operator norm of $\hat{A} - {\tilde{Σ}}_{x y}$ , it is sufficient for partial support recovery, e.g., the recovery of the rows of U or V with strongest signals. (See Appendix B of [26] for example, for some results in this direction under the easy regime when r = 1.)

As indicated by the previous paragraph, we apply coordinate thresholding to the matrix ${\tilde{Σ}}_{x y} = Σ_{x}^{- 1} {\hat{Σ}}_{n, x y} Σ_{y}^{- 1}$ , which directly targets the matrix $Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} = U Λ V^{T}$ . We call this step the peeling step because it extracts the matrix ${\tilde{Σ}}_{x y}$ from the sandwiched matrix ${\hat{Σ}}_{n, x y} = Σ_{x} {\tilde{Σ}}_{x y} Σ_{y}$ . We then perform the entry-wise co-ordinate thresholding algorithm on the peeled form ${\tilde{Σ}}_{x y}$ with threshold Thr so as to obtain $η ({\tilde{Σ}}_{x y}; Thr ∕ \sqrt{n})$ . We postpone the discussion on Thr to Section III-D2. The thresholded matrix is an estimator of $Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1}$ , but we need an estimator of $Σ_{x}^{- 1 ∕ 2} Σ_{x y} Σ_{y}^{- 1 ∕ 2}$ . Therefore, we again sandwich ${\tilde{Σ}}_{x y}$ between $Σ_{x}^{1 ∕ 2}$ and $Σ_{y}^{1 ∕ 2}$ . The motivation behind this sandwiching is that if $‖ {\tilde{Σ}}_{x y} - Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} ‖_{o p} = ϵ_{n}$ , then $Σ_{x}^{1 ∕ 2} {\tilde{Σ}}_{x y} Σ_{y}^{1 ∕ 2}$ is a good estimator of $Σ_{x}^{- 1 ∕ 2} Σ_{x y} Σ_{y}^{- 1 ∕ 2}$ in that

‖ Σ_{x}^{1 ∕ 2} {\tilde{Σ}}_{x y} Σ_{y}^{1 ∕ 2} - Σ_{x}^{- 1 ∕ 2} Σ_{x y} Σ_{y}^{- 1 ∕ 2} ‖_{o p} \leq \sqrt{‖ Σ_{x} ‖_{o p} ‖ Σ_{y} ‖_{o p}} ϵ_{n} \leq ℬ ϵ_{n} .

However, $Σ_{x}^{1 ∕ 2} U Λ V^{T} Σ_{y}^{1 ∕ 2}$ is an SVD of $Σ_{x}^{- 1 ∕ 2} Σ_{x y} Σ_{y}^{- 1 ∕ 2}$ . Using Davis-Kahan sin theta theorem [52], one can show that the SVD of $Σ_{x}^{1 ∕ 2} {\tilde{Σ}}_{x y} Σ_{y}^{1 ∕ 2}$ produces estimators ${\hat{U}}^{'}$ and ${\hat{V}}^{'}$ of $Σ_{x}^{1 ∕ 2} U$ and $Σ_{y}^{1 ∕ 2} V$ , where the columns of ${\hat{U}}^{'}$ and ${\hat{V}}^{'}$ are ϵ_n-consistent in l₂ norm for the columns of $Σ_{x}^{1 ∕ 2} U$ and $Σ_{y}^{1 ∕ 2} V$ , respectively, up to a sign flip (cf. Theorem 2 of [52]). Pre-multiplying the resulting U′ by $Σ_{x}^{- 1 ∕ 2}$ yields an estimator $\hat{U}$ of U up to a sign flip of the columns. We do not worry about the sign flip because Condition 1 allows for the sign flips of the columns. Therefore, we feed this $\hat{U}$ into RecoverSupp as our final step. See Algorithm 2 for more details.

Remark 6. In case of electronic health records data, it is possible to obtain large surrogate data on X and Y separately and thus might allow relaxing the known precision matrices assumption above. We do not pursue such semi-supervised setups here.

2). Analysis of the CT Algorithm:

For the asymptotic analysis of the CT algorithm, we will assume the underlying distribution to be Gaussian, i.e., $P \in 𝒫_{G} (r, s_{x}, s_{y}, ℬ)$ . This Gaussian assumption will be used to perform a crucial decomposition of sample covariance matrix, which typically holds for Gaussian random vectors. [1], who used similar devices for obtaining the sharp rate results in SPCA, also required a similar Gaussian assumption. We do not yet know how to extend these results to sub-Gaussian random vectors.

Let us consider the threshold $Thr ∕ \sqrt{n}$ , where Thr is explicitly given in Theorem 4. Unfortunately, tuning of Thr requires the knowledge of the underlying sparsity s_x and s_y. Similar to [1], our thresholding level is different than the traditional choice of order $O (\sqrt{\log (p + q) ∕ n})$ in the easy regime analyzed in [50], [63] and [26]. The latter level is too large to successfully recover all the nonzero elements in the difficult regime. We threshold ${\tilde{Σ}}_{x y}$ at a lower level, which in its turn, complicates the analysis to a greater degree. Our main result in this direction, stated in Theorem 4, is proved in Appendix F.

Theorem 4. Suppose $(X_{i}, Y_{i}) \sim P \in 𝒫_{G} (r, s_{x}, s_{y}, ℬ)$ . Further suppose $s_{x} + s_{y} < \sqrt{n}$ , $p \lor q = o (\log n)$ , and $\log n = o (\sqrt{p} \lor \sqrt{q})$ . Let K and C₁ be constants so that $K \geq 1288 ℬ^{4}$ and $C_{1} \geq C ℬ^{4}$ , where C > 0 is an absolute constant. Suppose the threshold level Thr is defined by

T h r = {\begin{matrix} \sqrt{C_{1} \log (p + q)} \\ i f (s_{x} + s_{y})^{2} < 2^{1 ∕ 4} (p + q)^{3 ∕ 4} (c a s e i) \\ {(K \log (\frac{p + q}{(s_{x} + s_{y})^{2}}))}^{1 ∕ 2} \\ i f 2^{1 ∕ 4} (p + q)^{3 ∕ 4} \leq (s_{x} + s_{y})^{2} \leq (p + q) ∕ e \\ (c a s e i i) \\ 0 o . w . (c a s e i i i) . \end{matrix}

Suppose $c_{ℬ}$ is a constant that takes the value K, C₁, or one in case (i), (ii), and (iii), respectively. Then there exists an absolute constant C > 0 so that the following holds with probability 1 − o(1) for ${\tilde{Σ}}_{x y} = Σ_{x}^{- 1} {\hat{Σ}}_{n, x y} Σ_{y}^{- 1}$ :

‖ η ({\tilde{Σ}}_{x y}; η) - Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} ‖_{o p} \leq C ℬ^{2} \frac{(s_{x} + s_{y})}{\sqrt{n}} \times \max {{(c_{ℬ} \log (\frac{p + q}{(s_{x} + s_{y})^{2}}))}^{1 ∕ 2}, 1} .

To disentangle the statement of Theorem 4, let us assume $p + q ≍ n$ for the time being. Then case (ii) in the theorem corresponds to $n^{3 ∕ 4} ≲ (s_{x} + s_{y})^{2} \leq n$ . Thus, CT works in the difficult regime provided $p + q ≍ n$ . It should be noted that the threshold for this case is almost of the order $O (1 ∕ \sqrt{n})$ , which is much smaller than $O (\sqrt{\log (p + q) ∕ n})$ , the traditional threshold for the easy regime. Next, observe that case (i) is an easy case because $s_{x} + s_{y}$ is much smaller than $\sqrt{n}$ . Therefore, in this case, the traditional threshold of the easy regime works. Case (iii) includes the hard regime, where polynomial time support recovery is probably impossible. Because it is unlikely that CT can improve over the vanilla estimator ${\tilde{Σ}}_{x y}$ in this regime, a threshold of zero is set.

Remark 7. Theorem 4 requires $\log n = o (\sqrt{p} \lor \sqrt{q})$ because one of our concentration inequalities in the analysis of case (ii) needs this technical condition (see Lemma 8). The omitted regime $\log n > C (\sqrt{p} \lor \sqrt{q})$ is indeed an easier one, where special methods like CT is not even required. In fact, it is well known that subgaussian X and Y satisfy (cf. Theorem 4.7.1 of [42])

‖ {\hat{Σ}}_{n, x y} - Σ_{x y} ‖_{o p} \leq C ({(\frac{p + q}{n})}^{1 ∕ 2} + \frac{p + q}{n}),

which is $O (\log n ∕ \sqrt{n})$ in the regime under concern. Including this result in the statement of Theorem 4 could unnecessarily lengthen the exposition. Therefore, we decided to exclude this regime from Theorem 4 to focus more on the $s_{x} + s_{y} \approx \sqrt{p + q}$ regime.

Remark 8. The statement of Theorem 4 is not explicit on the lower bound of the constant C₁. However, our simulation shows that the algorithm works for $C_{1} \geq 50 ℬ^{4}$ . Both threshold parameters C₁ and K in Theorem 4 depend on the unknown $ℬ > 0$ . The proof actually shows that $ℬ$ can be replaced by $\max {Λ_{m a x} (Σ_{x})$ , $Λ_{m a x} (Σ_{y})$ , $Λ_{m a x} (Σ_{x}^{- 1})$ , $Λ_{m a x} (Σ_{y}^{- 1})}$ .

Finally, Theorem 4 leads to the following corollary, which establishes that in the difficult regime, there exist estimators which satisfy Condition 1, and Algorithm 2 succeeds with probability one provided $p + q ≍ n$ . This answers Question 4 in the affirmative for Gaussian distributions.

Corollary 2. Instate the conditions of Theorem 4. Then there exists $C_{ℬ} > 0$ so that if

n \geq C_{ℬ} r (s_{x} + s_{y})^{2} \max {\log (\frac{p + q}{(s_{x} + s_{y})^{2}}), 1},

(23)

then the ${\hat{U}}^{(1)}$ defined in Algorithm 2 satisfies Condition 1, and ${inf}_{P \in 𝒫_{G} (r, s_{x}, s_{y}, ℬ)} P$ (Algorithm 2 correctly recovers $D (V)) \to_{n} 1$ .

We defer the proof of Corollary 2 to Appendix G. We will now present a brief outline of the proof of Theorem 4.

Main idea behind the proof of Theorem 4:

The proof hinges on the hidden variable representation of X and Y due to [64]. We discuss this representation in detail in Appendix C-2, which basically says the data matrices X and Y can be represented as

X = Z 𝒲_{1}^{T} + Z_{1} ℋ_{1}^{T} and Y = Z 𝒲_{2}^{T} + Z_{2} ℋ_{2}^{T},

where $Z \in R^{n \times r}$ , $Z_{1} \in R^{n \times p}$ , and $Z_{2} \in R^{n \times q}$ are independent standard Gaussian data matrices, and $𝒲_{1} = Σ_{x} U Λ^{1 ∕ 2}$ , $𝒲_{2} = Σ_{y} V Λ^{1 ∕ 2}$ , $ℋ_{1} = (Σ_{x} - 𝒲_{1} 𝒲_{1}^{T})^{1 ∕ 2}$ , and $ℋ_{2} = (Σ_{y} - 𝒲_{2} 𝒲_{2}^{T})^{1 ∕ 2}$ . We will later show in Section C-2 that $ℋ_{1}$ and $ℋ_{2}$ are well defined positive definite matrices. It follows that ${\hat{Σ}}_{n, x y} = X^{T} Y ∕ n$ has the representation

{\hat{Σ}}_{n, x y} = \frac{1}{n} {𝒲_{1} Z^{T} Z 𝒲_{2}^{T} + 𝒲_{1} Z^{T} Z_{2} ℋ_{2} + ℋ_{1}^{T} Z_{1}^{T} Z 𝒲_{2}^{T} + ℋ_{1}^{T} Z_{1}^{T} Z_{2} ℋ_{2}} .

Next, we define some sets. Let $E_{1} = \cup_{i = 1}^{r} D (u_{i})$ , $F_{1} = [p] ∖ E_{1}$ , $E_{2} = \cup_{i = 1}^{r} D (v_{i})$ , and $F_{2} = [q] ∖ E_{2}$ . Therefore E₁ and E₂ correspond to the supports, where F₁ and F₂ correspond to their complements. Now we partition [p] × [q] into the following three sets:

E = E_{1} \times E_{2}, F = F_{1} \times F_{2},

(24)

and

G = (F_{1} \times E_{2}) \cup (E_{1} \times F_{2}) .

(25)

Therefore E is the set that contains the joint support. We can decompose ${\tilde{Σ}}_{x y}$ as

{\tilde{Σ}}_{x y} = \underset{S_{1}}{\underset{︸}{𝒫_{E} {{\tilde{Σ}}_{x y}}}} + \underset{S_{2}}{\underset{︸}{𝒫_{F} {{\tilde{Σ}}_{x y}}}} + \underset{S_{3}}{\underset{︸}{𝒫_{G} {{\tilde{Σ}}_{x y}}}} .

(26)

where $𝒫$ is the projection operator defined in (4).

The usefulness of the decomposition in (26) is that S₁, S₂, and S₃ have different supports, which enables us to write

η ({\tilde{Σ}}_{x y}) = η (S_{1}) + η (S_{2}) + η (S_{3}) .

We can therefore analyze the three terms $η (S_{1})$ , $η (S_{2})$ , and $η (S_{3})$ separately. In general, the thresholding operator η is not linear in that for matrices A and B, $η (A + B) = η (A) + η (B)$ generally does not hold.

As indicated above, we analyze the operator norms of $η (S_{1})$ , $η (S_{2})$ , and $η (S_{3})$ separately. Among S₁, S₂, and S₃, S₁ is the only matrix that is supported on E, the true support. The basic idea of the proof is showing that co-ordinate thresholding preserves the matrix S₁, and kills off the other matrices S₂ and S₃, which contain the noise terms. S₁ includes the matrix $U Λ^{1 ∕ 2} Z^{T} Z Λ^{1 ∕ 2} U^{T}$ . Because $Z^{T} Z$ concentrates around I_r by Bai-Yin law (cf. Lemma 4.7.1 of [42]), $U Λ^{1 ∕ 2} Z^{T} Z Λ^{1 ∕ 2} U^{T}$ concentrates around $U Λ V^{T}$ . Therefore the analysis of η(S₁) is relatively straightforward.

Most of the proof is devoted towards showing $‖ η (S_{2}) ‖_{o p}$ and $‖ η (S_{3}) ‖_{o p}$ are small, i.e., co-ordinate thresholding kills off the noise terms. The difficulty arises because the threshold was kept smaller than the traditional threshold of order $\sqrt{\log (p + q) ∕ n}$ to adjust for the hard regime. Therefore the approaches of [50] or [28] do not work in this regime. The noise matrices S₂ and S₃ are sum of matrices of the form $M Z^{T} Z_{1} N$ , $M Z_{1}^{T} Z_{2} N$ , or $M Z^{T} Z_{2} N$ , or their transposes, where for rest of this section, M and N should be understood as deterministic matrices of appropriate dimension, whose definition can change from line to line. Analyzing $‖ η (S_{2}) ‖_{o p}$ and $‖ η (S_{3}) ‖_{o p}$ essentially hinges on Lemma 8, which upper bounds the operator norm of matrices of the form $η (M Z_{1}^{T} Z_{2} N)$ . The proof of Lemma 8 uses, among other tools, a sharp Gaussian concentration result from [1] (see Corollary 10 therein), and a generalized Chernoff’s inequality for dependent Bernoulli random variables [65]. Using Lemma 8, we can also upper bound operator norms of matrices of the form $η (M_{1} Z_{1}^{T} Z N_{1} + M_{2} Z_{1}^{T} Z_{2} N_{2})$ because $M_{1} Z_{1}^{T} Z N_{1} + M_{2} Z_{1}^{T} Z_{2} N_{2}$ can be represented as $M_{3} [Z Z_{1}]^{T} Z_{2} N_{2}$ for some matrix M₃ of appropriate dimension. Therefore, to show $‖ η (S_{2}) ‖_{o p}$ and $‖ η (S_{3}) ‖_{o p}$ are small, Lemma 8 suffices, which also completes the proof.

The proof of Theorem 4 has similarities with the proof of the analogous result for PCA in [1] (see Theorem 1 therein). However, one main difference is that for PCA, the key instrument is the representation of X as the spiked model [44], which yields the representation

X = Z M + σ Z_{1},

(27)

where $Z \in R^{n \times r}$ and $Z_{1} \in R^{n \times p}$ are standard Gaussian data matrices, and $M \in R^{r \times p}$ is a deterministic matrix. The analysis in PCA revolves around the sample covariance matrix ${\hat{Σ}}_{n, x} = X^{T} X ∕ n$ , which, following (27), writes as

{\hat{Σ}}_{n, x} = \frac{1}{n} {M^{T} Z^{T} Z M + σ Z_{1}^{T} Z M + σ M^{T} Z^{T} Z_{1} + σ^{2} Z_{1}^{T} Z_{1}} .

From the above representation, it can be shown that the analogues of S₂ and S₃ in the PCA case are sum of matrices of the form $M_{1} Z_{1}^{T} Z_{2}$ or their transposes. [1] uses an upper bound on $‖ η (Z_{1}^{T} Z_{2}) ‖_{o p}$ to bound the PCA analogue of $‖ η (S_{2}) ‖_{o p}$ and $‖ η (S_{3}) ‖_{o p}$ (see Proposition 13 therein). In contrast, we encounter terms of the form $M_{1} Z_{1}^{T} Z_{2} N_{1}$ since CCA is concerned with X^TY/n. To deal with these terms, we needed the upper bound result on $‖ η (M_{1} Z_{1}^{T} Z_{2} N_{1}) ‖_{o p}$ instead, which requires a separate elaborate proof. Although the basic idea behind bounding $‖ η (M_{1} Z_{1}^{T} Z_{2} N_{1}) ‖_{o p}$ and bounding $‖ η (Z_{1}^{T} Z_{2}) ‖_{o p}$ is similar, the proof of bounding $‖ η (M_{1} Z_{1}^{T} Z_{2} N_{1}) ‖_{o p}$ is more involved. For example, some independence structures are destroyed due to the pre and post multiplication by the matrices M₁ and N₁, respectively. We required concentration inequalities on dependent Bernoulli random variables to tackle the latter.

IV. Numerical Experiments

This section illustrates the performance of different polynomial time CCA support recovery methods when the sparsity transitions from the easy to difficult regime. We base our demonstration on a Gaussian rank one model, i.e., (X, Y) are jointly Gaussian with covariance matrix $Σ_{x y} = ρ Σ_{x} α β^{T} Σ_{y}$ . For simplicity, we take p = q and s_x = s_y = s. In all our simulations, ρ is set to be 0.5, and $α = α^{*} ∕ \sqrt{(α^{*})^{T} Σ_{x} α^{*}}$ , $β = β^{*} ∕ \sqrt{(β^{*})^{T} Σ_{y} β^{*}}$ where

α^{*} = (1 ∕ \sqrt{s}, \dots, 1 ∕ \sqrt{s}, 0, \dots, 0), β^{*} = (\sqrt{1 - (s - 1) s^{- 4 ∕ 3}}, s^{- 2 ∕ 3}, \dots, s^{- 2 ∕ 3}, 0, \dots, 0)

are unit norm vectors. Note that the order of most elements of β is O(s^−2/3), where a typical element of α is O(s^−1/2). Therefore, we will refer to α and β as the moderate and the small signal case, respectively. For the population covariance matrices Σ_x and Σ_y of X and Y, we consider the following two scenarios:

A (Identity): Σ_x = I_p and Σ_y = I_q. Since p = q, they are essentially the same.
B (Sparse inverse): This example is taken from [28]. In this case, $Σ_{x}^{- 1} = Σ_{y}^{- 1}$ are banded matrices, whose entries are given by
$(Σ_{x}^{- 1})_{i, j} = 1 {i = j} + 0.65 \times 1 {∣ i - j ∣ = 1} + 0.4 \times 1 {∣ i - j ∣ = 2} .$

Now we explain our common simulation scheme. We take the sample size n to be 1000, and consider three values for p: 100, 200, and 300. The highest value of p + q is thus 600, which is smaller than but in proportion to n regime. Our simulations indicate that all of the methods considered here requires n to be quite larger than p + q for the asymptotics to kick in at ρ = 0.5. We will later discuss this point in detail. We further let $s ∕ \sqrt{n}$ vary in the set [0.01, 2]. To be more specific, we consider 16 equidistant points in the set [0.01, 2] for the ratio $s ∕ \sqrt{n}$ .

Now we discuss the error metric used here to compare the performance of different support recovery methods. Type I and type II errors are commonly used tools to measure the performance of support recovery [1]. In case of support recovery of α, we define the type I error to be the proportion of zero elements in α that appear in the estimated support $\hat{D} (α)$ . Thus, we quantify the type I error of α by $∣ \hat{D} (α) ∖ D (α) ∣ ∕ (p - s)$ . On the other hand, the type II error for α is the proportion of elements in D(α) which are absent in $\hat{D} (α)$ , i.e., the type II error is quantified by $∣ D (α) ∖ \hat{D} (α) ∣ ∕ s$ . One can define the type I and type II errors corresponding to β similarly. Our simulations demonstrate that often the methods with low type I error exhibit a high type II error, and vice versa. In such situations, comparison between the corresponding methods becomes difficult if one uses the type I and type II errors separately. Therefore, we consider a scaled Hamming loss type metric, which suitably combines the type I and type II error. The symmetric Hamming error of estimating D(α) by $\hat{D} (α)$ is [66, Section 2.1]

1 - \frac{∣ D (α) \cap \hat{D} (α) ∣}{\sqrt{∣ D (α) ‖ \hat{D} (α) ∣}} .

Note that the above quantity is always bounded above by one. We can similarly define the symmetric Hamming distance between D(β) and $\hat{D} (β)$ . Finally, the estimates of these three errors (Type I, Type II, and scaled Hamming Loss) are obtained based on 1000 Monte Carlo replications.

Now we discuss the support recovery methods we compare here.

Naïve SCCA. We estimate α and β using the SCCA method of [24], and set $\hat{D} (α) = {i \in [p] : {\hat{α}}_{i} \neq 0}$ and $\hat{D} (β) = {i \in [q] : {\hat{β}}_{i} \neq 0}$ , where $\hat{α}$ and $\hat{β}$ are the corresponding SCCA estimators. To implement the SCCA method of [24], we use the R code referred therein with default tuning parameters.
Cleaned SCCA. This method implements RecoverSupp with the above mentioned SCCA estimators of α and β as the preliminary estimators.
CT. This is the method outlined in Algorithm 2, which is RecoverSupp coupled with the CT estimators of α and β.

Our CT method requires the knowledge of the population covariance matrices Σ_x and Σ_y. Therefore, to keep the comparison fair, in case of the cleaned SCCA method as well, we implement RecoverSupp with the popular covariance matrices. Because of their reliance on RecoverSupp, both cleaned SCCA and CT depend on the threshold cut, tuning which seems to be a non-trivial task. We set $cut = C \sqrt{\log (p + q) s (Σ_{x}^{- 1}) ∕ n}$ , where C is the thresholding constant. Our simulations show that a large C results in high type II error, where insufficient thresholding inflates the type I error. Taking the hamming loss into account, we observe that C ≈ 1 leads to a better performance in case A in an overall sense. On the other hand, case B requires a smaller value of thresholding parameter. In particular, we let C to be one in case A, and set C = 0.05 and 0.2, respectively, for the support recovery of α and β in case B. The CT algorithm requires an extra threshold parameter, namely the parameter Thr in Algorithm 2, which corresponds to the co-ordinate thresholding step. We set Thr in accordance with Theorem 4 and Remark 8, with K being $1288 ℬ^{4}$ and C₁ being $50 ℬ^{4}$ . We set $ℬ$ as in Remark 8, that is

ℬ = \max {Λ_{m a x} (Σ_{x}), Λ_{m a x} (Σ_{y}), Λ_{m a x} (Σ_{x}^{- 1}), Λ_{m a x} (Σ_{y}^{- 1})} .

The errors incurred by our methods in case A are displayed in Figure 2 (for α) and Figure 3 (for β). Figures 4 and 5, on the other hand, display the errors in the recovery of α and β, respectively, in case B.

Fig. 2: — Support recovery for α when $Σ_{x} = I_{p}$ and $Σ_{y} = I_{q}$ . Here threshold refers to cut in Theorem 1.

Fig. 3: — Support recovery for β when $Σ_{x} = I_{p}$ and $Σ_{y} = I_{q}$ . Here threshold refers to cut in Theorem 1.

Fig. 4: — Support recovery for α when $Σ_{x}$ and $Σ_{y}$ are the sparse covariance matrices. Here threshold refers to cut in Theorem 1.

Fig. 5: — Support recovery for β when $Σ_{x}$ and $Σ_{y}$ are the sparse covariance matrices. Here threshold refers to cut in Theorem 1.

Now we discuss the main observations from the above plots. When the sparsity parameter s is considerably low (less than ten in the current settings), the naíve SCCA method is sufficient in the sense that the specialized methods do not perform any better. Moreover, the naïve method is the most conservative one among all three methods. As a consequence, the associated type I error is always small, although the type II error of the naïve method grows faster than any other method. The specialized methods are able to improve the type II error at the cost of higher type I error. At a higher sparsity level, the specialized methods can outperform the naïve method in terms of the Hamming error, however. This is most evident when the setting is also complex, i.e., the signal is small, or the underlying covariance matrices are not identity. In particular, Figure 2 and 4 entail that when the signal strength is moderate and the sparsity is high, the cleaned SCCA has the lowest hamming error. In the small signal case, however, CT exhibits the best hamming error as $s ∕ \sqrt{n}$ increases; cf. Figure 3 and 5.

The Type I error of CT can be slightly improved if the sparsity information can be incorporated during the thresholding step. We simply replace cut by the maximum of cut and the s-th largest element of ${\hat{V}}^{c l e a n}$ , where the latter is as in Algorithm RecoverSupp. See, e.g., Figure 6, which entails that this modification reduces the Hamming error of the CT algorithm in case A. our empirical analysis hints that the CT algorithm has potential for improvement from the implementation perspective. In particular, it may be desirable to obtain a more efficient procedure for choosing cut in a systematic way. However, such a detailed numerical analysis is beyond the scope of the current paper and will require further modifications of the initial methods for estimation of α, β both for scalability and finite sample performance reasons. We keep these explorations as important future directions.

Fig. 6: — Support recovery by the CT algorithm when we use the information on sparsity to improve the type I error. Here $Σ_{x}$ and $Σ_{y}$ are $I_{p}$ and $I_{q}$ , respectively, and threshold refers to cut in Theorem 1. To see the decrease in type I error, compare the errors with that of Figure 2 and Figure 3.

It is natural to wonder what is the effect of cleaning via RecoverSupp on SCCA. As mentioned earlier, during our simulations we observed that a cleaning step generally improves the type II error of the naïve SCCA, but it also increases the type I error. In terms of the combined measure, i.e., the Hamming error, it turns out that cleaning does have an edge at higher sparsity levels in case B; cf. Figure 4 and Figure 5. However, the scenario is different in case A. Although Figures 2 and 3 indicate that almost no cleaning occurs at the set threshold level of one, we saw that cleaning happens at lower threshold levels. However, the latter does not improve the overall Hamming error of naïve SCCA. The consequence of cleaning may be different for other SCCA methods.

To summarize, when the sparsity is low, support recovery using the naïve SCCA is probably as good as the specialized methods. However, at higher sparsity level, specialized support recovery methods may be preferable. Consequently, the precise analysis of the apparently naïve SCCA will indeed be an interesting future direction.

V. Discussion

In this paper, we have discussed rate optimal behavior of information theoretic and computational limits of the joint support recovery for the sparse canonical correlation analysis problem. Inspired by recent results in the estimation theory of sparse CCA, a flurry of results in sparse PCA, and related developments based on low-degree polynomial conjecture – we are able to paint a complete picture of the landscape of support recovery for SCCA. For future directions, it is worth noting that our results are so far not designed to recover D(v_i) for individual i ∈ [r] separately (and hence the term joint recovery). Although this is also the case for most state of the art in the sparse pCA problem (results often exist only for the combined support [1] or the single spike model where r = 1 [29]), we believe that it is an interesting question for deeper explorations in the future. Moreover, moving beyond asymptotically exact recovery of support to more nuanced metrics (e.g., Hamming Loss) will also require new ideas worth studying. Finally, it remains an interesting question to pursue whether polynomial time support recovery is possible in the $\sqrt{n ∕ \log (p + q)} ≪ s_{x}$ , $s_{y} ≪ \sqrt{n}$ regime using a CT type idea – but for unknown yet structured high dimensional nuisance parameters Σ_x, Σ_y.

Acknowledgments

This work was supported by National Institutes of Health grant P42ES030990.

Biographies

Nilanjana Laha received a Bachelor of Statistics in 2012 and a Master of Statistics in 2014 from the Indian Statistical Institute, Kolkata. Then she received a Ph.D. in statistics in 2019 from the University of Washington, Seattle. She was a postdoctoral research fellow at the department of Biostatistics at Harvard university from 2019 to 2022. She is currently an assistant professor in Statistics at Texas A & M University. Her research interests include dynamic treatment regimes, high dimensional association, and shape constrained inference.

Rajarshi Mukherjee received a Bachelor of Statistics in 2007 and a Master of Statistics in 2009 from the Indian Statistical Institute, Kolkata. He received his Ph.D. degree in Bisostatistics from Harvard University in 2014. He was a Stein fellow in the department of Statistics at Stanford University from 2014 to 2017. He was an assistant professor at the division of Biostatistics at the University of California, Berkeley, from 2017 to 2018. Since 2018, he has been an assistant professor at the department of Biostatistics at Harvard University. His research interests primarily lie in structured signal detection problems in high dimensional and network models, and functional estimation and adaptation theory in nonparametric statistics.

Appendix A. Full version of RecoverSupp

Algorithm 3 RecoverSupp: simultaneous support recovery
of U and V

\begin{matrix} Input : 1) Preliminary estimations {\hat{U}}^{(1)} and {\hat{V}}^{(1)} of U and \\ V, and estimators {\hat{Γ}}_{n}^{(1)} and {\hat{Ω}}_{n}^{(1)} of Σ_{x}^{- 1} Σ_{y}^{- 1}, re - \\ spectively . All are based on smaple O_{1} = (x_{i}, y_{i})_{i = 1}^{[n ∕ 2]} . \\ 2) Estimator {\hat{Σ}}_{n, x y}^{(2)} of Σ_{x y} based on sample O_{2} = \\ (x_{i}, y_{i})_{i = [n ∕ 2] + 1}^{n} . \\ 3) {Threshold levels cut}_{x}, {cut}_{y} > 0 and rank r \in N . \\ Output : \hat{D} (U) and \hat{D} (V), estimators of D (U) and D (V), \\ respectively . \\ 1) Cleaning : {\hat{V}}^{c l e a n} \leftarrow {\hat{Ω}}_{n}^{(1)} {\hat{Σ}}_{n, y x}^{(2)} {\hat{U}}^{(1)}; {\hat{U}}^{c l e a n} \leftarrow \\ {\hat{Γ}}_{n}^{(1)} {\hat{Σ}}_{n, x y}^{(2)} {\hat{V}}^{(1)} . \\ 2) Threshold : Compute \\ \hat{D} (U) ≔ {i \in [p] : ∣ {\hat{U}}_{i k}^{c l e a n} ∣ > {cut}_{x} for some k \in [r]} \\ and \\ \hat{D} (V) ≔ {i \in [q] : ∣ {\hat{V}}_{i k}^{c l e a n} ∣ > {cut}_{y} for some k \in [r]} . \\ return : \hat{D} (U) and \hat{D} (V) . \end{matrix}

Open in a new tab

In Algorithm 3, we used different cut-offs for estimating $\hat{D} (U)$ and $\hat{D} (V)$ , which are cut_x and cut_y, respectively. In practice, one can choose the same threshold cut for both of them.

Appendix B. Proof preliminaries

The Appendix collects the proof of all our theorems and lemmas. This section introduces some new notations and collects some facts, which are used repeatedly in our proofs.

A. New Notations

Since the columns of $Σ_{x}^{1 ∕ 2} U$ , i.e., [ $Σ_{x}^{1 ∕ 2} U_{1}, \dots, Σ_{x}^{1 ∕ 2} U_{r}$ ] are orthogonal, we can extend it to an orthogonal basis of $R^{p}$ , which can also be expressed in the form [ $Σ_{x}^{1 ∕ 2} u_{1}, \dots, Σ_{x}^{1 ∕ 2} u_{p}$ ] since Σ_x is non-singular. Let us denote the matrix [u₁,…,u_p] by $\tilde{U}$ , whose first r columns form the matrix U. Along the same line, we can define $\tilde{V}$ , whose first q columns constitute the matrix V.

Suppose $A \in R^{p \times q}$ is a matrix. Recall the projection operator defined in (4). For any S ⊂ [p], we let $A_{S *}$ denote the matrix $𝒫_{S \times [q]} {A}$ . Similarly, for $F \subset [q]$ , we let A_F be the matrix $𝒫_{[p] \times F} {A}$ . For $k \in N$ , we define the norms $‖ A ‖_{k, \infty} = \max_{j \in [q]} ‖ A_{j} ‖_{k}$ and $‖ A ‖_{\infty, k} = \max_{i \in [q]} ‖ A_{i} ‖_{k}$ . We will use the notation ∣A∣_∞ to denote the quantity ${sup}_{1 \in [p], j \in [q]} ∣ A_{i, j} ∣$ .

The Kullback Leibler (KL) divergence between two probability distributions P₁ and P₂ will be denoted by KL(P₁ ∣ P₂). For $x \in R$ , we let $⌊ x ⌋$ denote greatest integer less than or equal to $x \in R$ .

B. Facts on $𝒫 (r, s_{x}, s_{y}, ℬ)$

First, note that since $v_{i}^{T} Σ_{y} v_{i} = 1$ by (2) for all i ∈ [q], we have $‖ v_{i} ‖_{2} \leq \sqrt{ℬ}$ . Similarly, we can also show that $‖ u_{i} ‖_{2} \leq \sqrt{ℬ}$ . Second, we note that $‖ Σ_{x}^{1 ∕ 2} U ‖_{o p} = ‖ Σ_{y}^{1 ∕ 2} V ‖_{o p} = 1$ , and

∣ Σ_{y x} ∣_{\infty} \leq ‖ Σ_{y x} ‖_{o p} = ‖ Σ_{y} V Λ U^{T} Σ_{x} ‖_{o p} \leq ‖ Σ_{y}^{1 ∕ 2} ‖_{o p} ‖ Σ_{y}^{1 ∕ 2} V ‖_{o p} ‖ Λ ‖_{o p} ‖ Σ_{x}^{1 ∕ 2} U ‖_{o p} ‖ Σ_{x}^{1 ∕ 2} ‖_{o p} \leq ℬ

(28)

because the largest element of Λ is not larger than one. Since X_i’s and Y_i’s are Subgaussian, for any random vector v independent of X and Y, it follows that [45, Lemma 7]

∣ ({\hat{Σ}}_{n, y x} - Σ_{y x}) v ∣_{\infty} \leq C_{ℬ} ‖ v ‖_{2} \sqrt{\frac{\log (p + q)}{n}}

(29)

with $P$ probability 1 – o(1) uniformly over $P \in 𝒫 (r, s_{x}, s_{y}, ℬ)$ . Also, we can show that $Φ_{0} = Σ_{y}^{- 1}$ satisfies

‖ (Φ_{0})_{k} ‖_{1, \infty} \leq \sqrt{s (Σ_{x})} ‖ (Φ_{0})_{k} ‖_{2, \infty} \leq \sqrt{s (Σ_{x})} ‖ Φ_{0} ‖_{o p} \leq \sqrt{s (Σ_{x})} ℬ,

where Cauchy-Schwarz inequality was used in the first step.

C. General Technical Facts

Fact 1.For two matrices

A \in R^{m \times n}

and

B \in R^{n \times q}

, we have

‖ A B ‖_{F}^{2} \leq ‖ A ‖_{o p}^{2} ‖ B ‖_{F}^{2}, ‖ A B ‖_{F}^{2} \leq ‖ A ‖_{F}^{2} ‖ B ‖_{o p}^{2}

Fact 2 (Lemma 11 of [1]). Let $Z \in R^{n \times p}$ be a matrix with i.i.d. standard normal entries, i.e., Z_i,j ~ N(0, 1). Then for every t > 0,

P (‖ Z ‖_{o p} \geq \sqrt{p} + \sqrt{n} + t) \leq exp (- t^{2} ∕ 2) .

As a consequence, there exists an absolute constant C > 0 such that

P (‖ Z ‖_{o p} \geq \sqrt{2} (\sqrt{p} + \sqrt{n})) \leq exp (- C (p + n)) .

Recall that for $A \in R^{p \times q}$ , in Appendix B-A, we defined ∥A∥_1,∞ and ∥A∥_∞1 to be the matrix norms $\max_{j \in [q]} ‖ A_{j} ‖_{1}$ and $\max_{i \in [p]} ‖ A_{i *} ‖_{1}$ , respectively.

The following fact is a Corollary to (29).

Fact 3. Suppose X and Y are jointly subgaussian. Then $∣ {\hat{Σ}}_{n, x y} - Σ_{x y} ∣_{\infty} = O_{p} (\sqrt{\log (p + q) ∕ n})$ .

Fact 4 (Chi-square tail bound). Suppose $Z_{1}, \dots, Z_{k} \overset{i i d}{\sim} N (0, 1)$ . Then for any y > 5, we have

P (\sum_{l = 1}^{k} Z_{l}^{2} \geq y k) \leq exp (- y k ∕ 5) .

Proof of Fact 4. Since Z_l’s are independent standard Gaussian random variables, by tail bounds on Chi-squared random variables (The form below is from Lemma 12 of [1]),

P (\sum_{l = 1}^{k} Z_{l}^{2} \geq k + 2 \sqrt{k x} + 2 x) \leq exp (- x) .

Plugging in x = y_k, we obtain that

P (\sum_{l = 1}^{k} Z_{l}^{2} \geq (1 + 2 \sqrt{y} + 2 y) k) \leq exp (- y k),

which implies for y > 1,

P (\sum_{l = 1}^{k} Z_{l}^{2} \geq 5 y k) \leq exp (- y k),

which can be rewritten as

P (\sum_{l = 1}^{k} Z_{l}^{2} \geq y k) \leq exp (- y k ∕ 5)

as long as y > 5. □

Appendix C. Proof of Theorem 1

For the sake of simplicity, we denote ${\hat{U}}^{(1)}$ , ${\hat{Σ}}_{n, x y}^{(2)}$ , and ${\hat{Ω}}_{n}^{(1)}$ by $\hat{U}$ , ${\hat{Σ}}_{n, x y}$ , and ${\hat{Ω}}_{n}$ , respectively. The reader should keep in mind that $\hat{U}$ and ${\hat{Ω}}_{n}$ are independent of ${\hat{Σ}}_{n, x y}$ and ${\hat{Ω}}_{n}$ because they are constructed from a different sample. Next, using Condition 1, we can show that there exists $(w_{i}, \dots, w_{p}) \in {\pm 1}^{p}$ so that

inf_{P \in 𝒫 (r, s_{x}, s_{y}, ℬ)} P (\max_{i \in [r]} ∣ (w_{i} {\hat{u}}_{n, i} - u_{i})^{T} Σ_{x} (w_{i} {\hat{u}}_{n, i} - u_{i}) ∣ < {Err}^{2}) \to 1 .

as n → ∞. Without loss of generality, we assume w_i = 1 for all i ∈ [r]. The proof will be similar for general w_i’s. Thus

inf_{P \in 𝒫 (r, s_{x}, s_{y}, ℬ)} P (\max_{i \in [r]} ∣ ({\hat{u}}_{n, i} - u_{i})^{T} Σ_{x} ({\hat{u}}_{n, i} - u_{i}) ∣ < {Err}^{2}) \to 1

(30)

Therefore $‖ {\hat{u}}_{n, i} - u_{i} ‖_{2} \leq Err \sqrt{ℬ}$ for all i ∈ [r] with $P$ probability tending to one.

Now we will collect some facts which will be used during the proof. Because ${\hat{u}}_{n, i}$ and ${\hat{Σ}}_{n, y x}$ are independent, (29) implies that

∣ ({\hat{Σ}}_{n, y x} - Σ_{y x}) {\hat{u}}_{n, i} ∣_{\infty} \leq C_{ℬ} ‖ {\hat{u}}_{n, i} ‖_{2} \sqrt{\frac{\log (p + q)}{n}} .

Using (30), we obtain that $‖ {\hat{u}}_{n, i} ‖_{2} \leq ‖ {\hat{u}}_{n, i} - u_{i} ‖_{2} + ‖ u_{i} ‖_{2} \leq \sqrt{B} (Err + 1)$ . Because $Err < ℬ^{- 1} \leq 1$ , we have

inf_{P \in 𝒫 (r, s_{x}, s_{y}, ℬ)} P (\max_{i \in [r]} ∣ ({\hat{Σ}}_{n, y x} - Σ_{y x}) {\hat{u}}_{n, i} ∣_{\infty} \leq C_{ℬ} \sqrt{\frac{\log (p + q)}{n}}) = 1 - o (1) .

(31)

Noting (28) implies $∣ Σ_{y x} {\hat{u}}_{n, i} ∣_{\infty} \leq ‖ Σ_{y x} ‖_{o p} ‖ {\hat{u}}_{n, i} ‖_{2} \leq 2 ℬ^{3 ∕ 2}$ , and that $\log (p + q) = o (n)$ , using (31), we obtain that

\max_{i \in [r]} ∣ {\hat{Σ}}_{n, y x} {\hat{u}}_{n, i} ∣_{\infty} \leq ∣ ({\hat{Σ}}_{n, y x} - Σ_{y x}) {\hat{u}}_{n, i} ∣_{\infty} + ∣ Σ_{y x} {\hat{u}}_{n, i} ∣_{\infty} \leq 3 ℬ^{3 ∕ 2}

(32)

with $P$ probability 1 – o(1).

Now we are ready to prove Theorem 1. We will denote the columns of ${\hat{V}}_{n}^{c l e a n}$ by ${\hat{v}}_{n, i}^{c l e a n}$ for i ∈ [r]. Because $Λ_{i} (v_{i})_{k} = e_{k}^{T} Σ_{y}^{- 1} Σ_{y x} u_{i}$ , it holds that

({\hat{v}}_{n, i}^{c l e a n})_{k} - Λ_{i} (v_{i})_{k} = e_{k}^{T} ({\hat{Ω}}_{n} - Φ_{0}) {\hat{Σ}}_{n, y x} {\hat{u}}_{n, i} + e_{k}^{T} Φ_{0} ({\hat{Σ}}_{n, y x} - Σ_{y x}) {\hat{u}}_{n, i} + e_{k}^{T} Φ_{0} Σ_{y x} ({\hat{u}}_{n, i} - u_{i})

leading to

∣ ({\hat{v}}_{n, i}^{c l e a n})_{k} - Λ_{i} (v_{i})_{k} ∣ \leq \underset{T_{1} (i, k)}{\underset{︸}{∣ e_{k}^{T} ({\hat{Ω}}_{n} - Φ_{0}) {\hat{Σ}}_{n, y x} {\hat{u}}_{n, i} ∣}} + \underset{T_{2} (i, k)}{\underset{︸}{∣ e_{k}^{T} Φ_{0} ({\hat{Σ}}_{n, y x} - Σ_{y x}) {\hat{u}}_{n, i} ∣}} + \underset{T_{3} (i, k)}{\underset{︸}{∣ e_{k}^{T} Φ_{0} Σ_{y x} ({\hat{u}}_{n, i} - u_{i}) ∣}} .

Handling the term T₂ is the easiest because

\max_{i \in [r], k \in [q]} T_{2} (i, k) \leq ‖ Φ_{0} ‖_{1, \infty} ∣ ({\hat{Σ}}_{n, y x} - Σ_{y x}) {\hat{u}}_{n, i} ∣_{\infty} \leq C_{ℬ} \sqrt{\frac{s (Σ_{y}^{- 1}) \log (p + q)}{n}}

with $P$ probability 1 – o(1) uniformly over $𝒫 (r, s_{x}, s_{y}, ℬ)$ , where we used (31) and the fact that $‖ Φ_{0} ‖_{1, \infty} \leq \sqrt{s (Σ_{y}^{- 1}) ℬ}$ . The difference in cases (A), (B), (C) arises only due to different bounds on T₁(i, k) in these cases. We demonstrate the whole proof only for case (A). For the other two cases, we only discuss the analysis of T₁(i, k) because the rest of the proof remains identical in these cases.

1). Case (A):

Since we have shown in (32) that $∣ {\hat{Σ}}_{n, y x} {\hat{u}}_{n, i} ∣_{\infty} \leq 3 ℬ^{3 ∕ 2}$ , we calculate

\max_{i \in [r], k \in [q]} T_{1} (i, k) \leq ‖ {\hat{Ω}}_{n} - Φ_{0} ‖_{1, \infty} \max_{i \in [r]} ∣ {\hat{Σ}}_{n, y x} {\hat{u}}_{n, i} ∣_{\infty} \leq 3 ℬ^{3 ∕ 2} C_{pre} s (Σ_{y}^{- 1}) \sqrt{\frac{\log q}{n}}

with $P$ probability tending to one, uniformly over $𝒫 (r, s_{x}, s_{y}, ℬ)$ , where to get the last inequality, we also used the bound on $‖ {\hat{Ω}}_{n} - Φ_{0} ‖_{\infty, 1}$ in case (A).

Finally, for T₃, we notice that

T_{3} (i, k) = ∣ e_{k}^{T} Φ_{0} Σ_{y x} ({\hat{u}}_{n, i} - u_{i}) ∣ = ∣ e_{k}^{T} \sum_{j = 1}^{r} Λ_{j} v_{j} u_{j}^{T} Σ_{x} ({\hat{u}}_{n, i} - u_{i}) ∣ \leq \max_{j \in [r]} ∣ (v_{j})_{k} ∣ ∣ \sum_{j = 1}^{r} u_{j}^{T} Σ_{x} ({\hat{u}}_{n, i} - u_{i}) ∣

since Λ₁ ≥ 1. Since (v_j)_k = V_kj, it is clear that T₃(i, k) is identically zero if $k \notin D (V)$ . Otherwise, Cauchy Schwarz inequality implies,

∣ \sum_{j = 1}^{r} u_{j}^{T} Σ_{x} ({\hat{u}}_{n, i} - u_{i}) ∣ \leq \sqrt{r} {(\sum_{j = 1}^{r} (u_{j}^{T} Σ_{x} ({\hat{u}}_{n, i} - u_{i}))^{2})}^{1 ∕ 2} \leq \sqrt{r} ‖ Σ_{x}^{1 ∕ 2} ({\hat{u}}_{n, i} - u_{i}) ‖_{2}

because $Σ_{x}^{1 ∕ 2} u_{j} ’ s$ are orthogonal. Thus

\max_{i \in [r], k \in D (V)} ∣ T_{3} (i, k) ∣ \leq \sqrt{r} \max_{j \in [r]} ∣ (v_{j})_{k} ∣ Err .

Now we will combine the above pieces together. Note that

\max_{i \in [q]} \max_{k \in [r]} (∣ T_{1} (i, k) ∣ + ∣ T_{2} (i, k) ∣ \leq C_{ℬ} \underset{ϵ_{n}}{\underset{︸}{C_{pre} s (Σ_{y}^{- 1}) \sqrt{\frac{\log (p + q)}{n}}}} .

(33)

For $k \notin D (V)$ , denoting the i-th column of ${\hat{V}}^{c l e a n}$ by ${\hat{v}}_{n, i}^{c l e a n}$ we observe that,

\max_{k \notin D (V)} \max_{i \in [r]} ∣ {\hat{V}}_{k i}^{c l e a n} ∣ = \max_{k \notin D (V)} \max_{i \in [r]} ∣ ({\hat{v}}_{n, i}^{c l e a n})_{k} ∣ \leq \max_{i \in [q]} \max_{k \in [r]} (∣ T_{1} (i, k) ∣ + ∣ T_{2} (i, k) ∣) \leq C_{ℬ} ϵ_{n}

(34)

with $P$ probability 1 – o(1) uniformly over $P \in 𝒫 (r, s_{x}, s_{y}, ℬ)$ . On the other hand, if k ∈ D(v_i), then we have for all i ∈ [r],

∣ ({\hat{v}}_{n, i}^{c l e a n})_{k} ∣ > Λ_{i} ∣ (v_{i})_{k} ∣ - \sqrt{r} \max_{j \in [r]} ∣ (v_{j})_{k} ∣ Err - \max_{i \in [q]} \max_{k \in [r]} (∣ T_{1} (i, k) ∣ + ∣ T_{2} (i, k) ∣),

which implies

\max_{i \in [r]} ∣ {\hat{V}}_{k i}^{c l e a n} ∣ > \max_{i \in [r]} Λ_{i} ∣ (v_{i})_{k} ∣ - \sqrt{r} \max_{i \in [r]} ∣ (v_{i})_{k} ∣ Err - C_{ℬ} ϵ_{n} .

Since $Err < ℬ^{- 1} ∕ (2 \sqrt{r})$ and $ℬ^{- 1} < {min}_{i \in [r]} Λ_{i}$ , we have

\max_{i \in [r]} Λ_{i} ∣ (v_{i})_{k} ∣ - \sqrt{r} \max_{i \in [r]} ∣ (v_{i})_{k} ∣ Err > (ℬ^{- 1} - \sqrt{r} Err) \max_{i \in [r]} ∣ (v_{i})_{k} ∣ > ℬ^{- 1} \max_{i \in [r]} ∣ (v_{i})_{k} ∣ ∕ 2 .

Thus, noting V_ki = (v_i)_k, we obtain that

min_{k \in D (V)} \max_{i \in [r]} ∣ ({\hat{v}}_{n, i}^{c l e a n})_{k} ∣ = min_{k \in D (V)} \max_{i \in [r]} ∣ {\hat{V}}_{k i}^{c l e a n} ∣ > min_{k \in D (V)} \max_{i \in [r]} ∣ V_{k i} ∣ ∕ (2 ℬ) - C_{ℬ} ϵ_{n}

with $P$ probability 1 – o(1) uniformly over $P \in 𝒫 (r, s_{x}, s_{y}, ℬ)$ . Suppose $C_{ℬ}^{'} = 2 ℬ C_{ℬ}$ . Note that

min_{k \in [p]} \max_{i \in [r]} ∣ (v_{i})_{k} ∣ = θ_{n} C_{ℬ}^{'} ϵ_{n}

where θ_n > 2. Then with $P$ probability 1 – o(1) uniformly over $P \in 𝒫 (r, s_{x}, s_{y}, ℬ)$ ,

min_{k \in D (V)} \max_{i \in [r]} {\hat{V}}_{k i}^{c l e a n} > (θ_{n} - 1) C_{ℬ}^{'} ϵ_{n} ∕ (2 ℬ) .

This, combined with (34) implies setting $cut \in [C_{ℬ}^{'} ϵ_{n} ∕ (2 ℬ), (θ_{n} - 1) C_{ℬ}^{'} ϵ_{n} ∕ (2 ℬ)]$ leads to full support recovery with $P$ probability 1 – o(1). The proof of the first part follows.

2). Case (B):

In the Gaussian case, we resort to the hidden variable representation of X and Y due to [64], which enables sharper bound on the term T₁(i, k). Suppose Z ~ N_r(0, I_r) where r is the rank of Σ_xy. Consider Z₁ ~ N_p(0, I_p) and Z₂ ~ N_q(0, I_q) independent of Z. Then X and Y can be represented as

X = 𝒲_{1} Z + ℋ_{1} Z_{1} and Y = 𝒲_{2} Z + ℋ_{2} Z_{2},

(35)

where

𝒲_{1} = Σ_{x} U Λ^{1 ∕ 2}, 𝒲_{2} = Σ_{y} V Λ^{1 ∕ 2}, ℋ_{1} = (Σ_{x} - 𝒲_{1} 𝒲_{1}^{T})^{1 ∕ 2},

and

ℋ_{2} = (Σ_{y} - 𝒲_{2} 𝒲_{2}^{T})^{1 ∕ 2} .

Here $(Σ_{x} - 𝒲_{1} 𝒲_{1}^{T})^{1 ∕ 2}$ is well defined because

Σ_{x} - 𝒲_{1} 𝒲_{1}^{T} = Σ_{x} \tilde{U} (I_{p} - Λ_{x}) {\tilde{U}}^{T} Σ_{x},

where Λ_x is a p × p diagonal matrix whose first p elements are Λ₁,…,Λ_r, and they rest are zero. Because Λ₁ ≤ 1, we have

(Σ_{x} - 𝒲_{1} 𝒲_{1}^{T})^{1 ∕ 2} = Σ_{x} \tilde{U} (I_{p} - Λ_{x})^{1 ∕ 2} {\tilde{U}}^{T} Σ_{x} .

Similarly, we can show that

(Σ_{y} - 𝒲_{2} 𝒲_{2}^{T})^{1 ∕ 2} = Σ_{y} \tilde{V} (I_{q} - Λ_{y})^{1 ∕ 2} {\tilde{V}}^{T} Σ_{y},

where Λ_y is the diagonal matrix whose first r elements are Λ₁,…,Λ_r, and the rest are zero. It can be easily verified that

V a r (X) = 𝒲_{1} 𝒲_{1}^{T} + ℋ_{1} = Σ_{x}, V a r (Y) = 𝒲_{2} 𝒲_{2}^{T} + ℋ_{2} = Σ_{y},

and

Σ_{x y} = 𝒲_{1} 𝒲_{2}^{T} = Σ_{x} U Λ V^{T} Σ_{y},

which ensures that the joint variance of (X, Y) is still α. Also, some linear algebra leads to

\max {‖ ℋ_{1} ‖_{o p}^{2}, ‖ ℋ_{2} ‖_{o p}^{2}, ‖ 𝒲_{1} ‖_{o p}, ‖ 𝒲_{2} ‖_{o p}} < ℬ .

(36)

Suppose we have n independent realizations of the pseudoobservations Z₁, Z₂, and Z. Denote by Z₁, Z₂, and Z, the stacked data matrices with the i-th row as (Z₁)_i, (Z₂)_i, and Z_i, respectively, where i ∈ [n]. Here we used the term data-matrix although we do not observe Z, Z₁ and Z₂ directly. Due to the representation in (35), the data matrices X and Y have the form

X = Z 𝒲_{1}^{T} + Z_{1} ℋ_{1}, Y = Z 𝒲_{2}^{T} + Z_{2} ℋ_{2} .

We can write the covariance matrix ${\hat{Σ}}_{n, x y} = X^{T} Y ∕ n$ as

{\hat{Σ}}_{n, x y} = \frac{1}{n} {𝒲_{1} Z^{T} Z 𝒲_{2}^{T} + 𝒲_{1} Z^{T} Z_{2} ℋ_{2} + ℋ_{1}^{T} Z_{1}^{T} Z 𝒲_{2}^{T} + ℋ_{1}^{T} Z_{1}^{T} Z_{2} ℋ_{2}} .

(37)

Therefore, for any vector $θ_{1} \in R^{p}$ and $θ_{2} \in R^{q}$ , we have

θ_{1}^{T} ({\hat{Σ}}_{n, x y} - Σ_{x y}) θ_{2} = θ_{1}^{T} 𝒲_{1}^{T} (\frac{Z^{T} Z}{n} - I_{r}) 𝒲_{2} θ_{2} + \frac{1}{n} θ_{1}^{T} (𝒲_{1} Z^{T} Z_{2} ℋ_{2} + ℋ_{1}^{T} Z_{1}^{T} Z 𝒲_{2}^{T} + ℋ_{1}^{T} Z_{1}^{T} Z_{2} ℋ_{2}) θ_{2} .

(38)

By Bai-Yin law on eigenvalues of Wishart matrices [67], there exists abolute constant C > 0 so that for any t > 1,

P ({‖ \frac{Z^{T} Z}{n} - I_{r} ‖}_{o p} < t \sqrt{r ∕ n}) \geq 1 - 2 exp (- C t^{2} r),

which, combined with (36), implies

inf_{P \in 𝒫_{G} (r, s_{x}, s_{y}, ℬ)} P (∣ θ_{1}^{T} 𝒲_{1}^{T} (Z^{T} Z ∕ n - I_{r}) 𝒲_{2} θ_{2} ∣ \leq t ℬ^{2} ‖ θ_{1} ‖_{2} ‖ θ_{2} ‖_{2} \sqrt{r ∕ n}) \geq 1 - 2 exp (- C t^{2} r) .

Now we will state a lemma which will be required to control the other terms on the right hand side of (38).

Lemma 3. Suppose $Z_{1} \in R^{n \times p}$ and $Z_{2} \in R^{n \times q}$ are independent Gaussian data matrices. Further suppose $x \in R^{p}$ and $y \in R^{q}$ are either deterministic or independent of both Z₁ and Z₂. Then there exists a constant C > 0 so that for any t > 1,

P (∣ x^{T} Z_{1}^{T} Z_{2} y ∣ > t ‖ x ‖_{2} ‖ y ‖_{2} \sqrt{n}) \leq exp (- C n) - exp (t^{2} ∕ 2) .

The proof of Lemma 3 follows directly setting b = 1 in the following Lemma, which is proved in Appendix H-D.

Lemma 4. Suppose $Z_{1} \in R^{n \times p}$ and $Z_{2} \in R^{n \times q}$ are independent standard Gaussian data matrices, and $D \in R^{n \times k_{1}}$ and $B \in R^{n \times k_{2}}$ are deterministic matrices with rank a and b, respectively. Let a ≤ b ≤ n. Then there exists an absolute constant C > 0 so that for any t ≥ 0, the following holds with probability at least $1 - exp (- C n) - exp (- t^{2} ∕ 2)$ :

‖ D^{T} Z_{1}^{T} Z_{2} B ‖_{o p} \leq C ‖ D ‖_{o p} ‖ B ‖_{o p} \sqrt{n} \max {\sqrt{b}, t} .

Lemma 3, in conjunction with (36), implies that there exists an absolute constant C > 0 so that

\frac{1}{n} ∣ θ_{1}^{T} (𝒲_{1} Z^{T} Z_{2} ℋ_{2} + ℋ_{1}^{T} Z_{1}^{T} Z 𝒲_{2}^{T} + ℋ_{1}^{T} Z_{1}^{T} Z_{2} ℋ_{2}) θ_{2} ∣ \leq t ℬ^{2} ‖ θ_{1} ‖_{2} ‖ θ_{2} ‖_{2} n^{- 1 ∕ 2}

with $P$ probability at least $1 - exp (- C n) - exp (t^{2} ∕ 2)$ for all $P \in 𝒫_{G} (r, s_{x}, s_{y}, ℬ)$ . Therefore, there exists C > 0 so that

P (∣ θ_{1}^{T} ({\hat{Σ}}_{n, x y} - Σ_{x y}) θ_{2} ∣ \leq t \sqrt{r} ℬ^{2} ‖ θ_{1} ‖_{2} ‖ θ_{2} ‖_{2} n^{- 1 ∕ 2}) \geq 1 - exp (- C n) - exp (- C t^{2}) .

(39)

for all $P \in 𝒫_{G} (r, s_{x}, s_{y}, ℬ)$ . Note that

T_{1} (i, k) \leq \underset{T_{11} (i, k)}{\underset{︸}{∣ {(({\hat{Ω}}_{n})_{k *} - (Σ_{y}^{- 1})_{k *})}^{T} ({\hat{Σ}}_{n, y x} - Σ_{y x}) {\hat{u}}_{n, i} ∣}} + \underset{T_{12} (i, k)}{\underset{︸}{∣ {(({\hat{Ω}}_{n})_{k *} - (Σ_{y}^{- 1})_{k *})}^{T} Σ_{y x} {\hat{u}}_{n, i} ∣}} .

Now suppose $θ_{1} = ({\hat{Ω}}_{n})_{k *} - (Σ_{y}^{- 1})_{k *}$ and $θ_{2} = {\hat{u}}_{n, i}$ . By our assumption, $‖ θ_{1} ‖_{2} \leq C_{pre} \sqrt{s (Σ_{y}^{- 1}) (\log q) ∕ n}$ with $P$ probability 1 – o(1) uniformly across $P \in 𝒫_{G} (r, s_{x}, s_{y}, ℬ)$ . We also showed that $‖ {\hat{u}}_{n, i} ‖_{2} \leq 2 \sqrt{ℬ}$ . It is not had to see that

sup_{i \in [q], k \in [r]} T_{12} (i, k) \leq 2 ℬ^{3 ∕ 2} C_{pre} \sqrt{s (Σ_{y}^{- 1}) (\log q) ∕ n}

(40)

with $P$ probability 1 – o(1) uniformly across $P \in 𝒫_{G} (r, s_{x}, s_{y}, ℬ)$ . For T₁₁, observe that (39) applies because $θ_{i} = ({\hat{Ω}}_{n})_{k *} - (Σ_{y}^{- 1})_{k *}$ and $θ_{2} = {\hat{u}}_{n, i}$ are independent of ${\hat{Σ}}_{n, x y}$ . Thus we can write that for any t > 1, there exists $C_{ℬ} > 1$ such that

sup_{P \in 𝒫_{G} (r, s_{x}, s_{y}, ℬ)} P (∣ T_{11} (i, k) ∣ > t C_{ℬ} C_{pre} \sqrt{r s (Σ_{y}^{- 1}) \log q} ∕ n) \leq exp (- C n) + exp (- C t^{2}) .

Applying union bound, we obtain that for any $P \in 𝒫_{G} (r, s_{x}, s_{y}, ℬ)$ ,

P (\max_{i \in [q]} \max_{k \in [r]} ∣ T_{11} (i, k) ∣ > \frac{t C_{ℬ} C_{pre} \sqrt{r s (Σ_{y}^{- 1}) \log q}}{n}) \leq exp (- C n + \log (q r)) + exp (- C t^{2} + \log (q r)) .

Since r < q and log q = o(n), setting $t = 2 \sqrt{\log q} ∕ C$ , we obtain that

sup_{\begin{matrix} P \in \\ 𝒫_{G} (r, s_{x}, s_{y}, ℬ) \end{matrix}} P (\max_{i \in [q]} \max_{k \in [r]} ∣ T_{11} (i, k) ∣ > C_{ℬ} C_{pre} \sqrt{r s (Σ_{y}^{- 1})} \frac{\log q}{n})

is o(1). Using (33) and (40), one can show that

ϵ_{n} = C_{pre} \sqrt{s (Σ_{y}^{- 1}) (\log (p + q)) ∕ n} \max {\sqrt{r (\log q) ∕ n}, 1}

in this case.

3). Case (C):

Note that when ${\hat{Ω}}_{n} = Σ_{y}^{- 1}, T_{1} (i, k) = 0$ . Therefore, (33) implies $ϵ_{n} = \sqrt{s (Σ_{y}^{- 1}) \log (p + q) ∕ n}$ in this case.

Appendix D. Proof of Theorem 2

Since the proof for U and V follows in a similar way, we will only consider the support recovery of U. The proof for both cases follows a common structure. Therefore, we will elaborate the common structure first. Since the model $𝒫 (r, s_{x}, s_{y}, ℬ)$ is fairly large, we will work with a smaller submodel. Specifically, we will consider a subclass of the single spike models, i.e., r = 1. Because we are concerned with only the support recovery of the left singular vectors, we fix $β_{0}$ in $R^{q}$ so that $‖ β_{0} ‖_{2} = 1$ . We also fix ρ ∈ (0, 1) and consider the subset $ℰ \subset {α \in R^{p} : ‖ α ‖_{2} = 1}$ . Both ρ and $ℰ$ will be chosen later. We restrict our attention to the submodel $ℳ (s_{x}, s_{y}, ρ, ℰ)$ given by

{P \in 𝒫 (1, s_{x}, s_{y}, ℬ) : P \equiv N_{p + q} (0, Σ) where Σ is of the form (41) with α \in ℰ, β = β_{0}},

where (41) is as follows:

Σ = [\begin{matrix} I_{p} & ρ α β^{T} \\ ρ β α^{T} & I_{q} \end{matrix}] .

(41)

That Σ is positive definite for ρ ∈ (0, 1) can be shown either using elementary linear algebra or the the hidden variable representation (35). During the proof of part (B), we will choose $ℰ$ so that ${Sig}_{x}^{2} \leq (B^{2} - 1) (\log (p - s_{x})) ∕ 8 n$ , which will ensure that $ℳ (s_{x}, s_{y}, ρ, ℰ) \subset 𝒫_{Sig} (r, s_{x}, s_{y}, ℬ)$ as well.

Note that for $P \in ℳ (s_{x}, s_{y}, ρ, ℰ)$ , U corresponds to α, and hence D(U) = D(α). Therefore for the proof of both parts, it suffices to show that for any decoder ${\hat{D}}_{α}$ of D(α),

inf_{{\hat{D}}_{α}} sup_{P \in ℳ (s_{x}, s_{y}, ℰ)} P ({\hat{D}}_{α} \neq D (α)) > 1 ∕ 2 .

(42)

In both of the proofs, our $ℰ$ will be a finite set. Our goal is to choose $ℰ$ so that $ℳ (s_{x}, s_{y}, ρ, ℰ)$ is structurally rich enough to guarantee (42), yet lends itself to easy computations. The guidance for choosing $ℰ$ comes from our main technical tool for this proof, which is Fano’s inequality. We use the verson of Fano’s inequality in [53] (Fano’s Lemma). Applied to our problem, this inequality yields

inf_{{\hat{D}}_{α}} sup_{P \in ℳ (s_{x}, s_{y}, ρ, ℰ)} P ({\hat{D}}_{α} \neq D (α)) \geq 1 - \frac{\frac{\sum_{P_{1}, P_{2} \in ℳ (s_{x}, s_{y}, ρ, ℰ)} K L (P_{1}^{n} ∣ P_{2}^{n})}{∣ ℳ (s_{x}, s_{y}, ρ, ℰ) ∣^{2}} + \log 2}{\log (∣ ℳ (s_{x}, s_{y}, ρ, ℰ) ∣ - 1)},

(43)

where $P_{n}$ denotes the product measure corresponding to n i.i.d. observations from $P$ . We also have the following result for product measures, $K L (P_{1}^{n} ∣ P_{2}^{n}) = n K L (P_{1} ∣ P_{2})$ . Moreover, when $P_{1}$ , $P_{2} \in ℳ (s_{x}, x_{y}, ρ, ℰ)$ with left singular vectors $α_{1}$ and $α_{2}$ , respectively,

K L (P_{1} ∣ P_{2}) = \log \frac{det (Σ_{2})}{det (Σ_{1})} - (p + q) + T r (Σ_{2}^{- 1} Σ_{1}),

where $det (Σ_{1}) = det (Σ_{2}) = 1 - ρ^{2}$ by Lemma 13, and

- (p + q) + T r (Σ_{2}^{- 1} Σ_{1}) = \frac{2 ρ^{2}}{1 - ρ^{2}} (1 - (α_{1}^{T} α_{2}) ‖ β_{0} ‖_{2}^{2})

by Lemma 14. Noting $α_{1}$ , $α_{2}$ , and $β_{0}$ are unit vectors, we derive $K L (P_{1} ∣ P_{2}) = ρ^{2} (‖ α_{1} - α_{2} ‖_{2}^{2}) ∕ (1 - ρ^{2})$ . Therefore, in our case, (43) reduces to

inf_{{\hat{D}}_{α}} sup_{P \in ℳ (s_{x}, s_{y}, ρ, ℰ)} P ({\hat{D}}_{α} \neq D (α)) \geq 1 - \frac{n ρ^{2} {sup}_{α_{1}, α_{2} \in ℰ} ‖ α_{1} - α_{2} ‖^{2} ∕ (1 - ρ^{2}) + \log 2}{\log (∣ ℰ ∣ - 1)} .

(44)

Thus, to ensure the right hand side of (44) is non-negligible, the key is to choose $ℰ$ so that the α’s in $ℰ$ are close in l₂ norm, but $∣ ℰ ∣$ is sufficiently large. Note that the above ensures that distinguishing the α’s in $ℰ$ is difficult.

A. Proof of part (A)

Note that our main job is to choose $ℰ$ and ρ suitably. Let us denote

α_{0} = (\underset{s_{x} many}{\underset{︸}{1 ∕ {\sqrt{s}}_{x}, \dots, 1 ∕ {\sqrt{s}}_{x}}}, \underset{p - s_{x} many}{\underset{︸}{0, \dots, 0}}) .

We generate a class of α’s by replacing one of the $1 ∕ {\sqrt{s}}_{x}$ in $α_{0}$ by 0, and one of the zero’s in $α_{0}$ by $1 ∕ {\sqrt{s}}_{x}$ . A typical α obtained this way looks like

α = (\underset{s_{x} many}{\underset{︸}{(1 ∕ {\sqrt{s}}_{x}, \dots, 0, \dots 1 ∕ {\sqrt{s}}_{x}}}, \underset{p - s_{x} many}{\underset{︸}{0, \dots, 1 ∕ \sqrt{s_{x}}, \dots, 0}}) .

Let $ℰ$ be the class, which consists of $α_{0}$ , and all such resulting α’s. Note that $∣ ℰ ∣ = s_{x} (p - s_{x})$ , and $α_{1}$ , $α_{2} \in ℰ$ satisfy

‖ α_{1} - α_{2} ‖_{2}^{2} ‖ \leq ‖ α_{1} - α_{0} ‖_{2}^{2} + ‖ α_{2} - α_{0} ‖_{2}^{2} \leq 4 s_{x}^{- 1} .

Because p > s_x > 1, we have

\log (s_{x} (p - s_{x}) - 1) \geq \log (p - s_{x}) .

Therefore, (44) leads to

inf_{{\hat{D}}_{α}} sup_{P \in ℳ (s_{x}, s_{y}, ρ, ℰ)} P ({\hat{D}}_{α} \neq D (α)) \geq 1 - \frac{4 ρ^{2} n s_{x}^{- 1} ∕ (1 - ρ^{2}) + \log 2}{\log (p - s_{x})},

which is bounded below by 1/2 whenever

s_{x} > \frac{8 ρ^{2} n}{(1 - ρ^{2}) {\log (p - s_{x}) - \log 4}},

which follows if

s_{x} > \frac{16 ρ^{2} n}{(1 - ρ^{2}) \log (p - s_{x})}

because $4 = \sqrt{16} < \sqrt{p - s_{x}}$ . To get the best bound on s_x, we choose the value of ρ which minimizes ρ²/(1 – ρ²) for $P \in 𝒫 (r, s_{x}, s_{y}, ℬ)$ , that is $ρ = 1 ∕ ℬ$ . Plugging in $ρ = 1 ∕ ℬ$ , the proof follows.

B. Proof of part (B)

Suppose each $α \in ℰ$ is of the following form

α = (\underset{s_{x} - 1 many}{\underset{︸}{b, \dots, b}}, \underset{p - s_{x} + 1 many}{\underset{︸}{0, \dots, 0, z, 0, \dots, 0}}) .

We fix z ∈ (0, 1), and hence $b = \sqrt{(1 - z^{2}) ∕ (s_{x} - 1)}$ is also fixed. We will choose the value of ρ and z later so that $𝒫_{Sig} (r, s_{x}, s_{y}, ℬ) \supset ℳ (s_{x}, s_{y}, ρ, ℰ)$ . Since z is fixed, such an α can be chosen in p–s_x + 1 ways. Therefore $∣ ℰ ∣ = p - s_{x} + 1$ . Also note that for α, $α^{'} \in ℰ$ , $‖ α - α^{'} ‖_{2}^{2} \leq 2 z^{2}$ . Therefore (44) implies

inf_{{\hat{D}}_{α}} sup_{P \in ℳ (s_{x}, s_{y}, ρ, ℰ)} P ({\hat{D}}_{α} \neq D (α))

(45)

\geq 1 - \frac{2 n ρ^{2} z^{2} ∕ (1 - ρ^{2}) + \log 2}{\log (p - s_{x})},

(46)

which is greater than 1/2 whenever

z^{2} < \frac{1 - ρ^{2}}{4 n ρ^{2}} \log (\frac{p - s_{x}}{4}),

which holds if

z^{2} = \frac{1 - ρ^{2}}{8 n ρ^{2}} \log (p - s_{x})

because 16 < p – s_x. To get the best bound on z, we choose the value of ρ for $P \in 𝒫 (r, s_{x}, s_{y}, ℬ)$ which maximizes (1 – ρ²)/ρ², that is $ρ = 1 ∕ ℬ$ . Thus (42) is satisfied when $ρ = 1 ∕ ℬ$ , and $ℰ$ corresponds to

z^{2} = (ℬ^{2} - 1) l o g (p - s_{x}) ∕ (8_{n})

Since the minimal signal strength Sig_x for any $P \in ℳ (s_{x}, s_{y}, ℬ^{- 1}, ℰ)$ equals min(z, b) ≤ z, we have $𝒫_{Sig} (r, s_{x}, s_{y}, ℬ) \supset ℳ (s_{x}, s_{y}, ℬ^{- 1}, ℰ)$ , which completes the proof.

Appendix E. Proof of Theorem 3

We first introduce some notations and terminologies that are required for the proof. For $w \in Z^{m}$ , and $x \in R^{m}$ , we denote $w! = \prod_{i = 1}^{m} w_{i}!$ and $x^{w} = \prod_{i = 1}^{m} x_{i}^{w_{i}}$ . In low-degree polynomial literature, when $w \in Z^{m}$ , the notation ∣w∣ is commonly used to denote the sum $\sum_{i = 1}^{n} w_{i}$ for sake of simplicity. We also follow the above convention. Here the notation ∣·∣ should not be confused with the absolute value of real numbers. Also, for any function $f : R^{m} \mapsto R$ , $w \in Z^{m}$ , and t = (t₁,…, t_m), we denote

\partial_{t}^{w} f (t) = \frac{\partial^{∣ w ∣}}{\partial t_{1}^{w_{1}} \dots \partial t_{r}^{w_{r}}} f (t) .

We will also use the shorthand notation $E_{π}$ to denote $E_{α \sim π_{x}, β \sim π_{y}}$ , sometimes.

Our analysis relies on the Hermite polynomial, which we will discuss here very briefly. For a detailed account on the Hermite polynomials, see Chapter V of [62]. The univariate Hermite polynomials of degree k will be denoted by h_k. For k ≥ 0, the univariate Hermite polynomials $h_{k} : R \mapsto R$ are defined recursively as follows:

h_{0} (x) = 1, h_{1} (x) = x h_{0} (x), \dots, h_{k + 1} (x) = x h_{k} (x) - h_{k}^{'} (x) .

The normalized univariate Hermite polynomials are given by ${\hat{h}}_{k} (x) = h_{k} (x) ∕ \sqrt{k!}$ . The univariate Hermite polynomials form an orthogonal basis of L₂(N(0, 1)). For $w \in Z^{m}$ , the m-variate Hermite polynomials are given by $H_{w} (y) = \prod_{i = 1}^{m} h_{w_{i}} (y_{i})$ , where $y \in R^{m}$ . The normalized version ${\hat{H}}_{w}$ of H_w equals $H_{w} ∕ \sqrt{w!}$ . The polynomials ${\hat{H}}_{w} ’ s$ form an orthogonal basis of L₂(N_m(0, I_m)). We denote by $Π_{n}^{\leq D_{n}}$ the linear span of all n(p + q)-variate Hermite polynomials of degree at most D_n. Since $L_{n}^{\leq D_{n}}$ is the projection of $L_{n}$ on $Π^{\leq D_{n}}$ , it then follows that

‖ L_{n}^{\leq D_{n}} ‖_{L_{2} (Q_{n})}^{2} = \sum_{\begin{matrix} w \in Z^{n (p + q)} \\ ∣ w ∣ \leq D_{n} \end{matrix}} 〈 L_{n}, {\hat{H}}_{w} 〉_{L^{2} (Q_{n})}^{2} .

(47)

From now on, the degree-index vector w of ${\hat{H}}_{w}$ or H_w will be assumed to lie in $Z^{n (p + q)}$ . We will partition w into n components, which gives w = (w₁,…,w_n), where $w_{i} \in Z^{p + q}$ for each i ∈ [n]. Clearly, i here corresponds to the i-th observation. We also separate each w_i into two parts $w_{i}^{x} \in Z^{p}$ and $w_{i}^{y} \in Z^{q}$ so that $w_{i} = (w_{i}^{x}, w_{i}^{y})$ . We will also denote $w^{x} = (w_{1}^{x}, \dots, w_{n}^{x})$ , and $w_{y} = (w_{1}^{y}, \dots, w_{n}^{y})$ . Note that $w^{x} \in Z^{n p}$ and $w^{y} \in Z^{n q}$ , but w ≠ (w^x, w^y) in general, although ∣w∣= ∣w^x∣+∣w^y∣.

Now we state the main lemmas which yields the value of $‖ L_{n}^{\leq D_{n}} ‖_{L_{2} (Q_{n})}^{2}$ . The first lemma, proved in Appendix H-C, gives the form of the inner products $〈 L_{n}, {\hat{H}}_{w} 〉_{L^{2} (Q_{n})}$ .

Lemma 5. Suppose w is as defined above and $L_{n}$ is as in (20). Then it holds that

〈 L_{n}, {\hat{H}}_{w} 〉_{L^{2} (Q_{n})}^{2} = {\begin{matrix} \frac{ℬ^{- ∣ w ∣}}{w!} {E_{π} [1 {‖ α ‖_{2} ‖ β ‖_{2} < ℬ} α^{\sum_{i = 1}^{n} w_{i}^{x}} β^{\sum_{i = 1}^{n} w_{i}^{y}}]}^{2} \\ \times {(\prod_{i = 1}^{n} ∣ w_{i}^{x} ∣!)}^{2} i f ∣ w_{i}^{x} ∣ = ∣ w_{i}^{y} ∣ f o r a l l i \in [n], \\ 0 o . w . \end{matrix}

Here the priors π_x and π_y are the Rademacher priors defined in (17).

Our next lemma uses Lemma 5 to give the form of $‖ L_{n}^{\leq D_{n}} ‖_{L_{2} (Q_{n})}^{2}$ . This lemma uses replicas of α and β. Suppose $α_{1}$ , $α_{2} \sim π_{y}$ and $β_{1}$ , $β_{2} \sim π_{y}$ are all independent Rademacher priors, where π_x and π_y are defined as in (17). We overload notation, and use $E_{π}$ to denote the expectation under $α_{1}$ , $α_{2}$ , $β_{1}$ , and $β_{2}$ .

Lemma 6. Suppose W is the indicator function of the event ${‖ α_{1} ‖_{2} ‖ β_{1} ‖_{2} < ℬ$ , $‖ α_{2} ‖_{2} ‖ β_{2} ‖_{2} < ℬ}$ . Then For any $D_{n} \in N$ , $‖ L_{n}^{\leq D_{n}} ‖_{L_{2} (Q_{n})}^{2}$ equals

E_{π} [W \sum_{d = 0}^{⌊ D_{n} ∕ 2 ⌋} (\begin{matrix} d + n - 1 \\ d \end{matrix}) {ℬ^{- 2} (α_{1}^{T} α_{2}) (β_{1}^{T} β_{2})}^{d}] .

The proof of Lemma 6 is also deferred to Appendix H-C. We remark in passing that the negative binomial series expansion yields

(1 - x)^{- n} = \sum_{d = 0}^{\infty} (\begin{matrix} n + d - 1 \\ d \end{matrix}) x^{d}, for ∣ x ∣ < 1,

(48)

whose D_n-th order truncation equals

{((1 - x)^{- n})}^{\leq D_{n}} = \sum_{d = 0}^{D_{n}} (\begin{matrix} n + d - 1 \\ d \end{matrix}) x^{d} .

Note that W is nonzero if and only if $‖ α_{1} ‖_{2} ‖ β_{1} ‖_{2} < ℬ$ and $‖ α_{2} ‖_{2} ‖ β_{2} ‖_{2} < ℬ$ , which, by Cauchy Schwarz inequality, implies

∣ (α_{1}^{T} α_{2}) (β_{1}^{T} β_{2}) ∣ < ℬ^{2} .

Thus $∣ ℬ^{- 2} (α_{1}^{T} α_{2}) (β_{1}^{T} β_{2}) ∣ < 1$ when W = 1. Hence Lemma 6 can also be written as

‖ L_{n}^{\leq D_{n}} ‖_{L_{2} (Q_{n})}^{2} = E_{π} [W {{(1 - ℬ^{- 2} (α_{1}^{T} α_{2}) (β_{1}^{T} β_{2}))}^{- n}}^{\leq ⌊ D_{n} ∕ 2 ⌋}] .

Now we are ready to prove Theorem 3.

Proof of Theorem 3. Our first task is to get rid of W from the expression of $‖ L_{n}^{\leq D_{n}} ‖_{L_{2} (Q_{n})}$ in Lemma 6. However, we can not directly bound W by one since the term $(α_{1}^{T} α_{2})^{d} (β_{1}^{T} β_{2})^{d} W$ may be negative for odd $d \in N$ . We claim that $E [(α_{1}^{T} α_{2})^{d} (β_{1}^{T} β_{2})^{d} W] = 0$ if $d \in N$ is odd. To see this, first we write

E [(α_{1}^{T} α_{2})^{d} (β_{1}^{T} β_{2})^{d} W] = E [E [(α_{1}^{T} α_{2})^{d} W ∣ β_{1}, β_{2}] (β_{1}^{T} β_{2})^{d}] .

(49)

Note that $(α_{1}^{T} α_{2})^{d} W ∣ β_{1}$ , $β_{2}$ has the same distribution as

1 {‖ α_{1} ‖_{2} < ℬ ‖ β_{1} ‖_{2}^{- 1}} 1 {‖ α_{2} ‖_{2} < ℬ ‖ β_{2} ‖_{2}^{- 1}} (α_{1}^{T} α_{2}) .

Notice from (17) that marginally, $α_{1} \overset{d}{=} - α_{1}$ , and $α_{1}$ is independent of $α_{2}$ , $β_{1}$ and $β_{2}$ . Therefore,

(α_{1}^{T} α_{2}) W ∣ β_{1}, β_{2} \overset{d}{=} - (α_{1}^{T} α_{2}) W ∣ β_{1}, β_{2} .

Hence, conditional on $β_{1}$ and $β_{2}$ , $(α_{1}^{T} α_{2}) W$ is a symmetric random variable, and $E [(α_{1}^{T} α_{2})^{d} W^{d} ∣ β_{1}, β_{2}] = 0$ for any odd positive integer d. Since W is binary random variable, W^d = W. Thus, $E [(α_{1}^{T} α_{2})^{d} W ∣ β_{1}, β_{2}] = 0$ as well for an odd number $d \in N$ . Thus the claim follows from (49). Therefore, from Lemma 6, it follows that

‖ L_{n}^{\leq D_{n}} ‖_{L_{2} (Q_{n})}^{2} = E_{π} [W \sum_{d = 0}^{⌊ \frac{⌊ \frac{D_{n}}{2} ⌋}{2} ⌋} (\begin{matrix} 2 d + n - 1 \\ 2 d \end{matrix}) {ℬ^{- 2} (α_{1}^{T} α_{2}) (β_{1}^{T} β_{2})}^{2 d}] .

Observe that $⌊ ⌊ D_{n} ∕ 2 ⌋ ∕ 2 ⌋ \leq D_{n} ∕ 4$ . Hence, $⌊ ⌊ D_{n} ∕ 2 ⌋ ∕ 2 ⌋ \leq ⌊ D_{n} ∕ 4 ⌋$ . Also the summands in the last expression are non-negative. Therefore, using the fact that W ≤ 1, we obtain

‖ L_{n}^{\leq D_{n}} ‖_{L_{2} (Q_{n})}^{2} \leq E_{π} [\sum_{d = 0}^{⌊ D_{n} ∕ 4 ⌋} (\begin{matrix} 2 d + n - 1 \\ 2 d \end{matrix}) {ℬ^{- 2} (α_{1}^{T} α_{2}) (β_{1}^{T} β_{2})}^{2 d}] .

(50)

Our next step is to simplify the above bound on $‖ L_{n}^{\leq D_{n}} ‖_{L_{2} (Q_{n})}^{2}$ . To that end, define the random variables $ξ_{i} = α_{1 i} α_{2 i}$ for $i \in [p]$ , and $ξ_{j}^{'} = β_{1 j} β_{2 j}$ for $j \in [q]$ . Denoting

ν = (s_{x} ∕ p)^{2}, and ω = (s_{y} ∕ q)^{2},

we note that

ξ_{i} = {\begin{matrix} \frac{+ 1}{s_{x}} & w . p . ν ∕ 2 \\ \frac{- 1}{s_{x}} & w . p . ν ∕ 2 \\ 0 & w . p . 1 - ν, \end{matrix} and ξ_{j}^{'} = {\begin{matrix} \frac{+ 1}{s_{y}} & w . p . ω ∕ 2 \\ \frac{- 1}{s_{y}} & w . p . ω ∕ 2 \\ 0 & w . p . 1 - ω . \end{matrix}

Also, since $ξ_{i}$ and $ξ_{j} ’ s$ are symmetric, $E ξ_{i}^{2 k + 1}$ and $E ξ_{j}^{2 k + 1}$ vanishes for any $k \in Z$ . Then for any $d \in Z$ ,

E_{π} [(α_{1}^{T} α_{2})^{2 d} (β_{1}^{T} β_{2})^{2 d}] = E_{π_{x}} [{(\sum_{i = 1}^{p} ξ_{i})}^{2 d}] E_{π_{y}} [{(\sum_{j = 1}^{q} ξ_{j}^{'})}^{2 d}] = (\sum_{\begin{matrix} z \in Z^{p}, \\ ∣ z ∣ = 2 d \end{matrix}} \frac{(2 d)!}{z!} \prod_{i = 1}^{p} E [ξ_{i}^{z_{i}}]) (\sum_{\begin{matrix} l \in Z^{q}, \\ ∣ l ∣ = 2 d \end{matrix}} \frac{(2 d)!}{l!} \prod_{j = 1}^{q} E [(ξ_{j}^{'})^{l_{j}}])

by Fact 6. Since the odd moments of $ξ$ and $ξ^{'}$ vanish, the above equals

(\sum_{\begin{matrix} z \in Z^{p}, \\ ∣ z ∣ = 2 d \end{matrix}} \frac{(2 d)!}{(2 z)!} \prod_{i = 1}^{p} E [ξ_{i}^{2 z_{i}}]) (\sum_{\begin{matrix} l \in Z^{q}, \\ ∣ l ∣ = d \end{matrix}} \frac{(2 d)!}{(2 l)!} \prod_{j = 1}^{q} E [(ξ_{j}^{'})^{2 l_{j}}]) = (\sum_{\begin{matrix} z \in Z^{p}, \\ ∣ z ∣ = d \end{matrix}} \frac{ν^{∣ D (z) ∣} (2 d)!}{(2 z)!} \prod_{i = 1}^{p} s_{x}^{- 2 z_{i}}) \times (\sum_{\begin{matrix} l \in Z^{q}, \\ ∣ l ∣ = d \end{matrix}} \frac{ω^{∣ D (z) ∣} (2 d)!}{(2 l)!} \prod_{j = 1}^{q} s_{y}^{- 2 l_{j}}),

where we remind the readers that ∣D(z)∣ denotes the cardinality of the support of z for any vector z. The above implies

E_{π} [(α_{1}^{T} α_{2})^{2 d} (β_{1}^{T} β_{2})^{2 d}] = (s_{x} s_{y})^{- 2 d} \underset{𝒥 (d; p)}{\underset{︸}{\sum_{\begin{matrix} z \in Z^{p}, \\ ∣ z ∣ = d \end{matrix}} \frac{(2 d)!}{(2 z)!} ν^{∣ D (z) ∣}}} \underset{𝒥 (d; q)}{\underset{︸}{\sum_{\begin{matrix} l \in Z^{q}, \\ ∣ l ∣ = d \end{matrix}} \frac{(2 d)!}{(2 l)!} ν^{∣ D (l) ∣}}} .

Plugging the above into (50) yields

‖ L_{n}^{\leq D_{n}} ‖_{L_{2} (Q_{n})}^{2} \leq \sum_{d = 0}^{⌊ D_{n} ∕ 4 ⌋} (\begin{matrix} 2 d + n - 1 \\ 2 d \end{matrix}) (s_{x} s_{y})^{- 2 d} ℬ^{- 4 d} 𝒥 (d; p) 𝒥_{d} (d; q) \overset{(a)}{\leq} \sum_{d = 0}^{⌊ D_{n} ∕ 4 ⌋} {(\frac{(2 d + n - 1) e}{2 d})}^{2 d} (s_{x} s_{y})^{- 2 d} ℬ^{- 4 d} 𝒥 (d; p) 𝒥 (d; q),

where (a) follows since $(\begin{matrix} a \\ b \end{matrix}) \leq (a e ∕ b)^{b}$ for a, $b \in N$ . Let us denote $μ_{x} = \sqrt{n e} ∕ (\sqrt{p} ℬ)$ and $μ_{y} = \sqrt{n e} ∕ (\sqrt{q} ℬ)$ . By (21), $μ_{x}$ , $μ_{y} < 1 ∕ \sqrt{3}$ and

D_{n} \leq \frac{min {s_{y}^{2}, s_{x}^{2}} ℬ^{2}}{n e} .

Therefore we have

μ_{x}^{2} D_{n} < \frac{s_{x}^{2}}{p} and μ_{y}^{2} D_{n} < \frac{s_{y}^{2}}{q} .

Hence Lemma 4.5 of [36] implies that for any 11 ≤ d ≤ D_n,

𝒥 (d; p) ≲ (2 d)! (\begin{matrix} p \\ d \end{matrix}) \sqrt{d} e^{d^{2} ∕ p + d ∕ 2} 2^{- 3 d ∕ 2} μ_{x}^{- 2 d} ν^{d}, 𝒥 (d; q) ≲ (2 d)! (\begin{matrix} q \\ d \end{matrix}) \sqrt{d} e^{d^{2} ∕ q + d ∕ 2} 2^{- 3 d ∕ 2} μ_{y}^{- 2 d} ω^{d} .

For d ≤ 1, Theorem 5 of [68] gives

(2 d)! \leq \frac{(2 d)^{2 d + 1} e^{- 2 d} \sqrt{2 π}}{\sqrt{2 d - 1}} .

Also since $(\begin{matrix} p \\ d \end{matrix}) \leq (p e ∕ d)^{d}$ , we have

𝒥 (d; p) ≲ (2 d)^{2 d + 1 ∕ 2} {(\frac{p e}{d})}^{d} \sqrt{d} e^{d^{2} ∕ p - d} 2^{- 3 d ∕ 2} μ_{x}^{- 2 d} ν^{d}, 𝒥 (d; q) ≲ (2 d)^{2 d + 1 ∕ 2} {(\frac{q e}{d})}^{d} \sqrt{d} e^{d^{2} ∕ q - d} 2^{- 3 d ∕ 2} μ_{y}^{- 2 d} ω^{d},

leading to

𝒥 (d; p) 𝒥 (d; q) ≲ d^{2 d + 2} e^{d^{2} ∕ p + d^{2} ∕ q} 2^{d + 1} (μ_{x} μ_{y})^{- 2 d} (ν p)^{d} (ω q)^{d} .

Therefore $‖ L_{n}^{\leq D} ‖_{L_{2} (Q_{n})}^{2}$ is bounded by a constant multiple of

\sum_{d = 11}^{⌊ D_{n} ∕ 4 ⌋} {(\frac{(2 d + n - 1) e}{2 d})}^{2 d} (s_{x} s_{y})^{- 2 d} d^{2 d + 2} e^{d^{2} ∕ p + d^{2} ∕ q} \times 2^{d + 1} (μ_{x} μ_{y})^{- 2 d} (ν p)^{d} (ω q)^{d} ℬ^{- 4 d} ≲ \sum_{d = 11}^{⌊ D_{n} ∕ 4 ⌋} d {\frac{ℬ^{- 4} (2 d + n - 1)^{2} e^{2}}{2 μ_{x}^{2} μ_{y}^{2} p q}}^{d} e^{d^{2} ∕ p + d^{2} ∕ q} .

Since $D_{n}^{2} \leq min {p, q}$ , it follows that $e^{d^{2} ∕ p + d^{2} ∕ q} \leq e^{2}$ . Note that the above sum converges if

(D_{n} ∕ 2 + n - 1)^{2} e^{2} < 2 ℬ^{4} μ_{x}^{2} μ_{y}^{2} p q = 2 n^{2} e^{2},

or equivalently $(D_{n} ∕ 2 + n - 1)^{2} < 2 n^{2}$ , which is satisfied for all $n \in N$ since $D_{n} < n$ . Thus the proof follows. □

Appendix F. Proof of Theorem 4

We invoke the decomposition of ${\hat{Σ}}_{n, x y}$ in (37). But first, we will derive a simplified form for the matrices $ℋ_{1}$ and $ℋ_{2}$ in (37). Note that we can write $ℋ_{1}$ as

ℋ_{1} = Σ_{x}^{1 ∕ 2} (I_{p} - Σ_{x}^{1 ∕ 2} U Λ U^{T} Σ_{x}^{1 ∕ 2}) Σ_{x}^{1 ∕ 2} .

Let us denote

B_{x} = diag (\underset{r times}{\underset{︸}{1 - Λ_{1}, \dots, 1 - Λ_{r}}}, \underset{p - r times}{\underset{︸}{1, \dots, 1}}) .

(51)

Because $Σ_{x}^{1 ∕ 2} \tilde{U}$ is an orthogonal matrix, $Σ_{x}^{1 ∕ 2} \tilde{U} B_{x} {\tilde{U}}^{T} Σ_{x}^{1 ∕ 2}$ is a spectral decomposition, which leads to

ℋ_{1} = Σ_{x}^{1 ∕ 2} {(Σ_{x}^{1 ∕ 2} \tilde{U} B_{x} {\tilde{U}}^{T} Σ_{x}^{1 ∕ 2})}^{1 ∕ 2} Σ_{x}^{1 ∕ 2} = Σ_{x} \tilde{U} B_{x}^{1 ∕ 2} {\tilde{U}}^{T} Σ_{x} .

Similarly, we can show that the matrix $ℋ_{2}$ in (37) equals $Σ_{y} \tilde{V} B_{y}^{1 ∕ 2} {\tilde{V}}^{T} Σ_{y}$ , where

B_{y} = diag (\underset{r times}{\underset{︸}{1 - Λ_{1}, \dots, 1 - Λ_{r}}}, \underset{q - r times}{\underset{︸}{1, \dots, 1}}) .

Finally the fact that $ℋ_{1} = Σ_{x} U Λ^{1 ∕ 2}$ and $𝒲_{2} = Σ_{y} V Λ^{1 ∕ 2}$ in conjuction with (37) produces the following representation for ${\tilde{Σ}}_{x y} = Σ_{x}^{- 1} {\hat{Σ}}_{n, x y} Σ_{y}^{- 1}$ :

{\tilde{Σ}}_{x y} = \frac{1}{n} {U Λ^{1 ∕ 2} Z^{T} Z Λ^{1 ∕ 2} V^{T} + U Λ^{1 ∕ 2} Z^{T} Z_{2} Σ_{y} (\tilde{V} B_{y} {\tilde{V}}^{T}) + (\tilde{U} B_{y} {\tilde{U}}^{T}) Σ_{x} Z_{1}^{T} Z Λ^{1 ∕ 2} V^{T} + (\tilde{U} B_{x} {\tilde{U}}^{T}) Σ_{x} Z_{1}^{T} Z_{2} Σ_{y} (\tilde{V} B_{y} {\tilde{V}}^{T})} .

Now recall the sets E, F, and G defined in (24) and (25) in Section III-D, and the decomposition of ${\tilde{Σ}}_{x y}$ in (26). From (26) it follows that

η ({\tilde{Σ}}_{x y}) = η (𝒫_{E} {{\tilde{Σ}}_{x y}}) + η (𝒫_{F} {{\tilde{Σ}}_{x y}}) + η (𝒫_{G} {{\tilde{Σ}}_{x y}}) .

Recall that for any matrix $A \in R^{p \times q}$ , and $S \subset [p]$ , we denote by $A_{S^{*}}$ the matrix $𝒫_{S \times [q]} {A}$ . Then it is not hard to see that $U_{E_{1}^{*}} = U$ and $V_{E_{2}^{*}} = V$ , which leads to

S_{1} = \frac{1}{n} {U Λ^{1 ∕ 2} Z^{T} Z Λ^{1 ∕ 2} V^{T} + U Λ^{1 ∕ 2} Z^{T} Z_{2} Σ_{y} (\tilde{V} B_{y} ({\tilde{V}}_{E_{2} *})^{T}) + ({\tilde{U}}_{E_{1} *} B_{y} {\tilde{U}}^{T}) Σ_{x} Z_{1}^{T} Z Λ^{1 ∕ 2} V^{T} + ({\tilde{U}}_{E_{1} *} B_{x} {\tilde{U}}^{T}) Σ_{x} Z_{1}^{T} Z_{2} Σ_{y} (\tilde{V} B_{y} ({\tilde{V}}_{E_{2} *})^{T})} .

(52)

Next, note that $U_{F_{1} *} = 0$ and $V_{F_{2} *} = 0$ . Therefore,

S_{2} = \frac{1}{n} {({\tilde{U}}_{F_{1} *} B_{x} {\tilde{U}}^{T}) Σ_{x} Z_{1}^{T} Z_{2} Σ_{y} (\tilde{V} B_{y} ({\tilde{V}}_{F_{2} *})^{T})} .

(53)

Finally, we note that S₃ = (H₁ + H₂), where

H_{1} = \frac{1}{n} {(U Λ^{1 ∕ 2} Z^{T} Z_{2} Σ_{y} (\tilde{V} B_{y} ({\tilde{V}}_{F_{2} *})^{T}) + ({\tilde{U}}_{E_{1} *} B_{x} {\tilde{U}}^{T}) Σ_{x} Z_{1}^{T} Z_{2} Σ_{y} (\tilde{V} B_{y} ({\tilde{V}}_{F_{2} *})^{T})}

(54)

and

H_{2} = \frac{1}{n} {({\tilde{U}}_{F_{1} *} B_{y} {\tilde{U}}^{T}) Σ_{x} Z_{1}^{T} Z Λ^{1 ∕ 2} V^{T} + ({\tilde{U}}_{F_{1} *} B_{x} {\tilde{U}}^{T}) Σ_{x} Z_{1}^{T} Z_{2} Σ_{y} (\tilde{V} B_{y} ({\tilde{V}}_{E_{2} *})^{T})} .

Here the term S₁ holds the information about $Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} = U Λ V^{T}$ . Its elements are not killed off by co-ordinate thresholding because it contains the Wishart matrix $Z^{T} Z$ which concentrates around I_r by Bai-Yin law (cf. Theorem 4.7.1 of of [42]). The only term that contributes to ${\hat{Σ}}_{n, x y}$ is S₁. Lemma 7 entails that η(S₁) concentrates around $Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1}$ in operator norm. The proof of Lemma 7 is deferred to Appendix H-B.

Lemma 7. Suppose $s_{x}$ , $s_{y} < n$ . Then with probability 1 – o(1),

‖ η (S_{1}) - Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} ‖_{o p} \leq \frac{T h r min {s_{x}, s_{y}}}{\sqrt{n}} + C ℬ^{2} \frac{\max {\sqrt{s_{x}}, \sqrt{s_{y}}}}{\sqrt{n}} .

The entries of S₂ and S₃ are linear combinations of the entries of $Z_{1}^{T} Z_{2}$ , $Z^{T} Z_{1}$ and $Z^{T} Z_{2}$ . Since Z, Z₁, Z₂ are independent, the entries from the latter matrices are of order O_p(n^−1/2), and as we will see, they are killed off by the thresholding operator η. Our main work boils down to showing that thresholding kills off most terms of the noise matrices S₂ and S₃, making $‖ η (S_{2}) ‖_{o p}$ and $‖ η (S_{3}) ‖_{o p}$ small. To that end, we state some general lemmas, which are proved in Appendix H-B. That $‖ η (S_{2}) ‖_{o p}$ and $‖ η (S_{3}) ‖_{o p}$ are small follows as corollaries to this lemmas. Our next lemma provides a sharp concentration bound which is our main tool in analyzing the difficult regime, i.e., $s_{x} + s_{y} \approx \sqrt{p + q}$ case.

Lemma 8. Suppose $Z_{1} \in R^{n \times p}$ and $Z_{2} \in R^{n \times q}$ are independent standard Gaussian data matrices. Let us also denote $Q_{M, N} = M Z_{1}^{T} Z_{2} N$ where $M \in R^{p^{'} \times p}$ and $N \in R^{q \times q^{'}}$ are fixed matrices so that $p^{'} \leq p$ and $q^{'} \leq q$ . Further suppose $\log (p \lor q) = o (n)$ and $\log n = o (\sqrt{p} \lor \sqrt{q})$ . Let $K_{0} = 161 ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2}$ . Suppose K ≥ K₀ is such that threshold level τ satisfies $τ \in [\sqrt{K_{0}}, \sqrt{K \log (\max {p, q})} ∕ 2]$ . Then there exists a constant C > 0 so that with probability 1 – o(1),

‖ η (Q_{M, N}; τ ∕ \sqrt{n}) ‖_{o p} \leq C ‖ M ‖_{o p} ‖ N ‖_{o p} (\sqrt{\frac{p + q}{n}} \lor \frac{p + q}{n}) \times e^{- τ^{2} ∕ K} .

Our next lemma, which also is proved in Appendix H-B, handles the easier case when the threshold is exactly of the order $\sqrt{\log (p + q)}$ . This thresholding, as we will see, is required in the easier sparsity regime, i.e., $s_{x} + s_{y} ≪ \sqrt{p + q}$ . Although Lemma 9 follows as a corollary to Lemma A.3 of [50], we include it here for the sake of completeness.

Lemma 9. Suppose Z₁, Z₂, M, N, and Q_M,N are as in Lemma 8, and $\log (p + q) = o (n)$ . Further suppose $‖ M ‖_{o p}$ , $‖ N ‖_{o p} \leq C ℬ$ where C > 0 is an absolute constant. Let $τ = \sqrt{C_{1} \log (p + q)}$ . Here the tuning parameter $C_{1} > C ℬ^{4}$ where C > 0 is a sufficiently large constant. Then $η (Q_{M, N}; τ ∕ \sqrt{n}) = 0$ with probability tending to one.

We will need another technical lemma for handling the terms S₂ and S₃.

Lemma 10. Suppose $A \in R^{m \times p}$ and $D = D_{1} \times D_{2} \subset [m] \times [p]$ . Then the followings hold:

$𝒫_{D} (η (A)) = η (𝒫_{D} (A))$ .
$‖ 𝒫_{D} (A) ‖_{o p} \leq ‖ A ‖_{o p}$

Note that $M = {\tilde{U}}_{F_{1} *} B_{x} {\tilde{U}}^{T} Σ_{x}$ satisfies

‖ M ‖_{o p} \leq ‖ Σ_{x}^{- 1 ∕ 2} ‖_{o p} ‖ Σ_{x}^{1 ∕ 2} {\tilde{U}}_{F^{*}} ‖_{o p} ‖ B_{x} ‖_{o p} ‖ Σ_{x}^{1 ∕ 2} \tilde{U} ‖_{o p} ‖ Σ_{x}^{1 ∕ 2} ‖_{o p} .

However, $‖ B_{x} ‖_{o p} \leq 1$ . Also because $P \in 𝒫 (r, s_{x}, s_{y}, ℬ)$ , it follows that $‖ Σ_{x} ‖_{o p}, ‖ Σ_{x}^{- 1} ‖_{o p} \leq ℬ$ . Moreover, since $Σ_{x}^{1 ∕ 2} \tilde{U}$ is orthogonal, $‖ Σ_{x}^{1 ∕ 2} \tilde{U} ‖_{o p} = 1$ . Part B of Lemma 10 then yields $‖ Σ_{x}^{1 ∕ 2} {\tilde{U}}_{F^{*}} ‖_{o p} \leq ‖ Σ_{x}^{1 ∕ 2} \tilde{U} ‖_{o p} = 1$ . Therefore

‖ M ‖_{o p} = ‖ {\tilde{U}}_{F_{1} *} B_{x} {\tilde{U}}^{T} Σ_{x} ‖_{o p} \leq ℬ .

(55)

Similarly we can show that the matrix $N = \tilde{V} B_{y} ({\tilde{V}}_{F_{2} *})^{T} Σ_{y}$ satisfies $‖ N ‖_{o p} \leq ℬ$ . Because $S_{2} = η (M Z_{1}^{T} Z_{2} N)$ by (53), that $η (S_{2})$ is small follows immediately from Lemma 8. Under the conditions of Lemma 8, we have

‖ η (S_{2}) ‖_{o p} \leq C ℬ^{2} (\sqrt{\frac{p + q}{n}} \lor \frac{p + q}{n}) \times e^{- {Thr}^{2} ∕ K}

(56)

with high probability provided $K \geq 161 ℬ^{4}$ and $Thr \in [13 ℬ^{2}, \sqrt{K \log (\max {p, q})} ∕ 2]$ . On the other hand, under the setup of Lemma 9, $p (‖ η (S_{2}) ‖_{o p} = 0) \to_{n} 1$ . Lemma 11, which we prove in Appendix H-B, entails that the same holds for S₃.

Lemma 11. Consider the setup of Lemma 8. Suppose $K \geq 1288 ℬ^{4}$ is such that $T h r \in [36 ℬ^{2}, \sqrt{K \log (2 \max {p, q})} ∕ 2]$ . Then there exists a constant C > 0 so that with probability tending to one,

‖ η (S_{3}) ‖_{o p} \leq C ℬ^{2} (\sqrt{\frac{p + q}{n}} \lor \frac{p + q}{n}) e^{- T h r^{2} ∕ K} .

Under the setup of Lemma 9, on the other hand, $η (‖ η (S_{3}) ‖_{o p}) = 0$ with probability tending to one.

We will now combine all the above lemmas and finish the proof. First, we consider the regime when $(s_{x} + s_{y})^{2} \leq (p + q) e$ , so that there is thresholding, i.e., $Thr > 0$ . We split this regime into two subregimes: $2^{1 ∕ 4} (p + q)^{3 ∕ 4} \leq (s_{x} + s_{y})^{2} \leq (p + q) e$ and $(s_{x} + s_{y})^{2} \leq 2^{1 ∕ 4} (p + q)^{3 ∕ 4}$ .

1). Regime $2^{1 ∕ 4} (p + q)^{3 ∕ 4} \leq (s_{x} + s_{y})^{2} \leq (p + q) ∕ e ∷$

First, we explain why we needed to split the $e (s_{x} + s_{y})^{2} \leq p + q$ regime into two parts. Since s_x, $s_{y} < \sqrt{n}$ , Lemma 7 applies. Note that if $Thr \in [36 ℬ^{2}$ , $\sqrt{K \log (\max {p, q})} ∕ 2]$ with $K \geq 1288 ℬ^{4}$ , then Lemma 11 and (56) also apply. Therefore it follows that in this case

‖ η ({\tilde{Σ}}_{x y}) - Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} ‖_{o p} \leq C ℬ^{2} (\frac{(s_{x} + s_{y}) Thr}{\sqrt{n}} + \frac{\sqrt{s_{x} + s_{y}}}{n} + \sqrt{\frac{p + q}{n}} \lor \frac{p + q}{n} e^{- {Thr}^{2} ∕ K}) .

(57)

We will shortly show that under $(s_{x} + s_{y})^{2} \leq (p + q) ∕ e$ , setting ${Thr}^{2} = K \log ((p + q) ∕ (s_{x} + s_{y})^{2})$ ensures that the bound in (57) is small. However, for (57) to hold, ${Thr}^{2}$ needs to satisfy

{Thr}^{2} ∕ K \leq \frac{\log (\max {p, q}))}{4},

which holds with ${Thr}^{2} = K \log ((p + q) ∕ (s_{x} + s_{y})^{2})$ if and only if

\log ((p + q) ∕ (s_{x} + s_{y})^{2}) \leq \log (\max {p, q}^{1 ∕ 4}) .

Since $\max {p, q} \geq (p + q) ∕ 2$ the above holds when

(p + q)^{3 ∕ 4} \leq 2^{- 1 ∕ 4} (s_{x} + s_{y})^{2} .

Therefore, setting ${Thr}^{2} = K \log ((p + q) ∕ (s_{x} + s_{y})^{2})$ is useful when we are in the regime $2^{1 ∕ 4} (p + q)^{3 ∕ 4} \leq (s_{x} + s_{y})^{2} \leq (p + q) ∕ e$ . We will analyze the regime $(s_{x} + s_{y})^{2} \leq^{1 ∕ 4} (p + q)^{3 ∕ 4}$ using separate procedure.

In the $2^{1 ∕ 4} (p + q)^{3 ∕ 4} \leq (s_{x} + s_{y})^{2} \leq (p + q) ∕ e$ case,

\sqrt{\frac{p + q}{n}} e^{- {Thr}^{2} ∕ K} = \sqrt{\frac{p + q}{n}} \frac{(s_{x} + s_{y})^{2}}{(p + q)} = \frac{(s_{x} + s_{y})}{\sqrt{n}} \frac{s_{x} + s_{y}}{\sqrt{p + q}} < \frac{(s_{x} + s_{y})}{\sqrt{e n}}

because $(s_{x} + s_{y})^{2} \leq (p + q) ∕ e$ , and similarly,

\frac{p + q}{n} e^{- {Thr}^{2} ∕ K} = \frac{p + q}{n} \frac{(s_{x} + s_{y})^{2}}{(p + q)} = \frac{(s_{x} + s_{y})}{2 \sqrt{n}}

since we also assume $s_{x} + s_{y} \leq 2 \sqrt{n}$ . The above bounds entail that, in this regime, the first term on the bound in (57) is the leading term provided $Thr > 1$ , i.e.,

‖ η ({\tilde{Σ}}_{x y}) - Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} ‖_{o p} \leq C ℬ^{2} (\frac{(s_{x} + s_{y}) Thr}{\sqrt{n}} + \frac{(s_{x} + s_{y})}{\sqrt{n}}) \leq C ℬ^{2} \frac{(s_{x} + s_{y}) \max (Thr, 1)}{\sqrt{n}}

with probability $1 - o (1)$ . Plugging in the value of Thr leads to

‖ η ({\tilde{Σ}}_{x y}) - Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} ‖_{o p} \leq C ℬ^{2} \frac{s_{x} + s_{y}}{\sqrt{n}} {(\max {K \log (\frac{(s_{x} + s_{y})^{2}}{p + q}), 1})}^{1 ∕ 2}

in the regime $2^{1 ∕ 4} (p + q)^{3 ∕ 4} \leq (s_{x} + s_{y})^{2} \leq (p + q) ∕ e$ . In our case, $(s_{x} + s_{y})^{2} ∕ (p + q) \geq e$ . Also since $ℬ > 1$ by definition of $𝒫 (r, s_{x}, s_{y}, ℬ)$ , we also have $K \geq 1$ , indicating

‖ η ({\tilde{Σ}}_{x y}) - Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} ‖_{o p} \leq C ℬ^{2} \frac{s_{x} + s_{y}}{\sqrt{n}} {(K \log (\frac{(s_{x} + s_{y})^{2}}{p + q}))}^{1 ∕ 2} .

2). Regime $(s_{x} + s_{y})^{2} < 2^{1 ∕ 4} (p + q)^{3 ∕ 4} :$

When $(s_{x} + s_{y})^{2} < 2^{1 ∕ 4} (p + q)^{3 ∕ 4}$ , of course, the above line of arguments may not work although this indeed is an easier regime because $s_{x} + s_{y}$ is less than $\sqrt{(p + q) ∕ \log (p + q)}$ . In this regime, we set $Thr = \sqrt{C_{1} \log (p + q)}$ where $C_{1}$ is a constant depending on $ℬ$ as in Lemma 9. For this τ, we have showed that $‖ η (S_{2}) ‖_{o p} = 0$ with probability tending to one. Lemma 11 implies the same holds for $‖ η (S_{3}) ‖_{o p}$ as well. Thus from the decomposition of ${\tilde{Σ}}_{x y}$ in (26), it follows that the asymptotic error occurs only due to the estimation of $Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1}$ by $η (S_{1})$ . Using Lemma 7, we thus obtain

‖ {\tilde{Σ}}_{x y} - Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} ‖_{o p} \leq C ℬ^{2} \frac{(s_{x} + s_{y}) \max {Thr, 1}}{\sqrt{n}} .

On the other hand, since $(p + q)^{3 ∕ 4} > 2^{- 1 ∕ 4} (s_{x} + s_{y})^{2}$ , rearranging terms, we have

\log ((p + q) ∕ (s_{x} + s_{y})^{2}) > (\log (p + q) - \log 2) ∕ 4 > C \log (p + q) .

Thus, in the regime $(s_{x} + s_{y})^{2} < 2^{1 ∕ 4} (p + q)^{3 ∕ 4}$ , we have

‖ {\tilde{Σ}}_{x y} - Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} ‖_{o p} \leq C ℬ^{2} \frac{s_{x} + s_{y}}{\sqrt{n}} \max {\sqrt{C_{1} \log (\frac{p + q}{(s_{x} + s_{y})^{2}})}, 1} .

(58)

3). Regime $(s_{x} + s_{y})^{2} > (p + q) ∕ e :$ :

It remains to analyze the case when either $(s_{x} + s_{y})^{2} > (p + q) ∕ e$ . In that case, there is no thresholding, i.e., Thr = 0. We will show that the assertions of Theorem 4 holds in this case as well. To that end, note that (26) implies

‖ {\tilde{Σ}}_{x y} - Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} ‖_{o p} \leq ‖ S_{1} - Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} ‖_{o p} + ‖ S_{2} + S_{3} ‖_{o p} .

From the proof of Lemma 7 it follows that $‖ S_{1} - Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} ‖_{o p} \leq {C ℬ}^{2} \max {\sqrt{s_{x}}, \sqrt{s_{y}}} ∕ \sqrt{n}$ . For S₂, we have shown that it is of the form $M Z_{1}^{T} Z_{2} N ∕ n$ where $‖ M ‖_{o p}, ‖ N ‖_{o p} \leq ℬ$ . On the other hand, we showed that $S_{3} = H_{1} + H_{2}$ , where the proof of Lemma 11 shows H₁ and H₂ are of the form M AN where $‖ M ‖_{o p} ‖ N ‖_{o p} \leq 2 ℬ^{2}$ and A is either $[Z Z_{1}]^{T} Z_{2} ∕ n$ (for H₁) or $Z_{1}^{T} [Z Z_{2}] ∕ n$ (for H₂). Therefore, it is not hard to see that $‖ S_{2} + S_{3} ‖_{o p}$ is bounded by

C ℬ^{2} (‖ Z_{1}^{T} Z_{2} ‖_{o p} + ‖ Z_{1}^{T} [Z Z_{2}] ‖_{o p} + ‖ [Z Z_{1}]^{T} Z_{2} ‖_{o p}) .

For standard Gaussian matrices $Z_{1} \in R^{n \times p}$ and $Z_{2} \in R^{n \times q}$ it holds that $‖ Z_{1}^{T} Z_{2} ∕ n ‖_{o p} \leq C (\sqrt{(p + q) ∕ n} + (p + q) ∕ n)$ with probability $1 - o (1)$ (cf. Theorem 4.7.1 of of [42]). Since $r \leq min {p, q}$ , it follows that $‖ S_{2} + S_{3} ‖_{o p} \leq C ℬ^{2} (\sqrt{p + q ∕ n} + (p + q) ∕ n)$ with probability $1 - o (1)$ . The above discussion leads to

‖ {\tilde{Σ}}_{x y} - Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} ‖_{o p} \leq C ℬ^{2} ({(\frac{s_{x} + s_{y}}{n})}^{1 ∕ 2} + {(\frac{p + q}{n})}^{1 ∕ 2} + \frac{p + q}{n}) \leq 2 C ℬ^{2} ({(\frac{p + q}{n})}^{1 ∕ 2} + \frac{p + q}{n})

because $s_{x} + s_{y} < p + q$ . If $(p + q) \leq e (s_{x} + s_{y})^{2}$ , the above bound is of the order $(s_{x} + s_{y}) ∕ \sqrt{n}$ . Thus Theorem 4 follows.

Appendix G. Proof of Corollary 2

Proof of Corollary 2. We will first show that there exist $C_{ℬ}^{'}, c_{ℬ} > 0$ so that

\max_{i \in [r]} ‖ {\hat{u}}_{n, i} - w u_{i} ‖_{2} \leq C_{ℬ}^{'} \frac{(s_{x} + s_{y})}{\sqrt{n}} \max {{(c_{ℬ} \log (\frac{p + q}{(s_{x} + s_{y})^{2}}))}^{1 ∕ 2}, 1} .

(59)

For the sake of simplicity, we denote the matrix ${\hat{U}}^{(1)}$ in Algorithm 2 by $\hat{U}$ . Denoting $ϵ_{n}^{'} = ‖ η ({\tilde{Σ}}_{x y}) - Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} ‖_{o p}$ , we note that

‖ Σ_{x}^{1 ∕ 2} η ({\tilde{Σ}}_{x y}) Σ_{y}^{1 ∕ 2} - Σ_{x}^{1 ∕ 2} U Λ V^{T} Σ_{y}^{1 ∕ 2} ‖_{o p} \leq ℬ ϵ_{n}^{'} .

Also the matrix ${\hat{U}}_{pre}$ defined in Algorithm 2 and $Σ_{x}^{1 ∕ 2} U$ are the matrices corresponding to the leading r singular vectors of $Σ_{x}^{1 ∕ 2} η ({\tilde{Σ}}_{x y}) Σ_{y}^{1 ∕ 2}$ and $Σ_{x}^{1 ∕ 2} U Λ V^{T} Σ_{y}^{1 ∕ 2}$ , respectively. By Wedin’s sin-theta theorem (we use Theorem 4 of [52]), for any $1 \leq i < r$ ,

min_{w \in {\pm 1}} ‖ {\hat{U}}_{i}^{pre} - w Σ_{x}^{1 ∕ 2} u_{i} ‖ \leq \frac{2^{3 ∕ 2} (2 Λ_{1} + ϵ_{n}^{'}) ϵ_{n}^{'}}{min {Λ_{i - 1}^{2} - Λ_{i}^{2}, Λ_{i}^{2} - Λ_{i + 1}^{2}}},

where Λ₀ is taken to be ∞, and

min_{w \in {\pm 1}} ‖ {\hat{U}}_{r}^{pre} - w Σ_{x}^{1 ∕ 2} u_{r} ‖_{2} \leq \frac{2^{3 ∕ 2} (2 Λ_{1} + ϵ_{n}^{'}) ϵ_{n}^{'}}{Λ_{r - 1}^{2} - Λ_{r}^{2}} .

Since $P \in 𝒫_{G} (r, s_{x}, s_{y}, ℬ)$ , ${min}_{i \in [r]} (Λ_{i - 1} - Λ_{i}) > ℬ^{- 1}$ and ${min}_{i \in [r]} Λ_{i} > ℬ^{- 1}$ . Therefore, for $ϵ_{n}^{'} < 1$ , we have

\max_{i \in [r]} min_{w \in {\pm 1}} ‖ {\hat{U}}_{i}^{pre} - w Σ_{x}^{1 ∕ 2} u_{i} ‖_{2} \leq C_{ℬ} ϵ_{n}^{'} .

We have to show $ϵ_{n}^{'} < 1$ . Theorem 4 gives a bound on $ϵ_{n}^{'}$ , which can be made smaller than one if the $C_{ℬ}$ in (23) is chosen to be sufficiently large. Hence, the above inequality holds. Because ${\hat{U}}_{i}^{pre} = Σ_{x}^{1 ∕ 2} {\hat{u}}_{n, i}$ , using the fact $‖ Σ_{x} ‖_{o p} \leq ℬ$ , the last display implies

\max_{i \in [r]} min_{w \in {\pm 1}} ‖ {\hat{u}}_{n, i} - w u_{i} ‖_{2} \leq ℬ^{1 ∕ 2} C_{ℬ} ϵ_{n}^{'},

which, combined with Theorem 4, proves (59). Now note that the constant $C_{ℬ}$ in (23) can be chosen so large such that the right hand side of (59) is smaller than $1 ∕ (2 ℬ^{2} \sqrt{r})$ . Since $‖ Σ_{x} ‖_{o p} < ℬ$ , it follows that Condition 1 is satisfied, and the rest of the proof then follows from Theorem 1.

Appendix H. Proof of Auxilliary Lemmas

A. Proof of Technical Lemmas for Theorem 2

The following lemma can be verified using elementary linear algebra, and hence its proof is omitted.

Lemma 12. Suppose Σ is of the form (41). Then the spectral decomposition of Σ is as follows:

Σ = \sum_{i = 1}^{p - 1} x_{1}^{(i)} (x_{1}^{(i)})^{T} + \sum_{i = 1}^{q - 1} x_{2}^{(i)} (x_{2}^{(i)})^{T} + (1 + ρ) x_{3} x_{3}^{T} + (1 - ρ) x_{4} x_{4}^{T},

where the eigenvectors are of the following form:

For $i \in [p - 1], x_{1}^{(i)} = (y_{i}, 0_{q})$ , where ${y_{i}}_{i = 1}^{p - 1} \subset R^{p}$ forms an orthonormal basis system of the orthogonal space of α.
For $i \in [q - 1], x_{2}^{(i)} = (0_{p}, z_{i})$ , where ${z_{i}}_{i = 1}^{q - 1} \subset R^{p}$ forms an orthonormal basis system of the orthogonal space of β.
$x_{3} = (α ∕ \sqrt{2}, β ∕ \sqrt{2})$ and $x_{4} = (α ∕ \sqrt{2}, - β ∕ \sqrt{2})$ .

Here for $k \in N$ , 0_k denotes the k-dimensional vector whose all entries are zero.

Lemma 13. Suppose Σ is as in (41). Then $d e t (Σ) = 1 - ρ^{2}$ and

Σ^{- 1} = [\begin{matrix} I - α α^{T} & 0 \\ 0 & I - β β^{T} \end{matrix}] + \frac{1}{2 (1 + ρ)} [\begin{matrix} α α^{T} & α β^{T} \\ β α^{T} & β β^{T} \end{matrix}] + \frac{1}{2 (1 - ρ)} [\begin{matrix} α α^{T} & - α β^{T} \\ - β α^{T} & β β^{T} \end{matrix}] .

Proof of Lemma 13. Follows directly from Lemma 12.

Lemma 14. Suppose Σ₁ and Σ₂ are of the form (41) with singular vectors α₁, β₁, α₂, and β₂, respectively. Then

T r (Σ_{1} Σ_{2}^{- 1}) = p + q + \frac{2 ρ^{2}}{1 - ρ^{2}} (1 - (β_{1}^{T} β_{2}) (α_{1}^{T} α_{2})) .

Proof. Lemma 13 can be used to obtain the form of $Σ_{2}^{- 1}$ , which implies $Σ_{1} Σ_{2}^{- 1}$ equals

[\begin{matrix} I_{p} & ρ α_{1} β_{1}^{T} \\ ρ β_{1} α_{1}^{T} & I_{q} \end{matrix}] [\begin{matrix} I_{p} + ρ C_{ρ} α_{2} α_{2}^{T} & - C_{ρ} α_{2} β_{2}^{T} \\ - C_{ρ} β_{2} α_{2}^{T} & I_{q} + ρ C_{ρ} β_{2} β_{2}^{T}, \end{matrix}]

where $C_{ρ} = ρ ∕ 1 - ρ^{2}$ . Since $T r (Σ_{1} Σ_{2}^{- 1})$ equals the sum of the two p × p and q × q diagonal submatrices, we obtain that

T r (Σ_{1} Σ_{2}^{- 1}) = T r (I_{p} + ρ C_{ρ} α_{2} α_{2}^{T} - ρ C_{ρ} (β_{1}^{T} β_{2}) α_{1} α_{2}^{T}) + T r (I_{q} + ρ C_{ρ} β_{2} β_{2}^{T} - ρ C_{ρ} (α_{1}^{T} α_{2}) β_{1} β_{2}^{T}) = p + q + \frac{ρ^{2}}{1 - ρ^{2}} (T r (α_{2}^{T} α_{2}) + T r (β_{2}^{T} β_{2}) - (β_{1}^{T} β_{2}) T r (α_{1} α_{2}^{T}) - (α_{1}^{T} α_{2}) T r (β_{1} β_{2}^{T})),

where we used the linearity of Trace operator, as well as the fact that $T r (A B) = T r (B A)$ . Noticing $‖ α_{2} ‖_{2} = ‖ β_{2} ‖_{2} = 1$ , the result follows.

B. Proof of Key Lemmas for Theorem 4

1). Proof of Lemma 7:

Proof of Lemma 7. Note that

‖ η (S_{1}) - Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} ‖_{o p} \leq \underset{T_{1}}{\underset{︸}{‖ η (S_{1}) - S_{1} ‖_{o p}}} + \underset{T_{2}}{\underset{︸}{‖ S_{1} - Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} ‖_{o p}}} .

We deal with the term T₁ first. Recall from (26) that $S_{1} = 𝒫_{E_{1} \times E_{2}} ({\tilde{Σ}}_{x y})$ is a sparse matrix. In particular, each row and column of S₁ can have at most s_y and s_x many nonzero elements, respectively. Now we make use of two elementary facts. First, for $x \neq 0$ , $∣ η (x) - x ∣ \leq Thr ∕ \sqrt{n}$ , and second, for any matrix $A \in R^{p \times q}$ ,

‖ A ‖_{o p} \leq \max_{1 \leq i \leq p} \sum_{j = 1}^{p} ∣ A_{i j} ∣ \land \max_{1 \leq j \leq q} \sum_{i = 1}^{n} ∣ A_{i j} ∣ .

The above results, combined with the row and column sparsity of S₁, lead to

T_{1} = ‖ η (S_{1}) - S_{1} ‖_{o p} \leq (\max_{1 \leq i \leq p} ‖ (S_{1})_{i *} ‖_{0}) \land (\max_{1 \leq j \leq q} ‖ (S_{1})_{j} ‖_{0}) \frac{Thr}{\sqrt{n}} \leq min {s_{x}, s_{y}} \frac{Thr}{\sqrt{n}},

which is the first term in the bound of $‖ η (S_{1}) - Σ_{x y} ‖_{o p}$ .

Now for T₂, noting $Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} = U Λ V^{T}$ , observe that

S_{1} - Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} = \underset{S_{11}}{\underset{︸}{U Λ^{1 ∕ 2} (\frac{Z^{T} Z}{n} - I_{r}) Λ^{1 ∕ 2} V^{T}}} + \underset{S_{12}}{\underset{︸}{U Λ^{1 ∕ 2} \frac{Z^{T} Z_{2}}{n} Σ_{y} (\tilde{V} B_{y} ({\tilde{V}}_{E_{2} *})^{T})}} + \underset{S_{13}}{\underset{︸}{({\tilde{U}}_{E_{1} *} B_{y} {\tilde{U}}^{T}) Σ_{x} \frac{Z_{1}^{T} Z}{n} Λ^{1 ∕ 2} V^{T}}} + \underset{S_{14}}{\underset{︸}{({\tilde{U}}_{E_{1} *} B_{x} {\tilde{U}}^{T}) Σ_{x} \frac{Z_{1}^{T} Z}{n} Σ_{y} (\tilde{V} B_{y} ({\tilde{V}}_{E_{2} *})^{T})}} .

It is easy to see that

‖ S_{11} ‖_{o p} \leq ‖ Σ_{x}^{- 1 ∕ 2} ‖_{o p} ‖ Σ_{x}^{1 ∕ 2} U ‖_{o p} ‖ Σ_{y}^{- 1 ∕ 2} ‖_{o p} ‖ Σ_{y}^{1 ∕ 2} V ‖_{o p} ‖ Λ ‖_{o p} \times {‖ \frac{Z^{T} Z}{n} - I_{r} ‖}_{o p} .

Since $(X, Y) \sim P \in 𝒫_{G} (r, s_{x}, s_{y}, ℬ)$ , $Σ_{x}^{- 1}$ and $Σ_{y}^{- 1}$ are bounded it operator norm by $ℬ$ . Also, $Σ_{x}^{1 ∕ 2} \tilde{U}$ and $Σ_{y}^{1 ∕ 2} \tilde{V}$ are orthonormal matrices. Therefore the operator norms of the matrices $Σ_{x}^{1 ∕ 2} U$ , $Σ_{y}^{1 ∕ 2} V$ , and Λ are bounded by one. On the other hand, by Bai-Yin’s law on eigenvalues of Wishart matrices (cf. Theorem 4.7.1 of [42]), $‖ Z^{T} Z ∕ n - I_{r} ‖_{o p} \leq C (\sqrt{r ∕ n} + r ∕ n)$ with high probability. Since $r < s_{x} < \sqrt{n}$ , clearly $r ∕ n < 1$ . Thus $‖ S_{11} ‖_{o p} \leq ℬ C \sqrt{r ∕ n}$ with high probability. Hence it suffices to show that the terms S₁₂, S₁₃, and S₁₄ are small in operator norm, for which, we will make use of Lemma 4. First let us consider the case of S₁₂. Clearly,

‖ S_{12} ‖_{o p} \leq ‖ Σ_{x}^{- 1 ∕ 2} ‖_{o p} ‖ Σ_{x}^{1 ∕ 2} U ‖_{o p} ‖ Λ^{1 ∕ 2} ‖_{o p} \times {‖ \frac{Z^{T} Z}{n} Σ_{y} (\tilde{V} B_{y} ({\tilde{V}}_{E_{2} *})^{T}) ‖}_{o p} .

We already mentioned that $‖ Σ_{x}^{- 1} ‖_{o p} \leq ℬ$ , and $‖ Σ_{x}^{1 ∕ 2} U ‖_{o p}$ and $‖ Λ ‖_{p}$ are bounded by one. Therefore, it follows that

‖ S_{12} ‖_{o p} \leq ℬ^{1 ∕ 2} {‖ \frac{Z^{T} Z_{2}}{n} Σ_{y} (\tilde{V} B_{y} ({\tilde{V}}_{E_{2} *})^{T}) ‖}_{o p} .

Now we apply Lemma 4 on the term $Z^{T} Z_{2} Σ_{y} (\tilde{V} B_{y} ({\tilde{V}}_{E_{2} *})^{T})$ with $A = I_{r}$ , and $B = Σ_{y} \tilde{V} B_{y} ({\tilde{V}}_{E_{2} *})^{T}$ . Note that $Σ_{y}$ , $\tilde{V}$ , and $B_{y}$ are full rank matrices, i.e., they have rank q. Therefore, the rank of B equals rank of ${\tilde{V}}_{E_{2} *}$ . Note that the rows of the matrix $\tilde{V}$ are linearly independent because the square matrix $\tilde{V}$ has full rank. Therefore, the rank of ${\tilde{V}}_{E_{2} *}$ is $∣ E_{2} ∣$ , which is $s_{y}$ . Hence, the rank of $B$ is also $s_{y}$ . Also note that $(A) = r \leq s_{y} \leq n$ . Therefore Lemma 4 can be applied with $a = r$ and $b = s_{y}$ . Also, $‖ A ‖_{o p} = 1$ trivially follows. Using the same arguments which led to (55), on the other hand, we can show that $‖ B ‖_{o p} \leq ℬ$ by (26). Therefore Lemma 4 implies that for any t > 0, the following holds with probability at least $1 - exp (- C n) - exp (- t^{2} ∕ 2)$ :

{‖ \frac{Z^{T} Z_{2}}{n} Σ_{y} (\tilde{V} B_{y} ({\tilde{V}}_{E_{2} *})^{T}) ‖}_{o p} \leq C ℬ \frac{\max {\sqrt{s_{y}}, t}}{\sqrt{n}},

which implies $∣ S_{12} ∣ \leq C ℬ^{3 ∕ 2} \max {\sqrt{s_{y}}, t} ∕ \sqrt{n}$ with high probability. Exchanging the role of X and Y in the above arguments, we can show that $∣ S_{13} ∣ \leq C ℬ^{3 ∕ 2} \max {\sqrt{s_{x}}, t} ∕ \sqrt{n}$ with high probability. For S₁₄, we note that

‖ S_{14} ‖_{o p} \leq {‖ ({\tilde{U}}_{E_{1} *} B_{x} {\tilde{U}}^{T}) Σ_{x} \frac{Z_{1}^{T} Z_{2}}{n} Σ_{y} (\tilde{V} B_{y} ({\tilde{V}}_{E_{2} *})^{T}) ‖}_{o p} .

We intend to apply Lemma 4 with $A = Σ_{x} \tilde{U} B_{x} ({\tilde{U}}_{E_{1} *})^{T}$ and $B = Σ_{y} \tilde{V} B_{y} ({\tilde{V}}_{E_{2} *})^{T}$ . Arguing in the lines of the proof for the term S₁₂, we can show that A and B have rank $a = s_{x}$ and $b = s_{y}$ , respectively. Without loss of generality we assume $s_{y} \geq s_{x}$ , which yields $b \geq a$ , as requred by Lemma 4. Otherwise, we can just take the transpose of C₁₄, which leads to $a = s_{y}$ and $b = s_{x}$ , implying $b \geq a$ . Using (55), as before, we can show that the operator norms of A and B are bounded by $ℬ$ . Therefore, Lemma 4 implies that for all $t \geq 0$ ,

‖ S_{14} ‖_{o p} \leq C ℬ^{2} \frac{\max {{\sqrt{s}}_{x}, {\sqrt{s}}_{y}, t}}{\sqrt{n}}

with probability at least $1 - exp (- C n) - exp (- t^{2} ∕ 2)$ . Hence, it follows that with probability $1 - o (1)$ ,

‖ S_{1} - Σ_{x}^{- 1} Σ_{x y} Σ_{y}^{- 1} ‖_{o p} \leq C ℬ^{2} \frac{\max {\sqrt{s_{x}}, \sqrt{s_{y}}}}{\sqrt{n}} .

2). Proof of Lemma 8:

Without loss of generality, we will assume that $p > q$ . We will also assume, without loss of generality, that $p^{'} = p$ and $q^{'} = q$ . If that is not the case, we can add some zero rows to M and zero columns to N, respectively, which does not change their operator norm, but ensures $p^{'} = p$ and $q^{'} = q$ . For any $p \in N$ , let $S^{p - 1}$ denote the unit sphere in $R^{p}$ . We denote an ϵ-net (with respect to Eucledian norm) on any set $𝒳 \subset R^{p}$ by $T^{ϵ} (𝒳)$ . When $𝒳 = S^{p - 1}$ , there exists an ϵ-net of $S^{p - 1}$ so that

∣ T^{ϵ} (S^{p - 1}) ∣ \leq {(1 + 2 ∕ ϵ)}^{p} .

By $T_{p}^{ϵ}$ , we denote such an ϵ-net. Although $T_{p}^{ϵ}$ may not be unique, that is not necessary for our purpose. For a subset $S \subset [p]$ , $T_{p}^{ϵ} (S)$ will denote an ϵ-net of the set ${x \in S^{p - 1} : x_{i} = 0 if i \neq 0}$ . Note that each element of the latter set has at most $∣ S ∣ - 1$ many degrees of freedom, from which, one can show that $∣ T_{k}^{ϵ} (S) ∣ \leq (1 + 2 ∕ ϵ)^{∣ S ∣}$ . The following Fact on ϵ-nets will be very useful for us. The proof is standard and can be found, for example, in [42].

Fact 5. Let $A \in R^{p \times q}$ for p, $q \in N$ . Then there exist $x \in T_{p}^{ϵ}$ and $y \in T_{q}^{ϵ}$ such that $∣ 〈 x, A y 〉 ∣ \geq (1 - 2 ϵ) ‖ A ‖_{o p}$ .

Letting $A_{n} = η (M Z_{1}^{T} Z_{2} N)$ , and using Fact 5, we obtain that

P (‖ A_{n} ‖_{o p} > δ) \leq P (\max_{x \in T_{p}^{ϵ}, y \in T_{q}^{ϵ}} ∣ 〈 x, A_{n} y 〉 ∣ \geq (1 - 2 ϵ) δ)

for any $δ > 0$ . Proceeding like Proposition 15 of [1], we fix $1 < J_{p, q} \leq min {p, q}$ , and introduce the sets

S_{x} = {i \in [p] : ∣ x_{i} ∣ \geq \sqrt{J_{p, q} ∕ p}} S_{y} = {i \in [q] : ∣ y_{i} ∣ \geq \sqrt{J_{p, q} ∕ q}},

(60)

and their complements $S_{x}^{c} = [p] ∖ S_{x}$ and $S_{y}^{c} = [q] ∖ S_{y}$ . The precise value of J_p,q will be chosen later. For any subset $A \subset [k]$ , $k \in N$ , and vector $x \in R^{k}$ , we denote by $x_{A}$ the projection of x onto A, which means $x_{A} \in R^{p}$ and $(x_{A})_{i} = x_{i}$ if $i \in A$ , and zero otherwise. Let us denote the projections of x and y on $S_{x}$ , $S_{x}^{c}$ , $S_{y}$ , and $S_{y}^{c}$ , by $x_{S_{x}}$ , $x_{S_{x}^{c}}$ , $y_{S_{y}}$ , and $y_{S_{y}^{c}}$ , respectively. Note that this implies

x = x_{S_{x}} + x_{S_{x}^{c}}, y = y_{S_{y}} + y_{S_{y}^{c}},

as well as

x_{S_{x}}, x_{S_{x}^{c}} \in R^{p}, y_{S_{y}}, y_{S_{y}^{c}} \in R^{q} .

There are fewer elements the sets $S_{x}$ and $S_{y}$ compared to their complements. Therefore, we will treat these sets separately. To that end, we consider the splitting

P (\max_{x \in T_{p}^{ϵ}, y \in T_{q}^{ϵ}} ∣ 〈 x, A_{n} y 〉 ∣ \geq 4 δ (1 - 2 ϵ)) \leq \underset{T_{1}}{\underset{︸}{P (\max_{x \in T_{p}^{ϵ}, y \in T_{q}^{ϵ}} ∣ 〈 x_{S_{x}}, A_{n} y_{S_{y}} 〉 ∣ \geq δ (1 - 2 ϵ))}} + \underset{T_{2}}{\underset{︸}{P (\max_{x \in T_{p}^{ϵ}, y \in T_{q}^{ϵ}} ∣ 〈 x_{S_{x}}, A_{n} y_{S_{y}^{c}} 〉 ∣ \geq δ (1 - 2 ϵ))}} + \underset{T_{3}}{\underset{︸}{P (\max_{x \in T_{p}^{ϵ}, y \in T_{q}^{ϵ}} ∣ 〈 x_{S_{x}^{c}}, A_{n} y 〉 ∣ \geq δ (1 - 2 ϵ))}}

(61)

The term T₁ can be bounded by Lemma 15.

Lemma 15. Suppose M and N are as in Lemma 8 and $A_{n} = η (Q_{M, N})$ where $Q_{M, N} = M Z_{1}^{T} Z_{2} N ∕ n$ . Then for any $Δ > 0$ , there exist absolute constants C, $c > 0$ such that

P {\max_{x \in T_{p}^{ϵ}, y \in T_{q}^{ϵ}} ∣ 〈 x_{S_{x}}, A_{n} y_{S_{y}} 〉 ∣ \geq Δ} \leq C exp ((p + q) \frac{\log (C J_{p, q})}{J_{p, q}} - \frac{n^{2} Δ^{2}}{4 C ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2} (2 n + p + q)}) + \frac{C}{Δ^{2}} ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2} (n (p + q))^{C} {e^{- c (n + q)} + e^{- c (n + p)}}

We state another lemma which helps in controlling the terms T₂ and T₃.

Lemma 16. Suppose M, N, Z₁, Z₂, and A_n are as in Lemma 8. Let $K_{0} = 161 ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2}$ . Suppose $k > 0$ is such that $K \geq K_{0}$ and moreover, $τ \in [\sqrt{K_{0}}, \sqrt{K \log p} ∕ 2]$ . Let $𝒯_{2}$ be either the set $T_{q}^{ϵ}$ or the set ${\tilde{T}}_{q}^{ϵ} = {y_{S_{y}} : y \in T_{q}^{ϵ}}$ . Then there exist absolute constants C, $c > 0$ such that the following holds for any $Δ > 0$ ;

P {\max_{x \in T_{p}^{ϵ}, y \in 𝒯_{2}} ∣ 〈 x_{S_{x}^{c}}, A_{n} y 〉 ∣ \geq Δ} \leq C exp (C (p + q) - \frac{Δ^{2} n^{2} e^{τ^{2} ∕ K}}{C ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2} J_{p, q} (2 n + p + q)}) + \frac{C ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2}}{Δ^{2}} (n (p + q))^{C} exp (- c min (n, \sqrt{p})) .

Note that when $𝒟 = T_{q}^{ϵ}$ , Lemma 16 yields a bound on T₃. On the other hand, the case $𝒯_{2} = {\tilde{T}}_{q}^{ϵ}$ yields a bound on the term

T_{2}^{'} = P (\max_{x \in T_{p}^{ϵ}, y \in T_{q}^{ϵ}} ∣ 〈 x_{S_{x}^{c}}, A_{n} y_{S_{y}} 〉 ∣ \geq δ (1 - 2 ϵ)) .

(62)

While $T_{2}^{'}$ is not exactly equal to T₂, interchanging the role of x and y in $T_{2}^{'}$ gives T₂. Since the upper bound on $T_{2}^{'}$ given by Lemma 16 is symmetric in p and q, it is not hard to see that the same bound works for T₂.

If we let $ϵ = 1 ∕ 4$ , then $Δ = δ ∕ 2$ . Combining the bounds on T₁, T₂, and T₃, we conclude that the right hand side of (61) is $o (1)$ if $Δ^{2}$ is larger than some constant multiple of

\max {\frac{(n + p + q) (p + q)}{n^{2}} (\frac{\log J_{p, q}}{J_{p, q}} + J_{p, q} e^{- τ^{2} ∕ K_{0}}), \frac{(n (p + q))^{C}}{exp (c min {n, \sqrt{p}})}} ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2}

where $K_{0} = 320 ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2}$ . We will show that the first term dominates the second term. By our assumption on τ, $τ^{2} < 80 \log p ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2}$ , which implies $τ^{2} ∕ K_{0} < \log (p \land q) ∕ 2$ , which combined with the fact $J_{p, q} > 1$ , yields $J_{p, q} exp (- τ^{2} ∕ K_{0}) > J_{p, q} ∕ \sqrt{p \land q}$ . On the other hand, under p > q, our assumption on n implies $\log n = o (\sqrt{p})$ . Also because $p + q = o (\log n)$ , it follows that $(n (p + q))^{C} exp (- c min {n, \sqrt{p}})$ is small, in particular

\frac{(n + p + q) (p + q)}{n^{2}} (\frac{\log J_{p, q}}{J_{p, q}} + J_{p, q} e^{- τ^{2} ∕ K_{0}}) \geq \frac{(n + p + q) (p + q)}{n^{2} \sqrt{p \land q}} ≫ (n (p + q))^{C} exp (- c min {n, \sqrt{p}}) .

Therefore, for $P (‖ A_{n} ‖_{δ} > δ)$ to be small,

δ^{2} > C min_{1 < J_{p, q} < p \land q} ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2} \frac{(n + p + q) (p + q)}{n^{2}} \times (\frac{\log J_{p, q}}{J_{p, q}} + J_{p, q} e^{- τ^{2} ∕ K_{0}})

suffices. In particular, we choose $J_{p, q} = exp (τ^{2} ∕ (2 K_{0}))$ . Note that because $τ^{2} \leq K_{0} \log (p \land q) ∕ 2$ , this choice of J_p,q ensures that $J_{p, q} ≪ min {p, q}$ , as required. The proof follows noting this choice of J_p,q also implies

\frac{\log J_{p, q}}{J_{p, q}} + J_{p, q} e^{- τ^{2} ∕ K_{0}} \leq e^{- τ^{2} ∕ (2.5 K_{0})} = {exp (\frac{- τ^{2}}{40^{2} ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2}})}^{2} .

3). Proof of Lemma 9:

Proof of 9. For any $i \in [p]$ and $j \in [q]$ ,

Z_{1} M_{i *} ∕ ‖ M_{i *} ‖_{2} \sim N (0, I_{n})

and $Z_{2} N_{j} ∕ ‖ N_{j} ‖_{2} \sim N (0, I_{n})$ are independent. In this case, there exist absolute constants δ, c and $C > 0$ , so that (cf. Lemma A.3 of [50])

P (\frac{∣ M_{i *}^{T} Z_{1}^{T} Z_{2} N_{j} ∣}{‖ M_{i *} ‖_{2} ‖ N_{j} ‖_{2}} \geq n t) \leq C exp (- c n t^{2})

for all $t \leq δ$ . Since $(M Z_{1}^{T} Z_{2} N)_{i j} = M_{i *}^{T} Z_{1}^{T} Z_{2} N_{j}$ , and $‖ M_{i *} ‖_{2}, ‖ N_{i} ‖_{2} \leq ℬ$ , using union bound we obtain

P (∣ M Z_{1}^{T} Z_{2} N ∣_{\infty} \geq n t) \leq C exp (\log (p^{'} q^{'}) - c n t^{2} ∕ ℬ^{4}) .

Letting $τ = ℬ^{2} \sqrt{C^{'} \log (p + q)}$ and $t = τ ∕ \sqrt{n}$ , we observe that for our choice of τ, $t < δ$ for all sufficiently large n since $\log (p + q) = o (n)$ . Therefore, the above inequality leads to

P (η (Q_{M, N}) \neq 0) = P (∣ Q_{M, N} ∣_{\infty} \geq τ ∕ \sqrt{n}) \leq C exp (2 \log (p^{'} + q^{'}) - c C^{'} \log (p + q)) .

Because $p^{'} \leq p$ and $q^{'} \leq q$ by our assumption on M and N, $C^{'} > 2 ∕ c$ suffices. Hence the proof follows.

4). Proof of Lemma 11:

Proof of Lemma 11. From the definition of S₃ in (26), and (54), it is not hard to see that $η (S_{3}) = η (H_{1}) + η (H_{2})$ . We will show that H₁ is of the form $M [Z Z_{1}]^{T} Z_{2} N$ where $‖ M ‖_{o p} \leq 2 ℬ$ and $‖ N ‖_{o p} \leq ℬ$ . Then the first part would follow from Lemma 8, which, when applied to this case, would imply

‖ η (H_{1}) ‖_{o p} \leq C ℬ^{2} (\sqrt{\frac{p + q}{n}} \lor \frac{p + q}{n}) e^{- {Thr}^{2} ∕ K}

provided $Thr \in [36 ℬ^{2}$ , $\sqrt{K \log (\max p + r, q)} ∕ 2]$ and $K \geq 1288 ℬ^{4}$ . Since $r < min {p, q}$ , the upper bound of Thr becomes $\sqrt{K \log (2 \max {p, q})} ∕ 2$ . The proof for $‖ η (H_{2}) ‖_{o p}$ will follow in a similar way, and hence skipped.

Letting

A_{1} = Λ^{1 ∕ 2} U^{T}, A_{2} = Σ_{x} \tilde{U} B_{x} ({\tilde{U}}_{E_{1} *})^{T}, A_{3} = Σ_{y} \tilde{V} B_{y} ({\tilde{V}}_{F_{2} *})^{T},

we note that (54) implies $H_{1} = A_{1}^{T} Z^{T} Z_{2} A_{3} + A_{2}^{T} Z_{1}^{T} Z_{2} A_{3}$ , which can be written as

H_{1} = A_{1}^{T} {([Z Z_{1}])}^{T} Z_{2} A_{3}, where A_{4} = [\begin{matrix} A_{1} \\ A_{2} \end{matrix}] .

We will now invoke Lemma 8 because $Z_{3} = [Z Z_{2}]$ is a Gaussian data matrix with n rows and $p + r \leq 2 p$ columns, and the matrices A₄ and A₃ are also bounded in operator norm. To see the latter, first, noting $‖ A_{4} ‖_{o p} = \sqrt{‖ A_{4}^{T} A_{4} ‖_{o p}}$ , and we observe that

‖ A_{4}^{T} A_{4} ‖_{o p} = ‖ A_{1}^{T} A_{1} + A_{2}^{T} A_{2} ‖_{o p} \leq ‖ A_{1} ‖_{o p}^{2} + ‖ A_{2} ‖_{o p}^{2} .

Therefore it suffices to bound the operator norms of A₁, A₂, and A₃ only. Using (55), we can show that the operator norm of the matrices of the form A₂, or A₃ is bounded by $ℬ$ for $(X, Y) \sim P \in 𝒫 (r, s_{x}, s_{y}, ℬ)$ . Since $Σ_{x}^{1 ∕ 2} U$ has orthogonal columns, it can be easily seen that $‖ A_{1} ‖_{o p} \leq 1$ . Therefore

‖ A_{4} ‖_{o p} \leq ‖ A_{1} ‖_{o p} + ‖ A_{2} ‖_{o p} \leq 1 + ℬ \leq 2 ℬ

because $ℬ > 1$ as per the definition of $𝒫 (r, s_{x}, s_{y}, ℬ)$ . The proof of the first part now follows by Lemma 8. Because $‖ A_{4} ‖_{o p} \leq 2 ℬ$ and $‖ A_{3} ‖_{o p} \leq ℬ$ , the proof of the second part follows directly from Lemma 9, and hence skipped.

C. Proof of Additional Lemmas for Section III-C and Theorem 3

Proof of Lemma 2. To prove the current lemma, we will require a result on the concentration of $α$ and $β$ under $π_{x}$ and $π_{y}$ . To that end, for s, $m \in N$ satisfying $s \leq m$ , let us define the set

𝒲 (s, m) = {x \in R^{m} : ‖ x ‖_{0} \in [s ∕ 2, 2 s], ‖ x ‖_{2} \in [0.9, 1.1]} .

Suppose $π_{x}$ and $π_{y}$ are the Rademacher priors on $α$ and $β$ as defined in Section III-C. The following lemma then says that $α$ and $β$ concentrates on $𝒲 (s_{x}, p)$ and $𝒲 (s_{y}, q)$ with probability tending to one.

Lemma 17. Suppose $s_{y} \to \infty$ . Then

lim_{n} π_{x} (α \in 𝒲 (s_{x}, p)) = 1; lim_{n} π_{y} (β \in 𝒲 (s_{y}, q)) = 1 .

(63)

Here the probability $π_{x} (α \in 𝒲 (s_{x}, p))$ depends on n through $s_{x}$ and q. Similarly $π_{y} (β \in 𝒲 (s_{y}, q))$ depends on n through $s_{y}$ and q.

Recall the definition of $P_{α, β}$ from (19). Let us consider the class

𝒫_{s u b} (ℬ) = {P_{α, β} : α \in 𝒲 (s_{x}, p), β \in 𝒲 (s_{y}, q)} .

If $α \in 𝒲 (s_{x}, p)$ and $β \in 𝒲 (s_{x}, p)$ , than $‖ α ‖_{2} ‖ β ‖_{2} \leq (1.1)^{2} < ℬ$ because $ℬ > 2$ . Therefore (19) implies that $(X, Y) \sim P \in 𝒫_{s u b} (ℬ)$ has canonical correlation $ℬ^{- 1}$ . Thus $𝒫_{s u b} (ℬ) \subset 𝒫_{G} (r, 2 s_{x}, 2 s_{y}, ℬ)$ , implying

\underset{n}{lim inf} sup_{P_{n} \in 𝒫_{G} (r, 2 s_{x}, 2 s_{y}, ℬ)^{n}} P_{n} (Φ_{n} (X, Y) = 1)) \geq \underset{n}{lim inf} sup_{P_{n} \in 𝒫_{s u b} (ℬ)} P_{n} (Φ_{n} (X, Y) = 1)) .

Suppose $ℱ_{x}$ and $ℱ_{y}$ are the Borel σ-field associated with $𝒲 (s_{x}, p)$ and $𝒲 (s_{y}, q)$ , respectively. Define the probability measures ${\tilde{π}}_{x}$ and ${\tilde{π}}_{y}$ on $(𝒲 (s_{x}, p), ℱ_{x})$ and $(𝒲 (s_{y}, q), ℱ_{y})$ , respectively, by

{\tilde{π}}_{x} (A) = \frac{π_{x} (A)}{π_{x} (𝒲 (s_{x}, p))} for all A \in ℱ_{x},

and

{\tilde{π}}_{y} (B) = \frac{π_{y} (B)}{π_{y} (𝒲 (s_{y}, q))} for all B \in ℱ_{y} .

Note also that if $α \in 𝒲 (s_{x}, p)$ and $β \in 𝒲 (s_{y}, q)$ , then $P_{α, β} \in 𝒫_{s u b} (ℬ)$ . Therefore

\underset{n}{lim inf} sup_{P_{n} \in 𝒫_{s u b} (ℬ)} P_{n} (Φ_{n} (X, Y) = 1)) \geq \underset{n}{lim inf} \int_{\begin{matrix} 𝒲 (s_{x}, p) \\ \times 𝒲 (s_{y}, q) \end{matrix}} P_{n, α, β} (Φ_{n} (X, Y) = 1) d {\tilde{π}}_{x} (α) d {\tilde{π}}_{y} (β) = \frac{{lim inf}_{n} \int_{\begin{matrix} 𝒲 (s_{x}, p) \\ \times 𝒲 (s_{y}, q) \end{matrix}} P_{n, α, β} (Φ_{n} (X, Y) = 1) d π_{x} (α) d π_{y} (β)}{{lim sup}_{n} (π_{x} (𝒲 (s_{x}, p)) π_{y} (𝒲 (s_{y}, q)))},

whose denominator is one by Lemma 17. Denoting $𝒲 (s_{y}, q)^{c} = R^{p} ∖ 𝒲 (s_{y}, q)$ , we note that

\int_{R^{p} \times 𝒲 (s_{y}, q)^{c}} P_{n, α, β} (Φ_{n} (X, Y) = 1) d π_{x} (α) d π_{y} (β) \leq 1 - π_{y} (𝒲 (s_{y}, q)) \to_{n} 0

by Lemma 17. Similarly, denoting $𝒲 (s_{x}, p)^{c} = R^{p} ∖ 𝒲 (s_{x}, p)$ , we can show that

\int_{𝒲 (s_{x}, p)^{c} \times R^{y}} P_{n, α, β} (Φ_{n} (X, Y) = 1) d π_{x} (α) d π_{y} (β) \to_{n} 0 .

Therefore, it holds that

\underset{n}{lim inf} \int_{𝒲 (s_{x}, p) \times 𝒲 (s_{y}, q)} P_{n, α, β} (Φ_{n} (X, Y) = 1) d π_{x} (α) d π_{y} (β) = & \underset{n}{lim inf} E_{π} [P_{n, α, β} (Φ_{n} (X, Y) = 0)] .

Thus the proof follows.

Proof of Lemma 17:

Proof of Lemma 17.. We are going to show (63) only for $π_{x}$ because the proof for $π_{y}$ follows in the identical manner. Throughout we will denote by $E_{π_{x}}$ and ${var}_{π_{x}}$ the expectation and variance under $π_{x}$ . Note that when $α \sim π_{x}$ , $‖ α ‖_{0} = \sum_{i = 1}^{p} I [α_{i} \neq 0]$ , where $I [α_{i} \neq 0]$ ’s are i.i.d. Bernoulli random variables with success probability $s_{x} ∕ p$ . Therefore, Chebyshev’s inequality yields that for any $ϵ > 0$ ,

π_{x} (∣ ‖ α ‖_{0} - s_{x} ∣ > s_{x} ϵ) \geq p \frac{{var}_{π_{x}} (I [α_{i} \neq 0])}{s_{x}^{2} ϵ^{2}} = \frac{1 - s_{x} ∕ p}{s_{x} ϵ^{2}},

which goes to zero if $s_{x} \to \infty$ . Therefore, for $ϵ = 1 ∕ 2$ , we have

π_{x} (‖ α ‖_{0} \in [s_{x} ∕ 2, 2 s_{x}]) \leq π_{x} (∣ ‖ α ‖_{0} - s_{x} ∣ > s_{x} ϵ) \to 0 .

Also, since $E_{π_{x}} [\sum_{i = 1}^{p} α_{i}^{2}] = 1$ , Chebyshev’s inequality implies that

π_{x} (\sum_{i = 1}^{p} α_{i}^{2} - 1 \geq ϵ) \leq \frac{{var}_{π_{x}} (\sum_{i = 1}^{p} α_{i}^{2})}{ϵ^{2}} \overset{(a)}{=} \frac{p . {var}_{π_{x}} (α_{i}^{2})}{ϵ^{2}} \leq \frac{p E_{π_{x}} [α_{i}^{4}]}{ϵ^{2}} = \frac{1}{s_{x} ϵ^{2}},

which goes to zero if $s_{x} \to \infty$ for any fixed $ϵ > 0$ . Here (a) uses the fact that $α_{i}$ ’s are i.i.d. The proof now follows setting $ϵ = 0.1$ .

1). Proof of Lemma 5:

Proof of Lemma depends on two auxiliary lemmas. We state and prove these lemmas first.

Lemma 18. Suppose $w \in Z^{m}$ , and $A \in R^{m \times m}$ is a matrix. Let $P$ be the measure induced by the m-dimensional standard Gaussian random vector and denote by $E_{P}$ the corresponding expectation. Then for any $x \in R^{m}$ we have

\sum_{j \in Z^{m}} \frac{t^{j}}{j!} E_{P} [H_{j} (A Z)] = e^{t^{T} (A^{2} - I) t ∕ 2} .

Proof of Lemma 18. The generating function of H_w has the convergent expansion [69, Proposition 6]

\sum_{j \in Z^{m}} \frac{t^{j}}{j!} H_{j} (x) = exp {t^{T} x - t^{T} t ∕ 2}

for any $x \in R^{m}$ . Therefore,

\sum_{j \in Z^{m}} \frac{t^{j}}{j!} H_{j} (A x) = exp {t^{T} A x - t^{T} t ∕ 2} .

Multiplying both side by the density $d P$ of $P$ and then integrating over $R^{m}$ gives us

\sum_{j \in Z^{m}} \frac{t^{j}}{j!} E_{P} [H_{j} (A Z)] = E_{P} [e^{t^{T} A Z}] e^{- t^{T} t ∕ 2} = e^{t^{T} (A^{2} - I) t ∕ 2} .

Lemma 19. Let $Σ (α, β, 1 ∕ ℬ)$ be as defined in (18). Suppose $z = (z_{x}, z_{y})$ where $z_{x} \in Z^{p}$ and $z_{y} \in Z^{q}$ . Then for any $t \in R^{p + q}$ , we have

{\partial_{t}^{z} exp {\frac{1}{2} t^{T} (Σ (α, β, 1 ∕ ℬ) - I_{p + q}) t} ∣}_{t = (0, \dots, 0)} = {\begin{matrix} ℬ^{- ∣ z_{x} ∣} ∣ z_{x} ∣! α^{z_{x}} β^{z_{y}} & i f ∣ z_{x} ∣ = ∣ z_{y} ∣, \\ 0 & o . w . \end{matrix}

Proof of Lemma 19. Let us partition t as $(t_{x}, t_{y})$ where $t_{x} = (t_{x} (1), \dots, t_{x} (p)) \in R^{p}$ and $t_{y} = (t_{y} (1), \dots, t_{y} (q)) \in R^{q}$ . We then calculate

\frac{t^{T} (Σ (α, β, 1 ∕ ℬ) - I_{p + q}) t}{2} = ℬ^{- 1} t_{x}^{T} α β^{T} t_{y},

which implies

exp {\frac{1}{2} t^{T} (Σ (α, β, 1 ∕ ℬ) - I_{p + q}) t} = exp {ℬ^{- 1} \sum_{i = 1}^{p} \sum_{j = 1}^{q} α_{i} β_{j} t_{x} (i) t_{y} (j)} = \sum_{k = 0}^{\infty} ℬ^{- k} \frac{{(\sum_{i = 1}^{p} α_{i} t_{x} (i))}^{k} {(\sum_{j = 1}^{q} β_{j} t_{y} (j))}^{k}}{k!}

which equals

\sum_{k = 0}^{\infty} \frac{ℬ^{- k}}{k!} \sum_{\begin{matrix} z_{x} \in Z^{p}, \\ ∣ z_{x} ∣ = k \end{matrix}} \sum_{\begin{matrix} z_{y} \in Z^{q}, \\ ∣ z_{y} ∣ = k \end{matrix}} \frac{k!}{z_{x}!} \frac{k!}{z_{y}!} α^{z_{x}} β^{z_{y}} t_{x}^{z_{x}} t_{y}^{z_{y}} = \sum_{k = 0}^{\infty} \sum_{\begin{matrix} z_{x} \in Z^{p}, \\ ∣ z_{x} ∣ = k \end{matrix}} \sum_{\begin{matrix} z_{y} \in Z^{q}, \\ ∣ z_{y} ∣ = k \end{matrix}} ℬ^{- k} \frac{k!}{z!} α^{z_{x}} β^{z_{y}} t_{x}^{z_{x}} t_{y}^{z_{y}} \overset{(a)}{=} \sum_{\begin{matrix} z \in Z^{p + q} \\ ∣ z_{x} ∣ = ∣ z_{y} ∣ \end{matrix}} ℬ^{- ∣ z_{x} ∣} \frac{∣ z_{x} ∣!}{z!} α^{z_{x}} β^{z_{y}} t^{z} .

In step (a), we stacked the variables $z_{x}$ and $z_{y}$ to form $z = (z_{x}, z_{y})^{T}$ . Note that following the terminologies set in the beginning of Appendix E, $z! = z_{x}! z_{y}!$ and $t^{z} = t_{x}^{z_{x}} t_{y}^{z_{y}}$ . Note that if $∣ z_{x} ∣ \neq ∣ z_{y} ∣$ , then the term $t^{z}$ has zero coefficient in the above expansion. Thus the lemma follows.

Proof of Lemma 5.

〈 L_{n}, H_{w} 〉_{L^{2} (Q_{n})} = E_{(X, Y) \sim Q_{n}} [E_{π} [H_{w} (X, Y) \frac{d P_{n, α, β}}{d Q_{n}}]] = E_{π} [E_{(X, Y) \sim P_{n, α, β}} [H_{w} (X, Y)]] \overset{(a)}{=} E_{π} [\underset{i \in [n]}{E_{(X_{i}, Y_{i}) \sim P_{α, β},}} [\prod_{i \in [n]} H_{w_{i}} (X_{i}, Y_{i})]] = E_{π} [\prod_{i \in [n]} E_{(X_{i}, Y_{i}) \sim P_{α, β}} [H_{w_{i}} (X_{i}, Y_{i})]]

where (a) follows because $(X_{i}, Y_{i})$ ’s are independent observations. Now note that if $‖ α ‖ ‖ β ‖_{2} \geq ℬ$ , then (19) implies

E_{(X_{i}, Y_{i}) \sim P_{α, β}} [H_{w_{i}} (X_{i}, Y_{i})] = E_{(X_{i}, Y_{i}) \sim Q} [H_{w_{i}} (X_{i}, Y_{i})] = 0,

where the last step follows because $E_{Z \sim Q} [H_{w_{i}} (Z)] = 0$ for any $i \in [n]$ . If $‖ α ‖ ‖ β ‖_{2} < ℬ$ , then $Σ (α, β, 1 ∕ ℬ)$ defined in (18) is positive definite, and (19) implies

E_{(X_{i}, Y_{i}) \sim P_{α, β}} [H_{w_{i}} (X_{i}, Y_{i})] = E_{Z \sim Q} [H_{w_{i}} (Σ (α, β, 1 ∕ ℬ)^{1 ∕ 2} Z)] = {\partial_{t}^{w_{i}} (exp {\frac{1}{2} t^{T} (Σ (α, β, 1 ∕ ℬ) - I_{p + q}) t}) ∣}_{t = (0, \dots, 0)}

by Lemma 18. Here $Σ (α, β, 1 ∕ ℬ)$ is as in (18), and $Σ (α, β, 1 ∕ ℬ)$ is positive definite because $‖ α ‖_{2} ‖ β ‖_{2} < ℬ$ , as discussed in Section III-C. Therefore, we can write

E_{(X_{i}, Y_{i}) \sim P_{α, β}} [H_{w_{i}} (X_{i}, Y_{i})] = 1 {‖ α ‖_{2} ‖ β ‖_{2} < ℬ} \times {\partial_{t}^{w_{i}} (exp {\frac{1}{2} t^{T} (Σ (α, β, 1 ∕ ℬ) - I_{p + q}) t}) ∣}_{t = (0, \dots, 0)}

Lemma 19 gives the form of the partial derivative in the above expression, and implies that the partial derivative is zero unless $∣ w_{i}^{x} ∣ = ∣ w_{i}^{y} ∣$ . Therefore, $〈 L_{n}, H_{w} 〉_{L^{2} (Q_{n})} \neq 0$ only if $∣ w_{i}^{x} ∣ = ∣ w_{i}^{y} ∣$ for all $i \in [n]$ . In this case, $∣ w_{i} ∣ = 2 ∣ w_{i}^{x} ∣$ is even, and by Lemma 19,

〈 L_{n}, H_{w} 〉_{L^{2} (Q_{n})} = E_{π} [1 {‖ α ‖_{2} ‖ β ‖_{2} < ℬ} \prod_{i \in [n]} ℬ^{- ∣ w_{i}^{x} ∣} ∣ w_{i}^{x} ∣! α^{w_{i}^{x}} β^{w_{i}^{y}}] = {ℬ^{- \sum_{i = 1}^{n} ∣ w_{i}^{x} ∣} \prod_{i = 1}^{n} ∣ w_{i}^{x} ∣!} \times E_{π} [1 {‖ α ‖_{2} ‖ β ‖_{2} < ℬ} α^{\sum_{i = 1}^{n} w_{i}^{x}} β^{\sum_{i = 1}^{n} w_{i}^{y}}] = ℬ^{- ∣ w ∣ ∕ 2} {\prod_{i = 1}^{n} ∣ w_{i}^{x} ∣!} \times E_{π} [1 {‖ α ‖_{2} ‖ β ‖_{2} < ℬ} α^{\sum_{i = 1}^{n} w_{i}^{x}} β^{\sum_{i = 1}^{n} w_{i}^{y}}] .

Therefore,

〈 L_{n}, {\hat{H}}_{w} 〉_{L^{2} (Q_{n})}^{2} = {\begin{matrix} \frac{ℬ^{- ∣ w ∣}}{w!} E_{π} {[1 {‖ α ‖_{2} ‖ β ‖_{2} < ℬ} α^{\sum_{i = 1}^{n} w_{i}^{x}} β^{\sum_{i = 1}^{n} w_{i}^{y}}]}^{2} \\ \times {\prod_{i = 1}^{n} ∣ w_{i}^{x} ∣!}^{2} if ∣ w_{i}^{x} ∣ = ∣ w_{i}^{y} ∣ for all i \in [n], \\ 0 o.w. \end{matrix}

2). Proof of Lemma 6:

Proof. Lemma 5 implies that $L_{n}$ belongs to the subspace generated by those $H_{w}$ ’s whose degree-index ω has $∣ w_{i}^{x} ∣ = ∣ w_{i}^{y} ∣$ for all $i \in [n]$ . The degree of the polynomial $H_{w}$ is $∣ w ∣$ , which is even in the above case. Therefore, if $D_{n} \geq 1$ is odd, $‖ L_{n}^{\leq D_{n}} ‖_{L_{2} (Q_{n})}^{2}$ equals $‖ L_{n}^{\leq (D_{n} - 1)} ‖_{L_{2} (Q_{n})}^{2}$ . Hence, it suffices to compute the norm of $L_{n}^{\leq 2 𝒟_{n}}$ , where $𝒟_{n} = ⌊ D_{n} ∕ 2 ⌋$ . Suppose $w \in Z^{n (p + q)}$ is such that $∣ w_{i}^{x} ∣ = ∣ w_{i}^{y} ∣$ for all $i \in [n]$ . Lemma 5 gives

〈 L_{n}, {\hat{H}}_{w} 〉_{L^{2} (Q_{n})}^{2} = \frac{ℬ^{- ∣ w ∣}}{w!} {E_{π} [1 {‖ α ‖_{2} ‖ β ‖_{2} < ℬ} α^{\sum_{i = 1}^{n} w_{i}^{x}} β^{\sum_{i = 1}^{n} w_{i}^{y}}]}^{2} \times {\prod_{i = 1}^{n} ∣ w_{i}^{x} ∣!}^{2} .

Consider the pair of replicas $α_{1}, α_{2} \overset{i i d}{\sim} π_{x}$ and $β_{1}, β_{2} \overset{i i d}{\sim} π_{y}$ . Letting W denote the indicator function of the event ${‖ α_{1} ‖_{2} ‖ β_{1} ‖_{2} < ℬ, ‖ α_{2} ‖_{2} ‖ β_{2} ‖_{2} < ℬ}$ , we can then write

〈 L_{n}, {\hat{H}}_{w} 〉_{L^{2} (Q_{n})}^{2} = \frac{ℬ^{- ∣ w ∣}}{w!} E_{π} [(α_{1} α_{2})^{\sum_{i = 1}^{n} w_{i}^{x}} (β_{1} β_{2})^{\sum_{i = 1}^{n} w_{i}^{y}} W] \times {\prod_{i = 1}^{n} ∣ w_{i}^{x} ∣!}^{2} .

(64)

Denote by $\bar{d} = (d_{1}, \dots, d_{n}) \in Z^{n}$ . Using (64), we obtain the following expression:

‖ L_{n}^{\leq 2 𝒟_{n}} ‖_{L_{2} (Q)} = \sum_{d = 0}^{𝒟_{n}} ℬ^{- 2 d} \sum_{\bar{d} : \sum d_{i} = d} \sum_{\begin{matrix} w : w_{i}^{x} \in Z^{p}, \\ w_{i}^{y} \in Z^{q}, \\ ∣ w_{i}^{x} ∣ = ∣ w_{i}^{y} ∣ = d_{i} \end{matrix}} T_{\bar{d}, w}

where

T_{\bar{d}, w} = E_{π} [W \prod_{i = 1}^{n} (\frac{d_{i}^{2}}{w_{i}^{x}! w_{i}^{y}!} (α_{1} α_{2})^{w_{i}^{x}} (β_{1} β_{2})^{w_{i}^{y}})] .

Therefore $‖ L_{n}^{\leq 2 𝒟_{n}} ‖_{L_{2} (Q)}$ equals

\sum_{d = 0}^{𝒟_{n}} ℬ^{- 2 d} \sum_{\bar{d} : \sum d_{i} = d} E_{π} [W \sum_{\begin{matrix} w : w_{i}^{x} \in Z^{p}, \\ w_{i}^{y} \in Z^{q}, \\ ∣ w_{i}^{x} ∣ = ∣ w_{i}^{y} ∣ = d_{i} \end{matrix}} (\prod_{i = 1}^{n} \frac{d_{i}!}{w_{i}^{x}!} (α_{1} α_{2})^{w_{i}^{x}}) \times (\prod_{i = 1}^{n} \frac{d_{i}!}{w_{i}^{y}!} (β_{1} β_{2})^{w_{i}^{y}})] = \sum_{d = 0}^{𝒟_{n}} ℬ^{- 2 d} \sum_{\bar{d} : \sum d_{i} = d} E_{π} [W (\sum_{\begin{matrix} w^{x} : w_{i}^{x} \in Z^{p} \\ ∣ w_{i}^{x} ∣ = d_{i} \end{matrix}} \prod_{i = 1}^{n} \frac{d_{i}!}{w_{i}^{x}!} (α_{1} α_{2})^{w_{i}^{x}}) \times (\sum_{\begin{matrix} w^{y} : w_{i}^{y} \in Z^{q} \\ ∣ w_{i}^{y} ∣ = d_{i} \end{matrix}} \prod_{i = 1}^{n} \frac{d_{i}!}{w_{i}^{y}!} (β_{1} β_{2})^{w_{i}^{y}})]

In the last step, we used the variables $w^{x} = (w_{1}^{x}, \dots, w_{n}^{x})$ , and $w_{y} = (w_{1}^{y}, \dots, w_{n}^{y})$ . Suppose $z_{i} \in Z^{p}$ for each $i \in [n]$ . For any $x \in R^{p}$ and $y \in R^{q}$ , it holds that

\sum_{z_{i} \in Z^{p}, ∣ z_{i} ∣ = d_{i}} \prod_{i = 1}^{n} \frac{d_{i}!}{z_{i}!} x^{z_{i}} y^{z_{i}} = \prod_{i = 1}^{n} \sum_{z_{i} \in Z^{p}, ∣ z_{i} ∣ = d_{i}} \frac{d_{i}!}{z_{i}!} x^{z_{i}} y^{z_{i}} \overset{(a)}{=} \prod_{i = 1}^{n} (x^{T} y)^{d_{i}} = (x^{T} y)^{\sum_{i = 1}^{n} d_{i}},

where (a) follows from Fact 6.

Fact 6. [Multinomial Theorem] Suppose $α \in R^{p}$ . Then for $m \in Z$ ,

{(\sum_{i = 1}^{p} α_{i})}^{m} = \sum_{z \in Z^{p}, ∣ z ∣ = m} \frac{k! α^{z}}{z!} .

Therefore it follows that

(\sum_{\begin{matrix} w^{x} : w_{i}^{x} \in Z^{p} \\ ∣ w_{i}^{x} ∣ = d_{i} \end{matrix}} \frac{d_{i}!}{w_{i}^{x}!} (α_{1} α_{2})^{w_{i}^{x}}) (\sum_{\begin{matrix} w^{y} : w_{i}^{y} \in Z^{q} \\ ∣ w_{i}^{y} ∣ = d_{i} \end{matrix}} \frac{d_{i}!}{w_{i}^{y}!} (β_{1} β_{2})^{w_{i}^{y}}) = (α_{1}^{T} α_{2})^{\sum_{i = 1}^{n} d_{i}} (β_{1}^{T} β_{2})^{\sum_{i = 1}^{n} d_{i}},

which implies

‖ L_{n}^{\leq 2 𝒟_{n}} ‖_{L_{2} (Q)} = \sum_{d = 0}^{𝒟_{n}} ℬ^{- 2 d} \sum_{\bar{d} : \sum d_{i} = d} E_{π} [W (α_{1}^{T} α_{2})^{\sum_{i = 1}^{n} d_{i}} (β_{1}^{T} β_{2})^{\sum_{i = 1}^{n} d_{i}}] \overset{(a)}{=} \sum_{d = 0}^{𝒟_{n}} ℬ^{- 2 d} (\begin{matrix} d + n - 1 \\ d \end{matrix}) E_{π} [W (α_{1}^{T} α_{2})^{d} (β_{1}^{T} β_{2})^{d}] = E_{π} [W \sum_{d = 0}^{𝒟_{n}} {(\begin{matrix} d + n - 1 \\ d \end{matrix}) {(ℬ^{- 2} (α_{1}^{T} α_{2}) (β_{1}^{T} β_{2})}^{d}}] .

where (a) follows since the number of $\bar{d} \in Z^{n}$ such that $∣ \bar{d} ∣ = d$ equals $(\begin{matrix} n + d - 1 \\ d \end{matrix})$ . Noting $𝒟_{n} = ⌊ D_{n} ∕ 2 ⌋$ , the proof follows.

D. Proof of Technical Lemmas for Theorem 4

First, we introduce some additional notations and state some useful results that will be used repeatedly throughout the proof. Suppose $A \in R^{p \times q}$ . We can write A as

A = [A_{* 1} A_{* 2} \dots A_{* q}] .

We define the vectorization operator as

Vec (A) = [\begin{matrix} A_{* 1} \\ \dots \\ A_{* q} \end{matrix}] .

We will use two well known operations on the vetorization operators, which follow from Section 10.2.2 of [70].

Fact 7. A. $T r a c e (A^{T} B) = V e c (A)^{T} V e c (B)$ .

B. $V e c (A X B) = (B^{T} \otimes A) V e c (X)$ where $\otimes$ denotes the Kronecker delta product.

Often times we will also use the fact that [71, Theorem 13.12]

‖ A \otimes B ‖_{o p} = ‖ A ‖_{o p} ‖ B ‖_{o p} .

(65)

Define the Hadamard product between vectors $x = (x_{1}, \dots, x_{p})$ and $y = (y_{1}, \dots, y_{p})$ by

x \circ y = (x_{1} y_{1}, \dots, x_{p} y_{p})^{T} .

Note that Cauchy-Schwarz inequality implies that

‖ x \circ y ‖_{2} \leq ‖ x ‖_{2} ‖ y ‖_{2}

(66)

We will also often use of Fact 1, which states $‖ A B ‖_{F}^{2} \leq ‖ A ‖_{o p}^{2} ‖ B ‖_{F}^{2}$ .

1). Proof of Lemma 10:

Proof. The first result is immediate. For the second result, denote by $x_{D}$ by the projection of y on $R^{D}$ . Note that for any $x \in R^{m}$ and $y \in R^{p}$ .

\frac{x^{T} D (A) y}{‖ x ‖ \cdot ‖ y ‖} = \frac{x_{D_{1}}^{T} A y_{D_{2}}}{‖ x ‖ \cdot ‖ y ‖} \leq \frac{x_{D_{1}}^{T} A y_{D_{2}}}{‖ x_{D_{1}} ‖ \cdot ‖ y_{D_{2}} ‖}

Thus the maximum singular value of D(A) is smaller than that of A, indicating that

‖ D (A) ‖ \leq ‖ A ‖ .

2). Proof of Lemma 4:

First, we state and prove two facts, which are used in the proof of Lemma 4.

Fact 8. Suppose $A \in R^{n \times r}$ , $B \in R^{p \times s}$ are potentially random matrices satisfying $A^{T} A = I_{r}$ and $B^{T} B = I_{s}$ . Let $X \in R^{n \times p}$ be such that $r, s \leq p$ , and $X ∣ A, B$ is distributed as a standard Gaussian data matrix. Then the matrix $A^{T} X B ∣ A, B$ is distributed as a standard Gaussian data matrix.

Proof of Fact 8. $X \in R^{n \times p}$ is a Gaussian data matrix with covariance $Σ \in R^{p \times p}$ if and only if

Vec (X^{T}) \sim N_{n p} (0, I_{n} \otimes Σ) .

(67)

Now

Vec ((A^{T} X B)^{T}) = Vec (B^{T} X^{T} A) \overset{(a)}{=} (A^{T} \otimes B^{T}) Vec (X^{T})

where (a) follows from Fact 7B. However, since $(A^{T} \otimes B^{T}) \in R^{r s \times n p}$ , (67) implies

(A^{T} \otimes B^{T}) Vec (X^{T}) ∣ A, B \sim N_{r s} (0, (A^{T} \otimes B^{T}) (A \otimes B)),

but

(A^{T} \otimes B^{T}) (A \otimes B) = A^{T} A \otimes B^{T} B = I_{r} \otimes I_{s} = I_{r s} .

Therefore,

Vec ((A^{T} X B)^{T}) ∣ A, B \sim N_{r s} (0, I_{r s}) .

Then the result follows from (67).

In the above fact, it may appear that $A^{T} X B$ is independent of matrices A and B since its conditional distribution is standard Gaussian. However, $A^{T} X B$ still depends on A and B through r and s, which may be random quantities.

Fact 9. Suppose $A \in R^{n \times k}$ , $X \in R^{n \times p}$ , $B \in R^{p \times s}$ are such that conditional on A and B, X is distributed as a standard Gaussian data matrix. Further suppose that the rank of A and B are a and b, respectively. Then the following assertion holds:

\frac{‖ A^{T} X B ‖_{o p}}{‖ A ‖_{o p} ‖ B ‖_{o p}} \leq ‖ Z ‖_{o p}

where $Z ∣ A, B$ is distributed as a standard Gaussian data matrix in $R^{a \times b}$ .

Proof of Fact 9. Suppose $P_{A}$ and $P_{B}$ are the projection matrices onto the column spaces of A and B, respectively. Then we can write $P_{A} = V_{A} V_{A}^{T}$ and $P_{B} = V_{B} V_{B}^{T}$ , where $V_{A} \in R^{n \times a}$ and $V_{B} \in R^{p \times b}$ are matrices matrices with full column rank so that $V_{A}^{T} V_{A} = I_{a}$ and $V_{B}^{T} V_{B} = I_{b}$ . Writing $A = P_{A} A$ and $B = P_{B} B$ , we obtain that

‖ A^{T} X B ‖_{o p} = ‖ A^{T} V_{A} V_{A}^{T} X V_{B} V_{B}^{T} B ‖_{o p}

which is bounded by

‖ A ‖_{o p} ‖ V_{A} ‖_{o p} ‖ V_{A}^{T} X V_{B} ‖_{o p} ‖ V_{B} ‖_{o p} ‖ B ‖_{o p}

That $‖ V_{A} ‖_{o p}$ and $‖ V_{B} ‖_{o p}$ are one follows from the definitions of $V_{A}$ and $V_{B}$ . Fact 8 implies conditional on $V_{A}$ and $V_{B}$ , $V_{A}^{T} X V_{B} \in R^{a \times b}$ is distributed as a standard Gaussian data matrix. Hence, the proof follows.

Proof of Lemma 4. Let us denote the rank of $Z_{1} D$ by $a^{'}$ . Note that $a^{'} \leq rank (D) = a$ . Letting $A = Z_{1} D$ , and applying Fact 9, we have the bound

‖ D^{T} Z_{1}^{T} Z_{2} B ‖_{o p} \leq ‖ Z_{1} D ‖_{o p} ‖ Z ‖_{o p} ‖ B ‖_{o p}

where $Z ∣ Z_{1}$ is distributed as a standard Gaussian data matrix in $R^{a^{'} \times b}$ . Next we apply Fact 9 again, but now on the term $‖ Z_{1} D ‖_{o p}$ , which leads to

‖ Z_{1} D ‖_{o p} \leq ‖ D ‖_{o p} ‖ Z^{'} ‖_{o p},

where $Z^{'} \in R^{n \times a}$ is a standard Gaussian data matrix. Therefore,

‖ D^{T} Z_{1}^{T} Z_{2} B ‖_{o p} \leq ‖ A ‖_{o p} ‖ Z^{'} ‖_{o p} ‖ Z ‖_{o p} ‖ B ‖_{o p} .

We use the Gaussian matrix concentration inequality in Fact 2 to show that with probability at least $1 - exp (- C n), ‖ Z^{'} ‖_{o p} \leq \sqrt{2} (\sqrt{n} + \sqrt{a})$ . Also, for $Z \in R^{a^{'} \times b}$ , the first part of Fact 2 implies

P (‖ Z ‖_{o p} \leq \sqrt{a^{'}} + \sqrt{b} + t ∣ Z_{1}) \geq 1 - exp (- t^{2} ∕ 2)

for any $t > 0$ . Since $a^{'} \leq a$ , and $t$ is deterministic, the above implies

P (‖ Z ‖_{o p} \leq \sqrt{a} + \sqrt{b} + t) \geq 1 - exp (- t^{2} ∕ 2) .

Hence, for any $t > 0$ , we have the following with probability at least $1 - exp (- C n) - exp (- t^{2} ∕ 2)$ :

‖ D^{T} Z_{1}^{T} Z_{2} B ‖_{o p} \leq \sqrt{2} ‖ D ‖_{o p} ‖ B ‖_{o p} (\sqrt{n} + \sqrt{a}) (\sqrt{a} + \sqrt{b} + t) .

Since $a \leq b \leq n$ , it follows that

‖ D^{T} Z_{1}^{T} Z_{2} B ‖_{o p} \leq C ‖ D ‖_{o p} ‖ B ‖_{o p} \sqrt{n} \max {\sqrt{b}, t} .

Therefore, the proof follows.

3). Proof of Lemma 15:

Proof of Lemma 15. Denoting

𝒯 = {(x^{'}, y^{'}) \in R^{p} : x^{'} = x_{S_{x}}, y^{'} = y_{S_{y}}, x \in T_{p}^{ϵ}, y \in T_{q}^{ϵ}},

we note that

\max_{x \in T_{p}^{ϵ}, y \in T_{q}^{ϵ}} ∣ 〈 x_{S_{x}}, A_{n} y_{S_{y}} 〉 ∣ = ∣ \max_{(x, y) \in T} ∣ 〈 x, A_{n} y 〉 ∣ .

Therefore it suffices to show that there exist absolute constants C, $c > 0$ such that

P {\max_{(x, y) \in 𝒯} ∣ 〈 x, A_{n} y 〉 ∣ \geq Δ} \leq C exp {(p + q) \frac{\log (C J_{p, q})}{J_{p, q}} - \frac{n^{2} Δ^{2}}{4 C ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2} (2 n + p + q)}} . + \frac{C}{Δ^{2}} ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2} (n (p + q))^{C} {e^{- c (n + q)} + e^{- c (n + p)}}

Let us denote $𝒵_{1} = Vec (Z_{1}^{T})$ , $𝒵_{2} = Vec (Z_{2}^{T})$ , and $𝒵 = (𝒵_{1}^{T}, 𝒵_{2}^{T})^{T}$ . Thus

𝒵^{T} = {(𝒵_{1})_{* 1}^{T}, \dots, (Z_{1})_{* p}^{T}, (Z_{2})_{* 1}^{T}, \dots, (Z_{1})_{* q}^{T}} .

Recalling $Q_{M, N} = \frac{M Z_{1}^{T} Z_{2} N}{n}$ , we define

f_{x, y} (𝒵_{1}, 𝒵_{2}) = 〈 x, η (Q_{M, N}) y 〉 = 〈 x, A_{y} y 〉 .

(68)

To obtain a tight concentration inequality for $f_{x, y} (𝒵_{1}, 𝒵_{2})$ , we want to use the following Gaussian concentration lemma due to [1]

Lemma 20 (Corollary 10 of [1]). Let $𝒵 \sim N (0, I_{n})$ be a vector of n i.i.d. standard Gaussian variables. Suppose $ℬ$ is a finite set and we have functions $F_{b} : R^{n} \mapsto R$ for every $b \in ℬ$ . Assume $𝒢 \in R^{n} \times R^{n}$ is a Borel set such that for lebesgue-almost every $(Z, Z^{'}) \in 𝒢$ :

\max_{b \in ℬ} sup_{t \in [0, 1]} ‖ \nabla F_{b} (\sqrt{t} Z + \sqrt{1 - t} Z^{'}) ‖_{2} \leq ℒ .

Then, there exists an absolute constant $C > 0$ so that for any $Δ > 0$ ,

P (\max_{b \in ℬ} ∣ F_{b} (𝒵) - E F_{b} (𝒵) ∣ \geq Δ) \leq C ∣ ℬ ∣ exp (- \frac{Δ^{2}}{C ℒ^{2}}) + \frac{C}{Δ^{2}} E [\max_{b \in ℬ} (F_{b} (𝒵) - F_{b} (𝒵^{'}))^{4}] P (𝒢^{c})^{1 ∕ 2} .

Here $𝒵^{'}$ is an independent copy of $𝒵$ .

In our case, the index b corresponds to (x, y), the set $ℬ$ corresponds to $𝒯$ , and the function $F_{b} (𝒵)$ corresponds to $F_{x, y} (𝒵)$ . To find the centering and the Lipschitz constant $ℒ$ , we need to compute $E f_{x, y} (𝒵_{1}, 𝒵_{2})$ and $\nabla_{𝒵} f_{x, y} (𝒵_{1}, 𝒵_{2})$ , respectively.

First, note that since Z₁ and Z₂ are independent standard Gaussian data matrices, $Q_{M, N} \overset{d}{=} - Q_{M, N}$ . Noting $E η (X) = 0$ for any symmetric random variable X, we deduce

E 〈 x, A_{n} y 〉 = 〈 x, E [η (Q_{M, N})] y 〉 = 0 .

Using Lemma 21 we obtain that

{‖ \frac{\partial f_{x, y} (𝒵_{1}, 𝒵_{2})}{\partial 𝒵_{1}} ‖}_{2} \leq ‖ g (Z_{2}) ‖_{o p} {‖ v \circ \nabla η (Vec (Q_{M, N})) ‖}_{2}

and

{‖ \frac{\partial f_{x, y} (𝒵_{1}, 𝒵_{2})}{\partial 𝒵_{2}} ‖}_{2} \leq ‖ h (Z_{1}) ‖_{o p} {‖ v \circ \nabla η (Vec (Q_{M, N})) ‖}_{2}

where

v = Vec (x y^{T}), g (Z_{2}) = Z_{2} N \otimes M^{T} ∕ n h (Z_{1}) = Z_{1} M^{T} \otimes N ∕ n .

Because $∣ \nabla η (x) ∣ < 1$ for each $x \in R$ ,

‖ v \circ \nabla η (Vec (Q_{M, N})) ‖_{2} \leq sup_{x} \nabla ∣ η (x) ∣ ‖ v ‖_{2} \leq ‖ v ‖_{2} = ‖ x ‖_{2} ‖ y ‖_{2}

since $‖ v ‖_{2}^{2} = ‖ x y^{T} ‖_{F}^{2} = ‖ x ‖_{2}^{2} ‖ y ‖_{2}^{2}$ . Also, because $‖ A \otimes B ‖_{o p}$ equals $‖ A ‖_{o p} ‖ B ‖_{o p}$ , we have

‖ g (Z_{2}) ‖_{o p}^{2} = \frac{‖ Z_{2} N \otimes M^{T} ‖_{o p}^{2}}{n^{2}} = \frac{‖ Z_{2} N ‖_{o p}^{2} ‖ M ‖_{o p}^{2}}{n^{2}} \leq \frac{‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2} ‖ Z_{2} ‖_{o p}^{2}}{n^{2}} .

(69)

and similarly,

‖ h (Z_{1}) ‖_{o p}^{2} \leq \frac{‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2} ‖ Z_{1} ‖_{o p}^{2}}{n^{2}} .

(70)

Therefore,

{‖ \frac{\partial f_{x, y} (𝒵_{1}, 𝒵_{2})}{\partial 𝒵_{1}} ‖}_{2} \leq ‖ x ‖_{2} ‖ y ‖_{2} \frac{‖ M ‖_{o p} ‖ N ‖_{o p} ‖ Z_{2} ‖_{o p}}{n}, {‖ \frac{\partial f_{x, y} (𝒵_{1}, 𝒵_{2})}{\partial 𝒵_{2}} ‖}_{2} \leq ‖ x ‖_{2} ‖ y ‖_{2} \frac{‖ M ‖_{o p} ‖ N ‖_{o p} ‖ Z_{1} ‖_{o p}}{n} .

Letting $\nabla f_{x, y} (𝒵)$ denote $\frac{\partial f_{x, y} (𝒵)}{\partial 𝒵}$ , we note that the above two inequalities imply

{‖ \nabla f_{x, y} (𝒵) ‖}_{2}^{2} \leq ‖ x ‖_{2}^{2} ‖ y ‖_{2}^{2} \frac{‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2} (‖ Z_{1} ‖_{o p}^{2} + ‖ Z_{2} ‖_{o p}^{2})}{n^{2}} .

Because $‖ x ‖_{2}$ , $‖ y ‖_{2} \leq 1$ , we have

{‖ \nabla f_{x, y} (𝒵) ‖}_{2}^{2} \leq \frac{‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2} (‖ Z_{1} ‖_{o p}^{2} + ‖ Z_{2} ‖_{o p}^{2})}{n^{2}} .

(71)

We choose a good set $𝒢_{1}$ where the above bound is small. To that end, we take $𝒢_{1}$ to be

𝒢_{1} = {({\tilde{Z}}_{1}, {\tilde{Z}}_{1}^{'}, {\tilde{Z}}_{2}, {\tilde{Z}}_{2}^{'}) : {\tilde{Z}}_{1} \in R^{n \times p}, {\tilde{Z}}_{1}^{'} \in R^{n \times p}, {\tilde{Z}}_{2} \in R^{n \times q}, {\tilde{Z}}_{2}^{'} \in R^{n \times q}, \max {‖ Z_{1} ‖_{o p}, ‖ Z_{1}^{'} ‖_{o p}} \leq \sqrt{2} (\sqrt{n} + \sqrt{p}), \max {‖ Z_{2} ‖_{o p}, ‖ Z_{2}^{'} ‖_{o p}} \leq \sqrt{2} (\sqrt{n} + \sqrt{q})} .

(72)

Let us denote $𝒵_{i} = Vec (Z_{i}^{T})$ and ${\tilde{𝒵}}_{i} = Vec ({\tilde{Z}}_{i}^{T})$ . To apply Lemma 21, now we define the process

𝒵_{i} (t) = \sqrt{t} {\tilde{𝒵}}_{i} + \sqrt{1 - t} {\tilde{𝒵}}_{i}^{'}, t \in [0, 1], i = 1, 2 .

Equation 71 implies that on $𝒢_{1}$ ,

{‖ \nabla_{𝒵} f_{x, y} (𝒵_{1} (t), 𝒵_{2} (t)) ‖}_{2}^{2} \leq \frac{4 ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2} (2 n + p + q)}{n^{2}} = ℒ .

We are now in a position to apply Lemma 21, which yields that

P (\max_{(x, y) \in 𝒯} ∣ f_{x, y} (𝒵_{1}, 𝒵_{2}) ∣ \geq Δ) \leq C ∣ 𝒯 ∣ exp (- \frac{Δ^{2}}{C ℒ^{2}}) + \frac{C}{Δ^{2}} E {[\max_{(x, y) \in T} f_{x, y} (𝒵_{1}, 𝒵_{2})^{4}]}^{1 ∕ 2} \times P (𝒢_{1}^{c})^{1 ∕ 2} .

(73)

From equation 79 of [1] it follows that C can be chosen so large such that

∣ 𝒯 ∣ \leq exp ((p + q) \frac{\log (C J_{p, q})}{J_{p, q}}) .

Thus, after plugging in the value of $ℒ$ , the first term on the right hand side of (73) can be bounded above by

C exp {(p + q) \frac{\log (C J_{p, q})}{J_{p, q}} - \frac{n^{2} Δ^{2}}{4 C ‖ M ‖^{2} ‖ N ‖^{2} (2 n + p + q)}} .

To bound the second term in (73), notice that Lemma 22 yields the bound

E [\max_{(x, y) \in 𝒯} f_{x, y} (𝒵_{1}, 𝒵_{2})^{4}] \leq C ‖ M ‖_{o p}^{4} ‖ N ‖_{o p}^{4} (n (p + q))^{C},

whereas Fact 2 leads to the bound

P (𝒢_{1}^{c})^{1 ∕ 2} \leq 2 (exp (- c (n + p)) + exp (- c (n + q))) .

(74)

Therefore the proof follows.

Lemma 21. Suppose $f_{x, y}$ is as defined in (68) and $Q_{M, N} = M Z_{1}^{T} Z_{2} N ∕ n$ . Then

{‖ \frac{\partial f_{x, y} (𝒵_{1}, 𝒵_{2})}{\partial 𝒵_{1}} ‖}_{2} \leq ‖ g (Z_{2}) ‖_{o p} {‖ v \circ \nabla η (V e c (Q_{M, N})) ‖}_{2}, {‖ \frac{\partial f_{x, y} (𝒵_{1}, 𝒵_{2})}{\partial 𝒵_{2}} ‖}_{2} \leq ‖ h (Z_{1}) ‖_{o p} {‖ v \circ \nabla η (V e c (Q_{M, N})) ‖}_{2}

where $v = V e c (x y^{T})$ , $g (Z_{2}) = Z_{2} N \otimes M^{T} ∕ n$ , and $h (Z_{1}) = Z_{1} M^{T} \otimes N ∕ n$ .

Proof. Using $v = Vec (x y^{T})$ , and the fact that $T r (A B) = T r (B A)$ , we calculate that

f_{x, y} (𝒵_{1}, 𝒵_{2}) = T r (y x^{T} η (Q_{M, N})) = T r ((x y^{T})^{T} η (\frac{M Z_{1}^{T} Z_{2} N}{n})) = Vec (x y^{T})^{T} Vec (η (\frac{M Z_{1}^{T} Z_{2} N}{n})) = v^{T} η (Vec (\frac{M Z_{1}^{T} Z_{2} N}{n})) .

Fact 7 implies

Vec (Q_{M, N}) = \frac{(N^{T} Z_{2}^{T} \otimes M)}{n} 𝒵_{1} = g (Z_{2})^{T} 𝒵_{1},

(75)

which yields $f_{x, y} (𝒵_{1}, 𝒵_{2}) = v^{T} η (g (Z_{2})^{T} 𝒵_{1})$ . Noting $v \in R^{p q}$ , we can hence write $f_{x, y} (𝒵_{1}, 𝒵_{2})$ as

f_{x, y} (𝒵_{1}, 𝒵_{2}) = \sum_{i = 1}^{p q} v_{i} η ([g (Z_{2})_{i}]^{T} 𝒵_{1}) .

Let us denote by $\nabla η (x)$ the derivative of $η (x)$ evaluated at $x \in R$ . For $A \in R^{p \times q}$ , we denote by $\nabla η (A)$ the matrix whose (i, j)-th entry equals $\nabla η (A_{i, j})$ . Then we obtain that for $j \in [n p]$ ,

\frac{\partial f_{x, y} (𝒵_{1}, 𝒵_{2})}{\partial (𝒵_{1})_{j}} = \sum_{i = 1}^{p q} v_{i} \nabla η ([g (Z_{2})_{i}]^{T} 𝒵_{1}) g (Z_{2})_{i j},

indicating that

\frac{\partial f_{x, y} (𝒵_{1}, 𝒵_{2})}{\partial 𝒵_{1}} = \sum_{i = 1}^{p q} v_{i} \nabla η ([g (Z_{2})_{i}]^{T} 𝒵_{1}) g (Z_{2})_{i} = g (Z_{2}) [v \circ \nabla η (g (Z_{2})^{T} 𝒵_{1})]

where $\circ$ implies the Hadamard product. It follows that

{‖ \frac{\partial f_{x, y} (𝒵_{1}, 𝒵_{2})}{\partial 𝒵_{1}} ‖}_{2} \leq ‖ g (Z_{2}) ‖_{o p} {‖ v \circ \nabla η (g (Z_{2})^{T} 𝒵_{1}) ‖}_{2} .

Then the first part of the proof follows from (75). The proof of the second part follows similarly, and hence, skipped.

Writing $v^{'} = V e c (y x^{T})$ , we have

f_{x, y} (𝒵_{1}, 𝒵_{2}) = T r (η (\frac{N^{T} Z_{2}^{T} Z_{1} M^{T}}{n}) x y^{T}) = T r (x y^{T} η (\frac{N^{T} Z_{2}^{T} Z_{1} M^{T}}{n})),

which equals

T r ((y x^{T})^{T} η (\frac{N^{T} Z_{2}^{T} Z_{1} M^{T}}{n})) = Vec (y x^{T})^{T} Vec (η (\frac{N^{T} Z_{2}^{T} Z_{1} M^{T}}{n})) = (v^{'})^{T} η (Vec (\frac{N^{T} Z_{2}^{T} Z_{1} M^{T}}{n})) .

Fact 7 implies that the above equals

(v^{'})^{T} η (\frac{(M Z_{1}^{T} \otimes N^{T})}{n} 𝒵_{2}) = (v^{'})^{T} η (h (Z_{1})^{T} 𝒵_{2}) .

where $h (Z_{1}) = \frac{Z_{1} M^{T} \otimes N}{n}$ . Thus, similarly we can show that

{‖ \frac{\partial f_{x, y} (𝒵_{1}, 𝒵_{2})}{\partial 𝒵_{2}} ‖}_{2} \leq ‖ h (Z_{1}) ‖_{o p} {‖ v^{'} \circ \nabla η (h (Z_{1})^{T} 𝒵_{2}) ‖}_{2} = ‖ h (Z_{1}) ‖_{o p} {‖ Vec ((x y^{T})^{T}) \circ Vec (\nabla η {([\frac{M Z_{1}^{T} Z_{2} N}{n}])}^{T}) ‖}_{2} = ‖ h (Z_{1}) ‖_{o p} {‖ Vec (x y^{T}) \circ Vec (\nabla η ([\frac{M Z_{1}^{T} Z_{2} N}{n}])) ‖}_{2} = ‖ h (Z_{1}) ‖_{o p} {‖ v \circ \nabla η (g (Z_{2})^{T} 𝒵_{1}) ‖}_{2} .

Therefore, the proof follows.

Lemma 22. There exists an absolute constant C so that the function $f_{x, y}$ defined in (68) satisfies

E [\max_{‖ x ‖_{2} \leq 1, ‖ y ‖_{2} \leq 1} f_{x, y} (𝒵_{1}, 𝒵_{2})^{4}] \leq C ‖ M ‖_{o p}^{4} ‖ N ‖_{o p}^{4} (n (p + q))^{C} .

Proof. As usual, we let $Q_{M, N} = M Z_{1}^{T} Z_{2} N ∕ n$ . Since $‖ x ‖_{2}, ‖ y ‖_{2} \leq 1$ , we have

f_{x, y} (𝒵_{1}, 𝒵_{2})^{4} \leq ‖ η (Q_{M, N}) ‖_{o p}^{4} \overset{(a)}{\leq} ‖ η (Q_{M, N}) ‖_{F}^{4} \overset{(b)}{\leq} ‖ Q_{M, N} ‖_{F}^{4} \overset{(c)}{\leq} n^{- 4} ‖ M ‖_{o p}^{4} ‖ N ‖_{o p}^{4} ‖ Z_{1} ‖_{F}^{4} ‖ Z_{2} ‖_{F}^{4} .

Here (a) follows because the operator norm is smaller than the Frobenius norm, (b) follows because $∣ η (x) ∣ \leq ∣ x ∣$ , and (c) follows from Fact 1. Since Z₁ and Z₂ are independent,

E [\max_{‖ x ‖_{2} \leq 1, ‖ y ‖_{2} \leq 1} f_{x, y} (𝒵_{1}, 𝒵_{2})^{4}] \leq n^{- 4} ‖ M ‖_{o p}^{4} ‖ N ‖_{o p}^{4} E [‖ Z_{1} ‖_{F}^{4}] E [‖ Z_{2} ‖_{F}^{4}] .

Now note that since Z₁ and Z₂ are standard Gaussian data matrices,

E [‖ Z_{1} ‖_{F}^{4}] \leq E [T r (Z_{1}^{T} Z_{1})^{2}] \leq k_{1} (n + p)^{k_{2}}

for some absolute constants $k_{1}$ and $k_{2}$ . We can choose C so large such that $k_{1} (n + p)^{k_{2}} \leq C (n + p)^{C}$ . Similarly, we can show that

E [‖ Z_{2} ‖_{F}^{4}] \leq E [T r (Z_{2}^{T} Z_{2})^{2}] \leq C (n + q)^{C},

implying

E [\max_{‖ x ‖_{2} \leq 1, ‖ y ‖_{2} \leq 1} f_{x, y} (𝒵_{1}, 𝒵_{2})^{4}] \leq C ‖ M ‖_{o p}^{4} ‖ N ‖_{o p}^{4} (n (p + q))^{C}

for sufficiently large C.

4). Proof of Lemma 16:

Proof. The framework will be same as the proof of Lemma 15. Define $𝒯 = 𝒯_{1} \times 𝒯_{2}$ where

𝒯_{1} = {x^{'} \in R^{p} : x^{'} = x_{S_{x}}, x \in T_{p}^{ϵ}} .

Let $𝒵_{1}$ , $𝒵_{2}$ , $𝒵$ , and $f_{x, y}$ be as in Lemma 15. In this case, the main difference from Lemma 15 is that $∣ 𝒯 ∣$ is much larger. Eventually we will arrive at (73) using the concentration inequality in Lemma 21, but large $∣ 𝒯 ∣$ makes the right hand side of the inequality in (73) much larger. Therefore, we require a tighter bound on $ℒ$ , which is the bound on the Lipschitz constant of $\nabla f_{x, y} (𝒵)$ on the good set, so that the concentration inequality in (73) is still useful. To bound the Lipschitz constant, as before, we bound $‖ \nabla f_{x, y} (𝒵_{1}, 𝒵_{2}) ‖^{2}$ using Lemma 21, which implies that

{‖ \frac{\partial f_{x, y} (𝒵_{1}, 𝒵_{2})}{\partial 𝒵_{1}} ‖}_{2} \leq ‖ g (Z_{2}) ‖_{o p} {‖ v \circ \nabla η (Vec (Q_{M, N})) ‖}_{2},

where $v = Vec (x y^{T})$ and $g (Z_{2}) = Z_{2} N \otimes M^{T} ∕ n$ . From (69) it follows that

‖ g (Z_{2}) ‖_{o p}^{2} \leq \frac{‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2} ‖ Z_{2} ‖_{o p}^{2}}{n^{2}} .

(76)

In Lemma 15, we bounded $‖ v \circ \nabla η (Vec (Q_{M, N})) ‖_{2}$ by $‖ v ‖_{2}$ , which was later bounded by 1. We require a tighter bound on $‖ v \circ \nabla η (Vec (Q_{M, N})) ‖_{2}$ this time. Note that $\nabla η (z) \leq 1 {∣ z ∣ \geq τ ∕ \sqrt{n}}$ at all $z \in R$ for any directional derivative of η. Noting $‖ x ‖_{\infty} \leq \sqrt{J_{p, q} ∕ p}$ for $x \in 𝒯_{1}$ , we deduce that any $A \in R^{p \times q}$ and $(x, y) \in 𝒯$ satisfy

‖ v \circ \nabla η (Vec (A)) ‖_{2}^{2} = \sum_{i = 1}^{p} \sum_{j = 1}^{q} x_{i}^{2} y_{j}^{2} η (A_{i, j})^{2} \leq \frac{J_{p, q}}{p} \sum_{j = 1}^{q} y_{j}^{2} sup_{j \in [q]} \sum_{i = 1}^{p} η (A_{i, j})^{2} = \frac{J_{p, q}}{p} ‖ y ‖_{2}^{2} sup_{j \in [q]} \sum_{i = 1}^{p} 1 {∣ A_{i, j} ∣ > τ ∕ \sqrt{n}},

which is not greater than

J_{p, q} sup_{j \in [q]} \sum_{i = 1}^{p} 1 {∣ A_{i, j} ∣ > τ ∕ \sqrt{n}} ∕ p

because $‖ y ‖_{2}^{2} \leq 1$ for $y \in 𝒯_{2}$ .

Thus, it follows that

{‖ \frac{\partial f_{x, y} (𝒵_{1}, 𝒵_{2})}{\partial 𝒵_{1}} ‖}_{2}^{2} \leq \frac{2 J_{p, q} ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2} ‖ Z_{2} ‖_{o p}^{2}}{p n^{2}} \times sup_{j \in [q]} {\sum_{i = 1}^{p} 1 {∣ (Q_{M, N})_{i, j} ∣ > τ ∕ \sqrt{n}}} .

Similarly, we can show that

{‖ \frac{\partial f_{x, y} (𝒵_{1}, 𝒵_{2})}{\partial 𝒵_{2}} ‖}_{2}^{2} \leq \frac{2 J_{p, q} ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2} ‖ Z_{1} ‖_{o p}^{2}}{p n^{2}} \times sup_{j \in [q]} {\sum_{i = 1}^{p} 1 {∣ (Q_{M, N})_{i, j} ∣ > τ ∕ \sqrt{n}}} .

Thus,

{‖ \nabla f_{x, y} (𝒵) ‖}_{2}^{2} \leq \frac{2 J_{p, q} ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2} (‖ Z_{1} ‖_{o p}^{2} + ‖ Z_{2} ‖_{o p}^{2})}{n^{2}} \times sup_{j \in [q]} \frac{{\sum_{i = 1}^{p} 1 {∣ (Q_{M, N})_{i, j} ∣ > τ ∕ \sqrt{n}}}}{p} .

We want to define the good set $𝒢_{2}$ of $({\tilde{Z}}_{1}, {\tilde{Z}}_{2}, {\tilde{Z}}_{1}^{'}, {\tilde{Z}}_{2}^{'})$ such that

Z_{i} (t) = \sqrt{t} {\tilde{Z}}_{i} + \sqrt{1 - t} {\tilde{Z}}_{i}^{'}, t \in [0, 1], i = 1, 2,

satisfies both $‖ Z_{1} (t) ‖_{o p}^{2} + ‖ Z_{2} (t) ‖_{o p}^{2} \leq 4 (2 n + p + q)$ and

sup_{j \in [q]} \sum_{i = 1}^{p} 1 {∣ (M Z_{1} (t)^{T} Z_{2} (t) N)_{i, j} ∣ > τ \sqrt{n}} \leq 4 p e^{- τ^{2} ∕ K} .

We claim that the above holds if $({\tilde{Z}}_{1}, {\tilde{Z}}_{2}, {\tilde{Z}}_{1}^{'}, {\tilde{Z}}_{2}^{'}) \in 𝒢_{1}$ defined in (72), and for all $j \in [q]$ ,

\sum_{i = 1}^{p} 1 {∣ (M {\tilde{Z}}_{1}^{T} {\tilde{Z}}_{2} N)_{i, j} ∣ > τ \sqrt{n} ∕ 2}, \sum_{i = 1}^{p} 1 {∣ (M ({\tilde{Z}}_{1}^{'})^{T} {\tilde{Z}}_{2}^{'} N)_{i, j} ∣ > τ \sqrt{n} ∕ 2} \leq 2 p e^{- τ^{2} ∕ K} \sum_{i = 1}^{p} 1 {∣ (M ({\tilde{Z}}_{1}^{'})^{T} {\tilde{Z}}_{2} N)_{i, j} ∣ > τ \sqrt{n} ∕ 2}, \sum_{i = 1}^{p} 1 {∣ (M ({\tilde{Z}}_{1})^{T} {\tilde{Z}}_{2}^{'} N)_{i, j} ∣ > τ \sqrt{n} ∕ 2} \leq 2 p e^{- τ^{2} ∕ K} .

(77)

The above claim follows from (89) and (90) of [1]. Therefore we define the good set $𝒢_{2}$ to be the subset of $𝒢_{1}$ where (77) is satisfied. Defining $𝒵_{1} (t) = Vec (Z_{1} (t)^{T})$ and $𝒵_{2} (t) = Vec (Z_{2} (t)^{T})$ , we obtain that for some absolute constant $C > 0$ , it holds that

‖ \nabla f_{x, y} (𝒵_{1} (t), 𝒵_{2} (t)) ‖_{2}^{2} \leq q C \underset{ℒ^{2}}{\underset{︸}{\frac{J_{p, q} (2 n + p + q) ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2} e^{- τ^{2} ∕ K_{0}}}{n^{2}}}} = C ℒ^{2}

provided ${\tilde{Z}}_{1}, {\tilde{Z}}_{1}^{'}, {\tilde{Z}}_{2}, {\tilde{Z}}_{2}^{'} \in 𝒢_{2}$ . Similar to the proof of Lemma 15, using Lemma 21, we obtain that there exists an absolute constant C so that

P {\max_{(x, y) \in 𝒯} ∣ f_{x, y} (𝒵_{1}, 𝒵_{2}) ∣ \geq Δ} \leq C ∣ 𝒯 ∣ exp (- \frac{Δ^{2}}{C ℒ^{2}}) + \frac{C}{Δ^{2}} E {[\max_{(x, y) \in 𝒯} f_{x, y} (Z_{1}, Z_{2})^{4}]}^{1 ∕ 2} P (𝒢_{2}^{c})^{1 ∕ 2} .

(78)

Now since $∣ 𝒯 ∣ \leq ∣ T_{p}^{ϵ} ∣ \times ∣ T_{q}^{ϵ} ∣$ , and for any $k \in N$ , the ϵ-net $T_{k}^{ϵ}$ is chosen so as to satisfy $∣ T_{k}^{ϵ} ∣ \leq (1 + 2 ∕ ϵ)^{k}$ , we have $∣ 𝒯 ∣ \leq (1 + 2 ∕ ϵ)^{p + q}$ . Therefore, we conclude that the first term of the bound in (78) is not larger than

C exp (- \frac{Δ^{2}}{C ℒ^{2}} + (p + q) \log (1 + 2 ∕ ϵ)) .

Rest of the proof is devoted to bounding the second term of the bound in (78). The expectation term can be bounded easily using Lemma 22, which yields

E [\max_{(x, y) \in T} f_{x, y} (Z_{1}, Z_{2})^{4}] \leq C ‖ M ‖_{o p}^{4} ‖ N ‖_{o p}^{4} {n (p + q)}^{C} .

We will now show that $P (𝒢_{2}^{c})$ is small. Note that by definition, $𝒢_{2} = 𝒢_{1} \cap 𝒱$ , where $𝒱$ is the set of $({\tilde{Z}}_{1}, {\tilde{Z}}_{2}, {\tilde{Z}}_{1}^{'}, {\tilde{Z}}_{2}^{'})$ , which satisfies the equation system (77). Notice that by (74), we already have $P (𝒢_{1}^{c}) \leq e^{- c (n + p)} + e^{- c (n + q)}$ for some $c > 0$ . Thus it suffices to show that $P (𝒱^{c})$ is small. To this end, note that since ${\tilde{Z}}_{1}$ , ${\tilde{Z}}_{1}^{'}$ , ${\tilde{Z}}_{2}$ , ${\tilde{Z}}_{2}^{'}$ are independent, (77) implies

P (𝒱^{c}) \leq 4 P (\sum_{i = 1}^{p} 1 {∣ M_{i *}^{T} {\tilde{Z}}_{1}^{T} {\tilde{Z}}_{2} N_{j} ∣ > τ \sqrt{n} ∕ 4} > 2 p e^{- τ^{2} ∕ K} for all j \in [q]) .

Defining the set $𝒜_{j} = {‖ {\tilde{Z}}_{2} N_{* j} ‖_{2} \leq 2 \sqrt{n} ‖ N_{* j} ‖_{2}}$ , we bound the above probability as follows:

P (𝒱^{c}) \leq 4 \sum_{j = 1}^{q} P (\sum_{i = 1}^{p} 1 {∣ M_{i *}^{T} {\tilde{Z}}_{1}^{T} {\tilde{Z}}_{2} N_{j} ∣ > τ \sqrt{n} ∕ 4 > 2 p exp (- τ^{2} ∕ K_{0}) ∣ {\tilde{Z}}_{2} \in 𝒜_{j}) + 4 \sum_{j = 1}^{q} P ({\tilde{Z}}_{2} \in 𝒜_{j}^{c}) .

(79)

Now note that ${\tilde{Z}}_{2} N_{* j} \sim N_{n} (0, ‖ N_{* j} ‖_{2}^{2} I_{n})$ , or ${\tilde{Z}}_{2} N_{* j} ∕ ‖ N_{* j} ‖_{2} \sim N_{n} (0, I_{n})$ . Therefore, there exists a universal constant $c > 0$ so that

\sum_{j = 1}^{q} P ({\tilde{Z}}_{2} \in 𝒜_{j}^{c}) = \sum_{j = 1}^{q} P (‖ N_{* j} ‖_{2}^{- 1} ‖ {\tilde{Z}}_{2} N_{* j} ‖_{2} > 2 \sqrt{n}) \leq q exp (- c n),

(80)

where the last bound is due to the Chi-square tail bound in Fact 4 (see also Lemma 1 of [72] and Lemma 12 of [1]). Therefore, it only remains to bound the first term in (79). We begin with an expansion of $∣ M_{i *}^{T} {\tilde{Z}}_{1}^{T} {\tilde{Z}}_{2} N_{j} ∣$ as follows

∣ M_{i *}^{T} {\tilde{Z}}_{1}^{T} {\tilde{Z}}_{2} N_{j} ∣ = ∣ \sum_{l = 1}^{p} \sum_{k = 1}^{n} M_{i l} ({\tilde{Z}}_{1})_{k l} ({\tilde{Z}}_{2} N)_{k j} ∣ = ∣ \sum_{l = 1}^{p} M_{i l} \underset{Ψ_{l}^{j}}{\underset{︸}{\sum_{k = 1}^{n} ({\tilde{Z}}_{1})_{k l} ({\tilde{Z}}_{2} N)_{k j}}} ∣ .

Since ${\tilde{Z}}_{1}$ and ${\tilde{Z}}_{2}$ are independent, ${\tilde{Z}}_{1}$ conditioned on ${\tilde{Z}}_{2}$ is still a standard Gaussian data matrix. Hence, for $l \in [p]$ , conditional on ${\tilde{Z}}_{2}$ , $Ψ_{l}^{j}$ ’s are independent $N (0, ‖ {\tilde{Z}}_{2} N_{* j} ‖_{2}^{2})$ random variables. As a result, for each $l \in [p]$ and $j \in [q]$ , $Ψ_{l}^{j}$ can be written as $‖ {\tilde{Z}}_{2} N_{* j} ‖_{2} Z_{l}$ , where $Z_{l} = Ψ_{l}^{j} ∕ ‖ {\tilde{Z}}_{2} N_{* j} ‖_{2}$ , and $Z_{1}, \dots, Z_{p} ∣ {\tilde{Z}}_{2} \overset{i i d}{\sim} N (0, 1)$ . Noting $‖ N_{j} ‖_{2} \leq ‖ N ‖_{o p}$ for every $j \in [q]$ , we derive the following bound provided ${\tilde{Z}}_{2} \in 𝒜_{j}$ :

\sum_{i = 1}^{p} 1 {∣ M_{i *}^{T} {\tilde{Z}}_{1}^{T} {\tilde{Z}}_{2} N_{j} ∣ > τ \sqrt{n} ∕ 4} = \sum_{i = 1}^{p} 1 [‖ {\tilde{Z}}_{2} N_{* j} ‖_{2} ∣ \sum_{l = 1}^{p} M_{i l} Z_{l} ∣ > τ \sqrt{n} ∕ 4] \leq \sum_{i = 1}^{p} 1 [\sqrt{2} ‖ N ‖_{o p} ∣ \sum_{l = 1}^{p} M_{i l} Z_{l} ∣ > τ ∕ 4] .

Defining

f (x) \equiv f (x_{1}, \dots, x_{p}) = \sum_{i = 1}^{p} \frac{1 [∣ \sum_{l = 1}^{p} M_{i l} x_{l} ∣ > τ ∕ (4 \sqrt{2} ‖ N ‖_{o p})]}{p},

(81)

we notice that the above calculations implies conditional on ${\tilde{Z}}_{2} \in 𝒜_{j}$ ,

\frac{\sum_{i = 1}^{p} ∣ M_{i *}^{T} {\tilde{Z}}_{1}^{T} {\tilde{Z}}_{2} N_{j} ∣ > τ \sqrt{n} ∕ 4]}{p} \leq f (Z_{1}, \dots, Z_{p}) .

Therefore,

P (\sum_{i = 1}^{p} 1 {∣ M_{i *}^{T} {\tilde{Z}}_{1}^{T} {\tilde{Z}}_{2} N_{j} ∣ > τ \sqrt{n} ∕ 4} > 2 p e^{- τ^{2} ∕ K} ∣ {\tilde{Z}}_{2} \in 𝒜_{j}) \leq P (f (Z_{1}, \dots, Z_{p}) > 2 exp (- τ^{2} ∕ K) ∣ {\tilde{Z}}_{2} \in 𝒜_{j}),

(82)

which is is bounded by $exp (- 2 \sqrt{p})$ by Lemma 23. Therefore, (79), (80), and (82) jointly imply that

P (𝒱^{c}) \leq 4 q exp (- c n) + 4 q exp (- 2 \sqrt{p}) .

Therefore $P (𝒢_{2}^{c})$ is bounded by

exp (- c (n + p)) + exp (- c (n + q)) + 4 q exp (- c n) + 4 q exp (- 2 \sqrt{p}) \leq 4 q exp (- c min {n, \sqrt{p}}),

which completes the proof.

Lemma 23. Suppose $160 ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2} < K$ and $τ < \sqrt{K \log p} ∕ 2$ . Further suppose $Z_{1}, \dots Z_{p}$ are independent standard Gaussian random variables. Then the function f defined in (81) satisfies

P (f (Z_{1}, \dots, Z_{p}) > 2 exp (- τ^{2} ∕ K)) \leq exp (- 2 \sqrt{p}) .

Proof of Lemma 23. Note that $p f (Z_{1}, \dots, Z_{p})$ is a sum of dependent Bernoulli random variables. Therefore the traditional Chernoffs or Hoeffding’s bound for inependent Bernoulli random variables will not apply. We use a generalized version of Chernoffs inequality, originally due to [65] (also discussed by [73], [74] among others), which applies to weakly dependent Bernoulli random variables.

Lemma 24 ([65]). Let $X_{1}, \dots, X_{p}$ be Bernoulli random variables and $ϵ \in (0, 1)$ . Suppose there exists $δ \in (0, ϵ)$ such that for any $ℬ \subset [p]$ , the following assertion holds:

E [\prod_{i \in ℬ} X_{i}] \leq δ^{∣ ℬ ∣} .

(83)

For $x, y \in (0, 1)$ , we denote

D (x ‖ y) = y \log (y ∕ x) + (1 - y) \log ((1 - y) ∕ (1 - x)) .

Then we have

P [\frac{\sum_{i = 1}^{p} X_{i}}{p} \geq ϵ] \leq exp (- p D (δ ‖ ϵ)) .

Note that if we take $X_{i} = 1 {∣ \sum_{l = 1}^{p} M_{i l} Z_{l} ∣ > τ ∕ (4 \sqrt{2} ‖ N ‖_{o p})}$ and $ϵ = 2 exp (- τ^{2} ∕ K)$ , then the above lemma can Ire applied to bound $P (f (Z_{1}, \dots, Z_{p}) > 2 exp (- τ^{2} ∕ K))$ provided (83) holds, which will be referred as the weak dependence Condition from now on. Suppose $∣ ℬ ∣ = k$ . For the sake of simplicity, we take $ℬ = {1, \dots, k}$ . The arguments, which are to follow, would hold for any other choice of $ℬ$ as well as long as $‖ ℬ ∣ = k$ . Denote by $M_{k}$ the submatrix of M containing only the first k rows of M. Let us denote $Z_{1 : k} = (Z_{1}, \dots, Z_{k})$ . Letting $t = τ ∕ (4 \sqrt{2} ‖ N ‖_{o p})$ , we observe that for our choice of $X_{i}$ ’s, $E [\prod_{i \in ℬ} X_{i}]$ equals

P (∣ M_{i *}^{T} Z_{1 : k} ∣ > t, l \in [k]) \leq P (Z_{1 : k}^{T} M_{k}^{T} M_{k} Z_{1 : k} > k t^{2}) \leq P (‖ M_{k}^{T} M_{k} ‖_{o p} \sum_{i = 1}^{k} Z_{l}^{2} > k t^{2}) .

The operator norm $‖ M_{k}^{T} M_{k} ‖_{o p}$ equals $‖ M_{k} ‖_{o p}^{2}$ , which is bounded by $‖ M ‖_{o p}^{2}$ by Lemma 10B. Therefore, the right hand side of the last display is bounded by $P (\sum_{l = 1}^{k} Z_{l}^{2} > k t^{2} ∕ ‖ M ‖_{o p}^{2})$ . By Chi-square tail bounds (see for instance Fact 4), the latter probability is bounded above by $exp (- k t^{2} ∕ (5 ‖ M ‖_{o p}^{2}))$ for all $t > \sqrt{5} ‖ M ‖_{o p}$ . Since $t = τ ∕ (4 \sqrt{2} ‖ N ‖_{o p})$ , note that $τ > \sqrt{160} ‖ M ‖_{o p} ‖ N ‖_{o p}$ suffices. For such τ, we have thus shown that

E [\prod_{i \in ℬ} X_{i}] \leq exp (- ∣ ℬ ∣ \frac{τ^{2}}{160 ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2}}) .

Thus our $δ = exp (- \frac{τ^{2}}{160 ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2}})$ , which is less than $ϵ ∕ 2 = exp (- τ^{2} ∕ K)$ because $K > 160 ‖ M ‖_{o p}^{2} ‖ N ‖_{o p}^{2}$ . Thus our (δ, ϵ) pair satisfies the weak dependence condition. Therefore by Lemma 24, it follows that

P (f (Z_{1}, \dots, Z_{p}) > 2 exp (- τ^{2} ∕ K)) \leq exp (- p D (δ ‖ ϵ)) .

We will now use the lower bound $D (x ‖ y) \geq 2 (x - y)^{2}$ for $x, y \in (0, 1)$ . Because $∣ δ - ϵ ∣ \leq ϵ ∕ 2, D (δ ‖ ϵ) \geq ϵ^{2} ∕ 2$ , indicating $p D (δ ‖ ϵ) \geq 2 p exp (- 2 τ^{2} ∕ K)$ , which is greater than $2 \sqrt{p} if 2 τ^{2} ∕ K \leq \log p ∕ 2$ , or equivalently $τ^{2} \leq (K \log p) ∕ 4$ . Therefore, the current lemma follows.

Footnotes

In this paper, by support recovery, we refer to the exact recovery of the combined support of the $u_{i}$ ’s (or the $v_{i}$ ’s) corresponding to nonzero $Λ_{i}$ ’s.

e.g., ${\hat{Σ}}_{n, x}^{(j)}$ , ${\hat{Σ}}_{n, y}^{(j)}$ , and ${\hat{Σ}}_{n, x y}^{(j)}$ will stand for the empirical estimators created from the j^th-equal split of the data.

here and later, we will use s to generically denote the sparsity of relevant parameter vectors in parallel problems like sparse PCA or sparse linear regression.

⁴

e.g., Planted Clique Conjecture [28], [54], [55], Statistical Query based lower bounds [56]-[59], and Overlap Gap Property based analysis [37], [60], [61].

Contributor Information

Nilanjana Laha, Department of Statistics, Texas A&M University, College Station, TX 77843.

Rajarshi Mukherjee, Department of Biostatistics, Harvard T. H. Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115.

References

[1].Deshpande Y and Montanari A, “Sparse pca via covariance thresholding,” Journal of Machine Learning Research, vol. 17, no. 141, pp. 1–14, 2016. [Google Scholar]
[2].Kunisky D, Wein AS, and Bandeira AS, “Notes on computational hardness of hypothesis testing: Predictions using the low-degree likelihood ratio,” in Mathematical Analysis, its Applications and Computation, 2022, pp. 1–50, part of the Springer Proceedings in Mathematics and Statistics book series. [Google Scholar]
[3].Hardoon DR, Szedmak S, and Shawe-Taylor J, “Canonical correlation analysis: An overview with application to learning methods,” Neural computation, vol. 16, no. 12, pp. 2639–2664, 2004. [DOI] [PubMed] [Google Scholar]
[4].Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, and Vasconcelos N, “A new approach to cross-modal multimedia retrieval,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 251–260. [Google Scholar]
[5].Gong Y, Ke Q, Isard M, and Lazebnik S, “A multi-view embedding space for modeling internet images, tags, and their semantics,” International journal of computer vision, vol. 106, no. 2, pp. 210–233, 2014. [Google Scholar]
[6].Bin G, Gao X, Yan Z, Hong B, and Gao S, “An online multi-channel ssvep-based brain–computer interface using a canonical correlation analysis method,” Journal of neural engineering, vol. 6, no. 4, p. 046002, 2009. [DOI] [PubMed] [Google Scholar]
[7].Avants BB, Cook PA, Ungar L, Gee JC, and Grossman M, “Dementia induces correlated reductions in white matter integrity and cortical thickness: a multivariate neuroimaging study with sparse canonical correlation analysis,” Neuroimage, vol. 50, no. 3, pp. 1004–1016, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Witten DM, Tibshirani R, and Hastie T, “A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis,” Biostatistics, vol. 10, no. 3, pp. 515–534, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Bagozzi RP, “Measurement and meaning in information systems and organizational research: Methodological and philosophical foundations,” Management Information Systems quarterly, vol. 35, no. 2, pp. 261–292, 2011. [Google Scholar]
[10].Dhillon P, Foster DP, and Ungar L, “Multi-view learning of word embeddings via CCA,” in Proceedings of the 24th International Conference on Neural Information Processing Systems, vol. 24, 2011, p. 199207. [Google Scholar]
[11].Faruqui M and Dyer C, “Improving vector space word representations using multilingual correlation,” in Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 2014, pp. 462–471. [Google Scholar]
[12].Friman O, Borga M, Lundberg P, and Knutsson H, “Adaptive analysis of fmri data,” NeuroImage, vol. 19, no. 3, pp. 837–845, 2003. [DOI] [PubMed] [Google Scholar]
[13].Kim T-K, Wong S-F, and Cipolla R, “Tensor canonical correlation analysis for action classification,” in 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8. [Google Scholar]
[14].Arora R and Livescu K, “Multi-view cca-based acoustic features for phonetic recognition across speakers and domains,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 7135–7139. [Google Scholar]
[15].Wang W, Arora R, Livescu K, and Bilmes J, “On deep multi-view representation learning,” in Proceedings of the 32nd International Conference on International Conference on Machine Learning, vol. 37, 2015, pp. 1083–1092. [Google Scholar]
[16].Anderson TW, An Introduction to Multivariate Statistical Analysis, ser. Wiley Series in Probability and Statistics. Wiley, 2003. [Google Scholar]
[17].Lê Cao K-A, Martin PG, Robert-Granié C, and Besse P, “Sparse canonical methods for biological data integration: application to a cross-platform study,” BMC bioinformatics, vol. 10, no. 1, pp. 1–17, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Lee W, Lee D, Lee Y, and Pawitan Y, “Sparse canonical covariance analysis for high-throughput data,” Statistical Applications in Genetics and Molecular Biology, vol. 10, no. 1, pp. 1–24, 2011.21291411 [Google Scholar]
[19].Waaijenborg S, de Witt Hamer PCV, and Zwinderman AH, “Quantifying the association between gene expressions and dna-markers by penalized canonical correlation analysis,” Statistical applications in genetics and molecular biology, vol. 7, no. 1, 2008. [DOI] [PubMed] [Google Scholar]
[20].Anderson TW, “Asymptotic theory for canonical correlation analysis,” Journal of Multivariate Analysis, vol. 70, no. 1, pp. 1–29, 1999. [Google Scholar]
[21].Cai TT and Zhang A, “Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics,” The Annals of Statistics, vol. 46, no. 1, pp. 60 – 89, 2018. [Google Scholar]
[22].Ma Z, Li X et al. , “Subspace perspective on canonical correlation analysis: Dimension reduction and minimax rates,” Bernoulli, vol. 26, no. 1, pp. 432–470, 2020. [Google Scholar]
[23].Bao Z, Hu J, Pan G, and Zhou W, “Canonical correlation coefficients of high-dimensional gaussian vectors: Finite rank case,” The Annals of Statistics, vol. 47, no. 1, pp. 612–640, February 2019. [Google Scholar]
[24].Mai Q and Zhang X, “An iterative penalized least squares approach to sparse canonical correlation analysis,” Biometrics, vol. 75, no. 3, pp. 734–744, 2019. [DOI] [PubMed] [Google Scholar]
[25].Solari OS, Brown JB, and Bickel PJ, “Sparse canonical correlation analysis via concave minimization,” arXiv preprint arXiv:1909.07947, 2019. [Google Scholar]
[26].Chen M, Gao C, Ren Z, and Zhou HH, “Sparse cca via precision adjusted iterative thresholding,” arXiv preprint arXiv:1311.6186, 2013. [Google Scholar]
[27].Gao C, Ma Z, Ren Z, Zhou HH et al. , “Minimax estimation in sparse canonical correlation analysis,” The Annals of Statistics, vol. 43, no. 5, pp. 2168–2197, 2015. [Google Scholar]
[28].Gao C, Ma Z, Zhou HH et al. , “Sparse cca: Adaptive estimation and computational barriers,” The Annals of Statistics, vol. 45, no. 5, pp. 2074–2101, 2017. [Google Scholar]
[29].Wainwright MJ, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting,” IEEE Transactions on Information Theory, vol. 55, no. 12, pp. 5728–5741, 2009. [Google Scholar]
[30].Amini AA and Wainwright MJ, “High-dimensional analysis of semidefinite relaxations for sparse principal components,” The Annals of Statistics, vol. 37, no. 5B, pp. 2877 – 2921, 2009. [Google Scholar]
[31].Butucea C, Ingster YI, and Suslina IA, “Sharp variable selection of a sparse submatrix in a high-dimensional noisy matrix,” ESAIM: Probability and Statistics, vol. 19, pp. 115–134, 2015. [Google Scholar]
[32].Butucea C and Stepanova N, “Adaptive variable selection in nonparametric sparse additive models,” Electronic Journal of Statistics, vol. 11, no. 1, pp. 2321–2357, 2017. [Google Scholar]
[33].Meinshausen N and Bühlmann P, “Stability selection,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 72, no. 4, pp. 417–473, 2010. [Google Scholar]
[34].Johnstone IM and Lu AY, “On consistency and sparsity for principal components analysis in high dimensions,” Journal of the American Statistical Association, vol. 104, no. 486, pp. 682–693, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
[35].Krauthgamer R, Nadler B, Vilenchik D et al. , “Do semidefinite relaxations solve sparse pca up to the information limit?” The Annals of Statistics, vol. 43, no. 3, pp. 1300–1322, 2015. [Google Scholar]
[36].Ding Y, Kunisky D, Wein AS, and Bandeira AS, “Subexponential-time algorithms for sparse pca,” arXiv preprint arXiv:1907.11635, 2019. [Google Scholar]
[37].Arous GB, Wein AS, and Zadik I, “Free energy wells and overlap gap property in sparse pca,” in Proceedings of the 33rd Annual Conference on Learning Theory, vol. 125, 2020, pp. 479–482. [Google Scholar]
[38].Laha N and Mukherjee R, “Support.CCA,” https://github.com/nilanjanalaha/Support.CCA, 2021. [Google Scholar]
[39].Hopkins SB and Steurer D, “Efficient bayesian estimation from few samples: community detection and related problems,” in IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), 2017, pp. 379–390. [Google Scholar]
[40].Hopkins S, “Statistical inference and the sum of squares method,” Ph.D. dissertation, Cornell University, 2018. [Google Scholar]
[41].Ma Z and Yang F, “Sample canonical correlation coefficients of high-dimensional random vectors with finite rank correlations,” arXiv preprint arXiv:2102.03297, 2021. [Google Scholar]
[42].Vershynin R, High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018. [Google Scholar]
[43].Laha N, Huey N, Coull B, and Mukherjee R, “On statistical inference with high dimensional sparse cca,” arXiv preprint arXiv:2109.11997, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[44].Johnstone IM, “On the distribution of the largest eigenvalue in principal components analysis,” The Annals of statistics, vol. 29, no. 2, pp. 295–327, 2001. [Google Scholar]
[45].Janková J and van de Geer S, “De-biased sparse pca: Inference and testing for eigenstructure of large covariance matrices,” IEEE Transactions on Information Theory, vol. 67, no. 4, pp. 2507–2527, 2021. [Google Scholar]
[46].Meloun M and Militky J, Statistical data analysis: A practical guide. Woodhead Publishing Limited, 2011. [Google Scholar]
[47].Dutta D, Sen A, and Satagopan J, “Sparse canonical correlation to identify breast cancer related genes regulated by copy number aberrations,” medRxiv, 2022. [Online]. Available: https://www.medrxiv.org/content/early/2022/05/09/2021.08.29.21262811 [DOI] [PMC free article] [PubMed] [Google Scholar]
[48].Lock EF, Hoadley KA, Marron JS, and Nobel AB, “Joint and individual variation explained (jive) for integrated analysis of multiple data types,” The annals of applied statistics, vol. 7, no. 1, p. 523, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
[49].van de Geer S, Bhlmann P, Ritov Y, and Dezeure R, “On asymptotically optimal confidence regions and tests for high-dimensional models,” The Annals of Statistics, vol. 42, no. 3, pp. 1166–1202, 2014. [Google Scholar]
[50].Bickel PJ and Levina E, “Covariance regularization by thresholding,” The Annals of Statistics, vol. 36, no. 6, pp. 2577–2604, 2008. [Google Scholar]
[51].Cai TT, Liu W, and Luo X, “A constrained $l_{1}$ minimization approach to sparse precision matrix estimation,” Journal of the American Statistical Association, vol. 106, no. 494, pp. 594–607, 2011. [Google Scholar]
[52].Yu Y, Wang T, and Samworth RJ, “A useful variant of the davis–kahan theorem for statisticians,” Biometrika, vol. 102, no. 2, pp. 315–323, 2015. [Google Scholar]
[53].Yatracos YG, “A lower bound on the error in nonparametric regression type problems,” The Annals of Statistics, vol. 16, no. 3, pp. 1180–1187, 1988. [Google Scholar]
[54].Berthet Q and Rigollet P, “Complexity theoretic lower bounds for sparse principal component detection,” in Proceedings of the 26th Annual Conference on Learning Theory, vol. 30, 2013, pp. 1046–1066. [Google Scholar]
[55].Brennan M, Bresler G, and Huleihel W, “Reducibility and computational lower bounds for problems with planted sparse structure,” in Proceedings of the 31st Conference On Learning Theory, vol. 75, 2018, pp. 48–166. [Google Scholar]
[56].Kearns M, “Efficient noise-tolerant learning from statistical queries,” Journal of the ACM (JACM), vol. 45, no. 6, pp. 983–1006, 1998. [Google Scholar]
[57].Feldman V and Kanade V, “Computational bounds on statistical query learning,” in Proceedings of the 25th Annual Conference on Learning Theory, vol. 23, 2012, pp. 16.1–16.22. [Google Scholar]
[58].Brennan MS, Bresler G, Hopkins S, Li J, and Schramm T, “Statistical query algorithms and low degree tests are almost equivalent,” in Proceedings of Thirty Fourth Conference on Learning Theory, vol. 134, 2021, pp. 774–774. [Google Scholar]
[59].Dudeja R and Hsu D, “Statistical query lower bounds for tensor pca,” Journal of Machine Learning Research, vol. 22, no. 83, pp. 1–51, 2021. [Google Scholar]
[60].David G and Ilias Z, “High dimensional regression with binary coefficients. Estimating squared error and a phase transtition,” in Proceedings of the Conference on Learning Theory, vol. 65, 2017, pp. 948–953. [Google Scholar]
[61].Gamarnik D, Jagannath A, and Sen S, “The overlap gap property in principal submatrix recovery,” Probability Theory and Related Fields volume, vol. 181, pp. 757–814, 2021. [Google Scholar]
[62].Szegö G, Orthogonal polynomials, ser. American Mathematical Society colloquium publications. American Mathematical Society, 1939. [Google Scholar]
[63].Cai TT, Zhou HH et al. , “Optimal rates of convergence for sparse covariance matrix estimation,” The Annals of Statistics, vol. 40, no. 5, pp. 2389–2420, 2012. [Google Scholar]
[64].Bach FR and Jordan MI, “A probabilistic interpretation of canonical correlation analysis,” Tech. Rep, 2005. [Online]. Available: https://www.di.ens.fr/~fbach/probacca.pdf [Google Scholar]
[65].Panconesi A and Srinivasan A, “Randomized distributed edge coloring via an extension of the chernoff–hoeffding bounds,” SIAM Journal on Computing, vol. 26, no. 2, pp. 350–368, 1997. [Google Scholar]
[66].Wang S, Fan J, Pocock G, Arena ET, Eliceiri KW, and Yuan M, “Structured correlation detection with application to colocalization analysis in dual-channel fluorescence microscopic imaging,” Statistica Sinica, vol. 31, no. 1, pp. 333–360, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[67].Bai Z-D and Yin Y-Q, “Limit of the Smallest Eigenvalue of a Large Dimensional Sample Covariance Matrix,” The Annals of Probability, vol. 21, no. 3, pp. 1275 – 1294, 1993. [Google Scholar]
[68].Sandór J and Debnath L, “On certain inequalities involving the constant e and their applications,” Journal of mathematical analysis and applications, vol. 249, no. 2, pp. 569–582, 2000. [Google Scholar]
[69].Rahman S, “Wiener–hermite polynomial expansion for multivariate gaussian probability measures,” Journal of Mathematical Analysis and Applications, vol. 454, no. 1, pp. 303–334, 2017. [Google Scholar]
[70].Petersen KB and Pedersen MS, “The matrix cookbook,” 2015, version: Nov 12, 2015. [Online]. Available: http://www2.compute.dtu.dk/pubdb/pubs/3274-full.html [Google Scholar]
[71].Laub A, Matrix Analysis for Scientists and Engineers. Society for Industrial and Applied Mathematics, 2005. [Google Scholar]
[72].Laurent B and Massart P, “Adaptive estimation of a quadratic functional by model selection,” The Annals of Statistics, vol. 28, no. 5, pp. 1302–1338, 2000. [Google Scholar]
[73].Pelekis C and Ramon J, “Hoeffding’s inequality for sums of weakly dependent random variables,” Mediterranean Journal of Mathematics volume, vol. 14, no. 243, 2017. [Google Scholar]
[74].Linial N and Luria Z, “Chernoff’s inequality-a very elementary proof,” arXiv preprint arXiv:1403.7739, 2014. [Google Scholar]

[R1] [1].Deshpande Y and Montanari A, “Sparse pca via covariance thresholding,” Journal of Machine Learning Research, vol. 17, no. 141, pp. 1–14, 2016. [Google Scholar]

[R2] [2].Kunisky D, Wein AS, and Bandeira AS, “Notes on computational hardness of hypothesis testing: Predictions using the low-degree likelihood ratio,” in Mathematical Analysis, its Applications and Computation, 2022, pp. 1–50, part of the Springer Proceedings in Mathematics and Statistics book series. [Google Scholar]

[R3] [3].Hardoon DR, Szedmak S, and Shawe-Taylor J, “Canonical correlation analysis: An overview with application to learning methods,” Neural computation, vol. 16, no. 12, pp. 2639–2664, 2004. [DOI] [PubMed] [Google Scholar]

[R4] [4].Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, and Vasconcelos N, “A new approach to cross-modal multimedia retrieval,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 251–260. [Google Scholar]

[R5] [5].Gong Y, Ke Q, Isard M, and Lazebnik S, “A multi-view embedding space for modeling internet images, tags, and their semantics,” International journal of computer vision, vol. 106, no. 2, pp. 210–233, 2014. [Google Scholar]

[R6] [6].Bin G, Gao X, Yan Z, Hong B, and Gao S, “An online multi-channel ssvep-based brain–computer interface using a canonical correlation analysis method,” Journal of neural engineering, vol. 6, no. 4, p. 046002, 2009. [DOI] [PubMed] [Google Scholar]

[R7] [7].Avants BB, Cook PA, Ungar L, Gee JC, and Grossman M, “Dementia induces correlated reductions in white matter integrity and cortical thickness: a multivariate neuroimaging study with sparse canonical correlation analysis,” Neuroimage, vol. 50, no. 3, pp. 1004–1016, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Witten DM, Tibshirani R, and Hastie T, “A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis,” Biostatistics, vol. 10, no. 3, pp. 515–534, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Bagozzi RP, “Measurement and meaning in information systems and organizational research: Methodological and philosophical foundations,” Management Information Systems quarterly, vol. 35, no. 2, pp. 261–292, 2011. [Google Scholar]

[R10] [10].Dhillon P, Foster DP, and Ungar L, “Multi-view learning of word embeddings via CCA,” in Proceedings of the 24th International Conference on Neural Information Processing Systems, vol. 24, 2011, p. 199207. [Google Scholar]

[R11] [11].Faruqui M and Dyer C, “Improving vector space word representations using multilingual correlation,” in Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 2014, pp. 462–471. [Google Scholar]

[R12] [12].Friman O, Borga M, Lundberg P, and Knutsson H, “Adaptive analysis of fmri data,” NeuroImage, vol. 19, no. 3, pp. 837–845, 2003. [DOI] [PubMed] [Google Scholar]

[R13] [13].Kim T-K, Wong S-F, and Cipolla R, “Tensor canonical correlation analysis for action classification,” in 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8. [Google Scholar]

[R14] [14].Arora R and Livescu K, “Multi-view cca-based acoustic features for phonetic recognition across speakers and domains,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 7135–7139. [Google Scholar]

[R15] [15].Wang W, Arora R, Livescu K, and Bilmes J, “On deep multi-view representation learning,” in Proceedings of the 32nd International Conference on International Conference on Machine Learning, vol. 37, 2015, pp. 1083–1092. [Google Scholar]

[R16] [16].Anderson TW, An Introduction to Multivariate Statistical Analysis, ser. Wiley Series in Probability and Statistics. Wiley, 2003. [Google Scholar]

[R17] [17].Lê Cao K-A, Martin PG, Robert-Granié C, and Besse P, “Sparse canonical methods for biological data integration: application to a cross-platform study,” BMC bioinformatics, vol. 10, no. 1, pp. 1–17, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Lee W, Lee D, Lee Y, and Pawitan Y, “Sparse canonical covariance analysis for high-throughput data,” Statistical Applications in Genetics and Molecular Biology, vol. 10, no. 1, pp. 1–24, 2011.21291411 [Google Scholar]

[R19] [19].Waaijenborg S, de Witt Hamer PCV, and Zwinderman AH, “Quantifying the association between gene expressions and dna-markers by penalized canonical correlation analysis,” Statistical applications in genetics and molecular biology, vol. 7, no. 1, 2008. [DOI] [PubMed] [Google Scholar]

[R20] [20].Anderson TW, “Asymptotic theory for canonical correlation analysis,” Journal of Multivariate Analysis, vol. 70, no. 1, pp. 1–29, 1999. [Google Scholar]

[R21] [21].Cai TT and Zhang A, “Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics,” The Annals of Statistics, vol. 46, no. 1, pp. 60 – 89, 2018. [Google Scholar]

[R22] [22].Ma Z, Li X et al. , “Subspace perspective on canonical correlation analysis: Dimension reduction and minimax rates,” Bernoulli, vol. 26, no. 1, pp. 432–470, 2020. [Google Scholar]

[R23] [23].Bao Z, Hu J, Pan G, and Zhou W, “Canonical correlation coefficients of high-dimensional gaussian vectors: Finite rank case,” The Annals of Statistics, vol. 47, no. 1, pp. 612–640, February 2019. [Google Scholar]

[R24] [24].Mai Q and Zhang X, “An iterative penalized least squares approach to sparse canonical correlation analysis,” Biometrics, vol. 75, no. 3, pp. 734–744, 2019. [DOI] [PubMed] [Google Scholar]

[R25] [25].Solari OS, Brown JB, and Bickel PJ, “Sparse canonical correlation analysis via concave minimization,” arXiv preprint arXiv:1909.07947, 2019. [Google Scholar]

[R26] [26].Chen M, Gao C, Ren Z, and Zhou HH, “Sparse cca via precision adjusted iterative thresholding,” arXiv preprint arXiv:1311.6186, 2013. [Google Scholar]

[R27] [27].Gao C, Ma Z, Ren Z, Zhou HH et al. , “Minimax estimation in sparse canonical correlation analysis,” The Annals of Statistics, vol. 43, no. 5, pp. 2168–2197, 2015. [Google Scholar]

[R28] [28].Gao C, Ma Z, Zhou HH et al. , “Sparse cca: Adaptive estimation and computational barriers,” The Annals of Statistics, vol. 45, no. 5, pp. 2074–2101, 2017. [Google Scholar]

[R29] [29].Wainwright MJ, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting,” IEEE Transactions on Information Theory, vol. 55, no. 12, pp. 5728–5741, 2009. [Google Scholar]

[R30] [30].Amini AA and Wainwright MJ, “High-dimensional analysis of semidefinite relaxations for sparse principal components,” The Annals of Statistics, vol. 37, no. 5B, pp. 2877 – 2921, 2009. [Google Scholar]

[R31] [31].Butucea C, Ingster YI, and Suslina IA, “Sharp variable selection of a sparse submatrix in a high-dimensional noisy matrix,” ESAIM: Probability and Statistics, vol. 19, pp. 115–134, 2015. [Google Scholar]

[R32] [32].Butucea C and Stepanova N, “Adaptive variable selection in nonparametric sparse additive models,” Electronic Journal of Statistics, vol. 11, no. 1, pp. 2321–2357, 2017. [Google Scholar]

[R33] [33].Meinshausen N and Bühlmann P, “Stability selection,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 72, no. 4, pp. 417–473, 2010. [Google Scholar]

[R34] [34].Johnstone IM and Lu AY, “On consistency and sparsity for principal components analysis in high dimensions,” Journal of the American Statistical Association, vol. 104, no. 486, pp. 682–693, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] [35].Krauthgamer R, Nadler B, Vilenchik D et al. , “Do semidefinite relaxations solve sparse pca up to the information limit?” The Annals of Statistics, vol. 43, no. 3, pp. 1300–1322, 2015. [Google Scholar]

[R36] [36].Ding Y, Kunisky D, Wein AS, and Bandeira AS, “Subexponential-time algorithms for sparse pca,” arXiv preprint arXiv:1907.11635, 2019. [Google Scholar]

[R37] [37].Arous GB, Wein AS, and Zadik I, “Free energy wells and overlap gap property in sparse pca,” in Proceedings of the 33rd Annual Conference on Learning Theory, vol. 125, 2020, pp. 479–482. [Google Scholar]

[R38] [38].Laha N and Mukherjee R, “Support.CCA,” https://github.com/nilanjanalaha/Support.CCA, 2021. [Google Scholar]

[R39] [39].Hopkins SB and Steurer D, “Efficient bayesian estimation from few samples: community detection and related problems,” in IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), 2017, pp. 379–390. [Google Scholar]

[R40] [40].Hopkins S, “Statistical inference and the sum of squares method,” Ph.D. dissertation, Cornell University, 2018. [Google Scholar]

[R41] [41].Ma Z and Yang F, “Sample canonical correlation coefficients of high-dimensional random vectors with finite rank correlations,” arXiv preprint arXiv:2102.03297, 2021. [Google Scholar]

[R42] [42].Vershynin R, High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018. [Google Scholar]

[R43] [43].Laha N, Huey N, Coull B, and Mukherjee R, “On statistical inference with high dimensional sparse cca,” arXiv preprint arXiv:2109.11997, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] [44].Johnstone IM, “On the distribution of the largest eigenvalue in principal components analysis,” The Annals of statistics, vol. 29, no. 2, pp. 295–327, 2001. [Google Scholar]

[R45] [45].Janková J and van de Geer S, “De-biased sparse pca: Inference and testing for eigenstructure of large covariance matrices,” IEEE Transactions on Information Theory, vol. 67, no. 4, pp. 2507–2527, 2021. [Google Scholar]

[R46] [46].Meloun M and Militky J, Statistical data analysis: A practical guide. Woodhead Publishing Limited, 2011. [Google Scholar]

[R47] [47].Dutta D, Sen A, and Satagopan J, “Sparse canonical correlation to identify breast cancer related genes regulated by copy number aberrations,” medRxiv, 2022. [Online]. Available: https://www.medrxiv.org/content/early/2022/05/09/2021.08.29.21262811 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] [48].Lock EF, Hoadley KA, Marron JS, and Nobel AB, “Joint and individual variation explained (jive) for integrated analysis of multiple data types,” The annals of applied statistics, vol. 7, no. 1, p. 523, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] [49].van de Geer S, Bhlmann P, Ritov Y, and Dezeure R, “On asymptotically optimal confidence regions and tests for high-dimensional models,” The Annals of Statistics, vol. 42, no. 3, pp. 1166–1202, 2014. [Google Scholar]

[R50] [50].Bickel PJ and Levina E, “Covariance regularization by thresholding,” The Annals of Statistics, vol. 36, no. 6, pp. 2577–2604, 2008. [Google Scholar]

[R51] [51].Cai TT, Liu W, and Luo X, “A constrained $l_{1}$ minimization approach to sparse precision matrix estimation,” Journal of the American Statistical Association, vol. 106, no. 494, pp. 594–607, 2011. [Google Scholar]

[R52] [52].Yu Y, Wang T, and Samworth RJ, “A useful variant of the davis–kahan theorem for statisticians,” Biometrika, vol. 102, no. 2, pp. 315–323, 2015. [Google Scholar]

[R53] [53].Yatracos YG, “A lower bound on the error in nonparametric regression type problems,” The Annals of Statistics, vol. 16, no. 3, pp. 1180–1187, 1988. [Google Scholar]

[R54] [54].Berthet Q and Rigollet P, “Complexity theoretic lower bounds for sparse principal component detection,” in Proceedings of the 26th Annual Conference on Learning Theory, vol. 30, 2013, pp. 1046–1066. [Google Scholar]

[R55] [55].Brennan M, Bresler G, and Huleihel W, “Reducibility and computational lower bounds for problems with planted sparse structure,” in Proceedings of the 31st Conference On Learning Theory, vol. 75, 2018, pp. 48–166. [Google Scholar]

[R56] [56].Kearns M, “Efficient noise-tolerant learning from statistical queries,” Journal of the ACM (JACM), vol. 45, no. 6, pp. 983–1006, 1998. [Google Scholar]

[R57] [57].Feldman V and Kanade V, “Computational bounds on statistical query learning,” in Proceedings of the 25th Annual Conference on Learning Theory, vol. 23, 2012, pp. 16.1–16.22. [Google Scholar]

[R58] [58].Brennan MS, Bresler G, Hopkins S, Li J, and Schramm T, “Statistical query algorithms and low degree tests are almost equivalent,” in Proceedings of Thirty Fourth Conference on Learning Theory, vol. 134, 2021, pp. 774–774. [Google Scholar]

[R59] [59].Dudeja R and Hsu D, “Statistical query lower bounds for tensor pca,” Journal of Machine Learning Research, vol. 22, no. 83, pp. 1–51, 2021. [Google Scholar]

[R60] [60].David G and Ilias Z, “High dimensional regression with binary coefficients. Estimating squared error and a phase transtition,” in Proceedings of the Conference on Learning Theory, vol. 65, 2017, pp. 948–953. [Google Scholar]

[R61] [61].Gamarnik D, Jagannath A, and Sen S, “The overlap gap property in principal submatrix recovery,” Probability Theory and Related Fields volume, vol. 181, pp. 757–814, 2021. [Google Scholar]

[R62] [62].Szegö G, Orthogonal polynomials, ser. American Mathematical Society colloquium publications. American Mathematical Society, 1939. [Google Scholar]

[R63] [63].Cai TT, Zhou HH et al. , “Optimal rates of convergence for sparse covariance matrix estimation,” The Annals of Statistics, vol. 40, no. 5, pp. 2389–2420, 2012. [Google Scholar]

[R64] [64].Bach FR and Jordan MI, “A probabilistic interpretation of canonical correlation analysis,” Tech. Rep, 2005. [Online]. Available: https://www.di.ens.fr/~fbach/probacca.pdf [Google Scholar]

[R65] [65].Panconesi A and Srinivasan A, “Randomized distributed edge coloring via an extension of the chernoff–hoeffding bounds,” SIAM Journal on Computing, vol. 26, no. 2, pp. 350–368, 1997. [Google Scholar]

[R66] [66].Wang S, Fan J, Pocock G, Arena ET, Eliceiri KW, and Yuan M, “Structured correlation detection with application to colocalization analysis in dual-channel fluorescence microscopic imaging,” Statistica Sinica, vol. 31, no. 1, pp. 333–360, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R67] [67].Bai Z-D and Yin Y-Q, “Limit of the Smallest Eigenvalue of a Large Dimensional Sample Covariance Matrix,” The Annals of Probability, vol. 21, no. 3, pp. 1275 – 1294, 1993. [Google Scholar]

[R68] [68].Sandór J and Debnath L, “On certain inequalities involving the constant e and their applications,” Journal of mathematical analysis and applications, vol. 249, no. 2, pp. 569–582, 2000. [Google Scholar]

[R69] [69].Rahman S, “Wiener–hermite polynomial expansion for multivariate gaussian probability measures,” Journal of Mathematical Analysis and Applications, vol. 454, no. 1, pp. 303–334, 2017. [Google Scholar]

[R70] [70].Petersen KB and Pedersen MS, “The matrix cookbook,” 2015, version: Nov 12, 2015. [Online]. Available: http://www2.compute.dtu.dk/pubdb/pubs/3274-full.html [Google Scholar]

[R71] [71].Laub A, Matrix Analysis for Scientists and Engineers. Society for Industrial and Applied Mathematics, 2005. [Google Scholar]

[R72] [72].Laurent B and Massart P, “Adaptive estimation of a quadratic functional by model selection,” The Annals of Statistics, vol. 28, no. 5, pp. 1302–1338, 2000. [Google Scholar]

[R73] [73].Pelekis C and Ramon J, “Hoeffding’s inequality for sums of weakly dependent random variables,” Mediterranean Journal of Mathematics volume, vol. 14, no. 243, 2017. [Google Scholar]

[R74] [74].Linial N and Luria Z, “Chernoff’s inequality-a very elementary proof,” arXiv preprint arXiv:1403.7739, 2014. [Google Scholar]

PERMALINK

On Support Recovery with Sparse CCA: Information Theoretic and Computational Limits

Nilanjana Laha

Rajarshi Mukherjee

Abstract

I. Introduction

A. Summary of Main Results

1). General methodology:

2). Information theoretic and computational hardness as a function of sparsity:

3). Information theoretic hardness as a function of minimal signal strength:

B. Notation

II. Mathematical Formalism

III. Main Results

A. A Simple and General Method:

Main idea behind the proof of Theorem 1:

Fig. 1:

B. Information Theoretic Lower Bounds: Answers to Question 1 and 2

Main idea behind the proof of Theorem 2:

C. Computational Limits and Low Degree Polynomials: Answer to Question 3

1). Reduction to Testing Problem::

2). Background on the Low-degree Framework:

3). Main Result:

The main idea behind the proof of Theorem 3:

D. A Polynomial Time Algorithm for n∕log(p+q)≪sx,sy≪n Regime : Answer to Question 4

1). Methodology: Estimation via CT:

2). Analysis of the CT Algorithm:

Main idea behind the proof of Theorem 4:

IV. Numerical Experiments

Fig. 2:

Fig. 3:

Fig. 4:

Fig. 5:

Fig. 6:

V. Discussion

Acknowledgments

Biographies

Appendix A. Full version of RecoverSupp

Appendix B. Proof preliminaries

A. New Notations

B. Facts on 𝒫(r,sx,sy,ℬ)

C. General Technical Facts

Appendix C. Proof of Theorem 1

1). Case (A):

2). Case (B):

3). Case (C):

Appendix D. Proof of Theorem 2

A. Proof of part (A)

B. Proof of part (B)

Appendix E. Proof of Theorem 3

Appendix F. Proof of Theorem 4

1). Regime 21∕4(p+q)3∕4≤(sx+sy)2≤(p+q)∕e∷

2). Regime (sx+sy)2<21∕4(p+q)3∕4:

3). Regime (sx+sy)2>(p+q)∕e::

Appendix G. Proof of Corollary 2

Appendix H. Proof of Auxilliary Lemmas

A. Proof of Technical Lemmas for Theorem 2

B. Proof of Key Lemmas for Theorem 4

1). Proof of Lemma 7:

2). Proof of Lemma 8:

3). Proof of Lemma 9:

4). Proof of Lemma 11:

C. Proof of Additional Lemmas for Section III-C and Theorem 3

1). Proof of Lemma 5:

2). Proof of Lemma 6:

D. Proof of Technical Lemmas for Theorem 4

1). Proof of Lemma 10:

2). Proof of Lemma 4:

3). Proof of Lemma 15:

4). Proof of Lemma 16:

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

D. A Polynomial Time Algorithm for $\sqrt{n} ∕ \log (p + q) ≪ s_{x}, s_{y} ≪ \sqrt{n}$ Regime : Answer to Question 4

B. Facts on $𝒫 (r, s_{x}, s_{y}, ℬ)$

1). Regime $2^{1 ∕ 4} (p + q)^{3 ∕ 4} \leq (s_{x} + s_{y})^{2} \leq (p + q) ∕ e ∷$

2). Regime $(s_{x} + s_{y})^{2} < 2^{1 ∕ 4} (p + q)^{3 ∕ 4} :$

3). Regime $(s_{x} + s_{y})^{2} > (p + q) ∕ e :$ :