MINIMAX BOUNDS FOR SPARSE PCA WITH NOISY HIGH-DIMENSIONAL DATA

Aharon Birnbaum; Iain M Johnstone; Boaz Nadler; Debashis Paul

doi:10.1214/12-AOS1014

. Author manuscript; available in PMC: 2014 Oct 14.

Published in final edited form as: Ann Stat. 2013 Jun;41(3):1055–1084. doi: 10.1214/12-AOS1014

MINIMAX BOUNDS FOR SPARSE PCA WITH NOISY HIGH-DIMENSIONAL DATA

Aharon Birnbaum ¹, Iain M Johnstone ², Boaz Nadler ³, Debashis Paul ⁴

PMCID: PMC4196701 NIHMSID: NIHMS632099 PMID: 25324581

Abstract

We study the problem of estimating the leading eigenvectors of a high-dimensional population covariance matrix based on independent Gaussian observations. We establish a lower bound on the minimax risk of estimators under the l₂ loss, in the joint limit as dimension and sample size increase to infinity, under various models of sparsity for the population eigenvectors. The lower bound on the risk points to the existence of different regimes of sparsity of the eigenvectors. We also propose a new method for estimating the eigenvectors by a two-stage coordinate selection scheme.

Keywords: Minimax risk, high-dimensional data, principal component analysis, sparsity, spiked covariance model

1. Introduction

Principal components analysis (PCA) is widely used to reduce dimensionality of multivariate data. A traditional setting involves repeated observations from a multivariate normal distribution. Two key theoretical questions are: (i) what is the relation between the sample and population eigenvectors, and (ii) how well can population eigenvectors be estimated under various sparsity assumptions? When the dimension N of the observations is fixed and the sample size n → ∞, the asymptotic properties of the sample eigenvalues and eigenvectors are well known [Anderson (1963), Muirhead (1982)]. This asymptotic analysis works because the sample covariance approximates the population covariance well when the sample size is large. However, it is increasingly common to encounter statistical problems where the dimensionality N is comparable to, or larger than, the sample size n. In such cases, the sample covariance matrix, in general, is not a reliable estimate of its population counterpart.

Better estimators of large covariance matrices, under various models of sparsity, have been studied recently. These include development of banding and thresholding schemes [Bickel and Levina (2008a, 2008b), Cai and Liu (2011), El Karoui (2008), Rothman, Levina and Zhu (2009)], and analysis of their rate of convergence in the spectral norm. More recently, Cai, Zhang and Zhou (2010) and Cai and Zhou (2012) established the minimax rate of convergence for estimation of the covariance matrix under the matrix l₁ norm and the spectral norm, and its dependence on the assumed sparsity level.

In this paper we consider a related but different problem, namely, the estimation of the leading eigenvectors of the covariance matrix. We formulate this eigenvector estimation problem under the well-studied “spiked population model” which assumes that the ordered set of eigenvalues ℒ(∑) of the population covariance matrix ∑ satisfies

ℒ (\sum) = {λ_{1} + σ^{2}, \dots, λ_{M} + σ^{2}, σ^{2}, \dots, σ^{2}}

(1.1)

for some M ≥ 1, where σ² > 0 and λ₁ > λ₂ > ⋯ > λ_M > 0. This is a standard model in several scientific fields, including, for example, array signal processing [see, e.g., van Trees (2002)] where the observations are modeled as the sum of an M-dimensional random signal and an independent, isotropic noise. It also arises as a latent variable model for multivariate data, for example, in factor analysis [Jolliffe (2002), Tipping and Bishop (1999)]. The assumption that the leading M eigenvalues are distinct is made to simplify the analysis, as it ensures that the corresponding eigenvectors are identifiable up to a sign change. The assumption that all remaining eigenvalues are equal is not crucial as our analysis can be generalized to the case when these are only bounded by σ². Asymptotic properties of the eigenvalues and eigenvectors of the sample covariance matrix under this model, in the setting when N/n → c ∈ (0, ∞) as n → ∞, have been studied by Lu (2002), Baik and Silverstein (2006), Nadler (2008), Onatski (2006) and Paul (2007), among others. A key conclusion is that when N/n → c > 0, the eigenvectors of standard PCA are inconsistent estimators of the population eigenvectors.

Eigenvector and covariance matrix estimation are related in the following way. When the population covariance is a low rank perturbation of the identity, as in this paper, sparsity of the eigenvectors corresponding to the nonunit eigenvalues implies sparsity of the whole covariance. Consistency of an estimator of the whole covariance matrix in spectral norm implies convergence of its leading eigenvalues to their population counterparts. If the gaps between the distinct eigenvalues remain bounded away from zero, it also implies convergence of the corresponding eigen-subspaces [El Karoui (2008)]. In such cases, upper bounds for sparse covariance estimation in the spectral norm, as in Bickel and Levina (2008a) and Cai and Zhou (2012), also yield upper bounds on the rate of convergence of the corresponding eigenvectors under the l₂ loss. These works, however, did not study the following fundamental problem, considered in this paper: How well can the leading eigenvectors be estimated, and namely, what are the minimax rates for eigenvector estimation? Indeed, it turns out that the optimal rates for covariance matrix estimation and leading eigenvector estimation are different. Moreover, schemes based on thresholding the entries of the sample covariance matrix do not achieve the minimax rate for eigenvector estimation. The latter result is beyond the scope of this paper and will be reported in a subsequent publication by the current authors.

Several works considered various models of sparsity for the leading eigenvectors and developed improved sparse estimators. For example, Witten, Tibshirani and Hastie (2009) and Zou, Hastie and Tibshirani (2006), among others, imposed l₁-type sparsity constraints directly on the eigenvector estimates and proposed optimization procedures for obtaining them. Shen and Huang (2008) suggested a regularized low rank approach to sparse PCA. The consistency of the resulting leading eigenvectors was recently proven in Shen, Shen and Marron (2011), in a model in which the sample size n is fixed while N → ∞. d’Aspremont et al. (2007) suggested a semi-definite programming (SDP) problem as a relaxation to the l₀-penalty for sparse population eigenvectors. Assuming a single spike, Amini and Wainwright (2009) studied the asymptotic properties of the resulting leading eigenvector of the covariance estimator in the joint limit as both sample size and dimension tend to infinity. Specifically, they considered a leading eigenvector with exactly k ≪ N nonzero entries all of the form ${- 1 / \sqrt{k}, 1 / \sqrt{k}}$ . For this hardest subproblem in the k-sparse l₀-ball, Amini and Wainwright (2009) derived information theoretic lower bounds for such eigenvector estimation.

In this paper, in contrast, following Johnstone and Lu (2009) (JL), we study estimation of the leading eigenvectors of ∑ assuming that these are approximately sparse, with a bounded l_q norm. Under this model, JL developed an estimation procedure based on coordinate selection by thresholding the diagonal of the sample covariance matrix, followed by the spectral decomposition of the submatrix corresponding to the selected coordinates. JL further proved consistency of this estimator assuming dimension grows at most polynomially with sample size, but did not study its convergence rate. Since this estimation procedure is considerably simpler to implement and computationally much faster than the l₁ penalization procedures cited above, it is of interest to understand its theoretical properties. More recently, Ma (2011) developed iterative thresholding sparse PCA(ITSPCA),which is based on repeated filtering, thresholding and orthogonalization steps that result in sparse estimators of the subspaces spanned by the leading eigenvectors. He also proved consistency and derived rates of convergence under appropriate loss functions and sparsity assumptions. In a later work, Cai, Ma and Wu (2012) considered a two-stage estimation scheme for the leading population eigenvector, in which the first stage is similar to the DT scheme applied to a stochastically perturbed version of the data. The estimates of the leading eigenvectors from this step are then used to project another stochastically perturbed version of the data to obtain the final estimates of the eigenvectors through solving an orthogonal regression problem. They showed that this two-stage scheme achieves the optimal rate for estimation of eigen-subspaces under suitable sparsity conditions.

In this paper, which is partly based on the Ph.D. thesis of Paul [Paul (2005)], we study the estimation of the leading eigenvectors of ∑, all assumed to belong to appropriate l_q spaces. Our analysis thus extends the JL setting and complements the work of Amini and Wainwright (2009) in the l₀-sparsity setting. For simplicity, we assume Gaussian observations in our analysis.

The main contributions of this paper are as follows. First, we establish lower bounds on the rate of convergence of the minimax risk for any eigenvector estimator under the l₂ loss. This analysis points to three different regimes of sparsity, which we denote dense, thin and sparse, each having its own rate of convergence. We show that in the “dense” setting (as defined in Section 3), the standard PCA estimator attains the optimal rate of convergence, whereas in sparser settings it is not even consistent. Next, we show that while the JL diagonal thresholding (DT) scheme is consistent under these sparsity assumptions, it is not rate optimal in general. This motivates us to propose a new refined thresholding method (Augmented Sparse PCA, or ASPCA) that is based on a two-stage coordinate selection scheme. In the sparse setting, both our ASPCA procedure, as well as the method of Ma (2011) achieve the lower bound on the minimax risk obtained by us, and are thus rate-optimal procedures, so long as DT is consistent. For proofs see Ma (2011) and Paul and Johnstone (2007). There is a somewhat special, intermediate, “thin” region where a gap exists between the current lower bound and the upper bound on the risk. It is an open question whether the lower bound can be improved in this scenario, or a better estimator can be derived. Table 1 provides a comparison of the lower bounds and rates of convergence of various estimators.

Table 1.

Comparison of lower bounds on eigenvector estimation and worst case rates of various procedures

Estimator Lower bound	Dense O(N/n)	Thin O(n^−(1−q/2))	Sparse O((log N/n)^1−q/2)
PCA	Rate optimal^*	Inconsistent	Inconsistent
DT	Inconsistent	Inconsistent	Not rate optimal
ASPCA	Inconsistent	Inconsistent	Rate optimal^†

Open in a new tab

When N/n → 0.

^†

So long as DT is consistent.

The theoretical results also show that under comparable scenarios, the optimal rate for eigenvector estimation O((log N/n)^−(1−q/2)) (under squared-error loss) is faster than the rate obtained for sparse covariance estimation, O((log N/n)^−(1−q)) (under squared operator norm loss), by Bickel and Levina (2008a) and shown to be optimal by Cai and Zhou (2012).

Finally, we emphasize that to obtain good finite-sample performance for both our two-stage scheme, as well as for other thresholding methods, the exact thresholds need to be carefully tuned. This issue and the detailed theoretical analysis of the ASPCA estimator are beyond the scope of this paper, and will be presented in a future publication. After this paper was completed, we learned of Vu and Lei (2012), which cites Paul and Johnstone (2007) and contains results overlapping with some of the work of Paul and Johnstone (2007) and this paper.

The rest of the paper is organized as follows. In Section 2, we describe the model for the eigenvectors and analyze the risk of the standard PCA estimator. In Section 3, we present the lower bounds on the minimax risk of any eigenvector estimator. In Section 4, we derive a lower bound on the risk of the diagonal thresholding estimator proposed by Johnstone and Lu (2009). In Section 5, we propose a new estimator named ASPCA (augmented sparse PCA) that is a refinement of the diagonal thresholding estimator. In Section 6, we discuss the question of attainment of the risk bounds. Proofs of the results are given in Appendix A.

Throughout, 𝕊^N−1 denotes the unit sphere in ℝ^N centered at the origin, ⌊x⌋ denotes the largest integer less than or equal to x ∈ ℝ.

2. Problem setup

We suppose a triangular array model, in which for each n, the random vectors $X_{i} ≔ X_{i}^{n}, i = 1, \dots, n$ , each have dimension N = N(n) and are independent and identically distributed on a common probability space. Throughout we assume that X_i’s are i.i.d. as N_N(0, ∑), where the population matrix ∑, also depending on N, is a finite rank perturbation of (a multiple of) the identity. In other words,

\sum = \sum_{ν = 1}^{M} λ_{ν} θ_{ν} θ_{ν}^{T} + σ^{2} I,

(2.1)

where λ₁ > λ₂ > ⋯ > λ_M > 0, and the vectors θ₁, … ,θ_M ∈ ℝ^N are orthonormal, which implies (1.1). θ_ν is the eigenvector of ∑ corresponding to the νth largest eigenvalue, namely, λ_ν + σ². The term “finite rank” means that M remains fixed even as n → ∞. The asymptotic setting involves letting both n and N grow to infinity simultaneously. For simplicity, we assume that the λ_ν’s are fixed while the parameter space for the θ_ν’s varies with N.

The observations can be described in terms of the model

X_{i k} = \sum_{ν = 1}^{M} \sqrt{λ_{ν}} υ_{ν i} θ_{ν k} + σ Z_{i k}, i = 1, \dots, n, k = 1, \dots, N .

(2.2)

Here, for each i, υ_νi, Z_ik are i.i.d. N(0, 1). Since the eigenvectors of ∑ are invariant to a scale change in the original observations, it is henceforth assumed that σ = 1. Hence, λ₁, …, λ_M in the asymptotic results should be replaced by λ₁/σ², … , λ_M/σ² when (2.1) holds with an arbitrary σ > 0. Since the main focus of this paper is estimation of eigenvectors, without loss of generality we consider the uncentered sample covariance matrix S ≔ n⁻¹ XX^T, where X = [X₁: … : X_n].

The following condition, termed Basic assumption, will be used throughout the asymptotic analysis, and will be referred to as (BA).

(BA) (2.2) holds with σ = 1; N = N(n) → ∞ as n → ∞; λ₁ > ⋯ > λ_M > 0 are fixed (do not vary with N); M is unknown but fixed.

2.1. Eigenvector estimation with squared error loss

Given data ${X_{i}}_{i = 1}^{n}$ , the goal is to estimate M and the eigenvectors θ₁, … ,θ_M. For simplicity, to derive the lower bounds, we first assume that M is known. In Section 5.2 we derive an estimator of M, which can be shown to be consistent under the assumed sparsity conditions. To assess the performance of any estimator, a minimax risk analysis approach is proposed. The first task is to specify a loss function L(θ̂_ν, θ_ν) between the estimated and true eigenvector.

Eigenvectors are invariant to choice of sign, so we introduce a notation for the acute (angle) difference between unit vectors,

a ⊝ b = a - sign (〈 a, b 〉) b,

where a and b are N × 1 vectors with unit l₂ norm. We consider the following loss function, also invariant to sign changes:

L (a, b) ≔ 2 (1 - | 〈 a, b 〉 |) = {‖ a ⊝ b ‖}^{2} .

(2.3)

An estimator θ̂_ν is called consistent with respect to L, if L(θ̂_ν,θ_ν) → 0 in probability as n → ∞.

2.2. Rate of convergence for ordinary PCA

We first consider the asymptotic risk of the leading eigenvectors of the sample covariance matrix (henceforth referred to as the standard PCA estimators) when the ratio N/n → 0 as n → ∞. For future use, we define

h (λ) ≔ \frac{λ^{2}}{1 + λ}, λ > 0,

(2.4)

and

g (λ, τ) = \frac{{(λ - τ)}^{2}}{(1 + λ) (1 + τ)}, λ, τ > 0 .

(2.5)

In Johnstone and Lu (2009) (Theorem 1) it was shown that under a single spike model, as N/n → 0, the standard PCA estimator of the leading eigenvector is consistent. The following result, proven in the Appendix, is a refinement of that, as it also provides the leading error term.

Theorem 2.1. Let θ̂_ν,PCA be the eigenvector corresponding to the νth largest eigenvalue of S. Assume that (BA) holds and N, n → ∞ such that N/n → 0. Then, for each ν = 1, … , M,

sup_{θ_{ν} \in 𝕊^{N - 1}} 𝔼 L ({\hat{θ}}_{ν, PCA}, θ_{ν}) = [\frac{N - M}{n h (λ_{ν})} + \frac{1}{n} \sum_{μ \neq ν} \frac{1}{g (λ_{μ}, λ_{ν})}] (1 + o (1)) .

(2.6)

Remark 2.1. Observe that Theorem 2.1 does not assume any special structure such as sparsity for the eigenvectors. The first term on the right-hand side of (2.6) is a nonparametric component which arises from the interaction of the noise terms with the different coordinates. The second term is “parametric” and results from the interaction with the remaining M − 1 eigenvectors corresponding to different eigenvalues. The second term shows that the closer the successive eigenvalues, the larger the estimation error. The upshot of (2.6) is that standard PCA yields a consistent estimator of the leading eigenvectors of the population covariance matrix when the dimension-to-sample-size ratio (N/n) is asymptotically negligible.

2.3. l_q constraint on eigenvectors

When N/n → c ∈ (0, ∞], standard PCA provides inconsistent estimators for the population eigenvectors, as shown by various authors [Johnstone and Lu (2009), Lu (2002), Nadler (2008), Onatski (2006), Paul (2007)]. In this subsection we consider the following model for approximate sparsity of the eigenvectors. For each ν = 1, … , M, assume that θ_ν belongs to an l_q ball with radius C, for some q ∈ (0, 2), thus θ_ν ∈ Θ_q (C), where

Θ_{q} (C) ≔ {a \in 𝕊^{N - 1} : \sum_{k = 1}^{N} {| a_{k} |}^{q} \leq C^{q}} .

(2.7)

Note that our condition of sparsity is slightly different from that of Johnstone and Lu (2009). Since 0 < q < 2, for Θ_q(C) to be nonempty, one needs C ≥ 1. Further, if C_q ≥ N_1−q/2, then the space Θ_q (C) is all of 𝕊^N−1 because in this case, the least sparse vector $\frac{1}{\sqrt{N}}$ (1, 1, … , 1) is in the parameter space.

The parameter space for θ ≔ [θ₁: … :θ_M] is denoted by

Θ_{q}^{M} (C_{1}, \dots, C_{M}) ≔ {θ \in \prod_{ν = 1}^{M} Θ_{q} (C_{ν}) : 〈 θ_{ν}, θ_{ν'} 〉 = 0, for ν \neq ν'},

(2.8)

where Θ_q (C) is defined through (2.7), and C_ν ≥ 1 for all ν = 1, … , M. Thus $Θ_{q}^{M}$ consists of sparse orthonormal M-frames, with sparsity measured in l_q. Note that in the analysis that follows we allow the C_ν’s to increase with N.

Remark 2.2. While our focus is on eigenvector sparsity, condition (2.8) also implies sparsity of the covariance matrix itself. In particular, for q ∈ (0, 1), a spiked covariance matrix satisfying (2.8) also belongs to the class of sparse covariance matrices analyzed by Bickel and Levina (2008a), Cai and Liu (2011) and Cai and Zhou (2012). Indeed, Cai and Zhou (2012) obtained the minimax rate of convergence for covariance matrix estimators under the spectral norm when the rows of the population matrix satisfy a weak-l_q constraint. However, as we will show below, the minimax rate for estimation of the leading eigenvectors is faster than that for covariance estimation.

3. Lower bounds on the minimax risk

We now derive lower bounds on the minimax risk of estimating θ_ν under the loss function (2.3). To aid in describing and interpreting the lower bounds, we define the following two auxiliary parameters. The first is an effective noise level per coordinate

τ_{ν}^{2} = 1 / (n h (λ_{ν}))

(3.1)

and the second is an effective dimension

m_{ν} ≔ A_{q} {({\bar{C}}_{ν} / τ_{ν})}^{q},

(3.2)

where a_q ≔ (2/9)^1−q/2, c₁ ≔ log(9/8) and $A_{q} ≔ 1 / (a_{q} c_{1}^{q / 2})$ and finally ${\bar{C}}_{ν}^{q} ≔ C_{ν}^{q} - 1$ .

The phrase effective noise level per coordinate is motivated by the risk bound in Theorem 2.1: dividing both sides of (2.6) by N, the expected “per coordinate” risk (or variance) of the PCA estimator is asymptotically $τ_{ν}^{2}$ . Next, following Nadler (2009), let us provide a different interpretation of τ_ν. Consider a sparse θ_ν and an oracle that, regardless of the observed data, selects a set J_τ of all coordinates of θ_ν that are larger than τ in absolute value, and then performs PCA on the sample covariance restricted to these coordinates. Since θ_ν ∈ Θ_q (C_ν), the maximal squared-bias is

sup_{θ_{ν} \in Θ_{q} (C_{ν})} \sum_{k \notin J_{τ}} {| θ_{ν k} |}^{2} ≍ sup {\sum_{k = 1}^{N} x_{k}^{2 / q} : \sum_{k = 1}^{N} x_{k} \leq C_{ν}^{q}, 0 \leq x_{k} \leq τ^{q}} ≍ C_{ν}^{q} τ^{2 - q},

which follows by the correspondence x_k = |θ_νk|^q, and the convexity of the function $\sum_{k = 1}^{N} x_{k}^{2 / q}$ . On the other hand, by Theorem 2.1, the maximal variance term of this oracle estimator is of the order k_τ /(nh(λ_ν)) where k_τ is the maximal number of coordinates of θ_ν exceeding τ. Again, θ_ν ∈ Θ_q (C_ν) implies that $k_{τ} ≍ C_{ν}^{q} τ^{- q}$ . Thus, to balance the bias and variance terms, we need $τ ≍ 1 / \sqrt{n h (λ_{ν})} = τ_{ν}$ . This heuristic analysis shows that τ_ν can be viewed as an oracle threshold for the coordinate selection scheme, that is, the best possible estimator of θ_ν based on individual coordinate selection can expect to recover only those coordinates that are above the threshold τ_ν.

To understand why m_ν is an effective dimension, consider the least sparse vector θ_ν ∈ Θ_q (C_ν). This vector should have as many nonzero coordinates of equal size as possible. If $C_{ν}^{q} > N^{1 - q / 2}$ then the vector with coordinates ± N^−1/2 does the job. Otherwise, we set the first coordinate of the vector to be $\sqrt{1 - r^{2}}$ for some r ∈ (0, 1) and choose all the nonzero coordinates to be of magnitude τ_ν. Clearly, we must have $r^{2} = m τ_{ν}^{2}$ , where m + 1 is the maximal number of nonzero coordinates, while the l_q constraint implies that ${(1 - r^{2})}^{q / 2} + m τ_{ν}^{q} \leq C_{ν}^{q}$ . The last inequality shows that the maximal m is just a constant multiple of m_ν. This construction also constitutes the key idea in the proof of Theorems 3.1 and 3.2. Finally, we set

N' = N - M .

(3.3)

Theorem 3.1. Assume that (BA) holds, 0 < q < 2 and n, N → ∞. Then, there exists a constant B₁ > 0 such that for n sufficiently large,

R_{ν}^{*} ≔ inf_{{\hat{θ}}_{ν}} sup_{Θ_{q} (C)} 𝔼 L ({\hat{θ}}_{ν}, θ_{ν}) \geq B_{1} δ_{n},

(3.4)

where δ_n is given by

δ_{n} = {\begin{matrix} τ_{ν}^{2} N', & if τ_{ν}^{2} N' < 1 and N' < m_{ν} & (dense setting), \\ τ_{ν}^{2} m_{ν}, & if τ_{ν}^{2} m_{ν} < 1 and m_{ν} < N' & (sparse setting), \\ 1, & if τ_{ν}^{2} \cdot min {N' m_{ν}} > 1 & (weak signal) . \end{matrix}

We may think of m_{n, ν} ≔ min{N′, m_ν} as the effective dimension of the least favorable configuration.

In the thin setting, $m_{n, ν} = A_{q} {\bar{C}}_{ν}^{q} {[n h (λ_{ν})]}^{q / 2} < N (i . e ., {\bar{C}}_{ν}^{q} n^{q / 2} < c' N$ for some c′ > 0), and the lower bound is of the order

δ_{n} = A_{q} {\bar{C}}_{ν}^{q} τ_{ν}^{2 - q} = \frac{A_{q} {\bar{C}}_{ν}^{q}}{{[n h (λ_{ν})]}^{1 - q / 2}} ≍ \frac{{\bar{C}}_{ν, n}^{q}}{n^{1 - q / 2}} .

(3.5)

In the dense setting, on the other hand, m_{n, ν} = N − M, and

δ_{n} = \frac{N - M}{n h (λ_{ν})} ≍ \frac{N}{n} .

(3.6)

If N/n → c for some c > 0, then δ_n ≍ 1, and so any estimator of the eigenvector θ_ν is inconsistent. If N/n → 0, then equation (3.6) and Theorem 2.1 imply that the standard PCA estimator θ̂_ν,PCA attains the optimal rate of convergence.

A sharper lower bound is possible if ${\bar{C}}_{ν}^{q} n^{q / 2} = O (N^{1 - α})$ for some α ∈ (0, 1). We call this a sparse setting, noting that it is a special case of the thin setting. In this case the dimension N is much larger than the quantity ${\bar{C}}_{ν}^{q} n^{q / 2}$ measuring the effective dimension. Hence, we define a modified effective noise level percoordinate

{\bar{τ}}_{ν}^{2} = \frac{α}{9} \frac{log N}{n h (λ_{ν})},

and a modified effective dimension

{\bar{m}}_{ν} = a_{q}^{- 1} {({\bar{C}}_{ν} / {\bar{τ}}_{ν})}^{q} .

Theorem 3.2. Assume that (BA) holds, 0 < q < 2 and n, N → ∞ in such a way that ${\bar{C}}_{ν}^{q} n^{q / 2} = O (N^{1 - α})$ for some α ∈ (0, 1). Then there exists a constant B₁ such that for n sufficiently large, the minimax bound (3.4) holds with

δ_{n} = {\bar{m}}_{ν} {\bar{τ}}_{ν}^{2} = a_{q}^{- 1} C_{ν}^{q} {(\frac{log N}{n h (λ_{ν})})}^{1 - q / 2} (sparse setting)

(3.7)

so long as this quantity is ≤ 1.

Note that in the sparse setting δ_n is larger by a factor of (log N)^1−q/2 compared to the thin setting [equation (3.5)].

It should be noted that for fixed signal strength λ_ν, for the corresponding eigenvector to be thin, but not sparse is somewhat of a “rarity,” as the following argument shows: consider first the case N = o(n). If $N' < A_{q} {(h (λ_{ν}))}^{q / 2} {\bar{C}}_{ν}^{q} n^{q / 2}$ , then we are in the dense setting, since $τ_{ν}^{2} N' ≍ N / n \to 0$ . On the other hand, if N = o(n) and ${\bar{C}}_{ν}^{q} n^{q / 2} = O (N^{1 - α})$ for some α ∈ (0, 1), then θ_ν is sparse, according to the discussion preceding Theorem 3.2. So, if N = o (n), for the eigenvector θ_ν to be thin but not sparse, we need ${\bar{C}}_{ν}^{q} n^{q / 2} ≍ N s_{N}$ where s_N is a term which may be constant or may converge to zero at a rate slower than any polynomial in N. Next, consider the case n = o (N). For a meaningful lower bound, we require $τ_{ν}^{2} m_{ν} < 1$ , which means that ${\bar{C}}_{ν}^{q} n^{q / 2} < c_{q, ν} n$ for some constant c_q,ν > 0. Thus, as long as n = O(N^1−α) for some α ∈ (0, 1), θ_ν cannot be thin but not sparse. Finally, suppose that N ≍ n, and let ${\bar{C}}_{ν}^{q} = N^{β}$ for some β ≥ 0. If β < 1 − q/2, then we are in the sparse case. On the other hand, if β > 1 − q/2, then there is no sparsity at all since when $C_{ν}^{q} \geq N^{1 - q / 2}$ the entire 𝕊^N−1 belongs to the relevant l_q ball for θ_ν. Hence, only if β = 1 − q/2 exactly, it is possible for θ_ν to be sparse. This analysis emphasizes the point that at least for a fixed signal strength, thin but not sparse is a somewhat special situation.

4. Risk of the diagonal thresholding estimator

In this section, we analyze the convergence rate of the diagonal thresholding (DT) approach to sparse PCA proposed by Johnstone and Lu (2009) (JL). In this section and in Section 5, we assume for simplicity that N ≥ n. Let the sample variance of the kth coordinate, the kth diagonal entry of S, be denoted by S_kk. Then DT consists of the following steps:

Define I = I (γ_n) to be the set of indices k ∈ {1, … , N} such that S_kk > 1 + γ_n for some threshold γ_n > 0.
Let S_II be the submatrix of S corresponding to the coordinates I. Perform an eigen-analysis of S_II and denote its eigenvectors by f_i, i = 1, … , min{n, |I|}.
For ν = 1, … , M, estimate θ_ν by the N × 1 vector f̃_ν, obtained from f_ν by augmenting zeros to all the coordinates in I^c ≔ {1, … , N} \ I.

Assuming that θ_ν ∈ Θ_q(C_ν), and a threshold $γ_{n} = γ \sqrt{log N / n}$ for some γ > 0, JL showed that DT yields a consistent estimator of θ_ν, but did not further analyze the risk. Indeed, as we prove below, the risk of the DT estimator is not rate optimal. This might be anticipated from the lower bound on the minimax risk (Theorems 3.1 and 3.2) which indicate that to attain the optimal risk, a coordinate selection scheme must select all coordinates of θ_ν of size at least $c \sqrt{log N / n}$ for some c > 0. With a threshold of the form γ_n above, however, only coordinates of size (log N/n)^1/4 are selected. Even for the case of a single signal, M = 1, this leads to a much larger lower bound.

Theorem 4.1. Suppose that (BA) holds with M = 1. Let C > 1 (may depend on n), 0 < q < 2 and n, N → ∞ be such that $C^{q} n^{q / 4} = o (\sqrt{n})$ . Then the diagonal thresholding estimator θ̂_{1, DT} satisfies

sup_{θ_{1} \in Θ_{q} (C)} 𝔼 L ({\hat{θ}}_{1, DT}, θ_{1}) \geq K_{q} {\bar{C}}^{q} n^{- (1 - q / 2) / 2}

(4.1)

for a constant K_q > 0, where C̄^q = C^q −1.

A comparison of (4.1) with the lower bound (3.5), shows a large gap between the two rates, n^{−1/2(1−q/2)} versus n^−(1−q/2). This gap arises because DT uses only the diagonal of the sample covariance matrix S, ignoring the information in its off diagonal entries. In the next section we propose a refinement of the DT scheme, denoted ASPCA, that constructs an improved eigenvector estimate using all entries of S.

In the sparse setting, the ITSPCA estimator of Ma (2011) attains the same asymptotic rate as the lower bound of Theorem 3.2, provided DT yields consistent estimates of the eigenvectors. The latter condition can be shown to hold if, for example, $C_{ν}^{q} n^{q / 4} {(log N)}^{1 - q / 2} = o (\sqrt{n})$ for all ν = 1, … , M. Thus, in the sparse setting, with this additional restriction, the lower bound on the minimax rate is sharp, and consequently, the DT estimator is not rate optimal.

5. A two-stage coordinate selection scheme

As discussed above, the DT scheme can reliably detect only those eigenvector coordinates k for which |θ_ν,k| ≥ c(log N/n)^1/4 (for some c > 0), whereas to reach the lower bound one needs to detect those coordinates for which |θ_ν,k| ≥ c(log N/n)^1/2.

To motivate an improved coordinate selection scheme, consider the single component (i.e., M = 1) case, and form a partition of the N coordinates into two sets A and B, where the former contains all those k such that |θ_1k| is “large” (selected by DT), and the latter contains the remaining smaller coordinates. Partition the matrix ∑ as

\sum = [\begin{matrix} \sum_{A A} & \sum_{A B} \\ \sum_{B A} & \sum_{B B} \end{matrix}] .

Observe that $\sum_{B A} = λ_{1} θ_{1, B} θ_{1, A}^{T}$ . Let θ̃₁ be a “preliminary” estimator of θ₁ such that lim_{n → ∞} ℙ(〈θ̃_1,A, θ_1,A〉 ≥ δ₀) = 1 for some δ₀ > 0 (e.g., θ̃₁ could be the DT estimator). Then we have the relationship

\sum_{B A} {\tilde{θ}}_{1, A} = 〈 {\tilde{θ}}_{1, A}, θ_{1, A} 〉 λ_{1} θ_{1, B} \approx c (δ_{0}) λ_{1} θ_{1, B}

for some c(δ₀) bounded below by δ₀/2, say. Thus one possible strategy is to additionally select all those coordinates of ∑_BAθ̃_1,A that are larger (in absolute value) than some constant multiple of $\sqrt{log N} / \sqrt{n h (λ_{1})}$ . Neither ∑_BA nor λ₁ is known, but we can use S_BA as a surrogate for the former and the largest eigenvalue of S_AA to obtain an estimate for the latter. A technical challenge is to show that, with probability tending to 1, such a scheme indeed recovers all coordinates k with $| θ_{1 k} | > γ_{+} \sqrt{log N} / \sqrt{n h (λ_{1})}$ , while discarding all coordinates k with $| θ_{1 k} | < γ_{-} \sqrt{log N} / \sqrt{n h (λ_{1})}$ for some constants γ₊ > γ₋ > 0. Figure 1 provides a pictorial description of the DT and ASPCA coordinate selection schemes.

Fig. 1 — Schematic diagram of the DT and ASPCA thresholding schemes under the single component setting. The x-axis represents the indices of different coordinates of the first eigenvector and the vertical lines depict the absolute values of the coordinates. The threshold for the DT scheme is γ(log N/n)^1/4 while the thresholds for the ASPCA scheme is γ(log N/n)^1/2. For some generic constants γ₊ > γ > γ₋ > 0, with high probability, the schemes select all coordinates above the upper limits (indicated by the multiplier γ₊) and discard all coordinates below the lower limits (indicated by the multiplier γ₋).

5.1. ASPCA scheme

Based on the ideas described above, we now present the ASPCA algorithm. It first makes two stages of coordinate selection, whereas the final stage consists of an eigen-analysis of the submatrix of S corresponding to the selected coordinates. The algorithm is described below.

For any γ > 0 define

I (γ) = {k : S_{k k} > 1 + γ} .

(5.1)

Let γ_i > 0 for i = 1, 2 and κ > 0 be constants to be specified later.

Stage 1.

1°
Let I = I (γ_1,n) where $γ_{1, n} = γ_{1} \sqrt{log N / n} .$ .
2°
Denote the eigenvalues and eigenvectors of S_II by ℓ̂₁ > ⋯ > ℓ̂_m₁ and f₁, … , f_m₁, respectively, where m₁ = min{n, |I|}.
3°
Estimate M by M̂ defined in Section 5.2.

Stage 2.

4°
Let $E = [{\hat{ℓ}}_{1}^{- 1 / 2} f_{1} \dots {\hat{ℓ}}_{\hat{M}}^{1 / 2} f_{\hat{M}}]$ and Q = S_IcI E.
5°
Let $J = {k \notin I : {({QQ}^{T})}_{k k} > γ_{2, n}^{2}}$ for some γ_2,n > 0. Define K = I ∪ J.

Stage 3.

6°
For ν = 1, …, M̂, denote by θ̂_ν the νth eigenvector of S_KK, augmented with zeros in the coordinates K^c.

Remark 5.1. The ASPCA scheme is specified up to the choice of parameters γ₁ and γ_2,n that determine its rate of convergence. It can be shown that choosing γ₁ = 4 and

γ_{2, n} = κ \sqrt{\frac{log N}{n}} + \sqrt{\frac{\hat{M}}{n}}

(5.2)

with $κ = \sqrt{3 + ε}$ for some ε > 0, results in an asymptotically optimal rate. Again, we note that for finite N, n, the actual performance in terms of the risk of the resulting eigenvector estimate may have a strong dependence on the threshold. In practice, a delicate choice of thresholds can be highly beneficial. This issue, as well as the analysis of the risk of the ASPCA estimator, are beyond the scope of this paper and will be studied in a separate publication.

5.2. Estimation of M

Estimation of the dimension of the signal subspace is a classical problem. If the signal eigenvalues are strong enough (i.e., $λ_{ν} > c \sqrt{N / n}$ for all ν = 1, … , M, for some c > 1 independent of N, n), then nonparametric methods that do not assume eigenvector sparsity can asymptotically estimate the correct M; see, for example, Kritchman and Nadler (2008). When the eigenvectors are sparse, we can detect much weaker signals, as we describe below.

We estimate M by thresholding the eigenvalues of the submatrix S_ĪĪ where $\bar{I} ≔ I (\bar{γ} \sqrt{log N / n})$ for some γ̄ ∈ (0, γ₁). Let m̄ = min{n, |Ī|} and ℓ̄₁ > ⋯ > ℓ̄_m̄ be the nonzero eigenvalues of S_ĪĪ. Let α_n > 0 be a threshold of the form

α_{n} = 2 \sqrt{\frac{| \bar{I} |}{n}} + (1 + c o \sqrt{\frac{log n}{n}}) \frac{| \bar{I} |}{n}

for some user-defined constant c₀ > 0. Then, define M̂ by

\hat{M} ≔ max {1 \leq k \leq \bar{m} : {\bar{ℓ}}_{k} > 1 + α_{n}} .

(5.3)

The idea is that, for large enough n, I (γ_n) ⊂ Ī with high probability and thus |Ī| acts as an upper bound on |I (γ_1n)|. Using this and the behavior of the extreme eigenvalues of a Wishart matrix, it can be shown that, with a suitable choice of c₀ and γ̄, M̂ is a consistent estimator of M.

6. Summary and discussion

In this paper we have derived lower bounds on eigenvector estimates under three different sparsity regimes, denoted dense, thin and sparse. In the dense setting, Theorems 2.1 and 3.1 show that when N/n → 0, the standard PCA estimator attains the optimal rate of convergence.

In the sparse setting, Theorem 3.1 of Ma (2011) shows that the maximal risk of the ITSPCA estimator proposed by him attains the same asymptotic rate as the corresponding lower bound of Theorem 3.2. This implies that in the sparse setting, the lower bound on the minimax rate is indeed sharp. In a separate paper, we prove that in the sparse regime, the ASPCA algorithm also attains the minimax rate. All these sparse setting results currently require the additional condition of consistency of DT—without this condition, the rate optimality question remains open.

Finally, our analysis leaves some open questions in the intermediate thin regime. According to Theorem 3.1, the lower bound in this regime is smaller by a factor of (log N)^1−q/2, as compared to the sparse setting. Therefore, whether there exists an estimator (and in particular, one with low complexity), that attains the current lower bound, or whether this lower bound can be improved is an open question for future research. However, as we indicated at the end of Section 3, the eigenvector being thin but not sparse is a somewhat rare occurrence in terms of mathematical possibilities.

Acknowledgement

The authors thank the Editor and an anonymous referee for their thoughtful suggestions. The final draft was completed with the hospitality of the Institute of Mathematical Sciences at the National University of Singapore.

APPENDIX A: PROOFS

A.1. Asymptotic risk of the standard PCA estimator

To prove Theorem 2.1, on the risk of the PCA estimator, we use the following lemmas. Throughout, ‖B‖ = sup{x^T Bx : ‖x‖₂ = 1} denotes the spectral norm on square matrices.

Deviation of extreme Wishart eigenvalues and quadratic forms

In our analysis, we will need a probabilistic bound for deviations of ‖n⁻¹ZZ^T − I‖. This is given in the following lemma, proven in Appendix B.

Lemma A.1. Let Z be an N × n matrix with i.i.d. N (0, 1) entries. Suppose N < n and set $t_{n} = 8 \sqrt{n^{- 1} log n}$ and γ_n = N/n. Then for any c > 0, there exists n_c ≥ 1 such that for all n ≥ n_c,

ℙ (‖ n^{- 1} {ZZ}^{T} - I_{N} ‖ > γ_{n} + 2 \sqrt{γ_{n}} + c t_{n}) \leq 2 n^{- c^{2}} .

(A.1)

Lemma A.2 [Johnstone (2001)]. Let $χ_{n}^{2}$ denote a Chi-square random variable with n degrees of freedom. Then

ℙ (χ_{n}^{2} > n (1 + ε)) \leq e^{- 3 n ε^{2} / 16} (0 < ε < \frac{1}{2}),

(A.2)

ℙ (χ_{n}^{2} < n (1 - ε)) \leq e^{- n ε^{2} / 4} (0 < ε < 1),

(A.3)

ℙ (χ_{n}^{2} > n (1 + ε)) \leq \frac{\sqrt{2}}{ε \sqrt{n}} e^{- n ε^{2} / 4} (0 < ε < 1 / 2, n \geq 16) .

(A.4)

Lemma A.3 [Johnstone and Lu (2009)]. Let y_1i, y_2i, i = 1, … , n, be two sequences of mutually independent, i.i.d. N(0, 1) random variables. Then for large n and any b s.t. 0 < b ≪ $\sqrt{n}$ ,

ℙ (| \frac{1}{n} \sum_{i = 1}^{n} y_{1 i} y_{2 i} | > \sqrt{b / n}) \leq 2 exp {- \frac{3 b}{2} + O (n^{- 1} b^{2})} .

(A.5)

Perturbation of eigen-structure

The following lemma, modified in Appendix B from Paul (2005), is convenient for risk analysis of estimators of eigenvectors. Several variants of this lemma appear in the literature, most based on the approach of Kato (1980). To state it, let the eigenvalues of a symmetric matrix A be denoted by λ₁(A) ≥ ⋯ ≥ λ_m (A), with the convention that λ₀(A) = ∞ and λ_m+1 (A) = −∞. Let P_s denote the projection matrix onto the possibly multidimensional eigenspace corresponding to λ_s (A) and define

H_{r} (A) = \sum_{s \neq r} \frac{1}{λ_{s} (A) - λ_{r} (A)} P_{s} (A) .

Note that H_r (A) may be viewed as the resolvent of A “evaluated at λ_r (A).”

Lemma A.4. Let A and B be symmetric m × m matrices. Suppose that λ_r (A) is a unique eigenvalue of A with

δ_{r} (A) = min {| λ_{j} (A) - λ_{r} (A) | : 1 \leq j \neq r \leq m} .

Let p_r denote the unit eigenvector associated with the λ_r(A). Then

p_{r} (A + B) ⊝ p_{r} (A) = - H_{r} (A) B p_{r} (A) + R_{r},

(A.6)

where, if $4 ‖ B ‖ \leq δ_{r}^{- 1}$ (A),

‖ R_{r} ‖ \leq K δ_{r}^{- 1} (A) ‖ H_{r} (A) B p_{r} (A) ‖ ‖ B ‖,

(A.7)

and we may take K = 30.

Proof of Theorem 2.1. First we out line the approach. For notational simplicity, throughout this subsection, we write θ̂_ν to mean θ̂_ν,PCA. Recall that the loss function L(θ̂_ν, θ_ν) = ‖θ̂_ν ⊝ θ_ν‖². Invoking Lemma A.4 with A = ∑ and B = S − ∑ we get

{\hat{θ}}_{ν} ⊝ θ_{ν} = - H_{ν} S θ_{ν} + R_{ν},

(A.8)

where and

H_{ν} \equiv H_{ν} (\sum) ≔ \sum_{1 \leq μ \neq ν \leq M} \frac{1}{λ_{μ} - λ_{ν}} θ_{μ} θ_{μ}^{T} - \frac{1}{λ_{ν}} P_{⊥}

(A.9)

and $P_{⊥} = I - \sum_{μ = 1}^{M} θ_{μ} θ_{μ}^{T}$ . Note that H_νθ_ν = 0 and that H_ν∑θ_ν = 0.

Let $ε_{n ν} = K δ_{r}^{- 1} (\sum) ‖ S - \sum ‖$ . We have from (A.7) that

‖ R_{ν} ‖ \leq ‖ H_{ν} S θ_{ν} ‖ δ_{n ν}^{'},

and we will show that as n → ∞, ε_nν → 0 with probability approaching 1 and that

‖ H_{ν} S θ_{ν} ‖^{2} {(1 - ε_{n ν})}^{2} \leq L ({\hat{θ}}_{ν}, θ_{ν}) \leq ‖ H_{ν} S θ_{ν} ‖^{2} {(1 + ε_{n ν})}^{2} .

(A.10)

Theorem 2.1 then follows from an (exact, nonasymptotic) evaluation,

𝔼 [{‖ H_{ν} S θ_{ν} ‖}^{2}] = \frac{N - M}{n h (λ_{ν})} + \frac{1}{n} \sum_{μ \neq ν} \frac{(1 + λ_{μ}) (1 + λ_{ν})}{{(λ_{μ} - λ_{ν})}^{2}} .

(A.11)

We begin with the evaluation of (A.11). First we derive a convenient representation of H_νSθ_ν. In matrix form, model (2.2) becomes

X = \sum_{ν = 1}^{M} \sqrt{λ_{ν}} θ_{ν} υ_{ν}^{T} + Z,

(A.12)

where $υ_{ν} = {(υ_{ν i})}_{i = 1}^{n}$ , for ν = 1, … , M. Also, define

z_{ν} = Z^{T} θ_{ν}, w_{ν} = X^{T} θ_{ν} = \sqrt{λ_{ν}} υ_{ν} + z_{ν}

(A.13)

and

{〈 a, b 〉}_{n} = \frac{1}{n} \sum_{i = 1}^{n} a_{i} b_{i} for arbitrary a, b \in ℝ^{n} .

(A.14)

Then we have

S θ_{ν} = \frac{1}{n} X w_{ν} = \sum_{μ = 1}^{M} \sqrt{λ_{μ}} {〈 υ_{μ}, w_{ν} 〉}_{n} θ_{μ} + \frac{1}{n} Z w_{ν} .

Using (A.13),

\frac{1}{n} H_{ν} Z w_{ν} = \sum_{μ \neq ν} \frac{〈 z_{μ}, w_{ν} 〉}{λ_{μ} - λ_{ν}} θ_{μ} - \frac{1}{n λ_{ν}} P_{⊥} Z w_{ν} .

Using (A.9), H_νθ_μ = (λ_μ − λ_ν)⁻¹ θ_μ for μ ≠ ν, and we arrive at the desired representation

H_{ν} S θ_{ν} = \sum_{μ \neq ν} \frac{{〈 w_{μ}, w_{ν} 〉}_{n}}{λ_{μ} - λ_{ν}} θ_{μ} - \frac{1}{n λ_{ν}} P_{⊥} Z w_{ν} .

(A.15)

By orthogonality,

{‖ H_{ν} S θ_{ν} ‖}^{2} = \sum_{μ \neq ν} \frac{{〈 w_{μ}, w_{ν} 〉}_{n}^{2}}{{(λ_{μ} - λ_{ν})}^{2}} + \frac{1}{n^{2} λ_{ν}^{2}} w_{ν}^{T} Z^{T} P_{⊥} Z w_{ν} .

(A.16)

Now we compute the expectation. One verifies that z_ν ~ N(0, I_n) independently of each other and of each υ_ν ~ N(0, I_n), so that w_ν ~ N (0, (1 + λ_ν) I_n) independently. Hence, for μ ≠ ν,

𝔼 [{〈 w_{μ}, w_{ν} 〉}_{n}^{2} = n^{- 2} 𝔼 tr (w_{ν} w_{ν}^{T} w_{μ} w_{μ}^{T}) = n^{- 2} tr ((1 + λ_{μ}) (1 + λ_{ν}) I_{n}) = n^{- 1} (1 + λ_{μ}) (1 + λ_{ν}) .

(A.17)

From (A.13),

𝔼 [w_{ν}^{T} Z^{T} P_{⊥} Z w_{ν} | Z] = z_{ν}^{T} Z^{T} P_{⊥} Z z_{ν} + λ_{ν} 𝔼 [υ_{ν}^{T} Z^{T} P_{⊥} Z υ_{ν} | Z] = tr (Z Z^{T} P_{⊥} Z Z^{T} θ_{ν} θ_{ν}^{T}) + λ_{ν} tr (P_{⊥} Z Z^{T}) .

Now, it can be easily verified that if W ≔ ZZ^T ~ W_N(n, I), then for arbitrary symmetric N × N matrices Q, R, we have

𝔼 [tr (W Q W R)] = n [tr (Q R) + tr (Q) tr (R)] + n^{2} tr (Q R) .

(A.18)

Taking Q = P_⊥ and $R = θ_{μ} θ_{μ}^{T}$ and noting that QR = 0, by (A.18) we have

𝔼 [w_{ν}^{T} Z^{T} P_{⊥} Z w_{ν}] = n tr (P_{⊥}) + n λ_{ν} tr (P_{⊥}) = n (N - M) (1 + λ_{ν}) .

(A.19)

Combining (A.17) with (A.19) in computing the expectation of (A.16), we obtain the expression (A.11) for 𝔼‖H_νSθ_ν‖².

A.2. Bound for ‖S − ∑‖

We begin with the decomposition of the sample covariance matrix S. Introduce the abbreviation ξ_μ = n⁻¹Zυ_μ. Then

S = \sum_{μ, μ' = 1}^{M} \sqrt{λ_{μ} λ_{μ'}} {〈 υ_{μ}, υ_{μ'} 〉}_{n} θ_{μ} θ_{μ'}^{T} + \sum_{μ = 1}^{M} \sqrt{λ_{μ}} (θ_{μ} ξ_{μ}^{T} + ξ_{μ} θ_{μ}^{T}) + n^{- 1} Z Z^{T}

(A.20)

and from (2.1), with V_μμ′= |〈υ_μ, υ_μ′〉_n − δ_μμ′| and δ_μμ′ denoting the Kronecker symbol,

‖ S - \sum ‖ \leq \sum_{μ, μ' = 1}^{M} \sqrt{λ_{μ} λ_{μ'}} V_{μ μ'} + 2 \sum_{μ = 1}^{M} \sqrt{λ_{μ}} ‖ ξ_{μ} ‖ + ‖ n^{- 1} Z Z^{T} - I ‖ .

(A.21)

We establish a bound for ‖S − ∑‖ with probability converging to one. Introduce notation

η_{n} = \sqrt{N^{- 1} log n}, {\bar{η}}_{n} = \sqrt{n^{- 1} log n}, γ_{n} = N / n, \sqrt{Λ} = \sum_{μ = 1}^{M} \sqrt{λ_{μ}} .

Fix c > 0 and assume that γ_n ≤ 1. Initially, we assume that 2cη_n ≤ 1/2, which is equivalent to N ≥ 16c² log n.

We introduce some events of high probability under which (A.21) may be bounded. Thus, let D₁ be the intersection of the events

| ‖ υ_{μ} ‖_{n}^{2} - 1 | \leq 2 c {\bar{η}}_{n}, 1 \leq μ \leq M, | {〈 υ_{μ}, υ_{μ'} 〉}_{n} | c {\bar{η}}_{n}, 1 \leq μ \neq μ' \leq M, N^{- 1} ‖ Z υ_{μ} ‖^{2} / ‖ υ_{μ} ‖^{2} \leq 1 + 2 c η_{n}, 1 \leq μ \leq M,

(A.22)

and let D₂ be the event

‖ n^{- 1} Z Z^{T} - I ‖ \leq γ_{n} + 2 \sqrt{γ_{n}} + 8 c {\bar{η}}_{n} .

(A.23)

To bound the probability of $D_{1}^{c}$ , in the case of the first line of (A.22), use (A.3) and (A.4) with ε = 2cη̄_n. For the second, use (A.5) with b = c² log n. For the third, observe that Zυ_μ/‖υ_μ‖ ~ N_N(0, I), and again use (A.4), this time with ε = 2cη_n < 1/2. For $D_{2}^{c}$ , we appeal to Lemma A.1. As a result,

ℙ (D_{1}^{c}) \leq 3 M n^{- c^{2}} + M (M - 1) n^{- (3 / 2) c^{2} + O (n^{- 1} {log}^{2} n)}, ℙ (D_{2}^{c}) \leq 2 n^{- c^{2}} .

(A.24)

To bound (A.21) on the event D₁ ∩ D₂, we use bounds (A.22) and (A.23), and also write

‖ ξ_{μ} ‖ = \sqrt{γ_{n}} \frac{‖ Z υ_{μ} ‖}{\sqrt{N} ‖ υ_{μ} ‖} \frac{‖ υ_{μ} ‖}{\sqrt{n}} \leq \sqrt{γ_{n}} {(1 + 2 c η_{n})}^{1 / 2} {(1 + 2 c {\bar{η}}_{n})}^{1 / 2} = \sqrt{γ_{n}} H_{n},

(A.25)

say, and also noting that η̄_n ≤ η_n, we obtain

‖ S - \sum ‖ \leq \sqrt{γ_{n}} [2 c η_{n} Λ + 2 \sqrt{Λ} H_{n} + 4 (1 + 2 c η_{n})] .

(A.26)

Now combine the bound 2cη_n < 1/2 with H_n ≤ 3/2 and $2 \sqrt{Λ} \leq Λ + 1$ to conclude that on D₁ ∩ D₂,

‖ S - \sum ‖ \leq 2 (Λ + 4) \sqrt{γ_{n}}

and so

ε_{n ν} \leq 2 K δ_{ν}^{- 1} (\sum) (Λ + 4) \sqrt{γ_{n}} \to 0

since N/n → 0.

Let us now turn to the case N ≤ 16c² log n. We can replace the last event in (A.22) by the event

N^{- 1} {‖ Z υ_{μ} ‖}^{2} / {‖ υ_{μ} ‖}^{2} \leq 2 (c^{2} log n + log N), 1 \leq μ \leq M,

and the second bound holds for $ℙ (D_{1}^{c})$ for sufficiently large n, using the bound $ℙ (N^{- 1} χ_{N}^{2} > a) \leq 2 N (1 - Φ (\sqrt{a})) \leq N \sqrt{2 / a π} e^{- a / 2}$ for any a > 0. In (A.25), we replace the term (1 + 2cη_n)^1/2 by (2c² log n + 2 log N)^1/2 which may be bounded by $a_{1} \sqrt{log n}$ . As soon as N ≥ 4c², we also have $2 c η_{n} \leq \sqrt{log n}$ and so $1 + 2 c {\bar{η}}_{n} \leq 1 + \sqrt{γ_{n} log n}$ . This leads to a bound for the analog of H_n in (A.26) and so to

‖ S - \sum ‖ \leq \sqrt{γ_{n} log n} {Λ + 2 a_{1} \sqrt{Λ} {(1 + \sqrt{γ_{n} log n)}}^{1 / 2} + a_{2}} .

When N ≤ 16c² log n, we have $\sqrt{γ_{n} log n} \leq 4 c log n / \sqrt{n}$ and so

ε_{n ν} \leq a_{3} K δ_{ν}^{- 1} (\sum) (Λ + 1) log n / \sqrt{n} \to 0 .

To summarize, choose $c = \sqrt{2}$ , say, so that D_n = D₁ ∩ D₂ has probability at least 1−O(n⁻²), and on D_n we have ε_nν → 0. This completes the proof of (A.10).

Theorem 2.1 now follows from noticing that L(θ̂_ν, θ_ν) ≤ 2 and so

𝔼 [L ({\hat{θ}}_{ν}, θ_{ν}), D_{n}^{c}] \leq 2 ℙ (D_{n}^{c}) = O (n^{- 2}) = o (𝔼 ‖ H_{ν} S θ_{ν} ‖^{2})

and an additional computation using (A.16) which shows that

𝔼 [{‖ H_{ν} S θ_{ν} ‖}^{2}, D_{n}^{c}] \leq {(𝔼 [‖ H_{ν} S θ_{ν} ‖^{4}])}^{1 / 2} P (D_{n}^{c}) = o (𝔼 ‖ H_{ν} S θ_{ν} ‖^{2}) .

A.3. Lower bound on the minimax risk

In this subsection, we prove Theorems 3.1 and 3.2. The key idea in the proofs is to utilize the geometry of the parameter space in order to construct appropriate finite-dimensional subproblems for which bounds are easier to obtain. We first give an overview of the general machinery used in the proof.

Risk bounding strategy

A key tool for deriving lower bounds on the minimax risk is Fano’s lemma. In this subsection, we use superscripts on vectors θ as indices, not exponents. First, we fix ν ∈ {1, … , M} and then construct a large finite subset ℱ of $Θ_{q}^{M}$ (C₁, … , C_M), such that for some δ > 0, to be chosen

θ^{1}, θ^{2} \in ℱ \Rightarrow L (θ_{ν}^{1}, θ_{ν}^{2}) \geq 4 δ .

This property will be referred to as “4δ-distinguishability in θ_ν.” Given any estimator θ̄ of θ, based on data X_n = (X₁, … , X_n), define a new estimator ϕ(X_n) = θ*, whose M components are given by $θ_{μ}^{*} = {arg min}_{θ \in ℱ} L ({\hat{θ}}_{μ}, θ_{μ})$ , where θ̂_μ is the μth column of θ̂. Then, by Chebyshev’s inequality and the 4δ-distinguishability in θ_ν, it follows that

sup_{θ \in Θ_{q}^{M} (C_{1}, \dots, C_{M})} 𝔼_{θ} L ({\hat{θ}}_{ν}, θ_{ν}) \geq δ sup_{θ \in ℱ} ℙ_{θ} (ϕ (X_{n}) \neq θ) .

(A.27)

The task is then to find an appropriate lower bound for the quantity on the right-hand side of (A.27). For this, we use the following version of Fano’s lemma, due to Birgé (2001), modifying a result of Yang and Barron (1999), pages 1570 and 1571.

Lemma A.5. Let {P_θ : θ ∈ Θ} be a family of probability distributions on a common measurable space, where Θ is an arbitrary parameter set. Let p_max be the minimax risk over Θ, with the loss function L′(θ, θ′) = 1θ≠θ′,

p_{max} = inf_{T} sup_{θ \in Θ} ℙ_{θ} (T \neq θ) = inf_{T} sup_{θ \in Θ} 𝔼 L' (θ, T),

where T denotes an arbitrary estimator of θ with values in Θ. Then for any finite subset ℱ of Θ, with elements θ₁, … ,θ_J where J = |ℱ |,

p_{max} \geq 1 - inf_{Q} \frac{J^{- 1} \sum_{i = 1}^{J} K (P_{i}, Q) + log 2}{log J},

(A.28)

where P_i = ℙ_{θ_i}, and Q is an arbitrary probability distribution, and K(P_i,Q) is the Kullback–Leibler divergence of Q from P_i.

The following lemma, proven in Appendix B, gives the Kullback–Leibler discrepancy corresponding to two different values of the parameter.

Lemma A.6. Let $θ^{j} ≔ [θ_{1}^{j} : \dots : θ_{M}^{j}], j = 1, 2$ be two parameters (i.e., for each j, $θ_{k}^{j} ’ s$ are orthonormal). Let ∑_j denote the matrix given by (2.1) with θ = θ^j (and σ = 1). Let P_j denote the joint probability distribution of n i.i.d. observations from N (0, ∑_j) and let η(λ) = λ/(1 + λ). Then the Kullback–Leibler discrepancy of P₂ with respect to P₁ is given by

Κ_{1, 2} ≔ K (θ^{1}, θ^{2}) = \frac{n}{2} [\sum_{μ = 1}^{M} η (λ_{μ}) λ_{μ} - \sum_{μ = 1}^{M} \sum_{μ' = 1}^{M} η (λ_{μ}) λ_{μ'} {〈 θ_{μ'}^{1}, θ_{μ}^{2} 〉}^{2}] .

(A.29)

Geometry of the hypothesis set and sphere packing

Next, we describe the construction of a large set of hypotheses ℱ, satisfying the 4δ distinguishability condition. Our construction is based on the well-studied sphere-packing problem, namely how many unit vectors can be packed onto 𝕊^m−1 with given minimal pairwise distance between any two vectors.

Here we follow the construction due to Zong (1999) (page 77). Let m be a large positive integer, and m₀ = ⌊2m/9⌋. Define $Y_{m}^{*}$ as the maximal set of points of the form z = (z₁, … , z_m) in 𝕊^m−1 such that the following is true:

\sqrt{m_{0}} z_{i} \in {- 1, 0, 1} \forall i \sum_{i = 1}^{m} | z_{i} | \leq \sqrt{m_{0}}

and

for z, z' \in Y_{m}^{*} ‖ z - z' ‖ \geq 1 .

For any m ≥ 1, the maximal number of points lying on 𝕊^m−1 such that any two points are at distance at least 1, is called the kissing number of an m-sphere. Zong (1999) used the construction described above to derive a lower bound on the kissing number, by showing that $| Y_{m}^{*} | \geq {(9 / 8)}^{m (1 + o (1))}$ for m large.

Next, for m < N − M we use the sets $Y_{m}^{*}$ to construct our hypothesis set ℱ of the same size, $| ℱ | = | Y_{m}^{*} |$ . To this end, let ${e_{μ}}_{μ = 1}^{N}$ denote the standard basis of ℝ^N. Our initial set θ⁰ is composed of the first M standard basis vectors, θ⁰ = [e₁: … :e_M]. Then, for fixed ν, and values of m, r yet to be determined, each of the other hypotheses θ^j ∈ ℱ has the same vectors as θ⁰ for k ≠ ν. The difference is that the νth vector is instead given by

θ_{ν}^{j} = \sqrt{1 - r^{2} e_{ν}} + r \sum_{l = 1}^{m} z_{l}^{j} e_{M + l}, j = 1, \dots, | F |,

(A.30)

where $z^{j} = (z_{1}^{j}, \dots, z_{m}^{j}), j \geq 1$ , is an enumeration of the elements of $Y_{m}^{*}$ . Thus $θ_{ν}^{j}$ perturbs e_ν in subsets of the fixed set of coordinates {M + 1, … , M + m}, according to the sphere-packing construction for 𝕊^m−1.

The construction ensures that $θ_{1}^{j}, \dots, θ_{M}^{j}$ are orthonormal for each j. In particular, $〈 θ_{μ}^{j}, e_{μ'} 〉$ , vanishes unless μ = μ′, and so (A.29) simplifies to

K (θ^{j}, θ^{0}) = \frac{1}{2} n h (λ_{ν}) (1 - {〈 θ_{ν}^{j}, θ_{ν}^{0} 〉}^{2}) = \frac{1}{2} n h (λ_{ν}) r^{2}

(A.31)

for j = 1, … , |ℱ |. Finally, $〈 θ_{ν}^{j}, θ_{ν}^{k} 〉 = 1 - r^{2} + r^{2} 〈 z^{j}, z^{k} 〉$ , and so by construction, for any θ^j, θ^k ∈ ℱ with j ≠ k, we have

L (θ_{ν}^{j}, θ_{ν}^{k}) \geq r^{2} .

(A.32)

In other words, the set ℱ is r²-distinguishable in θ_ν. Consequently, combining (A.27), (A.28) and (A.31) (taking Q = P_θ⁰ in Lemma A.5), we have

R_{ν}^{*} = inf_{{\hat{θ}}_{ν}} sup_{Θ_{q} (C)} 𝔼 L ({\hat{θ}}_{ν}, θ_{ν}) \geq (r^{2} / 4) [1 - a (r, ℱ)]

(A.33)

with

a (r, ℱ) = \frac{n h (λ_{ν}) r^{2} / 2 + log 2}{log | ℱ |} .

(A.34)

Proof of Theorem 3.1. It remains to specify m and let r ∈ (0, 1). Let $Y_{m}^{*}$ be the sphere-packing set defined above, and let ℱ be the corresponding set of hypotheses, defined via (A.30).

Let c₁ = log(9/8), then we have log |ℱ | ≥ b_mc₁m, where b_m → 1 as m → ∞. Now choose r = r(m) so that a(r, ℱ) ≤ 3/4 asymptotically in the bound (A.33). To accomplish this, set

r^{2} = \frac{c_{1} m}{n h (λ_{ν})} .

(A.35)

Indeed, inserting this into (A.34) we find that

a (r, ℱ) \leq \frac{c_{1} m / 2 + log 2}{b_{m} c_{1} m} .

Therefore, so long as m ≥ m_*, an absolute constant, we have a(r, ℱ₀) ≤ 3/4 and hence $R_{ν}^{*} \geq r^{2} / 16 = (c_{1} / 16) m τ_{ν}^{2} .$ .

We also need to ensure that $θ_{ν}^{j} \in Θ_{q} (C_{ν})$ . Since exactly m₀ coordinates are nonzero out of {M + 1, … , M + m},

‖ θ_{ν}^{j} ‖_{q}^{q} = {(1 - r^{2})}^{q / 2} + r^{q} m_{0}^{1 - q / 2} \leq 1 + a_{q} r^{q} m^{1 - q / 2},

, where a_q = (2/9)^1−q/2. A sufficient condition for $θ_{ν}^{(j)} \in Θ_{q} (C_{ν})$ is that

a_{q} m {(r^{2} / m)}^{q / 2} \leq {\bar{C}}_{ν}^{q} .

(A.36)

Our choice (A.35) fixes r²/m, and so, recalling that $A_{q} = 1 / (a_{q} c_{1}^{q / 2})$ , the previous display becomes

m \leq A_{q} {\bar{C}}_{ν}^{q} {[n h (λ_{ν})]}^{q / 2} .

To simultaneously ensure that (i) r² < 1, (ii) m does not exceed the number of available co-ordinates, N − M and (iii) $θ_{ν}^{j} \in Θ_{q} (C_{ν})$ , we set

m = ⌊ m' ⌋, m' = min {n h (λ_{ν}), N - M, A_{q} {\bar{C}}_{ν}^{q} {(n h (λ_{ν}))}^{q / 2}} .

Recalling (3.1), (3.2) and (3.3), we have

m' = min {τ_{ν}^{- 2}, N', m_{ν}} = τ_{ν}^{- 2} min {1, τ_{ν}^{2} \cdot min {N', m_{ν}}} .

To complete the proof of Theorem 3.1, set B₁ = [(m_* + 1)/m_*]c₁/16 and observe that

R_{ν}^{*} \geq B_{1} m' τ_{ν}^{2} .

Proof of Theorem 3.2. The construction of the set of hypotheses in the proof of Theorem 3.1 considered a fixed set of potential nonzero coordinates, namely {M + 1, … ,M + m}. However, in the sparse setting, when the effective dimension is significantly smaller than the nominal dimension N, it is possible to construct a much larger collection of hypotheses by allowing the set of nonzero coordinates to span all remaining coordinates {M + 1, … , N}.

In the proof of Theorem 3.2 we shall use the following lemma, proven in Appendix B. Call A ⊂ {1, … , N} an m-set if |A| = m.

Lemma A.7. Let k be fixed, and let 𝒜_k be the maximal collection of m-sets such that the intersection of any two members has cardinality at most k −1. Then, necessarily,

| 𝒜_{k} | \geq (\begin{matrix} N \\ k \end{matrix}) / {(\begin{matrix} m \\ k \end{matrix})}^{2} .

(A.37)

Let k = [m₀/2] + 1 and m₀ = [βm] with 0 < β < 1. Suppose that m, N → ∞with m = o(N). Then

| 𝒜_{k} | \geq exp [N ε (β m / 2 N) - 2 m ε (β / 2)] (1 + o (1)),

(A.38)

, where ε(x) is the Shannon entropy function,

ε (x) = - x log (x) - (1 - x) log (1 - x), 0 < x < 1 .

Let π be an m-set contained in {M + 1, … , N}, and construct a family ℱ_π by modifying (A.30) to use the set π rather than the fixed set {M +1, … , M + m} as in Theorem 3.1,

θ_{ν}^{(j, π)} = \sqrt{1 - r^{2} e_{ν}} + r \sum_{l \in π} z_{l}^{j} e_{l}, j = 1, \dots, | Y_{m}^{*} | .

We will choose m below to ensure that $θ_{ν}^{(j, π)} \in Θ_{q} (C_{ν})$ . Let 𝒫 be a collection of sets π such that, for any two sets π and π′ in 𝒫, the set π ∩ π′ has cardinality at most m₀/2. This ensures that the sets ℱ_π are disjoint for π ≠ π′, since each $θ_{ν}^{(j, π)}$ is nonzero in exactly m₀ + 1 coordinates. This construction also ensures that

for all θ^{1}, θ^{2} \in \underset{π \in 𝒫}{\cup} ℱ_{π} L (θ^{1}, θ^{2}) \geq (\frac{m_{0}}{2} + \frac{m_{0}}{2}) {(\frac{r}{\sqrt{m_{0}}})}^{2} = r^{2} .

Define ℱ ≔ ∪_π∈𝒫 ℱ_π. Then

| ℱ | = | \underset{π \in 𝒫}{\cup} ℱ_{π} | = | 𝒫 | | Y_{m}^{*} | \geq | 𝒫 | {(9 / 8)}^{m (1 + o (1))} .

(A.39)

By Lemma A.7, there is a collection 𝒫 such that |𝒫| is at least exp([Nε(m/9N)− 2mε(1/9)](1 + o(1))). Since ε(x) ≥ − x log x, it follows from (A.39) that

\frac{log | ℱ |}{m} \geq (\frac{1}{9} log \frac{9 N}{m} - 2 ε (1 / 9)) + log (9 / 8) (1 + o (1)) \geq \frac{α}{9} log N + O (1),

since m = O (N^1−α).

Proceeding as for Theorem 3.1, we have log |ℱ| ≥ b_m (α/9)m log N, where b_m → 1. Let us set (with m still to be specified)

r^{2} = m \frac{(α / 9) log N}{n h (λ_{ν})} = m {\bar{τ}}_{ν}^{2} .

(A.40)

Again, so long as m ≥ m_*, we have a(r, ℱ) ≤ 3/4 and $R_{ν}^{*} \geq r^{2} / 16 = (1 / 16) m {\bar{τ}}_{ν}^{2}$ . We also need to ensure that $θ_{ν}^{(j, π)} \in Θ_{q} (C_{ν})$ , which as before is implied by (A.36). Substituting (A.40) puts this into the form

m \leq {\bar{m}}_{ν} = a_{q}^{- 1} {({\bar{C}}_{ν} / {\bar{τ}}_{ν})}^{q} .

To simultaneously ensure that (i) r² < 1, (ii) m does not exceed the number of available co-ordinates, N − M and (iii) $θ_{ν}^{j} \in Θ_{q} (C_{ν})$ , we set

m = ⌊ m' ⌋, m' = min {{\bar{τ}}_{ν}^{- 2}, N - M, {\bar{m}}_{ν}} .

The assumption that ${\bar{C}}_{ν}^{q} n^{q / 2} = O (N^{1 - α})$ for some α ∈ (0, 1) is equivalent to the assertion m̄_ν = O(N^1−α), and so for n sufficiently large, m̄_ν ≤ N − M and so m′ = m̄_ν so long as ${\bar{m}}_{ν} {\bar{τ}}_{ν}^{2} \leq 1$ . Theorem 3.2 now follows from our bound on $R_{ν}^{*}$ .

A.4. Lower bound on the risk of the DT estimator

To prove Theorem 4.1, assume w.l.o.g. that 〈θ̂_1,DT,θ₁〉 > 0, and decompose the loss as

L ({\hat{θ}}_{1, DT}, θ_{1}) = {‖ θ_{1} - θ_{1, I} ‖}^{2} + {‖ {\hat{θ}}_{1, DT} - θ_{1, I} ‖}^{2},

(A.41)

where I = I (γ_n) is the set of coordinates selected by the DT scheme and θ_{1, I} denotes the subvector of θ₁ corresponding to this set. Note that, in (A.41), the first term on the right can be viewed as a bias term while the second term can be seen as a variance term.

We choose a particular vector θ₁ = θ_* ∈ Θ_q(C) so that

𝔼 {‖ θ_{*} - θ_{*, I} ‖}^{2} \geq K {\bar{C}}^{q} n^{- (1 - q / 2) / 2} .

(A.42)

This, together with (A.41), proves Theorem 4.1 since the worst case risk is clearly at least as large as (A.42). Accordingly, set r_n = C̄^q/2n^{−(1−q/2)/4}, where C̄^q = C^q −1. Since C^qn^q/4 = o(n^1/2), we have r_n = o(1), and so for sufficiently large n, we can take r_n < 1 and define

θ_{*, k} = {\begin{matrix} \sqrt{1 - r_{n}^{2},} & if k = 1, \\ \frac{r_{n}}{\sqrt{m_{n}}}, & if 2 \leq k \leq m_{n} + 1, \\ 0 & if m_{n} + 2 \leq k \leq N, \end{matrix}

where m_n = ⌊(1/2)C̄^qn^q/4⌋. Then by construction θ_* ∈ Θ_q(C), since

\sum_{k = 1}^{N} {| θ_{*, k} |}^{q} = {(1 - r_{n}^{2})}^{q / 2} + r_{n}^{q} m_{n}^{1 - q / 2} < 1 + r_{n}^{q} m_{n}^{1 - q / 2} \leq 1 + \frac{{\bar{C}}^{q}}{2^{1 - q / 2}} < C^{q},

where the last inequality is due to q ∈ (0, 2) and C̄^q = C^q −1.

For notational convenience, let $α_{n} = γ \sqrt{log N / n}$ . Recall that DT selects all coordinates k for which S_kk > 1 + α_n. Since $S_{k k} ~ (1 + λ_{1} θ_{*, k}^{2}) χ_{n}^{2} / n$ , coordinate k is not selected with probability

p_{k} = ℙ (S_{k k} < 1 + α_{n}) = ℙ (χ_{n}^{2} < n (1 + ε_{n})),

(A.43)

, where $ε_{n} = (1 + α_{n}) / (1 + λ_{1} θ_{*, k}^{2}) - 1$ . Notice that, for k = 2, …, m_n + 1, p_k = p₂ and θ_*,k = 0 for k > m_n + 1. Hence,

𝔼 {‖ θ_{*} - θ_{*, I} ‖}^{2} = \sum_{k = 1}^{N} p_{k} {| θ_{*, k} |}^{2} > p_{2} \sum_{k = 2}^{m_{n} + 1} {| θ_{*, k} |}^{2} = p_{2} r_{n}^{2} = p_{2} {\bar{C}}^{q} n^{- (1 - q / 2) / 2} .

Now, use bound (A.3) to show that $n ε_{n}^{2} \to \infty$ in (A.43) and hence that p₂ → 1. Indeed $θ_{*, 2}^{2} = r_{n}^{2} / m_{n} = 2 n^{- 1 / 2} (1 + o (1))$ , and so

ε_{n} = \frac{α_{n} - λ_{1} θ_{*, k}^{2}}{1 + λ_{1} θ_{*, k}^{2}} = \frac{1}{2 \sqrt{n}} [γ \sqrt{log N} - 2 λ_{1}]

for sufficiently large n. Hence, $n ε_{n}^{2} \to \infty$ , and the proof is complete.

APPENDIX B: PROOFS OF RELEVANT LEMMAS

B.1. Proof of Lemma A.1

We use the following result on extreme eigenvalues of Wishart matrices from Davidson and Szarek (2001).

Lemma B.1. Let Z be a p × q matrix of i.i.d. N (0, 1) entries with p ≤ q. Let s_max (Z) and s_min(Z) denote the largest and the smallest singular value of Z, respectively. Then

ℙ (s_{max} (Z / \sqrt{q}) > 1 + \sqrt{p / q} + t) \leq e^{- q t^{2} / 2},

(B.1)

ℙ (s_{min} (Z / \sqrt{q}) < 1 - \sqrt{p / q} + t) \leq e^{- q t^{2} / 2} .

(B.2)

Observe first that

Δ ≔ ‖ n^{- 1} Z Z^{T} - I_{N} ‖ = max {λ_{1} (n^{- 1} Z Z^{T}) - 1, 1 - λ_{N} (Z Z^{T})} .

Let s_± denote the maximum and minimum singular values of N^−1/2Z. Define $γ (t) ≔ \sqrt{N / n} + t$ for t > 0. Then since $Δ = max {s_{+}^{2} - 1, 1 - s_{-}^{2}}$ , and letting Δ_n(t) ≔ 2γ(t) + γ(t)², we have

{Δ > Δ_{n} (t)} \subset {s_{+} > 1 + γ (t)} \cup {s_{-} < 1 - γ (t)} .

We apply Lemma B.1 with p = N and q = n, and get

ℙ (Δ > Δ_{n} (t)) \leq 2 e^{- n t^{2} / 2} .

We observe that, with γ_n = N/n ≤ 1,

Δ_{n} (t) = (N / n + 2 \sqrt{N / n}) + t (2 + t + 2 \sqrt{N / n}) \leq γ_{n} + 2 \sqrt{γ_{n}} + t (4 + t) .

(B.3)

Now choose $t = c \sqrt{2 log n / n}$ so that tail probability is at most 2e^−n²t²/2 = 2n^−c². The result is now proved, since if $c \sqrt{log n / n} \leq 1$ , then t (4 + t) ≤ ct_n.

B.2. Proof of Lemma A.4

Paul (2005) introduced the quantities

Δ_{r} ≔ \frac{1}{2} [‖ H_{r} (A) B ‖ + | λ_{r} (A + B) - λ_{r} (A) | ‖ H_{r} (A) |],

(B.4)

{\bar{Δ}}_{r} = \frac{‖ B ‖}{{min}_{1 \leq j \neq r \leq m} | λ_{j} (A) - λ_{r} (A) |}

(B.5)

and showed that the residual term R_r can be bounded by

‖ R_{r} ‖ \leq min {10 {\bar{Δ}}_{r}^{2}, ‖ H_{r} (A) B p_{r} (A) ‖ [\frac{2 Δ_{r} (1 + 2 Δ_{r})}{1 - 2 Δ_{r} (1 + 2 Δ_{r})} + \frac{‖ H_{r} (A) B p_{r} (A) ‖}{{(1 - 2 Δ_{r} (1 + 2 Δ_{r}))}^{2}}]},

(B.6)

where the second bound holds only if 6r $Δ_{r} < (\sqrt{5} - 1) / 4$ .

We now show that if Δ̄_r ≤ 1/4, then we can simplify bound (B.6) to obtain (A.7). To see this, note that |λ_r(A + B) − λ_r(A)| ≤ ‖B‖ and that ‖H_r(A)‖ ≤ [min_j≠r |λ_j (A) − λ_r(A)|]⁻¹, so that

Δ_{r} \leq ‖ H_{r} (A) ‖ ‖ B ‖ \leq {\bar{Δ}}_{r} .

Now, defining δ ≔ 2 Δ̄_r (1 + 2Δ̄_r) and β ≔ ‖H_r (A)Bp_r (A)‖, we have $10 {\bar{Δ}}_{r}^{2} \leq (5 / 2) δ^{2}$ , and the bound (B.6) may be expressed as

‖ R_{r} | \leq \frac{β δ}{1 - δ} min {\frac{5}{2} \frac{δ (1 - δ)}{β}, 1 + \frac{β}{δ (1 - δ)}} .

For x > 0, the function x → min{5x/2, 1 + 1/x} ≤ 5/2. Further, if Δ̄_r < 1/4, then δ < 3 Δ̄_r < 3/4, and so we conclude that

‖ R_{r} ‖ \leq 10 β δ \leq 30 β {\bar{Δ}}_{r} .

B.3. Proof of Lemma A.6

Recall that, if distributions F₁ and F₂ have density functions f₁ and f₂, respectively, such that the support of f₁ is contained in the support of f₂, then the Kullback–Leibler discrepancy of F₂ with respect to F₁, to be denoted by K(F₁, F₂), is given by

K (F_{1}, F_{2}) = \int log \frac{f_{1} (y)}{f_{2} (y)} f_{1} (y) d y .

(B.7)

For n i.i.d. observations X_i, i = 1, … , n, the Kullback–Leibler discrepancy is just n times the Kullback–Leibler discrepancy for a single observation. Therefore, without loss of generality we take n = 1. Since

\sum^{- 1} = I - \sum_{μ = 1}^{M} η (λ_{μ}) θ_{μ} θ_{μ}^{T},

(B.8)

the log-likelihood function for a single observation is given by

log f (x | θ) = - \frac{N}{2} log (2 π) - \frac{1}{2} log | \sum | - \frac{1}{2} x^{T} \sum^{- 1} x = - \frac{N}{2} log (2 π) - \frac{1}{2} \sum_{μ = 1}^{M} log (1 + λ_{μ}) - \frac{1}{2} (〈 x, x 〉 - \sum_{μ = 1}^{M} η (λ_{μ}) {〈 x, θ_{μ} 〉}^{2}) .

(B.9)

From (B.9), we have

Κ_{1, 2} = 𝔼_{θ^{1}} (log f (X | θ^{1}) - log f (X | θ^{2})) = \frac{1}{2} \sum_{μ = 1}^{M} η (λ_{μ}) [𝔼_{θ^{1}} {(〈 X, θ_{μ}^{1} 〉)}^{2} - 𝔼_{θ^{1}} {(〈 X, θ_{μ}^{2} 〉)}^{2}] = \frac{1}{2} \sum_{μ = 1}^{M} η (λ_{μ}) [〈 θ_{μ}^{1}, \sum_{1} θ_{μ}^{1} 〉 - 〈 θ_{μ}^{2}, \sum_{1} θ_{μ}^{2} 〉] \frac{1}{2} \sum_{μ = 1}^{M} η (λ_{μ}) [\sum_{μ' = 1}^{M} λ_{μ'} {{〈 θ_{μ}^{1}, θ_{μ}^{1} 〉}^{2} - {〈 θ_{μ}^{1}, θ_{μ}^{2} 〉}^{2}}],

which equals the RHS of (A.29), since the columns of θ^j are orthonormal for each j = 1, 2.

B.4. Proof of Lemma A.7

Let 𝒫_m be the collection of all m-sets of {1, … , N}, clearly $| 𝒫_{m} | = (\begin{matrix} N \\ m \end{matrix})$ . For any m-set A, let ℐ(A) denote the collection of “inadmissible” m-sets A′ for which |A ∩ A′| ≥ k. Clearly

| ℐ (A) | \leq (\begin{matrix} m \\ k \end{matrix}) (\begin{matrix} N - k \\ m - k \end{matrix}) .

If 𝒜_k is maximal, then 𝒫_m = ∪_{A∈𝒜_k} ℐ(A), and so (A.37) follows from the inequality

| 𝒫_{m} | \leq | 𝒜_{k} | max_{A} | ℐ (A) |

and rearrangement of factorials.

Turning to the second part, we recall that Stirling’s formula shows that if k and N → ∞,

(\begin{matrix} N \\ k \end{matrix}) = ζ {(\frac{N}{2 π k (N - k)})}^{1 / 2} exp {N ε (\frac{k}{N})},

where ζ ∈ (1 − (6k)⁻¹, 1 + (12N)⁻¹). The coefficient multiplying the exponent in $(\begin{matrix} N \\ k \end{matrix}) / {(\begin{matrix} m \\ k \end{matrix})}^{2}$ is

\sqrt{2 π k} {(1 - k / N)}^{- 1 / 2} (1 - k / m) ~ \sqrt{π β m} (1 - β / 2) \to \infty

under our assumptions, and this yields (A.38).

Footnotes

Supported by NSF Grants DMS-09-06812 and NIH R01 EB001988.

Supported by a grant from the Israeli Science Foundation (ISF).

Supported by the NSF Grants DMR 1035468 and DMS-11-06690.

Contributor Information

Aharon Birnbaum, School of Computer Science and Engineering Hebrew University of Jerusalem The Edmond J. Safra Campus Jerusalem, 91904 Israel aharob01@cs.huji.ac.il.

Iain M. Johnstone, Department of Statistics Stanford University Stanford, California 94305 USA imj@stanford.edu.

Boaz Nadler, Department of Computer Science and Applied Mathematics Weizmann Institute of Science P.O. Box 26, Rehovot, 76100 Israel boaz.nadler@weizmann.ac.il.

Debashis Paul, Department of Statistics University of California Davis, California 95616 USA debashis@wald.ucdavis.edu.

REFERENCES

Amini AA, Wainwright MJ. High-dimensional analysis of semidefinite relaxations for sparse principal components. Ann. Statist. 2009;37:2877–2921. MR2541450. [Google Scholar]
Anderson TW. Asymptotic theory for principal component analysis. Ann. Math. Statist. 1963;34:122–148. MR0145620. [Google Scholar]
Baik J, Silverstein JW. Eigenvalues of large sample covariance matrices of spiked population models. J. Multivariate Anal. 2006;97:1382–1408. MR2279680. [Google Scholar]
Bickel PJ, Levina E. Covariance regularization by thresholding. Ann. Statist. 2008a;36:2577–2604. MR2485008. [Google Scholar]
Bickel PJ, Levina E. Regularized estimation of large covariance matrices. Ann. Statist. 2008b;36:199–227. MR2387969. [Google Scholar]
Birgé L. Technical report. Univ. Paris; 2001. A new look at an old result: Fano’s lemma; p. 6. [Google Scholar]
Cai T, Liu W. Adaptive thresholding for sparse covariance matrix estimation. J. Amer. Statist. Assoc. 2011;106:672–684. MR2847949. [Google Scholar]
Cai TT, Ma Z, Wu Y. Sparse PCA: Optimal rates and adaptive estimation. Technical report. 2012 Available at arXiv: 1211.1309.
Cai TT, Zhang C-H, Zhou HH. Optimal rates of convergence for covariance matrix estimation. Ann. Statist. 2010;38:2118–2144. MR2676885. [Google Scholar]
Cai TT, Zhou HH. Minimax esrimation of large covariance matrices under l1 norm. Statist. Sinica. 2012;22:1319–1378. [Google Scholar]
D’Aspremont A, El Ghaoui L, Jordan MI, Lanckriet GRG. A direct formulation for sparse PCA using semidefinite programming. SIAM Rev. 2007;49:434–448. (electronic). MR2353806. [Google Scholar]
Davidson KR, Szarek SJ. Local operator theory, random matrices and Banach spaces. In: Johnson WB, Linden-strauss J, editors. Handbook of the Geometry of Banach Spaces, Vol. I. North-Holland: Amsterdam; 2001. pp. 317–366. MR1863696. [Google Scholar]
El Karoui N. Operator norm consistent estimation of large-dimensional sparse covariance matrices. Ann. Statist. 2008;36:2717–2756. MR2485011. [Google Scholar]
Johnstone IM. Chi-square oracle inequalities. In: de Gunst M, Klaassen C, van der Waart A, editors. State of the Art in Probability and Statistics (Leiden, 1999) Institute of Mathematical Statistics Lecture Notes—Monograph Series. Vol. 36. Beachwood, OH: IMS; 2001. pp. 399–418. MR1836572. [Google Scholar]
Johnstone IM, Lu AY. On consistency and sparsity for principal components analysis in high dimensions. J. Amer. Statist. Assoc. 2009;104:682–693. doi: 10.1198/jasa.2009.0121. MR2751448. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jolliffe IT. Principal Component Analysis. Berlin: Springer; 2002. [Google Scholar]
Kato T. Perturbation Theory of Linear Operators. New York: Springer; 1980. [Google Scholar]
Kritchman S, Nadler B. Determining the number of components in a factor model from limited noisy data. Chemometrics and Intelligent Laboratory Systems. 2008;94:19–32. [Google Scholar]
Lu AY. Ph.D. thesis. Stanford, CA: Stanford Univ; 2002. Sparse principal components analysis for functional data. [Google Scholar]
Ma Z. Technical report, Dept. Statistics. Pennsylvania, Philadelphia, PA: The Wharton School, Univ.; 2011. Sparse principal component analysis and iterative thresholding. [Google Scholar]
Muirhead RJ. Aspects of Multivariate Statistical Theory. New York: Wiley; 1982. MR0652932. [Google Scholar]
Nadler B. Finite sample approximation results for principal component analysis: A matrix perturbation approach. Ann. Statist. 2008;36:2791–2817. MR2485013. [Google Scholar]
Nadler B. Discussion of “On consistency and sparsity for principal component analysis in high dimensions”. J. Amer. Statist. Assoc. 2009;104:694–697. doi: 10.1198/jasa.2009.0121. MR2751449. [DOI] [PMC free article] [PubMed] [Google Scholar]
Onatski A. Technical report. Columbia Univ.; 2006. Determining the number of factors from empirical distribution of eigenvalues. [Google Scholar]
Paul D. Ph.D. thesis. Stanford, CA: Stanford Univ.; 2005. Nonparametric estimation of principal components. [Google Scholar]
Paul D. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist. Sinica. 2007;17:1617–1642. MR2399865. [Google Scholar]
Paul D, Johnstone IM. Technical report. Univ. California, Davis; 2007. Augmented sparse principal component analysis for high dimensional data. Available at arXiv:1202.1242. [Google Scholar]
Rothman AJ, Levina E, Zhu J. Generalized thresholding of large covariance matrices. J. Amer. Statist. Assoc. 2009;104:177–186. MR2504372. [Google Scholar]
Shen H, Huang JZ. Sparse principal component analysis via regularized low rank matrix approximation. J. Multivariate Anal. 2008;99:1015–1034. MR2419336. [Google Scholar]
Shen D, Shen H, Marron JS. Consistency of sparse PCA in high dimension, low sample size contexts. Technical report. 2011 Available at http://arxiv.org/pdf/1104.4289v1.pdf.
Tipping ME, Bishop CM. Probabilistic principal component analysis. J.R. Stat. Soc. Ser. B Stat. Methodol. 1999;61:611–622. MR1707864. [Google Scholar]
Van Trees HL. Optimum Array Processing. New York: Wiley; 2002. [Google Scholar]
Vu VQ, Lei J. Minimax rates of estimation for sparse PCA in high dimensions. Technical report. 2012 Available at http://arxiv.org/pdf/1202.0786.pdf.
Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10:515–534. doi: 10.1093/biostatistics/kxp008. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang Y, Barron A. Information-theoretic determination of minimax rates of convergence. Ann. Statist. 1999;27:1564–1599. MR1742500. [Google Scholar]
Zong C. Sphere Packings. New York: Springer; 1999. MR1707318. [Google Scholar]
Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. J. Comput. Graph. Statist. 2006;15:265–286. MR2252527. [Google Scholar]

[R1] Amini AA, Wainwright MJ. High-dimensional analysis of semidefinite relaxations for sparse principal components. Ann. Statist. 2009;37:2877–2921. MR2541450. [Google Scholar]

[R2] Anderson TW. Asymptotic theory for principal component analysis. Ann. Math. Statist. 1963;34:122–148. MR0145620. [Google Scholar]

[R3] Baik J, Silverstein JW. Eigenvalues of large sample covariance matrices of spiked population models. J. Multivariate Anal. 2006;97:1382–1408. MR2279680. [Google Scholar]

[R4] Bickel PJ, Levina E. Covariance regularization by thresholding. Ann. Statist. 2008a;36:2577–2604. MR2485008. [Google Scholar]

[R5] Bickel PJ, Levina E. Regularized estimation of large covariance matrices. Ann. Statist. 2008b;36:199–227. MR2387969. [Google Scholar]

[R6] Birgé L. Technical report. Univ. Paris; 2001. A new look at an old result: Fano’s lemma; p. 6. [Google Scholar]

[R7] Cai T, Liu W. Adaptive thresholding for sparse covariance matrix estimation. J. Amer. Statist. Assoc. 2011;106:672–684. MR2847949. [Google Scholar]

[R8] Cai TT, Ma Z, Wu Y. Sparse PCA: Optimal rates and adaptive estimation. Technical report. 2012 Available at arXiv: 1211.1309.

[R9] Cai TT, Zhang C-H, Zhou HH. Optimal rates of convergence for covariance matrix estimation. Ann. Statist. 2010;38:2118–2144. MR2676885. [Google Scholar]

[R10] Cai TT, Zhou HH. Minimax esrimation of large covariance matrices under l1 norm. Statist. Sinica. 2012;22:1319–1378. [Google Scholar]

[R11] D’Aspremont A, El Ghaoui L, Jordan MI, Lanckriet GRG. A direct formulation for sparse PCA using semidefinite programming. SIAM Rev. 2007;49:434–448. (electronic). MR2353806. [Google Scholar]

[R12] Davidson KR, Szarek SJ. Local operator theory, random matrices and Banach spaces. In: Johnson WB, Linden-strauss J, editors. Handbook of the Geometry of Banach Spaces, Vol. I. North-Holland: Amsterdam; 2001. pp. 317–366. MR1863696. [Google Scholar]

[R13] El Karoui N. Operator norm consistent estimation of large-dimensional sparse covariance matrices. Ann. Statist. 2008;36:2717–2756. MR2485011. [Google Scholar]

[R14] Johnstone IM. Chi-square oracle inequalities. In: de Gunst M, Klaassen C, van der Waart A, editors. State of the Art in Probability and Statistics (Leiden, 1999) Institute of Mathematical Statistics Lecture Notes—Monograph Series. Vol. 36. Beachwood, OH: IMS; 2001. pp. 399–418. MR1836572. [Google Scholar]

[R15] Johnstone IM, Lu AY. On consistency and sparsity for principal components analysis in high dimensions. J. Amer. Statist. Assoc. 2009;104:682–693. doi: 10.1198/jasa.2009.0121. MR2751448. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Jolliffe IT. Principal Component Analysis. Berlin: Springer; 2002. [Google Scholar]

[R17] Kato T. Perturbation Theory of Linear Operators. New York: Springer; 1980. [Google Scholar]

[R18] Kritchman S, Nadler B. Determining the number of components in a factor model from limited noisy data. Chemometrics and Intelligent Laboratory Systems. 2008;94:19–32. [Google Scholar]

[R19] Lu AY. Ph.D. thesis. Stanford, CA: Stanford Univ; 2002. Sparse principal components analysis for functional data. [Google Scholar]

[R20] Ma Z. Technical report, Dept. Statistics. Pennsylvania, Philadelphia, PA: The Wharton School, Univ.; 2011. Sparse principal component analysis and iterative thresholding. [Google Scholar]

[R21] Muirhead RJ. Aspects of Multivariate Statistical Theory. New York: Wiley; 1982. MR0652932. [Google Scholar]

[R22] Nadler B. Finite sample approximation results for principal component analysis: A matrix perturbation approach. Ann. Statist. 2008;36:2791–2817. MR2485013. [Google Scholar]

[R23] Nadler B. Discussion of “On consistency and sparsity for principal component analysis in high dimensions”. J. Amer. Statist. Assoc. 2009;104:694–697. doi: 10.1198/jasa.2009.0121. MR2751449. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Onatski A. Technical report. Columbia Univ.; 2006. Determining the number of factors from empirical distribution of eigenvalues. [Google Scholar]

[R25] Paul D. Ph.D. thesis. Stanford, CA: Stanford Univ.; 2005. Nonparametric estimation of principal components. [Google Scholar]

[R26] Paul D. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist. Sinica. 2007;17:1617–1642. MR2399865. [Google Scholar]

[R27] Paul D, Johnstone IM. Technical report. Univ. California, Davis; 2007. Augmented sparse principal component analysis for high dimensional data. Available at arXiv:1202.1242. [Google Scholar]

[R28] Rothman AJ, Levina E, Zhu J. Generalized thresholding of large covariance matrices. J. Amer. Statist. Assoc. 2009;104:177–186. MR2504372. [Google Scholar]

[R29] Shen H, Huang JZ. Sparse principal component analysis via regularized low rank matrix approximation. J. Multivariate Anal. 2008;99:1015–1034. MR2419336. [Google Scholar]

[R30] Shen D, Shen H, Marron JS. Consistency of sparse PCA in high dimension, low sample size contexts. Technical report. 2011 Available at http://arxiv.org/pdf/1104.4289v1.pdf.

[R31] Tipping ME, Bishop CM. Probabilistic principal component analysis. J.R. Stat. Soc. Ser. B Stat. Methodol. 1999;61:611–622. MR1707864. [Google Scholar]

[R32] Van Trees HL. Optimum Array Processing. New York: Wiley; 2002. [Google Scholar]

[R33] Vu VQ, Lei J. Minimax rates of estimation for sparse PCA in high dimensions. Technical report. 2012 Available at http://arxiv.org/pdf/1202.0786.pdf.

[R34] Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10:515–534. doi: 10.1093/biostatistics/kxp008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Yang Y, Barron A. Information-theoretic determination of minimax rates of convergence. Ann. Statist. 1999;27:1564–1599. MR1742500. [Google Scholar]

[R36] Zong C. Sphere Packings. New York: Springer; 1999. MR1707318. [Google Scholar]

[R37] Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. J. Comput. Graph. Statist. 2006;15:265–286. MR2252527. [Google Scholar]

PERMALINK

MINIMAX BOUNDS FOR SPARSE PCA WITH NOISY HIGH-DIMENSIONAL DATA

Aharon Birnbaum

Iain M Johnstone

Boaz Nadler

Debashis Paul

Abstract

1. Introduction

Table 1.

2. Problem setup

2.1. Eigenvector estimation with squared error loss

2.2. Rate of convergence for ordinary PCA

2.3. lq constraint on eigenvectors

3. Lower bounds on the minimax risk

4. Risk of the diagonal thresholding estimator

5. A two-stage coordinate selection scheme

Fig. 1.

5.1. ASPCA scheme

5.2. Estimation of M

6. Summary and discussion

Acknowledgement

APPENDIX A: PROOFS

A.1. Asymptotic risk of the standard PCA estimator

Deviation of extreme Wishart eigenvalues and quadratic forms

Perturbation of eigen-structure

A.2. Bound for ‖S − ∑‖

A.3. Lower bound on the minimax risk

Risk bounding strategy

Geometry of the hypothesis set and sphere packing

A.4. Lower bound on the risk of the DT estimator

APPENDIX B: PROOFS OF RELEVANT LEMMAS

B.1. Proof of Lemma A.1

B.2. Proof of Lemma A.4

B.3. Proof of Lemma A.6

B.4. Proof of Lemma A.7

Footnotes

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.3. l_q constraint on eigenvectors