Abstract
This paper considers a sparse spiked covariancematrix model in the high-dimensional setting and studies the minimax estimation of the covariance matrix and the principal subspace as well as the minimax rank detection. The optimal rate of convergence for estimating the spiked covariance matrix under the spectral norm is established, which requires significantly different techniques from those for estimating other structured covariance matrices such as bandable or sparse covariance matrices. We also establish the minimax rate under the spectral norm for estimating the principal subspace, the primary object of interest in principal component analysis. In addition, the optimal rate for the rank detection boundary is obtained. This result also resolves the gap in a recent paper by Berthet and Rigollet [2] where the special case of rank one is considered.
Keywords: Covariance matrix, Group sparsity, Low-rank matrix, Minimax rate of convergence, Sparse principal component analysis, Principal subspace, Rank detection
1 Introduction
Covariance matrix plays a fundamental role in multivariate analysis. Many methodologies, including discriminant analysis, principal component analysis and clustering analysis, rely critically on the knowledge of the covariance structure. Driven by a wide range of contemporary applications in many fields including genomics, signal processing, and financial econometrics, estimation of covariance matrices in the high-dimensional setting is of particular interest.
There have been significant recent advances on the estimation of a large covariance matrix and its inverse, the precision matrix. A variety of regularization methods, including banding, tapering, thresholding and penalization, have been introduced for estimating several classes of covariance and precision matrices with different structures. See, for example, [3, 4, 8, 12, 13, 17, 22, 26, 36, 41], among many others.
1.1 Sparse spiked covariance matrix model
In the present paper, we consider spiked covariance matrix models in the high-dimensional setting, which arise naturally from factor models with homoscedastic noise. To be concrete, suppose that we observe an n × p data matrix X with the rows X1*, …, Xn* i.i.d. following a multivariate normal distribution with mean 0 and covariance matrix Σ, denoted by by N(0, Σ), where the covariance matrix Σ is given by
(1) |
where Λ = diag(λ1, …, λr) with λ1 ≥ ⋯ ≥ λr > 0, and V = [v1, …, vr] is p × r with orthonormal columns. The r largest eigenvalues of Σ are λi + σ2, i = 1, …, r, and the rest are all equal to σ2. The r leading eigenvectors of Σ are given by the column vectors of V. Since the spectrum of Σ has r spikes, (1) is termed by [19] as the spiked covariance matrix model. This covariance structure and its variations have been widely used in signal processing, chemometrics, econometrics, population genetics, and many other fields. See, for instance, [16, 24, 32, 34]. In the high-dimensional setting, various aspects of this model have been studied by several recent papers, including but not limited to [2, 5, 10, 20, 21, 31, 33, 35]. For simplicity, we assume σ is known. Since σ can always be factored out by scaling X, without loss of generality, we assume σ = 1. Data-based estimation of σ will be discussed in Section 6.
The primary focus of this paper is on the setting where V and Σ are sparse, and our goal is threefold. First, we consider the minimax estimation of the spiked covariance matrix Σ under the spectral norm. The method as well as the optimal rates of convergence in this problem are considerably different from those for estimating other recently studied structured covariance matrices, such as bandable and sparse covariance matrices. Second, we are interested in rank detection. The rank r plays an important role in principal component analysis (PCA) and is also of significant interest in signal processing and other applications. Last but not least, we consider optimal estimation of the principal subspace span(V) under the spectral norm, which is the main object of interest in PCA. Each of these three problems is important in its own right.
We now explain the sparsity model of V and Σ. The difficulty of estimation and rank detection depends on the joint sparsity of the columns of V. Let Vj* denote the jth row of V. The row support of V is defined by
(2) |
whose cardinality is denoted by |supp(V)|. Let the collection of p × r matrices with orthonormal columns be O(p, r) = {V ∈ ℝp × r : V′V = Ir}. Define the following parameter spaces for Σ,
(3) |
where τ ≥ 1 is a constant and r ≤ k ≤ p is assumed throughout the paper. Note that the condition number of Λ is at most τ. Moreover, for each covariance matrix in Θ0 (k, p, r, λ, τ), the leading r singular vectors (columns of V) are jointly k-sparse in the sense that the row support size of V is upper bounded by k. The structure of group sparsity has proved useful for high-dimensional regression; See, for example, [30]. In addition to (3), we also define the following parameter spaces by dropping the dependence on τ and r, respectively:
(4) |
and
(5) |
As a consequence of the group sparsity in V, a covariance matrix Σ in any of the above parameter spaces has at most k rows and k columns containing nonzero off-diagonal entries. We note that the matrix is more structured than the so-called “k-sparse” matrices considered in [3, 8, 13], where each row (or column) has at most k nonzero off-diagonals.
1.2 Main contributions
In statistical decision theory, the minimax rate quantifies the difficulty of an inference problem and is frequently used as a benchmark for the performance of inference procedures. The main contributions of this paper include the sharp non-asymptotic minimax rates for estimating the covariance matrix Σ and the principal subspace span(V) under the squared spectral norm loss, as well as for detecting the rank r of the principal subspace. In addition, we also establish the minimax rates for estimating the precision matrix Ω = Σ−1 as well as the eigenvalues of Σ under the spiked covariance matrix model (1).
We establish the minimax rate for estimating the spiked covariance matrix Σ in (1) under the spectral norm
(6) |
where for a matrix A its spectral norm is defined as ∥A∥ = sup∥x∥2=1 ∥Ax∥2 with ∥·∥2 the vector ℓ2 norm. The minimax upper and lower bounds developed in Sections 3 and 4 yield the following optimal rate for estimating sparse spiked covariance matrices under the spectral norm
(7) |
subject to certain mild regularity conditions.1 The two terms in the squared bracket are contributed by the estimation error of the eigenvectors V and the eigenvalues, respectively. Note that the second term can be dominant if λ is large.
An important quantity of the spiked model is the rank r of the principal subspace span(V), or equivalently, the number of spikes in the spectrum of Σ, which is of significant interest in chemometrics [24], signal array processing [25], and other applications. Our second goal is the minimax estimation of the rank r under the zero–one loss, or equivalently, the minimax detection of the rank r. It is intuitively clear that the difficulty in estimating the rank r depends crucially on the magnitude of the minimum spike λr. Results in Sections 3 and 4 show that the optimal rank detection boundary over the parameter space Θ2(k, p, λ, τ) is of order . Equivalently, the rank r can be exactly recovered with high probability if for a sufficiently large constant β; On the other hand, reliable detection becomes impossible by any method if for some positive constant β0. Lying in the heart of the arguments is a careful analysis of the moment generating function of a squared symmetric random walk stopped at a hypergeometrically distributed time, which is summarized in Lemma 1. It is worth noting that the optimal rate for rank detection obtained in the current paper resolves a gap left open in a recent paper by Berthet and Rigollet [2], where the authors obtained the optimal detection rate for the rank-one case in the regime of , but the lower bound deteriorates to when which is strictly suboptimal.
In many statistical applications, instead of the covariance matrix itself, the object of direct interest is often a lower dimensional functional of the covariance matrix, e.g., the principal subspace span(V). This problem is known in the literature as sparse PCA [5, 10, 20, 31]. The third goal of the paper is the minimax estimation of the principal subspace span(V). To this end, we note that the principal subspace can be uniquely identified with the associated projection matrix VV′. Moreover, any estimator can be identified with a projection matrix V̂V̂′, where the columns of V̂ constitute an orthonormal basis for the subspace estimator. Thus, estimating span(V) is equivalent to estimating VV′. We aim to optimally estimate span(V) under the loss [37, Section II.4]
(8) |
which equals the squared sine of the largest canonical angle between the respective linear spans. In the sparse PCA literature, the loss (8) was first used in [31] for multi-dimensional subspaces. For this problem, we shall show that, under certain regularity conditions, the minimax rate of convergence is
(9) |
In the present paper we considered estimation of the principal subspace span(V) under the spectral norm loss (8). It is interesting to compare the results with those for optimal estimation under the Frobenius norm loss [10, 40] , whose ratio to (8) is between 1 and 2r. The optimal rate under the spectral norm loss given in (9) does not depend on the rank r, whereas the optimal rate under the Frobenius norm loss has an extra term , which depends on the rank r quadratically through r(k − r) [10]. Therefore the rate under the Frobenius norm far exceeds (9) when . When r = 1, both norms lead to the same rate and the result in (9) recovers earlier results on estimating the leading eigenvector obtained in [5, 39, 29].
In addition to the optimal rates for estimating the covariance matrix Σ, the rank r and the principal subspace span(V), the minimax rates for estimating the precision matrix Ω = Σ−1 as well as the eigenvalues of Σ are also established.
1.3 Other related work
Apart from the spiked covariance matrix model studied in this paper, other covariance matrix models have been considered in the literature. The most commonly imposed structural assumptions include “Toeplitz”, where each descending diagonal from left to right is constant, “bandable”, where the entries of the covariance matrix decay as they move away from the diagonal, and “sparse”, where only a small number of entries in each row/column are nonzero. The optimal rates of convergence were established in [11], [12] and [13] for estimating Toeplitz, bandable, and sparse covariance matrices, respectively. Estimation of sparse precision matrices has also been actively studied due to its close connection to Gaussian graphical models [9, 36, 41]. In addition, our work is also connected to the estimation of effective low-rank covariance matrices. See, for example, [28, 7] and the reference therein.
1.4 Organization
The rest of the paper is organized as the following. Section 2 introduces basic notation and then gives a detailed description of the procedures for estimating the spiked covariance matrix Σ, the rank r and the principal subspace span(V). The rates of convergence of these estimators are given in Section 3. Section 4 presents the minimax lower bounds that match the upper bounds in Section 3 in terms of the convergence rates, thereby establishing the minimax rates of convergence and rate-optimality of the estimators constructed in Section 2. The minimax rates for estimating the eigenvalues and the precision matrix are given in Section 5. Section 6 discusses computational and other related issues. The proofs are given in Section 7.
2 Estimation Procedure
We give a detailed description of the estimation procedure in this section and study its properties in Section 3. Throughout, we shall focus on minimax estimation and assume the sparsity k is known, while the rank r will be selected based on data. Adaptation to k will be discussed in Section 6.
Notation
We first introduce some notation. For any matrix X = (xij) and any vector u, denote by ∥X∥ the spectral norm, ∥X∥F the Frobenius norm, and ∥u∥ the vector ℓ2 norm. Moreover, the ith row of X is denoted by Xi* and the jth column by X*j. Let supp(X) = {i : Xi* ≠ 0} denote the row support of X. For a positive integer p, [p] denotes the index set {1, 2, …, p}. For any set A, |A| denotes its cardinality, and Ac its complement. For two subsets I and J of indices, denote by XIJ the |I| × |J| submatrices formed by xij with (i, j) ∈ I × J. Let XI* = XI[n] and X*J = X[p]J. For any square matrix A = (aij), we let Tr(A) = Σi aii be its trace. Define the inner product of any two matrices B and C of the same size by 〈B, C〉 = Tr(B′C). For any matrix A, we use σi(A) to denote its ith largest singular value. When A is positive semi-definite, σi(A) is also the ith largest eigenvalue of A. For any real number a and b, set a ∨ b = max{a, b} and a ∧ b = min{a, b}. Let Sp−1 denote the unit sphere in ℝp. For any event E, we write 1{E} as its indicator function.
For any set B ⊂ [p], let Bc be its complement. For any symmetric matrix A ∈ ℝp × p, we use AB to denote the p × p matrix whose B × B block is ABB, the remaining diagonal elements are all ones and the remaining off-diagonal elements are all zeros, i.e.,
(10) |
In other word, after proper reordering of rows and columns, we have
Let P ⊗ Q denote the product measure of P and Q and P⊗n the n-fold product of P. For random variables X and Y, we write if they follow the same distribution, and if Ρ(X > t) ≤ Ρ(Y > t) for all t ∈ ℝ. Throughout the paper, we use C to denote a generic positive absolute constant, whose actual value may depend on the context. For any two sequences {an} and {bn} of positive numbers, we write an ≲ bn when an ≤ Cbn for some numeric constant C, and an ≳ bn when bn ≲ an, and an ≍ bn when both an ≳ bn and an ≲ bn hold.
Estimators
We are now ready to present the procedure for estimating the spiked covariance matrix Σ, the rank r, and the principal subspace span(V).
Let
(11) |
be the row support of V. For any m ∈ [p], let
(12) |
Recall that the observed matrix X has i.i.d. rows Xi* ~ N(0, Σ). We define as the sample covariance matrix. Also recall that we assume knowledge of the sparsity level k which is an upper bound for the support size |A|. The first step in the estimation scheme is to select a subset  of k features based on the data. To this end, let
(13) |
The appropriate value of γ1 will be specified later in the statement of the theorems. Intuitively speaking, the requirements in (13) aim to ensure that for any B ∈ Bk, there is no evidence in data suggesting that Bc overlaps with the row support A of V. If Bk ≠ 0̸, denote by  an arbitrary element of Bk (or we can let  = argmaxB∈Bk Tr(SBB) for concreteness). As we will show later, Bk is non-empty with high probability; See Lemma 3 in Section 7.1. The set  represents the collection of selected features, which turns out to be instrumental in constructing optimal estimators for the three objects we are interested in: the covariance matrix, the rank of the spiked model, and the principle subspace. The estimator  of the support set A is obtained through searching over all subsets of size k. Such a global search, though computationally expensive, appears to be necessary in order for our procedure to optimally estimate Σ and V under the spectral norm. For example, estimating row-by-row is not guaranteed to always yield optimal results. Whether there exist computationally efficient procedures attaining the optimal rate is currently unknown. See Section 6 for more discussions.
Given Â, the estimators for the above three objects are defined as follows. Recalling the notation in (10), we define the covariance matrix estimator as
(14) |
The estimator for the rank is
(15) |
The appropriate value of γ2 will be specified later in the statement of the theorems. Last but not least, the estimator for the principle subspace is span(V̂), Where
(16) |
with v̂l the lth eigenvector of Σ̂. When Bk = 0̸, we set r̂ = 0 and V̂ = 0 since Σ̂ = Ip. Note that the estimator V̂ is based on the estimated rank r̂. Whenever r̂ ≠ r, the value of the loss function (8) equals 1.
For a brief discussion on the comparison between the foregoing estimation procedure and that in [40], see Section 6.
3 Minimax Upper Bounds
We now investigate the properties of the estimation procedure introduced in Section 2. Rates of convergence for estimating the whole covariance matrix and its principal subspace under the spectral norm as well as for rank detection are established. The minimax lower bounds given in Section 4 will show that these rates are optimal and together they yield the minimax rates of convergence.
We begin with the estimation error of the covariance matrix estimator (14). Note that here we do not require the ratio λ1/λr to be bounded.
Theorem 1
Let Σ̂ be defined in (14) with Bk given by (13) for some γ1 ≥ 10. If for a sufficiently small constant c0 > 0, then
(17) |
where the parameter space Θ1 (k, p, r, λ) is defined in (4).
As we shall show later in Theorem 4, the rates in (17) are optimal with respect to all the model parameters, namely k, p, r and λ.
Remark 1
Since the parameter space Θ1 in (4) is contained in the set of k-sparse covariance matrices, it is of interest to compare the minimax rates for covariance matrix estimation in these two nested parameter spaces. For simplicity, only consider the case where the spectral norms of covariance matrices are uniformly bounded by a constant. Cai and Zhou [13] showed that, under certain regularity conditions, the minimax rate of convergence for estimating k-sparse matrices is , while the rate over Θ1 in (7) reduces to when the spectral norm of the matrix and hence λ is bounded. Thus, ignoring the logarithmic terms, the rate over the smaller parameter space Θ1 is faster by a factor of k. This faster rate can be achieved because the group k-sparsity considered in our parameter space imposes much more structure than the row-wise k-sparsity does for the general k-sparse matrices.
The next result concerns with the detection rate of the rank estimator (15) under the extra assumption that the ratio of the largest spike to the smallest, i.e., λ1/λr, is bounded.
Theorem 2
Let r̂ = r̂(γ1, γ2) be defined in (15) for some constants γ1 ≥ 10 and . Assume that
(18) |
for a sufficiently small constant c0 ∈ (0, 1) which depends on γ1. If for some sufficiently large β depending only on γ1, γ2 and τ, then
(19) |
where the parameter space Θ2(k, p, λ, τ) is defined in (5).
By Theorem 5 to be introduced later, the detection rate of is optimal. For more details, see the discussion in Section 4.
Finally, we turn to the risk of the principal subspace estimator. As in Theorem 2, we require λ1/λr to be bounded.
Theorem 3
Suppose
(20) |
holds for some absolute constants M0, M1 > 0. Let V̂ be defined in (16) with constants and in (13) and (15). If (18) holds for a sufficiently small constant c0 depending on γ1, then
(21) |
where the parameter space Θ0(k, p, r, λ, τ) is defined in (3).
Remark 2
To ensure that the choice of γ1 for achieving (21) is data-driven, we only need an under-estimate for M0 = log n/ log λ, or equivalently an overestimate for λ. (Note that M1 = log n/ log p can be obtained directly given the data matrix.) To this end, we first estimate Σ by Σ̂ in (14) with an initial γ1 = 10 in (13). Then we control λ by 2∥Σ̂∥ − 1. By the proof of Theorem 2, and in particular (75), this is an over-estimate of λ with high probability. The upper bound in (21) remains valid if we compute V̂ with a (possibly) new γ1 = 10 ∧ (1 + 2/M̂0)M1 in (13), where M̂0 = log n/ log(2∥ Σ̂ ∥ − 1).
It is worth noting that the rate in (21) does not depend on r, and is optimal, by the lower bound given in Theorem 4 later.
The problems of estimating Σ and V are clearly related, but they are also distinct from each other. To discuss their relationship, we first note the following result (proved in Appendix 7.5) which is a variation of the well-known sin-theta theorem [15]:
Proposition 1
Let Σ and Σ̂ be p × p symmetric matrices. Let r ∈ [p − 1] be arbitrary and let V, V̂ ∈ O(p, r) be formed by the r leading singular vectors of Σ and Σ̂, respectively. Then
(22) |
In view of Proposition 1, the minimax risks for estimating the spiked co-variance matrix Σ and the principle subspace V under the spectral norm can be tied as follows:
(23) |
where Θ = Θ0(k, p, r, λ, τ).
The results of Theorems 1 and 3 suggest, however, that the above inequality is not tight when λ is large. The optimal rate for estimating V is not equivalent to that for estimating Σ divided by λ2 when . Consequently, Theorem 3 cannot be directly deduced from Theorem 1 but requires a different analysis by introducing an intermediate matrix S0 defined later in (50). This is because the estimation of Σ needs to take into account the extra error in estimating the eigenvalues in addition to those in estimating V. On the other hand, in proving Theorem 1 we need to contend with the difficulty that the loss function is unbounded.
4 Minimax Lower Bounds and Optimal Rates of Convergence
In this section we derive minimax lower bounds for estimating the spiked covariance matrix Σ and the principal subspace span(V) as well as for the rank detection. These lower bounds hold for all parameters and are non-asymptotic. The lower bounds together with the upper bounds given in Section 3 establish the optimal rates of convergence for the three problems.
The technical analysis heavily relies on a careful study of a rank-one testing problem and analyzing the moment generating function of a squared symmetric random walk stopped at a hypergeometrically distributed time. This lower bound technique is of independent interest and can be useful for other related matrix estimation and testing problems.
4.1 Lower bounds and minimax rates for matrix and subspace estimation
We first consider the lower bounds for estimating the spiked covariance matrix Σ and the principal subspace V under the spectral norm.
Theorem 4
For any 1 ≤ r ≤ k ≤ p and n ∈ ℕ,
(24) |
and
(25) |
where the parameter spaces Θ1 (k, p, r, λ) and Θ0 (k, p, r, λ, τ) are defined in (4) and (3), respectively.
To better understand the lower bound (24), it is helpful to write it equivalently as
which can be proved by showing that the minimax risk is lower bounded by each of these two terms. The first term does not depend on r and is the same as the lower bound in the rank-one case. The second term is the oracle risk when the true support of V is known. The key to the proof is the analysis of the rank-one case which will be discussed in more detail in Section 4.3. The proof of (25) is relatively straightforward by using known results on rank-one estimation.
In view of the upper bounds given in Theorems 1 and 3 and the lower bounds given in Theorem 4, we establish the following minimax rates of convergence for estimating the spiked covariance matrix Σ and the principal subspace V, subject to the constraints on the parameters given in Theorems 1 and 3:
(26) |
(27) |
where (27) holds under the addition condition that k − r ≳ k. Therefore, the estimators of Σ and V given in Section 2 are rate optimal. In (26), the trivial upper bound of λ2 can always be achieved by using the identity matrix as the estimator.
4.2 Lower bound and minimax rate for rank detection
We now turn to the lower bound and minimax rate for the rank detection problem.
Theorem 5
Let be a constant. For any 1 ≤ r ≤ k ≤ p and n ∈ ℕ, if , then
(28) |
where the function w : (0, ) → (0, 1) satisfies w(0+) = 1.
The upper and lower bounds given in Theorems 2 and 5 show that the optimal detection boundary for the rank r is . That is, the rank r can be estimated with an arbitrarily small error probability when for a sufficiently large constant β, whereas this is impossible to achieve by any method if for some small positive constant β0. Note that Theorem 5 applies to the full range of sparsity including the non-sparse case k = p, which requires . This observation turns out to be useful in proving the “parametric term” in the minimax lower bound for estimating Σ in Theorem 4.
The rank detection lower bound in Theorem 5 is in fact a direct consequence of the next proposition concerning testing the identity covariance matrix against rank-one alternatives,
(29) |
where B0(k), ≜ {x ∈ ℝp : |supp(x)| ≤ k}. Note that Σ is in the parameter space Θ2 under both the null and the alternative hypotheses. The rank-one testing problem (29) has been studied in [2], where there is a gap between the lower and upper bounds when . The following result show that their lower bound is in fact sub-optimal in this case. We shall give below a dimension-free lower bound for the optimal probability of error and determine the optimal rate of separation. The proof is deferred to Section 7.2.2.
Proposition 2
Let be a constant. Let X be an n × p random matrix whose rows are independently drawn from N(0, Σ). For any k ∈ [p] and n ∈ ℕ, if , the minimax sum of Type-I and Type-II error probabilities for the testing problem (29) satisfies
where the function w : (0, ) → (0, 1) satisfies w(0+) = 1.
Proposition 2 shows that testing independence in the rank-one spiked co-variance model can be achieved reliably only if the effective signal-to-noise-ratio
Furthermore, the lower bound in Proposition 2 also captures the following phenomenon: if β0 vanishes, then the optimal probability of error converges to one. In fact, the lower bound in Proposition 2 is optimal in the sense that the following test succeeds with vanishing probability of error if β0 → ∞:
for some c only depends on β0. See, e.g., [2, Section 4]. However, the above test has high computational complexity since one needs to enumerate all k × k submatrices of S. It remains an open problem to construct tests that are both computationally feasible and minimax rate-optimal.
4.3 Testing rank-one spiked model
As mentioned earlier, a careful study of the rank-one testing problem (29) provides a major tool for the lower bound arguments. A key step in this study is the analysis of the moment generating function of a squared symmetric random walk stopped at a hypergeometrically distributed time. We present the main ideas in this section as the techniques can also be useful for other related matrix estimation and testing problems.
It is well-known that the minimax risk is given by the least-favorable Bayesian risk under mild regularity conditions on the model [27]. For the composite testing problem (29), it turns out that the rate-optimal least-favorable prior for v is given by the distribution of the following random vector:
(30) |
where w = (w1, …, wp) consists of iid Rademacher entries, and JI is a diagonal matrix given by (JI)ii = 1{i∈I} with I uniformly chosen from all subsets of [p] of size k. In other words, u is uniformly distributed on the collection of k-sparse vectors of unit length with equal-magnitude non-zeros. Hence u ∈ Sp−1 ∩ B0(k). We set
where β0 > 0 is a sufficiently small absolute constant.2 The desired lower bound then follows if we establish that the following (Bayesian) hypotheses
(31) |
cannot be separated with vanishing probability of error.
Remark 3
The composite testing problem (31) has also been considered in [2]. In particular, the following suboptimal lower bound is given in [2, Theorem 5.1]: If
(32) |
then the optimal error probability satisfies εn(k, p, λ) ≥ C(υ), where C(υ) → 1 as υ → 0. This result is established based on the following prior:
(33) |
which is a binary sparse vector with uniformly chosen support.
Compared to the result in Proposition 2, (32) is rate-optimal in the very sparse regime of . However, since log(1 + x) ≍ x when x ≲ 1, in the moderately sparse regime of , and so the lower bound in (32) is substantially smaller than the optimal rate in Proposition 2 by a factor of , which is a polynomial factor in k when k ≳ pα for any α > 1/2. In fact, by strengthening the proof in [2], one can show that the optimal separation for discriminating (31) using the binary prior (33) is . Therefore the prior (33) is rate-optimal only in the regime of , while (30) is rate-optimal for all k. Examining the role of the prior (30) in the proof of Theorem 5, we see that it is necessary to randomize the signs of the singular vector in order to take advantage of the central limit theorem and Gaussian approximation. When , the fact that the singular vector u is positive componentwise reduces the difficulty of the testing problem.
The main technical tool for establishing the rank-detection lower bound in Proposition 2 is the following lemma, which can be of independent interest. It deals with the behavior of a symmetric random walk stopped after a hypergeometrically distributed number of steps. Moreover, note that Lemma 1 also incorporates the non-sparse case (k = p and H = k), which proves to be useful in establishing the minimax lower bound for estimating Σ in Theorem 4. The proof of Lemma 1 is deferred to Section 7.6.
Lemma 1
Let p ∈ ℕ and k ∈ [p]. Let B1, …,Bk be independently Rademacher distributed. Denote the symmetric random walk on ℤ stopped at the mth step by
(34) |
Let H ~ Hypergeometric(p, k, k) with , i = 0, …, k. Then there exists a function g : (0, ) → (1, ∞) with g(0+) = 1, such that for any ,
(35) |
where .
Remark 4 (Tightness of Lemma 1)
The purpose of Lemma 1 is to seek the largest t, as a function of (p, k), such that Ε exp ( ) is upper bounded by a constant non-asymptotically. The condition that is in fact both necessary and sufficient. To see the necessity, note that Ρ {GH = H∣H = i} = 2−i. Therefore
which cannot be upper bounded by an absolutely constant unless .
5 Estimation of Precision Matrix and Eigenvalues
We have so far focused on the optimal rates for estimating the spiked covariance matrix Σ, the rank r and the principal subspace span(V). The technical results and tools developed in the earlier sections turn out to be readily useful for establishing the optimal rates of convergence for estimating the precision matrix Ω = Σ−1 as well as the eigenvalues of Σ under the spiked covariance matrix model (1).
Besides the covariance matrix Σ, it is often of significant interest to estimate the precision matrix Ω, which is closely connected to the Gaussian graphical model as the support of the precision matrix Ω coincides with the edges in the corresponding Gaussian graph. Let Σ̂ be defined in (14) and let σi (Σ̂) denote its ith largest eigenvalue value for all i ∈ [p]. Define the precision matrix estimator as
(36) |
The following result gives the optimal rates for estimating Ω under the spectral norm.
Proposition 3 (Precision matrix estimation)
Assume that λ ≍ 1. If for a sufficiently small constant c0 > 0, then
(37) |
where the upper bound is attained by the estimator (36) with γ1 ≥ 10 in obtaining Σ̂.
The upper bound follows from the lines in the proof of Theorem 1 after we control the smallest eigenvalue of Σ̂ as in (36). Proposition 2 can be readily applied to yield the desired lower bound.
Note that the optimal rate in (37) is quite different from the minimax rate of convergence for estimating k-sparse precision matrices where each row/column has at most k nonzero entries. Here M is the ℓ1 norm bound for the precision matrices. See [9]. So the sparsity in the principal eigenvectors and the sparsity in the precision matrix itself have significantly different implications for estimation of Ω under the spectral norm.
We now turn to estimation of the eigenvalues. Since σ is assumed to be equal to one, it suffices to estimate the eigenvalue matrix E = diag(λi) where λi, ≜ 0 for i > r. For any estimator Ẽ = diag(λ̃i), we quantify the estimator error by the loss function ∥Ẽ − E∥2 = maxi∈[p] |λ̃i − λi|2. The following result gives the optimal rate of convergence for this estimation problem.
Proposition 4 (Uniform estimation of spectra)
Under the same conditions of Proposition 3,
(38) |
where the upper bound is attained by the estimator Ê = diag(σi(Σ̂)) − Ip with Σ̂ defined in (14) with γ1 ≥ 10.
Hence, the spikes and the eigenvalues can be estimated uniformly at the rate of when k ≪ n/ log p. Proposition 4 is a direct consequence of Theorem 1 and Proposition 2. The proofs of Propositions 3 and 4 are deferred to Section 7.4.
6 Discussions
We have assumed the knowledge of the noise level σ = 1 and the support size k. For a given value of σ2, one can always rescale the data and reduce to the case of unit variance. As a consequence of rescaling, the results in this paper remain valid for a general σ2 by replacing each λ with λ/σ2 in both the expressions of the rates and the definitions of the parameter spaces. When σ2 is unknown, it can be easily estimated based on the data. Under the sparsity models (3)-(5), when k < p/2, σ2 can be well estimated by σ̂2 = median(sjj) as suggested in [20], where sjj is jth diagonal element of the sample covariance matrix.
The knowledge of the support size k is much more crucial for our procedure. An interesting topic for future research is the construction of adaptive estimators which could achieve the minimax rates in Theorems 1–3 without knowing k. One possibility is to define Bk in (13) for all k ∈ [p], find the smallest k such that Bk is non-empty, and then define estimators for that particular k similar to those in Section 2 with necessary adjustments accounting for the extra multiplicity in the support selection procedure.
For ease of exposition, we have assumed that the data are normally distributed and so various non-asymptotic tail bounds used in the proof follow. Since these bounds typically only rely on sub-Gaussianity assumptions on the data, we expect the results in the present paper readily extendable to data generated from distributions with appropriate sub-Gaussian tail conditions. The estimation procedure in Section 2 is different from that in [40]. Although both are based on enumerating all possible support sets of size k, Vu and Lei [40] proposed to pick the support set which maximizes a quadratic form, while ours is based on picking the set satisfying certain deviation bounds.
A more important issue is the computational complexity required to obtain the minimax rate optimal estimators. The procedure described in Section 2 entails a global search for the support set A, which can be computationally intensive. In many cases, this seems unavoidable since the spectral norm is not separable in terms of the entries/rows/columns. However, in some other cases, there are estimators that are computationally more efficient and can attain the same rates of convergence. For instance, in the low rank cases where , the minimax rates for estimating span(V) under the spectral norm and under the Frobenius norm coincide with each other. See the discussion following Theorem 3 in Section 3. Therefore the procedure introduced in [10, Section 3] attains the optimal rates under both norms simultaneously. As shown in [10], this procedure is not only computationally efficient, but also adaptive to the sparsity k. Finding the general complexity-theoretic limits for attaining the minimax rates under the spectral norm is an interesting and challenging topic for future research. Following the initial post of the current paper, Berthet and Rigollet [1] showed in a closely related sparse principal component detection problem that the minimax detection rate cannot be achieved by any computationally efficient algorithm in the highly sparse regime.
7 Proofs
We first collect a few useful technical lemmas in Section 7.1 before proving the main theorems in Section 7.2 in the order of Theorems 1 - 4. We then give the proofs of the propositions in the order of Propositions 2, 3, 4, and 1. As mentioned in Section 4, Theorem 5 on the rank detection lower bound is a direct consequence of Proposition 2. We complete this section with the proof of Lemma 1.
Recall that the row vector of X are i.i.d. samples from the N(0, Σ) distribution with Σ specified by (1). Equivalently, one can think of X as an n × p data matrix generated as
(39) |
where U is an n × r random effects matrix with iid N(0, 1) entries, , V is p × r orthonormal, and Z has iid N(0, 1) entries which are independent of U.
7.1 Technical Lemmas
Lemma 2
Let be the sample covariance matrix, then
Proof
Note that
The result follows from the Davidson–Szarek bound [14, Theorem II.7] and the union bound.
Lemma 3
Let Bk be defined as in (13) with γ1 ≥ 3. Then Ρ(A ∉ Bk) ≤ 5(ep)1−γ1/2.
Proof
Note that by union bound
We now bound the two terms on the right-hand side separately.
For the first term, note that A = supp(V). Then for any D ⊂ Ac, . Hence
where the second inequality is [10, Proposition 4], and the last inequality holds for all γ1 ≥ 3 and p ≥ 2.
For the second term, note that for any fixed D ⊂ Ac, , where Z*D and X*A are independent. Thus, let W be the left singular vector matrix of X*A, we obtain that3
where Y is a |D| × k matrix with i.i.d. N(0, 1) entries. Thus, we have
where the third inequality is due to the Davidson-Szarek bound. Combining the two bounds completes the proof.
Lemma 4
Let γ1 ≥ 3. Suppose that for a sufficiently small constant c0 > 0. Then with probability at least 1 − 12(ep)1−γ1/2, Bk ≠ 0̸ and
(40) |
where .
Proof
We focus on the event Bk ≠ 0̸. Define the sets
(41) |
which correspond to the sets of correctly identified, missing, and overly identified features by the selected support set Â, respectively. By the triangle inequality, we have
(42) |
We now provide high probability bounds for each term on the right-hand side.
For the first term, recall that  ∈ Bk which is defined in (13). Since M ⊂ Âc, we have
(43) |
For the second term, by similar calculation to that in the proof of Lemma 3, we have that when γ1 ≥ 3,
(44) |
with probability at least 1 − 4(ep)1−γ1/2. For the third term, we turn to the definition of Bk in (13) again to obtain
By Lemma 2, with probability at least 1 − (ep)1−γ1/2,
(45) |
where the second inequality holds when and the last inequality holds for sufficiently small constant c0. Moreover, the last two displays jointly imply
(46) |
For the fourth term, we obtain by similar arguments that with probability at least 1 − (ep)1 − γ1/2,
(47) |
Note that Ρ(Bk = 0̸) ≤ Ρ(A ∉ Bk), and by Lemma 3, the latter is bounded above by 5(ep)1 −γ1/2. So the union bound implies that the intersection of the event {Bk ≠ 0̸} and the event that (43)-(47) all hold has probability at least 1 − 12(ep)1 −γ1/2. On this event, we assemble (42)-(47) to obtain
This completes the proof.
Lemma 5
Let Y ∈ ℝn × k and Z ∈ ℝn × m be two independent matrices with i.i.d. N (0, 1) entries. Then there exists an absolute constant C > 0, such that
(48) |
(49) |
Proof
The inequality (48) follows directly from integrating the high-probability upper bound in [10, Proposition 4].
Let the SVD of Y be Y = ACB′, where A,B are n × (n ∧ k) and k × (n ∧ k) uniformly Haar distributed, and C is an (n ∧ k) × (n ∧ k) diagonal matrix with ∥C∥ = ∥Y∥. Since A and Z are independent, A′Z has the same law as a (n ∧ k) × m iid Gaussian matrix. Therefore
for some absolute constant C0, where the last inequality follows from the Davidson-Szarek theorem [14, Theorem II.7]. Exchanging the role of Y and Z, we have E∥Y′Z∥2 ≤ C0(n + m)(n ∧ m + k). Consequently,
This completes the proof.
In the proofs, the following intermediate matrix
(50) |
plays a key role. In particular, the following results on S0 will be used repeatedly.
Lemma 6
Suppose λ1/λr ≤ τ for some constant τ ≥ 1. If , then
(51) |
Moreover, VV′ is the projection matrix onto the rank r principal subspace of S0.
Proof
It is straightforward to verify that
(52) |
When , . So Weyl’s inequality [18, Theorem 4.3.1] leads to
Note that S0 always has 1 as its eigenvalue with multiplicity at least p − r. We thus obtain (51).
When (51) holds, (50) shows that the rank r principal subspace of S0 is equal to that of . Therefore, the subspace is spanned by the column vectors of V, and VV′ is the projection matrix onto it since V ∈ O(p, r).
To prove the lower bound for rank detection, we need the following lemma concerning the the χ2-divergences in covariance models. Recall that the χ2-divergence between two probability measures is defined as
For a distribution F, we use F⊗n to denote the product distribution of n independent copies of F.
Lemma 7
Let ν be a probability distribution on the space of p × p symmetric random matrix M such that ∥M∥ ≤ 1 almost surely. Consider the scale mixture distribution Ε [N(0, Ip + M)⊗n] = ∫ N(0, Ip + M)⊗n ν(dM). Then
(53) |
(54) |
where M1 and M2 are independently drawn from ν. Moreover, if rank(M) = 1 a.s., then (54) holds with equality.
Proof
Denote by gi the probability density function of N (0, Σi) for i = 0, 1 and 2, respectively. Then it is straightforward to verify that
(55) |
if Σ0(Σ1 + Σ2) ≥ Σ1Σ2; otherwise, the integral on the left-hand side of (55) is infinite. Applying (55) to Σ0 = Ip and Σi = Ip + Mi and using Fubini’s theorem, we have
(56) |
(57) |
(58) |
where (56) is due to ∥M1M2∥ ≤ ∥M1∥∥M2∥ ≤ 1 a.s., and (58) is due to log det(I + A) ≤ Tr(A), with equality if and only if rank(A) = 1.
7.2 Proofs of Main Results
7.2.1 Proofs of the Upper Bounds
Proof of Theorem 1
Let γ1 ≥ 10 be a constant. Denote by E the event that (40) holds, which, in particular, contains the event {Bk ≠ 0̸}. By triangle inequality,
(59) |
where the last step holds because . In view of the definition of η in (12), we have .
To bound the second term in (59), define JB as the diagonal matrix given by
(60) |
Then, for S0 in (50),
(61) |
Therefore
(62) |
In view of (52) and Lemma 5, we have
(63) |
where the last step is due to by assumption. Similarly,
(64) |
Again by (49) in Lemma 5,
(65) |
Assembling (62) - (65), we have
(66) |
Combining (59) and (66) yields
(67) |
Next we control the estimation error conditioned on the complement of the event E. First note that . Then ∥SÂ − Σ∥ ≤ ∥S∥ + ∥Σ∥ + 1 ≤ ∥Σ∥ (∥Wp∥ + 1), where Wp is equal to times a p × p Wishart matrix with n degrees of freedom. Also, ∥Σ − I∥ = λ1. In view of (14), we have ∥Σ̂ − Σ∥ ≤ (1 + λ1)(∥Wp∥ + 2). Using Cauchy-Schwartz inequality, we have
Note that . By [14, Theorem II.7], . Then . By Lemma 4, we have Ρ {Ec} ≤ 12(ep)1−γ1/2. Therefore
(68) |
Choose a fixed γ1 ≥ 10. Assembling (67) and (68), we have
Proof of Theorem 2
To prove the theorem, it suffices to show that with the desired probability,
(69) |
Recall (50) and (52). By [10, Proposition 4], with probability at least 1 − 2(ep)−γ1/2, . Under (18), when c0 is sufficiently small, and so η(k, n, p, γ1) ≤ 1/(2τ). Thus,
(70) |
Therefore, Lemma 6 leads to (51). Moreover, Weyl’s inequality leads to
(71) |
Next, we consider SA − S0. By (61), we have
By [10, Proposition 4] and (18), when c0 is sufficiently small, with probability at least 1 − (ep + 1)(ep)−γ1/2,
Moreover, [10, Proposition 3] implies that with probability at least 1−2(ep)−γ1/2,
Assembling the last three displays, we obtain that with probability at least 1 − (ep + 3)(ep)−γ1/2,
(72) |
Last but not least, Lemma 4 implies that, under the condition of Theorem 2, with probability at least 1 − 12(ep)1−γ1/2, (40) holds. Together with (72), it implies that
(73) |
where . By (18), we could further upper bound the right-hand side by λ1/4 ∨ 1/2. When λ1 ≤ 1, , so for sufficiently small c0, (18) implies that the right side of (73) is further bounded by λ1/4. When λ1 ≤ 1, , and so the right side of (73) is further bounded by 1/2 for sufficiently small c0.
Thus, the last display, together with (70), implies
(74) |
Here, the last inequality comes from the above discussion, and the fact that η(k, n, p, γ1) < 1/4 for small c0. The triangle inequality further leads to
(75) |
Set . Then (51), (73) and (75) jointly imply that, with probability at least 1−(13ep+5)(ep)−γ1/2, the second inequality in (69) holds. Moreover, (74) and the triangle inequality implies that, with the same probability,
Here, the last inequality holds when for a sufficiently large β which depends only on γ1, γ2 and τ. In view of (75), the last display implies the first inequality of (69). This completes the proof of the upper bound.
Proof of Theorem 3
Let E be the event such that Lemma 3, Lemma 4, the upper bound in Theorem 2, and (71) hold. Then Ρ(Ec) ≤ C(ep)1−γ1/2.
On the event E, Σ̂ = SÂ. Moreover, Lemma 6 shows that VV′ is the projection matrix onto the principal subspace of S0, and Theorem 2 ensures V̂ ∈ ℝp × r. Thus, Proposition 1 leads to
and so
To further bound the right-hand side of the last display, we apply (61), (64), (65), Lemma 3 and Lemma 4 to obtain
Together with the second last display, this implies
(76) |
Now consider the event Ec. Note that ∥V̂V̂′ − VV′∥ ≤ 1 always holds. Thus,
(77) |
where the last inequality holds under condition (20) for all . Assembling (76) and (77), we obtain the upper bounds.
7.2.2 Proofs of the Lower Bounds
Proof of Theorem 4
1° The minimax lower bound for estimating span(V) follows straightforwardly from previous results on estimating the leading singular vector, i.e., the rank-one case (see, e.g., [5, 39]). The desired lower bound (25) can be found in [10, Eq. (58)] in the proof of [10, Proof of Theorem 3].
2° Next we establish the minimax lower bound for estimating the spiked covariance matrix Σ under the spectral norm, which is considerably more involved. Let Θ1 = Θ1(k, p, r, λ). In view of the fact that a ∧ (b + c) ≤ a ∧ b + a ∧ c ≤ 2(a ∧ (b + c)) for all a, b, c ≥ 0, it is convenient to prove the following equivalent lower bound
(78) |
To this end, we show that the minimax risk is lower bounded by the two terms on the right-hand side of (78) separately. In fact, the first term is the minimax rate in the rank-one case and the second term is the rate of the oracle risk when the estimator knows the true support of V.
2.1° Consider the following rank-one subset of the parameter space
Then σ1(Σ) = 1 + λ and σ2(Σ) = 1. For any estimator Σ̂ denote by v̂ its leading singular vector. Applying Proposition 1 yields
Then
(79) |
(80) |
where (79) follows from rank(v̂v̂′− vv′) ≤ 2 and (80) follows from the minimax lower bound in [39, Theorem 2.1] (see also [5, Theorem 2]) for estimating the leading singular vector.
2.2° To prove the lower bound , consider the following (composite) hypotheses testing problem:
(81) |
where with a sufficiently small absolute constant b > 0. Since r ∈ [k] and , both the null and the alternative hypotheses belong to the parameter set Θ1 defined in (4). Following the Le Cam’s twopoint argument [38, Section 2.3], next we show that the minimal sum of Type-I and Type-II error probabilities of testing (81) is non-vanishing. Since any pair of covariance matrices in H0 and H1 differ in operator norm by at least , we obtain a lower bound of rate .
To this end, let X consist of n iid rows drawn from N(0, Σ), where Σ is either from H0 or H1. Since under both the null and the alternative, the last p−r columns of X are standard normal and independent of the first r columns, we conclude that the first r columns form a sufficient statistic. Therefore the minimal Type-I+II error probability testing (81), denoted by εn, is equal to that of the following testing problem of dimension r and sample size n:
(82) |
Recall that the minimax risk is lower bounded by the Bayesian risk. For any random vector u taking values in Sr−1, denote by the Ε[N(0, Ir + ρuu′)⊗n] the mixture alternative distribution with a prior equal to the distribution of u. Applying [38, Theorem 2.2 (iii)] we obtain the following lower bound in terms of the χ2-divergence from the mixture alternative to the null:
(83) |
Consider the unit random vector u with iid coordinates taking values in uniformly. Since ρ ≤ 1, applying the equality case of Lemma 7 yields
where in the last step we recall that Gr is the symmetric random walk on ℤ at rth step defined in (34). Since , choosing as a fixed constant and applying Lemma 1 with p = k = r (the non-sparse case), we conclude that
(84) |
where g is given by in Lemma 1 and satisfies g > 1. Combining (83) - (84), we obtain the following lower bound for estimating Σ:
(85) |
As we mentioned in Section 4, the rank-detection lower bound in Theorem 5 is a direct consequence of Proposition 2 concerning testing rank-zero versus rank-one perturbation, which we prove below.
7.3 Proof of Proposition 2
Proof
Analogous to (83), any random vector u taking vales in Sp−1 ∩ B0(k) gives the Bayesian lower bound
(86) |
Put
Let the random sparse vector u be defined in (30). In view of Lemma 7 as well as the facts that rank(λvv′) = 1 and ∥λvv′∥ = λ ≤ 1, we have
where in the last step we have defined H ≜ |I ∩ Ĩ| ~ Hypergeometric(p, k, k) and {Gm} is the symmetric random walk on ℤ defined in (34). Now applying Lemma 1, we conclude that
where g is given by in Lemma 1 satisfying g(0+) = 1. In view of (86), we conclude that
Note that the function w satisfies w(0+) = 1.
7.4 Proof of Propositions 3 and 4
We give here a joint proof of Propositions 3 and 4.
Proof
1° (Upper bounds) By assumption, we have λ ≍ 1 and . Since r ∈ [k], applying Theorem 1 yields
(87) |
Note that
(88) |
where the last inequality follows from σp (Σ) = 1 and Weyl’s inequality. By Chebyshev’s inequality, . Let . Then
(89) |
Moreover, again by Weyl’s inequality, we have hence in view of (36). On the other hand, by definition we always have ∥Ω̂∥ ≤ 2. Therefore
(90) |
(91) |
where (90) follows from (88) and (91) is due to (87) and (89).
Next we prove upper bound for estimating E. Recall that E+Ip and Ê+Ip give the diagonal matrices formed by the ordered singular values of Σ and Σ̂, respectively. Similar to the proof of Proposition 4, Weyl’s inequality implies that ∥Ê−E∥ = ∥Ê+Ip−(E+Ip)∥ = maxi |σi(Σ̂)−σi(Σ)| ≤ ∥Σ̂−Σ∥, where for any i ∈ [p], σi(Σ) is the ith largest eigenvalue of Σ. In view of Theorem 1, we have .
2° (Lower bounds) The lower bound follows from the testing result in Proposition 2. Consider the testing problem (29), then both the null (Σ = Ip) and alternatives (Σ = Ip+λvv′) are contained in the parameter space Θ1. By Proposition 2, they cannot be distinguished with probability 1 if . The spectra differs at least |σ1(Ip) − σ1(Ip + λvv′)| = λ. By the Woodbury identity, , hence . The lower bound is now completed by way of the usual two-point argument.
7.5 Proof of Proposition 1
Proof
Recall that σ1 (Σ) ≥ … ≥ σp(Σ) ≥ 0 denote the ordered singular values of Σ. If , then by Weyl’s inequality, we have . If , then by David–Kahn’s sin-theta theorem [15] (see also [10, Theorem 10] we have
completing the proof of (22) in view of the fact that ∥V̂V̂′ − VV′∥ ≤ 1.
7.6 Proof of Lemma 1
Proof
First of all, we can safely assume that p ≥ 5, for otherwise the expectation on the right-hand side of (35) is obviously upper bounded by an absolutely constant. In the sequel we shall assume that
(92) |
We use normal approximation of the random walk Gm for small m and use truncation argument to deal with large m. To this end, we divide [p], the whole range of k, into three regimes.
Case I: Large k
Assume that . Then . By the following non-asymptotic version of the Tusnády’s coupling lemma (see, for example, [6, Lemma 4, p. 242]), for each m, there exists Zm ~ N(0, m), such that
(93) |
Since H ≤ p, in view of (92), we have
(94) |
(95) |
Case II: Small k
Assume that , which, in particular, implies that since p ≥ 5. Using , we have
(96) |
(97) |
(98) |
where (96) follows from the stochastic dominance of hypergeometric distributions by binomial distributions (see, e.g., [23, Theorem 1.1(d)])
(99) |
and the moment generating function of binomial distributions, (97) is due to (1 + x)k ≤ exp(kx) for x ≥ 0 and , (98) is due to
and
Case III: Moderate k
Assume that . Define
(100) |
and write
(101) |
Next we bound the two terms in (101) separately: For the first term, we use normal approximation of Gm. By (93) - (94), for each fixed m ≤ A, we have
(102) |
where the last inequality follows from and (100).
To control the second term in (101), without loss of generality, we assume that A ≤ k, i.e., . we proceed similarly as in (96) - (98):
(103) |
(104) |
(105) |
(106) |
where (103) follows from (99), (104) follows from that , p ≥ 2k and m ≤ A, (105) follows from that e log x ≤ x for all x ≥ 1, and (106) is due to (100) and our choice of a in (92). Plugging (102) and (106) into (101), we obtain
(107) |
We complete the proof of (35), with g(a) defined as the maxima of the righthand sides of (95), (98) and (107).
Acknowledgments
The research of Tony Cai was supported in part by NSF FRG Grant DMS-0854973, NSF Grant DMS-1208982, and NIH Grant R01 CA 127334. The research of Zongming Ma is supported in part by the Dean’s Research Fund of the Wharton School. The research of Yihong Wu was supported in part by NSF FRG Grant DMS-0854973.
Footnotes
Here and after, a ∧ b ≜ min(a, b) and an ≍ bn means that is bounded from both below and above by constants independent of n and all model parameters.
Here β0 can be chosen to be any constant smaller than . See Proposition 2. The number is certainly not optimized.
If rank(X*A) < k, then is changed to , and the subsequent arguments continue to hold verbatim.
Contributor Information
Tony Cai, Email: tcai@wharton.upenn.edu.
Zongming Ma, Email: zongming@wharton.upenn.edu.
Yihong Wu, Email: yihongwu@wharton.upenn.edu.
References
- 1.Berthet Q, Rigollet P. Complexity theoretic lower bounds for sparse principal component detection. Journal of Machine Learning Research: Workshop and Conference Proceedings. 2013;30:1–21. [Google Scholar]
- 2.Berthet Q, Rigollet P. Optimal detection of sparse principal components in high dimension. Annals of Statistics. 2013;41(4):1780–1815. [Google Scholar]
- 3.Bickel P, Levina E. Covariance regularization by thresholding. The Annals of Statistics. 2008;36(6):2577–2604. [Google Scholar]
- 4.Bickel P, Levina E. Regularized estimation of large covariance matrices. The Annals of Statistics. 2008;36(1):199–227. [Google Scholar]
- 5.Birnbaum A, Johnstone I, Nadler B, Paul D. Minimax bounds for sparse PCA with noisy high-dimensional data. arXiv preprint arXiv:1203.0967. 2012 doi: 10.1214/12-AOS1014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Bretagnolle J, Massart P. Hungarian constructions from the nonasymptotic viewpoint. The Annals of Probability. 1989;17(1):239–256. [Google Scholar]
- 7.Bunea F, Xiao L. On the sample covariance matrix estimator of reduced effective rank population matrices, with applications to fPCA. arXiv preprint arXiv:1212.5321 2012 [Google Scholar]
- 8.Cai T, Liu W. Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association. 2011;106:672–684. [Google Scholar]
- 9.Cai T, Liu W, Zhou H. Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. arXiv preprint arXiv:1212.2882 2012 [Google Scholar]
- 10.Cai T, Ma Z, Wu Y. Sparse PCA: Optimal rates and adaptive estimation. 2012 URL http://arxiv.org/abs/1211.1309. Preprint.
- 11.Cai T, Ren Z, Zhou H. Optimal rates of convergence for estimating toeplitz covariance matrices. Probability Theory and Related Fields. 2012:1–43. [Google Scholar]
- 12.Cai T, Zhang CH, Zhou H. Optimal rates of convergence for covariance matrix estimation. The Annals of Statistics. 2010;38(4):2118–2144. [Google Scholar]
- 13.Cai T, Zhou H. Optimal rates of convergence for sparse covariance matrix estimation. The Annals of Statistics. 2012 [Google Scholar]
- 14.Davidson K, Szarek S. Handbook on the Geometry of Banach Spaces, chap Local operator theory, random matrices and Banach spaces. Vol. 1. Elsevier Science; 2001. pp. 317–366. [Google Scholar]
- 15.Davis C, Kahan W. The rotation of eigenvectors by a perturbation. III. SIAM J Numer Anal. 1970;7(1):1–46. [Google Scholar]
- 16.Fan J, Fan Y, Lv J. High dimensional covariance matrix estimation using a factor model. Journal of Econometrics. 2008;147(1):186–197. [Google Scholar]
- 17.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Horn RA, Johnson CR. Matrix Analysis. Cambridge University Press; 1990. [Google Scholar]
- 19.Johnstone I. On the distribution of the largest eigenvalue in principal component analysis. The Annals of Statistics. 2001;29:295–327. [Google Scholar]
- 20.Johnstone I, Lu A. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association. 2009;104:682–693. doi: 10.1198/jasa.2009.0121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Jung S, Marron J. PCA consistency in high dimension, low sample size context. The Annals of Statistics. 2009;37(6B):4104–4130. [Google Scholar]
- 22.Karoui N. Operator norm consistent estimation of large-dimensional sparse covariance matrices. The Annals of Statistics. 2008;36:2717–2756. [Google Scholar]
- 23.Klenke A, Mattner L. Stochastic ordering of classical discrete distributions. Advances in Applied probability. 2010;42(2):392–410. [Google Scholar]
- 24.Kritchman S, Nadler B. Determining the number of components in a factor model from limited noisy data. Chemometrics and Intelligent Laboratory Systems. 2008;94(1):19–32. [Google Scholar]
- 25.Kritchman S, Nadler B. Non-parametric detection of the number of signals: hypothesis testing and random matrix theory. 2009;57(10):3930–3941. [Google Scholar]
- 26.Lam C, Fan J. Sparsistency and rates of convergence in large covariance matrix estimation. The Annals of Statistics. 2009;37:4254–4278. doi: 10.1214/09-AOS720. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Le Cam L. Asymptotic Methods in Statistical Theory. Springer-Verlag New York, Inc; 1986. [Google Scholar]
- 28.Lounici K. High-dimensional covariance matrix estimation with missing observations. arXiv preprint arXiv:1201.2577 2012 [Google Scholar]
- 29.Lounici K. Sparse principal component analysis with missing observations. arXiv preprint arXiv:1205.7060 2012 [Google Scholar]
- 30.Lounici K, Pontil M, Van De Geer S, Tsybakov A. Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics. 2011;39(4):2164–2204. [Google Scholar]
- 31.Ma Z. Sparse principal component analysis and iterative thresholding. arxiv preprint arXiv:1112.2432 2011 [Google Scholar]
- 32.Onatski A. Asymptotics of the principal components estimator of large factor models with weak factors. J Econometrics. 2012;168:244–258. [Google Scholar]
- 33.Onatski A, Moreira M, Hallin M. Signal detection in high dimension: The multispiked case. arXiv preprint arXiv:1210.5663 2012 [Google Scholar]
- 34.Patterson N, Price A, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2:e190. doi: 10.1371/journal.pgen.0020190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Paul D. Asymptotics of sample eigenstruture for a large dimensional spiked covariance model. Statistica Sinica. 2007;17(4):1617–1642. [Google Scholar]
- 36.Ravikumar P, Wainwright M, Raskutti G, Yu B. High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electronic Journal of Statistics. 2011;5:935–980. [Google Scholar]
- 37.Stewart G, Sun JG. Computer science and scientific computing. Academic Press; 1990. Matrix Perturbation Theory. [Google Scholar]
- 38.Tsybakov A. Introduction to Nonparametric Estimation. Springer Verlag; 2009. [Google Scholar]
- 39.Vu V, Lei J. Minimax rates of estimation for sparse PCA in high dimensions. the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS’12); 2012. URL http://arxiv.org/abs/1202.0786. [Google Scholar]
- 40.Vu V, Lei J. Minimax sparse principal subspace estimation in high dimensions. arXiv preprint arXiv:1211.0373 2012 [Google Scholar]
- 41.Yuan M. High dimensional inverse covariancematrix estimation via linear programming. The Journal of Machine Learning Research. 2010;99:2261–2286. [Google Scholar]