Optimal Estimation and Rank Detection for Sparse Spiked Covariance Matrices

Tony Cai; Zongming Ma; Yihong Wu

doi:10.1007/s00440-014-0562-z

. Author manuscript; available in PMC: 2016 Apr 1.

Published in final edited form as: Probab Theory Relat Fields. 2014 Apr 22;161(3-4):781–815. doi: 10.1007/s00440-014-0562-z

Optimal Estimation and Rank Detection for Sparse Spiked Covariance Matrices

Tony Cai ¹, Zongming Ma ¹, Yihong Wu ¹

PMCID: PMC4527666 NIHMSID: NIHMS588170 PMID: 26257453

Abstract

This paper considers a sparse spiked covariancematrix model in the high-dimensional setting and studies the minimax estimation of the covariance matrix and the principal subspace as well as the minimax rank detection. The optimal rate of convergence for estimating the spiked covariance matrix under the spectral norm is established, which requires significantly different techniques from those for estimating other structured covariance matrices such as bandable or sparse covariance matrices. We also establish the minimax rate under the spectral norm for estimating the principal subspace, the primary object of interest in principal component analysis. In addition, the optimal rate for the rank detection boundary is obtained. This result also resolves the gap in a recent paper by Berthet and Rigollet [2] where the special case of rank one is considered.

Keywords: Covariance matrix, Group sparsity, Low-rank matrix, Minimax rate of convergence, Sparse principal component analysis, Principal subspace, Rank detection

1 Introduction

Covariance matrix plays a fundamental role in multivariate analysis. Many methodologies, including discriminant analysis, principal component analysis and clustering analysis, rely critically on the knowledge of the covariance structure. Driven by a wide range of contemporary applications in many fields including genomics, signal processing, and financial econometrics, estimation of covariance matrices in the high-dimensional setting is of particular interest.

There have been significant recent advances on the estimation of a large covariance matrix and its inverse, the precision matrix. A variety of regularization methods, including banding, tapering, thresholding and penalization, have been introduced for estimating several classes of covariance and precision matrices with different structures. See, for example, [3, 4, 8, 12, 13, 17, 22, 26, 36, 41], among many others.

1.1 Sparse spiked covariance matrix model

In the present paper, we consider spiked covariance matrix models in the high-dimensional setting, which arise naturally from factor models with homoscedastic noise. To be concrete, suppose that we observe an n × p data matrix X with the rows X_1*, …, X_n* i.i.d. following a multivariate normal distribution with mean 0 and covariance matrix Σ, denoted by by N(0, Σ), where the covariance matrix Σ is given by

\sum = Cov (X_{i *}) = V Λ V' + σ^{2} I_{p},

(1)

where Λ = diag(λ₁, …, λ_r) with λ₁ ≥ ⋯ ≥ λ_r > 0, and V = [v₁, …, v_r] is p × r with orthonormal columns. The r largest eigenvalues of Σ are λ_i + σ², i = 1, …, r, and the rest are all equal to σ². The r leading eigenvectors of Σ are given by the column vectors of V. Since the spectrum of Σ has r spikes, (1) is termed by [19] as the spiked covariance matrix model. This covariance structure and its variations have been widely used in signal processing, chemometrics, econometrics, population genetics, and many other fields. See, for instance, [16, 24, 32, 34]. In the high-dimensional setting, various aspects of this model have been studied by several recent papers, including but not limited to [2, 5, 10, 20, 21, 31, 33, 35]. For simplicity, we assume σ is known. Since σ can always be factored out by scaling X, without loss of generality, we assume σ = 1. Data-based estimation of σ will be discussed in Section 6.

The primary focus of this paper is on the setting where V and Σ are sparse, and our goal is threefold. First, we consider the minimax estimation of the spiked covariance matrix Σ under the spectral norm. The method as well as the optimal rates of convergence in this problem are considerably different from those for estimating other recently studied structured covariance matrices, such as bandable and sparse covariance matrices. Second, we are interested in rank detection. The rank r plays an important role in principal component analysis (PCA) and is also of significant interest in signal processing and other applications. Last but not least, we consider optimal estimation of the principal subspace span(V) under the spectral norm, which is the main object of interest in PCA. Each of these three problems is important in its own right.

We now explain the sparsity model of V and Σ. The difficulty of estimation and rank detection depends on the joint sparsity of the columns of V. Let V_j* denote the j^th row of V. The row support of V is defined by

supp (V) = {j \in [p] : V_{j *} \neq 0},

(2)

whose cardinality is denoted by |supp(V)|. Let the collection of p × r matrices with orthonormal columns be O(p, r) = {V ∈ ℝ^{p × r} : V′V = I_r}. Define the following parameter spaces for Σ,

\begin{array}{l} Θ_{0} (k, p, r, λ, τ) = {\sum = VΛ V^{'} + I_{p} & : λ / τ \leq λ_{r} \leq \dots \leq λ_{1} \leq λ, \\ V \in O (p, r), | supp (V) | \leq k}, \end{array}

(3)

where τ ≥ 1 is a constant and r ≤ k ≤ p is assumed throughout the paper. Note that the condition number of Λ is at most τ. Moreover, for each covariance matrix in Θ₀ (k, p, r, λ, τ), the leading r singular vectors (columns of V) are jointly k-sparse in the sense that the row support size of V is upper bounded by k. The structure of group sparsity has proved useful for high-dimensional regression; See, for example, [30]. In addition to (3), we also define the following parameter spaces by dropping the dependence on τ and r, respectively:

\begin{array}{l} Θ_{1} (k, p, r, λ) & = cl \underset{τ \geq 1}{\cup} Θ_{0} (k, p, r, λ, τ) \\ \begin{array}{l} = {\sum = V Λ V^{'} + I_{p} : & 0 \leq λ_{r} \leq \dots \leq λ_{1} \leq λ, \\ V \in O (p, r), | supp (V) | \leq k} \end{array} \end{array}

(4)

and

Θ_{2} (k, p, λ, τ) = \cup_{r = 0}^{k} Θ_{0} (k, p, r, λ, τ) .

(5)

As a consequence of the group sparsity in V, a covariance matrix Σ in any of the above parameter spaces has at most k rows and k columns containing nonzero off-diagonal entries. We note that the matrix is more structured than the so-called “k-sparse” matrices considered in [3, 8, 13], where each row (or column) has at most k nonzero off-diagonals.

1.2 Main contributions

In statistical decision theory, the minimax rate quantifies the difficulty of an inference problem and is frequently used as a benchmark for the performance of inference procedures. The main contributions of this paper include the sharp non-asymptotic minimax rates for estimating the covariance matrix Σ and the principal subspace span(V) under the squared spectral norm loss, as well as for detecting the rank r of the principal subspace. In addition, we also establish the minimax rates for estimating the precision matrix Ω = Σ⁻¹ as well as the eigenvalues of Σ under the spiked covariance matrix model (1).

We establish the minimax rate for estimating the spiked covariance matrix Σ in (1) under the spectral norm

L (\sum, \sum^{^}) = {‖ \sum - \sum^{^} ‖}^{2},

(6)

where for a matrix A its spectral norm is defined as ∥A∥ = sup_∥x∥₂=1 ∥Ax∥₂ with ∥·∥₂ the vector ℓ₂ norm. The minimax upper and lower bounds developed in Sections 3 and 4 yield the following optimal rate for estimating sparse spiked covariance matrices under the spectral norm

\inf_{\sum^{^}} \sup_{\sum \in Θ_{1} (k, p, r, λ)} Ε {‖ \sum^{^} - \sum ‖}^{2} ≍ [\frac{(λ + 1) k}{n} \log \frac{e p}{k} + \frac{λ^{2} r}{n}] \land λ^{2},

(7)

subject to certain mild regularity conditions.¹ The two terms in the squared bracket are contributed by the estimation error of the eigenvectors V and the eigenvalues, respectively. Note that the second term can be dominant if λ is large.

An important quantity of the spiked model is the rank r of the principal subspace span(V), or equivalently, the number of spikes in the spectrum of Σ, which is of significant interest in chemometrics [24], signal array processing [25], and other applications. Our second goal is the minimax estimation of the rank r under the zero–one loss, or equivalently, the minimax detection of the rank r. It is intuitively clear that the difficulty in estimating the rank r depends crucially on the magnitude of the minimum spike λ_r. Results in Sections 3 and 4 show that the optimal rank detection boundary over the parameter space Θ₂(k, p, λ, τ) is of order $\sqrt{\frac{k}{n} \log \frac{e p}{k}}$ . Equivalently, the rank r can be exactly recovered with high probability if $λ_{r} \geq β \sqrt{\frac{k}{n} \log \frac{ep}{k}}$ for a sufficiently large constant β; On the other hand, reliable detection becomes impossible by any method if $λ_{r} \leq β_{0} \sqrt{\frac{k}{n} \log \frac{ep}{k}}$ for some positive constant β₀. Lying in the heart of the arguments is a careful analysis of the moment generating function of a squared symmetric random walk stopped at a hypergeometrically distributed time, which is summarized in Lemma 1. It is worth noting that the optimal rate for rank detection obtained in the current paper resolves a gap left open in a recent paper by Berthet and Rigollet [2], where the authors obtained the optimal detection rate $\sqrt{\frac{k}{n} \log \frac{ep}{k}}$ for the rank-one case in the regime of $k ≪ \sqrt{p}$ , but the lower bound deteriorates to $\sqrt{\frac{p}{kn}}$ when $k ≫ \sqrt{p}$ which is strictly suboptimal.

In many statistical applications, instead of the covariance matrix itself, the object of direct interest is often a lower dimensional functional of the covariance matrix, e.g., the principal subspace span(V). This problem is known in the literature as sparse PCA [5, 10, 20, 31]. The third goal of the paper is the minimax estimation of the principal subspace span(V). To this end, we note that the principal subspace can be uniquely identified with the associated projection matrix VV′. Moreover, any estimator can be identified with a projection matrix V̂V̂′, where the columns of V̂ constitute an orthonormal basis for the subspace estimator. Thus, estimating span(V) is equivalent to estimating VV′. We aim to optimally estimate span(V) under the loss [37, Section II.4]

L (V, \hat{V}) = {‖ V V^{'} - \hat{V} {\hat{V}}^{'} ‖}^{2},

(8)

which equals the squared sine of the largest canonical angle between the respective linear spans. In the sparse PCA literature, the loss (8) was first used in [31] for multi-dimensional subspaces. For this problem, we shall show that, under certain regularity conditions, the minimax rate of convergence is

\inf_{\hat{V}} \sup_{\sum \in Θ_{0} (k, p, r, λ, τ)} Ε {‖ \hat{V} {\hat{V}}^{'} - V V^{'} ‖}^{2} ≍ \frac{(λ + 1) k}{n λ^{2}} \log \frac{e p}{k} \land 1 .

(9)

In the present paper we considered estimation of the principal subspace span(V) under the spectral norm loss (8). It is interesting to compare the results with those for optimal estimation under the Frobenius norm loss [10, 40] ${‖ V V^{'} - \hat{V} \hat{V}' ‖}_{F}^{2}$ , whose ratio to (8) is between 1 and 2r. The optimal rate under the spectral norm loss given in (9) does not depend on the rank r, whereas the optimal rate under the Frobenius norm loss has an extra term $\frac{λ + 1}{n λ^{2}} r (k - r)$ , which depends on the rank r quadratically through r(k − r) [10]. Therefore the rate under the Frobenius norm far exceeds (9) when $r ≫ \log \frac{e p}{k}$ . When r = 1, both norms lead to the same rate and the result in (9) recovers earlier results on estimating the leading eigenvector obtained in [5, 39, 29].

In addition to the optimal rates for estimating the covariance matrix Σ, the rank r and the principal subspace span(V), the minimax rates for estimating the precision matrix Ω = Σ⁻¹ as well as the eigenvalues of Σ are also established.

1.3 Other related work

Apart from the spiked covariance matrix model studied in this paper, other covariance matrix models have been considered in the literature. The most commonly imposed structural assumptions include “Toeplitz”, where each descending diagonal from left to right is constant, “bandable”, where the entries of the covariance matrix decay as they move away from the diagonal, and “sparse”, where only a small number of entries in each row/column are nonzero. The optimal rates of convergence were established in [11], [12] and [13] for estimating Toeplitz, bandable, and sparse covariance matrices, respectively. Estimation of sparse precision matrices has also been actively studied due to its close connection to Gaussian graphical models [9, 36, 41]. In addition, our work is also connected to the estimation of effective low-rank covariance matrices. See, for example, [28, 7] and the reference therein.

1.4 Organization

The rest of the paper is organized as the following. Section 2 introduces basic notation and then gives a detailed description of the procedures for estimating the spiked covariance matrix Σ, the rank r and the principal subspace span(V). The rates of convergence of these estimators are given in Section 3. Section 4 presents the minimax lower bounds that match the upper bounds in Section 3 in terms of the convergence rates, thereby establishing the minimax rates of convergence and rate-optimality of the estimators constructed in Section 2. The minimax rates for estimating the eigenvalues and the precision matrix are given in Section 5. Section 6 discusses computational and other related issues. The proofs are given in Section 7.

2 Estimation Procedure

We give a detailed description of the estimation procedure in this section and study its properties in Section 3. Throughout, we shall focus on minimax estimation and assume the sparsity k is known, while the rank r will be selected based on data. Adaptation to k will be discussed in Section 6.

Notation

We first introduce some notation. For any matrix X = (x_ij) and any vector u, denote by ∥X∥ the spectral norm, ∥X∥_F the Frobenius norm, and ∥u∥ the vector ℓ₂ norm. Moreover, the i^th row of X is denoted by X_i* and the j^th column by X_*j. Let supp(X) = {i : X_i* ≠ 0} denote the row support of X. For a positive integer p, [p] denotes the index set {1, 2, …, p}. For any set A, |A| denotes its cardinality, and A^c its complement. For two subsets I and J of indices, denote by X_IJ the |I| × |J| submatrices formed by x_ij with (i, j) ∈ I × J. Let X_I* = X_I[n] and X_*J = X_[p]J. For any square matrix A = (a_ij), we let Tr(A) = Σ_i a_ii be its trace. Define the inner product of any two matrices B and C of the same size by 〈B, C〉 = Tr(B′C). For any matrix A, we use σ_i(A) to denote its i^th largest singular value. When A is positive semi-definite, σ_i(A) is also the i^th largest eigenvalue of A. For any real number a and b, set a ∨ b = max{a, b} and a ∧ b = min{a, b}. Let S^p−1 denote the unit sphere in ℝ^p. For any event E, we write 1_{E} as its indicator function.

For any set B ⊂ [p], let B^c be its complement. For any symmetric matrix A ∈ ℝ^{p × p}, we use A_B to denote the p × p matrix whose B × B block is A_BB, the remaining diagonal elements are all ones and the remaining off-diagonal elements are all zeros, i.e.,

{(A_{B})}_{ij} = a_{ij} 1_{{i \in B, j \in B}} + 1_{{i = j \in B^{c}}} .

(10)

In other word, after proper reordering of rows and columns, we have

A_{B} = [\begin{matrix} A_{BB} & 0 \\ 0 & I_{B^{c} B^{c}} \end{matrix}] .

Let P ⊗ Q denote the product measure of P and Q and P^⊗n the n-fold product of P. For random variables X and Y, we write $X \overset{(d)}{=} Y$ if they follow the same distribution, and $X \overset{s . t .}{\leq} Y$ if Ρ(X > t) ≤ Ρ(Y > t) for all t ∈ ℝ. Throughout the paper, we use C to denote a generic positive absolute constant, whose actual value may depend on the context. For any two sequences {a_n} and {b_n} of positive numbers, we write a_n ≲ b_n when a_n ≤ Cb_n for some numeric constant C, and a_n ≳ b_n when b_n ≲ a_n, and a_n ≍ b_n when both a_n ≳ b_n and a_n ≲ b_n hold.

Estimators

We are now ready to present the procedure for estimating the spiked covariance matrix Σ, the rank r, and the principal subspace span(V).

Let

A = supp (V)

(11)

be the row support of V. For any m ∈ [p], let

η (m, n, p, γ_{1}) = 2 (\sqrt{\frac{m}{n}} + \sqrt{\frac{γ_{1}}{n} m \log \frac{e p}{m}}) + {(\sqrt{\frac{m}{n}} + \sqrt{\frac{γ_{1}}{n} m \log \frac{e p}{m}})}^{2} .

(12)

Recall that the observed matrix X has i.i.d. rows X_i* ~ N(0, Σ). We define $S = \frac{1}{n} X^{'} X$ as the sample covariance matrix. Also recall that we assume knowledge of the sparsity level k which is an upper bound for the support size |A|. The first step in the estimation scheme is to select a subset Â of k features based on the data. To this end, let

\begin{array}{l} B_{k} & = & {B \subset [p] : | B | = k, and for all D \subset B^{c} with | D | \leq k, \\ ‖ S_{D} - I ‖ \leq η (| D |, n, p, γ_{1}), ‖ S_{DB} ‖ \leq 2 \sqrt{‖ S_{B} ‖} η (k, n, p, γ_{1})} . \end{array}

(13)

The appropriate value of γ₁ will be specified later in the statement of the theorems. Intuitively speaking, the requirements in (13) aim to ensure that for any B ∈ B_k, there is no evidence in data suggesting that B^c overlaps with the row support A of V. If B_k ≠ 0̸, denote by Â an arbitrary element of B_k (or we can let Â = argmax_B∈Bk Tr(S_BB) for concreteness). As we will show later, B_k is non-empty with high probability; See Lemma 3 in Section 7.1. The set Â represents the collection of selected features, which turns out to be instrumental in constructing optimal estimators for the three objects we are interested in: the covariance matrix, the rank of the spiked model, and the principle subspace. The estimator Â of the support set A is obtained through searching over all subsets of size k. Such a global search, though computationally expensive, appears to be necessary in order for our procedure to optimally estimate Σ and V under the spectral norm. For example, estimating row-by-row is not guaranteed to always yield optimal results. Whether there exist computationally efficient procedures attaining the optimal rate is currently unknown. See Section 6 for more discussions.

Given Â, the estimators for the above three objects are defined as follows. Recalling the notation in (10), we define the covariance matrix estimator as

\sum^{^} = S_{\hat{A}} 1_{{B_{k} \neq \emptyset}} + I_{p} 1_{{B_{k} = \emptyset}},

(14)

The estimator for the rank is

\hat{r} = \max {l : σ_{l} (S_{\hat{A} \hat{A}}) \geq 1 + γ_{2} \sqrt{‖ S_{\hat{A}} ‖} η (k, n, p, γ_{1})} .

(15)

The appropriate value of γ₂ will be specified later in the statement of the theorems. Last but not least, the estimator for the principle subspace is span(V̂), Where

\hat{V} = [{\hat{v}}_{1}, \dots, {\hat{v}}_{\hat{r}}] \in O (p, \hat{r})

(16)

with v̂_l the l^th eigenvector of Σ̂. When B_k = 0̸, we set r̂ = 0 and V̂ = 0 since Σ̂ = I_p. Note that the estimator V̂ is based on the estimated rank r̂. Whenever r̂ ≠ r, the value of the loss function (8) equals 1.

For a brief discussion on the comparison between the foregoing estimation procedure and that in [40], see Section 6.

3 Minimax Upper Bounds

We now investigate the properties of the estimation procedure introduced in Section 2. Rates of convergence for estimating the whole covariance matrix and its principal subspace under the spectral norm as well as for rank detection are established. The minimax lower bounds given in Section 4 will show that these rates are optimal and together they yield the minimax rates of convergence.

We begin with the estimation error of the covariance matrix estimator (14). Note that here we do not require the ratio λ₁/λ_r to be bounded.

Theorem 1

Let Σ̂ be defined in (14) with B_k given by (13) for some γ₁ ≥ 10. If $\frac{k}{n} \log \frac{e p}{k} \leq c_{0}$ for a sufficiently small constant c₀ > 0, then

\sup_{\sum \in Θ_{1} (k, p, r, λ)} Ε {‖ \sum^{^} - \sum ‖}^{2} ≲ \frac{(λ + 1) k}{n} \log \frac{e p}{k} + \frac{λ^{2} r}{n},

(17)

where the parameter space Θ₁ (k, p, r, λ) is defined in (4).

As we shall show later in Theorem 4, the rates in (17) are optimal with respect to all the model parameters, namely k, p, r and λ.

Remark 1

Since the parameter space Θ₁ in (4) is contained in the set of k-sparse covariance matrices, it is of interest to compare the minimax rates for covariance matrix estimation in these two nested parameter spaces. For simplicity, only consider the case where the spectral norms of covariance matrices are uniformly bounded by a constant. Cai and Zhou [13] showed that, under certain regularity conditions, the minimax rate of convergence for estimating k-sparse matrices is $k^{2} \frac{\log p}{n}$ , while the rate over Θ₁ in (7) reduces to $\frac{k}{n} \log \frac{e p}{k}$ when the spectral norm of the matrix and hence λ is bounded. Thus, ignoring the logarithmic terms, the rate over the smaller parameter space Θ₁ is faster by a factor of k. This faster rate can be achieved because the group k-sparsity considered in our parameter space imposes much more structure than the row-wise k-sparsity does for the general k-sparse matrices.

The next result concerns with the detection rate of the rank estimator (15) under the extra assumption that the ratio of the largest spike to the smallest, i.e., λ₁/λ_r, is bounded.

Theorem 2

Let r̂ = r̂(γ₁, γ₂) be defined in (15) for some constants γ₁ ≥ 10 and $γ_{2} \geq 8 \sqrt{γ_{1}} + 34$ . Assume that

\frac{τk}{n} \log \frac{e p}{k} \leq c_{0}

(18)

for a sufficiently small constant c₀ ∈ (0, 1) which depends on γ₁. If $λ \geq β \sqrt{\frac{k}{n} \log \frac{e p}{k}}$ for some sufficiently large β depending only on γ₁, γ₂ and τ, then

\sup_{\sum \in Θ_{2} (k, p, λ, τ)} Ρ {\hat{r} \neq rank (Λ)} \leq C p^{1 - γ_{1} / 2}

(19)

where the parameter space Θ₂(k, p, λ, τ) is defined in (5).

By Theorem 5 to be introduced later, the detection rate of $\sqrt{\frac{k}{n} \log \frac{e p}{k}}$ is optimal. For more details, see the discussion in Section 4.

Finally, we turn to the risk of the principal subspace estimator. As in Theorem 2, we require λ₁/λ_r to be bounded.

Theorem 3

Suppose

M_{1} \log p \geq \log n \geq M_{0} \log λ

(20)

holds for some absolute constants M₀, M₁ > 0. Let V̂ be defined in (16) with constants $γ_{1} \geq 10 \lor (1 + \frac{2}{M_{0}}) M_{1}$ and $γ_{2} \geq 8 \sqrt{γ_{1}} + 34$ in (13) and (15). If (18) holds for a sufficiently small constant c₀ depending on γ₁, then

\sup_{\sum \in Θ_{0} (k, p, r, λ, τ)} Ε {‖ \hat{V} {\hat{V}}^{'} - V V^{'} ‖}^{2} ≲ \frac{k (λ + 1)}{n λ^{2}} \log \frac{e p}{k} \land 1,

(21)

where the parameter space Θ₀(k, p, r, λ, τ) is defined in (3).

Remark 2

To ensure that the choice of γ₁ for achieving (21) is data-driven, we only need an under-estimate for M₀ = log n/ log λ, or equivalently an overestimate for λ. (Note that M₁ = log n/ log p can be obtained directly given the data matrix.) To this end, we first estimate Σ by Σ̂ in (14) with an initial γ₁ = 10 in (13). Then we control λ by 2∥Σ̂∥ − 1. By the proof of Theorem 2, and in particular (75), this is an over-estimate of λ with high probability. The upper bound in (21) remains valid if we compute V̂ with a (possibly) new γ₁ = 10 ∧ (1 + 2/M̂₀)M₁ in (13), where M̂₀ = log n/ log(2∥ Σ̂ ∥ − 1).

It is worth noting that the rate in (21) does not depend on r, and is optimal, by the lower bound given in Theorem 4 later.

The problems of estimating Σ and V are clearly related, but they are also distinct from each other. To discuss their relationship, we first note the following result (proved in Appendix 7.5) which is a variation of the well-known sin-theta theorem [15]:

Proposition 1

Let Σ and Σ̂ be p × p symmetric matrices. Let r ∈ [p − 1] be arbitrary and let V, V̂ ∈ O(p, r) be formed by the r leading singular vectors of Σ and Σ̂, respectively. Then

‖ \sum^{^} - \sum ‖ \geq \frac{1}{2} (σ_{r} (\sum) - σ_{r + 1} (\sum)) ‖ \hat{V} {\hat{V}}^{'} - V V^{'} ‖ .

(22)

In view of Proposition 1, the minimax risks for estimating the spiked co-variance matrix Σ and the principle subspace V under the spectral norm can be tied as follows:

\inf_{\sum^{^}} \sup_{\sum \in Θ} Ε {‖ \sum^{^} - \sum ‖}^{2} ≳ λ_{r}^{2} \inf_{\hat{V}} \sup_{\sum \in Θ} Ε {‖ \hat{V} {\hat{V}}^{'} - V V^{'} ‖}^{2},

(23)

where Θ = Θ₀(k, p, r, λ, τ).

The results of Theorems 1 and 3 suggest, however, that the above inequality is not tight when λ is large. The optimal rate for estimating V is not equivalent to that for estimating Σ divided by λ² when $\frac{λ + 1}{λ^{2}} \log \frac{e p}{k} ≪ \frac{r}{k}$ . Consequently, Theorem 3 cannot be directly deduced from Theorem 1 but requires a different analysis by introducing an intermediate matrix S₀ defined later in (50). This is because the estimation of Σ needs to take into account the extra error in estimating the eigenvalues in addition to those in estimating V. On the other hand, in proving Theorem 1 we need to contend with the difficulty that the loss function is unbounded.

4 Minimax Lower Bounds and Optimal Rates of Convergence

In this section we derive minimax lower bounds for estimating the spiked covariance matrix Σ and the principal subspace span(V) as well as for the rank detection. These lower bounds hold for all parameters and are non-asymptotic. The lower bounds together with the upper bounds given in Section 3 establish the optimal rates of convergence for the three problems.

The technical analysis heavily relies on a careful study of a rank-one testing problem and analyzing the moment generating function of a squared symmetric random walk stopped at a hypergeometrically distributed time. This lower bound technique is of independent interest and can be useful for other related matrix estimation and testing problems.

4.1 Lower bounds and minimax rates for matrix and subspace estimation

We first consider the lower bounds for estimating the spiked covariance matrix Σ and the principal subspace V under the spectral norm.

Theorem 4

For any 1 ≤ r ≤ k ≤ p and n ∈ ℕ,

\inf_{\sum^{\sim}} \sup_{\sum \in Θ_{1} (k, p, r, λ)} Ε {‖ \sum^{\sim} - \sum ‖}^{2} ≳ (\frac{(λ + 1) k}{n} \log \frac{e p}{k} + \frac{λ^{2} r}{n}) \land λ^{2},

(24)

and

\inf_{\tilde{V}} \sup_{\sum \in Θ_{0} (k, p, r, λ, τ)} Ε {‖ \tilde{V} {\tilde{V}}^{'} - V V^{'} ‖}^{2} ≳ \frac{(λ + 1) (k - r)}{n λ^{2}} \log \frac{e (p - r)}{k - r} \land 1,

(25)

where the parameter spaces Θ₁ (k, p, r, λ) and Θ₀ (k, p, r, λ, τ) are defined in (4) and (3), respectively.

To better understand the lower bound (24), it is helpful to write it equivalently as

(\frac{(λ + 1) k}{n} \log \frac{e p}{k}) \land λ^{2} + \frac{λ^{2} r}{n},

which can be proved by showing that the minimax risk is lower bounded by each of these two terms. The first term does not depend on r and is the same as the lower bound in the rank-one case. The second term is the oracle risk when the true support of V is known. The key to the proof is the analysis of the rank-one case which will be discussed in more detail in Section 4.3. The proof of (25) is relatively straightforward by using known results on rank-one estimation.

In view of the upper bounds given in Theorems 1 and 3 and the lower bounds given in Theorem 4, we establish the following minimax rates of convergence for estimating the spiked covariance matrix Σ and the principal subspace V, subject to the constraints on the parameters given in Theorems 1 and 3:

\inf_{\sum^{\sim}} \sup_{\sum \in Θ_{1} (k, p, r, λ, τ)} Ε {‖ \sum^{\sim} - \sum ‖}^{2} ≍ (\frac{(λ + 1) k}{n} \log \frac{e p}{k} + \frac{λ^{2} r}{n}) \land λ^{2},

(26)

\inf_{\tilde{V}} \sup_{\sum \in Θ_{0} (k, p, r, λ, τ)} Ε {‖ \tilde{V} {\tilde{V}}^{'} - V V^{'} ‖}^{2} ≍ \frac{(λ + 1) k}{n λ^{2}} \log \frac{e p}{k} \land 1,

(27)

where (27) holds under the addition condition that k − r ≳ k. Therefore, the estimators of Σ and V given in Section 2 are rate optimal. In (26), the trivial upper bound of λ² can always be achieved by using the identity matrix as the estimator.

4.2 Lower bound and minimax rate for rank detection

We now turn to the lower bound and minimax rate for the rank detection problem.

Theorem 5

Let $β_{0} < \frac{1}{36}$ be a constant. For any 1 ≤ r ≤ k ≤ p and n ∈ ℕ, if $λ \leq 1 \land \sqrt{\frac{β_{0} k}{n} \log \frac{ep}{k}}$ , then

\inf_{\tilde{r}} \sup_{\sum \in Θ_{2} (k, p, λ, τ)} Ρ {\tilde{r} \neq rank (\sum)} \geq w (β_{0}),

(28)

where the function w : (0, $\frac{1}{36}$ ) → (0, 1) satisfies w(0+) = 1.

The upper and lower bounds given in Theorems 2 and 5 show that the optimal detection boundary for the rank r is $\sqrt{\frac{k}{n} \log \frac{ep}{k}}$ . That is, the rank r can be estimated with an arbitrarily small error probability when $λ_{r} \geq β \sqrt{\frac{k}{n} \log \frac{ep}{k}}$ for a sufficiently large constant β, whereas this is impossible to achieve by any method if $λ_{r} \leq \sqrt{\frac{β_{0} k}{n} \log \frac{ep}{k}}$ for some small positive constant β₀. Note that Theorem 5 applies to the full range of sparsity including the non-sparse case k = p, which requires $λ_{r} \geq \sqrt{\frac{p}{n}}$ . This observation turns out to be useful in proving the “parametric term” in the minimax lower bound for estimating Σ in Theorem 4.

The rank detection lower bound in Theorem 5 is in fact a direct consequence of the next proposition concerning testing the identity covariance matrix against rank-one alternatives,

H_{0} : \sum = I_{p}, versus H_{1} : \sum = I_{p} + λ v v^{'}, v \in S^{p - 1} \cap B_{0} (k),

(29)

where B₀(k), ≜ {x ∈ ℝ^p : |supp(x)| ≤ k}. Note that Σ is in the parameter space Θ₂ under both the null and the alternative hypotheses. The rank-one testing problem (29) has been studied in [2], where there is a gap between the lower and upper bounds when $k ≳ \sqrt{p}$ . The following result show that their lower bound is in fact sub-optimal in this case. We shall give below a dimension-free lower bound for the optimal probability of error and determine the optimal rate of separation. The proof is deferred to Section 7.2.2.

Proposition 2

Let $β_{0} < \frac{1}{36}$ be a constant. Let X be an n × p random matrix whose rows are independently drawn from N(0, Σ). For any k ∈ [p] and n ∈ ℕ, if $λ \leq 1 \land \sqrt{\frac{β_{0} k}{n} \log \frac{e p}{k}}$ , the minimax sum of Type-I and Type-II error probabilities for the testing problem (29) satisfies

ε_{n} (k, p, λ) ≜ \inf_{ϕ : ℝ^{n \times p} \to {0, 1}} (Ρ_{I_{p}} {ϕ (X) = 1} + \sup_{\sum \in H_{1}} Ρ_{\sum} {ϕ (X) = 0}) \geq w (β_{0})

where the function w : (0, $\frac{1}{36}$ ) → (0, 1) satisfies w(0+) = 1.

Proposition 2 shows that testing independence in the rank-one spiked co-variance model can be achieved reliably only if the effective signal-to-noise-ratio

β_{0} = \frac{λ^{2}}{\frac{k}{n} \log \frac{e p}{k}} \to \infty .

Furthermore, the lower bound in Proposition 2 also captures the following phenomenon: if β₀ vanishes, then the optimal probability of error converges to one. In fact, the lower bound in Proposition 2 is optimal in the sense that the following test succeeds with vanishing probability of error if β₀ → ∞:

reject H_{0} if and only if \max_{\underset{| J | = k}{J \subset [p]}} ‖ S_{JJ} - I_{JJ} ‖ \geq c \sqrt{\frac{k}{n} \log \frac{e p}{k}}

for some c only depends on β₀. See, e.g., [2, Section 4]. However, the above test has high computational complexity since one needs to enumerate all k × k submatrices of S. It remains an open problem to construct tests that are both computationally feasible and minimax rate-optimal.

4.3 Testing rank-one spiked model

As mentioned earlier, a careful study of the rank-one testing problem (29) provides a major tool for the lower bound arguments. A key step in this study is the analysis of the moment generating function of a squared symmetric random walk stopped at a hypergeometrically distributed time. We present the main ideas in this section as the techniques can also be useful for other related matrix estimation and testing problems.

It is well-known that the minimax risk is given by the least-favorable Bayesian risk under mild regularity conditions on the model [27]. For the composite testing problem (29), it turns out that the rate-optimal least-favorable prior for v is given by the distribution of the following random vector:

u = \frac{1}{\sqrt{k}} J_{I} w,

(30)

where w = (w₁, …, w_p) consists of iid Rademacher entries, and J_I is a diagonal matrix given by (J_I)_ii = 1_{i∈I} with I uniformly chosen from all subsets of [p] of size k. In other words, u is uniformly distributed on the collection of k-sparse vectors of unit length with equal-magnitude non-zeros. Hence u ∈ S^p−1 ∩ B₀(k). We set

λ^{2} = \frac{β_{0} k}{n} \log \frac{e p}{k},

where β₀ > 0 is a sufficiently small absolute constant.² The desired lower bound then follows if we establish that the following (Bayesian) hypotheses

H_{0} : \sum = I_{p} v.s. H_{1} : \sum = I_{p} + λ u u^{'}

(31)

cannot be separated with vanishing probability of error.

Remark 3

The composite testing problem (31) has also been considered in [2]. In particular, the following suboptimal lower bound is given in [2, Theorem 5.1]: If

λ \leq 1 \land \sqrt{\frac{υk}{n} \log (1 + \frac{p}{k^{2}})}

(32)

then the optimal error probability satisfies ε_n(k, p, λ) ≥ C(υ), where C(υ) → 1 as υ → 0. This result is established based on the following prior:

u = \frac{1}{\sqrt{k}} J_{I} 1,

(33)

which is a binary sparse vector with uniformly chosen support.

Compared to the result in Proposition 2, (32) is rate-optimal in the very sparse regime of $k = o (\sqrt{p})$ . However, since log(1 + x) ≍ x when x ≲ 1, in the moderately sparse regime of $k ≳ \sqrt{p}$ , $\log (1 + \frac{p}{k^{2}}) ≍ \frac{p}{k^{2}}$ and so the lower bound in (32) is substantially smaller than the optimal rate in Proposition 2 by a factor of $\sqrt{\frac{k^{2}}{p} \log \frac{e p}{k}}$ , which is a polynomial factor in k when k ≳ p^α for any α > 1/2. In fact, by strengthening the proof in [2], one can show that the optimal separation for discriminating (31) using the binary prior (33) is $1 \land \sqrt{\frac{k}{n} \log \frac{e p}{k}} \land \frac{p}{k \sqrt{n}}$ . Therefore the prior (33) is rate-optimal only in the regime of $k = o (p^{\frac{2}{3}})$ , while (30) is rate-optimal for all k. Examining the role of the prior (30) in the proof of Theorem 5, we see that it is necessary to randomize the signs of the singular vector in order to take advantage of the central limit theorem and Gaussian approximation. When $k ≳ p^{\frac{2}{3}}$ , the fact that the singular vector u is positive componentwise reduces the difficulty of the testing problem.

The main technical tool for establishing the rank-detection lower bound in Proposition 2 is the following lemma, which can be of independent interest. It deals with the behavior of a symmetric random walk stopped after a hypergeometrically distributed number of steps. Moreover, note that Lemma 1 also incorporates the non-sparse case (k = p and H = k), which proves to be useful in establishing the minimax lower bound for estimating Σ in Theorem 4. The proof of Lemma 1 is deferred to Section 7.6.

Lemma 1

Let p ∈ ℕ and k ∈ [p]. Let B₁, …,B_k be independently Rademacher distributed. Denote the symmetric random walk on ℤ stopped at the m^th step by

G_{m} ≜ \sum_{i = 1}^{m} B_{i} .

(34)

Let H ~ Hypergeometric(p, k, k) with $Ρ {H = i} = \frac{(\begin{matrix} k \\ i \end{matrix}) (\begin{matrix} p - k \\ k - i \end{matrix})}{(\begin{matrix} p \\ k \end{matrix})}$ , i = 0, …, k. Then there exists a function g : (0, $\frac{1}{36}$ ) → (1, ∞) with g(0+) = 1, such that for any $a < \frac{1}{36}$ ,

Ε \exp (t G_{H}^{2}) \leq g (a),

(35)

where $t = \frac{a}{k} \log \frac{e p}{k}$ .

Remark 4 (Tightness of Lemma 1)

The purpose of Lemma 1 is to seek the largest t, as a function of (p, k), such that Ε exp ( $t G_{H}^{2}$ ) is upper bounded by a constant non-asymptotically. The condition that $t ≍ \frac{1}{k} \log \frac{ep}{k}$ is in fact both necessary and sufficient. To see the necessity, note that Ρ {G_H = H∣H = i} = 2⁻ⁱ. Therefore

Ε \exp (t G_{H}^{2}) \geq Ε \exp (t H^{2}) 2^{- H} \geq \exp (t k^{2}) 2^{- k} Ρ {H = k} \geq \exp (t k^{2} - k \log \frac{2 p}{k}),

which cannot be upper bounded by an absolutely constant unless $t ≲ \frac{1}{k} \log \frac{e p}{k}$ .

5 Estimation of Precision Matrix and Eigenvalues

We have so far focused on the optimal rates for estimating the spiked covariance matrix Σ, the rank r and the principal subspace span(V). The technical results and tools developed in the earlier sections turn out to be readily useful for establishing the optimal rates of convergence for estimating the precision matrix Ω = Σ⁻¹ as well as the eigenvalues of Σ under the spiked covariance matrix model (1).

Besides the covariance matrix Σ, it is often of significant interest to estimate the precision matrix Ω, which is closely connected to the Gaussian graphical model as the support of the precision matrix Ω coincides with the edges in the corresponding Gaussian graph. Let Σ̂ be defined in (14) and let σ_i (Σ̂) denote its i^th largest eigenvalue value for all i ∈ [p]. Define the precision matrix estimator as

\hat{Ω} = {\begin{cases} {(\sum^{^})}^{- 1}, & σ_{p} (\sum^{^}) \geq \frac{1}{2}, \\ I_{p}, & σ_{p} (\sum^{^}) < \frac{1}{2} . \end{cases}

(36)

The following result gives the optimal rates for estimating Ω under the spectral norm.

Proposition 3 (Precision matrix estimation)

Assume that λ ≍ 1. If $\frac{k}{n} \log \frac{e p}{k} \leq c_{0}$ for a sufficiently small constant c₀ > 0, then

\inf_{\hat{Ω}} \sup_{\sum \in Θ_{1} (k, p, r, λ)} Ε {‖ \hat{Ω} - Ω ‖}^{2} ≍ \frac{k}{n} \log \frac{e p}{k},

(37)

where the upper bound is attained by the estimator (36) with γ₁ ≥ 10 in obtaining Σ̂.

The upper bound follows from the lines in the proof of Theorem 1 after we control the smallest eigenvalue of Σ̂ as in (36). Proposition 2 can be readily applied to yield the desired lower bound.

Note that the optimal rate in (37) is quite different from the minimax rate of convergence $M^{2} k^{2} \frac{\log p}{n}$ for estimating k-sparse precision matrices where each row/column has at most k nonzero entries. Here M is the ℓ₁ norm bound for the precision matrices. See [9]. So the sparsity in the principal eigenvectors and the sparsity in the precision matrix itself have significantly different implications for estimation of Ω under the spectral norm.

We now turn to estimation of the eigenvalues. Since σ is assumed to be equal to one, it suffices to estimate the eigenvalue matrix E = diag(λ_i) where λ_i, ≜ 0 for i > r. For any estimator Ẽ = diag(λ̃_i), we quantify the estimator error by the loss function ∥Ẽ − E∥² = max_i∈[p] |λ̃_i − λ_i|². The following result gives the optimal rate of convergence for this estimation problem.

Proposition 4 (Uniform estimation of spectra)

Under the same conditions of Proposition 3,

\inf_{\hat{E}} \sup_{\sum \in Θ_{1} (k, p, r, λ)} Ε {‖ \hat{E} - E ‖}^{2} ≍ \frac{k}{n} \log \frac{e p}{k},

(38)

where the upper bound is attained by the estimator Ê = diag(σ_i(Σ̂)) − I_p with Σ̂ defined in (14) with γ₁ ≥ 10.

Hence, the spikes and the eigenvalues can be estimated uniformly at the rate of $\frac{k}{n} \log \frac{e p}{k}$ when k ≪ n/ log p. Proposition 4 is a direct consequence of Theorem 1 and Proposition 2. The proofs of Propositions 3 and 4 are deferred to Section 7.4.

6 Discussions

We have assumed the knowledge of the noise level σ = 1 and the support size k. For a given value of σ², one can always rescale the data and reduce to the case of unit variance. As a consequence of rescaling, the results in this paper remain valid for a general σ² by replacing each λ with λ/σ² in both the expressions of the rates and the definitions of the parameter spaces. When σ² is unknown, it can be easily estimated based on the data. Under the sparsity models (3)-(5), when k < p/2, σ² can be well estimated by σ̂² = median(s_jj) as suggested in [20], where s_jj is j^th diagonal element of the sample covariance matrix.

The knowledge of the support size k is much more crucial for our procedure. An interesting topic for future research is the construction of adaptive estimators which could achieve the minimax rates in Theorems 1–3 without knowing k. One possibility is to define B_k in (13) for all k ∈ [p], find the smallest k such that B_k is non-empty, and then define estimators for that particular k similar to those in Section 2 with necessary adjustments accounting for the extra multiplicity in the support selection procedure.

For ease of exposition, we have assumed that the data are normally distributed and so various non-asymptotic tail bounds used in the proof follow. Since these bounds typically only rely on sub-Gaussianity assumptions on the data, we expect the results in the present paper readily extendable to data generated from distributions with appropriate sub-Gaussian tail conditions. The estimation procedure in Section 2 is different from that in [40]. Although both are based on enumerating all possible support sets of size k, Vu and Lei [40] proposed to pick the support set which maximizes a quadratic form, while ours is based on picking the set satisfying certain deviation bounds.

A more important issue is the computational complexity required to obtain the minimax rate optimal estimators. The procedure described in Section 2 entails a global search for the support set A, which can be computationally intensive. In many cases, this seems unavoidable since the spectral norm is not separable in terms of the entries/rows/columns. However, in some other cases, there are estimators that are computationally more efficient and can attain the same rates of convergence. For instance, in the low rank cases where $r ≲ \log \frac{e p}{k}$ , the minimax rates for estimating span(V) under the spectral norm and under the Frobenius norm coincide with each other. See the discussion following Theorem 3 in Section 3. Therefore the procedure introduced in [10, Section 3] attains the optimal rates under both norms simultaneously. As shown in [10], this procedure is not only computationally efficient, but also adaptive to the sparsity k. Finding the general complexity-theoretic limits for attaining the minimax rates under the spectral norm is an interesting and challenging topic for future research. Following the initial post of the current paper, Berthet and Rigollet [1] showed in a closely related sparse principal component detection problem that the minimax detection rate cannot be achieved by any computationally efficient algorithm in the highly sparse regime.

7 Proofs

We first collect a few useful technical lemmas in Section 7.1 before proving the main theorems in Section 7.2 in the order of Theorems 1 - 4. We then give the proofs of the propositions in the order of Propositions 2, 3, 4, and 1. As mentioned in Section 4, Theorem 5 on the rank detection lower bound is a direct consequence of Proposition 2. We complete this section with the proof of Lemma 1.

Recall that the row vector of X are i.i.d. samples from the N(0, Σ) distribution with Σ specified by (1). Equivalently, one can think of X as an n × p data matrix generated as

X = UD V^{'} + Z,

(39)

where U is an n × r random effects matrix with iid N(0, 1) entries, $D = diag (λ_{1}^{1 / 2}, \dots, λ_{r}^{1 / 2})$ , V is p × r orthonormal, and Z has iid N(0, 1) entries which are independent of U.

7.1 Technical Lemmas

Lemma 2

Let $S = \frac{1}{n} X' X$ be the sample covariance matrix, then

Ρ (\max_{| B | = k} ‖ S_{BB} ‖ \leq (λ_{1} + 1) {(1 + \sqrt{\frac{k}{n}} + \frac{t}{\sqrt{n}})}^{2}) \leq (\begin{matrix} p \\ k \end{matrix}) e^{- t^{2} / 2} .

Proof

Note that

\begin{array}{l} \max_{| B | = k} ‖ S_{BB} ‖ & \leq & \max_{| B | = k} ‖ \sum_{BB} ‖ \cdot ‖ \sum_{BB}^{- 1 / 2} S_{BB} \sum_{BB}^{- 1 / 2} ‖ \\ \leq & (λ_{1} + 1) \max_{| B | = k} ‖ \sum_{BB}^{- 1 / 2} S_{BB} \sum_{BB}^{- 1 / 2} ‖ \overset{(d)}{=} (λ_{1} + 1) \max_{| B | = k} {‖ Z_{* B} ‖}^{2} . \end{array}

The result follows from the Davidson–Szarek bound [14, Theorem II.7] and the union bound.

Lemma 3

Let B_k be defined as in (13) with γ₁ ≥ 3. Then Ρ(A ∉ B_k) ≤ 5(ep)^1−γ₁/2.

Proof

Note that by union bound

\begin{array}{l} Ρ (A \notin B_{k}) \leq & Ρ (\exists D \subset A^{c}, | D | \leq k, ‖ S_{D} - I ‖ > η (| D |, n, p, γ_{1})) \\ + Ρ (\exists D \subset A^{c}, | D | \leq k, ‖ S_{DA} ‖ > 2 \sqrt{‖ S_{A} ‖} η (k, n, p, γ_{1})) . \end{array}

We now bound the two terms on the right-hand side separately.

For the first term, note that A = supp(V). Then for any D ⊂ A^c, $S_{D} = \frac{1}{n} Z_{* D}^{'} Z_{* D}$ . Hence

\begin{array}{l} Ρ (\exists D \subset A^{c}, | D | \leq k, ‖ S_{D} - I ‖ > η (| D |, n, p, γ_{1})) \\ \leq \sum_{D \subset A^{c}, | D | \leq k} Ρ (‖ S_{D} - I ‖ > η (| D |, n, p, γ_{1})) \\ \leq \sum_{l = 1}^{k} (\begin{matrix} p - k \\ l \end{matrix}) 2 \exp (- \frac{γ_{1}}{2} l \log \frac{e p}{l}) \\ \leq 2 \sum_{l = 1}^{k} {(\frac{e p}{l})}^{l (1 - γ_{1} / 2)} \leq 4 {(e p)}^{1 - γ_{1} / 2}, \end{array}

where the second inequality is [10, Proposition 4], and the last inequality holds for all γ₁ ≥ 3 and p ≥ 2.

For the second term, note that for any fixed D ⊂ A^c, $S_{DA} = \frac{1}{n} Z_{* D}^{'} X_{* A}$ , where Z_*D and X_*A are independent. Thus, let W be the left singular vector matrix of X_*A, we obtain that³

‖ S_{DA} ‖ \leq \frac{1}{n} ‖ Z_{* D}^{'} W ‖ ‖ X_{* A} ‖ \overset{(d)}{=} \frac{1}{\sqrt{n}} ‖ Y ‖ \sqrt{‖ S_{A} ‖},

where Y is a |D| × k matrix with i.i.d. N(0, 1) entries. Thus, we have

\begin{array}{l} Ρ (\exists D \subset A^{c}, | D | \leq k, ‖ S_{DA} ‖ > 2 \sqrt{‖ S_{A} ‖} η (k, n, p, γ_{1})) \\ \leq \sum_{D \subset A^{c}, | D | \leq k} Ρ (‖ S_{DA} ‖ > 2 \sqrt{‖ S_{A} ‖} η (k, n, p, γ_{1})) \\ \leq \sum_{D \subset A^{c}, | D | \leq k} Ρ (‖ Y ‖ > 2 \sqrt{n} η (k, n, p, γ_{1})) \\ \leq \sum_{l = 1}^{k} (\begin{matrix} p - k \\ l \end{matrix}) \exp (- 2 γ_{1} k \log \frac{e p}{k}) \\ \leq k {(e \frac{p}{k})}^{k (1 - 2 γ_{1})} \leq {(e p)}^{1 - γ_{1} / 2}, \end{array}

where the third inequality is due to the Davidson-Szarek bound. Combining the two bounds completes the proof.

Lemma 4

Let γ₁ ≥ 3. Suppose that $\frac{k}{n} \log \frac{e p}{k} \leq C_{0}$ for a sufficiently small constant c₀ > 0. Then with probability at least 1 − 12(ep)^1−γ₁/2, B_k ≠ 0̸ and

‖ S_{\hat{A}} - S_{A} ‖ \leq C_{0} (γ_{1}) \sqrt{λ_{1} + 1} η (k, n, p, γ_{1}),

(40)

where $C_{0} (γ_{1}) = 14 + 4 \sqrt{γ_{1}}$ .

Proof

We focus on the event B_k ≠ 0̸. Define the sets

G = A \cap \hat{A}, M = A \cap {\hat{A}}^{c}, O = A^{c} \cap \hat{A},

(41)

which correspond to the sets of correctly identified, missing, and overly identified features by the selected support set Â, respectively. By the triangle inequality, we have

‖ S_{\hat{A}} - S_{A} ‖ \leq ‖ S_{MM} - I_{MM} ‖ + ‖ S_{OO} - I_{OO} ‖ + 2 ‖ S_{GM} ‖ + 2 ‖ S_{GO} ‖ .

(42)

We now provide high probability bounds for each term on the right-hand side.

For the first term, recall that Â ∈ B_k which is defined in (13). Since M ⊂ Â^c, we have

‖ S_{MM} - I_{MM} ‖ \leq η (| M |, n, p, γ_{1}) \leq η (k, n, p, γ_{1}) .

(43)

For the second term, by similar calculation to that in the proof of Lemma 3, we have that when γ₁ ≥ 3,

‖ S_{OO} - I_{OO} ‖ \leq η (k, n, p, γ_{1})

(44)

with probability at least 1 − 4(ep)^1−γ₁/2. For the third term, we turn to the definition of B_k in (13) again to obtain

‖ S_{GM} ‖ \leq ‖ S_{\hat{A} M} ‖ \leq 2 \sqrt{‖ S_{\hat{A} \hat{A}} ‖} η (k, n, p, γ_{1}) .

By Lemma 2, with probability at least 1 − (ep)^1−γ₁/2,

\begin{array}{l} ‖ S_{\hat{A} \hat{A}} ‖, ‖ S_{AA} ‖ & \leq & \max_{| B | = k} ‖ S_{BB} ‖ \leq (λ_{1} + 1) {(1 + \sqrt{\frac{k}{n}} + \sqrt{\frac{γ_{1} k}{n} \log \frac{e p}{k}})}^{2} \\ \leq & {(1 + (\sqrt{γ_{1}} + 1) \sqrt{c_{0}})}^{2} (λ_{1} + 1) \\ \leq & {(\frac{1}{2} \sqrt{γ_{1}} + \frac{3}{2})}^{2} (λ_{1} + 1), \end{array}

(45)

where the second inequality holds when $\frac{k}{n} \log \frac{e p}{k} \leq c_{0}$ and the last inequality holds for sufficiently small constant c₀. Moreover, the last two displays jointly imply

‖ S_{GM} ‖ \leq (\sqrt{γ_{1}} + 3) \sqrt{λ_{1} + 1} η (k, n, p, γ_{1}) .

(46)

For the fourth term, we obtain by similar arguments that with probability at least 1 − (ep)^{1 − γ₁/2},

‖ S_{GO} ‖ \leq ‖ S_{AO} ‖ \leq (\sqrt{γ_{1}} + 3) \sqrt{λ_{1} + 1} η (k, n, p, γ_{1}) .

(47)

Note that Ρ(B_k = 0̸) ≤ Ρ(A ∉ B_k), and by Lemma 3, the latter is bounded above by 5(ep)^{1 −γ₁/2}. So the union bound implies that the intersection of the event {B_k ≠ 0̸} and the event that (43)-(47) all hold has probability at least 1 − 12(ep)^{1 −γ₁/2}. On this event, we assemble (42)-(47) to obtain

\begin{array}{l} ‖ S_{\hat{A}} - S_{A} ‖ & \leq & [2 + 4 (\sqrt{γ_{1}} + 3) \sqrt{λ_{1} + 1}] η (k, n, p, γ_{1}) \\ \leq & [2 + 4 (\sqrt{γ_{1}} + 3)] \sqrt{λ_{1} + 1} η (k, n, p, γ_{1}) . \end{array}

This completes the proof.

Lemma 5

Let Y ∈ ℝ^{n × k} and Z ∈ ℝ^{n × m} be two independent matrices with i.i.d. N (0, 1) entries. Then there exists an absolute constant C > 0, such that

Ε {‖ \frac{1}{n} Y^{'} Y - I_{k} ‖}^{2} \leq C (\frac{k}{n} + \frac{k^{2}}{n^{2}}), and

(48)

Ε {‖ Y^{'} Z ‖}^{2} \leq C (n (k + m) + km) .

(49)

Proof

The inequality (48) follows directly from integrating the high-probability upper bound in [10, Proposition 4].

Let the SVD of Y be Y = ACB′, where A,B are n × (n ∧ k) and k × (n ∧ k) uniformly Haar distributed, and C is an (n ∧ k) × (n ∧ k) diagonal matrix with ∥C∥ = ∥Y∥. Since A and Z are independent, A′Z has the same law as a (n ∧ k) × m iid Gaussian matrix. Therefore

Ε {‖ Y^{'} Z ‖}^{2} \leq Ε {‖ Y ‖}^{2} Ε {‖ A^{'} Z ‖}^{2} \leq C_{0} (n + k) (n \land k + m) .

for some absolute constant C₀, where the last inequality follows from the Davidson-Szarek theorem [14, Theorem II.7]. Exchanging the role of Y and Z, we have E∥Y′Z∥² ≤ C₀(n + m)(n ∧ m + k). Consequently,

Ε {‖ Y^{'} Z ‖}^{2} \leq C_{0} ((n + k) (n \land k + m) \land (n + m) (n \land m + k)) \leq 2 C_{0} (n (k + m) + km) .

This completes the proof.

In the proofs, the following intermediate matrix

S_{0} = \frac{1}{n} VD U^{'} UD V^{'} + I_{p}

(50)

plays a key role. In particular, the following results on S₀ will be used repeatedly.

Lemma 6

Suppose λ₁/λ_r ≤ τ for some constant τ ≥ 1. If $‖ \frac{1}{n} U^{'} U - I_{r} ‖ < 1 / τ$ , then

σ_{l} (S_{0}) > 1, \forall l \in [r], and σ_{l} (S_{0}) = 1, \forall l > r .

(51)

Moreover, VV′ is the projection matrix onto the rank r principal subspace of S₀.

Proof

It is straightforward to verify that

S_{0} - \sum = VD (\frac{1}{n} U^{'} U - I_{r}) D V^{'} .

(52)

When $‖ \frac{1}{n} U^{'} U - I_{r} ‖ < 1$ , $‖ S_{0} - \sum ‖ \leq {‖ D ‖}^{2} ‖ \frac{1}{n} U^{'} U - I_{r} ‖ < λ_{1} / τ \leq λ_{r}$ . So Weyl’s inequality [18, Theorem 4.3.1] leads to

σ_{r} (S_{0}) \geq σ_{r} (\sum) - ‖ S_{0} - \sum ‖ > λ_{r} + 1 - λ_{r} = 1 .

Note that S₀ always has 1 as its eigenvalue with multiplicity at least p − r. We thus obtain (51).

When (51) holds, (50) shows that the rank r principal subspace of S₀ is equal to that of $\frac{1}{n} VD U^{'} UD V^{'}$ . Therefore, the subspace is spanned by the column vectors of V, and VV′ is the projection matrix onto it since V ∈ O(p, r).

To prove the lower bound for rank detection, we need the following lemma concerning the the χ²-divergences in covariance models. Recall that the χ²-divergence between two probability measures is defined as

χ^{2} (P ‖ Q) ≜ {\int (\frac{d P}{d Q} - 1)}^{2} d Q .

For a distribution F, we use F^⊗n to denote the product distribution of n independent copies of F.

Lemma 7

Let ν be a probability distribution on the space of p × p symmetric random matrix M such that ∥M∥ ≤ 1 almost surely. Consider the scale mixture distribution Ε [N(0, I_p + M)^⊗n] = ∫ N(0, I_p + M)^⊗n ν(dM). Then

χ^{2} (Ε [N {(0, I_{p} + M)}^{\otimes n}] ‖ N {(0, I_{p})}^{\otimes n}) + 1 = Ε \det {(I_{p} - M_{1} M_{2})}^{- \frac{n}{2}}

(53)

\geq Ε \exp (\frac{n}{2} 〈 M_{1}, M_{2} 〉),

(54)

where M₁ and M₂ are independently drawn from ν. Moreover, if rank(M) = 1 a.s., then (54) holds with equality.

Proof

Denote by g_i the probability density function of N (0, Σ_i) for i = 0, 1 and 2, respectively. Then it is straightforward to verify that

\int \frac{g_{1} g_{2}}{g_{0}} = \det {(\sum_{0})}^{- 1} {[\det (\sum_{0} (\sum_{1} + \sum_{2}) - \sum_{1} \sum_{2})]}^{- \frac{1}{2}}

(55)

if Σ₀(Σ₁ + Σ₂) ≥ Σ₁Σ₂; otherwise, the integral on the left-hand side of (55) is infinite. Applying (55) to Σ₀ = I_p and Σ_i = I_p + M_i and using Fubini’s theorem, we have

\begin{matrix} χ^{2} (Ε [N {(0, I_{p} + M)}^{\otimes n}] ‖ N {(0, I_{p})}^{\otimes n}) + 1 \\ = Ε \det {(I_{p} - M_{1} M_{2})}^{- \frac{n}{2}} \end{matrix}

(56)

= Ε \exp (- \frac{n}{2} \log \det (I_{p} - M_{1} M_{2}))

(57)

\geq Ε \exp (\frac{n}{2} Tr (M_{1} M_{2})),

(58)

where (56) is due to ∥M₁M₂∥ ≤ ∥M₁∥∥M₂∥ ≤ 1 a.s., and (58) is due to log det(I + A) ≤ Tr(A), with equality if and only if rank(A) = 1.

7.2 Proofs of Main Results

7.2.1 Proofs of the Upper Bounds

Proof of Theorem 1

Let γ₁ ≥ 10 be a constant. Denote by E the event that (40) holds, which, in particular, contains the event {B_k ≠ 0̸}. By triangle inequality,

\begin{array}{l} Ε {‖ S_{\hat{A}} - \sum ‖}^{2} 1_{E} & \leq & 2 Ε {‖ S_{\hat{A}} - S_{A} ‖}^{2} 1_{E} + 2 Ε {‖ S_{A} - \sum ‖}^{2} 1_{E} \\ ≲ & 2 (λ_{1} + 1) η^{2} (k, n, p, γ_{1}) + 2 Ε {‖ S_{A} - \sum ‖}^{2} \\ ≍ & (λ_{1} + 1) \frac{k}{n} \log \frac{e p}{k} + Ε {‖ S_{A} - \sum ‖}^{2}, \end{array}

(59)

where the last step holds because $\frac{k}{n} \log \frac{e p}{k} \leq c_{0}$ . In view of the definition of η in (12), we have $η^{2} (k, n, p, γ_{1}) ≍ \frac{k}{n} \log \frac{e p}{k}$ .

To bound the second term in (59), define J_B as the diagonal matrix given by

{(J_{B})}_{ii} = 1_{{i \in B}} .

(60)

Then, for S₀ in (50),

\begin{array}{l} S_{A} & = & J_{A} (S - I) J_{A} + I = J_{A} S J_{A} - I_{AA} + I \\ = & S_{0} + \frac{1}{n} VD U^{'} Z J_{A} + \frac{1}{n} J_{A} Z^{'} UD V^{'} + J_{A} (\frac{1}{n} Z^{'} Z - I) J_{A} . \end{array}

(61)

Therefore

‖ S_{A} - \sum ‖ \leq ‖ S_{0} - \sum ‖ + \frac{2}{n} ‖ VD U^{'} Z J_{A} ‖ + \frac{1}{n} ‖ J_{A} (\frac{1}{n} Z^{'} Z - I) J_{A} ‖

(62)

In view of (52) and Lemma 5, we have

Ε {‖ S_{0} - \sum ‖}^{2} \leq {‖ D ‖}^{4} Ε {‖ \frac{1}{n} U^{'} U - I_{r} ‖}^{2} ≲ λ_{1}^{2} (\frac{r}{n} + \frac{r^{2}}{n^{2}}) ≍ \frac{λ_{1}^{2} r}{n},

(63)

where the last step is due to $n \geq c_{0} k \log \frac{e p}{k} \geq c_{0} k$ by assumption. Similarly,

Ε {‖ J_{A} (\frac{1}{n} Z^{'} Z - I) J_{A} ‖}^{2} ≲ \frac{k}{n},

(64)

Again by (49) in Lemma 5,

\frac{1}{n^{2}} Ε {‖ VD U^{'} Z J_{A} ‖}^{2} \leq \frac{λ_{1}}{n^{2}} Ε {‖ U^{'} Z J_{A} ‖}^{2} ≲ \frac{λ_{1} (n (k + r) + kr)}{n^{2}} ≍ \frac{λ_{1} k}{n} .

(65)

Assembling (62) - (65), we have

Ε {‖ S_{A} - \sum ‖}^{2} ≲ \frac{λ_{1}^{2} r}{n} + \frac{(λ_{1} + 1) k}{n} .

(66)

Combining (59) and (66) yields

Ε {‖ S_{\hat{A}} - \sum ‖}^{2} 1_{E} ≲ (λ_{1} + 1) \frac{k}{n} \log \frac{e p}{k} + \frac{λ_{1}^{2} r}{n} .

(67)

Next we control the estimation error conditioned on the complement of the event E. First note that $S_{\hat{A}} - \sum = [\begin{matrix} S_{\hat{A} \hat{A}} & 0 \\ 0 & I_{{\hat{A}}^{c} {\hat{A}}^{c}} \end{matrix}] - \sum$ . Then ∥S_Â − Σ∥ ≤ ∥S∥ + ∥Σ∥ + 1 ≤ ∥Σ∥ (∥W_p∥ + 1), where W_p is equal to $\frac{1}{n}$ times a p × p Wishart matrix with n degrees of freedom. Also, ∥Σ − I∥ = λ₁. In view of (14), we have ∥Σ̂ − Σ∥ ≤ (1 + λ₁)(∥W_p∥ + 2). Using Cauchy-Schwartz inequality, we have

\begin{array}{l} Ε {‖ \sum^{^} - \sum ‖}^{2} 1_{E^{c}} & \leq & {(1 + λ_{1})}^{2} Ε {(‖ W_{p} ‖ + 2)}^{2} 1 E^{c} \\ \leq & {(1 + λ_{1})}^{2} \sqrt{Ε {(‖ W_{p} ‖ + 2)}^{4}} \sqrt{Ρ {1_{E^{c}}}} . \end{array}

Note that $‖ W_{p} ‖ \overset{(d)}{=} \frac{1}{n} σ_{1}^{2} (Z)$ . By [14, Theorem II.7], $Ρ {σ_{1} (Z) \geq 1 + \sqrt{\frac{p}{n}} + t} \leq \exp (- n t^{2} / 2)$ . Then $Ε {‖ W_{p} ‖}^{4} ≲ \frac{1}{n^{4}} (1 + \frac{p^{4}}{n^{4}})$ . By Lemma 4, we have Ρ {E^c} ≤ 12(ep)^1−γ₁/2. Therefore

Ε {‖ \sum^{^} - \sum ‖}^{2} 1_{E^{c}} ≲ {(1 + λ_{1})}^{2} \frac{1}{n^{2}} (1 + \frac{p^{2}}{n^{2}}) p^{\frac{1}{2} - \frac{γ_{1}}{4}}

(68)

Choose a fixed γ₁ ≥ 10. Assembling (67) and (68), we have

\begin{array}{l} Ε {‖ \sum^{^} - \sum ‖}^{2} & = & Ε {‖ S_{\hat{A}} - \sum ‖}^{2} 1_{E} + Ε {‖ S_{\hat{A}} - \sum ‖}^{2} 1_{E^{c}} \\ ≲ & (1 + λ_{1}) \frac{k}{n} \log \frac{e p}{k} + \frac{λ_{1}^{2} r}{n} + {(1 + λ_{1})}^{2} \frac{1}{n^{2}} ≍ (1 + λ_{1}) \frac{k}{n} \log \frac{e p}{k} + \frac{λ_{1}^{2} r}{n} . \end{array}

Proof of Theorem 2

To prove the theorem, it suffices to show that with the desired probability,

σ_{r} (S_{\hat{A}}) > 1 + γ_{2} \sqrt{‖ S_{\hat{A}} ‖} η (k, n, p, γ_{1}) > σ_{r + 1} (S_{\hat{A}}) .

(69)

Recall (50) and (52). By [10, Proposition 4], with probability at least 1 − 2(ep)^−γ₁/2, $‖ \frac{1}{n} U' U - I_{r} ‖ \leq η (k, n, p, γ_{1})$ . Under (18), $\sqrt{\frac{k}{n} \log \frac{e p}{k}} ≍ η (k, n, p, γ_{1}) \leq 1 / (2 τ)$ when c₀ is sufficiently small, and so η(k, n, p, γ₁) ≤ 1/(2τ). Thus,

‖ S_{0} - \sum ‖ \leq {‖ D ‖}^{2} ‖ \frac{1}{n} U^{'} U - I_{r} ‖ \leq λ_{r} / 2.

(70)

Therefore, Lemma 6 leads to (51). Moreover, Weyl’s inequality leads to

σ_{r} (S_{0}) \geq σ_{r} (\sum) - | σ_{r} (S_{0}) - σ_{r} (\sum) | \geq λ_{r} + 1 - λ_{1} η (k, n, p, γ_{1}) \geq \frac{1}{2} λ_{r} + 1 > 1 .

(71)

Next, we consider S_A − S₀. By (61), we have

‖ S_{A} - S_{0} ‖ \leq \frac{2}{n} ‖ VD U^{'} Z J_{A} ‖ + \frac{1}{n} ‖ J_{A} (Z^{'} Z - I) J_{A} ‖ .

By [10, Proposition 4] and (18), when c₀ is sufficiently small, with probability at least 1 − (ep + 1)(ep)^−γ₁/2,

\begin{array}{l} ‖ VD U^{'} Z J_{A} ‖ & \leq & ‖ D ‖ ‖ U^{'} Z J_{A} ‖ \\ \leq & \sqrt{λ_{1}} n \sqrt{1 + \frac{8 γ_{1} \log p}{3 n}} (\sqrt{\frac{r}{n}} + \sqrt{\frac{k}{n}} + \sqrt{γ_{1} \log \frac{e p}{n}}) \\ \leq & n \sqrt{λ_{1}} η (k, p, n, γ_{1}) . \end{array}

Moreover, [10, Proposition 3] implies that with probability at least 1−2(ep)^−γ₁/2,

\frac{1}{n} ‖ J_{A} (Z^{'} Z - I) J_{A} ‖ \leq η (k, n, p, γ_{1}) .

Assembling the last three displays, we obtain that with probability at least 1 − (ep + 3)(ep)^−γ₁/2,

‖ S_{A} - S_{0} ‖ \leq (2 \sqrt{λ_{1}} + 1) η (k, n, p, γ_{1}) .

(72)

Last but not least, Lemma 4 implies that, under the condition of Theorem 2, with probability at least 1 − 12(ep)^1−γ₁/2, (40) holds. Together with (72), it implies that

‖ S_{\hat{A}} - S_{0} ‖ \leq [C_{0} (γ_{1}) + 3] \sqrt{λ_{1} + 1} η (k, n, p, γ_{1}),

(73)

where $C_{0} (γ_{1}) = 14 + 4 \sqrt{γ_{1}}$ . By (18), we could further upper bound the right-hand side by λ₁/4 ∨ 1/2. When λ₁ ≤ 1, $\sqrt{λ_{1} + 1} \leq \sqrt{2 λ_{1}} < \sqrt{2} λ_{1}$ , so for sufficiently small c₀, (18) implies that the right side of (73) is further bounded by λ₁/4. When λ₁ ≤ 1, $\sqrt{λ_{1} + 1} \leq \sqrt{2}$ , and so the right side of (73) is further bounded by 1/2 for sufficiently small c₀.

Thus, the last display, together with (70), implies

\begin{array}{l} ‖ S_{\hat{A}} - \sum ‖ & \leq λ_{1} η (k, n, p, γ_{1}) + [C_{0} (γ_{1}) + 3] \sqrt{λ_{1} + 1} η (k, n, p, γ_{1}) \\ < \frac{1}{2} (λ_{1} + 1) . \end{array}

(74)

Here, the last inequality comes from the above discussion, and the fact that η(k, n, p, γ₁) < 1/4 for small c₀. The triangle inequality further leads to

\frac{1}{2} (λ_{1} + 1) < ‖ \sum ‖ - ‖ S_{\hat{A}} - \sum ‖ \leq ‖ S_{\hat{A}} ‖ \leq ‖ \sum ‖ + ‖ S_{\hat{A}} - \sum ‖ < \frac{3}{2} (λ_{1} + 1) .

(75)

Set $γ_{2} \geq 2 [C_{0} (γ_{1}) + 3] = 8 \sqrt{γ_{1}} + 34$ . Then (51), (73) and (75) jointly imply that, with probability at least 1−(13ep+5)(ep)^−γ₁/2, the second inequality in (69) holds. Moreover, (74) and the triangle inequality implies that, with the same probability,

\begin{array}{l} σ_{r} (S_{\hat{A}}) & \geq & σ_{r} (\sum) - ‖ S_{\hat{A}} - \sum ‖ \\ \geq & 1 + λ_{r} - λ_{1} η (k, n, p, γ_{1}) - [C_{0} (γ_{1}) + 3] \sqrt{λ_{1} + 1} η (k, n, p, γ_{1}) \\ \geq & 1 + λ_{r} [1 - τη (k, n, p, γ_{1}) - \frac{\sqrt{λ_{1} + 1}}{2 λ_{1}} γ_{2} τη (k, n, p, γ_{1})] \\ \geq & 1 + \frac{3}{2} γ_{2} (λ_{1} + 1) η (k, n, p, γ_{1}) . \end{array}

Here, the last inequality holds when $λ_{r} \geq β \sqrt{\frac{k}{n} \log \frac{e p}{k}}$ for a sufficiently large β which depends only on γ₁, γ₂ and τ. In view of (75), the last display implies the first inequality of (69). This completes the proof of the upper bound.

Proof of Theorem 3

Let E be the event such that Lemma 3, Lemma 4, the upper bound in Theorem 2, and (71) hold. Then Ρ(E^c) ≤ C(ep)^1−γ₁/2.

On the event E, Σ̂ = S_Â. Moreover, Lemma 6 shows that VV′ is the projection matrix onto the principal subspace of S₀, and Theorem 2 ensures V̂ ∈ ℝ^{p × r}. Thus, Proposition 1 leads to

\begin{array}{l} ‖ \sum^{^} - S_{0} ‖ 1_{{E}} & \geq & \frac{1}{2} (σ_{r} (S_{0}) - σ_{r + 1} (S_{0})) ‖ \hat{V} {\hat{V}}^{'} - V V^{'} ‖ 1_{{E}} \\ \geq & \frac{λ_{r}}{4} ‖ \hat{V} {\hat{V}}^{'} - V V^{'} ‖ 1_{{E}}, \end{array}

and so

Ε {‖ \hat{V} {\hat{V}}^{'} - V V^{'} ‖}^{2} 1_{{E}} \leq \frac{16}{λ_{r}^{2}} Ε {‖ \sum^{^} - S_{0} ‖}^{2} 1_{{E}} .

To further bound the right-hand side of the last display, we apply (61), (64), (65), Lemma 3 and Lemma 4 to obtain

\begin{array}{l} Ε {‖ \sum^{^} - S_{0} ‖}^{2} 1_{{E}} & \leq & Ε {‖ S_{\hat{A}} - S_{0} ‖}^{2} 1_{{E}} \\ ≲ & Ε {‖ S_{\hat{A}} - S_{A} ‖}^{2} 1_{{E}} + Ε {‖ S_{A} - S_{0} ‖}^{2} ≲ (λ_{1} + 1) \frac{k}{n} \log \frac{e p}{k} . \end{array}

Together with the second last display, this implies

Ε {‖ \hat{V} {\hat{V}}^{'} - V V^{'} ‖}^{2} 1_{{E}} \leq C_{τ^{2}} \frac{k (λ_{1} + 1)}{n λ_{1}^{2}} \log \frac{e p}{k} .

(76)

Now consider the event E^c. Note that ∥V̂V̂′ − VV′∥ ≤ 1 always holds. Thus,

Ε {‖ \hat{V} {\hat{V}}^{'} - V V^{'} ‖}^{2} 1_{{E^{c}}} \leq Ρ (E^{c}) \leq C {(e p)}^{1 - γ_{1} / 2} ≲ \frac{λ_{1} + 1}{n λ_{1}^{2}},

(77)

where the last inequality holds under condition (20) for all $γ_{1} \geq (1 + \frac{2}{M_{o}}) M_{1}$ . Assembling (76) and (77), we obtain the upper bounds.

7.2.2 Proofs of the Lower Bounds

Proof of Theorem 4

1° The minimax lower bound for estimating span(V) follows straightforwardly from previous results on estimating the leading singular vector, i.e., the rank-one case (see, e.g., [5, 39]). The desired lower bound (25) can be found in [10, Eq. (58)] in the proof of [10, Proof of Theorem 3].

2° Next we establish the minimax lower bound for estimating the spiked covariance matrix Σ under the spectral norm, which is considerably more involved. Let Θ₁ = Θ₁(k, p, r, λ). In view of the fact that a ∧ (b + c) ≤ a ∧ b + a ∧ c ≤ 2(a ∧ (b + c)) for all a, b, c ≥ 0, it is convenient to prove the following equivalent lower bound

\inf_{\sum^{^}} \sup_{\sum \in Θ} Ε {‖ \sum^{^} - \sum ‖}^{2} ≳ λ^{2} \land (\frac{(λ + 1) k}{n} \log \frac{ep}{k}) + λ^{2} (1 \land \frac{r}{n}) .

(78)

To this end, we show that the minimax risk is lower bounded by the two terms on the right-hand side of (78) separately. In fact, the first term is the minimax rate in the rank-one case and the second term is the rate of the oracle risk when the estimator knows the true support of V.

2.1° Consider the following rank-one subset of the parameter space

Θ^{'} = {\sum = I_{p} + λ v v^{'} : v \in B_{0} (k) \cap S^{p - 1}} \subset Θ_{1} .

Then σ₁(Σ) = 1 + λ and σ₂(Σ) = 1. For any estimator Σ̂ denote by v̂ its leading singular vector. Applying Proposition 1 yields

‖ \sum^{^} - \sum ‖ \geq \frac{λ}{2} ‖ \hat{v} {\hat{v}}^{'} - v v^{'} ‖ .

Then

\begin{array}{l} \inf_{\sum^{^}} \sup_{\sum \in Θ^{'}} Ε {‖ \sum^{^} - \sum ‖}^{2} & \geq & \frac{λ^{2}}{4} \inf_{\hat{v}} \sup_{v \in B_{0} (k) \cap S^{p - 1}} Ε {‖ \hat{v} {\hat{v}}^{'} - v v^{'} ‖}^{2} \\ \geq & \frac{λ^{2}}{8} \inf_{\hat{v}} \sup_{v \in B_{0} (k) \cap S^{p - 1}} Ε {‖ \hat{v} {\hat{v}}^{'} - v v^{'} ‖}_{F}^{2} \end{array}

(79)

≳ λ^{2} (1 \land \frac{(λ + 1) k}{n λ^{2}} \log \frac{ep}{k}),

(80)

where (79) follows from rank(v̂v̂′− vv′) ≤ 2 and (80) follows from the minimax lower bound in [39, Theorem 2.1] (see also [5, Theorem 2]) for estimating the leading singular vector.

2.2° To prove the lower bound $λ^{2} (1 \land \frac{r}{n})$ , consider the following (composite) hypotheses testing problem:

H_{0} : \sum = [\begin{matrix} (1 + \frac{λ}{2}) I_{r} & 0 \\ 0 & I_{p - r} \end{matrix}], vs . H_{1} : \sum = [\begin{matrix} (1 + \frac{λ}{2}) (I_{r} + ρ v v^{'}) & 0 \\ 0 & I_{p - r} \end{matrix}], v \in S^{r - 1},

(81)

where $ρ = \frac{b λ}{λ + 2} (1 \land \sqrt{\frac{r}{n}})$ with a sufficiently small absolute constant b > 0. Since r ∈ [k] and $ρ (1 + \frac{λ}{2}) \leq λ$ , both the null and the alternative hypotheses belong to the parameter set Θ₁ defined in (4). Following the Le Cam’s twopoint argument [38, Section 2.3], next we show that the minimal sum of Type-I and Type-II error probabilities of testing (81) is non-vanishing. Since any pair of covariance matrices in H₀ and H₁ differ in operator norm by at least $ρ (1 + \frac{λ}{2}) = \frac{b λ}{2} (1 \land \sqrt{\frac{r}{n}})$ , we obtain a lower bound of rate $λ^{2} (1 \land \frac{r}{n})$ .

To this end, let X consist of n iid rows drawn from N(0, Σ), where Σ is either from H₀ or H₁. Since under both the null and the alternative, the last p−r columns of X are standard normal and independent of the first r columns, we conclude that the first r columns form a sufficient statistic. Therefore the minimal Type-I+II error probability testing (81), denoted by ε_n, is equal to that of the following testing problem of dimension r and sample size n:

H_{0} : \sum = I_{r}, versus H_{1} : \sum = I_{r} + ρ v v^{'}, v \in S^{r - 1} .

(82)

Recall that the minimax risk is lower bounded by the Bayesian risk. For any random vector u taking values in S^r−1, denote by the Ε[N(0, I_r + ρuu′)^⊗n] the mixture alternative distribution with a prior equal to the distribution of u. Applying [38, Theorem 2.2 (iii)] we obtain the following lower bound in terms of the χ²-divergence from the mixture alternative to the null:

ε_{n} \geq 1 - {(χ^{2} (Ε [N {(0, I_{r} + ρ u u^{'})}^{\otimes n}] ‖ N {(0, I_{r})}^{\otimes n}) / 2)}^{1 / 2}

(83)

Consider the unit random vector u with iid coordinates taking values in $\frac{1}{\sqrt{r}} {\pm 1}$ uniformly. Since ρ ≤ 1, applying the equality case of Lemma 7 yields

\begin{array}{l} 1 + χ^{2} {(Ε [N (0, I_{r} + ρ u u^{'})}^{\otimes n}] ‖ N {(0, I_{r})}^{\otimes n)} & = & Ε \exp (n ρ^{2} {〈 u u, \tilde{u} \tilde{u} 〉}^{2}) \\ = & Ε \exp (n ρ^{2} G_{r}^{2}), \end{array}

where in the last step we recall that G_r is the symmetric random walk on ℤ at r^th step defined in (34). Since $n ρ^{2} \leq b^{2} \frac{r}{n}$ , choosing $b^{2} \leq \frac{1}{36}$ as a fixed constant and applying Lemma 1 with p = k = r (the non-sparse case), we conclude that

χ^{2} {(Ε [N (0, I_{p} + λ u u^{'})}^{\otimes n}] ‖ N {(0, I_{p})}^{\otimes n}) \leq g (b^{2}) - 1,

(84)

where g is given by in Lemma 1 and satisfies g > 1. Combining (83) - (84), we obtain the following lower bound for estimating Σ:

\inf_{\sum^{^}} \sup_{\sum \in Θ} Ε {‖ \sum^{^} - \sum ‖}^{2} ≳ {(1 + \frac{λ}{2})}^{2} ρ^{2} ≍ λ^{2} (1 \land \frac{r}{n}) .

(85)

As we mentioned in Section 4, the rank-detection lower bound in Theorem 5 is a direct consequence of Proposition 2 concerning testing rank-zero versus rank-one perturbation, which we prove below.

7.3 Proof of Proposition 2

Proof

Analogous to (83), any random vector u taking vales in S^p−1 ∩ B₀(k) gives the Bayesian lower bound

ε_{n} (k, p, λ) \geq 1 - (χ^{2} (Ε N {(0, I_{p} + λ u u^{'})}^{\otimes n} ‖ N {(0, I_{p})}^{\otimes n}) / 2) .

(86)

Put

t ≜ \frac{n λ^{2}}{k^{2}} = \frac{β_{0}}{k} \log \frac{e p}{k} .

Let the random sparse vector u be defined in (30). In view of Lemma 7 as well as the facts that rank(λvv′) = 1 and ∥λvv′∥ = λ ≤ 1, we have

\begin{array}{l} 1 + χ^{2} {(Ε [N (0, I_{p} + λ u u^{'})}^{\otimes n} & ] & ‖ N {(0, I_{p})}^{\otimes n}) \\ = & Ε \exp (n λ^{2} {〈 u u^{'}, \tilde{u} {\tilde{u}}^{'} 〉}^{2}) \\ = & Ε \exp (n λ^{2} {〈 \frac{1}{\sqrt{k}} J_{I} w, \frac{1}{\sqrt{k}} J_{\tilde{I}} \tilde{w} 〉}^{2}) \\ = & Ε \exp (t {(w^{'} J_{I \cap \tilde{I}} {\tilde{w}}^{'})}^{2}) \\ = & Ε \exp (t G_{H}^{2}), \end{array}

where in the last step we have defined H ≜ |I ∩ Ĩ| ~ Hypergeometric(p, k, k) and {G_m} is the symmetric random walk on ℤ defined in (34). Now applying Lemma 1, we conclude that

χ^{2} {(Ε [N (0, I_{p} + λ u u^{'})}^{\otimes n}] ‖ N {(0, I_{p})}^{\otimes n}) \leq g (β_{0}) - 1

where g is given by in Lemma 1 satisfying g(0+) = 1. In view of (86), we conclude that

ε_{n} (k, p, λ) \geq \max {0, 1 - \sqrt{(g (β_{0}) - 1) / 2}} ≜ w (β_{0}) .

Note that the function w satisfies w(0+) = 1.

7.4 Proof of Propositions 3 and 4

We give here a joint proof of Propositions 3 and 4.

Proof

1° (Upper bounds) By assumption, we have λ ≍ 1 and $\frac{k}{n} \log \frac{e p}{k} \leq c_{0}$ . Since r ∈ [k], applying Theorem 1 yields

\sup_{\sum \in Θ_{1} (k, p, r, λ)} Ε {‖ \sum^{^} - \sum ‖}^{2} ≲ \frac{k}{n} \log \frac{e p}{k} .

(87)

Note that

‖ {\sum^{^}}^{- 1} - \sum^{- 1} ‖ \leq ‖ {\sum^{^}}^{- 1} ‖ ‖ \sum^{- 1} ‖ ‖ \sum^{^} - \sum ‖ \leq \frac{‖ \sum^{^} - \sum ‖}{1 - ‖ \sum^{^} - \sum ‖},

(88)

where the last inequality follows from σ_p (Σ) = 1 and Weyl’s inequality. By Chebyshev’s inequality, $Ρ {‖ \sum^{^} - \sum ‖ \geq t} \leq \frac{1}{t^{2}} Ε {‖ \sum^{^} - \sum ‖}^{2}$ . Let $E = {‖ \sum^{^} - \sum ‖ \leq \frac{1}{2}}$ . Then

Ρ {E^{c}} ≲ \frac{k}{n} \log \frac{e p}{k} .

(89)

Moreover, again by Weyl’s inequality, we have $E \subset {σ_{p} (\sum^{^}) \geq \frac{1}{2}}$ hence $\hat{Ω} 1_{{E}} = {(\sum^{^})}^{- 1} 1_{{E}}$ in view of (36). On the other hand, by definition we always have ∥Ω̂∥ ≤ 2. Therefore

\begin{array}{l} Ε {‖ \hat{Ω} - Ω ‖}^{2} & = & Ε {‖ {\sum^{^}}^{- 1} - \sum^{- 1} ‖}^{2} 1_{{E}} + Ε {‖ \hat{Ω} - Ω ‖}^{2} 1_{{E^{c}}} \\ \leq & Ε {‖ {\sum^{^}}^{- 1} - \sum^{- 1} ‖}^{2} 1_{{E}} + 3 Ρ {E^{c}} \\ \leq & 4 Ε {‖ \sum^{^} - \sum ‖}^{2} + 3 Ρ {E^{c}} \end{array}

(90)

≲ \frac{k}{n} \log \frac{e p}{k} .

(91)

where (90) follows from (88) and (91) is due to (87) and (89).

Next we prove upper bound for estimating E. Recall that E+I_p and Ê+I_p give the diagonal matrices formed by the ordered singular values of Σ and Σ̂, respectively. Similar to the proof of Proposition 4, Weyl’s inequality implies that ∥Ê−E∥ = ∥Ê+I_p−(E+I_p)∥ = max_i |σ_i(Σ̂)−σ_i(Σ)| ≤ ∥Σ̂−Σ∥, where for any i ∈ [p], σ_i(Σ) is the i^th largest eigenvalue of Σ. In view of Theorem 1, we have $Ε {‖ \hat{E} - E ‖}^{2} \leq \frac{k}{n} \log \frac{e p}{k}$ .

2° (Lower bounds) The lower bound follows from the testing result in Proposition 2. Consider the testing problem (29), then both the null (Σ = I_p) and alternatives (Σ = I_p+λvv′) are contained in the parameter space Θ₁. By Proposition 2, they cannot be distinguished with probability 1 if $λ^{2} ≍ \frac{k}{n} \log \frac{e p}{k}$ . The spectra differs at least |σ₁(I_p) − σ₁(I_p + λvv′)| = λ. By the Woodbury identity, ${(I_{p} + λ vv')}^{- 1} = I_{p} - \frac{λ}{λ + 1} v v^{'}$ , hence $‖ I_{p} - {(I_{p} + λ v v^{'})}^{- 1} ‖ = \frac{λ}{1 + λ} ≍ λ$ . The lower bound $\frac{k}{n} \log \frac{e p}{k}$ is now completed by way of the usual two-point argument.

7.5 Proof of Proposition 1

Proof

Recall that σ₁ (Σ) ≥ … ≥ σ_p(Σ) ≥ 0 denote the ordered singular values of Σ. If $σ_{r} (\sum^{^}) \leq \frac{σ_{r} (\sum) + σ_{r + 1} (\sum)}{2}$ , then by Weyl’s inequality, we have $‖ \sum^{^} - \sum ‖ \geq \frac{1}{2} (σ_{r} (\sum) - σ_{r + 1} (\sum))$ . If $σ_{r} (\sum^{^}) \geq \frac{σ_{r} (\sum) + σ_{r + 1} (\sum)}{2}$ , then by David–Kahn’s sin-theta theorem [15] (see also [10, Theorem 10] we have

‖ \hat{V} {\hat{V}}^{'} - V V^{'} ‖ \leq \frac{‖ \sum^{^} - \sum ‖}{| σ_{r} (\sum^{^}) - σ_{r + 1} (\sum) |} \leq \frac{2 ‖ \sum^{^} - \sum ‖}{σ_{r} (\sum) - σ_{r + 1} (\sum)},

completing the proof of (22) in view of the fact that ∥V̂V̂′ − VV′∥ ≤ 1.

7.6 Proof of Lemma 1

Proof

First of all, we can safely assume that p ≥ 5, for otherwise the expectation on the right-hand side of (35) is obviously upper bounded by an absolutely constant. In the sequel we shall assume that

0 < a \leq \frac{1}{36} .

(92)

We use normal approximation of the random walk G_m for small m and use truncation argument to deal with large m. To this end, we divide [p], the whole range of k, into three regimes.

Case I: Large k

Assume that $\frac{p}{2} \leq k \leq p$ . Then $t \leq \frac{2 a \log (2 e)}{p}$ . By the following non-asymptotic version of the Tusnády’s coupling lemma (see, for example, [6, Lemma 4, p. 242]), for each m, there exists Z_m ~ N(0, m), such that

| G_{m} | \overset{s . t .}{\leq} | Z_{m} | + 2.

(93)

Since H ≤ p, in view of (92), we have

\begin{array}{l} Ε \exp (t G_{H}^{2}) & \leq & Ε \exp (t G_{p}^{2}) \leq Ε \exp (t {(| Z_{p} | + 2)}^{2}) \\ \leq & \exp (8 t) Ε \exp (2 t Z_{p}^{2}) \end{array}

(94)

= \frac{\exp (8 t)}{\sqrt{1 - 4 pt}} \leq \frac{\exp (7 a)}{\sqrt{1 - 8 \log (2 e) a}}

(95)

Case II: Small k

Assume that $1 \leq k \leq \log \frac{e p}{k}$ , which, in particular, implies that $k \leq p^{\frac{1}{3}} \land \frac{p}{2}$ since p ≥ 5. Using $G_{H}^{2} \leq H^{2} \leq k H$ , we have

Ε \exp (t G_{H}^{2}) \leq Ε {(\frac{e p}{k})}^{aH} \leq {(1 + \frac{k}{p - k} ({(\frac{e p}{k})}^{a} - 1))}^{k}

(96)

\leq \exp {\frac{2 \sqrt{k}}{\sqrt{p}} ({(\frac{e p}{k})}^{a} - 1)}

(97)

\leq \exp (8 a),

(98)

where (96) follows from the stochastic dominance of hypergeometric distributions by binomial distributions (see, e.g., [23, Theorem 1.1(d)])

Hypergeometric (p, k, k) \overset{s . t .}{\leq} Bin (k, \frac{k}{p - k}), k \leq \frac{p}{2}

(99)

and the moment generating function of binomial distributions, (97) is due to (1 + x)^k ≤ exp(kx) for x ≥ 0 and $\frac{k^{2}}{p} \leq \sqrt{\frac{k}{p}}$ , (98) is due to

\sup_{y \geq 1} \frac{{(e y)}^{a} - 1}{\sqrt{y}} = 2 \sqrt{e} a {((1 - 2 a))}^{\frac{1}{2 a} - 1} \leq 4 a, \forall a \leq \frac{1}{2},

and

Case III: Moderate k

Assume that $\log \frac{e p}{k} \leq k \leq \frac{p}{2}$ . Define

A ≜ \frac{k}{\sqrt{a} \log \frac{e p}{k}} \land k

(100)

and write

Ε \exp (t G_{H}^{2}) = Ε \exp (t G_{H}^{2}) 1_{{H \leq A}} + Ε \exp (t G_{H}^{2}) 1_{{H > A}} .

(101)

Next we bound the two terms in (101) separately: For the first term, we use normal approximation of G_m. By (93) - (94), for each fixed m ≤ A, we have

Ε \exp (t G_{m}^{2}) \leq \frac{\exp (8 t)}{\sqrt{1 - 4 mt}} \leq \frac{\exp (8 a)}{\sqrt{1 - 4 \sqrt{a}}},

(102)

where the last inequality follows from $k \geq \log \frac{e p}{k}$ and (100).

To control the second term in (101), without loss of generality, we assume that A ≤ k, i.e., $A = \frac{k}{\sqrt{a} \log \frac{e p}{k}} \geq \frac{1}{\sqrt{a}}$ . we proceed similarly as in (96) - (98):

\begin{array}{l} Ε \exp (t G_{H}^{2}) 1_{{H > A}} \\ \leq & Ε \exp (tkH) 1_{{H > A}} \\ \leq & \sum_{m > A} \exp (tkm) {(\frac{k}{p - k})}^{m} {(\frac{p - 2 k}{p - k})}^{k - m} (\begin{matrix} k \\ m \end{matrix}) \end{array}

(103)

\leq \sum_{m > A} \exp (am \log \frac{p}{k} - m \log \frac{p}{2 k} + m \log (\sqrt{a} e \log \frac{e p}{k}))

(104)

\leq \sum_{m > A} \exp {- (1 - a - \frac{1}{e}) m \log 2 + m \log (2 \sqrt{a} e) + \frac{m}{e}}

(105)

\leq 7 \exp (- \frac{1}{\sqrt{a}}),

(106)

where (103) follows from (99), (104) follows from that $(\begin{array}{l} k \\ m \end{array}) \leq {(\frac{e k}{m})}^{m}$ , p ≥ 2k and m ≤ A, (105) follows from that e log x ≤ x for all x ≥ 1, and (106) is due to (100) and our choice of a in (92). Plugging (102) and (106) into (101), we obtain

Ε \exp (t G_{H}^{2}) \leq \frac{\exp (8 a)}{\sqrt{1 - 4 \sqrt{a}}} + 7 \exp (- \frac{1}{\sqrt{a}}) .

(107)

We complete the proof of (35), with g(a) defined as the maxima of the righthand sides of (95), (98) and (107).

Acknowledgments

The research of Tony Cai was supported in part by NSF FRG Grant DMS-0854973, NSF Grant DMS-1208982, and NIH Grant R01 CA 127334. The research of Zongming Ma is supported in part by the Dean’s Research Fund of the Wharton School. The research of Yihong Wu was supported in part by NSF FRG Grant DMS-0854973.

Footnotes

Here and after, a ∧ b ≜ min(a, b) and a_n ≍ b_n means that $\frac{a_{n}}{b_{n}}$ is bounded from both below and above by constants independent of n and all model parameters.

Here β₀ can be chosen to be any constant smaller than $\frac{1}{36}$ . See Proposition 2. The number $\frac{1}{36}$ is certainly not optimized.

If rank(X_*A) < k, then $\overset{(d)}{=}$ is changed to $\overset{s . t .}{\leq}$ , and the subsequent arguments continue to hold verbatim.

Contributor Information

Tony Cai, Email: tcai@wharton.upenn.edu.

Zongming Ma, Email: zongming@wharton.upenn.edu.

Yihong Wu, Email: yihongwu@wharton.upenn.edu.

References

1.Berthet Q, Rigollet P. Complexity theoretic lower bounds for sparse principal component detection. Journal of Machine Learning Research: Workshop and Conference Proceedings. 2013;30:1–21. [Google Scholar]
2.Berthet Q, Rigollet P. Optimal detection of sparse principal components in high dimension. Annals of Statistics. 2013;41(4):1780–1815. [Google Scholar]
3.Bickel P, Levina E. Covariance regularization by thresholding. The Annals of Statistics. 2008;36(6):2577–2604. [Google Scholar]
4.Bickel P, Levina E. Regularized estimation of large covariance matrices. The Annals of Statistics. 2008;36(1):199–227. [Google Scholar]
5.Birnbaum A, Johnstone I, Nadler B, Paul D. Minimax bounds for sparse PCA with noisy high-dimensional data. arXiv preprint arXiv:1203.0967. 2012 doi: 10.1214/12-AOS1014. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Bretagnolle J, Massart P. Hungarian constructions from the nonasymptotic viewpoint. The Annals of Probability. 1989;17(1):239–256. [Google Scholar]
7.Bunea F, Xiao L. On the sample covariance matrix estimator of reduced effective rank population matrices, with applications to fPCA. arXiv preprint arXiv:1212.5321 2012 [Google Scholar]
8.Cai T, Liu W. Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association. 2011;106:672–684. [Google Scholar]
9.Cai T, Liu W, Zhou H. Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. arXiv preprint arXiv:1212.2882 2012 [Google Scholar]
10.Cai T, Ma Z, Wu Y. Sparse PCA: Optimal rates and adaptive estimation. 2012 URL http://arxiv.org/abs/1211.1309. Preprint.
11.Cai T, Ren Z, Zhou H. Optimal rates of convergence for estimating toeplitz covariance matrices. Probability Theory and Related Fields. 2012:1–43. [Google Scholar]
12.Cai T, Zhang CH, Zhou H. Optimal rates of convergence for covariance matrix estimation. The Annals of Statistics. 2010;38(4):2118–2144. [Google Scholar]
13.Cai T, Zhou H. Optimal rates of convergence for sparse covariance matrix estimation. The Annals of Statistics. 2012 [Google Scholar]
14.Davidson K, Szarek S. Handbook on the Geometry of Banach Spaces, chap Local operator theory, random matrices and Banach spaces. Vol. 1. Elsevier Science; 2001. pp. 317–366. [Google Scholar]
15.Davis C, Kahan W. The rotation of eigenvectors by a perturbation. III. SIAM J Numer Anal. 1970;7(1):1–46. [Google Scholar]
16.Fan J, Fan Y, Lv J. High dimensional covariance matrix estimation using a factor model. Journal of Econometrics. 2008;147(1):186–197. [Google Scholar]
17.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Horn RA, Johnson CR. Matrix Analysis. Cambridge University Press; 1990. [Google Scholar]
19.Johnstone I. On the distribution of the largest eigenvalue in principal component analysis. The Annals of Statistics. 2001;29:295–327. [Google Scholar]
20.Johnstone I, Lu A. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association. 2009;104:682–693. doi: 10.1198/jasa.2009.0121. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Jung S, Marron J. PCA consistency in high dimension, low sample size context. The Annals of Statistics. 2009;37(6B):4104–4130. [Google Scholar]
22.Karoui N. Operator norm consistent estimation of large-dimensional sparse covariance matrices. The Annals of Statistics. 2008;36:2717–2756. [Google Scholar]
23.Klenke A, Mattner L. Stochastic ordering of classical discrete distributions. Advances in Applied probability. 2010;42(2):392–410. [Google Scholar]
24.Kritchman S, Nadler B. Determining the number of components in a factor model from limited noisy data. Chemometrics and Intelligent Laboratory Systems. 2008;94(1):19–32. [Google Scholar]
25.Kritchman S, Nadler B. Non-parametric detection of the number of signals: hypothesis testing and random matrix theory. 2009;57(10):3930–3941. [Google Scholar]
26.Lam C, Fan J. Sparsistency and rates of convergence in large covariance matrix estimation. The Annals of Statistics. 2009;37:4254–4278. doi: 10.1214/09-AOS720. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Le Cam L. Asymptotic Methods in Statistical Theory. Springer-Verlag New York, Inc; 1986. [Google Scholar]
28.Lounici K. High-dimensional covariance matrix estimation with missing observations. arXiv preprint arXiv:1201.2577 2012 [Google Scholar]
29.Lounici K. Sparse principal component analysis with missing observations. arXiv preprint arXiv:1205.7060 2012 [Google Scholar]
30.Lounici K, Pontil M, Van De Geer S, Tsybakov A. Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics. 2011;39(4):2164–2204. [Google Scholar]
31.Ma Z. Sparse principal component analysis and iterative thresholding. arxiv preprint arXiv:1112.2432 2011 [Google Scholar]
32.Onatski A. Asymptotics of the principal components estimator of large factor models with weak factors. J Econometrics. 2012;168:244–258. [Google Scholar]
33.Onatski A, Moreira M, Hallin M. Signal detection in high dimension: The multispiked case. arXiv preprint arXiv:1210.5663 2012 [Google Scholar]
34.Patterson N, Price A, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2:e190. doi: 10.1371/journal.pgen.0020190. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Paul D. Asymptotics of sample eigenstruture for a large dimensional spiked covariance model. Statistica Sinica. 2007;17(4):1617–1642. [Google Scholar]
36.Ravikumar P, Wainwright M, Raskutti G, Yu B. High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electronic Journal of Statistics. 2011;5:935–980. [Google Scholar]
37.Stewart G, Sun JG. Computer science and scientific computing. Academic Press; 1990. Matrix Perturbation Theory. [Google Scholar]
38.Tsybakov A. Introduction to Nonparametric Estimation. Springer Verlag; 2009. [Google Scholar]
39.Vu V, Lei J. Minimax rates of estimation for sparse PCA in high dimensions. the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS’12); 2012. URL http://arxiv.org/abs/1202.0786. [Google Scholar]
40.Vu V, Lei J. Minimax sparse principal subspace estimation in high dimensions. arXiv preprint arXiv:1211.0373 2012 [Google Scholar]
41.Yuan M. High dimensional inverse covariancematrix estimation via linear programming. The Journal of Machine Learning Research. 2010;99:2261–2286. [Google Scholar]

[R1] 1.Berthet Q, Rigollet P. Complexity theoretic lower bounds for sparse principal component detection. Journal of Machine Learning Research: Workshop and Conference Proceedings. 2013;30:1–21. [Google Scholar]

[R2] 2.Berthet Q, Rigollet P. Optimal detection of sparse principal components in high dimension. Annals of Statistics. 2013;41(4):1780–1815. [Google Scholar]

[R3] 3.Bickel P, Levina E. Covariance regularization by thresholding. The Annals of Statistics. 2008;36(6):2577–2604. [Google Scholar]

[R4] 4.Bickel P, Levina E. Regularized estimation of large covariance matrices. The Annals of Statistics. 2008;36(1):199–227. [Google Scholar]

[R5] 5.Birnbaum A, Johnstone I, Nadler B, Paul D. Minimax bounds for sparse PCA with noisy high-dimensional data. arXiv preprint arXiv:1203.0967. 2012 doi: 10.1214/12-AOS1014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Bretagnolle J, Massart P. Hungarian constructions from the nonasymptotic viewpoint. The Annals of Probability. 1989;17(1):239–256. [Google Scholar]

[R7] 7.Bunea F, Xiao L. On the sample covariance matrix estimator of reduced effective rank population matrices, with applications to fPCA. arXiv preprint arXiv:1212.5321 2012 [Google Scholar]

[R8] 8.Cai T, Liu W. Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association. 2011;106:672–684. [Google Scholar]

[R9] 9.Cai T, Liu W, Zhou H. Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. arXiv preprint arXiv:1212.2882 2012 [Google Scholar]

[R10] 10.Cai T, Ma Z, Wu Y. Sparse PCA: Optimal rates and adaptive estimation. 2012 URL http://arxiv.org/abs/1211.1309. Preprint.

[R11] 11.Cai T, Ren Z, Zhou H. Optimal rates of convergence for estimating toeplitz covariance matrices. Probability Theory and Related Fields. 2012:1–43. [Google Scholar]

[R12] 12.Cai T, Zhang CH, Zhou H. Optimal rates of convergence for covariance matrix estimation. The Annals of Statistics. 2010;38(4):2118–2144. [Google Scholar]

[R13] 13.Cai T, Zhou H. Optimal rates of convergence for sparse covariance matrix estimation. The Annals of Statistics. 2012 [Google Scholar]

[R14] 14.Davidson K, Szarek S. Handbook on the Geometry of Banach Spaces, chap Local operator theory, random matrices and Banach spaces. Vol. 1. Elsevier Science; 2001. pp. 317–366. [Google Scholar]

[R15] 15.Davis C, Kahan W. The rotation of eigenvectors by a perturbation. III. SIAM J Numer Anal. 1970;7(1):1–46. [Google Scholar]

[R16] 16.Fan J, Fan Y, Lv J. High dimensional covariance matrix estimation using a factor model. Journal of Econometrics. 2008;147(1):186–197. [Google Scholar]

[R17] 17.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Horn RA, Johnson CR. Matrix Analysis. Cambridge University Press; 1990. [Google Scholar]

[R19] 19.Johnstone I. On the distribution of the largest eigenvalue in principal component analysis. The Annals of Statistics. 2001;29:295–327. [Google Scholar]

[R20] 20.Johnstone I, Lu A. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association. 2009;104:682–693. doi: 10.1198/jasa.2009.0121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Jung S, Marron J. PCA consistency in high dimension, low sample size context. The Annals of Statistics. 2009;37(6B):4104–4130. [Google Scholar]

[R22] 22.Karoui N. Operator norm consistent estimation of large-dimensional sparse covariance matrices. The Annals of Statistics. 2008;36:2717–2756. [Google Scholar]

[R23] 23.Klenke A, Mattner L. Stochastic ordering of classical discrete distributions. Advances in Applied probability. 2010;42(2):392–410. [Google Scholar]

[R24] 24.Kritchman S, Nadler B. Determining the number of components in a factor model from limited noisy data. Chemometrics and Intelligent Laboratory Systems. 2008;94(1):19–32. [Google Scholar]

[R25] 25.Kritchman S, Nadler B. Non-parametric detection of the number of signals: hypothesis testing and random matrix theory. 2009;57(10):3930–3941. [Google Scholar]

[R26] 26.Lam C, Fan J. Sparsistency and rates of convergence in large covariance matrix estimation. The Annals of Statistics. 2009;37:4254–4278. doi: 10.1214/09-AOS720. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Le Cam L. Asymptotic Methods in Statistical Theory. Springer-Verlag New York, Inc; 1986. [Google Scholar]

[R28] 28.Lounici K. High-dimensional covariance matrix estimation with missing observations. arXiv preprint arXiv:1201.2577 2012 [Google Scholar]

[R29] 29.Lounici K. Sparse principal component analysis with missing observations. arXiv preprint arXiv:1205.7060 2012 [Google Scholar]

[R30] 30.Lounici K, Pontil M, Van De Geer S, Tsybakov A. Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics. 2011;39(4):2164–2204. [Google Scholar]

[R31] 31.Ma Z. Sparse principal component analysis and iterative thresholding. arxiv preprint arXiv:1112.2432 2011 [Google Scholar]

[R32] 32.Onatski A. Asymptotics of the principal components estimator of large factor models with weak factors. J Econometrics. 2012;168:244–258. [Google Scholar]

[R33] 33.Onatski A, Moreira M, Hallin M. Signal detection in high dimension: The multispiked case. arXiv preprint arXiv:1210.5663 2012 [Google Scholar]

[R34] 34.Patterson N, Price A, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2:e190. doi: 10.1371/journal.pgen.0020190. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Paul D. Asymptotics of sample eigenstruture for a large dimensional spiked covariance model. Statistica Sinica. 2007;17(4):1617–1642. [Google Scholar]

[R36] 36.Ravikumar P, Wainwright M, Raskutti G, Yu B. High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electronic Journal of Statistics. 2011;5:935–980. [Google Scholar]

[R37] 37.Stewart G, Sun JG. Computer science and scientific computing. Academic Press; 1990. Matrix Perturbation Theory. [Google Scholar]

[R38] 38.Tsybakov A. Introduction to Nonparametric Estimation. Springer Verlag; 2009. [Google Scholar]

[R39] 39.Vu V, Lei J. Minimax rates of estimation for sparse PCA in high dimensions. the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS’12); 2012. URL http://arxiv.org/abs/1202.0786. [Google Scholar]

[R40] 40.Vu V, Lei J. Minimax sparse principal subspace estimation in high dimensions. arXiv preprint arXiv:1211.0373 2012 [Google Scholar]

[R41] 41.Yuan M. High dimensional inverse covariancematrix estimation via linear programming. The Journal of Machine Learning Research. 2010;99:2261–2286. [Google Scholar]

PERMALINK

Optimal Estimation and Rank Detection for Sparse Spiked Covariance Matrices

Tony Cai

Zongming Ma

Yihong Wu

Abstract

1 Introduction

1.1 Sparse spiked covariance matrix model

1.2 Main contributions

1.3 Other related work

1.4 Organization

2 Estimation Procedure

Notation

Estimators

3 Minimax Upper Bounds

Theorem 1

Remark 1

Theorem 2

Theorem 3

Remark 2

Proposition 1

4 Minimax Lower Bounds and Optimal Rates of Convergence

4.1 Lower bounds and minimax rates for matrix and subspace estimation

Theorem 4

4.2 Lower bound and minimax rate for rank detection

Theorem 5

Proposition 2

4.3 Testing rank-one spiked model

Remark 3

Lemma 1

Remark 4 (Tightness of Lemma 1)

5 Estimation of Precision Matrix and Eigenvalues

Proposition 3 (Precision matrix estimation)

Proposition 4 (Uniform estimation of spectra)

6 Discussions

7 Proofs

7.1 Technical Lemmas

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

Lemma 7

Proof

7.2 Proofs of Main Results

7.2.1 Proofs of the Upper Bounds

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

7.2.2 Proofs of the Lower Bounds

Proof of Theorem 4

7.3 Proof of Proposition 2

Proof

7.4 Proof of Propositions 3 and 4

Proof

7.5 Proof of Proposition 1

Proof

7.6 Proof of Lemma 1

Proof

Case I: Large k

Case II: Small k

Case III: Moderate k

Acknowledgments

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases