Asymptotics of empirical eigenstructure for high dimensional spiked covariance

Weichen Wang; Jianqing Fan

doi:10.1214/16-AOS1487

. Author manuscript; available in PMC: 2017 Aug 21.

Published in final edited form as: Ann Stat. 2017 Jun 13;45(3):1342–1374. doi: 10.1214/16-AOS1487

Asymptotics of empirical eigenstructure for high dimensional spiked covariance

Weichen Wang ^†, Jianqing Fan ^†,^‡,^✉

PMCID: PMC5563862 NIHMSID: NIHMS842544 PMID: 28835726

Abstract

We derive the asymptotic distributions of the spiked eigenvalues and eigenvectors under a generalized and unified asymptotic regime, which takes into account the magnitude of spiked eigenvalues, sample size, and dimensionality. This regime allows high dimensionality and diverging eigenvalues and provides new insights into the roles that the leading eigenvalues, sample size, and dimensionality play in principal component analysis. Our results are a natural extension of those in Paul (2007) to a more general setting and solve the rates of convergence problems in Shen et al. (2013). They also reveal the biases of estimating leading eigenvalues and eigenvectors by using principal component analysis, and lead to a new covariance estimator for the approximate factor model, called shrinkage principal orthogonal complement thresholding (S-POET), that corrects the biases. Our results are successfully applied to outstanding problems in estimation of risks of large portfolios and false discovery proportions for dependent test statistics and are illustrated by simulation studies.

Keywords and phrases: Asymptotic distributions, Principal component analysis, Spiked covariance model, Diverging eigenvalues, Approximate factor model, Relative risk management, False discovery proportion

1. Introduction

Principal Component Analysis (PCA) is a powerful tool for dimensionality reduction and data visualization. Its theoretical properties such as consistency and asymptotic distributions of the empirical eigenvalues and eigenvectors are challenging especially in high dimensional regime. For the past half century, substantial amount of efforts have been devoted to understanding empirical eigen-structure. An early effort came from Anderson (1963) who established the asymptotic normality of sample eigenvalues and eigenvectors under the classical regime with large sample size n and fixed dimension p. However, when dimensionality diverges at the same rate as the sample size, sample covariance matrix is a notoriously bad estimator with dramatically different eigen-structure from the population one. A lot of recent literature makes the endeavor to understand the behavior of the empirical eigenvalues and eigenvectors under high dimensional regime where both n and p go to infinity. See for example Baik, Ben Arous and Péché (2005); Bai (1999); Paul (2007); Johnstone and Lu (2009); Onatski (2012); Shen et al. (2013) and many related papers. For additional developments and references, see Bai and Silverstein (2009).

Among different structures of population covariance, the spiked covariance model is of great interest. It typically assumes several eigenvalues larger than the remaining, and focuses on recovering only these leading eigenvalues and their associated eigenvectors. The spiked part is of importance, as we are usually interested in the directions that explain the most variations of the data. In this paper, we consider a high dimensional spiked covariance model with the leading eigenvalues larger than the rest. We provide new understanding on how the spiked empirical eigenvalues and eigenvectors fluctuate around their theoretical counterparts and what their asymptotic biases are. Three quantities play an essential role in determining the asymptotic behavior of empirical eigen-structure: the sample size n, the dimension p, and the magnitude of leading eigenvalues ${λ_{j}}_{j = 1}^{m}$ . The natural question to ask is how the asymptotics of empirical engen-structure depends on the interplay of those quantities. We will give a unified answer to this important question in the principal component analysis. Theoretical properties of PCA have been investigated from three different perspectives: (i) random matrix theories, (ii) sparse PCA and (iii) approximate factor model.

The first angle to analyze PCA is through random matrix theories, where it is typically assumed p/n → γ ∈ (0, ∞) with bounded spike sizes. It is well known that if the true covariance matrix is identity, the empirical spectral distribution converges almost surely to the Marcenko-Pastur distribution (Bai, 1999) and when γ < 1 the largest and smallest eigenvalues converge almost surely to ${(1 + \sqrt{γ})}^{2}$ and ${(1 - \sqrt{γ})}^{2}$ respectively (Bai and Yin, 1993; Johnstone, 2001). If the true covariance structure takes the form of a spiked matrix, Baik, Ben Arous and Péché (2005) showed that the asymptotic distribution of the top empirical eigenvalue exhibits an n^2/3 scaling when the eigenvalue lies below a threshold $1 + \sqrt{γ}$ , and an n^1/2 scaling when it is above the threshold (named BBP phase transition after the authors). The phase transition is further studied by Benaych-Georges and Nadakuditi (2011) and Bai and Yao (2012) under more general assumptions. For the case where we have the regular scaling, Paul (2007) investigated the asymptotic behavior of the corresponding empirical eigenvectors and showed that the major part of an eigenvector is normally distributed with a regular scaling n^1/2. The convergence of principal component scores under this regime was considered by Lee, Zou and Wright (2010). The same random matrix regime has also been considered by Onatski (2012) in studying the principal component estimator for high-dimensional factor models. More recently, Koltchinskii and Lounici (2014a,b) revealed a profound link of concentration bounds of empirical eigen-structure with the effective rank defined as r̄ = tr(Σ)/λ₁ (Vershynin, 2010). Their results extend the regime of bounded eigenvalues to a more general setting, although the asymptotic results in most cases still rely on the assumption r̄ = o(n), which essentially requires a low dimensionality, i.e. p/n → 0, if λ₁ is bounded. In this paper, we consider the general regime of bounded p/(nλ₁), which implies r̄ = O(n) and allows diverging λ₁. More discussions will be given in Section 3.

A second line of efforts is through sparse PCA. According to Johnstone and Lu (2009), PCA does not generate consistent estimators for leading eigenvectors if p/n → γ ∈ (0, 1) with bounded eigenvalues. This motivates the development of sparse PCA, which leverages the extra assumption on the sparsity of eigenvectors. A large amount of literature has contributed to the topic of sparse PCA, for example Amini and Wainwright (2008); Vu and Lei (2012); Birnbaum et al. (2013); Berthet and Rigollet (2013); Ma (2013). Specifically, Vu and Lei (2012) derived optimal bound for the minimax estimation error of the first sparse leading eigenvector, while Cai, Ma and Wu (2015) conducted a more thorough study on the minimax optimal rates for estimating top eigenvalues and eigenvectors of spiked covariance matrices with jointly k-sparse eigenvectors. This type of work assumes bounded eigenvalues, which ignore the contributions of the strong signals from the data in many real applications. To make the problem solvable, sparsity assumptions on the eigenvectors are imposed. In contrast, driven by applications such as genomics, economics, and finance, this paper studies the contributions of the diverging eigenvalues (signals) to the estimation of their associated eigenvectors, without relying on sparsity assumptions on the eigenvectors.

In order to illustrate the third perspective, let us briefly review the approximate factor model (Bai, 2003; Fan, Liao and Mincheva, 2013) and see how the spiked eigenvalues arise naturally from the model. Consider the following data generating model:

y_{t} = B f_{t} + ε_{t}, for t = 1, \dots, T,

where y_t is a p-dimensional vector observed at time t, f_t ∈ ℝ^m is the vector of latent factors that drive the cross-sectional dependence at time t, B is the matrix of the corresponding loading coefficients, and ε_t is the idiosyncratic part that can not be explained by the factors. Assume without loss of generality that var(f_t) = I_m, the m × m identity matrix. Then, the model implies Σ = var(y_t) = BB′ + Σ_ε, where Σ_ε = var(ε). It admits a low-rank plus sparse structure, when Σ_ε is assumed to be sparse (Fan, Fan and Lv, 2008; Fan, Liao and Mincheva, 2013). The recovery of the low-rank and sparse matrices was considered thoroughly by Candès et al. (2011) and Chandrasekaran et al. (2011) under the incoherence condition in the noise-less setting and by Agarwal, Negahban and Wainwright (2012) in the noisy case. If the factor loadings {b_i}_i≤p (the transpose of rows of B) are i.i.d. samples from a population with mean zero and covariance Σ_b [this is a pervasive assumption commonly used in the factor models (Fan, Liao and Mincheva, 2013)], then by the law of large numbers, $p^{- 1} B' B = p^{- 1} \sum_{i = 1}^{p} b_{i} b_{i}^{'} \to Σ_{b}$ , as p → ∞. In other words, the eigenvalues of BB′ are approximately

p λ_{1} (Σ_{b}) (1 + o (1)), \dots, p λ_{m} (Σ_{b}) (1 + o (1)), 0, \dots, 0,

where λ_j(Σ_b) is the j^th eigenvalue of Σ_b. Then, by Weyl’s theorem, we conclude that the eigenvalues of Σ

λ_{j} = p λ_{j} (Σ_{b}) (1 + o (1)), for j = 1, \dots, m,

(1.1)

and the remaining are bounded, if ‖Σ_ε‖ is bounded. Therefore, the factor model implies a spiked covariance with diverging leading eigenvalues. Fan, Liao and Mincheva (2013) showed that if the leading eigenvalues grow linearly with the dimension, then the corresponding eigenvectors can be consistently estimated as long as sample size goes to infinity. See Section 4 for more details.

Deviating from the classical random matrix and sparse PCA literature, we consider the high dimensional regime, allowing p/n → ∞. To take into account the contributions of the signals for PCA, we allow λ_j → ∞ for the first m leading eigenvalues. This leads to the third perspective for understanding PCA from this high dimensional setting. Shen et al. (2013) adopted this point of view and considered the regime of p/(nλ_j) → γ_j where 0 ≤ γ_j < ∞ for leading eigenvalues. This is more general than the bounded eigenvalue condition. Specifically if eigenvalues are bounded, we require the ratio p/n converges to a bounded constant as in the random matrix regime. On the other hand, if the dimension is much larger than the sample size, we offset the dimensionality by assuming increased signals or sample size, without additional sparse eigenvector assumption as in sparse PCA regime. In particular, as shown in (1.1), the strong (or pervasive) factors considered in financial applications corresponds to γ_j = 0 with the leading eigenvalues λ_j ≍ p; see for example Stock and Watson (2002); Bai (2003); Bai and Ng (2002); Fan, Liao and Mincheva (2013); Fan, Liao and Wang (2016). The weak or semi-strong factors considered by De Mol, Giannone and Reichlin (2008) and Onatski (2012) also imply bounded p/(nλ₁), with p/n bounded and λ_j ≍ p^θ for some θ ∈ [0, 1).

Hall, Marron and Neeman (2005); Jung and Marron (2009) started the research of high dimension low sample size (HDLSS) regime. With n fixed, Jung and Marron (2009) concluded that consistency of leading eigenvalues and eigenvectors is granted if λ_j ≍ p^θ for θ > 1, which also corresponds to γ_j = 0. Shen et al. (2013) revealed an interesting fact that when γ_j ≠ 0, spiked sample eigenvalues almost surely converge to a biased quantity of the true eigenvalues; furthermore the corresponding sample eigenvectors show an asymptotic conical structure. However, their work focuses only on the consistency problem. In this study, we will consider the same regime as theirs, but focus more on the rates of convergence and the asymptotic distributions of the empirical eigen-structure, and under more relaxed conditions. Our results can be viewed as a natural extension of Paul (2007) to the high dimensional setting.

We would like to emphasize more on the scope and importance of our contributions here. Firstly, the regime we consider in this paper is p/(nλ_j) → γ_j ∈ [0, ∞) for j ≤ m, which permits high dimensionality p/n → ∞ and diverging eigenvalues without specifying their divergence rates. As we have argued, this encompasses many situations considered in the existing literature. It puts into the same framework of the typical random matrix regime with bounded eigenvalues and HDLSS analysis with fixed sample size. Secondly, the contributions of diverging eigenvalues are indeed recognized and accounted for in our theoretical developments. This avoids the restrictive assumptions on sparse eigenvectors. PCA without sparsity assumptions has been widely used in the diverging fields such as population association study (Yamaguchi-Kabata et al., 2008), genome-wide association study (Ringnér, 2008), microarray data (Landgrebe, Wurst and Welzl, 2002; Price et al., 2006), fMRI data (Thomas, Harshman and Menon, 2002), and financial returns (Chen and Shimerda, 1981; Chamberlain and Rothschild, 1983). Our efforts contribute to theoretical understanding why such a plain PCA works in these diverse fields. Finally, by allowing certain generality, we gain theoretical insights into how n, p and signal strength λ_j interplay.

The results are useful in two ways. On the one hand, they help quantify the biases of empirical eigen-structure and explain where they come from. Specifically, in Theorem 3.1, the bias of the j^th sample eigenvalue (j ≤ m) is quantified by p/(nλ_j), which is also showed by Yata and Aoshima (2012, 2013) under different assumptions of the spiked covariance model. Our novel contribution lies in Theorem 3.2, revealing the bias of the j^th sample eigenvector (j ≤ m). In (3.7), we provide a decomposition of each empirical eigenvector into a spiked part, which converges to the true eigenvector with a deflation factor also quantified by p/(nλ_j), and a non-spiked part, which creates a random bias distributed uniformly on an ellipse. More details will be presented in Section 3. On the other hand, the theoretical results provide new technical tools for analyzing factor models, which motivate the study. As we have seen, although it is natural to assume eigenvalues grow linearly with dimension, the assumption imposes a strong signal. Note that when p/(nλ_j) → 0, no biases will occur. So in Section 4, we consider to relax the order of spikes to slightly faster than $\sqrt{p}$ . By correcting the biases, we propose a new method called Shrinkage Principal Orthogonal complEment Thresholding (S-POET) and employ it to two applications: risk assessment of large portfolios (Pesaran and Zaffaroni, 2008; Fan, Liao and Shi, 2015) and false discovery proportion estimation for dependent test statistics (Leek and Storey, 2008; Fan, Han and Gu, 2012). Existing methodologies for those two problems reply on rather strong signal level, but we are able to relax it with the help of S-POET.

The paper is organized as follows. Section 2 introduces the notations, assumptions, and an important fact which serves as basis of our proofs. Sections 3.1 and 3.2 devote to the theoretical results for the sample eigenvalues and eigenvectors of the spiked covariance matrix. In Section 4, we discuss several applications of the theories in the previous section. Simulations are conducted in Section 5 to demonstrate the theoretical results at the finite sample and the performance of S-POET. Section 6 provides concluding remarks. The proofs for Section 3 are provided in the appendix and those for Section 4 are relegated to the supplementary material.

2. Assumptions and a simple fact

Assume that ${Y_{i}}_{i = 1}^{n}$ is a sequence of i.i.d. random variables with zero mean and covariance matrix Σ_p×p. Let λ₁, …, λ_p be the eigenvalues of Σ in descending order. We consider the spiked covariance model as follows.

Assumption 2.1

λ₁ > λ₂ > ⋯ > λ_m > λ_m+1 ≥ ⋯ ≥ λ_p > 0, where the non-spiked eigenvalues are bounded, i.e. c₀ ≤ λ_j ≤ C₀, j > m for constants c₀, C₀ > 0 and the spiked eigenvalues are well separated, i.e. ∃ δ₀ > 0 such that min_j≤m(λ_j − λ_j+1)/λ_j ≥ δ₀.

The eigenvalues are divided into the spiked ones and bounded non-spiked ones. We do not need to specify the order of the leading eigenvalues nor require them to diverge. Thus, our results in Section 3 are applicable to both bounded and diverging leading eigenvalues. For simplicity, we only consider distinguishable eigenvalues (multiplicity 1) for the largest m eigenvalues and a fixed number m, independent of n and p.

The factor model y = Bf + ε with pervasive factors considered in Fan, Liao and Mincheva (2013) implies a spiked covariance model with λ_j ≍ p in (1.1) and satisfies the above assumption. For the interplay of the sample size n, dimension p and the spikes λ_j ’s, the following relationship is assumed as in Shen et al. (2013).

Assumption 2.2

Assume p > n. For the spiked part 1 ≤ j ≤ m, c_j = p/(nλ_j) is bounded, and for the non-spiked part, ${(p - m)}^{- 1} \sum_{j = m + 1}^{p} λ_{j} = \bar{c} + o (n^{- 1 / 2})$ .

We allow p/n → ∞ in any manner, though λ_j also needs to grow fast enough to ensure bounded c_j. In particular, c_j = o(1) is allowed as in the factor model. We do not assume the non-spiked eigenvalues are identical, as in most spiked covariance model literature (e.g. Paul (2007); Johnstone and Lu (2009)).

By spectral decomposition, Σ = ΓΛΓ′, where the orthonormal matrix Γ is constructed by the eigenvectors of Σ and Λ = diag(λ₁, …, λ_p). Let X_i = Γ′Y_i. Since the empirical eigenvalues are invariant and the empirical eigenvectors are equivariant under an orthonormal transformation, we focus the analysis on the transformed domain of X_i and then translate the results into those of the original data. Note that var(X_i) = Λ. Let Z_i = Λ^−1/2X_i be the elementwise standardized random vector.

Assumption 2.3

${Z_{i}}_{i = 1}^{n}$ are i.i.d copies of Z. The standardized random vector Z = (Z₁, …, Z_p) is sub-Gaussian with independent entries of mean zero and variance one. The sub-Gaussian norms of all components are uniformly bounded: max_j ‖Z_j‖_ψ₂ ≤ C₀, where ‖Z_j‖_ψ₂ = sup_q≥1 q^−1/2(E|Z_j|^q)^1/q.

Since Var(X_i) = diag(λ₁, λ₂, …, λ_p), the first m population eigenvectors are simply unit vectors e₁, e₂, …, e_m with only one nonvanishing element. Denote the n by p transformed data matrix by X = (X₁, X₂, …, X_n)′. Then the sample covariance matrix is

{\hat{Σ}}_{p \times p} = \frac{1}{n} X' X = \frac{1}{n} \sum_{i = 1}^{n} X_{i} X_{i}^{'},

whose eigenvalues are denoted as λ̂₁, λ̂₂, …, λ̂_p (λ̂_j = 0 for j > n) with corresponding eigenvectors ξ̂₁, ξ̂₂, …, ξ̂_p. Note that the empirical eigenvectors of data Y_i’s are ${\hat{ξ}}_{j}^{(Y)} = Γ {\hat{ξ}}_{j}$ .

Let Z_j be the j^th column of the standardized X. Then each Z_j has i.i.d sub-Gaussian entries with zero mean and unit variance. Exchanging the roles of rows and columns, we get the n by n Gram matrix

{\tilde{Σ}}_{n \times n} = \frac{1}{n} X X' = \frac{1}{n} \sum_{j = 1}^{p} λ_{j} Z_{j} Z_{j}^{'},

with the same nonzero eigenvalues λ̂₁, λ̂₂, …, λ̂_n as Σ̂ and the corresponding eigenvectors u₁, u₂, …, u_n. It is well known that for i = 1, 2, …, n

{\hat{ξ}}_{i} = {(n {\hat{λ}}_{i})}^{- 1 / 2} X' u_{i} and u_{i} = {(n {\hat{λ}}_{i})}^{- 1 / 2} X {\hat{ξ}}_{i},

(2.1)

while the other eigenvectors of Σ̂ constitute a (p−n)-dimensional orthogonal complement of ξ̂₁, …, ξ̂_n.

By using this simple fact, for the specific case with c₀ = C₀ = 1 in Assumption 2.1, λ_j = 1 for j > m in Assumption 2.2, and Gaussian data in Assumption 2.3, Shen et al. (2013) showed that

\frac{{\hat{λ}}_{j}}{λ_{j}} \overset{a . s .}{\to} 1 + c_{j}, 1 \leq j \leq m;

and

| 〈 {\hat{ξ}}_{j}, e_{j} 〉 | \overset{a . s .}{\to} {(1 + c_{j})}^{- \frac{1}{2}},

where 〈a, b〉 denotes the inner product of two vectors. However, they fail to establish any results on convergence rates or asymptotic distributions of the empirical eigen-structure. This motivates the current paper.

The aim of this paper is to establish the asymptotic normality of the empirical eigenvalues and eigenvectors under more relaxed conditions. Our results are a natural extension of Paul (2007) to a more general setting with new insights, where the asymptotic normality of sample eigenvectors is derived using complicated random matrix techniques for Gaussian data under the regime of p/n → γ ∈ [0, 1). In comparison, our proof, based on the relationship (2.1), is much simpler and insightful for understanding the behavior of high dimensional PCA.

Here are some notations that we will use in the paper. For a general matrix M, we denote its matrix entry-wise max norm as ‖M‖_max = max_i,j{|M_i,j|} and define the quantities $‖ M ‖ = λ_{max}^{1 / 2} (M' M), {‖ M ‖}_{F} = {(\sum_{i, j} M_{i, j}^{2})}^{1 / 2}$ , ‖M‖_∞ = max_i ∑_j |M_i,j| to be its spectral, Frobenius and induced ℓ_∞ norms. If M is symmetric, we define λ_j(M) to be the j^th largest eigenvalue of M and λ_max(M), λ_min(M) to be the maximal and minimal eigenvalues respectively. We denote tr(M) as the trace of M. For any vector v, its ℓ₂ norm is represented by ‖v‖ while ℓ₁ norm is written as ‖v‖₁. We use diag(v) to denote the diagonal matrix with the same diagonal entries as v. For two random vectors a, b of the same length, we say a = b+O_P(δ) if ‖a−b‖ = O_P(δ) and a = b + o_P(δ) if ‖a − b‖ = o_P(δ). We denote $a \overset{d}{\Rightarrow} ℒ$ for some distribution ℒ if there exists b ~ ℒ such that a = b + o_P(1). Throughout the paper, C is a generic constant that may differ from line to line.

3. Asymptotic behavior of empirical eigen-structure

3.1. Asymptotic normality of empirical eigenvalues

Let us first study the behavior of the leading m empirical eigenvalues of Σ̂. Denote by λ_j(A) the j^th largest eigenvalue of matrix A and recall that λ̂_j = λ_j(Σ̂). We have the following asymptotic normality of λ̂_j.

Theorem 3.1

Under Assumptions 2.1 – 2.3, ${{\hat{λ}}_{j}}_{j = 1}^{m}$ ’s have independent limiting distributions. In addition,

\sqrt{n} {\frac{{\hat{λ}}_{j}}{λ_{j}} - (1 + \bar{c} c_{j} + O_{P} (λ_{j}^{- 1} \sqrt{p / n}))} \overset{d}{\Rightarrow} N (0, κ_{j} - 1),

(3.1)

where κ_j is the kurtosis of X_j.

The theorem shows that the bias of λ̂_j/λ_j is $\bar{c} c_{j} + O_{P} (λ_{j}^{- 1} \sqrt{p / n})$ . The second term is dominated by the first term since p > n and it is of order o_P(n^−1/2) if $\sqrt{p} = o (λ_{j})$ . The latter assumption is satisfied by the strong factor model in Fan, Liao and Mincheva (2013) and a part of weak or semi-strong factor model in Onatski (2012). The theorem reveals the bias is controlled by a term of rate p/(nλ_j). To get the asymptotically unbiased estimate, it requires c_j = p/(nλ_j) → 0 for j ≤ m. This result is more general than that of Shen et al. (2013) and sheds a similar light to that of Koltchinskii and Lounici (2014a,b) i.e. ‖Σ̂ − Σ‖/‖Σ‖ → 0 almost surely if and only if the effective rank r̄ = tr(Σ)/λ₁ is of order o(n), which is true when c₁ = o(1). Our result here holds for each individual spike. Yata and Aoshima (2012, 2013) employed a similar technical trick and gave a comprehensive study on the asymptotic consistency and distributions of the eigenvalues. They got similar results under different conditions from ours. Our framework is more general here. If c_j ↛ 0, bias reduction can also be made; see Section 4.2, where an estimator for c̄ is proposed. Under the bounded spiked covariance model considered in Baik, Ben Arous and Péché (2005), Johnstone and Lu (2009) and Paul (2007), it is assumed λ_j = c₀ = C₀, j > m so that c̄ = c₀, the minimum eigenvalue of the population covariance matrix. Our result is also consistent with Anderson (1963)’s result that

\sqrt{n} ({\hat{λ}}_{j} - λ_{j}) \overset{d}{\Rightarrow} N (0, 2 λ_{j}^{2}),

for Gaussian data and fixed p and λ_j’s, where the non-spiked part does not exist and thus the bias $O_{P} (λ_{j}^{- 1} \sqrt{p / n})$ disappears. The proof is relegated to the appendix.

3.2. Behavior of empirical eigenvectors

Let us consider the asymptotic distribution of the empirical eigenvectors ξ̂_j’s corresponding to λ̂_j, j = 1, 2, …, m. As in Paul (2007), each ξ̂_j is divided into two parts corresponding to the spiked and non-spiked components, i.e. ${\hat{ξ}}_{j} = ({\hat{ξ}}_{j A}^{'}, {\hat{ξ}}_{j B}^{'})'$ where ξ̂_jA is of length m.

Theorem 3.2

Under Assumptions 2.1 – 2.3, we have

For the spiked part, if m = 1,
$\frac{2 (1 + \bar{c} c_{1})}{\bar{c} c_{1}} \sqrt{n} (\sqrt{1 + \bar{c} c_{1}} {\hat{ξ}}_{1 A} - 1 + O_{P} (\sqrt{\frac{p}{n λ_{1}^{2}}})) \overset{d}{\Rightarrow} N (0, κ_{1} - 1),$ (3.2)
while if m > 1,
$\sqrt{n} (\frac{{\hat{ξ}}_{j A}}{‖ {\hat{ξ}}_{j A} ‖} - e_{j A} + O_{P} (\sqrt{\frac{p}{n λ_{j}^{2}}})) \overset{d}{\Rightarrow} N_{m} (0, Σ_{j}),$ (3.3)
for j = 1, 2, …, m, with
$Σ_{j} = \sum_{k \in [m] \ j} a_{j k}^{2} e_{k A} e_{k A}^{'},$
where [m] = {1, ⋯, m}, e_kA is the first m elements of the unit vector e_k, and $a_{j k} = {lim}_{λ_{j}, λ_{k} \to \infty} \sqrt{λ_{j} λ_{k}} / (λ_{j} - λ_{k})$ , which is assumed to exist.
For the non-spiked part, if we further assume the data is Gaussian, there exists p − m dimensional vector h₀ ~ Unif (B_p−m(1)) such that
$‖ D_{0} \frac{{\hat{ξ}}_{j B}}{‖ {\hat{ξ}}_{j B} ‖} - h_{0} ‖ = O_{P} (\sqrt{\frac{n}{p}}) + o_{P} (\frac{1}{\sqrt{n}}),$ (3.4)
where $D_{0} = diag (\sqrt{\bar{c} / λ_{m + 1}}, \dots, \sqrt{\bar{c} / λ_{p}})$ is a diagonal matrix and Unif (B_k(r)) denotes the uniform distribution over the centered sphere of radius r. In addition, the max norm of ξ̂_jB satisfies
${‖ {\hat{ξ}}_{j B} ‖}_{max} = O_{P} (p / (n λ_{j}^{3 / 2}) + \sqrt{log p / (n λ_{j})}) .$ (3.5)
Furthermore, $‖ {\hat{ξ}}_{j A} ‖ = {(1 + \bar{c} c_{j})}^{- 1 / 2} + O_{P} (λ_{j}^{- 1} \sqrt{p / n} + p / (n^{3 / 2} λ_{j}))$ and $‖ {\hat{ξ}}_{j B} ‖ = {(\frac{\bar{c} c_{j}}{1 + \bar{c} c_{j}})}^{1 / 2} + O_{P} (\sqrt{1 / λ_{j}} + \sqrt{p / (n^{2} λ_{j})})$ . Together with (i), this implies the inner product between empirical eigenvector and the population one converges to (1 + c̄c_j)^−1/2 in probability and
$〈 {\hat{ξ}}_{j}, e_{j} 〉 - \frac{1}{\sqrt{1 + \bar{c} c_{j}}} = O_{P} (λ_{j}^{- 1} \sqrt{p / n} + p / (n^{3 / 2} λ_{j})) + O_{P} (n^{- 1}) I_{{m > 1}} .$ (3.6)

In the above theory, we assume $a_{j k} = {lim}_{λ_{j}, λ_{k} \to \infty} \frac{\sqrt{λ_{j} λ_{k}}}{λ_{j} - λ_{k}}$ exists. This is not restrictive if eigenvalues are well separated i.e. min_j≠k≤m |λ_j − λ_k|/λ_j ≥ δ₀ from assumption 2.1. The assumption obviously holds for the pervasive factor model, in which $a_{j k} = \sqrt{λ_{j} (Σ_{b}) λ_{k} (Σ_{b})} / (λ_{k} (Σ_{b}) - λ_{j} (Σ_{b}))$ .

Theorem 3.2 is an extension of random matrix results into high dimensional regime. Its proof sheds light on how to use the smaller n × n matrix Σ̃ as a tool to understand the behavior of the larger p × p covariance matrix Σ̂. Specifically, we start from Σ̃u_j = λ̂_ju_j or identity (A.3) and then use the simple fact (2.1) to get a relationship (A.4) of eigenvector ξ̂_j. Then (A.4) is rearranged as (A.6) which gives a clear separation of the dominating term, that is asymptotically normal, and the error term. This makes the whole proof much simpler in comparison with Paul (2007) who showed a similar type of result through a complicated representation of ξ̂_j and λ̂_j under more restricted assumptions. From this simple trick, we can understand deeply how some important high and low dimensional quantities link together and differ from each other.

Several remarks are in order. Firstly, since ${\hat{ξ}}_{j}^{(Y)} = Γ {\hat{ξ}}_{j}$ is the j^th empirical eigenvector based on observed data Y, we have decomposition

{\hat{ξ}}_{j}^{(Y)} = Γ_{A} {\hat{ξ}}_{j A} + Γ_{B} {\hat{ξ}}_{j B},

(3.7)

where Γ = (Γ_A, Γ_B). Note that Γ_Aξ̂_jA converges to the true eigenvector deflated by a factor of $\sqrt{1 + \bar{c} c_{j}}$ with the convergence rate $O_{P} (\sqrt{p / (n λ_{j}^{2})} + p / (n^{3 / 2} λ_{j}) + n^{- 1 / 2})$ while Γ_Bξ̂_jB creates a random bias, which is distributed uniformly on an ellipse of (p − m) dimension and projected into the p dimensional space spaned by Γ_B. The two parts intertwine in such a way that correction for the biases of estimating eigenvectors is almost impossible. More details are discussed in Section 4 for the factor models. Secondly, it is clearly as in the eigenvalue case, the bias term $λ_{j}^{- 1} \sqrt{p / n}$ Theorem 3.2 (i) disappears when $\sqrt{p} = o (λ_{j})$ . In particular, for the stronger factor given by (1.1), ${\hat{ξ}}_{j}^{(Y)}$ is a consistent estimator. Thirdly, the situations m = 1 and m > 1 have slight difference in that multiple spikes could interact with each other. Especially this is reflected in the convergence of the angle between the empirical eigenvector and its population counterpart: the angle converges to (1 + c̄c_j)^−1/2 with an extra rate O_P(1/n) which stems from estimating ξ̂_jk for j ≠ k ≤ m (see proof of Theorem 3.2 (iii)). The difference will only be seen when the spike magnitude is higher than the order $\sqrt{p n} \lor p n^{- 1 / 2}$ . We will verify this by a simple simulation in Section 5. Finally, it is the first time that the max norm bound of the non-spiked part is derived. This bound will be useful for analyzing factor models in Section 4.

Theorem 3.2 again implies the results of Shen et al. (2013). It also generalizes the asymptotic distribution of non-spiked part from pure orthogonal invariant case of Paul (2007) to a more general setting. In particular, when p/n → ∞, the asymptotic distribution of the normalized non-spiked component is not uniform over a sphere any more, but over an ellipse. In addition, our result can be compared with the low dimensional case, where Anderson (1963) showed that

\sqrt{n} ({\hat{ξ}}_{j} - e_{j}) \overset{d}{\Rightarrow} N_{p} (0, \sum_{k \in [m] \ j} \frac{λ_{j} λ_{k}}{{(λ_{j} - λ_{k})}^{2}} e_{k} e_{k}^{'}),

(3.8)

for fixed p and λ_j’s. Under our assumptions, since the spiked eigenvalues may go to infinity, the constants in the asymptotic covariance matrix are replaced by the limits a_jk’s. Similar to the behavior of eigenvalues, the spiked part ξ̂_jA preserves the normality property except for a bias factor 1/(1+ c̄c_j) caused by the high dimensionality. Also, recent work of Koltchinskii and Lounici (2014b) provides general asymptotic results for the empirical eigenvectors from a spectral projector point of view, but they mainly focus on the regime of p/nλ_j → 0 or r̄ = o(n). Last but not least, it has been shown by Johnstone and Lu (2009) that PCA generates consistent eigenvector estimation if and only if p/n → 0 when the spike sizes are fixed. This motivates the study of sparse PCA. We take the spike magnitude into account and provide additional insights by showing that PCA consistently estimate eigenvalues and eigenvectors if and only if p/(nλ_j) → 0. This explains why Fan, Liao and Mincheva (2013) can consistently estimate the eigenvalues and eigenvectors while Johnstone and Lu (2009) can not.

4. Applications to factor models

In this section, we propose a method named Shrinkage Principal Orthogonal complEment Thresholding (S-POET) for estimating large covariance matrices induced by the approximate factor models. The estimator is based on correction of the bias of the empirical eigenvalues as specified in (3.1). We derive for the first time the bound for the relative estimation errors of covariance matrices under the spectral norm. The results are then applied to assessing large portfolio risks and estimating false discovery proportions, where the conditions in existing literature are significantly relaxed.

4.1. Approximate factor models

Factor models have been widely used in various disciplines. For example, it is used to extract information from financial market for sufficient forecasting of other time series (Stock and Watson, 2002; Fan, Xue and Yao, 2015) and to adjust heterogeneity for biological data aggregation of multiple sources (Leek et al., 2010; Fan et al., 2016). Consider the approximate factor model

y_{i t} = b_{i}^{'} f_{t} + u_{i t},

(4.1)

where y_it is the observed data for the i^th (i = 1, …, p) individual (e.g. returns of stocks) or component (e.g. expressions of genes) at time t = 1, …, T; f_t is an m × 1 vector of latent common factors and b_i is the factor loadings for the i^th individual or component; u_it is the idiosyncratic error, uncorrelated with the common factors. In genomics application, t can also index repeated experiments. For simplicity we assume there is no time dependency.

The factor model can be written into a matrix form as follows:

Y = B F' + U,

(4.2)

where Y_p×T, B_p×m, F_T×m, U_p×T are respectively the matrix form of the observed data, the factor loading matrix, the factor matrix, and the error matrix. For identifiability, we impose the condition that cov(f_t) = I. Thus, the covariance matrix is given by

Σ = B B' + Σ_{u},

(4.3)

where Σ_u is the covariance matrix of the idiosyncratic error at any time t.

Under the assumption that Σ_u = (σ_u,ij)_i,j≤p is sparse with its eigenvalues bounded away from zero and infinity, the population covariance exhibits a “low-rank plus sparse” structure. The sparsity is measured by the following quantity

m_{p} = max_{i \leq p} \sum_{j \leq p} {| σ_{u, i j} |}^{q},

for some q ∈ [0, 1] (Bickel and Levina, 2008). In particular, with q = 0, m_p equals the maximum number of nonzero elements in each row of Σ_u.

In order to estimate the true covariance matrix with the above factor structure, Fan, Liao and Mincheva (2013) proposed a method called “POET” to recover the unknown factor matrix as well as the factor loadings. The idea is simply to first decompose the sample covariance matrix into the spiked and non-spiked part and estimate them separately. Specifically, recall Σ̂ = T⁻¹YY′ and let {λ̂_j} and {ξ̂_j} be its corresponding eigenvalues and eigenvectors. They define

{\hat{Σ}}^{⊤} = \sum_{j = 1}^{m} {\hat{λ}}_{j} {\hat{ξ}}_{j} {\hat{ξ}}_{j}^{'} + {\hat{Σ}}_{u}^{⊤},

(4.4)

where ${\hat{Σ}}_{u}^{⊤}$ is the matrix after applying thresholding method (Bickel and Levina, 2008) to ${\hat{Σ}}_{u} = \hat{Σ} - \sum_{j = 1}^{m} {\hat{λ}}_{j} {\hat{ξ}}_{j} {\hat{ξ}}_{j}^{'}$ .

They showed that the above estimation procedure is equivalent to the least square approach that minimizes

(\hat{B}, \hat{F}) = arg min_{B, F} {‖ Y - B F' ‖}_{F}^{2} s . t . \frac{1}{T} F' F = I_{m}, B' B is diagonal .

(4.5)

The columns of $\hat{F} / \sqrt{T}$ are the eigenvectors corresponding to the m largest eigenvalues of the T × T matrix T⁻¹Y′Y and B̂ = T⁻¹YF̂. After B and F are estimated, the sample covariance of Û = Y − B̂F̂′ can be formed: Σ̂_u = T⁻¹ÛÛ′. Finally thresholding is applied to Σ̂_u to generate ${\hat{Σ}}_{u}^{⊤} = {({\hat{σ}}_{u, i j}^{⊤})}_{p \times p}$ , where

{\hat{σ}}_{u, i j}^{⊤} {\begin{matrix} {\hat{σ}}_{u, i j}, & i = j; \\ s_{i j} ({\hat{σ}}_{u, i j}) I (| {\hat{σ}}_{u, i j} | \geq τ_{i j}), & i \neq j . \end{matrix}

(4.6)

Here s_ij(·) is the generalized shrinkage function (Antoniadis and Fan, 2001; Rothman, Levina and Zhu, 2009) and τ_ij = τ(σ̂_u,iiσ̂_u,jj)^1/2 is the entry-dependent threshold. The above adaptive threshold corresponds to applying thresholding with parameter τ to the correlation matrix of Σ̂_u. The positive parameter τ will be determined later.

Fan, Liao and Mincheva (2013) showed that under Assumptions ?? - ?? listed in Appendix ?? in the supplementary material (Wang and Fan, 2015),

{‖ {\hat{Σ}}^{⊤} - Σ ‖}_{Σ, F} = O_{P} (\frac{\sqrt{p} log p}{T} + m_{p} {(\frac{log p}{T} + \frac{1}{p})}^{(1 - q) / 2}),

(4.7)

where ‖A‖_Σ,F = p^−1/2‖Σ^−1/2AΣ^−1/2‖_F and ‖ · ‖_F is the Frobenius norm. Note that

{‖ {\hat{Σ}}^{⊤} - Σ ‖}_{Σ, F} = p^{- 1 / 2} {‖ Σ^{- 1 / 2} {\hat{Σ}}^{⊤} Σ^{- 1 / 2} - I_{p} ‖}_{F},

which measures the relative error in Frobenius norm. A more natural metric is relative error under the operator norm ‖A‖_Σ = ‖Σ^−1/2AΣ^−1/2‖, which can not be obtained by using the technical device of Fan, Liao and Mincheva (2013). Note ‖A‖_Σ,F ≤ ‖A‖_Σ. Via our new results in the last section, we will establish a result under those two relative norms, under weaker conditions than their pervasiveness assumption. Note that the relative error convergence is particularly meaningful for spiked covariance matrix, as eigenvalues are in different scales.

4.2. Shrinkage POET under relative spectral norm

The discussion above reveals several drawbacks of POET. First, the spike size has to be of order p which rules out relatively weak factors. Second, it is well known that the empirical eigenvalues are inconsistent if the spiked eigenvalues do not significantly dominate the non-spiked part. Therefore, a proper correction or shrinkage is needed. See a recent paper by Donoho, Gavish and Johnstone (2014) for optimal shrinkage of eigenvalues.

Regarding to the first drawback, we relax the assumption ‖p⁻¹B′B − Ω₀‖ = o(1) in Assumption ?? to the following weaker assumption.

Assumption 4.1

$‖ Λ_{A}^{- 1 / 2} B' B Λ_{A}^{- 1 / 2} - Ω_{0} ‖ = o (1)$ for some Ω₀ with eigenvalues bounded from above and below, where Λ_A = diag(λ₁, …, λ_m). In addition, we assume λ_m → ∞, λ₁/λ_m is bounded from above and below.

This assumption does not require the first m eigenvalues of Σ to take on any specific rate. They can still be much smaller than p, although for simplicity we require them to diverge and share the same diverging rate. Since ‖Σ_u‖ is assumed to be bounded, the assumption λ_m → ∞ is also imposed to avoid the issue of identifiability. When λ_m does not diverge, more sophisticated condition is needed for identifiability (Chandrasekaran et al., 2011).

In order to handle the second drawback, we propose the Shrinkage POET (S-POET) method. Inspired by (3.1), the shrinkage POET modifies the first part in POET estimator (4.4) as follows:

{\hat{Σ}}^{S} = \sum_{j = 1}^{m} {\hat{λ}}_{j}^{S} {\hat{ξ}}_{j} {\hat{ξ}}_{j}^{'} + {\hat{Σ}}_{u}^{⊤},

(4.8)

where ${\hat{λ}}_{j}^{S} = max {{\hat{λ}}_{j} - \bar{c} p / n, 0}$ , a simple soft thresholding correction. Obviously if λ̂_j is sufficiently large, ${\hat{λ}}_{j}^{S} / λ_{j} = {\hat{λ}}_{j} / λ_{j} - \bar{c} c_{j} = 1 + o_{P} (1)$ . Since c̄ is unknown, a natural estimator ĉ is such that the total of the eigenvalues remains unchanged:

tr (\hat{Σ}) = \sum_{j = 1}^{m} ({\hat{λ}}_{j} - ĉ p / n) + (p - m) ĉ

or $ĉ = (tr (\hat{Σ}) - \sum_{j = 1}^{m} {\hat{λ}}_{j}) / (p - m - p m / n)$ . It has been shown by Lemma 7 of Yata and Aoshima (2012) that

(ĉ - \bar{c}) \frac{p}{n λ_{j}} = O_{P} (\frac{tr (\hat{Σ}) - \sum_{j = 1}^{m} {\hat{λ}}_{j}}{(n - m) λ_{m}} - \frac{\bar{c} p}{n λ_{m}}) = O_{P} (n^{- 1}) .

Thus, replacing c̄ by ĉ, we have ${\hat{λ}}_{j}^{S} / λ_{j} - 1 = O_{P} (λ_{j}^{- 1} \sqrt{p / n} + n^{- 1 / 2})$ , i.e. the estimation error in ĉ is negligible. From Lemma 3.1, we can easily obtain the asymptotic normality, $\sqrt{n} ({\hat{λ}}_{j}^{S} / λ_{j} - 1) \overset{d}{\Rightarrow} N (0, κ_{j} - 1)$ if $\sqrt{p} = o (λ_{j})$ .

To get the convergence of relative errors under the operator norm, we also need the following additional assumptions:

Assumption 4.2

{u_t, f_t}_t≥1 are independently and identically distributed with 𝔼[u_it] = 𝔼[u_itf_jt] = 0 for all i ≤ p, j ≤ m and t ≤ T.
There exist positive constants c₁ and c₂ such that λ_min(Σ_u) > c₁, ‖Σ_u‖_∞ < c₂, and min_i,j Var(u_itu_jt) > c₁.
There exist positive constants r₁, r₂, b₁ and b₂ such that for s > 0, i ≤ p, j ≤ m,
$ℙ (| u_{i t} | > s) \leq exp (- {(s / b_{1})}^{r_{1}}) and ℙ (| f_{j t} | > s) \leq exp (- {(s / b_{2})}^{r_{2}}) .$
There exists M > 0 such that for all i ≤ p, j ≤ m, $| b_{i j} | \leq M \sqrt{λ_{j} / p}$ .
$\sqrt{p} {(log T)}^{1 / r_{2}} = o (λ_{m})$ .

The first three conditions are common in factor model literature. If we write B = (b̃₁, …, b̃_m), by Weyl’s inequality we have max_1≤j≤m ‖b̃_j‖²/λ_j ≤ 1+‖Σ_u‖/λ_j = 1+o(1). Thus it is reasonable to assume the magnitude |b_ij | of factor loadings is of order $\sqrt{λ_{j} / p}$ in the fourth condition. The last condition is imposed to ease technical presentation.

Now we are ready to investigate ‖Σ̂^S −Σ‖_Σ. Suppose the SVD decomposition of Σ,

Σ = (Γ_{p \times m} Ω_{p \times (p - m)}) (\begin{matrix} Λ_{m \times m} \\ Θ_{(p - m) \times (p - m)} \end{matrix}) (\begin{matrix} Γ' \\ Ω' \end{matrix}) .

Then obviously

{‖ {\hat{Σ}}^{S} - Σ ‖}_{Σ} \leq ‖ Σ^{- \frac{1}{2}} (\hat{Γ} {\hat{Λ}}^{S} \hat{Γ}' - B B') Σ^{- \frac{1}{2}} ‖ + ‖ Σ^{- \frac{1}{2}} ({\hat{Σ}}_{u}^{⊤} - Σ_{u}) Σ^{- \frac{1}{2}} ‖ ≕ Δ_{L} + Δ_{S},

(4.9)

and

Δ_{S} \leq ‖ Σ^{- 1} ‖ ‖ {\hat{Σ}}_{u}^{⊤} - Σ_{u} ‖ \leq C ‖ {\hat{Σ}}_{u}^{⊤} - Σ_{u} ‖ .

(4.10)

It can be shown

Δ_{L} = ‖ (\begin{matrix} Λ^{- \frac{1}{2}} Γ' \\ Θ^{- \frac{1}{2}} Ω' \end{matrix}) (\hat{Γ} {\hat{Λ}}^{S} \hat{Γ}' - B B') (Γ Λ^{- \frac{1}{2}} Ω Θ^{- \frac{1}{2}}) ‖ \leq Δ_{L 1} + Δ_{L 2},

(4.11)

where $Δ_{L 1} = ‖ Λ^{- \frac{1}{2}} Γ' (\hat{Γ} {\hat{Λ}}^{S} \hat{Γ}' - B B') Γ Λ^{- \frac{1}{2}} ‖$ and $Δ_{L 2} = ‖ Θ^{- \frac{1}{2}} Ω' (\hat{Γ} {\hat{Λ}}^{S} \hat{Γ}' - B B') Ω Θ^{- \frac{1}{2}} ‖$ . Thus in order to find the convergence rate of relative spectral norm, we need to consider the terms Δ_L1, Δ_L2 and Δ_S separately. Notice that Δ_L1 measures the relative error of the estimated spiked eigenvalues, Δ_L2 reflects the goodness of the estimated eigenvectors, and Δ_S controls the error of estimating the sparse idiosyncratic covariance matrix. To bound the relative Frobenius norm ‖Σ̂_S − Σ‖_Σ,F, we define similar quantities Δ̃_L1, Δ̃_L2, Δ̃_S which replace the spectral norm by Frobenius norm multiplied by p^−1/2. Note that (4.9) – (4.11) also hold for relative Frobenius norm with Δ̃_L1, Δ̃_L2, Δ̃_S. The following theorem reveals the rate of each term. Its proof will be provided in Appendix ?? of the supplementary material (Wang and Fan, 2015).

Theorem 4.1

Under Assumptions 2.1, 2.2, 2.3, 4.1 and 4.2, if p log p > max{T(log T)^4/r₂, T(log(pT))^2/r₁}, we have

{\tilde{Δ}}_{L 1} \leq Δ_{L 1} = O_{P} (T^{- 1 / 2}), Δ_{L 2} = O_{P} (\frac{p}{T}), {\tilde{Δ}}_{L 2} = O_{P} (\frac{\sqrt{p}}{T}),

and by the applying adaptive thresholding estimator (4.6) with

τ_{i j} = C ω_{T} {({\hat{σ}}_{u, i i} {\hat{σ}}_{u, j j})}^{1 / 2}, and ω_{T} = \sqrt{log p / T} + \sqrt{1 / p},

we have

{\tilde{Δ}}_{S} \leq Δ_{S} = O_{P} (m_{p} ω_{T}^{1 - q}) .

Combining the three terms, ${‖ {\hat{Σ}}^{S} - Σ ‖}_{Σ} = O_{P} (p / T + m_{p} ω_{T}^{1 - q})$ and ${‖ {\hat{Σ}}^{S} - Σ ‖}_{Σ, F} = O_{P} (\sqrt{p} / T + m_{p} ω_{T}^{1 - q})$ .

The relative error convergence characterizes the accuracy of estimation for spiked covariance matrix. In contrast with the result on relative Frobenius norm, this is the first time that the relative rate under spectral norm is derived. As long as λ_m is slightly above $\sqrt{p}$ , we reach the same rates of convergence. Therefore, we can conclude S-POET is effective even under a much weaker signal level. Comparing the rate with (4.7), under relative Frobenius norm, we achieve a better rate without the artificial log term, thanks to the new asymptotic results.

4.3. Portfolio risk management

The risk of a given portfolio with allocation weight w is conventionally measured by its variance w′Σw, where Σ is the volatility (covariance) matrix of the returns of underlying assets. To estimate large portfolio’s risks, it needs to estimate a large covariance matrix Σ and factor models are frequently used to reduce the dimensionality. This was the idea of Fan, Liao and Shi (2015) in which they used POET estimator to estimate Σ. However, the basic method for bounding the risk error |w′Σ̂w − w′Σw| in their paper is

| w' \hat{Σ} w - w' Σ w | \leq {‖ w ‖}_{1}^{2} {‖ \hat{Σ} - Σ ‖}_{max} .

They assumed that the gross exposure of the portfolio is bounded, i.e. ‖w‖₁ = O(1). Technically, when p is large, w′Σw can be small. What an investor cares mostly is the relative risk error RE(w) = |w′Σw/w′Σw − 1|. Often w is a data-driven investment strategy, which depends on the past data. Regardless of what w is,

max_{w} R E (w) = {‖ \hat{Σ} - Σ ‖}_{Σ},

which does not converge by Theorem 4.1 for p > T. Thus the question of interest is what kind of portfolio w will make the relative error converge. Decompose w as a linear combination of the eigenvectors of Σ, namely w = (Γ, Ω)η and $η = (η_{A}^{'}, η_{B}^{'})'$ . We have the following useful result for risk management.

Theorem 4.2

Under Assumptions 2.1, 2.2, 4.1,4.2 and the factor model (4.1) with Gaussian noises and factors, if there exists C₁ > 0 such that ‖η_B‖₁ ≤ C₁, and assume λ_j ∝ p^α for j = 1, …, m and T ≥ Cp^β for α > 1/2, 0 < β < 1, α + β > 1, then the relative risk error is of order

R E (w) = | \frac{w' {\hat{Σ}}^{S} w}{w' Σ w} - 1 | = O_{P} (T^{- min {\frac{2 (α + β - 1)}{β}, \frac{1}{2}}} + m_{p} w_{T}^{1 - q}),

for α < 1. If α ≥ 1 or ‖η_A‖ ≥ C₂, $R E (w) = O_{P} (m_{p} w_{T}^{1 - q})$ .

The condition ‖η_B‖₁ ≤ C₁ is generally weaker than ‖w‖₁ = O(1). It does not limit the total exposure of investor’s position, but only put constraint on investment of the non-spiked section. Note that under the conditions of Theorem 4.2, p/(Tλ_j) → 0, and S-POET and POET are approximately the same. So the stated result is valid for POET too.

4.4. Estimation of false discovery proportion

Another important application of the factor model is the estimation of false discovery proportion. For simplicity, we assume Gaussian data X_i ~ N(μ, Σ) with an unknown correlation matrix Σ and wish to test which coordinates of μ are nonvanishing. Consider the test statistic $Z = \sqrt{n} \bar{X}$ where X̄ is the sample mean of all data. Then Z ~ N(μ*, Σ) with $μ^{*} = \sqrt{n} μ$ . The problem is to test

H_{0 j} : μ_{j}^{*} = 0 v . s . H_{1 j} : μ_{j}^{*} \neq 0 .

Define the number of discoveries R(t) = #{j : P_j ≤ t} and the number of false discoveries V(t) = #{true null : P_j ≤ t}, where P_j is the p-value associated with the j^th test. Note that R(t) is observable while V(t) needs to be estimated. The false discovery proportion (FDP) is defined as FDP(t) = V(t)/R(t).

Fan and Han (2013) proposed to employ the factor structure

Σ = B B' + A,

(4.12)

where $B = (\sqrt{λ_{1}} ξ_{1}, \dots, \sqrt{λ_{m}} ξ_{m})$ . λ_j and ξ_j are respectively the j^th eigenvalue and eigenvector of Σ as before. Then Z can be stochastically decomposed as

Z = μ^{*} + B W + K,

where W ~ N(0, I_m) are m common factors and K ~ N(0, A), independent of W, are the idiosyncratic errors. For simplicity, assume the maximal number of nonzero elements of each row of A is bounded. In Fan and Han (2013), they argued that the asymptotic upper bound

F D P_{A} (t) = \sum_{i = 1}^{p} [Φ (a_{i} (z_{t / 2} + η_{i})) + Φ (a_{i} (z_{t / 2} - η_{i}))] / R (t)

(4.13)

of FDP(t) should be a realistic target to estimate for dependence tests, where z_t/2 is the t/2-quantile of the standard normal distribution a_i = (1 − ‖b_i‖²)^−1/2, $η_{i} = b_{i}^{'} W$ and $b_{i}^{'}$ is the i^th row of B.

Realized factors W and the loading matrix B are typically unknown. If a generic estimator Σ̂ is provided, then we are able to estimate B and thus b_i from its empirical eigenvalues and eigenvectors λ̂_j’s and ξ̂_j’s. W can be estimated by the least-squares estimate Ŵ = (B̂′B̂)⁻¹B̂′Z. Fan and Han (2013) proposed the following estimator for FDP_A(t):

{\hat{F D P}}_{U} (t) = \sum_{i = 1}^{p} [Φ (â_{i} (z_{t / 2} + {\hat{η}}_{i})) + Φ (â_{i} (z_{t / 2} - {\hat{η}}_{i}))] / R (t),

(4.14)

where â_i = (1 − ‖b̂_i‖²)^−1/2 and ${\hat{η}}_{i} = {\hat{b}}_{i}^{'} Ŵ$ . The following assumptions are in their paper.

Assumption 4.3

There exists a constant h > 0 such that (i) R(t)/p > hp^−θ for h > 0 and θ ≥ 0 as p → ∞ and (ii) â_i ≤ h, a_i ≤ h for all i = 1, …, p.

They showed that if Σ̂ is based on the POET estimator with a spike size λ_m ≍ p, under Assumptions ?? - ??, on the event that Assumption 4.3 holds,

| {\hat{F D P}}_{U, P O E T} (t) - F D P_{A} (t) | = O_{P} (p^{θ} (\sqrt{\frac{log p}{T}} + \frac{‖ μ^{*} ‖}{\sqrt{p}})) .

(4.15)

Again we can relax the assumption on the spike magnitude from order p to much weaker Assumption 4.1. Since Σ is a correlation matrix, λ₁ ≤ tr(Σ) = p. This, together with Assumption 4.1, leads us to consider leading eigenvalues of order p^α for 1/2 < α ≤ 1.

Now we apply the proposed S-POET method to obtain Σ̂^S and use it for FDP estimation. The following theorem shows the estimation error.

Theorem 4.3

If Assumptions 2.1, 2.2, 4.1, and 4.2 are applied to Gaussian independent data X_i ~ N(μ, Σ), and λ_j ∝ p^α for j = 1, …, m, T ≥ Cp^β for 1/2 < α ≤ 1, 0 < β < 1, α + β > 1, on the event that Assumption 4.3 holds, we have

| {\hat{F D P}}_{U, S P O E T} (t) - F D P_{A} (t) | = O_{P} (p^{θ} (‖ μ^{*} ‖ p^{- \frac{1}{2}} + T^{- min {\frac{α + β - 1}{β}, \frac{1}{2}}})) .

Comparing the result with (4.15), this convergence rate attained by S-POET is more general than the rate achieved before. The only difference is the second term, which is O(T^−1/2) if $α + \frac{1}{2} β \geq 1$ and T^{−(α +
β−1)/β} otherwise. So we relax the condition from α = 1 in Fan and Han (2013) to α ∈ (1/2, 1]. This means a weaker signal than order p is actually allowed to obtain a consistent estimate of false discovery proportion.

5. Simulations

We conducted some simulations to demonstrate the finite sample behavior of empirical eigen-structure, the performance of S-POET, and validity of applying it to estimate false discovery proportion.

5.1. Eigen-structure

In this simulation, we set n = 50, p = 500 and Σ = diag(50, 20, 10, 1, …, 1), which has three spike eigenvalues (m = 3) λ₁ = 50, λ₂ = 20, λ₃ = 10 and correspondingly c₁ = 0.2, c₂ = 0.5, c₃ = 1. Data was generated from multivariate Gaussian. The number of simulations is 1000. The histograms of the standardized empirical eigenvalues $\sqrt{n / 2} ({\hat{λ}}_{j} / λ_{j} - 1 - c_{j})$ , and their associated asymptotic distributions (standard normal) are plotted in Figure 1. The approximations are very good even for this low sample size n = 50.

Fig 1 — Behavior of empirical eigenvalues. The empirical distributions of $\sqrt{n / 2} ({\hat{λ}}_{j} / λ_{j} - 1 - c_{j})$ for j = 1, 2, 3 are compared with their asymptotic distributions N(0, 1).

Figure 2 shows the histograms of $\sqrt{n} ({\hat{ξ}}_{j A} / ‖ {\hat{ξ}}_{j A} ‖ - e_{j A})$ for the first three elements (the spiked part) of the first three eigenvectors. On the one hand, according to the asymptotic results, the values in the diagonal position should stochastically converge to 0 as observed. On the other hand, plots in the off-diagonal positions should converge in distribution to N(0, 1) for k ≠ j after standardization, which is indeed the case. We also report the correlations between the first three elements for the three eigenvectors based on those 1000 repetitions in Table 1. The correlations are all quite close to 0, which is consistent with the theory.

Fig 2 — Behavior of empirical eigenvectors. The histogram of the *k^th* element of the *j^th* empirical vector is depicted in the location (*k, j*) for *k, j* ≤ 3. Off-diagonal plots of values $\sqrt{n} {\hat{ξ}}_{j k} / ‖ {\hat{ξ}}_{j A} ‖ / \sqrt{\frac{c_{j} c_{k}}{{(c_{j} - c_{k})}^{2}}}$ are compared to their asymptotic distributions N(0, 1) for k ≠ j while diagonal plots of values $\sqrt{n} ({\hat{ξ}}_{j j} / ‖ {\hat{ξ}}_{j A} ‖ - 1)$ are compared to stochastically 0.

Table 1.

The correlations between the first three elements for each of the three empirical eigenvectors based on 1000 repetitions

	1st & 2nd elements	1st & 3rd elements	2nd & 3rd elements
1st Eigenvector	0.00156	−0.00192	−0.04112
2nd Eigenvector	−0.02318	−0.00403	0.01483
3rd Eigenvector	−0.02529	−0.04004	0.12524

Open in a new tab

For the normalized non-spiked part ξ̂_jB/‖ξ̂_jB‖, it should be distributed uniformly over the unit sphere. This can be tested by the results of Cai, Fan and Jiang (2013). For any n data points X₁, …, X_n on a p-dimensional sphere, define the normalized empirical distribution of angles of each pair of vectors as

μ_{n, p} = \frac{1}{(\begin{matrix} n \\ 2 \end{matrix})} \sum_{1 \leq i < j \leq n} δ_{\sqrt{p - 2} (π / 2 - Θ_{i j})},

where Θ_ij ∈ [0, π] is the angle between vectors X_i and X_j. When the data are generated uniformly from a sphere, μ_n,p converges to the standard normal distribution with probability 1. Figure 3 shows the empirical distributions of all pairwise angles of the realized ξ̂_jB/‖ξ̂_jB‖ (j = 1, 2, 3) in 1000 simulations. Since number of such pairwise angels is $(\begin{matrix} 1000 \\ 2 \end{matrix})$ , the empirical distributions and the asymptotic distributions N(0, 1) are almost identical. The normality holds even for a small subset of the angles.

Fig 3 — The empirical distributions of all pairwise angles of the 1000 realized ξ̂_jB/‖ξ̂_jB‖ (j = 1, 2, 3) compared with their asymptotic distributions N(0, 1).

Lastly, we did simulations to verify the rate difference of 〈ξ̂_j, e_j〉 for m = 1 and m > 1, revealed in Theorem 3.2 (iii). We choose n = [10 × 1.2^l] for l = 0, …, 9, p = [n³/100], where [·] represents rounding. We set λ_j = 1 for j ≥ 3 and consider two situations: (1) λ₁ = p, λ₂ = 1, (2) λ₁ = 2λ₂ = p. Under both cases, simulations were carried out 500 times and the corresponding angle of the empirical eigenvector and its truth was calculated for each simulation. The logarithm of the median absolute error of $〈 {\hat{ξ}}_{1}, e_{1} 〉 - 1 / \sqrt{1 + c_{1}}$ was plotted against log(n). Under the two situations, the rates of convergence are O_P(n^−3/2) and O_P(n⁻¹) respectively. Thus the slope of the curves should be −3/2 for a single spike and −1 for two spikes, which is indeed the case as shown in Figure 4.

Fig 4 — Difference of convergence rate of $〈 {\hat{ξ}}_{1}, e_{1} 〉 - 1 / \sqrt{1 + c_{1}}$ for models with a single spike and two spikes. The error should be expected to decrease at the rate of *O_P*(n^−3/2) and *O_P*(n⁻¹) respectively.

In short, all the simulation results match well with the theoretical results for the high dimensional regime.

5.2. Performance of S-POET

We demonstrate the effectiveness of S-POET in comparison with POET. A similar setting to the last section is used, i.e. m = 3 and c₁ = 0.2, c₂ = 0.5, c₃ = 1. The sample size T ranges from 50 to 150 and p = [T^3/2]. Note that when T = 150, p ≈ 1800. The spiked eigenvalues are determined from p/(Tλ_j) = c_j so that λ_j is of order $\sqrt{T}$ , which is much smaller than p. For each pair of T and p, the following steps are used to generate observed data from the factor model for 200 times.

Each row of B is simulated from the standard multivariate normal distribution and the j^th column is normalized to have norm λ_j for j = 1, 2, 3.
Each row of F is simulated from standard multivariate normal distribution.
Set $Σ_{u} = diag (σ_{1}^{2}, \dots, σ_{p}^{2})$ where σ_i’s are generated from Gamma(α, β) with α = β = 100 (mean 1, standard deviation 0.1). The idiosyncratic error U is simulated from N(0, Σ_u).
Compute the observed data Y = BF′ + U.

Both S-POET and POET are applied to estimate the covariance matrix Σ = BB′ + Σ_u. Their mean estimation errors over 200 simulations, measured in relative spectral norm ‖Σ̂ − Σ‖_Σ, relative Frobenius norm ‖Σ̂ − Σ‖_Σ,F, spectral norm ‖Σ̂ − Σ‖ and max norm ‖Σ̂ − Σ‖_max, are reported in Figure 5. The errors for sample covariance matrix are also depicted for comparison. First notice that no matter in what norm, S-POET uniformly outperforms POET and the sample covariance. It affirms the claim that shrinkage of spiked eigenvalues is necessary to maintain good performance when the spikes are not sufficiently large. Since the low rank part is not shrunk for POET, its error under the spectral norm is comparable and even slightly larger than that of the sample covariance matrix. The errors under max norm and relative Frobenius norm as expected decrease as T and p increase. However the error under the relative spectral norm does not converge: our theory shows it should increase in the order $p / T = \sqrt{T}$ .

Fig 5 — Estimation errors of covariance matrix under relative spectral, relative Frobenius, spectral and max norms using S-POET (red), POET (black) and sample covariance (blue).

5.3. FDP estimation

In this section, we report simulation results on FDP estimation by using both POET and S-POET. The data are simulated in a similar way as in Section 5.2 with p = 1000 and n = 100. The first m = 3 eigenvalues have spike size proportional to $p / \sqrt{n}$ which corresponds to α = β = 2/3 in Theorem 4.3. The true FDP is calculated by using FDP(t) = V (t)/R(t) with t = 0.01. The approximate FDP, FDP_A(t), is calculated as in (4.13) with known B but estimated W given by Ŵ = (BB′)⁻¹B′Z. This FDP_A(t) based on a known covariance matrix serves as a benchmark for our estimated covariance matrix to compare with. We employed POET and S-POET to get ${\hat{F D P}}_{U, P O E T} (t)$ and ${\hat{F D P}}_{U, S P O E T} (t)$ .

In Figure 6, three scatter plots are drawn to compare FDP_A(t), ${\hat{F D P}}_{U, P O E T} (t)$ and ${\hat{F D P}}_{U, S P O E T} (t)$ with the true FDP(t). The points are basically aligned along the 45 degree line, meaning that all of them are quite close to the true FDP. With the semi-strong signal $λ \propto p / \sqrt{n}$ , although much weaker than order p, POET accomplishes the task as well as S-POET. Both estimators perform as well as if we know the covariance matrix Σ, the benchmark.

Fig 6 — Comparison of estimated FDP’s with true values. The left plot assumes knowledge of B, the middle and right ones are corresponding to POET and S-POET methods respectively. The results are aligned along the 45-degree line, indicating the accuracy of the estimated FDP.

6. Conclusions

In this paper, we studied two closely related problems: the asymptotic behavior of empirical eigenvalues and eigenvectors under a general regime of bounded p/(nλ_j) and the large covariance estimation for factor models with relaxed signal level of $\sqrt{p} = o (λ_{j})$ .

The first study provides new technical tools for the derivation of error bounds for large covariance estimation under relative Frobenius norm (with better rate) and relative spectral norm (for the first time). The results motivate the new proposed covariance estimator S-POET for the second problem by correcting biases of the estimated leading eigenvalues. S-POET is demonstrated to have better sampling properties than POET, and this is convincingly verified in the simulation study. In addition, we are able to apply S-POET to two important applications, risk management and false discovery control, and relax the required signal to $\sqrt{p}$ . Those conclusions shed new lights for applications of factor models.

On the other hand, the second problem is a key motivation for us to study the empirical engen-structure in a more general high dimensional regime. We aim to understand why PCA works for pervasive factor models but fails classical random matrix problems, without sparsity assumptions. What are the fundamental limit for the PCA in high dimension? We clearly showed that for both empirical eigenvalues and vectors, consistency is granted once p/(nλ_j) → 0. Further, our theories give a fine-grained characterization of the asymptotic behavior under the generalized and unified regime, which includes the situation of bounded eigenvalues, HDLSS, and pervasive factor models, especially for empirical eigenvectors. The asymptotic rate of convergence is obtained as long as p/(nλ_j) is bounded, while the asymptotic distribution is fully described when $\sqrt{p} = o (λ_{j})$ . Some interesting phenomena, such as interaction between multiple spikes, are also revealed in our results. Our proofs are novel in that we clearly identify terms that keep the low-dimensional asymptotic normality and terms that generate the random biases. In sum, our results serve as a necessary complement of the random matrix literature when the signal diverges with dimensionality.

Supplementary Material

supp1

NIHMS842544-supplement-supp1.pdf^{(349.6KB, pdf)}

Acknowledgments

The research was partially supported by NSF grants DMS-1206464 and DMS-1406266 and NIH grants R01-GM072611-11 and NIH R01GM100474-04. We would like to thank the Editor, associate editor, and anonymous referees for constructive comments that lead to substantial improvement of the presentation and results of the paper.

APPENDIX A

PROOFS FOR SECTION 3

A.1. Proof of Theorem 3.1

We first provide three useful lemmas for the proof. Lemma A.1 provides non-asymptotic upper and lower bound for the eigenvalues of weighted Wishart matrix for sub-Gaussian distributions.

Lemma A.1

Let A₁, …, A_n’s be n independent p dimensional sub-Gaussian random vectors with zero mean and identity variance, and the sub-Gaussian norms bounded by a constant C₀. Then for every t ≥ 0, with probability at least 1 − 2 exp(−ct²), one has

\bar{w} - max {δ, δ^{2}} \leq λ_{p} (\frac{1}{n} \sum_{i = 1}^{n} w_{i} A_{i} A_{i}^{'}) \leq λ_{1} (\frac{1}{n} \sum_{i = 1}^{n} w_{i} A_{i} A_{i}^{'}) \leq \bar{w} + max {δ, δ^{2}} .

where $δ = C \sqrt{p / n} + t / \sqrt{n}$ for constants C, c > 0, depending on C₀. Here |w_i|’s is bounded for all i and $\bar{w} = n^{- 1} \sum_{i = 1}^{n} w_{i}$ .

The above lemma is the extension of the classical Davidson-Szarek bound [Theorem II.7 of Davidson and Szarek (2001)] to the weighted sample co-variance with sub-Gaussian distribution. It was shown by Vershynin (2010) that the conclusion holds with w_i = 1 for all i. With similar techniques to those developed in Vershynin (2010), we can obtain the above lemma for general bounded weights. The details are omitted.

Now in order to prove the theorem, let us define two quantities and treat them separately in the following two lemmas. Let

A = n^{- 1} \sum_{j = 1}^{m} λ_{j} Z_{j} Z_{j}^{'}, and B = n^{- 1} \sum_{j = m + 1}^{p} λ_{j} Z_{j} Z_{j}^{'},

where Z_j is columns of $X Λ^{- \frac{1}{2}}$ . Then,

\tilde{Σ} = \frac{1}{n} \sum_{j = 1}^{p} λ_{j} Z_{j} Z_{j}^{'} = A + B .

(A.1)

Lemma A.2

Under Assumptions 2.1 – 2.3, as n → ∞,

\sqrt{n} (λ_{j} (A) / λ_{j} - 1) \overset{d}{\Rightarrow} N (0, κ_{j} - 1), for j = 1, \dots, m .

In addition, they are asymptotically independent.

Lemma A.3

Under Assumptions 2.1 – 2.3, for j = 1, ⋯, m, we have

λ_{k} (B) / λ_{j} = \bar{c} c_{j} + O_{P} (λ_{j}^{- 1} \sqrt{p / n}) + o_{P} (n^{- \frac{1}{2}}), for k = 1, 2, \dots, n .

The proofs of the above two lemmas will be given in Appendix?? in the supplementary material (Wang and Fan, 2015).

Proof of Theorem 3.1

By Wely’s Theorem, λ_j(A) + λ_n(B) ≤ λ̂_j ≤ λ_j(A) + λ₁(B): Therefore from Lemma A.3,

\frac{{\hat{λ}}_{j}}{λ_{j}} = \frac{λ_{j} (A)}{λ_{j}} + \bar{c} c_{j} + O_{P} (λ_{j}^{- 1} \sqrt{\frac{p}{n}}) + o_{P} (c_{j} n^{- 1 / 2}),

By Lemma A.2 and Slutsky’s theorem, we conclude that $\sqrt{n} ({\hat{λ}}_{j} / λ_{j} - (1 + \bar{c} c_{j} + O_{P} (λ_{j}^{- 1} \sqrt{p / n)}))$ converges in distribution to N(0, κ_j − 1) and the limiting distributions of the first m eigenvalues are independent.

A.2. Proofs of Theorem 3.2

The proof of Theorem 3.2 is mathematically involved. The basic idea for proving part (i) is outlined in Section 2. We relegate less important technical lemmas A.4 – A.6 to Appendix ?? in the supplementary material (Wang and Fan, 2015) in order not to distract the readers. The proof of part (ii) utilizes the invariance of standard Gaussian distribution under orthogonal transformations.

Proof of Theorem 3.2

(i) Let us start by proving the asymptotic normality of ξ̂_jA for the case m > 1. Write

X = (Z_{A} Λ_{A}^{\frac{1}{2}}, Z_{B} Λ_{B}^{\frac{1}{2}}) = (\sqrt{λ_{1}} Z_{1}, \dots, \sqrt{λ_{m}} Z_{m}, \sqrt{λ_{m + 1}} Z_{m + 1}, \dots, \sqrt{λ_{p}} Z_{p}),

where each Z_j follows a sub-Gaussian distribution with mean 0 and identity variance I_n. Then by the eigenvalue relationship of equation (2.1), we have

{\hat{ξ}}_{j A} = \frac{Λ_{A}^{\frac{1}{2}} Z_{A}^{'} u_{j}}{\sqrt{n {\hat{λ}}_{j}}} and u_{j} = \frac{X {\hat{ξ}}_{j}}{\sqrt{n {\hat{λ}}_{j}}} = \frac{Z_{A} Λ_{A}^{\frac{1}{2}} {\hat{ξ}}_{j A}}{\sqrt{n {\hat{λ}}_{j}}} + \frac{Z_{B} Λ_{B}^{\frac{1}{2}} {\hat{ξ}}_{j B}}{\sqrt{n {\hat{λ}}_{j}}} .

(A.2)

Recall u_j is the eigenvector of the matrix Σ̃, that is, $\frac{1}{n} X X' u_{j} = {\hat{λ}}_{j} u_{j}$ . Using $X = (Z_{A} Λ_{A}^{\frac{1}{2}}, Z_{B} Λ_{B}^{\frac{1}{2}})$ , we obtain

(I_{n} - \frac{1}{n} Z_{A} \frac{Λ_{A}}{λ_{j}} Z_{A}^{'}) u_{j} = D u_{j} - Δ u_{j},

(A.3)

where we denote $D = {(n λ_{j})}^{- 1} Z_{B} Λ_{B} Z_{B}^{'} - \bar{c} c_{j} I_{n}$ , Δ = λ̂_j/λ_j − (1 + c̄c_j). We then left-multiply equation (A.3) by $Λ_{A}^{\frac{1}{2}} Z_{A}^{'} / \sqrt{n {\hat{λ}}_{j}}$ and employ relationship (A.2) to replace u_j by ξ̂_jA and ξ̂_jB as follows:

(I_{m} - \frac{Λ_{A}}{λ_{j}}) {\hat{ξ}}_{j A} = \frac{Λ_{A}^{\frac{1}{2}} (\frac{1}{n} Z_{A}^{'} Z_{A} - I_{m}) Λ_{A}^{\frac{1}{2}}}{λ_{j}} {\hat{ξ}}_{j A} + \frac{Λ_{A}^{\frac{1}{2}} Z_{A}^{'} D Z_{A} Λ_{A}^{\frac{1}{2}}}{n {\hat{λ}}_{j}} {\hat{ξ}}_{j A} + \frac{Λ_{A}^{\frac{1}{2}} Z_{A}^{'} D Z_{B} Λ_{B}^{\frac{1}{2}}}{n {\hat{λ}}_{j}} {\hat{ξ}}_{j B} - Δ {\hat{ξ}}_{j A} .

(A.4)

Further define

R = \sum_{k \in [m] \ j} \frac{λ_{j}}{λ_{j} - λ_{k}} e_{k A} e_{k A}^{'} .

Then we have $R (I - Λ_{A} / λ_{j}) = I_{m} - e_{j A} e_{j A}^{'}$ . Note that R is only well defined if m > 1. Therefore, by left multiplying R to equation (A.4),

{\hat{ξ}}_{j A} - 〈 {\hat{ξ}}_{j A}, e_{j A} 〉 e_{j A} = R {(\frac{Λ_{A}}{λ_{j}})}^{\frac{1}{2}} K {(\frac{Λ_{A}}{λ_{j}})}^{\frac{1}{2}} {\hat{ξ}}_{j A} + R \frac{Λ_{A}^{\frac{1}{2}} Z_{A}^{'} D Z_{B} Λ_{B}^{\frac{1}{2}}}{n {\hat{λ}}_{j}} {\hat{ξ}}_{j B} - Δ R {\hat{ξ}}_{j A},

(A.5)

where $K = n^{- 1} Z_{A}^{'} Z_{A} - I_{n} + λ_{j} {(n {\hat{λ}}_{j})}^{- 1} Z_{A}^{'} D Z_{A}$ . Dividing both side by ‖ξ̂_jA‖, we are able to write

\frac{{\hat{ξ}}_{j A}}{‖ {\hat{ξ}}_{j A} ‖} - e_{j A} = R {(\frac{Λ_{A}}{λ_{j}})}^{\frac{1}{2}} K {(\frac{Λ_{A}}{λ_{j}})}^{\frac{1}{2}} e_{j A} + r_{n},

(A.6)

where

r_{n} = (〈 \frac{{\hat{ξ}}_{j A}}{‖ {\hat{ξ}}_{j A} ‖}, e_{j A} 〉 - 1) e_{j A} + R {(\frac{Λ_{A}}{λ_{j}})}^{\frac{1}{2}} K {(\frac{Λ_{A}}{λ_{j}})}^{\frac{1}{2}} (\frac{{\hat{ξ}}_{j A}}{‖ {\hat{ξ}}_{j A} ‖} - e_{j A}) + R \frac{Λ_{A}^{\frac{1}{2}} Z_{A}^{'} D Z_{B} Λ_{B}^{\frac{1}{2}}}{n {\hat{λ}}_{j}} \frac{{\hat{ξ}}_{j B}}{‖ {\hat{ξ}}_{j A} ‖} - Δ R (\frac{{\hat{ξ}}_{j A}}{‖ {\hat{ξ}}_{j A} ‖} - e_{j A}) .

(A.7)

Lemma A.4

As n → ∞, $‖ r_{n} ‖ = O_{P} (λ_{j}^{- 1} \sqrt{p / n} + 1 / n)$ .

By Lemma A.4, r_n is a smaller order term. Since ${(Λ_{A} / λ_{j})}^{\frac{1}{2}} e_{j A} = e_{j A}$ ,

\sqrt{n} (\frac{{\hat{ξ}}_{j A}}{‖ {\hat{ξ}}_{j A} ‖} - e_{j A} + O_{P} (\sqrt{\frac{p}{n λ_{j}^{2}}})) = \sqrt{n} R {(\frac{Λ_{A}}{λ_{j}})}^{\frac{1}{2}} K e_{j A} + o_{P} (1) .

(A.8)

Now let us derive normality of the right hand side of (A.8). According to the definition of R,

R {(\frac{Λ_{A}}{λ_{j}})}^{\frac{1}{2}} = \sum_{k \in [m] \ j} \frac{\sqrt{λ_{j} λ_{k}}}{λ_{j} - λ_{k}} e_{k A} e_{k A}^{'} \to \sum_{k \in [m] \ j} a_{j k} e_{k A} e_{k A}^{'} .

(A.9)

Let $W = \sqrt{n} K e_{j A} = (W_{1}, \dots, W_{m})$ and W^(−j) be the (m − 1)-dimensional vector without the j^th element in W. Since the j^th diagonal element of R is zero, $R {(Λ_{A} / λ_{j})}^{\frac{1}{2}} W$ depends only on W^(−j).

Lemma A.5

$W^{(- j)} + O_{P} (λ_{j}^{- 1} \sqrt{p / n}) \overset{d}{\Rightarrow} N (0, I_{m - 1})$ .

Therefore, by Lemma A.5 and Slutsky’s theorem,

\sqrt{n} R {(\frac{Λ_{A}}{λ_{j}})}^{\frac{1}{2}} K e_{j A} + O_{P} (\sqrt{\frac{p}{n λ_{j}^{2}}}) \overset{d}{\Rightarrow} N_{m} (0, \sum_{k \in [m] \ j} a_{j k}^{2} e_{k A} e_{k A}^{'}) .

Together with (A.8), we concludes (3.3) for the case m > 1.

Now let us turn to the case of m = 1. Since R is not defined for m = 1, we need to find a different derivation. Equivalently, (A.3) can be written as

\frac{1}{n} Z_{1} Z_{1}^{'} u_{1} + \frac{1}{n λ_{1}} Z_{B} Λ_{B} Z_{B}^{'} u_{1} = \frac{{\hat{λ}}_{1}}{λ_{1}} u_{1} .

Left-multiplying $u_{1}^{'}$ and using relationship (A.2), we obtain easily

{\hat{ξ}}_{1 A}^{2} = 1 - \frac{\bar{c} c_{1}}{{\hat{λ}}_{1} / λ_{1}} - \frac{λ_{1}}{{\hat{λ}}_{1}} u_{1}^{'} D u_{1} = 1 - \frac{\bar{c} c_{1}}{{\hat{λ}}_{1} / λ_{1}} + O_{P} (λ_{1}^{- 1} \sqrt{p / n}),

where D is defined as before and $‖ D ‖ = O_{P} (λ_{1}^{- 1} \sqrt{p / n})$ according to the proof of Lemma A.4. Expanding $\sqrt{1 - \bar{c} c_{1} / x}$ at the point of (1 + c̄c₁), we have

{\hat{ξ}}_{1 A} = \frac{1}{\sqrt{1 + \bar{c} c_{1}}} + \frac{\bar{c} c_{1}}{2 {(1 + \bar{c} c_{1})}^{3 / 2}} ({\hat{λ}}_{1} / λ_{1} - (1 + \bar{c} c_{1})) + O_{P} (\sqrt{\frac{p}{n λ_{1}^{2}}} + c_{1} n^{- 1}) .

Note that from Lemmas A.2 and A.3, ${\hat{λ}}_{1} / λ_{1} - (1 + \bar{c} c_{1}) = ({‖ Z_{1} ‖}^{2} / n - 1) + O_{P} (λ_{1}^{- 1} {(p / n)}^{1 / 2}) + o_{P} (c_{j} n^{- 1 / 2})$ . Therefore due to the fact $\sqrt{n} ({‖ Z_{1} ‖}^{2} / n - 1)$ is asymptotically N(0, κ₁ − 1), we conclude

\frac{2 {(1 + \bar{c} c_{1})}^{3 / 2}}{\bar{c} c_{1}} \sqrt{n} ({\hat{ξ}}_{1 A} - \frac{1}{\sqrt{1 + \bar{c} c_{1}}} + O_{P} (\sqrt{\frac{p}{n λ_{1}^{2}}})) \overset{d}{\Rightarrow} N (0, κ_{1} - 1) .

This completes the first part of the proof.

(ii) We now prove the conclusion for the non-spiked part ξ̂_jB. Recall that X_i follows N(0, Λ). Consider $X_{i}^{R} = diag (I_{m}, D_{0}) X_{i}$ where as defined in the theorem $D_{0} = diag (\sqrt{\bar{c} / λ_{m + 1}}, \dots, \sqrt{\bar{c} / λ_{p}})$ . Here the superscript R indicates rescaled data by diag(I_m, D₀). After rescaling, we have $X_{i}^{R} ~ N (0, diag (Λ_{A}, \bar{c} I_{p - m}))$ . Correspondingly, the n × p data matrix X^R = X diag(I_m, D₀) = (X_A, X_BD₀) where $X_{A} = Z_{A} Λ_{A}^{\frac{1}{2}}$ and $X_{B} = Z_{B} Λ_{B}^{\frac{1}{2}}$ as the notations before. Assume ${\hat{ξ}}_{j}^{R}$ and $u_{j}^{R}$ are eigenvectors given by Σ̂^R and Σ̃^R of the rescaled data X^R and ${\hat{ξ}}_{j}^{R} = ({\hat{ξ}}_{j A}^{R}, {\hat{ξ}}_{j B}^{R})'$ . It has been proved by Paul (2007) that $h_{0} ≔ {\hat{ξ}}_{j B}^{R} / ‖ {\hat{ξ}}_{j B}^{R} ‖$ is distributed uniformly over the unit sphere and is independent of $‖ {\hat{ξ}}_{j B}^{R} ‖$ due to the orthogonal invariance of the non-spiked part of ${\hat{ξ}}_{j B}^{R}$ . Hence it only remains to link ξ̂_jB/‖ξ̂_jB‖ with h₀.

Note that Σ̃ = n⁻¹XX′ and Σ̃^R = n⁻¹X^RX^R′, so

‖ \tilde{Σ} - {\tilde{Σ}}^{R} ‖ = ‖ \frac{1}{n} X_{B} (I - D_{0}^{2}) X_{B}^{'} ‖ = ‖ \frac{1}{n} \sum_{j = m + 1}^{p} (λ_{j} - \bar{c}) Z_{j} Z_{j} ‖,

where the last term is of order $O_{P} (\sqrt{p / n})$ by Lemma A.1. Thus by the sin θ theorem of Davis and Kahan (1970), $‖ u_{j} - u_{j}^{R} ‖ = O_{P} (λ_{j}^{- 1} \sqrt{p / n})$ . Next we convert from u_j to ξ̂_jB using the basic relationship (2.1). We have,

‖ D_{0} \frac{{\hat{ξ}}_{j B}}{‖ {\hat{ξ}}_{j B} ‖} - \frac{{\hat{ξ}}_{j B}^{R}}{‖ {\hat{ξ}}_{j B}^{R} ‖} ‖ = ‖ \frac{D_{0} X_{B}^{'} u_{j}}{\sqrt{n {\hat{λ}}_{j}} ‖ {\hat{ξ}}_{j B} ‖} - \frac{D_{0} X_{B}' u_{j}^{R}}{\sqrt{n {\hat{λ}}_{j}^{R}} ‖ {\hat{ξ}}_{j B}^{R} ‖} ‖ \leq ‖ \frac{D_{0} X_{B}^{'} u_{j}}{\sqrt{n λ_{j}}} ‖ | \sqrt{\frac{λ_{j}}{{\hat{λ}}_{j} {‖ {\hat{ξ}}_{j B} ‖}^{2}}} - \sqrt{\frac{λ_{j}}{{\hat{λ}}_{j}^{R} {‖ {\hat{ξ}}_{j B}^{R} ‖}^{2}}} | + ‖ \frac{D_{0} X_{B}^{'}}{\sqrt{n {\hat{λ}}_{j}^{R}} ‖ {\hat{ξ}}_{j B}^{R} ‖} ‖ ‖ u_{j} - u_{j}^{R} ‖ ≕ I + I I .

First it is not hard to see $I I = O_{P} (λ_{j}^{- 1} \sqrt{p / n})$ since $‖ u_{j} - u_{j}^{R} ‖ = O_{P} (λ_{j}^{- 1} \sqrt{p / n}), ‖ X_{B}^{'} / \sqrt{n λ_{j}} ‖ = O_{P} (\sqrt{c_{j}}), λ_{j} / {\hat{λ}}_{j}^{R} = O_{P} (1)$ and $1 / ‖ {\hat{ξ}}_{j B}^{R} ‖ = O_{P} (1 / \sqrt{c_{j}})$ . The last result is due to the following lemma.

Lemma A.6

$‖ {\hat{ξ}}_{j A} ‖ = {(1 + \bar{c} c_{j})}^{- 1 / 2} + O_{P} (λ_{j}^{- 1} \sqrt{p / n} + c_{j} n^{- 1 / 2})$ and $‖ {\hat{ξ}}_{j B} ‖ = {(\frac{\bar{c} c_{j}}{1 + \bar{c} c_{j}})}^{1 / 2} + O_{P} (\sqrt{1 / λ_{j}} + \sqrt{c_{j}} n^{- 1 / 2})$ .

We claim $I = O_{P} (\sqrt{n / p}) + o_{P} (n^{- 1 / 2})$ . Actually from the proof of Lemma A.6, we have

{\hat{λ}}_{j} {‖ {\hat{ξ}}_{j B} ‖}^{2} / λ_{j} = \bar{c} c_{j} + O_{P} (λ_{j}^{- 1} \sqrt{p / n}) + o_{P} (c_{j} n^{- 1 / 2}) .

Then some elementary calculation gives the rate of I. Therefore, $‖ D_{0} {\hat{ξ}}_{j B} / ‖ {\hat{ξ}}_{j B} ‖ - h_{0} ‖ = O_{P} (\sqrt{n / p}) + o_{P} (n^{- 1 / 2})$ . The conclusion (3.4) follows.

To prove the max norm bound (3.5) of ‖ξ̂_jB‖_max, we first show ${‖ h_{0} ‖}_{max} = O_{P} (\sqrt{log p / p})$ . Recall that h₀ is uniformly distributed on the unit sphere of dimension p − m. This follows easily from its normal representation. Let G to be (p − m)-dimensional multivariate standard normal distributed, then $h_{0} \overset{d}{=} G / ‖ G ‖$ . It then follows

{‖ h_{0} ‖}_{max} = max_{i \leq p - m} | G_{i} | / ‖ G ‖ = O_{P} (\sqrt{log p / p}) .

From the derivation above,

{‖ {\hat{ξ}}_{j B} ‖}_{max} \leq \sqrt{{\hat{λ}}_{j}^{R} / {\hat{λ}}_{j}} ‖ D_{0}^{- 1} ‖ ‖ {\hat{ξ}}_{j B}^{R} ‖ (I I + {‖ h_{0} ‖}_{max}),

which gives $O_{P} (\sqrt{c_{j}} (\sqrt{p / (n λ_{j}^{2}}) + \sqrt{log p / p})) = O_{P} (p / n λ_{j}^{3 / 2}) + \sqrt{log p / (n λ_{j}))}$ , given the fact that $‖ {\hat{ξ}}_{j B}^{R} ‖ = O_{P} (\sqrt{c_{j}})$ by Lemma A.6. Thus we are done with the second part of the proof.

(iii) The proof for the convergence of ‖ξ̂_jA‖ and ‖ξ̂_jB‖ are given in Lemma A.6. If m = 1, the result for ‖ξ̂_jA‖ directly gives (3.6) with the same rate. For m > 1, from Lemma A.6 we have

{‖ {\hat{ξ}}_{j A} ‖}^{2} = {(1 + \bar{c} c_{j})}^{- 1} + O_{P} (\sqrt{p / (n λ_{j}^{2})} + c_{j} n^{- 1 / 2}) .

On the other hand, from Theorem 3.2 (i), ${\hat{ξ}}_{j k}^{2} = O_{P} (p / (n λ_{j}^{2}) + 1 / n)$ for k ≠ j ≤ m. So ${\hat{ξ}}_{j 1}^{2} = {(1 + \bar{c} c_{j})}^{- 1} + O_{P} (\sqrt{p / (n λ_{j}^{2})} + c_{j} n^{- 1 / 2} + 1 / n)$ , which implies (3.6).

Footnotes

SUPPLEMENTARY MATERIAL

Supplement: Technical proofs Wang and Fan (2015) (). This document contains technical lemmas for Section 3 and the comparison of assumptions and theoretical proofs for Section 4.

REFERENCES

Agarwal A, Negahban S, Wainwright MJ. Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. The Annals of Statistics. 2012;40:1171–1197. [Google Scholar]
Amini AA, Wainwright MJ. Information Theory 2008. ISIT 2008. IEEE International Symposium on. IEEE; 2008. High-dimensional analysis of semidefinite relaxations for sparse principal components; pp. 2454–2458. [Google Scholar]
Anderson TW. Asymptotic theory for principal component analysis. The Annals of Mathematical Statistics. 1963;34:122–148. [Google Scholar]
Antoniadis A, Fan J. Regularization of wavelet approximations. Journal of the American Statistical Association. 2001;96 [Google Scholar]
Bai Z. Methodologies in spectral analysis of large-dimensional random matrices, a review. Statist. Sinica. 1999;9:611–677. [Google Scholar]
Bai J. Inferential theory for factor models of large dimensions. Econometrica. 2003;71:135–171. [Google Scholar]
Bai J, Ng S. Determining the number of factors in approximate factor models. Econometrica. 2002;70:191–221. [Google Scholar]
Bai Z, Silverstein JW. Spectral analysis of large dimensional random matrices. 2nd. Springer; 2009. [Google Scholar]
Bai Z, Yao J. on sample eigenvalues in a generalized spiked population model. Journal of Multivariate Analysis. 2012;106:167–177. [Google Scholar]
Bai Z, Yin Y. Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. The Annals of Probability. 1993:1275–1294. [Google Scholar]
Baik J, Ben Arous G, Péché S. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. The Annals of Probability. 2005:1643–1697. [Google Scholar]
Benaych-Georges F, Nadakuditi RR. The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Advances in Mathematics. 2011;227:494–521. [Google Scholar]
Berthet Q, Rigollet P. Optimal detection of sparse principal components in high dimension. The Annals of Statistics. 2013;41:1780–1815. [Google Scholar]
Bickel PJ, Levina E. Covariance regularization by thresholding. The Annals of Statistics. 2008:2577–2604. [Google Scholar]
Birnbaum A, Johnstone IM, Nadler B, Paul D. Minimax bounds for sparse PCA with noisy high-dimensional data. The Annals of Statistics. 2013;41:1055. doi: 10.1214/12-AOS1014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cai T, Fan J, Jiang T. Distributions of angles in random packing on spheres. The Journal of Machine Learning Research. 2013;14:1837–1864. [PMC free article] [PubMed] [Google Scholar]
Cai T, Ma Z, Wu Y. Optimal estimation and rank detection for sparse spiked covariance matrices. Probability Theory and Related Fields. 2015;161:781–815. doi: 10.1007/s00440-014-0562-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Candès EJ, Li X, Ma Y, Wright J. Robust principal component analysis? Journal of the ACM (JACM) 2011;58:11. [Google Scholar]
Chamberlain G, Rothschild M. Arbitrage, factor structure, and mean-variance analysis on large asset markets. Econometrica. 1983;51:1305–1324. [Google Scholar]
Chandrasekaran V, Sanghavi S, Parrilo PA, Willsky AS. Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization. 2011;21:572–596. [Google Scholar]
Chen KH, Shimerda TA. An empirical analysis of useful financial ratios. Financial Management. 1981:51–60. [Google Scholar]
Davidson KR, Szarek SJ. Local operator theory, random matrices and Banach spaces. In: Johnson WB, Lindenstrauss J, editors. Handbook of the Geometry of Banach Spaces. Vol. 1. Elsevier Science BV; 2001. pp. 317–366. [Google Scholar]
Davis C, Kahan WM. The rotation of eigenvectors by a perturbation. III. SIAM Journal on Numerical Analysis. 1970;7:1–46. [Google Scholar]
De Mol C, Giannone D, Reichlin L. Forecasting using a large number of predictors: Is Bayesian shrinkage a valid alternative to principal components? Journal of Econometrics. 2008;146:318–328. [Google Scholar]
Donoho DL, Gavish M, Johnstone IM. Optimal shrinkage of eigenvalues in the Spiked Covariance Model. arXiv preprint arXiv:1311.0851. 2014 doi: 10.1214/17-AOS1601. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Fan Y, Lu J. High dimensional covariance matrix estimation using a factor model. Journal of Econometrics. 2008;147:186–197. [Google Scholar]
Fan J, Han X, Gu W. Estimating false discovery proportion under arbitrary covariance dependence. Journal of the American Statistical Association. 2012;107:1019–1035. doi: 10.1080/01621459.2012.720478. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Han X. Estimation of false discovery proportion with unknown dependence. arXiv preprint arXiv:1305.7007. 2013 doi: 10.1111/rssb.12204. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Liao Y, Mincheva M. Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B. 2013;75:1–44. doi: 10.1111/rssb.12016. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Liao Y, Shi X. Risks of large portfolios. Journal of Econometrics. 2015;186:367–387. doi: 10.1016/j.jeconom.2015.02.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Liao Y, Wang W. Projected Principal Component Analysis in Factor Models. The Annals of Statistics. 2016;44:219–254. doi: 10.1214/15-AOS1364. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Xue L, Yao J. Sufficient Forecasting Using Factor Models. arXiv preprint arXiv:1505.07414. 2015 [Google Scholar]
Fan J, Liu H, Wang W, Zhu Z. Heterogeneity Adjustment with Applications to Graphical Model Inference. arXiv preprint arXiv:1602.05455. 2016 doi: 10.1214/18-EJS1466. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hall P, Marron J, Neeman A. Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005;67:427–444. [Google Scholar]
Johnstone IM. On the distribution of the largest eigenvalue in principal components analysis. The Annals of Statistics. 2001:295–327. [Google Scholar]
Johnstone IM, Lu AY. On Consistency and Sparsity for Principal Components Analysis in High Dimensions. Journal of the American Statistical Association. 2009;104:682–693. doi: 10.1198/jasa.2009.0121. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jung S, Marron J. PCA consistency in high dimension, low sample size context. The Annals of Statistics. 2009;37:4104–4130. [Google Scholar]
Koltchinskii V, Lounici K. Concentration Inequalities and Moment Bounds for Sample Covariance Operators. arXiv preprint arXiv:1405.2468. 2014a [Google Scholar]
Koltchinskii V, Lounici K. Asymptotics and Concentration Bounds for Bilinear Forms of Spectral Projectors of Sample Covariance. arXiv preprint arXiv:1408.4643. 2014b [Google Scholar]
Landgrebe J, Wurst W, Welzl G. Permutation-validated principal components analysis of microarray data. Genome Biol. 2002;3:1–11. doi: 10.1186/gb-2002-3-4-research0019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee S, Zou F, Wright FA. Convergence and prediction of principal component scores in high-dimensional settings. The Annals of Statistics. 2010;38:3605. doi: 10.1214/10-AOS821. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leek JT, Storey JD. A general framework for multiple testing dependence. Proceedings of the National Academy of Sciences. 2008;105:18718–18723. doi: 10.1073/pnas.0808709105. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, Irizarry RA. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics. 2010;11:733–739. doi: 10.1038/nrg2825. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ma Z. Sparse principal component analysis and iterative thresholding. The Annals of Statistics. 2013;41:772–801. [Google Scholar]
Onatski A. Asymptotics of the principal components estimator of large factor models with weakly inuential factors. Journal of Econometrics. 2012;168:244–258. [Google Scholar]
Paul D. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica. 2007;17:1617–1642. [Google Scholar]
Pesaran MH, Zaffaroni P. Optimal asset allocation with factor models for large portfolios. 2008 [Google Scholar]
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
Ringnér M. What is principal component analysis? Nature Biotechnology. 2008;26:303–304. doi: 10.1038/nbt0308-303. [DOI] [PubMed] [Google Scholar]
Rothman AJ, Levina E, Zhu J. Generalized thresholding of large covariance matrices. Journal of the American Statistical Association. 2009;104:177–186. [Google Scholar]
Shen D, Shen H, Zhu H, Marron J. Surprising asymptotic conical structure in critical sample eigen-directions. arXiv preprint arXiv:1303.6171. 2013 [Google Scholar]
Stock JH, Watson MW. Forecasting using principal components from a large number of predictors. Journal of the American statistical association. 2002;97:1167–1179. [Google Scholar]
Thomas CG, Harshman RA, Menon RS. Noise reduction in BOLD-based fMRI using component analysis. Neuroimage. 2002;17:1521–1537. doi: 10.1006/nimg.2002.1200. [DOI] [PubMed] [Google Scholar]
Vershynin R. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027. 2010 [Google Scholar]
Vu VQ, Lei J. Minimax rates of estimation for sparse PCA in high dimensions. arXiv preprint arXiv:1202.0786. 2012 [Google Scholar]
Wang W, Fan J. Supplementary appendix to the paper “Asymptotics of empirical eigen-structure for high dimensional spiked covariance”. 2015 [Google Scholar]
Yamaguchi-Kabata Y, Nakazono K, Takahashi A, Saito S, Hosono N, Kubo M, Nakamura Y, Kamatani N. Japanese population structure, based on SNP genotypes from 7003 individuals compared to other ethnic groups: effects on population-based association studies. The American Journal of Human Genetics. 2008;83:445–456. doi: 10.1016/j.ajhg.2008.08.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yata K, Aoshima M. Effective PCA for high-dimension, low-sample-size data with noise reduction via geometric representations. Journal of Multivariate Analysis. 2012;105:193–215. [Google Scholar]
Yata K, Aoshima M. PCA consistency for the power spiked model in high-dimensional settings. Journal of Multivariate Analysis. 2013;122:334–354. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp1

NIHMS842544-supplement-supp1.pdf^{(349.6KB, pdf)}

[R1] Agarwal A, Negahban S, Wainwright MJ. Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. The Annals of Statistics. 2012;40:1171–1197. [Google Scholar]

[R2] Amini AA, Wainwright MJ. Information Theory 2008. ISIT 2008. IEEE International Symposium on. IEEE; 2008. High-dimensional analysis of semidefinite relaxations for sparse principal components; pp. 2454–2458. [Google Scholar]

[R3] Anderson TW. Asymptotic theory for principal component analysis. The Annals of Mathematical Statistics. 1963;34:122–148. [Google Scholar]

[R4] Antoniadis A, Fan J. Regularization of wavelet approximations. Journal of the American Statistical Association. 2001;96 [Google Scholar]

[R5] Bai Z. Methodologies in spectral analysis of large-dimensional random matrices, a review. Statist. Sinica. 1999;9:611–677. [Google Scholar]

[R6] Bai J. Inferential theory for factor models of large dimensions. Econometrica. 2003;71:135–171. [Google Scholar]

[R7] Bai J, Ng S. Determining the number of factors in approximate factor models. Econometrica. 2002;70:191–221. [Google Scholar]

[R8] Bai Z, Silverstein JW. Spectral analysis of large dimensional random matrices. 2nd. Springer; 2009. [Google Scholar]

[R9] Bai Z, Yao J. on sample eigenvalues in a generalized spiked population model. Journal of Multivariate Analysis. 2012;106:167–177. [Google Scholar]

[R10] Bai Z, Yin Y. Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. The Annals of Probability. 1993:1275–1294. [Google Scholar]

[R11] Baik J, Ben Arous G, Péché S. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. The Annals of Probability. 2005:1643–1697. [Google Scholar]

[R12] Benaych-Georges F, Nadakuditi RR. The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Advances in Mathematics. 2011;227:494–521. [Google Scholar]

[R13] Berthet Q, Rigollet P. Optimal detection of sparse principal components in high dimension. The Annals of Statistics. 2013;41:1780–1815. [Google Scholar]

[R14] Bickel PJ, Levina E. Covariance regularization by thresholding. The Annals of Statistics. 2008:2577–2604. [Google Scholar]

[R15] Birnbaum A, Johnstone IM, Nadler B, Paul D. Minimax bounds for sparse PCA with noisy high-dimensional data. The Annals of Statistics. 2013;41:1055. doi: 10.1214/12-AOS1014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Cai T, Fan J, Jiang T. Distributions of angles in random packing on spheres. The Journal of Machine Learning Research. 2013;14:1837–1864. [PMC free article] [PubMed] [Google Scholar]

[R17] Cai T, Ma Z, Wu Y. Optimal estimation and rank detection for sparse spiked covariance matrices. Probability Theory and Related Fields. 2015;161:781–815. doi: 10.1007/s00440-014-0562-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Candès EJ, Li X, Ma Y, Wright J. Robust principal component analysis? Journal of the ACM (JACM) 2011;58:11. [Google Scholar]

[R19] Chamberlain G, Rothschild M. Arbitrage, factor structure, and mean-variance analysis on large asset markets. Econometrica. 1983;51:1305–1324. [Google Scholar]

[R20] Chandrasekaran V, Sanghavi S, Parrilo PA, Willsky AS. Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization. 2011;21:572–596. [Google Scholar]

[R21] Chen KH, Shimerda TA. An empirical analysis of useful financial ratios. Financial Management. 1981:51–60. [Google Scholar]

[R22] Davidson KR, Szarek SJ. Local operator theory, random matrices and Banach spaces. In: Johnson WB, Lindenstrauss J, editors. Handbook of the Geometry of Banach Spaces. Vol. 1. Elsevier Science BV; 2001. pp. 317–366. [Google Scholar]

[R23] Davis C, Kahan WM. The rotation of eigenvectors by a perturbation. III. SIAM Journal on Numerical Analysis. 1970;7:1–46. [Google Scholar]

[R24] De Mol C, Giannone D, Reichlin L. Forecasting using a large number of predictors: Is Bayesian shrinkage a valid alternative to principal components? Journal of Econometrics. 2008;146:318–328. [Google Scholar]

[R25] Donoho DL, Gavish M, Johnstone IM. Optimal shrinkage of eigenvalues in the Spiked Covariance Model. arXiv preprint arXiv:1311.0851. 2014 doi: 10.1214/17-AOS1601. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Fan J, Fan Y, Lu J. High dimensional covariance matrix estimation using a factor model. Journal of Econometrics. 2008;147:186–197. [Google Scholar]

[R27] Fan J, Han X, Gu W. Estimating false discovery proportion under arbitrary covariance dependence. Journal of the American Statistical Association. 2012;107:1019–1035. doi: 10.1080/01621459.2012.720478. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Fan J, Han X. Estimation of false discovery proportion with unknown dependence. arXiv preprint arXiv:1305.7007. 2013 doi: 10.1111/rssb.12204. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Fan J, Liao Y, Mincheva M. Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B. 2013;75:1–44. doi: 10.1111/rssb.12016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Fan J, Liao Y, Shi X. Risks of large portfolios. Journal of Econometrics. 2015;186:367–387. doi: 10.1016/j.jeconom.2015.02.015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Fan J, Liao Y, Wang W. Projected Principal Component Analysis in Factor Models. The Annals of Statistics. 2016;44:219–254. doi: 10.1214/15-AOS1364. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Fan J, Xue L, Yao J. Sufficient Forecasting Using Factor Models. arXiv preprint arXiv:1505.07414. 2015 [Google Scholar]

[R33] Fan J, Liu H, Wang W, Zhu Z. Heterogeneity Adjustment with Applications to Graphical Model Inference. arXiv preprint arXiv:1602.05455. 2016 doi: 10.1214/18-EJS1466. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Hall P, Marron J, Neeman A. Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005;67:427–444. [Google Scholar]

[R35] Johnstone IM. On the distribution of the largest eigenvalue in principal components analysis. The Annals of Statistics. 2001:295–327. [Google Scholar]

[R36] Johnstone IM, Lu AY. On Consistency and Sparsity for Principal Components Analysis in High Dimensions. Journal of the American Statistical Association. 2009;104:682–693. doi: 10.1198/jasa.2009.0121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Jung S, Marron J. PCA consistency in high dimension, low sample size context. The Annals of Statistics. 2009;37:4104–4130. [Google Scholar]

[R38] Koltchinskii V, Lounici K. Concentration Inequalities and Moment Bounds for Sample Covariance Operators. arXiv preprint arXiv:1405.2468. 2014a [Google Scholar]

[R39] Koltchinskii V, Lounici K. Asymptotics and Concentration Bounds for Bilinear Forms of Spectral Projectors of Sample Covariance. arXiv preprint arXiv:1408.4643. 2014b [Google Scholar]

[R40] Landgrebe J, Wurst W, Welzl G. Permutation-validated principal components analysis of microarray data. Genome Biol. 2002;3:1–11. doi: 10.1186/gb-2002-3-4-research0019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Lee S, Zou F, Wright FA. Convergence and prediction of principal component scores in high-dimensional settings. The Annals of Statistics. 2010;38:3605. doi: 10.1214/10-AOS821. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] Leek JT, Storey JD. A general framework for multiple testing dependence. Proceedings of the National Academy of Sciences. 2008;105:18718–18723. doi: 10.1073/pnas.0808709105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, Irizarry RA. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics. 2010;11:733–739. doi: 10.1038/nrg2825. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] Ma Z. Sparse principal component analysis and iterative thresholding. The Annals of Statistics. 2013;41:772–801. [Google Scholar]

[R45] Onatski A. Asymptotics of the principal components estimator of large factor models with weakly inuential factors. Journal of Econometrics. 2012;168:244–258. [Google Scholar]

[R46] Paul D. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica. 2007;17:1617–1642. [Google Scholar]

[R47] Pesaran MH, Zaffaroni P. Optimal asset allocation with factor models for large portfolios. 2008 [Google Scholar]

[R48] Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]

[R49] Ringnér M. What is principal component analysis? Nature Biotechnology. 2008;26:303–304. doi: 10.1038/nbt0308-303. [DOI] [PubMed] [Google Scholar]

[R50] Rothman AJ, Levina E, Zhu J. Generalized thresholding of large covariance matrices. Journal of the American Statistical Association. 2009;104:177–186. [Google Scholar]

[R51] Shen D, Shen H, Zhu H, Marron J. Surprising asymptotic conical structure in critical sample eigen-directions. arXiv preprint arXiv:1303.6171. 2013 [Google Scholar]

[R52] Stock JH, Watson MW. Forecasting using principal components from a large number of predictors. Journal of the American statistical association. 2002;97:1167–1179. [Google Scholar]

[R53] Thomas CG, Harshman RA, Menon RS. Noise reduction in BOLD-based fMRI using component analysis. Neuroimage. 2002;17:1521–1537. doi: 10.1006/nimg.2002.1200. [DOI] [PubMed] [Google Scholar]

[R54] Vershynin R. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027. 2010 [Google Scholar]

[R55] Vu VQ, Lei J. Minimax rates of estimation for sparse PCA in high dimensions. arXiv preprint arXiv:1202.0786. 2012 [Google Scholar]

[R56] Wang W, Fan J. Supplementary appendix to the paper “Asymptotics of empirical eigen-structure for high dimensional spiked covariance”. 2015 [Google Scholar]

[R57] Yamaguchi-Kabata Y, Nakazono K, Takahashi A, Saito S, Hosono N, Kubo M, Nakamura Y, Kamatani N. Japanese population structure, based on SNP genotypes from 7003 individuals compared to other ethnic groups: effects on population-based association studies. The American Journal of Human Genetics. 2008;83:445–456. doi: 10.1016/j.ajhg.2008.08.019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R58] Yata K, Aoshima M. Effective PCA for high-dimension, low-sample-size data with noise reduction via geometric representations. Journal of Multivariate Analysis. 2012;105:193–215. [Google Scholar]

[R59] Yata K, Aoshima M. PCA consistency for the power spiked model in high-dimensional settings. Journal of Multivariate Analysis. 2013;122:334–354. [Google Scholar]

PERMALINK

Asymptotics of empirical eigenstructure for high dimensional spiked covariance

Weichen Wang

Jianqing Fan

Abstract

1. Introduction

2. Assumptions and a simple fact

Assumption 2.1

Assumption 2.2

Assumption 2.3

3. Asymptotic behavior of empirical eigen-structure

3.1. Asymptotic normality of empirical eigenvalues

Theorem 3.1

3.2. Behavior of empirical eigenvectors

Theorem 3.2

4. Applications to factor models

4.1. Approximate factor models

4.2. Shrinkage POET under relative spectral norm

Assumption 4.1

Assumption 4.2

Theorem 4.1

4.3. Portfolio risk management

Theorem 4.2

4.4. Estimation of false discovery proportion

Assumption 4.3

Theorem 4.3

5. Simulations

5.1. Eigen-structure

Fig 1.

Fig 2.

Table 1.

Fig 3.

Fig 4.

5.2. Performance of S-POET

Fig 5.

5.3. FDP estimation

Fig 6.

6. Conclusions

Supplementary Material

Acknowledgments

APPENDIX A

PROOFS FOR SECTION 3

A.1. Proof of Theorem 3.1

Lemma A.1

Lemma A.2

Lemma A.3

Proof of Theorem 3.1

A.2. Proofs of Theorem 3.2

Proof of Theorem 3.2

Lemma A.4

Lemma A.5

Lemma A.6

Footnotes

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases