Large Covariance Estimation by Thresholding Principal Orthogonal Complements

Jianqing Fan; Yuan Liao; Martina Mincheva

doi:10.1111/rssb.12016

. Author manuscript; available in PMC: 2014 Mar 1.

Published in final edited form as: J R Stat Soc Series B Stat Methodol. 2013 Aug 12;75(4):10.1111/rssb.12016. doi: 10.1111/rssb.12016

Large Covariance Estimation by Thresholding Principal Orthogonal Complements

Jianqing Fan ^*,^†, Yuan Liao ^‡, Martina Mincheva ^*

PMCID: PMC3859166 NIHMSID: NIHMS423280 PMID: 24348088

Abstract

This paper deals with the estimation of a high-dimensional covariance with a conditional sparsity structure and fast-diverging eigenvalues. By assuming sparse error covariance matrix in an approximate factor model, we allow for the presence of some cross-sectional correlation even after taking out common but unobservable factors. We introduce the Principal Orthogonal complEment Thresholding (POET) method to explore such an approximate factor structure with sparsity. The POET estimator includes the sample covariance matrix, the factor-based covariance matrix (Fan, Fan, and Lv, 2008), the thresholding estimator (Bickel and Levina, 2008) and the adaptive thresholding estimator (Cai and Liu, 2011) as specific examples. We provide mathematical insights when the factor analysis is approximately the same as the principal component analysis for high-dimensional data. The rates of convergence of the sparse residual covariance matrix and the conditional sparse covariance matrix are studied under various norms. It is shown that the impact of estimating the unknown factors vanishes as the dimensionality increases. The uniform rates of convergence for the unobserved factors and their factor loadings are derived. The asymptotic results are also verified by extensive simulation studies. Finally, a real data application on portfolio allocation is presented.

Keywords: High-dimensionality, approximate factor model, unknown factors, principal components, sparse matrix, low-rank matrix, thresholding, cross-sectional correlation, diverging eigenvalues

1 Introduction

Information and technology make large data sets widely available for scientific discovery. Much statistical analysis of such high-dimensional data involves the estimation of a covariance matrix or its inverse (the precision matrix). Examples include portfolio management and risk assessment (Fan, Fan and Lv, 2008), high-dimensional classification such as Fisher discriminant (Hastie, Tibshirani and Friedman, 2009), graphic models (Meinshausen and Bühlmann, 2006), statistical inference such as controlling false discoveries in multiple testing (Leek and Storey, 2008; Efron, 2010), finding quantitative trait loci based on longitudinal data (Yap, Fan, and Wu, 2009; Xiong et al. 2011), and testing the capital asset pricing model (Sentana, 2009), among others. See Section 5 for some of those applications. Yet, the dimensionality is often either comparable to the sample size or even larger. In such cases, the sample covariance is known to have poor performance (Johnstone, 2001), and some regularization is needed.

Realizing the importance of estimating large covariance matrices and the challenges brought by the high dimensionality, in recent years researchers have proposed various regularization techniques to consistently estimate Σ. One of the key assumptions is that the covariance matrix is sparse, namely, many entries are zero or nearly so (Bickel and Levina, 2008, Rothman et al, 2009, Lam and Fan 2009, Cai and Zhou, 2010, Cai and Liu, 2011). In many applications, however, the sparsity assumption directly on Σ is not appropriate. For example, financial returns depend on the equity market risks, housing prices depend on the economic health, gene expressions can be stimulated by cytokines, among others. Due to the presence of common factors, it is unrealistic to assume that many outcomes are uncorrelated. An alternative method is to assume a factor model structure, as in Fan, Fan and Lv (2008). However, they restrict themselves to the strict factor models with known factors.

A natural extension is the conditional sparsity. Given the common factors, the outcomes are weakly correlated. In order to do so, we consider an approximate factor model, which has been frequently used in economic and financial studies (Chamberlain and Rothschild, 1983; Fama and French 1993; Bai and Ng, 2002, etc):

y_{i t} = b_{i}^{'} f_{t} + u_{i t} .

(1.1)

Here y_it is the observed response for the ith (i = 1, …, p) individual at time t = 1, …, T; b_i is a vector of factor loadings; f_t is a K × 1 vector of common factors, and u_it is the error term, usually called idiosyncratic component, uncorrelated with f_t. Both p and T diverge to infinity, while K is assumed fixed throughout the paper, and p is possibly much larger than T.

We emphasize that in model (1.1), only y_it is observable. It is intuitively clear that the unknown common factors can only be inferred reliably when there are sufficiently many cases, that is, p → ∞. In a data-rich environment, p can diverge at a rate faster than T. The factor model (1.1) can be put in a matrix form as

y_{t} = {Bf}_{t} + u_{t} .

(1.2)

where y_t = (y₁_t, …, y_pt)′, B = (b₁, …, b_p)′ and u_t = (u₁_t, …, u_pt)′. We are interested in Σ, the p × p covariance matrix of y_t, and its inverse, which are assumed to be time-invariant. Under model (1.1), Σ is given by

\sum = B cov (f_{t}) B^{'} + \sum_{u},

(1.3)

where Σ_u = (σ_u,ij)_p_×_p is the covariance matrix of u_t. The literature on approximate factor models typically assumes that the first K eigenvalues of Bcov(f_t)B′ diverge at rate O(p), whereas all the eigenvalues of Σ_u are bounded as p → ∞. This assumption holds easily when the factors are pervasive in the sense that a non-negligible fraction of factor loadings should be non-vanishing. The decomposition (1.3) is then asymptotically identified as p → ∞. In addition to it, in this paper we assume that Σ_u is approximately sparse as in Bickel and Levina (2008) and Rothman et al. (2009): for some q ∈ [0, 1),

m_{p} = max_{i \leq p} \sum_{j \leq p} {∣ σ_{u, i j} ∣}^{q}

does not grow too fast as p → ∞. In particular, this includes the exact sparsity assumption (q = 0) under which m_p = max_i_≤_p Σ_j_≤_p I_{(σ_u,ij≠0)}, the maximum number of nonzero elements in each row.

The conditional sparsity structure of (1.2) was explored by Fan, Liao and Mincheva (2011) in estimating the covariance matrix, when the factors {f_t} are observable. This allows them to use regression analysis to estimate ${u_{t}}_{t = 1}^{T}$ . This paper deals with the situation in which the factors are unobservable and have to be inferred. Our approach is simple, optimization-free and it uses the data only through the sample covariance matrix. Run the singular value decomposition on the sample covariance matrix Σ̂_sam of y_t, keep the covariance matrix formed by the first K principal components, and apply the thresholding procedure to the remaining covariance matrix. This results in a Principal Orthogonal complEment Thresholding (POET) estimator. When the number of common factors K is unknown, it can be estimated from the data. See Section 2 for additional details. We will investigate various properties of POET under the assumption that the data are serially dependent, which includes independent observations as a specific example. The rate of convergence under various norms for both estimated Σ and Σ_u and their precision (inverse) matrices will be derived. We show that the effect of estimating the unknown factors on the rate of convergence vanishes when p log p ≫ T, and in particular, the rate of convergence for Σ_u achieves the optimal rate in Cai and Zhou (2012).

This paper focuses on the high-dimensional static factor model (1.2), which is innately related to the principal component analysis (PCA), as clarified in Section 2. This feature makes it different from the classical factor model with fixed dimensionality (e.g., Lawley and Maxwell 1971). In the last ten years, much theory on the estimation and inference of the static factor model has been developed, for example, Stock and Watson (1998, 2002), Bai and Ng (2002), Bai (2003), Doz, Giannone and Reichlin (2011), among others. Our contribution is on the estimation of covariance matrices and their inverse in large factor models.

The static model considered in this paper is to be distinguished from the dynamic factor model as in Forni, Hallin, Lippi and Reichlin (2000); the latter allows y_t to also depend on f_t with lags in time. Their approach is based on the eigenvalues and principal components of spectral density matrices, and on the frequency domain analysis. Moreover, as shown in Forni and Lippi (2001), the dynamic factor model does not really impose a restriction on the data generating process, and the assumption of idiosyncrasy (in their terminology, a p-dimensional process is idiosyncratic if all the eigenvalues of its spectral density matrix remain bounded as p → ∞) asymptotically identifies the decomposition of y_it into the common component and idiosyncratic error. The literature includes, for example, Forni et al. (2000, 2004), Forni and Lippi (2001), Hallin and Liška (2007, 2011), and many other references therein. Above all, both the static and dynamic factor models are receiving increasing attention in applications of many fields where information usually is scattered through a (very) large number of interrelated time series.

There has been extensive literature in recent years that deals with sparse principal components, which has been widely used to enhance the convergence of the principal components in high-dimensional space. d’Aspremont, Bach and El Ghaoui (2008), Shen and Huang (2008), Witten, Tibshirani, and Hastie (2009) and Ma (2011) proposed and studied various algorithms for computations. More literature on sparse PCA is found in Johnstone and Lu (2009), Amini and Wainwright (2009), Zhang and El Ghaoui (2011), Birnbaum et al. (2012), among others. In addition, there has also been a growing literature that theoretically studies the recovery from a low-rank plus sparse matrix estimation problem, see for example, Wright et al. (2009), Lin et al. (2009), Candès et al. (2011), Luo (2011), Agarwal, Nagahban, Wainwright (2012), Pati et al. (2012). It corresponds to the identifiability issue of our problem.

There is a big difference between our model and those considered in the aforementioned literature. In the current paper, the first K eigenvalues of Σ are spiked and grow at a rate O(p), whereas the eigenvalues of the matrices studied in the existing literature on covariance estimation are usually assumed to be either bounded or slowly growing. Due to this distinctive feature, the common components and the idiosyncratic components can be identified, and in addition, PCA on the sample covariance matrix can consistently estimate the space spanned by the eigenvectors of Σ. The existing methods of either thresholding directly or solving a constrained optimization method can fail in the presence of very spiked principal eigenvalues. However, there is a price to pay here: as the first K eigenvalues are “too spiked”, one can hardly obtain a satisfactory rate of convergence for estimating Σ in absolute term, but it can be estimated accurately in relative term (see Section 3.3 for details). In addition, Σ⁻¹ can be estimated accurately.

We would like to further note that the low-rank plus sparse representation of our model is on the population covariance matrix, whereas Candès et al. (2011), Wright et al. (2009), Lin et al. (2009)¹ considered such a representation on the data matrix. As there is no Σ to estimate, their goal is limited to producing a low-rank plus sparse matrix decomposition of the data matrix, which corresponds to the identifiability issue of our study, and does not involve estimation and inference. In contrast, our ultimate goal is to estimate the population covariance matrices as well as the precision matrices. For this purpose, we require the idiosyncratic components and common factors to be uncorrelated and the data generating process to be strictly stationary. The covariances considered in this paper are constant over time, though slow-time-varying covariance matrices are applicable through localization in time (time-domain smoothing). Our consistency result on Σ_u demonstrates that the decomposition (1.3) is identifiable, and hence our results also shed the light of the “surprising phenomenon” of Candès et al. (2011) that one can separate fully a sparse matrix from a low-rank matrix when only the sum of these two components is available.

The rest of the paper is organized as follows. Section 2 gives our estimation procedures and builds the relationship between the principal components analysis and the factor analysis in high-dimensional space. Section 3 provides the asymptotic theory for various estimated quantities. Section 4 illustrates how to choose the thresholds using cross-validation and guarantees the positive definiteness in any finite sample. Specific applications of regularized covariance matrices are given in Section 5. Numerical results are reported in Section 6. Finally, Section 7 presents a real data application on portfolio allocation. All proofs are given in the appendix. Throughout the paper, we use λ_min(A) and λ_max(A) to denote the minimum and maximum eigenvalues of a matrix A. We also denote by ||A||_F, ||A||, ||A||₁ and ||A||_max the Frobenius norm, spectral norm (also called operator norm), L₁-norm, and elementwise norm of a matrix A, defined respectively by ||A||_F = tr^1/2(A′A), $| | A | | = λ_{max}^{1 / 2} (A^{'} A)$ , ||A||₁ = max_j Σ_i |a_ij| and ||A||_max = max_i,j |a_ij|. Note that when A is a vector, both ||A||_F and ||A|| are equal to the Euclidean norm. Finally, for two sequences, we write a_T ≫ b_T if b_T = o(a_T) and a_T ≍ b_T if a_T = O(b_T) and b_T = O(a_T).

2 Regularized Covariance Matrix via PCA

There are three main objectives of this paper: (i) understand the relationship between principal component analysis (PCA) and the high-dimensional factor analysis; (ii) estimate both covariance matrices Σ and the idiosyncratic Σ_u and their precision matrices in the presence of common factors, and (iii) investigate the impact of estimating the unknown factors on the covariance estimation. The propositions in Section 2.1 below show that the space spanned by the principal components in the population level Σ is close to the space spanned by the columns of the factor loading matrix B.

2.1 High-dimensional PCA and factor model

Consider a factor model

y_{i t} = b_{i}^{'} f_{t} + u_{i t}, i \leq p, t \leq T,

where the number of common factors, K = dim(f_t), is small compared to p and T, and thus is assumed to be fixed throughout the paper. In the model, the only observable variable is the data y_it. One of the distinguished features of the factor model is that the principal eigenvalues of Σ are no longer bounded, but growing fast with the dimensionality. We illustrate this in the following example.

Example 2.1

Consider a single-factor model y_it = b_if_t + u_it where b_i ∈ ℝ. Suppose that the factor is pervasive in the sense that it has non-negligible impact on a non-vanishing proportion of outcomes. It is then reasonable to assume $\sum_{i = 1}^{p} b_{i}^{2} > c p$ for some c > 0. Therefore, assuming that λ_max(Σ_u) = o(p), an application of (1.3) yields,

λ_{max} (\sum) \geq var (f_{t}) \sum_{i = 1}^{p} b_{i}^{2} - λ_{max} (\sum_{u}) > \frac{c}{2} var (f_{t}) p

for all large p, assuming var(f_t) > 0.

We now elucidate why PCA can be used for the factor analysis in the presence of spiked eigenvalues. Write B = (b₁, …, b_p)′ as the p × K loading matrix. Note that the linear space spanned by the first K principal components of Bcov(f_t)B′ is the same as that spanned by the columns of B when cov(f_t) is non-degenerate. Thus, we can assume without loss of generality that the columns of B are orthogonal and cov(f_t) = I_K, the identity matrix. This canonical form corresponds to the identifiability condition in decomposition (1.3). Let b̃₁, · · ·, b̃_K be the columns of B, ordered such that ${| | {\tilde{b}}_{j} | |}_{j = 1}^{K}$ is in a non-increasing order. Then, ${{\tilde{b}}_{j} / | | {\tilde{b}}_{j} | |}_{j = 1}^{K}$ are eigenvectors of the matrix BB′ with eigenvalues ${{| | {\tilde{b}}_{j} | |}^{2}}_{j = 1}^{K}$ and the rest zero. We will impose the pervasiveness assumption that all eigenvalues of the K × K matrix p⁻¹B′B are bounded away from zero, which holds if the factor loadings ${b_{i}}_{i = 1}^{p}$ are independent realizations from a non-degenerate population. Since the non-vanishing eigenvalues of the matrix BB′ are the same as those of B′B, from the pervasiveness assumption it follows that ${{| | {\tilde{b}}_{j} | |}^{2}}_{j = 1}^{K}$ are all growing at rate O(p).

Let ${λ_{j}}_{j = 1}^{p}$ be the eigenvalues of Σ in a descending order and ${ξ_{j}}_{j = 1}^{p}$ be their corresponding eigenvectors. Then, an application of Weyl’s eigenvalue theorem (see the appendix) yields that

Proposition 2.1

Assume that the eigenvalues of p⁻¹B′B are bounded away from zero for all large p. For the factor model (1.3) with the canonical condition

cov (f_{t}) = I_{K} and B^{'} B i s diagonal,

(2.1)

we have

∣ λ_{j} - {| | {\tilde{b}}_{j} | |}^{2} ∣ \leq | | \sum_{u} | |, for j \leq K, ∣ λ_{j} ∣ \leq | | \sum_{u} | |, for j > K .

In addition, for j ≤ K, lim inf_p_→∞ ||b̃_j||²/p > 0.

Using Proposition 2.1 and the sin θ theorem of Davis and Kahn (1970, see the appendix), we have the following:

Proposition 2.2

Under the assumptions of Proposition 2.1, if ${| | {\tilde{b}}_{j} | |}_{j = 1}^{K}$ are distinct, then

| | ξ_{j} - {\tilde{b}}_{j} / | | {\tilde{b}}_{j} | | | | = O (p^{- 1} | | \sum_{u} | |), for j \leq K .

Propositions 2.1 and 2.2 state that PCA and factor analysis are approximately the same if ||Σ_u|| = o(p). This is assured through a sparsity condition on Σ_u = (σ_u,ij)_p_×_p, which is frequently measured through

m_{p} = max_{i \leq p} \sum_{j \leq p} {∣ σ_{u, i j} ∣}^{q}, for some q \in [0, 1] .

(2.2)

The intuition is that, after taking out the common factors, many pairs of the cross-sectional units become weakly correlated. This generalized notion of sparsity was used in Bickel and Levina (2008) and Cai and Liu (2011). Under this generalized measure of sparsity, we have

| | \sum_{u} | | \leq {| | \sum_{u} | |}_{1} \leq max_{i} \sum_{j = 1}^{p} {∣ σ_{u, i j} ∣}^{q} {(σ_{u, i i} σ_{u, j j})}^{(1 - q) / 2} = O (m_{p}),

if the noise variances { $σ_{u, i i}^{2}$ } are bounded. Therefore, when m_p = o(p), Proposition 2.1 implies that we have distinguished eigenvalues between the principal components ${λ_{j}}_{j = 1}^{K}$ and the rest of the components ${λ_{j}}_{j = K + 1}^{p}$ and Proposition 2.2 ensures that the first K principal components are approximately the same as the columns of the factor loadings.

The aforementioned sparsity assumption appears reasonable in empirical applications. Boivin and Ng (2006) conducted an empirical study and showed that imposing zero correlation between weakly correlated idiosyncratic components improves forecast². More recently, Phan (2012) empirically estimated the level of sparsity of the idiosyncratic covariance using the UK market data.

Recent developments on random matrix theory, for example, Johnstone and Lu (2009) and Paul (2007), have shown that when p/T is not negligible, the eigenvalues and eigenvectors of Σ might not be consistently estimated from the sample covariance matrix. A distinguished feature of the covariance considered in this paper is that there are some very spiked eigenvalues. By Propositions 2.1 and 2.2, in the factor model, the pervasiveness condition

λ_{min} (p^{- 1} B^{'} B) > c > 0

(2.3)

implies that the first K eigenvalues are growing at a rate p. Moreover, when p is large, the principal components ${ξ_{j}}_{j = 1}^{K}$ are close to the normalized vectors ${{\tilde{b}}_{j}}_{j = 1}^{K}$ when m_p = o(p). This provides the mathematics for using the first K principal components as a proxy of the space spanned by the columns of the factor loading matrix B. In addition, due to (2.3), the signals of the first K eigenvalues are stronger than those of the spiked covariance model considered by Jung and Marron (2009) and Birnbaum et al. (2012). Therefore, our other conditions for the consistency of principal components at the population level are much weaker than those in the spiked covariance literature. On the other hand, this also shows that, under our setting the PCA is a valid approximation to factor analysis only if p → ∞. The fact that the PCA on the sample covariance is inconsistent when p is bounded was also previously demonstrated in the literature (See e.g., Bai (2003)).

With assumption (2.3), the standard literature on approximate factor models has shown that the PCA on the sample covariance matrix Σ̂_sam can consistently estimate the space spanned by the factor loadings (e.g., Stock and Watson (1998), Bai (2003)). Our contribution in Propositions 2.1 and 2.2 is that we connect the high-dimensional factor model to the principal components, and obtain the consistency of the spectrum in the population level Σ instead of the sample level Σ̂_sam. The spectral consistency also enhances the results in Chamberlain and Rothschild (1983). This provides the rationale behind the consistency results in the factor model literature.

2.2 POET

Sparsity assumption directly on Σ is inappropriate in many applications due to the presence of common factors. Instead, we propose a nonparametric estimator of Σ based on the principal component analysis. Let λ̂₁ ≥ λ̂₂ ≥ · · · ≥ λ̂_p be the ordered eigenvalues of the sample covariance matrix Σ̂_sam and ${{\hat{ξ}}_{i}}_{i = 1}^{p}$ be their corresponding eigenvectors. Then the sample covariance has the following spectral decomposition:

{\sum^{^}}_{sam} = \sum_{i = 1}^{K} {\hat{λ}}_{i} {\hat{ξ}}_{i} {\hat{ξ}}_{i}^{'} + {\hat{R}}_{K},

(2.4)

where ${\hat{R}}_{K} = \sum_{i = K + 1}^{p} {\hat{λ}}_{i} {\hat{ξ}}_{i} {\hat{ξ}}_{i}^{'} = {({\hat{r}}_{i j})}_{p \times p}$ is the principal orthogonal complement, and K is the number of diverging eigenvalues of Σ. Let us first assume K is known.

Now we apply thresholding on R̂_K. Define

{\hat{R}}_{K}^{τ} = {({\hat{r}}_{i j}^{τ})}_{p \times p}, {\hat{r}}_{i j}^{τ} = {\begin{array}{l} {\hat{r}}_{i i}, & i = j; \\ s_{i j} ({\hat{r}}_{i j}) I (∣ {\hat{r}}_{i j} ∣ \geq τ_{i j}), & i \neq j . \end{array}

(2.5)

where s_ij(·) is a generalized shrinkage function of Antoniadis and Fan (2001), employed by Rothman et al. (2009) and Cai and Liu (2011), and τ_ij > 0 is an entry-dependent threshold. In particular, the hard-thresholding rule s_ij(x) = xI(|x| ≥ τ_ij) (Bickel and Levina, 2008) and the constant thresholding parameter τ_ij = δ are allowed. In practice, it is more desirable to have τ_ij be entry-adaptive. An example of the adaptive thresholding is

τ_{i j} = τ {({\hat{r}}_{i i} {\hat{r}}_{j j})}^{1 / 2}, for a given τ > 0

(2.6)

where r̂_ii is the i^th diagonal element of R̂_K. This corresponds to applying the thresholding with parameter τ to the correlation matrix of R̂_K.

The estimator of Σ is then defined as:

{\sum^{^}}_{K} = \sum_{i = 1}^{K} {\hat{λ}}_{i} {\hat{ξ}}_{i} {\hat{ξ}}_{i}^{'} + {\hat{R}}_{K}^{T} .

(2.7)

We will call this estimator the Principal Orthogonal complEment thresholding (POET) estimator. It is obtained by thresholding the remaining components of the sample covariance matrix, after taking out the first K principal components. One of the attractiveness of POET is that it is optimization-free, and hence is computationally appealing. ³

With the choice of τ_ij in (2.6) and the hard thresholding rule, our estimator encompasses many popular estimators as its specific cases. When τ = 0, the estimator is the sample covariance matrix and when τ = 1, the estimator becomes that based on the strict factor model (Fan, Fan, and Lv, 2008). When K = 0, our estimator is the same as the thresholding estimator of Bickel and Levina (2008) and (with a more general thresholding function) Rothman et al. (2009) or the adaptive thresholding estimator of Cai and Liu (2011) with a proper choice of τ_ij.

In practice, the number of diverging eigenvalues (or common factors) can be estimated based on the sample covariance matrix. Determining K in a data-driven way is an important topic, and is well understood in the literature. We will describe the POET with a data-driven K in Section 2.4.

2.3 Least squares point of view

The POET (2.7) has an equivalent representation using a constrained least squares method. The least squares method seeks for ${\hat{Λ}}_{K} = {({\hat{b}}_{1}^{K}, \dots, {\hat{b}}_{p}^{K})}^{'}$ and ${\hat{F}}_{K}^{'} = ({\hat{f}}_{1}^{K}, \dots, {\hat{f}}_{T}^{K})$ such that

({\hat{Λ}}_{K}, {\hat{F}}_{K}) = arg min_{b_{i} \in R^{K}, f_{t} \in R^{K}} \sum_{i = 1}^{p} \sum_{t = 1}^{T} {(y_{i t} - b_{i}^{'} f_{t})}^{2},

(2.8)

subject to the normalization

\frac{1}{T} \sum_{t = 1}^{T} f_{t} f_{t}^{'} = I_{K}, and \frac{1}{p} \sum_{i = 1}^{p} b_{i} b_{i}^{'} is diagonal .

(2.9)

The constraints (2.9) correspond to the normalization (2.1). Here we assume that the mean of each variable ${y_{i t}}_{t = 1}^{T}$ has been removed, that is, Ey_it = Ef_jt = 0 for all i ≤ p, j ≤ K and t ≤ T. Putting it in a matrix form, the optimization problem can be written as

\begin{array}{l} arg min_{B, F} {| | Y - {BF}^{'} | |}_{F}^{2} \\ T^{- 1} F^{'} F = I_{K}, B^{'} B is diagonal . \end{array}

(2.10)

where Y = (y₁, …, y_T) and F′ = (f₁, · · ·, f_T). For each given F, the least-squares estimator of B is Λ = T⁻¹YF, using the constraint (2.9) on the factors. Substituting this into (2.10), the objective function now becomes ${| | Y - T^{- 1} {YFF}^{'} | |}_{F}^{2} = tr [(I_{T} - T^{- 1} {FF}^{'}) Y^{'} Y]$ . The minimizer is now clear: the columns of ${\hat{F}}_{K} / \sqrt{T}$ are the eigenvectors corresponding to the K largest eigenvalues of the T × T matrix Y′Y and Λ̂_K = T⁻¹YF̂_K (see e.g., Stock and Watson (2002)).

We will show that under some mild regularity conditions, as p and T → ∞, ${\hat{b}}_{i}^{K^{'}} {\hat{f}}_{t}^{K}$ consistently estimates the true $b_{i}^{'} f_{t}$ uniformly over i ≤ p and t ≤ T. Since Σ_u is assumed to be sparse, we can construct an estimator of Σ_u using the adaptive thresholding method by Cai and Liu (2011) as follows. Let ${\hat{u}}_{i t} = y_{i t} - {\hat{b}}_{i}^{K^{'}} {\hat{f}}_{t}^{K}, {\hat{σ}}_{i j} - \frac{1}{T} \sum_{t = 1}^{T} {\hat{u}}_{i t} {\hat{u}}_{j t}$ , and ${\hat{θ}}_{i j} = \frac{1}{T} \sum_{t = 1}^{T} {({\hat{u}}_{i t} {\hat{u}}_{j t} - {\hat{σ}}_{i j})}^{2}$ . For some pre-determined decreasing sequence ω_T > 0, and large enough C > 0, define the adaptive threshold parameter as $τ_{i j} = C \sqrt{{\hat{θ}}_{i j}} ω_{T}$ . The estimated idiosyncratic covariance estimator is then given by

{\sum^{^}}_{u, K}^{T} = {({\hat{σ}}_{i j}^{T})}_{p \times p}, {\hat{σ}}_{i j}^{T} = {\begin{array}{l} {\hat{σ}}_{i i}, & i = j \\ s_{i j} ({\hat{σ}}_{i j}), & i \neq j, \end{array}

(2.11)

where for all z ∈ ℝ (see Antoniadis and Fan, 2001),

s_{i j} (z) = 0 when ∣ z ∣ \leq τ_{i j}, ∣ s_{i j} (z) - z ∣ \leq τ_{i j} .

It is easy to verify that s_ij(·) includes many interesting thresholding functions such as the hard thresholding (s_ij(z) = zI_{(|z|≥τ_ij)}), soft thresholding (s_ij (z) = sign(z)(|z| − τ_ij)₊), SCAD, and adaptive lasso (See Rothman et al. (2009)).

Analogous to the decomposition (1.3), we obtain the following substitution estimators

{\sum^{\sim}}_{K} = {\hat{Λ}}_{K} {\hat{Λ}}_{K}^{'} + {\sum^{^}}_{u, K}^{T},

(2.12)

and by the Sherman-Morrison-Woodbury formula, noting that $\frac{1}{T} \sum_{t = 1}^{T} {\hat{f}}_{t}^{K} {\hat{f}}_{t}^{K^{'}} = I_{K}$ ,

{({\sum^{\sim}}_{K})}^{- 1} = {({\sum^{^}}_{u, K}^{T})}^{- 1} - {({\sum^{^}}_{u, K}^{T})}^{- 1} {\hat{Λ}}_{K} {[I_{K} + {\hat{Λ}}_{K}^{'} {({\sum^{^}}_{u, K}^{T})}^{- 1} {\hat{Λ}}_{K}]}^{- 1} {\hat{Λ}}_{K}^{'} {({\sum^{^}}_{u, K}^{T})}^{- 1},

(2.13)

In practice, the true number of factors K might be unknown to us. However, for any determined K₁ ≤ p, we can always construct either (Σ̂_K₁, ${\hat{R}}_{K_{1}}^{T}$ ) as in (2.7) or (Σ̃_K₁, ${\sum^{^}}_{u, K_{1}}^{T}$ ) as in (2.12) to estimate (Σ, Σ_u). The following theorem shows that for each given K₁, the two estimators based on either regularized PCA or least squares substitution are equivalent. Similar results were obtained by Bai (2003) when K₁ = K and no thresholding was imposed.

Theorem 2.1

Suppose that the entry-dependent threshold in (2.5) is the same as the thresholding parameter used in (2.11). Then for any K₁ ≤ p, the estimator (2.7) is equivalent to the substitution estimator (2.12), that is,

{\sum^{^}}_{K_{1}} = {\sum^{\sim}}_{K_{1}}, and {\sum^{^}}_{u, K_{1}}^{T} = {\hat{R}}_{K_{1}}^{T} .

In this paper, we will use a data-driven K̂ to construct the POET (see Section 2.4 below), which has two equivalent representations according to Theorem 2.1.

2.4 POET with Unknown K

Determining the number of factors in a data-driven way has been an important research topic in the econometric literature. Bai and Ng (2002) proposed a consistent estimator as both p and T diverge. Other recent criteria are proposed by Kapetanios (2010), Onatski (2010), Alessi et al. (2010), etc.

Our method also allows a data-driven K̂ to estimate the covariance matrices. In principle, any procedure that gives a consistent estimate of K can be adopted. In this paper we apply the well-known method in Bai and Ng (2002). It estimates K by

\hat{K} = arg min_{0 \leq K_{1} \leq M} log {\frac{1}{p T} {| | Y - T^{- 1} Y {\hat{F}}_{K_{1}} {\hat{F}}_{K_{1}}^{'} | |}_{F}^{2}} + K_{1} g (T, p),

(2.14)

where M is a prescribed upper bound, F̂_K₁ is a T × K₁ matrix whose columns are $\sqrt{T}$ times the eigenvectors corresponding to the K₁ largest eigenvalues of the T × T matrix Y′Y; g(T, p) is a penalty function of (p, T) such that g(T, p) = o(1) and min{p, T}g(T, p) → ∞. Two examples suggested by Bai and Ng (2002) are

\begin{array}{l} IC 1 : g (T, p) = \frac{p + T}{p T} log (\frac{p T}{p + T}), \\ IC 2 : g (T, p) = \frac{p + T}{p T} log min {p, T} . \end{array}

Throughout the paper, we let K̂ be the solution to (2.14) using either IC1 or IC2. The asymptotic results are not affected regardless of the specific choice of g(T, p). We define the POET estimator with unknown K as

{\sum^{^}}_{\hat{K}} = \sum_{i = 1}^{\hat{K}} {\hat{λ}}_{i} {\hat{ξ}}_{i} {\hat{ξ}}_{i}^{'} + {\hat{R}}_{\hat{K}}^{T} .

(2.15)

The procedure is as stated in Section 2.2 except that K̂ is now data-driven.

3 Asymptotic Properties

3.1 Assumptions

This section presents the assumptions on the model (1.2), in which only ${y_{t}}_{t = 1}^{T}$ are observable. Recall the identifiability condition (2.1).

The first assumption has been one of the most essential ones in the literature of approximate factor models. Under this assumption and other regularity conditions, the number of factors, loadings and common factors can be consistently estimated (e.g., Stock and Watson (1998, 2002), Bai and Ng (2002), Bai (2003), etc.).

Assumption 3.1

All the eigenvalues of the K × K matrix p⁻¹B′B are bounded away from both zero and infinity as p → ∞.

Remark 3.1

It implies from Proposition 2.1 in Section 2 that the first K eigenvalues of Σ grow at rate O(p). This unique feature distinguishes our work from most of other low-rank plus sparse covariances considered in the literature, e.g., Luo (2011), Pati et al. (2012), Agarwal et al. (2012), Birnbaum et al. (2012). ⁴
Assumption 3.1 requires the factors to be pervasive, that is, to impact a non-vanishing proportion of individual time series. See Example 2.1 for its meaning. ⁵
As to be illustrated in Section 3.3 below, due to the fast diverging eigenvalues, one can hardly achieve a good rate of convergence for estimating Σ under either the spectral norm or Frobenius norm when p > T. This phenomenon arises naturally from the characteristics of the high-dimensional factor model, which is another distinguished feature compared to those convergence results in the existing literature.

Assumption 3.2

{u_t, f_t}_t_≥1 is strictly stationary. In addition, Eu_it = Eu_itf_jt = 0 for all i ≤ p, j ≤ K and t ≤ T.
There exist constants c₁, c₂ > 0 such that λ_min(Σ_u) > c₁, ||Σ_u||₁ < c₂, and min_i_≤_p,j_≤_p var(u_itu_jt) > c₁.
There exist r₁, r₂ > 0 and b₁, b₂ > 0, such that for any s > 0, i ≤ p and j ≤ K,
$P (∣ u_{i t} ∣ > s) \leq exp (- {(s / b_{1})}^{r_{1}}), P (∣ f_{j t} ∣ > s) \leq exp (- {(s / b_{2})}^{r_{2}}) .$

Condition (i) requires strict stationarity as well as the non-correlation between {u_t} and {f_t}. These conditions are slightly stronger than those in the literature, e.g., Bai (2003), but are still standard and simplify our technicalities. Condition (ii) requires that Σ_u be well-conditioned. The condition ||Σ_u||₁ ≤ c₂ instead of a weaker condition λ_max(Σ_u) ≤ c₂ is imposed here in order to consistently estimate K. But it is still standard in the approximate factor model literature as in Bai and Ng (2002), Bai (2003), etc. When K is known, such a condition can be removed. Our working paper⁶ shows that the results continue to hold for a growing (known) K under the weaker condition λ_max(Σ_u) ≤ c₂. Condition (iii) requires exponential-type tails, which allows us to apply the large deviation theory to $\frac{1}{T} \sum_{t = 1}^{T} u_{i t} u_{j t} - σ_{u, i j}$ and $\frac{1}{T} \sum_{t = 1}^{T} f_{j t} u_{i t}$ .

We impose the strong mixing condition. Let $F_{- \infty}^{0}$ and $F_{T}^{\infty}$ denote the σ-algebras generated by {(f_t, u_t) : t ≤ 0} and {(f_t, u_t) : t ≥ T} respectively. In addition, define the mixing coefficient

α (T) = sup_{A \in F_{- \infty}^{0}, B \in F_{T}^{\infty}} ∣ P (A) P (B) - P (A B) ∣ .

(3.1)

Assumption 3.3

Strong mixing: There exists r₃ > 0 such that $3 r_{1}^{- 1} + 1.5 r_{2}^{- 1} + r_{3}^{- 1} > 1$ , and C > 0 satisfying: for all T ∈ ℤ⁺,

α (T) \leq exp (- {C T}^{r_{3}}) .

In addition, we impose the following regularity conditions.

Assumption 3.4

There exists M > 0 such that for all i ≤ p, t ≤ T and s ≤ T,

||b_i||_max < M,
$E {[p^{- 1 / 2} (u_{s}^{'} u_{t} - E u_{s}^{'} u_{t})]}^{4} < M$ ,
$E {| | p^{- 1 / 2} \sum_{i = 1}^{p} b_{i} u_{i t} | |}^{4} < M$ .

These conditions are needed to consistently estimate the transformed common factors as well as the factor loadings. Similar conditions were also assumed in Bai (2003), and Bai and Ng (2006). The number of factors is assumed to be fixed. Our conditions in Assumption 3.4 are weaker than those in Bai (2003) as we focus on different aspects of the study.

3.2 Convergence of the idiosyncratic covariance

Estimating the covariance matrix Σ_u of the idiosyncratic components {u_t} is important for many statistical inferences. For example, it is needed for large sample inference of the unknown factors and their loadings, for testing the capital asset pricing model (Sentana, 2009), and large-scale hypothesis testing (Fan, Han and Gu, 2012). See Section 5.

We estimate Σ_u by thresholding the principal orthogonal complements after the first K̂ principal components of the sample covariance are taken out: ${\sum^{^}}_{u, \hat{K}}^{T} = {\hat{R}}_{\hat{K}}^{T}$ . By Theorem 2.1, it also has an equivalent expression given by (2.11), with ${\hat{u}}_{i t} = y_{i t} - {({\hat{b}}_{i}^{\hat{K}})}^{'} {\hat{f}}_{t}^{\hat{K}}$ . Throughout the paper, we apply the adaptive threshold

τ_{i j} = C \sqrt{{\hat{θ}}_{i j}} ω_{T}, ω_{T} = \frac{1}{\sqrt{p}} + \sqrt{\frac{log p}{T}}

(3.2)

where C > 0 is a sufficiently large constant, though the results hold for other types of thresholding. As in Bickel and Levina (2008) and Cai and Liu (2011), the threshold chosen in the current paper is in fact obtained from the optimal uniform rate of convergence of max_i_≤_p,j_≤_p |σ̂_ij − σ_u,ij|. When direct observation of u_it is not available, the effect of estimating the unknown factors also contributes to this uniform estimation error, which is why p^−1/2 appears in the threshold.

The following theorem gives the rate of convergence of the estimated idiosyncratic covariance. Let $γ^{- 1} = 3 r_{1}^{- 1} + 1.5 r_{2}^{- 1} + r_{3}^{- 1} + 1$ . In the convergence rate below, recall that m_p and q are defined in the measure of sparsity (2.2).

Theorem 3.1

Suppose log p = o(T^γ^/6), T = o(p²), and Assumptions 3.1–3.4 hold. Then for a sufficiently large constant C > 0 in the threshold (3.2), the POET estimator ${\sum^{^}}_{u, \hat{K}}^{T}$ satisfies

| | {\sum^{^}}_{u, \hat{K}}^{T} - \sum_{u} | | = O_{p} (ω_{T}^{1 - q} m_{p}) .

If further $ω_{T}^{1 - q} m_{p} = o (1)$ , then the eigenvalues of ${\sum^{^}}_{u, \hat{K}}^{T}$ are all bounded away from zero with probability approaching one, and

| | {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} - \sum_{u}^{- 1} | | = O_{p} (ω_{T}^{1 - q} m_{p}) .

When estimating Σ_u, p is allowed to grow exponentially fast in T, and ${\sum^{^}}_{u, \hat{K}}^{T}$ can be made consistent under the spectral norm. In addition, ${\sum^{^}}_{u, \hat{K}}^{T}$ is asymptotically invertible while the classical sample covariance matrix based on the residuals is not when p > T.

Remark 3.2

Consistent estimation of Σ_u indicates that Σ_u is identifiable in (1.3), namely, the sparse Σ_u can be separated perfectly from the low-rank matrix there. The result here gives another proof (when assuming $ω_{T}^{1 - q} m_{p} = o (1)$ of the “surprising phenomenon” in Candès et al (2011) under different technical conditions.
Fan, Liao and Mincheva (2011) recently showed that when ${f_{t}}_{t = 1}^{T}$ are observable and q = 0, the rate of convergence of the adaptive thresholding estimator is given by $| | {\sum^{^}}_{u}^{T} - \sum_{u} | | = O_{p} (m_{p} \sqrt{\frac{log p}{T}}) = | | {({\sum^{^}}_{u}^{T})}^{- 1} - \sum_{u}^{- 1} | |$ . Hence when the common factors are unobservable, the rate of convergence has an additional term $m_{p} / \sqrt{p}$ , coming from the impact of estimating the unknown factors. This impact vanishes when p log p ≫ T, in which case the minimax rate as in Cai and Zhou (2010) is achieved. As p increases, more information about the common factors is collected, which results in more accurate estimation of the common factors ${f_{t}}_{t = 1}^{T}$ .
When K is known and grows with p and T, with slightly weaker assumptions, our working paper (Fan et al. 2011) shows that under the exactly sparse case (that is, q = 0), the result continues to hold with convergence rate $m_{p} (K^{2} \sqrt{\frac{log p}{T}} + \frac{K^{3}}{\sqrt{p}})$ .

3.3 Convergence of the POET estimator

Since the first K eigenvalues of Σ grow with p, one can hardly estimate Σ with satisfactory accuracy in the absolute term. This problem arises not from the limitation of any estimation method, but is due to the nature of the high-dimensional factor model. We illustrate this using a simple example.

Example 3.1

Consider an ideal case where we know the spectrum except for the first eigenvector of Σ. Let ${λ_{j}, ξ_{j}}_{j = 1}^{p}$ be the eigenvalues and vectors, and assume that the largest eigenvalue λ₁ ≥ cp for some c > 0. Let ξ̂₁ be the estimated first eigenvector and define the covariance estimator $\sum^{^} = λ_{1} {\hat{ξ}}_{1} {\hat{ξ}}_{1}^{'} + \sum_{j = 2}^{p} λ_{j} ξ_{j} ξ_{j}^{'}$ . Assume that ξ̂₁ is a good estimator in the sense that ||ξ̂₁ − ξ₁||² = O_p(T⁻¹). However,

| | \sum^{^} - \sum | | = | | λ_{1} ({\hat{ξ}}_{1} {\hat{ξ}}_{1}^{'} - ξ_{1} ξ_{1}^{'}) | | = λ_{1} O_{p} (| | \hat{ξ} - ξ | |) = O_{p} (λ_{1} T^{- 1 / 2}),

which can diverge when T = O(p²).

In the presence of very spiked eigenvalues, while the covariance Σ cannot be consistently estimated in absolute term, it can be well estimated in terms of the relative error matrix

\sum^{- 1 / 2} \sum^{^} {\sum^{^}}^{- 1 / 2} - I_{p}

which is more relevant for many applications (see Example 5.2). The relative error matrix can be measured by either its spectral norm or the normalized Frobenius norm defined by

p^{- 1 / 2} {| | \sum^{- 1 / 2} \sum^{^} \sum^{- 1 / 2} - I_{p} | |}_{F} = {(p^{- 1} t r [{(\sum^{- 1 / 2} \sum^{^} \sum^{- 1 / 2} - I_{p})}^{2}])}^{1 / 2} .

(3.3)

In the last equality, there are p terms being added in the trace operation and the factor p⁻¹ plays the role of normalization. The loss (3.3) is closely related to the entropy loss, introduced by James and Stein (1961). Also note that

p^{- 1 / 2} {| | \sum^{- 1 / 2} \sum^{^} \sum^{- 1 / 2} - I_{p} | |}_{F} = {| | \sum^{^} - \sum | |}_{\sum}

where ||A||_Σ = p^−1/2||Σ^−1/2AΣ^−1/2||_F is the weighted quadratic norm in Fan et al (2008).

Fan et al. (2008) showed that in a large factor model, the sample covariance is such that ${| | {\sum^{^}}_{sam} - \sum | |}_{\sum} = O_{p} (\sqrt{p / T})$ , which does not converge if p > T. On the other hand, Theorem 3.2 below shows that ||Σ̂_K̂ − Σ||_Σ can still be convergent as long as p = o(T²). Technically, the impact of high-dimensionality on the convergence rate of Σ̂_K̂ − Σ is via the number of rows in B. We show in the appendix that B appears in ||Σ̂_K̂ − Σ||_Σ through B′Σ⁻¹B whose eigenvalues are bounded. Therefore it successfully cancels out the curse of high-dimensionality introduced by B.

Compared to estimating Σ, in a large approximate factor model, we can estimate the precision matrix with a satisfactory rate under the spectral norm. The intuition follows from the fact that Σ⁻¹ has bounded eigenvalues.

The following theorem summarizes the rate of convergence under various norms.

Theorem 3.2

Under the assumptions of Theorem 3.1, the POET estimator defined in (2.15) satisfies

{| | {\sum^{^}}_{\hat{K}} - \sum | |}_{\sum} = O_{p} (\frac{\sqrt{p} log p}{T} + m_{p} ω_{T}^{1 - q}), {| | {\sum^{^}}_{\hat{K}} - \sum | |}_{max} = O_{p} (ω_{T}) .

In addition, if $m_{p} ω_{T}^{1 - q} = o (1)$ , then Σ̂_K̂ is nonsingular with probability approaching one, with

| | {\sum^{^}}_{\hat{K}}^{- 1} - \sum^{- 1} | | = O_{p} (m_{p} ω_{T}^{1 - q}) .

Remark 3.3

When estimating Σ⁻¹, p is allowed to grow exponentially fast in T, and the estimator has the same rate of convergence as that of the estimator ${\sum^{^}}_{u, \hat{K}}^{T}$ in Theorem 3.1. When p becomes much larger than T, the precision matrix can be estimated at the same rate as if the factors were observable.
As in Remark 3.2, when K > 0 is known and grows with p and T, the working paper Fan et al. (2011) proves the following results (when q = 0) ⁷:
$\begin{array}{l} {| | {\sum^{^}}^{T} - \sum | |}_{\sum} = O_{p} (\frac{K \sqrt{p} log p}{T} + K^{2} m_{p} \sqrt{\frac{log p}{T}} + \frac{m_{p} K^{3}}{\sqrt{p}}), \\ {| | {\sum^{^}}^{T} - \sum | |}_{max} = O_{p} (K^{3} \sqrt{\frac{log p}{T}} + \frac{K^{3}}{\sqrt{p}}), \\ | | {({\sum^{^}}^{T})}^{- 1} - \sum^{- 1} | | = O_{p} (K^{2} m_{p} \sqrt{\frac{log p}{T}} + \frac{K^{3} m_{p}}{\sqrt{p}}), \end{array}$

The results state explicitly the dependence of the rate of convergence on the number of factors.
The relative error ||Σ^−1/2 Σ̂_K̂ Σ^−1/2 − I_p|| in operator norm can be shown to have the same order as the maximum relative error of estimated eigenvalues. It does not converge to zero nor diverge. It is much smaller than ||Σ̂_K̂ − Σ||, which is of order $p / \sqrt{T}$ (see Example 3.1).

3.4 Convergence of unknown factors and factor loadings

Many applications of the factor model require estimating the unknown factors. In general, factor loadings in B and the common factors f_t are not separably identifiable, as for any matrix H such that H′H = I_K, Bf_t = BH′Hf_t. Hence (B, f_t) cannot be identified from (BH′, Hf_t). Note that the linear space spanned by the rows of B is the same as that by those of BH′. In practice, it often does not matter which one is used.

Let V denote the K̂ × K̂ diagonal matrix of the first K̂ largest eigenvalues of the sample covariance matrix in decreasing order. Recall that F′ = (f₁, …, f_T) and define a K̂ × K̂ matrix $H = \frac{1}{T} V^{- 1} {\hat{F}}^{'} {FB}^{'} B$ . Then for t ≤ T, Hf_t = T⁻¹V⁻¹F̂′(Bf₁, …, Bf_T)′Bf_t. Note that Hf_t depends only on the data V⁻¹F̂′ and an identifiable part of parameters ${{Bf}_{t}}_{t = 1}^{T}$ . Therefore, there is no identifiability issue in Hf_t regardless of the imposed identifiability condition.

Bai (2003) obtained the rate of convergence for both b̂_i and f̂_t for any fixed (i, t). However, the uniform rate of convergence is more relevant for many applications (see Example 5.1). The following theorem extends those results in Bai (2003) in a uniformity sense. In particular, with a more refined technique, we have improved the uniform convergence rate for f̂_t.

Theorem 3.3

Under the assumptions of Theorem 3.1,

max_{i \leq p} | | {\hat{b}}_{i} - {Hb}_{i} | | = O_{p} (ω_{T}), max_{t \leq T} | | {\hat{f}}_{t} - {Hf}_{t} | | = O_{p} (\frac{1}{T^{1 / 2}} + \frac{T^{1 / 4}}{\sqrt{p}}) .

As a consequence of Theorem 3.3, we obtain the following: (recall that the constant r₂ is defined in Assumption 3.2.)

Corollary 3.1

Under the assumptions of Theorem 3.1,

max_{i \leq p, t \leq T} | | {\hat{b}}_{i}^{'} {\hat{f}}_{t} - b_{i}^{'} f_{t} | | = O_{p} ({(log T)}^{1 / r_{2}} \sqrt{\frac{log p}{T}} + \frac{T^{1 / 4}}{\sqrt{p}}) .

The rates of convergence obtained above also explain the condition T = o(p²) in Theorems 3.1 and 3.2. It is needed in order to estimate the common factors ${f_{t}}_{t = 1}^{T}$ uniformly in t ≤ T. When we do not observe ${f_{t}}_{t = 1}^{T}$ , in addition to the factor loadings, there are KT factors to estimate. Intuitively, the condition T = o(p²) requires the number of parameters introduced by the unknown factors be “not too many”, so that we can consistently estimate them uniformly. Technically, as demonstrated by Bickel and Levina (2008), Cai and Liu (2011) and many other authors, achieving uniform accuracy is essential for large covariance estimations.

4 Choice of Threshold

4.1 Finite-sample positive definiteness

Recall that the threshold value $τ_{i j} = C \sqrt{{\hat{θ}}_{i j}} ω_{T}$ , where C is determined by the users. To make POET operational in practice, one has to choose C to maintain the positive definiteness of the estimated covariances for any given finite sample. We write ${\sum^{^}}_{u, \hat{K}}^{T} (C) = {\sum^{^}}_{u, \hat{K}}^{T}$ , where the covariance estimator depends on C via the threshold. We choose C in the range where $λ_{min} ({\sum^{^}}_{u, \hat{K}}^{T}) > 0$ . Define

C_{min} = inf {C > 0 : λ_{min} ({\sum^{^}}_{u, \hat{K}}^{T} (M)) > 0, \forall M > C}

(4.1)

When C is sufficiently large, the estimator becomes diagonal, while its minimum eigenvalue must retain strictly positive. Thus, C_min is well defined and for all C > C_min, ${\sum^{^}}_{u, \hat{K}}^{T} (C)$ is positive definite under finite sample. We can obtain C_min by solving $λ_{min} ({\sum^{^}}_{u, \hat{K}}^{T} (C)) = 0$ , C ≠ 0. We can also approximate C_min by plotting $λ_{min} ({\sum^{^}}_{u, \hat{K}}^{T} (C))$ as a function of C, as illustrated in Figure 1. In practice, we can choose C in the range (C_min + ε, M) for a small ε and large enough M. Choosing the threshold in a range to guarantee the finite-sample positive definiteness has also been previously suggested by Fryzlewicz (2010).

Minimum eigenvalue of ${\sum^{^}}_{u, \hat{K}}^{T} (C)$ as a function of C for three choices of thresholding rules. The plot is based on the simulated data set in Section 6.2.

4.2 Multifold Cross-Validation

In practice, C can be data-driven, and chosen through multifold cross-validation. After obtaining the estimated residuals {û_t}_t_≤_T by the PCA, we divide them randomly into two subsets, which are, for simplicity, denoted by {û_t}_t∈J₁ and {û_t}_t∈J₂. The sizes of J₁ and J₂, denoted by T(J₁) and T (J₂), are T (J₁) ≍ T and T (J₂) + T (J₁) = T. For example, in sparse matrix estimation, Bickel and Levina (2008) suggested to choose T(J₁) = T (1 − (log T)⁻¹).

We repeat this procedure H times. At the jth split, we denote by ${\sum^{^}}_{u}^{T, j} (C)$ the POET estimator with the threshold $C \sqrt{θ_{i j}} ω_{T}$ on the training data set {û_t}_t∈J₁. We also denote by ${\sum^{^}}_{u}^{j}$ the sample covariance based on the validation set, defined by ${\sum^{^}}_{u}^{j} = T {(J_{2})}^{- 1} \sum_{t \in J_{2}} {\hat{u}}_{t} {\hat{u}}_{t}^{'}$ . Then we choose the constant C^* by minimizing a cross-validation objective function over a compact interval

C^{*} = arg min_{C_{min} + ε \leq C \leq M} \frac{1}{H} \sum_{j = 1}^{H} {| | {\sum^{^}}_{u}^{T, j} (C) - {\sum^{^}}_{u}^{j} | |}_{F}^{2} .

(4.2)

Here C_min is the minimum constant that guarantees the positive definiteness of ${\sum^{^}}_{u, \hat{K}}^{T} (C)$ for C > C_min as described in the previous subsection, and M is a large constant such that ${\sum^{^}}_{u, \hat{K}}^{T} (M)$ is diagonal. The resulting C^* is data-driven, so depends on Y as well as p and T via the data. On the other hand, for each given N × T data matrix Y, C^* is a universal constant in the threshold $τ_{i j} = C^{*} \sqrt{{\hat{θ}}_{i j}} ω_{T}$ in the sense that it does not change with respect to the position (i, j). We also note that the cross-validation is based on the estimate of Σ_u rather than Σ because POET thresholds the error covariance matrix. Thus cross-validation improves the performance of thresholding.

It is possible to derive the rate of convergence for ${\sum^{^}}_{u, \hat{K}}^{T} (C^{*})$ under the current model setting, but it ought to be much more technically involved than the regular sparse matrix estimation considered by Bickel and Levina (2008) and Cai and Liu (2011). To keep our presentation simple we do not pursue it in the current paper.

5 Applications of POET

We give four examples to which the results in Theorems 3.1–3.3 can be applied. Detailed pursuits of these are beyond the scope of the paper.

Example 5.1 (Large-scale hypothesis testing)

Controlling the false discovery rate in large-scale hypothesis testing based on correlated test statistics is an important and challenging problem in statistics (Leek and Storey, 2008; Efron, 2010; Fan, et al., 2012). Suppose that the test statistic for each of the hypothesis

H_{i 0} : μ_{i} = 0 vs . H_{i 1} : μ_{i} \neq 0

is Z_i ~ N (μ_i, 1) and these test statistics Z are jointly normal N (μ, Σ) where Σ is unknown. For a given critical value x, the false discovery proportion is then defined as FDP(x) = V (x)/R(x) where V (x) = p⁻¹ Σ_{μ_i=0} I(|Z_i| > x) and $R (x) = p^{- 1} \sum_{i = 1}^{p} I (∣ Z_{i} ∣ > x)$ are the total number of false discoveries and the total number of discoveries, respectively. Our interest is to estimate FDP(x) for each given x. Note that R(x) is an observable quantity. Only V (x) needs to be estimated.

If the covariance Σ admits the approximate factor structure (1.3), then the test statistics can be stochastically decomposed as

Z = μ + Bf + u, where \sum_{u} is sparse .

(5.1)

By the principal factor approximation (Theorem 1, Fan, Han, Gu, 2012)

V (x) = \sum_{i = 1}^{p} {Φ (a_{i} (z_{x / 2} + η_{i})) + Φ (a_{i} (z_{x / 2} - η_{i}))} + o_{P} (p),

(5.2)

when m_p = o(p) and the number of true significant hypothesis {i : μ_i ≠ 0} is o(p), where z_x is the upper x-quantile of the standard normal distribution, η_i = (Bf)_i and a_i = var(u_i)⁻¹.

Now suppose that we have n repeated measurements from the model (5.1). Then, by Corollary 3.1, {η_i} can be uniformly consistently estimated, and hence p⁻¹V (x) and FDP(x) can be consistently estimated. Efron (2010) obtained these repeated test statistics based on the bootstrap sample from the original raw data. Our theory (Theorem 3.3) gives a formal justification to the framework of Efron (2007, 2010).

Example 5.2 (Risk management)

The maximum elementwise estimation error ||Σ̂_K̂ − Σ||_max appears in risk assessment as in Fan, Zhang and Yu (2012). For a fixed portfolio allocation vector w, the true portfolio variance and the estimated one are given by w′Σw and w′Σ̂_K̂w respectively. The estimation error is bounded by

∣ w^{'} {\sum^{^}}_{\hat{K}} w - w^{'} \sum w ∣ \leq {| | {\sum^{^}}_{\hat{K}} - \sum | |}_{max} {| | w | |}_{1}^{2},

where ||w||₁, the L₁-norm of w, is the gross exposure of the portfolio. Usually a constraint is placed on the total percentage of the short positions, in which case we have a restriction ||w||₁ ≤ c for some c > 0. In particular, c = 1 corresponds to a portfolio with no-short positions (all weights are nonnegative). Theorem 3.2 quantifies the maximum approximation error.

The above compares the absolute error of perceived risk and true risk. The relative error is bounded by

∣ w^{'} {\sum^{^}}_{\hat{K}} w / w^{'} \sum w - 1 ∣ \leq | | \sum^{- 1 / 2} {\sum^{^}}_{\hat{K}} \sum^{- 1 / 2} - I_{p} | |

for any allocation vector w. Theorem 3.2 quantifies this relative error.

Example 5.3 (Panel regression with a factor structure in the errors)

Consider the following panel regression model

Y_{i t} = x_{i t}^{'} β + ε_{i t}, ε_{i t} = b_{i}^{'} f_{t} + u_{i t}, i \leq p, t \leq T,

where x_it is a vector of observable regressors with fixed dimension. The regression error ε_it has a factor structure and is assumed to be independent of x_it, but b_i, f_t and u_it are all unobservable. We are interested in the common regression coefficients β. The above panel regression model has been considered by many researchers, such as Ahn, Lee and Schmidt (2001), Pesaran (2006), and has broad applications in social sciences.

Although OLS (ordinary least squares) produces a consistent estimator of β, a more efficient estimation can be obtained by GLS (generalized least squares). The GLS method depends, however, on an estimator of $\sum_{ε}^{- 1}$ , the inverse of the covariance matrix of ε_t = (ε₁_t, …, ε_pt)′. By assuming the covariance matrix of (u₁_t, …, u_pt) to be sparse, we can successfully solve this problem by applying Theorem 3.2. Although ε_it is unobservable, it can be replaced by the regression residuals ε̂_it, obtained via first regressing Y_it on x_it. We then apply the POET estimator to $T^{- 1} \sum_{t = 1}^{T} {\hat{ε}}_{t} {\hat{ε}}_{t}^{'}$ . By Theorem 3.2, the inverse of the resulting estimator is a consistent estimator of $\sum_{ε}^{- 1}$ under the spectral norm. A slight difference lies in the fact that when we apply POET, $T^{- 1} \sum_{t = 1}^{T} ε_{t} ε_{t}^{'}$ is replaced with $T^{- 1} \sum_{t = 1}^{T} {\hat{ε}}_{t} {\hat{ε}}_{t}^{'}$ , which introduces an additional term $O_{p} (\sqrt{\frac{log p}{T}})$ in the estimation error.

Example 5.4 (Validating an asset pricing theory)

A celebrated financial economic theory is the capital asset pricing model (CAPM, Sharpe 1964) that makes William Sharpe win the Nobel prize in Economics in 1990, whose extension is the multi-factor model (Ross, 1976, Chamberlain and Rothschild, 1983). It states that in a frictionless market, the excessive return of any financial asset equals the excessive returns of the risk factors times its factor loadings plus noises. In the multi-period model, the excess return y_it of firm i at time t follows model (1.1), in which f_t is the excess returns of the risk factors at time t. To test the null hypothesis (1.2), one embeds the model into the multivariate linear model

y_{t} = α + {Bf}_{t} + u_{t}, t = 1, \dots, T

(5.3)

and wishes to test H₀ : α = 0. The F-test statistic involves the estimation of the covariance matrix Σ_u, whose estimates are degenerate without regularization when p ≥ T. Therefore, in the literature (Sentana, 2009, and references therein), one focuses on the case p is relatively small. The typical choices of parameters are T = 60 monthly data and the number of assets p = 5, 10 or 25. However, the CAPM should hold for all tradeable assets, not just a small fraction of assets. With our regularization technique, non-degenerate estimate ${\sum^{^}}_{u, \hat{K}}^{T}$ can be obtained and the F-test or likelihood-ratio test statistics can be employed even when p ≫ T.

To provide some insights, let α̂ be the least-squares estimator of (5.3). Then, when u_t ~ N (0, Σ_u), α̂ ~ N(α, Σ_u/c_T) for a constant c_T which depends on the observed factors. When Σ_u is known, the Wald test statistic is $W = c_{T} {\hat{α}}^{'} \sum_{u}^{- 1} \hat{α}$ . When it is unknown and p is large, it is natural to use the F-type of test statistic $\hat{W} = c_{T} {\hat{α}}^{'} {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} \hat{α}$ . The difference between these two statistics is bounded by

∣ \hat{W} - W ∣ \leq c_{T} | | {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} - \sum_{u}^{- 1} | | {| | \hat{α} | |}^{2} .

Since under the null hypothesis α̂ ~ N(0, Σ_u/c_T), we have $c_{T} {| | \sum_{u}^{- 1 / 2} \hat{α} | |}^{2} = O (p)$ . Thus, it follows from boundness of ||Σ_u|| that $∣ \hat{W} - W ∣ = O (p) | | {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} - \sum_{u}^{- 1} | |$ . Theorem 3.1 provides the rate of convergence for the above difference. Detailed development is out of the scope of the current paper, and we will leave it as a separate research project.

6 Monte Carlo Experiments

In this section, we will examine the performance of the POET method in a finite sample. We will also demonstrate the effect of this estimator on the asset allocation and risk assessment. Similarly to Fan, et al. (2008, 2011), we simulated from a standard Fama-French three-factor model, assuming a sparse error covariance matrix and three factors. Throughout this section, the time span is fixed at T = 300, and the dimensionality p increases from 1 to 600. We assume that the excess returns of each of p stocks over the risk-free interest rate follow the following model:

y_{i t} = b_{i 1} f_{1 t} + b_{i 2} f_{2 t} + b_{i 3} f_{3 t} + u_{i t} .

The factor loadings are drawn from a trivariate normal distribution b ~ N₃(μ_B, Σ_B), the idiosyncratic errors from u_t ~ N_p(0, Σ_u), and the factor returns f_t follow a VAR(1) model. To make the simulation more realistic, model parameters are calibrated from the financial returns, as detailed in the following section.

6.1 Calibration

To calibrate the model, we use the data on annualized returns of 100 industrial portfolios from the website of Kenneth French, and the data on 3-month Treasury bill rates from the CRSP database. These industrial portfolios are formed as the intersection of 10 portfolios based on size (market equity) and 10 portfolios based on book equity to market equity ratio. Their excess returns (ỹ_t) are computed for the period from January 1^st, 2009 to December 31^st, 2010. Here, we present a short outline of the calibration procedure.

Given ${{\tilde{y}}_{t}}_{t = 1}^{500}$ as the input data, we fit a Fama-French-three-factor model and calculate a 100 × 3 matrix B̃, and 500 × 3 matrix F̃, using the principal components method described in Section 3.1.
We summarize 100 factor loadings (the rows of B̃) by their sample mean vector μ_B and sample covariance matrix Σ_B, which are reported in Table 1. The factor loadings b_i = (b_i₁, b_i₂, b_i₃)^T for i = 1, …, p are drawn from N₃(μ_B, Σ_B).
We run the stationary vector autoregressive model f_t = μ + Φf_t₋₁ + ε_t, a VAR(1) model, to the data F̃ to obtain the multivariate least squares estimator for μ and Φ, and estimate Σ_ε. Note that all eigenvalues of Φ in Table 2 fall within the unit circle, so our model is stationary. The covariance matrix cov(f_t) can be obtained by solving the linear equation cov(f_t) = Φcov(f_t) Φ′ + Σ_ε. The estimated parameters are depicted in Table 2 and are used to generate f_t.
For each value of p, we generate a sparse covariance matrix Σ_u of the form:
$\sum_{u} = D \sum_{0} D .$

Here, Σ₀ is the error correlation matrix, and D is the diagonal matrix of the standard deviations of the errors. We set D = diag(σ₁, …, σ_p), where each σ_i is generated independently from a Gamma distribution G(α, β), and α and β are chosen to match the sample mean and sample standard deviation of the standard deviations of the errors. A similar approach to Fan et al. (2011) has been used in this calibration step. The off-diagonal entries of Σ₀ are generated independently from a normal distribution, with mean and standard deviation equal to the sample mean and sample standard deviation of the sample correlations among the estimated residuals, conditional on their absolute values being no larger than 0.95. We then employ hard thresholding to make Σ₀ sparse, where the threshold is found as the smallest constant that provides the positive definiteness of Σ₀. More precisely, start with threshold value 1, which gives Σ₀ = I_p and then decrease the threshold values in a grid until positive definiteness is violated.

Table 1.

Mean and covariance matrix used to generate b

μ_B	Σ_B
0.0047	0.0767	−0.00004	0.0087
0.0007	−0.00004	0.0841	0.0013
−1.8078	0.0087	0.0013	0.1649

Open in a new tab

Table 2.

Parameters of f_t generating process

μ	cov(f_t)			Φ
−0.0050	1.0037	0.0011	−0.0009	−0.0712	0.0468	0.1413
0.0335	0.0011	0.9999	0.0042	−0.0764	−0.0008	0.0646
−0.0756	−0.0009	0.0042	0.9973	0.0195	−0.0071	−0.0544

Open in a new tab

6.2 Simulation

For the simulation, we fix T = 300, and let p increase from 1 to 600. For each fixed p, we repeat the following steps N = 200 times, and record the means and the standard deviations of each respective norm.

Generate independently ${b_{i}}_{i = 1}^{p} ~ N_{3} (μ_{B}, \sum_{B})$ , and set B = (b₁, …, b_p)′.
Generate independently ${u_{t}}_{t = 1}^{T} ~ N_{p} (0, \sum_{u})$ .
Generate ${f_{t}}_{t = 1}^{T}$ as a vector autoregressive sequence of the form f_t = μ + Φf_t₋₁ + ε_t.
Calculate ${y_{t}}_{t = 1}^{T}$ from y_t = Bf_t + u_t.
Set hard-thresholding with threshold $0.5 \sqrt{{\hat{θ}}_{i j}} (\sqrt{\frac{log p}{T}} + \frac{1}{\sqrt{p}})$ . Estimate K using Bai and Ng (2002)’s IC1. Calculate covariance estimators using the POET method. Calculate the sample covariance matrix Σ̂_sam.

In the graphs below, we plot the averages and standard deviations of the distance from Σ̂_K̂ and Σ̂_sam to the true covariance matrix Σ, under norms ||.||_Σ, ||.|| and ||.||_max. We also plot the means and standard deviations of the distances from (Σ̂_K̂)⁻¹ and ${\sum^{^}}_{sam}^{- 1}$ to Σ⁻¹ under the spectral norm. The dimensionality p ranges from 20 to 600 in increments of 20. Due to invertibility, the spectral norm for ${\sum^{^}}_{sam}^{- 1}$ is plotted only up to p = 280. Also, we zoom into these graphs by plotting the values of p from 1 to 100, this time in increments of 1. Notice that we also plot the distance from Σ̂_obs to Σ for comparison, where Σ̂_obs is the estimated covariance matrix proposed by Fan et al. (2011), assuming the factors are observable.

6.3 Results

In a factor model, we expect POET to perform as well as Σ̂_obs when p is relatively large, since the effect of estimating the unknown factors should vanish as p increases. This is illustrated in the plots below.

From the simulation results, reported in Figures 2–5, we observe that POET under the unobservable factor model performs just as well as the estimator in Fan et al. (2011) if the factors are known, when p is large enough. The cost of not knowing the factors is approximately of order $O_{p} (1 / \sqrt{p})$ . It can be seen in Figures 2 and 3 that this cost vanishes for p ≥ 200. To give a better insight of the impact of estimating the unknown factors for small p, a separate set of simulations is conducted for p ≤ 100. As we can see from Figures 2 (bottom panel) and 3 (middle and bottom panels), the impact decreases quickly. In addition, when estimating Σ⁻¹, it is hard to distinguish the estimators with known and unknown factors, whose performances are quite stable compared to the sample covariance matrix. Also, the maximum absolute elementwise error (Figure 4) of our estimator performs very similarly to that of the sample covariance matrix, which coincides with our asymptotic result. Figure 5 shows that the performances of the three methods are indistinguishable in the spectral norm, as expected.

Averages (left panel) and standard deviations (right panel) of the relative error p^−1/2||Σ^−1/2Σ̂Σ^−1/2 − I_p||_F with known factors (Σ̂ = Σ̂*_obs* solid red curve), POET (Σ̂ = Σ̂*_K̂* solid blue curve), and sample covariance (Σ̂ = Σ̂_sam dashed curve) over 200 simulations, as a function of the dimensionality p. Top panel: p ranges in 20 to 600 with increment 20; bottom panel: p ranges in 1 to 100 with increment 1.

Averages of ||Σ̂ − Σ|| (left panel) and ||Σ^−1/2Σ̂Σ^−1/2 − I_p|| with known factors (Σ̂ = Σ̂*_obs* solid red curve), POET (Σ̂= Σ̂*_K̂ ω* solid blue curve), and sample covariance (Σ̂ = Σ̂_sam dashed curve) over 200 simulations, as a function of the dimensionality p. The three curves are hardly distinguishable on the left panel.

Averages (left panel) and standard deviations (right panel) of ||Σ̂⁻¹ − Σ⁻¹|| with known factors (Σ̂ = Σ̂*_obs* solid red curve), POET (Σ̂ = Σ̂*_K̂* solid blue curve), and sample covariance (Σ̂ = Σ̂_sam dashed curve) over 200 simulations, as a function of the dimensionality p. Top panel: p ranges in 20 to 600 with increment 20; middle panel: p ranges in 1 to 100 with increment 1; Bottom panel: the same as the top panel with dashed curve excluded.

Averages (left panel) and standard deviations (right panel) of ||Σ̂ − Σ||_max with known factors (Σ̂ = **Σ̂_o***_bs* solid red curve), POET (Σ̂ = Σ̂*_K̂* solid blue curve), and sample covariance (Σ̂ = Σ̂_sam dashed curve) over 200 simulations, as a function of the dimensionality p. They are nearly indifferentiable.

6.4 Robustness to the estimation of K

The POET estimator depends on the estimated number of factors. Our theory uses a consistent esimator K̂. To assess the robustness of our procedure to K̂ in finite sample, we calculate ${\sum^{^}}_{u, K}^{T}$ for K = 1, 2, …, 10. Again, the threshold is fixed to be $0.5 \sqrt{{\hat{θ}}_{i j}} (\sqrt{\frac{log p}{T}} + \frac{1}{\sqrt{p}})$ .

6.4.1 Design 1

The simulation setup is the same as before where the true K₀ = 3. We calculate $| | {\sum^{^}}_{u, K}^{T} - \sum_{u} | |, | | {({\sum^{^}}_{u, K}^{T})}^{- 1} - \sum_{u}^{- 1} | |, | | {\sum^{^}}_{K}^{- 1} - \sum^{- 1} | |$ and ||Σ̂_K − Σ||_Σ for K = 1, 2, …, 10. Figure 6 plots these norms as p increases but with a fixed T = 300. The results demonstrate a trend that is quite robust when K ≥ 3; especially, the estimation accuracy of the spectral norms for large p are close to each other. When K = 1 or 2, the estimators perform badly due to modeling bias. Therefore, POET is robust to over-estimated K, but not to under-estimation.

Robustness of K as p increases for various choices of K (Design 1, T = 300). Top left: || ${\sum^{^}}_{u, K}^{T} - \sum_{u}$ ||; top right: || ${({\sum^{^}}_{u, K}^{T})}^{- 1} - \sum_{u}^{- 1}$ ||; bottom left: ||Σ̂_K − Σ||_Σ; bottom right: || ${\sum^{^}}_{K}^{- 1} - \sum^{- 1}$ ||.

6.4.2 Design 2

We also simulated from a new data generating process for the robustness assessment. Consider a banded idiosyncratic matrix

σ_{u, i j} = {\begin{array}{l} {0.5}^{∣ i - j ∣}, & ∣ i - j ∣ \leq 9 \\ 0, & ∣ i - j ∣ > 9 \end{array}, (u_{1}, \dots, u_{T}) ~^{i . i . d .} N_{p} (0, \sum_{u}) .

We still consider a K₀ = 3 factor model, where the factors are independently simulated as

f_{i t} ~ N (0, 1), b_{j i} ~ N (0, 1) i \leq 3, j \leq p, t \leq T,

Table 3 summarizes the average estimation error of covariance matrices across K in the spectral norm. Each simulation is replicated 50 times and T = 200.

Table 3.

Robustness of K. Design 2, estimation errors in spectral norm

p = 100

{\sum^{^}}_{u, K}^{T}

10.70

5.23

1.63

1.80

1.91

2.04

2.22

{({\sum^{^}}_{u, K}^{T})}^{- 1}

2.71

2.51

1.51

1.50

1.44

1.84

2.82

{\sum^{^}}_{K}^{- 1}

2.69

2.48

1.47

1.49

1.41

1.56

2.35

Σ̂_K

94.66

91.36

29.41

31.45

30.91

33.59

33.48

Σ^−1/2 Σ̂_K Σ^−1/2

17.37

10.04

2.05

2.83

2.94

2.95

2.93

p = 200

{\sum^{^}}_{u, K}^{T}

11.34

11.45

1.64

1.71

1.79

1.87

2.01

{({\sum^{^}}_{u, K}^{T})}^{- 1}

2.69

3.91

1.57

1.56

1.81

2.26

3.42

{\sum^{^}}_{K}^{- 1}

2.67

3.72

1.57

1.55

1.70

2.13

3.19

Σ̂_K

200.82

195.64

57.44

63.09

64.53

60.24

56.20

Σ^−1/2 Σ̂_K Σ^−1/2

20.86

14.22

3.29

4.52

4.72

4.69

4.76

p = 300

{\sum^{^}}_{u, K}^{T}

12.74

15.20

1.66

1.71

1.78

1.84

1.95

{({\sum^{^}}_{u, K}^{T})}^{- 1}

7.58

7.80

1.74

2.18

2.58

3.54

5.45

{\sum^{^}}_{K}^{- 1}

7.59

7.49

1.70

2.13

2.49

3.37

5.13

Σ̂_K

302.16

274.12

87.92

92.47

91.90

83.21

92.50

Σ^−1/2 Σ̂_K Σ^−1/2

23.43

16.89

4.38

6.04

6.16

6.14

6.20

Open in a new tab

Table 3 illustrates some interesting patterns. First of all, the best estimation accuracy is achieved when K = K₀. Second, the estimation is robust for K ≥ K₀. As K increases from K₀, the estimation error becomes larger, but is increasing slowly in general, which indicates the robustness when a slightly larger K has been used. Third, when the number of factors is under-estimated, corresponding to K = 1, 2, all the estimators perform badly, which demonstrates the danger of missing any common factors. Therefore, over-estimating the number of factors, while still maintaining a satisfactory estimation accuracy of the covariance matrices, is much better than under-estimating. The resulting bias caused by under-estimation is more severe than the additional variance introduced by over-estimation. Finally, estimating Σ, the covariance of y_t, does not achieve a good accuracy even when K = K₀ in the absolute term ||Σ̂ − Σ||, but the relative error ||Σ^−1/2 Σ̂_KΣ^−1/2 − I_p|| is much smaller. This is consistent with our discussions in Section 3.3.

6.5 Comparisons with Other Methods

6.5.1 Comparison with related methods

We compare POET with related methods that address low-rank plus sparse covariance estimation, specifically, LOREC proposed by Luo (2012), the strict factor model (SFM) by Fan, Fan and Lv (2008), the Dual Method (Dual) by Lin et al. (2009), and finally, the singular value thresholding (SVT) by Cai, Candès and Shen (2008). In particular, SFM is a special case of POET which employs a large threshold that forces Σ̂_u to be diagonal even when the true Σ_u might not be. Note that Dual, SVT and many others dealing with low-rank plus sparse, such as Candès et al. (2011) and Wright et al. (2009), assume a known Σ and focus on recovering the decomposition. Hence they do not estimate Σ or its inverse, but decompose the sample covariance into two components. The resulting sparse component may not be positive definite, which can lead to large estimation errors for ${\sum^{^}}_{u}^{- 1}$ and Σ̂⁻¹.

Data are generated from the same setup as Design 2 in Section 6.4. Table 4 reports the averaged estimation error of the four comparing methods, calculated based on 50 replications for each simulation. Dual and SVT assume the data matrix has a low-rank plus sparse representation, which is not the case for the sample covariance matrix (though the population Σ has such a representation). The tuning parameters for POET, LOREC, Dual and SVT are chosen to achieve the best performance for each method.⁸

Table 4.

Method Comparison under spectral norm for T = 100. RelE represents the relative error ||Σ^−1/2Σ̂Σ^−1/2 − I_p||

Σ̂_u

{\sum^{^}}_{u}^{- 1}

RelE

Σ̂⁻¹

Σ̂

p = 100

POET

1.624

1.336

2.080

1.309

29.107

LOREC

2.274

1.880

2.564

1.511

32.365

SFM

2.084

2.039

2.707

2.022

34.949

Dual

2.306

5.654

2.707

4.674

29.000

SVT

2.59

13.64

2.806

103.1

29.670

p = 200

POET

1.641

1.358

3.295

1.346

58.769

LOREC

2.179

1.767

3.874

1.543

62.731

SFM

2.098

2.071

3.758

2.065

60.905

Dual

2.41

6.554

4.541

5.813

56.264

SVT

2.930

362.5

4.680

47.21

63.670

p = 300

POET

1.662

1.394

4.337

1.395

65.392

LOREC

2.364

1.635

4.909

1.742

91.618

SFM

2.091

2.064

4.874

2.061

88.852

Dual

2.475

2.602

6.190

2.234

74.059

SVT

2.681

> 10³

6.247

> 10³

80.954

Open in a new tab

6.5.2 Comparison with direct thresholding

This section compares POET with direct thresholding on the sample covariance matrix without taking out common factors (Rothman et al. 2009, Cai and Liu 2011. We denote this method by THR). We also run simulations to demonstrate the finite sample performance when Σ itself is sparse and has bounded eigenvalues, corresponding to the case K = 0. Three models are considered and both POET and THR use the soft thresholding. We fix T = 200. Reported results are the average of 100 replications.

Model 1: one-factor

The factors and loadings are independently generated from N(0, 1). The error covariance is the same banded matrix as Design 2 in Section 6.4. Here Σ has one diverging eigenvalue.

Model 2: sparse covariance

Set K = 0, hence Σ = Σ_u itself is a banded matrix with bounded eigenvalues.

Model 3: cross-sectional AR(1)

Set K = 0, but Σ = Σ_u = (0.85^|ⁱ⁻^j^|)_p_×_p. Now Σ is no longer sparse (or banded), but is not too dense either since Σ_ij decreases to zero exponentially fast as |i − j| → ∞. This is the correlation matrix if ${y_{i t}}_{i = 1}^{p}$ follows a cross-sectional AR(1) process: y_it = 0.85y_i₋₁_,t + ε_it.

For each model, POET uses an estimated K̂ based on IC1 of Bai and Ng (2002), while THR thresholds the sample covariance directly. We find that in Model 1, POET performs significantly better than THR as the latter misses the common factor. For Model 2, IC1 estimates K̂ = 0 precisely in each replication, and hence POET is identical to THR. For Model 3, POET still outperforms. The results are summarized in Table 5.

Table 5.

Method Comparison. T = 200

		\|\|Σ̂ − Σ\|\|		\|\|Σ̂⁻¹ − Σ⁻¹\|\|		K̂
		POET	THR	POET	THR	K̂
p = 200	Model 1	26.20	240.18	1.31	2.67	1
	Model 2	2.04	2.04	2.07	2.07	0
	Model 3	7.73	11.24	8.48	11.40	6.2

p = 300	Model 1	32.60	314.43	2.18	2.58	1
	Model 2	2.03	2.03	2.08	2.08	0
	Model 3	9.41	11.29	8.81	11.41	5.45

Open in a new tab

The reported numbers are the averages based on 100 replications.

6.6 Simulated portfolio allocation

We demonstrate the improvement of our method compared to the sample covariance and that based on the strict factor model (SFM), in a problem of portfolio allocation for risk minimization purposes.

Let Σ̂ be a generic estimator of the covariance matrix of the return vector y_t, and w be the allocation vector of a portfolio consisting of the corresponding p financial securities. Then the theoretical and the empirical risk of the given portfolio are R(w) = w′Σw and R̂ (w) = w′Σ̂w, respectively. Now, define

\hat{w} = {argmin}_{w^{'} 1 = 1} w^{'} \sum^{^} w,

the estimated (minimum variance) portfolio. Then the actual risk of the estimated portfolio is defined as R(ŵ) = ŵ′Σŵ, and the estimated risk (also called empirical risk) is equal to R̂ (ŵ) = ŵ′Σ̂ŵ. In practice, the actual risk is unknown, and only the empirical risk can be calculated.

For each fixed p, the population Σ was generated in the same way as described in Section 6.1, with a sparse but not diagonal error covariance. We use three different methods to estimate Σ and obtain ŵ: strict factor model Σ̂_diag (estimate Σ_u using a diagonal matrix), our POET estimator Σ̂_POET, both are with unknown factors, and sample covariance Σ̂_Sam. We then calculate the corresponding actual and empirical risks.

It is interesting to examine the accuracy and the performance of the actual risk of our portfolio ŵ in comparison to the oracle risk R^* = min_w′1₌₁ w′Σw, which is the theoretical risk of the portfolio we would have created if we knew the true covariance matrix Σ. We thus compare the regret R(ŵ) − R^*, which is always nonnegative, for three estimators of Σ̂. They are summarized by using the box plots over the 200 simulations. The results are reported in Figure 7. In practice, we are also concerned about the difference between the actual and empirical risk of the chosen portfolio ŵ. Hence, in Figure 8, we also compare the average estimation error |R(ŵ) − R̂ (ŵ)| and the average relative estimation error |R̂ (ŵ)/R(ŵ) − 1| over 200 simulations. When ŵ is obtained based on the strict factor model, both differences - between actual and oracle risk, and between actual and empirical risk, are persistently greater than the corresponding differences for the approximate factor estimator. Also, in terms of the relative estimation error, the factor model based method is negligible, where as the sample covariance does not process such a property.

Box plots of regrets R(ŵ) − R^* for p = 80 and 140. In each panel, the box plots from left to right correspond to ŵ obtained using Σ̂ based on approximate factor model, strict factor model, and sample covariance, respectively.

Estimation errors for risk assessments as a function of the portfolio size p. Left panel plots the average absolute error |R(ŵ) − R̂ (ŵ)| and right panel depicts the average relative error |R̂ (ŵ)/R(ŵ) − 1|. Here, ŵ and R̂ are obtained based on three estimators of Σ̂.

7 Real Data Example

We demonstrate the sparsity of the approximate factor model on real data, and present the improvement of the POET estimator over the strict factor model (SFM) in a real-world application of portfolio allocation.

7.1 Sarsity of Idiosyncratic Errors

The data were obtained from the CRSP (The Center for Research in Security Prices) database, and consists of p = 50 stocks and their annualized daily returns for the period January 1^st, 2010-December 31^st, 2010 (T = 252). The stocks are chosen from 5 different industry sectors, (more specifically, Consumer Goods-Textile & Apparel Clothing, Financial-Credit Services, Healthcare-Hospitals, Services-Restaurants, Utilities-Water utilities), with 10 stocks from each sector. We made this selection to demonstrate a block diagonal trend in the sparsity. More specifically, we show that the non-zero elements are clustered mainly within companies in the same industry. We also notice that these are the same groups that show predominantly positive correlation.

The largest eigenvalues of the sample covariance equal 0.0102, 0.0045 and 0.0039, while the rest are bounded by 0.0020. Hence K = 0, 1, 2, 3 are the possible values of the number of factors. Figure 9 shows the heatmap of the thresholded error correlation matrix (for simplicity, we applied hard thresholding). The threshold has been chosen using the cross validation as described in Section 4. We compare the level of sparsity (percentage of non-zero off-diagonal elements) for the 5 diagonal blocks of size 10 × 10, versus the sparsity of the rest of the matrix. For K = 2, our method results in 25.8% non-zero off-diagonal elements in the 5 diagonal blocks, as opposed to 7.3% non-zero elements in the rest of the covariance matrix. Note that, out of the non-zero elements in the central 5 blocks, 100% are positive, as opposed to a distribution of 60.3% positive and 39.7% negative amongst the non-zero elements in off-diagonal blocks. There is a strong positive correlation between the returns of companies in the same industry after the common factors are taken out, and the thresholding has preserved them. The results for K = 1, 2 and 3 show the same characteristics. These provide stark evidence that the strict factor model is not appropriate.

Heatmap of thresholded error correlation matrix for number of factors K = 0, K = 1, K = 2 and K = 3.

7.2 Portfolio Allocation

We extend our data size by including larger industrial portfolios (p = 100), and longer period (ten years): January 1^st, 2000 to December 31^st, 2010 of annualized daily excess returns. Two portfolios are created at the beginning of each month, based on two different covariance estimates through approximate and strict factor models with unknown factors. At the end of each month, we compare the risks of both portfolios.

The number of factors is determined using the penalty function proposed by Bai and Ng (2002), as defined in (2.14). For calibration, we use the last 100 consecutive business days of the above data, and both IC1 and IC2 give K̂ = 3. On the 1^st of each month, we estimate Σ̂_diag (SFM) and Σ̂_K̂ (POET with soft thresholding) using the historical data of excess daily returns for the proceeding 12 months (T = 252). The value of the threshold is determined using the cross-validation procedure. We minimize the empirical risk of both portfolios to obtain the two respective optimal portfolio allocations ŵ = ŵ₁ and ŵ₂ (based on Σ̂ = Σ̂_diag and Σ̂_K̂): ŵ = arg min_w̃′1₌₁ w′Σ̂w. At the end of the month (21 trading days), their actual risks are compared, calculated by

R_{i} = {\hat{w}}_{i}^{'} \frac{1}{21} \sum_{t = 1}^{21} y_{t} y_{t}^{'} {\hat{w}}_{i}, for i = 1, 2.

We can see from Figure 10 that the minimum-risk portfolio created by the POET estimator performs significantly better, achieving lower variance 76% of the time. Amongst those months, the risk is decreased by 48.63%. On the other hand, during the months that POET produces a higher-risk portfolio, the risk is increased by only 17.66%.

Risk of portfolios created with POET and SFM (strict factor model)

Next, we demonstrate the impact of the choice of number of factors and threshold on the performance of POET. If cross-validation seems computationally expensive, we can choose a common soft-threshold throughout the whole investment process. The average constant in the cross-validation was 0.53, close to our suggested constant 0.5 used for simulation. We also present the results based on various choices of constant C = 0.5, 0.75, 1 and 1.25, with soft threshold $C \sqrt{{\hat{θ}}_{i j}} ω_{T}$ . The results are summarized in Table 6. The performance of POET seems consistent across different choices of these parameters.

Table 6.

Comparisons of the risks of portfolios using POET and SFM: The first number is proportion of the time POET outperforms and the second number is percentage of average risk improvements. C represents the constant in the threshold.

C	K̂ = 1	K̂ = 2	K̂ = 3
0.25	0.58/29.6%	0.68/38%	0.71/33%
0.5	0.66/31.7%	0.70/38.2%	0.75/33.5%
0.75	0.68/29.3%	0.70/29.6%	0.71/25.1%
1	0.66/20.7%	0.62/19.4%	0.69/18%

Open in a new tab

8 Conclusion and Discussion

We study the problem of estimating a high-dimensional covariance matrix with conditional sparsity. Realizing unconditional sparsity assumption is inappropriate in many applications, we introduce a latent factor model that has a conditional sparsity feature, and propose the POET estimator to take advantage of the structure. This expands considerably the scope of the model based on the strict factor model, which assumes independent idiosyncratic noise and is too restrictive in practice. By assuming sparse error covariance matrix, we allow for the presence of the cross-sectional correlation even after taking out the common factors. The sparse covariance is estimated by the adaptive thresholding technique.

It is found that the rates of convergence of the estimators have an extra term approximately O_p(p^−1/2) in addition to the results based on observable factors by Fan et al. (2008, 2011), which arises from the effect of estimating the unobservable factors. As we can see, this effect vanishes as the dimensionality increases, as more information about the common factors becomes available. When p gets large enough, the effect of estimating the unknown factors is negligible, and we estimate the covariance matrices as if we knew the factors.

The proposed POET also has wide applicability in statistical genomics. For example, Carvalho et al. (2008) applied a Bayesian sparse factor model to study the breast cancer hormonal pathways. Their real-data results have identified about two common factors that have highly loaded genes (about half of 250 genes). As a result, these factors should be treated as “pervasive” (see the explanation in Example 2.1), which will result in one or two very spiked eigenvalues of the gene expressions’ covariance matrix. The POET can be applied to estimate such a covariance matrix and its network model.

Acknowledgments

The research was partially supported by NIH R01GM100474-01, NIH R01-GM072611, DMS-0704337, and Bendheim Center for Finance at Princeton University.

APPENDIX

A Estimating a sparse covariance with contaminated data

We estimate Σ_u by applying the adaptive thresholding given by (2.11). However, the task here is slightly different from the standard problem of estimating a sparse covariance matrix in the literature, as no direct observations for ${u_{t}}_{t = 1}^{T}$ are available. In many cases the original data are contaminated, including any type of estimate of the data when direct observations are not available. This typically happens when ${u_{t}}_{t = 1}^{T}$ represent the error terms in regression models or when data is subject to measurement of errors. Instead, we may observe ${{\hat{u}}_{t}}_{t = 1}^{T}$ . For instance, in the approximate factor models, ${\hat{u}}_{t} = y_{t} - {\hat{b}}_{i}^{'} {\hat{f}}_{t}$ .

We can estimate Σ_u using the adaptive thresholding proposed by Cai and Liu (2011): for the threshold $τ_{i j} = C \sqrt{{\hat{θ}}_{i j}} ω_{T}$ , define

{\hat{σ}}_{i j} = \frac{1}{T} \sum_{t = 1}^{T} {\hat{u}}_{i t} {\hat{u}}_{j t}, and {\hat{θ}}_{i j} = \frac{1}{T} \sum_{t = 1}^{T} {({\hat{u}}_{i t} {\hat{u}}_{j t} - {\hat{σ}}_{i j})}^{2} .

(A.1)

{\sum^{^}}_{u}^{T} = {(s_{i j} ({\hat{σ}}_{i j}))}_{p \times p},

where s_ij(.) satisfies: for all z ∈ ℝ, s_ij(z) = 0, when |z| ≤ τ_ij; |s_ij(z) − z| ≤ τ_ij.

When ${{\hat{u}}_{t}}_{t = 1}^{T}$ is close enough to ${u_{t}}_{t = 1}^{T}$ , we can show that ${\sum^{^}}_{u}^{T}$ is also consistent. The following theorem extends the standard thresholding results in Bickel and Levina (2008) and Cai and Liu (2011) to the case when no direct observations are available, or the original data are contaminated. For the tail and mixing parameters r₁ and r₃ defined in Assumptions 3.2 and 3.3, let $α = 3 r_{1}^{- 1} + r_{3}^{- 1} + 1$ .

Theorem A.1

Suppose (log p)⁶^α = o(T), and Assumptions 3.2 and 3.3 hold. In addition, suppose there is a sequence a_T = o(1) so that ${max}_{i \leq p} \frac{1}{T} \sum_{t = 1}^{T} {∣ u_{i t} - {\hat{u}}_{i t} ∣}^{2} = O_{p} (a_{T}^{2})$ , and max_i_≤_p,t_≤_T |u_it − û_it| = o_p(1); Then there is a constant C > 0 in the adaptive thresholding estimator (A.1) with

ω_{T} = \sqrt{\frac{log p}{T}} + a_{T}

such that

| | {\sum^{^}}_{u}^{T} - \sum_{u} | | = O_{p} (ω_{T}^{1 - q} m_{p}) .

If further ω_T m_p = o(1), then ${\sum^{^}}_{u}^{T}$ is invertible with probability approaching one, and

| | {({\sum^{^}}_{u}^{T})}^{- 1} - \sum_{u}^{- 1} | | = O_{p} (ω_{T}^{1 - q} m_{p}) .

Proof

By Assumptions 3.2 and 3.3, the conditions of Lemmas A.3 and A.4 of Fan, Liao and Mincheva (2011, Ann. Statist, 39, 3320–3356) are satisfied. Hence for any ε > 0, there are positive constants M, θ₁ and θ₂ such that each of the events

\begin{array}{l} A_{1} = {max_{i \leq p, j \leq p} ∣ {\hat{σ}}_{i j} - σ_{u, i j} ∣ < M ω_{T}} \\ A_{2} = {θ_{1} > \sqrt{{\hat{θ}}_{i j}} > θ_{2}, all i \leq p, j \leq p} . \end{array}

occurs with probability at least 1 − ε. By the condition of threshold function, $s_{i j} (t) = s_{i j} (t) I_{∣ t ∣ > C ω_{T} \sqrt{{\hat{θ}}_{i j}}}$ . Now for $C = θ_{2}^{- 1} 2 M$ , under the event A₁ ∩ A₂,

\begin{array}{l} | | {\sum^{^}}_{u}^{T} - \sum_{u} | | \leq max_{i \leq p} \sum_{j = 1}^{p} ∣ s_{i j} ({\hat{σ}}_{i j}) - σ_{u, i j} ∣ \\ = max_{i \leq p} \sum_{j = 1}^{p} ∣ s_{i j} ({\hat{σ}}_{i j}) I_{(∣ {\hat{σ}}_{i j} ∣ > C ω_{T} \sqrt{{\hat{θ}}_{i j}})} - σ_{u, i j} I_{(∣ {\hat{σ}}_{i j} ∣ > C ω_{T} \sqrt{{\hat{θ}}_{i j}})} - σ_{u, i j} I_{(∣ {\hat{σ}}_{i j} ∣ \leq C ω_{T} \sqrt{{\hat{θ}}_{i j}})} ∣ \\ \leq max_{i \leq p} \sum_{j = 1}^{p} ∣ s_{i j} ({\hat{σ}}_{i j}) - {\hat{σ}}_{i j} ∣ I_{(∣ {\hat{σ}}_{i j} ∣ > C ω_{T} \sqrt{{\hat{θ}}_{i j}})} + \sum_{j = 1}^{p} ∣ {\hat{σ}}_{i j} - σ_{u, i j} ∣ I_{(∣ {\hat{σ}}_{i j} ∣ > C ω_{T} \sqrt{{\hat{θ}}_{i j}})} + \sum_{j = 1}^{p} ∣ σ_{u, i j} ∣ I_{(∣ {\hat{σ}}_{i j} ∣ \leq C ω_{T} \sqrt{{\hat{θ}}_{i j}})} \\ \leq max_{i \leq p} \sum_{j = 1}^{p} C ω_{T} \sqrt{{\hat{θ}}_{i j}} I_{(∣ {\hat{σ}}_{i j} ∣ > C ω_{T} θ_{2})} + M ω_{T} \sum_{j = 1}^{p} I_{(∣ {\hat{σ}}_{i j} ∣ > C ω_{T} θ_{2})} + \sum_{j = 1}^{p} ∣ σ_{u, i j} ∣ I_{(∣ {\hat{σ}}_{i j} ∣ \leq C ω_{T} θ_{1})} \\ \leq (C θ_{1} + M) ω_{T} max_{i \leq p} \sum_{j = 1}^{p} I_{(∣ σ_{u, i j} ∣ > M ω_{T})} + max_{i \leq p} \sum_{j = 1}^{p} ∣ σ_{u, i j} ∣ I_{(∣ σ_{u, i j} ∣ \leq C ω_{T} θ_{1} + M ω_{T})} \\ \leq (C θ_{1} + M) ω_{T} max_{i \leq p} \sum_{j = 1}^{p} \frac{{∣ σ_{u, i j} ∣}^{q}}{M^{q} ω_{T}^{q}} I_{(∣ σ_{u, i j} ∣ > M ω_{T})} + max_{i \leq p} \sum_{j = 1}^{p} ∣ σ_{u, i j} ∣ \frac{{(C θ_{1} + M)}^{1 - q} ω_{T}^{1 - q}}{{∣ σ_{u, i j} ∣}^{1 - q}} I_{(∣ σ_{u, i j} ∣ \leq (C θ_{1} + M) ω_{T})} \\ \leq \frac{C θ_{1} + M}{M^{q}} ω_{T}^{1 - q} max_{i \leq p} \sum_{j = 1}^{p} {∣ σ_{u, i j} ∣}^{q} + max_{i \leq p} \sum_{j = 1}^{p} {∣ σ_{u, i j} ∣}^{q} {(C θ_{1} + M)}^{1 - q} ω_{T}^{1 - q} \\ = m_{p} ω_{T}^{1 - q} (C θ_{1} + M) (M^{- q} + {(C θ_{1} + M)}^{- q}) . \end{array}

Let M₁ = (Cθ₁+M)(M⁻^q+(Cθ₁+M)⁻^q). Then with probability at least 1 − 2ε, $| | {\sum^{^}}_{u}^{T} - \sum_{u} | | \leq m_{p} ω_{T}^{1 - q} M_{1}$ . Since ε is arbitrary, we have $| | {\sum^{^}}_{u}^{T} - \sum_{u} | | = O_{p} (ω_{T}^{1 - q} m_{p})$ . If in addition, ω_T m_p = o(1), then the minimum eigenvalue of ${\sum^{^}}_{u}^{T}$ is bounded away from zero with probability approaching one since λ_min(Σ_u) > c₁. This then implies $| | {({\sum^{^}}_{u}^{T})}^{- 1} - \sum_{u}^{- 1} | | = O_{p} (ω_{T}^{1 - q} m_{p})$ .

B Proofs for Section 2

We first cite two useful theorems, which are needed to prove propositions 2.1 and 2.2. In Lemma B.1 below, let ${λ_{i}}_{i = 1}^{p}$ be the eigenvalues of Σ in descending order and ${ξ_{i}}_{i = 1}^{p}$ be their associated eigenvectors. Correspondingly, let ${{\hat{λ}}_{i}}_{i = 1}^{p}$ be the eigenvalues of Σ̂ in descending order and ${{\hat{ξ}}_{i}}_{i = 1}^{p}$ be their associated eigenvectors.

Lemma B.1

(Weyl’s Theorem) |λ̂_i − λ_i| ≤ ||Σ̂ − Σ||.
(sin θ Theorem, Davis and Kahan, 1970)
$| | {\hat{ξ}}_{i} - ξ_{i} | | \leq \frac{\sqrt{2} | | \sum^{^} - \sum | |}{min (∣ {\hat{λ}}_{i - 1} - λ_{i} ∣, ∣ λ_{i} - {\hat{λ}}_{i + 1} ∣)} .$

Proof of Proposition 2.1

Proof

Since ${λ_{j}}_{j = 1}^{p}$ are the eigenvalue of Σ and ${{| | {\tilde{b}}_{j} | |}^{2}}_{j = 1}^{K}$ are the first K eigenvalues of BB′ (the remaining p − K eigenvalues are zero), then by the Weyl’s theorem, for each j ≤ K,

∣ λ_{j} - {| | {\tilde{b}}_{j} | |}^{2} ∣ \leq | | \sum - {BB}^{'} | | = | | \sum_{u} | | .

For j > K, |λ_j| = |λ_j − 0| ≤ ||Σ_u||. On the other hand, the first K eigenvalues of BB are also the eigenvalues of B′B. By the assumption, the eigenvalues of p⁻¹B′B are bounded away from zero. Thus when j ≤ K, ||b̃_j||²/p are bounded away from zero for all large p.

Proof of Proposition 2.2

Proof

Applying the sin θ theorem yields

| | ξ_{j} - {\tilde{b}}_{j} / | | {\tilde{b}}_{j} | | | | \leq \frac{\sqrt{2} | | \sum_{u} | |}{min (∣ λ_{j - 1} - {| | {\tilde{b}}_{j} | |}^{2} ∣, ∣ {| | {\tilde{b}}_{j} | |}^{2} - λ_{j + 1} ∣)}

For a generic constant c > 0, |λ_j₋₁ − ||b̃_j||²| ≥ |||b̃_j₋₁||² − ||b̃_j||²| − |λ_j₋₁ − ||b̃_j₋₁||²| ≥ cp for all large p, since |||b̃_j₋₁||² − ||b̃_j||²| ≥ cp but |λ_j₋₁ − ||b̃_j₋₁||²| is bounded by Prosposition 2.1. On the other hand, if j < K, the same argument implies |||b̃_j||² − λ_j₊₁| ≥ cp. If j = K, |||b̃_j||² − λ_j₊₁| = p|||b̃_K||²/p − λ_K₊₁/p|, where ||b̃_K||²/p is bounded away from zero, but λ_K₊₁/p = O(p⁻¹). Hence again, |||b̃_j||² − λ_j₊₁| ≥ cp.

Proof of Theorem 2.1

Proof

The sample covariance matrix of the residuals using least squares method is given by ${\sum^{^}}_{u} \frac{1}{T} (Y - \hat{Λ} {\hat{F}}^{'}) (Y^{'} - \hat{F} {\hat{Λ}}^{'}) = \frac{1}{T} {YY}^{'} - \hat{Λ} {\hat{Λ}}^{'}$ . where we used the normalization condition F̂′F̂ = TI_K and Λ̂ = YF̂/T. If we show that $\hat{Λ} {\hat{Λ}}^{'} = \sum_{i = 1}^{K} {\hat{λ}}_{i} {\hat{ξ}}_{i} {\hat{ξ}}_{i}^{'}$ , then from the decompositions of the sample covariance

\frac{1}{T} {YY}^{'} = \hat{Λ} {\hat{Λ}}^{'} + {\sum^{^}}_{u} = \sum_{i = 1}^{K} {\hat{λ}}_{i} {\hat{ξ}}_{i} {\hat{ξ}}_{i}^{'} + \hat{R},

we have R̂ = Σ̂_u. Consequently, applying thresholding on Σ̂_u is equivalent to applying thresholding on R̂, which gives the desired result.

We now show $\hat{Λ} {\hat{Λ}}^{'} = \sum_{i = 1}^{K} {\hat{λ}}_{i} {\hat{ξ}}_{i} {\hat{ξ}}_{i}^{'}$ indeed holds. Consider again the least squares problem (2.8) but with the following alternative normalization constraints: $\frac{1}{p} \sum_{i = 1}^{p} b_{i} b_{i}^{'} = I_{K}$ , and $\frac{1}{T} \sum_{t = 1}^{T} f_{t} f_{t}^{'}$ is diagonal. Let (Λ̃, F̃) be the solution to the new optimization problem. Switching the roles of B and F, then the solution of (2.10) is Λ̃ = (ξ̂₁, · · ·, ξ̂_K) and F̃ = p⁻¹Y′Λ̃. In addition, T⁻¹F̃′F̃= diag(λ̂₁, · · ·, λ̂_K). From Λ̂F̂′ = Λ̃F̃′, it follows that $\hat{Λ} {\hat{Λ}}^{'} = \frac{1}{T} \hat{Λ} {\hat{F}}^{'} \hat{F} {\hat{Λ}}^{'} = \frac{1}{T} \tilde{Λ} {\tilde{F}}^{'} \tilde{F} {\tilde{Λ}}^{'} = \sum_{i = 1}^{K} {\hat{λ}}_{i} {\hat{ξ}}_{i} {\hat{ξ}}_{i}^{'}$ .

C Proofs for Section 3

We will proceed by subsequently showing Theorems 3.3, 3.1 and 3.2.

C.1 Preliminary lemmas

The following results are to be used subsequently. The proofs of Lemmas C.1, C.2 and C.3 are found in Fan, Liao and Mincheva (2011).

Lemma C.1

Suppose A, B are symmetric semi-positive definite matrices, and λ_min(B) > c_T for a sequence c_T > 0. If ||A − B|| = o_p(c_T ), then λ_min(A) > c_T/2, and

| | A^{- 1} - B^{- 1} | | = O_{p} (c_{T}^{- 2}) | | A - B | | .

Lemma C.2

Suppose that the random variables Z₁, Z₂ both satisfy the exponential-type tail condition: There exist r₁, r₂ ∈ (0, 1) and b₁, b₂ > 0, such that ∀s > 0,

P (∣ Z_{i} ∣ > s) \leq exp (- {(s / b_{i})}^{r_{i}}), i = 1, 2.

Then for some r₃ and b₃ > 0, and any s > 0,

P (∣ Z_{1} Z_{2} ∣ > s) \leq exp (1 - {(s / b_{3})}^{r_{3}}) .

(C.1)

Lemma C.3

Under the assumptions of Theorem 3.1,

${max}_{i, j \leq K} ∣ \frac{1}{T} \sum_{t = 1}^{T} f_{i t} f_{j t} - E f_{i t} f_{j t} ∣ = O_{p} (\sqrt{1 / T})$ .
${max}_{i, j \leq p} ∣ \frac{1}{T} \sum_{t = 1}^{T} u_{i t} u_{j t} - {E u}_{i t} u_{j t} ∣ = O_{p} (\sqrt{(log p) / T})$
${max}_{i \leq K, j \leq p} ∣ \frac{1}{T} \sum_{t = 1}^{T} f_{i t} u_{j t} ∣ = O_{p} (\sqrt{(log p) / T})$

Lemma C.4

Let λ̂_K denote the Kth largest eigenvalue of ${\sum^{^}}_{sam} = \frac{1}{T} \sum_{t = 1}^{T} y_{t} y_{t}^{'}$ , then λ̂_K > C₁p with probability approaching one for some C₁ > 0.

Proof

First of all, by Proposition 2.1, under Assumption 3.1, the Kth largest eigenvalue λ_K of Σ satisfies: for some c > 0,

λ_{K} \geq {| | {\tilde{b}}_{K} | |}^{2} - ∣ λ_{K} - {| | {\tilde{b}}_{K} | |}^{2} ∣ \geq c p | | \sum_{u} | | \geq c p / 2

for sufficiently large p. Using Weyl’s theorem, we need only to prove that ||Σ̂_sam − Σ|| = o_p(p). Without loss of generality, we prove the result under the identifiability condition (2.1). Using model (1.2), ${\sum^{^}}_{sam} = T^{- 1} \sum_{t = 1}^{T} ({Bf}_{t} + u_{t}) {({Bf}_{t} + u_{t})}^{'}$ . Using this and (1.3), Σ̂_sam − Σ can be decomposed as the sum of the four terms:

\begin{array}{l} D_{1} = (T^{- 1} B \sum_{t = 1}^{T} f_{t} f_{t}^{'} - I_{K}) B^{'}, D_{2} = T^{- 1} \sum_{t = 1}^{T} (u_{t} u_{t}^{'} - \sum_{u}), \\ D_{3} = B T^{- 1} \sum_{t = 1}^{T} f_{t} u_{t}^{'}, D_{4} = D_{3}^{'} \end{array}

We now deal them term by term. We will repeatedly use the fact that for a p × p matrix A,

| | A | | \leq p {| | A | |}_{max} .

First of all, by Lemma C.3, $| | T^{- 1} \sum_{t = 1}^{T} f_{t} f_{t}^{'} - I_{K} | | \leq K {| | T^{- 1} \sum_{t = 1}^{T} f_{t} f_{t}^{'} - I_{K} | |}_{max} = O_{p} (\sqrt{1 / T})$ , which is o_p(p) if K log p = o(T). Consequently, by Assumption 3.1, we have

| | D_{1} | | \leq O_{p} (K \sqrt{(log K) / T}) | | {BB}^{'} | | = O_{p} (p \sqrt{1 / T}) .

We now deal with D₂. It follows from Lemma C.3 that

| | D_{2} | | \leq p {| | T^{- 1} \sum_{t = 1}^{T} (u_{t} u_{t}^{'} - \sum_{u}) | |}_{max} = O_{p} (p \sqrt{(log p) / T}) .

Since ||D₄|| = ||D₃||, it remains to deal with D₃, which is bounded by

| | D_{3} | | \leq | | T^{- 1} \sum_{t = 1}^{T} f_{t} u_{t}^{'} | | | | B | | = O_{p} (p \sqrt{(log p) / T}),

which is o_p(p) since log p = o(T).

Lemma C.5

Under Assumption 3.3, ${max}_{t \leq T} \sum_{s = 1}^{T} ∣ E u_{s}^{'} u_{t} ∣ / p = O (1)$ .

Proof

Since ${u_{t}}_{t = 1}^{T}$ is weakly stationary, ${max}_{t \leq T} \sum_{s = 1}^{T} ∣ E u_{s}^{'} u_{t} ∣ / p \leq 2 \sum_{t = 1}^{\infty} ∣ E u_{1}^{'} u_{t} ∣ / p$ . In addition, E|u_it|⁴ < M for some constant M and any i, t since u_it has exponential tail. Hence by Davydov’s inequality (Corollary 16.2.4 in Athreya and Lahiri 2006), there is a constant C > 0, for all i ≤ p, t ≤ T, $∣ {E u}_{i 1} u_{i t} ∣ \leq C \sqrt{α (t)}$ , where α (t) is the α-mixing coefficient. By Assumption 3.3, $\sum_{t = 1}^{\infty} \sqrt{α (t)} < \infty$ . Thus uniformly in T,

max_{t \leq T} \sum_{s = 1}^{T} ∣ E u_{s}^{'} u_{t} ∣ / p \leq 2 \sum_{t = 1}^{\infty} ∣ E u_{1}^{'} u_{t} ∣ / p \leq 2 \sum_{t = 1}^{\infty} max_{i \leq p} ∣ {E u}_{i 1} u_{i t} ∣ \leq 2 C \sum_{t = 1}^{\infty} \sqrt{α (t)} < \infty .

C.2 Proof of Theorem 3.3

Our derivation below relies on a result obtained by Bai and Ng (2002), which showed that the estimated number of factors is consistent, in the sense that K̂ equals the true K with probability approaching one. Note that under our Assumptions 3.1–3.4, all the assumptions in Bai and Ng (2002) are satisfied. Thus immediately we have the following Lemma.

Lemma C.6 (Theorem 2 in Bai and Ng (2002))

For K̂ defined in (2.14),

P (\hat{K} = K) \to 1.

Proof

See Bai and Ng (2002).

Using (A.1) in Bai (2003), we have the following identity:

{\hat{f}}_{t} - {Hf}_{t} = {(V / p)}^{- 1} (\frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{s} E (u_{s}^{'} u_{t}) / p + \frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{s} ζ_{s t} + \frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{s} η_{s t} + \frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{s} ξ_{s t})

(C.2)

where $ζ_{s t} = u_{s}^{'} u_{t} / p - E (u_{s}^{'} u_{t}) / p, η_{s t} = f_{s}^{'} \sum_{i = 1}^{p} b_{i} u_{i t} / p$ , and $ξ_{s t} = f_{t}^{'} \sum_{i = 1}^{p} b_{i} u_{i s} / p$ .

We first prove some preliminary results in the following Lemmas. Denote by f̂_t = (f̂₁_t, …, f̂ _K̂t)′.

Lemma C.7

For all i ≤ K̂,

$\frac{1}{T} \sum_{t = 1}^{T} {(\frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{i s} E (u_{s}^{'} u_{t}) / p)}^{2} = O_{p} (T^{- 1})$ ,
$\frac{1}{T} \sum_{t = 1}^{T} {(\frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{i s} ζ_{s t})}^{2} = O_{p} (p^{- 1})$ ,
$\frac{1}{T} \sum_{t = 1}^{T} {(\frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{i s} η_{s t})}^{2} = O_{p} (p^{- 1})$ ,
$\frac{1}{T} \sum_{t = 1}^{T} {(\frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{i s} ξ_{s t})}^{2} = O_{p} (p^{- 1})$ .

Proof

We have ∀i, $\sum_{s = 1}^{T} {\hat{f}}_{i s}^{2} = T$ . By the Cauchy-Schwarz inequality,
$\begin{array}{l} \frac{1}{T} \sum_{t = 1}^{T} {(\frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{i s} E (u_{s}^{'} u_{t}) / p)}^{2} \leq \frac{1}{T} \sum_{t = 1}^{T} \frac{1}{T} \sum_{s = 1}^{T} {(E u_{s}^{'} u_{t} / p)}^{2} \\ \leq max_{t \leq T} \frac{1}{T} \sum_{s = 1}^{T} {(E u_{s}^{'} u_{t} / p)}^{2} \leq max_{s, t} ∣ E u_{s}^{'} u_{t} / p ∣ max_{t \leq T} \frac{1}{T} \sum_{s = 1}^{T} ∣ E u_{s}^{'} u_{t} / p ∣ \end{array}$

By Lemma C.5, ${max}_{t \leq T} \sum_{s = 1}^{T} ∣ E u_{s}^{'} u_{t} / p ∣ = O (1)$ , which then yields the result.

By the Cauchy-Schwarz inequality,

\begin{array}{l} \frac{1}{T} \sum_{t = 1}^{T} {(\frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{i s} ζ_{s t})}^{2} = \frac{1}{T^{3}} \sum_{s = 1}^{T} \sum_{l = 1}^{T} {\hat{f}}_{i s} {\hat{f}}_{i l} (\sum_{t = 1}^{T} ζ_{s t} ζ_{l t}) \leq \frac{1}{T^{3}} {(\sum_{s l} {({\hat{f}}_{i s} {\hat{f}}_{i l})}^{2} \sum_{s l} {(\sum_{t = 1}^{T} ζ_{s t} ζ_{l t})}^{2})}^{1 / 2} \\ \leq \frac{1}{T^{3}} \sum_{s = 1}^{T} {\hat{f}}_{i s}^{2} {(\sum_{s l} {(\sum_{t = 1}^{T} ζ_{s t} ζ_{l t})}^{2})}^{1 / 2} = \frac{1}{T^{2}} {(\sum_{s = 1}^{T} \sum_{l = 1}^{T} {(\sum_{t = 1}^{T} ζ_{s t} ζ_{l t})}^{2})}^{1 / 2} . \end{array}

Note that $E (\sum_{s = 1}^{T} \sum_{l = 1}^{T} {(\sum_{t = 1}^{T} ζ_{s t} ζ_{l t})}^{2}) = T^{2} E {(\sum_{t = 1}^{T} ζ_{s t} ζ_{l t})}^{2} \leq T^{4} {max}_{s t} E {∣ ζ_{s t} ∣}^{4}$ . By Assumption 3.4, ${max}_{s t} E ζ_{s t}^{4} = O (p^{- 2})$ , which implies that $\sum_{s, l} {(\sum_{t = 1}^{T} ζ_{s t} ζ_{l t})}^{2} = O_{p} (T^{4} / p^{2})$ , and yields the result.

By definition, $η_{s t} = f_{s}^{'} \sum_{i = 1}^{p} b_{i} u_{i t} / p$ . We first bound || $\sum_{i = 1}^{p} b_{i} u_{i t}$ ||. Assumption 3.4 implies $E \frac{1}{T} \sum_{t = 1}^{T} {| | \sum_{i = 1}^{p} b_{i} u_{i t} | |}^{2} = E {| | \sum_{i = 1}^{p} b_{i} u_{i t} | |}^{2} = O (p)$ . Therefore, by the Cauchy-Schwarz inequality,
$\begin{array}{l} \frac{1}{T} \sum_{t = 1}^{T} {(\frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{i s} η_{s t})}^{2} \leq {| | \frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{i s} f_{s}^{'} | |}^{2} \frac{1}{T} \sum_{t = 1}^{T} {| | \sum_{j = 1}^{p} b_{j} u_{j t} \frac{1}{p} | |}^{2} \\ \leq \frac{1}{{T p}^{2}} \sum_{t = 1}^{T} {| | \sum_{j = 1}^{p} b_{j} u_{j t} | |}^{2} (\frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{i s}^{2} \frac{1}{T} \sum_{s = 1}^{T} {| | f_{s} | |}^{2}) = O_{p} (\frac{1}{p}) . \end{array}$
Similar to part (iii), noting that ξ_st is a scalar, we have:
$\begin{array}{l} \frac{1}{T} \sum_{t = 1}^{T} {(\frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{i s} ξ_{s t})}^{2} = \frac{1}{T} \sum_{t = 1}^{T} {| \frac{1}{T} \sum_{s = 1}^{T} f_{t}^{'} \sum_{j = 1}^{p} b_{j} u_{j s} \frac{1}{p} {\hat{f}}_{i s} |}^{2} \leq \frac{1}{T} \sum_{t = 1}^{T} {| | f_{t} | |}^{2} \cdot {‖ \frac{1}{T} \sum_{s = 1}^{T} \sum_{j = 1}^{p} b_{j} u_{j s} \frac{1}{p} {\hat{f}}_{i s} ‖}^{2} \\ \leq O_{p} (1) \frac{1}{T} \sum_{s = 1}^{T} {‖ \sum_{j = 1}^{p} b_{j} u_{j s} \frac{1}{p} ‖}^{2} \cdot \frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{i s}^{2} \leq O_{p} (\frac{1}{p}), \end{array}$

where the third line follows from the Cauchy-Schwarz inequality.

Lemma C.8

${max}_{t \leq T} | | \frac{1}{T p} \sum_{s = 1}^{T} {\hat{f}}_{s} E (u_{s}^{'} u_{t}) | | = O_{p} (\sqrt{1 / T})$ ,
${max}_{t \leq T} | | \frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{s} ζ_{s t} | | = O_{p} (T^{1 / 4} / \sqrt{p})$ ,
${max}_{t \leq T} | | \frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{s} η_{s t} | | = O_{p} (T^{1 / 4} / \sqrt{p})$ ,
${max}_{t \leq T} | | \frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{s} ξ_{s t} | | = O_{p} (T^{1 / 4} / \sqrt{p})$ .

Proof

By the Cauchy-Schwarz inequality and the fact that

\frac{1}{T} \sum_{t = 1}^{T} {| | {\hat{f}}_{t} | |}^{2} = O_{p} (1)

\begin{array}{l} max_{t \leq T} | | \frac{1}{T p} \sum_{s = 1}^{T} {\hat{f}}_{s} E (u_{s}^{'} u_{t}) | | \leq max_{t \leq T} {(\frac{1}{T} \sum_{s = 1}^{T} {| | {\hat{f}}_{s} | |}^{2} \frac{1}{T} \sum_{s = 1}^{T} {(E u_{s}^{'} u_{t} / p)}^{2})}^{1 / 2} \\ \leq O_{p} (1) max_{t \leq T} {(\frac{1}{T} \sum_{s = 1}^{T} {(E u_{s}^{'} u_{t} / p)}^{2})}^{1 / 2} \leq O_{p} (1) max_{s, t} \sqrt{∣ E u_{s}^{'} u_{t} / p ∣} max_{t \leq T} {(\frac{1}{T} \sum_{s = 1}^{T} ∣ E u_{s}^{'} u_{t} / p ∣)}^{1 / 2} . \end{array}

The result then follows from Assumption 3.3.

By the Cauchy-Schwarz inequality,
$max_{t \leq T} | | \frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{s} ζ_{s t} | | \leq max_{t \leq T} \frac{1}{T} {(\sum_{s = 1}^{T} {| | {\hat{f}}_{s} | |}^{2} \sum_{s = 1}^{T} ζ_{s t}^{2})}^{1 / 2} \leq {(O_{p} (1) max_{t} \frac{1}{T} \sum_{s = 1}^{T} ζ_{s t}^{2})}^{1 / 2} .$

It follows from Assumption 3.4 that $E {(\frac{1}{T} \sum_{s = 1}^{T} ζ_{s t}^{2})}^{2} \leq {max}_{s, t \leq T} E ζ_{s t}^{4} = O (\frac{1}{p^{2}})$ . It then follows from the Chebyshev’s inequality and Bonferroni’s method that $\frac{1}{T} \sum_{s = 1}^{T} ζ_{s t}^{2} = O_{p} (\sqrt{T} / p)$ .
By Assumption 3.4, $E {| | \frac{1}{\sqrt{p}} \sum_{i = 1}^{p} b_{i} u_{i t} | |}^{4} \leq K^{2} M$ . Chebyshev’s inequality and Bonferroni’s method yield ${max}_{t \leq T} | | \sum_{i = 1}^{p} b_{i} u_{i t} | | = O_{p} (T^{1 / 4} \sqrt{p})$ with probability one, which then implies: ${max}_{t \leq T} | | \frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{s} η_{s t} | | \leq | | \frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{s} f_{s}^{'} | | {max}_{t} | | \frac{1}{p} \sum_{i = 1}^{p} b_{i} u_{i t} | | = o_{p} (T^{1 / 4} / p^{1 / 2})$ .
By the Cauchy-Schwarz inequality and Assumption 3.4, we have demonstrated that $| | \frac{1}{T} \sum_{s = 1}^{T} \sum_{i = 1}^{p} b_{i} u_{i s} \frac{1}{p} {\hat{f}}_{s} | | = O_{p} (p^{- 1 / 2})$ . In addition, since E||K⁻² f_t||⁴ < M, max_t_≤_T ||f_t|| = O_p(T^1/4). It follows that ${max}_{t \leq T} | | \frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{s} ξ_{s t} | | \leq {max}_{t \leq T} | | f_{t} | | \cdot | | \frac{1}{T} \sum_{s = 1}^{T} \sum_{i = 1}^{p} b_{i} u_{i s} \frac{1}{p} {\hat{f}}_{s} | | = O_{p} (T^{1 / 4} / p^{1 / 2})$ .

Lemma C.9

${max}_{i \leq K} \frac{1}{T} \sum_{t = 1}^{T} {({\hat{f}}_{t} - {Hf}_{t})}_{i}^{2} = O_{p} (1 / T + 1 / p)$ .
$\frac{1}{T} \sum_{t = 1}^{T} {| | {\hat{f}}_{t} - {Hf}_{t} | |}^{2} + O_{p} (1 / T + 1 / p)$ .
${max}_{t \leq T} | | {\hat{f}}_{t} - {Hf}_{t} | | = O_{p} (\sqrt{1 / T} + T^{1 / 4} / \sqrt{p})$ .

Proof

We prove this lemma conditioning on the event K̂= K. Once this is done, due to P(K̂ ≠ K) = o(1), it then implies the unconditional arguments.

When K̂ = K, by Lemma C.4, all the eigenvalues of V/p are bounded away from zero. Using the inequality (a + b + c + d)² ≤ 4(a² + b² + c² + d²) and the identity (C.2), we have, for some constant C > 0,
$\begin{array}{l} max_{i \leq K} \frac{1}{T} \sum_{t = 1}^{T} {({\hat{f}}_{t} - {Hf}_{t})}_{i}^{2} \leq C max_{i \leq K} \frac{1}{T} \sum_{t = 1}^{T} {(\frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{i s} E (u_{s}^{'} u_{t}) / p)}^{2} + C max_{i \leq K} \frac{1}{T} \sum_{t = 1}^{T} {(\frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{i s} ζ_{s t})}^{2} \\ + C max_{i \leq K} \frac{1}{T} \sum_{t = 1}^{T} {(\frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{i s} η_{s t})}^{2} + C max_{i \leq K} \frac{1}{T} \sum_{t = 1}^{T} {(\frac{1}{T} \sum_{s = 1}^{T} {\hat{f}}_{i s} ξ_{s t})}^{2} . \end{array}$

Each of the four terms on the right hand side above are bounded in Lemma C.7, which then yields the desired result.
It follows from part (i) and that $\frac{1}{T} \sum_{t = 1}^{T} {| | {\hat{f}}_{t} - {Hf}_{t} | |}^{2} \leq K {max}_{i \leq K} \frac{1}{T} \sum_{t = 1}^{T} {({\hat{f}}_{t} - {Hf}_{t})}_{i}^{2}$ . Part (iii) is implied by (C.2) and Lemma C.8.

Lemma C.10

${HH}^{'} = I_{\hat{K}} + O_{p} (1 / \sqrt{T} + 1 / \sqrt{p})$ .
$H^{'} H = I_{K} + O_{p} (1 / \sqrt{T} + 1 / \sqrt{p})$ .

Proof

We first condition on K̂ = K. (i) Lemma C.4 implies ||V⁻¹|| = O_p(p⁻¹). Also $| | F | | = λ_{max}^{1 / 2} ({FF}^{'}) = λ_{max}^{1 / 2} (\sum_{t = 1}^{T} f_{t} f_{t}^{'}) = O_{p} (\sqrt{T})$ . In addition, $| | \hat{F} | | = \sqrt{T}$ . It then follows from the definition of H that ||H|| = O_p(1). Define $\hat{cov} ({Hf}_{t}) = \frac{1}{T} \sum_{t = 1}^{T} {Hf}_{t} {({Hf}_{t})}^{'}$ . Applying the triangular inequality gives:

{| | {HH}^{'} - I_{\hat{K}} | |}_{F} \leq {| | {HH}^{'} - \hat{cov} ({Hf}_{t}) | |}_{F} + {| | \hat{cov} ({Hf}_{t}) - I_{\hat{K}} | |}_{F}

(C.3)

By Lemma C.3, the first term in (C.3) is ${| | {HH}^{'} - \hat{cov} ({Hf}_{t}) | |}_{F} \leq {| | H | |}^{2} {| | I_{K} - \hat{cov} (f_{t}) | |}_{F} = O_{p} (\frac{1}{T})$ . The second term of (C.3) can be bounded, by the Cauchy-Schwarz inequality and Lemma C.9, as follows:

\begin{array}{l} {‖ \frac{1}{T} \sum_{t = 1}^{T} {Hf}_{t} {({Hf}_{t})}^{'} - \frac{1}{T} \sum_{t = 1}^{T} {\hat{f}}_{t} {\hat{f}}_{t}^{'} ‖}_{F} \leq {‖ \frac{1}{T} \sum_{t} ({Hf}_{t} - {\hat{f}}_{t}) {({Hf}_{t})}^{'} ‖}_{F} + {‖ \frac{1}{T} \sum_{t} {\hat{f}}_{t} ({\hat{f}}_{t}^{'} - {({Hf}_{t})}^{'}) ‖}_{F} \\ \leq (\frac{1}{T} \sum_{t} {| | {Hf}_{t} - {\hat{f}}_{t} | |}^{2} \frac{1}{T} \sum_{t} {| | {Hf}_{t} | |}^{2}) + {(\frac{1}{T} \sum_{t} {| | {Hf}_{t} - {\hat{f}}_{t} | |}^{2} \frac{1}{T} \sum_{t} {| | {\hat{f}}_{t} | |}^{2})}^{1 / 2} \\ = O_{p} (\frac{1}{\sqrt{T}} + \frac{1}{\sqrt{p}}) . \end{array}

(ii) Still conditioning on K̂ = K, since ${HH}^{'} = I_{K} + O_{p} (1 / \sqrt{T} + 1 / \sqrt{p})$ and ||H|| = O_p(1), right multiplying H gives ${HH}^{'} H = H + O_{p} (1 / \sqrt{T} + 1 / \sqrt{p})$ . Part (i) also gives, conditioning on K̂ = K, ||H⁻¹|| = O_p(1). Hence further left multiplying H⁻¹ yields $H^{'} H = I_{K} + O_{p} (1 / \sqrt{T} + \sqrt{p})$ . Due to P(K̂ = K) → 1, we reach the desired result.

Proof of Theorem 3.3

Proof

The second part of this theorem was proved in Lemma C.9. We now derive the convergence rate of max_i_≤_p ||b̂_i − Hb_i||.

Using the facts that ${\hat{b}}_{i} = \frac{1}{T} \sum_{t = 1}^{T} y_{i t} {\hat{f}}_{t}$ , and that $\frac{1}{T} \sum_{t = 1}^{T} {\hat{f}}_{t} {\hat{f}}_{t}^{'} = I_{k}$ , we have

{\hat{b}}_{i} - {Hb}_{i} = \frac{1}{T} \sum_{t = 1}^{T} {Hf}_{t} u_{i t} + \frac{1}{T} \sum_{t = 1}^{T} y_{i t} ({\hat{f}}_{t} - {Hf}_{t}) + H (\frac{1}{T} \sum_{t = 1}^{T} f_{t} f_{t}^{'} - I_{K}) b_{i} .

(C.4)

We bound the three terms on the right hand side respectively. It follows from Lemmas C.3 and C.10 that ${max}_{i \leq p} | | \frac{1}{T} \sum_{t = 1}^{T} {Hf}_{t} u_{i t} | | \leq | | H | | {max}_{i} \sqrt{\sum_{k = 1}^{K} {(\frac{1}{T} \sum_{t = 1}^{T} f_{k t} u_{i t})}^{2}} = O_{p} (\sqrt{\frac{log p}{T}})$ . For the second term, $E y_{i t}^{2} = O (1)$ . Therefore, ${max}_{i} T^{- 1} \sum_{t = 1}^{T} y_{i t}^{2} = O_{p} (1)$ . The Cauchy-Schwarz inequality and Lemma C.9 imply

max_{i} | | \frac{1}{T} \sum_{t = 1}^{T} y_{i t} ({\hat{f}}_{t} - {Hf}_{t}) | | \leq max_{i} {(\frac{1}{T} \sum_{t = 1}^{T} y_{i t}^{2} \sum_{t = 1}^{T} {| | {\hat{f}}_{t} - {Hf}_{t} | |}^{2})}^{1 / 2} = O_{p} (\frac{1}{\sqrt{T}} + \frac{1}{\sqrt{p}}) .

Finally, $| | \frac{1}{T} \sum_{t = 1}^{T} f_{t} f_{t}^{'} - I_{K} | | = O_{p} (T^{- 1 / 2})$ and max_i ||b_i|| = O(1) imply that the third term is O_p(T^−1/2).

Proof of Corollary 3.1

Under Assumption 3.3, it can be shown by Bonferroni’s method that max_t_≤_T ||f_t|| = O_p((log T )¹^/r² ). By Theorem 3.3, uniformly in i and t,

\begin{array}{l} | | {\hat{b}}_{i}^{'} {\hat{f}}_{t} - b_{i}^{'} f_{t} | | \leq | | {\hat{b}}_{i} - {Hb}_{i} | | | | {\hat{f}}_{t} - {Hf}_{t} | | + | | {Hb}_{i} | | | | {\hat{f}}_{t} - {Hf}_{t} | | + | | {\hat{b}}_{i} - {Hb}_{i} | | | | {Hf}_{t} | | + | | b_{i} | | f_{t} | | | | H^{'} H - I_{K} | | \\ = O_{p} ({(log T)}^{1 / r_{2}} \sqrt{\frac{log p}{T}} + \frac{T^{1 / 4}}{\sqrt{p}}) . \end{array}

C.3 Proof of Theorem 3.1

Lemma C.11

${max}_{i \leq p} \frac{1}{T} \sum_{t = 1}^{T} {∣ u_{i t} - {\hat{u}}_{i t} ∣}^{2} = O_{p} (ω_{T}^{2})$ , and max_i,t |u_it − û_it| = o_p(1).

Proof

We have, $u_{i t} - {\hat{u}}_{i t} = b_{i}^{'} H^{'} ({\hat{f}}_{t} - {Hf}_{t}) + ({\hat{b}}_{i}^{'} - b_{i}^{'} H^{'}) {\hat{f}}_{t} + b_{i}^{'} (H^{'} H - I_{K}) f_{t}$ . Therefore, using the inequality (a + b + c)² ≤ 4a² + 4b² + 4c², we have:

\begin{array}{l} max_{i \leq p} \frac{1}{T} \sum_{t = 1}^{T} {(u_{i t} - {\hat{u}}_{i t})}^{2} \leq 4 max_{i} {| | b_{i}^{'} H^{'} | |}^{2} \frac{1}{T} \sum_{t = 1}^{T} {| | {\hat{f}}_{t} - {Hf}_{t} | |}^{2} \\ + 4 max_{i} {| | {\hat{b}}_{i}^{'} - b_{i}^{'} H^{'} | |}^{2} \frac{1}{T} \sum_{t = 1}^{T} {| | {\hat{f}}_{t} | |}^{2} + 4 max_{i} {| | b_{i} | |}^{2} \frac{1}{T} \sum_{t = 1}^{T} {| | f_{t} | |}^{2} {| | H^{'} H - I_{K} | |}_{F}^{2} \end{array}

The first part of the lemma then follows from Theorem 3.3 and Lemma C.9. The second part follows from Corollary 3.1.

Proof of Theorem 3.1

The theorem follows immediately from Theorem A.1 and Lemma C.11.

C.4 Proof of Theorem 3.2

Define

C_{T} = \hat{Λ} - {BH}^{'} .

Lemma C.12

${| | C_{T} | |}_{F}^{2} = O_{p} (ω_{T}^{2} p)$ , and ${| | C_{T}^{'} C_{T} | |}_{\sum}^{2} = O_{p} (ω_{T}^{4} p)$ .
${| | {\sum^{^}}_{u, \hat{K}}^{T} - \sum_{u} | |}_{\sum}^{2} = O_{p} (ω_{T}^{2 - 2 q} m_{p}^{2})$ .
${| | {BH}^{'} C_{T}^{'} | |}_{\sum}^{2} = O_{p} (ω_{T}^{2})$ .
${| | B (H^{'} H - I_{K}) B^{'} | |}_{\sum}^{2} = O_{p} (p^{- 2} + {(p T)}^{- 1})$ .

Proof

We have ${| | C_{T} | |}_{F}^{2} \leq {max}_{i \leq p} {| | {\hat{b}}_{i} - {Hb}_{i} | |}^{2} p = O_{p} (ω_{T}^{2} p)$ . Moreover, since all the eigenvalues of Σ are bounded away from zero, for any matrix A, ${| | A | |}_{\sum}^{2} = O_{p} (p^{- 1}) {| | A | |}_{F}^{2}$ . Hence ${| | C_{T}^{'} C_{T} | |}_{\sum}^{2} = O_{p} (p^{- 1} {| | C_{T} | |}_{F}^{4}) = O_{p} (p ω_{T}^{4})$ .
By Theorem 3.1, ${| | {\sum^{^}}_{u, \hat{K}}^{T} - \sum_{u} | |}_{\sum}^{2} = O_{p} (p^{- 1} {| | {\sum^{^}}_{u, \hat{K}}^{T} - \sum_{u} | |}_{F}^{2}) = O_{p} ({| | {\sum^{^}}_{u, \hat{K}}^{T} - \sum_{u} | |}^{2}) = O_{p} (ω_{T}^{2 - 2 q} m_{p}^{2})$ .
The same argument of the proof of Theorem 2 in Fan, Fan and Lv (2008) implies that ||B′Σ⁻¹B|| = O(1). Thus, ${| | {BH}^{'} C_{T}^{'} | |}_{\sum}^{2} = p^{- 1} tr (H^{'} C_{T}^{'} \sum^{- 1} C_{T} {HB}^{'} \sum^{- 1} B)$ is upper bounded by $p^{- 1} {| | H | |}^{2} | | B^{'} \sum^{- 1} B | | | | \sum^{- 1} | | {| | C_{T} | |}_{F}^{2} = O_{p} (p^{- 1} {| | C_{T} | |}_{F}^{2}) = O_{p} (ω_{T}^{2})$ .
Again, by ||B′Σ⁻¹B|| = O(1), and Lemma C.10,
$\begin{array}{l} {| | B (H^{'} H - I_{K}) B^{'} | |}_{\sum}^{2} = p^{- 1} tr ((H^{'} H - I_{K}) B^{'} \sum^{- 1} B (H^{'} H - I_{K}) B^{'} \sum^{- 1} B) \\ \leq p^{- 1} {| | H^{'} H - I_{K} | |}_{F}^{2} {| | B^{'} \sum^{- 1} B | |}^{2} = O_{p} (p^{- 2} + {(p T)}^{- 1}) . \end{array}$ (C.5)

Proof of Theorem 3.2 (i)

Proof

By Lemma C.12, ${| | B (H^{'} H - I_{K}) B^{'} | |}_{\sum}^{2} + {| | {BH}^{'} {C_{T}}^{'} | |}_{\sum}^{2} + {| | C_{T} {C_{T}}^{'} | |}_{\sum}^{2} = O_{p} (ω_{T}^{2} + \frac{p {log}^{2} p}{T^{2}})$ . Hence for a generic constant C > 0,

\begin{array}{l} {| | {\sum^{^}}_{K} - \sum | |}_{\sum}^{2} \leq C {| | \hat{Λ} {\hat{Λ}}^{'} - {BB}^{'} | |}_{\sum}^{2} + C {| | {\sum^{^}}_{u, \hat{K}}^{T} - \sum_{u} | |}_{\sum}^{2} \\ \leq C [{| | B (H^{'} H - I_{K}) B^{'} | |}_{\sum}^{2} + {| | {BH}^{'} {C_{T}}^{'} | |}_{\sum}^{2} + {| | C_{T} {C_{T}}^{'} | |}_{\sum}^{2} + {| | {\sum^{^}}_{u, \hat{K}}^{T} - \sum_{u} | |}_{\sum}^{2}] \\ = O_{p} (ω_{T}^{2 - 2 q} m_{p}^{2} + \frac{p {log}^{2} p}{T^{2}}) . \end{array}

Lemma C.13

$| | {\hat{Λ}}^{'} {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} \hat{Λ} - {({BH}^{'})}^{'} \sum_{u}^{- 1} {BH}^{'} | | = O_{p} (p ω_{T}^{1 - q} m_{p})$ .

Proof

${| | C_{T} | |}_{F}^{2} = O_{p} (ω_{T}^{2} p)$ . Hence

\begin{array}{l} | | {\hat{Λ}}^{'} {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} \hat{Λ} - {({BH}^{'})}^{'} \sum_{u}^{- 1} {BH}^{'} | | \leq | | C_{T}^{'} {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} C_{T} | | \\ + 2 | | C_{T}^{'} {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} {BH}^{'} | | + | | {BH}^{'} ({({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} - \sum_{u}^{- 1}) {BH}^{'} | | = O_{p} (p ω_{T}^{1 - q} m_{p}) \end{array}

(C.6)

Lemma C.14

If $ω_{T}^{1 - q} m_{p} = o (1)$ , then with probability approaching one, for some c > 0,

$λ_{min} (I_{K} + {({BH}^{'})}^{'} \sum_{u}^{- 1} {BH}^{'}) \geq c p$ .
$λ_{min} (I_{K} + {\hat{Λ}}^{'} {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} \hat{Λ}) \geq c p$ .
$λ_{min} (I_{K} + B^{'} \sum_{u}^{- 1} B) \geq c p$ .
$λ_{min} ({({HH}^{'})}^{- 1} + B^{'} \sum_{u}^{- 1} B) \geq c p$ .

Proof

(i) By Lemma C.10, with probability approaching one, λ_min(HH′) is bounded away from zero. Hence,

\begin{array}{l} λ_{min} (I_{K} + {({BH}^{'})}^{'} \sum_{u}^{- 1} {BH}^{'}) \geq λ_{min} ({({BH}^{'})}^{'} \sum_{u}^{- 1} {BH}^{'})) \\ \geq λ_{min} (\sum_{u}^{- 1}) λ_{min} ({HB}^{'} {BH}^{'}) \geq λ_{min} (\sum_{u}^{- 1}) λ_{min} (B^{'} B) λ_{min} ({HH}^{'}) \geq c p . \end{array}

(ii) The result follows from part (i) and Lemma C.13. Part (iii) and (iv) follow from a similar argument of part (i) and Lemma C.10.

Proof of Theorem 3.2

Proof

We derive the rate for $| | {\sum^{^}}_{\hat{K}}^{- 1} - \sum^{- 1} | |$ . Define

\sum^{\sim} = {BH}^{'} {HB}^{'} + \sum_{u} .

Note that ${\sum^{^}}_{\hat{K}} = \hat{Λ} {\hat{Λ}}^{'} + {\sum^{^}}_{u, \hat{K}}^{T}$ and Σ = BB′ + Σ_u. The triangular inequality gives

| | {\sum^{^}}_{\hat{K}}^{- 1} - \sum^{- 1} | | \leq | | {\sum^{^}}_{\hat{K}}^{- 1} - {\sum^{\sim}}^{- 1} | | + | | {\sum^{\sim}}^{- 1} - \sum^{- 1} | | .

Using the Sherman-Morrison-Woodbury formula, we have $| | {\sum^{^}}_{\hat{K}}^{- 1} - {\sum^{\sim}}^{- 1} | | \leq \sum_{i = 1}^{6} L_{i}$ , where

\begin{array}{l} L_{1} = | | {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} - \sum_{u}^{- 1} | | \\ L_{2} = | | ({({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} - \sum_{u}^{- 1}) \hat{Λ} {[I_{K} + {\hat{Λ}}^{'} {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} \hat{Λ}]}^{- 1} {\hat{Λ}}^{'} {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} | | \\ L_{3} = | | ({({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} - \sum_{u}^{- 1}) \hat{Λ} {[I_{K} + {\hat{Λ}}^{'} {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} \hat{Λ}]}^{- 1} {\hat{Λ}}^{'} \sum_{u}^{- 1} | | \\ L_{4} = | | \sum_{u}^{- 1} (\hat{Λ} - {BH}^{'}) {[I_{K} + {\hat{Λ}}^{'} {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} \hat{Λ}]}^{- 1} {\hat{Λ}}^{'} \sum_{u}^{- 1} | | \\ L_{5} = | | \sum_{u}^{- 1} (\hat{Λ} - {BH}^{'}) {[I_{K} + {\hat{Λ}}^{'} {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} \hat{Λ}]}^{- 1} {HB}^{'} \sum_{u}^{- 1} | | \\ L_{6} = | | \sum_{u}^{- 1} {BH}^{'} ({[I_{K} + {\hat{Λ}}^{'} {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} \hat{Λ}]}^{- 1} - {[I_{K} + {HB}^{'} \sum_{u}^{- 1} {BH}^{'}]}^{- 1}) {HB}^{'} \sum_{u}^{- 1} | | . \end{array}

(C.7)

We bound each of the six terms respectively. First of all, L₁ is bounded by Theorem 3.1. Let $G = {[I_{K} + {\hat{Λ}}^{'} {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} \hat{Λ}]}^{- 1}$ , then

L_{2} \leq | | {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} - \sum_{u}^{- 1} | | \cdot | | \hat{Λ} G {\hat{Λ}}^{'} | | \cdot | | {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} | | .

Note that Theorem 3.1 implies $| | {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} | | = O_{p} (1)$ . Lemma C.14 then implies ||G|| = O_p(p⁻¹). This shows that L₂ = O_p(L₁). Similarly L₃ = O_p(L₁). In addition, since ${| | C_{T} | |}_{F}^{2} = {| | \hat{Λ} - {BH}^{'} | |}_{F}^{2} = O_{p} (ω_{T}^{2} p), L_{4} \leq | | \sum_{u}^{- 1} (\hat{Λ} - {BH}^{'}) | | | | G | | | | \hat{Λ} \sum_{u}^{- 1} | | = O_{p} (ω_{T})$ . Similarly L₅ = O_p(L₄). Finally, let $G_{1} = {[I_{K} + {({BH}^{'})}^{'} \sum_{u}^{- 1} {BH}^{'}]}^{- 1}$ . By Lemma C.14, ||G₁|| = O_p(p⁻¹). Then by Lemma C.13,

\begin{array}{l} | | G - G_{1} | | = | | G (G^{- 1} - G_{1}^{- 1}) G_{1} | | \leq O_{p} (p^{- 2}) | | {({BH}^{'})}^{'} \sum_{u}^{- 1} {BH}^{'} - {\hat{Λ}}^{'} {({\sum^{^}}_{u, \hat{K}}^{T})}^{- 1} \hat{Λ} | | \\ = O_{p} (p^{- 1} ω_{T}^{1 - q} m_{p}) . \end{array}

Consequently, $L_{6} \leq {| | \sum_{u}^{- 1} {BH}^{'} | |}^{2} | | G - G_{1} | | = O_{p} (ω_{T}^{1 - q} m_{p})$ . Adding up L₁–L₆ gives

| | {\sum^{^}}_{\hat{K}}^{- 1} - {\sum^{\sim}}^{- 1} | | = O_{p} (ω_{T}^{1 - q} m_{p}) .

One the other hand, using Sherman-Morrison-Woodbury formula again implies

\begin{array}{l} | | {\sum^{\sim}}^{- 1} - \sum^{- 1} | | \leq | | \sum_{u}^{- 1} B ({[{(H^{'} H)}^{- 1} + B^{'} \sum_{u}^{- 1} B]}^{- 1} - {[I_{K} + B^{'} \sum_{u}^{- 1} B]}^{- 1}) B^{'} \sum_{u}^{- 1} | | \\ \leq O (p) | | {[{(H^{'} H)}^{- 1} + B^{'} \sum_{u}^{- 1} B]}^{- 1} - {[I_{K} + B^{'} \sum_{u}^{- 1} B]}^{- 1} | | \\ = O_{p} (p^{- 1}) | | {(H^{'} H)}^{- 1} - I_{K} | | = o_{p} (ω_{T}^{1 - q} m_{p}) . \end{array}

Proof of Theorem 3.2

||Σ̂^τ − Σ||_max

Proof

We first bound ||Λ̂Λ̂′ − BB′||_max. Repeatedly using the triangular inequality yields

\begin{array}{l} {| | \hat{Λ} {\hat{Λ}}^{'} - {BB}^{'} | |}_{max} = max_{i, j \leq p} ∣ {\hat{b}}_{i}^{'} {\hat{b}}_{j} - b_{i}^{'} b_{j} ∣ \\ \leq max_{i j} [∣ {({\hat{b}}_{i} - {Hb}_{i})}^{'} {\hat{b}}_{j} ∣ + ∣ b_{i}^{'} H^{'} ({\hat{b}}_{j} - {Hb}_{j}) ∣ + ∣ b_{i}^{'} (H^{'} H - I_{K}) b_{j} ∣] \\ \leq {(max_{i} | | {\hat{b}}_{i} - {Hb}_{i} | |)}^{2} + 2 max_{i j} | | {\hat{b}}_{i} - {Hb}_{i} | | | | {Hb}_{j} | | + max_{i} {| | b_{i} | |}^{2} H^{'} H - I_{K} | | \\ = O_{p} (ω_{T}) . \end{array}

On the other hand, let σ_u,ij be the (i, j) entry of Σ_u. Then max_ij |σ̂_ij − σ_u,ij| = O_p(ω_T).

max_{i j} ∣ s_{i j} ({\hat{σ}}_{i j}) - σ_{u, i j} ∣ \leq max_{i j} ∣ s_{i j} ({\hat{σ}}_{i j}) - {\hat{σ}}_{i j} ∣ + ∣ {\hat{σ}}_{i j} - σ_{u, i j} ∣ \leq max_{i j} τ_{i j} + O_{p} (ω_{T}) = O_{p} (ω_{T}) .

Hence ${| | {\sum^{^}}_{u, \hat{K}}^{T} - \sum_{u} | |}_{max} = O_{p} (ω_{T})$ . The result then follows immediately.

Footnotes

We thank a referee for reminding us these related works.

We thank a referee for this interesting reference.

We have written an R package for POET, which outputs the estimated Σ, Σ_u, K, the factors and loadings.

⁴

To our best knowledge, the only other papers that estimate large covariances with diverging eigenvalues (growing at the rate of dimensionality O(p)) are Fan et al. (2008, 2011) and Bai and Shi (2011). While Fan et al. (2008, 2011) assumed the factors are observable, Bai and Shi (2011) considered the strict factor model in which Σ_u is diagonal.

⁵

It is important to distinguish the model we consider in this paper from the “sparse factor model” in the literature, e.g., Carvalho et al. (2009), Pati et al. (2012), which assumes that the loading matrix B is sparse The intuition of a sparse loading matrix is that each factor is related to only a relatively small number of stocks, assets, genes, etc. With B being sparse, all the eigenvalues of B′B and hence those of Σ are bounded.

⁶

See Fan, Liao and Mincheva (2011), working paper, arxiv.org/pdf/1201.0175.pdf

⁷

The assumptions in the working paper Fan et al. (2011) are slightly weak than those presented here, in that it required λ_max(Σ_u) instead of ||Σ_u||₁ be bounded.

⁸

We used the R package for LOREC developed by Luo (2012) and the Matlab codes for Dual and SVT provided on Yi Ma’s website “Low-rank matrix recovery and completion via convex optimization” at University of Illinois. The tuning parameters for each method have been chosen to minimize the sum of relative errors $| | \sum^{- 1 / 2} \sum^{^} \sum^{- 1 / 2} - I_{p} | | + | | \sum_{u}^{- 1 / 2} {\sum^{^}}_{u} \sum_{u}^{- 1 / 2} - I_{p} | |$ . We have also written an R package for POET.

References

Agarwal A, Negahban S, Martin J, Wainwright MJ. Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. Ann Statist. 2012;40:1171–1197. [Google Scholar]
Ahn S, Lee Y, Schmidt P. GMM estimation of linear panel data models with time-varying individual effects. J Econometrics. 2001;101:219–255. [Google Scholar]
Alessi L, Barigozzi M, Capassoc M. Improved penalization for determining the number of factors in approximate factor models. Statistics and Probability Letters. 2010;80:1806–1813. [Google Scholar]
Amini AA, Wainwright MJ. High-dimensional analysis of semidefinite relaxations for sparse principal components. Ann Statist. 2009;37:2877–2921. [Google Scholar]
Antoniadis A, Fan J. Regularized wavelet approximations. J Amer Statist Assoc. 2001;96:939–967. [Google Scholar]
Athreya K, Lahiri S. Measure theory and probability theorey. Springer; New York: 2006. [Google Scholar]
Bai J. Inferential theory for factor models of large dimensions. Econometrica. 2003;71:135–171. [Google Scholar]
Bai J, Ng S. Determining the number of factors in approximate factor models. Econometrica. 2002;70:191–221. [Google Scholar]
Bai J, Ng S. Large dimensional factor analysis. Foundations and trends in econometrics. 2008;3:89–163. [Google Scholar]
Bai J, Shi S. Estimating high dimensional covariance matrices and its applications. Annals of Economics and Finance. 2011;12:199–215. [Google Scholar]
Bickel P, Levina E. Covariance regularization by thresholding. Ann Statist. 2008;36:2577–2604. [Google Scholar]
Birnbaum A, Johnstone I, Nadler B, Paul D. Minimax bounds for sparse PCA with noisy high-dimensional data. To appear in. Ann Statist. 2012 doi: 10.1214/12-AOS1014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Boivin J, Ng S. Are More Data Always Better for Factor Analysis? J Econometrics. 2006;132:169–194. [Google Scholar]
Cai J, Candès E, Shen Z. A singular value thresholding algorithm for matrix completion. SIAM J on Optimization. 2008;20:1956–1982. [Google Scholar]
Cai T, Liu W. Adaptive thresholding for sparse covariance matrix estimation. J Amer Statist Assoc. 2011;106:672–684. [Google Scholar]
Cai T, Zhou H. Optimal rates of convergence for sparse covariance matrix estimation. University of Pennsylvania; 2010. Manuscript. [Google Scholar]
Candès E, Li X, Ma Y, Wright J. Robust principal component analysis? J ACM. 2011;58:3. [Google Scholar]
Carvalho C, Chang J, Lucas J, Nevins J, Wang Q, West M. High-dimensional sparse factor modeling: applications in gene expression genomics. J Amer Statist Assoc. 2008;103:1438–1456. doi: 10.1198/016214508000000869. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chamberlain G, Rothschild M. Arbitrage, factor structure and mean-variance analyssi in large asset markets. Econometrica. 1983;51:1305–1324. [Google Scholar]
Doz C, Giannone D, Reichlin L. A two-step estimator for large approximate dynamic factor models based on Kalman filtering. J Econometrics. 2011;164:188–205. [Google Scholar]
d’Aspremont A, Bach F, Ghaoui ELL. Optimal solutions for sparse principal component analysis. J Mach Learn Res. 2008;9:1269–1294. [Google Scholar]
Davis C, Kahan W. The rotation of eigenvectors by a perturbation III. SIAM J Numer Anal. 1970;7:1–46. [Google Scholar]
Efron B. Correlation and large-scale simultaneous significance testing. J Amer Statist Assoc. 2007;102:93–103. [Google Scholar]
Efron B. Correlated z-values and the accuracy of large-scale statistical estimates. J Amer Statist Assoc. 2010;105:1042–1055. doi: 10.1198/jasa.2010.tm09129. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fama E, French K. The cross-section of expected stock returns. Journal of Finance. 1992;47:427–465. [Google Scholar]
Fan J, Fan Y, Lv J. High dimensional covariance matrix estimation using a factor model. J Econometrics. 2008;147:186–197. [Google Scholar]
Fan J, Han X, Gu W. Control of the false discovery rate under arbitrary covariance dependence (with discussion) J Amer Statist Assoc. 2012;107:1019–1048. doi: 10.1080/01621459.2012.720478. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Liao Y, Mincheva M. High dimensional covariance matrix estimation in approximate factor models. Ann Statist. 2011;39:3320–3356. doi: 10.1214/11-AOS944. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Liao Y, Mincheva M. Large covariance estimation by thresholding principal orthogonal complements. Working paper of this article. 2011 doi: 10.1111/rssb.12016. arxiv.org/pdf/1201.0175.pdf. [DOI] [PMC free article] [PubMed]
Fan J, Zhang J, Yu K. Vast portfolio selection with gross-exposure constraints. J Amer Statist Assoc. 2012;107:592–606. doi: 10.1080/01621459.2012.682825. [DOI] [PMC free article] [PubMed] [Google Scholar]
Forni M, Hallin M, Lippi M, Reichlin L. The generalized dynamic factor model: identification and estimation. Review of Economics and Statistics. 2000;82:540–554. [Google Scholar]
Forni M, Hallin M, Lippi M, Reichlin L. The generalized dynamic factor model consistency and rates. J Econometrics. 2004;119:231–255. [Google Scholar]
Forni M, Lippi M. The generalized dynamic factor model: representation theory. Econometric Theory. 2001;17:1113–1141. [Google Scholar]
Fryzlewicz P. Haar-Fisz wavelet method for interpretable estimation of large, sparse, time-varying volatility matrices. London School of Economics and Political Science; 2010. Manuscript. [Google Scholar]
Hallin M, Liška R. Determining the number of factors in the general dynamic factor model. J Amer Statist Assoc. 2007;102:603–617. [Google Scholar]
Hallin M, Liška R. Dynamic factors in the presence of blocks. J Econometrics. 2011;163:29–41. [Google Scholar]
Hastie TJ, Tibshirani R, Friedman J. The elements of Statistical Learning: Data Mining, Inference, and Prediction. 2. Springer; New York: 2009. [Google Scholar]
James W, Stein C. Proc Fourth Berkeley Symp Math Statist Probab. Vol. 1. Univ. California Press; Berkeley: 1961. Estimation with quadratic loss; pp. 361–379. [Google Scholar]
Johnstone IM. On the distribution of the largest eigenvalue in principal components analysis. Ann Statist. 2001;29:295–327. [Google Scholar]
Johnstone IM, Lu AY. On consistency and sparsity for principal components analysis in high dimensions. J Amer Statist Assoc. 2009;104:682–693. doi: 10.1198/jasa.2009.0121. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jung S, Marron JS. PCA consistency in high dimension, low sample size context. Ann Statist. 2009;37:4104–4130. [Google Scholar]
Kapetanios G. A testing procedure for determining the number of factors in approximate factor models with large datasets. J Bus Econom Statist. 2010;28:397–409. [Google Scholar]
Lam C, Fan J. Sparsistency and rates of convergence in large covariance matrix estimation. Ann Statist. 2009;37:4254–4278. doi: 10.1214/09-AOS720. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lawley D, Maxwell A. Factor analysis as a statistical method. 2. London: Butterworths; 1971. [Google Scholar]
Leek J, Storey J. A general framework for multiple testing dependence. PNAS. 2008;105:18718–18723. doi: 10.1073/pnas.0808709105. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lin Z, Ganesh A, Wright J, Wu L, Chen M, Ma Y. Fast convex optimization algorithms for exact recovery of a corrupted low-rank matrix. Microsoft Research Asia; 2009. Manuscript. [Google Scholar]
Luo X. High dimensional low rank and sparse covariance matrix estimation via convex minimization. 2011. Manuscript. [Google Scholar]
Ma Z. Sparse principal components analysis and iterative thresholding. 2011. Manuscript. [Google Scholar]
Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the Lasso. Ann Statist. 2006;34:1436–1462. [Google Scholar]
Onatski A. Determining the number of factors from empirical distribution of eigenvalues. The Review of Economics and Statistics. 2010;92:1004–1016. [Google Scholar]
Pati D, Bhattacharya A, Pillai N, Dunson D. Posterior contraction in sparse Bayesian factor models for massive covariance matrices. Duke University; 2012. Manuscript. [Google Scholar]
Paul D. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist Sinica. 2007;17:1617–1642. [Google Scholar]
Pesaran MH. Estimation and inference in large heterogeneous panels with a multifactor error structure. Econometrica. 2006;74:967–1012. [Google Scholar]
Phan Q. On the sparsity assumption of the idiosyncratic errors covariance matrix-Support from the FTSE 100 stock returns. University of Warwick; 2012. Manuscript. [Google Scholar]
Ross SA. The arbitrage theory of capital asset pricing. Journal of Economic Theory. 1976;13:341–360. [Google Scholar]
Rothman A, Levina E, Zhu J. Generalized thresholding of large covariance matrices. J Amer Statist Assoc. 2009;104:177–186. [Google Scholar]
Sentana E. The econometrics of mean-variance efficiency tests: a survey. Econ-metrics Jour. 2009;12:65–101. [Google Scholar]
Shen H, Huang J. Sparse principal component analysis via regularized low rank matrix approximation. J Multivariate Analysis. 2008;99:1015–1034. [Google Scholar]
Sharpe W. Capital asset prices: A theory of market equilibrium under conditons of risks. Journal of Finance. 1964;19:425–442. [Google Scholar]
Stock J, Watson M. Diffusion Indexes, NBER Working Paper 6702. 1998. [Google Scholar]
Stock J, Watson M. Forecasting using principal components from a large number of predictors. J Amer Statist Assoc. 2002;97:1167–1179. [Google Scholar]
Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10:515–534. doi: 10.1093/biostatistics/kxp008. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wright J, Peng Y, MY, Ganesh A, Rao S. Robust principal component analysis: exact recovery of corrupted low-rank matrices by convex optimization. Microsoft Research Asia; 2009. Manuscript. [Google Scholar]
Xiong H, Goulding EH, Carlson EJ, Tecott LH, McCulloch CE, Sen S. A flexible estimating equations approach for mapping function-valued traits. Genetics. 2011;189:305–316. doi: 10.1534/genetics.111.129221. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yap JS, Fan J, Wu R. Nonparametric modeling of longitudinal covariance structure in functional mapping of quantitative trait loci. Biometrics. 2009;65:1068–1077. doi: 10.1111/j.1541-0420.2009.01222.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang Y, El Ghaui L. Large-scale sparse principal component analysis with application to text data. NIPS 2011 [Google Scholar]

[R1] Agarwal A, Negahban S, Martin J, Wainwright MJ. Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. Ann Statist. 2012;40:1171–1197. [Google Scholar]

[R2] Ahn S, Lee Y, Schmidt P. GMM estimation of linear panel data models with time-varying individual effects. J Econometrics. 2001;101:219–255. [Google Scholar]

[R3] Alessi L, Barigozzi M, Capassoc M. Improved penalization for determining the number of factors in approximate factor models. Statistics and Probability Letters. 2010;80:1806–1813. [Google Scholar]

[R4] Amini AA, Wainwright MJ. High-dimensional analysis of semidefinite relaxations for sparse principal components. Ann Statist. 2009;37:2877–2921. [Google Scholar]

[R5] Antoniadis A, Fan J. Regularized wavelet approximations. J Amer Statist Assoc. 2001;96:939–967. [Google Scholar]

[R6] Athreya K, Lahiri S. Measure theory and probability theorey. Springer; New York: 2006. [Google Scholar]

[R7] Bai J. Inferential theory for factor models of large dimensions. Econometrica. 2003;71:135–171. [Google Scholar]

[R8] Bai J, Ng S. Determining the number of factors in approximate factor models. Econometrica. 2002;70:191–221. [Google Scholar]

[R9] Bai J, Ng S. Large dimensional factor analysis. Foundations and trends in econometrics. 2008;3:89–163. [Google Scholar]

[R10] Bai J, Shi S. Estimating high dimensional covariance matrices and its applications. Annals of Economics and Finance. 2011;12:199–215. [Google Scholar]

[R11] Bickel P, Levina E. Covariance regularization by thresholding. Ann Statist. 2008;36:2577–2604. [Google Scholar]

[R12] Birnbaum A, Johnstone I, Nadler B, Paul D. Minimax bounds for sparse PCA with noisy high-dimensional data. To appear in. Ann Statist. 2012 doi: 10.1214/12-AOS1014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Boivin J, Ng S. Are More Data Always Better for Factor Analysis? J Econometrics. 2006;132:169–194. [Google Scholar]

[R14] Cai J, Candès E, Shen Z. A singular value thresholding algorithm for matrix completion. SIAM J on Optimization. 2008;20:1956–1982. [Google Scholar]

[R15] Cai T, Liu W. Adaptive thresholding for sparse covariance matrix estimation. J Amer Statist Assoc. 2011;106:672–684. [Google Scholar]

[R16] Cai T, Zhou H. Optimal rates of convergence for sparse covariance matrix estimation. University of Pennsylvania; 2010. Manuscript. [Google Scholar]

[R17] Candès E, Li X, Ma Y, Wright J. Robust principal component analysis? J ACM. 2011;58:3. [Google Scholar]

[R18] Carvalho C, Chang J, Lucas J, Nevins J, Wang Q, West M. High-dimensional sparse factor modeling: applications in gene expression genomics. J Amer Statist Assoc. 2008;103:1438–1456. doi: 10.1198/016214508000000869. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Chamberlain G, Rothschild M. Arbitrage, factor structure and mean-variance analyssi in large asset markets. Econometrica. 1983;51:1305–1324. [Google Scholar]

[R20] Doz C, Giannone D, Reichlin L. A two-step estimator for large approximate dynamic factor models based on Kalman filtering. J Econometrics. 2011;164:188–205. [Google Scholar]

[R21] d’Aspremont A, Bach F, Ghaoui ELL. Optimal solutions for sparse principal component analysis. J Mach Learn Res. 2008;9:1269–1294. [Google Scholar]

[R22] Davis C, Kahan W. The rotation of eigenvectors by a perturbation III. SIAM J Numer Anal. 1970;7:1–46. [Google Scholar]

[R23] Efron B. Correlation and large-scale simultaneous significance testing. J Amer Statist Assoc. 2007;102:93–103. [Google Scholar]

[R24] Efron B. Correlated z-values and the accuracy of large-scale statistical estimates. J Amer Statist Assoc. 2010;105:1042–1055. doi: 10.1198/jasa.2010.tm09129. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Fama E, French K. The cross-section of expected stock returns. Journal of Finance. 1992;47:427–465. [Google Scholar]

[R26] Fan J, Fan Y, Lv J. High dimensional covariance matrix estimation using a factor model. J Econometrics. 2008;147:186–197. [Google Scholar]

[R27] Fan J, Han X, Gu W. Control of the false discovery rate under arbitrary covariance dependence (with discussion) J Amer Statist Assoc. 2012;107:1019–1048. doi: 10.1080/01621459.2012.720478. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Fan J, Liao Y, Mincheva M. High dimensional covariance matrix estimation in approximate factor models. Ann Statist. 2011;39:3320–3356. doi: 10.1214/11-AOS944. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Fan J, Liao Y, Mincheva M. Large covariance estimation by thresholding principal orthogonal complements. Working paper of this article. 2011 doi: 10.1111/rssb.12016. arxiv.org/pdf/1201.0175.pdf. [DOI] [PMC free article] [PubMed]

[R30] Fan J, Zhang J, Yu K. Vast portfolio selection with gross-exposure constraints. J Amer Statist Assoc. 2012;107:592–606. doi: 10.1080/01621459.2012.682825. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Forni M, Hallin M, Lippi M, Reichlin L. The generalized dynamic factor model: identification and estimation. Review of Economics and Statistics. 2000;82:540–554. [Google Scholar]

[R32] Forni M, Hallin M, Lippi M, Reichlin L. The generalized dynamic factor model consistency and rates. J Econometrics. 2004;119:231–255. [Google Scholar]

[R33] Forni M, Lippi M. The generalized dynamic factor model: representation theory. Econometric Theory. 2001;17:1113–1141. [Google Scholar]

[R34] Fryzlewicz P. Haar-Fisz wavelet method for interpretable estimation of large, sparse, time-varying volatility matrices. London School of Economics and Political Science; 2010. Manuscript. [Google Scholar]

[R35] Hallin M, Liška R. Determining the number of factors in the general dynamic factor model. J Amer Statist Assoc. 2007;102:603–617. [Google Scholar]

[R36] Hallin M, Liška R. Dynamic factors in the presence of blocks. J Econometrics. 2011;163:29–41. [Google Scholar]

[R37] Hastie TJ, Tibshirani R, Friedman J. The elements of Statistical Learning: Data Mining, Inference, and Prediction. 2. Springer; New York: 2009. [Google Scholar]

[R38] James W, Stein C. Proc Fourth Berkeley Symp Math Statist Probab. Vol. 1. Univ. California Press; Berkeley: 1961. Estimation with quadratic loss; pp. 361–379. [Google Scholar]

[R39] Johnstone IM. On the distribution of the largest eigenvalue in principal components analysis. Ann Statist. 2001;29:295–327. [Google Scholar]

[R40] Johnstone IM, Lu AY. On consistency and sparsity for principal components analysis in high dimensions. J Amer Statist Assoc. 2009;104:682–693. doi: 10.1198/jasa.2009.0121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Jung S, Marron JS. PCA consistency in high dimension, low sample size context. Ann Statist. 2009;37:4104–4130. [Google Scholar]

[R42] Kapetanios G. A testing procedure for determining the number of factors in approximate factor models with large datasets. J Bus Econom Statist. 2010;28:397–409. [Google Scholar]

[R43] Lam C, Fan J. Sparsistency and rates of convergence in large covariance matrix estimation. Ann Statist. 2009;37:4254–4278. doi: 10.1214/09-AOS720. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] Lawley D, Maxwell A. Factor analysis as a statistical method. 2. London: Butterworths; 1971. [Google Scholar]

[R45] Leek J, Storey J. A general framework for multiple testing dependence. PNAS. 2008;105:18718–18723. doi: 10.1073/pnas.0808709105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] Lin Z, Ganesh A, Wright J, Wu L, Chen M, Ma Y. Fast convex optimization algorithms for exact recovery of a corrupted low-rank matrix. Microsoft Research Asia; 2009. Manuscript. [Google Scholar]

[R47] Luo X. High dimensional low rank and sparse covariance matrix estimation via convex minimization. 2011. Manuscript. [Google Scholar]

[R48] Ma Z. Sparse principal components analysis and iterative thresholding. 2011. Manuscript. [Google Scholar]

[R49] Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the Lasso. Ann Statist. 2006;34:1436–1462. [Google Scholar]

[R50] Onatski A. Determining the number of factors from empirical distribution of eigenvalues. The Review of Economics and Statistics. 2010;92:1004–1016. [Google Scholar]

[R51] Pati D, Bhattacharya A, Pillai N, Dunson D. Posterior contraction in sparse Bayesian factor models for massive covariance matrices. Duke University; 2012. Manuscript. [Google Scholar]

[R52] Paul D. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist Sinica. 2007;17:1617–1642. [Google Scholar]

[R53] Pesaran MH. Estimation and inference in large heterogeneous panels with a multifactor error structure. Econometrica. 2006;74:967–1012. [Google Scholar]

[R54] Phan Q. On the sparsity assumption of the idiosyncratic errors covariance matrix-Support from the FTSE 100 stock returns. University of Warwick; 2012. Manuscript. [Google Scholar]

[R55] Ross SA. The arbitrage theory of capital asset pricing. Journal of Economic Theory. 1976;13:341–360. [Google Scholar]

[R56] Rothman A, Levina E, Zhu J. Generalized thresholding of large covariance matrices. J Amer Statist Assoc. 2009;104:177–186. [Google Scholar]

[R57] Sentana E. The econometrics of mean-variance efficiency tests: a survey. Econ-metrics Jour. 2009;12:65–101. [Google Scholar]

[R58] Shen H, Huang J. Sparse principal component analysis via regularized low rank matrix approximation. J Multivariate Analysis. 2008;99:1015–1034. [Google Scholar]

[R59] Sharpe W. Capital asset prices: A theory of market equilibrium under conditons of risks. Journal of Finance. 1964;19:425–442. [Google Scholar]

[R60] Stock J, Watson M. Diffusion Indexes, NBER Working Paper 6702. 1998. [Google Scholar]

[R61] Stock J, Watson M. Forecasting using principal components from a large number of predictors. J Amer Statist Assoc. 2002;97:1167–1179. [Google Scholar]

[R62] Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10:515–534. doi: 10.1093/biostatistics/kxp008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R63] Wright J, Peng Y, MY, Ganesh A, Rao S. Robust principal component analysis: exact recovery of corrupted low-rank matrices by convex optimization. Microsoft Research Asia; 2009. Manuscript. [Google Scholar]

[R64] Xiong H, Goulding EH, Carlson EJ, Tecott LH, McCulloch CE, Sen S. A flexible estimating equations approach for mapping function-valued traits. Genetics. 2011;189:305–316. doi: 10.1534/genetics.111.129221. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] Yap JS, Fan J, Wu R. Nonparametric modeling of longitudinal covariance structure in functional mapping of quantitative trait loci. Biometrics. 2009;65:1068–1077. doi: 10.1111/j.1541-0420.2009.01222.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R66] Zhang Y, El Ghaui L. Large-scale sparse principal component analysis with application to text data. NIPS 2011 [Google Scholar]

PERMALINK

Large Covariance Estimation by Thresholding Principal Orthogonal Complements

Jianqing Fan

Yuan Liao

Martina Mincheva

Abstract

1 Introduction

2 Regularized Covariance Matrix via PCA

2.1 High-dimensional PCA and factor model

Example 2.1

Proposition 2.1

Proposition 2.2

2.2 POET

2.3 Least squares point of view

Theorem 2.1

2.4 POET with Unknown K

3 Asymptotic Properties

3.1 Assumptions

Assumption 3.1

Remark 3.1

Assumption 3.2

Assumption 3.3

Assumption 3.4

3.2 Convergence of the idiosyncratic covariance

Theorem 3.1

Remark 3.2

3.3 Convergence of the POET estimator

Example 3.1

Theorem 3.2

Remark 3.3

3.4 Convergence of unknown factors and factor loadings

Theorem 3.3

Corollary 3.1

4 Choice of Threshold

4.1 Finite-sample positive definiteness

Figure 1.

4.2 Multifold Cross-Validation

5 Applications of POET

Example 5.1 (Large-scale hypothesis testing)

Example 5.2 (Risk management)

Example 5.3 (Panel regression with a factor structure in the errors)

Example 5.4 (Validating an asset pricing theory)

6 Monte Carlo Experiments

6.1 Calibration

Table 1.

Table 2.

6.2 Simulation

6.3 Results

Figure 2.

Figure 5.

Figure 3.

Figure 4.

6.4 Robustness to the estimation of K

6.4.1 Design 1

Figure 6.

6.4.2 Design 2

Table 3.

6.5 Comparisons with Other Methods

6.5.1 Comparison with related methods

Table 4.

6.5.2 Comparison with direct thresholding

Model 1: one-factor

Model 2: sparse covariance

Model 3: cross-sectional AR(1)

Table 5.

6.6 Simulated portfolio allocation

Figure 7.

Figure 8.

7 Real Data Example

7.1 Sarsity of Idiosyncratic Errors

Figure 9.

7.2 Portfolio Allocation

Figure 10.

Table 6.

8 Conclusion and Discussion

Acknowledgments

APPENDIX

A Estimating a sparse covariance with contaminated data

Theorem A.1

Proof