Embracing the Blessing of Dimensionality in Factor Models

Quefeng Li; Guang Cheng; Jianqing Fan; Yuyan Wang

doi:10.1080/01621459.2016.1256815

. Author manuscript; available in PMC: 2019 Jan 1.

Published in final edited form as: J Am Stat Assoc. 2017 Nov 13;113(521):380–389. doi: 10.1080/01621459.2016.1256815

Embracing the Blessing of Dimensionality in Factor Models

Quefeng Li ^a, Guang Cheng ^b, Jianqing Fan ^c,^d, Yuyan Wang ^c

PMCID: PMC6005696 NIHMSID: NIHMS867177 PMID: 29930437

Abstract

Factor modeling is an essential tool for exploring intrinsic dependence structures among high-dimensional random variables. Much progress has been made for estimating the covariance matrix from a high-dimensional factor model. However, the blessing of dimensionality has not yet been fully embraced in the literature: much of the available data are often ignored in constructing covariance matrix estimates. If our goal is to accurately estimate a covariance matrix of a set of targeted variables, shall we employ additional data, which are beyond the variables of interest, in the estimation? In this article, we provide sufficient conditions for an affirmative answer, and further quantify its gain in terms of Fisher information and convergence rate. In fact, even an oracle-like result (as if all the factors were known) can be achieved when a sufficiently large number of variables is used. The idea of using data as much as possible brings computational challenges. A divide-and-conquer algorithm is thus proposed to alleviate the computational burden, and also shown not to sacrifice any statistical accuracy in comparison with a pooled analysis. Simulation studies further confirm our advocacy for the use of full data, and demonstrate the effectiveness of the above algorithm. Our proposal is applied to a microarray data example that shows empirical benefits of using more data. Supplementary materials for this article are available online.

Keywords: Asymptotic normality, Auxiliary data, Divide-and-conquer, Factor model, Fisher information, High-dimensionality

1. Introduction

With the advance of modern information technology, it is now possible to track millions of variables or subjects simultaneously. To discover the relationship among them, the estimation of a high-dimensional covariance matrix Σ has recently received a great deal of attention in the literature. Researchers proposed various regularization methods to obtain consistent estimators of Σ (Bickel and Levina 2008; Rothman et al. 2008; Lam and Fan 2009; Cai, Zhang, and Zhou 2010; Cai and Liu 2011). A key assumption for these regularization methods is that Σ is sparse, that is, many elements of Σ are small or exactly zero.

Different from such a sparsity condition, factor analysis assumes that the intrinsic dependence is mainly driven by some common latent factors (Johnson and Wichern 1992). For example, in modeling stock returns, Fama and French (1993) proposed the well-known Fama–French three-factor model. In the factor model, Σ has spiked eigenvalues and dense entries. In the high-dimensional setting, there are many recent studies on the estimation of the covariance matrix based on the factor model (Fan, Fan, and Lv 2008; Fan, Liao, and Mincheva 2011, 2013; Bai and Li 2012; Bai and Liao 2013), where the number of variables can be much larger than the number of observations.

The interest of this article is on the estimation of the covariance matrix for a certain set of variables using auxiliary data information. In the literature, we use only the data information on the variables of interest. In the data-rich environment today, substantially more amount of data information is indeed available, but is often ignored in statistical analysis. For example, we might be interested in understanding the covariance matrix of 50 stocks in a portfolio, yet the available data information is a time series of thousands of stocks. Similarly, an oncologist may wish to study the dependence or network structures among 100 genes that are significantly associated with a certain cancer, yet she has expression data for over 20,000 genes from the whole genome. Can we benefit from using much more rich auxiliary data?

The answer to the above question is affirmative when a factor model is imposed. Since the whole system is driven by a few common factors, these common factors can be inferred more accurately from a much larger set of data information (Fan, Liao, and Mincheva 2013), which is indeed a “blessing of dimensionality.” A major contribution of this article is to characterize how much the estimation of the covariance matrix of interest and also common factors can be improved by auxiliary data information (and under what conditions).

Consider the following factor model for all p observable data y_t = (y_1t, …, y_pt)′ ∈ ℝ^p at time t:

\begin{matrix} y_{t} = B f_{t} + u_{t}, & t = 1, \dots, T, \end{matrix}

(1)

where f_t ∈ ℝ^K is a K-dimensional vector of common factors, $B = {(b_{1}^{'}, \dots, b_{p}^{'})}^{'} \in R^{p \times K}$ is a factor loading matrix with b_i ∈ ℝ^K being the factor loading of the ith variable on the latent factor f_t, and u_t is an idiosyncratic error vector. In the above model, y_t is the only observable variable, while B is a matrix of unknown parameters, and (f_t, u_t) are latent random variables. Without loss of generality, we assume E(f_t) = E(u_t) = 0 and f_t and u_t are uncorrelated. Then, the model implied covariance structure is

Σ = B cov (f_{t}) B^{'} + Σ_{u},

where $Σ = E (y_{t} y_{t}^{'})$ and $Σ_{u} = E (u_{t} u_{t}^{'})$ . Observe that B and f_t are not individually identifiable, since Bf_t = BHH′f_t for any orthogonal matrix H. To this end, an identifiability condition is imposed:

\begin{matrix} cov (f_{t}) = I_{K} & and & B^{'} Σ_{u}^{- 1} B is diagonal, \end{matrix}

(2)

which is a common assumption in the literature (Bai and Li 2012; Bai and Liao 2013).

Assume that we are only interested in a subset S among a total of p variables in model (1). We aim to obtain an efficient estimator of

Σ_{S} = B_{S} B_{S}^{'} + Σ_{u, S},

the covariance matrix of the s variables in S, where B_S is the sub-matrix of B with row indices in S and Σ_u,S is the submatrix of Σ_u with row and column indices in S. As mentioned above, the existing literature uses the following conventional method:

Method 1: Use solely the s variables in the set S to estimate common factors f_t, the loading matrix B_S, the idiosyncratic matrix Σ_u,S, and the covariance matrix Σ_S.

This idea is apparently strongly influenced by the nonparametric estimation of the covariance matrix and ignores a large portion of the available data in the other p – s variables. An intuitively more efficient method is

Method 2: Use all the p variables to obtain estimators of f_t, the loading matrix B, the idiosyncratic matrix Σ_u, and the entire covariance matrix Σ, and then restrict them to the variables of interest. This is the same as estimating f_t using all variables, and then estimating B_S and Σ_u,S based on the model (1) and the subset S with f_t being estimated (observed), and obtaining a plug-in estimator of Σ_S.

We will show that Method 2 is more efficient than Method 1 in the estimation of f_t and Σ_S as more auxiliary data information is incorporated. By treating common factor as an unknown parameter, we calculate its Fisher information that grows with more data being used in Method 2. In this case, a more efficient factor estimate can be obtained, for example, through weighted principal component (WPC) method (Bai and Liao 2013). The advantage of factor estimation is further carried over to the estimation of Σ_S by Method 2 in terms of its convergence rate. Moreover, if the number of total variables is sufficiently large, Method 2 is proven to perform as well as an “oracle method,” which observes all latent factors. This lends further support to our aforementioned claim of “blessing of dimensionality.” Such a best possible rate improvement is new to the existing literature, and counted as another contribution of this article. All these conclusions hold when the number of factors K is assumed to be fixed and known, while s, p, and T all tend to infinity.

The idea of using data as much as possible brings computational challenges. Fortunately, we observe that all the p variables are controlled by the same group of latent factors. Having said that, we can actually split p variables into smaller groups, and then use each group to estimate latent factors. The final factor estimate is obtained by averaging over these repeatedly estimated factors. Obviously, this divide-and-conquer algorithm can be implemented in a parallel computing environment, and thus produces factor estimators in a much more efficient way. On the other hand, our theory illustrates that this new method performs as well as the “pooled analysis,” where we run the method over the whole dataset. Simulation studies further demonstrate the boosted computational speed and satisfactory statistical performance.

The rest of the article is organized as follows. We compare the Fisher information of the factors by the two methods in Section 2. Section 3 describes the WPC method. As a main result, the convergence rates of Different estimators of Σ_S are further compared in Section 4 under various norms. Section 5 introduces the divide-and-conquer method for accelerating computation, while Section 6 presents all simulation results. Section 7 gives a microarray data example to illustrate our proposal. All technical proofs are delegated to the Appendix.

For any vector a, let a_S denote a sub-vector of a with indices in S. Denote ‖a‖ the Euclidean norm of a. For a symmetric matrix A ∈ ℝ^d×d, let A_I,j be the submatrix of A with row and column indices in I and J, respectively. We write A_S for A_S,S for simplicity. Let λ_j(A) be the jth largest eigenvalue of A. Denote ‖A‖ = max{|λ₁(A)|, |λ_d(A)|} the operator norm of A, ‖A‖_max = max_ij |a_ij| the max-norm of A, where a_ij is the (i, j)th entry of A, ${‖ A ‖}_{1} = {max}_{i} \sum_{j = 1}^{d} | a_{i j} |$ the L₁ norm of A, ${‖ A ‖}_{F} = \sqrt{tr (A^{'} A)}$ the Frobenius norm of A, and ‖A‖_M = d^−1/2‖M^−1/2AM^−1/2‖_F the relative norm of A to M, where the weight matrix M is assumed to be positive definite. For a non-square matrix C, let C_S be the submatrix of C with row indices in S.

2. Fisher Information of Common Factor

In this section, we treat the vector of common factors as a fixed unknown parameter, and compute its Fisher information matrices based on Method 1 and Method 2. In the computation, the loading matrix B is treated as deterministic in Proposition 2. In Proposition 3, the Fisher information is computed for each given B and then averaged over B by regarding it as a realization of a chance process, which bypasses the block diagonal assumption needed without taking average over B. In other sections, we adopt the convention regarding the factors as random and B as fixed. We start by calculating the Fisher information of θ_t := Bf_t, which serves as an intermediate step in obtaining that for f_t. For simplicity of notation, time t is suppressed in (y_t, f_t, u_t, θ_t) so that it becomes (y, f, u, θ) in this section.

Given a general density function of y, denoted as h(y; θ), the Fisher information of θ contained in full data is given by

I_{p} (θ) = E [(\frac{\partial log h (y; θ)}{\partial θ}) {(\frac{\partial log h (y; θ)}{\partial θ})}^{'}] .

When only data in S is used, the Fisher information of θ_S is given by

I_{S} (θ_{S}) = E [(\frac{\partial log h_{S} (y_{S}; θ_{S})}{\partial θ_{S}}) {(\frac{\partial log h_{S} (y_{S}; θ_{S})}{\partial θ_{S}})}^{'}] .

where h_S is the marginal density of y_S for the target set of variable S. Our first proposition shows that {I_p(θ)}_S, the submatrix of I_p(θ) restricted on S, dominates I_S(θ_S) under a mild condition.

Proposition 1. If h(y; θ) = h(y – θ) and the density function h(y – θ) satisfies the following regularity condition:

\nabla_{y_{S}} \int h (y_{S} - θ_{S}, y_{S^{c}} - θ_{S^{c}}) d y_{S^{c}} = \int \nabla_{y_{S}} h (y_{S} - θ_{S}, y_{S^{c}} - θ_{S^{c}}) d y_{S^{c}},

(3)

then{I_p(θ)}_S ≥ I_S(θ_S) in the sense that {I_p(θ)}_S – I_S(θ_S) is positive semidefinite.

The regularity condition (3) is fairly mild, as illustrated in the following examples.

Example 1. In model 1, if u_S and u_S^c are independent, then (3) holds.

Example 2. If y follows an elliptical distribution that

h (y; θ) \propto g ({(y - θ)}^{'} Σ^{- 1} (y - θ)),

where the mapping function g(t) : [0, ∞) → [0, ∞) satisfies that |g′(t)| ≤ cg(t) for some positive constant c, and E|y| < ∞, then (3) holds. Example 2 includes some commonly used multivariate distributions as its special cases, for example, the multivariate normal distribution and the multivariate t-distribution with degrees of freedom greater than 1. The proof is given in Appendix A.2.

We next compute the Fisher information of f based on the full dataset, denoted as I(f), and the partial dataset restricted on S, denoted as I_S(f). This can be done easily by noting that I(f) = B′I_p(θ)B. Indeed, the WPC estimators used in Methods 1 and 2 achieve such efficiency since their asymptotic variances are proven to be the inverse of I(f) and I_S(f), respectively; see Remark 1.

Proposition 2 shows that I(f) dominates I_S(f), if I_p(θ) is block-diagonal, that is, {I_p(θ)}_S,S^c = 0. Hence, common factors can be estimated more efficiently using additional data y_S^c. The above block-diagonal condition implies that the idiosyncratic error of additional variables cannot be confounded with that of the variables-of-interest. For example, if u is normal, then {I_p(θ)}_S,S^c = 0 indeed requires that u_S is independent of u_S^c.

Proposition 2. Under condition (3), if {I_p(θ)}_S,S^c = 0, I(f) ≥ I_S(f).

So far we treat B as being deterministic. Rather, Proposition 3 regards {b_i} as a realization of a chance process. Under this assumption, the expectation of I(f) over B is shown to always dominate that of I_S(f). In other words, we can claim that averaging over loading matrices, a larger dataset contains more information about the unknown factors.

Proposition 3. If ${B_{i}}_{i = 1}^{p}$ are iid random loadings with E(b_i) = 0 and (3) holds, then E[I(f)] ≥ E[I_S(f)], where the expectation is taken with respect to the distribution of B.

3. Efficient Estimation of Common Factor

In this section, we construct an efficient estimator of the common factors by showing that its asymptotic variance is exactly the inverse of its Fisher information. This together with the arguments in Section 2 enables us to draw a conclusion that using more data results in a more efficient factor estimator with a smaller asymptotic variance.

From a least-square perspective, when the loading matrix B is known, f_t can be estimated by the weighted least-squares: ${argmin}_{f_{t} \in R^{K}} \sum_{t = 1}^{T} {(y_{t} - B f_{t})}^{'} Σ_{u}^{- 1} (y_{t} - B f_{t})$ . In the high-dimensional setting (p ≫ T), we assume Σ_u is a sparse matrix and define its sparsity measurement as

\begin{matrix} m_{p} = max_{i \leq p} \sum_{j \neq i} I (σ_{u, i j} \neq 0), & where σ_{u, i j} is the (i, j) th entry of Σ_{u} . \end{matrix}

(4)

In particular, we assume the following sparsity condition:

\begin{matrix} m_{p} = o (min {\frac{1}{p^{1 / 4}} \sqrt{\frac{T}{log p}}, p^{1 / 4}}) & and & \sum_{i = 1}^{p} \sum_{j \neq i} I (σ_{u, i j} \neq 0) = O (p) . \end{matrix}

(5)

Now, we propose to solve the following constrained weighted least-square problem:

\begin{matrix} (\hat{B}, {\hat{f}}_{1}, \dots, {\hat{f}}_{T}) = \underset{B, f_{t}}{argmin} \sum_{t = 1}^{T} {(y_{t} - B f_{t})}^{'} {\tilde{Σ}}_{u}^{- 1} (y_{t} - B f_{t}), \\ subject to \frac{1}{T} \sum_{t = 1}^{T} f_{t} f_{t}^{'} = I_{K}; B^{'} {\tilde{Σ}}_{u}^{- 1} B is diagonal, \end{matrix}

(6)

where Σ̃_u is a regularized estimator of Σ_u to be discussed later. The above constraint is a sample analog of the identifiability condition (2). The involvement of the weight ${\tilde{Σ}}_{u}^{- 1}$ is to account for the heterogeneity among the data and leads to more efficient estimation of (B, f_t) (Choi 2012; Bai and Liao 2013).

Indeed, an initial estimator Σ̃_u of the idiosyncratic matrix Σ_u is needed for solving the constrained weighted least-square problem. We propose to obtain such an estimator by the following procedure, which is in the same spirit as the estimation of the idiosyncratic matrix in the POET method (Fan, Liao, and Mincheva 2013). Let $S_{y} = T^{- 1} \sum_{t = 1}^{T} (y_{t} - \bar{y}) {(y_{t} - \bar{y})}^{'}$ be the sample covariance of y and ${(λ_{i}, ζ_{i})}_{i = 1}^{p}$ be eigen-pairs of S_y with λ₁ ≥ λ₂ ≥ … ≥ λ_p. Denote $R = S_{y} - \sum_{i = 1}^{K} λ_{i} ζ_{i} ζ_{i}^{'}$ . We estimate Σ_u by Σ̃_u, whose (i, j)th entry

\begin{matrix} {\hat{σ}}_{u, i j} = {\begin{matrix} r_{i j}, & for i = j, \\ s_{i j} (r_{i j}), & for i \neq j, \end{matrix} & where R = (r_{i j}), \end{matrix}

s_ij(r_ij) is a general entry-wise thresholding function (Antoniadis and Fan 2001) such that s_ij(z) = 0 if |z| ≤ τ_ij and |s_ij(z) – z| ≤ τ_ij for |z| > τ_ij. In our article, we choose hard-thresholding even though SCAD (Fan and Li 2001) and MCP (Zhang 2010) are also applicable. We specify the entry-wise thresholding level as

\begin{matrix} τ_{i j} (p) = C \sqrt{r_{i i} r_{j j}} ω (p), & where & ω (p) = \sqrt{\frac{log p}{T}} + \frac{1}{\sqrt{p}}, \end{matrix}

(7)

and C is a constant chosen by cross-validation. The thresholding parameter C_ω(p) is applied to the correlation matrix. This is similar to the adaptive thresholding estimator for a general covariance matrix (Rothman, Levina, and Zhu 2009), where the entry-wise thresholding level depends on p.

With Σ̃_u being the thresholding estimator described above, the constrained weighted least-square problem (6) can be solved by the weighted principal component (WPC) method. The solution is given by

\begin{matrix} \hat{F} = {({\hat{f}}_{1}, \dots, {\hat{f}}_{T})}^{'} & and & {\hat{B}}^{'} = T^{- 1} Y \hat{F}, \end{matrix}

(8)

where Y = (y₁, …, y_T) and the columns of F̂ are the eigenvectors corresponding to the largest K eigenvalues of the T × T matrix $\sqrt{T} Y^{'} {\tilde{Σ}}_{u}^{- 1} y$ (Bai and Liao 2013).

In the following, we give a result showing that the WPC estimator is asymptotically efficient. Indeed, Bai and Liao (2013) derive the asymptotic normality of f̂_t under the following conditions:

All eigenvalues of B′B/p are bounded away from zero and infinity as p → ∞;
There exists a K × K diagonal matrix Q such that $B^{'} Σ_{u}^{- 1} B / p \to Q$ . In addition, the diagonal elements of Q are distinct and bounded away from infinity.
For each fixed t ≤ T, ${(B^{'} Σ_{u}^{- 1} B)}^{- 1 / 2} B^{'} Σ_{u}^{- 1} u_{t} \overset{d}{\to} N (0, I_{K})$ , as p → ∞, together with the sparsity assumption (5), and some additional regularity conditions given in Section A.1. When √plog p = o(T), it is shown that

$\sqrt{p} ({\hat{f}}_{t} - H f_{t}) \overset{D}{\to} N (0, Q^{- 1}),$ (9)

where H is a specific rotation matrix given by

$H = {\hat{V}}^{- 1} {\hat{F}}^{'} F B^{'} {\tilde{Σ}}_{u}^{- 1} B / T,$ (10)

and V̂ is a K × K diagonal matrix of the largest K eigenvalues of $Y^{'} {\tilde{Σ}}_{u}^{- 1} y / T$ . The rotation matrix H is introduced here so that Hf_t is an identifiable quantity from the data. See more discussion about the identifiability in Remark 2.

Condition (i) is a “pervasive condition” requiring that the common factors affect a nonnegligible fraction of subjects. This is a common assumption for the principal components based methods (Fan, Liao, and Mincheva 2011; Bai and Liao 2013). In condition (ii), $B^{'} Σ_{u}^{- 1} B$ is indeed the Fisher information (under Gaussian errors) contained in p variables, while the limit Q can be viewed as an average information for each variable. Hence, the asymptotic normality in (9) shows that f̂_t is efficient as its asymptotic variance attains the inverse of the (averaged) Fisher information.

Remark 1. The results in Section 2 together with (9) imply that Method 2 is in general better than Method 1 in the estimation of common factors. To explain why, we consider two Different cases here. When p is an order of magnitude larger than s, where s is the number of variables of interest. Method 2 produces a better estimator of factors with a faster convergence rate. Even when p and s diverge at the same speed, the factor estimator based on Method 2 is shown to possess a smaller asymptotic variance, as long as Σ_u,S,S^c = 0. Recall that $B^{'} Σ_{u}^{- 1} B = I (f)$ and $B_{S}^{'} Σ_{u, S}^{- 1} B_{S} = I_{S} (f)$ under Gaussian errors, and they also correspond to the inverse of the asymptotic variance given by Methods 1 and 2, respectively. Then, Proposition 2 implies that Method 2 has a smaller asymptotic variance, if Σ_u,S,S^c = 0. Alternatively, if B is treated as being random, Proposition 3 immediately implies that $E (B_{S}^{'} Σ_{u, S}^{- 1} B_{S}) \geq E (B^{'} Σ^{- 1} B)$ . Therefore, even without the block diagonal assumption, Method 2 produces a more efficient factor estimate on average.

4. Covariance Matrix Estimation

One primary goal in this article is to obtain an accurate estimator of the covariance matrix $Σ_{S} = E (y_{S} y_{S}^{'})$ for the variables-of-interest. In this section, we compare three Different estimation methods, namely, Methods 1, 2, and Oracle Method, in terms of their rates of convergence (under various norms). Obviously, these rates depend on how accurately the realized factors are estimated as demonstrated later.

Below we describe these three methods in full details.

Method 1:
1. Use solely the data in the subset S to obtain estimators of the realized factors F̂⁽¹⁾ and the loading matrix B̂₁ = T⁻¹Y_SF̂⁽¹⁾ based on (8);
2. Let ${({\hat{f}}_{t}^{(1)})}^{'}$ be the tth row of F̂⁽¹⁾, ${({\hat{b}}_{i}^{(1)})}^{'}$ be the ith row of B̂₁, ${\hat{u}}_{i t} = y_{i t} - {({\hat{b}}_{i}^{(1)})}^{'} {\hat{f}}_{t}^{(1)}$ , and ${\hat{σ}}_{i j} = \frac{1}{T} \sum_{t = 1}^{T} {\hat{u}}_{i t} {\hat{u}}_{j t}$ . The (i, j)th entry of the idiosyncratic matrix estimator ${\hat{Σ}}_{u, S}^{(1)}$ of Σ_u,S is given by thresholding σ̂_ij at the level of $C {\hat{θ}}_{i j}^{1 / 2} ω (s)$ , where ω(s) is defined in (7) and ${\hat{θ}}_{i j} = \frac{1}{T} \sum_{t = 1}^{T} {({\hat{u}}_{i t} {\hat{u}}_{j t} - {\hat{σ}}_{i j})}^{2}$ ;
3. The final estimator is given by ${\hat{Σ}}_{S}^{(1)} = {\hat{B}}_{1} {\hat{B}}_{1}^{'} + {\hat{Σ}}_{u, S}^{(1)}$ .
Method 2:
1. Use all p variables to obtain the estimate F̂⁽²⁾ as given in (8) for the realized factors and then estimate the loading B_S by B̂₂ = T⁻¹Y_SF̂⁽²⁾;
2. Follow the same procedure as in Method 1 to obtain the estimator ${\hat{Σ}}_{u, S}^{(2)}$ but based on F̂⁽²⁾ and B̂₂;
3. The final estimator is given by ${\hat{Σ}}_{S}^{(2)} = {\hat{B}}_{2} {\hat{B}}_{2}^{'} + {\hat{Σ}}_{u, S}^{(2)}$ .
Oracle Method:
1. Estimate the loading by B̂_o = T⁻¹Y_SF, where F = (f₁, …, f_T)′ are the true factors.
2. The idiosyncratic matrix estimator ${\hat{Σ}}_{u, S}^{o}$ is given by the same procedure as in Method 1, with ${\hat{b}}_{i}^{(1)}$ and ${\hat{f}}_{t}^{(1)}$ being replaced by ${\hat{b}}_{i}^{o}$ and f_t, respectively.
3. The final estimator is given by ${\hat{Σ}}_{S}^{o} = {\hat{B}}_{o} {\hat{B}}_{o}^{'} + {\hat{Σ}}_{u, S}^{o}$ .

Theorem 1 depicts the estimation accuracy of Σ_S by the above three methods with respect to the following measurements:

\begin{matrix} {‖ {\hat{Σ}}_{S} - Σ_{S} ‖}_{Σ_{S}}, & {‖ {\hat{Σ}}_{S} - Σ_{S} ‖}_{max}, & ‖ {\hat{Σ}}_{S}^{- 1} - Σ_{S}^{- 1} ‖, \end{matrix}

where ${‖ {\hat{Σ}}_{S} - Σ_{S} ‖}_{Σ_{S}} = p^{- 1 / 2} {‖ Σ_{S}^{- 1 / 2} {\hat{Σ}}_{S} Σ_{S}^{- 1 / 2} - I_{S} ‖}_{F}$ is a norm of the relative errors. Note that the results of Fan, Liao, and Mincheva (2013) cannot be directly used here since we employ the weighted principal component analysis to estimate the unobserved factors. This is expected to be more accurate than the ordinary principal component analysis, as shown in Bai and Liao (2013). Indeed, the technical proofs for our results are technically more involved than those by Fan, Liao, and Mincheva (2013).

We assume that s is much less than p, that is, s = o(p), but both tend to infinity. Under the pervasive condition (i), ‖Σ_S‖ ≥ cs and therefore diverges. For this reason, we consider the relative norm ‖Σ̂_S – Σ_S‖_{Σ_s}, instead of ‖Σ̂_S – Σ_S‖, and the operator norm $‖ {\hat{Σ}}_{S}^{- 1} - Σ_{S}^{- 1} ‖$ for estimating the inverse. In addition, we consider another element-wise max norm ‖Σ̂_S – Σ_S‖_max. We show that if p is large with respect to s and T, Method 2 performs as well as the Oracle Method, both of which outperform Method 1. As a consequence, even if we are only interested in the covariance matrix of a small subset of variables, we should use all the data to estimate the common factors, which ultimately improves the estimation of Σ_S. In particular, we are able to specify an explicit regime of (s, p) under which the improvements are substantial. However, when s ≍ p, that is, they are in the same order, using more data does not show as dramatic improvements for estimating Σ_S. This is expected and will be clearly seen in the simulation section.

Before stating Theorem 1, we need a few preliminary results: Lemmas 1–3. Specifically, Lemma 1 presents the uniform convergence rates of the factor estimates by Methods 1 and 2. Based on that, Lemmas 2 and 3 further derive the estimation accuracy of factor loadings and idiosyncratic matrix by the three methods, respectively. These results together lead to the estimation error rates of Σ_S in Theorem 1 w.r.t. three measures defined above. Additional Lemmas supporting the proof are given in the Appendix. Again, these kinds of results cannot be obtained directly from Fan, Liao, and Mincheva (2013) due to our use of WPC.

Lemma 1. Suppose that conditions (i), (ii), the sparsity condition (5), and additional regularity conditions (iv)–(vii) in Section A.1 hold for both s and p. If √plog p = o(T) and T = o(s²), then we have

\begin{matrix} max_{t \leq T} ‖ {\hat{f}}_{t}^{(1)} - H_{1} f_{t} ‖ = O_{P} (\frac{1}{\sqrt{T}} + \frac{T^{1 / 4}}{\sqrt{s}}) & and \\ max_{t \leq T} ‖ {\hat{f}}_{t}^{(2)} - H_{2} f_{t} ‖ = O_{P} (\frac{1}{\sqrt{T}} + \frac{T^{1 / 4}}{\sqrt{p}}), \end{matrix}

where $H_{1} = {\hat{V}}_{1}^{- 1} {\hat{F}}^{{(1)}^{'}} F B_{S}^{'} {\tilde{Σ}}_{u, S}^{- 1} B_{S} / T, H_{2} = {\hat{V}}_{2}^{- 1} {\hat{F}}^{{(2)}^{'}} F B^{'} {\tilde{Σ}}_{u, S}^{- 1} B / T$ , V̂₁ is the diagonal matrix of the largest K eigenvalues of $Y_{S}^{'} {\tilde{Σ}}_{u, S}^{- 1} Y_{S} / T$ and V̂₂ is the diagonal matrix of the largest K eigenvalues of $Y^{'} {\tilde{Σ}}_{u}^{- 1} y / T$ .

Remark 2. H₁ and H₂ correspond to the rotation matrix H defined in (10) using Methods 1 and 2, respectively. Recall that F = (f₁, …, f_T)′, then $H f_{t} = T^{- 1} {\hat{V}}^{- 1} \hat{F} {(B f_{1}, \dots, B f_{T})}^{'} {\tilde{Σ}}_{u}^{- 1} B f_{t}$ . Note that Hf_t only depends on quantities V⁻¹F̂, ${\tilde{Σ}}_{u}^{- 1}$ and the identifiable component ${B f_{t}}_{t = 1}^{T}$ . Therefore, there is no identifiability issue regarding Hf_t. In other words, even though f_t itself may not be identifiable, an identifiable rotation of f_t can be consistently estimated by f̂_t.

Lemma 1 implies that Method 2 produces a better factor estimate if

0.5 < γ_{s} < 1.5 \leq γ_{p} < 2,

by representing s and p as s ≍ T^γ_s and p ≍ T^γ_p.

It is not surprising that the estimation accuracy of loading matrix also varies among these three methods as shown in Lemma 2.

Lemma 2. Under conditions of Lemma 1,

max_{i \leq s} ‖ {\hat{b}}_{i}^{(1)} - H_{1} b_{i} ‖ = O_{P} (w_{1}), where w_{1} : = \frac{1}{\sqrt{s}} + \sqrt{\frac{log s}{T}}, max_{i \leq s} ‖ {\hat{b}}_{i}^{(2)} - H_{2} b_{i} ‖ = O_{P} (w_{2}), where w_{2} : = \frac{1}{\sqrt{p}} + \sqrt{\frac{log s}{T}}, max_{i \leq s} ‖ {\hat{b}}_{i}^{(2)} - b_{i} ‖ = O_{P} (w_{o}), where w_{o} : = \sqrt{\frac{log s}{T}} .

Similarly, Lemma 2 indicates that Method 2 performs as well as the Oracle Method, both of which are better than Method 1, that is, w₂ = w_o < w₁, if

0.5 < γ_{s} < 1 \leq γ_{p} < 2,

by representing s and p in the order of T as above. We remark that the extra terms 1/√s and 1/√p in w₁ and w₂ (in comparison with the oracle rate w_o) are due to the factor estimation. Another preliminary result regarding the estimation of the identifiable component $b_{i}^{'} f_{t}$ is given in Lemma A.1.

Similar insights can be delivered from Lemma 3 on the estimation of Σ_u,S.

Lemma 3. Under conditions of Lemma 1, it holds that

\begin{matrix} ‖ {\hat{Σ}}_{u, S}^{(1)} - Σ_{u, S} ‖ = O_{P} (m_{s} w_{1}) = ‖ {({\hat{Σ}}_{u, S}^{(1)})}^{- 1} - Σ_{u, S}^{- 1} ‖, \\ ‖ {\hat{Σ}}_{u, S}^{(2)} - Σ_{u, S} ‖ = O_{P} (m_{s} w_{2}) = ‖ {({\hat{Σ}}_{u, S}^{(2)})}^{- 1} - Σ_{u, S}^{- 1} ‖, \\ ‖ {\hat{Σ}}_{u, S}^{o} - Σ_{u, S} ‖ = O_{P} (m_{s} w_{o}) = ‖ {({\hat{Σ}}_{u, S}^{o})}^{- 1} - Σ_{u, S}^{- 1} ‖, \end{matrix}

where m_s is defined as in (4) with p being replaced by s.

Now, we are ready to state our main result on the estimation of Σ_S based on the above preliminary results. From Theorem 1, it is easily seen that the comparison of the estimation accuracy of Σ_S among three methods is solely determined by the relative magnitude of w_o, w₁, and w₂. Therefore, we should use additional variables to estimate the factors if p is much larger than s in the sense that T/log s = O(p) and s log s = o(T) (implying w₂ = w_o < w₁).

Theorem 1. Under conditions of Lemma 1, it holds that

For the relative norm, ${‖ {\hat{Σ}}_{S}^{(1)} - Σ_{S} ‖}_{Σ_{S}} = O_{P} (\sqrt{s} w_{1}^{2} + m_{s} w_{1})$ , ${‖ {\hat{Σ}}_{S}^{(2)} - Σ_{S} ‖}_{Σ_{S}} = O_{P} (\sqrt{s} w_{2}^{2} + m_{s} w_{2})$ , and ${‖ {\hat{Σ}}_{S}^{o} - Σ_{S} ‖}_{Σ_{S}} = O_{P} (\sqrt{s} w_{0}^{2} + m_{s} w_{o})$ .
For the max-norm, ${‖ {\hat{Σ}}_{S}^{(1)} - Σ_{S} ‖}_{max} = O_{P} (w_{1})$ , ${‖ {\hat{Σ}}_{S}^{(2)} - Σ_{S} ‖}_{max} = O_{P} (w_{2})$ , and ${‖ {\hat{Σ}}_{S}^{o} - Σ_{S} ‖}_{max} = O_{P} (w_{o})$ .
For the operator norm of the inverse matrix, $‖ {({\hat{Σ}}_{S}^{(1)})}^{- 1} - Σ_{S}^{- 1} ‖ = O_{P} (m_{s} w_{1})$ , $‖ {({\hat{Σ}}_{S}^{(2)})}^{- 1} - Σ_{S}^{- 1} ‖ = O_{P} (m_{s} w_{2})$ and $‖ {({\hat{Σ}}_{S}^{o})}^{- 1} - Σ_{S}^{- 1} ‖ = O_{P} (m_{s} w_{o})$ .

Remark 3. So far, we assumed that the number of factors K is fixed and known. A data-driven choice of K has been extensively studied in the econometrics literature, for example, by Bai and Ng (2002), Kapetanios (2010). To estimate K, we can adopt the method by Bai and Ng (2002) and propose a consistent estimator of K (by allowing p, T → ∞) as follows:

\hat{K} = \underset{0 \leq k \leq N}{argmin} log {\frac{1}{p T} {‖ Y - T^{- 1} Y {\hat{F}}_{k} {\hat{F}}_{k}^{'} ‖}_{F}^{2}} + k g (p, T),

where N is a predefined upper bound, F̂_k is a T × k matrix whose columns are √T times the eigenvectors corresponding to the largest k eigenvalues of Y′Y, and g(p, T) is a penalty function. Two examples suggested by Bai and Ng (2002) are

\begin{matrix} g (T, p) = \frac{p + T}{p T} log (\frac{p T}{p + T}) & or \\ g (T, p) = \frac{p + T}{p T} log (min {p, T}) . \end{matrix}

Under our assumptions (i)–(x), all conditions required by theorem 2 of Bai and Ng (2002) hold. Hence, their theorem implies that P(K̂ = K) → 1. Then, conditioning on the event that {K̂ = K}, our Theorem 1 still holds by replacing K with K̂. Other effective methods for selecting the number of factors include the eigen ratio method by Lam and Yao (2012) and Ahn and Horenstein (2013).

Remark 4. When K grows with p and T, Fan, Liao, and Mincheva (2013) gave the explicit dependence of the convergence rates on K for their proposed POET estimator. By adopting their technique, we can obtain the following results:

${‖ {\hat{Σ}}_{S}^{(1)} - Σ_{S} ‖}_{Σ_{S}} = O_{P} (K \sqrt{s} w_{1}^{2} + K^{3} m_{s} w_{1})$ , ${‖ {\hat{Σ}}_{S}^{(2)} - Σ_{S} ‖}_{Σ_{S}} = O_{P} (K \sqrt{s} w_{2}^{2} + K^{3} m_{s} w_{2})$ , ${‖ {\hat{Σ}}_{S}^{o} - Σ_{S} ‖}_{Σ_{S}} = O_{P} (K \sqrt{s} w_{o}^{2} + K^{3} m_{s} w_{o})$ ;
${‖ {\hat{Σ}}_{S}^{(1)} - Σ_{S} ‖}_{max} = O_{P} (K^{3} w_{1})$ , ${‖ {\hat{Σ}}_{S}^{(2)} - Σ_{S} ‖}_{max} = O_{P} (K^{3} w_{2})$ , ${‖ {\hat{Σ}}_{S}^{o} - Σ_{S} ‖}_{max} = O_{P} (K^{3} w_{o})$ ;
$‖ {({\hat{Σ}}_{S}^{(1)})}^{- 1} - Σ_{S}^{- 1} ‖ = O_{P} (K^{3} m_{s} w_{1})$ , $‖ {({\hat{Σ}}_{S}^{(2)})}^{- 1} - Σ_{S}^{- 1} ‖ = O_{P} (K^{3} m_{s} w_{2})$ , $‖ {({\hat{Σ}}_{S}^{o})}^{- 1} - Σ_{S}^{- 1} ‖ = O_{P} (K^{3} m_{s} w_{o})$ .

Again, the rate difference among three types of estimators only depends on w_o, w₁, and w₂. Therefore, the same conclusion (when p is much larger than s, using additional variables improves the estimation of Σ_S) can still be made even if K diverges. As long as K diverges in the rate that $K = o (min {1 / (\sqrt{s} w_{1}^{2}), 1 / {(m_{s} w_{1})}^{1 / 3}})$ , $K = o (1 / w_{1}^{1 / 3})$ or K = o(1/(m_sw₁)^1/3), the same blessing of dimensionality phenomena persist in terms of estimation consistency in relative norm, max norm, or operator norm of the inverse, respectively.

5. Divide-and-Conquer Computing Method

As discussed previously, we prefer using auxiliary data information as much as possible even we are only interested in the covariance matrix of some particular set of variables. But this can bring up heavy computational burden. This concern motivates a simple divide-and-conquer scheme that splits all p variables in Y. Without loss of generality, assume that p rows of matrix Y can be evenly divided into M groups with p/M variables in each group. The s variables of interest can possibly be assigned to Different groups.

Divide-and-Conquer Computation Scheme

In the mth group, obtain the initial estimator Σ̃_u,m by using the adaptive thresholding method as described in Section 3 based on the data in the mth group only.
Denote Y_m as the data vector corresponding to the variables in the mth group and let F̂_m = (f̂_m,₁, …, f̂_m,T)′, where its columns are the eigenvectors corresponding to the largest K eigenvalues of the T × T matrix $\sqrt{T} Y_{m}^{'} {\tilde{Σ}}_{u, m}^{- 1} Y_{m}$ . The computation in the above two steps can be done in a parallel manner.
Average ${{\hat{f}}_{m, t}}_{m = 1}^{M}$ to obtain a single estimator of f_t as

${\bar{f}}_{t} = \frac{1}{M} \sum_{m = 1}^{M} {\hat{f}}_{m, t} .$

The loading matrix estimate is given by B̄_S = T⁻¹Y_SF̄, where F̄ = (f̄₁, …, f̄_T)′.
The idiosyncratic matrix is estimated as follows. Let ${\bar{f}}_{t}^{'}$ be the tth row of F̄ and ${\bar{b}}_{i}^{'}$ be the ith row of B̄_S. Let ${\hat{u}}_{i t} = y_{i t} - {\bar{b}}_{i}^{'} {\bar{f}}_{t}$ , ${\hat{σ}}_{i j} = T^{- 1} \sum_{t = 1}^{T} {\hat{u}}_{i t} {\hat{u}}_{j t}$ , and ${\hat{θ}}_{i j} = T^{- 1} \sum_{t = 1}^{T} {({\hat{u}}_{i t} {\hat{u}}_{j t} - {\hat{σ}}_{i j})}^{2}$ . The (i, j)th entry of Σ_u,S is given by thresholding σ̂_ij at the level of $C {\hat{θ}}_{i j}^{1 / 2} ω (s)$ , where ω(s) is defined as in (7) with p replaced by s.
The final estimator of the covariance matrix is given by

${\bar{Σ}}_{S} = {\bar{B}}_{S} {\bar{B}}_{S}^{'} + {\bar{Σ}}_{u, S} .$

We show that, if M is fixed,

\begin{matrix} {‖ {\bar{Σ}}_{S} - Σ_{S} ‖}_{Σ_{S}} = O_{P} (\sqrt{s} w_{2}^{2} + m_{s} w_{2}), \\ {‖ {\bar{Σ}}_{S} - Σ_{S} ‖}_{max} = O_{P} (w_{2}), \\ ‖ {({\bar{Σ}}_{S})}^{- 1} - Σ_{S}^{- 1} ‖ = O_{P} (m_{s} w_{2}) . \end{matrix}

These rates match the rates of ${\hat{Σ}}_{S}^{(2)}$ attained by Method 2, where all p variables are pooled together for the analysis. The proof is given in Appendix A.3. The simulation results in Section 6 further demonstrate that without sacrificing the estimation accuracy, the divide-and-conquer method runs much faster than Method 2. Therefore, the divide-and-conquer method is practically useful when dealing with massive dataset.

The main computational cost of our method comes from taking the inverse of Σ̃_u. For our Method 2, where all p variables are pooled together for the analysis, the computational complexity of the inversion is O(p³). On the other hand, for the divide-and-conquer method, the corresponding estimator Σ̃_u,m in the mth group only needs a computational cost of O((p/M)³) to be inverted. Then, the total computation complexity is O(p³/M²). Hence, the computational speed can be boosted by M²-fold. Such a computational acceleration can also be observed from simulation study results in Figure 1(d). Other operations like the eigen-decomposition on the T × T matrix $\sqrt{T} Y^{'} {\tilde{Σ}}_{u}^{- 1} Y$ do not have dominating computational cost, as we assume that p is much larger than T. When M grows too fast, the divide-and-conquer method may lose estimation efficiency compared with the pooled analysis (Method 2). However, considering its boost of computation, the divide-and-conquer method is practically useful when dealing with massive dataset.

Estimation error by four methods and their computational time: the dotted lines represent the means over 100 simulations and the segments represent the corresponding standard deviations.

6. Simulations

We use simulated examples to compare the statistical performances of Methods 1, 2, and the Oracle Method. We fix the number of factors K = 3 and repeat 100 simulations for each combination of (s, p, T). The loading b_i, the factor f_t and the idiosyncratic error u_t are generated as follows:

${b_{i}}_{i = 1}^{p}$ are iid from N_K(0, 5I_K).
${f_{t}}_{t = 1}^{T}$ are iid from N_K(0, I_K).
${u_{t}}_{t = 1}^{T}$ are iid from N_p(0, 50I_p).

The observations ${y_{t}}_{t = 1}^{T}$ are generated from (1) using b_i, f_t, and u_t from the above. Tables 1–4 report the estimation errors of the factors, the loading matrices, and the covariance-of-interest Σ_S in terms of Different measurements.

Table 1.

Comparison of three methods when s is much smaller than p(T = 200).

(s, p)
Method

(50, 1000)

(50, 2000)

ORA

‖Σ̂_S – Σ_S‖_{Σ_S}

0.271(0.014)

0.205(0.013)

0.204(0.013)

0.270(0.014)

0.201(0.013)

0.200(0.013)

‖ {\hat{Σ}}_{S}^{- 1} - Σ_{S}^{- 1} ‖

0.016(0.003)

0.009(0.002)

0.017(0.003)

0.009(0.002)

‖Σ̂_S – Σ_S‖_max

18.828(3.072)

17.460(3.237)

17.457(3.261)

18.076(2.697)

16.631(2.949)

16.623(2.950)

max_t_≤_T ‖f̂_t – Hf_t‖

1.811(0.195)

0.445(0.046)

1.870(0.236)

0.331(0.025)

max_i_≤_s ‖b̂_i – Hb_i‖

8.064(0.694)

4.100(0.330)

3.858(0.274)

8.150(0.682)

3.932(0.292)

3.805(0.297)

{max}_{i \leq s, t \leq T} ‖ {\hat{b}}_{i}^{'} {\hat{f}}_{t} - b_{i}^{'} f_{t} ‖

11.375(1.262)

5.519(0.813)

5.268(0.843)

11.466(1.353)

5.253(0.776)

5.113(0.739)

Open in a new tab

NOTE: M1, M2, and ORA stand for Method 1, 2, and Oracle method, respectively.

Table 4.

Comparison of three methods when s is comparative to p(T = 400).

(s, p)
Method

(800, 1000)

(800, 2000)

ORA

‖Σ̂_S – Σ_S‖_{Σ_S}

0.193(0.004)

0.192(0.004)

0.189(0.004)

0.192(0.004)

0.190(0.004)

0.188(0.004)

‖ {\hat{Σ}}_{S}^{- 1} - Σ_{S}^{- 1} ‖

0.008(0.001)

‖Σ̂_S – Σ_S‖_max

17.062(2.603)

17.051(2.612)

17.041(2.621)

16.919(2.182)

16.891(2206)

16.888(2209)

max_t_≤_T ‖f̂_t – Hf_t‖

0.467(0.038)

0.423(0.036)

0.466(0.038)

0.304(0.026)

max_i_≤_s ‖b̂_i – Hb_i‖

11.009(0298)

10.850(0.302)

10225(0205)

10.934(0274)

10530(0213)

10.189(0.172)

{max}_{i \leq s, t \leq T} ‖ {\hat{b}}_{i}^{'} {\hat{f}}_{t} - b_{i}^{'} f_{t} ‖

5.367(0577)

5276(0560)

4.880(0528)

5293(0.411)

5.024(0.461)

4.894(0.420)

Open in a new tab

NOTE: M1, M2, and ORA stand for Method 1, 2, and Oracle method, respectively.

We see from Tables 1 and 2 that when s = 50 and p = 1000, 2000, Method 1 performs much worse than Method 2, for both T = 200 and T = 400. However, when s increases to 800 with p being the same, Tables 3 and 4 show that the improvement of Method 2 over Method 1 is less profound. This is expected as the set of interest already contains sufficiently rich information to produce an accurate estimator for realized factors. In general, we note that Method 2 is the most advantageous in the settings where s is much smaller than p. In addition, from Tables 3 and 4, we can tell that Method 2 comes closer to the Oracle method as p grows. In practice, we also observe that the WPC factor estimator performs better than the unweighted PC estimator when u_t is heteroscedastic. Due to the space limit, we choose not to present the simulation results in this model.

Table 2.

Comparison of three methods when s is much smaller than p(T = 400).

(s, p)
Method

(50, 1000)

(50, 2000)

ORA

‖Σ̂_S – Σ_S‖_{Σ_S}

0.186(0.009)

0.132(0.007)

0.131(0.007)

0.186(0.009)

0.131(0.008)

0.130(0.008)

‖ {\hat{Σ}}_{S}^{- 1} - Σ_{S}^{- 1} ‖

0.011(0.002)

0.004(0.001)

0.011(0.002)

0.004(0.001)

‖Σ̂_S – Σ_S‖_max

14.054(1.945)

11.922(2245)

11.891(2262)

14.180(2.154)

11.901(2.603)

11.900(2.604)

max_t_≤_T ‖f̂_t – Hf_t‖

1.839(0.193)

0.417(0.036)

1.843(0.198)

0.305(0.026)

max_i_≤_s ‖b̂_i – Hb_i‖

6.960(0.584)

2.830(0200)

2.692(0.198)

7.024(0.605)

2.761(0.188)

2.692(0.194)

{max}_{i \leq s, t \leq T} ‖ {\hat{b}}_{i}^{'} {\hat{f}}_{t} - b_{i}^{'} f_{t} ‖

11.871(1.540)

4.138(0510)

3.824(0.501)

11.457(1.569)

4.088(0.516)

3.889(0.542)

Open in a new tab

NOTE: M1, M2, and ORA stand for Method 1, 2, and Oracle method, respectively.

Table 3.

Comparison of three methods when sis comparative to p(T = 200).

(s, p)
Method

(800, 1000)

(800, 2000)

ORA

‖Σ̂_S – Σ_S‖_{Σ_S}

0.440(0.006)

0.439(0.006)

0.435(0.006)

0.439(0.006)

0.436(0.006)

0.435(0.006)

‖ {\hat{Σ}}_{S}^{- 1} - Σ_{S}^{- 1} ‖

0.062(0.009)

0.061(0.009)

0.062(0.012)

‖Σ̂_S – Σ_S‖_max

24565(2.626)

24562(2.609)

24567(2599)

24511(2.883)

24543(2.847)

24536(2.851)

max_t_≤_T ‖f̂_t – Hf_t‖

0.488(0.047)

0.447(0.040)

0.478(0.049)

0.337(0.038)

max_i_≤_s ‖b̂_i – Hb_i‖

15550(0.488)

15.370(0.462)

14.418(0271)

15595(0551)

15.041(0.357)

14.398(0243)

{max}_{i \leq s, t \leq T} ‖ {\hat{b}}_{i}^{'} {\hat{f}}_{t} - b_{i}^{'} f_{t} ‖

6.745(0.611)

6.680(0.635)

6.405(0.630)

6.904(0.734)

6.697(0.763)

6588(0.737)

Open in a new tab

NOTE: M1, M2, and ORA stand for Method 1,2, and Oracle method, respectively.

For further comparison with the divide-and-conquer method, we vary T from 50 to 500 and set (s, p, M) as s = ⎿T^0.6⏌, p = ⎿T^1.4⏌, and M = ⎿T^0.2⏌. Figure 1 shows the estimation errors of the four methods together with the corresponding computational time. Again, when p is large, Method 2 performs as well as the Oracle Method, both of which greatly outperform Method 1. However, its computation becomes much slower in this case. In contrast, the divide-and-conquer method is much faster, while maintaining comparable performance as Method 2. In the extreme case that p is around 6000 (T = 500), the divide-and-conquer method can boost the speed by nine-fold for Method 2.

7. Real Data Example

We use a real data example to illustrate how Different utilization of available variables can affect the inference of the variables of interest. Krug et al. (2012) carried out a gene profiling study among 40 Portuguese and Spanish adults to identify key genetic risk factors for ischemic stroke. Among them, 20 subjects were patients having ischemic stroke and the others were controls. Their gene profiles were obtained using the GeneChip Human Genome U133 Plus 2.0 microarray. The data were available at Gene Expression Omnibus with access name “GSE22255.”

To judge how effectively the gene expression can distinguish ischemic stroke and controls, we applied the Linear Discriminant Analysis (LDA) to this dataset. We randomly chose 10 subjects as the test set and the rest as the training set. We repeated the random splitting for 100 runs. In each run, we selected the set of expressed Differentially (DE) genes with a threshold of over 1.2-fold change and a Q-value ≤ 0.05, which is a commonly used quantity to define DE genes (Storey 2002). An LDA rule was then learned from the training set using the selected genes and further applied to the test set for classifying cases and controls. The LDA rule classifies a subject as a case if

\hat{δ}' {\hat{Σ}}^{- 1} (x - \bar{μ}) \geq 0,

(11)

where δ̂ = μ̂₁ – μ̂₀ ∈ ℝ^s is the sample mean difference between the two groups (case–control), s is the number of selected genes, Σ̂ ∈ ℝ^s^×^s is an estimator of the true covariance matrix Σ of the selected genes, and μ̄ = (μ̂₁ + μ̂₀)/2. μ̄, δ̂, and Σ̂ are obtainesd from the training set and x is the gene expression of subjects in the test set.

As s can be larger than the sample size, the traditional LDA where Σ̂ is the sample covariance is no longer applicable. An alternative method to estimate Σ is adopting the factor model. Factor modeling is widely used in the genomics literature to model the dependencies among genes (Kustra, Shioda, and Zhu 2006; Carvalho et al. 2012). Several factors, like the natural pathway structure (Ogata et al. 2000) can be the latent factors affecting the correlation among genes. A few spiked eigenvalues of the sample covariance in Figure 2 also suggest the existence of potential latent factors in this dataset. Again, there are two ways using the factor model. One way is to use Method 1, where all procedures are done based on the selected genes only. The resulting rule is referred as “LDA-1” in Figure 3. Another way is to use auxiliary data as in Method 2. More specifically, it first uses data from all involved genes and subjects in the training set to estimate the latent factors. These estimated factors are then applied to the set of selected genes, where their loadings and idiosyncratic matrix estimators are obtained. Combing them together produces the covariance matrix estimator, which is still an s × s matrix. The resulting rule is referred as “LDA-2” in Figure 3. Recall that the only difference between the two rules is that they use Different covariance estimators.

Eigen-values of the sample covariance matrix for GSE22255.

Misclassification rates of LDA-1 and LDA-2 over 100 random splits: the dotted lines represent the means over 100 splits and the segments represent the corresponding standard deviations.

Figure 3 plots the average misclassification rates on the test set against the number of factors for the 100 random splits. It is clearly seen that LDA-2 gives better misclassification rates than LDA-1, which is solely due to a Different estimation of the covariance matrix. The results lend further support to our claim that using more data is beneficial.

Supplementary Material

sup

NIHMS867177-supplement-sup.pdf^{(270.9KB, pdf)}

Acknowledgments

Guang Cheng was on sabbatical at Princeton while this work was carried out and thanks the Princeton ORFE department for its hospitality. The authors thank the editor, the associate editor, and the referees for their constructive comments.

Funding: Guang Cheng gratefully acknowledges NSF CAREER Award DMS-1151692, DMS-1418042, Simons Fellowship in Mathematics, Office of Naval Research (ONR N00014-15-1-2331). Jianqing Fan was supported in part by NSF Grants DMS-1206464, DMS-1406266, and NIH grant R01GM100474-4. Quefeng Li was supported by NIH grant 2R01-GM072611-11 as a postdoctoral fellow at Princeton University and supported by NIH grant 2R01-GM047845-25 at UNC-Chapel Hill.

Footnotes

Supplementary Materials: The online supplement contains the appendices for the article.

References

Ahn SC, Horenstein AR. Eigenvalue Ratio Test for the Number of Factors. Econometrica. 2013;81:1203–1227. [Google Scholar]
Antoniadis A, Fan J. Regularized Wavelet Approximations (with discussion) Journal of the American Statistical Association. 2001;96:939–967. [Google Scholar]
Bai J, Li K. Statistical Analysis of Factor Models of High Dimension. The Annals of Statistics. 2012;40:436–465. [Google Scholar]
Bai J, Liao Y. Statistical Inferences Using Large Estimated Covariances for Panel Data and Factor Models. arXiv. 2013;1307:2662. [Google Scholar]
Bai J, Ng S. Determining the Number of Factors in Approximate Factor Models. Econometrica. 2002;70:191–221. [Google Scholar]
Bickel PJ, Levina E. Covariance Regularization by Thresholding. The Annals of Statistics. 2008;36:2577–2604. [Google Scholar]
Cai T, Liu W. Adaptive Thresholding for Sparse Covariance Matrix Estimation. Journal of the American Statistical Association. 2011;106:672–684. [Google Scholar]
Cai T, Zhang CH, Zhou H. Optimal Rates of Convergence for Covariance Matrix Estimation. The Annals of Statistics. 2010;38:2118–2144. [Google Scholar]
Carvalho CM, Chang J, Lucas JE, Nevins JR, Wang Q, West M. High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics. Journal of the American Statistical Association. 2012;103:1438–1456. doi: 10.1198/016214508000000869. [DOI] [PMC free article] [PubMed] [Google Scholar]
Choi I. Efficient Estimation of Factor Models. Econometric Theory. 2012;28:274–308. [Google Scholar]
Fama EF, French KR. Common Risk Factors in the Returns on Stocks and Bonds. Journal of Financial Economics. 1993;33:3–56. [Google Scholar]
Fan J, Fan Y, Lv J. High Dimensional Covariance Matrix Estimation Using a Factor Model. Journal of Econometrics. 2008;147:186–197. [Google Scholar]
Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Fan J, Liao Y, Mincheva M. High Dimensional Covariance Matrix Estimation in Approximate Factor Models. The Annals of Statistics. 2011;39:3320–3356. doi: 10.1214/11-AOS944. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Liao Y, Mincheva M. Large Covariance Estimation by Thresholding Principal Orthogonal Complements. Journal of the Royal Statistical, Series B. 2013;75:603–680. doi: 10.1111/rssb.12016. [DOI] [PMC free article] [PubMed] [Google Scholar]
Johnson RA, Wichern DW. Applied Multivariate Statistical Analysis. Englewood Cliffs, NJ: Prentice Hall; 1992. [Google Scholar]
Kapetanios G. A Testing Procedure for Determining the Number of Factors in Approximate Factor Models With Large Datasets. Journal of Business and Economic Statistics. 2010;28:397–409. [Google Scholar]
Krug T, Gabriel JP, Taipa R, Fonseca BV, Domingues-Montanari S, Fernandez-Cadenas I, Manso H, Gouveia LO, Sobral J, Albergaria I, Gaspar G, Jiménez-Conde J, Rabionet R, Ferro JM, Montaner J, Vicente AM, Rui Silva M, Matos I, Lopes G, Oliveira SA. TTC7B Emerges as a Novel Risk Factor for Ischemic Stroke Through the Convergence of Several Genome-Wide Approaches. Journal of Cerebral Blood Flow & Metabolism. 2012;32:1061–1072. doi: 10.1038/jcbfm.2012.24. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kustra R, Shioda R, Zhu M. A Factor Analysis Model for Functional Genomics. BMC Bioinformatics. 2006;7:216–228. doi: 10.1186/1471-2105-7-216. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lam C, Fan J. Sparsistency and Rates of Convergence in Large Covariance Matrix Estimation. The Annals of Statistics. 2009;37:4254–4278. doi: 10.1214/09-AOS720. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lam C, Yao Q. Factor Modeling for High-Dimensional Time Series: Inference for the Number of Factors. The Annals of Statistics. 2012;40:694–726. [Google Scholar]
Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research. 2000;28:27–30. doi: 10.1093/nar/27.1.29. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse Permutation Invariant Covariance Estimation. Electronic Journal of Statistics. 2008;2:494–515. [Google Scholar]
Rothman AJ, Levina E, Zhu J. Generalized Thresholding of Large Covariance Matrices. Journal of the American Statistical Association. 2009;104:177–186. [Google Scholar]
Storey JD. A Direct Approach to False Discovery Rates. Journal of the Royal Statistical, Series B. 2002;64:479–498. [Google Scholar]
Zhang CH. Nearly Unbiased Variable Selection Under Minimax Concave Penalty. The Annals of Statistics. 2010;38:894–942. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

sup

NIHMS867177-supplement-sup.pdf^{(270.9KB, pdf)}

[R1] Ahn SC, Horenstein AR. Eigenvalue Ratio Test for the Number of Factors. Econometrica. 2013;81:1203–1227. [Google Scholar]

[R2] Antoniadis A, Fan J. Regularized Wavelet Approximations (with discussion) Journal of the American Statistical Association. 2001;96:939–967. [Google Scholar]

[R3] Bai J, Li K. Statistical Analysis of Factor Models of High Dimension. The Annals of Statistics. 2012;40:436–465. [Google Scholar]

[R4] Bai J, Liao Y. Statistical Inferences Using Large Estimated Covariances for Panel Data and Factor Models. arXiv. 2013;1307:2662. [Google Scholar]

[R5] Bai J, Ng S. Determining the Number of Factors in Approximate Factor Models. Econometrica. 2002;70:191–221. [Google Scholar]

[R6] Bickel PJ, Levina E. Covariance Regularization by Thresholding. The Annals of Statistics. 2008;36:2577–2604. [Google Scholar]

[R7] Cai T, Liu W. Adaptive Thresholding for Sparse Covariance Matrix Estimation. Journal of the American Statistical Association. 2011;106:672–684. [Google Scholar]

[R8] Cai T, Zhang CH, Zhou H. Optimal Rates of Convergence for Covariance Matrix Estimation. The Annals of Statistics. 2010;38:2118–2144. [Google Scholar]

[R9] Carvalho CM, Chang J, Lucas JE, Nevins JR, Wang Q, West M. High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics. Journal of the American Statistical Association. 2012;103:1438–1456. doi: 10.1198/016214508000000869. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Choi I. Efficient Estimation of Factor Models. Econometric Theory. 2012;28:274–308. [Google Scholar]

[R11] Fama EF, French KR. Common Risk Factors in the Returns on Stocks and Bonds. Journal of Financial Economics. 1993;33:3–56. [Google Scholar]

[R12] Fan J, Fan Y, Lv J. High Dimensional Covariance Matrix Estimation Using a Factor Model. Journal of Econometrics. 2008;147:186–197. [Google Scholar]

[R13] Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R14] Fan J, Liao Y, Mincheva M. High Dimensional Covariance Matrix Estimation in Approximate Factor Models. The Annals of Statistics. 2011;39:3320–3356. doi: 10.1214/11-AOS944. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Fan J, Liao Y, Mincheva M. Large Covariance Estimation by Thresholding Principal Orthogonal Complements. Journal of the Royal Statistical, Series B. 2013;75:603–680. doi: 10.1111/rssb.12016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Johnson RA, Wichern DW. Applied Multivariate Statistical Analysis. Englewood Cliffs, NJ: Prentice Hall; 1992. [Google Scholar]

[R17] Kapetanios G. A Testing Procedure for Determining the Number of Factors in Approximate Factor Models With Large Datasets. Journal of Business and Economic Statistics. 2010;28:397–409. [Google Scholar]

[R18] Krug T, Gabriel JP, Taipa R, Fonseca BV, Domingues-Montanari S, Fernandez-Cadenas I, Manso H, Gouveia LO, Sobral J, Albergaria I, Gaspar G, Jiménez-Conde J, Rabionet R, Ferro JM, Montaner J, Vicente AM, Rui Silva M, Matos I, Lopes G, Oliveira SA. TTC7B Emerges as a Novel Risk Factor for Ischemic Stroke Through the Convergence of Several Genome-Wide Approaches. Journal of Cerebral Blood Flow & Metabolism. 2012;32:1061–1072. doi: 10.1038/jcbfm.2012.24. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Kustra R, Shioda R, Zhu M. A Factor Analysis Model for Functional Genomics. BMC Bioinformatics. 2006;7:216–228. doi: 10.1186/1471-2105-7-216. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Lam C, Fan J. Sparsistency and Rates of Convergence in Large Covariance Matrix Estimation. The Annals of Statistics. 2009;37:4254–4278. doi: 10.1214/09-AOS720. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Lam C, Yao Q. Factor Modeling for High-Dimensional Time Series: Inference for the Number of Factors. The Annals of Statistics. 2012;40:694–726. [Google Scholar]

[R22] Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research. 2000;28:27–30. doi: 10.1093/nar/27.1.29. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse Permutation Invariant Covariance Estimation. Electronic Journal of Statistics. 2008;2:494–515. [Google Scholar]

[R24] Rothman AJ, Levina E, Zhu J. Generalized Thresholding of Large Covariance Matrices. Journal of the American Statistical Association. 2009;104:177–186. [Google Scholar]

[R25] Storey JD. A Direct Approach to False Discovery Rates. Journal of the Royal Statistical, Series B. 2002;64:479–498. [Google Scholar]

[R26] Zhang CH. Nearly Unbiased Variable Selection Under Minimax Concave Penalty. The Annals of Statistics. 2010;38:894–942. [Google Scholar]

PERMALINK

Embracing the Blessing of Dimensionality in Factor Models

Quefeng Li

Guang Cheng

Jianqing Fan

Yuyan Wang

Abstract

1. Introduction

2. Fisher Information of Common Factor

3. Efficient Estimation of Common Factor

4. Covariance Matrix Estimation

5. Divide-and-Conquer Computing Method

Figure 1.

6. Simulations

Table 1.

Table 4.

Table 2.

Table 3.

7. Real Data Example

Figure 2.

Figure 3.

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Embracing the Blessing of Dimensionality in Factor Models

Quefeng Li

Guang Cheng

Jianqing Fan

Yuyan Wang

Abstract

1. Introduction

2. Fisher Information of Common Factor

3. Efficient Estimation of Common Factor

4. Covariance Matrix Estimation

5. Divide-and-Conquer Computing Method

Figure 1.

6. Simulations

Table 1.

Table 4.

Table 2.

Table 3.

7. Real Data Example

Figure 2.

Figure 3.

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases