Abstract
Factor modeling is an essential tool for exploring intrinsic dependence structures among high-dimensional random variables. Much progress has been made for estimating the covariance matrix from a high-dimensional factor model. However, the blessing of dimensionality has not yet been fully embraced in the literature: much of the available data are often ignored in constructing covariance matrix estimates. If our goal is to accurately estimate a covariance matrix of a set of targeted variables, shall we employ additional data, which are beyond the variables of interest, in the estimation? In this article, we provide sufficient conditions for an affirmative answer, and further quantify its gain in terms of Fisher information and convergence rate. In fact, even an oracle-like result (as if all the factors were known) can be achieved when a sufficiently large number of variables is used. The idea of using data as much as possible brings computational challenges. A divide-and-conquer algorithm is thus proposed to alleviate the computational burden, and also shown not to sacrifice any statistical accuracy in comparison with a pooled analysis. Simulation studies further confirm our advocacy for the use of full data, and demonstrate the effectiveness of the above algorithm. Our proposal is applied to a microarray data example that shows empirical benefits of using more data. Supplementary materials for this article are available online.
Keywords: Asymptotic normality, Auxiliary data, Divide-and-conquer, Factor model, Fisher information, High-dimensionality
1. Introduction
With the advance of modern information technology, it is now possible to track millions of variables or subjects simultaneously. To discover the relationship among them, the estimation of a high-dimensional covariance matrix Σ has recently received a great deal of attention in the literature. Researchers proposed various regularization methods to obtain consistent estimators of Σ (Bickel and Levina 2008; Rothman et al. 2008; Lam and Fan 2009; Cai, Zhang, and Zhou 2010; Cai and Liu 2011). A key assumption for these regularization methods is that Σ is sparse, that is, many elements of Σ are small or exactly zero.
Different from such a sparsity condition, factor analysis assumes that the intrinsic dependence is mainly driven by some common latent factors (Johnson and Wichern 1992). For example, in modeling stock returns, Fama and French (1993) proposed the well-known Fama–French three-factor model. In the factor model, Σ has spiked eigenvalues and dense entries. In the high-dimensional setting, there are many recent studies on the estimation of the covariance matrix based on the factor model (Fan, Fan, and Lv 2008; Fan, Liao, and Mincheva 2011, 2013; Bai and Li 2012; Bai and Liao 2013), where the number of variables can be much larger than the number of observations.
The interest of this article is on the estimation of the covariance matrix for a certain set of variables using auxiliary data information. In the literature, we use only the data information on the variables of interest. In the data-rich environment today, substantially more amount of data information is indeed available, but is often ignored in statistical analysis. For example, we might be interested in understanding the covariance matrix of 50 stocks in a portfolio, yet the available data information is a time series of thousands of stocks. Similarly, an oncologist may wish to study the dependence or network structures among 100 genes that are significantly associated with a certain cancer, yet she has expression data for over 20,000 genes from the whole genome. Can we benefit from using much more rich auxiliary data?
The answer to the above question is affirmative when a factor model is imposed. Since the whole system is driven by a few common factors, these common factors can be inferred more accurately from a much larger set of data information (Fan, Liao, and Mincheva 2013), which is indeed a “blessing of dimensionality.” A major contribution of this article is to characterize how much the estimation of the covariance matrix of interest and also common factors can be improved by auxiliary data information (and under what conditions).
Consider the following factor model for all p observable data yt = (y1t, …, ypt)′ ∈ ℝp at time t:
| (1) |
where ft ∈ ℝK is a K-dimensional vector of common factors, is a factor loading matrix with bi ∈ ℝK being the factor loading of the ith variable on the latent factor ft, and ut is an idiosyncratic error vector. In the above model, yt is the only observable variable, while B is a matrix of unknown parameters, and (ft, ut) are latent random variables. Without loss of generality, we assume E(ft) = E(ut) = 0 and ft and ut are uncorrelated. Then, the model implied covariance structure is
where and . Observe that B and ft are not individually identifiable, since Bft = BHH′ft for any orthogonal matrix H. To this end, an identifiability condition is imposed:
| (2) |
which is a common assumption in the literature (Bai and Li 2012; Bai and Liao 2013).
Assume that we are only interested in a subset S among a total of p variables in model (1). We aim to obtain an efficient estimator of
the covariance matrix of the s variables in S, where BS is the sub-matrix of B with row indices in S and Σu,S is the submatrix of Σu with row and column indices in S. As mentioned above, the existing literature uses the following conventional method:
Method 1: Use solely the s variables in the set S to estimate common factors ft, the loading matrix BS, the idiosyncratic matrix Σu,S, and the covariance matrix ΣS.
This idea is apparently strongly influenced by the nonparametric estimation of the covariance matrix and ignores a large portion of the available data in the other p – s variables. An intuitively more efficient method is
Method 2: Use all the p variables to obtain estimators of ft, the loading matrix B, the idiosyncratic matrix Σu, and the entire covariance matrix Σ, and then restrict them to the variables of interest. This is the same as estimating ft using all variables, and then estimating BS and Σu,S based on the model (1) and the subset S with ft being estimated (observed), and obtaining a plug-in estimator of ΣS.
We will show that Method 2 is more efficient than Method 1 in the estimation of ft and ΣS as more auxiliary data information is incorporated. By treating common factor as an unknown parameter, we calculate its Fisher information that grows with more data being used in Method 2. In this case, a more efficient factor estimate can be obtained, for example, through weighted principal component (WPC) method (Bai and Liao 2013). The advantage of factor estimation is further carried over to the estimation of ΣS by Method 2 in terms of its convergence rate. Moreover, if the number of total variables is sufficiently large, Method 2 is proven to perform as well as an “oracle method,” which observes all latent factors. This lends further support to our aforementioned claim of “blessing of dimensionality.” Such a best possible rate improvement is new to the existing literature, and counted as another contribution of this article. All these conclusions hold when the number of factors K is assumed to be fixed and known, while s, p, and T all tend to infinity.
The idea of using data as much as possible brings computational challenges. Fortunately, we observe that all the p variables are controlled by the same group of latent factors. Having said that, we can actually split p variables into smaller groups, and then use each group to estimate latent factors. The final factor estimate is obtained by averaging over these repeatedly estimated factors. Obviously, this divide-and-conquer algorithm can be implemented in a parallel computing environment, and thus produces factor estimators in a much more efficient way. On the other hand, our theory illustrates that this new method performs as well as the “pooled analysis,” where we run the method over the whole dataset. Simulation studies further demonstrate the boosted computational speed and satisfactory statistical performance.
The rest of the article is organized as follows. We compare the Fisher information of the factors by the two methods in Section 2. Section 3 describes the WPC method. As a main result, the convergence rates of Different estimators of ΣS are further compared in Section 4 under various norms. Section 5 introduces the divide-and-conquer method for accelerating computation, while Section 6 presents all simulation results. Section 7 gives a microarray data example to illustrate our proposal. All technical proofs are delegated to the Appendix.
For any vector a, let aS denote a sub-vector of a with indices in S. Denote ‖a‖ the Euclidean norm of a. For a symmetric matrix A ∈ ℝd×d, let AI,j be the submatrix of A with row and column indices in I and J, respectively. We write AS for AS,S for simplicity. Let λj(A) be the jth largest eigenvalue of A. Denote ‖A‖ = max{|λ1(A)|, |λd(A)|} the operator norm of A, ‖A‖max = maxij |aij| the max-norm of A, where aij is the (i, j)th entry of A, the L1 norm of A, the Frobenius norm of A, and ‖A‖M = d−1/2‖M−1/2AM−1/2‖F the relative norm of A to M, where the weight matrix M is assumed to be positive definite. For a non-square matrix C, let CS be the submatrix of C with row indices in S.
2. Fisher Information of Common Factor
In this section, we treat the vector of common factors as a fixed unknown parameter, and compute its Fisher information matrices based on Method 1 and Method 2. In the computation, the loading matrix B is treated as deterministic in Proposition 2. In Proposition 3, the Fisher information is computed for each given B and then averaged over B by regarding it as a realization of a chance process, which bypasses the block diagonal assumption needed without taking average over B. In other sections, we adopt the convention regarding the factors as random and B as fixed. We start by calculating the Fisher information of θt := Bft, which serves as an intermediate step in obtaining that for ft. For simplicity of notation, time t is suppressed in (yt, ft, ut, θt) so that it becomes (y, f, u, θ) in this section.
Given a general density function of y, denoted as h(y; θ), the Fisher information of θ contained in full data is given by
When only data in S is used, the Fisher information of θS is given by
where hS is the marginal density of yS for the target set of variable S. Our first proposition shows that {Ip(θ)}S, the submatrix of Ip(θ) restricted on S, dominates IS(θS) under a mild condition.
Proposition 1. If h(y; θ) = h(y – θ) and the density function h(y – θ) satisfies the following regularity condition:
| (3) |
then{Ip(θ)}S ≥ IS(θS) in the sense that {Ip(θ)}S – IS(θS) is positive semidefinite.
The regularity condition (3) is fairly mild, as illustrated in the following examples.
Example 1. In model 1, if uS and uSc are independent, then (3) holds.
Example 2. If y follows an elliptical distribution that
where the mapping function g(t) : [0, ∞) → [0, ∞) satisfies that |g′(t)| ≤ cg(t) for some positive constant c, and E|y| < ∞, then (3) holds. Example 2 includes some commonly used multivariate distributions as its special cases, for example, the multivariate normal distribution and the multivariate t-distribution with degrees of freedom greater than 1. The proof is given in Appendix A.2.
We next compute the Fisher information of f based on the full dataset, denoted as I(f), and the partial dataset restricted on S, denoted as IS(f). This can be done easily by noting that I(f) = B′Ip(θ)B. Indeed, the WPC estimators used in Methods 1 and 2 achieve such efficiency since their asymptotic variances are proven to be the inverse of I(f) and IS(f), respectively; see Remark 1.
Proposition 2 shows that I(f) dominates IS(f), if Ip(θ) is block-diagonal, that is, {Ip(θ)}S,Sc = 0. Hence, common factors can be estimated more efficiently using additional data ySc. The above block-diagonal condition implies that the idiosyncratic error of additional variables cannot be confounded with that of the variables-of-interest. For example, if u is normal, then {Ip(θ)}S,Sc = 0 indeed requires that uS is independent of uSc.
Proposition 2. Under condition (3), if {Ip(θ)}S,Sc = 0, I(f) ≥ IS(f).
So far we treat B as being deterministic. Rather, Proposition 3 regards {bi} as a realization of a chance process. Under this assumption, the expectation of I(f) over B is shown to always dominate that of IS(f). In other words, we can claim that averaging over loading matrices, a larger dataset contains more information about the unknown factors.
Proposition 3. If are iid random loadings with E(bi) = 0 and (3) holds, then E[I(f)] ≥ E[IS(f)], where the expectation is taken with respect to the distribution of B.
3. Efficient Estimation of Common Factor
In this section, we construct an efficient estimator of the common factors by showing that its asymptotic variance is exactly the inverse of its Fisher information. This together with the arguments in Section 2 enables us to draw a conclusion that using more data results in a more efficient factor estimator with a smaller asymptotic variance.
From a least-square perspective, when the loading matrix B is known, ft can be estimated by the weighted least-squares: . In the high-dimensional setting (p ≫ T), we assume Σu is a sparse matrix and define its sparsity measurement as
| (4) |
In particular, we assume the following sparsity condition:
| (5) |
Now, we propose to solve the following constrained weighted least-square problem:
| (6) |
where Σ̃u is a regularized estimator of Σu to be discussed later. The above constraint is a sample analog of the identifiability condition (2). The involvement of the weight is to account for the heterogeneity among the data and leads to more efficient estimation of (B, ft) (Choi 2012; Bai and Liao 2013).
Indeed, an initial estimator Σ̃u of the idiosyncratic matrix Σu is needed for solving the constrained weighted least-square problem. We propose to obtain such an estimator by the following procedure, which is in the same spirit as the estimation of the idiosyncratic matrix in the POET method (Fan, Liao, and Mincheva 2013). Let be the sample covariance of y and be eigen-pairs of Sy with λ1 ≥ λ2 ≥ … ≥ λp. Denote . We estimate Σu by Σ̃u, whose (i, j)th entry
sij(rij) is a general entry-wise thresholding function (Antoniadis and Fan 2001) such that sij(z) = 0 if |z| ≤ τij and |sij(z) – z| ≤ τij for |z| > τij. In our article, we choose hard-thresholding even though SCAD (Fan and Li 2001) and MCP (Zhang 2010) are also applicable. We specify the entry-wise thresholding level as
| (7) |
and C is a constant chosen by cross-validation. The thresholding parameter Cω(p) is applied to the correlation matrix. This is similar to the adaptive thresholding estimator for a general covariance matrix (Rothman, Levina, and Zhu 2009), where the entry-wise thresholding level depends on p.
With Σ̃u being the thresholding estimator described above, the constrained weighted least-square problem (6) can be solved by the weighted principal component (WPC) method. The solution is given by
| (8) |
where Y = (y1, …, yT) and the columns of F̂ are the eigenvectors corresponding to the largest K eigenvalues of the T × T matrix (Bai and Liao 2013).
In the following, we give a result showing that the WPC estimator is asymptotically efficient. Indeed, Bai and Liao (2013) derive the asymptotic normality of f̂t under the following conditions:
All eigenvalues of B′B/p are bounded away from zero and infinity as p → ∞;
There exists a K × K diagonal matrix Q such that . In addition, the diagonal elements of Q are distinct and bounded away from infinity.
-
For each fixed t ≤ T, , as p → ∞, together with the sparsity assumption (5), and some additional regularity conditions given in Section A.1. When √plog p = o(T), it is shown that
(9) where H is a specific rotation matrix given by
(10) and V̂ is a K × K diagonal matrix of the largest K eigenvalues of . The rotation matrix H is introduced here so that Hft is an identifiable quantity from the data. See more discussion about the identifiability in Remark 2.
Condition (i) is a “pervasive condition” requiring that the common factors affect a nonnegligible fraction of subjects. This is a common assumption for the principal components based methods (Fan, Liao, and Mincheva 2011; Bai and Liao 2013). In condition (ii), is indeed the Fisher information (under Gaussian errors) contained in p variables, while the limit Q can be viewed as an average information for each variable. Hence, the asymptotic normality in (9) shows that f̂t is efficient as its asymptotic variance attains the inverse of the (averaged) Fisher information.
Remark 1. The results in Section 2 together with (9) imply that Method 2 is in general better than Method 1 in the estimation of common factors. To explain why, we consider two Different cases here. When p is an order of magnitude larger than s, where s is the number of variables of interest. Method 2 produces a better estimator of factors with a faster convergence rate. Even when p and s diverge at the same speed, the factor estimator based on Method 2 is shown to possess a smaller asymptotic variance, as long as Σu,S,Sc = 0. Recall that and under Gaussian errors, and they also correspond to the inverse of the asymptotic variance given by Methods 1 and 2, respectively. Then, Proposition 2 implies that Method 2 has a smaller asymptotic variance, if Σu,S,Sc = 0. Alternatively, if B is treated as being random, Proposition 3 immediately implies that . Therefore, even without the block diagonal assumption, Method 2 produces a more efficient factor estimate on average.
4. Covariance Matrix Estimation
One primary goal in this article is to obtain an accurate estimator of the covariance matrix for the variables-of-interest. In this section, we compare three Different estimation methods, namely, Methods 1, 2, and Oracle Method, in terms of their rates of convergence (under various norms). Obviously, these rates depend on how accurately the realized factors are estimated as demonstrated later.
Below we describe these three methods in full details.
- Method 1:
- Use solely the data in the subset S to obtain estimators of the realized factors F̂(1) and the loading matrix B̂1 = T−1YSF̂(1) based on (8);
- Let be the tth row of F̂(1), be the ith row of B̂1, , and . The (i, j)th entry of the idiosyncratic matrix estimator of Σu,S is given by thresholding σ̂ij at the level of , where ω(s) is defined in (7) and ;
- The final estimator is given by .
- Method 2:
- Use all p variables to obtain the estimate F̂(2) as given in (8) for the realized factors and then estimate the loading BS by B̂2 = T−1YSF̂(2);
- Follow the same procedure as in Method 1 to obtain the estimator but based on F̂(2) and B̂2;
- The final estimator is given by .
- Oracle Method:
- Estimate the loading by B̂o = T−1YSF, where F = (f1, …, fT)′ are the true factors.
- The idiosyncratic matrix estimator is given by the same procedure as in Method 1, with and being replaced by and ft, respectively.
- The final estimator is given by .
Theorem 1 depicts the estimation accuracy of ΣS by the above three methods with respect to the following measurements:
where is a norm of the relative errors. Note that the results of Fan, Liao, and Mincheva (2013) cannot be directly used here since we employ the weighted principal component analysis to estimate the unobserved factors. This is expected to be more accurate than the ordinary principal component analysis, as shown in Bai and Liao (2013). Indeed, the technical proofs for our results are technically more involved than those by Fan, Liao, and Mincheva (2013).
We assume that s is much less than p, that is, s = o(p), but both tend to infinity. Under the pervasive condition (i), ‖ΣS‖ ≥ cs and therefore diverges. For this reason, we consider the relative norm ‖Σ̂S – ΣS‖Σs, instead of ‖Σ̂S – ΣS‖, and the operator norm for estimating the inverse. In addition, we consider another element-wise max norm ‖Σ̂S – ΣS‖max. We show that if p is large with respect to s and T, Method 2 performs as well as the Oracle Method, both of which outperform Method 1. As a consequence, even if we are only interested in the covariance matrix of a small subset of variables, we should use all the data to estimate the common factors, which ultimately improves the estimation of ΣS. In particular, we are able to specify an explicit regime of (s, p) under which the improvements are substantial. However, when s ≍ p, that is, they are in the same order, using more data does not show as dramatic improvements for estimating ΣS. This is expected and will be clearly seen in the simulation section.
Before stating Theorem 1, we need a few preliminary results: Lemmas 1–3. Specifically, Lemma 1 presents the uniform convergence rates of the factor estimates by Methods 1 and 2. Based on that, Lemmas 2 and 3 further derive the estimation accuracy of factor loadings and idiosyncratic matrix by the three methods, respectively. These results together lead to the estimation error rates of ΣS in Theorem 1 w.r.t. three measures defined above. Additional Lemmas supporting the proof are given in the Appendix. Again, these kinds of results cannot be obtained directly from Fan, Liao, and Mincheva (2013) due to our use of WPC.
Lemma 1. Suppose that conditions (i), (ii), the sparsity condition (5), and additional regularity conditions (iv)–(vii) in Section A.1 hold for both s and p. If √plog p = o(T) and T = o(s2), then we have
where , V̂1 is the diagonal matrix of the largest K eigenvalues of and V̂2 is the diagonal matrix of the largest K eigenvalues of .
Remark 2. H1 and H2 correspond to the rotation matrix H defined in (10) using Methods 1 and 2, respectively. Recall that F = (f1, …, fT)′, then . Note that Hft only depends on quantities V−1F̂, and the identifiable component . Therefore, there is no identifiability issue regarding Hft. In other words, even though ft itself may not be identifiable, an identifiable rotation of ft can be consistently estimated by f̂t.
Lemma 1 implies that Method 2 produces a better factor estimate if
by representing s and p as s ≍ Tγs and p ≍ Tγp.
It is not surprising that the estimation accuracy of loading matrix also varies among these three methods as shown in Lemma 2.
Lemma 2. Under conditions of Lemma 1,
Similarly, Lemma 2 indicates that Method 2 performs as well as the Oracle Method, both of which are better than Method 1, that is, w2 = wo < w1, if
by representing s and p in the order of T as above. We remark that the extra terms 1/√s and 1/√p in w1 and w2 (in comparison with the oracle rate wo) are due to the factor estimation. Another preliminary result regarding the estimation of the identifiable component is given in Lemma A.1.
Similar insights can be delivered from Lemma 3 on the estimation of Σu,S.
Lemma 3. Under conditions of Lemma 1, it holds that
where ms is defined as in (4) with p being replaced by s.
Now, we are ready to state our main result on the estimation of ΣS based on the above preliminary results. From Theorem 1, it is easily seen that the comparison of the estimation accuracy of ΣS among three methods is solely determined by the relative magnitude of wo, w1, and w2. Therefore, we should use additional variables to estimate the factors if p is much larger than s in the sense that T/log s = O(p) and s log s = o(T) (implying w2 = wo < w1).
Theorem 1. Under conditions of Lemma 1, it holds that
For the relative norm, , , and .
For the max-norm, , , and .
For the operator norm of the inverse matrix, , and .
Remark 3. So far, we assumed that the number of factors K is fixed and known. A data-driven choice of K has been extensively studied in the econometrics literature, for example, by Bai and Ng (2002), Kapetanios (2010). To estimate K, we can adopt the method by Bai and Ng (2002) and propose a consistent estimator of K (by allowing p, T → ∞) as follows:
where N is a predefined upper bound, F̂k is a T × k matrix whose columns are √T times the eigenvectors corresponding to the largest k eigenvalues of Y′Y, and g(p, T) is a penalty function. Two examples suggested by Bai and Ng (2002) are
Under our assumptions (i)–(x), all conditions required by theorem 2 of Bai and Ng (2002) hold. Hence, their theorem implies that P(K̂ = K) → 1. Then, conditioning on the event that {K̂ = K}, our Theorem 1 still holds by replacing K with K̂. Other effective methods for selecting the number of factors include the eigen ratio method by Lam and Yao (2012) and Ahn and Horenstein (2013).
Remark 4. When K grows with p and T, Fan, Liao, and Mincheva (2013) gave the explicit dependence of the convergence rates on K for their proposed POET estimator. By adopting their technique, we can obtain the following results:
, , ;
, , ;
-
, , .
Again, the rate difference among three types of estimators only depends on wo, w1, and w2. Therefore, the same conclusion (when p is much larger than s, using additional variables improves the estimation of ΣS) can still be made even if K diverges. As long as K diverges in the rate that , or K = o(1/(msw1)1/3), the same blessing of dimensionality phenomena persist in terms of estimation consistency in relative norm, max norm, or operator norm of the inverse, respectively.
5. Divide-and-Conquer Computing Method
As discussed previously, we prefer using auxiliary data information as much as possible even we are only interested in the covariance matrix of some particular set of variables. But this can bring up heavy computational burden. This concern motivates a simple divide-and-conquer scheme that splits all p variables in Y. Without loss of generality, assume that p rows of matrix Y can be evenly divided into M groups with p/M variables in each group. The s variables of interest can possibly be assigned to Different groups.
Divide-and-Conquer Computation Scheme
In the mth group, obtain the initial estimator Σ̃u,m by using the adaptive thresholding method as described in Section 3 based on the data in the mth group only.
Denote Ym as the data vector corresponding to the variables in the mth group and let F̂m = (f̂m,1, …, f̂m,T)′, where its columns are the eigenvectors corresponding to the largest K eigenvalues of the T × T matrix . The computation in the above two steps can be done in a parallel manner.
-
Average to obtain a single estimator of ft as
The loading matrix estimate is given by B̄S = T−1YSF̄, where F̄ = (f̄1, …, f̄T)′.
The idiosyncratic matrix is estimated as follows. Let be the tth row of F̄ and be the ith row of B̄S. Let , , and . The (i, j)th entry of Σu,S is given by thresholding σ̂ij at the level of , where ω(s) is defined as in (7) with p replaced by s.
-
The final estimator of the covariance matrix is given by
We show that, if M is fixed,
These rates match the rates of attained by Method 2, where all p variables are pooled together for the analysis. The proof is given in Appendix A.3. The simulation results in Section 6 further demonstrate that without sacrificing the estimation accuracy, the divide-and-conquer method runs much faster than Method 2. Therefore, the divide-and-conquer method is practically useful when dealing with massive dataset.
The main computational cost of our method comes from taking the inverse of Σ̃u. For our Method 2, where all p variables are pooled together for the analysis, the computational complexity of the inversion is O(p3). On the other hand, for the divide-and-conquer method, the corresponding estimator Σ̃u,m in the mth group only needs a computational cost of O((p/M)3) to be inverted. Then, the total computation complexity is O(p3/M2). Hence, the computational speed can be boosted by M2-fold. Such a computational acceleration can also be observed from simulation study results in Figure 1(d). Other operations like the eigen-decomposition on the T × T matrix do not have dominating computational cost, as we assume that p is much larger than T. When M grows too fast, the divide-and-conquer method may lose estimation efficiency compared with the pooled analysis (Method 2). However, considering its boost of computation, the divide-and-conquer method is practically useful when dealing with massive dataset.
Figure 1.

Estimation error by four methods and their computational time: the dotted lines represent the means over 100 simulations and the segments represent the corresponding standard deviations.
6. Simulations
We use simulated examples to compare the statistical performances of Methods 1, 2, and the Oracle Method. We fix the number of factors K = 3 and repeat 100 simulations for each combination of (s, p, T). The loading bi, the factor ft and the idiosyncratic error ut are generated as follows:
are iid from NK(0, 5IK).
are iid from NK(0, IK).
are iid from Np(0, 50Ip).
The observations are generated from (1) using bi, ft, and ut from the above. Tables 1–4 report the estimation errors of the factors, the loading matrices, and the covariance-of-interest ΣS in terms of Different measurements.
Table 1.
Comparison of three methods when s is much smaller than p(T = 200).
| (s, p) Method |
(50, 1000) | (50, 2000) | |||||
|---|---|---|---|---|---|---|---|
|
|
|
||||||
| M1 | M2 | ORA | M1 | M2 | ORA | ||
| ‖Σ̂S – ΣS‖ΣS | 0.271(0.014) | 0.205(0.013) | 0.204(0.013) | 0.270(0.014) | 0.201(0.013) | 0.200(0.013) | |
|
|
0.016(0.003) | 0.009(0.002) | 0.009(0.002) | 0.017(0.003) | 0.009(0.002) | 0.009(0.002) | |
| ‖Σ̂S – ΣS‖max | 18.828(3.072) | 17.460(3.237) | 17.457(3.261) | 18.076(2.697) | 16.631(2.949) | 16.623(2.950) | |
| maxt≤T ‖f̂t – Hft‖ | 1.811(0.195) | 0.445(0.046) | NA | 1.870(0.236) | 0.331(0.025) | NA | |
| maxi≤s ‖b̂i – Hbi‖ | 8.064(0.694) | 4.100(0.330) | 3.858(0.274) | 8.150(0.682) | 3.932(0.292) | 3.805(0.297) | |
|
|
11.375(1.262) | 5.519(0.813) | 5.268(0.843) | 11.466(1.353) | 5.253(0.776) | 5.113(0.739) | |
NOTE: M1, M2, and ORA stand for Method 1, 2, and Oracle method, respectively.
Table 4.
Comparison of three methods when s is comparative to p(T = 400).
| (s, p) Method |
(800, 1000) | (800, 2000) | |||||
|---|---|---|---|---|---|---|---|
|
|
|
||||||
| M1 | M2 | ORA | M1 | M2 | ORA | ||
| ‖Σ̂S – ΣS‖ΣS | 0.193(0.004) | 0.192(0.004) | 0.189(0.004) | 0.192(0.004) | 0.190(0.004) | 0.188(0.004) | |
|
|
0.008(0.001) | 0.008(0.001) | 0.008(0.001) | 0.008(0.001) | 0.008(0.001) | 0.008(0.001) | |
| ‖Σ̂S – ΣS‖max | 17.062(2.603) | 17.051(2.612) | 17.041(2.621) | 16.919(2.182) | 16.891(2206) | 16.888(2209) | |
| maxt≤T ‖f̂t – Hft‖ | 0.467(0.038) | 0.423(0.036) | NA | 0.466(0.038) | 0.304(0.026) | NA | |
| maxi≤s ‖b̂i – Hbi‖ | 11.009(0298) | 10.850(0.302) | 10225(0205) | 10.934(0274) | 10530(0213) | 10.189(0.172) | |
|
|
5.367(0577) | 5276(0560) | 4.880(0528) | 5293(0.411) | 5.024(0.461) | 4.894(0.420) | |
NOTE: M1, M2, and ORA stand for Method 1, 2, and Oracle method, respectively.
We see from Tables 1 and 2 that when s = 50 and p = 1000, 2000, Method 1 performs much worse than Method 2, for both T = 200 and T = 400. However, when s increases to 800 with p being the same, Tables 3 and 4 show that the improvement of Method 2 over Method 1 is less profound. This is expected as the set of interest already contains sufficiently rich information to produce an accurate estimator for realized factors. In general, we note that Method 2 is the most advantageous in the settings where s is much smaller than p. In addition, from Tables 3 and 4, we can tell that Method 2 comes closer to the Oracle method as p grows. In practice, we also observe that the WPC factor estimator performs better than the unweighted PC estimator when ut is heteroscedastic. Due to the space limit, we choose not to present the simulation results in this model.
Table 2.
Comparison of three methods when s is much smaller than p(T = 400).
| (s, p) Method |
(50, 1000) | (50, 2000) | |||||
|---|---|---|---|---|---|---|---|
|
|
|
||||||
| M1 | M2 | ORA | M1 | M2 | ORA | ||
| ‖Σ̂S – ΣS‖ΣS | 0.186(0.009) | 0.132(0.007) | 0.131(0.007) | 0.186(0.009) | 0.131(0.008) | 0.130(0.008) | |
|
|
0.011(0.002) | 0.004(0.001) | 0.004(0.001) | 0.011(0.002) | 0.004(0.001) | 0.004(0.001) | |
| ‖Σ̂S – ΣS‖max | 14.054(1.945) | 11.922(2245) | 11.891(2262) | 14.180(2.154) | 11.901(2.603) | 11.900(2.604) | |
| maxt≤T ‖f̂t – Hft‖ | 1.839(0.193) | 0.417(0.036) | NA | 1.843(0.198) | 0.305(0.026) | NA | |
| maxi≤s ‖b̂i – Hbi‖ | 6.960(0.584) | 2.830(0200) | 2.692(0.198) | 7.024(0.605) | 2.761(0.188) | 2.692(0.194) | |
|
|
11.871(1.540) | 4.138(0510) | 3.824(0.501) | 11.457(1.569) | 4.088(0.516) | 3.889(0.542) | |
NOTE: M1, M2, and ORA stand for Method 1, 2, and Oracle method, respectively.
Table 3.
Comparison of three methods when sis comparative to p(T = 200).
| (s, p) Method |
(800, 1000) | (800, 2000) | |||||
|---|---|---|---|---|---|---|---|
|
|
|
||||||
| M1 | M2 | ORA | M1 | M2 | ORA | ||
| ‖Σ̂S – ΣS‖ΣS | 0.440(0.006) | 0.439(0.006) | 0.435(0.006) | 0.439(0.006) | 0.436(0.006) | 0.435(0.006) | |
|
|
0.062(0.009) | 0.062(0.009) | 0.062(0.009) | 0.061(0.009) | 0.061(0.009) | 0.062(0.012) | |
| ‖Σ̂S – ΣS‖max | 24565(2.626) | 24562(2.609) | 24567(2599) | 24511(2.883) | 24543(2.847) | 24536(2.851) | |
| maxt≤T ‖f̂t – Hft‖ | 0.488(0.047) | 0.447(0.040) | NA | 0.478(0.049) | 0.337(0.038) | NA | |
| maxi≤s ‖b̂i – Hbi‖ | 15550(0.488) | 15.370(0.462) | 14.418(0271) | 15595(0551) | 15.041(0.357) | 14.398(0243) | |
|
|
6.745(0.611) | 6.680(0.635) | 6.405(0.630) | 6.904(0.734) | 6.697(0.763) | 6588(0.737) | |
NOTE: M1, M2, and ORA stand for Method 1,2, and Oracle method, respectively.
For further comparison with the divide-and-conquer method, we vary T from 50 to 500 and set (s, p, M) as s = ⎿T0.6⏌, p = ⎿T1.4⏌, and M = ⎿T0.2⏌. Figure 1 shows the estimation errors of the four methods together with the corresponding computational time. Again, when p is large, Method 2 performs as well as the Oracle Method, both of which greatly outperform Method 1. However, its computation becomes much slower in this case. In contrast, the divide-and-conquer method is much faster, while maintaining comparable performance as Method 2. In the extreme case that p is around 6000 (T = 500), the divide-and-conquer method can boost the speed by nine-fold for Method 2.
7. Real Data Example
We use a real data example to illustrate how Different utilization of available variables can affect the inference of the variables of interest. Krug et al. (2012) carried out a gene profiling study among 40 Portuguese and Spanish adults to identify key genetic risk factors for ischemic stroke. Among them, 20 subjects were patients having ischemic stroke and the others were controls. Their gene profiles were obtained using the GeneChip Human Genome U133 Plus 2.0 microarray. The data were available at Gene Expression Omnibus with access name “GSE22255.”
To judge how effectively the gene expression can distinguish ischemic stroke and controls, we applied the Linear Discriminant Analysis (LDA) to this dataset. We randomly chose 10 subjects as the test set and the rest as the training set. We repeated the random splitting for 100 runs. In each run, we selected the set of expressed Differentially (DE) genes with a threshold of over 1.2-fold change and a Q-value ≤ 0.05, which is a commonly used quantity to define DE genes (Storey 2002). An LDA rule was then learned from the training set using the selected genes and further applied to the test set for classifying cases and controls. The LDA rule classifies a subject as a case if
| (11) |
where δ̂ = μ̂1 – μ̂0 ∈ ℝs is the sample mean difference between the two groups (case–control), s is the number of selected genes, Σ̂ ∈ ℝs×s is an estimator of the true covariance matrix Σ of the selected genes, and μ̄ = (μ̂1 + μ̂0)/2. μ̄, δ̂, and Σ̂ are obtainesd from the training set and x is the gene expression of subjects in the test set.
As s can be larger than the sample size, the traditional LDA where Σ̂ is the sample covariance is no longer applicable. An alternative method to estimate Σ is adopting the factor model. Factor modeling is widely used in the genomics literature to model the dependencies among genes (Kustra, Shioda, and Zhu 2006; Carvalho et al. 2012). Several factors, like the natural pathway structure (Ogata et al. 2000) can be the latent factors affecting the correlation among genes. A few spiked eigenvalues of the sample covariance in Figure 2 also suggest the existence of potential latent factors in this dataset. Again, there are two ways using the factor model. One way is to use Method 1, where all procedures are done based on the selected genes only. The resulting rule is referred as “LDA-1” in Figure 3. Another way is to use auxiliary data as in Method 2. More specifically, it first uses data from all involved genes and subjects in the training set to estimate the latent factors. These estimated factors are then applied to the set of selected genes, where their loadings and idiosyncratic matrix estimators are obtained. Combing them together produces the covariance matrix estimator, which is still an s × s matrix. The resulting rule is referred as “LDA-2” in Figure 3. Recall that the only difference between the two rules is that they use Different covariance estimators.
Figure 2.

Eigen-values of the sample covariance matrix for GSE22255.
Figure 3.

Misclassification rates of LDA-1 and LDA-2 over 100 random splits: the dotted lines represent the means over 100 splits and the segments represent the corresponding standard deviations.
Figure 3 plots the average misclassification rates on the test set against the number of factors for the 100 random splits. It is clearly seen that LDA-2 gives better misclassification rates than LDA-1, which is solely due to a Different estimation of the covariance matrix. The results lend further support to our claim that using more data is beneficial.
Supplementary Material
Acknowledgments
Guang Cheng was on sabbatical at Princeton while this work was carried out and thanks the Princeton ORFE department for its hospitality. The authors thank the editor, the associate editor, and the referees for their constructive comments.
Funding: Guang Cheng gratefully acknowledges NSF CAREER Award DMS-1151692, DMS-1418042, Simons Fellowship in Mathematics, Office of Naval Research (ONR N00014-15-1-2331). Jianqing Fan was supported in part by NSF Grants DMS-1206464, DMS-1406266, and NIH grant R01GM100474-4. Quefeng Li was supported by NIH grant 2R01-GM072611-11 as a postdoctoral fellow at Princeton University and supported by NIH grant 2R01-GM047845-25 at UNC-Chapel Hill.
Footnotes
Supplementary Materials: The online supplement contains the appendices for the article.
References
- Ahn SC, Horenstein AR. Eigenvalue Ratio Test for the Number of Factors. Econometrica. 2013;81:1203–1227. [Google Scholar]
- Antoniadis A, Fan J. Regularized Wavelet Approximations (with discussion) Journal of the American Statistical Association. 2001;96:939–967. [Google Scholar]
- Bai J, Li K. Statistical Analysis of Factor Models of High Dimension. The Annals of Statistics. 2012;40:436–465. [Google Scholar]
- Bai J, Liao Y. Statistical Inferences Using Large Estimated Covariances for Panel Data and Factor Models. arXiv. 2013;1307:2662. [Google Scholar]
- Bai J, Ng S. Determining the Number of Factors in Approximate Factor Models. Econometrica. 2002;70:191–221. [Google Scholar]
- Bickel PJ, Levina E. Covariance Regularization by Thresholding. The Annals of Statistics. 2008;36:2577–2604. [Google Scholar]
- Cai T, Liu W. Adaptive Thresholding for Sparse Covariance Matrix Estimation. Journal of the American Statistical Association. 2011;106:672–684. [Google Scholar]
- Cai T, Zhang CH, Zhou H. Optimal Rates of Convergence for Covariance Matrix Estimation. The Annals of Statistics. 2010;38:2118–2144. [Google Scholar]
- Carvalho CM, Chang J, Lucas JE, Nevins JR, Wang Q, West M. High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics. Journal of the American Statistical Association. 2012;103:1438–1456. doi: 10.1198/016214508000000869. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Choi I. Efficient Estimation of Factor Models. Econometric Theory. 2012;28:274–308. [Google Scholar]
- Fama EF, French KR. Common Risk Factors in the Returns on Stocks and Bonds. Journal of Financial Economics. 1993;33:3–56. [Google Scholar]
- Fan J, Fan Y, Lv J. High Dimensional Covariance Matrix Estimation Using a Factor Model. Journal of Econometrics. 2008;147:186–197. [Google Scholar]
- Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
- Fan J, Liao Y, Mincheva M. High Dimensional Covariance Matrix Estimation in Approximate Factor Models. The Annals of Statistics. 2011;39:3320–3356. doi: 10.1214/11-AOS944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Liao Y, Mincheva M. Large Covariance Estimation by Thresholding Principal Orthogonal Complements. Journal of the Royal Statistical, Series B. 2013;75:603–680. doi: 10.1111/rssb.12016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnson RA, Wichern DW. Applied Multivariate Statistical Analysis. Englewood Cliffs, NJ: Prentice Hall; 1992. [Google Scholar]
- Kapetanios G. A Testing Procedure for Determining the Number of Factors in Approximate Factor Models With Large Datasets. Journal of Business and Economic Statistics. 2010;28:397–409. [Google Scholar]
- Krug T, Gabriel JP, Taipa R, Fonseca BV, Domingues-Montanari S, Fernandez-Cadenas I, Manso H, Gouveia LO, Sobral J, Albergaria I, Gaspar G, Jiménez-Conde J, Rabionet R, Ferro JM, Montaner J, Vicente AM, Rui Silva M, Matos I, Lopes G, Oliveira SA. TTC7B Emerges as a Novel Risk Factor for Ischemic Stroke Through the Convergence of Several Genome-Wide Approaches. Journal of Cerebral Blood Flow & Metabolism. 2012;32:1061–1072. doi: 10.1038/jcbfm.2012.24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kustra R, Shioda R, Zhu M. A Factor Analysis Model for Functional Genomics. BMC Bioinformatics. 2006;7:216–228. doi: 10.1186/1471-2105-7-216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lam C, Fan J. Sparsistency and Rates of Convergence in Large Covariance Matrix Estimation. The Annals of Statistics. 2009;37:4254–4278. doi: 10.1214/09-AOS720. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lam C, Yao Q. Factor Modeling for High-Dimensional Time Series: Inference for the Number of Factors. The Annals of Statistics. 2012;40:694–726. [Google Scholar]
- Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research. 2000;28:27–30. doi: 10.1093/nar/27.1.29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse Permutation Invariant Covariance Estimation. Electronic Journal of Statistics. 2008;2:494–515. [Google Scholar]
- Rothman AJ, Levina E, Zhu J. Generalized Thresholding of Large Covariance Matrices. Journal of the American Statistical Association. 2009;104:177–186. [Google Scholar]
- Storey JD. A Direct Approach to False Discovery Rates. Journal of the Royal Statistical, Series B. 2002;64:479–498. [Google Scholar]
- Zhang CH. Nearly Unbiased Variable Selection Under Minimax Concave Penalty. The Annals of Statistics. 2010;38:894–942. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
