Abstract
Factor models are a class of powerful statistical models that have been widely used to deal with dependent measurements that arise frequently from various applications from genomics and neuroscience to economics and finance. As data are collected at an ever-growing scale, statistical machine learning faces some new challenges: high dimensionality, strong dependence among observed variables, heavy-tailed variables and heterogeneity. High-dimensional robust factor analysis serves as a powerful toolkit to conquer these challenges.
This paper gives a selective overview on recent advance on high-dimensional factor models and their applications to statistics including Factor-Adjusted Robust Model selection (FarmSelect) and Factor-Adjusted Robust Multiple testing (FarmTest). We show that classical methods, especially principal component analysis (PCA), can be tailored to many new problems and provide powerful tools for statistical estimation and inference. We highlight PCA and its connections to matrix perturbation theory, robust statistics, random projection, false discovery rate, etc., and illustrate through several applications how insights from these fields yield solutions to modern challenges. We also present far-reaching connections between factor models and popular statistical learning problems, including network analysis and low-rank matrix recovery.
Keywords: Factor model, PCA, covariance estimation, perturbation bounds, robustness, random sketch, FarmSelect, FarmTest
1. INTRODUCTION
In modern data analytics, dependence across high-dimensional outcomes or measurements is ubiquitous. For example, stocks within the same industry exhibit significantly correlated returns, housing prices of a country depend on various economic factors, gene expressions can be stimulated by cytokines. Ignoring such dependence structure can produce significant systematic bias and yields inefficient statistical results and misleading insights. The problems are more severe for high-dimensional big data, where dependence, non-Gaussianity and heterogeneity of measurements are common.
Factor models aim to capture such dependence by assuming several variates or “factors”, usually much fewer than the outcomes, that drive the dependence of the entire outcomes (Lawley and Maxwell, 1962; Stock and Watson, 2002). Stemming from the early works on measuring human abilities (Spearman, 1927), factor models have become one of the most popular and powerful tools in multivariate analysis and have made profound impact in the past century on psychology (Bartlett, 1938; McCrae and John, 1992), economics and finance (Chamberlain and Rothschild, 1982; Fama and French, 1993; Stock and Watson, 2002; Bai and Ng, 2002), biology (Hirzel et al., 2002; Hochreiter et al., 2006; Leek and Storey, 2008), etc. Suppose x1, …, xn are n i.i.d. p-dimensional random vectors, which may represent financial returns, housing prices, gene expressions, etc. The generic factor model assumes that
(1.1) |
where , is the mean vector, is the matrix of factor loadings, stores K-dimensional vectors of common factors with , and represents the error terms (a.k.a. idiosyncratic components), which has mean zero and is uncorrelated with or independent of F. We emphasize that, for most of our discussions in the paper (except Section 3.1), only are observable, and the goal is to infer B and through . Here we use the name “factor model” to refer to a general concept where the idiosyncratic components ui are allowed to be weakly correlated. This is also known as the “approximate factor model” in the literature, in contrast to the “strict factor model” where the idiosyncratic components are assumed to be uncorrelated.
Note that the model (1.1) has identifiability issues: given any invertible matrix , simultaneously replacing B with BR and fi with R−1fi does not change the observation xi. To resolve this ambiguity issue, the following identifiability assumption is usually imposed:
Assumption 1.1 (Identifiability). B⊤B is diagonal and cov(fi) = Ip.
Other identifiability assumptions as well as detailed discussions can be found in Bai and Li (2012) and Fan et al. (2013).
Factor analysis is closely related to principal component analysis (PCA), which breaks down the covariance matrix into a set of orthogonal components and identifies the subspace that explains the most variation of the data (Pearson, 1901; Hotelling, 1933). In this selective review, we will mainly leverage PCA, or more generally, spectral methods, to estimate the factors and the loading matrix B in (1.1). Other popular estimators, mostly based on the maximum likelihood principle, can be found in Lawley and Maxwell (1962); Anderson and Amemiya (1988); Bai and Li (2012), etc. The covariance matrix of xi consists of two components: cov(Bfi) and cov(ui). Intuitively, when the contribution of the covariance from the error term ui is negligible compared with those from the factor term Bfi, the top-K eigenspace (namely, the space spanned by top K eigenvectors) of the sample covariance of should be well aligned with the column space of B. This can be seen from the assumption that , which occurs frequently in high-dimensional statistics (Fan et al., 2013).
Here is our main message: applying PCA to well-crafted covariance matrices (including vanilla sample covariance matrices and their robust version) consistently estimates the factors and loadings, as long as the signal-to-noise ratio is large enough. The core theoretical challenge is to characterize how idiosyncratic covariance cov(ui) perturb the eigenstructure of the factor covariance BB⊤. In addition, the situation is more complicated with the presence of heavy-tailed data, missing data, computational constraints, heterogeneity, etc.
The rest of the paper is devoted to solutions to these challenges and a wide range of applications to statistical machine learning problems. In Section 2, we will elucidate the relationship between factor models and PCA and present several useful deterministic perturbation bounds for eigenspaces. We will also discuss robust covariance inputs for the PCA procedure to guard against corruption from heavy-tailed data. Exploiting the factor structure of the data helps solve many statistical and machine learning problems. In Section 3, we will see how the factor models and PCA can be applied to high-dimensional covariance estimation, regression, multiple testing and model selection. In Section 4, we demonstrate the connection between PCA and a wide range of machine learning problems including Gaussian mixture models, community detection, matrix completion, etc. We will develop useful tools and establish strong theoretical guarantees for our proposed methods.
Here we collect all the notations for future convenience. We use [m] to refer to {1, 2, …, m}. We adopt the convention of using regular letters for scalars and using bold-face letters for vectors or matrices. For , and 1 ≤ q < ∞, we define , , where , and . For a matrix M, we use , , and to denote its operator norm (spectral norm), Frobenius norm, entry-wise (element-wise) max-norm, and vector ℓ1 norm, respectively. To be more specific, the last two norms are defined by and . Let Ip denote the p × p identity matrix, 1p denote the p-dimensional all-one vector, and denote the indicator of event A, i.e., if A happens, and 0 otherwise. We use to refer to the normal distribution with mean vector μ and covariance matrix Σ. For two nonnegative numbers a and b that possibly depend on n and p, we use the notation a = O(b) and to mean a ≤ C1b for some constant C1 > 0, and the notation a = Ω(b) and to mean a ≥ C2b for some constant C2 > 0. We write if both a = O(b) and a = Ω(b) hold. For a sequence of random variables and a sequence of nonnegative deterministic numbers , we write if for any ε > 0, there exists C > 0 and N > 0 such that holds for all n > N; and we write if for any ε > 0 and C > 0, there exists N > 0 such that holds for all n > N. We omit the subscripts when it does not cause confusion.
2. FACTOR MODELS AND PCA
2.1. Relationship between PCA and factor models in high dimensions
Under model (1.1) with the identifiability condition, Σ = cov(xi) is given by
(2.1) |
Intuitively, if the magnitude of BB⊤ dominates Σu, the top-K eigenspace of Σ should be approximately aligned with the column space of B. Naturally we expect a large gap between the eigenvalues of BB⊤ and Σu to be important for estimating the column space of B through PCA (see Figure 1). On the other hand, if this gap is small compared with the eigenvalues of Σu, it is known that PCA leads to inconsistent estimation (Johnstone and Lu, 2009). The above discussion motivates a simple vanilla PCA-based method for estimating B and F as follows (assuming the Identifiability Assumption).
Fig 1.
The left panel is the histogram of the eigenvalue distribution from a synthetic dataset. Fix n = 1000, p = 400 and K = 2 and let all the entries of B be i.i.d. Gaussian . Each entry of F and U is generated from i.i.d. and i.i.d. respectively. The data matrix X is formed according to the factor model (1.1). The right diagram illustrates the Pervasiveness Assumption.
Step 1. Obtain an estimator and of μ and Σ, e.g., the sample mean and covariance matrix or their robust versions.
Step 2. Compute the eigen-decomposition of . Let be the top K eigenvalues and be their corresponding eigenvectors. Set and .
Step 3. Obtain PCA estimators and , namely, consists of the top-K rescaled eigenvectors of an is just the rescaled projection of onto the space spanned by the eigen-space:.
Let us provide some intuitions for the estimators in Step 3. Recall that bj is the jth column of B. Then, by model (1.1),. In the high-dimensional setting, the second term is averaged out when ui is weakly dependent across its component. This along with the identifiability condition delivers that
(2.2) |
Now, we estimate and hence bj by and by . Using the substitution method, we obtain the estimators in Step 3.
The above heuristic also reveals that the PCA-based methods work well if the effect of the factors outweighs the noise. To quantify this, we introduce a form of Pervasiveness Assumption from the factor model literature. While this assumption is strong1, it simplifies our discussion and captures the above intuition well: it holds when the factor loadings are random samples from a nondegenerate population (Fan et al., 2013).
Assumption 2.1 (Pervasiveness). The first K eigenvalues of BB⊤ have order Ω(p), whereas .
Note that cov(fi) = IK under the Identifiability Assumption 1.1. The first part of this assumption holds when each factor influences a non-vanishing proportion of outcomes. Mathematically speaking, it means that for any k ∈ [K] := {1, 2, …, K}, the average of squared loadings of the kth factor satisfies (right panel of Figure 1). This holds with high probability if, for example, are i.i.d. realizations from a non-degenerate distribution, but we will not make such assumption in this paper. The second part of the assumption is reasonable, as cross-sectional correlation becomes weak after we take out the common factors. Typically, if Σu is a sparse matrix, the norm bound holds; see Section 3.1 for details. Under this Pervasiveness Assumption, the first K eigenvalues of Σ will be well separated with the rest of eigenvalues. By the Davis-Kahan theorem (Davis and Kahan, 1970), which we present as Theorem 2.1, we can consistently estimate the column space of B through the top-K eigenspace of Σ. This explains why we can apply PCA to factor model analysis (Fan et al., 2013).
Though factor models and PCA are not identical (see Jolliffe, 1986), they are approximately the same for high-dimensional problems with the pervasiveness assumption(Fan et al., 2013). Thus, PCA-based ideas are important components of estimation and inference for factor models. In later sections (especially Section 4), we discuss statistical and machine learning problems with factor-model-type structures. There PCA is able to achieve consistent estimation even when the Pervasiveness Assumption is weakened—and somewhat surprisingly—PCA can work well down to the information limit. For perspectives from random matrix theory, see Baik et al. (2005); Paul (2007); Johnstone and Lu (2009); Benaych-Georges and Nadakuditi (2011); O’Rourke et al. (2016); Wang and Fan (2017), among others.
2.2. Estimating the number of factors
In high-dimensional factor models, if the factors are unobserved, we need to choose the number of factors K before estimating the loading matrix, factors, etc. The number K can be usually estimated from the eigenvalues of the the sample covariance matrix or its robust version. With certain conditions such as separation of the top K eigenvalues from the others, the estimation is consistent. Classical methods include likelihood ratio tests (Bartlett, 1950), the scree plot (Cattell, 1966), parallel analysis (Horn, 1965), etc. Here, we introduce a few recent methods: the first one is based on the eigenvalue ratio, the second on eigenvalue differences, and the third on the eigenvalue magnitude.
For simplicity, let us use the sample covariance and arrange its eigenvalues in descending order: λ1 ≥ λ2 ≥ ⋯ ≥ λn∧p, where n ∧ p = min{n, p} (the remaining eigenvalues, if any, are zero). Lam and Yao (2012) and Ahn and Horenstein (2013) proposed an estimator based on ratios of consecutive eigenvalues. For a pre-determined kmax, the eigenvalue ratio estimator is
Intuitively, when the signal eigenvalues are well separated from the other eigenvalues, the ratio at k = K should be large. Under some conditions, the consistency of this estimator, which does not involve complicated tuning parameters, is established.
In an earlier work, Onatski (2010) proposed to use the differences of consecutive eigenvalues. For a given δ > 0 and pre-determined integer kmax, define
Using a result on eigenvalue empirical distribution from random matrix theory, Onatski (2010) proved consistency of under the Pervasiveness Assumption. The intuition is that, the Pervasiveness Assumption implies that λK − λK+1 tends on ∞ in probability as n → ∞; whereas λi − λi+1 → 0 almost surely for K < i < kmax because these λi-s converge to the same limit, which can be determined using random matrix theory. Onatski (2010) also proposed a data-driven way to determine δ from the empirical eigenvalue distribution of the sample covariance matrix.
A third possibility is to use an information criterion. Define
where is the sample mean, and the equivalence (second equality) is well known. For a given k, V (k) is interpreted as the scaled sum of squared residuals, which measures how well k factors fit the data. A very natural estimator is to find the best k ≤ kmax such that the following penalized version of V (k) is minimized (Bai and Ng, 2002):
and is any consistent estimate . The upper limit kmax is assumed to be no smaller than K, and is typically chosen as 8 or 15 in empirical studies in Bai and Ng (2002). Consistency results are established under more general choices of g(n, p).
We conclude this section by remarking that in general, it is impossible to consistently estimate K if the smallest nonzero eigenvalue B⊤B is much smaller than , because the ‘signals’ (eigenvalues of B⊤B) would not be distinguishable from the the noise (eigenvalues of UU⊤). As mentioned before, consistency of PCA is well studied in the random matrix theory literature. See Dobriban (2017) for a recent work that justifies parallel analysis using random matrix theory.
2.3. Robust covariance inputs
To extract latent factors and their factor loadings, we need an initial covariance estimator. Given independent observations x1, …, xn with mean zero, the sample covariance matrix, namely , is a natural choice to estimate . The finite sample bound on has been well studied in the literature (Vershynin, 2010; Tropp, 2012; Koltchinskii and Lounici, 2017). Before presenting the result from Vershynin (2010), let us review the definition of sub-Gaussian variables.
A random variable ξ is called sub-Gaussian if is finite, in which case this quantity defines a norm called the sub-Gaussian norm. Sub-Gaussian variables include as special cases Gaussian variables, bounded variables, and other variables with tails similar to or lighter than Gaussian tails. For a random vector ξ, we define ; we call ξ sub-Gaussian if is finite.
Theorem 2.1. Let Σ be the covariance matrix of xi. Assume that are i.i.d. sub-Gaussian random vectors, and denote . Then for any t ≥ 0, there exist constants C and c only depending on κ such that
(2.3) |
where .
Remark 2.1. The spectral-norm bound above depends on the ambient dimension p, which can be large in high-dimensional scenarios. Interested readers can refer to Koltchinskii and Lounici (2017) for a refined result that only depends on the intrinsic dimension (or effective rank) of Σ.
An important asepect of the above result is the sub-Gaussian concentration in (2.3), but this depends heavily on the sub-Gaussian or sub-exponential behaviors of observed random vectors. This condition can not be validated in high dimensions when tens of thousands of variables are collected. See Fan et al. (2016b). When the distribution is heavy-tailed2, one cannot expect sub-Gaussian or sub-exponential behaviors of the sample covariance in the spectral norm (Catoni, 2012). See also Vershynin (2012) and Srivastava and Vershynin (2013). Therefore, to perform PCA for heavy-tailed data, the sample covariance is not a good choice to begin with. Alternative robust estimators have been constructed to achieve better finite sample performance.
Catoni (2012), Fan et al. (2017b) and Fan et al. (2016b) approached the problem by first considering estimation of a univariate mean μ from a sequence of i.i.d random variables X1, ⋯, Xn with variance σ2. In this case, the sample mean provides an estimator but without exponential concentration. Indeed, by Markov inequality, we have , which is tight in general and has a Cauchy tail (in terms of t). On the other hand, if we truncate the data with and compute the mean of the truncated data, then we have (Fan et al., 2016b)
for a universal constant c > 0. In other words, the mean of truncated data with only a finite second moment behaves very much the same as the sample mean from the normal data: both estimators have Gaussian tails (in terms of t). This sub-Gaussian concentration is fundamental in high-dimensional statistics as the sample mean is computed tens of thousands or even millions of times.
As an example, estimating the high-dimensional covariance matrix Σ = (σij) involves O(p2) univariate mean estimation, since the covariance can be expressed as an expectation: as . Estimating each component by the truncated mean yields a covariance matrix . Assuming the fourth moment is bounded (as the covariance itself are second moments), by using the union bound and the above concentration inequality, we can easily obtain
for any a > 0 and a constant c′ > 0. In other words, with truncation, when the data have merely bounded fourth moments, we can achieve the same estimation rate as the sample covariance matrix under the Gaussian data.
Fan et al. (2016b) and Minsker (2016) independently proposed shrinkage variants of the sample covariance with sub-Gaussian behavior under the spectral norm, as long as the fourth moments of X are finite. For any , Fan et al. (2016b) proposed the following shrinkage sample covariance matrix
(2.4) |
to estimate Σ, where is the ℓ4-norm. The following theorem establishes the statistical error rate of in terms of the spectral norm.
Theorem 2.2. Suppose for any unit vector . Then it holds that for any δ > 0,
(2.5) |
where and C is a universal constant.
Applying PCA to the robust covariance estimators as described above leads to more reliable estimation of principal eigenspaces in the presence of heavy-tailed data.
In Theorem 2.2, we assume that the mean of xi is zero. When this does not hold, a natural estimator of is to use the shrunk U-statistic (Fan et al., 2017a):
where . When τ = ∞, it reduces to the usual U-statistics. It possesses a similar concentration property to that in Theorem 2.2 with a proper choice of τ.
2.4. Perturbation bounds
In this section, we introduce several perturbation results on eigenspaces, which serve as fundamental technical tools in factor models and related learning problems. For example, in relating the factor loading matrix B to the principal components of covariance matrix Σ in (2.1), one can regard Σ as a perturbation of BB⊤ by an amount of Σu and take A = BB⊤ and in Theorem 2.3 below. Similarly, we can also regard a covariance matrix estimator as a perturbation of Σ by an amount of .
We will begin with a review of the Davis-Kahan theorem (Davis and Kahan, 1970), which is usually useful for deriving ℓ2-type bounds (which includes spectral norm bounds) for symmetric matrices. Then, based on this classical result, we introduce entry-wise (ℓ∞) bounds, which typically give refined results under structural assumptions. We also derive bounds for rectangular matrices that are similar to Wedin’s theorem (Wedin, 1972). Several recent works on this topic can be found in Yu et al. (2014); Fan et al. (2018b); Koltchinskii and Xia (2016); Abbe et al. (2017); Zhong (2017); Cape et al. (2017); Eldridge et al. (2017).
First, for any two subspaces and of the same dimension K in , we choose any V, with orthonormal columns that span and , respectively. We can measure the closeness between two subspaces though the difference between their projectors:
The above definitions are both proper metrics (or distances) for subspaces S and and do not depend on the specific choice of V and , since and VV⊤ are projection operators. Importantly, these two metrics are connected to the well-studied notion of canonical angles (or principal angles). Formally, let the singular values of be , and define the canonical angles θk = cos−1 σk for k = 1, …, K. It is often useful to denote the sine of the canonical (principal) angles by , which can be interpreted as a generalization of sine of angles between two vectors. The following identities are well known (Stewart and Sun, 1990).
In some cases, it is convenient to fix a specific choice of and V. It is known that for both Frobenius norm and spectral norm,
where is the space of orthogonal matrices of size K × K. The minimizer (best rotation of basis) can be given by the singular value decomposition (SVD) of . For details, see Cape et al. (2017) for example.
Now, we present the Davis-Kahan sin θ theorem (Davis and Kahan, 1970).
Theorem 2.3. Suppose A, are symmetric, and that V, have orthonormal column vectors which are eigenvectors of A and respectively. Let be the set of eigenvalues corresponding to the eigenvectors given in V, and let (respectively ) be the set of eigenvalues corresponding to the eigenvectors not given in V (respectively ). If there exists an interval [α, β] and δ > 0 such that and , then for any orthogonal-invariant norm3
This theorem can be generalized to singular vector perturbation for rectangular matrices; see Wedin (1972). A slightly unpleasant feature of this theorem is that δ depends on the eigenvalues of both A and . However, with the help of Weyl’s inequality, we can immediately obtain a corollary that does not involve the eigenvalues of . Let λj(·) denote the jth largest eigenvalue of a real symmetric matrix. Recall that Weyl’s inequality bounds the differences between the eigenvalues of A and :
(2.6) |
This inequality suggests that, if the eigenvalues in have the same ranks (in descending order) as those in , then and are similar. Below we state our corollary, whose proof is in the appendix.
Corollary 2.1. Assume the setup of the above theorem, and suppose the eigenvalues in have the same ranks as those in . If and for some δ0 > 0, then
We can then use to obtain a bound under the Frobenius norm. In the special case where and V = v, reduce to vectors, we can choose α = β = λ, and the above corollary translates into
(2.7) |
We can now see that the factor model and PCA are approximately the same with sufficiently large eigen-gap. Indeed, under Identifiability Assumption 1.1, we have . Applying Weyl’s inequality and Corollary 2.1 to BB⊤ (as A) and Σ (as ), we can easily control the eigenvalue/eigenvector differences by and the eigengap, which is comparably small under Pervasiveness Assumption 2.1. This difference can be interpreted as the bias incurred by PCA on approximating factor models.
Furthermore, given any covariance estimator , we can similarly apply the above results by setting A = Σ and to bound the difference between the estimated eigenvalues/eigenvectors and the population counterparts. Note that the above corollary gives us an upper bound on the subspace estimation error in terms of the ratio .
Next, we consider entry-wise bounds on the eigenvectors. For simplicity, here we only consider eigenvectors corresponding to unique eigenvalues rather than the general eigenspace. Often, we want to have a bound on each entry of the eigenvector difference , instead of an ℓ2 norm bound, which is an average-type result. In many cases, none of these entries has dominant perturbation, but the Davis-Kahan’s theorem falls short of providing a reasonable bound (the naïve bound gives a suboptimal result).
Some recent papers (Abbe et al., 2017) have addressed this problem, and in particular, entry-wise bounds of the following form are established.
where μ ∈ [0, 1] is related to the structure of the statistical problem and typically can be as small as , which is very desirable in high-dimensional setting. The small term is often related to independence pattern of the data, which is typically small under mild independence conditions.
We illustrate this idea in Figure 2 through a simulated data example (left) and a real data example (right), both of which have factor-type structure. For the left plot, we generated a network data according to the stochastic block model with K = 2 blocks (communities), each having nodes n/2 = 2500: the adjacency matrix that represents the links between nodes is a symmetric matrix, with upper triangular elements generated independently from Bernoulli trials (diagonal elements are taken as 0), with the edge probability 5 log n/n for two nodes within blocks and log n/(4n) otherwise. Our task is to classify (cluster) these two communities based on the adjacency matrix. We used the second eigenvector (that is, corresponding to the second largest eigenvalue) of the adjacency matrix as a classifier. The left panel of Figure 2 represents the values of the 5000 coordinates (or entries) [v2]i in the y-axis against the indices i = 1, …, 5000 in the x-axis. For comparison, the second eigenvector of the expectation of the adjacency matrix—which is of interest but unknown—have entries taking values only in , depending on the unknown nature of which block a vertex belongs to (this statement is not hard to verify). We used the horizontal line to represent these ideal values: they indicate exactly the membership of each vertex. Clearly, the magnitude of entry-wise perturbation is . Therefore, we can use as an estimate of v* and classify all nodes with the same sign as the same community. See Section 4.2 for more details.
Fig 2.
The left plot shows the entries (coordinates) of the second eigenvectors v2 computed from the adjacency matrix from the SBM with two equal-sized blocks (n = 5000, K = 2). The plot also shows the expectation counterpart , whose entries have the same magnitude . The deviation of v2 from is quite uniform, which is a phenomenon not captured by the Davis-Kahan’s theorem. The right plot shows the coordinates of two leading eigenvectors of the sample covariance matrix calculated from 2012–2017 daily return data of 484 stocks (tiny black dots). We also highlight six stocks during three time windows (2012–2015, 2013–2016, 2014–2017) with big markers, so that the fluctuation/perturbation is shown. The magnitude of these coordinates is typically small, and the fluctuation is also small.
For the right plot, we used daily return data of stocks that are constituents of S&P 500 index from 2012.1.1–2017.12.31. We considered stocks with exactly n = 1509 records and excluded stocks with incomplete/missing values, which resulted in p = 484 stocks. Then, we calculated the sample covariance matrix using the data in the entire period, and computed two leading eigenvectors (note that they span the column space of B) and plotted the coordinates (entries) using small dots. Stocks with an coordinate smaller than 5% quantile or larger than 95% quantile are potentially outlying values and are not shown in the plot. In addition, we also highlighted the fluctuation of six stocks during three time windows: 2012.1–2015.12, 2013.1–2016.12 and 2014.1–2017.12, with different big markers. That is, for each of the three time windows, we re-computed the covariance matrices and the two leading eigenvectors, and then highlighted coordinates that correspond to the six major stocks. Clearly, the magnitude for these stocks is small, which is roughly , and the fluctuation of coordinates is also very small. Both plots suggest an interesting phenomenon of eigenvectors in high dimensions: entry-wise behavior of eigenvectors can be benign under factor model structure.
To state our results rigorously, let us suppose that A, , are symmetric matrices, with and rank(A) = K < n. Let the eigen-decomposition of A and be
(2.8) |
Here the eigenvalues and are the K largest ones of A and , respectively, in terms of absolute values. Both sequences are sorted in descending order. are eigenvalues of whose absolute values are smaller than . The eigenvectors and are normalized to have unit norms.
Here are allowed to take negative values. Thanks to Weyl’s inequality, and are well-separated when the size of perturbation W is not too large. In addition, we have the freedom to choose signs for eigenvectors, since they are not uniquely defined. Later, we will use ‘up to sign’ to signify that our statement is true for at least one choice of sign. With the conventions λ0 = +∞ and λK+1 = −∞, we define the eigen-gap as
(2.9) |
which is the smallest distance between λk and other eigenvalues (including 0). This definition coincides with the (usual) eigen-gap in Corollary 2.1 in the special case where we are interested in a single eigenvalue and its associated eigenvector.
We now present an entry-wise perturbation result. Let us first look at only one eigenvector. In this case, when is small, heuristically,
holds uniformly for each entry. When , that is, is unbiased, this gives the first-order approximation (rather than bounds on the difference ) of the random vector . Abbe et al. (2017) proves rigorously this result and generalizes to eigenspaces. The key technique for the proof is similar to Theorem 2.4 below, which simplifies the one in Abbe et al. (2017) in various ways but holds under more general conditions. It is stated in a deterministic way, and can be powerful if there is certain structural independence in the perturbation matrix W. A self-contained proof can be found in the appendix.
For each m ∈ [n], let be a modification of W with the mth row and mth column zeroed out, i.e.,
We also define , and denote its eigenvalues and eigenvectors by and , respectively. This construction is related to the leave-one-out technique in probability and statistics. For recent papers using this technique, see Bean et al. (2013); Zhong and Boumal (2018); Abbe et al. (2017) for example.
Theorem 2.4. Fix any ℓ ∈ [K]. Suppose that , and that the eigen-gap δℓ as defined in (2.9) satisfies . Then, up to sign,
(2.10) |
where wm is the mth column of W.
To understand this theorem, let us compare it with the standard ℓ2 bound (Theorem 2.3), which implies . The first term of the upper bound in (2.10) says the perturbation on the mth entry can be much smaller, because the factor , always bounded by 1, can be usually much smaller. For example, if vk’s are uniformly distributed on the unit sphere, then this factor is typically of order . This factor is related to the notion of incoherence in Candès and Recht (2009); Candès et al. (2011), etc.
The second term of the upper bound in (2.10) is typically much smaller than , especially under certain independence assumption. For example, if wm is independent of other entries, then, by construction, and wm are independent. If, moreover, entries of wm are i.i.d. standard Gaussian, is of order , whereas typically scales with . This gives a bound for the mth entry, and can be extended to an ℓ∞ bound if we are willing to make independence assumption for all m ∈ [n] (which is typical for random graphs for example).
We remark that this result can be generalized to perturbation bounds for eigenspaces (Abbe et al., 2017), and the conditions on eigenvalues can be relaxed using certain random matrix assumptions (Koltchinskii and Xia, 2016; O’Rourke et al., 2017; Zhong, 2017).
Now, we extend this perturbation result to singular vectors of rectangular matrices. Suppose L, , satisfy and rank(L) = K < min{n, p}. Let the SVD of L and be4
where σk and are respectively non-increasing in k, and uk and vk are all normalized to have unit ℓ2 norm. As before, let have K largest absolute values. Similar to (2.9), we adopt the conventions σ0 = +∞, σK+1 = 0 and define the eigen-gap as
(2.11) |
For j ∈ [p] and i ∈ [n], we define unit vectors and by replacing certain row or column of E with zeros. To be specific, in our expression , if we replace the ith row of E by zeros, then the normalized right singular vectors of the resulting perturbed matrix are denoted by ; and if we replace the jth column of E by zeros, then the normalized left singular vectors of the resulting perturbed matrix are denoted by .
Corollary 2.2. Fix any ℓ ∈ [K]. Suppose that , and that . Then, up to sign,
where is the ith row vector of E, and is the jth column vector of E.
If we view as the data matrix (or observation) X, then, the low rank matrix L can be interpreted as BF⊤. The above result provides a tool of studying estimation errors of the singular subspace of this low rank matrix. Note that can be interpreted as the result of removing the idiosyncratic error of the ith observation, and as the result of removing the jth covariate of the idiosyncratic error.
To better understand this result, let us consider a very simple case: K = 1 and each row of E is i.i.d.. We are interested in bounding the singular vector difference between the rank-1 matrix L = σ1uv⊤ and its noisy observation . This is a spiked matrix model with a single spike. By independence between and as well as elementary properties of Gaussian variables, Corollary 2.2 implies that with probability 1 – o(1), up to sign,
(2.12) |
Random matrix theory gives with high probability. Our ℓ2 perturbation inequality (Corollary 2.1) implies that . This upper bound is much larger than the two terms in (2.12), as is typically much smaller than 1 in high dimensions. Thus, (2.12) gives a better entry-wise control over the ℓ2 counterpart.
Beyond this simple case, there are many desirable features of Corollary 2.2. First of all, we allow K to be moderately large, in which case, as mentioned before, the factor is related to the incoherence structure in the matrix completion and robust PCA literature. Secondly, the result holds deterministically, so random matrices are also applicable. Finally, the result holds for each i ∈ [n] and j ∈ [p], and thus it is useful even if the entries of E are not independent, e.g. when a subset of covariates are dependent.
To sum up, our results Theorem 2.4 and Corollary 2.2 provide flexible tools of studying entry-wise perturbation of eigenvectors and singular vectors. It is also easy to adapt to other problems since their proofs are not complicated (see the appendix).
3. APPLICATIONS TO HIGH-DIMENSIONAL STATISTICS
3.1. Covariance estimation
Estimation of high-dimensional covariance matrices has wide applications in modern data analysis. When the dimensionality p exceeds the sample size n, the sample covariance matrix becomes singular. Structural assumptions are necessary in order to obtain a consistent estimator in this challenging scenario. One typical assumption in the literature is that the population covariance matrix is sparse, with a large fraction of entries being (close to) zero, see Bickel and Levina (2008) and Cai and Liu (2011). In this setting, most variables are nearly uncorrelated. In financial and genetic data, however, the presence of common factors leads to strong dependencies among variables (Fan et al., 2008). The approximate factor model (1.1) better characterizes this structure and helps construct valid estimates. Under this model, the covariance matrix Σ has decomposition (2.1), where is assumed to be sparse (Fan et al., 2013). Intuitively, we may assume that Σu only has a small number of nonzero entries. Formally, we require the sparsity parameter
to be small. This definition can be generalized to a weaker sense of sparsity, which is characterized by , where q ∈ (0g, 1) is a parameter. Note that small mq forces Σu to have few large entries. However, for simplicity, we choose not to use this more general definition when presenting theoretical results below.
The approximate factor model has the following two important special cases, under which the parameter estimation has been well studied.
The sparse covariance model is (2.1) without factor structure, i.e. Σ = Σu; typically, entry-wise thresholding is employed for estimation.
The strict factor model corresponds to (2.1) with Σu being diagonal; usually, PCA-based methods are used.
The approximate factor model is a combination of the above two models, as it comprises both a low-rank component and a sparse component. A natural idea is to fuse methodologies for the two models into one, by estimating the two components using their corresponding methods. This motivated our high-level idea for estimation under the approximate factor model: (1) estimating the low-rank component (factors and loadings) using regression (when factors are observable) or PCA (when factors are latent); (2) after eliminating it from Σ, employing standard techniques such as thresholding in the sparse covariance matrix literature to estimate Σu; (3) adding the two estimated components together.
First, let us consider the scenario where the factors are observable. In this setting, we do not need the Identifiability Assumption 1.1. Fan et al. (2008) focused on the strict factor model where the Σu in (2.1) is diagonal. It is then extended to the approximate factor model (1.1) by Fan et al. (2011). Later, Fan et al. (2018b) relaxed the sub-Gaussian assumption on the data to moment condition, and proposed a robust estimator. We are going to present the main idea of these methods using the one in Fan et al. (2011).
Step 1. Estimate B using the ordinary least-squares: where
Step 2. Let be the vector of intercepts, be the vector of residual for i ∈ [n], and be the sample covariance. Apply thresholding to Su and obtain a regularized estimator .
Step 3. Estimate cov(fi) by .
Step 4. The final estimator is .
We remark that in Step 2, there are many thresholding rules for estimating sparse covariance matrices. Two popular choices are the t-statistic-based adaptive thresholding (Cai and Liu, 2011) and correlation-based adaptive thresholding (Fan et al., 2013), with the entry-wise thresholding level chosen to be . As the sparsity pattern of correlation and covariance are the same and the correlation matrix is scale-invariant, one typically applies the thresholding on the correlation and then scales it back to the covariance. Except for the number of factors K, this coincides with the commonly-used threshold for estimating sparse covariance matrices.
While it is not possible to achieve better convergence of Σ in terms of the operator norm or the Frobenius norm, Fan et al. (2011) considered two other important norms. Under regularity conditions, it is shown that
(3.1) |
Here for , and refer to its entropy-loss norm and entry-wise max-norm maxi,j |Aij|. As is pointed out by Fan et al. (2011) and Wang and Fan (2017), they are relevant to portfolio selection and risk management. In addition, convergence rates for , and are also established.
Now we come to covariance estimation with latent factors. As is mentioned in Section 2.1, the Pervasiveness Assumption 2.1 helps separate the low-rank part BB⊤ from the sparse part Σu in (2.1). Fan et al. (2013) proposed a Principal Orthogonal complEment Thresholding (POET) estimator, motivated by the relationship between PCA and factor model, and the estimation of sparse covariance matrix Σu in Fan et al. (2011). The procedure is described as follows.
Step 1. Let be the sample covariance matrix, be the eigenvalues of S in non-ascending order, be their corresponding eigenvectors.
Step 2. Apply thresholding to and obtain a regularized estimator .
Step 3. The final estimator is .
Here K is assumed to be known and bounded to simplify presentation and emphasize the main ideas. The methodology and theory in Fan et al. (2013) also allow using a data-driven estimate of K. In Step 2 above we can choose from a large class of thresholding rules, and it is recommended to use the correlation-based adaptive thresholding. However, the thresholding level should be set to . Compared to the level we use in covariance estimation with observed factors, the extra term here is the price we pay for not knowing the latent factors. It can be negligible when p grows much faster than n. Intuitively, thanks to the Pervasiveness Assumption, the latent factors can be estimated accurately in high dimensions. Fan et al. (2013) obtained theoretical guarantees for the POET that are similar to (3.1). The analysis allows for general sparsity patterns of Σu by considering mq as the measure of sparsity for q ∈ [0, 1).
Robust procedures handling heavy-tailed data are proposed and analyzed by Fan et al. (2018a,b). In another line of research, Li et al. (2017) considered estimation of the covariance matrix of a set of targeted variables, when additional data beyond the variables of interest are available. By assuming a factor model structure, they constructed an estimator taking advantage of all the data and justified the information gain theoretically.
The Pervasiveness Assumption rules out the case where factors are weak and the leading eigenvalues of Σ are not as large as O(p). Shrinkage of eigenvalues is a powerful technique in this scenario. Donoho et al. (2013) systematically studied the optimal shrinkage in spiked covariance model where all the eigenvalues except several largest ones are assumed to be the same. Wang and Fan (2017) considered the approximate factor model, which is more general, and proposed a new version of POET with shrinkage for covariance estimation.
3.2. Principal component regression with random sketch
Principal component regression (PCR), first proposed by Hotelling (1933) and Kendall (1965), is one of the most popular methods of dimension reduction in linear regression. It employs the principal components of the predictors xi to explain or predict the response yi. Why do principal components, not other components, have more prediction power? Here we offer an insight from the perspective of high-dimensional factor models.
The basic assumption is that the unobserved latent factors drive simultaneously the covariates via (1.1) and responses, as shown in Figure 3. As a specific example, we assume
where y = (y1, …, yn)⊤ and the noise ε = (ε1, …, εn)⊤ has zero means. Since fi is latent and the covariate vector is high dimensional, we naturally infer the latent factors from the observed covariates via PCA. This yields the PCR.
Fig 3.
Illustration of the data generation mechanism in PCR. Both predictors xi and responses yi are driven by the latent factors fi. PCR extracts latent factors via the principal components from X, and uses the resulting estimate as the new predictor. Regressing y against leads to the PCR estimator , which typically enjoys a smaller variance due to its reduced dimension, though it introduces bias.
By (2.2) (assume μ = 0 for simplicity), , where . This suggests that if we directly regress yi over xi, then the regression coefficient β† should lie in the column space spanned by B. This inspires the core idea of PCR, i.e., instead of seeking the least square estimator in the entire space, we restrict our search scope to be the left leading singular space of X, which is approximately the column space of B under the Pervasiveness Assumption.
Let us discuss PCR more rigorously. To be consistent with the rest of this paper, we let , which is different from conventions, and
(3.2) |
Let be the SVD of X, where with non-increasing singular values. For some integer K satisfying 1 ≤ K ≤ min(n, p), write P = (PK, PK+) and Q = (QK, QK+). The PCR estimator solves the following optimization problem:
(3.3) |
It is easy to verify that
(3.4) |
where is the top left submatrix of Σ. The following lemma calculates the excess risk of , i.e., , treating X as fixed. The proof is relegated to the appendix.
Lemma 3.1. Let be the column vectors of P. For j = 1, …, p, denote . We have
Define the ordinary least squares (OLS) estimator . Note that . Comparing and , one can clearly see a variance-bias tradeoff: PCR reduces the variance by introducing a bias term , which is typically small and vanishes in the ideal case —this is the bias incurred by imposing the constraint in (3.3).
In the high-dimensional setting where p is large, calculating PK using SVD is computationally expensive. Recently, sketching has gained growing attention in statistics community and is used for downscaling and accelerating inference tasks with massive data. See recent surveys by Woodruff (2014) and Yang et al. (2016). The essential idea is to multiply the data matrix by a sketch matrix to reduce its dimension while still preserving the statistical performance of the procedure, since random projection reduces the strength of the idiosyncratic noise. To apply sketching to PCR, we first multiply the design matrix X by an appropriately chosen matrix with K ≤ m < p:
(3.5) |
where R is called the “sketching matrix”. This creates m indices based on X. From the factor model perspective (assuming μ = 0), with a proper choice of R, we have , since the idiosyncratic components in (1.1) is averaged out due to weak dependence of u. Hence, the indices in are approximately linear combinations of the factors . At the same time, since m ≥ K and R is nondegenerate, the row space of is approximately the same as that spanned by F⊤. This shows running linear regression on is approximately the same as running it on F⊤, without using the computationally expensive PCA.
We now examine the property of sketching approach beyond the factor models. Let be the SVD of , and write and . Imitating the form of (3.4), we consider the following sketched PCR estimator:
(3.6) |
where is the top left submatrix of .
We now explain the above construction for . It is easy to derive from (3.4) that given R⊤X and y as the design matrix and response vector, the PCR estimator should be . Then the corresponding PCR projection of y onto R⊤X should be . This leads to the construction of in (3.6). Theorem 4 in Mor-Yosef and Avron (2018) gives the excess risk of , which holds for any R satisfying the conditions of the theorem.
Theorem 3.1. Assume m ≥ K and rank(R⊤X) ≥ K. If , then
(3.7) |
This theorem shows that the extra bias induced by sketching is . Given the bound of in Lemma 3.1, we can deduce that
As we will see below, a smaller ν requires a larger m, and thus more computation. Therefore, we observe a tradeoff between statistical accuracy and computational resources: if we have more computational resources, we can allow a large dimension of sketched matrix , and the sketched PCR is more accurate, and vice versa.
One natural question thus arises: which R should we choose to guarantee a small ν to retain the statistical rate of ? Recent results (Cohen et al., 2015) on approximate matrix multiplication (AMM) suggest several candidate sketching matrices for R. Define the stable rank , which can be interpreted as a soft version of the usual rank—indeed, sr(X) ≤ rank(X) always holds, and sr(X) can be small if X is approximately low-rank. An example of candidate sketching matrices for R is a random matrix with independent and suitably scaled sub-Gaussian entries. As long as the sketch size m = Ω(sr(X) + log(1/δ)/ε2), it will hold for any ε, δ ∈ (0, 1/2) that
(3.8) |
Combining this with the Davis-Kahan Theorem (Corollary 2.1), we can deduce that is small with certain eigen-gap condition. We summarize our argument by presenting a corollary of Theorem 9 in Mor-Yosef and Avron (2018) below. Readers can find more candidate sketching matrices in the examples after Theorem 1 in Cohen et al. (2015).
CorollarY 3.1. For any ν, δ ∈ (0, 1/2), let
Let a random matrix with i.i.d. entries. Then there exists a universal constant C > 0 such that for any δ > 0, if m ≥ C(sr(X)+ log(1/δ)/ε2), it holds with probability at least 1 − δ that
(3.9) |
Remark 3.1. Note that , and this bound is tight with a small ν. Some algebra yields that (3.9) holds when
One can see that reducing ν requires a larger sketch size m. Besides, a large eigengap of the design matrix X helps reduce the required sketch size.
3.3. Factor-Adjust Robust Multiple (FARM) tests
Large-scale multiple testing is a fundamental problem in high-dimensional inference. In genome-wide association studies and many other applications, tens of thousands of hypotheses are tested simultaneously. Standard approaches such as Benjamini and Hochberg (1995) and Storey (2002) can not control well both false and missed discovery rates in the presence of strong correlations among test statistics. Important efforts on dependence adjustment include Efron (2007), Friguet et al. (2009), Efron (2010), and Desai and Storey (2012). Fan et al. (2012) and Fan and Han (2017) considered FDP estimation under the approximate factor model. Wang et al. (2017) studied a more complicated model with both observed variables and latent factors. All these existing papers heavily rely on the joint normality assumption of the data, which is easily violated in real applications. A recent paper (Fan et al., 2017a) developed a factor-adjusted robust procedure that can handle heavy-tailed data while controlling FDP. We are going to introduce this method in this subsection.
Suppose our i.i.d. observations satisfy the approximate factor model (1.1) where is an unknown mean vector. To make the model identifiable, we use the Identifiability Assumption 1.1. We are interested in simultaneously testing
Let Tj be a generic test statistic for H0j. For a pre-specified level z > 0, we reject H0j whenever |Tj| ≥ z. The numbers of total discoveries R(z) and false discoveries V (z) are defined as
Note that R(z) is observable while V (z) needs to be estimated. Our goal is to control the false discovery proportion FDP(z) = V (z)/R(z) with the convention 0/0 = 0.
Naïve tests based on sample averages suffer from size distortion of FDP control due to dependence of common factors in (1.1). On the other hand, the factor-adjusted test based on the sample averages of xi − Bfi (B and fi need to be estimated) has two advantages: the noise ui is now weakly dependent so that FDP can be controlled with high accuracy, and the variance of ui is smaller than that of Bfi + ui in model (1.1), so that it is more powerful. This will be convincingly demonstrated in Figure 5 below. The factor-adjusted robust multiple test (FarmTest) is a robust implementation of the above idea (Fan et al., 2017a), which replaces the sample mean by its adaptive Huber estimation and extracts latent factors from a robust covariance input.
Fig 5.
Histograms of four different mean estimators for simultaneous inference. Fix n = 100, p = 500 and K = 3, and data are generated i.i.d. from t3, which is heavy-tailed. Dashes lines correspond to μj = 0 and μj = 0.6, which is unknown. Robustification and factor adjustment help distinguish nulls and alternatives.
To begin with, we consider the Huber loss (Huber, 1964) with the robustification parameter τ ≥ 0:
and use as a robust M -estimator of μj. Fan et al. (2017a) suggested choosing to deal with possible asymmetric distribution and called it adaptive Huber estimator. They showed, assuming bounded fourth moments only, that
(3.10) |
where , and σu,jj is the (j, j)th entry of Σu as is defined in (2.1). Assuming for now that , and are all observable, then the factor-adjusted test statistic is asymptotically . The law of large numbers implies that V (z) should be close to 2p0Φ(−z) for z ≥ 0, where Φ(·) is the cumulative distribution function of , and p0 = #{j : μj = 0} is the number of true nulls. Hence
Note that in the high-dimensional and sparse regime, we have p0 = p − o(p) and thus FDPA(z) is only a slightly conservative surrogate. However, we can also estimate the proportion π0 = p0/p and use less conservative estimate instead, where is an estimate of π0 whose idea is depicted in Figure 4; see Storey (2002). Finally, we define the critical value zα = inf{z ≥ 0 : FDPA(z) ≤ α} and reject H0j whenever |Tj| ≥ zα.
Fig 4.
Estimation of proportion of true nulls. The observed P-values (right panel) consist of those from significant variables (genes), which are usually small, and those from insignificant variables, which are uniformly distributed. Assuming the P-values for significant variables are mostly less than λ (taken to be 0.5 in this illustration, left panel), the contributions of observed P-values > λ are mostly from true nulls and this yields a natural estimator , which is the average height of the histogram with P-values > λ (red line). Note that the histograms above the red line estimates the distributions of P-values from the significant variavles (genes) in the left panel.
In practice, we have no access to , or in (3.10) and need to use their estimates. This results in the Factor-Adjusted Robust Multiple test (FarmTest) in Fan et al. (2017a). The inputs include , a generic robust covariance matrix estimator from the data, a pre-specified level α ∈ (0, 1) for FDP control , the number of factors K, and the robustification parameters γ and . Note that K can be estimated by the methods in Section 2.2, and overestimating K has little impact on final outputs.
Step 1. Denote by a generic robust covariance matrix estimator. Compute the eigen-decomposition of , set to be its top K eigenvalues in descending order, and to be their corresponding eigenvectors. Let where , and denote its rows by .
Step 2. Let for j ∈ [p] and . Construct factor-adjusted test statistics
(3.11) |
where , .
Step 3. Calculate the critical value zα = inf{z ≥ 0 : FDPA(z) ≤ α}, where , and reject H0j whenever |Tj| ≥ zα.
In Step 2, we estimate based on , which is implied by the factor model (1.1), and regard non-vanishing μj as an outlier. In the estimation of σu,jj, we used the identity and robustly estimated the second moment θj.
Figure 5 is borrowed from Figure 1 in Fan et al. (2017a) that illustrates the effectiveness of this procedure. Here n = 100, p = 500, K = 3, and the entries of ui are generated independently from the t-distribution with 3 degrees of freedom. It is known that t—distributions are not sub-Guassian variables and are often used to model heavy-tailed data. The unknown means are fixed as μj = 0.6 for j ≤ 125 and μj = 0 otherwise. We plot the histograms of sample means, robust mean estimators, and their counterparts with factor-adjustment. The latent factors and heavy-tailed errors make it difficult to distinguish μj = 0.6 from μj = 0, and that explains why the sample means behave poorly. As is shown in Figure 5, better separation can be obtained by factor adjustment and robustification.
While existing literature usually imposes the joint normal assumption on , the FarmTest only requires the coordinates of to have bounded fourth-order moments, and to be sub-Gaussian. Under standard regularity conditions for the approximate factor model, it is proved by Fan et al. (2017a) that
We see that FDPA is a valid approximation of FDP, which is therefore faithfully controlled by the FARM-Test.
3.4. Factor-Adjusted Robust Model (FARM) selection
Model selection is one of the central tasks in high dimensional data analysis. Parsimonious models enjoy interpretability, stability and oftentimes, better prediction accuracy. Numerous methods for model selection have been proposed in the past two decades, including, Lasso (Tibshirani, 1996), SCAD (Fan and Li, 2001), the elastic net Zou and Hastie (2005), the Dantzig selector (Candes and Tao, 2007), among others. However, these methods work only when the covariates are weakly dependent or statisfy certain regularity conditions (Zhao and Yu, 2006; Bickel et al., 2009). When covariates are strongly correlated, Paul et al. (2008); Kneip and Sarda (2011); Wang (2012); Fan et al. (2016a) used factor model to eliminate the dependencies caused by pervasive factors, and to conduct model selection using the resulting weakly correlated variables.
Assume that follow the approximate factor model (1.1). As a standard assumption, the coordinates of are weakly dependent. Thanks to this condition and the decomposition
(3.12) |
where α = μT β and γ = B⊤β. we may treat wi as the new predictors. In other words, by lifting the number of variables from p to p + K, the covariates of wi are now weakly dependent. The usual regularized estimation can now be applied to this new set of variables. Note that we regard the coefficients B⊤β as free parameters to facilitate the implementation (ignoring the relation γ = B⊤β) and this requires an additional assumption to make this valid (Fan et al., 2016a).
Suppose we wish to fit a model via a loss function . The above idea suggests the following two-step approach, which is called Factor-Adjusted Regularized (or Robust when so implemented) Model selection (FarmSelect) (Fan et al., 2016a).
Step 1: Factor estimation. Fit the approximate factor model (1.1) to get , and .
Step 2: Augmented regularization. Find α, β and γ to minimize
where pλ(·) is a folded concave penalty (Fan and Li, 2001) with parameter λ.
In Step 1, standard estimation procedures such as POET (Fan et al., 2013) and S-POET (Wang and Fan, 2017) can be applied, as long as they produce consistent estimators of B, and . Step 2 is carried out using usual regularization methods with new covariates.
Figure 6, borrowed from Figure 3 (a) in Fan et al. (2016a), shows that the proposed method outperforms other popular ones for model selection including Lasso (Tibshirani, 1996), SCAD (Fan and Li, 2001) and elastic net (Zou and Hastie, 2005), in the presence of correlated covariates. The basic setting is sparse linear regression y = x⊤β* + ε with p = 500 and n growing from 50 to 160. The true coefficients are β* = (β1, ··· , β10, 0p−10)⊤, where are drawn uniformly at random from [2, 5], and . The correlation structure of covariates x is calibrated from S&P 500 monthly excess returns between 1980 and 2012.
Fig 6.
Model selection consistency rate, i.e., the proportion of simulations that the selected model is identical to the true one, with p = 500 and n varying from 50 to 160. With moderate sample size, the proposed method faithfully identifies the correct model while other methods cannot.
Under the generalized linear model, L(y, z) = −yz + b(z) and b(·) is a convex function. Fan et al. (2016a) analyzed theoretical properties of the above procedure. As long as the coordinates of wi (rather than xi) are not too strongly dependent and the factor model is estimated to enough precision, enjoys optimal rates of convergence , where q = 1, 2 or ∞. When the minimum entry of |β*| is at least , the model selection consistency is achieved.
When we use the square loss, this method reduces to the one in Kneip and Sarda (2011). By using the square loss and replacing the penalized multiple regression in Step 2 with marginal regression, we recover the factor-profiled variable screening method in Wang (2012). While these papers aim at modeling and then eliminating the dependencies in xi via (1.1), Paul et al. (2008) used a factor model to characterize the joint distribution of and develops a related but different approach.
4. RELATED LEARNING PROBLEMS
4.1. Gaussian mixture model
PCA, or more generally, spectral decomposition, can be also applied to learn mixture models for heterogeneous data. A thread of recent papers (Hsu and Kakade, 2013; Anandkumar et al., 2014; Yi et al., 2016; Sedghi et al., 2016) apply spectral decomposition to lower-order moments of the data to recover the parameters of interest in a wide class of latent variable models. Here we use the Gaussian mixture model to illustrate their idea. Consider a mixture of K Gaussian distributions with spherical covariances. Let wk ∈ (0, 1) be the probability of choosing component k ∈ {1, … , K}, and be the component mean vectors, and be the component covariance matrices, which is required by Hsu and Kakade (2013) and Anandkumar et al. (2014). Each data vector follows the mixture of the Gaussian distribution. The parameters of interest are .
Hsu and Kakade (2013) and Anandkumar et al. (2014) shed lights on the close connection between the lower-order moments of the data and the parameters of interest, which motivates the use of Method of Moments (MoM). Denote the population covariance by Σ. Below we present Theorem 1 in Hsu and Kakade (2013) to elucidate the moment structure of the problem.
Theorem 4.1. Suppose that are linearly independent. Then the average variance is the smallest eigenvalue of Σ. Let v be any eigenvector of Σ that is associated with the eigenvalue . Define the following quantities:
Then we have
(4.1) |
where the notation ⊗ represents the tensor product.
Theorem 4.1 gives the relationship between the moments of the first three orders of x and the parameters of interest. With replaced by their empirical versions, the remaining task is to solve for all the parameters of interest via (4.1). Hsu and Kakade (2013) and Anandkumar et al. (2014) proposed a fast method called robust tensor power method to compute the estimators. The crux therein is to construct an estimable third-order tensor that can be decomposed as the sum of orthogonal tensors based on μk. This orthogonal tensor decomposition can be regarded as an extension of spectral decomposition to third-order tensors (simply speaking, three-dimensional arrays). Then the power iteration method is applied to the estimate of to recover each μk, as well as other parameters.
Specifically, consider first the following linear transformation of μk:
(4.2) |
for k ∈ [K], where . The key is to use the whitening transformation by setting W to be a square root of M2. This ensures that are orthogonal to each other. Denoting a⊗3 = a ⊗ a ⊗ a,
(4.3) |
is an orthogonal tensor decomposition; that is, it satisfies orthogonality of . The following theorem from Anandkumar et al. (2014) summarizes the above argument, and more importantly, it shows how to obtain μk back from .
Theorem 4.2. Suppose the vectors are linearly independent, and the scalars are strictly positive. Let M2 = UDU⊤ be the spectral decomposition of M2 and let W = UD−1/2. Then in (4.2) are orthogonal to each other. Furthermore, the Moore-Penrose pseudo-inverse of W is , and we have for k ∈ [K].
As promised, the orthogonal tensor can be estimated from empirical moments. We will make use of the following identity, which is similar to Theorem 4.1.
(4.4) |
where we used the cyclic sum notation
Note that is simply the jth row of W. To obtain an estimate of , we replace the expectation by the empirical average, and substitute W and M1 by their plug-in estimates. It is worth mentioning that, because has a smaller size than M3, computations involving can be implemented more efficiently.
Once we obtain an estimate of , which we denote by , to recover , and , the only task left is computing the orthogonal tensor decomposition (4.3) for . The tensor power method in Anandkumar et al. (2014) is shown to solve this problem with provable computational guarantees. We omit the details of the algorithm here. Interested readers are referred to Section 5 of Anandkumar et al. (2014) for the introduction and analysis of this algorithm.
To conclude this subsection, we summarize the entire procedure of estimating as below.
Step 1. Calculate the sample covariance matrix , its minimum eigenvalue and its associated eigenvector .
Step 2. Derive the estimators , , based on Theorem 4.1 by plug-in of empirical moments of x, and .
Step 3. Calculate the spectral decomposition . Let . Construct an estimator of , denoted by , based on (4.4) by plug-in of empirical moments of , and . Apply the robust tensor power method in Anandkumar et al. (2014) to and obtain and .
Step 4. Set and . Solve the linear equation for .
4.2. Community detection
In statistical modeling of networks, the stochastic block model (SBM), first proposed by Holland et al. (1983), has gained much attention in recent years (see Abbe, 2017 for a recent survey). Suppose our observation is a graph of n vertices, each of which belongs to one of K communities (or blocks). Let the vertices be indexed by [n], and the community that vertex i belongs to is indicated by an unknown θi ∈ [K]. In SBM, the probability of an edge between two vertices depends entirely on the membership of the communities. To be specific, let be a symmetric matrix where each entry takes value in [0, 1], and let be the adjacency matrix, i.e., Aij = 1 if there is an edge between vertex i and j, and Aij = 0 otherwise. Then, the SBM assumes
and {Aij}i>j are independent. Here, for ease of presentation, we allow self-connecting edges. Figure 7 gives one realization of the network with two communities.
Fig 7.
In both heatmaps, a dark pixel represents an entry with value 1 in a matrix, and a white pixel represents an entry with value 0. The left heatmap shows the (observed) adjacency matrix A of size n = 40 generated from the SBM with two equal-sized blocks (K = 2), with edge probabilities 5 log n/n (within blocks) and log n/(4n) (between blocks). The right heatmap shows the same matrix with its row indices and column indices suitably permutated based on unobserved zi. Clearly, we observe an approximate rank-2 structure in the right heatmap. This motivates estimating zi via the second eigenvector.
Though seemingly different, this problem shares a close connection with PCA and spectral methods. Let zi = ek (namely, the kth canonical basis in ) if θi = k, indicating the membership of ith node, and define . The expectation of A has a low-rank decomposition and
(4.5) |
Loosely speaking, the matrix Z plays a similar role as factors or loading matrices (unnormalized), and is similar to the noise (idiosyncratic component). In the ideal situation, the adjacency matrix A and its expectation are close, and naturally we expect the eigenvectors of A to be useful for estimating θi. Indeed, this observation is the underpinning of many methods (Rohe et al., 2011; Gao et al., 2015; Abbe and Sandon, 2015). The vanilla spectral method for network/graph data is as follows:
Step 1. Construct the adjacency matrix A or other similarity-based matrices;
Step 2. Compute eigenvectors v1, … , vL corresponding to the largest eigenvalues, and form a matrix ;
Step 3. Run a clustering algorithm on the row vectors of V.
There are many variants and improvements of this vanilla spectral method. For example, in Step 1, very often the graph Laplacian D − A or normalized Laplacian D−1/2(D − A)D−1/2 is used in place of the adjacency matrix, where D = diag(d1, … , dn), and is the degree of vertex i. If real-valued similarities or distances between vertices are available, weighted graphs are usually constructed. In Step 2, there are many other refinements over raw eigenvectors in the construction of V, for example, projecting row vectors of V onto the unit sphere (Ng et al., 2002), and calculating scores based on eigenvector ratios (Jin, 2015), etc. In Step 3, a very popular algorithm for clustering is the K-means algorithm.
We will look at the vanilla spectral algorithm in its simplest form. Our goal is exact recovery, which means finding an estimator of such that as n → ∞,
Note that we can only determine θ up to a permutation since the distribution of our observation is invariant to permutations of [K]. There are nice theoretical results, including information limits for exact recovery in Abbe et al. (2016).
Despite its simplicity, spectral methods can be quite sharp for exact recovery in SBM, which succeed in a regime that matches the information limit. The next theorem from Abbe et al. (2017) will make this point clear. Consider the SBM with two balanced blocks, i.e., K = 2 and {i : θi = 1} = {i : θi = 2} = n/2, and suppose W11 = W22 = a log n/n, W12 = b log n/n where a > b > 0. In this case, one can easily see that the second eigenvector of is given by whose ith entry is given by if θi = 1 and −1 otherwise. In other words, classifies the two communities, where sgn(·) is the sign function applied to each entry of a vector. This is shown in Figure 2 for the case that #{i : θi = 1} = 2500 (red curve, left panel), where the second eigenvector v2 of A is also depicted (blue curve). The entrywise closeness between these two quantities is guaranteed by the perturbation theory under ℓ∞-norm (Abbe et al., 2017).
Theorem 4.3. Let v2 be the normalized second eigenvector of A. If , then no estimator achieves exact recovery; if , then both the maximum likelihood estimator and the eigenvector estimator sgn(v2) achieves exact recovery.
The proof of this result is based on entry-wise analysis of eigenvectors in a spirit similar to Theorem 2.4, together with a probability tail bound for differences of binomial variables.
4.3. Matrix completion
In recommendation systems, an important problem is to estimate users’ preferences based on history data. Usually, the available data per user is very small compared with the total number of items (each user sees only a small number of movies and buys only a small fraction of books, comparing to the total). Matrix completion is one formulation of such problem.
The goal of (noisy) matrix completion is to estimate a low-rank matrix from noisy observations of some entries (n1 users and n2 items). Suppose we know rank(M*) = K. For each i ∈ [n1] and j ∈ [n2], let Iij be i.i.d. Bernoulli variable with that indicates if we have observed information about the entry , i.e., Iij = 1 if and only if it is observed. Also suppose that our observation is if Iij = 1, where εij is i.i.d. jointly independent of Iij.
One natural way to estimate M* is to solve
where is the sampling operator defined by . The minimizer of this problem is essentially the MLE for M *. Due to the nonconvex constraint rank(X) = K, it is desirable to relax this optimization into a convex program. A popular way to achieve that is to transform the rank constraint into a penalty term λǁXǁ* that is added to the quadratic objective function, where λ is a tuning parameter and is the nuclear norm (that is, the ℓ1 norm of the vector of all its singular values), which encourages a solution with low rank (number of nonzero components in that vector). A rather surprising conclusion from Candès and Recht (2009) is that in the noiseless setting, solving the relaxed problem yields the same solution as the nonconvex problem with high probability.
We can view this problem from the perspective of factor models. The assumption that M* has low rank can be justified by interpreting each Mij as the linear combination of a few latent factors. Indeed, if Mij is the preference score of user i for item j, then it is reasonable to posit , where is the features item j possesses and is the tendency of user i towards the features. In this regard, M* = BF⊤ can be viewed as the part explained by the factors in the factor models.
This discussion motivates us to write our observation as
since . This decomposition gives the familiar “low-rank plus noise” structure. It is natural to conduct PCA on to extract the low-rank part.
Let the best rank-K approximation of be given by , where are the largest K singular values in descending order, and columns of , correspond to their normalized left and right singular vectors, respectively. Similarly, we have singular value decomposition . The following result from Abbe et al. (2017) provides entry-wise bounds for our estimates. For a matrix, denote by the largest absolute value of all entries, and the largest ℓ2 norm of all row vectors.
Theorem 4.4. Let n = n1 + n2, and . There exist constants C, C′ > 0 and an orthogonal matrix such that the following holds. If p ≥ 6 log n/n and . There exist constants C, C′ > 0 and an orthogonal matrix , then with at least probability 1 − C/n,
We can simplify the bounds with a few additional assumptions. If , then η is of order assuming a bounded coherence number. In addition, if κ is also bounded, then
We remark that the requirement on the sample ratio is the weakest condition necessary for matrix completion, which ensures each row and column and sampled with high probability. Also, the entry-wise bound above can recover the Frobenius bound (Keshavan et al., 2010) up to a log factor. It is more precise than the Frobenius bound, because the latter only provides control on average error.
4.4. Synchronization problems
Synchronization problems are a class of problems in which one estimates signals from their pairwise comparisons. Consider the phase synchronization problem as an example, that is, estimating n angles θ1, … , θn from noisy measurements of their differences. We can express an angle θℓ in the equivalent form of a unit-modulus complex number zℓ = exp(iθℓ), and thus, the task is to estimate a complex vector . Suppose our measurements have the form , where denotes the conjugate of zℓ, and for all ℓ > k, is i.i.d. complex Gaussian variable (namely, the real part and imaginary part of wℓk are and independent). Then, the phase of Cℓk (namely arg(Cℓk)) encodes the noisy difference θk − θA.
More generally, the goal of a synchronization problem is to estimate n signals from their pairwise measurements, where each signal is an element from a group, e.g., the group of rotations in three dimensions. Synchronization problems are motivated from imaging problems such as cryo-EM (Shkolnisky and Singer, 2012), camera calibration (Tron and Vidal, 2009), etc.
Synchronization problems also admit the “low-rank plus noise” structure. Consider our phase synchronization problem again. If we let wkℓ = wℓk (ℓ > k) and wℓℓ = 0, and write , then our measurement matrix has the structure
where * denotes the conjugate transpose. This decomposition has a similar form to (4.5) in community detection. Note that zz* is a complex matrix with a single nonzero eigenvalue n, and is of order with high probability (which is a basic result in random matrix theory). Therefore, we expect that no estimators can do well if . Indeed, the information-theoretic limit is established in Lelarge and Miolane (2016). Our next result from Zhong and Boumal (2018) gives estimation guarantees if the reverse inequality is true (up to a log factor).
Theorem 4.5. Let be the leading eigenvector of C such that and . Then, if , then with probability 1 − O(n−2), the relative errors satisfy
Moreover, the above two inequalities also hold for the maximum likelihood estimator.
Note that the eigenvector of a complex matrix is not unique: for any , the vector eiαv is also an eigenvector, so we fix the global phase eiα by restricting . Note also that the maximum likelihood estimator is different from v, because the MLE must satisfy the entry-wise constraint |zℓ| = 1 for any ℓ ∈ [n]. This result implies consistency of v in terms of both the ℓ2 norm and the ℓ∞ norm if , and thus, provides good evidence that spectral methods (or PCA) are simple, generic, yet powerful.
Acknowledgments
The authors gratefully acknowledge the support from the NSF grants DMS-1712591 and DMS-1662139 and NIH grant R01-GM072611.
APPENDIX A: PROOFS
Proof of Corollary 2.1. Notice that the result is trivial if , since and always hold. If , then by Weyl’s inequality,
Thus, we can set in Theorem 2.3 and derive
This proves the spectral norm case.
Proof of Theorem 2.4. Step 1: First, we derive a few elementary inequalities: for any m ∈ [n],
(A.1) |
To prove these inequalities, recall the (equivalent) definition of spectral norm for symmetric matrices:
where Sn−1 is the unit sphere in , and x = (x1, … , xn)⊤, y = (y1, … , yn)⊤. The first and second inequalities follow from
The third inequality follows from the first one and the triangle inequality.
Step 2: Next, by the definition of eigenvectors,
(A.2) |
We first control the entries of the first term on the right-hand side. Using the decomposition (2.8), we have
(A.3) |
Using the triangle inequality, we have
By Weyl’s inequality, , and thus . Also, by Corollary 2.1 (simplified Davis-Kahan’s theorem) and its following remark, . Therefore, under the condition δℓ ≥ 2ǁWǁ2,
Using Corollary 2.1 again, we obtain
where the first inequality is due to and the condition , and the second inequality is due to the fact that is a subset of orthonormal basis. Now we use the Cauchy-Schwarz inequality to bound the second term on the right-hand side of (A.3) and get
(A.4) |
Step 3: To bound the entries of the second term in (A.2), we use the leave-one-out idea as follows.
(A.5) |
We can bound the second term using the Cauchy-Schwarz inequality: . The crucial observation is that, if we view as the perturbed version of , then by Theorem 2.3 (Davis-Kahan’s theorem) and Weyl’s inequality, for any ℓ ∈ [K],
Here, is the eigen-gap of A + W(m), and it satisfies since for all i ∈ [n], by Weyl’s inequality. By (A.1), we have . Thus, under the condition δℓ ≥ 5ǁWǁ2, we have
Note that the mth entry of the vector is exactly , and other entries are where . Thus,
where we used . The above inequality, together with , leads to a bound on in (A.5).
(A.6) |
where we used . We claim that . Once this is proved, combining it with (A.4) and (A.6) yields the desired bound on the entries of in (A.2):
where, in the first inequality, we used , and in the second inequality, we used and the claim.
Step 4: Finally, we prove our claim that . By definition, . Note that the mth row of is 0, since W(m) has only zeros in its mth row. Thus,
With an argument similar to the one that leads to (A.4), we can bound the first term on the right-hand side.
Clearly, is also upper bounded by the right-hand side above. This proves our claim and concludes the proof.
Proof of Corollary 2.2. Let us construct symmetric matrices A, W, of size n + p via a standard dilation technique (Paulsen, 2002). Define
It can be checked that rank(A) = 2K, and importantly,
(A.7) |
Step 1: Check the conditions of Theorem 2.4. The nonzero eigenvalues of A are ±σk, (k ∈ [K]), and the corresponding eigenvectors are . It is clear that the eigenvalue condition in Theorem 2.4 is satisfied, and the eigen-gap δℓ of A is exactly γℓ. Since the identity (A.7) holds for any matrix constructed from dilation, by applying it to W we get .
Step 2: Apply the conclusion of Theorem 2.4. Similarly as before, we write W(m) as the matrix obtained by setting mth row and mth column of W to zero, where m ∈ [n + p]. We also denote . Using a similar argument as Step 1, we find
the eigenvectors of are ,
the eigenvectors of are , and
the eigenvectors of are ,
where * means some appropriate vectors we do not need in the proof (we do not bother introducing notations for them). We also observe that
Note that the inner product between wm and the eigenvector of is if m = i ∈ [n], or if m = n + j, j ∈ [p]. Therefore, applying Theorem 2.4 to the first n entries of
we obtain the first inequality of Corollary 2.2, and applying Theorem 2.4 to the last p entries leads to the second inequality.
Proof of Lemma 3.1.
Footnotes
There is a weaker assumption, under which (1.1) is usually called the weak factor model; see Onatski (2012).
Here, we mean it has second bounded moment when estimating the mean and has bounded fourth moment when estimating the variance.
A norm is orthogonal-invariant if for any matrix B and any orthogonal matrices U and V.
Here, we prefer using uk to refer to the singular vectors (not to be confused with the noise term in factor models). The same applies to Section 4.
Contributor Information
Jianqing Fan, Department of Operations Research and Financial Engineering, Princeton University, Princeton, 08540, NJ, USA.
Kaizheng Wang, Department of Industrial Engineering and Operations Research, Columbia University, New York, 10027, NY, USA.
Yiqiao Zhong, Packard 239, Stanford University, Stanford, 94305, CA, USA.
Ziwei Zhu, Department of Statistics, University of Michigan, Ann Arbor, 48109, MI, USA.
REFERENCES
- Abbe E (2017). Community detection and stochastic block models: recent developments. arXiv preprint arXiv:1703.10146
- Abbe E, Bandeira AS and Hall G (2016). Exact recovery in the stochastic block model. IEEE Transactions on Information Theory 62 471–487. [Google Scholar]
- Abbe E, Fan J, Wang K and Zhong Y (2017). Entrywise eigenvector analysis of random matrices with low expected rank. arXiv preprint arXiv:1709.09565 [DOI] [PMC free article] [PubMed]
- Abbe E and Sandon C (2015). Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on. IEEE. [Google Scholar]
- Ahn SC and Horenstein AR (2013). Eigenvalue ratio test for the number of factors. Econometrica 81 1203–1227. [Google Scholar]
- Anandkumar A, Ge R, HSU D, Kakade SM and Telgarsky M (2014). Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research 15 2773–2832. [Google Scholar]
- Anderson TW and Amemiya Y (1988). The asymptotic normal distribution of estimators in factor analysis under general conditions. The Annals of Statistics 16 759–771. [Google Scholar]
- Bai J and Li K (2012). Statistical analysis of factor models of high dimension. The Annals of Statistics 40 436–465. [Google Scholar]
- Bai J and NG S (2002). Determining the number of factors in approximate factor models. Econometrica 70 191–221. [Google Scholar]
- Baik J, Ben AROUS G and PÉchÉ S (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Annals of Probability 1643–1697.
- Bartlett MS (1938). Methods of estimating mental factors. Nature 141 609–610. [Google Scholar]
- Bartlett MS (1950). Tests of significance in factor analysis. British Journal of Mathematical and Statistical Psychology 3 77–85. [Google Scholar]
- Bean D, Bickel PJ, El KarOUI N and YU B (2013). Optimal M-estimation in high-dimensional regression. Proceedings of the National Academy of Sciences 110 14563–14568. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Benaych-GeorgeS F and Nadakuditi RR (2011). The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Advances in Mathematics 227 494–521. [Google Scholar]
- Benjamini Y and HochberG Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological) 289–300.
- Bickel PJ and Levina E (2008). Covariance regularization by thresholding. The Annals of Statistics 36 2577–2604. [Google Scholar]
- Bickel PJ, Ritov Y and Tsybakov AB (2009). Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics 37 1705–1732. [Google Scholar]
- Cai T and LIU W (2011). Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association 106 672–684. [Google Scholar]
- Candes E and TaO T (2007). The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics 35 2313–2351. [Google Scholar]
- CandÈs EJ, LI X, Ma Y and Wright J (2011). Robust principal component analysis? Journal of the ACM (JACM) 58 11. [Google Scholar]
- CandÈs EJ and Recht B (2009). Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9 717. [Google Scholar]
- Cape J, TanG M and Priebe CE (2017). The two-to-infinity norm and singular subspace geometry with applications to high-dimensional statistics. arXiv preprint arXiv:1705.10735
- Catoni O (2012). Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, vol. 48. Institut Henri Poincaré. [Google Scholar]
- Cattell RB (1966). The scree test for the number of factors. Multivariate behavioral research 1 245–276. [DOI] [PubMed] [Google Scholar]
- Chamberlain G and Rothschild M (1982). Arbitrage, factor structure, and mean-variance analysis on large asset markets
- Cohen MB, Nelson J and WoodrUFF DP (2015). Optimal approximate matrix product in terms of stable rank. arXiv preprint arXiv:1507.02268
- Davis C and Kahan WM (1970). The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis 7 1–46. [Google Scholar]
- Desai KH and Storey JD (2012). Cross-dimensional inference of dependent high-dimensional data. Journal of the American Statistical Association 107 135–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dobriban E (2017). Factor selection by permutation. arXiv preprint arXiv:1710.00479
- Donoho DL, Gavish M and Johnstone IM (2013). Optimal shrinkage of eigenvalues in the spiked covariance model. arXiv preprint arXiv:1311.0851 [DOI] [PMC free article] [PubMed]
- EFRon B (2007). Correlation and large-scale simultaneous significance testing. Journal of the American Statistical Association 102 93–103. [Google Scholar]
- EFRon B (2010). Correlated z-values and the accuracy of large-scale statistical estimates. Journal of the American Statistical Association 105 1042–1055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eldridge J, Belkin M and Wang Y (2017). Unperturbed: spectral analysis beyond Davis-Kahan. arXiv preprint arXiv:1706.06516
- Fama EF and French KR (1993). Common risk factors in the returns on stocks and bonds. Journal of financial economics 33 3–56. [Google Scholar]
- Fan J, Fan Y and LV J (2008). High dimensional covariance matrix estimation using a factor model. Journal of Econometrics 147 186–197. [Google Scholar]
- Fan J and Han X (2017). Estimation of the false discovery proportion with unknown dependence. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 1143–1164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Han X and Gu W (2012). Estimating false discovery proportion under arbitrary covariance dependence. Journal of the American Statistical Association 107 1019–1035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Ke Y, Sun Q and Zhou W-X (2017a). Farm-test: Factor-adjusted robust multiple testing with false discovery control. arXiv preprint arXiv:1711.05386 [DOI] [PMC free article] [PubMed]
- Fan J, Ke Y and Wang K (2016a). Decorrelation of covariates for high dimensional sparse regression. arXiv preprint arXiv:1612.08490
- Fan J, Li Q and Wang Y (2017b). Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 247–265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association 96 1348–1360. [Google Scholar]
- Fan J, LiaO Y and Mincheva M (2011). High-dimensional covariance matrix estimation in approximate factor models. The Annals of Statistics 39 3320–3356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, LiaO Y and Mincheva M (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75 603–680. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, LIU H and Wang W (2018a). Large covariance estimation through elliptical factor models. Annals of Statistics 46 1383–1414. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Wang W and Zhong Y (2018b). An ℓ∞ eigenvector perturbation bound and its application. Journal of Machine Learning Research 18 1–42. [PMC free article] [PubMed] [Google Scholar]
- Fan J, Wang W and ZHU Z (2016b). A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. arXiv preprint arXiv:1603.08315 [DOI] [PMC free article] [PubMed]
- Friguet C, Kloareg M and Causeur D (2009). A factor model approach to multiple testing under dependence. Journal of the American Statistical Association 104 1406–1415. [Google Scholar]
- GaO C, Ma Z, Zhang AY and ZHOU HH (2015). Achieving optimal misclassification proportion in stochastic block model. arXiv preprint arXiv:1505.03772
- Hirzel AH, Hausser J, Chessel D and Perrin N (2002). Ecological-niche factor analysis: how to compute habitat-suitability maps without absence data? Ecology 83 2027–2036. [Google Scholar]
- Hochreiter S, Clevert D-A and ObermaYer K (2006). A new summarization method for affymetrix probe level data. Bioinformatics 22 943–949. [DOI] [PubMed] [Google Scholar]
- Holland PW, Laskey KB and Leinhardt S (1983). Stochastic blockmodels: First steps. Social networks 5 109–137. [Google Scholar]
- Horn JL (1965). A rationale and test for the number of factors in factor analysis. Psychometrika 30 179–185. [DOI] [PubMed] [Google Scholar]
- Hotelling H (1933). Analysis of a complex of statistical variables into principal components. Journal of educational psychology 24 417. [Google Scholar]
- HSU D and Kakade SM (2013). Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science. ACM. [Google Scholar]
- Huber PJ (1964). Robust estimation of a location parameter. The annals of mathematical statistics 73–101.
- Jin J (2015). Fast community detection by score. The Annals of Statistics 43 57–89. [Google Scholar]
- Johnstone IM and LU AY (2009). On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association 104 682–693. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jolliffe IT (1986). Principal component analysis and factor analysis. In Principal component analysis Springer, 115–128. [Google Scholar]
- Kendall MG (1965). A course in multivariate analysis
- Keshavan RH, Montanari A and Oh S (2010). Matrix completion from noisy entries. Journal of Machine Learning Research 11 2057–2078. [Google Scholar]
- Kneip A and Sarda P (2011). Factor models and variable selection in high-dimensional regression analysis. The Annals of Statistics 39 2410–2447. [Google Scholar]
- Koltchinskii V and Lounici K (2017). Concentration inequalities and moment bounds for sample covariance operators. Bernoulli 23 110–133. [Google Scholar]
- Koltchinskii V and Xia D (2016). Perturbation of linear forms of singular vectors under gaussian noise. In High Dimensional Probability VII Springer, 397–423. [Google Scholar]
- Lam C and YaO Q (2012). Factor modeling for high-dimensional time series: inference for the number of factors. The Annals of Statistics 40 694–726. [Google Scholar]
- Lawley D and Maxwell A (1962). Factor analysis as a statistical method. Journal of the Royal Statistical Society. Series D (The Statistician) 12 209–229. [Google Scholar]
- Leek JT and Storey JD (2008). A general framework for multiple testing dependence. Proceedings of the National Academy of Sciences 105 18718–18723. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lelarge M and Miolane L (2016). Fundamental limits of symmetric low-rank matrix estimation. arXiv preprint arXiv:1611.03888
- LI Q, Cheng G, Fan J and Wang Y (2017). Embracing the blessing of dimensionality in factor models. Journal of the American Statistical Association 1–10. [DOI] [PMC free article] [PubMed]
- McCrae RR and John OP (1992). An introduction to the five-factor model and its applications. Journal of personality 60 175–215. [DOI] [PubMed] [Google Scholar]
- Minsker S (2016). Sub-gaussian estimators of the mean of a random matrix with heavy-tailed entries. arXiv preprint arXiv:1605.07129
- MOr-Yosef L and AvRon H (2018). Sketching for principal component regression. arXiv preprint arXiv:1803.02661
- NG AY, Jordan MI and Weiss Y (2002). On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems
- Onatski A (2010). Determining the number of factors from empirical distribution of eigenvalues. The Review of Economics and Statistics 92 1004–1016. [Google Scholar]
- Onatski A (2012). Asymptotics of the principal components estimator of large factor models with weakly influential factors. Journal of Econometrics 168 244–258. [Google Scholar]
- O’Rourke S, Vu V and Wang K (2016). Eigenvectors of random matrices: a survey. Journal of Combinatorial Theory, Series A 144 361–442. [Google Scholar]
- O’Rourke S, Vu V and Wang K (2017). Random perturbation of low rank matrices: Improving classical bounds. Linear Algebra and its Applications
- Paul D (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica 1617–1642.
- Paul D, Bair E, Hastie T and Tibshirani R (2008). ” preconditioning” for feature selection and regression in high-dimensional problems. The Annals of Statistics 1595–1618.
- Paulsen V (2002). Completely bounded maps and operator algebras, vol. 78. Cambridge University Press. [Google Scholar]
- Pearson K (1901). Principal components analysis. The London, Edinburgh and Dublin Philosophical Magazine and Journal 6 566. [Google Scholar]
- Rohe K, Chatterjee S and YU B (2011). Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics 39 1878–1915. [Google Scholar]
- Sedghi H, Janzamin M and Anandkumar A (2016). Provable tensor methods for learning mixtures of generalized linear models. In Artificial Intelligence and Statistics
- Shkolnisky Y and Singer A (2012). Viewing direction estimation in cryo-EM using synchronization. SIAM journal on imaging sciences 5 1088–1110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Spearman C (1927). The abilities of man. [DOI] [PubMed]
- SRIvastava N and Vershynin R (2013). Covariance estimation for distributions with 2 + ε moments. The Annals of Probability 41 3081–3111. [Google Scholar]
- Stewart G and Sun J (1990). Matrix perturbation theory
- Stock JH and Watson MW (2002). Forecasting using principal components from a large number of predictors. Journal of the American statistical association 97 1167–1179. [Google Scholar]
- Storey JD (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64 479–498. [Google Scholar]
- Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 267–288. [Google Scholar]
- TRon R and Vidal R (2009). Distributed image-based 3-D localization of camera sensor networks. In Decision and Control, 2009 held jointly with the 2009 28th Chinese Control Conference. CDC/CCC 2009. Proceedings of the 48th IEEE Conference on. IEEE. [Google Scholar]
- TROpp JA (2012). User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics 12 389–434. [Google Scholar]
- Vershynin R (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027
- Vershynin R (2012). How close is the sample covariance matrix to the actual covariance matrix? Journal of Theoretical Probability 25 655–686. [Google Scholar]
- Wang H (2012). Factor profiled sure independence screening. Biometrika 99 15–28. [Google Scholar]
- Wang J, ZhaO Q, Hastie T and Owen AB (2017). Confounder adjustment in multiple hypothesis testing. The Annals of Statistics 45 1863–1894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang W and Fan J (2017). Asymptotics of empirical eigenstructure for high dimensional spiked covariance. Ann. Statist 45 1342–1374. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wedin P-A (1972). Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics 12 99–111. [Google Scholar]
- WoodrUFF DP (2014). Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science 10 1–157. [Google Scholar]
- Yang J, Meng X and Mahoney MW (2016). Implementing randomized matrix algorithms in parallel and distributed environments. Proceedings of the IEEE 104 58–92. [Google Scholar]
- Yi X, Caramanis C and Sanghavi S (2016). Solving a mixture of many random linear equations by tensor decomposition and alternating minimization. arXiv preprint arXiv:1608.05749
- YU Y, Wang T and SamwORth RJ (2014). A useful variant of the davis–kahan theorem for statisticians. Biometrika 102 315–323. [Google Scholar]
- ZhaO P and YU B (2006). On model selection consistency of lasso. Journal of Machine learning research 7 2541–2563. [Google Scholar]
- Zhong Y (2017). Eigenvector under random perturbation: A nonasymptotic Rayleigh-Schrö dinger theory. arXiv preprint arXiv:1702.00139
- Zhong Y and Boumal N (2018). Near-optimal bounds for phase synchronization. SIAM Journal on Optimization 28 989–1016. [Google Scholar]
- ZOU H and Hastie T (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 301–320. [Google Scholar]