Robust high dimensional factor models with applications to statistical machine learning

Jianqing Fan; Kaizheng Wang; Yiqiao Zhong; Ziwei Zhu

doi:10.1214/20-sts785

. Author manuscript; available in PMC: 2021 Jul 27.

Published in final edited form as: Stat Sci. 2021 Apr 19;36(2):303–327. doi: 10.1214/20-sts785

Robust high dimensional factor models with applications to statistical machine learning

Jianqing Fan ¹, Kaizheng Wang ², Yiqiao Zhong ³, Ziwei Zhu ⁴

PMCID: PMC8315369 NIHMSID: NIHMS1639567 PMID: 34321713

Abstract

Factor models are a class of powerful statistical models that have been widely used to deal with dependent measurements that arise frequently from various applications from genomics and neuroscience to economics and finance. As data are collected at an ever-growing scale, statistical machine learning faces some new challenges: high dimensionality, strong dependence among observed variables, heavy-tailed variables and heterogeneity. High-dimensional robust factor analysis serves as a powerful toolkit to conquer these challenges.

This paper gives a selective overview on recent advance on high-dimensional factor models and their applications to statistics including Factor-Adjusted Robust Model selection (FarmSelect) and Factor-Adjusted Robust Multiple testing (FarmTest). We show that classical methods, especially principal component analysis (PCA), can be tailored to many new problems and provide powerful tools for statistical estimation and inference. We highlight PCA and its connections to matrix perturbation theory, robust statistics, random projection, false discovery rate, etc., and illustrate through several applications how insights from these fields yield solutions to modern challenges. We also present far-reaching connections between factor models and popular statistical learning problems, including network analysis and low-rank matrix recovery.

Keywords: Factor model, PCA, covariance estimation, perturbation bounds, robustness, random sketch, FarmSelect, FarmTest

1. INTRODUCTION

In modern data analytics, dependence across high-dimensional outcomes or measurements is ubiquitous. For example, stocks within the same industry exhibit significantly correlated returns, housing prices of a country depend on various economic factors, gene expressions can be stimulated by cytokines. Ignoring such dependence structure can produce significant systematic bias and yields inefficient statistical results and misleading insights. The problems are more severe for high-dimensional big data, where dependence, non-Gaussianity and heterogeneity of measurements are common.

Factor models aim to capture such dependence by assuming several variates or “factors”, usually much fewer than the outcomes, that drive the dependence of the entire outcomes (Lawley and Maxwell, 1962; Stock and Watson, 2002). Stemming from the early works on measuring human abilities (Spearman, 1927), factor models have become one of the most popular and powerful tools in multivariate analysis and have made profound impact in the past century on psychology (Bartlett, 1938; McCrae and John, 1992), economics and finance (Chamberlain and Rothschild, 1982; Fama and French, 1993; Stock and Watson, 2002; Bai and Ng, 2002), biology (Hirzel et al., 2002; Hochreiter et al., 2006; Leek and Storey, 2008), etc. Suppose x₁, …, x_n are n i.i.d. p-dimensional random vectors, which may represent financial returns, housing prices, gene expressions, etc. The generic factor model assumes that

x_{i} = μ + B f_{i} + u_{i}, or in matrix form, X = μ 1_{n}^{⊤} + B F^{⊤} + U,

(1.1)

where $X = (x_{1}, \dots, x_{n}) \in ℝ^{p \times n}$ , $μ = {(μ_{1}, \dots, μ_{p})}^{⊤}$ is the mean vector, $B = {(b_{1}, \dots, b_{p})}^{⊤} \in ℝ^{p \times K}$ is the matrix of factor loadings, $F = {(f_{1}, \dots, f_{n})}^{⊤} \in ℝ^{n \times K}$ stores K-dimensional vectors of common factors with $E f_{i} = 0$ , and $U = (u_{1}, \dots, u_{n}) \in ℝ^{p \times n}$ represents the error terms (a.k.a. idiosyncratic components), which has mean zero and is uncorrelated with or independent of F. We emphasize that, for most of our discussions in the paper (except Section 3.1), only ${x_{i}}_{i = 1}^{n}$ are observable, and the goal is to infer B and ${f_{i}}_{i = 1}^{n}$ through ${x_{i}}_{i = 1}^{n}$ . Here we use the name “factor model” to refer to a general concept where the idiosyncratic components u_i are allowed to be weakly correlated. This is also known as the “approximate factor model” in the literature, in contrast to the “strict factor model” where the idiosyncratic components are assumed to be uncorrelated.

Note that the model (1.1) has identifiability issues: given any invertible matrix $R \in ℝ^{K \times K}$ , simultaneously replacing B with BR and f_i with R⁻¹f_i does not change the observation x_i. To resolve this ambiguity issue, the following identifiability assumption is usually imposed:

Assumption 1.1 (Identifiability). B^⊤B is diagonal and cov(f_i) = I_p.

Other identifiability assumptions as well as detailed discussions can be found in Bai and Li (2012) and Fan et al. (2013).

Factor analysis is closely related to principal component analysis (PCA), which breaks down the covariance matrix into a set of orthogonal components and identifies the subspace that explains the most variation of the data (Pearson, 1901; Hotelling, 1933). In this selective review, we will mainly leverage PCA, or more generally, spectral methods, to estimate the factors ${f_{i}}_{i = 1}^{n}$ and the loading matrix B in (1.1). Other popular estimators, mostly based on the maximum likelihood principle, can be found in Lawley and Maxwell (1962); Anderson and Amemiya (1988); Bai and Li (2012), etc. The covariance matrix of x_i consists of two components: cov(Bf_i) and cov(u_i). Intuitively, when the contribution of the covariance from the error term u_i is negligible compared with those from the factor term Bf_i, the top-K eigenspace (namely, the space spanned by top K eigenvectors) of the sample covariance of ${x_{i}}_{i = 1}^{n}$ should be well aligned with the column space of B. This can be seen from the assumption that $cov (x_{i}) = B B^{⊤} + cov (u_{i}) \approx B B^{⊤}$ , which occurs frequently in high-dimensional statistics (Fan et al., 2013).

Here is our main message: applying PCA to well-crafted covariance matrices (including vanilla sample covariance matrices and their robust version) consistently estimates the factors and loadings, as long as the signal-to-noise ratio is large enough. The core theoretical challenge is to characterize how idiosyncratic covariance cov(u_i) perturb the eigenstructure of the factor covariance BB^⊤. In addition, the situation is more complicated with the presence of heavy-tailed data, missing data, computational constraints, heterogeneity, etc.

The rest of the paper is devoted to solutions to these challenges and a wide range of applications to statistical machine learning problems. In Section 2, we will elucidate the relationship between factor models and PCA and present several useful deterministic perturbation bounds for eigenspaces. We will also discuss robust covariance inputs for the PCA procedure to guard against corruption from heavy-tailed data. Exploiting the factor structure of the data helps solve many statistical and machine learning problems. In Section 3, we will see how the factor models and PCA can be applied to high-dimensional covariance estimation, regression, multiple testing and model selection. In Section 4, we demonstrate the connection between PCA and a wide range of machine learning problems including Gaussian mixture models, community detection, matrix completion, etc. We will develop useful tools and establish strong theoretical guarantees for our proposed methods.

Here we collect all the notations for future convenience. We use [m] to refer to {1, 2, …, m}. We adopt the convention of using regular letters for scalars and using bold-face letters for vectors or matrices. For $x = {(x_{1}, \dots, x_{p})}^{⊤} \in ℝ^{p}$ , and 1 ≤ q < ∞, we define ${‖ x ‖}_{q} = {(\sum_{j = 1}^{p} {| x_{j} |}^{q})}^{1 / q}$ , ${‖ x ‖}_{0} = | supp (x) |$ , where $supp (x) = {j : x_{j} \neq 0}$ , and ${‖ x ‖}_{\infty} = {max}_{1 \leq j \leq p} | x_{j} |$ . For a matrix M, we use ${‖ M ‖}_{2}$ , ${‖ M ‖}_{F}$ , ${‖ M ‖}_{max}$ and ${‖ M ‖}_{1}$ to denote its operator norm (spectral norm), Frobenius norm, entry-wise (element-wise) max-norm, and vector ℓ₁ norm, respectively. To be more specific, the last two norms are defined by ${‖ M ‖}_{max} = {max}_{j, k} | M_{j k} |$ and ${‖ M ‖}_{1} = \sum_{j, k} | M_{j k} |$ . Let I_p denote the p × p identity matrix, 1_p denote the p-dimensional all-one vector, and $1_{A}$ denote the indicator of event A, i.e., $1_{A} = 1$ if A happens, and 0 otherwise. We use $N (μ, Σ)$ to refer to the normal distribution with mean vector μ and covariance matrix Σ. For two nonnegative numbers a and b that possibly depend on n and p, we use the notation a = O(b) and $a ≲ b$ to mean a ≤ C₁b for some constant C₁ > 0, and the notation a = Ω(b) and $a ≳ b$ to mean a ≥ C₂b for some constant C₂ > 0. We write $a ≍ b$ if both a = O(b) and a = Ω(b) hold. For a sequence of random variables ${X_{n}}_{n = 1}^{\infty}$ and a sequence of nonnegative deterministic numbers ${a_{n}}_{n = 1}^{\infty}$ , we write $X_{n} = O_{ℙ} (a_{n})$ if for any ε > 0, there exists C > 0 and N > 0 such that $ℙ (| X_{n} | \geq C a_{n}) \leq ε$ holds for all n > N; and we write $X_{n} = o_{ℙ} (a_{n})$ if for any ε > 0 and C > 0, there exists N > 0 such that $ℙ (| X_{n} | \geq C a_{n}) \leq ε$ holds for all n > N. We omit the subscripts when it does not cause confusion.

2. FACTOR MODELS AND PCA

2.1. Relationship between PCA and factor models in high dimensions

Under model (1.1) with the identifiability condition, Σ = cov(x_i) is given by

Σ = B B^{⊤} + Σ_{u}, Σ_{u} = {(σ_{u, j k})}_{1 \leq j, k \leq p} = cov (u_{i}) .

(2.1)

Intuitively, if the magnitude of BB^⊤ dominates Σ_u, the top-K eigenspace of Σ should be approximately aligned with the column space of B. Naturally we expect a large gap between the eigenvalues of BB^⊤ and Σ_u to be important for estimating the column space of B through PCA (see Figure 1). On the other hand, if this gap is small compared with the eigenvalues of Σ_u, it is known that PCA leads to inconsistent estimation (Johnstone and Lu, 2009). The above discussion motivates a simple vanilla PCA-based method for estimating B and F as follows (assuming the Identifiability Assumption).

Fig 1. — *The* **left panel** *is the histogram of the eigenvalue distribution from a synthetic dataset. Fix n* = 1000, p = 400 *and K* = 2 *and let all the entries of* B *be i.i.d. Gaussian $N (0, 1 / 4)$ . Each entry of* F *and* U *is generated from i.i.d. $N (0, 1)$ and i.i.d. $N (0, 5^{2})$ respectively. The data matrix* X *is formed according to the factor model* (1.1). *The* **right diagram** *illustrates the Pervasiveness Assumption.*

Step 1. Obtain an estimator $\hat{μ}$ and $\hat{Σ}$ of μ and Σ, e.g., the sample mean and covariance matrix or their robust versions.

Step 2. Compute the eigen-decomposition of $\hat{Σ} = \sum_{j = 1}^{p} {\hat{λ}}_{j} {\hat{v}}_{j} {\hat{v}}_{j}^{⊤}$ . Let ${{\hat{λ}}_{k}}_{k = 1}^{K}$ be the top K eigenvalues and ${{\hat{v}}_{k}}_{k = 1}^{K}$ be their corresponding eigenvectors. Set $\hat{V} = ({\hat{v}}_{1}, \dots, {\hat{v}}_{K}) \in ℝ^{p \times K}$ and $\hat{Λ} = diag ({\hat{λ}}_{1}, \dots, {\hat{λ}}_{K}) \in ℝ^{K \times K}$ .

Step 3. Obtain PCA estimators $\hat{B} = \hat{V} {\hat{Λ}}^{1 / 2}$ and $\hat{F} = {(X - \hat{μ} 1^{⊤})}^{⊤} \hat{V} {\hat{Λ}}^{- 1 / 2}$ , namely, $\hat{B}$ consists of the top-K rescaled eigenvectors of $\hat{Σ}$ an ${\hat{f}}_{i}$ is just the rescaled projection of $x_{i} - \hat{μ}$ onto the space spanned by the eigen-space: ${\hat{f}}_{i} = {\hat{Λ}}^{- 1 / 2} {\hat{V}}^{T} (x_{i} - \hat{μ})$ .

Let us provide some intuitions for the estimators in Step 3. Recall that b_j is the jth column of B. Then, by model (1.1), $B^{⊤} (x_{i} - μ) = B^{⊤} B f_{i} + B^{⊤} u_{i}$ . In the high-dimensional setting, the second term is averaged out when u_i is weakly dependent across its component. This along with the identifiability condition delivers that

f_{i} \approx diag {(B^{⊤} B)}^{- 1} B^{⊤} (x_{i} - μ) = diag {({‖ b_{1} ‖}^{2}, \dots, {‖ b_{K} ‖}^{2})}^{- 1} B^{⊤} (x_{i} - μ) .

(2.2)

Now, we estimate $\sum_{j = 1}^{K} {\hat{λ}}_{j} {\hat{v}}_{j} {\hat{v}}_{j}^{⊤}$ and hence b_j by ${\hat{λ}}_{j}^{1 / 2} {\hat{v}}_{j}$ and ${‖ b_{j} ‖}^{2}$ by ${\hat{λ}}_{j}$ . Using the substitution method, we obtain the estimators in Step 3.

The above heuristic also reveals that the PCA-based methods work well if the effect of the factors outweighs the noise. To quantify this, we introduce a form of Pervasiveness Assumption from the factor model literature. While this assumption is strong¹, it simplifies our discussion and captures the above intuition well: it holds when the factor loadings ${b_{j}}_{j = 1}^{p}$ are random samples from a nondegenerate population (Fan et al., 2013).

Assumption 2.1 (Pervasiveness). The first K eigenvalues of BB^⊤ have order Ω(p), whereas ${‖ Σ_{u} ‖}_{2} = O (1)$ .

Note that cov(f_i) = I_K under the Identifiability Assumption 1.1. The first part of this assumption holds when each factor influences a non-vanishing proportion of outcomes. Mathematically speaking, it means that for any k ∈ [K] := {1, 2, …, K}, the average of squared loadings of the kth factor satisfies $p^{- 1} \sum_{j = 1}^{p} B_{j k}^{2} = Ω (1)$ (right panel of Figure 1). This holds with high probability if, for example, ${B_{j k}}_{j = 1}^{p}$ are i.i.d. realizations from a non-degenerate distribution, but we will not make such assumption in this paper. The second part of the assumption is reasonable, as cross-sectional correlation becomes weak after we take out the common factors. Typically, if Σ_u is a sparse matrix, the norm bound ${‖ Σ_{u} ‖}_{2} = O (1)$ holds; see Section 3.1 for details. Under this Pervasiveness Assumption, the first K eigenvalues of Σ will be well separated with the rest of eigenvalues. By the Davis-Kahan theorem (Davis and Kahan, 1970), which we present as Theorem 2.1, we can consistently estimate the column space of B through the top-K eigenspace of Σ. This explains why we can apply PCA to factor model analysis (Fan et al., 2013).

Though factor models and PCA are not identical (see Jolliffe, 1986), they are approximately the same for high-dimensional problems with the pervasiveness assumption(Fan et al., 2013). Thus, PCA-based ideas are important components of estimation and inference for factor models. In later sections (especially Section 4), we discuss statistical and machine learning problems with factor-model-type structures. There PCA is able to achieve consistent estimation even when the Pervasiveness Assumption is weakened—and somewhat surprisingly—PCA can work well down to the information limit. For perspectives from random matrix theory, see Baik et al. (2005); Paul (2007); Johnstone and Lu (2009); Benaych-Georges and Nadakuditi (2011); O’Rourke et al. (2016); Wang and Fan (2017), among others.

2.2. Estimating the number of factors

In high-dimensional factor models, if the factors are unobserved, we need to choose the number of factors K before estimating the loading matrix, factors, etc. The number K can be usually estimated from the eigenvalues of the the sample covariance matrix or its robust version. With certain conditions such as separation of the top K eigenvalues from the others, the estimation is consistent. Classical methods include likelihood ratio tests (Bartlett, 1950), the scree plot (Cattell, 1966), parallel analysis (Horn, 1965), etc. Here, we introduce a few recent methods: the first one is based on the eigenvalue ratio, the second on eigenvalue differences, and the third on the eigenvalue magnitude.

For simplicity, let us use the sample covariance and arrange its eigenvalues in descending order: λ₁ ≥ λ₂ ≥ ⋯ ≥ λ_n∧p, where n ∧ p = min{n, p} (the remaining eigenvalues, if any, are zero). Lam and Yao (2012) and Ahn and Horenstein (2013) proposed an estimator ${\hat{K}}_{1}$ based on ratios of consecutive eigenvalues. For a pre-determined k_max, the eigenvalue ratio estimator is

{\hat{K}}_{1} = \underset{i \leq k_{max}}{argmax} \frac{λ_{i}}{λ_{i + 1}} .

Intuitively, when the signal eigenvalues are well separated from the other eigenvalues, the ratio at k = K should be large. Under some conditions, the consistency of this estimator, which does not involve complicated tuning parameters, is established.

In an earlier work, Onatski (2010) proposed to use the differences of consecutive eigenvalues. For a given δ > 0 and pre-determined integer k_max, define

{\hat{K}}_{2} (δ) = max {i \leq k_{max} : λ_{i} - λ_{i + 1} \geq δ} .

Using a result on eigenvalue empirical distribution from random matrix theory, Onatski (2010) proved consistency of ${\hat{K}}_{2} (δ)$ under the Pervasiveness Assumption. The intuition is that, the Pervasiveness Assumption implies that λ_K − λ_K+1 tends on ∞ in probability as n → ∞; whereas λ_i − λ_i+1 → 0 almost surely for K < i < k_max because these λ_i-s converge to the same limit, which can be determined using random matrix theory. Onatski (2010) also proposed a data-driven way to determine δ from the empirical eigenvalue distribution of the sample covariance matrix.

A third possibility is to use an information criterion. Define

V (k) = \frac{1}{n p} min_{\hat{B} \in ℝ^{p \times k}, \hat{F} \in ℝ^{n \times k}} {‖ X - \hat{μ} 1_{n}^{⊤} - \hat{B} {\hat{F}}^{⊤} ‖}_{F}^{2} = p^{- 1} \sum_{j > k} λ_{j},

where $\hat{μ}$ is the sample mean, and the equivalence (second equality) is well known. For a given k, V (k) is interpreted as the scaled sum of squared residuals, which measures how well k factors fit the data. A very natural estimator ${\hat{K}}_{3}$ is to find the best k ≤ k_max such that the following penalized version of V (k) is minimized (Bai and Ng, 2002):

P C (k) = V (k) + k {\hat{σ}}^{2} g (n, p), where g (n, p) : = \frac{n + p}{n p} log (\frac{n p}{n + p}),

and ${\hat{σ}}^{2}$ is any consistent estimate ${(n p)}^{- 1} \sum_{i = 1}^{n} \sum_{j = 1}^{d} E u_{j i}^{2}$ . The upper limit k_max is assumed to be no smaller than K, and is typically chosen as 8 or 15 in empirical studies in Bai and Ng (2002). Consistency results are established under more general choices of g(n, p).

We conclude this section by remarking that in general, it is impossible to consistently estimate K if the smallest nonzero eigenvalue B^⊤B is much smaller than ${‖ Σ_{u} ‖}_{2}$ , because the ‘signals’ (eigenvalues of B^⊤B) would not be distinguishable from the the noise (eigenvalues of UU^⊤). As mentioned before, consistency of PCA is well studied in the random matrix theory literature. See Dobriban (2017) for a recent work that justifies parallel analysis using random matrix theory.

2.3. Robust covariance inputs

To extract latent factors and their factor loadings, we need an initial covariance estimator. Given independent observations x₁, …, x_n with mean zero, the sample covariance matrix, namely ${\hat{Σ}}_{sam} : = n^{- 1} \sum_{i = 1}^{n} x_{i} x_{i}^{⊤}$ , is a natural choice to estimate $Σ \in ℝ^{p \times p}$ . The finite sample bound on ${‖ {\hat{Σ}}_{sam} - Σ ‖}_{2}$ has been well studied in the literature (Vershynin, 2010; Tropp, 2012; Koltchinskii and Lounici, 2017). Before presenting the result from Vershynin (2010), let us review the definition of sub-Gaussian variables.

A random variable ξ is called sub-Gaussian if ${‖ ξ ‖}_{ψ_{2}} \equiv {sup}_{q \geq 1} q^{- 1 / 2} {(E {| ξ |}^{q})}^{1 / q}$ is finite, in which case this quantity defines a norm ${‖ \cdot ‖}_{ψ_{2}}$ called the sub-Gaussian norm. Sub-Gaussian variables include as special cases Gaussian variables, bounded variables, and other variables with tails similar to or lighter than Gaussian tails. For a random vector ξ, we define ${‖ ξ ‖}_{ψ_{2}} : = {sup}_{{‖ v ‖}_{2} = 1} {‖ ξ^{⊤} v ‖}_{ψ_{2}}$ ; we call ξ sub-Gaussian if ${‖ ξ ‖}_{ψ_{2}}$ is finite.

Theorem 2.1. Let Σ be the covariance matrix of x_i. Assume that ${Σ^{- \frac{1}{2}} x_{i}}_{i = 1}^{n}$ are i.i.d. sub-Gaussian random vectors, and denote $κ = {sup}_{{‖ v ‖}_{2} = 1} {‖ x_{i}^{⊤} v ‖}_{ψ_{2}}$ . Then for any t ≥ 0, there exist constants C and c only depending on κ such that

ℙ ({‖ {\hat{Σ}}_{sam} - Σ ‖}_{2} \geq max (δ, δ^{2}) {‖ Σ ‖}_{2}) \leq 2 exp (- c t^{2}),

(2.3)

where $δ = C \sqrt{p / n} + t / \sqrt{n}$ .

Remark 2.1. The spectral-norm bound above depends on the ambient dimension p, which can be large in high-dimensional scenarios. Interested readers can refer to Koltchinskii and Lounici (2017) for a refined result that only depends on the intrinsic dimension (or effective rank) of Σ.

An important asepect of the above result is the sub-Gaussian concentration in (2.3), but this depends heavily on the sub-Gaussian or sub-exponential behaviors of observed random vectors. This condition can not be validated in high dimensions when tens of thousands of variables are collected. See Fan et al. (2016b). When the distribution is heavy-tailed², one cannot expect sub-Gaussian or sub-exponential behaviors of the sample covariance in the spectral norm (Catoni, 2012). See also Vershynin (2012) and Srivastava and Vershynin (2013). Therefore, to perform PCA for heavy-tailed data, the sample covariance is not a good choice to begin with. Alternative robust estimators have been constructed to achieve better finite sample performance.

Catoni (2012), Fan et al. (2017b) and Fan et al. (2016b) approached the problem by first considering estimation of a univariate mean μ from a sequence of i.i.d random variables X₁, ⋯, X_n with variance σ². In this case, the sample mean $\bar{X}$ provides an estimator but without exponential concentration. Indeed, by Markov inequality, we have $ℙ (| \bar{X} - μ | \geq t σ / \sqrt{n}) \leq t^{- 2}$ , which is tight in general and has a Cauchy tail (in terms of t). On the other hand, if we truncate the data ${\tilde{X}}_{i} = sign (X_{i}) min (| X_{i} |, τ)$ with $τ ≍ σ \sqrt{n}$ and compute the mean of the truncated data, then we have (Fan et al., 2016b)

ℙ (| \frac{1}{n} \sum_{i = 1}^{n} {\tilde{X}}_{i} - μ | \geq t \frac{σ}{\sqrt{n}}) \leq 2 exp (- c t^{2}),

for a universal constant c > 0. In other words, the mean of truncated data with only a finite second moment behaves very much the same as the sample mean from the normal data: both estimators have Gaussian tails (in terms of t). This sub-Gaussian concentration is fundamental in high-dimensional statistics as the sample mean is computed tens of thousands or even millions of times.

As an example, estimating the high-dimensional covariance matrix Σ = (σ_ij) involves O(p²) univariate mean estimation, since the covariance can be expressed as an expectation: as $σ_{i j} = E (X_{i} X_{j}) - E (X_{i}) E (X_{j})$ . Estimating each component by the truncated mean yields a covariance matrix $\tilde{Σ}$ . Assuming the fourth moment is bounded (as the covariance itself are second moments), by using the union bound and the above concentration inequality, we can easily obtain

ℙ ({‖ \tilde{Σ} - Σ ‖}_{max} \geq \sqrt{\frac{a log p}{c^{'} n}}) ≲ p^{2 - a}

for any a > 0 and a constant c′ > 0. In other words, with truncation, when the data have merely bounded fourth moments, we can achieve the same estimation rate as the sample covariance matrix under the Gaussian data.

Fan et al. (2016b) and Minsker (2016) independently proposed shrinkage variants of the sample covariance with sub-Gaussian behavior under the spectral norm, as long as the fourth moments of X are finite. For any $τ \in ℝ^{+}$ , Fan et al. (2016b) proposed the following shrinkage sample covariance matrix

{\hat{Σ}}_{s} (τ) = \frac{1}{n} \sum_{i = 1}^{n} {\tilde{x}}_{i} {\tilde{x}}_{i}^{⊤}, {\tilde{x}}_{i} : = ({‖ x_{i} ‖}_{4} \land τ) x_{i} / {‖ x_{i} ‖}_{4},

(2.4)

to estimate Σ, where ${‖ \cdot ‖}_{4}$ is the ℓ₄-norm. The following theorem establishes the statistical error rate of ${\tilde{Σ}}_{s} (τ)$ in terms of the spectral norm.

Theorem 2.2. Suppose $E {(v^{⊤} x_{i})}^{4} \leq R$ for any unit vector $v \in S^{p - 1}$ . Then it holds that for any δ > 0,

ℙ ({‖ {\hat{Σ}}_{s} (τ) - Σ ‖}_{2} \geq \sqrt{\frac{δ R p log p}{n}}) \leq p^{1 - C δ},

(2.5)

where $τ ≍ {(n R / (δ log p))}^{1 / 4}$ and C is a universal constant.

Applying PCA to the robust covariance estimators as described above leads to more reliable estimation of principal eigenspaces in the presence of heavy-tailed data.

In Theorem 2.2, we assume that the mean of x_i is zero. When this does not hold, a natural estimator of $Σ = \frac{1}{2} E (x_{1} - x_{2}) {(x_{1} - x_{2})}^{⊤}$ is to use the shrunk U-statistic (Fan et al., 2017a):

{\hat{Σ}}_{U} (τ) = \frac{1}{2 (\begin{array}{l} n \\ 2 \end{array})} \sum_{j \neq k} \frac{ψ_{τ} ({‖ x_{j} - x_{k} ‖}_{2}^{2})}{{‖ x_{j} - x_{k} ‖}_{2}^{2}} (x_{j} - x_{k}) {(x_{j} - x_{k})}^{⊤} = \frac{1}{2 (\begin{array}{l} n \\ 2 \end{array})} \sum_{j \neq k} min (1, τ / {‖ x_{j} - x_{k} ‖}_{2}^{2}) (x_{j} - x_{k}) {(x_{j} - x_{k})}^{⊤},

where $ψ_{τ} (u) = (| u | \land τ) sign (u)$ . When τ = ∞, it reduces to the usual U-statistics. It possesses a similar concentration property to that in Theorem 2.2 with a proper choice of τ.

2.4. Perturbation bounds

In this section, we introduce several perturbation results on eigenspaces, which serve as fundamental technical tools in factor models and related learning problems. For example, in relating the factor loading matrix B to the principal components of covariance matrix Σ in (2.1), one can regard Σ as a perturbation of BB^⊤ by an amount of Σ_u and take A = BB^⊤ and $\tilde{A} = Σ$ in Theorem 2.3 below. Similarly, we can also regard a covariance matrix estimator $\hat{Σ}$ as a perturbation of Σ by an amount of $\hat{Σ} - Σ$ .

We will begin with a review of the Davis-Kahan theorem (Davis and Kahan, 1970), which is usually useful for deriving ℓ₂-type bounds (which includes spectral norm bounds) for symmetric matrices. Then, based on this classical result, we introduce entry-wise (ℓ_∞) bounds, which typically give refined results under structural assumptions. We also derive bounds for rectangular matrices that are similar to Wedin’s theorem (Wedin, 1972). Several recent works on this topic can be found in Yu et al. (2014); Fan et al. (2018b); Koltchinskii and Xia (2016); Abbe et al. (2017); Zhong (2017); Cape et al. (2017); Eldridge et al. (2017).

First, for any two subspaces $S$ and $\tilde{S}$ of the same dimension K in $ℝ^{p}$ , we choose any V, $\tilde{V} \in ℝ^{p \times K}$ with orthonormal columns that span $S$ and $\tilde{S}$ , respectively. We can measure the closeness between two subspaces though the difference between their projectors:

d_{2} (S, \tilde{S}) = {‖ \tilde{V} {\tilde{V}}^{⊤} - V V^{⊤} ‖}_{2} or d_{F} (S, \tilde{S}) = {‖ \tilde{V} {\tilde{V}}^{⊤} - V V^{⊤} ‖}_{F} .

The above definitions are both proper metrics (or distances) for subspaces S and $\tilde{S}$ and do not depend on the specific choice of V and $\tilde{V}$ , since $\tilde{V} {\tilde{V}}^{⊤}$ and VV^⊤ are projection operators. Importantly, these two metrics are connected to the well-studied notion of canonical angles (or principal angles). Formally, let the singular values of ${\tilde{V}}^{⊤} V$ be ${σ_{k}}_{k = 1}^{K}$ , and define the canonical angles θ_k = cos⁻¹ σ_k for k = 1, …, K. It is often useful to denote the sine of the canonical (principal) angles by $sin Θ (\hat{V}, V) : = diag (sin θ_{1}, \dots, sin θ_{K}) \in ℝ^{K \times K}$ , which can be interpreted as a generalization of sine of angles between two vectors. The following identities are well known (Stewart and Sun, 1990).

{‖ sin Θ (\tilde{V}, V) ‖}_{2} = d_{2} (S, \tilde{S}), \sqrt{2} {‖ sin Θ (\tilde{V}, V) ‖}_{F} = d_{F} (S, \tilde{S}) .

In some cases, it is convenient to fix a specific choice of $\tilde{V}$ and V. It is known that for both Frobenius norm and spectral norm,

‖ sin Θ (\tilde{V}, V) ‖ \leq min_{R \in O (K)} ‖ \tilde{V} R - V ‖ \leq \sqrt{2} ‖ sin Θ (\tilde{V}, V) ‖,

where $O (K)$ is the space of orthogonal matrices of size K × K. The minimizer (best rotation of basis) can be given by the singular value decomposition (SVD) of ${\tilde{V}}^{⊤} V$ . For details, see Cape et al. (2017) for example.

Now, we present the Davis-Kahan sin θ theorem (Davis and Kahan, 1970).

Theorem 2.3. Suppose A, $\tilde{A} \in ℝ^{n \times n}$ are symmetric, and that V, $\tilde{V} \in ℝ^{n \times K}$ have orthonormal column vectors which are eigenvectors of A and $\tilde{A}$ respectively. Let $L (V)$ be the set of eigenvalues corresponding to the eigenvectors given in V, and let $L (V^{⊥})$ (respectively $L ({\tilde{V}}^{⊥})$ ) be the set of eigenvalues corresponding to the eigenvectors not given in V (respectively $\tilde{V}$ ). If there exists an interval [α, β] and δ > 0 such that $L (V) \subset [α, β]$ and $L ({\tilde{V}}^{⊥}) \subset (- \infty, α - δ] \cup [β + δ, + \infty)$ , then for any orthogonal-invariant norm³

‖ sin Θ (\tilde{V}, V) ‖ \leq δ^{- 1} ‖ (\tilde{A} - A) V ‖ .

This theorem can be generalized to singular vector perturbation for rectangular matrices; see Wedin (1972). A slightly unpleasant feature of this theorem is that δ depends on the eigenvalues of both A and $\tilde{A}$ . However, with the help of Weyl’s inequality, we can immediately obtain a corollary that does not involve the eigenvalues of $\tilde{A}$ . Let λ_j(·) denote the jth largest eigenvalue of a real symmetric matrix. Recall that Weyl’s inequality bounds the differences between the eigenvalues of A and $\tilde{A}$ :

max_{1 \leq j \leq n} | λ_{j} (\tilde{A}) - λ_{j} (A) | \leq {‖ \tilde{A} - A ‖}_{2} .

(2.6)

This inequality suggests that, if the eigenvalues in $L ({\tilde{V}}^{⊥})$ have the same ranks (in descending order) as those in $L (V^{⊥})$ , then $L ({\tilde{V}}^{⊥})$ and $L (V^{⊥})$ are similar. Below we state our corollary, whose proof is in the appendix.

Corollary 2.1. Assume the setup of the above theorem, and suppose the eigenvalues in $L (\tilde{V})$ have the same ranks as those in $L (V)$ . If $L (V) \subset [α, β]$ and $L (V^{⊥}) \subset (- \infty, α - δ_{0}] \cup [β + δ_{0}, + \infty)$ for some δ₀ > 0, then

{‖ sin Θ (\tilde{V}, V) ‖}_{2} \leq 2 δ_{0}^{- 1} {‖ \tilde{A} - A ‖}_{2} .

We can then use ${‖ sin Θ (\tilde{V}, V) ‖}_{F} \leq \sqrt{K} {‖ sin Θ (\tilde{V}, V) ‖}_{2}$ to obtain a bound under the Frobenius norm. In the special case where $L (V) = {λ}$ and V = v, $\tilde{V} = \tilde{v}$ reduce to vectors, we can choose α = β = λ, and the above corollary translates into

min_{s \in {\pm 1}} {‖ \hat{v} - s v ‖}_{2} \leq \sqrt{2} sin θ (\hat{v}, v) \leq 2 \sqrt{2} δ_{0}^{- 1} {‖ \tilde{A} - A ‖}_{2} .

(2.7)

We can now see that the factor model and PCA are approximately the same with sufficiently large eigen-gap. Indeed, under Identifiability Assumption 1.1, we have $Σ = B B^{⊤} + Σ_{u}$ . Applying Weyl’s inequality and Corollary 2.1 to BB^⊤ (as A) and Σ (as $\tilde{A}$ ), we can easily control the eigenvalue/eigenvector differences by ${‖ Σ_{u} ‖}_{2}$ and the eigengap, which is comparably small under Pervasiveness Assumption 2.1. This difference can be interpreted as the bias incurred by PCA on approximating factor models.

Furthermore, given any covariance estimator $\hat{Σ}$ , we can similarly apply the above results by setting A = Σ and $\tilde{A} = \hat{Σ}$ to bound the difference between the estimated eigenvalues/eigenvectors and the population counterparts. Note that the above corollary gives us an upper bound on the subspace estimation error in terms of the ratio ${‖ \hat{Σ} - Σ ‖}_{2} / δ_{0}$ .

Next, we consider entry-wise bounds on the eigenvectors. For simplicity, here we only consider eigenvectors corresponding to unique eigenvalues rather than the general eigenspace. Often, we want to have a bound on each entry of the eigenvector difference $\tilde{v} - v$ , instead of an ℓ₂ norm bound, which is an average-type result. In many cases, none of these entries has dominant perturbation, but the Davis-Kahan’s theorem falls short of providing a reasonable bound (the naïve bound ${‖ \cdot ‖}_{\infty} \leq {‖ \cdot ‖}_{2}$ gives a suboptimal result).

Some recent papers (Abbe et al., 2017) have addressed this problem, and in particular, entry-wise bounds of the following form are established.

| {[\tilde{v} - v]}_{m} | ≲ μ \frac{{‖ \tilde{A} - A ‖}_{2}}{δ_{0}} + small term, m \in [n],

where μ ∈ [0, 1] is related to the structure of the statistical problem and typically can be as small as $O (1 / \sqrt{n})$ , which is very desirable in high-dimensional setting. The small term is often related to independence pattern of the data, which is typically small under mild independence conditions.

We illustrate this idea in Figure 2 through a simulated data example (left) and a real data example (right), both of which have factor-type structure. For the left plot, we generated a network data according to the stochastic block model with K = 2 blocks (communities), each having nodes n/2 = 2500: the adjacency matrix that represents the links between nodes is a symmetric matrix, with upper triangular elements generated independently from Bernoulli trials (diagonal elements are taken as 0), with the edge probability 5 log n/n for two nodes within blocks and log n/(4n) otherwise. Our task is to classify (cluster) these two communities based on the adjacency matrix. We used the second eigenvector $v_{2} \in ℝ^{5000}$ (that is, corresponding to the second largest eigenvalue) of the adjacency matrix as a classifier. The left panel of Figure 2 represents the values of the 5000 coordinates (or entries) [v₂]_i in the y-axis against the indices i = 1, …, 5000 in the x-axis. For comparison, the second eigenvector $v_{2}^{*} \in ℝ^{5000}$ of the expectation of the adjacency matrix—which is of interest but unknown—have entries taking values only in ${\pm 1 / \sqrt{5000}}$ , depending on the unknown nature of which block a vertex belongs to (this statement is not hard to verify). We used the horizontal line to represent these ideal values: they indicate exactly the membership of each vertex. Clearly, the magnitude of entry-wise perturbation is $O (1 / \sqrt{n})$ . Therefore, we can use $sign (v_{2}) / \sqrt{5000}$ as an estimate of v* and classify all nodes with the same sign as the same community. See Section 4.2 for more details.

Fig 2. — *The* ***left plot*** *shows the entries (coordinates) of the second eigenvectors* v₂ *computed from the adjacency matrix from the SBM with two equal-sized blocks (n* = 5000, K = 2). The plot also shows the expectation counterpart $v_{2}^{*}$ , whose entries have the same magnitude $O (1 / \sqrt{n})$ . The deviation of v₂ *from $v_{2}^{*}$ is quite uniform, which is a phenomenon not captured by the Davis-Kahan’s theorem. The* ***right plot*** shows the coordinates of two leading eigenvectors of the sample covariance matrix calculated from 2012–2017 daily return data of 484 stocks (tiny black dots). We also highlight six stocks during three time windows (2012–2015, 2013–2016, 2014–2017) with big markers, so that the fluctuation/perturbation is shown. The magnitude of these coordinates is typically small, and the fluctuation is also small.

For the right plot, we used daily return data of stocks that are constituents of S&P 500 index from 2012.1.1–2017.12.31. We considered stocks with exactly n = 1509 records and excluded stocks with incomplete/missing values, which resulted in p = 484 stocks. Then, we calculated the sample covariance matrix ${\hat{Σ}}_{sam} \in ℝ^{p \times p}$ using the data in the entire period, and computed two leading eigenvectors (note that they span the column space of B) and plotted the coordinates (entries) using small dots. Stocks with an coordinate smaller than 5% quantile or larger than 95% quantile are potentially outlying values and are not shown in the plot. In addition, we also highlighted the fluctuation of six stocks during three time windows: 2012.1–2015.12, 2013.1–2016.12 and 2014.1–2017.12, with different big markers. That is, for each of the three time windows, we re-computed the covariance matrices and the two leading eigenvectors, and then highlighted coordinates that correspond to the six major stocks. Clearly, the magnitude for these stocks is small, which is roughly $O (1 / \sqrt{p})$ , and the fluctuation of coordinates is also very small. Both plots suggest an interesting phenomenon of eigenvectors in high dimensions: entry-wise behavior of eigenvectors can be benign under factor model structure.

To state our results rigorously, let us suppose that A, $\tilde{A}$ , $W \in ℝ^{n \times n}$ are symmetric matrices, with $\tilde{A} = A + W$ and rank(A) = K < n. Let the eigen-decomposition of A and $\tilde{A}$ be

A = \sum_{k = 1}^{K} λ_{k} v_{k} v_{k}^{⊤}, and \tilde{A} = \sum_{k = 1}^{K} {\tilde{λ}}_{k} {\tilde{v}}_{k} {\tilde{v}}_{k}^{⊤} + \sum_{k = K + 1}^{n} {\tilde{λ}}_{k} {\tilde{v}}_{k} {\tilde{v}}_{k}^{⊤} .

(2.8)

Here the eigenvalues ${λ_{k}}_{k = 1}^{K}$ and ${{\tilde{λ}}_{k}}_{k = 1}^{K}$ are the K largest ones of A and $\tilde{A}$ , respectively, in terms of absolute values. Both sequences are sorted in descending order. ${{\tilde{λ}}_{k}}_{k = K + 1}^{K}$ are eigenvalues of $\tilde{A}$ whose absolute values are smaller than $| {\tilde{λ}}_{K} |$ . The eigenvectors ${v_{k}}_{k = 1}^{K}$ and ${{\tilde{v}}_{k}}_{k = 1}^{n}$ are normalized to have unit norms.

Here ${λ_{k}}_{k = 1}^{K}$ are allowed to take negative values. Thanks to Weyl’s inequality, ${{\tilde{λ}}_{k}}_{k = 1}^{K}$ and ${{\tilde{λ}}_{k}}_{k = K + 1}^{n}$ are well-separated when the size of perturbation W is not too large. In addition, we have the freedom to choose signs for eigenvectors, since they are not uniquely defined. Later, we will use ‘up to sign’ to signify that our statement is true for at least one choice of sign. With the conventions λ₀ = +∞ and λ_K+1 = −∞, we define the eigen-gap as

δ_{k} = min {λ_{k - 1} - λ_{k}, λ_{k} - λ_{k + 1}, | λ_{k} |}, \forall k \in [K],

(2.9)

which is the smallest distance between λ_k and other eigenvalues (including 0). This definition coincides with the (usual) eigen-gap in Corollary 2.1 in the special case $L (v_{k}) = {λ_{k}}$ where we are interested in a single eigenvalue and its associated eigenvector.

We now present an entry-wise perturbation result. Let us first look at only one eigenvector. In this case, when $‖ \tilde{A} - A ‖$ is small, heuristically,

{\tilde{v}}_{k} = \frac{\tilde{A} {\tilde{v}}_{k}}{{\tilde{λ}}_{k}} \approx \frac{\tilde{A} v_{k}}{λ_{k}} = v_{k} + \frac{(\tilde{A} - A) v_{k}}{λ_{k}}

holds uniformly for each entry. When $A = E \tilde{A}$ , that is, $\tilde{A}$ is unbiased, this gives the first-order approximation (rather than bounds on the difference ${\tilde{v}}_{k} - v_{k}$ ) of the random vector ${\tilde{v}}_{k}$ . Abbe et al. (2017) proves rigorously this result and generalizes to eigenspaces. The key technique for the proof is similar to Theorem 2.4 below, which simplifies the one in Abbe et al. (2017) in various ways but holds under more general conditions. It is stated in a deterministic way, and can be powerful if there is certain structural independence in the perturbation matrix W. A self-contained proof can be found in the appendix.

For each m ∈ [n], let $W^{(m)} \in ℝ^{n \times n}$ be a modification of W with the mth row and mth column zeroed out, i.e.,

W_{i j}^{(m)} = W_{i j} 1_{{i \neq m}} 1_{{j \neq m}}, \forall i, j \in [n] .

We also define ${\tilde{A}}^{(m)} = A + W^{(m)}$ , and denote its eigenvalues and eigenvectors by ${{\tilde{λ}}_{k}^{(m)}}_{k = 1}^{n}$ and ${{\tilde{v}}_{k}^{(m)}}_{k = 1}^{n}$ , respectively. This construction is related to the leave-one-out technique in probability and statistics. For recent papers using this technique, see Bean et al. (2013); Zhong and Boumal (2018); Abbe et al. (2017) for example.

Theorem 2.4. Fix any ℓ ∈ [K]. Suppose that $| λ_{ℓ} | ≍ {max}_{k \in [K]} | λ_{k} |$ , and that the eigen-gap δ_ℓ as defined in (2.9) satisfies $δ_{ℓ} \geq 5 {‖ W ‖}_{2}$ . Then, up to sign,

| {[{\tilde{v}}_{ℓ} - v_{ℓ}]}_{m} | ≲ \frac{{‖ W ‖}_{2}}{δ_{ℓ}} {(\sum_{k = 1}^{K} {[v_{k}]}_{m}^{2})}^{1 / 2} + \frac{| 〈 w_{m}, {\tilde{v}}_{ℓ}^{(m)} 〉 |}{δ_{ℓ}}, \forall m \in [n],

(2.10)

where w_m is the mth column of W.

To understand this theorem, let us compare it with the standard ℓ₂ bound (Theorem 2.3), which implies ${‖ {\tilde{v}}_{ℓ} - v_{ℓ} ‖}_{2} \leq {‖ W ‖}_{2} / δ_{ℓ}$ . The first term of the upper bound in (2.10) says the perturbation on the mth entry can be much smaller, because the factor ${(\sum_{k = 1}^{K} {[v_{k}]}_{m}^{2})}^{1 / 2}$ , always bounded by 1, can be usually much smaller. For example, if v_k’s are uniformly distributed on the unit sphere, then this factor is typically of order $O (\sqrt{K log n / n})$ . This factor is related to the notion of incoherence in Candès and Recht (2009); Candès et al. (2011), etc.

The second term of the upper bound in (2.10) is typically much smaller than ${‖ W ‖}_{2} / δ_{ℓ}$ , especially under certain independence assumption. For example, if w_m is independent of other entries, then, by construction, ${\tilde{v}}_{ℓ}^{(m)}$ and w_m are independent. If, moreover, entries of w_m are i.i.d. standard Gaussian, $| 〈 w_{m}, {\tilde{v}}_{ℓ}^{(m)} 〉 |$ is of order $O_{ℙ} (1)$ , whereas ${‖ W ‖}_{2}$ typically scales with $\sqrt{n}$ . This gives a bound for the mth entry, and can be extended to an ℓ_∞ bound if we are willing to make independence assumption for all m ∈ [n] (which is typical for random graphs for example).

We remark that this result can be generalized to perturbation bounds for eigenspaces (Abbe et al., 2017), and the conditions on eigenvalues can be relaxed using certain random matrix assumptions (Koltchinskii and Xia, 2016; O’Rourke et al., 2017; Zhong, 2017).

Now, we extend this perturbation result to singular vectors of rectangular matrices. Suppose L, $\tilde{L}$ , $E \in ℝ^{n \times p}$ satisfy $\tilde{L} = L + E$ and rank(L) = K < min{n, p}. Let the SVD of L and $\tilde{L}$ be⁴

L = \sum_{k = 1}^{K} σ_{k} u_{k} v_{k}^{⊤} and \tilde{L} = \sum_{k = 1}^{K} {\tilde{σ}}_{k} {\tilde{u}}_{k} {\tilde{v}}_{k}^{⊤} + \sum_{k = K + 1}^{min {n, p}} {\tilde{σ}}_{k} {\tilde{u}}_{k} {\tilde{v}}_{k}^{⊤},

where σ_k and ${\tilde{σ}}_{k}$ are respectively non-increasing in k, and u_k and v_k are all normalized to have unit ℓ₂ norm. As before, let ${{\tilde{σ}}_{k}}_{k = 1}^{K}$ have K largest absolute values. Similar to (2.9), we adopt the conventions σ₀ = +∞, σ_K+1 = 0 and define the eigen-gap as

γ_{k} = min {σ_{k - 1} - σ_{k}, σ_{k} - σ_{k + 1}}, \forall k \in [K] .

(2.11)

For j ∈ [p] and i ∈ [n], we define unit vectors ${{\tilde{u}}_{k}^{(j)}}_{k = 1}^{min {n, p}} \subseteq ℝ^{n}$ and ${{\tilde{v}}_{k}^{(i)}}_{k = 1}^{min {n, p}} \subseteq ℝ^{p}$ by replacing certain row or column of E with zeros. To be specific, in our expression $\tilde{L} = L + E$ , if we replace the ith row of E by zeros, then the normalized right singular vectors of the resulting perturbed matrix are denoted by ${{\tilde{v}}_{k}^{(i)}}_{k = 1}^{min {n, p}}$ ; and if we replace the jth column of E by zeros, then the normalized left singular vectors of the resulting perturbed matrix are denoted by ${{\tilde{u}}_{k}^{(j)}}_{k = 1}^{min {n, p}}$ .

Corollary 2.2. Fix any ℓ ∈ [K]. Suppose that $σ_{ℓ} ≍ {max}_{k \in [K]} σ_{k}$ , and that $γ_{ℓ} \geq 5 {‖ E ‖}_{2}$ . Then, up to sign,

| {[{\tilde{u}}_{ℓ} - u_{ℓ}]}_{i} | ≲ \frac{{‖ E ‖}_{2}}{γ ℓ} {(\sum_{k = 1}^{K} {[u_{k}]}_{i}^{2})}^{1 / 2} + \frac{| 〈 {(e_{i}^{row})}^{⊤}, {\tilde{v}}_{ℓ}^{(i)} 〉 |}{γ ℓ}, \forall i \in [n], a n d

| {[{\tilde{v}}_{ℓ} - v_{ℓ}]}_{j} | ≲ \frac{{‖ E ‖}_{2}}{γ ℓ} {(\sum_{k = 1}^{K} {[v_{k}]}_{j}^{2})}^{1 / 2} + \frac{| 〈 e_{j}^{col}, {\tilde{u}}_{ℓ}^{(j)} 〉 |}{γ ℓ}, \forall j \in [p],

where $e_{i}^{row} \in ℝ^{p}$ is the ith row vector of E, and $e_{j}^{col} \in ℝ^{n}$ is the jth column vector of E.

If we view $\tilde{L}$ as the data matrix (or observation) X, then, the low rank matrix L can be interpreted as BF^⊤. The above result provides a tool of studying estimation errors of the singular subspace of this low rank matrix. Note that ${\tilde{v}}_{ℓ}^{(i)}$ can be interpreted as the result of removing the idiosyncratic error of the ith observation, and ${\tilde{u}}_{ℓ}^{(j)}$ as the result of removing the jth covariate of the idiosyncratic error.

To better understand this result, let us consider a very simple case: K = 1 and each row of E is i.i.d. $N (0, I_{p})$ . We are interested in bounding the singular vector difference between the rank-1 matrix L = σ₁uv^⊤ and its noisy observation $\tilde{L} = L + E$ . This is a spiked matrix model with a single spike. By independence between $e_{i}^{row}$ and ${\tilde{v}}_{ℓ}^{(i)}$ as well as elementary properties of Gaussian variables, Corollary 2.2 implies that with probability 1 – o(1), up to sign,

{‖ {\tilde{u}}_{1} - u_{1} ‖}_{\infty} \leq \frac{{‖ E ‖}_{2}}{σ_{1}} {‖ u_{1} ‖}_{\infty} + \frac{O (\sqrt{log n})}{σ_{1}} .

(2.12)

Random matrix theory gives ${‖ E ‖}_{2} ≍ \sqrt{n} + \sqrt{p}$ with high probability. Our ℓ₂ perturbation inequality (Corollary 2.1) implies that ${‖ {\tilde{u}}_{1} - u_{1} ‖}_{2} \leq {‖ E ‖}_{2} / σ_{1}$ . This upper bound is much larger than the two terms in (2.12), as ${‖ u_{1} ‖}_{\infty}$ is typically much smaller than 1 in high dimensions. Thus, (2.12) gives a better entry-wise control over the ℓ₂ counterpart.

Beyond this simple case, there are many desirable features of Corollary 2.2. First of all, we allow K to be moderately large, in which case, as mentioned before, the factor ${(\sum_{k = 1}^{K} {[u_{k}]}_{i}^{2})}^{1 / 2}$ is related to the incoherence structure in the matrix completion and robust PCA literature. Secondly, the result holds deterministically, so random matrices are also applicable. Finally, the result holds for each i ∈ [n] and j ∈ [p], and thus it is useful even if the entries of E are not independent, e.g. when a subset of covariates are dependent.

To sum up, our results Theorem 2.4 and Corollary 2.2 provide flexible tools of studying entry-wise perturbation of eigenvectors and singular vectors. It is also easy to adapt to other problems since their proofs are not complicated (see the appendix).

3. APPLICATIONS TO HIGH-DIMENSIONAL STATISTICS

3.1. Covariance estimation

Estimation of high-dimensional covariance matrices has wide applications in modern data analysis. When the dimensionality p exceeds the sample size n, the sample covariance matrix becomes singular. Structural assumptions are necessary in order to obtain a consistent estimator in this challenging scenario. One typical assumption in the literature is that the population covariance matrix is sparse, with a large fraction of entries being (close to) zero, see Bickel and Levina (2008) and Cai and Liu (2011). In this setting, most variables are nearly uncorrelated. In financial and genetic data, however, the presence of common factors leads to strong dependencies among variables (Fan et al., 2008). The approximate factor model (1.1) better characterizes this structure and helps construct valid estimates. Under this model, the covariance matrix Σ has decomposition (2.1), where $Σ_{u} = cov (u_{i}) = {(σ_{u, j k})}_{1 \leq j, k \leq p}$ is assumed to be sparse (Fan et al., 2013). Intuitively, we may assume that Σ_u only has a small number of nonzero entries. Formally, we require the sparsity parameter

m_{0} : = max_{j \in [p]} \sum_{k = 1}^{p} 1 {σ_{u, j k} \neq 0}

to be small. This definition can be generalized to a weaker sense of sparsity, which is characterized by $m_{q} = {max}_{j \in [p]} \sum_{k = 1}^{p} {| σ_{u, j k} |}^{q}$ , where q ∈ (0g, 1) is a parameter. Note that small m_q forces Σ_u to have few large entries. However, for simplicity, we choose not to use this more general definition when presenting theoretical results below.

The approximate factor model has the following two important special cases, under which the parameter estimation has been well studied.

The sparse covariance model is (2.1) without factor structure, i.e. Σ = Σ_u; typically, entry-wise thresholding is employed for estimation.
The strict factor model corresponds to (2.1) with Σ_u being diagonal; usually, PCA-based methods are used.

The approximate factor model is a combination of the above two models, as it comprises both a low-rank component and a sparse component. A natural idea is to fuse methodologies for the two models into one, by estimating the two components using their corresponding methods. This motivated our high-level idea for estimation under the approximate factor model: (1) estimating the low-rank component (factors and loadings) using regression (when factors are observable) or PCA (when factors are latent); (2) after eliminating it from Σ, employing standard techniques such as thresholding in the sparse covariance matrix literature to estimate Σ_u; (3) adding the two estimated components together.

First, let us consider the scenario where the factors ${f_{i}}_{i = 1}^{n}$ are observable. In this setting, we do not need the Identifiability Assumption 1.1. Fan et al. (2008) focused on the strict factor model where the Σ_u in (2.1) is diagonal. It is then extended to the approximate factor model (1.1) by Fan et al. (2011). Later, Fan et al. (2018b) relaxed the sub-Gaussian assumption on the data to moment condition, and proposed a robust estimator. We are going to present the main idea of these methods using the one in Fan et al. (2011).

Step 1. Estimate B using the ordinary least-squares: $\hat{B} = {({\hat{b}}_{1}, \dots, {\hat{b}}_{p})}^{⊤}$ where

({\hat{a}}_{j}, {\hat{b}}_{j}) = {argmin}_{a, \hat{b}} \frac{1}{n} \sum_{i = 1}^{n} {(x_{i j} - a - b^{⊤} f_{i})}^{2} .

Step 2. Let $\hat{a} = {({\hat{a}}_{1}, \dots, {\hat{a}}_{p})}^{⊤}$ be the vector of intercepts, ${\hat{u}}_{i} = x_{i} - \hat{a} - \hat{B} f_{i}$ be the vector of residual for i ∈ [n], and $S_{u} = \frac{1}{n} \sum_{i = 1}^{n} {\hat{u}}_{i} {\hat{u}}^{⊤}$ be the sample covariance. Apply thresholding to S_u and obtain a regularized estimator ${\hat{Σ}}_{u}$ .

Step 3. Estimate cov(f_i) by $\hat{cov} (f_{i}) = \frac{1}{n} \sum_{i = 1}^{n} (f_{i} - \bar{f}) {(f_{i} - \bar{f})}^{⊤}$ .

Step 4. The final estimator is $\hat{Σ} = \hat{B} \hat{cov} (f_{i}) {\hat{B}}^{⊤} + {\hat{Σ}}_{u}$ .

We remark that in Step 2, there are many thresholding rules for estimating sparse covariance matrices. Two popular choices are the t-statistic-based adaptive thresholding (Cai and Liu, 2011) and correlation-based adaptive thresholding (Fan et al., 2013), with the entry-wise thresholding level chosen to be $ω ≍ K \sqrt{\frac{log p}{n}}$ . As the sparsity pattern of correlation and covariance are the same and the correlation matrix is scale-invariant, one typically applies the thresholding on the correlation and then scales it back to the covariance. Except for the number of factors K, this coincides with the commonly-used threshold for estimating sparse covariance matrices.

While it is not possible to achieve better convergence of Σ in terms of the operator norm or the Frobenius norm, Fan et al. (2011) considered two other important norms. Under regularity conditions, it is shown that

{‖ \hat{Σ} - Σ ‖}_{Σ} = O_{ℙ} (m_{0} K \sqrt{\frac{log p}{n}} + \frac{K \sqrt{p} log p}{n}), {‖ \hat{Σ} - Σ ‖}_{max} = O_{ℙ} (K \sqrt{\frac{log p}{n}} + K^{2} \sqrt{\frac{log n}{n}}) .

(3.1)

Here for $A \in ℝ^{p \times p}$ , ${‖ A ‖}_{Σ}$ and ${‖ A ‖}_{max}$ refer to its entropy-loss norm $p^{- 1 / 2} {‖ Σ^{- 1 / 2} A Σ^{- 1 / 2} ‖}_{F}$ and entry-wise max-norm max_i,j |A_ij|. As is pointed out by Fan et al. (2011) and Wang and Fan (2017), they are relevant to portfolio selection and risk management. In addition, convergence rates for ${‖ {\hat{Σ}}^{- 1} - Σ^{- 1} ‖}_{2}$ , ${‖ {\hat{Σ}}_{u} - Σ_{u} ‖}_{2}$ and ${‖ {\hat{Σ}}_{u}^{- 1} - Σ_{u}^{- 1} ‖}_{2}$ are also established.

Now we come to covariance estimation with latent factors. As is mentioned in Section 2.1, the Pervasiveness Assumption 2.1 helps separate the low-rank part BB^⊤ from the sparse part Σ_u in (2.1). Fan et al. (2013) proposed a Principal Orthogonal complEment Thresholding (POET) estimator, motivated by the relationship between PCA and factor model, and the estimation of sparse covariance matrix Σ_u in Fan et al. (2011). The procedure is described as follows.

Step 1. Let $S = \frac{1}{n} \sum_{i = 1}^{n} x_{i} x_{i}^{⊤}$ be the sample covariance matrix, ${{\hat{λ}}_{j}}_{j = 1}^{p}$ be the eigenvalues of S in non-ascending order, ${{\hat{ξ}}_{j}}_{j = 1}^{p}$ be their corresponding eigenvectors.

Step 2. Apply thresholding to $S_{u} = S - \sum_{j = 1}^{K} {\hat{λ}}_{j} {\hat{ξ}}_{j} {\hat{ξ}}_{j}^{⊤}$ and obtain a regularized estimator ${\hat{Σ}}_{u}$ .

Step 3. The final estimator is $\hat{Σ} = \sum_{j = 1}^{K} {\hat{λ}}_{j} {\hat{ξ}}_{j} {\hat{ξ}}_{j}^{⊤} + {\hat{Σ}}_{u}$ .

Here K is assumed to be known and bounded to simplify presentation and emphasize the main ideas. The methodology and theory in Fan et al. (2013) also allow using a data-driven estimate $\hat{K}$ of K. In Step 2 above we can choose from a large class of thresholding rules, and it is recommended to use the correlation-based adaptive thresholding. However, the thresholding level should be set to $\tilde{ω} ≍ \sqrt{\frac{log p}{n}} + \frac{1}{\sqrt{p}}$ . Compared to the level $\sqrt{\frac{log p}{n}}$ we use in covariance estimation with observed factors, the extra term $1 / \sqrt{p}$ here is the price we pay for not knowing the latent factors. It can be negligible when p grows much faster than n. Intuitively, thanks to the Pervasiveness Assumption, the latent factors can be estimated accurately in high dimensions. Fan et al. (2013) obtained theoretical guarantees for the POET that are similar to (3.1). The analysis allows for general sparsity patterns of Σ_u by considering m_q as the measure of sparsity for q ∈ [0, 1).

Robust procedures handling heavy-tailed data are proposed and analyzed by Fan et al. (2018a,b). In another line of research, Li et al. (2017) considered estimation of the covariance matrix of a set of targeted variables, when additional data beyond the variables of interest are available. By assuming a factor model structure, they constructed an estimator taking advantage of all the data and justified the information gain theoretically.

The Pervasiveness Assumption rules out the case where factors are weak and the leading eigenvalues of Σ are not as large as O(p). Shrinkage of eigenvalues is a powerful technique in this scenario. Donoho et al. (2013) systematically studied the optimal shrinkage in spiked covariance model where all the eigenvalues except several largest ones are assumed to be the same. Wang and Fan (2017) considered the approximate factor model, which is more general, and proposed a new version of POET with shrinkage for covariance estimation.

3.2. Principal component regression with random sketch

Principal component regression (PCR), first proposed by Hotelling (1933) and Kendall (1965), is one of the most popular methods of dimension reduction in linear regression. It employs the principal components of the predictors x_i to explain or predict the response y_i. Why do principal components, not other components, have more prediction power? Here we offer an insight from the perspective of high-dimensional factor models.

The basic assumption is that the unobserved latent factors $f_{i} \in ℝ^{K}$ drive simultaneously the covariates via (1.1) and responses, as shown in Figure 3. As a specific example, we assume

y_{i} = θ^{* ⊤} f_{i} + ε_{i}, i = 1, \dots, n, or in matrix form, y = F θ^{*} + ε,

where y = (y₁, …, y_n)^⊤ and the noise ε = (ε₁, …, ε_n)^⊤ has zero means. Since f_i is latent and the covariate vector is high dimensional, we naturally infer the latent factors from the observed covariates via PCA. This yields the PCR.

Fig 3. — *Illustration of the data generation mechanism in PCR. Both predictors* x_i *and responses y*_i *are driven by the latent factors* f_i. PCR extracts latent factors via the principal components from X, and uses the resulting estimate $\hat{F}$ as the new predictor. Regressing y *against $\hat{F}$ leads to the PCR estimator $\hat{θ} \in ℝ^{K}$ , which typically enjoys a smaller variance due to its reduced dimension, though it introduces bias.*

By (2.2) (assume μ = 0 for simplicity), $y_{i} \approx {(β^{†})}^{⊤} x_{i} + ε_{i}$ , where $β^{†} : = B {(B^{⊤} B)}^{- 1} θ^{*} \in ℝ^{p}$ . This suggests that if we directly regress y_i over x_i, then the regression coefficient β^† should lie in the column space spanned by B. This inspires the core idea of PCR, i.e., instead of seeking the least square estimator in the entire $ℝ^{p}$ space, we restrict our search scope to be the left leading singular space of X, which is approximately the column space of B under the Pervasiveness Assumption.

Let us discuss PCR more rigorously. To be consistent with the rest of this paper, we let $X \in ℝ^{p \times n}$ , which is different from conventions, and

y_{i} = x_{i}^{⊤} β^{*} + ε_{i}, i = 1, \dots, n, or in matrix form, y = X^{⊤} β^{*} + ε .

(3.2)

Let $X = (x_{1}, \dots, x_{n}) = PΣ Q^{⊤}$ be the SVD of X, where $Σ = diag (σ_{1}, \dots, σ_{min (n, p)})$ with non-increasing singular values. For some integer K satisfying 1 ≤ K ≤ min(n, p), write P = (P_K, P_K+) and Q = (Q_K, Q_K+). The PCR estimator ${\hat{β}}_{K}$ solves the following optimization problem:

{\hat{β}}_{K} : = {argmin}_{P_{K +}^{⊤} β = 0} {‖ y - X^{⊤} β ‖}_{2} .

(3.3)

It is easy to verify that

{\hat{β}}_{K} = P_{K} Σ_{K}^{- 1} Q_{K}^{⊤} y = P_{K} P_{K}^{⊤} β^{*} + P_{K} Σ_{K}^{- 1} Q_{K}^{⊤} ε,

(3.4)

where $Σ_{K} \in ℝ^{K \times K}$ is the top left submatrix of Σ. The following lemma calculates the excess risk of ${\hat{β}}_{K}$ , i.e., $E ({\hat{β}}_{K}) : = E_{ε} [{‖ X^{⊤} {\hat{β}}_{K} - X^{⊤} β^{*} ‖}_{2}^{2} / n]$ , treating X as fixed. The proof is relegated to the appendix.

Lemma 3.1. Let $p_{1}, \dots, p_{min {n, p}} \in ℝ^{p}$ be the column vectors of P. For j = 1, …, p, denote $α_{j} = {(β^{*})}^{⊤} p_{j}$ . We have

E ({\hat{β}}_{K}) = \frac{K σ^{2}}{n} + \sum_{j = K + 1}^{p} λ_{j}^{2} α_{j}^{2} .

Define the ordinary least squares (OLS) estimator $\hat{β} : = {(X X^{⊤})}^{- 1} Xy$ . Note that $E (\hat{β}) = E_{ε} [{‖ X^{⊤} \hat{β} - X^{⊤} β^{*} ‖}_{2}^{2} / n] = min (n, p) σ^{2} / n$ . Comparing $E ({\hat{β}}_{K})$ and $E (\hat{β})$ , one can clearly see a variance-bias tradeoff: PCR reduces the variance by introducing a bias term $\sum_{j} λ_{j}^{2} α_{j}^{2}$ , which is typically small and vanishes in the ideal case $P_{K +}^{⊤} β^{*} = 0$ —this is the bias incurred by imposing the constraint in (3.3).

In the high-dimensional setting where p is large, calculating P_K using SVD is computationally expensive. Recently, sketching has gained growing attention in statistics community and is used for downscaling and accelerating inference tasks with massive data. See recent surveys by Woodruff (2014) and Yang et al. (2016). The essential idea is to multiply the data matrix by a sketch matrix to reduce its dimension while still preserving the statistical performance of the procedure, since random projection reduces the strength of the idiosyncratic noise. To apply sketching to PCR, we first multiply the design matrix X by an appropriately chosen matrix $R \in ℝ^{p \times m}$ with K ≤ m < p:

\tilde{X} : = R^{⊤} X,

(3.5)

where R is called the “sketching matrix”. This creates m indices based on X. From the factor model perspective (assuming μ = 0), with a proper choice of R, we have $\tilde{X} \approx R^{⊤} B F^{⊤}$ , since the idiosyncratic components in (1.1) is averaged out due to weak dependence of u. Hence, the indices in $\tilde{X}$ are approximately linear combinations of the factors ${f_{i}}_{i = 1}^{n}$ . At the same time, since m ≥ K and R is nondegenerate, the row space of $\tilde{X}$ is approximately the same as that spanned by F^⊤. This shows running linear regression on $\tilde{X}$ is approximately the same as running it on F^⊤, without using the computationally expensive PCA.

We now examine the property of sketching approach beyond the factor models. Let $\tilde{X} = \tilde{P} \tilde{Σ} {\tilde{Q}}^{⊤}$ be the SVD of $\tilde{X}$ , and write $\tilde{P} = ({\tilde{P}}_{K}, {\tilde{P}}_{K +})$ and $\tilde{Q} = ({\tilde{Q}}_{K}, {\tilde{Q}}_{K +})$ . Imitating the form of (3.4), we consider the following sketched PCR estimator:

{\tilde{β}}_{K} : = R {\tilde{P}}_{K} {\tilde{Σ}}_{K}^{- 1} {\tilde{Q}}_{K}^{⊤} y,

(3.6)

where ${\tilde{Σ}}_{K} \in ℝ^{K \times K}$ is the top left submatrix of $\tilde{Σ}$ .

We now explain the above construction for ${\tilde{β}}_{K}$ . It is easy to derive from (3.4) that given R^⊤X and y as the design matrix and response vector, the PCR estimator should be ${\tilde{β}}_{K}^{0} : = {\tilde{P}}_{K} {\tilde{Σ}}_{K}^{- 1} {\tilde{Q}}_{K}^{⊤} y$ . Then the corresponding PCR projection of y onto R^⊤X should be $X^{⊤} R {\tilde{β}}_{K}^{0} = X^{⊤} R {\tilde{P}}_{K} {\tilde{Σ}}_{K}^{- 1} {\tilde{Q}}_{K}^{⊤} y = X^{⊤} {\tilde{β}}_{K}$ . This leads to the construction of ${\tilde{β}}_{K}$ in (3.6). Theorem 4 in Mor-Yosef and Avron (2018) gives the excess risk of ${\tilde{β}}_{K}$ , which holds for any R satisfying the conditions of the theorem.

Theorem 3.1. Assume m ≥ K and rank(R^⊤X) ≥ K. If ${‖ sin Θ ({\tilde{P}}_{K}, P_{K}) ‖}_{2} \leq ν < 1$ , then

E ({\tilde{β}}_{K}) \leq E ({\hat{β}}_{K}) + \frac{(2 ν + ν^{2}) {‖ X^{⊤} β^{*} ‖}_{2}^{2}}{n},

(3.7)

This theorem shows that the extra bias induced by sketching is $(2 ν + ν^{2}) {‖ X^{⊤} β^{*} ‖}_{2}^{2} / n$ . Given the bound of $E ({\hat{β}}_{K})$ in Lemma 3.1, we can deduce that

E ({\tilde{β}}_{K}) \leq \frac{K σ^{2}}{n} + \sum_{j = K + 1}^{p} α_{j}^{2} σ_{j}^{2} + \frac{(2 ν + ν^{2}) {‖ X^{⊤} β^{*} ‖}_{2}^{2}}{n} .

As we will see below, a smaller ν requires a larger m, and thus more computation. Therefore, we observe a tradeoff between statistical accuracy and computational resources: if we have more computational resources, we can allow a large dimension of sketched matrix $\tilde{X}$ , and the sketched PCR is more accurate, and vice versa.

One natural question thus arises: which R should we choose to guarantee a small ν to retain the statistical rate of ${\hat{β}}_{K}$ ? Recent results (Cohen et al., 2015) on approximate matrix multiplication (AMM) suggest several candidate sketching matrices for R. Define the stable rank $sr (X) : = {‖ X ‖}_{F}^{2} / {‖ X ‖}_{2}^{2}$ , which can be interpreted as a soft version of the usual rank—indeed, sr(X) ≤ rank(X) always holds, and sr(X) can be small if X is approximately low-rank. An example of candidate sketching matrices for R is a random matrix with independent and suitably scaled sub-Gaussian entries. As long as the sketch size m = Ω(sr(X) + log(1/δ)/ε²), it will hold for any ε, δ ∈ (0, 1/2) that

ℙ ({‖ X^{⊤} R R^{⊤} X - X^{⊤} X ‖}_{2}^{2} \geq ε {‖ X ‖}_{2}^{2}) \leq δ .

(3.8)

Combining this with the Davis-Kahan Theorem (Corollary 2.1), we can deduce that ${‖ sin Θ ({\tilde{P}}_{K}, P_{K}) ‖}_{2}$ is small with certain eigen-gap condition. We summarize our argument by presenting a corollary of Theorem 9 in Mor-Yosef and Avron (2018) below. Readers can find more candidate sketching matrices in the examples after Theorem 1 in Cohen et al. (2015).

CorollarY 3.1. For any ν, δ ∈ (0, 1/2), let

ε = ν {(1 + ν)}^{- 1} (σ_{K}^{2} - σ_{K + 1}^{2}) / σ_{1}^{2} .

Let $R \in ℝ^{p \times m}$ a random matrix with i.i.d. $N (0, 1 / m)$ entries. Then there exists a universal constant C > 0 such that for any δ > 0, if m ≥ C(sr(X)+ log(1/δ)/ε²), it holds with probability at least 1 − δ that

E ({\tilde{β}}_{K}) \leq E ({\hat{β}}_{K}) + \frac{(2 ν + ν^{2}) {‖ X^{⊤} β^{*} ‖}_{2}^{2}}{n} .

(3.9)

Remark 3.1. Note that $ε \leq ν (σ_{K}^{2} - σ_{K + 1}^{2}) / σ_{1}^{2}$ , and this bound is tight with a small ν. Some algebra yields that (3.9) holds when

m = Ω (sr (X) + \frac{σ_{1}^{2} log (1 / δ)}{ν^{2} {(σ_{K}^{2} - σ_{K + 1}^{2})}^{2}}) .

One can see that reducing ν requires a larger sketch size m. Besides, a large eigengap of the design matrix X helps reduce the required sketch size.

3.3. Factor-Adjust Robust Multiple (FARM) tests

Large-scale multiple testing is a fundamental problem in high-dimensional inference. In genome-wide association studies and many other applications, tens of thousands of hypotheses are tested simultaneously. Standard approaches such as Benjamini and Hochberg (1995) and Storey (2002) can not control well both false and missed discovery rates in the presence of strong correlations among test statistics. Important efforts on dependence adjustment include Efron (2007), Friguet et al. (2009), Efron (2010), and Desai and Storey (2012). Fan et al. (2012) and Fan and Han (2017) considered FDP estimation under the approximate factor model. Wang et al. (2017) studied a more complicated model with both observed variables and latent factors. All these existing papers heavily rely on the joint normality assumption of the data, which is easily violated in real applications. A recent paper (Fan et al., 2017a) developed a factor-adjusted robust procedure that can handle heavy-tailed data while controlling FDP. We are going to introduce this method in this subsection.

Suppose our i.i.d. observations ${x_{i}}_{i = 1}^{n}$ satisfy the approximate factor model (1.1) where $μ \in ℝ^{p}$ is an unknown mean vector. To make the model identifiable, we use the Identifiability Assumption 1.1. We are interested in simultaneously testing

H_{0 j} : μ_{j} = 0 versus H_{1 j} : μ_{j} \neq 0, for j \in [p] .

Let T_j be a generic test statistic for H_0j. For a pre-specified level z > 0, we reject H_0j whenever |T_j| ≥ z. The numbers of total discoveries R(z) and false discoveries V (z) are defined as

R (z) = # {j : | T_{j} | \geq z} and V (z) = # {j : | T_{j} | \geq z, μ_{j} = 0} .

Note that R(z) is observable while V (z) needs to be estimated. Our goal is to control the false discovery proportion FDP(z) = V (z)/R(z) with the convention 0/0 = 0.

Naïve tests based on sample averages $\frac{1}{n} \sum_{i = 1}^{n} x_{i}$ suffer from size distortion of FDP control due to dependence of common factors in (1.1). On the other hand, the factor-adjusted test based on the sample averages of x_i − Bf_i (B and f_i need to be estimated) has two advantages: the noise u_i is now weakly dependent so that FDP can be controlled with high accuracy, and the variance of u_i is smaller than that of Bf_i + u_i in model (1.1), so that it is more powerful. This will be convincingly demonstrated in Figure 5 below. The factor-adjusted robust multiple test (FarmTest) is a robust implementation of the above idea (Fan et al., 2017a), which replaces the sample mean by its adaptive Huber estimation and extracts latent factors from a robust covariance input.

Fig 5. — *Histograms of four different mean estimators for simultaneous inference. Fix n* = 100, p = 500 *and K* = 3, and data are generated i.i.d. from t₃*, which is heavy-tailed. Dashes lines correspond to μ*_j = 0 *and μ*_j = 0.6, which is unknown. Robustification and factor adjustment help distinguish nulls and alternatives.

To begin with, we consider the Huber loss (Huber, 1964) with the robustification parameter τ ≥ 0:

ℓ_{τ} (u) = {\begin{array}{l} u^{2} / 2, & if | u | \leq τ \\ τ | u | - τ^{2} / 2, & if | u | > τ \end{array},

and use ${\hat{μ}}_{j} = {argmax}_{θ \in ℝ} \sum_{i = 1}^{n} ℓ_{τ} (x_{i j} - θ)$ as a robust M -estimator of μ_j. Fan et al. (2017a) suggested choosing $τ ≍ \sqrt{n / \log (n p)}$ to deal with possible asymmetric distribution and called it adaptive Huber estimator. They showed, assuming bounded fourth moments only, that

\sqrt{n} ({\hat{μ}}_{j} - μ_{j} - b_{j}^{⊤} \bar{f}) = N (0, σ_{u, j j}) + o_{ℙ} (1) uniformly over j \in [p];

(3.10)

where $\bar{f} = \frac{1}{n} \sum_{i = 1}^{n} f_{i}$ , and σ_u,jj is the (j, j)th entry of Σ_u as is defined in (2.1). Assuming for now that ${b_{j}}_{j = 1}^{p}$ , $\bar{f}$ and ${σ_{u, j j}}_{j = 1}^{p}$ are all observable, then the factor-adjusted test statistic $T_{j} = \sqrt{n / σ_{u, j j}} ({\hat{μ}}_{j} - b_{j}^{⊤} \bar{f})$ is asymptotically $N (0, 1)$ . The law of large numbers implies that V (z) should be close to 2p₀Φ(−z) for z ≥ 0, where Φ(·) is the cumulative distribution function of $N (0, 1)$ , and p₀ = #{j : μ_j = 0} is the number of true nulls. Hence

FDP (z) = \frac{V (z)}{R (z)} \approx \frac{2 p_{0} Φ (- z)}{R (z)} \leq \frac{2 p Φ (- z)}{R (z)} = : {FDP}^{A} (z) .

Note that in the high-dimensional and sparse regime, we have p₀ = p − o(p) and thus FDP^A(z) is only a slightly conservative surrogate. However, we can also estimate the proportion π₀ = p₀/p and use less conservative estimate ${FDP}^{A} (z) = 2 p {\hat{π}}_{0} Φ (- z) / R (z)$ instead, where ${\hat{π}}_{0}$ is an estimate of π₀ whose idea is depicted in Figure 4; see Storey (2002). Finally, we define the critical value z_α = inf{z ≥ 0 : FDP^A(z) ≤ α} and reject H_0j whenever |T_j| ≥ z_α.

Fig 4. — Estimation of proportion of true nulls. The observed P-values (right panel) consist of those from significant variables (genes), which are usually small, and those from insignificant variables, which are uniformly distributed. Assuming the P-values for significant variables are mostly less than λ *(taken to be 0.5 in this illustration, left panel), the contributions of observed P-values* > λ *are mostly from true nulls and this yields a natural estimator* ${\hat{π}}_{0} (λ) = \frac{1}{(1 - λ)} \sum_{j = 1}^{p} 1 ({\hat{P}}_{j} > λ)$ , *which is the average height of the histogram with P-values* > λ *(red line). Note that the histograms above the red line estimates the distributions of P-values from the significant variavles (genes) in the left panel*.

In practice, we have no access to ${b_{j}}_{j = 1}^{p}$ , $\bar{f}$ or ${σ_{u, j j}}_{j = 1}^{p}$ in (3.10) and need to use their estimates. This results in the Factor-Adjusted Robust Multiple test (FarmTest) in Fan et al. (2017a). The inputs include ${x_{i}}_{i = 1}^{n}$ , a generic robust covariance matrix estimator $\hat{Σ} \in ℝ^{p \times p}$ from the data, a pre-specified level α ∈ (0, 1) for FDP control , the number of factors K, and the robustification parameters γ and ${τ_{j}}_{j = 1}^{p}$ . Note that K can be estimated by the methods in Section 2.2, and overestimating K has little impact on final outputs.

Step 1. Denote by $\hat{Σ} \in ℝ^{p \times p}$ a generic robust covariance matrix estimator. Compute the eigen-decomposition of $\hat{Σ}$ , set ${{\hat{λ}}_{j}}_{j = 1}^{K}$ to be its top K eigenvalues in descending order, and ${{\hat{v}}_{j}}_{j = 1}^{K}$ to be their corresponding eigenvectors. Let $\hat{B} = ({\tilde{λ}}_{1}^{1 / 2} {\hat{v}}_{1}, \dots, {\tilde{λ}}_{1}^{1 / 2} {\hat{v}}_{K}) \in ℝ^{p \times K}$ where ${\tilde{λ}}_{j} = \max {{\hat{λ}}_{j}, 0}$ , and denote its rows by ${{\hat{b}}_{j}}_{j = 1}^{p}$ .

Step 2. Let ${\bar{x}}_{j} = \frac{1}{n} \sum_{i = 1}^{n} x_{i j}$ for j ∈ [p] and $\hat{f} = {argmax}_{f \in ℝ^{K}} \sum_{j = 1}^{p} ℓ_{γ} ({\bar{x}}_{j} - {\hat{b}}_{j}^{⊤} f)$ . Construct factor-adjusted test statistics

T_{j} = \sqrt{n / {\hat{σ}}_{u, j j}} ({\hat{μ}}_{j} - {\hat{b}}_{j}^{⊤} \hat{f}) for j \in [p],

(3.11)

where ${\hat{σ}}_{u, j j} = {\hat{θ}}_{j} - {\hat{μ}}_{j}^{2} - {‖ {\hat{b}}_{j} ‖}_{2}^{2}$ , ${\hat{θ}}_{j} = {argmin}_{θ \geq {\hat{μ}}_{j}^{2} + {‖ {\hat{b}}_{j} ‖}_{2}^{2}} ℓ_{τ_{j}} (x_{i j}^{2} - θ)$ .

Step 3. Calculate the critical value z_α = inf{z ≥ 0 : FDP^A(z) ≤ α}, where ${FDP}^{A} (z) = 2 {\hat{π}}_{0} p Φ (- z) / R (z)$ , and reject H_0j whenever |T_j| ≥ z_α.

In Step 2, we estimate $\bar{f}$ based on ${\bar{x}}_{j} = μ_{j} + {\hat{b}}_{j}^{⊤} \bar{f} + {\bar{u}}_{j}$ , which is implied by the factor model (1.1), and regard non-vanishing μ_j as an outlier. In the estimation of σ_u,jj, we used the identity $θ_{j} : = E x_{i j}^{2} = μ_{j}^{2} + {‖ b_{j} ‖}^{2} + σ_{u, j j}$ and robustly estimated the second moment θ_j.

Figure 5 is borrowed from Figure 1 in Fan et al. (2017a) that illustrates the effectiveness of this procedure. Here n = 100, p = 500, K = 3, $f_{i} ~ N (0, I_{3})$ and the entries of u_i are generated independently from the t-distribution with 3 degrees of freedom. It is known that t—distributions are not sub-Guassian variables and are often used to model heavy-tailed data. The unknown means $μ \in ℝ^{p}$ are fixed as μ_j = 0.6 for j ≤ 125 and μ_j = 0 otherwise. We plot the histograms of sample means, robust mean estimators, and their counterparts with factor-adjustment. The latent factors and heavy-tailed errors make it difficult to distinguish μ_j = 0.6 from μ_j = 0, and that explains why the sample means behave poorly. As is shown in Figure 5, better separation can be obtained by factor adjustment and robustification.

While existing literature usually imposes the joint normal assumption on ${f_{i}, u_{i}}_{i = 1}^{n}$ , the FarmTest only requires the coordinates of ${u_{i}}_{i = 1}^{n}$ to have bounded fourth-order moments, and ${f_{i}}_{i = 1}^{n}$ to be sub-Gaussian. Under standard regularity conditions for the approximate factor model, it is proved by Fan et al. (2017a) that

{FDA}^{A} (z) - FDP (z) = o_{ℙ} (1) .

We see that FDP^A is a valid approximation of FDP, which is therefore faithfully controlled by the FARM-Test.

3.4. Factor-Adjusted Robust Model (FARM) selection

Model selection is one of the central tasks in high dimensional data analysis. Parsimonious models enjoy interpretability, stability and oftentimes, better prediction accuracy. Numerous methods for model selection have been proposed in the past two decades, including, Lasso (Tibshirani, 1996), SCAD (Fan and Li, 2001), the elastic net Zou and Hastie (2005), the Dantzig selector (Candes and Tao, 2007), among others. However, these methods work only when the covariates are weakly dependent or statisfy certain regularity conditions (Zhao and Yu, 2006; Bickel et al., 2009). When covariates are strongly correlated, Paul et al. (2008); Kneip and Sarda (2011); Wang (2012); Fan et al. (2016a) used factor model to eliminate the dependencies caused by pervasive factors, and to conduct model selection using the resulting weakly correlated variables.

Assume that ${x_{i}}_{i = 1}^{n}$ follow the approximate factor model (1.1). As a standard assumption, the coordinates of $w_{i} = {(f_{i}^{⊤}, u_{i}^{⊤})}^{⊤} \in ℝ^{K + p}$ are weakly dependent. Thanks to this condition and the decomposition

x_{i}^{⊤} β = {(μ + B f_{i} + u_{i})}^{⊤} β = α + u_{i}^{⊤} β + f_{i}^{⊤} γ

(3.12)

where α = μ^T β and γ = B^⊤β. we may treat w_i as the new predictors. In other words, by lifting the number of variables from p to p + K, the covariates of w_i are now weakly dependent. The usual regularized estimation can now be applied to this new set of variables. Note that we regard the coefficients B^⊤β as free parameters to facilitate the implementation (ignoring the relation γ = B^⊤β) and this requires an additional assumption to make this valid (Fan et al., 2016a).

Suppose we wish to fit a model $y_{i} = g (x_{i}^{⊤} β, ε_{i})$ via a loss function $L (y_{i}, x_{i}^{⊤} β)$ . The above idea suggests the following two-step approach, which is called Factor-Adjusted Regularized (or Robust when so implemented) Model selection (FarmSelect) (Fan et al., 2016a).

Step 1: Factor estimation. Fit the approximate factor model (1.1) to get $\hat{B}$ , ${\hat{f}}_{i}$ and ${\hat{u}}_{i} = x_{i} - \hat{B} {\hat{f}}_{i}$ .

Step 2: Augmented regularization. Find α, β and γ to minimize

\sum_{i = 1}^{n} L (y_{i}, α + {\hat{u}}_{i}^{⊤} β + {\hat{f}}_{i}^{⊤} γ) + \sum_{j = 1}^{p} p_{λ} (| β_{j} |),

where p_λ(·) is a folded concave penalty (Fan and Li, 2001) with parameter λ.

In Step 1, standard estimation procedures such as POET (Fan et al., 2013) and S-POET (Wang and Fan, 2017) can be applied, as long as they produce consistent estimators of B, ${f_{i}}_{i = 1}^{n}$ and ${u_{i}}_{i = 1}^{n}$ . Step 2 is carried out using usual regularization methods with new covariates.

Figure 6, borrowed from Figure 3 (a) in Fan et al. (2016a), shows that the proposed method outperforms other popular ones for model selection including Lasso (Tibshirani, 1996), SCAD (Fan and Li, 2001) and elastic net (Zou and Hastie, 2005), in the presence of correlated covariates. The basic setting is sparse linear regression y = x^⊤β* + ε with p = 500 and n growing from 50 to 160. The true coefficients are β* = (β₁, ··· , β₁₀, 0_p−10)^⊤, where ${β_{j}}_{j = 1}^{10}$ are drawn uniformly at random from [2, 5], and $ε \sim N (0, 0.3)$ . The correlation structure of covariates x is calibrated from S&P 500 monthly excess returns between 1980 and 2012.

Fig 6. — *Model selection consistency rate, i.e., the proportion of simulations that the selected model is identical to the true one, with p* = 500 *and n varying from* 50 to 160. With moderate sample size, the proposed method faithfully identifies the correct model while other methods cannot.

Under the generalized linear model, L(y, z) = −yz + b(z) and b(·) is a convex function. Fan et al. (2016a) analyzed theoretical properties of the above procedure. As long as the coordinates of w_i (rather than x_i) are not too strongly dependent and the factor model is estimated to enough precision, $\hat{β}$ enjoys optimal rates of convergence ${‖ \hat{β} - β^{*} ‖}_{q} = O_{ℙ} ({| S |}^{1 / q} \sqrt{\log p / n})$ , where q = 1, 2 or ∞. When the minimum entry of |β*| is at least $Ω (\sqrt{\log p / n})$ , the model selection consistency is achieved.

When we use the square loss, this method reduces to the one in Kneip and Sarda (2011). By using the square loss and replacing the penalized multiple regression in Step 2 with marginal regression, we recover the factor-profiled variable screening method in Wang (2012). While these papers aim at modeling and then eliminating the dependencies in x_i via (1.1), Paul et al. (2008) used a factor model to characterize the joint distribution of ${(y_{i}, x_{i}^{⊤})}^{⊤}$ and develops a related but different approach.

4. RELATED LEARNING PROBLEMS

4.1. Gaussian mixture model

PCA, or more generally, spectral decomposition, can be also applied to learn mixture models for heterogeneous data. A thread of recent papers (Hsu and Kakade, 2013; Anandkumar et al., 2014; Yi et al., 2016; Sedghi et al., 2016) apply spectral decomposition to lower-order moments of the data to recover the parameters of interest in a wide class of latent variable models. Here we use the Gaussian mixture model to illustrate their idea. Consider a mixture of K Gaussian distributions with spherical covariances. Let w_k ∈ (0, 1) be the probability of choosing component k ∈ {1, … , K}, and ${μ_{1}, \dots, μ_{k}} \subseteq ℝ^{p}$ be the component mean vectors, and ${σ_{k}^{2} I_{p}}_{k = 1}^{K}$ be the component covariance matrices, which is required by Hsu and Kakade (2013) and Anandkumar et al. (2014). Each data vector $x ~ w_{1} N (μ_{1}, σ_{1}^{2} I_{p}) + \dots + w_{K} N (μ_{K}, σ_{K}^{2} I_{p})$ follows the mixture of the Gaussian distribution. The parameters of interest are ${w_{k}, μ_{k}, σ_{k}^{2}}_{k = 1}^{K}$ .

Hsu and Kakade (2013) and Anandkumar et al. (2014) shed lights on the close connection between the lower-order moments of the data and the parameters of interest, which motivates the use of Method of Moments (MoM). Denote the population covariance $E [(x - E x) {(x - E x)}^{⊤}]$ by Σ. Below we present Theorem 1 in Hsu and Kakade (2013) to elucidate the moment structure of the problem.

Theorem 4.1. Suppose that ${μ_{k}}_{k = 1}^{K}$ are linearly independent. Then the average variance $σ_{ave}^{2} : = K^{- 1} \sum_{k = 1}^{K} σ_{k}^{2}$ is the smallest eigenvalue of Σ. Let v be any eigenvector of Σ that is associated with the eigenvalue $σ_{ave}^{2}$ . Define the following quantities:

M_{1} = E [{(v^{⊤} (x - E x))}^{2} x] \in ℝ^{p},

M_{2} = E [x \otimes x] - σ_{ave}^{2} \cdot I_{p} \in ℝ^{p \times p},

M_{3} = E [x \otimes x \otimes x] - \sum_{j = 1}^{p} (M_{1} \otimes e_{j} \otimes e_{j} + e_{j} \otimes M_{1} \otimes e_{j} + e_{j} \otimes e_{j} \otimes M_{1}) \in ℝ^{p \times p \times p} .

Then we have

M_{1} = \sum_{k = 1}^{K} w_{k} σ_{k}^{2} μ_{k}, M_{2} = \sum_{k = 1}^{K} w_{k} μ_{k} \otimes μ_{k}, M_{3} = \sum_{k = 1}^{K} w_{k} μ_{k} \otimes μ_{k} \otimes μ_{k},

(4.1)

where the notation ⊗ represents the tensor product.

Theorem 4.1 gives the relationship between the moments of the first three orders of x and the parameters of interest. With ${M_{i}}_{i = 1}^{3}$ replaced by their empirical versions, the remaining task is to solve for all the parameters of interest via (4.1). Hsu and Kakade (2013) and Anandkumar et al. (2014) proposed a fast method called robust tensor power method to compute the estimators. The crux therein is to construct an estimable third-order tensor ${\tilde{M}}_{3}$ that can be decomposed as the sum of orthogonal tensors based on μ_k. This orthogonal tensor decomposition can be regarded as an extension of spectral decomposition to third-order tensors (simply speaking, three-dimensional arrays). Then the power iteration method is applied to the estimate of ${\tilde{M}}_{3}$ to recover each μ_k, as well as other parameters.

Specifically, consider first the following linear transformation of μ_k:

{\tilde{μ}}_{k} : = \sqrt{ω_{k}} W^{⊤} μ_{k}

(4.2)

for k ∈ [K], where $W \in ℝ^{p \times K}$ . The key is to use the whitening transformation by setting W to be a square root of M₂. This ensures that ${{\tilde{μ}}_{k}}_{k = 1}^{K}$ are orthogonal to each other. Denoting a^⊗3 = a ⊗ a ⊗ a,

{\tilde{M}}_{3} : = \sum_{k = 1}^{K} ω_{k} {(W^{⊤} μ_{k})}^{\otimes 3} = \sum_{k = 1}^{K} \frac{1}{\sqrt{ω_{k}}} {\tilde{μ}}_{k}^{\otimes 3} \in ℝ^{K \times K \times K}

(4.3)

is an orthogonal tensor decomposition; that is, it satisfies orthogonality of ${{\tilde{μ}}_{k}}_{k = 1}^{K}$ . The following theorem from Anandkumar et al. (2014) summarizes the above argument, and more importantly, it shows how to obtain μ_k back from ${\tilde{μ}}_{k}$ .

Theorem 4.2. Suppose the vectors ${{\tilde{μ}}_{k}}_{k = 1}^{K}$ are linearly independent, and the scalars ${ω_{k}}_{k = 1}^{K}$ are strictly positive. Let M₂ = UDU^⊤ be the spectral decomposition of M₂ and let W = UD^−1/2. Then ${{\tilde{μ}}_{k}}_{k = 1}^{K}$ in (4.2) are orthogonal to each other. Furthermore, the Moore-Penrose pseudo-inverse of W is $W^{†} : = D^{1 / 2} u^{⊤} \in ℝ^{K \times p}$ , and we have $μ_{k} = {(W^{†})}^{⊤} {\tilde{μ}}_{k} / \sqrt{ω_{k}}$ for k ∈ [K].

As promised, the orthogonal tensor ${\tilde{M}}_{3}$ can be estimated from empirical moments. We will make use of the following identity, which is similar to Theorem 4.1.

{\tilde{M}}_{3} = E [{(W^{⊤} x)}^{\otimes 3}] - \sum_{j = 1}^{p} \sum_{cyc} (W^{⊤} M_{1}) \otimes (W^{⊤} e_{j}) \otimes (W^{⊤} e_{j}),

(4.4)

where we used the cyclic sum notation

\sum_{cyc} a \otimes b \otimes c : = a \otimes b \otimes c + b \otimes c \otimes a + c \otimes a \otimes b .

Note that $W^{⊤} e_{j} \in ℝ^{K}$ is simply the jth row of W. To obtain an estimate of ${\tilde{M}}_{3}$ , we replace the expectation $E$ by the empirical average, and substitute W and M₁ by their plug-in estimates. It is worth mentioning that, because ${\tilde{M}}_{3}$ has a smaller size than M₃, computations involving ${\tilde{M}}_{3}$ can be implemented more efficiently.

Once we obtain an estimate of ${\tilde{M}}_{3}$ , which we denote by ${\bar{M}}_{3}$ , to recover ${μ_{k}}_{k = 1}^{K}$ , ${ω_{k}}_{k = 1}^{K}$ and ${σ_{k}^{2}}_{k = 1}^{K}$ , the only task left is computing the orthogonal tensor decomposition (4.3) for ${\bar{M}}_{3}$ . The tensor power method in Anandkumar et al. (2014) is shown to solve this problem with provable computational guarantees. We omit the details of the algorithm here. Interested readers are referred to Section 5 of Anandkumar et al. (2014) for the introduction and analysis of this algorithm.

To conclude this subsection, we summarize the entire procedure of estimating ${μ_{k}, σ_{k}, ω_{k}}_{k = 1}^{K}$ as below.

Step 1. Calculate the sample covariance matrix $\hat{Σ} : = n^{- 1} \sum_{i = 1}^{n} (x_{i} - \bar{x}) {(x_{i} - \bar{x})}^{⊤}$ , its minimum eigenvalue ${\hat{σ}}_{ave}^{2}$ and its associated eigenvector $\hat{v}$ .

Step 2. Derive the estimators ${\hat{M}}_{1}$ , ${\hat{M}}_{2}$ , ${\hat{M}}_{3}$ based on Theorem 4.1 by plug-in of empirical moments of x, $\hat{v}$ and ${\hat{σ}}_{ave}^{2}$ .

Step 3. Calculate the spectral decomposition ${\hat{M}}_{2} = \hat{U} \hat{D} {\hat{U}}^{⊤}$ . Let $\hat{W} = \hat{U} {\hat{D}}^{- 1 / 2}$ . Construct an estimator of ${\tilde{M}}_{3}$ , denoted by ${\bar{M}}_{3}$ , based on (4.4) by plug-in of empirical moments of ${\hat{W}}^{⊤} x$ , $\hat{W}$ and ${\hat{M}}_{1}$ . Apply the robust tensor power method in Anandkumar et al. (2014) to ${\bar{M}}_{3}$ and obtain ${{\bar{μ}}_{k}}_{k = 1}^{K}$ and ${{\hat{ω}}_{k}}_{k = 1}^{K}$ .

Step 4. Set ${\hat{W}}^{†} = {\hat{D}}^{1 / 2} {\hat{U}}^{⊤}$ and ${\hat{μ}}_{k} = {({\hat{W}}^{†})}^{⊤} {\bar{μ}}_{k} / \sqrt{{\hat{ω}}_{k}}$ . Solve the linear equation ${\hat{M}}_{1} = \sum_{k = 1}^{K} {\hat{ω}}_{k} {\hat{σ}}_{k}^{2} {\hat{μ}}_{k}$ for ${{\hat{σ}}_{k}^{2}}_{k = 1}^{K}$ .

4.2. Community detection

In statistical modeling of networks, the stochastic block model (SBM), first proposed by Holland et al. (1983), has gained much attention in recent years (see Abbe, 2017 for a recent survey). Suppose our observation is a graph of n vertices, each of which belongs to one of K communities (or blocks). Let the vertices be indexed by [n], and the community that vertex i belongs to is indicated by an unknown θ_i ∈ [K]. In SBM, the probability of an edge between two vertices depends entirely on the membership of the communities. To be specific, let $W \in ℝ^{K \times K}$ be a symmetric matrix where each entry takes value in [0, 1], and let $A \in ℝ^{n \times n}$ be the adjacency matrix, i.e., A_ij = 1 if there is an edge between vertex i and j, and A_ij = 0 otherwise. Then, the SBM assumes

ℙ (A_{i j} = 1) = W_{k ℓ} with θ_{i} = k, θ_{j} = ℓ

and {A_ij}_i>j are independent. Here, for ease of presentation, we allow self-connecting edges. Figure 7 gives one realization of the network with two communities.

Fig 7. — *In both heatmaps, a dark pixel represents an entry with value* 1 *in a matrix, and a white pixel represents an entry with value* 0. The ***left heatmap*** *shows the (observed) adjacency matrix* A *of size n* = 40 *generated from the SBM with two equal-sized blocks (K* = 2), with edge probabilities 5 log *n/n (within blocks) and* log n/(4n) *(between blocks). The* ***right heatmap*** *shows the same matrix with its row indices and column indices suitably permutated based on unobserved z*_i. Clearly, we observe an approximate rank-2 *structure in the right heatmap. This motivates estimating z*_i *via the second eigenvector.*

Though seemingly different, this problem shares a close connection with PCA and spectral methods. Let z_i = e_k (namely, the kth canonical basis in $ℝ^{K}$ ) if θ_i = k, indicating the membership of ith node, and define $Z = {[z_{1}, \dots, z_{n}]}^{⊤} \in ℝ^{n \times K}$ . The expectation of A has a low-rank decomposition $E A = ZW Z^{⊤}$ and

A = ZW Z^{⊤} + (A - E A) .

(4.5)

Loosely speaking, the matrix Z plays a similar role as factors or loading matrices (unnormalized), and $A - E A$ is similar to the noise (idiosyncratic component). In the ideal situation, the adjacency matrix A and its expectation are close, and naturally we expect the eigenvectors of A to be useful for estimating θ_i. Indeed, this observation is the underpinning of many methods (Rohe et al., 2011; Gao et al., 2015; Abbe and Sandon, 2015). The vanilla spectral method for network/graph data is as follows:

Step 1. Construct the adjacency matrix A or other similarity-based matrices;

Step 2. Compute eigenvectors v₁, … , v_L corresponding to the largest eigenvalues, and form a matrix $V = [v_{1}, \dots, v_{ℓ}] \in ℝ^{n \times L}$ ;

Step 3. Run a clustering algorithm on the row vectors of V.

There are many variants and improvements of this vanilla spectral method. For example, in Step 1, very often the graph Laplacian D − A or normalized Laplacian D^−1/2(D − A)D^−1/2 is used in place of the adjacency matrix, where D = diag(d₁, … , d_n), and $d_{i} = \sum_{j} A_{i j}$ is the degree of vertex i. If real-valued similarities or distances between vertices are available, weighted graphs are usually constructed. In Step 2, there are many other refinements over raw eigenvectors in the construction of V, for example, projecting row vectors of V onto the unit sphere (Ng et al., 2002), and calculating scores based on eigenvector ratios (Jin, 2015), etc. In Step 3, a very popular algorithm for clustering is the K-means algorithm.

We will look at the vanilla spectral algorithm in its simplest form. Our goal is exact recovery, which means finding an estimator $\hat{θ}$ of $θ = {(θ_{1}, \dots, θ_{n})}^{⊤}$ such that as n → ∞,

ℙ (there exists a permutation π of [K] s . t . {\hat{θ}}_{i} = π (θ_{i}), \forall i \in [n]) = 1 - o (1) .

Note that we can only determine θ up to a permutation since the distribution of our observation is invariant to permutations of [K]. There are nice theoretical results, including information limits for exact recovery in Abbe et al. (2016).

Despite its simplicity, spectral methods can be quite sharp for exact recovery in SBM, which succeed in a regime that matches the information limit. The next theorem from Abbe et al. (2017) will make this point clear. Consider the SBM with two balanced blocks, i.e., K = 2 and {i : θ_i = 1} = {i : θ_i = 2} = n/2, and suppose W₁₁ = W₂₂ = a log n/n, W₁₂ = b log n/n where a > b > 0. In this case, one can easily see that the second eigenvector of $E A$ is given by $v_{2}^{*}$ whose ith entry is given by $1 / \sqrt{n}$ if θ_i = 1 and −1 otherwise. In other words, $sgn (v_{2}^{*})$ classifies the two communities, where sgn(·) is the sign function applied to each entry of a vector. This is shown in Figure 2 for the case that #{i : θ_i = 1} = 2500 (red curve, left panel), where the second eigenvector v₂ of A is also depicted (blue curve). The entrywise closeness between these two quantities is guaranteed by the perturbation theory under ℓ_∞-norm (Abbe et al., 2017).

Theorem 4.3. Let v₂ be the normalized second eigenvector of A. If $\sqrt{a} - \sqrt{b} < \sqrt{2}$ , then no estimator achieves exact recovery; if $\sqrt{a} - \sqrt{b} > \sqrt{2}$ , then both the maximum likelihood estimator and the eigenvector estimator sgn(v₂) achieves exact recovery.

The proof of this result is based on entry-wise analysis of eigenvectors in a spirit similar to Theorem 2.4, together with a probability tail bound for differences of binomial variables.

4.3. Matrix completion

In recommendation systems, an important problem is to estimate users’ preferences based on history data. Usually, the available data per user is very small compared with the total number of items (each user sees only a small number of movies and buys only a small fraction of books, comparing to the total). Matrix completion is one formulation of such problem.

The goal of (noisy) matrix completion is to estimate a low-rank matrix $M^{*} \in ℝ^{n_{1} \times n_{2}}$ from noisy observations of some entries (n₁ users and n₂ items). Suppose we know rank(M*) = K. For each i ∈ [n₁] and j ∈ [n₂], let I_ij be i.i.d. Bernoulli variable with $ℙ (I_{i j} = 1) = p$ that indicates if we have observed information about the entry $M_{i j}^{*}$ , i.e., I_ij = 1 if and only if it is observed. Also suppose that our observation is $M_{i j} = M_{i j}^{*} + ε_{i j}$ if I_ij = 1, where ε_ij is i.i.d. $N (0, σ^{2})$ jointly independent of I_ij.

One natural way to estimate M* is to solve

\min_{X \in ℝ^{n_{1} \times n_{2}}} \frac{1}{2} {‖ P_{Ω} (M) P_{Ω} (X) ‖}^{2} subject to rank (X) = K,

where $P_{Ω} : ℝ^{n_{1} \times n_{2}} \to ℝ^{n_{1} \times n_{2}}$ is the sampling operator defined by ${[P_{Ω} (x)]}_{i j} = I_{i j} X_{i j}, \forall i, j$ . The minimizer of this problem is essentially the MLE for M *. Due to the nonconvex constraint rank(X) = K, it is desirable to relax this optimization into a convex program. A popular way to achieve that is to transform the rank constraint into a penalty term λǁXǁ_* that is added to the quadratic objective function, where λ is a tuning parameter and ${‖ \cdot ‖}_{*}$ is the nuclear norm (that is, the ℓ₁ norm of the vector of all its singular values), which encourages a solution with low rank (number of nonzero components in that vector). A rather surprising conclusion from Candès and Recht (2009) is that in the noiseless setting, solving the relaxed problem yields the same solution as the nonconvex problem with high probability.

We can view this problem from the perspective of factor models. The assumption that M* has low rank can be justified by interpreting each M_ij as the linear combination of a few latent factors. Indeed, if M_ij is the preference score of user i for item j, then it is reasonable to posit $M_{i j} = b_{i}^{⊤} f_{j}$ , where $f_{j} \in ℝ^{K}$ is the features item j possesses and $b_{i} \in ℝ^{K}$ is the tendency of user i towards the features. In this regard, M* = BF^⊤ can be viewed as the part explained by the factors in the factor models.

This discussion motivates us to write our observation as

P_{Ω} (M) = p M^{*} + (P_{Ω} (M^{*}) - E P_{Ω} (M^{*}) + P_{Ω} (E)), where E : = {(ε_{i j})}_{i, j} \in ℝ^{n_{1} \times n_{2}},

since $E P_{Ω} (M^{*}) = p M^{*}$ . This decomposition gives the familiar “low-rank plus noise” structure. It is natural to conduct PCA on $P_{Ω} (M)$ to extract the low-rank part.

Let the best rank-K approximation of $P_{Ω} (M)$ be given by $U diag (σ_{1}, \dots, σ_{K}) V^{⊤}$ , where ${σ_{k}}_{k = 1}^{K}$ are the largest K singular values in descending order, and columns of $U \in ℝ^{n_{1} \times K}$ , $V \in ℝ^{n_{2} \times K}$ correspond to their normalized left and right singular vectors, respectively. Similarly, we have singular value decomposition $M^{*} = U^{*} diag (σ_{1}^{*}, \dots, σ_{K}^{*}) {(V^{*})}^{⊤}$ . The following result from Abbe et al. (2017) provides entry-wise bounds for our estimates. For a matrix, denote by ${‖ \cdot ‖}_{\max}$ the largest absolute value of all entries, and ${‖ \cdot ‖}_{2 \to \infty}$ the largest ℓ₂ norm of all row vectors.

Theorem 4.4. Let n = n₁ + n₂, $η = \max {{‖ U ‖}_{2 \to \infty}, {‖ V ‖}_{2 \to \infty}}$ and $κ = σ_{1}^{*} / σ_{K}^{*}$ . There exist constants C, C′ > 0 and an orthogonal matrix $R \in ℝ^{K \times K}$ such that the following holds. If p ≥ 6 log n/n and . There exist constants C, C′ > 0 and an orthogonal matrix $κ \frac{n ({‖ M^{*} ‖}_{\max} + σ)}{σ_{r}^{*}} \sqrt{\frac{\log n}{n p}} \leq 1 / C$ , then with at least probability 1 − C/n,

max {{‖ U R - U^{*} ‖}_{\max}, {‖ V R - V^{*} ‖}_{\max}} \leq C^{'} η κ \frac{n ({‖ M^{*} ‖}_{\max} + σ)}{σ_{r}^{*}} \sqrt{\frac{\log n}{n p}},

{‖ U diag {σ_{1}, \dots, σ_{K}} V^{⊤} - M^{*} ‖}_{\max} \leq C^{'} η^{2} κ^{4} ({‖ M^{*} ‖}_{\max} + σ) \sqrt{\frac{n \log n}{p}} .

We can simplify the bounds with a few additional assumptions. If $n_{1} ≍ n_{2}$ , then η is of order $O (\sqrt{K / n})$ assuming a bounded coherence number. In addition, if κ is also bounded, then

{‖ U diag {σ_{1}, \dots, σ_{K}} V^{⊤} - M^{*} ‖}_{\max} ≲ ({‖ M^{*} ‖}_{\max} + σ) \sqrt{\frac{\log n}{n p}} .

We remark that the requirement on the sample ratio $p ≳ \log n / n$ is the weakest condition necessary for matrix completion, which ensures each row and column and sampled with high probability. Also, the entry-wise bound above can recover the Frobenius bound (Keshavan et al., 2010) up to a log factor. It is more precise than the Frobenius bound, because the latter only provides control on average error.

4.4. Synchronization problems

Synchronization problems are a class of problems in which one estimates signals from their pairwise comparisons. Consider the phase synchronization problem as an example, that is, estimating n angles θ₁, … , θ_n from noisy measurements of their differences. We can express an angle θ_ℓ in the equivalent form of a unit-modulus complex number z_ℓ = exp(iθ_ℓ), and thus, the task is to estimate a complex vector $z = {(z_{1}, \dots, z_{n})}^{⊤} \in ℂ^{n}$ . Suppose our measurements have the form $C_{ℓ k} = {\bar{z}}_{ℓ} z_{k} + σ w_{ℓ k}$ , where ${\bar{z}}_{ℓ}$ denotes the conjugate of z_ℓ, and for all ℓ > k, $w_{ℓ k} \in ℂ$ is i.i.d. complex Gaussian variable (namely, the real part and imaginary part of w_ℓk are $N (0, 1 / 2)$ and independent). Then, the phase of C_ℓk (namely arg(C_ℓk)) encodes the noisy difference θ_k − θ_A.

More generally, the goal of a synchronization problem is to estimate n signals from their pairwise measurements, where each signal is an element from a group, e.g., the group of rotations in three dimensions. Synchronization problems are motivated from imaging problems such as cryo-EM (Shkolnisky and Singer, 2012), camera calibration (Tron and Vidal, 2009), etc.

Synchronization problems also admit the “low-rank plus noise” structure. Consider our phase synchronization problem again. If we let w_kℓ = w_ℓk (ℓ > k) and w_ℓℓ = 0, and write $W = {(w_{ℓ k})}_{ℓ, k = 1}^{n}$ , then our measurement matrix $C = {(C_{ℓ k})}_{ℓ, k = 1}^{n}$ has the structure

C = z z^{*} + σ W,

where * denotes the conjugate transpose. This decomposition has a similar form to (4.5) in community detection. Note that zz* is a complex matrix with a single nonzero eigenvalue n, and ${‖ σ W ‖}_{2}$ is of order $σ \sqrt{n}$ with high probability (which is a basic result in random matrix theory). Therefore, we expect that no estimators can do well if $σ ≳ \sqrt{n}$ . Indeed, the information-theoretic limit is established in Lelarge and Miolane (2016). Our next result from Zhong and Boumal (2018) gives estimation guarantees if the reverse inequality is true (up to a log factor).

Theorem 4.5. Let $v \in ℂ^{n}$ be the leading eigenvector of C such that ${‖ v ‖}_{2} = \sqrt{n}$ and $v^{*} z = | v^{*} z |$ . Then, if $σ ≲ \sqrt{n / \log n}$ , then with probability 1 − O(n⁻²), the relative errors satisfy

n^{- 1 / 2} {‖ v - z ‖}_{2} ≲ σ / \sqrt{n}, a n d {‖ v - z ‖}_{\infty} ≲ σ \sqrt{\log n / n}

Moreover, the above two inequalities also hold for the maximum likelihood estimator.

Note that the eigenvector of a complex matrix is not unique: for any $α \in ℝ$ , the vector e^iαv is also an eigenvector, so we fix the global phase e^iα by restricting $v^{*} z = | v^{*} z |$ . Note also that the maximum likelihood estimator is different from v, because the MLE must satisfy the entry-wise constraint |z_ℓ| = 1 for any ℓ ∈ [n]. This result implies consistency of v in terms of both the ℓ₂ norm and the ℓ_∞ norm if $σ ≪ \sqrt{n / \log n}$ , and thus, provides good evidence that spectral methods (or PCA) are simple, generic, yet powerful.

Acknowledgments

The authors gratefully acknowledge the support from the NSF grants DMS-1712591 and DMS-1662139 and NIH grant R01-GM072611.

APPENDIX A: PROOFS

Proof of Corollary 2.1. Notice that the result is trivial if $δ_{0} \leq 2 {‖ \tilde{A} - A ‖}_{2}$ , since ${‖ (\tilde{A} - A) V ‖}_{2} \leq {‖ \tilde{A} - A ‖}_{2}$ and ${‖ \sin Θ (\tilde{A}, A) ‖}_{2} \leq 1$ always hold. If $δ_{0} > 2 {‖ \tilde{A} - A ‖}_{2}$ , then by Weyl’s inequality,

L ({\tilde{V}}^{⊥}) \subset (- \infty, α - δ_{0} + {‖ \tilde{A} - A ‖}_{2}] \cup [β + δ_{0} - {‖ \tilde{A} - A ‖}_{2}, + \infty) .

Thus, we can set $δ = δ_{0} - {‖ \tilde{A} - A ‖}_{2}$ in Theorem 2.3 and derive

{‖ \sin Θ (\tilde{V}, V) ‖}_{2} \leq \frac{{‖ \tilde{A} - A ‖}_{2}}{δ_{0} - {‖ \tilde{A} - A ‖}_{2}} \leq \frac{{‖ \tilde{A} - A ‖}_{2}}{δ_{0} - δ_{0} / 2} = 2 δ_{0}^{- 1} {‖ \tilde{A} - A ‖}_{2} .

This proves the spectral norm case.

Proof of Theorem 2.4. Step 1: First, we derive a few elementary inequalities: for any m ∈ [n],

{‖ W^{(m)} ‖}_{2} \leq {‖ W ‖}_{2}, {‖ W_{m} ‖}_{2} \leq {‖ W ‖}_{2}, {‖ W - W^{(m)} ‖}_{2} \leq 2 {‖ W ‖}_{2} .

(A.1)

To prove these inequalities, recall the (equivalent) definition of spectral norm for symmetric matrices:

{‖ W ‖}_{2} = \max_{x, y \in S^{n - 1}} x^{⊤} W y = \max_{x \in S^{n - 1}} {‖ W x ‖}_{2},

where Sⁿ⁻¹ is the unit sphere in $ℝ^{n}$ , and x = (x₁, … , x_n)^⊤, y = (y₁, … , y_n)^⊤. The first and second inequalities follow from

{‖ W ‖}_{2} \geq \max {x^{⊤} W y; x, y \in S^{n - 1}, x_{m} = y_{m} = 0} = {‖ W^{(m)} ‖}_{2}, and

{‖ W ‖}_{2} \geq \max_{x \in S^{n - 1}} | 〈 W_{m}, x 〉 | = {‖ W_{m} ‖}_{2} .

The third inequality follows from the first one and the triangle inequality.

Step 2: Next, by the definition of eigenvectors,

{\tilde{v}}_{ℓ} - v_{ℓ} = \frac{\tilde{A} {\tilde{v}}_{ℓ}}{{\tilde{λ}}_{ℓ}} - v_{ℓ} = (\frac{A {\tilde{v}}_{ℓ}}{{\tilde{λ}}_{ℓ}} - v_{ℓ}) + \frac{W {\tilde{v}}_{ℓ}}{{\tilde{λ}}_{ℓ}} .

(A.2)

We first control the entries of the first term on the right-hand side. Using the decomposition (2.8), we have

[\frac{A {\tilde{v}}_{ℓ}}{{\tilde{λ}}_{ℓ}} - v_{ℓ}] = (\frac{λ_{ℓ}}{{\tilde{λ}}_{ℓ}} 〈 v_{ℓ}, {\tilde{v}}_{ℓ} 〉 - 1) {[v_{ℓ}]}_{m} + \sum_{k \neq ℓ} \frac{λ_{ℓ}}{{\tilde{λ}}_{ℓ}} 〈 v_{ℓ}, {\tilde{v}}_{ℓ} 〉 {[v_{k}]}_{m}, \forall m \in [n] .

(A.3)

Using the triangle inequality, we have

| \frac{λ_{ℓ}}{{\tilde{λ}}_{ℓ}} 〈 v_{ℓ}, {\tilde{v}}_{ℓ} 〉 - 1 | \leq | \frac{λ_{ℓ}}{{\tilde{λ}}_{ℓ}} 〈 v_{ℓ}, {\tilde{v}}_{ℓ} 〉 - 〈 v_{ℓ}, {\tilde{v}}_{ℓ} 〉 | + | 〈 v_{ℓ}, {\tilde{v}}_{ℓ} 〉 - 1 | \leq \frac{| λ_{ℓ} - {\tilde{λ}}_{ℓ} |}{| {\tilde{λ}}_{ℓ} |} + \frac{1}{2} {‖ {\tilde{v}}_{ℓ} - v_{ℓ} ‖}^{2} .

By Weyl’s inequality, $| {\tilde{λ}}_{ℓ} - λ_{ℓ} | \leq {‖ W ‖}_{2}$ , and thus $| {\tilde{λ}}_{ℓ} | \geq | λ_{ℓ} | - {‖ W ‖}_{2} \geq δ_{ℓ} - {‖ W ‖}_{2}$ . Also, by Corollary 2.1 (simplified Davis-Kahan’s theorem) and its following remark, ${‖ {\tilde{v}}_{ℓ} - v_{ℓ} ‖}_{2} \leq 2 \sqrt{2} {‖ W ‖}_{2} / δ_{ℓ}$ . Therefore, under the condition δ_ℓ ≥ 2ǁWǁ₂,

| \frac{λ_{ℓ}}{{\tilde{λ}}_{ℓ}} 〈 v_{ℓ}, {\tilde{v}}_{ℓ} 〉 - 1 | \leq \frac{{‖ W ‖}_{2}}{δ_{ℓ} - {‖ W ‖}_{2}} + \frac{4 {‖ W ‖}_{2}^{2}}{δ_{ℓ}^{2}} \leq \frac{2 {‖ W ‖}_{2}^{2}}{δ_{ℓ}} + \frac{2 {‖ W ‖}_{2}}{δ_{ℓ}} = \frac{4 {‖ W ‖}_{2}}{δ_{ℓ}} .

Using Corollary 2.1 again, we obtain

\sum_{k \neq ℓ, k \leq K} \frac{λ_{k}^{2}}{{\tilde{λ}}_{ℓ}^{2}} {〈 v_{k}, {\tilde{v}}_{ℓ} 〉}^{2} ≲ \sum_{k \neq ℓ, k \leq K} {〈 v_{k}, {\tilde{v}}_{ℓ} 〉}^{2} \leq 1 - {〈 v_{ℓ}, {\tilde{v}}_{ℓ} 〉}^{2} = \sin^{2} θ (v_{ℓ}, {\tilde{v}}_{ℓ}) \leq \frac{4 {‖ W ‖}_{2}^{2}}{δ_{ℓ}^{2}},

where the first inequality is due to $| {\tilde{λ}}_{ℓ} | \geq | λ_{ℓ} | - {‖ W ‖}_{2} \geq 4 | λ_{ℓ} | / 5$ and the condition $| λ_{ℓ} | ≍ \max_{k \in [K]} | λ_{k} |$ , and the second inequality is due to the fact that ${v_{k}}_{k = 1}^{K}$ is a subset of orthonormal basis. Now we use the Cauchy-Schwarz inequality to bound the second term on the right-hand side of (A.3) and get

| {[\frac{A {\tilde{v}}_{ℓ}}{{\tilde{λ}}_{ℓ}} - v_{ℓ}]}_{m} | ≲ \frac{{‖ W ‖}_{2}}{δ_{ℓ}} {(\sum_{k = 1}^{K} {[v_{k}]}_{m}^{2})}^{1 / 2} .

(A.4)

Step 3: To bound the entries of the second term in (A.2), we use the leave-one-out idea as follows.

{[W {\tilde{v}}_{ℓ}]}_{m} = {[W {\tilde{v}}_{ℓ}^{(m)}]}_{m} + {[W ({\tilde{v}}_{ℓ} - {\tilde{v}}_{ℓ}^{(m)})]}_{m} = 〈 w_{m}, {\tilde{v}}_{ℓ}^{(m)} 〉 + 〈 w_{m}, {\tilde{v}}_{ℓ} - {\tilde{v}}_{ℓ}^{(m)} 〉, \forall m \in [n] .

(A.5)

We can bound the second term using the Cauchy-Schwarz inequality: $| 〈 w_{m}, {\tilde{v}}_{ℓ} - {\tilde{v}}_{ℓ}^{(m)} 〉 | \leq {‖ w_{m} ‖}_{2} {‖ {\tilde{v}}_{ℓ} - {\tilde{v}}_{ℓ}^{(m)} ‖}_{2}$ . The crucial observation is that, if we view ${\tilde{v}}_{ℓ}$ as the perturbed version of ${\tilde{v}}_{ℓ}^{(m)}$ , then by Theorem 2.3 (Davis-Kahan’s theorem) and Weyl’s inequality, for any ℓ ∈ [K],

{‖ {\tilde{v}}_{ℓ} - {\tilde{v}}_{ℓ}^{(m)} ‖}_{2} \leq \frac{\sqrt{2} {‖ Δ^{(m)} {\tilde{v}}_{ℓ}^{(m)} ‖}_{2}}{{\tilde{δ}}_{ℓ}^{(m)} - {‖ Δ^{(m)} ‖}_{2}}, where Δ^{(m)} : = W - W^{(m)} .

Here, ${\tilde{δ}}_{ℓ}^{(m)}$ is the eigen-gap of A + W^(m), and it satisfies ${\tilde{δ}}_{ℓ}^{(m)} \geq δ_{ℓ} - 2 {‖ W^{(m)} ‖}_{2}$ since $| {\tilde{λ}}_{i}^{(m)} - λ_{i} | \leq {‖ W^{(m)} ‖}_{2}$ for all i ∈ [n], by Weyl’s inequality. By (A.1), we have ${\tilde{δ}}_{ℓ}^{(m)} - {‖ Δ^{(m)} ‖}_{2} \geq δ_{ℓ} - 4 {‖ W ‖}_{2}$ . Thus, under the condition δ_ℓ ≥ 5ǁWǁ₂, we have

{‖ {\tilde{v}}_{ℓ} - {\tilde{v}}_{ℓ}^{(m)} ‖}_{2} ≲ \frac{{‖ Δ^{(m)} {\tilde{v}}_{ℓ}^{(m)} ‖}_{2}}{δ_{ℓ}} .

Note that the mth entry of the vector $Δ^{(m)} {\tilde{v}}_{ℓ}^{(m)}$ is exactly $〈 W_{m}, {\tilde{v}}_{ℓ}^{(m)} 〉$ , and other entries are $W_{i m} {[{\tilde{v}}_{ℓ}^{(m)}]}_{m}$ where $i \neq m$ . Thus,

{‖ {\tilde{v}}_{ℓ} - {\tilde{v}}_{ℓ}^{(m)} ‖}_{2} ≲ \frac{1}{δ_{ℓ}} {({〈 W_{m}, {\tilde{v}}_{ℓ}^{(m)} 〉}^{2} + \sum_{i \neq m} W^{2} {[{\tilde{v}}_{ℓ}^{(m)}]}_{m}^{2})}^{1 / 2} \leq \frac{1}{δ_{ℓ}} (| 〈 W_{m}, {\tilde{v}}_{ℓ}^{(m)} 〉 | + {‖ W_{m} ‖}_{2} | {[{\tilde{v}}_{ℓ}^{(m)}]}_{m} |),

where we used $\sqrt{a + b} \leq \sqrt{a} + \sqrt{b} (a, b \geq 0)$ . The above inequality, together with $| 〈 w_{m}, {\tilde{v}}_{ℓ} - {\tilde{v}}_{ℓ}^{(m)} 〉 | \leq {‖ w_{m} ‖}_{2} {‖ {\tilde{v}}_{ℓ} - {\tilde{v}}_{ℓ}^{(m)} ‖}_{2}$ , leads to a bound on ${[W {\tilde{v}}_{ℓ}]}_{m}$ in (A.5).

| {[W {\tilde{v}}_{ℓ}]}_{m} | ≲ | 〈 w_{m}, {\tilde{v}}_{ℓ}^{(m)} 〉 | + \frac{{‖ w_{m} ‖}_{2}}{δ_{ℓ}} (| 〈 w_{m}, {\tilde{v}}_{ℓ}^{(m)} 〉 | + {‖ w_{m} ‖}_{2} | {[{\tilde{v}}_{ℓ}^{(m)}]}_{m} |) ≲ | 〈 w_{m}, {\tilde{v}}_{ℓ}^{(m)} 〉 | + {‖ w_{m} ‖}_{2} | {[{\tilde{v}}_{ℓ}^{(m)}]}_{m} |

(A.6)

where we used $δ_{ℓ}^{- 1} {‖ w_{m} ‖}_{2} \leq δ_{ℓ}^{- 1} {‖ W ‖}_{2} < 1$ . We claim that $| {[{\tilde{v}}_{ℓ}^{(m)}]}_{m} | ≲ {(\sum_{k = 1}^{K} {[v_{k}]}_{m}^{2})}^{1 / 2}$ . Once this is proved, combining it with (A.4) and (A.6) yields the desired bound on the entries of ${\tilde{v}}_{ℓ} - v_{ℓ}$ in (A.2):

| {[{\tilde{v}}_{ℓ} - v_{ℓ}]}_{m} | ≲ \frac{{‖ W ‖}_{2}}{δ_{ℓ}} {(\sum_{k = 1}^{K} {[v_{k}]}_{m}^{2})}^{1 / 2} + \frac{1}{δ_{ℓ}} (| 〈 w_{m}, {\tilde{v}}_{ℓ}^{(m)} 〉 | + {‖ w_{m} ‖}_{2} | {[{\tilde{v}}_{ℓ}^{(m)}]}_{m} |) ≲ \frac{{‖ W ‖}_{2}}{δ_{ℓ}} {(\sum_{k = 1}^{K} {[v_{k}]}_{m}^{2})}^{1 / 2} + \frac{| 〈 w_{m}, {\tilde{v}}_{ℓ}^{(m)} 〉 |}{δ_{ℓ}},

where, in the first inequality, we used $| {\tilde{λ}}_{ℓ} | \geq | λ_{ℓ} | - {‖ W ‖}_{2} \geq δ_{ℓ} - δ_{ℓ} / 5 = 4 δ_{ℓ} / 5$ , and in the second inequality, we used ${‖ W_{m} ‖}_{2} \leq {‖ W ‖}_{2}$ and the claim.

Step 4: Finally, we prove our claim that $| {[{\tilde{v}}_{ℓ}^{(m)}]}_{m} | ≲ {(\sum_{k = 1}^{K} {[v_{k}]}_{m}^{2})}^{1 / 2}$ . By definition, ${\tilde{λ}}_{ℓ}^{(m)} {\tilde{v}}_{ℓ}^{(m)} = (A+ W^{(m)}) {\tilde{v}}_{ℓ}^{(m)}$ . Note that the mth row of $W^{(m)} {\tilde{v}}_{ℓ}^{(m)}$ is 0, since W^(m) has only zeros in its mth row. Thus,

{[{\tilde{v}}_{ℓ}^{(m)}]}_{m} = ({[{\tilde{v}}_{ℓ}^{(m)}]}_{m} - {[v_{ℓ}]}_{m}) + {[v_{ℓ}]}_{m} = {[\frac{A {\tilde{v}}_{ℓ}^{(m)}}{{\tilde{λ}}_{ℓ}^{(m)}} - v_{ℓ}]}_{m} + {[v_{ℓ}]}_{m} .

With an argument similar to the one that leads to (A.4), we can bound the first term on the right-hand side.

| {[\frac{A {\tilde{v}}_{ℓ}^{(m)}}{{\tilde{λ}}_{ℓ}^{(m)}} - v_{ℓ}]}_{m} | ≲ \frac{{‖ W^{(m)} ‖}_{2}}{δ_{ℓ}} {(\sum_{k = 1}^{K} {[v_{k}]}_{m}^{2})}^{1 / 2} \leq {(\sum_{k = 1}^{K} {[v_{k}]}_{m}^{2})}^{1 / 2} .

Clearly, $| {[v_{ℓ}]}_{m} |$ is also upper bounded by the right-hand side above. This proves our claim and concludes the proof.

Proof of Corollary 2.2. Let us construct symmetric matrices A, W, $\tilde{W}$ of size n + p via a standard dilation technique (Paulsen, 2002). Define

A = (\begin{matrix} 0 & L \\ L^{⊤} & 0 \end{matrix}), W = (\begin{matrix} 0 & E \\ E^{⊤} & 0 \end{matrix}), and \tilde{A} = A + W .

It can be checked that rank(A) = 2K, and importantly,

A = \frac{1}{2} \sum_{k = 1}^{K} σ_{k} (\begin{array}{l} u_{k} \\ v_{k} \end{array}) (\begin{matrix} u_{k}^{⊤} & v_{k}^{⊤} \end{matrix}) - \frac{1}{2} \sum_{k = 1}^{K} σ_{k} (\begin{matrix} u_{k} \\ - v_{k} \end{matrix}) (\begin{matrix} u_{k}^{⊤} & - v_{k}^{⊤} \end{matrix}) .

(A.7)

Step 1: Check the conditions of Theorem 2.4. The nonzero eigenvalues of A are ±σ_k, (k ∈ [K]), and the corresponding eigenvectors are ${(u_{k}^{⊤}, \pm v_{k}^{⊤})}^{⊤} / \sqrt{2} \in ℝ^{n + p}$ . It is clear that the eigenvalue condition $| λ_{ℓ} | ≍ \max_{k \in [K]} | λ_{k} |$ in Theorem 2.4 is satisfied, and the eigen-gap δ_ℓ of A is exactly γ_ℓ. Since the identity (A.7) holds for any matrix constructed from dilation, by applying it to W we get ${‖ W ‖}_{2} = {‖ E ‖}_{2}$ .

Step 2: Apply the conclusion of Theorem 2.4. Similarly as before, we write W^(m) as the matrix obtained by setting mth row and mth column of W to zero, where m ∈ [n + p]. We also denote ${\tilde{A}}^{(m)} = A + W^{(m)}$ . Using a similar argument as Step 1, we find

the eigenvectors of $\tilde{A}$ are $(\begin{matrix} {\tilde{u}}_{k} \\ \pm {\tilde{v}}_{k} \end{matrix}) / \sqrt{2}$ ,
the eigenvectors of ${\tilde{A}}^{(i)}$ are $(\begin{matrix} * \\ \pm {\tilde{v}}_{k}^{(i)} \end{matrix}) / \sqrt{2}, \forall i \in [n]$ , and
the eigenvectors of ${\tilde{A}}^{(n + j)}$ are $(\begin{matrix} {\tilde{u}}_{k}^{(j)} \\ * \end{matrix}) / \sqrt{2}, \forall j \in [p]$ ,

where * means some appropriate vectors we do not need in the proof (we do not bother introducing notations for them). We also observe that

W_{m} = {\begin{array}{l} {(\begin{matrix} 0 & e_{i}^{row} \end{matrix})}^{⊤}, & m = i \in [n] \\ {(\begin{matrix} {(e_{j}^{col})}^{⊤} & 0 \end{matrix})}^{⊤}, & m = n + j, j \in [p] \end{array}

Note that the inner product between w_m and the eigenvector of ${\tilde{A}}^{(m)}$ is $〈 {(e_{i}^{row})}^{⊤}, \pm {\tilde{v}}_{k}^{(i)} 〉$ if m = i ∈ [n], or $〈 e_{j}^{col}, {\tilde{u}}_{k}^{(j)} 〉$ if m = n + j, j ∈ [p]. Therefore, applying Theorem 2.4 to the first n entries of

\frac{1}{\sqrt{2}} (\begin{matrix} {\tilde{u}}_{ℓ} - u_{ℓ} \\ {\tilde{v}}_{ℓ} - v_{ℓ} \end{matrix}),

we obtain the first inequality of Corollary 2.2, and applying Theorem 2.4 to the last p entries leads to the second inequality.

Proof of Lemma 3.1.

E_{ε} [{‖ X^{⊤} {\hat{β}}_{K} - X^{⊤} β^{*} ‖}_{2}^{2} / n] = E_{ε} [{‖ Q_{K} Σ_{K} P_{K}^{⊤} β^{*} + Q_{K} Q_{K}^{⊤} ε - X^{⊤} β^{*} ‖}_{2}^{2} / n] = E_{ε} [{‖ Q_{K} Q_{K}^{⊤} ε - Q_{K +} Σ_{K +} P_{K +}^{⊤} β^{*} ‖}_{2}^{2} / n] = \frac{K σ^{2}}{n} + \underset{α^{⊤}}{\underset{︸}{β^{* ⊤} P_{K +}}} Σ_{K +}^{2} \underset{α}{\underset{︸}{P_{K +}^{⊤} β^{*}}} . = \frac{K σ^{2}}{n} + \sum_{j = K + 1}^{d} λ_{j}^{2} α_{j}^{2} .

Footnotes

There is a weaker assumption, under which (1.1) is usually called the weak factor model; see Onatski (2012).

Here, we mean it has second bounded moment when estimating the mean and has bounded fourth moment when estimating the variance.

A norm $‖ \cdot ‖$ is orthogonal-invariant if $‖ u^{⊤} BV ‖ = ‖ b ‖$ for any matrix B and any orthogonal matrices U and V.

⁴

Here, we prefer using u_k to refer to the singular vectors (not to be confused with the noise term in factor models). The same applies to Section 4.

Contributor Information

Jianqing Fan, Department of Operations Research and Financial Engineering, Princeton University, Princeton, 08540, NJ, USA.

Kaizheng Wang, Department of Industrial Engineering and Operations Research, Columbia University, New York, 10027, NY, USA.

Yiqiao Zhong, Packard 239, Stanford University, Stanford, 94305, CA, USA.

Ziwei Zhu, Department of Statistics, University of Michigan, Ann Arbor, 48109, MI, USA.

REFERENCES

Abbe E (2017). Community detection and stochastic block models: recent developments. arXiv preprint arXiv:1703.10146
Abbe E, Bandeira AS and Hall G (2016). Exact recovery in the stochastic block model. IEEE Transactions on Information Theory 62 471–487. [Google Scholar]
Abbe E, Fan J, Wang K and Zhong Y (2017). Entrywise eigenvector analysis of random matrices with low expected rank. arXiv preprint arXiv:1709.09565 [DOI] [PMC free article] [PubMed]
Abbe E and Sandon C (2015). Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on. IEEE. [Google Scholar]
Ahn SC and Horenstein AR (2013). Eigenvalue ratio test for the number of factors. Econometrica 81 1203–1227. [Google Scholar]
Anandkumar A, Ge R, HSU D, Kakade SM and Telgarsky M (2014). Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research 15 2773–2832. [Google Scholar]
Anderson TW and Amemiya Y (1988). The asymptotic normal distribution of estimators in factor analysis under general conditions. The Annals of Statistics 16 759–771. [Google Scholar]
Bai J and Li K (2012). Statistical analysis of factor models of high dimension. The Annals of Statistics 40 436–465. [Google Scholar]
Bai J and NG S (2002). Determining the number of factors in approximate factor models. Econometrica 70 191–221. [Google Scholar]
Baik J, Ben AROUS G and PÉchÉ S (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Annals of Probability 1643–1697.
Bartlett MS (1938). Methods of estimating mental factors. Nature 141 609–610. [Google Scholar]
Bartlett MS (1950). Tests of significance in factor analysis. British Journal of Mathematical and Statistical Psychology 3 77–85. [Google Scholar]
Bean D, Bickel PJ, El KarOUI N and YU B (2013). Optimal M-estimation in high-dimensional regression. Proceedings of the National Academy of Sciences 110 14563–14568. [DOI] [PMC free article] [PubMed] [Google Scholar]
Benaych-GeorgeS F and Nadakuditi RR (2011). The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Advances in Mathematics 227 494–521. [Google Scholar]
Benjamini Y and HochberG Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological) 289–300.
Bickel PJ and Levina E (2008). Covariance regularization by thresholding. The Annals of Statistics 36 2577–2604. [Google Scholar]
Bickel PJ, Ritov Y and Tsybakov AB (2009). Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics 37 1705–1732. [Google Scholar]
Cai T and LIU W (2011). Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association 106 672–684. [Google Scholar]
Candes E and TaO T (2007). The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics 35 2313–2351. [Google Scholar]
CandÈs EJ, LI X, Ma Y and Wright J (2011). Robust principal component analysis? Journal of the ACM (JACM) 58 11. [Google Scholar]
CandÈs EJ and Recht B (2009). Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9 717. [Google Scholar]
Cape J, TanG M and Priebe CE (2017). The two-to-infinity norm and singular subspace geometry with applications to high-dimensional statistics. arXiv preprint arXiv:1705.10735
Catoni O (2012). Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, vol. 48. Institut Henri Poincaré. [Google Scholar]
Cattell RB (1966). The scree test for the number of factors. Multivariate behavioral research 1 245–276. [DOI] [PubMed] [Google Scholar]
Chamberlain G and Rothschild M (1982). Arbitrage, factor structure, and mean-variance analysis on large asset markets
Cohen MB, Nelson J and WoodrUFF DP (2015). Optimal approximate matrix product in terms of stable rank. arXiv preprint arXiv:1507.02268
Davis C and Kahan WM (1970). The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis 7 1–46. [Google Scholar]
Desai KH and Storey JD (2012). Cross-dimensional inference of dependent high-dimensional data. Journal of the American Statistical Association 107 135–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dobriban E (2017). Factor selection by permutation. arXiv preprint arXiv:1710.00479
Donoho DL, Gavish M and Johnstone IM (2013). Optimal shrinkage of eigenvalues in the spiked covariance model. arXiv preprint arXiv:1311.0851 [DOI] [PMC free article] [PubMed]
EFRon B (2007). Correlation and large-scale simultaneous significance testing. Journal of the American Statistical Association 102 93–103. [Google Scholar]
EFRon B (2010). Correlated z-values and the accuracy of large-scale statistical estimates. Journal of the American Statistical Association 105 1042–1055. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eldridge J, Belkin M and Wang Y (2017). Unperturbed: spectral analysis beyond Davis-Kahan. arXiv preprint arXiv:1706.06516
Fama EF and French KR (1993). Common risk factors in the returns on stocks and bonds. Journal of financial economics 33 3–56. [Google Scholar]
Fan J, Fan Y and LV J (2008). High dimensional covariance matrix estimation using a factor model. Journal of Econometrics 147 186–197. [Google Scholar]
Fan J and Han X (2017). Estimation of the false discovery proportion with unknown dependence. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 1143–1164. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Han X and Gu W (2012). Estimating false discovery proportion under arbitrary covariance dependence. Journal of the American Statistical Association 107 1019–1035. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Ke Y, Sun Q and Zhou W-X (2017a). Farm-test: Factor-adjusted robust multiple testing with false discovery control. arXiv preprint arXiv:1711.05386 [DOI] [PMC free article] [PubMed]
Fan J, Ke Y and Wang K (2016a). Decorrelation of covariates for high dimensional sparse regression. arXiv preprint arXiv:1612.08490
Fan J, Li Q and Wang Y (2017b). Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 247–265. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association 96 1348–1360. [Google Scholar]
Fan J, LiaO Y and Mincheva M (2011). High-dimensional covariance matrix estimation in approximate factor models. The Annals of Statistics 39 3320–3356. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, LiaO Y and Mincheva M (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75 603–680. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, LIU H and Wang W (2018a). Large covariance estimation through elliptical factor models. Annals of Statistics 46 1383–1414. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Wang W and Zhong Y (2018b). An ℓ∞ eigenvector perturbation bound and its application. Journal of Machine Learning Research 18 1–42. [PMC free article] [PubMed] [Google Scholar]
Fan J, Wang W and ZHU Z (2016b). A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. arXiv preprint arXiv:1603.08315 [DOI] [PMC free article] [PubMed]
Friguet C, Kloareg M and Causeur D (2009). A factor model approach to multiple testing under dependence. Journal of the American Statistical Association 104 1406–1415. [Google Scholar]
GaO C, Ma Z, Zhang AY and ZHOU HH (2015). Achieving optimal misclassification proportion in stochastic block model. arXiv preprint arXiv:1505.03772
Hirzel AH, Hausser J, Chessel D and Perrin N (2002). Ecological-niche factor analysis: how to compute habitat-suitability maps without absence data? Ecology 83 2027–2036. [Google Scholar]
Hochreiter S, Clevert D-A and ObermaYer K (2006). A new summarization method for affymetrix probe level data. Bioinformatics 22 943–949. [DOI] [PubMed] [Google Scholar]
Holland PW, Laskey KB and Leinhardt S (1983). Stochastic blockmodels: First steps. Social networks 5 109–137. [Google Scholar]
Horn JL (1965). A rationale and test for the number of factors in factor analysis. Psychometrika 30 179–185. [DOI] [PubMed] [Google Scholar]
Hotelling H (1933). Analysis of a complex of statistical variables into principal components. Journal of educational psychology 24 417. [Google Scholar]
HSU D and Kakade SM (2013). Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science. ACM. [Google Scholar]
Huber PJ (1964). Robust estimation of a location parameter. The annals of mathematical statistics 73–101.
Jin J (2015). Fast community detection by score. The Annals of Statistics 43 57–89. [Google Scholar]
Johnstone IM and LU AY (2009). On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association 104 682–693. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jolliffe IT (1986). Principal component analysis and factor analysis. In Principal component analysis Springer, 115–128. [Google Scholar]
Kendall MG (1965). A course in multivariate analysis
Keshavan RH, Montanari A and Oh S (2010). Matrix completion from noisy entries. Journal of Machine Learning Research 11 2057–2078. [Google Scholar]
Kneip A and Sarda P (2011). Factor models and variable selection in high-dimensional regression analysis. The Annals of Statistics 39 2410–2447. [Google Scholar]
Koltchinskii V and Lounici K (2017). Concentration inequalities and moment bounds for sample covariance operators. Bernoulli 23 110–133. [Google Scholar]
Koltchinskii V and Xia D (2016). Perturbation of linear forms of singular vectors under gaussian noise. In High Dimensional Probability VII Springer, 397–423. [Google Scholar]
Lam C and YaO Q (2012). Factor modeling for high-dimensional time series: inference for the number of factors. The Annals of Statistics 40 694–726. [Google Scholar]
Lawley D and Maxwell A (1962). Factor analysis as a statistical method. Journal of the Royal Statistical Society. Series D (The Statistician) 12 209–229. [Google Scholar]
Leek JT and Storey JD (2008). A general framework for multiple testing dependence. Proceedings of the National Academy of Sciences 105 18718–18723. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lelarge M and Miolane L (2016). Fundamental limits of symmetric low-rank matrix estimation. arXiv preprint arXiv:1611.03888
LI Q, Cheng G, Fan J and Wang Y (2017). Embracing the blessing of dimensionality in factor models. Journal of the American Statistical Association 1–10. [DOI] [PMC free article] [PubMed]
McCrae RR and John OP (1992). An introduction to the five-factor model and its applications. Journal of personality 60 175–215. [DOI] [PubMed] [Google Scholar]
Minsker S (2016). Sub-gaussian estimators of the mean of a random matrix with heavy-tailed entries. arXiv preprint arXiv:1605.07129
MOr-Yosef L and AvRon H (2018). Sketching for principal component regression. arXiv preprint arXiv:1803.02661
NG AY, Jordan MI and Weiss Y (2002). On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems
Onatski A (2010). Determining the number of factors from empirical distribution of eigenvalues. The Review of Economics and Statistics 92 1004–1016. [Google Scholar]
Onatski A (2012). Asymptotics of the principal components estimator of large factor models with weakly influential factors. Journal of Econometrics 168 244–258. [Google Scholar]
O’Rourke S, Vu V and Wang K (2016). Eigenvectors of random matrices: a survey. Journal of Combinatorial Theory, Series A 144 361–442. [Google Scholar]
O’Rourke S, Vu V and Wang K (2017). Random perturbation of low rank matrices: Improving classical bounds. Linear Algebra and its Applications
Paul D (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica 1617–1642.
Paul D, Bair E, Hastie T and Tibshirani R (2008). ” preconditioning” for feature selection and regression in high-dimensional problems. The Annals of Statistics 1595–1618.
Paulsen V (2002). Completely bounded maps and operator algebras, vol. 78. Cambridge University Press. [Google Scholar]
Pearson K (1901). Principal components analysis. The London, Edinburgh and Dublin Philosophical Magazine and Journal 6 566. [Google Scholar]
Rohe K, Chatterjee S and YU B (2011). Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics 39 1878–1915. [Google Scholar]
Sedghi H, Janzamin M and Anandkumar A (2016). Provable tensor methods for learning mixtures of generalized linear models. In Artificial Intelligence and Statistics
Shkolnisky Y and Singer A (2012). Viewing direction estimation in cryo-EM using synchronization. SIAM journal on imaging sciences 5 1088–1110. [DOI] [PMC free article] [PubMed] [Google Scholar]
Spearman C (1927). The abilities of man. [DOI] [PubMed]
SRIvastava N and Vershynin R (2013). Covariance estimation for distributions with 2 + ε moments. The Annals of Probability 41 3081–3111. [Google Scholar]
Stewart G and Sun J (1990). Matrix perturbation theory
Stock JH and Watson MW (2002). Forecasting using principal components from a large number of predictors. Journal of the American statistical association 97 1167–1179. [Google Scholar]
Storey JD (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64 479–498. [Google Scholar]
Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 267–288. [Google Scholar]
TRon R and Vidal R (2009). Distributed image-based 3-D localization of camera sensor networks. In Decision and Control, 2009 held jointly with the 2009 28th Chinese Control Conference. CDC/CCC 2009. Proceedings of the 48th IEEE Conference on. IEEE. [Google Scholar]
TROpp JA (2012). User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics 12 389–434. [Google Scholar]
Vershynin R (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027
Vershynin R (2012). How close is the sample covariance matrix to the actual covariance matrix? Journal of Theoretical Probability 25 655–686. [Google Scholar]
Wang H (2012). Factor profiled sure independence screening. Biometrika 99 15–28. [Google Scholar]
Wang J, ZhaO Q, Hastie T and Owen AB (2017). Confounder adjustment in multiple hypothesis testing. The Annals of Statistics 45 1863–1894. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang W and Fan J (2017). Asymptotics of empirical eigenstructure for high dimensional spiked covariance. Ann. Statist 45 1342–1374. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wedin P-A (1972). Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics 12 99–111. [Google Scholar]
WoodrUFF DP (2014). Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science 10 1–157. [Google Scholar]
Yang J, Meng X and Mahoney MW (2016). Implementing randomized matrix algorithms in parallel and distributed environments. Proceedings of the IEEE 104 58–92. [Google Scholar]
Yi X, Caramanis C and Sanghavi S (2016). Solving a mixture of many random linear equations by tensor decomposition and alternating minimization. arXiv preprint arXiv:1608.05749
YU Y, Wang T and SamwORth RJ (2014). A useful variant of the davis–kahan theorem for statisticians. Biometrika 102 315–323. [Google Scholar]
ZhaO P and YU B (2006). On model selection consistency of lasso. Journal of Machine learning research 7 2541–2563. [Google Scholar]
Zhong Y (2017). Eigenvector under random perturbation: A nonasymptotic Rayleigh-Schrö dinger theory. arXiv preprint arXiv:1702.00139
Zhong Y and Boumal N (2018). Near-optimal bounds for phase synchronization. SIAM Journal on Optimization 28 989–1016. [Google Scholar]
ZOU H and Hastie T (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 301–320. [Google Scholar]

[R1] Abbe E (2017). Community detection and stochastic block models: recent developments. arXiv preprint arXiv:1703.10146

[R2] Abbe E, Bandeira AS and Hall G (2016). Exact recovery in the stochastic block model. IEEE Transactions on Information Theory 62 471–487. [Google Scholar]

[R3] Abbe E, Fan J, Wang K and Zhong Y (2017). Entrywise eigenvector analysis of random matrices with low expected rank. arXiv preprint arXiv:1709.09565 [DOI] [PMC free article] [PubMed]

[R4] Abbe E and Sandon C (2015). Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on. IEEE. [Google Scholar]

[R5] Ahn SC and Horenstein AR (2013). Eigenvalue ratio test for the number of factors. Econometrica 81 1203–1227. [Google Scholar]

[R6] Anandkumar A, Ge R, HSU D, Kakade SM and Telgarsky M (2014). Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research 15 2773–2832. [Google Scholar]

[R7] Anderson TW and Amemiya Y (1988). The asymptotic normal distribution of estimators in factor analysis under general conditions. The Annals of Statistics 16 759–771. [Google Scholar]

[R8] Bai J and Li K (2012). Statistical analysis of factor models of high dimension. The Annals of Statistics 40 436–465. [Google Scholar]

[R9] Bai J and NG S (2002). Determining the number of factors in approximate factor models. Econometrica 70 191–221. [Google Scholar]

[R10] Baik J, Ben AROUS G and PÉchÉ S (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Annals of Probability 1643–1697.

[R11] Bartlett MS (1938). Methods of estimating mental factors. Nature 141 609–610. [Google Scholar]

[R12] Bartlett MS (1950). Tests of significance in factor analysis. British Journal of Mathematical and Statistical Psychology 3 77–85. [Google Scholar]

[R13] Bean D, Bickel PJ, El KarOUI N and YU B (2013). Optimal M-estimation in high-dimensional regression. Proceedings of the National Academy of Sciences 110 14563–14568. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Benaych-GeorgeS F and Nadakuditi RR (2011). The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Advances in Mathematics 227 494–521. [Google Scholar]

[R15] Benjamini Y and HochberG Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological) 289–300.

[R16] Bickel PJ and Levina E (2008). Covariance regularization by thresholding. The Annals of Statistics 36 2577–2604. [Google Scholar]

[R17] Bickel PJ, Ritov Y and Tsybakov AB (2009). Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics 37 1705–1732. [Google Scholar]

[R18] Cai T and LIU W (2011). Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association 106 672–684. [Google Scholar]

[R19] Candes E and TaO T (2007). The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics 35 2313–2351. [Google Scholar]

[R20] CandÈs EJ, LI X, Ma Y and Wright J (2011). Robust principal component analysis? Journal of the ACM (JACM) 58 11. [Google Scholar]

[R21] CandÈs EJ and Recht B (2009). Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9 717. [Google Scholar]

[R22] Cape J, TanG M and Priebe CE (2017). The two-to-infinity norm and singular subspace geometry with applications to high-dimensional statistics. arXiv preprint arXiv:1705.10735

[R23] Catoni O (2012). Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, vol. 48. Institut Henri Poincaré. [Google Scholar]

[R24] Cattell RB (1966). The scree test for the number of factors. Multivariate behavioral research 1 245–276. [DOI] [PubMed] [Google Scholar]

[R25] Chamberlain G and Rothschild M (1982). Arbitrage, factor structure, and mean-variance analysis on large asset markets

[R26] Cohen MB, Nelson J and WoodrUFF DP (2015). Optimal approximate matrix product in terms of stable rank. arXiv preprint arXiv:1507.02268

[R27] Davis C and Kahan WM (1970). The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis 7 1–46. [Google Scholar]

[R28] Desai KH and Storey JD (2012). Cross-dimensional inference of dependent high-dimensional data. Journal of the American Statistical Association 107 135–151. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Dobriban E (2017). Factor selection by permutation. arXiv preprint arXiv:1710.00479

[R30] Donoho DL, Gavish M and Johnstone IM (2013). Optimal shrinkage of eigenvalues in the spiked covariance model. arXiv preprint arXiv:1311.0851 [DOI] [PMC free article] [PubMed]

[R31] EFRon B (2007). Correlation and large-scale simultaneous significance testing. Journal of the American Statistical Association 102 93–103. [Google Scholar]

[R32] EFRon B (2010). Correlated z-values and the accuracy of large-scale statistical estimates. Journal of the American Statistical Association 105 1042–1055. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Eldridge J, Belkin M and Wang Y (2017). Unperturbed: spectral analysis beyond Davis-Kahan. arXiv preprint arXiv:1706.06516

[R34] Fama EF and French KR (1993). Common risk factors in the returns on stocks and bonds. Journal of financial economics 33 3–56. [Google Scholar]

[R35] Fan J, Fan Y and LV J (2008). High dimensional covariance matrix estimation using a factor model. Journal of Econometrics 147 186–197. [Google Scholar]

[R36] Fan J and Han X (2017). Estimation of the false discovery proportion with unknown dependence. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 1143–1164. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Fan J, Han X and Gu W (2012). Estimating false discovery proportion under arbitrary covariance dependence. Journal of the American Statistical Association 107 1019–1035. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Fan J, Ke Y, Sun Q and Zhou W-X (2017a). Farm-test: Factor-adjusted robust multiple testing with false discovery control. arXiv preprint arXiv:1711.05386 [DOI] [PMC free article] [PubMed]

[R39] Fan J, Ke Y and Wang K (2016a). Decorrelation of covariates for high dimensional sparse regression. arXiv preprint arXiv:1612.08490

[R40] Fan J, Li Q and Wang Y (2017b). Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 247–265. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association 96 1348–1360. [Google Scholar]

[R42] Fan J, LiaO Y and Mincheva M (2011). High-dimensional covariance matrix estimation in approximate factor models. The Annals of Statistics 39 3320–3356. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] Fan J, LiaO Y and Mincheva M (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75 603–680. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] Fan J, LIU H and Wang W (2018a). Large covariance estimation through elliptical factor models. Annals of Statistics 46 1383–1414. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] Fan J, Wang W and Zhong Y (2018b). An ℓ∞ eigenvector perturbation bound and its application. Journal of Machine Learning Research 18 1–42. [PMC free article] [PubMed] [Google Scholar]

[R46] Fan J, Wang W and ZHU Z (2016b). A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. arXiv preprint arXiv:1603.08315 [DOI] [PMC free article] [PubMed]

[R47] Friguet C, Kloareg M and Causeur D (2009). A factor model approach to multiple testing under dependence. Journal of the American Statistical Association 104 1406–1415. [Google Scholar]

[R48] GaO C, Ma Z, Zhang AY and ZHOU HH (2015). Achieving optimal misclassification proportion in stochastic block model. arXiv preprint arXiv:1505.03772

[R49] Hirzel AH, Hausser J, Chessel D and Perrin N (2002). Ecological-niche factor analysis: how to compute habitat-suitability maps without absence data? Ecology 83 2027–2036. [Google Scholar]

[R50] Hochreiter S, Clevert D-A and ObermaYer K (2006). A new summarization method for affymetrix probe level data. Bioinformatics 22 943–949. [DOI] [PubMed] [Google Scholar]

[R51] Holland PW, Laskey KB and Leinhardt S (1983). Stochastic blockmodels: First steps. Social networks 5 109–137. [Google Scholar]

[R52] Horn JL (1965). A rationale and test for the number of factors in factor analysis. Psychometrika 30 179–185. [DOI] [PubMed] [Google Scholar]

[R53] Hotelling H (1933). Analysis of a complex of statistical variables into principal components. Journal of educational psychology 24 417. [Google Scholar]

[R54] HSU D and Kakade SM (2013). Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science. ACM. [Google Scholar]

[R55] Huber PJ (1964). Robust estimation of a location parameter. The annals of mathematical statistics 73–101.

[R56] Jin J (2015). Fast community detection by score. The Annals of Statistics 43 57–89. [Google Scholar]

[R57] Johnstone IM and LU AY (2009). On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association 104 682–693. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R58] Jolliffe IT (1986). Principal component analysis and factor analysis. In Principal component analysis Springer, 115–128. [Google Scholar]

[R59] Kendall MG (1965). A course in multivariate analysis

[R60] Keshavan RH, Montanari A and Oh S (2010). Matrix completion from noisy entries. Journal of Machine Learning Research 11 2057–2078. [Google Scholar]

[R61] Kneip A and Sarda P (2011). Factor models and variable selection in high-dimensional regression analysis. The Annals of Statistics 39 2410–2447. [Google Scholar]

[R62] Koltchinskii V and Lounici K (2017). Concentration inequalities and moment bounds for sample covariance operators. Bernoulli 23 110–133. [Google Scholar]

[R63] Koltchinskii V and Xia D (2016). Perturbation of linear forms of singular vectors under gaussian noise. In High Dimensional Probability VII Springer, 397–423. [Google Scholar]

[R64] Lam C and YaO Q (2012). Factor modeling for high-dimensional time series: inference for the number of factors. The Annals of Statistics 40 694–726. [Google Scholar]

[R65] Lawley D and Maxwell A (1962). Factor analysis as a statistical method. Journal of the Royal Statistical Society. Series D (The Statistician) 12 209–229. [Google Scholar]

[R66] Leek JT and Storey JD (2008). A general framework for multiple testing dependence. Proceedings of the National Academy of Sciences 105 18718–18723. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R67] Lelarge M and Miolane L (2016). Fundamental limits of symmetric low-rank matrix estimation. arXiv preprint arXiv:1611.03888

[R68] LI Q, Cheng G, Fan J and Wang Y (2017). Embracing the blessing of dimensionality in factor models. Journal of the American Statistical Association 1–10. [DOI] [PMC free article] [PubMed]

[R69] McCrae RR and John OP (1992). An introduction to the five-factor model and its applications. Journal of personality 60 175–215. [DOI] [PubMed] [Google Scholar]

[R70] Minsker S (2016). Sub-gaussian estimators of the mean of a random matrix with heavy-tailed entries. arXiv preprint arXiv:1605.07129

[R71] MOr-Yosef L and AvRon H (2018). Sketching for principal component regression. arXiv preprint arXiv:1803.02661

[R72] NG AY, Jordan MI and Weiss Y (2002). On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems

[R73] Onatski A (2010). Determining the number of factors from empirical distribution of eigenvalues. The Review of Economics and Statistics 92 1004–1016. [Google Scholar]

[R74] Onatski A (2012). Asymptotics of the principal components estimator of large factor models with weakly influential factors. Journal of Econometrics 168 244–258. [Google Scholar]

[R75] O’Rourke S, Vu V and Wang K (2016). Eigenvectors of random matrices: a survey. Journal of Combinatorial Theory, Series A 144 361–442. [Google Scholar]

[R76] O’Rourke S, Vu V and Wang K (2017). Random perturbation of low rank matrices: Improving classical bounds. Linear Algebra and its Applications

[R77] Paul D (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica 1617–1642.

[R78] Paul D, Bair E, Hastie T and Tibshirani R (2008). ” preconditioning” for feature selection and regression in high-dimensional problems. The Annals of Statistics 1595–1618.

[R79] Paulsen V (2002). Completely bounded maps and operator algebras, vol. 78. Cambridge University Press. [Google Scholar]

[R80] Pearson K (1901). Principal components analysis. The London, Edinburgh and Dublin Philosophical Magazine and Journal 6 566. [Google Scholar]

[R81] Rohe K, Chatterjee S and YU B (2011). Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics 39 1878–1915. [Google Scholar]

[R82] Sedghi H, Janzamin M and Anandkumar A (2016). Provable tensor methods for learning mixtures of generalized linear models. In Artificial Intelligence and Statistics

[R83] Shkolnisky Y and Singer A (2012). Viewing direction estimation in cryo-EM using synchronization. SIAM journal on imaging sciences 5 1088–1110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R84] Spearman C (1927). The abilities of man. [DOI] [PubMed]

[R85] SRIvastava N and Vershynin R (2013). Covariance estimation for distributions with 2 + ε moments. The Annals of Probability 41 3081–3111. [Google Scholar]

[R86] Stewart G and Sun J (1990). Matrix perturbation theory

[R87] Stock JH and Watson MW (2002). Forecasting using principal components from a large number of predictors. Journal of the American statistical association 97 1167–1179. [Google Scholar]

[R88] Storey JD (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64 479–498. [Google Scholar]

[R89] Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 267–288. [Google Scholar]

[R90] TRon R and Vidal R (2009). Distributed image-based 3-D localization of camera sensor networks. In Decision and Control, 2009 held jointly with the 2009 28th Chinese Control Conference. CDC/CCC 2009. Proceedings of the 48th IEEE Conference on. IEEE. [Google Scholar]

[R91] TROpp JA (2012). User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics 12 389–434. [Google Scholar]

[R92] Vershynin R (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027

[R93] Vershynin R (2012). How close is the sample covariance matrix to the actual covariance matrix? Journal of Theoretical Probability 25 655–686. [Google Scholar]

[R94] Wang H (2012). Factor profiled sure independence screening. Biometrika 99 15–28. [Google Scholar]

[R95] Wang J, ZhaO Q, Hastie T and Owen AB (2017). Confounder adjustment in multiple hypothesis testing. The Annals of Statistics 45 1863–1894. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R96] Wang W and Fan J (2017). Asymptotics of empirical eigenstructure for high dimensional spiked covariance. Ann. Statist 45 1342–1374. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R97] Wedin P-A (1972). Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics 12 99–111. [Google Scholar]

[R98] WoodrUFF DP (2014). Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science 10 1–157. [Google Scholar]

[R99] Yang J, Meng X and Mahoney MW (2016). Implementing randomized matrix algorithms in parallel and distributed environments. Proceedings of the IEEE 104 58–92. [Google Scholar]

[R100] Yi X, Caramanis C and Sanghavi S (2016). Solving a mixture of many random linear equations by tensor decomposition and alternating minimization. arXiv preprint arXiv:1608.05749

[R101] YU Y, Wang T and SamwORth RJ (2014). A useful variant of the davis–kahan theorem for statisticians. Biometrika 102 315–323. [Google Scholar]

[R102] ZhaO P and YU B (2006). On model selection consistency of lasso. Journal of Machine learning research 7 2541–2563. [Google Scholar]

[R103] Zhong Y (2017). Eigenvector under random perturbation: A nonasymptotic Rayleigh-Schrö dinger theory. arXiv preprint arXiv:1702.00139

[R104] Zhong Y and Boumal N (2018). Near-optimal bounds for phase synchronization. SIAM Journal on Optimization 28 989–1016. [Google Scholar]

[R105] ZOU H and Hastie T (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 301–320. [Google Scholar]

PERMALINK

Robust high dimensional factor models with applications to statistical machine learning

Jianqing Fan

Kaizheng Wang

Yiqiao Zhong

Ziwei Zhu

Abstract

1. INTRODUCTION

2. FACTOR MODELS AND PCA

2.1. Relationship between PCA and factor models in high dimensions

Fig 1.

2.2. Estimating the number of factors

2.3. Robust covariance inputs

2.4. Perturbation bounds

Fig 2.

3. APPLICATIONS TO HIGH-DIMENSIONAL STATISTICS

3.1. Covariance estimation

3.2. Principal component regression with random sketch

Fig 3.

3.3. Factor-Adjust Robust Multiple (FARM) tests

Fig 5.

Fig 4.

3.4. Factor-Adjusted Robust Model (FARM) selection

Fig 6.

4. RELATED LEARNING PROBLEMS

4.1. Gaussian mixture model

4.2. Community detection

Fig 7.

4.3. Matrix completion

4.4. Synchronization problems

Acknowledgments

APPENDIX A: PROOFS

Footnotes

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Robust high dimensional factor models with applications to statistical machine learning

Jianqing Fan

Kaizheng Wang

Yiqiao Zhong

Ziwei Zhu

Abstract

1. INTRODUCTION

2. FACTOR MODELS AND PCA

2.1. Relationship between PCA and factor models in high dimensions

Fig 1.

2.2. Estimating the number of factors

2.3. Robust covariance inputs

2.4. Perturbation bounds

Fig 2.

3. APPLICATIONS TO HIGH-DIMENSIONAL STATISTICS

3.1. Covariance estimation

3.2. Principal component regression with random sketch

Fig 3.

3.3. Factor-Adjust Robust Multiple (FARM) tests

Fig 5.

Fig 4.

3.4. Factor-Adjusted Robust Model (FARM) selection

Fig 6.

4. RELATED LEARNING PROBLEMS

4.1. Gaussian mixture model

4.2. Community detection

Fig 7.

4.3. Matrix completion

4.4. Synchronization problems

Acknowledgments

APPENDIX A: PROOFS

Footnotes

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases