Abstract
High-dimensional statistical tests often ignore correlations to gain simplicity and stability leading to null distributions that depend on functionals of correlation matrices such as their Frobenius norm and other ℓr norms. Motivated by the computation of critical values of such tests, we investigate the difficulty of estimation the functionals of sparse correlation matrices. Specifically, we show that simple plug-in procedures based on thresholded estimators of correlation matrices are sparsity-adaptive and minimax optimal over a large class of correlation matrices. Akin to previous results on functional estimation, the minimax rates exhibit an elbow phenomenon. Our results are further illustrated in simulated data as well as an empirical study of data arising in financial econometrics.
Keywords: Covariance matrix, functional estimation, high-dimensional testing, minimax, elbow effect
1. Introduction
Covariance matrices are at the core of many statistical procedures such as principal component analysis or linear discriminant analysis. Moreover, not only do they arise as natural quantities to capture interactions between variables but, as we illustrate below, they often characterize the asymptotic variance of commonly used estimators. Following the original papers of Bickel and Levina (2008a, 2008b), much work has focused on the inference of high-dimensional covariance matrices under sparsity [Cai and Liu (2011), Cai, Ren and Zhou (2013), Cai and Yuan (2012), Cai, Zhang and Zhou (2010), Cai and Zhou (2012), El Karoui (2008), Lam and Fan (2009), Ravikumar et al. (2011)] and other structural assumptions related to sparse principal component analysis [Amini and Wainwright (2009), Berthet and Rigollet (2013a, 2013b), Birnbaum et al. (2013), Cai, Ma and Wu (2013, 2015), Fan, Fan and Lv (2008), Johnstone and Lu (2009), Levina and Vershynin (2012), Ma (2013), Onatski, Moreira and Hallin (2013), Paul and Johnstone (2012), Rothman, Levina and Zhu (2009), Fan, Liao and Mincheva (2011, 2013), Jung and Marron (2009), Vu and Lei (2012), Zou, Hastie and Tibshirani (2006)]. This area of research is very active and, as a result, this list of references is illustrative rather than comprehensive. This line of work can be split into two main themes: estimation and detection. The former is the main focus of the present paper. However, while most of the literature has focused on estimating the covariance matrix itself, under various performance measures, we depart from this line of work by focusing on functionals of the covariance matrix rather than the covariance matrix itself.
Estimation of functionals of unknown signals such as regression functions or densities is known to be different in nature from estimation of the signal itself. This problem has received most attention in nonparametric estimation, originally in the Gaussian white noise model [Efromovich and Low (1996), Fan (1991), Ibragimov, Nemirovskiĭ and Khas'minskiĭ (1987), Nemirovskiĭ and Khas'minskiĭ (1987)] [see also Nemirovski (2000) for a survey of results in the Gaussian white noise model] and later extended to density estimation [Bickel and Ritov (1988), Hall and Marron (1987)] and various other models such as regression [Donoho and Nussbaum (1990), Cai and Low (2005, 2006), Klemelä (2006)] and inverse problems [Butucea (2007), Butucea and Meziani (2011)]. Most of these papers study the estimation of quadratic functionals and, interestingly, exhibit an elbow in the rates of convergence: there exists a critical regularity parameter below which the rate of estimation is nonparametric and above which, it becomes parametric. As we will see below the phenomenon also arises when regularity is measured by sparsity.
Over the past decade, sparsity has become the prime measure of regularity, both for its flexibility and generality. In particular, smooth functions can be viewed as functions with a sparse expansion in an appropriate basis. At a high level, sparsity assumes that many of the unknown parameters are equal to zero or nearly so, so that the few nonzero parameters can be consistently estimated using a small number of observations relative to the apparent dimensionality of the problem. Moreover, sparsity acts not only as a regularity parameter that stabilizes statistical procedures but also as key feature for interpretability. Indeed, it is often the case that setting many parameters to zero simply corresponds to a simpler sub-model. The main idea is to let data select the correct sub-model. This is the case in particular for covariance matrix estimation where zeros in the matrix correspond to uncor-related variables. Yet, while the value of sparsity for covariance matrix estimation has been well established, to the best of our knowledge, this paper provides the first analysis for the estimation of functionals of sparse covariance matrix. Indeed, the actual performance of many estimators critically depends on such functionals. Therefore, accurate functional estimation leads to a better understanding the performance of many estimators and can ultimately serve as a guide to selecting the best estimator. Applications of our results are illustrated in Section 2.
Our work is not only motivated by real applications, but also by a natural extension of the theoretical analysis carried out in the sparse Gaussian sequence model [Cai and Low (2005)]. In that paper, Cai and Low assume that the unknown parameter θ belong to an ℓq-ball, where q > 0 can be arbitrarily close to 0. Such balls are known to emulate sparsity and actually correspond to a more accurate notion of sparsity for signal θ that is encountered in applications [see, e.g., Foucart and Rauhut (2013)]. They also show that a nonquadratic estimator can be fully efficient to estimate quadratic functionals. We extend some of these results to covariance matrix estimation. Such an extension is not trivial since, unlike the Gaussian sequence model, covariance matrix lies at high-dimensional manifolds and its estimation exhibits complicated dependencies in the structure of the noise.
We also compare our results for optimal rates of estimating matrix functionals with that of estimating matrix itself. Many methods have been proposed to estimate covariance matrix in different sense of sparsity using different techniques including thresholding [Bickel and Levina (2008a)], tapering [Bickel and Levina (2008b), Cai, Zhang and Zhou (2010), Cai and Zhou (2012)] and penalized likelihood [Lam and Fan (2009)] to name only a few. These methods often lead to minimax optimal rates in various classes and under several metrics [Cai, Zhang and Zhou (2010), Cai and Zhou (2012), Rigollet and Tsybakov (2012)]. However, the optimal rates of estimating matrix functionals have not yet been covered by much literature. Intuitively, it should have faster rates of convergence on estimating a matrix functional than itself since it is just a one-dimensional estimating problem and the estimating error cancel with each other when we sum those elements together. We will see this is indeed the case when we compare the minimax rates of estimating matrix functionals with those of estimating matrices.
The rest of the paper is organized as follows. We begin in Section 2 by two motivating examples of high-dimensional hypothesis testing problems: a two-sample testing problem of Gaussian means that arises in genomics and validating the efficiency of markets based on the Capital Asset Pricing Model (CAPM). Next, in Section 3, we introduce an estimator of the quadratic functional of interest that is based on the thresholding estimator introduced in Bickel and Levina (2008a). We also prove its optimality in a minimax sense over a large class of sparse covariance matrices. The study is further extended to estimating other measures of sparsity of covariance matrix. Finally, we study the numerical performance of our estimator in Section 5 on simulated experiments as well as in the framework of the two applications described in Section 2. Due to space restrictions, the proofs for the upper bounds are relegated to the Appendix in the supplementary material [Fan, Rigollet and Wang (2015)].
Notation: Let d be a positive integer. The space of d × d positive semi-definite matrices is denoted by . For any two integers c < d, define [c : d] = {c, c + 1, …, d} to be the sequence of contiguous integers between c and d, and we simply write [d] = {1, …, d}. Id denotes the identity matrix of . Moreover, for any subset S ⊂ [d], denote by 1S ∈ {0, 1}d the column vector with jth coordinate equal to one iff j ∈ S. In particular, 1[d] denotes the d dimensional vector of all ones.
We denote by tr the trace operator on square matrices and by diag (resp., off) the linear operator that sets to 0 all the off diagonal (resp., diagonal) elements of a square matrix. The Frobenius norm of a real matrix M is denoted by ∥M∥F and is defined by . Note that ∥M∥F is a the Hilbert–Schimdt norm associated with inner product 〈A, B〉 = tr(A⊺B) defined on the space of real rectangular matrices of the same size. Moreover, |A| denotes the determinant of a square matrix A. The variance of a random variable X is denote by var(X).
In the proofs, we often employ C to denote a generic positive constant that may change from line to line.
2. Two motivating examples
In this section, we describe our main motivation for estimating quadratic functionals of a high-dimensional covariance matrix in the light of two applications to high-dimensional testing problems. The first one is a high-dimensional two-sample hypothesis testing with applications in gene-set testing. The second example is about testing the validity of the capital asset pricing model (CAPM) from financial economics.
2.1. Two-sample hypothesis testing in high-dimensions
In various statistical applications, in particular in genomics, the dimensionality of the problems is so large that statistical procedures involving inverse covariance matrices are not viable due to its lack of stability both from a statistical and numerical point of view. This limitation can be well illustrated on a showcase example: two-sample hypothesis testing [Bai and Saranadasa (1996)] in high-dimensions.
Suppose that we observe two independent samples that are i.i.d. and that are i.i.d. . Let n = n1 + N2. The goal is to test H0 : μ1 = μ2 vs. H1 : μ1 ≠ μ2.
Assume first that Σ1 = Σ2 = Σ. In this case, Hotelling's test is commonly employed when p is small. Nevertheless, when p is large, Bai and Saranadasa (1996) showed that the test based on Hotelling's T 2 has low power and suggest a new statistics M for the random matrix asymptotic regime where n, p → ∞, , . The statistics, implementing the naive Bayes rule, is defined as
and is proved to be asymptotically normal under the null hypothesis with
Clearly, the asymptotic variance of M depends on the unknown covariance matrix Σ through its quadratic functional, and in order to compute the critical value of the test, Bai and Saranadasa suggest to estimate by the quantity
They show that B2 is a ratio-consistent estimator of in the sense that . Clearly, this solution does not leverage any sparsity assumption and may suffer from power deficiency if the matrix Σ is indeed sparse. Rather, if the covariance matrix Σ is believed to be sparse, one may prefer to use a thresholded estimator for Σ as in Bickel and Levina (2008a) rather than the empirical covariance matrix . In this case, we estimate by , where could be any consistent estimator of σij and τ > 0 is a threshold parameter.
More recently, Chen and Qin (2010) took into account the case Σ1 ≠ Σ2 and proposed a test statistic based on an unbiased estimate of each of the three quantities in . In this case, the quantities and 〈Σ1, Σ2〉 appear in the asymptotic variance. The detailed formulation assumptions of this statistic, as well as discussions about other testing methods such as Srivastava and Du (2008), are provided in the supplementary material [Fan, Rigollet and Wang (2015)] for completeness. If Σ1 and Σ2 are indeed sparse, akin to the above reasoning, we can also estimate and 〈Σ1, Σ2〉 using thresholding to leverage sparsity assumption. It is not hard to derive a theory for estimating quadratic functionals involving two covariance matrices but the details of this procedure are beyond the scope of the present paper.
2.2. Testing high-dimensional CAPM model
The capital asset pricing model (CAPM) is a simple financial model that postulates how individual asset returns are related to the market risks. Specifically, the individual excessive return of asset i ∈ [N] over the risk-free rate at time t ∈ [T] can be expressed as an affine function of a vector of K risk factors :
(2.1) |
where we assume for any t ∈ [T], are observed. The case K = 1 with ft being the excessive return of the market portfolio corresponds to the CAPM [Lintner (1965), Mossin (1966), Sharpe (1964)]. It is nowadays more common to employ the Fama–French three-factor model [see Fama and French (1993) for a definition] for the US equity market, corresponding to K = 3.
For simplicity, let us rewrite the model (2.1) in the vectorial form
The multi-factor pricing model postulates α = 0. Namely, all returns are fully compensated by their risks: no extra returns are possible and the market is efficient. This leads us to naturally consider the hypothesis testing problem H0 : α = 0 vs. H1 : α ≠ 0.
Let and be the least-squares estimate and be a residual vector. Then an unbiased estimator of Σ = var(εt) is
Let and MF = IT − F(F⊺ F)−1F⊺ where F = (f1, …, fT)⊺ . Define the Wald-type of test statistics with correlation ignored, whose normalized version is given by
(2.2) |
Under some conditions, it was shown by Pesaran and Yamagata (2012) that, under H0, as N → ∞. Moreover, if 's are i.i.d. Gaussian, it holds that and
where ν = T − K − 1 is the degrees of freedom and
where ρ = D−1/2ΣD−1/2 with D = diag(Σ) is the correlation matrix of the stationary process (εt)t∈[T]. The authors go on to propose an estimator of the quadratic functional ρ−2 by replacing the correlation coefficients ρij in the above expression by where and τ > 0 is a threshold parameter. However, they did not provide any analysis of this method, nor any guidance to chose τ.
3. Optimal estimation of quadratic functionals
In the previous section, we have described rather general questions involving the estimation of quadratic functions of covariance or correlation matrices. We begin by observing that consistent estimation of is impossible unless p = o(n). This precludes in particular the high-dimensional framework that motivates our study.
Our goal is to estimate the Frobenius norm of a sparse p × p covariance matrix Σ using n i.i.d. observations . Observe that can be decomposed as where corresponds to the off-diagonal elements and corresponds to the diagonal elements. The following theorem, implies that even if Σ = diag(Σ) is diagonal, the quadratic functional cannot be estimated consistently in absolute error if p ≥ n. Note that the situation is quite different when it comes to relative error. Indeed, the estimator of Bai and Saranadasa (1996) is consistent in relative error with no sparsity assumption even in the high-dimensional regime. Study of the relative error in the presence of sparsity is an interesting question that deserves further developments. This makes sense intuitively as the diagonal of Σ consists of p unknown parameters while we have only n observations.
Proposition 3.1. Fix n, p ≥ 1 and let
be the class of diagonal covariance matrices with diagonal elements bounded by 1. Then there exists a universal constant C > 0 such that
In particular, it implies that
where the infima are taken with over all real valued measurable functions of the observations.
Proof
Our lower bounds rely on standard arguments from minimax theory. We refer to Chapter 2 of Tsybakov (2009) for more details. In the sequel, let denote the Kullback–Leibler divergence between two distributions P and , where . It is defined by
We are going to employ a simple two-point lower bound. Fix ε ∈ (0, 1/2) and let (resp., ) denote the distribution of a sample X1, …, Xn where [resp., ]. Next, observe that Ip, so that
(3.1) |
Moreover, |D(Ip) − D((1 − ε)Ip)| = p(2ε − ε2) > pε. Then it follows from the Markov inequality that
(3.2) |
where the last inequality follows from Theorem 2.2(iii) of Tsybakov (2009).
Completion of the proof requires an upper bound on . To that end, note that it follows from the chain rule and simple algebra that
Taking now yields . Together with (3.1) and (3.2), it yields
To complete the proof, we square the above inequality and employ Jensen's inequality.
To overcome the above limitation, we consider the following class of sparse covariance matrices (indeed correlation matrices). For any q ∈ [0, 2), R > 0 let denote the set of p × p covariance matrices defined by
(3.3) |
Note that for this class of functions, we assume that the variance along each coordinate is normalized to 1. This normalization is frequently obtained by sample estimates, as shown in the previous section. This simplified assumption is motivated also by Proposition 3.1 above which implies that for general covariance matrix cannot be estimated accurately in absolute error in the large p small n regime since sparsity assumptions on the diagonal elements are implausible. Note that the condition diag(Σ) = Ip implies that diagonal elements D(Σ) of matrices in can be estimated without error so that we could possibly achieve consistency even if the case of large p small n.
Matrices in have many small coefficients for small values of q and R. In particular, when q = 0, there are no more than R entries of nonvanishing correlations. Following a major trend in the estimation of sparse covariance matrices [Bickel and Levina (2008a, 2008b), Cai and Liu (2011), Cai and Yuan (2012), Cai, Zhang and Zhou (2010), Cai and Zhou (2012), El Karoui (2008), Lam and Fan (2009)], we employ a thresholding estimator of the covariance matrix as a running horse to estimate the quadratic functionals. From the n i.i.d. observations , we form the empirical covariance matrix that is defined by
(3.4) |
with elements and for any threshold τ > 0, let denote the thresholding estimator of Σ defined by if i ≠ j and .
Next, we employ a simple plug-in estimator for Q(Σ):
(3.5) |
Note that no value of the diagonal elements is used to estimate Q(Σ).
In the rest of this section, we establish that is minimax adaptive over the scale {, q ∈ [0, 2), R > 0}. Interestingly, we will see that the minimax rate presents an elbow as often in quadratic functional estimation.
Theorem 3.1
Assume that γ log(p) < n for some constant γ > 8 and fix C0 ≥ 4. Consider the threshold
and assume that τ ≤ 1. Then, for any q ∈ [0, 2), R > 0, the plug-in estimator satisfies
where
and C1, C2 are positive constants depending on γ, C0, q.
The proof is postponed to the supplementary material.
Note that the rates ψn,p(q, R) present an elbow at q = 1 − log log p/log n as usually the case in functional estimation. We now argue that the rates ψn,p(q, R) are optimal in a minimax sense for a wide range of settings. In particular, the elbow effect arising from the maximum in the definition of ψ is not an artifact. In the following theorem, we emphasize the dependence on Σ by using the notation for the expectation with respect to the distribution of the sample X1, …, Xn, where .
Theorem 3.2
Fix q ∈ [0, 2), R > 0 and assume 2 log p < n and R2 < (p − 1)n−q/2. Then there exists a positive constant C3 > 0 such that
where ϕn,p(q, R) is defined by
(3.6) |
and the infimum is taken over all measurable functions of the sample X1, …, Xn.
Before proceeding to the proof, a few remarks are in order.
The additional term of order p4−γ/2 in Theorem 3.1 can be made negligible by taking γ large enough. To show this tradeoff explicitly, we decided keep this term.
-
When 1 ≤ R2 < pαn−q for some constant α < 1, a slightly stronger requirement than Theorem 3.2, the lower bound there can be written as
(3.7) Observe that the above lower bound matches the upper bound presented in Theorem 3.1 when R2/(2−q) log p ≤ n. Arguably, this is the most interesting range as it characterizes rates of convergence (to zero) rather than rates of divergence, that may be of different nature [see, e.g., Verzelen (2012)]. In other words, the rates given in (3.7) are minimax adaptive with respect to n, R, p and q. In our formulation, we allow R = Rn,p to depend on other parameters of the problem. We choose here to keep the notation light.
- The reason we choose correlation matrix class to present the elbow effect is just for simplicity. Actually, we can replace the constraint diag(Σ) = Ip in the definition of by boundedness of diagonal elements of Σ. Then for estimating off-diagonal elements Q(Σ), following exactly the same derivation, the same elbow phenomenon has been noticed. Meanwhile, the optimal rate for estimating diagonal elements D(Σ) is again of the order p/n. This optimal rate can be attained by the estimator
We omitted the proof here. Thus, if we do not have prior information about diagonal elements, we could still estimate optimally the quadratic functional of a covariance matrix by applying the thresholding method (3.5) for off-diagonal elements, together with (3.8) for diagonal elements.(3.8) The rate ϕn,p(q, R) presents the same elbow phenomenon at q = 1 observed in the estimation of functionals, starting independently with work of Bickel and Ritov (1988) and Fan (1991). Closer to the present setup is the work of Cai and Low (2005) who study the estimation of functionals of “sparse” sequences in the infinite Gaussian sequence model. There, a parameter controls the speed of decay of the unknown coefficients. Note that while smaller values q lead to sparser matrices Σ, no estimator can benefit further from sparsity below q = 1 [the estimator has a rate of convergence O(R2/n) for any q < 1], unlike in the case of estimation of Σ. Again, this is inherent to estimating functionals.
The condition R2 < (p − 1)n−q/2 corresponds to the high-dimensional regime and allows us to keep clean terms in the logarithm. Similar assumptions are made in related literature [see, e.g., Cai and Zhou (2012)].
- The optimal rates obtained here cannot be implied by existing ones for estimating sparse covariance matrices. In particular, the latter do not admit an elbow phenomenon. Specifically, Rigollet and Tsybakov (2012) showed the optimal rate for estimating Σ for under the Frobenius norm is for 0 ≤ q < 2. Using this, it is not hard to derive with high probability,
since if nonvanishing correlations are bounded away from zero. On one hand, when q < 2 the first term always dominates so that we do not observe the elbow effect. In addition, the rate so obtained is not optimal.
We now turn to the proof of Theorem 3.2
Proof of Theorem 3.2
To prove minimax lower bounds, we employ a standard technique that consists of reducing the estimation problem to a testing problem. We split this proof into two parts and begin by proving
for some positive constant C > 0. To that end, for any , let denote the distribution of . It is not hard to show if |A| > 0 and |B| > 0, , then the Kullback–Leibler divergence between and is given by
(3.9) |
Next, take A and B to be of the form
where a, b ∈ (0, 1/2), 0 is a generic symbol to indicate that the missing space is filled with zeros, and 1 denotes a vector of ones of length k/2. Note that if we have random variables (X, Y, Z1, …, Zp−2) chosen from distribution meaning that Zk's are independent with X, Y but the correlation between X and Y is a, then random vector (X, …, X, Y, …, Y, Z1, …, Zp−k) with k/2 X's and Y's in it follows . It is obvious that these two matrices are degenerate and comes from perfectly correlated random variables. Since perfectly correlated random variables do not add new information, for such matrices, an application of (3.9) yields
Next, using the convexity inequality log(1 + x) ≥ x − x2/2 for all x > 0, we get that
using the fact that a, b ∈ (0, 1/2). Take now if R > 4
so that we indeed have a, b ∈ (0, 1/2) and also A(k), obviously. If R < 4, take k = 2, , instead. Moreover, this choice leads to . Using standard techniques to reduce estimation problems to testing problems [see, e.g., Theorem 2.5 of Tsybakov (2009)], we find that
For the above choice of A and B, we have
Since A(k), , the above two displays imply that
which completes the proof of the first part of the lower bound.
For the second part of the lower bound, we reduce our problem to a testing problem of the same flavor as Arias-Castro, Bubeck and Lugosi (2015), Berthet and Rigollet (2013b). Note, however, that our construction is different because the covariance matrices considered in these papers do not yield large enough lower bounds. We use the following construction.
Fix an integer k ∈ [p − 1] and let denote the set of subsets of [p − 1] that have cardinality k. Fix a ∈ (0, 1) to be chosen later and for any , recall that 1S is the column vector in {0, 1}p−1 with support given by S. For each , we define the following p × p covariance matrix:
(3.10) |
Let denote the distribution of and denote the distribution of . Let (resp., ) denote the distribution of X = (X1, …, Xn) of a collection n i.i.d. random variables drawn from (resp., ). Moreover, let denote the distribution of X where the Xi 's are drawn as follows: first draw S uniformly at random from and then, conditionally on S, draw X1, …, Xn independently from . Note that is the mixture of n independent samples rather the distribution of n independent random vectors drawn from a mixture distribution. Consider the following testing problem:
Using Theorem 2.2, part (iii) of Tsybakov (2009), we get that for any test ψ = ψ(X), we have
where we recall that the χ2-divergence between two probability distributions P and Q is defined by
Lemma A.1 implies that for suitable choices of the parameters a and k, we have so that the test errors are bounded below by a constant C = e−2/4. Since Q(ΣS) = 2ka2 for any , it follows from a standard reduction from hypothesis testing to estimation [see, e.g., Theorem 2.5 of Tsybakov (2009)] that the above result implies the following lower bound:
(3.11) |
for some positive constant C, where the infimum is taken over all estimators of Q(Σ) based on n observations and is the class of covariance matrices defined by
To complete the proof, observe that the values of a and k prescribed in Lemma A.1 imply that and give the desired lower bound. Note first that, for any choice of a and k, the following holds trivially: and diag(ΣS) = Ip for any . Write ΣS = (σij) and observe that
Next, we treat each case of Lemma A.1 separately.
Case 1. Note first that 2kaq = R/2 < R so that . Moreover, k2a4 = CR4/q.
- Case 2. Note first that 2kaq ≤ R/2 < R so that . Since k ≥ 2 and k2 ≤ R2nq, we have
Therefore,
Combining the two cases, we get
Together with (3.11), this completes the proof of the second part of the lower bound.
4. Extension to nonquadratic functionals
Closely related to quadratic functional is the ℓr functional of covariance matrices, which is defined by
(4.1) |
It is often used to measure the sparsity of a covariance matrix and plays an important role in estimating sparse covariance matrix. This along the theoretical interest on the difficulty of estimating such a functional give rise to this study. Note that ℓ1(Σ) functional is indeed the ℓ1-norm of the covariance matrix Σ, whereas when r = 2, ℓr functional is the maximal row-wise quadratic functional. Thus, the non-quadratic ℓr functional is just a natural extension of such a maximal quadratic functional, whose optimal estimation problem will be the main focus of this section.
4.1. Optimal estimation of ℓr functionals
We consider a class of matrix with row-wise sparsity structure as follows:
(4.2) |
for q ∈ [0, r) and R > 0 which can depend on n and p. A similar class of covariance matrices has been considered by Bickel and Levina (2008a) and Cai and Zhou (2012).
Theorem 4.1. Fix q ∈ [0, r), R > 0 and assume that 2 log p < n and R2 < (p − 1)n−q/2. Then there exists a positive constant C4 > 0 such that,
where is defined by
(4.3) |
and the infimum is taken over all measurable functions of the sample X1, …, Xn.
The proof is similar to that of Theorem 3.2 and is relegated to the Appendix.
As in (3.7), when 1 < R2 < pαn−q for some α < 1, the lower bound in Theorem 4.1 can be written as
(4.4) |
To establish the upper bound, we consider again a thresholding estimator. Naturally, we estimate ℓr functional of each single row, denoted by , using the thresholding technique. Following the same notation as the previous section, the estimator is defined by
(4.5) |
for a threshold τ > 0. We will see in the next theorem that this estimator achieves the adaptive minimax optimal rate.
Theorem 4.2
Assume that γ log(p) < n for some constant γ > 8 and fix C0 ≥ 4. Consider the threshold
and assume that τ ≤ 1. Then, for any q ∈ [0, r), R > 0, the plug-in estimator satisfies
where
and C5 and C6 are positive constants.
The proof of this theorem is a generalization of the proof of Theorem 3.1 but some aspects that have independent value are presented here. In the proof of Theorem 3.1, we used the decomposition
which is actually the Taylor expansion of at σi,j. Carefully scrutinizing the proof, we find that the first term has the parametric rate O(R2/n) whereas the second term contributes to the rate O(R2(log p/n)2−q). This phenomenon can be generalized to the ℓr-functional. In the latter case, we will apply the Taylor expansion of at |σij|, and the first-order term will contribute to the parametric rate of O(R2 log p/n) while the second-order term has the rate O(R2(log p/n)r−q). The elbow effect stems from the dominance of estimation errors of the first- and second-order terms of Taylor's expansion. We relegate the complete proof to the supplementary material.
A few remarks should be mentioned:
The combination of the two theorems imply that the estimator is minimax adaptive over the space under very mild conditions. The adaptive minimax optimal rate of convergence is given by (4.4). The term p4−γ/2 can be made arbitrarily small by choosing large enough γ.
The ℓr functional involves the maxima of the row sums. Compared it with estimating the quadratic functional, we need to pay the price of an extra log p term in the parametric rate.
The rate presents the elbow phenomenon at q = r − 1 if r > 1. So quadratic row-wise functional bears the same elbow behavior as the quadratic functional .
4.2. Optimal detection of correlations
In this subsection, we illustrate the intrinsic link between functional estimation and hypothesis testing. To that end, consider the following hypothesis testing problem:
This problem is intimately linked to sparse principal component analysis [Berthet and Rigollet (2013a, 2013b)]. A natural question associated with this problem is to find the minimal signal strength κ such that these hypotheses can be tested with high accuracy.
The previous subsection provides the optimal estimate for ℓr(off(Σ)). However, we need a result with high probability rather than in expectation. Using Lemma 4.2 in the supplementary material [Fan, Rigollet and Wang (2015)] and arguments similar to those employed to prove Theorem 4.2, it is not hard to show that
with probability larger than 1 − 4p−(γ−2) =: 1 − δ. Therefore, letting
we get and . Here, denotes the probability under the null hypothesis and denotes the largest probability over the composite alternative. To build a hypothesis test, note that if s1 > s0, then for any s ∈ [s0, s1], the test satisfies . We say that the test ψ discriminates between H0 and H1 with accuracy δ.
Theorem 4.3
Assume that n, p, R, q, r and δ are such that where
Then, for any and for any s ∈ [s0, s1], the test discriminates between H0 and H1 with accuracy δ.
The minimax risk for the correlation detection is given in the next theorem, which will be proved in the Appendix.
Theorem 4.4
For fixed ν > 0, define by
Then for any ,
where the infimum is taken over all possible tests and Cν > 0 is a continuous function of ν that tends to 1/2 as ν → 0.
If we assume the high-dimensional regime R2 < pαn−q for some α < 1 as discussed before, then the lower bound matches the upper bound. So the theorem concludes that no test has asymptotic power for correlation detection unless κ is of higher order than R1/r(log p/n)(r−q)/(2r) and the detection method based on optimal ℓr(Σ) estimation is also optimal for testing existence of correlation.
5. Numerical experiments
Simulations are conducted in this section to evaluate the numerical performance of our plug-in estimator for quadratic functionals. Then the proposed method is applied to two high-dimensional testing problems: simulated two-sample data and real financial equity market data.
5.1. Quadratic functional estimation
We first study the behavior of estimators for the total quadratic functional and for its off-diagonal part. To that end, four sparse covariance matrix structures were used in the simulations:
(M1) auto-correlation AR(1) covariance matrix σij = 0.25|i−j|;
(M2) banded correlation matrix with σij = 0.3 if |i − j| = 1 and 0 otherwise;
(M3) sparse matrix with a block, size p/20 by p/20, of correlation 0.3;
(M4) identity matrix (it attains the maximal level of sparsity).
We chose p = 500 and let n vary from 30 to 100. For estimating the total quadratic functional, our proposed thresholding estimator, BS [Bai and Saranadasa (1996)] estimator and CQ [Chen and Qin (2010)] estimator were applied to each setting for repetition of 500 times. Their mean absolute estimation errors were reported in log scale (base 2) in Figure 1 with their standard deviations omitted here. BS and CQ cannot be directly used for off-diagonal quadratic functional estimation, so we deducted from both of them to serve as an estimator for only the off-diagonal part. The mean absolute estimation errors, compared with our proposed estimator , are depicted in log scale (base 2) in Figure 2.
Fig. 1.
Performance of estimating using thresholded estimator (dotted), CQ (solid) and BS (dashed). The mean of absolute errors over 500 repetitions in log scale (base 2) versus the sample size were reported for matrix M1 (top left), M2 (top right), M3 (bottom left), M4 (bottom right).
Fig. 2.
Performance of estimating Q(Σ) using thresholded estimator (dotted), (solid) and (dashed). The mean of absolute errors over 500 repetitions in log scale (base 2) versus the sample size were reported for matrix M1 (top left), M2 (top right), M3 (bottom left), M4 (bottom right).
The four plots correspond to the aforementioned four covariance structures. We did not report the estimation error of directly using the naive plug-in which is an obvious disaster. In all the four cases, the BS (dashed line) method does not perform well in the “large p small n” regime. The method CQ (solid line) exhibits a relatively small estimation error in general, but it can still be improved using the thresholding method. As theory shows, the method CQ is ratio-consistent [Chen and Qin (2010)], so our method (dotted line) is better only to a second order, which was captured by the small gap between dotted and solid curves. When estimating only off-diagonal quadratic functionals (Figure 2), the advantage of the thresholding method is even sharper since the error caused by nonsparse diagonal elements has been eliminated. The improved performance comes from the prior knowledge of sparsity, thus our method works best for very sparse matrix, especially well for identity matrix as seen in Figure 1.
A practical question is how to choose a proper threshold, as this is important to the performance of the thresholding estimator. In the above simulations, we chose with constant C slightly different for the four cases but all close to 1.5. In the next two applications to hypothesis testing, we employ the cross validation to choose a proper thresholding. The procedure consists of the following steps:
-
(1)
The data is split into training data of sample size n1 and testing data of sample size n − n1 for m times, v = 1, 2, …, m.
-
(2)
The training data is used to construct the thresholding estimator under a sequence of thresholds while the testing data constructs the nonthresholded ratio-consistent estimator , for example, using CQ estimator of .
-
(3)
The candidates of thresholds are for j = 1, 2, …, J where J is chosen to be a reasonably large number, say 50, and Δ is such that .
-
(4)The final j* is taken to be the minimizer of the following problem:
-
(5)
The final estimator is obtained by applying threshold to the empirical covariance matrix of the entire n data.
Bickel and Levina (2008a) suggested to use n1 = n/ log n for estimating covariance matrices. This is consistent with our experience for estimating functionals when no prior knowledge about the covariance matrix structure is provided. We will apply this splitting rule in the later simulation studies on high-dimensional hypothesis testing.
5.2. Application to high-dimensional two-sample testing
In this section, we apply the thresholding estimator of quadratic functionals to the high-dimensional two-sample testing problem. Two groups of data are simulated from the Gaussian models:
The dimensions considered for this problem are (p, n) ∈ {(500, 100), (1000, 150), (2000, 200)}. For simplicity, we choose Σ to be a correlation matrix and choose the sparse covariance structure to be 2 by 2 block diagonal matrices with 250 of them having correlations 0.3 and the rest having correlations 0. So the off-diagonal quadratic functional is always 45, which does not increase with p in our setting. The mean vectors μ1 and μ2 are chosen as follows. Let μ1 = 0 and the percentage of μ1,k = μ2,k to be in {0%, 50%, 95%, 100%}. The 100% proportion corresponds to the case where the two groups are identical, thus gives information about accuracy of the size of the tests. The 95% proportion represents the situation where the alternative hypotheses are sparse. For those k such that μ1,k ≠ μ2,k, we simply chose the value of each μ2,k equally. To make the power comparable among different configurations, we use a constant signal-to-noise ratio across experiments.
Table 1 reports the empirical power and size of six testing methods based on 500 repetitions.
Table 1.
Empirical testing power and size of 6 testing methods based on 500 simulations
Prop. of equalities | BS | newBS | CQ | newCQ | Bonf | BH |
---|---|---|---|---|---|---|
p = 500, n = 100 | ||||||
0% | 0.408 | 0.422 | 0.428 | 0.432 | 0.104 | 0.110 |
50% | 0.396 | 0.422 | 0.418 | 0.428 | 0.110 | 0.116 |
95% | 0.422 | 0.440 | 0.438 | 0.442 | 0.208 | 0.214 |
100% (size) | 0.030 | 0.036 | 0.036 | 0.038 | 0.042 | 0.042 |
p = 1000, n = 150 | ||||||
0% | 0.696 | 0.710 | 0.718 | 0.718 | 0.082 | 0.086 |
50% | 0.698 | 0.712 | 0.712 | 0.714 | 0.106 | 0.112 |
95% | 0.702 | 0.716 | 0.718 | 0.722 | 0.308 | 0.328 |
100% (size) | 0.040 | 0.044 | 0.048 | 0.046 | 0.050 | 0.050 |
p = 2000, n = 200 | ||||||
0% | 0.930 | 0.938 | 0.940 | 0.940 | 0.138 | 0.146 |
50% | 0.918 | 0.922 | 0.924 | 0.928 | 0.104 | 0.106 |
95% | 0.922 | 0.928 | 0.930 | 0.930 | 0.324 | 0.338 |
100% (size) | 0.046 | 0.050 | 0.050 | 0.050 | 0.046 | 0.046 |
(BS) Bai and Saranadasa's original test.
(newBS) Bai and Saranadasa's modified test where tr(Σ2) is estimated by thresholding the sample covariance matrix.
(CQ) Chen and Qin's original test.
(newCQ) Chen and Qin's modified test where and tr(Σ1Σ2) are estimated by thresholding their empirical counterparts.
(Bonf) Bonferroni correction: This method regards the high-dimensional testing problem as p univariate testing problems. If there is a p-value that is less than 0.05/p, the null hypothesis is rejected.
(BH) Benjamini–Hochberg method. The method is similar to the Bonferroni correction, but employs the Benjamini–Hochberg method in decision making.
For estimating quadratic functionals, the cross-validation is employed using n/log(n) splitting rule. The first four methods are evaluated at the 5% significance level while Bonferroni correction and Benjamini–Hochberg correction are evaluated at 5% family-wise error rate or FDR. We also list the average relative estimation errors for the quadratic functionals of the first four methods in Table 2. Here, the average is taken over four different proportions of equalities and the average for CQ and newCQ is also taken over errors in estimating and .
Table 2.
Mean and SD of relative errors for estimating quadratic functionals (in percentage)
p = 500, n = 100 | p = 1000, n = 150 | p = 2000, n = 200 | |
---|---|---|---|
BS | 4.93 (2.48) | 4.47 (1.56) | 5.05 (1.10) |
newBS | 2.12 (1.43) | 0.74 (0.56) | 0.54 (0.40) |
CQ | 3.72 (1.97) | 2.32 (1.24) | 1.70 (0.91) |
newCQ | 2.77 (1.38) | 1.27 (0.64) | 0.62 (0.33) |
Several comments are in order. First, the first four methods based on Wald-type of statistic with correlation ignored perform much better, in terms of the power, than the last two methods which combines individual tests. Even in the case that proportional of equalities is 95% where the individual difference is large for nonidentical means, aggregating the signals together in the Wald-type of statistic still outperforms. However, in the case of 0% identical means, the power of Bonferroni or FDR method is extremely small, due to small individual differences. Second, the method newCQ, which combines CQ and thresholding estimator of the quadratic functional, has the highest power and performs the best among all methods. The corrected BS method also improves the performance by estimating the quadratic functionals better compared with original BS. CQ indeed is more powerful than BS as claimed by Chen and Qin (2010), but we can even improve the performance of those two methods more by leveraging the sparsity structure of covariance matrices.
5.3. Estimation of ℓr functional and correlation detection
In order to check the effectiveness of using ℓr norm of the thresholded sample matrix to detect correlation, let us take one simple matrix structure as an example and use r = 1. Under H0, assume ; while under , where Σij = 0.8 if i, and is a random subset of size p/20 in {1, 2, …, n}. We chose to use p = 500 and generated n = 100 independent random vectors under both H0 and H1. The whole simulation was done for N = 1000 times.
We compare the ℓ1 norm estimates based on empirical covariance matrix and thresholded empirical covariance matrix . The threshold is decided by cross validation with n/log(n) splitting. The simulations yielded N estimates for both null and alternative hypotheses, which were plotted in Figure 3. The optimal estimator perfectly discriminates the null and alternative hypotheses while overestimates ℓ1 functional and blurs the difference of the two hypotheses.
Fig. 3.
Histogram of 1000 ℓ1 functional estimates for H0 and H1 by (left) and optimal estimator (right).
5.4. Application to testing multifactor pricing model
In this section, we test the validity of the Capital Asset Pricing Model (CAPM) and Fama–French models using Pesaran and Yamagata's method (2.2) for the securities in the Standard & Poor 500 (S&P 500) index. Following the literature, we used 60 monthly stock returns to construct test statistics since monthly returns are nearly independent. The composition of index keeps changing annually, so we selected only 276 large stocks. The monthly returns (adjusted for dividend) between January 1990 and December 2012 are downloaded from the Wharton Research Data Services (WRDS) database. The time series on the risk-free rates and Fama–French three factors are obtained from Ken French's data library. If only the first factor, that is, the excessive return of the market portfolio is used, the Fama–French model reduces to the CAPM model. We tested the null hypothesis H0 : α = 0 for both models. The p-values of the tests are depicted in Figure 4, which are computed based on running windows of previous 60 months.
Fig. 4.
P -values of testing H0 : α = 0 in the CAMP and Fama–French 3 factor models based on S&P 500 monthly returns from January 1995 to December 2012.
The results suggest that market efficiency is time dependent and the Fama–French model are rejected less frequently than the CAPM. Before 1998, the evidence that α ≠ 0 is very strong. After 1998, the Fama–French 3-factor model holds most of the time except the period 2007–2009 that contains the financial crisis. On the other hand, the CAPM is rejected for an extended period of time during this period.
Supplementary Material
APPENDIX A: A TECHNICAL LEMMA ON χ2 DIVERGENCES
Lemma A.1
Consider a mixture of Gaussian product distributions
where such that . Then
(A.1) |
Furthermore, assume 2(log p) ≥ n. Consider the mixture defined in the proof of Theorem 3.2 where k and a are defined as follows:
If , then take k = 1 and a = (R/4)1/q.
If , then take k to be the largest integer such that
(A.2) |
and
(A.3) |
Such choices yield in both cases
(A.4) |
Moreover, in case 2 we have that (i) k ≥ 2 and (ii) under the assumption that R2 < (p − 1)n−q/2, we also have k2 ≤ R2nq < (p − 1)/2.
Proof
To unify the notation, we will work directly with , rather than , j ∈ [m]. However, in the first part of the proof, we will not use the specific form ΣS nor that of . For now, we simply assume that ΣS is invertible (we will check this later on). Recall that
where denotes the expectation with respect to . Furthermore,
Consider the spectral decomposition of , where U is an orthogonal matrix and Λ is a diagonal matrix with eigenvalues λ1, …, λp on its diagonal. Then, by rotational invariance of the Gaussian distribution, it holds
To ensure that the above expression is finite, note that the Cauchy–Schwarz inequality yields
where the two χ2 divergences are finite because for any . Therefore,
Next, observe that
Since we have not used the specific form of ΣS, , this bound is valid for any mixture and completes the proof of (A.1).
Next, we apply this bound to the specific choice for ΣS of (3.10). Note that the minimal eigenvalue of the matrices ΣS, is . Later we will show 2a2k ≤ 1, which implies that ΣS is always positive definite. In particular, this implies that for any . Moreover, it follows from definition (3.10) that
where 0 is a generic symbol to indicate space filled by zeros. Expanding the determinant along the first row (or column), we get
where in the second equality, we used Sylvester's determinant theorem. By (A.1), we have
As to be verified later, 2a2k ≤ 1. Using the fact that (1 − x)−1 ≤ exp(2x) for x ∈ [0, 1/2] and the symmetry, we have
where denotes the expectation with respect to the distribution of S randomly chosen from . In particular, is the sum of k negatively associated random variables. Therefore, using negative association, the above expectation is further bounded by
Next, for a given by (A.3), we have
We now show for both cases of the lemma, we have 2a2k ≤ 1. Indeed for case 1, we get 2a2k = 2(R/4)2/q < 2(log p)/n ≤ 1. For case 2, 2a2k ≤ 1 follows trivially from the definition of a. Also observe that k ≥ 2 since
This proves part (i) of the statement on k. To prove part (ii), observe that R2 < (p − 1)n−q/2 implies that
which is equivalent to
Therefore, k2 ≤ R2nq < (p − 1)/2.
APPENDIX B: PROOF OF THEOREM 4.1
The proof follows a similar idea to that of Theorem 3.2. For the second part of the lower bound, we use exactly the same construction of two hypotheses as in Lemma A.1. Then it follows that for the ℓr functional,
for some positive constant C, where the infimum is taken over all estimators of ℓr(Σ) based on n observations. In case 1, k2a2r = CR2r/q while in case 2, following the same arguments as before,
This completes the second part of the lower bound.
The first part of the result is a little bit more complicated than the construction of A(k) and B(k) in the proof of Theorem 3.2 due to the extra log p term in the lower bound. We need to consider a mixture of measures in order to capture the complexity of the problem. With a slight abuse of notation, we redefine (2k) × (2k) matrices A(k), B(k) as follows:
where a, b ∈ (0, 1/2) and 1 denotes a vector of ones of length k. Since R2 < p, we now construct the block diagonal covariance matrices
where the mth diagonal block is chosen to be Cm = B(k) while others are Ci = A(k) for i ≠ m and M = ⌊p/R⌋. Also define to be of the same structure with Ci = A(k) for all i. Then we have for m = 0, 1, …, M, since each row of only contains at most R nonzero elements that are bounded by 1.
Let denote the distribution of and denote the distribution of . Let (resp., ) denote the distribution of X = (X1, …, Xn) of n i.i.d. random variables drawn from (resp., ). Moreover, let denote the uniform mixture of over m ∈ [M]. Consider the testing problem
Using Theorem 2.2, part (iii) of Tsybakov (2009) as before, we need to show χ2-divergence can be bounded by a constant. By the same calculation as in Lemma A.1, we have
Note that χ2-divergence here depends on instead of since perfectly correlated random variables do not add additional information and hence do not affect χ2-divergence (see the proof of Theorem 3.2). Using the definition of 's, we obtain
Therefore,
which is bounded by ((1 − 5(a − b)2)−n/2 − 1)/M due to the fact 2(1 + a2)/(1 − a2)2 ≤ 5 for a ≤ 1/2. Now choose
By assumption, there exists a constant c0 > 1 such that R2 ≤ c0p. Thus,
Using standard techniques to reduce estimation problems to testing problems as before, we find
For the above choice of , we have
Since , the above two displays imply that
which together with the other part of the lower bound, completes the proof of the theorem.
APPENDIX C: PROOF OF THEOREM 4.4
The proof is similar to that of Theorem 3.2, but simpler since r ≤ 1 where no elbow effect exists. Consider hypothesis construction (3.10) with and
Choose ν sufficiently small so that , which implies 2ka2 ≤ 1 and guarantees the positive semi-definiteness of ΣS. Furthermore, kaq ≤ R holds, so . By the same derivation as in Theorem 3.2, we are able to show
which by Theorem 2.2(iii) of Tsybakov (2009) leads to the final conclusion.
Footnotes
SUPPLEMENTARY MATERIAL
Technical proofs Fan, Rigollet and Wang (2015) (DOI: 10.1214/15-AOS1357SUPP; .pdf). This supplementary material contains the introduction to two-sample high-dimensional testing methods and the proofs of upper bounds that were omitted from the paper.
REFERENCES
- Amini AA, Wainwright MJ. High-dimensional analysis of semidefinite relaxations for sparse principal components. Ann. Statist. 2009;37:2877–2921. MR2541450. [Google Scholar]
- Arias-Castro E, Bubeck S, Lugosi G. Detecting positive correlations in a multivariate sample. Bernoulli. 2015;21:209–241. MR3322317. [Google Scholar]
- Bai Z, Saranadasa H. Effect of high-dimension: By an example of a two sample problem. Statist. Sinica. 1996;6:311–329. MR1399305. [Google Scholar]
- Berthet Q, Rigollet P. Complexity theoretic lower bounds for sparse principal component detection. J. Mach. Learn. Res. 2013a;30:1046–1066. [Google Scholar]
- Berthet Q, Rigollet P. Optimal detection of sparse principal components in high-dimension. Ann. Statist. 2013b;41:1780–1815. MR3127849. [Google Scholar]
- Bickel PJ, Levina E. Covariance regularization by thresholding. Ann. Statist. 2008a;36:2577–2604. MR2485008. [Google Scholar]
- Bickel PJ, Levina E. Regularized estimation of large covariance matrices. Ann. Statist. 2008b;36:199–227. MR2387969. [Google Scholar]
- Bickel PJ, Ritov Y. Estimating integrated squared density derivatives: Sharp best order of convergence estimates. Sankhyā Ser. A. 1988;50:381–393. MR1065550. [Google Scholar]
- Birnbaum A, Johnstone IM, Nadler B, Paul D. Minimax bounds for sparse PCA with noisy high-dimensional data. Ann. Statist. 2013;41:1055–1084. doi: 10.1214/12-AOS1014. MR3113803. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Butucea C. Goodness-of-fit testing and quadratic functional estimation from indirect observations. Ann. Statist. 2007;35:1907–1930. MR2363957. [Google Scholar]
- Butucea C, Meziani K. Quadratic functional estimation in inverse problems. Stat. Methodol. 2011;8:31–41. MR2741507. [Google Scholar]
- Cai T, Liu W. Adaptive thresholding for sparse covariance matrix estimation. J. Amer. Statist. Assoc. 2011;106:672–684. MR2847949. [Google Scholar]
- Cai TT, Low MG. Nonquadratic estimators of a quadratic functional. Ann. Statist. 2005;33:2930–2956. MR2253108. [Google Scholar]
- Cai TT, Low MG. Optimal adaptive estimation of a quadratic functional. Ann. Statist. 2006;34:2298–2325. MR2291501. [Google Scholar]
- Cai TT, Ma Z, Wu Y. Sparse PCA: Optimal rates and adaptive estimation. Ann. Statist. 2013;41:3074–3110. MR3161458. [Google Scholar]
- Cai T, Ma Z, Wu Y. Optimal estimation and rank detection for sparse spiked covariance matrices. Probab. Theory Related Fields. 2015;161:781–815. doi: 10.1007/s00440-014-0562-z. MR3334281. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cai TT, Ren Z, Zhou HH. Optimal rates of convergence for estimating Toeplitz covariance matrices. Probab. Theory Related Fields. 2013;156:101–143. MR3055254. [Google Scholar]
- Cai TT, Yuan M. Adaptive covariance matrix estimation through block thresholding. Ann. Statist. 2012;40:2014–2042. MR3059075. [Google Scholar]
- Cai TT, Zhang C-H, Zhou HH. Optimal rates of convergence for covariance matrix estimation. Ann. Statist. 2010;38:2118–2144. MR2676885. [Google Scholar]
- Cai TT, Zhou HH. Minimax estimation of large covariance matrices under ℓ1-norm. Statist. Sinica. 2012;22:1319–1349. MR3027084. [Google Scholar]
- Chen SX, Qin Y-L. A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist. 2010;38:808–835. MR2604697. [Google Scholar]
- Donoho DL, Nussbaum M. Minimax quadratic estimation of a quadratic functional. J. Complexity. 1990;6:290–323. MR1081043. [Google Scholar]
- Efromovich S, Low M. On optimal adaptive estimation of a quadratic functional. Ann. Statist. 1996;24:1106–1125. MR1401840. [Google Scholar]
- El Karoui N. Operator norm consistent estimation of large-dimensional sparse covariance matrices. Ann. Statist. 2008;36:2717–2756. MR2485011. [Google Scholar]
- Fama EF, French KR. Common risk factors in the returns on stocks and bonds. Journal of Financial Economics. 1993;33:3–56. [Google Scholar]
- Fan J. On the estimation of quadratic functionals. Ann. Statist. 1991;19:1273–1294. MR1126325. [Google Scholar]
- Fan J, Fan Y, Lv J. High dimensional covariance matrix estimation using a factor model. J. Econometrics. 2008;147:186–197. MR2472991. [Google Scholar]
- Fan J, Liao Y, Mincheva M. High-dimensional covariance matrix estimation in approximate factor models. Ann. Statist. 2011;39:3320–3356. doi: 10.1214/11-AOS944. MR3012410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Liao Y, Mincheva M. Large covariance estimation by thresholding principal orthogonal complements. J. R. Stat. Soc. Ser. B. Stat. Methodol. 2013;75:603–680. doi: 10.1111/rssb.12016. MR3091653. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Rigollet P, Wang W. Supplement to “Estimation of functionals of sparse covariance matrices. 2015. DOI:10.1214/15-AOS1357SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Foucart S, Rauhut H. A Mathematical Introduction to Compressive Sensing. Birkhäuser/Springer; New York: 2013. MR3100033. [Google Scholar]
- Hall P, Marron JS. Estimation of integrated squared density derivatives. Statist. Probab. Lett. 1987;6:109–115. MR0907270. [Google Scholar]
- Ibragimov IA, Nemirovskiĭ AS, Khas'minskiĭ RZ. Some problems of nonparametric estimation in Gaussian white noise. Theory Probab. Appl. 1987;31:391–406. [Google Scholar]
- Johnstone IM, Lu AY. On consistency and sparsity for principal components analysis in high-dimensions. J. Amer. Statist. Assoc. 2009;104:682–693. doi: 10.1198/jasa.2009.0121. MR2751448. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jung S, Marron JS. PCA consistency in high-dimension, low sample size context. Ann. Statist. 2009;37:4104–4130. MR2572454. [Google Scholar]
- Klemelä J. Sharp adaptive estimation of quadratic functionals. Probab. Theory Related Fields. 2006;134:539–564. MR2214904. [Google Scholar]
- Lam C, Fan J. Sparsistency and rates of convergence in large covariance matrix estimation. Ann. Statist. 2009;37:4254–4278. doi: 10.1214/09-AOS720. MR2572459. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Levina E, Vershynin R. Partial estimation of covariance matrices. Probab. Theory Related Fields. 2012;153:405–419. MR2948681. [Google Scholar]
- Lintner J. The valuation of risk assets and the selection of risky investments in stock portfolios and capital budgets. The Review of Economics and Statistics. 1965;47:13–37. [Google Scholar]
- Ma Z. Sparse principal component analysis and iterative thresholding. Ann. Statist. 2013;41:772–801. MR3099121. [Google Scholar]
- Mossin J. Equilibrium in a capital asset market. Econometrica. 1966;34:768–783. [Google Scholar]
- Nemirovski A. Lectures on Probability Theory and Statistics (Saint-Flour, 1998). Lecture Notes in Math. Springer; Berlin: 2000. Topics in nonparametric statistics; pp. 85–277. 1738. MR1775640. [Google Scholar]
- Nemirovskiĭ AS, Khas'minskiĭ RZ. Nonparametric estimation of the functionals of the products of a signal observed in white noise. Problemy Peredachi Informatsii. 1987;23:27–38. MR0914348. [Google Scholar]
- Onatski A, Moreira MJ, Hallin M. Asymptotic power of sphericity tests for high-dimensional data. Ann. Statist. 2013;41:1204–1231. MR3113808. [Google Scholar]
- Paul D, Johnstone IM. Augmented sparse principal component analysis for high-dimensional data. 2012. Available at arXiv:1202.1242v1. [Google Scholar]
- Pesaran MH, Yamagata T. IZA Discussion Papers 6469. Institute for the Study of Labor; 2012. Testing capm with a large number of assets. [Google Scholar]
- Ravikumar P, Wainwright MJ, Raskutti G, Yu B. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electron. J. Stat. 2011;5:935–980. MR2836766. [Google Scholar]
- Rigollet P, Tsybakov AB. Comment: “Minimax estimation of large covariance matrices under ℓ1-norm” [MR3027084] Statist. Sinica. 2012;22:1358–1367. MR3027087. [Google Scholar]
- Rothman AJ, Levina E, Zhu J. Generalized thresholding of large covariance matrices. J. Amer. Statist. Assoc. 2009;104:177–186. MR2504372. [Google Scholar]
- Sharpe WF. Capital asset prices: A theory of market equilibrium under conditions of risk. J. Finance. 1964;19:425–442. [Google Scholar]
- Srivastava MS, Du M. A test for the mean vector with fewer observations than the dimension. J. Multivariate Anal. 2008;99:386–402. MR2396970. [Google Scholar]
- Tsybakov AB. Introduction to Nonparametric Estimation. Springer; New York: 2009. MR2724359. [Google Scholar]
- Verzelen N. Minimax risks for sparse regressions: Ultra-high-dimensional phenomenons. Electron. J. Stat. 2012;6:38–90. MR2879672. [Google Scholar]
- Vu V, Lei J. Minimax rates of estimation for sparse PCA in high-dimensions. Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics April 21–23, 2012, JMLR W&CP. 2012;22:1278–1286. [Google Scholar]
- Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. J. Comput. Graph. Statist. 2006;15:265–286. MR2252527. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.