ESTIMATION OF FUNCTIONALS OF SPARSE COVARIANCE MATRICES

Jianqing Fan; Philippe Rigollet; Weichen Wang

doi:10.1214/15-AOS1357

. Author manuscript; available in PMC: 2016 Jan 20.

Published in final edited form as: Ann Stat. 2015;43(6):2706–2737. doi: 10.1214/15-AOS1357

ESTIMATION OF FUNCTIONALS OF SPARSE COVARIANCE MATRICES

Jianqing Fan ^*,¹, Philippe Rigollet ^†,², Weichen Wang ^*,³

PMCID: PMC4719663 NIHMSID: NIHMS749714 PMID: 26806986

Abstract

High-dimensional statistical tests often ignore correlations to gain simplicity and stability leading to null distributions that depend on functionals of correlation matrices such as their Frobenius norm and other ℓ_r norms. Motivated by the computation of critical values of such tests, we investigate the difficulty of estimation the functionals of sparse correlation matrices. Specifically, we show that simple plug-in procedures based on thresholded estimators of correlation matrices are sparsity-adaptive and minimax optimal over a large class of correlation matrices. Akin to previous results on functional estimation, the minimax rates exhibit an elbow phenomenon. Our results are further illustrated in simulated data as well as an empirical study of data arising in financial econometrics.

Keywords: Covariance matrix, functional estimation, high-dimensional testing, minimax, elbow effect

1. Introduction

Covariance matrices are at the core of many statistical procedures such as principal component analysis or linear discriminant analysis. Moreover, not only do they arise as natural quantities to capture interactions between variables but, as we illustrate below, they often characterize the asymptotic variance of commonly used estimators. Following the original papers of Bickel and Levina (2008a, 2008b), much work has focused on the inference of high-dimensional covariance matrices under sparsity [Cai and Liu (2011), Cai, Ren and Zhou (2013), Cai and Yuan (2012), Cai, Zhang and Zhou (2010), Cai and Zhou (2012), El Karoui (2008), Lam and Fan (2009), Ravikumar et al. (2011)] and other structural assumptions related to sparse principal component analysis [Amini and Wainwright (2009), Berthet and Rigollet (2013a, 2013b), Birnbaum et al. (2013), Cai, Ma and Wu (2013, 2015), Fan, Fan and Lv (2008), Johnstone and Lu (2009), Levina and Vershynin (2012), Ma (2013), Onatski, Moreira and Hallin (2013), Paul and Johnstone (2012), Rothman, Levina and Zhu (2009), Fan, Liao and Mincheva (2011, 2013), Jung and Marron (2009), Vu and Lei (2012), Zou, Hastie and Tibshirani (2006)]. This area of research is very active and, as a result, this list of references is illustrative rather than comprehensive. This line of work can be split into two main themes: estimation and detection. The former is the main focus of the present paper. However, while most of the literature has focused on estimating the covariance matrix itself, under various performance measures, we depart from this line of work by focusing on functionals of the covariance matrix rather than the covariance matrix itself.

Estimation of functionals of unknown signals such as regression functions or densities is known to be different in nature from estimation of the signal itself. This problem has received most attention in nonparametric estimation, originally in the Gaussian white noise model [Efromovich and Low (1996), Fan (1991), Ibragimov, Nemirovskiĭ and Khas'minskiĭ (1987), Nemirovskiĭ and Khas'minskiĭ (1987)] [see also Nemirovski (2000) for a survey of results in the Gaussian white noise model] and later extended to density estimation [Bickel and Ritov (1988), Hall and Marron (1987)] and various other models such as regression [Donoho and Nussbaum (1990), Cai and Low (2005, 2006), Klemelä (2006)] and inverse problems [Butucea (2007), Butucea and Meziani (2011)]. Most of these papers study the estimation of quadratic functionals and, interestingly, exhibit an elbow in the rates of convergence: there exists a critical regularity parameter below which the rate of estimation is nonparametric and above which, it becomes parametric. As we will see below the phenomenon also arises when regularity is measured by sparsity.

Over the past decade, sparsity has become the prime measure of regularity, both for its flexibility and generality. In particular, smooth functions can be viewed as functions with a sparse expansion in an appropriate basis. At a high level, sparsity assumes that many of the unknown parameters are equal to zero or nearly so, so that the few nonzero parameters can be consistently estimated using a small number of observations relative to the apparent dimensionality of the problem. Moreover, sparsity acts not only as a regularity parameter that stabilizes statistical procedures but also as key feature for interpretability. Indeed, it is often the case that setting many parameters to zero simply corresponds to a simpler sub-model. The main idea is to let data select the correct sub-model. This is the case in particular for covariance matrix estimation where zeros in the matrix correspond to uncor-related variables. Yet, while the value of sparsity for covariance matrix estimation has been well established, to the best of our knowledge, this paper provides the first analysis for the estimation of functionals of sparse covariance matrix. Indeed, the actual performance of many estimators critically depends on such functionals. Therefore, accurate functional estimation leads to a better understanding the performance of many estimators and can ultimately serve as a guide to selecting the best estimator. Applications of our results are illustrated in Section 2.

Our work is not only motivated by real applications, but also by a natural extension of the theoretical analysis carried out in the sparse Gaussian sequence model [Cai and Low (2005)]. In that paper, Cai and Low assume that the unknown parameter θ belong to an ℓ_q-ball, where q > 0 can be arbitrarily close to 0. Such balls are known to emulate sparsity and actually correspond to a more accurate notion of sparsity for signal θ that is encountered in applications [see, e.g., Foucart and Rauhut (2013)]. They also show that a nonquadratic estimator can be fully efficient to estimate quadratic functionals. We extend some of these results to covariance matrix estimation. Such an extension is not trivial since, unlike the Gaussian sequence model, covariance matrix lies at high-dimensional manifolds and its estimation exhibits complicated dependencies in the structure of the noise.

We also compare our results for optimal rates of estimating matrix functionals with that of estimating matrix itself. Many methods have been proposed to estimate covariance matrix in different sense of sparsity using different techniques including thresholding [Bickel and Levina (2008a)], tapering [Bickel and Levina (2008b), Cai, Zhang and Zhou (2010), Cai and Zhou (2012)] and penalized likelihood [Lam and Fan (2009)] to name only a few. These methods often lead to minimax optimal rates in various classes and under several metrics [Cai, Zhang and Zhou (2010), Cai and Zhou (2012), Rigollet and Tsybakov (2012)]. However, the optimal rates of estimating matrix functionals have not yet been covered by much literature. Intuitively, it should have faster rates of convergence on estimating a matrix functional than itself since it is just a one-dimensional estimating problem and the estimating error cancel with each other when we sum those elements together. We will see this is indeed the case when we compare the minimax rates of estimating matrix functionals with those of estimating matrices.

The rest of the paper is organized as follows. We begin in Section 2 by two motivating examples of high-dimensional hypothesis testing problems: a two-sample testing problem of Gaussian means that arises in genomics and validating the efficiency of markets based on the Capital Asset Pricing Model (CAPM). Next, in Section 3, we introduce an estimator of the quadratic functional of interest that is based on the thresholding estimator introduced in Bickel and Levina (2008a). We also prove its optimality in a minimax sense over a large class of sparse covariance matrices. The study is further extended to estimating other measures of sparsity of covariance matrix. Finally, we study the numerical performance of our estimator in Section 5 on simulated experiments as well as in the framework of the two applications described in Section 2. Due to space restrictions, the proofs for the upper bounds are relegated to the Appendix in the supplementary material [Fan, Rigollet and Wang (2015)].

Notation: Let d be a positive integer. The space of d × d positive semi-definite matrices is denoted by $S_{d}^{+}$ . For any two integers c < d, define [c : d] = {c, c + 1, …, d} to be the sequence of contiguous integers between c and d, and we simply write [d] = {1, …, d}. I_d denotes the identity matrix of $R^{d}$ . Moreover, for any subset S ⊂ [d], denote by 1_S ∈ {0, 1}^d the column vector with jth coordinate equal to one iff j ∈ S. In particular, 1_[d] denotes the d dimensional vector of all ones.

We denote by tr the trace operator on square matrices and by diag (resp., off) the linear operator that sets to 0 all the off diagonal (resp., diagonal) elements of a square matrix. The Frobenius norm of a real matrix M is denoted by ∥M∥_F and is defined by ${‖ M ‖}_{F} = \sqrt{tr (M^{⊺} M)}$ . Note that ∥M∥_F is a the Hilbert–Schimdt norm associated with inner product 〈A, B〉 = tr(A^⊺B) defined on the space of real rectangular matrices of the same size. Moreover, |A| denotes the determinant of a square matrix A. The variance of a random variable X is denote by var(X).

In the proofs, we often employ C to denote a generic positive constant that may change from line to line.

2. Two motivating examples

In this section, we describe our main motivation for estimating quadratic functionals of a high-dimensional covariance matrix in the light of two applications to high-dimensional testing problems. The first one is a high-dimensional two-sample hypothesis testing with applications in gene-set testing. The second example is about testing the validity of the capital asset pricing model (CAPM) from financial economics.

2.1. Two-sample hypothesis testing in high-dimensions

In various statistical applications, in particular in genomics, the dimensionality of the problems is so large that statistical procedures involving inverse covariance matrices are not viable due to its lack of stability both from a statistical and numerical point of view. This limitation can be well illustrated on a showcase example: two-sample hypothesis testing [Bai and Saranadasa (1996)] in high-dimensions.

Suppose that we observe two independent samples $X_{1}^{(1)}, \dots, X_{n_{1}}^{(1)} \in R^{p}$ that are i.i.d. $N (μ_{1}, Σ_{1})$ and $X_{1}^{(2)}, \dots, X_{n_{2}}^{(2)} \in R^{p}$ that are i.i.d. $N (μ_{2}, Σ_{2})$ . Let n = n₁ + N₂. The goal is to test H₀ : μ₁ = μ₂ vs. H₁ : μ₁ ≠ μ₂.

Assume first that Σ₁ = Σ₂ = Σ. In this case, Hotelling's test is commonly employed when p is small. Nevertheless, when p is large, Bai and Saranadasa (1996) showed that the test based on Hotelling's T ² has low power and suggest a new statistics M for the random matrix asymptotic regime where n, p → ∞, $\frac{n}{p} \to γ > 0$ , $\frac{n_{1}}{n_{1} + n_{2}} \to κ \in (0, 1)$ . The statistics, implementing the naive Bayes rule, is defined as

M = {({\overset{‒}{X}}^{(1)} - {\overset{‒}{X}}^{(2)})}^{⊺} ({\overset{‒}{X}}^{(1)} - {\overset{‒}{X}}^{(2)}) - \frac{n}{n_{1} n_{2}} tr (\hat{Σ}),

and is proved to be asymptotically normal under the null hypothesis with

var (M) = 2 \frac{n (n - 1)}{{(n_{1} n_{2})}^{2}} {‖ Σ ‖}_{F}^{2} (1 + o (1)) .

Clearly, the asymptotic variance of M depends on the unknown covariance matrix Σ through its quadratic functional, and in order to compute the critical value of the test, Bai and Saranadasa suggest to estimate ${‖ Σ ‖}_{F}^{2}$ by the quantity

B^{2} = \frac{n^{2}}{(n + 2) (n - 1)} [{‖ \hat{Σ} ‖}_{F}^{2} - \frac{1}{n} {(tr (\hat{Σ}))}^{2}] .

They show that B² is a ratio-consistent estimator of ${‖ Σ ‖}_{F}^{2}$ in the sense that $B^{2} = (1 + o_{P} (1)) {‖ Σ ‖}_{F}^{2}$ . Clearly, this solution does not leverage any sparsity assumption and may suffer from power deficiency if the matrix Σ is indeed sparse. Rather, if the covariance matrix Σ is believed to be sparse, one may prefer to use a thresholded estimator for Σ as in Bickel and Levina (2008a) rather than the empirical covariance matrix $\hat{Σ}$ . In this case, we estimate ${‖ Σ ‖}_{F}^{2}$ by $\hat{{‖ Σ ‖}_{F}^{2}} = \sum_{i, j = 1}^{p} {\hat{σ}}_{i j}^{2} 1 {∣ {\hat{σ}}_{i j} ∣ > τ}$ , where ${{\hat{σ}}_{i j}, i, j \in [p]}$ could be any consistent estimator of σ_ij and τ > 0 is a threshold parameter.

More recently, Chen and Qin (2010) took into account the case Σ₁ ≠ Σ₂ and proposed a test statistic based on an unbiased estimate of each of the three quantities in ${‖ μ_{1} - μ_{2} ‖}^{2} = {‖ μ_{1} ‖}^{2} + {‖ μ_{2} ‖}^{2} - 2 μ_{1}^{⊺} μ_{2}$ . In this case, the quantities ${‖ Σ_{i} ‖}_{F}^{2}$ and 〈Σ₁, Σ₂〉 appear in the asymptotic variance. The detailed formulation assumptions of this statistic, as well as discussions about other testing methods such as Srivastava and Du (2008), are provided in the supplementary material [Fan, Rigollet and Wang (2015)] for completeness. If Σ₁ and Σ₂ are indeed sparse, akin to the above reasoning, we can also estimate ${‖ Σ_{i} ‖}_{F}^{2}, i = 1, 2$ and 〈Σ₁, Σ₂〉 using thresholding to leverage sparsity assumption. It is not hard to derive a theory for estimating quadratic functionals involving two covariance matrices but the details of this procedure are beyond the scope of the present paper.

2.2. Testing high-dimensional CAPM model

The capital asset pricing model (CAPM) is a simple financial model that postulates how individual asset returns are related to the market risks. Specifically, the individual excessive return $Y_{t}^{(i)}$ of asset i ∈ [N] over the risk-free rate at time t ∈ [T] can be expressed as an affine function of a vector of K risk factors $f_{t} \in R^{K}$ :

Y_{t}^{(i)} = α_{i} + β_{i}^{⊺} f_{t} + ε_{t}^{(i)},

(2.1)

where we assume for any t ∈ [T], $f_{t} \in R^{K}$ are observed. The case K = 1 with f_t being the excessive return of the market portfolio corresponds to the CAPM [Lintner (1965), Mossin (1966), Sharpe (1964)]. It is nowadays more common to employ the Fama–French three-factor model [see Fama and French (1993) for a definition] for the US equity market, corresponding to K = 3.

For simplicity, let us rewrite the model (2.1) in the vectorial form

Y_{t} = α + B f_{t} + ε_{t}, t \in [T] .

The multi-factor pricing model postulates α = 0. Namely, all returns are fully compensated by their risks: no extra returns are possible and the market is efficient. This leads us to naturally consider the hypothesis testing problem H₀ : α = 0 vs. H₁ : α ≠ 0.

Let $\hat{α}$ and $\hat{B}$ be the least-squares estimate and ${\hat{ε}}_{t} = Y_{t} - \hat{α} - \hat{B} f_{t}$ be a residual vector. Then an unbiased estimator of Σ = var(ε_t) is

\tilde{Σ} = \frac{1}{T - K - 1} \sum_{t = 1}^{T} {\hat{ε}}_{t} {\hat{ε}}_{t}^{⊺} .

Let $\hat{D} = diag (\tilde{Σ})$ and M_F = I_T − F(F^⊺ F)⁻¹F^⊺ where F = (f₁, …, f_T)^⊺ . Define $W_{d} = (1_{[T]}^{⊺} M_{F} 1_{[T]}) {\hat{α}}^{⊺} {\hat{D}}^{- 1} \hat{α}$ the Wald-type of test statistics with correlation ignored, whose normalized version is given by

J_{α} = \frac{W_{d} - E (W_{d})}{\sqrt{var (W_{d})}} .

(2.2)

Under some conditions, it was shown by Pesaran and Yamagata (2012) that, under H₀, $J_{α} \to N (0, 1)$ as N → ∞. Moreover, if $ε_{t}^{(i)}$ 's are i.i.d. Gaussian, it holds that $E (W_{d}) = ν N ∕ (ν - 2)$ and

var (W_{d}) = \frac{2 N (v - 1)}{v - 4} {(\frac{v}{v - 2})}^{2} [1 + (N - 1) {\overset{‒}{ρ}}^{2} + O (v^{- 1 ∕ 2})],

where ν = T − K − 1 is the degrees of freedom and

{\overset{‒}{ρ}}^{2} = \frac{2}{N (N - 1)} \sum_{i = 2}^{N} \sum_{j = 1}^{i - 1} ρ_{i j}^{2},

where ρ = D^−1/2ΣD^−1/2 with D = diag(Σ) is the correlation matrix of the stationary process (ε_t)_t∈[T]. The authors go on to propose an estimator of the quadratic functional ρ⁻² by replacing the correlation coefficients ρ_ij in the above expression by ${\hat{ρ}}_{i, j} 1 (∣ {\hat{ρ}}_{i j} ∣ > τ)$ where ${({\hat{ρ}}_{i j})}_{i, j \in [N]} = {\hat{D}}^{- 1 ∕ 2} \tilde{Σ} {\hat{D}}^{- 1 ∕ 2}$ and τ > 0 is a threshold parameter. However, they did not provide any analysis of this method, nor any guidance to chose τ.

3. Optimal estimation of quadratic functionals

In the previous section, we have described rather general questions involving the estimation of quadratic functions of covariance or correlation matrices. We begin by observing that consistent estimation of ${‖ Σ ‖}_{F}^{2}$ is impossible unless p = o(n). This precludes in particular the high-dimensional framework that motivates our study.

Our goal is to estimate the Frobenius norm ${‖ Σ ‖}_{F}^{2}$ of a sparse p × p covariance matrix Σ using n i.i.d. observations $X_{1}, \dots, X_{n} \sim N (0, Σ)$ . Observe that $‖ Σ ‖_{F}^{2}$ can be decomposed as ${‖ Σ ‖}_{F}^{2} = Q (Σ) + D (Σ)$ where $Q (Σ) = \sum_{i \neq j} σ_{i j}^{2}$ corresponds to the off-diagonal elements and $D (Σ) = \sum_{j} σ_{j j}^{2}$ corresponds to the diagonal elements. The following theorem, implies that even if Σ = diag(Σ) is diagonal, the quadratic functional ${‖ Σ ‖}_{F}^{2}$ cannot be estimated consistently in absolute error if p ≥ n. Note that the situation is quite different when it comes to relative error. Indeed, the estimator of Bai and Saranadasa (1996) is consistent in relative error with no sparsity assumption even in the high-dimensional regime. Study of the relative error in the presence of sparsity is an interesting question that deserves further developments. This makes sense intuitively as the diagonal of Σ consists of p unknown parameters while we have only n observations.

Proposition 3.1. Fix n, p ≥ 1 and let

D_{p} = {Σ \in S_{p}^{+} : Σ = diag (Σ), Σ_{i i} \leq 1}

be the class of diagonal covariance matrices with diagonal elements bounded by 1. Then there exists a universal constant C > 0 such that

\inf_{\hat{D}} \sup_{Σ \in D_{p}} E {[\hat{D} - D (Σ)]}^{2} \geq C \frac{p}{n} .

In particular, it implies that

\inf_{\hat{F}} \sup_{Σ \in D_{p}} E {[\hat{F} - {‖ Σ ‖}_{F}^{2}]}^{2} \geq C \frac{p}{n},

where the infima are taken with over all real valued measurable functions of the observations.

Proof

Our lower bounds rely on standard arguments from minimax theory. We refer to Chapter 2 of Tsybakov (2009) for more details. In the sequel, let $KL (P, \overset{‒}{P})$ denote the Kullback–Leibler divergence between two distributions P and $\overset{‒}{P}$ , where $P ⪡ \overset{‒}{P}$ . It is defined by

KL (P, \overset{‒}{P}) = \int \log (\frac{d P}{d \overset{‒}{P}}) d P .

We are going to employ a simple two-point lower bound. Fix ε ∈ (0, 1/2) and let $P_{p}^{n}$ (resp., $P_{p}^{n}$ ) denote the distribution of a sample X₁, …, X_n where $X_{1} ~ N (0, I_{p})$ [resp., $X_{1} ~ N (0, (1 - ε) I_{p})$ ]. Next, observe that I_p, $(1 - ε) I_{p} \subset D_{p}$ so that

\sup_{Σ \in D_{p}} E ∣ \hat{D} - D (Σ) ∣ \geq \max_{Σ \in {I_{p}, (1 - ε) I_{p}}} E ∣ \hat{D} - D (Σ) ∣ .

(3.1)

Moreover, |D(I_p) − D((1 − ε)I_p)| = p(2ε − ε²) > pε. Then it follows from the Markov inequality that

\begin{matrix} \frac{1}{p ε} \max_{Σ \in {I_{p}, (1 - ε) I_{p}}} E ∣ \hat{D} - D (Σ) ∣ & \geq \max_{Σ \in {I_{p}, (1 - ε) I_{p}}} P [∣ \hat{D} - D (Σ) ∣ > p ε] \\ \geq \frac{1}{4} \exp [- KL (P_{p}^{n}, {\overset{‒}{P}}_{p}^{n})], \end{matrix}

(3.2)

where the last inequality follows from Theorem 2.2(iii) of Tsybakov (2009).

Completion of the proof requires an upper bound on $KL (P_{p}^{n}, {\overset{‒}{P}}_{p}^{n})$ . To that end, note that it follows from the chain rule and simple algebra that

KL (P_{p}^{n}, {\overset{‒}{P}}_{p}^{n}) = n p KL (P_{1}^{1}, {\overset{‒}{P}}_{1}^{1}) = \frac{n p}{2} [\log (1 - ε) + \frac{ε}{1 - ε}] \leq \frac{n p}{2} \frac{ε^{2}}{1 - ε} \leq n p ε^{2} .

Taking now $ε = 1 ∕ (2 \sqrt{n p}) \leq 1 ∕ 2$ yields $KL (P_{p}^{n}, Q_{p}^{n}) \leq 1 ∕ 4$ . Together with (3.1) and (3.2), it yields

\inf_{\hat{D}} \sup_{Σ \in D_{p}} E ∣ \hat{D} - D (Σ) ∣ \geq \frac{1}{8 e^{1 ∕ 4}} \sqrt{\frac{p}{n}} .

To complete the proof, we square the above inequality and employ Jensen's inequality.

To overcome the above limitation, we consider the following class of sparse covariance matrices (indeed correlation matrices). For any q ∈ [0, 2), R > 0 let $F_{q} (R)$ denote the set of p × p covariance matrices defined by

F_{q} (R) = {Σ \in S_{p}^{+} : \sum_{i \neq j} ∣ σ_{i j} ∣^{q} \leq R, diag (Σ) = I_{p}} .

(3.3)

Note that for this class of functions, we assume that the variance along each coordinate is normalized to 1. This normalization is frequently obtained by sample estimates, as shown in the previous section. This simplified assumption is motivated also by Proposition 3.1 above which implies that ${‖ Σ ‖}_{F}^{2}$ for general covariance matrix cannot be estimated accurately in absolute error in the large p small n regime since sparsity assumptions on the diagonal elements are implausible. Note that the condition diag(Σ) = I_p implies that diagonal elements D(Σ) of matrices in $F_{q} (R)$ can be estimated without error so that we could possibly achieve consistency even if the case of large p small n.

Matrices in $F_{q} (R)$ have many small coefficients for small values of q and R. In particular, when q = 0, there are no more than R entries of nonvanishing correlations. Following a major trend in the estimation of sparse covariance matrices [Bickel and Levina (2008a, 2008b), Cai and Liu (2011), Cai and Yuan (2012), Cai, Zhang and Zhou (2010), Cai and Zhou (2012), El Karoui (2008), Lam and Fan (2009)], we employ a thresholding estimator of the covariance matrix as a running horse to estimate the quadratic functionals. From the n i.i.d. observations $X_{1}, \dots, X_{n} ~ N (0, Σ)$ , we form the empirical covariance matrix $\hat{Σ}$ that is defined by

\hat{Σ} = \frac{1}{n} \sum_{k = 1}^{n} X_{k} X_{k}^{⊺}

(3.4)

with elements $\hat{Σ} = {{\hat{σ}}_{i j}}_{i j}$ and for any threshold τ > 0, let ${\tilde{Σ}}_{τ} = {{\tilde{σ}}_{i j}}_{i j}$ denote the thresholding estimator of Σ defined by ${\tilde{σ}}_{i j} = {\hat{σ}}_{i j} 1 {∣ {\hat{σ}}_{i j} ∣ > τ}$ if i ≠ j and ${\tilde{σ}}_{i i} = {\hat{σ}}_{i i}$ .

Next, we employ a simple plug-in estimator for Q(Σ):

\hat{Q (Σ)} = Q ({\tilde{Σ}}_{τ}) = \sum_{i \neq j} {\hat{σ}}_{i j}^{2} 1 {∣ {\hat{σ}}_{i j} ∣ > τ} .

(3.5)

Note that no value of the diagonal elements is used to estimate Q(Σ).

In the rest of this section, we establish that $\hat{Q (Σ)}$ is minimax adaptive over the scale { $F_{q} (R)$ , q ∈ [0, 2), R > 0}. Interestingly, we will see that the minimax rate presents an elbow as often in quadratic functional estimation.

Theorem 3.1

Assume that γ log(p) < n for some constant γ > 8 and fix C₀ ≥ 4. Consider the threshold

τ = 2 C_{0} \sqrt{\frac{γ \log p}{n}},

and assume that τ ≤ 1. Then, for any q ∈ [0, 2), R > 0, the plug-in estimator $Q ({\tilde{Σ}}_{τ})$ satisfies

E [{(Q ({\tilde{Σ}}_{τ}) - Q (Σ))}^{2}] \leq C_{1} ψ_{n, p} (q, R) + C_{2} p^{4 - γ ∕ 2},

where

ψ_{n, p} (q, R) = \frac{R^{2}}{n} \lor R^{2} {(\frac{\log p}{n})}^{2 - q},

and C₁, C₂ are positive constants depending on γ, C₀, q.

The proof is postponed to the supplementary material.

Note that the rates ψ_n,p(q, R) present an elbow at q = 1 − log log p/log n as usually the case in functional estimation. We now argue that the rates ψ_n,p(q, R) are optimal in a minimax sense for a wide range of settings. In particular, the elbow effect arising from the maximum in the definition of ψ is not an artifact. In the following theorem, we emphasize the dependence on Σ by using the notation $E_{Σ}$ for the expectation with respect to the distribution of the sample X₁, …, X_n, where $X_{i} ~ N (0, Σ)$ .

Theorem 3.2

Fix q ∈ [0, 2), R > 0 and assume 2 log p < n and R² < (p − 1)n^−q/2. Then there exists a positive constant C₃ > 0 such that

\inf_{\hat{Q}} \sup_{Σ \in F_{q} (R)} E_{Σ} [{(\hat{Q} - Q (Σ))}^{2}] \geq C_{3} ϕ_{n, p} (q, R),

where ϕ_n,p(q, R) is defined by

ϕ_{n, p} (q, R) = \frac{R^{2}}{n} \lor {R^{2} {(\frac{\log ((p - 1) - ∕ (R^{2} n^{q}) + 1)}{2 n})}^{2 - q} \land R^{4 ∕ q} \land 1}

(3.6)

and the infimum is taken over all measurable functions $\hat{Q}$ of the sample X₁, …, X_n.

Before proceeding to the proof, a few remarks are in order.

The additional term of order p^4−γ/2 in Theorem 3.1 can be made negligible by taking γ large enough. To show this tradeoff explicitly, we decided keep this term.
When 1 ≤ R² < p^αn^−q for some constant α < 1, a slightly stronger requirement than Theorem 3.2, the lower bound there can be written as
$ϕ_{n, p} (q, R) = \frac{R^{2}}{n} \lor {R^{2} {(\frac{\log p}{n})}^{2 - q} \land 1} .$ (3.7)

Observe that the above lower bound matches the upper bound presented in Theorem 3.1 when R^2/(2−q) log p ≤ n. Arguably, this is the most interesting range as it characterizes rates of convergence (to zero) rather than rates of divergence, that may be of different nature [see, e.g., Verzelen (2012)]. In other words, the rates given in (3.7) are minimax adaptive with respect to n, R, p and q. In our formulation, we allow R = R_n,p to depend on other parameters of the problem. We choose here to keep the notation light.
The reason we choose correlation matrix class to present the elbow effect is just for simplicity. Actually, we can replace the constraint diag(Σ) = I_p in the definition of $F_{q} (R)$ by boundedness of diagonal elements of Σ. Then for estimating off-diagonal elements Q(Σ), following exactly the same derivation, the same elbow phenomenon has been noticed. Meanwhile, the optimal rate for estimating diagonal elements D(Σ) is again of the order p/n. This optimal rate can be attained by the estimator
$\hat{D (Σ)} = \frac{1}{n (n - 1)} \sum_{i = 1}^{p} \sum_{k \neq j} X_{k, i}^{2} X_{j, i}^{2} .$ (3.8)
We omitted the proof here. Thus, if we do not have prior information about diagonal elements, we could still estimate optimally the quadratic functional of a covariance matrix by applying the thresholding method (3.5) for off-diagonal elements, together with (3.8) for diagonal elements.
The rate ϕ_n,p(q, R) presents the same elbow phenomenon at q = 1 observed in the estimation of functionals, starting independently with work of Bickel and Ritov (1988) and Fan (1991). Closer to the present setup is the work of Cai and Low (2005) who study the estimation of functionals of “sparse” sequences in the infinite Gaussian sequence model. There, a parameter controls the speed of decay of the unknown coefficients. Note that while smaller values q lead to sparser matrices Σ, no estimator can benefit further from sparsity below q = 1 [the estimator has a rate of convergence O(R²/n) for any q < 1], unlike in the case of estimation of Σ. Again, this is inherent to estimating functionals.
The condition R² < (p − 1)n^−q/2 corresponds to the high-dimensional regime and allows us to keep clean terms in the logarithm. Similar assumptions are made in related literature [see, e.g., Cai and Zhou (2012)].
The optimal rates obtained here cannot be implied by existing ones for estimating sparse covariance matrices. In particular, the latter do not admit an elbow phenomenon. Specifically, Rigollet and Tsybakov (2012) showed the optimal rate for estimating Σ for $Σ \in F_{q} (R)$ under the Frobenius norm is $\sqrt{R} {(\log p ∕ n)}^{1 ∕ 2 - q ∕ 4}$ for 0 ≤ q < 2. Using this, it is not hard to derive with high probability,
$∣ Q (\hat{Σ}) - Q (Σ) ∣ \leq C_{1} R {(\frac{\log p}{n})}^{1 ∕ 2 - q ∕ 4} + C_{2} R {(\frac{\log p}{n})}^{1 - q ∕ 2},$
since ${‖ Q (Σ) ‖}_{F} = O (\sqrt{R})$ if nonvanishing correlations are bounded away from zero. On one hand, when q < 2 the first term always dominates so that we do not observe the elbow effect. In addition, the rate so obtained is not optimal.

We now turn to the proof of Theorem 3.2

Proof of Theorem 3.2

To prove minimax lower bounds, we employ a standard technique that consists of reducing the estimation problem to a testing problem. We split this proof into two parts and begin by proving

\inf_{\hat{Q}} \sup_{Σ \in F_{q} (R)} E_{Σ} {[\hat{Q} - Q (Σ)]}^{2} \geq C \frac{R^{2}}{n},

for some positive constant C > 0. To that end, for any $A \in S_{p}^{+}$ , let $P_{A}$ denote the distribution of $X ~ N (0, A)$ . It is not hard to show if |A| > 0 and |B| > 0, $A, B \in S_{p}^{+}$ , then the Kullback–Leibler divergence between $P_{A}$ and $P_{B}$ is given by

KL (P_{A}, P_{B}) = \frac{1}{2} [\log (\frac{∣ B ∣}{∣ A ∣}) + tr (B^{- 1} A) - p] .

(3.9)

Next, take A and B to be of the form

A^{(k)} = (\begin{matrix} 11^{⊺} & a 11^{⊺} & 0 \\ a 11^{⊺} & 11^{⊺} & 0 \\ 0 & 0 & I_{p - k} \end{matrix}), B^{(k)} = (\begin{matrix} 11^{⊺} & b 11^{⊺} & 0 \\ b 11^{⊺} & 11^{⊺} & 0 \\ 0 & 0 & I_{p - k} \end{matrix}),

where a, b ∈ (0, 1/2), 0 is a generic symbol to indicate that the missing space is filled with zeros, and 1 denotes a vector of ones of length k/2. Note that if we have random variables (X, Y, Z₁, …, Z_p−2) chosen from distribution $N (0, A^{(2)})$ meaning that Z_k's are independent with X, Y but the correlation between X and Y is a, then random vector (X, …, X, Y, …, Y, Z₁, …, Z_p−k) with k/2 X's and Y's in it follows $N (0, A^{(k)})$ . It is obvious that these two matrices are degenerate and comes from perfectly correlated random variables. Since perfectly correlated random variables do not add new information, for such matrices, an application of (3.9) yields

KL (P_{A^{(k)}}, P_{B^{(k)}}) = KL (P_{A^{(2)}}, P_{B^{(2)}}) = \frac{1 - a b}{1 - b^{2}} - \frac{1}{2} \log (\frac{1 - a^{2}}{1 - b^{2}}) - 1 .

Next, using the convexity inequality log(1 + x) ≥ x − x²/2 for all x > 0, we get that

KL (P_{A^{(k)}}, P_{B^{(k)}}) \leq \frac{{(a - b)}^{2}}{2 (1 - b^{2})} [1 + \frac{{(a + b)}^{2}}{2 (1 - b^{2})}] \leq 2 {(a - b)}^{2},

using the fact that a, b ∈ (0, 1/2). Take now if R > 4

a = \frac{1}{4}, b = a + \frac{1}{4 \sqrt{n}}, k = \sqrt{R}

so that we indeed have a, b ∈ (0, 1/2) and also A^(k), $B^{(k)} \in F_{q} (R)$ obviously. If R < 4, take k = 2, $a = \sqrt{R} ∕ 8$ , $b = a + \sqrt{R ∕ 64 n}$ instead. Moreover, this choice leads to $n KL (P_{A}, P_{B}) \leq 1 ∕ 5$ . Using standard techniques to reduce estimation problems to testing problems [see, e.g., Theorem 2.5 of Tsybakov (2009)], we find that

\inf_{\hat{Q}} \max_{Σ \in {A, B}} E_{Σ} [{(\hat{Q} - Q (Σ))}^{2}] \geq C {(Q (A) - Q (B))}^{2} .

For the above choice of A and B, we have

{(Q (A^{(k)}) - Q (B^{(k)}))}^{2} = \frac{k^{4}}{4} {(a^{2} - b^{2})}^{2} \geq C \frac{R^{2}}{n} .

Since A^(k), $B^{(k)} \in F_{q} (R)$ , the above two displays imply that

\inf_{\hat{Q}} \max_{Σ \in F_{q} (R)} E_{Σ} [{(\hat{Q} - Q (Σ))}^{2}] \geq C \frac{R^{2}}{n},

which completes the proof of the first part of the lower bound.

For the second part of the lower bound, we reduce our problem to a testing problem of the same flavor as Arias-Castro, Bubeck and Lugosi (2015), Berthet and Rigollet (2013b). Note, however, that our construction is different because the covariance matrices considered in these papers do not yield large enough lower bounds. We use the following construction.

Fix an integer k ∈ [p − 1] and let $S = {S \subset [p - 1] : ∣ S ∣ = k}$ denote the set of subsets of [p − 1] that have cardinality k. Fix a ∈ (0, 1) to be chosen later and for any $S \in S$ , recall that 1_S is the column vector in {0, 1}^p−1 with support given by S. For each $S \in S$ , we define the following p × p covariance matrix:

Σ_{S} = (\begin{matrix} 1 & a 1_{S}^{⊺} \\ a 1_{S} & I_{p - 1} \end{matrix}) .

(3.10)

Let $P_{0}$ denote the distribution of $X \sim N_{p} (0, I_{p})$ and $P_{S}$ denote the distribution of $X \sim N_{p} (0, Σ_{S})$ . Let $P_{0}^{n}$ (resp., $P_{S}^{n}$ ) denote the distribution of X = (X₁, …, X_n) of a collection n i.i.d. random variables drawn from $P_{0}$ (resp., $P_{S}$ ). Moreover, let ${\overset{‒}{P}}^{n}$ denote the distribution of X where the X_i 's are drawn as follows: first draw S uniformly at random from $S$ and then, conditionally on S, draw X₁, …, X_n independently from $P_{S}$ . Note that ${\overset{‒}{P}}^{n}$ is the mixture of n independent samples rather the distribution of n independent random vectors drawn from a mixture distribution. Consider the following testing problem:

H_{0} : X ~ P_{0}^{n} vs . H_{1} : X ~ {\overset{‒}{P}}^{n} .

Using Theorem 2.2, part (iii) of Tsybakov (2009), we get that for any test ψ = ψ(X), we have

P_{0}^{n} (ψ = 0) \lor \max_{S \in S} P_{S}^{n} (ψ = 1) \geq P_{0}^{n} (ψ = 0) \lor {\overset{‒}{P}}^{n} (ψ = 1) \geq \frac{1}{4} \exp (- χ^{2} ({\overset{‒}{P}}^{n}, P_{0})),

where we recall that the χ²-divergence between two probability distributions P and Q is defined by

χ^{2} (P, Q) = {\begin{matrix} \int {(\frac{d P}{d Q} - 1)}^{2} d Q, & if P ≪ Q, \\ \infty, & otherwise . \end{matrix}

Lemma A.1 implies that for suitable choices of the parameters a and k, we have $χ^{2} ({\overset{‒}{P}}^{n}, P_{0}) \leq 2$ so that the test errors are bounded below by a constant C = e⁻²/4. Since Q(Σ_S) = 2ka² for any $S \in S$ , it follows from a standard reduction from hypothesis testing to estimation [see, e.g., Theorem 2.5 of Tsybakov (2009)] that the above result implies the following lower bound:

\inf_{\hat{Q}} \max_{Σ \in H} E_{Σ} [{(\hat{Q} - Q (Σ))}^{2}] \geq C k^{2} a^{4},

(3.11)

for some positive constant C, where the infimum is taken over all estimators $\hat{Q}$ of Q(Σ) based on n observations and $H$ is the class of covariance matrices defined by

H = {I_{p}} \cup {Σ_{S} : S \in S} .

To complete the proof, observe that the values of a and k prescribed in Lemma A.1 imply that $H \subset F_{q} (R)$ and give the desired lower bound. Note first that, for any choice of a and k, the following holds trivially: $I_{p} \in F_{q} (R)$ and diag(Σ_S) = I_p for any $S \in S$ . Write Σ_S = (σ_ij) and observe that

\sum_{i \neq j} {∣ σ_{i j} ∣}^{q} = 2 k a^{q} .

Next, we treat each case of Lemma A.1 separately.

Case 1. Note first that 2ka^q = R/2 < R so that $Σ_{S} \in F_{q} (R)$ . Moreover, k²a⁴ = CR^4/q.
Case 2. Note first that 2ka^q ≤ R/2 < R so that $Σ_{S} \in F_{q} (R)$ . Since k ≥ 2 and k² ≤ R²n^q, we have
$k \geq \frac{R}{4} {(\frac{\log ((p - 1) ∕ k^{2} + 1)}{2 n})}^{- q ∕ 2} .$
Therefore,
$k^{2} a^{4} \geq \frac{R^{2}}{16} {(\frac{\log ((p - 1) ∕ (R^{2} n^{q}) + 1)}{2 n})}^{2 - q} \land \frac{1}{4} .$
Combining the two cases, we get
$k^{2} a^{4} \geq C [R^{2} {(\frac{\log ((p - 1) ∕ (R^{2} n^{q}) + 1)}{2 n})}^{2 - q} \land R^{4 ∕ q} \land 1] .$
Together with (3.11), this completes the proof of the second part of the lower bound.

4. Extension to nonquadratic functionals

Closely related to quadratic functional is the ℓ_r functional of covariance matrices, which is defined by

ℓ_{r} (Σ) = \max_{i \leq p} \sum_{j \leq p} {∣ σ_{i j} ∣}^{r} .

(4.1)

It is often used to measure the sparsity of a covariance matrix and plays an important role in estimating sparse covariance matrix. This along the theoretical interest on the difficulty of estimating such a functional give rise to this study. Note that ℓ₁(Σ) functional is indeed the ℓ₁-norm of the covariance matrix Σ, whereas when r = 2, ℓ_r functional is the maximal row-wise quadratic functional. Thus, the non-quadratic ℓ_r functional is just a natural extension of such a maximal quadratic functional, whose optimal estimation problem will be the main focus of this section.

4.1. Optimal estimation of ℓ_r functionals

We consider a class of matrix with row-wise sparsity structure as follows:

G_{q} (R) = {Σ \in S_{p}^{+} : \max_{i \leq p} \sum_{j \leq p} {∣ σ_{i j} ∣}^{q} \leq R, diag (Σ) = I_{p}},

(4.2)

for q ∈ [0, r) and R > 0 which can depend on n and p. A similar class of covariance matrices has been considered by Bickel and Levina (2008a) and Cai and Zhou (2012).

Theorem 4.1. Fix q ∈ [0, r), R > 0 and assume that 2 log p < n and R² < (p − 1)n^−q/2. Then there exists a positive constant C₄ > 0 such that,

\inf_{\hat{L}} \sup_{Σ \in G_{q} (R)} E_{Σ} [{(\hat{L} - ℓ_{r} (Σ))}^{2}] \geq C_{4} {\tilde{ϕ}}_{n, p} (q, R),

where ${\tilde{ϕ}}_{n, p} (q, R)$ is defined by

{\tilde{ϕ}}_{n, p} (q, R) = R^{2} \frac{\log p}{n} \lor {R^{2} {(\frac{\log ((p - 1) ∕ (R^{2} n^{q}) + 1)}{2 n})}^{r - q} \land R^{2 r ∕ q} \land 1}

(4.3)

and the infimum is taken over all measurable functions $\hat{L}$ of the sample X₁, …, X_n.

The proof is similar to that of Theorem 3.2 and is relegated to the Appendix.

As in (3.7), when 1 < R² < p^αn^−q for some α < 1, the lower bound in Theorem 4.1 can be written as

{\tilde{ϕ}}_{n, p} (q, R) = R^{2} \frac{\log p}{n} \lor {R^{2} {(\frac{\log p}{n})}^{r - q} \land 1} .

(4.4)

To establish the upper bound, we consider again a thresholding estimator. Naturally, we estimate ℓ_r functional of each single row, denoted by $ℓ_{r}^{(i)} (Σ) = Σ_{j} {∣ σ_{i j} ∣}^{r}$ , using the thresholding technique. Following the same notation as the previous section, the estimator is defined by

\hat{ℓ_{r} (Σ)} = ℓ_{r} ({\tilde{Σ}}_{τ}) = \max_{i} ℓ_{r}^{(i)} ({\tilde{Σ}}_{τ}) = \max_{i} \sum_{j \leq p} {∣ {\hat{σ}}_{i j} ∣}^{r} 1 {∣ {\hat{σ}}_{i j} ∣ > τ},

(4.5)

for a threshold τ > 0. We will see in the next theorem that this estimator achieves the adaptive minimax optimal rate.

Theorem 4.2

Assume that γ log(p) < n for some constant γ > 8 and fix C₀ ≥ 4. Consider the threshold

τ = 2 C_{0} \sqrt{\frac{γ \log p}{n}}

and assume that τ ≤ 1. Then, for any q ∈ [0, r), R > 0, the plug-in estimator $ℓ_{r} ({\tilde{Σ}}_{τ})$ satisfies

E [{(ℓ_{r} ({\tilde{Σ}}_{τ}) - ℓ_{r} (Σ))}^{2}] \leq C_{5} {\tilde{ψ}}_{n, p} (q, R) + C_{6} p^{4 - γ ∕ 2},

where

{\tilde{ψ}}_{n, p} (q, R) = {\begin{matrix} \frac{R^{2} \log p}{n}, & if q < \max {r - 1, 0}, \\ R^{2} {(\frac{γ \log p}{n})}^{r - q}, & if q \geq \max {r - 1, 0} \end{matrix}

and C₅ and C₆ are positive constants.

The proof of this theorem is a generalization of the proof of Theorem 3.1 but some aspects that have independent value are presented here. In the proof of Theorem 3.1, we used the decomposition

{\hat{σ}}_{i j}^{2} - σ_{i j}^{2} = 2 σ_{i j} ({\hat{σ}}_{i j} - σ_{i j}) + {({\hat{σ}}_{i j} - σ_{i j})}^{2},

which is actually the Taylor expansion of ${\hat{σ}}_{i, j}^{2}$ at σ_i,j. Carefully scrutinizing the proof, we find that the first term has the parametric rate O(R²/n) whereas the second term contributes to the rate O(R²(log p/n)^2−q). This phenomenon can be generalized to the ℓ_r-functional. In the latter case, we will apply the Taylor expansion of ${∣ {\hat{σ}}_{i j} ∣}^{r}$ at |σ_ij|, and the first-order term will contribute to the parametric rate of O(R² log p/n) while the second-order term has the rate O(R²(log p/n)^r−q). The elbow effect stems from the dominance of estimation errors of the first- and second-order terms of Taylor's expansion. We relegate the complete proof to the supplementary material.

A few remarks should be mentioned:

The combination of the two theorems imply that the estimator $ℓ_{r} ({\tilde{Σ}}_{τ})$ is minimax adaptive over the space ${G_{q} (R), q \in [0, r), R > 0}$ under very mild conditions. The adaptive minimax optimal rate of convergence is given by (4.4). The term p^4−γ/2 can be made arbitrarily small by choosing large enough γ.
The ℓ_r functional involves the maxima of the row sums. Compared it with estimating the quadratic functional, we need to pay the price of an extra log p term in the parametric rate.
The rate ${\tilde{ϕ}}_{n, p} (q, R)$ presents the elbow phenomenon at q = r − 1 if r > 1. So quadratic row-wise functional $ℓ_{2} ({\tilde{Σ}}_{τ})$ bears the same elbow behavior as the quadratic functional $Q ({\tilde{Σ}}_{τ})$ .

4.2. Optimal detection of correlations

In this subsection, we illustrate the intrinsic link between functional estimation and hypothesis testing. To that end, consider the following hypothesis testing problem:

H_{0} : X ~ N (0, I_{p}),

H_{1} X ~ N (0, I_{p} + κ \cdot off (Σ)), Σ \in ⋃_{q \in [0, r)} {G_{q} (R) : ℓ_{r} (off (Σ)) = 1} .

This problem is intimately linked to sparse principal component analysis [Berthet and Rigollet (2013a, 2013b)]. A natural question associated with this problem is to find the minimal signal strength κ such that these hypotheses can be tested with high accuracy.

The previous subsection provides the optimal estimate for ℓ_r(off(Σ)). However, we need a result with high probability rather than in expectation. Using Lemma 4.2 in the supplementary material [Fan, Rigollet and Wang (2015)] and arguments similar to those employed to prove Theorem 4.2, it is not hard to show that

∣ ℓ_{r} ({\tilde{Σ}}_{τ}) - ℓ_{r} (Σ) ∣ \leq C R {(\frac{γ \log p}{n})}^{(r - q) ∕ 2} = C R {(\frac{2 \log p + \log (4 ∕ δ)}{n})}^{(r - q) ∕ 2},

with probability larger than 1 − 4p^−(γ−2) =: 1 − δ. Therefore, letting

s_{0} = 1 + C R {(\frac{2 \log p + \log (4 ∕ δ)}{n})}^{(r - q) ∕ 2},

s_{1} = 1 + κ^{r} - C R {(\frac{2 \log p + \log (4 ∕ δ)}{n})}^{(r - q) ∕ 2},

we get $P_{H_{0}} (ℓ_{r} ({\tilde{Σ}}_{τ}) \leq s_{0}) \geq 1 - δ$ and $P_{H_{1}} (ℓ_{r} ({\tilde{Σ}}_{τ}) \geq s_{1}) \geq 1 - δ$ . Here, $P_{H_{0}}$ denotes the probability under the null hypothesis and $P_{H_{1}}$ denotes the largest probability over the composite alternative. To build a hypothesis test, note that if s₁ > s₀, then for any s ∈ [s₀, s₁], the test $ψ = 1 {ℓ_{r} ({\tilde{Σ}}_{τ}) \geq s}$ satisfies $P_{H_{0}} (ψ = 1) \lor P_{H_{1}} (ψ = 0) \leq δ$ . We say that the test ψ discriminates between H₀ and H₁ with accuracy δ.

Theorem 4.3

Assume that n, p, R, q, r and δ are such that $\overset{‒}{κ} < 1$ where

\overset{‒}{κ} ≔ 2 C R^{1 ∕ r} {(\frac{2 \log p + \log (4 ∕ δ)}{n})}^{(r - q) ∕ (2 r)} .

Then, for any $κ > \overset{‒}{κ}$ and for any s ∈ [s₀, s₁], the test $ψ = 1 {ℓ_{r} ({\tilde{Σ}}_{τ}) \geq s}$ discriminates between H₀ and H₁ with accuracy δ.

The minimax risk for the correlation detection is given in the next theorem, which will be proved in the Appendix.

Theorem 4.4

For fixed ν > 0, define $\underset{‒}{κ} > 0$ by

\underline{κ} ≔ R^{1 ∕ r} {(\frac{\log (v p ∕ (R^{2} n^{q}))}{2 n})}^{(r - q) ∕ (2 r)} .

Then for any $κ < \underset{‒}{κ}$ ,

\inf_{ψ} {P_{H_{0}} (ψ = 1) \lor P_{H_{1}} (ψ = 0)} \geq C_{v},

where the infimum is taken over all possible tests and C_ν > 0 is a continuous function of ν that tends to 1/2 as ν → 0.

If we assume the high-dimensional regime R² < p^αn^−q for some α < 1 as discussed before, then the lower bound matches the upper bound. So the theorem concludes that no test has asymptotic power for correlation detection unless κ is of higher order than R^1/r(log p/n)^(r−q)/(2r) and the detection method based on optimal ℓ_r(Σ) estimation is also optimal for testing existence of correlation.

5. Numerical experiments

Simulations are conducted in this section to evaluate the numerical performance of our plug-in estimator for quadratic functionals. Then the proposed method is applied to two high-dimensional testing problems: simulated two-sample data and real financial equity market data.

5.1. Quadratic functional estimation

We first study the behavior of estimators $\hat{Q (Σ)} + \hat{D (Σ)}$ for the total quadratic functional and $\hat{Q (Σ)} = Q ({\tilde{Σ}}_{τ})$ for its off-diagonal part. To that end, four sparse covariance matrix structures were used in the simulations:

(M1) auto-correlation AR(1) covariance matrix σ_ij = 0.25^|i−j|;
(M2) banded correlation matrix with σ_ij = 0.3 if |i − j| = 1 and 0 otherwise;
(M3) sparse matrix with a block, size p/20 by p/20, of correlation 0.3;
(M4) identity matrix (it attains the maximal level of sparsity).

We chose p = 500 and let n vary from 30 to 100. For estimating the total quadratic functional, our proposed thresholding estimator, BS [Bai and Saranadasa (1996)] estimator and CQ [Chen and Qin (2010)] estimator were applied to each setting for repetition of 500 times. Their mean absolute estimation errors were reported in log scale (base 2) in Figure 1 with their standard deviations omitted here. BS and CQ cannot be directly used for off-diagonal quadratic functional estimation, so we deducted $\hat{D (Σ)}$ from both of them to serve as an estimator for only the off-diagonal part. The mean absolute estimation errors, compared with our proposed estimator $Q ({\tilde{Σ}}_{τ})$ , are depicted in log scale (base 2) in Figure 2.

Fig. 1 — Performance of estimating ${‖ Σ ‖}_{F}^{2}$ using thresholded estimator $\hat{Q} + \hat{D}$ (dotted), CQ (solid) and BS (dashed). The mean of absolute errors over 500 repetitions in log scale (base 2) versus the sample size were reported for matrix M1 (top left), M2 (top right), M3 (bottom left), M4 (bottom right).

Fig. 2 — Performance of estimating Q(Σ) using thresholded estimator $\hat{Q}$ (dotted), $C Q - \hat{D}$ (solid) and $B S - \hat{D}$ (dashed). The mean of absolute errors over 500 repetitions in log scale (base 2) versus the sample size were reported for matrix M1 (top left), M2 (top right), M3 (bottom left), M4 (bottom right).

The four plots correspond to the aforementioned four covariance structures. We did not report the estimation error of directly using the naive plug-in which is an obvious disaster. In all the four cases, the BS (dashed line) method does not perform well in the “large p small n” regime. The method CQ (solid line) exhibits a relatively small estimation error in general, but it can still be improved using the thresholding method. As theory shows, the method CQ is ratio-consistent [Chen and Qin (2010)], so our method (dotted line) is better only to a second order, which was captured by the small gap between dotted and solid curves. When estimating only off-diagonal quadratic functionals (Figure 2), the advantage of the thresholding method is even sharper since the error caused by nonsparse diagonal elements has been eliminated. The improved performance comes from the prior knowledge of sparsity, thus our method works best for very sparse matrix, especially well for identity matrix as seen in Figure 1.

A practical question is how to choose a proper threshold, as this is important to the performance of the thresholding estimator. In the above simulations, we chose $τ = C \sqrt{\log p ∕ n}$ with constant C slightly different for the four cases but all close to 1.5. In the next two applications to hypothesis testing, we employ the cross validation to choose a proper thresholding. The procedure consists of the following steps:

(1)
The data is split into training data $D_{S}^{(v)}$ of sample size n₁ and testing data $D_{S^{c}}^{(v)}$ of sample size n − n₁ for m times, v = 1, 2, …, m.
(2)
The training data $D_{S}^{(v)}$ is used to construct the thresholding estimator $Q ({\tilde{Σ}}_{τ}^{(v)})$ under a sequence of thresholds while the testing data $D_{S^{c}}^{(v)}$ constructs the nonthresholded ratio-consistent estimator ${\hat{Q}}^{(v)}$ , for example, using CQ estimator of ${‖ Σ ‖}_{F}^{2}$ .
(3)
The candidates of thresholds are $τ_{j} = j Δ \sqrt{\log (p) ∕ n_{1}}$ for j = 1, 2, …, J where J is chosen to be a reasonably large number, say 50, and Δ is such that $J Δ \sqrt{\log (p) ∕ n_{1}} \leq \hat{M} ≔ \max_{i} {\hat{σ}}_{i i}$ .
(4)
The final j* is taken to be the minimizer of the following problem:
$\min_{j \in {1, 2, \dots, J}} \frac{1}{m} \sum_{v = 1}^{m} ∣ Q ({\tilde{Σ}}_{τ_{j}}^{(v)}) - {\hat{Q}}^{(v)} ∣ .$
(5)
The final estimator $Q ({\tilde{Σ}}_{τ_{j^{*}}})$ is obtained by applying threshold $j^{*} Δ \sqrt{\log (p) ∕ n}$ to the empirical covariance matrix of the entire n data.

Bickel and Levina (2008a) suggested to use n₁ = n/ log n for estimating covariance matrices. This is consistent with our experience for estimating functionals when no prior knowledge about the covariance matrix structure is provided. We will apply this splitting rule in the later simulation studies on high-dimensional hypothesis testing.

5.2. Application to high-dimensional two-sample testing

In this section, we apply the thresholding estimator of quadratic functionals to the high-dimensional two-sample testing problem. Two groups of data are simulated from the Gaussian models:

X_{i, j} ~ N (μ_{i}, Σ) for i = 1, 2 and j = 1, \dots, n ∕ 2 .

The dimensions considered for this problem are (p, n) ∈ {(500, 100), (1000, 150), (2000, 200)}. For simplicity, we choose Σ to be a correlation matrix and choose the sparse covariance structure to be 2 by 2 block diagonal matrices with 250 of them having correlations 0.3 and the rest having correlations 0. So the off-diagonal quadratic functional is always 45, which does not increase with p in our setting. The mean vectors μ₁ and μ₂ are chosen as follows. Let μ₁ = 0 and the percentage of μ_1,k = μ_2,k to be in {0%, 50%, 95%, 100%}. The 100% proportion corresponds to the case where the two groups are identical, thus gives information about accuracy of the size of the tests. The 95% proportion represents the situation where the alternative hypotheses are sparse. For those k such that μ_1,k ≠ μ_2,k, we simply chose the value of each μ_2,k equally. To make the power comparable among different configurations, we use a constant signal-to-noise ratio $η = ‖ μ_{1} - μ_{2} ‖ ∕ \sqrt{tr (Σ^{2})} = 0.1$ across experiments.

Table 1 reports the empirical power and size of six testing methods based on 500 repetitions.

Table 1.

Empirical testing power and size of 6 testing methods based on 500 simulations

Prop. of equalities	BS	newBS	CQ	newCQ	Bonf	BH
	p = 500, n = 100
0%	0.408	0.422	0.428	0.432	0.104	0.110
50%	0.396	0.422	0.418	0.428	0.110	0.116
95%	0.422	0.440	0.438	0.442	0.208	0.214
100% (size)	0.030	0.036	0.036	0.038	0.042	0.042
	p = 1000, n = 150
0%	0.696	0.710	0.718	0.718	0.082	0.086
50%	0.698	0.712	0.712	0.714	0.106	0.112
95%	0.702	0.716	0.718	0.722	0.308	0.328
100% (size)	0.040	0.044	0.048	0.046	0.050	0.050
	p = 2000, n = 200
0%	0.930	0.938	0.940	0.940	0.138	0.146
50%	0.918	0.922	0.924	0.928	0.104	0.106
95%	0.922	0.928	0.930	0.930	0.324	0.338
100% (size)	0.046	0.050	0.050	0.050	0.046	0.046

Open in a new tab

(BS) Bai and Saranadasa's original test.

(newBS) Bai and Saranadasa's modified test where tr(Σ²) is estimated by thresholding the sample covariance matrix.

(CQ) Chen and Qin's original test.

(newCQ) Chen and Qin's modified test where $tr (Σ_{i}^{2})$ and tr(Σ₁Σ₂) are estimated by thresholding their empirical counterparts.

(Bonf) Bonferroni correction: This method regards the high-dimensional testing problem as p univariate testing problems. If there is a p-value that is less than 0.05/p, the null hypothesis is rejected.

(BH) Benjamini–Hochberg method. The method is similar to the Bonferroni correction, but employs the Benjamini–Hochberg method in decision making.

For estimating quadratic functionals, the cross-validation is employed using n/log(n) splitting rule. The first four methods are evaluated at the 5% significance level while Bonferroni correction and Benjamini–Hochberg correction are evaluated at 5% family-wise error rate or FDR. We also list the average relative estimation errors for the quadratic functionals of the first four methods in Table 2. Here, the average is taken over four different proportions of equalities and the average for CQ and newCQ is also taken over errors in estimating $tr (Σ_{1}^{2})$ and $tr (Σ_{2}^{2})$ .

Table 2.

Mean and SD of relative errors for estimating quadratic functionals (in percentage)

	p = 500, n = 100	p = 1000, n = 150	p = 2000, n = 200
BS	4.93 (2.48)	4.47 (1.56)	5.05 (1.10)
newBS	2.12 (1.43)	0.74 (0.56)	0.54 (0.40)
CQ	3.72 (1.97)	2.32 (1.24)	1.70 (0.91)
newCQ	2.77 (1.38)	1.27 (0.64)	0.62 (0.33)

Open in a new tab

Several comments are in order. First, the first four methods based on Wald-type of statistic with correlation ignored perform much better, in terms of the power, than the last two methods which combines individual tests. Even in the case that proportional of equalities is 95% where the individual difference is large for nonidentical means, aggregating the signals together in the Wald-type of statistic still outperforms. However, in the case of 0% identical means, the power of Bonferroni or FDR method is extremely small, due to small individual differences. Second, the method newCQ, which combines CQ and thresholding estimator of the quadratic functional, has the highest power and performs the best among all methods. The corrected BS method also improves the performance by estimating the quadratic functionals better compared with original BS. CQ indeed is more powerful than BS as claimed by Chen and Qin (2010), but we can even improve the performance of those two methods more by leveraging the sparsity structure of covariance matrices.

5.3. Estimation of ℓ_r functional and correlation detection

In order to check the effectiveness of using ℓ_r norm of the thresholded sample matrix to detect correlation, let us take one simple matrix structure as an example and use r = 1. Under H₀, assume $X ~ N (0, I_{p})$ ; while under $H_{1}, X ~ N (0, Σ)$ , where Σ_ij = 0.8 if i, $j \in S$ and $S$ is a random subset of size p/20 in {1, 2, …, n}. We chose to use p = 500 and generated n = 100 independent random vectors under both H₀ and H₁. The whole simulation was done for N = 1000 times.

We compare the ℓ₁ norm estimates based on empirical covariance matrix $ℓ_{1} (\hat{Σ})$ and thresholded empirical covariance matrix $ℓ_{1} ({\tilde{Σ}}_{τ})$ . The threshold is decided by cross validation with n/log(n) splitting. The simulations yielded N estimates for both null and alternative hypotheses, which were plotted in Figure 3. The optimal estimator $ℓ_{1} ({\tilde{Σ}}_{τ})$ perfectly discriminates the null and alternative hypotheses while $ℓ_{1} (\hat{Σ})$ overestimates ℓ₁ functional and blurs the difference of the two hypotheses.

Fig. 3 — Histogram of 1000 ℓ₁ functional estimates for H₀ and _H₁ by $ℓ_{r} (\hat{Σ})$ (left) and optimal estimator $ℓ_{r} ({\tilde{Σ}}_{τ})$ (right).

5.4. Application to testing multifactor pricing model

In this section, we test the validity of the Capital Asset Pricing Model (CAPM) and Fama–French models using Pesaran and Yamagata's method (2.2) for the securities in the Standard & Poor 500 (S&P 500) index. Following the literature, we used 60 monthly stock returns to construct test statistics since monthly returns are nearly independent. The composition of index keeps changing annually, so we selected only 276 large stocks. The monthly returns (adjusted for dividend) between January 1990 and December 2012 are downloaded from the Wharton Research Data Services (WRDS) database. The time series on the risk-free rates and Fama–French three factors are obtained from Ken French's data library. If only the first factor, that is, the excessive return of the market portfolio is used, the Fama–French model reduces to the CAPM model. We tested the null hypothesis H₀ : α = 0 for both models. The p-values of the tests are depicted in Figure 4, which are computed based on running windows of previous 60 months.

Fig. 4 — P -values of testing H₀ : α = 0 in the CAMP and Fama–French 3 factor models based on S&P 500 monthly returns from January 1995 to December 2012.

The results suggest that market efficiency is time dependent and the Fama–French model are rejected less frequently than the CAPM. Before 1998, the evidence that α ≠ 0 is very strong. After 1998, the Fama–French 3-factor model holds most of the time except the period 2007–2009 that contains the financial crisis. On the other hand, the CAPM is rejected for an extended period of time during this period.

Supplementary Material

NIHMS749714-supplement-1.pdf^{(284.4KB, pdf)}

APPENDIX A: A TECHNICAL LEMMA ON χ² DIVERGENCES

Lemma A.1

Consider a mixture of Gaussian product distributions

{\tilde{P}}^{n} = \frac{1}{m} \sum_{j = 1}^{m} P_{j}^{n},

where $P_{j} ~ N_{p} (0, Σ_{j})$ such that $P_{j} ⪡ P_{0}$ . Then

χ^{2} ({\tilde{P}}^{n}, P_{0}^{n}) = \frac{1}{m^{2}} \sum_{j, k = 1}^{m} {∣ I_{p} - (Σ_{j} - I) (Σ_{k} - I) ∣}^{- n ∕ 2} - 1 .

(A.1)

Furthermore, assume 2(log p) ≥ n. Consider the mixture ${\overset{‒}{P}}^{n}$ defined in the proof of Theorem 3.2 where k and a are defined as follows:

If $R < 4 {(\frac{\log p}{n})}^{q ∕ 2}$ , then take k = 1 and a = (R/4)^1/q.
If $R \geq 4 {(\frac{\log p}{n})}^{q ∕ 2}$ , then take k to be the largest integer such that

k \leq \frac{R}{2} {(\frac{\log ((p - 1) ∕ k^{2} + 1)}{2 n})}^{- q ∕ 2}

(A.2)

and

a = {(\frac{\log ((p - 1) ∕ k^{2} + 1)}{2 n})}^{1 ∕ 2} \land {(2 k)}^{- 1 ∕ 2} .

(A.3)

Such choices yield in both cases

χ^{2} ({\overset{‒}{P}}^{n}, P_{0}^{n}) \leq e - 1 .

(A.4)

Moreover, in case 2 we have that (i) k ≥ 2 and (ii) under the assumption that R² < (p − 1)n^−q/2, we also have k² ≤ R²n^q < (p − 1)/2.

Proof

To unify the notation, we will work directly with $P_{S}$ , $S \in S$ rather than $P_{j}$ , j ∈ [m]. However, in the first part of the proof, we will not use the specific form Σ_S nor that of $S$ . For now, we simply assume that Σ_S is invertible (we will check this later on). Recall that

χ^{2} ({\overset{‒}{P}}^{n}, P_{0}^{n}) = E_{0} [{(\frac{d {\overset{‒}{P}}^{n}}{d P_{0}^{n}} - 1)}^{2}] = \frac{1}{{∣ S ∣}^{2}} \sum_{S, T \in S} {(E_{0} [\frac{d P_{S}}{d P_{0}} \frac{d P_{T}}{d P_{0}}])}^{n} - 1,

where $E_{0}$ denotes the expectation with respect to $P_{0}$ . Furthermore,

E_{0} [\frac{d P_{S}}{d P_{0}} \frac{d P_{T}}{d P_{0}}] = \frac{1}{{(∣ Σ_{S} ∣ ∣ Σ_{T} ∣)}^{1 ∕ 2}} E_{0} [\exp (- \frac{1}{2} X^{⊺} (Σ_{S}^{- 1} + Σ_{T}^{- 1} - 2 I_{p}) X)] .

Consider the spectral decomposition of $Σ_{S}^{- 1} + Σ_{T}^{- 1} - 2 I_{p} = U Λ U^{⊺}$ , where U is an orthogonal matrix and Λ is a diagonal matrix with eigenvalues λ₁, …, λ_p on its diagonal. Then, by rotational invariance of the Gaussian distribution, it holds

\begin{matrix} E_{0} & [\exp (- \frac{1}{2} X^{⊺} (Σ_{S}^{- 1} + Σ_{T}^{- 1} - 2 I_{p}) X)] \\ = E_{0} [\exp (- \frac{1}{2} X^{⊺} Λ X)] \\ = \prod_{j = 1}^{p} E_{0} [\exp (- \frac{1}{2} λ_{j} X_{j}^{2})] \\ = {\begin{matrix} \prod_{j = 1}^{p} {(1 + λ_{j})}^{- 1 ∕ 2} = {∣ I + Λ ∣}^{- 1 ∕ 2}, & if \max_{j} λ_{j} < 1, \\ \infty, & otherwise . \end{matrix} \end{matrix}

To ensure that the above expression is finite, note that the Cauchy–Schwarz inequality yields

\begin{matrix} {(E_{0} [\frac{d P_{S}}{d P_{0}} \frac{d P_{T}}{d P_{0}}])}^{2} & \leq E_{0} [{(\frac{d P_{S}}{d P_{0}})}^{2}] E_{0} [{(\frac{d P_{T}}{d P_{0}})}^{2}] \\ = (χ^{2} (P_{S}, P_{0}) + 1) (χ^{2} (P_{T}, P_{0}) + 1) < \infty, \end{matrix}

where the two χ² divergences are finite because $P_{S} ⪡ P_{0}$ for any $S \in S$ . Therefore,

E_{0} [\frac{d P_{S}}{d P_{0}} \frac{d P_{T}}{d P_{0}}] = \frac{{∣ I + Λ ∣}^{- 1 ∕ 2}}{{(∣ Σ_{S} ∣ ∣ Σ_{T} ∣)}^{1 ∕ 2}} = \frac{{∣ Σ_{S}^{- 1} + Σ_{T}^{- 1} - I_{p} ∣}^{- 1 ∕ 2}}{{(∣ Σ_{S} ∣ ∣ Σ_{T} ∣)}^{1 ∕ 2}} .

Next, observe that

\begin{matrix} {(∣ Σ_{S} ∣ ∣ Σ_{T} ∣ ∣ Σ_{S}^{- 1} + Σ_{T}^{- 1} - I_{p} ∣)}^{- 1 ∕ 2} & = {((∣ Σ_{S} (Σ_{S}^{- 1} + Σ_{T}^{- 1} - I_{p}) Σ_{T}) ∣)}^{- 1 ∕ 2} \\ = {∣ I - (Σ_{S} - I) (Σ_{T} - I) ∣}^{- 1 ∕ 2} . \end{matrix}

Since we have not used the specific form of Σ_S, $S \in S$ , this bound is valid for any mixture and completes the proof of (A.1).

Next, we apply this bound to the specific choice for Σ_S of (3.10). Note that the minimal eigenvalue of the matrices Σ_S, $S \in S$ is $1 - \sqrt{k a^{2}}$ . Later we will show 2a²k ≤ 1, which implies that Σ_S is always positive definite. In particular, this implies that $P_{S} ⪡ P_{0}$ for any $S \in S$ . Moreover, it follows from definition (3.10) that

I - (Σ_{S} - I) (Σ_{T} - I) = (\begin{matrix} 1 - a^{2} 1_{S}^{⊺} 1_{T} & 0 \\ 0 & I - a^{2} 1_{S} 1_{T}^{⊺} \end{matrix}),

where 0 is a generic symbol to indicate space filled by zeros. Expanding the determinant along the first row (or column), we get

\begin{matrix} {∣ I - (Σ_{S} - I) (Σ_{T} - I) ∣}^{- 1 ∕ 2} & = {(1 - a^{2} 1_{S}^{⊺} 1_{T})}^{- 1 ∕ 2} {∣ I - a^{2} 1_{S} 1_{T}^{⊺} ∣}^{- 1 ∕ 2} \\ = {(1 - a^{2} 1_{S}^{⊺} 1_{T})}^{- 1}, \end{matrix}

where in the second equality, we used Sylvester's determinant theorem. By (A.1), we have

χ^{2} ({\overset{‒}{P}}^{n}, P_{0}^{n}) = \frac{1}{{∣ S ∣}^{2}} \sum_{S, T \in S} {(1 - a^{2} 1_{S}^{⊺} 1_{T})}^{- n} - 1 .

As to be verified later, 2a²k ≤ 1. Using the fact that (1 − x)⁻¹ ≤ exp(2x) for x ∈ [0, 1/2] and the symmetry, we have

χ^{2} ({\overset{‒}{P}}^{n}, P_{0}^{n}) \leq \frac{1}{∣ S ∣} \sum_{S \in S} \exp (2 n a^{2} ∣ S \cap [k] ∣) - 1 = ε [\exp (2 n a^{2} ∣ S \cap [k] ∣) - 1],

where $E$ denotes the expectation with respect to the distribution of S randomly chosen from $S$ . In particular, $∣ S \cap [k] ∣ = Σ_{i = 1}^{k} 1 (i \in S)$ is the sum of k negatively associated random variables. Therefore, using negative association, the above expectation is further bounded by

\prod_{i = 1}^{k} ε [e^{2 n a^{2} 1 (i \in S)}] .

Next, for a given by (A.3), we have

\prod_{i = 1}^{k} ε [e^{2 n a^{2} 1 (i \in S)}] = {[(e^{2 n a^{2}} - 1) \frac{k}{p - 1} + 1]}^{k} \leq {[1 + \frac{1}{k}]}^{k} \leq e .

We now show for both cases of the lemma, we have 2a²k ≤ 1. Indeed for case 1, we get 2a²k = 2(R/4)^2/q < 2(log p)/n ≤ 1. For case 2, 2a²k ≤ 1 follows trivially from the definition of a. Also observe that k ≥ 2 since

\frac{R}{2} {(\frac{\log ((p - 1) ∕ 4 + 1)}{n})}^{- q ∕ 2} \geq 2 {(\frac{\log p}{n})}^{q ∕ 2} {(\frac{\log ((p - 1) ∕ 4 + 1)}{n})}^{- q ∕ 2} \geq 2 .

This proves part (i) of the statement on k. To prove part (ii), observe that R² < (p − 1)n^−q/2 implies that

2^{1 - 2 ∕ q} < 1 < \log (\frac{p - 1}{R^{2} n^{q}} + 1),

which is equivalent to

R n^{q ∕ 2} > \frac{R}{2} {(\frac{\log ((p - 1) ∕ (R^{2} n^{q}) + 1)}{2 n})}^{- q ∕ 2} .

Therefore, k² ≤ R²n^q < (p − 1)/2.

APPENDIX B: PROOF OF THEOREM 4.1

The proof follows a similar idea to that of Theorem 3.2. For the second part of the lower bound, we use exactly the same construction of two hypotheses as in Lemma A.1. Then it follows that for the ℓ_r functional,

\inf_{\hat{L}} \max_{Σ \in H} E_{Σ} [{(\hat{L} - ℓ_{r} (Σ))}^{2}] \geq C k^{2} a^{2 r},

for some positive constant C, where the infimum is taken over all estimators $\hat{L}$ of ℓ_r(Σ) based on n observations. In case 1, k²a^2r = CR^2r/q while in case 2, following the same arguments as before,

k^{2} a^{2 r} \geq \frac{R^{2}}{16} {(\frac{\log ((p - 1) ∕ (R^{2} n^{q}) + 1)}{2 n})}^{r - q} \land \frac{1}{2} .

This completes the second part of the lower bound.

The first part of the result is a little bit more complicated than the construction of A^(k) and B^(k) in the proof of Theorem 3.2 due to the extra log p term in the lower bound. We need to consider a mixture of measures in order to capture the complexity of the problem. With a slight abuse of notation, we redefine (2k) × (2k) matrices A^(k), B^(k) as follows:

A^{(k)} = (\begin{matrix} 11^{⊺} & a 11^{⊺} \\ a 11^{⊺} & 11^{⊺} \end{matrix}), B^{(k)} = (\begin{matrix} 11^{⊺} & b 11^{⊺} \\ b 11^{⊺} & 11^{⊺} \end{matrix}),

where a, b ∈ (0, 1/2) and 1 denotes a vector of ones of length k. Since R² < p, we now construct the block diagonal covariance matrices

Σ_{m}^{(k)} = diag (C_{1}, C_{2}, \dots, C_{M}, I_{p - 2 k M}), m = 1, 2, \dots, M,

where the mth diagonal block is chosen to be C_m = B^(k) while others are C_i = A^(k) for i ≠ m and M = ⌊p/R⌋. Also define $Σ_{0}^{(k)}$ to be of the same structure with C_i = A^(k) for all i. Then we have $Σ_{m}^{(R ∕ 2)} \in G_{q} (R)$ for m = 0, 1, …, M, since each row of $Σ_{m}^{(R ∕ 2)}$ only contains at most R nonzero elements that are bounded by 1.

Let $P_{0}$ denote the distribution of $X \sim N_{p} (0, Σ_{0}^{(R ∕ 2)})$ and $P_{m}$ denote the distribution of $X \sim N_{p} (0, Σ_{m}^{(R ∕ 2)})$ . Let $P_{0}^{n}$ (resp., $P_{m}^{n}$ ) denote the distribution of X = (X₁, …, X_n) of n i.i.d. random variables drawn from $P_{0}$ (resp., $P_{m}$ ). Moreover, let ${\overset{‒}{P}}^{n}$ denote the uniform mixture of $P_{m}^{n}$ over m ∈ [M]. Consider the testing problem

H_{0} : X ~ P_{0}^{n} vs . H_{1} : X ~ {\overset{‒}{P}}^{n} .

Using Theorem 2.2, part (iii) of Tsybakov (2009) as before, we need to show χ²-divergence can be bounded by a constant. By the same calculation as in Lemma A.1, we have

χ^{2} ({\overset{‒}{P}}^{n}, {\overset{‒}{P}}_{0}^{n}) = \frac{1}{M^{2}} \sum_{1 \leq i, j \leq M} {∣ I - [{(Σ_{0}^{(1)})}^{- 1} Σ_{i}^{(1)} - I] [{(Σ_{0}^{(1)})}^{- 1} Σ_{j}^{(1)} - I] ∣}^{- n ∕ 2} - 1 .

Note that χ²-divergence here depends on $Σ_{m}^{(1)}$ instead of $Σ_{m}^{(R ∕ 2)}$ since perfectly correlated random variables do not add additional information and hence do not affect χ²-divergence (see the proof of Theorem 3.2). Using the definition of $Σ_{m}^{(1)}$ 's, we obtain

\begin{matrix} ∣ I - & [{(Σ_{0}^{(1)})}^{- 1} Σ_{i}^{(1)} - I] [{(Σ_{0}^{(1)})}^{- 1} Σ_{j}^{(1)} - I] ∣ \\ = {\begin{matrix} 1 - 2 (1 + a^{2}) {(\frac{a - b}{1 - a^{2}})}^{2} + \frac{{(a - b)}^{4}}{{(1 - a^{2})}^{2}}, & if i = j, \\ 1, & otherwise . \end{matrix} \end{matrix}

Therefore,

χ^{2} ({\overset{‒}{P}}^{n}, P_{0}^{n}) = \frac{1}{M} {{(1 - 2 (1 + a^{2}) {(\frac{a - b}{1 - a^{2}})}^{2} + \frac{{(a - b)}^{4}}{{(1 - a^{2})}^{2}})}^{- n ∕ 2} - 1},

which is bounded by ((1 − 5(a − b)²)^−n/2 − 1)/M due to the fact 2(1 + a²)/(1 − a²)² ≤ 5 for a ≤ 1/2. Now choose

a = \frac{1}{4}, b = a + \frac{1}{4} \sqrt{\frac{\log p}{n}} .

By assumption, there exists a constant c₀ > 1 such that R² ≤ c₀p. Thus,

χ^{2} ({\overset{‒}{P}}^{n}, P_{0}^{n}) \leq \frac{1}{M} {{(1 - \frac{\log p}{2 n})}^{- n ∕ 2} - 1} \leq \frac{R}{p} e^{\log p ∕ 2} \leq \sqrt{c_{0}} .

Using standard techniques to reduce estimation problems to testing problems as before, we find

\inf_{\hat{L}} \max_{Σ \in {Σ_{m}^{(R ∕ 2)} : m = 0, \dots, M}} E_{Σ} [{(\hat{L} - ℓ_{r} (Σ))}^{2}] \geq C {(ℓ_{r} (Σ_{0}^{(R ∕ 2)}) - ℓ_{r} (Σ_{1}^{(R ∕ 2)}))}^{2} .

For the above choice of $Σ_{m}^{(k)}$ , we have

{(ℓ_{r} (Σ_{0}^{(R ∕ 2)}) - ℓ_{r} (Σ_{1}^{(R ∕ 2)}))}^{2} = \frac{R^{2}}{4} {(b^{r} - a^{r})}^{2} \geq C R^{2} \frac{\log p}{n} .

Since $Σ_{m}^{(R ∕ 2)} \in G_{q} (R)$ , the above two displays imply that

\inf_{\hat{L}} \max_{Σ \in G_{q} (R)} E_{Σ} [{(\hat{L} - ℓ_{r} (Σ))}^{2}] \geq C \frac{R^{2} \log p}{n},

which together with the other part of the lower bound, completes the proof of the theorem.

APPENDIX C: PROOF OF THEOREM 4.4

The proof is similar to that of Theorem 3.2, but simpler since r ≤ 1 where no elbow effect exists. Consider hypothesis construction (3.10) with $Σ_{S} = I_{p} + κ \overset{‒}{Σ}$ and

a = κ k^{- 1 ∕ r} and k = ⌈ R {(\frac{\log (ν p ∕ (R^{2} n^{q}))}{2 n})}^{- q ∕ 2} ⌉ .

Choose ν sufficiently small so that $R {(\frac{\log (ν p ∕ (R^{2} n^{q}))}{2 n})}^{1 - q ∕ 2} \leq 1 ∕ 2$ , which implies 2ka² ≤ 1 and guarantees the positive semi-definiteness of Σ_S. Furthermore, ka^q ≤ R holds, so $Σ_{S} \in G_{q} (R)$ . By the same derivation as in Theorem 3.2, we are able to show

χ^{2} ({\overset{‒}{P}}^{n}, P_{0}^{n}) \leq e^{ν} - 1,

which by Theorem 2.2(iii) of Tsybakov (2009) leads to the final conclusion.

Footnotes

SUPPLEMENTARY MATERIAL

Technical proofs Fan, Rigollet and Wang (2015) (DOI: 10.1214/15-AOS1357SUPP; .pdf). This supplementary material contains the introduction to two-sample high-dimensional testing methods and the proofs of upper bounds that were omitted from the paper.

REFERENCES

Amini AA, Wainwright MJ. High-dimensional analysis of semidefinite relaxations for sparse principal components. Ann. Statist. 2009;37:2877–2921. MR2541450. [Google Scholar]
Arias-Castro E, Bubeck S, Lugosi G. Detecting positive correlations in a multivariate sample. Bernoulli. 2015;21:209–241. MR3322317. [Google Scholar]
Bai Z, Saranadasa H. Effect of high-dimension: By an example of a two sample problem. Statist. Sinica. 1996;6:311–329. MR1399305. [Google Scholar]
Berthet Q, Rigollet P. Complexity theoretic lower bounds for sparse principal component detection. J. Mach. Learn. Res. 2013a;30:1046–1066. [Google Scholar]
Berthet Q, Rigollet P. Optimal detection of sparse principal components in high-dimension. Ann. Statist. 2013b;41:1780–1815. MR3127849. [Google Scholar]
Bickel PJ, Levina E. Covariance regularization by thresholding. Ann. Statist. 2008a;36:2577–2604. MR2485008. [Google Scholar]
Bickel PJ, Levina E. Regularized estimation of large covariance matrices. Ann. Statist. 2008b;36:199–227. MR2387969. [Google Scholar]
Bickel PJ, Ritov Y. Estimating integrated squared density derivatives: Sharp best order of convergence estimates. Sankhyā Ser. A. 1988;50:381–393. MR1065550. [Google Scholar]
Birnbaum A, Johnstone IM, Nadler B, Paul D. Minimax bounds for sparse PCA with noisy high-dimensional data. Ann. Statist. 2013;41:1055–1084. doi: 10.1214/12-AOS1014. MR3113803. [DOI] [PMC free article] [PubMed] [Google Scholar]
Butucea C. Goodness-of-fit testing and quadratic functional estimation from indirect observations. Ann. Statist. 2007;35:1907–1930. MR2363957. [Google Scholar]
Butucea C, Meziani K. Quadratic functional estimation in inverse problems. Stat. Methodol. 2011;8:31–41. MR2741507. [Google Scholar]
Cai T, Liu W. Adaptive thresholding for sparse covariance matrix estimation. J. Amer. Statist. Assoc. 2011;106:672–684. MR2847949. [Google Scholar]
Cai TT, Low MG. Nonquadratic estimators of a quadratic functional. Ann. Statist. 2005;33:2930–2956. MR2253108. [Google Scholar]
Cai TT, Low MG. Optimal adaptive estimation of a quadratic functional. Ann. Statist. 2006;34:2298–2325. MR2291501. [Google Scholar]
Cai TT, Ma Z, Wu Y. Sparse PCA: Optimal rates and adaptive estimation. Ann. Statist. 2013;41:3074–3110. MR3161458. [Google Scholar]
Cai T, Ma Z, Wu Y. Optimal estimation and rank detection for sparse spiked covariance matrices. Probab. Theory Related Fields. 2015;161:781–815. doi: 10.1007/s00440-014-0562-z. MR3334281. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cai TT, Ren Z, Zhou HH. Optimal rates of convergence for estimating Toeplitz covariance matrices. Probab. Theory Related Fields. 2013;156:101–143. MR3055254. [Google Scholar]
Cai TT, Yuan M. Adaptive covariance matrix estimation through block thresholding. Ann. Statist. 2012;40:2014–2042. MR3059075. [Google Scholar]
Cai TT, Zhang C-H, Zhou HH. Optimal rates of convergence for covariance matrix estimation. Ann. Statist. 2010;38:2118–2144. MR2676885. [Google Scholar]
Cai TT, Zhou HH. Minimax estimation of large covariance matrices under ℓ1-norm. Statist. Sinica. 2012;22:1319–1349. MR3027084. [Google Scholar]
Chen SX, Qin Y-L. A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist. 2010;38:808–835. MR2604697. [Google Scholar]
Donoho DL, Nussbaum M. Minimax quadratic estimation of a quadratic functional. J. Complexity. 1990;6:290–323. MR1081043. [Google Scholar]
Efromovich S, Low M. On optimal adaptive estimation of a quadratic functional. Ann. Statist. 1996;24:1106–1125. MR1401840. [Google Scholar]
El Karoui N. Operator norm consistent estimation of large-dimensional sparse covariance matrices. Ann. Statist. 2008;36:2717–2756. MR2485011. [Google Scholar]
Fama EF, French KR. Common risk factors in the returns on stocks and bonds. Journal of Financial Economics. 1993;33:3–56. [Google Scholar]
Fan J. On the estimation of quadratic functionals. Ann. Statist. 1991;19:1273–1294. MR1126325. [Google Scholar]
Fan J, Fan Y, Lv J. High dimensional covariance matrix estimation using a factor model. J. Econometrics. 2008;147:186–197. MR2472991. [Google Scholar]
Fan J, Liao Y, Mincheva M. High-dimensional covariance matrix estimation in approximate factor models. Ann. Statist. 2011;39:3320–3356. doi: 10.1214/11-AOS944. MR3012410. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Liao Y, Mincheva M. Large covariance estimation by thresholding principal orthogonal complements. J. R. Stat. Soc. Ser. B. Stat. Methodol. 2013;75:603–680. doi: 10.1111/rssb.12016. MR3091653. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Rigollet P, Wang W. Supplement to “Estimation of functionals of sparse covariance matrices. 2015. DOI:10.1214/15-AOS1357SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
Foucart S, Rauhut H. A Mathematical Introduction to Compressive Sensing. Birkhäuser/Springer; New York: 2013. MR3100033. [Google Scholar]
Hall P, Marron JS. Estimation of integrated squared density derivatives. Statist. Probab. Lett. 1987;6:109–115. MR0907270. [Google Scholar]
Ibragimov IA, Nemirovskiĭ AS, Khas'minskiĭ RZ. Some problems of nonparametric estimation in Gaussian white noise. Theory Probab. Appl. 1987;31:391–406. [Google Scholar]
Johnstone IM, Lu AY. On consistency and sparsity for principal components analysis in high-dimensions. J. Amer. Statist. Assoc. 2009;104:682–693. doi: 10.1198/jasa.2009.0121. MR2751448. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jung S, Marron JS. PCA consistency in high-dimension, low sample size context. Ann. Statist. 2009;37:4104–4130. MR2572454. [Google Scholar]
Klemelä J. Sharp adaptive estimation of quadratic functionals. Probab. Theory Related Fields. 2006;134:539–564. MR2214904. [Google Scholar]
Lam C, Fan J. Sparsistency and rates of convergence in large covariance matrix estimation. Ann. Statist. 2009;37:4254–4278. doi: 10.1214/09-AOS720. MR2572459. [DOI] [PMC free article] [PubMed] [Google Scholar]
Levina E, Vershynin R. Partial estimation of covariance matrices. Probab. Theory Related Fields. 2012;153:405–419. MR2948681. [Google Scholar]
Lintner J. The valuation of risk assets and the selection of risky investments in stock portfolios and capital budgets. The Review of Economics and Statistics. 1965;47:13–37. [Google Scholar]
Ma Z. Sparse principal component analysis and iterative thresholding. Ann. Statist. 2013;41:772–801. MR3099121. [Google Scholar]
Mossin J. Equilibrium in a capital asset market. Econometrica. 1966;34:768–783. [Google Scholar]
Nemirovski A. Lectures on Probability Theory and Statistics (Saint-Flour, 1998). Lecture Notes in Math. Springer; Berlin: 2000. Topics in nonparametric statistics; pp. 85–277. 1738. MR1775640. [Google Scholar]
Nemirovskiĭ AS, Khas'minskiĭ RZ. Nonparametric estimation of the functionals of the products of a signal observed in white noise. Problemy Peredachi Informatsii. 1987;23:27–38. MR0914348. [Google Scholar]
Onatski A, Moreira MJ, Hallin M. Asymptotic power of sphericity tests for high-dimensional data. Ann. Statist. 2013;41:1204–1231. MR3113808. [Google Scholar]
Paul D, Johnstone IM. Augmented sparse principal component analysis for high-dimensional data. 2012. Available at arXiv:1202.1242v1. [Google Scholar]
Pesaran MH, Yamagata T. IZA Discussion Papers 6469. Institute for the Study of Labor; 2012. Testing capm with a large number of assets. [Google Scholar]
Ravikumar P, Wainwright MJ, Raskutti G, Yu B. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electron. J. Stat. 2011;5:935–980. MR2836766. [Google Scholar]
Rigollet P, Tsybakov AB. Comment: “Minimax estimation of large covariance matrices under ℓ1-norm” [MR3027084] Statist. Sinica. 2012;22:1358–1367. MR3027087. [Google Scholar]
Rothman AJ, Levina E, Zhu J. Generalized thresholding of large covariance matrices. J. Amer. Statist. Assoc. 2009;104:177–186. MR2504372. [Google Scholar]
Sharpe WF. Capital asset prices: A theory of market equilibrium under conditions of risk. J. Finance. 1964;19:425–442. [Google Scholar]
Srivastava MS, Du M. A test for the mean vector with fewer observations than the dimension. J. Multivariate Anal. 2008;99:386–402. MR2396970. [Google Scholar]
Tsybakov AB. Introduction to Nonparametric Estimation. Springer; New York: 2009. MR2724359. [Google Scholar]
Verzelen N. Minimax risks for sparse regressions: Ultra-high-dimensional phenomenons. Electron. J. Stat. 2012;6:38–90. MR2879672. [Google Scholar]
Vu V, Lei J. Minimax rates of estimation for sparse PCA in high-dimensions. Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics April 21–23, 2012, JMLR W&CP. 2012;22:1278–1286. [Google Scholar]
Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. J. Comput. Graph. Statist. 2006;15:265–286. MR2252527. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS749714-supplement-1.pdf^{(284.4KB, pdf)}

[R1] Amini AA, Wainwright MJ. High-dimensional analysis of semidefinite relaxations for sparse principal components. Ann. Statist. 2009;37:2877–2921. MR2541450. [Google Scholar]

[R2] Arias-Castro E, Bubeck S, Lugosi G. Detecting positive correlations in a multivariate sample. Bernoulli. 2015;21:209–241. MR3322317. [Google Scholar]

[R3] Bai Z, Saranadasa H. Effect of high-dimension: By an example of a two sample problem. Statist. Sinica. 1996;6:311–329. MR1399305. [Google Scholar]

[R4] Berthet Q, Rigollet P. Complexity theoretic lower bounds for sparse principal component detection. J. Mach. Learn. Res. 2013a;30:1046–1066. [Google Scholar]

[R5] Berthet Q, Rigollet P. Optimal detection of sparse principal components in high-dimension. Ann. Statist. 2013b;41:1780–1815. MR3127849. [Google Scholar]

[R6] Bickel PJ, Levina E. Covariance regularization by thresholding. Ann. Statist. 2008a;36:2577–2604. MR2485008. [Google Scholar]

[R7] Bickel PJ, Levina E. Regularized estimation of large covariance matrices. Ann. Statist. 2008b;36:199–227. MR2387969. [Google Scholar]

[R8] Bickel PJ, Ritov Y. Estimating integrated squared density derivatives: Sharp best order of convergence estimates. Sankhyā Ser. A. 1988;50:381–393. MR1065550. [Google Scholar]

[R9] Birnbaum A, Johnstone IM, Nadler B, Paul D. Minimax bounds for sparse PCA with noisy high-dimensional data. Ann. Statist. 2013;41:1055–1084. doi: 10.1214/12-AOS1014. MR3113803. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Butucea C. Goodness-of-fit testing and quadratic functional estimation from indirect observations. Ann. Statist. 2007;35:1907–1930. MR2363957. [Google Scholar]

[R11] Butucea C, Meziani K. Quadratic functional estimation in inverse problems. Stat. Methodol. 2011;8:31–41. MR2741507. [Google Scholar]

[R12] Cai T, Liu W. Adaptive thresholding for sparse covariance matrix estimation. J. Amer. Statist. Assoc. 2011;106:672–684. MR2847949. [Google Scholar]

[R13] Cai TT, Low MG. Nonquadratic estimators of a quadratic functional. Ann. Statist. 2005;33:2930–2956. MR2253108. [Google Scholar]

[R14] Cai TT, Low MG. Optimal adaptive estimation of a quadratic functional. Ann. Statist. 2006;34:2298–2325. MR2291501. [Google Scholar]

[R15] Cai TT, Ma Z, Wu Y. Sparse PCA: Optimal rates and adaptive estimation. Ann. Statist. 2013;41:3074–3110. MR3161458. [Google Scholar]

[R16] Cai T, Ma Z, Wu Y. Optimal estimation and rank detection for sparse spiked covariance matrices. Probab. Theory Related Fields. 2015;161:781–815. doi: 10.1007/s00440-014-0562-z. MR3334281. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Cai TT, Ren Z, Zhou HH. Optimal rates of convergence for estimating Toeplitz covariance matrices. Probab. Theory Related Fields. 2013;156:101–143. MR3055254. [Google Scholar]

[R18] Cai TT, Yuan M. Adaptive covariance matrix estimation through block thresholding. Ann. Statist. 2012;40:2014–2042. MR3059075. [Google Scholar]

[R19] Cai TT, Zhang C-H, Zhou HH. Optimal rates of convergence for covariance matrix estimation. Ann. Statist. 2010;38:2118–2144. MR2676885. [Google Scholar]

[R20] Cai TT, Zhou HH. Minimax estimation of large covariance matrices under ℓ1-norm. Statist. Sinica. 2012;22:1319–1349. MR3027084. [Google Scholar]

[R21] Chen SX, Qin Y-L. A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist. 2010;38:808–835. MR2604697. [Google Scholar]

[R22] Donoho DL, Nussbaum M. Minimax quadratic estimation of a quadratic functional. J. Complexity. 1990;6:290–323. MR1081043. [Google Scholar]

[R23] Efromovich S, Low M. On optimal adaptive estimation of a quadratic functional. Ann. Statist. 1996;24:1106–1125. MR1401840. [Google Scholar]

[R24] El Karoui N. Operator norm consistent estimation of large-dimensional sparse covariance matrices. Ann. Statist. 2008;36:2717–2756. MR2485011. [Google Scholar]

[R25] Fama EF, French KR. Common risk factors in the returns on stocks and bonds. Journal of Financial Economics. 1993;33:3–56. [Google Scholar]

[R26] Fan J. On the estimation of quadratic functionals. Ann. Statist. 1991;19:1273–1294. MR1126325. [Google Scholar]

[R27] Fan J, Fan Y, Lv J. High dimensional covariance matrix estimation using a factor model. J. Econometrics. 2008;147:186–197. MR2472991. [Google Scholar]

[R28] Fan J, Liao Y, Mincheva M. High-dimensional covariance matrix estimation in approximate factor models. Ann. Statist. 2011;39:3320–3356. doi: 10.1214/11-AOS944. MR3012410. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Fan J, Liao Y, Mincheva M. Large covariance estimation by thresholding principal orthogonal complements. J. R. Stat. Soc. Ser. B. Stat. Methodol. 2013;75:603–680. doi: 10.1111/rssb.12016. MR3091653. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Fan J, Rigollet P, Wang W. Supplement to “Estimation of functionals of sparse covariance matrices. 2015. DOI:10.1214/15-AOS1357SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Foucart S, Rauhut H. A Mathematical Introduction to Compressive Sensing. Birkhäuser/Springer; New York: 2013. MR3100033. [Google Scholar]

[R32] Hall P, Marron JS. Estimation of integrated squared density derivatives. Statist. Probab. Lett. 1987;6:109–115. MR0907270. [Google Scholar]

[R33] Ibragimov IA, Nemirovskiĭ AS, Khas'minskiĭ RZ. Some problems of nonparametric estimation in Gaussian white noise. Theory Probab. Appl. 1987;31:391–406. [Google Scholar]

[R34] Johnstone IM, Lu AY. On consistency and sparsity for principal components analysis in high-dimensions. J. Amer. Statist. Assoc. 2009;104:682–693. doi: 10.1198/jasa.2009.0121. MR2751448. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Jung S, Marron JS. PCA consistency in high-dimension, low sample size context. Ann. Statist. 2009;37:4104–4130. MR2572454. [Google Scholar]

[R36] Klemelä J. Sharp adaptive estimation of quadratic functionals. Probab. Theory Related Fields. 2006;134:539–564. MR2214904. [Google Scholar]

[R37] Lam C, Fan J. Sparsistency and rates of convergence in large covariance matrix estimation. Ann. Statist. 2009;37:4254–4278. doi: 10.1214/09-AOS720. MR2572459. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Levina E, Vershynin R. Partial estimation of covariance matrices. Probab. Theory Related Fields. 2012;153:405–419. MR2948681. [Google Scholar]

[R39] Lintner J. The valuation of risk assets and the selection of risky investments in stock portfolios and capital budgets. The Review of Economics and Statistics. 1965;47:13–37. [Google Scholar]

[R40] Ma Z. Sparse principal component analysis and iterative thresholding. Ann. Statist. 2013;41:772–801. MR3099121. [Google Scholar]

[R41] Mossin J. Equilibrium in a capital asset market. Econometrica. 1966;34:768–783. [Google Scholar]

[R42] Nemirovski A. Lectures on Probability Theory and Statistics (Saint-Flour, 1998). Lecture Notes in Math. Springer; Berlin: 2000. Topics in nonparametric statistics; pp. 85–277. 1738. MR1775640. [Google Scholar]

[R43] Nemirovskiĭ AS, Khas'minskiĭ RZ. Nonparametric estimation of the functionals of the products of a signal observed in white noise. Problemy Peredachi Informatsii. 1987;23:27–38. MR0914348. [Google Scholar]

[R44] Onatski A, Moreira MJ, Hallin M. Asymptotic power of sphericity tests for high-dimensional data. Ann. Statist. 2013;41:1204–1231. MR3113808. [Google Scholar]

[R45] Paul D, Johnstone IM. Augmented sparse principal component analysis for high-dimensional data. 2012. Available at arXiv:1202.1242v1. [Google Scholar]

[R46] Pesaran MH, Yamagata T. IZA Discussion Papers 6469. Institute for the Study of Labor; 2012. Testing capm with a large number of assets. [Google Scholar]

[R47] Ravikumar P, Wainwright MJ, Raskutti G, Yu B. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electron. J. Stat. 2011;5:935–980. MR2836766. [Google Scholar]

[R48] Rigollet P, Tsybakov AB. Comment: “Minimax estimation of large covariance matrices under ℓ1-norm” [MR3027084] Statist. Sinica. 2012;22:1358–1367. MR3027087. [Google Scholar]

[R49] Rothman AJ, Levina E, Zhu J. Generalized thresholding of large covariance matrices. J. Amer. Statist. Assoc. 2009;104:177–186. MR2504372. [Google Scholar]

[R50] Sharpe WF. Capital asset prices: A theory of market equilibrium under conditions of risk. J. Finance. 1964;19:425–442. [Google Scholar]

[R51] Srivastava MS, Du M. A test for the mean vector with fewer observations than the dimension. J. Multivariate Anal. 2008;99:386–402. MR2396970. [Google Scholar]

[R52] Tsybakov AB. Introduction to Nonparametric Estimation. Springer; New York: 2009. MR2724359. [Google Scholar]

[R53] Verzelen N. Minimax risks for sparse regressions: Ultra-high-dimensional phenomenons. Electron. J. Stat. 2012;6:38–90. MR2879672. [Google Scholar]

[R54] Vu V, Lei J. Minimax rates of estimation for sparse PCA in high-dimensions. Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics April 21–23, 2012, JMLR W&CP. 2012;22:1278–1286. [Google Scholar]

[R55] Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. J. Comput. Graph. Statist. 2006;15:265–286. MR2252527. [Google Scholar]

PERMALINK

ESTIMATION OF FUNCTIONALS OF SPARSE COVARIANCE MATRICES

Jianqing Fan

Philippe Rigollet

Weichen Wang

Abstract

1. Introduction

2. Two motivating examples

2.1. Two-sample hypothesis testing in high-dimensions

2.2. Testing high-dimensional CAPM model

3. Optimal estimation of quadratic functionals

Proof

Theorem 3.1

Theorem 3.2

Proof of Theorem 3.2

4. Extension to nonquadratic functionals

4.1. Optimal estimation of ℓr functionals

Theorem 4.2

4.2. Optimal detection of correlations

Theorem 4.3

Theorem 4.4

5. Numerical experiments

5.1. Quadratic functional estimation

Fig. 1.

Fig. 2.

5.2. Application to high-dimensional two-sample testing

Table 1.

Table 2.

5.3. Estimation of ℓr functional and correlation detection

Fig. 3.

5.4. Application to testing multifactor pricing model

Fig. 4.

Supplementary Material

APPENDIX A: A TECHNICAL LEMMA ON χ2 DIVERGENCES

Lemma A.1

Proof

APPENDIX B: PROOF OF THEOREM 4.1

APPENDIX C: PROOF OF THEOREM 4.4

Footnotes

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

4.1. Optimal estimation of ℓ_r functionals

5.3. Estimation of ℓ_r functional and correlation detection

APPENDIX A: A TECHNICAL LEMMA ON χ² DIVERGENCES