Sparsistency and Rates of Convergence in Large Covariance Matrix Estimation

Clifford Lam; Jianqing Fan

doi:10.1214/09-AOS720

. Author manuscript; available in PMC: 2010 Dec 1.

Published in final edited form as: Ann Stat. 2009;37(6B):4254–4278. doi: 10.1214/09-AOS720

Sparsistency and Rates of Convergence in Large Covariance Matrix Estimation^{^*}

Clifford Lam ¹, Jianqing Fan ²

PMCID: PMC2995610 NIHMSID: NIHMS248826 PMID: 21132082

Abstract

This paper studies the sparsistency and rates of convergence for estimating sparse covariance and precision matrices based on penalized likelihood with nonconvex penalty functions. Here, sparsistency refers to the property that all parameters that are zero are actually estimated as zero with probability tending to one. Depending on the case of applications, sparsity priori may occur on the covariance matrix, its inverse or its Cholesky decomposition. We study these three sparsity exploration problems under a unified framework with a general penalty function. We show that the rates of convergence for these problems under the Frobenius norm are of order (s_n log p_n/n)^1/2, where s_n is the number of nonzero elements, p_n is the size of the covariance matrix and n is the sample size. This explicitly spells out the contribution of high-dimensionality is merely of a logarithmic factor. The conditions on the rate with which the tuning parameter λ_n goes to 0 have been made explicit and compared under different penalties. As a result, for the L₁-penalty, to guarantee the sparsistency and optimal rate of convergence, the number of nonzero elements should be small: $s_{n}^{'} = O (p_{n})$ at most, among $O (p_{n}^{2})$ parameters, for estimating sparse covariance or correlation matrix, sparse precision or inverse correlation matrix or sparse Cholesky factor, where $s_{n}^{'}$ is the number of the nonzero elements on the off-diagonal entries. On the other hand, using the SCAD or hard-thresholding penalty functions, there is no such a restriction.

Keywords: Covariance matrix, high dimensionality, consistency, nonconcave penalized likelihood, sparsistency, asymptotic normality

1 Introduction

Covariance matrix estimation is a common statistical problem in many scientific applications. For example, in financial risk assessment or longitudinal study, an input of covariance matrix Σ is needed, whereas an inverse of the covariance matrix, the precision matrix Σ⁻¹, is required for optimal portfolio selection, linear discriminant analysis or graphical network models. Yet, the number of parameters in the covariance matrix grows quickly with dimensionality. Depending on the applications, the sparsity of the covariance matrix or precision matrix is frequently imposed to strike a balance between biases and variances. For example, in longitudinal data analysis [see e.g., Diggle and Verbyla (1998), or Bickel and Levina (2008b)], it is reasonable to assume that remote data in time are weakly correlated, whereas in Gaussian graphical models, the sparsity of the precision matrix is a reasonable assumption (Dempster (1972)).

This initiates a series of researches focusing on the parsimony of a covariance matrix. Smith and Kohn (2002) used priors which admit zeros on the off-diagonal elements of the Cholesky factor of the precision matrix Ω = Σ⁻¹, while Wong, Carter and Kohn (2003) used zero-admitting prior directly on the off-diagonal elements of Ω to achieve parsimony. Wu and Pourahmadi (2003) used the Modified Cholesky Decomposition (MCD) to find a banded structure for Ω nonparametrically for longitudinal data. Bickel and Levina (2008b) developed consistency theories on banding methods for longitudinal data, for both Σ and Ω.

Various authors have used penalized likelihood methods to achieve parsimony on covariance selection. Fan and Peng (2004) has laid down a general framework for penalized likelihood with diverging dimensionality, with general conditions for the oracle property stated and proved. However, it is not clear whether it is applicable to the specific case of covariance matrix estimation. In particular, they did not link the dimensionality p_n with the number of nonzero elements s_n in the true covariance matrix Σ₀, or the precision matrix Ω₀. A direct application of their results to our setting can only handle a relatively small covariance matrix of size p_n = o(n^1/10).

Recently, there is a surge of interest on the estimation of sparse covariance matrix or precision matrix using penalized likelihood method. Huang, Liu, Pourahmadi and Liu (2006) used the LASSO on the off-diagonal elements of the Cholesky factor from MCD, while Meinshausen and Bühlmann (2006), d’Aspremont, Banerjee, and El Ghaoui (2008) and Yuan and Lin (2007) used different LASSO algorithms to select zero elements in the precision matrix. A novel penalty called the nested Lasso was constructed in Levina, Rothman and Zhu (2008) to penalize off-diagonal elements. Thresholding the sample covariance matrix in high-dimensional setting was thoroughly studied by El Karoui (2008) and Bickel and Levina (2008a) and Cai, Zhang and Zhou (2009) with remarkable results for high dimensional applications. However, it is not directly applicable to estimating sparse precision matrix when the dimensionality p_n is greater than the sample size n. Wagaman and Levina (2008) proposed an Isomap method for discovering meaningful orderings of variables based on their correlations that result in block-diagonal or banded correlation structure, resulting in an Isoband estimator. A permutation invariant estimator, called SPICE, was proposed in Rothman, Bickel, Levina and Zhu (2008) based on penalized likelihood with L₁-penalty on the off-diagonal elements for the precision matrix. They obtained remarkable results on the rates of convergence. The rate for estimating Ω under the Frobenius norm is of order (s_n log p_n/n)^1/2, with dimensionality cost only a logarithmic factor in the overall mean-square error, where s_n = p_n + s_n1, p_n is the number of the diagonal elements and s_n1 is the number of the nonzero off-diagonal entries. However, such a rate of convergence neither addresses explicitly the issues of sparsistency such as those in Fan and Li (2001) and Zhao and Yu (2006), nor the bias issues due to the L₁-penalty and the sampling distribution of the estimated nonzero elements. These are the core issues of the study. By sparsistency, we mean the property that all parameters that are zero are actually estimated as zero with probability tending to one, a weaker requirement than that of Ravikumar, Lafferty, Liu and Wasserman (2008).

In this paper, we investigate the aforementioned problems using the penalized pseudo-likelihood method. Assume a random sample {y_i}_1≤i≤n with mean zero and covariance matrix Σ₀, satisfying some sub-Gaussian tails conditions as specified in Lemma 2 (see Section 5). The sparsity of the true precision matrix Ω₀ can be explored by maximizing the Gaussian quasi-likelihood or equivalently minimizing

q_{1} (Ω) = tr (S Ω) - \log ∣ Ω ∣ + \sum_{i \neq j} p_{λ_{n 1}} (∣ ω_{i j} ∣),

(1.1)

which is the penalized negative log-likelihood if the data is Gaussian. The matrix $S = n^{- 1} \sum_{i = 1}^{n} y_{i} y_{i}^{T}$ is the sample covariance matrix, Ω = (ω_ij), and p_{λ_n1}(·) is a penalty function, depending on a regularization parameter λ_n1, which can be nonconvex. For instance, the L₁-penalty p_λ(θ) = λ|θ| is convex, while the hard-thresholding penalty defined by p_λ(θ) = λ² − (|θ| − λ)²1_{|θ|<λ}, and the SCAD penalty defined by

p_{λ}^{'} (θ) = λ 1_{{θ \leq λ}} + {(a λ - θ)}_{+} 1_{{θ > λ}} ∕ (a - 1), for some a > 2,

(1.2)

are folded-concave. Nonconvex penalty is introduced to reduce bias when the true parameter has a relatively large magnitude. For example, the SCAD penalty remains constant when θ is large, while the L₁-penalty grows linearly with θ. See Fan and Li (2001) for a detailed account of this and other advantages of such a penalty function. The computation can be done via the local linear approximation (Zhou and Li, 2008, Fan et al. 2009); see Section 2.1 for additional details.

Similarly, the sparsity of the true covariance matrix Σ₀ can be explored by minimizing

q_{2} (Σ) = tr (S Σ^{- 1}) + \log ∣ Σ ∣ + \sum_{i \neq j} p_{λ_{n 2}} (∣ σ_{i j} ∣),

(1.3)

where Σ = (σ_ij). Note that we only penalize the off-diagonal elements of Σ or Ω in the aforementioned two methods, since the diagonal elements of Σ₀ and Ω₀ do not vanish.

In studying a sparse covariance or precision matrix, it is important to distinguish between the diagonal and off-diagonal elements, since the diagonal elements are always positive and they contribute to the overall mean-squares errors. For example, the true correlation matrix, denoted by Γ₀, has the same sparsity structure as Σ₀ without the need to estimating its diagonal elements. In view of this fact, we introduce a revised method (3.2) to take this advantage. It turns out that the correlation matrix can be estimated with a faster rate of convergence, at (s_n1 log p_n/n)^1/2 instead of ((p_n + s_n1) log p_n/n)^1/2, where s_n1 is the number of nonzero correlation coefficients. We can take similar advantages over the estimation of the true inverse correlation matrix, denoted by Ψ₀. See Section 2.5. This is an extension of the work of Rothman et al. (2008) using the L₁-penalty. Such an extension is important since the nonconcave penalized likelihood ameliorates the bias problem for the L₁-penalized likelihood.

The bias issues of the commonly used L₁-penalty, or LASSO, can be seen from our theoretical results. In fact, due to the bias of LASSO, an upper bounded of λ_ni is needed in order to achieve fast rate of convergence. On the other hand, a lower bound is required in order to achieve sparsity of estimated precision or covariance matrices. This is in fact one of the motivations for introducing nonconvex penalty functions in Fan and Li (2001) and Fan and Peng (2004), but we state and prove the explicit rates in the current context. In particular, we demonstrate that the L₁-penalized estimator can achieve simultaneously the optimal rate of convergence and sparsistency for estimation of Σ₀ or Ω₀ when the number of nonzero elements in the off-diagonal entries are no larger than O(p_n), but not guaranteed so otherwise. On the other hand, using the nonconvex penalties like the SCAD or hard-thresholding penalty, such an extra restriction is not needed.

We also compare two different formulations of penalized likelihood using the modified Cholesky decomposition, exploring their respective rates of convergence and sparsity properties.

Throughout this paper, we use λ_min(A), λ_max(A) and tr(A) to denote the minimum eigenvalue, maximum eigenvalue, and trace of a symmetric matrix A respectively. For a matrix B, we define the operator norm and the Frobenius norm, respectively, as $∥ B ∥ = λ_{\max}^{1 ∕ 2} (B^{T} B)$ and ∥B∥_F = tr^1/2(B^TB).

2 Estimation of sparse precision matrix

In this section, we present the analysis of (1.1) for estimating a sparse precision matrix. Before this, let us first present an algorithm for computing the nonconcave maximum (pseudo)-likelihood estimator and then state the conditions needed for our technical results.

2.1 Algorithm based on iterated reweighted L₁-penalty

The computation of the nonconcave maximum likelihood problems can be solved by a sequence of L₁-penalized likelihood problems via local linear approximation (Zou and Li 2008, Fan et al. 2009). For example, given the current estimate Ω_k = (ω_ij,k), by the local linear approximation to the penalty function,

q_{1} (Ω) \approx tr (S Ω) - \log ∣ Ω ∣ + \sum_{i \neq j} [p_{λ_{n 1}} (∣ ω_{i j, k} ∣) + p_{λ_{n 1}}^{'} (∣ ω_{i j, k} ∣) (∣ ω_{i j} ∣ - ∣ ω_{i j, k} ∣)] .

(2.1)

Hence, Ω_k+1 should be taken to maximize the right-hand side of (2.1):

Ω_{k + 1} = {argmax}_{Ω} [tr (S Ω) - \log ∣ Ω ∣ + \sum_{i \neq j} p_{λ_{n 1}}^{'} (∣ ω_{i j, k} ∣) ∣ ω_{i j} ∣],

(2.2)

after ignoring the two constant terms. Problem (2.2) is the weighted penalized L₁-likelihood. In particular, if we take the most primitive initial value Ω₀ = 0, then

Ω_{1} = {argmax}_{Ω} [tr (S Ω) - \log ∣ Ω ∣ + λ_{n 1} \sum_{i \neq j} ∣ ω_{i j} ∣],

is already a good estimator. Iterations of (2.2) reduces the biases of the estimator, as larger estimated coefficients in the previous iterations receive less penalty. In fact, in a different setup, Zou and Li (2008) showed that one iteration of such a procedure is sufficient as long as the initial values are good enough.

Fan et al. (2009) has implemented the above algorithm for optimizing (1.1). They have also demonstrated in Section 2.2 in their paper how to utilize the graphical lasso algorithm of Friedman, Hastie and Tibshirani (2008), which is essentially a group coordinate descent procedure, to solve problem (2.2) quickly, even when p_n > n. Such a group coordinate decent algorithm was also used by Meier et al. (2008) to solve the group LASSO problem. Thus iteratively, (2.2), and hence (1.1), can be solved quickly with the graphical lasso algorithm. See also Zhang (2007) for a general solution to the folded-concave penalized least-squares problem. The following is a brief summary of the numerical results in Fan et al. (2009).

2.2 Some numerical results

We give a brief summary of a breast cancer data analysis with p_n > n considered in Fan et al. (2009). For full details, please refer to Section 3.2 of Fan et al. (2009). Other simulation results are also in Section 4 in their paper.

Breast cancer data

Normalized gene expression data from 130 patients with stage I-III breast cancers are analyzed, with 33 of them belong to class 1 and 97 belong to class 2. The aim is to assess prediction accuracy in predicting which class a patient will belong to, using a set of pre-selected genes (p_n = 110, chosen by t-tests) as gene expression profile data. The data is randomly divided into training (n = 109) and testing sets. The mean vector for the genes expression levels is obtained from the training data, as well as the associated inverse covariance matrix estimated using LASSO, adaptive LASSO and SCAD penalties as three different regularization methods. A linear discriminant score is then calculated for each regularization method and applied to the testing set to predict if a patient belongs to class 1 or 2. This is repeated 100 times.

On average, the estimated precision matrix $\hat{Ω}$ using LASSO has many more nonzeros than that using SCAD (3923 versus 674). This is not surprising when we look at equation (2.3) in our paper, where the L₁ penalty imposes an upper bound on the tuning parameter λ_n1 for consistency, which links to reducing the bias in the estimation. This makes the λ_n1 in practice too small to set many of the elements in $\hat{Ω}$ to zero. While we do not know which elements in the true Ω are zero, the large number of nonzero elements in the L₁ penalized estimator seems spurious, and the resulting gene network is not easy to interpret.

On the other hand, SCAD-penalized estimator has a much smaller number of nonzero elements, since the tuning parameter λ_n1 is not bounded above under consistency of the resulting estimator. This makes the resulting gene network easier to interpret, with some clusters of genes identified.

Also, classification results on the testing set using the SCAD penalty for precision matrix estimation is better than that using the L₁ penalty, in the sense that the specificity (#True Negative/#class 2) is higher (0.794 to 0.768) while the sensitivity (#True Positive/#class 1) is similar to that using L₁-penalized precision matrix estimator.

2.3 Technical conditions

We now introduce some notations and present regularity conditions for the rate of convergence and sparsistency.

Let $S_{1} = {(i, j) : ω_{i j}^{0} \neq 0}$ , where $Ω_{0} = (ω_{i j}^{0})$ is the true precision matrix. Denote by s_n1 = |S₁| − p_n, which is the number of nonzero elements in the off-diagonal entries of Ω₀. Define

a_{n 1} = \max_{(i, j) \in S_{1}} p_{λ_{n 1}}^{'} (∣ ω_{i j}^{0} ∣), b_{n 1} = \max_{(i, j) \in S_{1}} p_{λ_{n 1}}^{″} (∣ ω_{i j}^{0} ∣) .

The term a_n1 is related to the asymptotic bias of the penalized likelihood estimate due to penalization. Note that for L₁-penalty, a_n1 = λ_n1 and b_n1 = 0, whereas for SCAD, a_n1 = b_n1 = 0 for sufficiently large n under the last assumption of condition (B) below.

We assume the following regularity conditions:

There are constants τ₁ and τ₂ such that
$0 < τ_{1} < λ_{\min} (Σ_{0}) \leq λ_{\max} (Σ_{0}) < τ_{2} < \infty for all n .$
a_n1 = O({(1 + p_n/(s_n1 + 1)) log p_n/n}^1/2), b_n1 = o(1), and $\min_{(i, j) \in S_{1}} ∣ ω_{i j}^{0} ∣ ∕ λ_{n 1} \to \infty$ as n → ∞.
The penalty p_λ(·) is singular at the origin, with lim_t↓0 p_λ(t)/(λt) = k > 0.
There are constants C and D such that, when $θ_{1}, θ_{2} < C λ_{n 1}, ∣ p_{λ_{n 1}}^{″} (θ_{1}) - p_{λ_{n 1}}^{″} (θ_{2}) ∣ \leq D ∣ θ_{1} - θ_{2} ∣$ .

Condition (A) bounds uniformly the eigenvalues of Σ₀, which facilitates the proof of consistency. It also includes a wide class of covariance matrices as noted in Bickel and Levina (2008b). The rates a_n1 and b_n1 in condition (B) are also needed for proving consistency. If they are too large, the bias due to penalty can dominate the variance from the likelihood, resulting in poor estimates.

The last requirement in condition (B) states the rate at which the nonzero parameters should be distinguished from zero asymptotically. It is not explicitly needed in the proofs, but for asymptotically unbiased penalty functions, this is a necessary condition so that a_n1 and b_n1 are converging to zero fast enough as needed in the first part of condition (B). In particular, for the SCAD and hard-thresholding penalty functions, this condition implies that a_n1 = b_n1 = 0 exactly for sufficiently large n, thus allowing a flexible choice of λ_n1. For the SCAD penalty (1.2), the condition can be relaxed as $\min_{(i, j) \in S_{1}} ∣ ω_{i j}^{0} ∣ ∕ λ_{n 1} < a$ .

The singularity in condition (C) gives sparsity in the estimates [Fan and Li (2001)]. Finally, condition (D) is a smoothing condition for the penalty function, and is needed in proving asymptotic normality. The SCAD penalty, for instance, satisfies this condition by choosing the constant D, independent of n, to be large enough.

2.4 Properties of sparse precision matrix estimation

Minimizing (1.1) involves nonconvex minimization, and we need to prove that there exists a local minimizer $\hat{Ω}$ for the minimization problem with a certain rate of convergence, which is given under the Frobenius norm. The proof is given in Section 5. It is similar to the one given in Rothman et al. (2008), but now the penalty function is nonconvex.

Theorem 1 (Rate of convergence). Under regularity conditions (A)-(D), if $(p_{n} + s_{n 1}) \log p_{n} ∕ n = O (λ_{n 1}^{2})$ and (p_n + s_n1)(log p_n)^k/n = O(1) for some k > 1, then there exists a local minimizer $\hat{Ω}$ such that ${∥ \hat{Ω} - Ω_{0} ∥}_{F}^{2} = O_{P} {(p_{n} + s_{n 1}) \log p_{n} ∕ n}$ . For the L₁-penalty, we only need $\log p_{n} ∕ n = O (λ_{n 1}^{2})$ .

The proofs of this theorem and others are relegated to Section 5 so that readers can get more quickly what the results are. As in Fan and Li (2001), the asymptotic bias due to the penalty for each nonzero parameter is a_n1. Since we penalized only on the off-diagonal elements, the total bias induced by the penalty is asymptotically of order s_n1a_n1. The square of this total bias over all nonzero elements is of order O_P{(p_n + s_n1) log p_n/n} under condition (B).

Theorem 1 states explicitly how the number of nonzero elements and dimensionality affect the rate of convergence. Since there are (p_n + s_n1) nonzero elements and each of them can be estimated at best with rate n^−1/2, the total square errors are at least of rate (p_n + s_n1)/n. The price that we pay for high-dimensionality is merely a logarithmic factor log p_n. The results holds as long as (p_n+s_n1)/n is at a rate O((log p_n)^−k) with some k > 1, which decays to zero slowly. This means that in practice p_n can be comparable to n without violating the results. The condition here is not minimum possible; we expect it holds for p ≫ n. Here, we refer the local minimizer as an interior point within a given close set such that it minimizes the target function. Following a similar argument to Huang et al. (2008), the local minimizer in Theorem 1 can be taken as the global minimizer with additional conditions on the tail of the penalty function.

Theorem 1 is also applicable to the L₁-penalty function, where the local minimizer becomes the global minimizer. The asymptotic bias of the L₁-penalized estimate is given in the term s_n1a_n1 = s_n1λ_n1 as shown in the technical proof. In order to control the bias, we impose condition (B), which entails an upper bound on λ_n1 = O({(1+p_n/(s_n1+1)) log p_n/n}^1/2). The bias problem due to the L₁-penalty for finite parameter has already been unveiled by Fan and Li (2001) and Zou (2006).

Next we show the sparsistency of the penalized estimator from (1.1). We use S^c to denote the complement of a set S.

Theorem 2 (Sparsistency). Under the conditions given in Theorem 1, for any local minimizer of (1.1) satisfying ${∥ \hat{Ω} - Ω_{0} ∥}_{F}^{2} = O_{P} {(p_{n} + s_{n 1}) \log p_{n} ∕ n}$ and ${∥ \hat{Ω} - Ω_{0} ∥}^{2} = O_{P} (η_{n})$ for a sequence of $η_{n} \to 0$ , if $\log p_{n} ∕ n + η_{n} = O (λ_{n 1}^{2})$ , then with probability tending to 1, ${\hat{ω}}_{i j} = 0$ for all $(i, j) \in S_{1}^{c}$ .

First, since ${∥ M ∥}^{2} \leq {∥ M ∥}_{F}^{2}$ for any matrix M, we can always take η_n = (p_n + s_n1) log p_n/n in Theorem 2, but this will result in more stringent requirement on the number of zero elements when L₁-penalty is used, as we now explain. The sparsistency requires a lower bound on the rate of the regularization parameter λ_n1. On the other hand, condition (B) imposes an upper bound on λ_n1 when L₁-penalty is used in order to control the biases. Explicitly, we need, for L₁-penalized likelihood,

\log p_{n} ∕ n + η_{n} = O (λ_{n 1}^{2}) = (1 + p_{n} ∕ (s_{n 1} + 1)) \log p_{n} ∕ n

(2.3)

for both consistency and sparsistency to be satisfied. We present two scenarios here for the two bounds to be compatible, making use of the inequalities ${∥ M ∥}_{F}^{2} ∕ p_{n} \leq {∥ M ∥}^{2} \leq {∥ M ∥}_{F}^{2}$ for a matrix M of size p_n.

We always have $∥ \hat{Ω} - Ω_{0} ∥ \leq {∥ \hat{Ω} - Ω_{0} ∥}_{F}$ . In the worst case scenario where they have the same order, ${∥ \hat{Ω} - Ω_{0} ∥}^{2} = O_{P} ((p_{n} + s_{n 1}) \log p_{n} ∕ n)$ , so that η_n = (p_n+s_n1) log p_n/n. It is then easy to see from (2.3) that the two bounds are compatible only when s_n1 = O(1).
We also have ${∥ \hat{Ω} - Ω_{0} ∥}_{F}^{2} ∕ p_{n} \leq {∥ \hat{Ω} - Ω_{0} ∥}^{2}$ . In the optimistic scenario where they have the same order,
${∥ \hat{Ω} - Ω_{0} ∥}^{2} = O_{P} ((1 + s_{n 1} ∕ p_{n}) \log p_{n} ∕ n) .$
Hence, η_n = (1 + s_n1/p_n) log p_n/n, and compatibility of the bounds requires s_n1 = O(p_n).

Hence, even in the optimistic scenario, consistency and sparsistency are guaranteed only when s_n1 = O(p_n) if the L₁-penalty is used, i.e., the precision matrix has to be sparse enough.

However, if the penalty function used is unbiased, like the SCAD or the hard-thresholding penalty, we do not impose an extra upper bound for λ_n1 since its first derivative $p_{λ_{n 1}}^{'} (∣ θ ∣)$ goes to zero fast enough as |θ| increases (exactly equals zero for the SCAD and hard-thresholding penalty functions, when n is sufficiently large; see condition (B) and the explanation thereof). Thus, λ_n1 is allowed to decay to zero slowly, allowing even the largest order $s_{n 1} = O (p_{n}^{2})$ .

We remark that asymptotic normality for the estimators of the elements in S₁ have been established in a previous version of this paper. We omit it here for brevity.

2.5 Properties of sparse inverse correlation matrix estimation

The inverse correlation matrix Ψ₀ retains the same sparsity structure of Ω₀. Consistency and sparsistency results can be achieved with p_n as large as log p_n = o(n), as long as (s_n1 + 1)(log p_n)^k/n = O(1) for some k > 1 as n → ∞. We minimize, w.r.t. Ψ = (ψ_ij),

tr (Ψ {\hat{Γ}}_{S}) - \log ∣ Ψ ∣ + \sum_{i \neq j} p_{ν_{n 1}} (∣ ψ_{i j} ∣),

(2.4)

where ${\hat{Γ}}_{S} = {\hat{W}}^{- 1} S {\hat{W}}^{- 1}$ is the sample correlation matrix, with ${\hat{W}}^{2} = D_{S}$ being the diagonal matrix with diagonal elements of S, and υ_n1 is a regularization parameter. After obtaining $\hat{Ψ}$ , Ω₀ can also be estimated by $\tilde{Ω} = {\hat{W}}^{- 1} \hat{Ψ} {\hat{W}}^{- 1}$ .

To present the rates of convergence for $\hat{Ψ}$ and $\tilde{Ω}$ , we define

c_{n 1} = \max_{(i, j) \in S_{1}} p_{ν_{n 1}}^{'} (∣ ψ_{i j}^{0} ∣), d_{n 1} = \max_{(i, j) \in S_{1}} p_{ν_{n 1}}^{″} (∣ ψ_{i j}^{0} ∣),

where $Ψ_{0} = (ψ_{i j}^{0})$ and modify condition (D) to (D’) with λ_n1 there replaced by υ_n1, and impose (B’) c_n1 = O({log p_n/n}^1/2), d_n1 = o(1). Also, $\min_{(i, j) \in S_{1}} ∣ ψ_{i j}^{0} ∣ ∕ ν_{n 1} \to \infty$ as n → ∞.

Theorem 3 Under regularity conditions (A),(B’),(C) and (D’), if (s_n1+1)(log p_n)^k/n = O(1) for some k > 1 and $(s_{n 1} + 1) \log p_{n} ∕ n = o (ν_{n 1}^{2})$ , then there exists a local minimizer $\hat{Ψ}$ for (2.4) such that ${∥ \hat{Ψ} - Ψ_{0} ∥}_{F}^{2} = O_{P} (s_{n 1} \log p_{n} ∕ n)$ and ${∥ \tilde{Ω} - Ω_{0} ∥}^{2} = O_{P} ((s_{n 1} + 1) \log p_{n} ∕ n)$ under the operator norm. For the L₁-penalty, we only need $\log p_{n} ∕ n = O (ν_{n 1}^{2})$

Note that we can allow p_n ≫ n without violating the result as long as log p_n/n = o(1). Note also that an order of {p_n log p_n/n}^1/2 is removed by estimating the inverse correlation rather than the precision matrix, which is somewhat surprising since the inverse correlation matrix, unlike the correlation matrix, does not have known diagonal elements that contribute no errors to the estimation. This can be explained and proved as follows. If s_n1 = O(p_n), the result is obvious. When s_n1 = o(p_n), most of the off-diagonal elements are zero. Indeed, there are at most O(s_n1) columns of the inverse correlation matrix which contain at least one nonzero element. The rest of the columns that have all zero off-diagonal elements must have diagonal entries 1. These columns represent variables that are actually uncorrelated from the rest. Now, it is easy to see from (2.4) that these diagonal elements, which are one, are all estimated exactly as one with no estimation error. Hence, an order of (p_n log p_n/n)^1/2 is not present even in the case of estimating the inverse correlation matrix.

For the L₁-penalty, our result reduces to that given in Rothman et al. (2008). We offer the sparsistency result as follows.

Theorem 4 (Sparsistency) Under the conditions given in Theorem 3, for any local minimizer of (2.4) satisfying ${∥ \hat{Ψ} - Ψ_{0} ∥}_{F}^{2} = O_{P} (s_{n 1} \log p_{n} ∕ n)$ and ${∥ \hat{Ψ} - Ψ_{0} ∥}^{2} = O_{P} (n_{n})$ for some η_n → 0, if $\log p_{n} ∕ n + n_{n} = O (ν_{n 1}^{2})$ , then with probability tending to 1, ${\hat{ψ}}_{i j} = 0$ for all $(i, j) \in S_{1}^{c}$ .

The proof follows exactly the same as that for Theorem 2 in Section 2.4, and is thus omitted.

For the L₁-penalty, control of bias and sparsistency require υ_n1 to satisfy bounds like (2.3):

\log p_{n} ∕ n + η_{n} = O (ν_{n 1}^{2}) = \log p_{n} ∕ n .

(2.5)

This leads to two scenarios:

The worst case scenario has
${∥ \hat{Ψ} - Ψ_{0} ∥}^{2} = {∥ \hat{Ψ} - Ψ_{0} ∥}_{F}^{2} = O_{P} (s_{n 1} \log p_{n} ∕ n),$
meaning η_n = s_n1 log p_n/n. Then compatibility of the bounds in (2.5) requires s_n1 = O(1).
The optimistic scenario has
${∥ \hat{Ψ} - Ψ_{0} ∥}^{2} = {∥ \hat{Ψ} - Ψ_{0} ∥}_{F}^{2} ∕ p_{n} = O_{P} (s_{n 1} ∕ p_{n} \cdot \log p_{n} ∕ n),$
meaning η_n = s_n1/p_n · log p_n/n. Then compatibility of the bounds in (2.5) requires s_n1 = O(p_n).

On the other hand, for penalties like the SCAD or the hard-thresholding penalty, we do not need an upper bound for s_n1. Hence, we only need (s_n + 1)(log p_n)^k/n = O(1) as n → ∞ for some k > 1. It is clear that SCAD results in better sampling properties than the L₁-penalized estimator in precision or inverse correlation matrix estimation.

3 Estimation of sparse covariance matrix

In this section, we analyze the sparse covariance matrix estimation using the penalized likelihood (1.3). Then it is modified to estimating the correlation matrix, which improves the rate of convergence. We assume that the y_i’s are i.i.d. N(0, Σ₀) throughout this section.

3.1 Properties of sparse covariance matrix estimation

Let $S_{2} = {(i, j) : σ_{i j}^{0} \neq 0}$ , where $Σ_{0} = (σ_{i j}^{0})$ . Denote s_n2 = |S₂| − p_n, so that s_n2 is the number of nonzero elements in Σ₀ on the off-diagonal entries. Put

a_{n 2} = \max_{(i, j) \in S_{2}} p_{λ_{n 2}}^{'} (∣ σ_{i j}^{0} ∣), b_{n 2} = \max_{(i, j) \in S_{2}} p_{λ_{n 2}}^{″} (∣ σ_{i j}^{0} ∣) .

Technical conditions in Section 2 need some revision. In particular, condition (D) now becomes condition (D2) with λ_n1 there replaced by λ_n2. Condition (B) should now be (B2) a_n2 = O({(1 + p_n/(s_n2 + 1)) log p_n/n}^1/2), b_n2 = o(1), and $\min_{(i, j) \in S_{2}} ∣ σ_{i j}^{0} ∣ ∕ λ_{n 2} \to \infty$ as n → ∞.

Theorem 5 (Rate of convergence). Under regularity conditions (A), (B2), (C) and (D2), if (p_n + s_n2)(log p_n)^k/n = O(1) for some k > 1 and (p_n + s_n2) log $p_{n} ∕ n = O (λ_{n 2}^{2})$ , then there exists a local minimizer $\hat{Σ}$ such that ${∥ \hat{Σ} - Σ_{0} ∥}_{F}^{2} = O_{p} ((p_{n} + s_{n 2}) \log p_{n} ∕ n)$ . For the L₁-penalty, we only need $\log p_{n} ∕ n = O (λ_{n 2}^{2})$ .

Like the case for precision matrix estimation, the asymptotic bias due to the L₁-penalty is of order s_n2a_n2 = s_n2λ_n2. To control this term, for the L₁-penalty, we require λ_n2 = O({(1 + p_n/(s_n2 + 1)) log p_n/n}^1/2).

Theorem 6 (Sparsistency). Under the conditions given in Theorem 5, for any local minimizer $\hat{Σ}$ of (1.3) satisfying ${∥ \hat{Σ} - Σ_{0} ∥}_{F}^{2} = O_{P} ((p_{n} + s_{n 2}) \log p_{n} ∕ n)$ and ${∥ \hat{Σ} - Σ_{0} ∥}^{2} = O_{P} (n_{n})$ for some η_n → 0, if $\log p_{n} ∕ n + n_{n} = O (λ_{n 2}^{2})$ , then with probability tending to 1, ${\hat{σ}}_{i j} = 0$ for all $(i, j) \in S_{2}^{c}$ .

For the L₁-penalized likelihood, controlling of bias for consistency together with sparsistency requires

\log p_{n} ∕ n + η_{n} = O (λ_{n 2}^{2}) = (1 + p_{n} ∕ (s_{n 2} + 1)) \log p_{n} ∕ n .

(3.1)

This is the same condition as (2.3), and hence in the worst case scenario where

{∥ \hat{Σ} - Σ_{0} ∥}^{2} = {∥ \hat{Σ} - Σ_{0} ∥}_{F}^{2} = O_{P} ((p_{n} + s_{n 2}) \log p_{n} ∕ n),

we need s_n2 = O(1). In the optimistic scenario where

{∥ \hat{Σ} - Σ_{0} ∥}^{2} = {∥ \hat{Σ} - Σ_{0} ∥}_{F}^{2} ∕ p_{n},

we need s_n2 = O(p_n). In both cases, the matrix Σ₀ has to be very sparse, but the former is much sparser.

On the other hand, if unbiased penalty functions like the SCAD or hard-thresholding penalty are used, we do not need an upper bound on λ_n2 since the bias a_n2 = 0 for sufficiently large n. This gives more flexibility on the order of s_n2.

Similar to Section 2, asymptotic normality for the estimators of the elements in S₂ can be proved under certain assumptions.

3.2 Properties of sparse correlation matrix estimation

The correlation matrix Γ₀ retains the same sparsity structure of Σ₀ with known diagonal elements. This special structure allows us to estimate Γ₀ more accurately. To take advantage of the known diagonal elements, the sparse correlation matrix Γ₀ is estimated by minimizing w.r.t. Γ = (γ_ij),

tr (Γ^{- 1} {\hat{Γ}}_{S}) + \log ∣ Γ ∣ + \sum_{i \neq j} p_{ν_{n 2}} (∣ γ_{i j} ∣),

(3.2)

where υ_n2 is a regularization parameter. After obtaining $\hat{Γ}$ Σ₀ can be estimated by $\tilde{Σ} = \hat{W} \hat{Γ} \hat{W}$ .

To present the rates of convergence for $\hat{Γ}$ and $\tilde{Σ}$ , we define

c_{n 2} = \max_{(i, j) \in S_{2}} p_{ν_{n 2}}^{'} (∣ γ_{i j}^{0} ∣), d_{n 2} = \max_{(i, j) \in S_{2}} p_{ν_{n 2}}^{″} (∣ γ_{i j}^{0} ∣),

where $Γ_{0} = (γ_{i j}^{0})$ . We modify condition (D) to (D2′) with λ_n2 there replaced by υ_n2, and (B) to (B2′) as follows: (B2′) c_n2 = O({log p_n/n}^1/2), d_n2 = o(1), and $\min_{(i, j) \in S_{2}} ∣ γ_{i j}^{0} ∣ ∕ ν n_{2} \to \infty$ as n → ∞.

Theorem 7 Under regularity conditions (A),(B2′),(C) and (D2′), if (p_n+s_n2)(log p_n)^k/n = O(1) for some k > 1 and $(s_{n 2} + 1) \log p_{n} ∕ n = o (ν_{n 2}^{2})$ , then there exists a local minimizer $\hat{Γ}$ for (3.2) such that

{∥ \hat{Γ} - Γ_{0} ∥}_{F}^{2} = O_{P} (s_{n 2} \log p_{n} ∕ n) .

In addition, for the operator norm, we have

{∥ \tilde{Σ} - Σ_{0} ∥}^{2} = O_{P} {(s_{n 2} + 1) \log p_{n} ∕ n} .

For the L₁-penalty, we only need $\log p_{n} ∕ n = O (ν_{n 2}^{2})$ .

The proof is sketched in Section 5. This theorem shows that the correlation matrix, like the inverse correlation matrix, can be estimated more accurately, since diagonal elements are known to be one.

Theorem 8 (Sparsistency). Under the conditions given in Theorem 7, for any local minimizer $\hat{Γ}$ of (3.2) satisfying ${∥ \hat{Γ} - Γ_{0} ∥}_{F}^{2} = O_{P} (s_{n 2} \log p_{n} ∕ n)$ and ${∥ \hat{Γ} - Γ_{0} ∥}^{2} = O_{P} (n_{n})$ for some η_n → 0 , if $\log p_{n} ∕ n + n_{n} = O (ν_{n 2}^{2})$ , then with probability tending to 1, ${\hat{γ}}_{i j} = 0$ for all $(i, j) \in S_{2}^{c}$ .

The proof follows exactly the same as that of Theorem 6 in Section 5, and is omitted. For the L₁-penalized likelihood, controlling of bias and sparsistency requires

\log p_{n} ∕ n + η_{n} = O (ν_{n 2}^{2}) = \log p_{n} ∕ n .

(3.3)

This is the same condition as (2.5), hence in the worst scenario where

{∥ \hat{Γ} - Γ_{0} ∥}^{2} = {∥ \hat{Γ} - Γ_{0} ∥}_{F}^{2} = O_{P} (s_{n 2} \log p_{n} ∕ n),

we need s_n2 = O(1). In the optimistic scenario where

{∥ \hat{Γ} - Γ_{0} ∥}^{2} = {∥ \hat{Γ} - Γ_{0} ∥}_{F}^{2} ∕ p_{n} = O_{P} (s_{n 2} ∕ p_{n} \cdot \log p_{n} ∕ n),

we need s_n2 = O(p_n).

The use of unbiased penalty functions like the SCAD or the hard-thresholding penalty, similar to results in the previous sections, does not impose an upper bound on the regularization parameter since bias c_n2 = 0 for sufficiently large n. This gives more flexibility to the order of s_n2 allowed.

4 Extension to sparse Cholesky decomposition

Pourahmadi (1999) proposed the modified Cholesky decomposition (MCD) which facilitates the sparse estimation of Ω through penalization. The idea is to represent zero-mean data y = (y₁, ⋯ , y_pn)^T using the autoregressive model:

y_{i} = \sum_{j = 1}^{i - 1} ϕ_{i j} y_{j} + ∊_{i}, and T Σ T^{T} = D,

(4.1)

where T is the unique unit lower triangular matrix with ones on its diagonal and (i, j)^th element being −ϕ_ij for j < i, and D is diagonal with i^th element being $σ_{i}^{2} = var (∊_{i})$ . The optimization problem is unconstrained (since the ϕ_ij’s are free variables), and the estimate for Ω is always positive-definite.

Huang et al. (2006) and Levina et al. (2008) both used the MCD for estimating Ω₀. The former maximized the log-likelihood (ML) over T and D simultaneously, while the latter suggested also a least square version (LS), with D being first set to the identity matrix and then minimizing over T to obtain $\hat{T}$ . The latter corresponds to the original Cholesky decomposition. The sparse Cholesky factor can be estimated through minimizing

(M L) : q_{3} (T, D) = tr (T^{T} D^{- 1} TS) + \log ∣ D ∣ + 2 \sum_{i < j} p_{λ_{n 3}} (∣ t_{i j} ∣) .

(4.2)

This is indeed the same as (1.1) with the substitution of Ω = T^TD⁻¹T and penalization parameter λ_n3. Noticing that (4.1) can be written as $Ty = ε$ , the least square version is to minimize $tr (ε ε^{T}) = tr (T^{T} {Tyy}^{T})$ in the matrix notation. Aggregating the n observations and adding penalty functions, the least-square criterion is to minimize

(L S) : q_{4} (T) = tr (T^{T} TS) + 2 \sum_{i < j} p_{λ_{n 4}} (∣ t_{i j} ∣) .

(4.3)

In view of the results in Sections 2.5 and 3.2, we can also write the sample covariance matrix in (4.2) as $S = \hat{W} {\hat{Γ}}_{S} \hat{W}$ and then replace $D^{- 1 ∕ 2} T \hat{W}$ by T, resulting in the normalized (NL) version as follows:

(N L) : q_{5} (T) = tr (T^{T} T {\hat{Γ}}_{S}) - 2 \log ∣ T ∣ + 2 \sum_{i < j} p_{λ_{n 5}} (∣ t_{i j} ∣) .

(4.4)

We will also assume the y_i’s are i.i.d. N(0, Σ₀) as in the last section.

4.1 Properties of sparse Cholesky factor estimation

Since all the T’s introduced in the three models above have the same sparsity structure, let S and s_n3 be the nonzero set and number of nonzeros associated with each T above. Define

a_{n 3} = \max_{(i, j) \in S} p_{λ_{n 3}}^{'} (∣ t_{i j}^{0} ∣), b_{n 3} = \max_{(i, j) \in S} p_{λ_{n 3}}^{″} (∣ t_{i j}^{0} ∣) .

For (ML), condition (D) is adapted to (D3) with λ_n1 there replaced by λ_n3. Condition (B) is modified as (B3) a_n3 = O({(1 + p_n/(s_n3 + 1)) log p_n/n}^1/2), b_n3 = o(1) and $\min_{(i, j) \in S} ∣ ϕ_{i j}^{0} ∣ ∕ λ_{n 3} \to \infty$ as n → ∞.

After obtaining $\hat{T}$ and $\hat{D}$ from minimizing (ML), we set $\hat{Ω} = {\hat{T}}^{T} {\hat{D}}^{- 1} \hat{T}$ .

Theorem 9 Under regularity conditions (A),(B3),(C),(D3), if (p_n + s_n3)(log p_n)^k/n = O(1) for some k > 1 and $(p_{n} + s_{n 3}) \log p_{n} ∕ n = O (λ_{n 3}^{2})$ , then there exists a local minimizer $\hat{T}$ and $\hat{D}$ for (ML) such that ${∥ \hat{T} - T_{0} ∥}_{F}^{2} = O_{P} (s_{n 3} \log p_{n} ∕ n)$ , ${∥ \hat{D} - D_{0} ∥}_{F}^{2} = O_{P} (p_{n} \log p_{n} ∕ n)$ and ${∥ \hat{Ω} - Ω_{0} ∥}_{F}^{2} = O_{P} {(p_{n} + s_{n 3}) \log p_{n} ∕ n}$ . For the L₁-penalty, we only need $\log p_{n} ∕ n = O (λ_{n 3}^{2})$ .

The proof is similar to those of Theorems 5 and 7 and is omitted. The Cholesky factor T has ones on its main diagonal without the need for estimation. Hence, the rate of convergence is faster than $\hat{Ω}$ .

Theorem 10 (Sparsistency). Under the conditions in Theorem 9, for any local minimizer $\hat{T}$ , $\hat{D}$ of (4.2) satisfying ${∥ \hat{T} - T_{0} ∥}_{F}^{2} = O_{P} (s_{n 3} \log p_{n} ∕ n)$ and ${∥ \hat{D} - D_{0} ∥}_{F}^{2} = O_{P} (p_{n} \log p_{n} ∕ n)$ , if $\log p_{n} ∕ n + η_{n} + ζ_{n} = O (λ_{n 3}^{2})$ , then sparsistency holds for $\hat{T}$ , provided that ${∥ \hat{T} - T_{0} ∥}^{2} = O_{P} (η_{n})$ and ${∥ \hat{D} - D_{0} ∥}^{2} = O_{P} (ζ_{n})$ , for some η_n, ζ_n → 0.

The proof is in Section 5. For the L₁-penalized likelihood, control of bias and sparsistency impose the following:

\log p_{n} ∕ n + η_{n} + ζ_{n} = O (λ_{n 3}^{2}) = (1 + p_{n} ∕ (s_{n 3} + 1)) \log p_{n} ∕ n .

(4.5)

The worst scenario corresponds to η_n = s_n3 log p_n/n and ζ_n = p_n log p_n/n, so that we need s_n3 = O(1). The optimistic scenario corresponds to η_n = s_n3/p_n · log p_n/n and ζ_n = log p_n/n, so that we need s_n3 = O(p_n).

On the other hand, such a restriction is not needed for unbiased penalties like the SCAD or hard-thresholding penalty, giving more flexibility on the order of s_n3.

4.2 Properties of sparse normalized Cholesky factor estimation

We now turn to analyzing the normalized penalized likelihood (4.4). With T = (t_ij) in (NL) which is lower triangular, define

a_{n 5} = \max_{(i, j) \in S} p_{λ_{n 5}}^{'} (∣ t_{i j}^{0} ∣), b_{n 5} = \max_{(i, j) \in S} p_{λ_{n 5}}^{″} (∣ t_{i j}^{0} ∣) .

Condition (D) is now changed to (D5) with λ_n1 there replaced by λ_n5. Condition (B) is now substituted by (B5) $a_{n 5}^{2} = O (\log p_{n} ∕ n)$ , b_n5 = o(1), $\min_{(i, j) \in S} ∣ t_{i j}^{0} ∣ ∕ λ_{n 5} \to \infty$ as n → ∞.

Theorem 11 (Rate of convergence) Under regularity conditions (A),(B5),(C) and (D5), if s_n3(log p_n)^k/n = O(1) for some k > 1 and $(s_{n 3} + 1) \log p_{n} ∕ n = o (λ_{n 5}^{2})$ , then there exists a local minimizer $\hat{T}$ for (NL) such that ${∥ \hat{T} - T_{0} ∥}_{F}^{2} = O_{P} (s_{n 3} \log p_{n} ∕ n)$ and rate of convergence in the Frobenius norm

{∥ \hat{Ω} - Ω_{0} ∥}_{F}^{2} = O_{P} {(p_{n} + s_{n 3}) \log p_{n} ∕ n},

and in the operator norm, it is improved to

{∥ \hat{Ω} - Ω_{0} ∥}^{2} = O_{P} {(s_{n 3} + 1) \log p_{n} ∕ n)} .

For the L₁-penalty, we only need $\log p_{n} ∕ n = O (λ_{n 5}^{2})$ .

The proof is similar to that of Theorems 5 and 7 and is omitted. In this theorem, like Lemma 3, we can have p_n so that p_n/n goes to a constant less than 1. It is evident that normalizing with $\hat{W}$ results in an improvement in the rate of convergence in operator norm.

Theorem 12 (Sparsistency). Under the conditions given in Theorem 11, for any local minimizer $\hat{T}$ of (4.4) satisfying ${∥ \hat{T} - T_{0} ∥}_{F}^{2} = O_{P} (s_{n 3} \log p_{n} ∕ n)$ if $\log p_{n} ∕ n + η_{n} = O (λ_{n 5}^{2})$ , then sparsistency holds for $\hat{T}$ , provided that ${∥ \hat{T} - T_{0} ∥}^{2} = O (η_{n})$ for some η_n → 0.

Proof is omitted since it goes exactly the same as that of Theorem 10. The above results apply also to the L₁-penalized estimator. For simultaneous persistency and optimal rate of convergence using the L₁-penalty, the biases inherent in it induce the restriction s_n3 = O(1) in the worst scenario where $η_{n}^{2} = s_{n 3} \log p_{n} ∕ n$ , and s_n3 = O(p_n) in the optimistic scenario where $η_{n}^{2} = s_{n 3} ∕ p_{n} \cdot \log p_{n} ∕ n$ . This restriction does not apply to the SCAD and other asymptotically unbiased penalty functions.

5 Proofs

We first prove three lemmas. The first one concerns with inequalities involving the operator and the Frobenius norms. The other two concern with order estimation for elements in a matrix of the form A(S − Σ₀)B, which are useful in proving results concerning sparsistency.

Lemma 1 Let A and B be real matrices such that the product AB is defined. Then, defining ${∥ A ∥}_{\min}^{2} = λ_{\min} (A^{T} A)$ , we have

{∥ A ∥}_{\min} {∥ B ∥}_{F} \leq {∥ AB ∥}_{F} \leq ∥ A ∥ {∥ B ∥}_{F} .

(5.1)

In particular, if A = (a_ij), then |a_ij| ≤ ∥A∥ for each i, j.

Proof of Lemma 1. Write B = (b₁, ⋯ , b_q), where b_i is the i-th column vector in B. Then

\begin{matrix} {∥ AB ∥}_{F}^{2} = tr (B^{T} A^{T} AB) = \sum_{i = 1}^{q} b_{i}^{T} A^{T} {Ab}_{i} \leq & λ_{\max} (A^{T} A) \sum_{i = 1}^{q} {∥ b_{i} ∥}^{2} \\ = & {∥ A ∥}^{2} {∥ B ∥}_{F}^{2} . \end{matrix}

Similarly,

\begin{matrix} {∥ AB ∥}_{F}^{2} = \sum_{i = 1}^{q} b_{i}^{T} A^{T} {Ab}_{i} \geq & λ_{\min} (A^{T} A) \sum_{i = 1}^{q} {∥ b_{i} ∥}^{2} \\ = & {∥ A ∥}_{\min}^{2} {∥ B ∥}_{F}^{2}, \end{matrix}

which completes the proof of (5.1). To prove |a_ij| ≤ ∥A∥, note that $a_{i j} = e_{i}^{T} A e_{j}$ , where e_i is the unit column vector with one at the i-th position, and zero elsewhere. Hence, using (5.1),

∣ a_{i j} ∣ = ∣ e_{i}^{T} {Ae}_{j} ∣ \leq {∥ {Ae}_{j} ∥}_{F} \leq ∥ A ∥ \cdot {∥ e_{j} ∥}_{F} = ∥ A ∥,

and this completes the proof of the lemma. □

Lemma 2 Let S be a sample covariance matrix of a random sample {y_i}_1≤i≤n, with E(y_i) = 0 and var(y_i) = Σ₀. Let y_i = (y_i1, ⋯ , yip_n) with y_ij ~ F_j, where F_j is the c.d.f. of y_ij , and let G_j be the c.d.f. of $y_{i j}^{2}$ , with

\max_{1 \leq i \leq p_{n}} \int_{0}^{\infty} \exp (λ t) d G_{j} (t) < \infty, 0 < ∣ λ ∣ < λ_{0}

(5.2)

for some λ₀ > 0. Assume log p_n/n = o(1), and that Σ₀ has eigenvalues uniformly bounded above as n → ∞. Then for constant matrices A and B with ∥A∥, ∥B∥ = O(1), we have max_i,j |(A(S − Σ₀)B)_ij | = O_P({log p_n/n}^1/2).

Remark: The conditions on the y_ij’s above are the same as those used in Bickel and Levina (2008b) for relaxing the normality assumption.

Proof of Lemma 2. Let x_i = Ay_i and $w_{i} = B^{T} y_{i}$ . Define $u_{i} = {(x_{i}^{T}, w_{i}^{T})}^{T}$ , with covariance matrix

Σ_{u} = var (u_{i}) = (\begin{matrix} A Σ_{0} A^{T} & A Σ_{0} B \\ B^{T} Σ_{0} A^{T} & B^{T} Σ_{0} B \end{matrix}) .

Since ∥(A^T B)^T∥ ≤ (∥A∥² + ∥B∥²)^1/2 = O(1) and ∥Σ₀∥ = O(1) uniformly, we have ∥Σ_u∥ = O(1) uniformly, Then, with $S_{u} = n^{- 1} \sum_{i = 1}^{n} u_{i} u_{i}^{T}$ , which is the sample covariance matrix for the random sample {u_i}_1≤i≤n, by Lemma A.3 of Bickel and Levina (2008b) which holds under the assumption for the y_ij’s and log p_n/n = o(1), we have

\max_{i, j} ∣ {(S_{u} - Σ_{u})}_{i j} ∣ = O_{P} ({\log p_{n} ∕ n}^{1 ∕ 2}) .

In particular, it means that

\max_{i, j} ∣ {(A (S - Σ_{0}) B)}_{i j} ∣ = {(n^{- 1} \sum_{r = 1}^{n} x_{r} w_{r}^{T} - A Σ_{0} B)}_{i j} = O_{P} ({\log p_{n} ∕ n}^{1 ∕ 2}),

which completes the proof of the lemma. □

Lemma 3 Let S be a sample covariance matrix of a random sample y_i1≤i≤n with y_i ~ N(0, Σ₀). Assume p_n/n → y ∈ [0, 1), Σ₀ has eigenvalues uniformly bounded as n → ∞, and A = A₀ + Δ₁, B = B₀ + Δ₂ are such that the constant matrices ∥A₀∥, ∥B₀∥ = O(1), with ∥Δ₁∥, ∥Δ₂∥ = o_P(1). Then we still have max_i,j|(A(S−Σ₀)B)_ij| = O_P({log p_n/n}^1/2).

Proof of Lemma 3. Consider

A (S - Σ_{0}) B = K_{1} + K_{2} + K_{3} + K_{4},

(5.3)

where K₁ = A₀(S − Σ₀)B₀, K₂ = Δ₁(S − Σ₀)B₀, K₃ = A₀(S − Σ₀)Δ₂ and K₄ = Δ₁(S − Σ₀)Δ₂. Now max_i,j |(K₁)_ij | = O_P({log p_n/n}^1/2) by Lemma 2. Consider K₂. Suppose the maximum element of the matrix is at the (i, j)-th position. Consider ((S − Σ₀)B₀)_ij, the (i, j)-th element of (S − Σ₀)B₀. Since each element in S − Σ₀ has a rate O_P(n^−1/2), the i-th row of S − Σ₀ has a norm of O_P({p_n/n}^1/2). Also, the j-th column of B₀ has ∥B₀e_j∥ ≤ ∥B₀∥ = O(1). Hence, ((S − Σ₀)B₀)_ij = O_P({p_n/n}^1/2).

Hence, we can find c_n = o({n/p_n}^1/2) such that each element in $c_{n} B_{0}^{T} (S - Σ_{0})$ has an order larger than that in Δ₁, since ∥Δ₁∥ = o_P(1) implies that each element in Δ₁ is also o_P(1) by Lemma 1.

Then suitable choice of c_n leads to

\max_{i, j} ∣ {(Δ_{1} (S - Σ_{0}) B_{0})}_{i j} ∣ \leq c_{n} \max_{k} ∣ {(B_{0}^{T} {(S - Σ_{0})}^{2} B_{0})}_{k k} ∣ .

(5.4)

At the same time, Theorem 5.10 in Bai and Silverstein (2006) implies that, for y_i ~ N(0, Σ₀) and p_n/n → y ∈ (0, 1), with probability one,

\begin{matrix} - 2 \sqrt{y} - y \leq & \underset{n \to \infty}{\lim \inf} λ_{\min} (Σ_{0}^{- 1 ∕ 2} S Σ_{0}^{- 1 ∕ 2} - I) \\ \leq & \underset{n \to \infty}{\lim \sup} λ_{\max} (Σ_{0}^{- 1 ∕ 2} S Σ_{0}^{- 1 ∕ 2} - I) \leq 2 \sqrt{y} + y . \end{matrix}

Hence, if we have p_n/n = o(1), we must have $∥ Σ_{0}^{- 1 ∕ 2} S Σ_{0}^{- 1 ∕ 2} - I ∥ = o_{P} (1)$ , or it will contradict the above. It means that ∥S−Σ₀∥ = o_P(1) since Σ₀ has eigenvalues uniformly bounded. Or, if p_n/n → y ∈ (0, 1), then we have ∥S − Σ₀∥ = O_P(1) by the above.

Since S − Σ₀ is symmetric, we can find a rotation matrix Q (i.e. Q^TQ = QQ^T = I) so that

S - Σ_{0} = Q Λ Q^{T},

where Λ is a diagonal matrix with real entries. Then we are free to control c_n again so as to satisfy further that c_n∥Λ∥² = o_P(∥Λ∥), since ∥Λ∥ = ∥S − Σ₀∥ = O_P(1) at most. Hence,

\begin{matrix} c_{n} \max_{k} ∣ {(B_{0}^{T} {(S - Σ_{0})}^{2} B_{0})}_{k k} ∣ = & \max_{k} ∣ {(B_{0}^{T} Q c_{n} Λ^{2} Q^{T} B_{0})}_{k k} ∣ \\ \leq & \max_{k} ∣ {(B_{0}^{T} Q Λ Q^{T} B_{0})}_{k k} ∣ \\ = & \max_{k} ∣ {(B_{0}^{T} (S - Σ_{0}) B_{0})}_{k k} ∣ = O_{P} ({\log p_{n} ∕ n}^{1 ∕ 2}), \end{matrix}

where the last line used the previous proof for constant matrix B₀. Hence, combining this with (5.4), we have max_i,j |(K₂)_ij| = O_P({log p_n/n}^1/2). Similar arguments go for K₃ and K₄. □

Proof of Theorem 1. The main idea of the proof is inspired by Fan and Li (2001) and Rothman et al. (2008). Let U be a symmetric matrix of size p_n, D_U be its diagonal matrix and R_U = U − D_U be its off-diagonal matrix. Set Δ_U = α_nR_U + β_nD_U. We would like to show that, for α_n = (s_n1 log p_n/n)^1/2 and β_n = (p_n log p_n/n)^1/2, and for a set $A$ defined as $A = {U : {∥ Δ_{U} ∥}_{F}^{2} = C_{1}^{2} α_{n}^{2} + C_{2}^{2} β_{n}^{2}}$ ,

P (\inf_{U \in A} q_{1} (Ω_{0} + Δ_{U}) > q_{1} (Ω_{0})) \to 1,

for sufficiently large constants C₁ and C₂. This implies that there is a local minimizer in ${Ω_{0} + Δ_{U} : {∥ Δ_{U} ∥}_{F}^{2} \leq C_{1}^{2} α_{n}^{2} + C_{2}^{2} β_{n}^{2}}$ such that ${∥ \hat{Ω} - Ω_{0} ∥}_{F} = O_{P} (α_{n} + β_{n})$ for sufficiently large n, since Ω₀ + Δ_U is positive definite. This is shown by noting that

λ_{\min} (Ω_{0} + Δ_{U}) \geq λ_{\min} (Ω_{0}) + λ_{\min} (Δ_{U}) \geq λ_{\min} (Ω_{0}) - {∥ Δ_{U} ∥}_{F} > 0,

since Ω₀ has eigenvalues uniformly bounded away from 0 and ∞ by condition (A), and ∥Δ_U∥_F = O(α_n + β_n) = o(1).

Consider, for Σ = Σ₀ + Δ_U, the difference

q_{1} (Ω) - q_{1} (Ω_{0}) = I_{1} + I_{2} + I_{3},

where

I_{1} = tr (S Ω) - \log ∣ Ω ∣ - (tr (S Ω_{0}) - \log ∣ Ω_{0} ∣),

I_{2} = \sum_{(i, j) \in S_{1}^{c}} (p_{λ_{n 1}} (∣ ω_{i j} ∣) - p_{λ_{n 1}} (∣ ω_{i j}^{0} ∣)),

I_{3} = \sum_{(i, j) \in S_{1}, i \neq j} (p_{λ_{n 1}} (∣ ω_{i j} ∣) - p_{λ_{n 1}} (∣ ω_{i j}^{0} ∣)) .

It is sufficient to show that the difference is positive asymptotically with probability tending to 1. Using Taylor’s expansion with the integral remainder, we have I₁ = K₁+K₂, where

\begin{matrix} K_{1} = tr ((S - Σ_{0}) Δ_{U}), \\ K_{2} = vec {(Δ_{U})}^{T} {\int_{0}^{1} g (v, Ω_{v}) (1 - v) d v} vec (Δ_{U}), \end{matrix}

(5.5)

with the definitions Ω_v = Ω₀ + vΔ_U, and $g (v, Ω_{v}) = Ω_{v}^{- 1} \otimes Ω_{v}^{- 1}$ . Now,

\begin{matrix} K_{2} \geq & \int_{0}^{1} (1 - v) \min_{0 \leq v \leq 1} λ_{\min} (Ω_{v}^{- 1} \otimes Ω_{v}^{- 1}) d v \cdot {∥ vec (Δ_{U}) ∥}^{2} \\ = & {∥ vec (Δ_{U}) ∥}^{2} ∕ 2 \cdot \min_{0 \leq v \leq 1} λ_{\max}^{- 2} (Ω_{v}) \\ \geq & {∥ vec (Δ_{U}) ∥}^{2} ∕ 2 \cdot {(∥ Ω_{0} ∥ + ∥ Δ_{U} ∥)}^{- 2} \\ \geq & (C_{1}^{2} α_{n}^{2} + C_{2}^{2} β_{n}^{2}) ∕ 2 \cdot (τ_{1}^{- 1} + o (1)) - 2, \end{matrix}

where we used ∥Δ_U∥ ≤ C₁α_n + C₂β_n = O((log p_n)^(1−k)/2) = o(1) by our assumption.

Consider K₁. It is clear that |K₁| ≤ L₁ + L₂, where

L_{1} = ∣ \sum_{(i, j) \in S_{1}} {(S - Σ_{0})}_{i j} {(Δ_{U})}_{i j} ∣,

L_{2} = ∣ \sum_{(i, j) \in S_{1}^{c}} {(S - Σ_{0})}_{i j} {(Δ_{U})}_{i j} ∣ .

Using Lemmas 1 and 2, we have

\begin{matrix} L_{1} \leq & {(s_{n 1} + p_{n})}^{1 ∕ 2} \max_{i, j} ∣ {(S - Σ_{0})}_{i j} ∣ \cdot {∥ Δ_{U} ∥}_{F} \\ \leq & O_{P} (α_{n} + β_{n}) \cdot {∥ Δ_{U} ∥}_{F} \\ = & O_{P} (C_{1} α_{n}^{2} + C_{2} β_{n}^{2}), \end{matrix}

This is dominated by K₂ when C₁ and C₂ are sufficiently large.

Now, consider I₂ − L₂ for penalties other than L₁. Since ${∥ Δ_{U} ∥}_{F}^{2} = C_{1}^{2} α_{n}^{2} + C_{2}^{2} β_{n}^{2}$ on $A$ , we have that |ω_ij| = O(C₁α_n + C₂β_n) = o(1) for all $(i, j) \in S_{1}^{c}$ . Also, note that the condition on λ_n1 ensures that, for $(i, j) \in S_{1}^{c}$ , |ω_ij| = O(α_n + β_n) = o(λ_n1). Hence, by condition (C), for all $(i, j) \in S_{1}^{c}$ , we can find a constant k₁ > 0 such that

p_{λ_{n 1}} (∣ ω_{i j} ∣) \geq λ_{n 1} k_{1} ∣ ω_{i j} ∣ .

This implies that

I_{2} = \sum_{(i, j) \in S_{1}^{c}} p_{λ_{n 1}} (∣ ω_{i j} ∣) \geq λ_{n 1} k_{1} \sum_{(i, j) \in S_{1}^{c}} ∣ ω_{i j} ∣ .

Hence,

\begin{matrix} I_{2} - L_{2} \geq & \sum_{(i, j) \in S_{1}^{c}} {λ_{n 1} k_{1} ∣ ω_{i j} ∣ - ∣ {(S - Σ_{0})}_{i j} ∣ \cdot ∣ ω_{i j} ∣} \\ \geq & \sum_{(i, j) \in S_{1}^{c}} [λ_{n 1} k_{1} - O_{P} ({\log p_{n} ∕ n}^{1 ∕ 2})] \cdot ∣ ω_{i j} ∣ \\ = & λ_{n 1} \sum_{(i, j) \in S_{1}^{c}} [k_{1} - O_{P} (λ_{n 1}^{- 1} {\log p_{n} ∕ n}^{1 ∕ 2})] \cdot ∣ ω_{i j} ∣ . \end{matrix}

With the assumption that $(p_{n} + s_{n 1}) \log p_{n} ∕ n = O (λ_{n 1}^{2})$ , we see from the above that I₂−L₂ ≥ 0 since $O_{P} = (λ_{n 1}^{- 1} {\log p_{n} ∕ n}^{1 ∕ 2}) = o_{P} (1)$ , using $\log p_{n} ∕ n = o ((p_{n} + s_{n 1}) \log p_{n} ∕ n) = o (λ_{n 1}^{2})$ .

For the L₁-penalty, since we have max_i≠j |S − Σ₀| = O_P((log p_n/n)^1/2) by Lemma 2, we can find a positive W = O_P(1) such that

\max_{i \neq j} ∣ S - Σ_{0} ∣ = W {(\log p_{n} ∕ n)}^{1 ∕ 2} .

Then we can set λ_n1 = 2W(log p_n/n)^1/2 or one with order greater than (log p_n/n)^1/2, and the above arguments are still valid, so that I₂ − L₂ > 0.

Now, with L₁ dominated by K₂ and I₂ − L₂ ≥ 0, the proof completes if we can show that I₃ is also dominated by K₂, since we have proved that K₂ > 0. Using Taylor’s expansion, we can arrive at

∣ I_{3} ∣ \leq \min {(C_{1}, C_{2})}^{- 1} \cdot O (1) \cdot (C_{1}^{2} α_{n}^{2} + C_{2}^{2} β_{n}^{2}) + o (1) \cdot (C_{1}^{2} α_{n}^{2} + C_{2}^{2} β_{n}^{2}),

where o(1) and O(1) are the terms independent of C₁ and C₂. By condition (B), we have

∣ I_{3} ∣ = C \cdot O (α_{n}^{2} + β_{n}^{2}) + C^{2} \cdot o (α_{n}^{2} + β_{n}^{2}),

which is dominated by K₂ with large enough constants C₁ and C₂. This completes the proof of the theorem. □

Proof of Theorem 2. For Ω a minimizer of (1.1), the derivative for q₁(Ω) w.r.t. ω_ij for $(i, j) \in S_{2}^{c}$ is

\frac{\partial q_{1} (Ω)}{\partial ω_{i j}} = 2 (s_{i j} - σ_{i j} + p_{λ_{n 1}}^{'} (∣ ω_{i j} ∣) sgn (ω_{i j})),

where sgn(a) denotes the sign of a. If we can show that the sign of ∂q₁(Ω)/∂ω_ij depends on sgn(ω_ij) only with probability tending to 1, the optimum will be at 0, so that ${\hat{ω}}_{i j} = 0$ for all $(i, j) \in S_{2}^{c}$ with probability tending to 1. We need to estimate the order of s_ij − σ_ij independent of i and j.

Decompose s_ij − σ_ij = I₁ + I₂, where

I_{1} = s_{i j} - σ_{i j}^{0}, I_{2} = σ_{i j}^{0} - σ_{i j} .

By Lemma 2 or Lemma A.3 of Bickel and Levina (2008b), it follows that max_i,j |I₁| = O_P({log p_n/n}^1/2). It remains to estimate the order of I₂.

By Lemma 1, $∣ σ_{i j} - σ_{i j}^{0} ∣ \leq ∥ Σ - Σ_{0} ∥$ , which has order

\begin{matrix} ∥ Σ - Σ_{0} ∥ = & ∥ Σ (Ω - Ω_{0}) Σ_{0} ∥ \\ \leq & ∥ Σ ∥ \cdot ∥ Ω - Ω_{0} ∥ \cdot ∥ Σ_{0} ∥ \\ = & O (∥ Ω - Ω_{0} ∥), \end{matrix}

where we used condition (A) to get ∥Σ₀∥ = O(1), and using η_n → 0 so that $λ_{\min} (Ω - Ω_{0}) = o (1) for ∥ Ω - Ω_{0} ∥ = O (η_{n}^{1 ∕ 2})$ ,

\begin{matrix} ∥ Σ ∥ = λ_{\min}^{- 1} (Ω) \leq & {(λ_{\min} (Ω_{0}) + λ_{\min} (Ω - Ω_{0}))}^{- 1} \\ = & {(O (1) + o (1))}^{- 1} = O (1) . \end{matrix}

Hence, $∥ Ω - Ω_{0} ∥ = O (η_{n}^{1 ∕ 2})$ implies $∣ I_{2} ∣ = O (η_{n}^{1 ∕ 2})$ .

Combining the last two results yields that

\begin{matrix} \max_{i, j} ∣ s_{i j} - σ_{i j} ∣ = & O_{P} (∣ s_{i j} - σ_{i j}^{0} ∣ + η_{n}^{1 ∕ 2}) \\ = & O_{P} ({\log p_{n} ∕ n}^{1 ∕ 2} + η_{n}^{1 ∕ 2}) . \end{matrix}

By conditions (C) and (D), we have

p_{λ_{n 1}}^{'} (∣ ω_{i j} ∣) = C_{3} λ_{n 1}

for ω_ij in a small neighborhood of 0 (excluding 0 itself) and some positive constant C₃. Hence, if ω_ij lies in a small neighborhood of 0, we need to have $\log p_{n} ∕ n + n_{n} = O (λ_{n 1}^{2})$ in order to have the sign of ∂q₁(Ω)/∂ω_ij depends on sgn(ω_ij) only with probability tending to 1. The proof of the theorem is completed. □

Proof of Theorem 3. Because of the similarity between equations (2.4) and (1.1), the Frobenius norm result has nearly identical proof as Theorem 1, except that we now set Δ_U = α_nU. For the operator norm result, we refer readers to the proof of Theorem 2 of Rothman et al. (2008). □

Proof of Theorem 5. The proof is similar to that of Theorem 1. We only sketch briefly the proof, pointing out the important differences.

Let α_n = (s_n2 log p_n/n)^1/2 and β_n = (p_n log p_n/n)^1/2, and define $A = {U : {∥ Δ_{U} ∥}_{F}^{2} = C_{1}^{2} α_{n}^{2} + C_{2}^{2} β_{n}^{2}}$ . Want to show

P (\inf_{U \in A} q_{2} (Σ_{0} + Δ_{U}) > q_{2} (Σ_{0})) \to 1,

for sufficiently large constants C₁and C₂.

For Σ = Σ₀ + Δ_U, the difference

q_{2} (Σ) - q_{2} (Σ_{0}) = I_{1} + I_{2} + I_{3},

where

I_{1} = tr (S Ω) + \log ∣ Σ ∣ - (tr (S Ω_{0}) + \log ∣ Σ_{0} ∣),

I_{2} = \sum_{(i, j) \in S_{2}^{c}} (p_{λ_{n 2}} (∣ σ_{i j} ∣) - p_{λ_{n 2}} (∣ σ_{i j}^{0} ∣)),

I_{3} = \sum_{(i, j) \in S_{2}, i \neq j} (p_{λ_{n 2}} (∣ σ_{i j} ∣) - p_{λ_{n 2}} (∣ σ_{i j}^{0} ∣)),

with I₁ = K₁ + K₂, where

\begin{matrix} K_{1} = - tr ((S - Σ_{0}) Ω_{0} Δ_{U} Ω_{0}) = - tr ((S_{Ω_{0}} - Ω_{0}) Δ_{U}), \\ K_{2} = vec {(Δ_{U})}^{T} {\int_{0}^{1} g (v, Σ_{v}) (1 - v) d v,} vec (Δ_{U}), \end{matrix}

(5.6)

and Σ_v = Σ₀ + vΔ_U, S_Ω₀ is the sample covariance matrix of a random sample {x_i}_1≤i≤n having x_i ~ N(0, Ω₀). Also,

g (v, Σ_{v}) = Σ_{v}^{- 1} \otimes Σ_{v}^{- 1} S Σ_{v}^{- 1} + Σ_{v}^{- 1} S Σ_{v}^{- 1} \otimes Σ_{v}^{- 1} - Σ_{v}^{- 1} \otimes Σ_{v}^{- 1} .

(5.7)

The treatment of K₂is different from that in Theorem 1. By condition (A), and (p_n + s_n2)(log p_n)^k/n = O(1) for some k > 1, we have

∥ v Δ_{U} Ω_{0} ∥ \leq ∥ Δ_{U} ∥ ∥ Ω_{0} ∥ \leq τ_{1}^{- 1} (C_{1} α_{n} + C_{2} β_{n}) = O ({(\log p_{n})}^{1 - k}) = o (1) .

Thus, we can use the Neumann series expansion to arrive at

Σ_{v}^{- 1} = Ω_{0} {(I + v Δ_{U} Ω_{0})}^{- 1} = Ω_{0} (I - v Δ_{U} Ω_{0} + o (1)),

where the little o (or o_P, O or O_P in any matrix expansions in the remainder of this proof) represents a function of the L₂ norm of the residual matrix in the expansion. That is, $\sum_{ν}^{- 1} = Ω_{0} + O_{p} (α_{n} + β_{n})$ , and $∥ \sum_{ν}^{- 1} ∥ = τ_{1}^{- 1} + O_{p} (α_{n} + β_{n})$ . With S_I difined as the sample covariance matrix formed from a random sample {x_i}_1≤i≤n having x_i ~ N(0, I),

∥ S - Σ_{0} ∥ = O_{P} (∥ S_{I} - I ∥) = o_{P} (1)

(see arguments in Lemma 3). These entail

\begin{matrix} S Σ_{v}^{- 1} = & (S - Σ_{0}) Σ_{v}^{- 1} + Σ_{0} Σ_{v}^{- 1} \\ = & o_{P} (1) + I + O_{P} (α_{n} + β_{n}) \\ = & I + o_{P} (1) . \end{matrix}

Combining these results, we have

g (v, Σ_{v}) = Ω_{0} \otimes Ω_{0} + O_{P} (α_{n} + β_{n}) .

Consequently,

\begin{matrix} K_{2} = & vec {(Δ_{U})}^{T} {\int_{0}^{1} Ω_{0} \otimes Ω_{0} (1 + o_{P} (1)) (1 - v) d v} vec (Δ_{U}) \\ \geq & λ_{\min} (Ω_{0} \otimes Ω_{0}) {∥ vec (Δ_{U}) ∥}^{2} ∕ 2 \cdot (1 + o_{P} (1)) \\ = & τ_{1}^{- 2} (C_{1}^{2} α_{n}^{2} + C_{2}^{2} β_{n}^{2}) ∕ 2 \cdot (1 + o_{P} (1)) . \end{matrix}

All other terms are dealt with similarly as in the proof of Theorem 1, and hence we omit them. □

Proof of Theorem 6. The proof is similar to that of Theorem 2. We only show the main differences.

It is easy to show

\frac{\partial q_{2} (Σ)}{\partial σ_{i j}} = 2 (- {(Ω S Ω)}_{i j} + ω_{i j} + p_{λ_{n}}^{'} (∣ σ_{i j} ∣) sgn (σ_{i j})) .

Our aim is to estimate the order of |(−ΩSΩ + Ω)_ij|, finding an upper bound which is independent of both i and j.

Write

- Ω S Ω + Ω = I_{1} + I_{2},

where I₁ = −Ω(S − Σ₀)Ω and I₂ = Ω(Σ − Σ₀)Ω. Since

\begin{matrix} ∥ Ω ∥ = & λ_{\min}^{- 1} (Σ) \leq {(λ_{\min} (Σ_{0}) + λ_{\min} (Σ - Σ_{0}))}^{- 1} \\ = & τ_{1}^{- 1} + o (1), \end{matrix}

we have

Ω = Ω_{0} + (Ω - Ω_{0}) = Ω_{0} - Ω (Σ - Σ_{0}) Ω_{0} = Ω_{0} + Δ,

where $∥ Δ ∥ \leq ∥ Ω ∥ \cdot ∥ Σ - Σ_{0} ∥ \cdot ∥ Ω_{0} ∥ = O (η_{n}^{1 ∕ 2}) = o (1)$ by Lemma 1, with ∥Σ−Σ₀∥² = O(η_n). Hence, we can apply Lemma 3 and conclude that max_i,j |(I₁)_ij| = O_P({log p_n/n}^1/2).

For I₂, we have

\max_{i, j} ∣ {(I_{2})}_{i j} ∣ \leq ∥ Ω ∥ \cdot ∥ Σ - Σ_{0} ∥ \cdot ∥ Ω ∥ = O (∥ Σ - Σ_{0} ∥) = O (η_{n}^{1 ∕ 2}) .

Hence, we have

\max_{i, j} ∣ {(- Ω S Ω + Ω)}_{i j} ∣ = O ({\log p_{n} ∕ n}^{1 ∕ 2} + η_{n}^{1 ∕ 2}) .

The rest goes similar to the proof of Theorem 2, and is omitted. □

Proof of Theorem 7. The proof is nearly identical to that of Theorem 5, except that we now set Δ_U = α_nU. The fact that ${({\hat{Γ}}_{S})}_{i i} = 1 = γ_{i i}^{0}$ has no estimation error eliminates an order (p_n log p_n/n)^1/2 that contributes from estimating $tr (({\hat{Γ}}_{S} - Γ_{0}) Ψ_{0} Δ_{U} Ψ_{0})$ for (3.2). This is why we can estimate a sparse correlation matrix more accurately.

For the operator norm result, we refer readers to the proof of Theorem 2 of Rothman et al. (2008). □

Proof of Theorem 10. For (T,D) a minimizer of (4.2), the derivative for q₃(T,D) w.r.t. t_ij for $(i, j) \in S_{3}^{c}$ is

\frac{\partial q_{3} (T, D)}{\partial t_{i j}} = 2 ({({ST}^{T} D^{- 1})}_{j i} + p_{λ_{n 3}}^{'} (∣ t_{i j} ∣) sgn (t_{i j})) .

Now ST^TD⁻¹ = I₁ + I₂ + I₃ + I₄, where

I_{1} = (S - Σ_{0}) T^{T} D^{- 1} I_{2} = Σ_{0} {(T - T_{0})}^{T} D^{- 1},

I_{3} = Σ_{0} T_{0}^{T} (D^{- 1} - D_{0}^{- 1}), I_{4} = Σ_{0} T_{0}^{T} D_{0}^{- 1} .

By the MCD (4.1), $I_{4} = T_{0}^{- 1}$ . Since i > j for $(i, j) \in S_{3}^{c}$ , we must have ${(T_{0}^{- 1})}_{j i} = 0$ . Hence, we can ignore I₄.

Since ∥T−T₀∥² = O(η_n) and ∥D−D₀∥² = O(ζ_n) with η_n, ζ_n = o(1), and by condition (A) we can easily show $∥ D^{- 1} - D_{0}^{- 1} ∥ = O (∥ D - D_{0} ∥) = O (ζ_{n}^{1 ∕ 2})$ . Then we can apply Lemma 3 to show that max_ij |(I₁)_ij| = (log p_n/n)^1/2.

For I₂, we have ${max}_{i j} ∣ {(I_{2})}_{i j} ∣ \leq ∥ Σ_{0} ∥ \cdot ∥ T - T_{0} ∥ \cdot ∥ D^{- 1} ∥ = O (η_{n}^{1 ∕ 2})$ . And finally. ${max}_{i j} ∣ {(I_{3})}_{i j} ∣ \leq ∥ Σ_{0} ∥ \cdot ∥ T_{0} ∥ \cdot ∥ D^{- 1} - D_{0}^{- 1} ∥ = O (ζ_{n}^{1 ∕ 2})$ .

With all these, we have ${max}_{(i j) \in S_{3}^{c}} {∣ {({ST}^{T} D^{- 1})}_{i j} ∣}^{2} = \log p_{n} ∕ n + η_{n} + ζ_{n}$ . The rest of the proof goes like that of Theorem 2 or 6. □

Footnotes

Financial support from the NSF grant DMS-0354223, DMS-0704337 and NIH grant R01-GM072611 is gratefully acknowledged.

AMS 2000 subject classifications. Primary 62F12; secondary 62J07.

Contributor Information

Clifford Lam, Department of Statistics, London School of Economics and Political Science, London, WC2A 2AE (C.Lam2@lse.ac.uk).

Jianqing Fan, Department of Operation Research and Financial Engineering, Princeton University, Princeton, NJ 08544 (jqfan@princeton.edu).

References

[1].Bai Z, Silverstein JW. Spectral Analysis of Large Dimensional Random Matrices. Science Press; Beijing: 2006. [Google Scholar]
[2].Bickel PJ, Levina E. Covariance Regularization by Thresholding. Ann. Statist. 2008a;36(6):2577–2604. [Google Scholar]
[3].Bickel PJ, Levina E. Regularized Estimation of Large Covariance Matrices. Ann. Statist. 2008b;36(1):199–227. [Google Scholar]
[4].Cai T, Zhang C-H, Zhou H. Optimal rates of convergence for co-variance matrix estimaiton. the Wharton School, University of Pennsylvania; 2009. Technical report. [Google Scholar]
[5].d’Aspremont A, Banerjee O, El Ghaoui L. First-order Methods For Sparse Covariance Selection. SIAM. J. Matrix Anal. and Appl. 2008;30(1):56–66. [Google Scholar]
[6].Dempster AP. Covariance Selection. Biometrics. 1972;28:157–175. [Google Scholar]
[7].Diggle P, Verbyla A. Nonparametric Estimation of Covariance Structure in Longitudinal Data. Biometrics. 1998;54(2):401–415. [PubMed] [Google Scholar]
[8].El Karoui N. Operator Norm Consistent Estimation of a Large Dimensional Sparse Covariance Matrices. Ann. Statist. 2008;36(6):2717–2756. [Google Scholar]
[9].Fan J, Feng Y, Wu Y. Network Exploration via the Adaptive LASSO and SCAD Penalties. Annals of Applied Statistics. 2009;3(2):521–541. doi: 10.1214/08-AOAS215SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. J. Amer. Statist. Assoc. 2001;96:1348–1360. [Google Scholar]
[11].Fan J, Peng H. Nonconcave Penalized Likelihood With a Diverging Number of Parameters. Ann. Statist. 2004;32:928–961. [Google Scholar]
[12].Friedman J, Hastie T, Tibshirani R. Sparse Inverse Covariance Estimation with the Graphical LASSO. Biostatistics. 2008;9(3):432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Huang J, Horowitz J, Ma S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann. Statist. 2008;36:587–613. [Google Scholar]
[14].Huang J, Liu N, Pourahmadi M, Liu L. Covariance Matrix Selection and Estimation via Penalised Normal Likelihood. Biometrika. 2006;93(1):85–98. [Google Scholar]
[15].Levina E, Rothman AJ, Zhu J. Sparse Estimation of Large Covariance Matrices via a Nested Lasso Penalty. Ann. Applied Statist. 2008;2(1):245–263. [Google Scholar]
[16].Meier L, van de Geer S, Bühlmann P. The group Lasso for logistic regression. Journal of the Royal Statistical Society, B. 2008;70:53–71. [Google Scholar]
[17].Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the Lasso. Ann. Statist. 2006;34:1436–1462. [Google Scholar]
[18].Pourahmadi M. Joint Mean-Covariance Models with Applications to Longitudinal Data: Unconstrained Parameterisation. Biometrika. 1999;86:677–690. [Google Scholar]
[19].Ravikumar P, Lafferty J, Liu H, Wasserman L. Advances in Neural Information Processing Systems. MIT Press; 2008. Sparse additive models; p. 20. [Google Scholar]
[20].Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse Permutation Invariant Covariance Estimation. Electron. J. Statist. 2008;2:494–515. [Google Scholar]
[21].Smith M, Kohn R. Parsimonious Covariance Matrix Estimation for Longitudinal Data. J. Amer. Statist. Assoc. 2002;97(460):1141–1153. [Google Scholar]
[22].Wagaman AS, Levina E. Discovering sparse covariance structures with the Isomap. Journal of Computational and Graphical Statistics. 2008;18 to appear. [Google Scholar]
[23].Wong F, Carter C, Kohn R. Efficient Estimation of Covariance Selection Models. Biometrika. 2003;90:809–830. [Google Scholar]
[24].Wu WB, Pourahmadi M. Nonparametric Estimation of Large Covariance Matrices of Longitudinal Data. Biometrika. 2003;94:1–17. [Google Scholar]
[25].Yuan M, Lin Y. Model Selection and Estimation in the Gaussian Graphical Model. Biometrika. 2007;90:831–844. [Google Scholar]
[26].Zhang CH. Penalized Linear Unbiased Selection. the statistics dept., Rutgers University; 2007. Technical report 2007-003. [Google Scholar]
[27].Zhao P, Yu B. On Model Selection Consistency of Lasso. Journal of Machine Learning Research. 2006;7:2541–2563. [Google Scholar]
[28].Zou H. The adaptive Lasso and its oracle properties. J. Amer. Statist. Assoc. 2006;101:1418–1429. [Google Scholar]
[29].Zou H, Li R. One-step Sparse Estimates in Nonconcave Penalized Likelihood Models (With Discussion) Ann. Statist. 2008;36(4):1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] [1].Bai Z, Silverstein JW. Spectral Analysis of Large Dimensional Random Matrices. Science Press; Beijing: 2006. [Google Scholar]

[R2] [2].Bickel PJ, Levina E. Covariance Regularization by Thresholding. Ann. Statist. 2008a;36(6):2577–2604. [Google Scholar]

[R3] [3].Bickel PJ, Levina E. Regularized Estimation of Large Covariance Matrices. Ann. Statist. 2008b;36(1):199–227. [Google Scholar]

[R4] [4].Cai T, Zhang C-H, Zhou H. Optimal rates of convergence for co-variance matrix estimaiton. the Wharton School, University of Pennsylvania; 2009. Technical report. [Google Scholar]

[R5] [5].d’Aspremont A, Banerjee O, El Ghaoui L. First-order Methods For Sparse Covariance Selection. SIAM. J. Matrix Anal. and Appl. 2008;30(1):56–66. [Google Scholar]

[R6] [6].Dempster AP. Covariance Selection. Biometrics. 1972;28:157–175. [Google Scholar]

[R7] [7].Diggle P, Verbyla A. Nonparametric Estimation of Covariance Structure in Longitudinal Data. Biometrics. 1998;54(2):401–415. [PubMed] [Google Scholar]

[R8] [8].El Karoui N. Operator Norm Consistent Estimation of a Large Dimensional Sparse Covariance Matrices. Ann. Statist. 2008;36(6):2717–2756. [Google Scholar]

[R9] [9].Fan J, Feng Y, Wu Y. Network Exploration via the Adaptive LASSO and SCAD Penalties. Annals of Applied Statistics. 2009;3(2):521–541. doi: 10.1214/08-AOAS215SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. J. Amer. Statist. Assoc. 2001;96:1348–1360. [Google Scholar]

[R11] [11].Fan J, Peng H. Nonconcave Penalized Likelihood With a Diverging Number of Parameters. Ann. Statist. 2004;32:928–961. [Google Scholar]

[R12] [12].Friedman J, Hastie T, Tibshirani R. Sparse Inverse Covariance Estimation with the Graphical LASSO. Biostatistics. 2008;9(3):432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Huang J, Horowitz J, Ma S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann. Statist. 2008;36:587–613. [Google Scholar]

[R14] [14].Huang J, Liu N, Pourahmadi M, Liu L. Covariance Matrix Selection and Estimation via Penalised Normal Likelihood. Biometrika. 2006;93(1):85–98. [Google Scholar]

[R15] [15].Levina E, Rothman AJ, Zhu J. Sparse Estimation of Large Covariance Matrices via a Nested Lasso Penalty. Ann. Applied Statist. 2008;2(1):245–263. [Google Scholar]

[R16] [16].Meier L, van de Geer S, Bühlmann P. The group Lasso for logistic regression. Journal of the Royal Statistical Society, B. 2008;70:53–71. [Google Scholar]

[R17] [17].Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the Lasso. Ann. Statist. 2006;34:1436–1462. [Google Scholar]

[R18] [18].Pourahmadi M. Joint Mean-Covariance Models with Applications to Longitudinal Data: Unconstrained Parameterisation. Biometrika. 1999;86:677–690. [Google Scholar]

[R19] [19].Ravikumar P, Lafferty J, Liu H, Wasserman L. Advances in Neural Information Processing Systems. MIT Press; 2008. Sparse additive models; p. 20. [Google Scholar]

[R20] [20].Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse Permutation Invariant Covariance Estimation. Electron. J. Statist. 2008;2:494–515. [Google Scholar]

[R21] [21].Smith M, Kohn R. Parsimonious Covariance Matrix Estimation for Longitudinal Data. J. Amer. Statist. Assoc. 2002;97(460):1141–1153. [Google Scholar]

[R22] [22].Wagaman AS, Levina E. Discovering sparse covariance structures with the Isomap. Journal of Computational and Graphical Statistics. 2008;18 to appear. [Google Scholar]

[R23] [23].Wong F, Carter C, Kohn R. Efficient Estimation of Covariance Selection Models. Biometrika. 2003;90:809–830. [Google Scholar]

[R24] [24].Wu WB, Pourahmadi M. Nonparametric Estimation of Large Covariance Matrices of Longitudinal Data. Biometrika. 2003;94:1–17. [Google Scholar]

[R25] [25].Yuan M, Lin Y. Model Selection and Estimation in the Gaussian Graphical Model. Biometrika. 2007;90:831–844. [Google Scholar]

[R26] [26].Zhang CH. Penalized Linear Unbiased Selection. the statistics dept., Rutgers University; 2007. Technical report 2007-003. [Google Scholar]

[R27] [27].Zhao P, Yu B. On Model Selection Consistency of Lasso. Journal of Machine Learning Research. 2006;7:2541–2563. [Google Scholar]

[R28] [28].Zou H. The adaptive Lasso and its oracle properties. J. Amer. Statist. Assoc. 2006;101:1418–1429. [Google Scholar]

[R29] [29].Zou H, Li R. One-step Sparse Estimates in Nonconcave Penalized Likelihood Models (With Discussion) Ann. Statist. 2008;36(4):1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Sparsistency and Rates of Convergence in Large Covariance Matrix Estimation^{^*}

Clifford Lam

Jianqing Fan

Roles

Abstract

1 Introduction

2 Estimation of sparse precision matrix

2.1 Algorithm based on iterated reweighted L₁-penalty

2.2 Some numerical results

Breast cancer data

2.3 Technical conditions

2.4 Properties of sparse precision matrix estimation

2.5 Properties of sparse inverse correlation matrix estimation

3 Estimation of sparse covariance matrix

3.1 Properties of sparse covariance matrix estimation

3.2 Properties of sparse correlation matrix estimation

4 Extension to sparse Cholesky decomposition

4.1 Properties of sparse Cholesky factor estimation

4.2 Properties of sparse normalized Cholesky factor estimation

5 Proofs

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Sparsistency and Rates of Convergence in Large Covariance Matrix Estimation*

Clifford Lam

Jianqing Fan

Roles

Abstract

1 Introduction

2 Estimation of sparse precision matrix

2.1 Algorithm based on iterated reweighted L1-penalty

2.2 Some numerical results

Breast cancer data

2.3 Technical conditions

2.4 Properties of sparse precision matrix estimation

2.5 Properties of sparse inverse correlation matrix estimation

3 Estimation of sparse covariance matrix

3.1 Properties of sparse covariance matrix estimation

3.2 Properties of sparse correlation matrix estimation

4 Extension to sparse Cholesky decomposition

4.1 Properties of sparse Cholesky factor estimation

4.2 Properties of sparse normalized Cholesky factor estimation

5 Proofs

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Sparsistency and Rates of Convergence in Large Covariance Matrix Estimation^{^*}

2.1 Algorithm based on iterated reweighted L₁-penalty