Abstract
This paper studies the sparsistency and rates of convergence for estimating sparse covariance and precision matrices based on penalized likelihood with nonconvex penalty functions. Here, sparsistency refers to the property that all parameters that are zero are actually estimated as zero with probability tending to one. Depending on the case of applications, sparsity priori may occur on the covariance matrix, its inverse or its Cholesky decomposition. We study these three sparsity exploration problems under a unified framework with a general penalty function. We show that the rates of convergence for these problems under the Frobenius norm are of order (sn log pn/n)1/2, where sn is the number of nonzero elements, pn is the size of the covariance matrix and n is the sample size. This explicitly spells out the contribution of high-dimensionality is merely of a logarithmic factor. The conditions on the rate with which the tuning parameter λn goes to 0 have been made explicit and compared under different penalties. As a result, for the L1-penalty, to guarantee the sparsistency and optimal rate of convergence, the number of nonzero elements should be small: at most, among parameters, for estimating sparse covariance or correlation matrix, sparse precision or inverse correlation matrix or sparse Cholesky factor, where is the number of the nonzero elements on the off-diagonal entries. On the other hand, using the SCAD or hard-thresholding penalty functions, there is no such a restriction.
Keywords: Covariance matrix, high dimensionality, consistency, nonconcave penalized likelihood, sparsistency, asymptotic normality
1 Introduction
Covariance matrix estimation is a common statistical problem in many scientific applications. For example, in financial risk assessment or longitudinal study, an input of covariance matrix Σ is needed, whereas an inverse of the covariance matrix, the precision matrix Σ−1, is required for optimal portfolio selection, linear discriminant analysis or graphical network models. Yet, the number of parameters in the covariance matrix grows quickly with dimensionality. Depending on the applications, the sparsity of the covariance matrix or precision matrix is frequently imposed to strike a balance between biases and variances. For example, in longitudinal data analysis [see e.g., Diggle and Verbyla (1998), or Bickel and Levina (2008b)], it is reasonable to assume that remote data in time are weakly correlated, whereas in Gaussian graphical models, the sparsity of the precision matrix is a reasonable assumption (Dempster (1972)).
This initiates a series of researches focusing on the parsimony of a covariance matrix. Smith and Kohn (2002) used priors which admit zeros on the off-diagonal elements of the Cholesky factor of the precision matrix Ω = Σ−1, while Wong, Carter and Kohn (2003) used zero-admitting prior directly on the off-diagonal elements of Ω to achieve parsimony. Wu and Pourahmadi (2003) used the Modified Cholesky Decomposition (MCD) to find a banded structure for Ω nonparametrically for longitudinal data. Bickel and Levina (2008b) developed consistency theories on banding methods for longitudinal data, for both Σ and Ω.
Various authors have used penalized likelihood methods to achieve parsimony on covariance selection. Fan and Peng (2004) has laid down a general framework for penalized likelihood with diverging dimensionality, with general conditions for the oracle property stated and proved. However, it is not clear whether it is applicable to the specific case of covariance matrix estimation. In particular, they did not link the dimensionality pn with the number of nonzero elements sn in the true covariance matrix Σ0, or the precision matrix Ω0. A direct application of their results to our setting can only handle a relatively small covariance matrix of size pn = o(n1/10).
Recently, there is a surge of interest on the estimation of sparse covariance matrix or precision matrix using penalized likelihood method. Huang, Liu, Pourahmadi and Liu (2006) used the LASSO on the off-diagonal elements of the Cholesky factor from MCD, while Meinshausen and Bühlmann (2006), d’Aspremont, Banerjee, and El Ghaoui (2008) and Yuan and Lin (2007) used different LASSO algorithms to select zero elements in the precision matrix. A novel penalty called the nested Lasso was constructed in Levina, Rothman and Zhu (2008) to penalize off-diagonal elements. Thresholding the sample covariance matrix in high-dimensional setting was thoroughly studied by El Karoui (2008) and Bickel and Levina (2008a) and Cai, Zhang and Zhou (2009) with remarkable results for high dimensional applications. However, it is not directly applicable to estimating sparse precision matrix when the dimensionality pn is greater than the sample size n. Wagaman and Levina (2008) proposed an Isomap method for discovering meaningful orderings of variables based on their correlations that result in block-diagonal or banded correlation structure, resulting in an Isoband estimator. A permutation invariant estimator, called SPICE, was proposed in Rothman, Bickel, Levina and Zhu (2008) based on penalized likelihood with L1-penalty on the off-diagonal elements for the precision matrix. They obtained remarkable results on the rates of convergence. The rate for estimating Ω under the Frobenius norm is of order (sn log pn/n)1/2, with dimensionality cost only a logarithmic factor in the overall mean-square error, where sn = pn + sn1, pn is the number of the diagonal elements and sn1 is the number of the nonzero off-diagonal entries. However, such a rate of convergence neither addresses explicitly the issues of sparsistency such as those in Fan and Li (2001) and Zhao and Yu (2006), nor the bias issues due to the L1-penalty and the sampling distribution of the estimated nonzero elements. These are the core issues of the study. By sparsistency, we mean the property that all parameters that are zero are actually estimated as zero with probability tending to one, a weaker requirement than that of Ravikumar, Lafferty, Liu and Wasserman (2008).
In this paper, we investigate the aforementioned problems using the penalized pseudo-likelihood method. Assume a random sample {yi}1≤i≤n with mean zero and covariance matrix Σ0, satisfying some sub-Gaussian tails conditions as specified in Lemma 2 (see Section 5). The sparsity of the true precision matrix Ω0 can be explored by maximizing the Gaussian quasi-likelihood or equivalently minimizing
(1.1) |
which is the penalized negative log-likelihood if the data is Gaussian. The matrix is the sample covariance matrix, Ω = (ωij), and pλn1(·) is a penalty function, depending on a regularization parameter λn1, which can be nonconvex. For instance, the L1-penalty pλ(θ) = λ|θ| is convex, while the hard-thresholding penalty defined by pλ(θ) = λ2 − (|θ| − λ)21{|θ|<λ}, and the SCAD penalty defined by
(1.2) |
are folded-concave. Nonconvex penalty is introduced to reduce bias when the true parameter has a relatively large magnitude. For example, the SCAD penalty remains constant when θ is large, while the L1-penalty grows linearly with θ. See Fan and Li (2001) for a detailed account of this and other advantages of such a penalty function. The computation can be done via the local linear approximation (Zhou and Li, 2008, Fan et al. 2009); see Section 2.1 for additional details.
Similarly, the sparsity of the true covariance matrix Σ0 can be explored by minimizing
(1.3) |
where Σ = (σij). Note that we only penalize the off-diagonal elements of Σ or Ω in the aforementioned two methods, since the diagonal elements of Σ0 and Ω0 do not vanish.
In studying a sparse covariance or precision matrix, it is important to distinguish between the diagonal and off-diagonal elements, since the diagonal elements are always positive and they contribute to the overall mean-squares errors. For example, the true correlation matrix, denoted by Γ0, has the same sparsity structure as Σ0 without the need to estimating its diagonal elements. In view of this fact, we introduce a revised method (3.2) to take this advantage. It turns out that the correlation matrix can be estimated with a faster rate of convergence, at (sn1 log pn/n)1/2 instead of ((pn + sn1) log pn/n)1/2, where sn1 is the number of nonzero correlation coefficients. We can take similar advantages over the estimation of the true inverse correlation matrix, denoted by Ψ0. See Section 2.5. This is an extension of the work of Rothman et al. (2008) using the L1-penalty. Such an extension is important since the nonconcave penalized likelihood ameliorates the bias problem for the L1-penalized likelihood.
The bias issues of the commonly used L1-penalty, or LASSO, can be seen from our theoretical results. In fact, due to the bias of LASSO, an upper bounded of λni is needed in order to achieve fast rate of convergence. On the other hand, a lower bound is required in order to achieve sparsity of estimated precision or covariance matrices. This is in fact one of the motivations for introducing nonconvex penalty functions in Fan and Li (2001) and Fan and Peng (2004), but we state and prove the explicit rates in the current context. In particular, we demonstrate that the L1-penalized estimator can achieve simultaneously the optimal rate of convergence and sparsistency for estimation of Σ0 or Ω0 when the number of nonzero elements in the off-diagonal entries are no larger than O(pn), but not guaranteed so otherwise. On the other hand, using the nonconvex penalties like the SCAD or hard-thresholding penalty, such an extra restriction is not needed.
We also compare two different formulations of penalized likelihood using the modified Cholesky decomposition, exploring their respective rates of convergence and sparsity properties.
Throughout this paper, we use λmin(A), λmax(A) and tr(A) to denote the minimum eigenvalue, maximum eigenvalue, and trace of a symmetric matrix A respectively. For a matrix B, we define the operator norm and the Frobenius norm, respectively, as and ∥B∥F = tr1/2(BTB).
2 Estimation of sparse precision matrix
In this section, we present the analysis of (1.1) for estimating a sparse precision matrix. Before this, let us first present an algorithm for computing the nonconcave maximum (pseudo)-likelihood estimator and then state the conditions needed for our technical results.
2.1 Algorithm based on iterated reweighted L1-penalty
The computation of the nonconcave maximum likelihood problems can be solved by a sequence of L1-penalized likelihood problems via local linear approximation (Zou and Li 2008, Fan et al. 2009). For example, given the current estimate Ωk = (ωij,k), by the local linear approximation to the penalty function,
(2.1) |
Hence, Ωk+1 should be taken to maximize the right-hand side of (2.1):
(2.2) |
after ignoring the two constant terms. Problem (2.2) is the weighted penalized L1-likelihood. In particular, if we take the most primitive initial value Ω0 = 0, then
is already a good estimator. Iterations of (2.2) reduces the biases of the estimator, as larger estimated coefficients in the previous iterations receive less penalty. In fact, in a different setup, Zou and Li (2008) showed that one iteration of such a procedure is sufficient as long as the initial values are good enough.
Fan et al. (2009) has implemented the above algorithm for optimizing (1.1). They have also demonstrated in Section 2.2 in their paper how to utilize the graphical lasso algorithm of Friedman, Hastie and Tibshirani (2008), which is essentially a group coordinate descent procedure, to solve problem (2.2) quickly, even when pn > n. Such a group coordinate decent algorithm was also used by Meier et al. (2008) to solve the group LASSO problem. Thus iteratively, (2.2), and hence (1.1), can be solved quickly with the graphical lasso algorithm. See also Zhang (2007) for a general solution to the folded-concave penalized least-squares problem. The following is a brief summary of the numerical results in Fan et al. (2009).
2.2 Some numerical results
We give a brief summary of a breast cancer data analysis with pn > n considered in Fan et al. (2009). For full details, please refer to Section 3.2 of Fan et al. (2009). Other simulation results are also in Section 4 in their paper.
Breast cancer data
Normalized gene expression data from 130 patients with stage I-III breast cancers are analyzed, with 33 of them belong to class 1 and 97 belong to class 2. The aim is to assess prediction accuracy in predicting which class a patient will belong to, using a set of pre-selected genes (pn = 110, chosen by t-tests) as gene expression profile data. The data is randomly divided into training (n = 109) and testing sets. The mean vector for the genes expression levels is obtained from the training data, as well as the associated inverse covariance matrix estimated using LASSO, adaptive LASSO and SCAD penalties as three different regularization methods. A linear discriminant score is then calculated for each regularization method and applied to the testing set to predict if a patient belongs to class 1 or 2. This is repeated 100 times.
On average, the estimated precision matrix using LASSO has many more nonzeros than that using SCAD (3923 versus 674). This is not surprising when we look at equation (2.3) in our paper, where the L1 penalty imposes an upper bound on the tuning parameter λn1 for consistency, which links to reducing the bias in the estimation. This makes the λn1 in practice too small to set many of the elements in to zero. While we do not know which elements in the true Ω are zero, the large number of nonzero elements in the L1 penalized estimator seems spurious, and the resulting gene network is not easy to interpret.
On the other hand, SCAD-penalized estimator has a much smaller number of nonzero elements, since the tuning parameter λn1 is not bounded above under consistency of the resulting estimator. This makes the resulting gene network easier to interpret, with some clusters of genes identified.
Also, classification results on the testing set using the SCAD penalty for precision matrix estimation is better than that using the L1 penalty, in the sense that the specificity (#True Negative/#class 2) is higher (0.794 to 0.768) while the sensitivity (#True Positive/#class 1) is similar to that using L1-penalized precision matrix estimator.
2.3 Technical conditions
We now introduce some notations and present regularity conditions for the rate of convergence and sparsistency.
Let , where is the true precision matrix. Denote by sn1 = |S1| − pn, which is the number of nonzero elements in the off-diagonal entries of Ω0. Define
The term an1 is related to the asymptotic bias of the penalized likelihood estimate due to penalization. Note that for L1-penalty, an1 = λn1 and bn1 = 0, whereas for SCAD, an1 = bn1 = 0 for sufficiently large n under the last assumption of condition (B) below.
We assume the following regularity conditions:
- There are constants τ1 and τ2 such that
an1 = O({(1 + pn/(sn1 + 1)) log pn/n}1/2), bn1 = o(1), and as n → ∞.
The penalty pλ(·) is singular at the origin, with limt↓0 pλ(t)/(λt) = k > 0.
There are constants C and D such that, when .
Condition (A) bounds uniformly the eigenvalues of Σ0, which facilitates the proof of consistency. It also includes a wide class of covariance matrices as noted in Bickel and Levina (2008b). The rates an1 and bn1 in condition (B) are also needed for proving consistency. If they are too large, the bias due to penalty can dominate the variance from the likelihood, resulting in poor estimates.
The last requirement in condition (B) states the rate at which the nonzero parameters should be distinguished from zero asymptotically. It is not explicitly needed in the proofs, but for asymptotically unbiased penalty functions, this is a necessary condition so that an1 and bn1 are converging to zero fast enough as needed in the first part of condition (B). In particular, for the SCAD and hard-thresholding penalty functions, this condition implies that an1 = bn1 = 0 exactly for sufficiently large n, thus allowing a flexible choice of λn1. For the SCAD penalty (1.2), the condition can be relaxed as .
The singularity in condition (C) gives sparsity in the estimates [Fan and Li (2001)]. Finally, condition (D) is a smoothing condition for the penalty function, and is needed in proving asymptotic normality. The SCAD penalty, for instance, satisfies this condition by choosing the constant D, independent of n, to be large enough.
2.4 Properties of sparse precision matrix estimation
Minimizing (1.1) involves nonconvex minimization, and we need to prove that there exists a local minimizer for the minimization problem with a certain rate of convergence, which is given under the Frobenius norm. The proof is given in Section 5. It is similar to the one given in Rothman et al. (2008), but now the penalty function is nonconvex.
Theorem 1 (Rate of convergence). Under regularity conditions (A)-(D), if and (pn + sn1)(log pn)k/n = O(1) for some k > 1, then there exists a local minimizer such that . For the L1-penalty, we only need .
The proofs of this theorem and others are relegated to Section 5 so that readers can get more quickly what the results are. As in Fan and Li (2001), the asymptotic bias due to the penalty for each nonzero parameter is an1. Since we penalized only on the off-diagonal elements, the total bias induced by the penalty is asymptotically of order sn1an1. The square of this total bias over all nonzero elements is of order OP{(pn + sn1) log pn/n} under condition (B).
Theorem 1 states explicitly how the number of nonzero elements and dimensionality affect the rate of convergence. Since there are (pn + sn1) nonzero elements and each of them can be estimated at best with rate n−1/2, the total square errors are at least of rate (pn + sn1)/n. The price that we pay for high-dimensionality is merely a logarithmic factor log pn. The results holds as long as (pn+sn1)/n is at a rate O((log pn)−k) with some k > 1, which decays to zero slowly. This means that in practice pn can be comparable to n without violating the results. The condition here is not minimum possible; we expect it holds for p ≫ n. Here, we refer the local minimizer as an interior point within a given close set such that it minimizes the target function. Following a similar argument to Huang et al. (2008), the local minimizer in Theorem 1 can be taken as the global minimizer with additional conditions on the tail of the penalty function.
Theorem 1 is also applicable to the L1-penalty function, where the local minimizer becomes the global minimizer. The asymptotic bias of the L1-penalized estimate is given in the term sn1an1 = sn1λn1 as shown in the technical proof. In order to control the bias, we impose condition (B), which entails an upper bound on λn1 = O({(1+pn/(sn1+1)) log pn/n}1/2). The bias problem due to the L1-penalty for finite parameter has already been unveiled by Fan and Li (2001) and Zou (2006).
Next we show the sparsistency of the penalized estimator from (1.1). We use Sc to denote the complement of a set S.
Theorem 2 (Sparsistency). Under the conditions given in Theorem 1, for any local minimizer of (1.1) satisfying and for a sequence of , if , then with probability tending to 1, for all .
First, since for any matrix M, we can always take ηn = (pn + sn1) log pn/n in Theorem 2, but this will result in more stringent requirement on the number of zero elements when L1-penalty is used, as we now explain. The sparsistency requires a lower bound on the rate of the regularization parameter λn1. On the other hand, condition (B) imposes an upper bound on λn1 when L1-penalty is used in order to control the biases. Explicitly, we need, for L1-penalized likelihood,
(2.3) |
for both consistency and sparsistency to be satisfied. We present two scenarios here for the two bounds to be compatible, making use of the inequalities for a matrix M of size pn.
We always have . In the worst case scenario where they have the same order, , so that ηn = (pn+sn1) log pn/n. It is then easy to see from (2.3) that the two bounds are compatible only when sn1 = O(1).
- We also have . In the optimistic scenario where they have the same order,
Hence, ηn = (1 + sn1/pn) log pn/n, and compatibility of the bounds requires sn1 = O(pn).
Hence, even in the optimistic scenario, consistency and sparsistency are guaranteed only when sn1 = O(pn) if the L1-penalty is used, i.e., the precision matrix has to be sparse enough.
However, if the penalty function used is unbiased, like the SCAD or the hard-thresholding penalty, we do not impose an extra upper bound for λn1 since its first derivative goes to zero fast enough as |θ| increases (exactly equals zero for the SCAD and hard-thresholding penalty functions, when n is sufficiently large; see condition (B) and the explanation thereof). Thus, λn1 is allowed to decay to zero slowly, allowing even the largest order .
We remark that asymptotic normality for the estimators of the elements in S1 have been established in a previous version of this paper. We omit it here for brevity.
2.5 Properties of sparse inverse correlation matrix estimation
The inverse correlation matrix Ψ0 retains the same sparsity structure of Ω0. Consistency and sparsistency results can be achieved with pn as large as log pn = o(n), as long as (sn1 + 1)(log pn)k/n = O(1) for some k > 1 as n → ∞. We minimize, w.r.t. Ψ = (ψij),
(2.4) |
where is the sample correlation matrix, with being the diagonal matrix with diagonal elements of S, and υn1 is a regularization parameter. After obtaining , Ω0 can also be estimated by .
To present the rates of convergence for and , we define
where and modify condition (D) to (D’) with λn1 there replaced by υn1, and impose (B’) cn1 = O({log pn/n}1/2), dn1 = o(1). Also, as n → ∞.
Theorem 3 Under regularity conditions (A),(B’),(C) and (D’), if (sn1+1)(log pn)k/n = O(1) for some k > 1 and , then there exists a local minimizer for (2.4) such that and under the operator norm. For the L1-penalty, we only need
Note that we can allow pn ≫ n without violating the result as long as log pn/n = o(1). Note also that an order of {pn log pn/n}1/2 is removed by estimating the inverse correlation rather than the precision matrix, which is somewhat surprising since the inverse correlation matrix, unlike the correlation matrix, does not have known diagonal elements that contribute no errors to the estimation. This can be explained and proved as follows. If sn1 = O(pn), the result is obvious. When sn1 = o(pn), most of the off-diagonal elements are zero. Indeed, there are at most O(sn1) columns of the inverse correlation matrix which contain at least one nonzero element. The rest of the columns that have all zero off-diagonal elements must have diagonal entries 1. These columns represent variables that are actually uncorrelated from the rest. Now, it is easy to see from (2.4) that these diagonal elements, which are one, are all estimated exactly as one with no estimation error. Hence, an order of (pn log pn/n)1/2 is not present even in the case of estimating the inverse correlation matrix.
For the L1-penalty, our result reduces to that given in Rothman et al. (2008). We offer the sparsistency result as follows.
Theorem 4 (Sparsistency) Under the conditions given in Theorem 3, for any local minimizer of (2.4) satisfying and for some ηn → 0, if , then with probability tending to 1, for all .
The proof follows exactly the same as that for Theorem 2 in Section 2.4, and is thus omitted.
For the L1-penalty, control of bias and sparsistency require υn1 to satisfy bounds like (2.3):
(2.5) |
This leads to two scenarios:
- The worst case scenario has
meaning ηn = sn1 log pn/n. Then compatibility of the bounds in (2.5) requires sn1 = O(1). - The optimistic scenario has
meaning ηn = sn1/pn · log pn/n. Then compatibility of the bounds in (2.5) requires sn1 = O(pn).
On the other hand, for penalties like the SCAD or the hard-thresholding penalty, we do not need an upper bound for sn1. Hence, we only need (sn + 1)(log pn)k/n = O(1) as n → ∞ for some k > 1. It is clear that SCAD results in better sampling properties than the L1-penalized estimator in precision or inverse correlation matrix estimation.
3 Estimation of sparse covariance matrix
In this section, we analyze the sparse covariance matrix estimation using the penalized likelihood (1.3). Then it is modified to estimating the correlation matrix, which improves the rate of convergence. We assume that the yi’s are i.i.d. N(0, Σ0) throughout this section.
3.1 Properties of sparse covariance matrix estimation
Let , where . Denote sn2 = |S2| − pn, so that sn2 is the number of nonzero elements in Σ0 on the off-diagonal entries. Put
Technical conditions in Section 2 need some revision. In particular, condition (D) now becomes condition (D2) with λn1 there replaced by λn2. Condition (B) should now be (B2) an2 = O({(1 + pn/(sn2 + 1)) log pn/n}1/2), bn2 = o(1), and as n → ∞.
Theorem 5 (Rate of convergence). Under regularity conditions (A), (B2), (C) and (D2), if (pn + sn2)(log pn)k/n = O(1) for some k > 1 and (pn + sn2) log , then there exists a local minimizer such that . For the L1-penalty, we only need .
Like the case for precision matrix estimation, the asymptotic bias due to the L1-penalty is of order sn2an2 = sn2λn2. To control this term, for the L1-penalty, we require λn2 = O({(1 + pn/(sn2 + 1)) log pn/n}1/2).
Theorem 6 (Sparsistency). Under the conditions given in Theorem 5, for any local minimizer of (1.3) satisfying and for some ηn → 0, if , then with probability tending to 1, for all .
For the L1-penalized likelihood, controlling of bias for consistency together with sparsistency requires
(3.1) |
This is the same condition as (2.3), and hence in the worst case scenario where
we need sn2 = O(1). In the optimistic scenario where
we need sn2 = O(pn). In both cases, the matrix Σ0 has to be very sparse, but the former is much sparser.
On the other hand, if unbiased penalty functions like the SCAD or hard-thresholding penalty are used, we do not need an upper bound on λn2 since the bias an2 = 0 for sufficiently large n. This gives more flexibility on the order of sn2.
Similar to Section 2, asymptotic normality for the estimators of the elements in S2 can be proved under certain assumptions.
3.2 Properties of sparse correlation matrix estimation
The correlation matrix Γ0 retains the same sparsity structure of Σ0 with known diagonal elements. This special structure allows us to estimate Γ0 more accurately. To take advantage of the known diagonal elements, the sparse correlation matrix Γ0 is estimated by minimizing w.r.t. Γ = (γij),
(3.2) |
where υn2 is a regularization parameter. After obtaining Σ0 can be estimated by .
To present the rates of convergence for and , we define
where . We modify condition (D) to (D2′) with λn2 there replaced by υn2, and (B) to (B2′) as follows: (B2′) cn2 = O({log pn/n}1/2), dn2 = o(1), and as n → ∞.
Theorem 7 Under regularity conditions (A),(B2′),(C) and (D2′), if (pn+sn2)(log pn)k/n = O(1) for some k > 1 and , then there exists a local minimizer for (3.2) such that
In addition, for the operator norm, we have
For the L1-penalty, we only need .
The proof is sketched in Section 5. This theorem shows that the correlation matrix, like the inverse correlation matrix, can be estimated more accurately, since diagonal elements are known to be one.
Theorem 8 (Sparsistency). Under the conditions given in Theorem 7, for any local minimizer of (3.2) satisfying and for some ηn → 0 , if , then with probability tending to 1, for all .
The proof follows exactly the same as that of Theorem 6 in Section 5, and is omitted. For the L1-penalized likelihood, controlling of bias and sparsistency requires
(3.3) |
This is the same condition as (2.5), hence in the worst scenario where
we need sn2 = O(1). In the optimistic scenario where
we need sn2 = O(pn).
The use of unbiased penalty functions like the SCAD or the hard-thresholding penalty, similar to results in the previous sections, does not impose an upper bound on the regularization parameter since bias cn2 = 0 for sufficiently large n. This gives more flexibility to the order of sn2 allowed.
4 Extension to sparse Cholesky decomposition
Pourahmadi (1999) proposed the modified Cholesky decomposition (MCD) which facilitates the sparse estimation of Ω through penalization. The idea is to represent zero-mean data y = (y1, ⋯ , ypn)T using the autoregressive model:
(4.1) |
where T is the unique unit lower triangular matrix with ones on its diagonal and (i, j)th element being −ϕij for j < i, and D is diagonal with ith element being . The optimization problem is unconstrained (since the ϕij’s are free variables), and the estimate for Ω is always positive-definite.
Huang et al. (2006) and Levina et al. (2008) both used the MCD for estimating Ω0. The former maximized the log-likelihood (ML) over T and D simultaneously, while the latter suggested also a least square version (LS), with D being first set to the identity matrix and then minimizing over T to obtain . The latter corresponds to the original Cholesky decomposition. The sparse Cholesky factor can be estimated through minimizing
(4.2) |
This is indeed the same as (1.1) with the substitution of Ω = TTD−1T and penalization parameter λn3. Noticing that (4.1) can be written as , the least square version is to minimize in the matrix notation. Aggregating the n observations and adding penalty functions, the least-square criterion is to minimize
(4.3) |
In view of the results in Sections 2.5 and 3.2, we can also write the sample covariance matrix in (4.2) as and then replace by T, resulting in the normalized (NL) version as follows:
(4.4) |
We will also assume the yi’s are i.i.d. N(0, Σ0) as in the last section.
4.1 Properties of sparse Cholesky factor estimation
Since all the T’s introduced in the three models above have the same sparsity structure, let S and sn3 be the nonzero set and number of nonzeros associated with each T above. Define
For (ML), condition (D) is adapted to (D3) with λn1 there replaced by λn3. Condition (B) is modified as (B3) an3 = O({(1 + pn/(sn3 + 1)) log pn/n}1/2), bn3 = o(1) and as n → ∞.
After obtaining and from minimizing (ML), we set .
Theorem 9 Under regularity conditions (A),(B3),(C),(D3), if (pn + sn3)(log pn)k/n = O(1) for some k > 1 and , then there exists a local minimizer and for (ML) such that , and . For the L1-penalty, we only need .
The proof is similar to those of Theorems 5 and 7 and is omitted. The Cholesky factor T has ones on its main diagonal without the need for estimation. Hence, the rate of convergence is faster than .
Theorem 10 (Sparsistency). Under the conditions in Theorem 9, for any local minimizer , of (4.2) satisfying and , if , then sparsistency holds for , provided that and , for some ηn, ζn → 0.
The proof is in Section 5. For the L1-penalized likelihood, control of bias and sparsistency impose the following:
(4.5) |
The worst scenario corresponds to ηn = sn3 log pn/n and ζn = pn log pn/n, so that we need sn3 = O(1). The optimistic scenario corresponds to ηn = sn3/pn · log pn/n and ζn = log pn/n, so that we need sn3 = O(pn).
On the other hand, such a restriction is not needed for unbiased penalties like the SCAD or hard-thresholding penalty, giving more flexibility on the order of sn3.
4.2 Properties of sparse normalized Cholesky factor estimation
We now turn to analyzing the normalized penalized likelihood (4.4). With T = (tij) in (NL) which is lower triangular, define
Condition (D) is now changed to (D5) with λn1 there replaced by λn5. Condition (B) is now substituted by (B5) , bn5 = o(1), as n → ∞.
Theorem 11 (Rate of convergence) Under regularity conditions (A),(B5),(C) and (D5), if sn3(log pn)k/n = O(1) for some k > 1 and , then there exists a local minimizer for (NL) such that and rate of convergence in the Frobenius norm
and in the operator norm, it is improved to
For the L1-penalty, we only need .
The proof is similar to that of Theorems 5 and 7 and is omitted. In this theorem, like Lemma 3, we can have pn so that pn/n goes to a constant less than 1. It is evident that normalizing with results in an improvement in the rate of convergence in operator norm.
Theorem 12 (Sparsistency). Under the conditions given in Theorem 11, for any local minimizer of (4.4) satisfying if , then sparsistency holds for , provided that for some ηn → 0.
Proof is omitted since it goes exactly the same as that of Theorem 10. The above results apply also to the L1-penalized estimator. For simultaneous persistency and optimal rate of convergence using the L1-penalty, the biases inherent in it induce the restriction sn3 = O(1) in the worst scenario where , and sn3 = O(pn) in the optimistic scenario where . This restriction does not apply to the SCAD and other asymptotically unbiased penalty functions.
5 Proofs
We first prove three lemmas. The first one concerns with inequalities involving the operator and the Frobenius norms. The other two concern with order estimation for elements in a matrix of the form A(S − Σ0)B, which are useful in proving results concerning sparsistency.
Lemma 1 Let A and B be real matrices such that the product AB is defined. Then, defining , we have
(5.1) |
In particular, if A = (aij), then |aij| ≤ ∥A∥ for each i, j.
Proof of Lemma 1. Write B = (b1, ⋯ , bq), where bi is the i-th column vector in B. Then
Similarly,
which completes the proof of (5.1). To prove |aij| ≤ ∥A∥, note that , where ei is the unit column vector with one at the i-th position, and zero elsewhere. Hence, using (5.1),
and this completes the proof of the lemma. □
Lemma 2 Let S be a sample covariance matrix of a random sample {yi}1≤i≤n, with E(yi) = 0 and var(yi) = Σ0. Let yi = (yi1, ⋯ , yipn) with yij ~ Fj, where Fj is the c.d.f. of yij , and let Gj be the c.d.f. of , with
(5.2) |
for some λ0 > 0. Assume log pn/n = o(1), and that Σ0 has eigenvalues uniformly bounded above as n → ∞. Then for constant matrices A and B with ∥A∥, ∥B∥ = O(1), we have maxi,j |(A(S − Σ0)B)ij | = OP({log pn/n}1/2).
Remark: The conditions on the yij’s above are the same as those used in Bickel and Levina (2008b) for relaxing the normality assumption.
Proof of Lemma 2. Let xi = Ayi and . Define , with covariance matrix
Since ∥(AT B)T∥ ≤ (∥A∥2 + ∥B∥2)1/2 = O(1) and ∥Σ0∥ = O(1) uniformly, we have ∥Σu∥ = O(1) uniformly, Then, with , which is the sample covariance matrix for the random sample {ui}1≤i≤n, by Lemma A.3 of Bickel and Levina (2008b) which holds under the assumption for the yij’s and log pn/n = o(1), we have
In particular, it means that
which completes the proof of the lemma. □
Lemma 3 Let S be a sample covariance matrix of a random sample yi1≤i≤n with yi ~ N(0, Σ0). Assume pn/n → y ∈ [0, 1), Σ0 has eigenvalues uniformly bounded as n → ∞, and A = A0 + Δ1, B = B0 + Δ2 are such that the constant matrices ∥A0∥, ∥B0∥ = O(1), with ∥Δ1∥, ∥Δ2∥ = oP(1). Then we still have maxi,j|(A(S−Σ0)B)ij| = OP({log pn/n}1/2).
Proof of Lemma 3. Consider
(5.3) |
where K1 = A0(S − Σ0)B0, K2 = Δ1(S − Σ0)B0, K3 = A0(S − Σ0)Δ2 and K4 = Δ1(S − Σ0)Δ2. Now maxi,j |(K1)ij | = OP({log pn/n}1/2) by Lemma 2. Consider K2. Suppose the maximum element of the matrix is at the (i, j)-th position. Consider ((S − Σ0)B0)ij, the (i, j)-th element of (S − Σ0)B0. Since each element in S − Σ0 has a rate OP(n−1/2), the i-th row of S − Σ0 has a norm of OP({pn/n}1/2). Also, the j-th column of B0 has ∥B0ej∥ ≤ ∥B0∥ = O(1). Hence, ((S − Σ0)B0)ij = OP({pn/n}1/2).
Hence, we can find cn = o({n/pn}1/2) such that each element in has an order larger than that in Δ1, since ∥Δ1∥ = oP(1) implies that each element in Δ1 is also oP(1) by Lemma 1.
Then suitable choice of cn leads to
(5.4) |
At the same time, Theorem 5.10 in Bai and Silverstein (2006) implies that, for yi ~ N(0, Σ0) and pn/n → y ∈ (0, 1), with probability one,
Hence, if we have pn/n = o(1), we must have , or it will contradict the above. It means that ∥S−Σ0∥ = oP(1) since Σ0 has eigenvalues uniformly bounded. Or, if pn/n → y ∈ (0, 1), then we have ∥S − Σ0∥ = OP(1) by the above.
Since S − Σ0 is symmetric, we can find a rotation matrix Q (i.e. QTQ = QQT = I) so that
where Λ is a diagonal matrix with real entries. Then we are free to control cn again so as to satisfy further that cn∥Λ∥2 = oP(∥Λ∥), since ∥Λ∥ = ∥S − Σ0∥ = OP(1) at most. Hence,
where the last line used the previous proof for constant matrix B0. Hence, combining this with (5.4), we have maxi,j |(K2)ij| = OP({log pn/n}1/2). Similar arguments go for K3 and K4. □
Proof of Theorem 1. The main idea of the proof is inspired by Fan and Li (2001) and Rothman et al. (2008). Let U be a symmetric matrix of size pn, DU be its diagonal matrix and RU = U − DU be its off-diagonal matrix. Set ΔU = αnRU + βnDU. We would like to show that, for αn = (sn1 log pn/n)1/2 and βn = (pn log pn/n)1/2, and for a set defined as ,
for sufficiently large constants C1 and C2. This implies that there is a local minimizer in such that for sufficiently large n, since Ω0 + ΔU is positive definite. This is shown by noting that
since Ω0 has eigenvalues uniformly bounded away from 0 and ∞ by condition (A), and ∥ΔU∥F = O(αn + βn) = o(1).
Consider, for Σ = Σ0 + ΔU, the difference
where
It is sufficient to show that the difference is positive asymptotically with probability tending to 1. Using Taylor’s expansion with the integral remainder, we have I1 = K1+K2, where
(5.5) |
with the definitions Ωv = Ω0 + vΔU, and . Now,
where we used ∥ΔU∥ ≤ C1αn + C2βn = O((log pn)(1−k)/2) = o(1) by our assumption.
Consider K1. It is clear that |K1| ≤ L1 + L2, where
Using Lemmas 1 and 2, we have
This is dominated by K2 when C1 and C2 are sufficiently large.
Now, consider I2 − L2 for penalties other than L1. Since on , we have that |ωij| = O(C1αn + C2βn) = o(1) for all . Also, note that the condition on λn1 ensures that, for , |ωij| = O(αn + βn) = o(λn1). Hence, by condition (C), for all , we can find a constant k1 > 0 such that
This implies that
Hence,
With the assumption that , we see from the above that I2−L2 ≥ 0 since , using .
For the L1-penalty, since we have maxi≠j |S − Σ0| = OP((log pn/n)1/2) by Lemma 2, we can find a positive W = OP(1) such that
Then we can set λn1 = 2W(log pn/n)1/2 or one with order greater than (log pn/n)1/2, and the above arguments are still valid, so that I2 − L2 > 0.
Now, with L1 dominated by K2 and I2 − L2 ≥ 0, the proof completes if we can show that I3 is also dominated by K2, since we have proved that K2 > 0. Using Taylor’s expansion, we can arrive at
where o(1) and O(1) are the terms independent of C1 and C2. By condition (B), we have
which is dominated by K2 with large enough constants C1 and C2. This completes the proof of the theorem. □
Proof of Theorem 2. For Ω a minimizer of (1.1), the derivative for q1(Ω) w.r.t. ωij for is
where sgn(a) denotes the sign of a. If we can show that the sign of ∂q1(Ω)/∂ωij depends on sgn(ωij) only with probability tending to 1, the optimum will be at 0, so that for all with probability tending to 1. We need to estimate the order of sij − σij independent of i and j.
Decompose sij − σij = I1 + I2, where
By Lemma 2 or Lemma A.3 of Bickel and Levina (2008b), it follows that maxi,j |I1| = OP({log pn/n}1/2). It remains to estimate the order of I2.
By Lemma 1, , which has order
where we used condition (A) to get ∥Σ0∥ = O(1), and using ηn → 0 so that ,
Hence, implies .
Combining the last two results yields that
By conditions (C) and (D), we have
for ωij in a small neighborhood of 0 (excluding 0 itself) and some positive constant C3. Hence, if ωij lies in a small neighborhood of 0, we need to have in order to have the sign of ∂q1(Ω)/∂ωij depends on sgn(ωij) only with probability tending to 1. The proof of the theorem is completed. □
Proof of Theorem 3. Because of the similarity between equations (2.4) and (1.1), the Frobenius norm result has nearly identical proof as Theorem 1, except that we now set ΔU = αnU. For the operator norm result, we refer readers to the proof of Theorem 2 of Rothman et al. (2008). □
Proof of Theorem 5. The proof is similar to that of Theorem 1. We only sketch briefly the proof, pointing out the important differences.
Let αn = (sn2 log pn/n)1/2 and βn = (pn log pn/n)1/2, and define . Want to show
for sufficiently large constants C1and C2.
For Σ = Σ0 + ΔU, the difference
where
with I1 = K1 + K2, where
(5.6) |
and Σv = Σ0 + vΔU, SΩ0 is the sample covariance matrix of a random sample {xi}1≤i≤n having xi ~ N(0, Ω0). Also,
(5.7) |
The treatment of K2is different from that in Theorem 1. By condition (A), and (pn + sn2)(log pn)k/n = O(1) for some k > 1, we have
Thus, we can use the Neumann series expansion to arrive at
where the little o (or oP, O or OP in any matrix expansions in the remainder of this proof) represents a function of the L2 norm of the residual matrix in the expansion. That is, , and . With SI difined as the sample covariance matrix formed from a random sample {xi}1≤i≤n having xi ~ N(0, I),
(see arguments in Lemma 3). These entail
Combining these results, we have
Consequently,
All other terms are dealt with similarly as in the proof of Theorem 1, and hence we omit them. □
Proof of Theorem 6. The proof is similar to that of Theorem 2. We only show the main differences.
It is easy to show
Our aim is to estimate the order of |(−ΩSΩ + Ω)ij|, finding an upper bound which is independent of both i and j.
Write
where I1 = −Ω(S − Σ0)Ω and I2 = Ω(Σ − Σ0)Ω. Since
we have
where by Lemma 1, with ∥Σ−Σ0∥2 = O(ηn). Hence, we can apply Lemma 3 and conclude that maxi,j |(I1)ij| = OP({log pn/n}1/2).
For I2, we have
Hence, we have
The rest goes similar to the proof of Theorem 2, and is omitted. □
Proof of Theorem 7. The proof is nearly identical to that of Theorem 5, except that we now set ΔU = αnU. The fact that has no estimation error eliminates an order (pn log pn/n)1/2 that contributes from estimating for (3.2). This is why we can estimate a sparse correlation matrix more accurately.
For the operator norm result, we refer readers to the proof of Theorem 2 of Rothman et al. (2008). □
Proof of Theorem 10. For (T,D) a minimizer of (4.2), the derivative for q3(T,D) w.r.t. tij for is
Now STTD−1 = I1 + I2 + I3 + I4, where
By the MCD (4.1), . Since i > j for , we must have . Hence, we can ignore I4.
Since ∥T−T0∥2 = O(ηn) and ∥D−D0∥2 = O(ζn) with ηn, ζn = o(1), and by condition (A) we can easily show . Then we can apply Lemma 3 to show that maxij |(I1)ij| = (log pn/n)1/2.
For I2, we have . And finally. .
With all these, we have . The rest of the proof goes like that of Theorem 2 or 6. □
Footnotes
Financial support from the NSF grant DMS-0354223, DMS-0704337 and NIH grant R01-GM072611 is gratefully acknowledged.
AMS 2000 subject classifications. Primary 62F12; secondary 62J07.
Contributor Information
Clifford Lam, Department of Statistics, London School of Economics and Political Science, London, WC2A 2AE (C.Lam2@lse.ac.uk).
Jianqing Fan, Department of Operation Research and Financial Engineering, Princeton University, Princeton, NJ 08544 (jqfan@princeton.edu).
References
- [1].Bai Z, Silverstein JW. Spectral Analysis of Large Dimensional Random Matrices. Science Press; Beijing: 2006. [Google Scholar]
- [2].Bickel PJ, Levina E. Covariance Regularization by Thresholding. Ann. Statist. 2008a;36(6):2577–2604. [Google Scholar]
- [3].Bickel PJ, Levina E. Regularized Estimation of Large Covariance Matrices. Ann. Statist. 2008b;36(1):199–227. [Google Scholar]
- [4].Cai T, Zhang C-H, Zhou H. Optimal rates of convergence for co-variance matrix estimaiton. the Wharton School, University of Pennsylvania; 2009. Technical report. [Google Scholar]
- [5].d’Aspremont A, Banerjee O, El Ghaoui L. First-order Methods For Sparse Covariance Selection. SIAM. J. Matrix Anal. and Appl. 2008;30(1):56–66. [Google Scholar]
- [6].Dempster AP. Covariance Selection. Biometrics. 1972;28:157–175. [Google Scholar]
- [7].Diggle P, Verbyla A. Nonparametric Estimation of Covariance Structure in Longitudinal Data. Biometrics. 1998;54(2):401–415. [PubMed] [Google Scholar]
- [8].El Karoui N. Operator Norm Consistent Estimation of a Large Dimensional Sparse Covariance Matrices. Ann. Statist. 2008;36(6):2717–2756. [Google Scholar]
- [9].Fan J, Feng Y, Wu Y. Network Exploration via the Adaptive LASSO and SCAD Penalties. Annals of Applied Statistics. 2009;3(2):521–541. doi: 10.1214/08-AOAS215SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. J. Amer. Statist. Assoc. 2001;96:1348–1360. [Google Scholar]
- [11].Fan J, Peng H. Nonconcave Penalized Likelihood With a Diverging Number of Parameters. Ann. Statist. 2004;32:928–961. [Google Scholar]
- [12].Friedman J, Hastie T, Tibshirani R. Sparse Inverse Covariance Estimation with the Graphical LASSO. Biostatistics. 2008;9(3):432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Huang J, Horowitz J, Ma S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann. Statist. 2008;36:587–613. [Google Scholar]
- [14].Huang J, Liu N, Pourahmadi M, Liu L. Covariance Matrix Selection and Estimation via Penalised Normal Likelihood. Biometrika. 2006;93(1):85–98. [Google Scholar]
- [15].Levina E, Rothman AJ, Zhu J. Sparse Estimation of Large Covariance Matrices via a Nested Lasso Penalty. Ann. Applied Statist. 2008;2(1):245–263. [Google Scholar]
- [16].Meier L, van de Geer S, Bühlmann P. The group Lasso for logistic regression. Journal of the Royal Statistical Society, B. 2008;70:53–71. [Google Scholar]
- [17].Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the Lasso. Ann. Statist. 2006;34:1436–1462. [Google Scholar]
- [18].Pourahmadi M. Joint Mean-Covariance Models with Applications to Longitudinal Data: Unconstrained Parameterisation. Biometrika. 1999;86:677–690. [Google Scholar]
- [19].Ravikumar P, Lafferty J, Liu H, Wasserman L. Advances in Neural Information Processing Systems. MIT Press; 2008. Sparse additive models; p. 20. [Google Scholar]
- [20].Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse Permutation Invariant Covariance Estimation. Electron. J. Statist. 2008;2:494–515. [Google Scholar]
- [21].Smith M, Kohn R. Parsimonious Covariance Matrix Estimation for Longitudinal Data. J. Amer. Statist. Assoc. 2002;97(460):1141–1153. [Google Scholar]
- [22].Wagaman AS, Levina E. Discovering sparse covariance structures with the Isomap. Journal of Computational and Graphical Statistics. 2008;18 to appear. [Google Scholar]
- [23].Wong F, Carter C, Kohn R. Efficient Estimation of Covariance Selection Models. Biometrika. 2003;90:809–830. [Google Scholar]
- [24].Wu WB, Pourahmadi M. Nonparametric Estimation of Large Covariance Matrices of Longitudinal Data. Biometrika. 2003;94:1–17. [Google Scholar]
- [25].Yuan M, Lin Y. Model Selection and Estimation in the Gaussian Graphical Model. Biometrika. 2007;90:831–844. [Google Scholar]
- [26].Zhang CH. Penalized Linear Unbiased Selection. the statistics dept., Rutgers University; 2007. Technical report 2007-003. [Google Scholar]
- [27].Zhao P, Yu B. On Model Selection Consistency of Lasso. Journal of Machine Learning Research. 2006;7:2541–2563. [Google Scholar]
- [28].Zou H. The adaptive Lasso and its oracle properties. J. Amer. Statist. Assoc. 2006;101:1418–1429. [Google Scholar]
- [29].Zou H, Li R. One-step Sparse Estimates in Nonconcave Penalized Likelihood Models (With Discussion) Ann. Statist. 2008;36(4):1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]