Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Jun 17.
Published in final edited form as: Ann Stat. 2013 Oct 1;41(5):2505–2536. doi: 10.1214/13-AOS1159

CALIBRATING NON-CONVEX PENALIZED REGRESSION IN ULTRA-HIGH DIMENSION

Lan Wang *, Yongdai Kim , Runze Li
PMCID: PMC4060811  NIHMSID: NIHMS590995  PMID: 24948843

Abstract

We investigate high-dimensional non-convex penalized regression, where the number of covariates may grow at an exponential rate. Although recent asymptotic theory established that there exists a local minimum possessing the oracle property under general conditions, it is still largely an open problem how to identify the oracle estimator among potentially multiple local minima. There are two main obstacles: (1) due to the presence of multiple minima, the solution path is nonunique and is not guaranteed to contain the oracle estimator; (2) even if a solution path is known to contain the oracle estimator, the optimal tuning parameter depends on many unknown factors and is hard to estimate. To address these two challenging issues, we first prove that an easy-to-calculate calibrated CCCP algorithm produces a consistent solution path which contains the oracle estimator with probability approaching one. Furthermore, we propose a high-dimensional BIC criterion and show that it can be applied to the solution path to select the optimal tuning parameter which asymptotically identifies the oracle estimator. The theory for a general class of non-convex penalties in the ultra-high dimensional setup is established when the random errors follow the sub-Gaussian distribution. Monte Carlo studies confirm that the calibrated CCCP algorithm combined with the proposed high-dimensional BIC has desirable performance in identifying the underlying sparsity pattern for high-dimensional data analysis.

Keywords: High-dimensional regression, LASSO, MCP, SCAD, variable selection, penalized least squares

1. Introduction

High-dimensional data, where the number of covariates p greatly exceeds the sample size n, arise frequently in modern applications in biology, chemometrics, economics, neuroscience and other scientific fields. To facilitate the analysis, it is often useful and reasonable to assume that only a small number of covariates are relevant for modeling the response variable. Under this sparsity assumption, a widely used approach for analyzing high-dimensional data is regularized or penalized regression. This approach estimates the unknown regression coefficients by solving the following penalized regression problem

minβRp{(2n)1yXβ2+j=1ppλ(βj)}, (1.1)

where y is the vector of responses, X is an n × p matrix of covariates, β = (β1, … ,βp)T is the vector of unknown regression coefficients, ∥ · ∥ denotes the L2 norm (Euclidean norm), and Pλ(·) is a penalty function which depends on a tuning parameter λ > 0. Many commonly used variable selection procedures in the literature can be cast into the above framework, including the best subset selection, L1 penalized regression or Lasso (Tib-shirani, 1996), Bridge regression (Frank and Friedman, 1993), SCAD (Fan and Li, 2001), MCP (Zhang, 2010), among others.

The Lasso penalized regression is computationally attractive and enjoys great performance in prediction. However, it is known that Lasso requires rather stringent conditions on the design matrix to be variable selection consistent (Zou, 2006; Zhao and Yu, 2006). Focusing on identifying the unknown sparsity pattern, non-convex penalized high-dimensional regression has recently received considerable attention. Fan and Li (2001) first systematically studied nonconvex penalized likelihood for fixed finite dimension p. In particular, they recommended the SCAD penalty which enjoys the oracle property for variable selection. That is, it can estimate the zero coefficients as exact zero with probability approaching one, and estimate the non-zero coefficients as efficiently as if the true sparsity pattern is known in advance. Fan and Peng (2004) extended these results by allowing p to grow with n at the rate p = o(n1/5) or p = o(n1/3). For high dimensional nonconvex penalized regression with pn, Kim et al. (2008) proved that the oracle estimator itself is a local minimum of SCAD penalized least squares regression under very relaxed conditions; Zhang (2010) proposed a minimax concave penalty (MCP) and devised a novel PLUS algorithm which when used together can achieve the oracle property under certain regularity conditions. Important insight has also been gained through the recent work on theoretical analysis of the global solution (Kim and Kwon, 2012; Zhang and Zhang, 2012). However, direct computation of the global solution to the nonconvex penalized regression is infeasible in high dimensional setting.

For practical data analysis, it is critical to find an easy-to-implement procedure which can find a local solution with satisfactory theoretical propertyeven when the number of covariates greatly exceeds the sample size. Two challenging issues remain unsolved. One is the problem of multiple local minima; the other is the problem of optimal tuning parameter selection.

A direct consequence of the multiple local minima problem is that the solution path is not unique and is not guaranteed to contain the oracle estimator. This problem is due to the nature of the non-convexity of the penalty. To understand it, we note that the penalized objective function in (1.1) is non-convex in β whenever the convexity of the least squares loss function does not dominate the concavity of the penalty part. In general, the occurrence of multiple minima is unavoidable unless strong assumptions are imposed on both the design matrix and the penalty function. The recent theory for SCAD penalized linear regression (Kim et al., 2008) and for general non-concave penalized generalized linear models (Fan and Lv, 2011) indicates that one of the local minima enjoys the oracle property but it is still an unsolved problem how to identify the oracle estimator among multiple minima when pn. Popularly used algorithms generally only en sure the convergence to a local minimum, which is not necessarily the oracle estimator. Numerical evidence in Section 4 suggests that the local minima identified by some of the popular algorithms have a relatively low probability to recover the unknown sparsity pattern although it may have small estimation error.

Even if a solution path is known to contain the oracle estimator, identifying such a desirable estimator from the path is itself a challenging problem in ultra-high dimension. The main issue is to find the optimal tuning parameter which yields the oracle estimator. The theoretically optimal tuning parameter does not have an explicit representation and depends on unknown factors such as the variance of the unobserved random noise. Cross-validation is commonly adopted in practice to select the tuning parameter but is observed to often result in overfitting. In the case of fixed p, Wang, Li and Tsai (2007) rigorously proved that generalized cross-validation leads to an overfitted model with a positive probability for SCAD-penalized regression. Effective BIC-type criterion for nonconvex penalized regression has been investigated in Wang, Li and Tsai (2007) and Zhang, Li and Tsai (2010) for fixed p; and in Wang, Li and Leng (2009) for diverging p (but p < n). However, to the best of our knowledge, there is still no satisfactory tuning parameter selection procedure for nonconvex penalized regression in ultra-high dimension.

The above two main concerns motivate us to consider calibrating nonconvex penalized regression in ultra-high dimension with the goal to identify the oracle estimator with high probability. To achieve this, we first prove that a calibration of the CCCP algorithm (Kim et al., 2008) for non-convex penalized regression produces a consistent solution path with probability approaching one in merely two steps under conditions much more relaxed than what would be required for the Lasso estimator to be model selection consistent. Furthermore, extending the recent work of Chen and Chen (2008) and Kim et al. (2011) for Bayesian information criterion (BIC) on high dimensional least squares regression, we propose a high-dimensional BIC for a nonconvex penalized solution path and prove its validity under more general conditions when p grows at an exponential rate. The recent independent work of Zhang (2010, 2012) devised a multi-stage convex relaxation scheme and proved that for the capped L1 penalty the algorithm can find a consistent solution path with probability approaching one under certain conditions. Despite the similar flavor shared with the algorithm proposed in this paper, his algorithm takes multiple steps (which can be very large in practice depending on the design condition) and the paper has not studied the problem of tuning parameter selection.

To deepen our understanding of the nonconvex penalized regression, we also derive an interesting auxiliary theoretical result of an upper bound on the L2 distance between a sparse local solution of nonconvex penalized regression and the oracle estimator. This result is new and insightful. It suggests that under general regularity conditions a sparse local minimum can often have small estimation error even though it may not be the oracle estimator. Overall, the theoretical results in this paper fill in important gaps in the literature, thus substantially enlarge the scope of applications of nonconvex penalized regression in ultra-high dimension. In Monte Carlo studies, we demonstrate that the calibrated CCCP algorithm combined with the proposed high-dimensional BIC is effective in identifying the underlying sparsity pattern.

The rest of the paper is organized as follows. In Section 2, we define the notation, review the CCCP algorithm and introduce the new methodology. In Section 3, we establish that the proposed calibrated CCCP solution path contains the oracle estimator with probability approaching one under general conditions, and that the proposed high-dimensional BIC is able to select the optimal tuning parameter with probability tending to one. In Section 4, we report numerical results from Monte Carlo simulations and a real data example. In Section 5, we present an auxiliary theoretical result which sheds light on the estimation accuracy of a local minimum of non-convex penalized regression if it is not the oracle estimator. The proofs are given in Section 6.

2. Calibrated non-convex penalized least squares method

2.1. Notation and setup

Suppose that {(Yi,xi)}i=1n is a random sample from the linear regression model:

y=Xβ+, (2.1)

where y = (Y1, … , Yn)T , X is the n × p non-stochastic design matrix with the ith row (Y1,,Yn)T is the vector of unknown true parameters, and = (1, … , n)T is a vector of independent and identically distributed random errors.

We are interested in the case where p = pn greatly exceeds the sample size n. The vector of the true parameters β* is assumed to be sparse in the sense that the majority of its components are exactly zero. Let A0={j:βj0} be the index set of covariates with non-zero coefficients and let A0=q denote the cardinality of A0. We use d=min{βj:βj0} to denote the minimal absolute value of the non-zero coefficients. Without loss of generality, we may assume that the first q components of β* are non-zero, thus we can write β=(β1T,0T)T, where 0 represents a zero vector of length pq. The oracle estimator is defined as β^(o)=(β^1(o)T,0T)T, where β^1(o) is the least squares estimator fitted using only the covariates whose indices are in A0.

To handle the high-dimensional covariates, we consider the penalized regression in (1.1). The penalty function pλ(t) is assumed to be increasing and concave for t ∈ [0, +∞) with a continuous derivative p.λ(t) on (0, +∞). To induce sparsity of the penalized estimator, it is generally necessary for the penalty function to have a singularity at the origin, i.e., p.λ(0+)>0. Without loss of generality, the penalty function can be standardized such that p.λ(0+)=λ. Furthermore, it is required that

p.λ(t)λ,0<t<α0λ, (2.2)
p.λ(t)=0,t>α0λ, (2.3)

for some positive constant a0. Condition (2.3) plays the key role of not over-penalizing large coefficients, thus alleviating the bias problem associated with Lasso.

The above class of penalty functions include the popularly used SCAD penalty and MCP. The SCAD penalty is defined by

p.λ(t)=λ{I(tλ)+(aλt)+(a1)λI(t>λ)} (2.4)

for some a > 2, where the notation b+ stands for the positive part of b, i.e., b+ = bI(b > 0). Fan and Li (2001) recommended to use a = 3.7 from a Bayesian perspective. On the other hand, the MCP is defined by p.λ(t)=a1(aλt)+ for some a > 0 (as a ↓ 1, it amounts to hard-thresholding, thus in the following we assume a > 1).

Let x(j) be the jth column vector of X. Without loss of generality, we assume that x(j)Yx(j)n=1 for all j. Throughout this paper the following notation is used. For an arbitrary index set A ⊆ {1, 2 … , p}, XA denotes the n × ∣A∣ submatrix of X formed by those columns of X whose indices are in A. For a vector v = (v1, … , vp)’, we use ∥v∥ to denote its L2 norm; on the other hand ∥v0 = #{j : vj ≠ 0} denotes the L0 norm, ∥v1 = ∑jvj∣ denotes the L1 norm and ∥v = maxjvj∣ denotes the L norm. We use vA to represent the size-∣A∣ subvector of v formed by the entries vj with indices in A. For a symmetric matrix B, λmin(B) and λmax(B) stand for the smallest and largest eigenvalues of B, respectively. Furthermore, we let

ξmin(m)=minBm,A0Bλmin(n1XBTXB). (2.5)

Finally, p, q, λ and other related quantities are all allowed to depend on n, but we suppress such dependence for notational simplicity.

2.2. The CCCP algorithm

It is challenging to solve the penalized regression problem in (1.1) when the penalty function is nonconvex. Kim et al. (2008) proposed a fast optimization algorithm called the SCAD-CCCP (CCCP stands for ConCave Convex procedure) algorithm for solving the SCAD-penalized regression. The key idea is to update the solution with the minimizer of the tight convex upper bound of the objective function obtained at the current solution. What makes a fast algorithm practical relies on the possibility of decomposing the non-convexed penalized least squares objective function as the sum of a convex function and a concave function. To be specific, suppose we want to minimize an objective function C(β) which has the representation C(β) = Cvex(β) + Ccav(β) for a convex function Cvex(β) and a concave function Ccav(β). Given a current solution β(k), the tight convex upper bound of C(β) is given by Q(β) = Cvex(β) + ▿Ccav(βk)’ β where ▿Ccav(β) = ∂Ccav(β)/∂β. We then update the solution by minimizing Q(β). Since Q(β) is a convex function, it can be easily minimized.

For the penalized regression in (1.1), we consider a penalty function pλ(∣βj∣) which has the decomposition

pλ(βj)=Jλ(βj)+λβj, (2.6)

where Jλ(∣βj∣) is a differentiable concave function. For example, for the SCAD penalty,

Jλ(βj)=βj22λβj+λ22(a1)I(λβjaλ)+[(a+1)λ22λβj]I(βj>aλ);

while for the MCP penalty,

Jλ(βj)=βj22aI(0βj<aλ)+[aλ22λβj]I(βjaλ).

Hence, using the decomposition in (2.6), the penalized objective function in (1.1) can be rewritten as

12nyXβ2+j=1pJλ(βj)+λj=1pβj,

which is the sum of convex and concave functions. The CCCP algorithm is applied as follows. Given a current solution β(k), the tight convex upper bound is

Q(ββ(k),λ)=12nyXβ2+j=1pJλ(βj(k))βj+λj=1pβj. (2.7)

We then update the current solution by β(k+1) = arg minβ Q(ββ(k), λ).

An important property of the CCCP algorithm is that the objective function always decreases after each iteration (Yuille and Rangarajan, 2003 and Tao and An, 1997), from which it can be deduced that the solution converges to a local minimum. See, for example, Corollary 3.2 of Hunter and Li (2005). However, there is no guarantee that the local minimum found is the oracle estimator itself because there are multiple local minima and the solution of the CCCP algorithm depends on the choice of the initial solution.

2.3. Calibrated non-convex penalized regression

In this paper, we propose and study a calibrated CCCP estimator. More specifically, we start with the initial value β(0) = 0 and a tuning parameter λ > 0, and let Q be the tight convex upper bound defined in (2.7). The calibrated algorithm consists of the following two steps.

  1. Let β^(1)(λ)=agrminβQ(ββ(0),τλ), where the choice of τ > 0 will be discussed later.

  2. 2. Let β^(1)(λ)=agrminβQ(ββ^(1)(λ),λ).

When we consider a sequence of tuning parameter values, we obtain a solution path {β^(λ):λ>0}. The calculation of the path is fast even for very high-dimensional p as for each of the two steps a convex minimization problem is solved. In step 1, a smaller tuning parameter τλ is adopted to increase the estimation accuracy, see Section 3.1 for discussions on the practical choice of τ. We call a solution path “path consistent” if it contains the oracle estimator. In Section 3.1, we will prove that the calibrated CCCP algorithm produces a consistent solution path under rather weak conditions.

Given such a solution path, a critical question is how to tune the regularization parameter λ in order to identify the oracle estimator. The performance of a penalized regression estimator is known to heavily depend on the choice of the tuning parameter. To further calibrate non-convex penalized regression, we consider the following high-dimensional BIC criterion (HBIC) to compare the estimators from the above solution path:

HBIC(λ)=log(σ^λ2)+MλCnlog(p)n, (2.8)

where Mλ={j:β^j(λ)0} is the model identified by β^(λ),Mλ denotes the cardinality of Mλ, and σ^λ2=n1SSEλ with SSEλ=YXβ^(λ)2. As we are interested in the case where p greatly exceeds n, the penalty term also depends on p; and Cn is a sequence of numbers that diverges to ∞, which will be discussed later.

We compare the value of the above HBIC criterion for λ ∈ Λn = {λ : ∣Mλ∣ ≤ Kn}, where Kn > q represents a rough estimate of an upper bound of the sparsity of the model and is allowed to diverge to ∞. We select the tuning parameter

λ^=argminλΛnHBIC(λ).

The above criterion extends the recent works of Chen and Chen (2008) and Kim et al. (2012) on the high-dimensional BIC for the least squares regression to tuning parameter selection for nonconvex penalized regression. In Sections 3.1-3.3, we study asymptotic properties under conditions such as sub-Gaussian random errors, dimension of the covariates growing at the exponential rate and diverging Kn.

3. Theoretical properties

The main theory comprises two parts. We first show that under some general regularity conditions the calibrated CCCP algorithm yields a solution path with the “path consistency” property. We next verify that when the proposed high-dimensional BIC is applied to this solution path to choose the tuning parameter λ, with probability tending to one the resulted estimator is the oracle estimator itself.

To facilitate the presentation, we specify a set of regularity conditions.

(A1) There exists a positive constant C1 such that λmin(n1XA0TXA0)C1.

(A2) The random errors 1, … , ν are i.i.d. mean zero sub-Gaussian random variables with a scale factor 0 < σ < ∞, i.e., E[exp(t∊i)] ≤ eσ2t2/2, ∀ t.

(A3) The penalty function pλ(t) is assumed to be increasing and concave for t ∈ [0, +) with a continuous derivative p.λ(t) on (0, +∞). It admits a convex-concave decomposition as in (2.6) with Jλ(·) satisfies: ▿Jλ(∣t∣) = −λsign(t) for ∣t∣ > , where a > 1 is a constant; and ∣▿Jλ(∣t∣)∣ ≤ ∣t∣ for ∣t∣ ≤ , where ba is a positive constant.

(A4) The design matrix X satisfies: γ=minδ0,δA0c13δA01XδnδA0>0.

(A5) Assume that λ = o(d*) and τ = o(1), where d* is defined on page 5, λ and τ are the two parameters in the modified CCCP algorithm given in the first paragraph of Section 2.3.

Remark 1

Condition (A1) concerns the true model and is a common assumption in the literature on high-dimensional regression. Condition (A2) implies that for a vector a = (a1, … , an)T,

P(aT>t)2exp(t22σ2a2),t0. (3.1)

Condition (A3) is satisfied by popular nonconvex penalty functions such as SCAD and MCP. Note that the condition ▿Jλ(∣t∣) = −λsign(t) for ∣t∣ > is equivalent to assuming that p.λ(t)=0,t>aλ i.e., large coefficients are not penalized, which is exactly the motivation for nonconvex penalties. Condition (A4), which is given in Bickel et al. (2009), ensures a desirable bound on the L1 estimation loss of the Lasso estimator. Note that the CCCP algorithm yields the Lasso estimator after the first iteration, so the asymptotic properties of the CCCP estimator is related to that of the Lasso estimator. Condition (A4) holds under the restricted eigenvalue condition which is known to be a relatively mild condition on the design matrix for high-dimensional estimation. In particular, it is known to hold in some examples where the covariates are highly dependent, and is much weaker than the irrepresentable condition (Zhao and Yu, 2006) which is almost necessary for Lasso to be model selection consistent.

3.1. Property of the solution path

We first state a useful lemma that characterizes a nonasymptotic property of the oracle estimator in high dimension. The result is an extension of that in Kim et al. (2008) under the more general sub-Gaussian random error condition.

Lemma 3.1

For any given 0 < b1 < 1 and 0 < b2 < 1, consider the events

Fn1={maxjA0β^j()βjb1λ}andFn2={maxjA0cSj(β^())b2λ}

where Sj(β) = −n−1x(j)T(yXβ). Then under conditions (A1) and (A2),

P(Fn1Fn2)12qexp[C1b12nλ2(2σ2)]2(pq)exp[nb22λ2(2σ2)].

The proof of Lemma 3.1 is given in the online supplementary material. Theorem 3.2 below provides a non-asymptotic bound of the probability the solution path contains the oracle estimator. Under general conditions, this probability tends to one.

Theorem 3.2

(1) Assume that conditions (A1)-(A5) hold. If τγ−2q = o(1), then for all n sufficiently large,

P(β^(λ)β^())18pexp(nτ2λ2(8σ2)).

(2) Assume that conditions (A1)-(A5) hold. If nτ2λ2 → ∞, log p = o(2λ2) and τγ−2q = o(1), then

P(β^(λ)=β^())1

as n → ∞.

Remark 2

Meinshausen and Yu (2009) considered thresholding Lasso, which has the oracle property under an incoherent design condition in the ultra-high dimension. Zhou (2010) further proposed and investigated a multi-step thresholding procedure which can accurately estimate the sparsity pattern under the restricted eigenvalue condition of Bickel et al. (2009). These theoretical results are derived by assuming the initial Lasso is obtained using a theoretical tuning parameter value, which depends on the unknown random noise variance σ2. Estimating σ2 is a difficult problem in high-dimensional setting, particularly when the random noise is non-Gaussian. On the other hand, if the true value of σ2 is known a priori, then it is possible to derive variable selection consistency under somewhat more relaxed conditions on the design matrix than those in the current paper. Adaptive Lasso, originally proposed by Zou (2006) for fixed dimension, was extended to high dimension by Huang et al. (2008) under a rather strong mutual incoherence condition. Zhou, van der Geer and Bühlmann (2009) derived the consistency of adaptive Lasso in high dimension under similar conditions on X, but still requires complex conditions on s and d*. Some favorable empirical performance of the multi-step thresholded Lasso versus the adaptive Lasso was reported in Zhou (2010). A theoretical comparison of these two procedures in high dimension was considered by van de Geer, Bühlmann and Zhou (2011) and Chapter 7 of Bühlmann and van de Geer (2011). For both adaptive and thresholded Lasso, if a covariate is deleted in the first step, it will be excluded from the final selected model. Zhang (2010) proved that selection consistency holds for the MCP solution at the universal penalty level σ2logpn. The LLA algorithm, which Zou and Li (2008) originally proposed for fixed dimensional models, alleviates this problem and has the potential to be extended to the ultra-high dimension under conditions similar as those in this paper. Needless to say, the performances of the above procedures all depend on the choice of tuning parameter. However, the important issue of tuning parameter selection has not been addressed.

Remark 3

We proved that the calibrated CCCP algorithm which involves merely two iterations is guaranteed to yield a solution path that contains the oracle estimator with high probability under general conditions. To provide some intuition on this theory, we first note that the first step of the algorithm yields the Lasso estimator, albeit with a small penalty level τλ. If we denote the first step estimator by β^j(Lasso)(τλ), then based on the optimization theory, the oracle property is achieved when

minjA0β^j(Lasso)(τλ)aλ>λ,sign(β^j())=sign(βj),jA0,maxjA0Jλ(β^j(Lasso)(τλ))+n1XA0cT(YX)β^()λ.

The proof of Theorem 3.2 relies on the following condition:

β^(Lasso)(τλ)βλ2,minβj0βj>aλ+λ2, (3.2)

for the given a > 1. The proof proceeds by bounding the first part of (3.2) using a result of Bickel et al. (2009) via β^(Lasso)(τλ)ββ^Lasso(τλ)β2. In Section 3.3, we considered an alternative approach using the recent result of Zhang and Zhang (2012), which leads to weaker requirement on the minimal signal strength under slightly stronger assumptions on the design matrix. We also noted that Theorem 3.2 holds for any a > 1, although in the numerical studies we use the familiar a = 3.7.

How fast the probability that our estimator is equal to the oracle estimator approaches one depends on the sparsity level, the magnitude of the smallest signal, the size of the tuning parameter and the condition of the design matrix. Corollary 3.3 below confirms that the path-consistency can hold in ultra-high dimension.

Corollary 3.3

Assume that conditions (A1)-(A4) hold. Suppose there are two positive constants γ0 and K such that γγ0 > 0 and q < K. If d* = O(nc1) for some c1 ≥ 0 and p = O(exp(nc2)) for some c2 > 0, then

P(β^(λ)=β^())1,

provided λ = O(nc3) for some c3 > c1, τ2n1−2c3c2 → ∞ and τ = o(1).

The above corollary indicates that if the true model is very sparse (i.e. q < K) and the design matrix behaves well (i.e. γγ0 > 0), then we can take τ to be a sequence that converges to 0 slowly, for example, τ = 1/log n. On the other hand, if one is concerned that the true model may not be very sparse (q → ∞) and the design matrix may not behave very well (γ → 0), then an alternative choice is to take τ = λ which works also quite well in practice. The following corollary establishes that under some general conditions, the choice of τ = λ yields a consistent solution path under ultra high-dimensionality.

Corollary 3.4

Assume that conditions (A1)-(A4) hold. If q = O(nc1) for some c1 ≥ 0, d* = O(nc2) for some c2 ≥ 0, γ = O(nc3) for some c3 ≥ 0, p = O(exp(nc4)) for some 0 < c4 < 1, λ = O(nc5) for some max(c2, c1 + 2c3) < c5 < (1 − c4)/4 and τ = λ, then

P(β^(λ)=β^())1.

3.2. Property of the high-dimensional BIC

Theorem 3.5 below establishes the effectiveness of the HBIC defined in (2.8) for selecting the oracle estimator along a solution path of the calibrated CCCP.

Theorem 3.5

(Property of HBIC) Assume that the conditions of Theorem 3.2(2) hold, and there exists a positive constant κ such that

limnminAA0,AKn{n1(InPA)XA0βA02}κ, (3.3)

where In denotes the n × n identity matrix and PA denotes the projection matrix onto the linear space spanned by the columns of XA. If Cn, → ∞, qCn log(p) = o(n) and Kn2log(p)log(n)=o(n), then

P(Mλ^=A0)1,

as n, p → ∞.

Remark 4

Condition (3.3) is an asymptotic model identifiability condition, similar to that in Chen and Chen (2008). This condition states that if we consider any model which contains at most Kn covariates, it cannot predict the response variable as well as the true model does if it is not the true model. To give some intuition of this condition, as in Chen and Chen (2008), one can show that for AA0,

n1(InPA)XA0βA02λmin(n1XA0ATXA0A)βA0Ac2λmin(n1XA0ATXA0A)minβj0βj2.

The theorem confirms that the BIC criterion for shrinkage parameter selection investigated in Wang, Li and Tsai (2007), Wang, Li and Leng (2009) and Zhang, Li and Tsai (2010) can be modified and extended to ultra-high dimensionality. Carefully examining the proof, it is worth noting that the consistency of the HBIC only requires a consistent solution path but does not rely on the particular method used to construct the path. Hence, the proposed HBIC has the potential to be generalized to other settings with ultra-high dimensionality. The sequence Cn should diverge to ∞ slowly, e.g. Cn = log(log n), which is used in our numerical studies.

3.3. Relaxing the conditions on the minimal signal

Theorem 3.2, which is the main result of the paper, implies that the oracle property of the calibrated CCCP estimator requires the following lower bound on the magnitude of the smallest nonzero regression coefficient

dλcqlogpn, (3.4)

where ab means limn→∞ a/b = ∞, and c is a constant that depends on the design matrix X and other unknown factors such as σ2. When the true model dimension q is fixed, the lower bound for d* is arbitrarily close to the optimal lower bound clogpn for nonconvex penalized approaches (e.g. Zhang, 2010). However, when q is diverging, this bound is suboptimal. In general, there is a tradeoff between the conditions on d* and the conditions on the design matrix. Comparing to the results in the literature, Theorem 3.2 imposes weak conditions on the design matrix and the algorithm we investigate is transparent. In this section, we will prove that the optimal lower bound of d* can be achieved by the calibrated CCCP procedure under a set of slightly stronger conditions on the design matrix.

Note that the calibrated CCCP estimator depends on β^(1), which is the Lasso estimator obtained after the first iteration of the CCCP algorithm. In fact, the lower bound of d* is proportional to the l convergence rate of β^(1), to β*, and Condition (A4) only implies that maxjβ^j(1)βj is proportional to Op(qlogpnτ). If

maxjβ^j(1)βj=Op(logpnτ), (3.5)

we can show that dclogpnτ for any τ = o(1); and hence we can achieve almost the optimal lower bound for d*. Now, the question is under what conditions inequality (3.5) holds. Let vij be the (i, j) entry of XT X. Lounici (2008) derived the convergence rate (3.5) under the condition of mutual coherence:

maxijvij>bq (3.6)

for some constant b > 0. However, the mutual coherence condition would be too strong for practical purposes when q is diverging, since it requires that the pairwise correlations between all possible pairs are sufficiently small. In this subsection, we give an alternative condition for (3.5) based on the l1 operation norm of XT X.

We replace condition (A4) with the slightly stronger condition (A4’) below. We also introduce an additional condition (A6) based on the matrix l1 operational norm. For a given m × m matrix A, the l1 operational norm ∥A1 is defined by A1=maxi=1,,mj=1maij, where aij is the (i, j)th entry of A. Let

ζmax(m)=maxBm,A0Bn1XBTXB1,ζmin(m)=maxBm,A0B(n1XBTXB)11

Condition (A4’): There exist positive constants α and κmin such that

ξmin((α+1)q)κmin (3.7)

and

ξmax(αq)α1576κmin(13ξmax(αq)ακmin)2, (3.8)

where ξmax(m)=maxBm,A0Bλmax(n1XBTXB).

Condition (A6): Let u = α + 1. There exist finite positive constants ηmax and ηmin such that

limsupnζmax(uq)ηmax<

and

limsupnζmin(uq)ηmin<.

Remark 5

Similar conditions to Condition (A4’) were considered by Meinshausen and Yu (2009) and Bickel et al. (2009) for the l2 convergence of the Lasso estimator. However, (3.8) of Condition (A4’), which essentially assumes that ξmax(αq)/α is sufficiently small, is weaker, at least asymptotically, than the corresponding condition in Meinshausen and Yu (2009) and Bickel et al. (2009), which assumes that ξmax(q + min{n, p}) is bounded. Zhang and Zhang (2012) proved that {j:β^j0}A0(α+1)q under Condition (A4’). In addition, Condition (A4’) implies Condition (A4) (see Bickel et al. 2009). Condition (A6) is not too restrictive. Assume the xi’s are randomly sampled from a distribution with mean 0 and covariance matrix ∑. If the l1 operational norm of ∑ and ∑−1 are bounded, then we have ζmax(uq) ≤ maxB∣≤uq,A0⊂B ∥∑B1 + op(1) and ζmin(uq)maxBuq,A0BB11+op(1) provided that q does not diverge too fast. Here ∑B is the ∣B∣ × ∣B∣ whose entries consist of σjl, the (j, l)th entry ∑, for jB and lB. See Proposition A.1 in the online supplementary material of this paper. An example of ∑ satisfying maxB∣≤uq,A0B ∥∑B1 *** ∞ and maxBuq,A0BB11< is a block diagonal matrix where each block is well posed and of finite dimension. Moreover, Condition (A6) is almost necessary for the l convergence of the Lasso estimator. Suppose that p is small and d* is large so that all coefficients of the Lasso coefficients are nonzero. Then,

β^(1)=β^ls+τλ(XTXn)1δ,

where β^ls is the least square estimator, and δ = (δ1, … , δp) with δj=sign(β^jls). Hence, for the sup norm between β^1β^ls to be the order of τλ, the l1 operational norm of (XTX/n)−1 should be bounded.

Theorem 3.6

Assume that conditions A(1)-A(3), (A4’), (A5) and (A6) hold.

(1) If τ = o(1), then for all n sufficiently large,

P(β^(λ)=β^())18pexp[nτ2λ2(8σ2)].

(2) If τ = o(1) and log p = o(2λ2), then

P(β^(λ)=β^())1

as n → ∞

(3) Assume that the conditions of (2) and (3.3) hold. Let λ^ be the tuning parameter selected by HBIC. If Cn,qCnlog(p)=o(n),Kn2log(p)log(n)=o(n), then P(Mλ^=A0)1, as n, p → ∞.

Remark 6

We only need τ = o(1) in Theorem 3.6 for the probability bound of the calibrated CCCP estimator, while Theorem 3.2 requires τγ−2q = o(1): Under the conditions of Theorem 3.6, the oracle property of β^(λ) holds when

dλ1τlogpn. (3.9)

Since τ can converge to 0 arbitrarily slowly (e.g. τ = 1/log n), the lower bound of d* given by (3.9), logpnτ, is almost optimal.

4. Numerical results

4.1. Monte Carlo studies

We now investigate the sparsity recovery and estimation properties of the proposed estimator via numerical simulations. We compare the following estimators: the oracle estimator which assumes the availability of the knowledge of the true underlying model; the Lasso estimator (implemented using the R package glmnet); the adaptive Lasso estimator (denoted by ALasso, Zou (2006), Section 2.8 of Bühlmann and van de Geer (2011)), the hard-thresholded Lasso estimator (denoted by HLasso, Section 2.8, Bühlmann and van de Geer (2011)), the SCAD estimator from the original CCCP algorithm without calibration (denoted by SCAD); the MCP estimator with a = 1.5 and 3. For Lasso and SCAD, 5-fold cross-validation is used to select the tuning parameter; for ALasso, sequential tuning as described in Chapter 2 of Bühlmann and van de Geer (2011) is applied. For HLasso, following a referee’s suggestion, we first used λ as the tuning parameter to obtain the initial Lasso estimator, then thresholded the Lasso estimator using thresholding parameter η = for some c > 0 and refitted least squares regression. We denote the solution path of HLasso by β^HL(λ,cλ), and apply HBIC to select λ. We consider c = 2 and set Cn = log log n in the HBIC as it is found they lead to overall good performance for HLasso. The MCP estimator is computed using the R package PLUS with the theoretical optimal tuning parameter value λ=σ(2n)logp, where the standard deviation σ is taken to be known. For the proposed calibrated CCCP estimator (denoted by New), we take τ = 1/log n and set Cn = log log n in the HBIC. We observe that the new estimator performs similarly if we take τ = λ. In the following, we report simulation results from two examples. Results of additional simulations can be found in the online supplemental file.

Example 1

We generate a random sample {yi, xi}, i = 1, … , 100 from the following linear regression model:

yi=xiTβ+i,

where β=(3,1.5,0,0,2,0p5TT with 0k denoting a k-dimensional vector of zeros, the p-dimensional vector xi has the N(0p, Σ) distribution with covariance matrix Σ, i is independent of xi and has a normal distribution with mean zero and standard deviation σ = 2. This simulation setup was considered in Fan and Li (2001) for a small p case. In this example, we consider p = 3000 and the following choices of Σ: (1) Case 1a: the (i, j)th entry of Σ is equal to 0.5∣i−j∣, 1 ≤ i, jp; (2) Case 1b: the (i, j)th entry of Σ is equal to 0.8∣i−j∣, 1 ≤ i, jp; (3) Case 1c: the (i, j)th entry of Σ equal to 1 if i = j and 0.5 if 1 ≤ ijp.

Example 2

We consider a more challenging case by modifying Example 1 Case 1a. We divide the p components of β* into continuous blocks of size 20. We randomly select 10 blocks and assign each block the value (3,1.5,0,0,2,015T1.5. Hence, the number of nonzero coefficients is 30. The entries in other blocks are set to be zero. We consider σ = 1. Two different cases are investigated: (1) Case 2a: n = 200 and p = 3000; (2) Case 2b: n = 300 and p = 4000.

In the two examples, based on 100 simulation runs we report the average number of non-zero coefficients correctly estimated to be nonzero (i.e., true positive, denoted by TP) and average number of zero coefficients incorrectly estimated to be nonzero (i.e., false positive, denoted by FP) and the proportion of times the true model is exactly identified (denoted by TM). These three quantities describe the ability of various estimators for sparsity recovery. To measure the estimation accuracy, we report the mean squared error (MSE), which is defined to be 1001m=1100β^(m)β2, where β^(m) is the estimator from the mth simulation run.

The results are summarized in Table 1 and Table 2. It is not surprising that Lasso always overfits. Other procedures improve the performance of Lasso by reducing the false positive rate. The SCAD estimator from the original CCCP algorithm without calibration has no guarantee to find a good local minimum and has low probability of identifying the true model. The best overall performance is achieved by the calibrated new estimator: the probability of identifying the true model is high and the MSE is relatively small. The HLasso (with thresholding parameter selected by our proposed HBIC) and MCP (using PLUS algorithm and the theoretically optimal tuning parameter) also have overall fine performance. We do not report the results of the MCP with a = 1.5 for Example 2 since the PLUS algorithm sometimes runs into convergence problems.

Table 1.

Example 1. We report TP (the average number of non-zero coefficients correctly estimated to be nonzero, i.e., true positive), FP (average number of zero coefficients incorrectly estimated to be nonzero, i.e. false positive), TM (the proportion of the true model being exactly identified) and MSE.

Case method TP FP TM MSE
1a Oracle 3.00 0.00 1.00 0.146
Lasso 3.00 28.99 0.00 1.101
ALasso 3.00 11.47 0.01 1.327
HLasso 3.00 0.49 0.79 0.383
SCAD 3.00 10.12 0.08 1.496
MCP(a = 1.5) 2.89 0.28 0.76 0.561
MCP(a = 3) 2.91 0.42 0.68 1.292
New 2.99 0.09 0.91 0.222

1b Oracle 3.00 0.00 1.00 0.314
Lasso 3.00 20.64 0.00 1.248
ALasso 3.00 8.84 0.02 1.527
HLasso 2.79 0.50 0.56 1.244
SCAD 2.99 7.42 0.17 1.598
MCP(a = 1.5) 2.02 0.51 0.06 5.118
MCP(a = 3) 1.99 0.60 0.02 5.437
New 2.77 0.21 0.66 1.150

1c Oracle 3.00 0.00 1.00 0.195
Lasso 2.99 28.22 0.00 2.987
ALasso 2.96 10.09 0.02 2.433
HLasso 2.84 0.77 0.56 1.361
SCAD 2.96 18.09 0.01 3.428
MCP(a = 1.5) 2.67 0.17 0.72 1.636
MCP (a = 3) 2.77 0.22 0.68 1.677
New 2.79 0.46 0.58 1.244
Table 2.

Example 2. Captions are the same as those in Table 1.

Case method TP FP TM MSE
2a Oracle 30.00 0.00 1.00 0.223
Lasso 30.00 143.14 0.00 3.365
ALasso 29.98 7.50 0.00 0.393
HLasso 29.97 1.09 0.74 0.312
SCAD 29.98 46.15 0.00 2.495
MCP (a = 3) 29.83 0.50 0.92 0.807
New 29.99 0.20 0.89 0.247

2b Oracle 30.00 0.00 1.00 0.137
Lasso 30.00 133.65 0.00 1.089
ALasso 30.00 1.32 0.29 0.165
HLasso 30.00 0.00 1.00 0.137
SCAD 30.00 21.83 0.00 0.599
MCP (a = 3) 30.00 0.08 0.92 0.137
New 30.00 0.00 0.99 0.135

4.2. Real data analysis

To demonstrate the application, we analyze the gene expression data set of Scheetz et al. (2006), which contains expression values of 31042 probe sets on 120 twelve-week-old male offspring of rats. We are interested in identifying genes whose expressions are related to that of gene TRIM32 (known to be associated with human diseases of the retina) corresponding to probe 1389163_at. We first preprocess the data as described in Huang et al.(2008) to exclude genes that are either not expressed or lacking sufficient variation. This leaves 18957 genes.

For the analysis, we select 3000 genes that display the largest variance in expression level. We further analyze the top p (p = 1000 and 2000) genes that have the largest absolute value of marginal correlation with gene TRIM32. We randomly partition the 120 rats into the training data set (80 rates) and testing data set (40 rats). We use the training data set to fit the model and select the tuning parameter; and use the testing data set to evaluate the prediction performance. We perform 1000 random partitions and report in Table 3 the average model sizes and the average prediction error on the testing data set for p = 1000 and 2000. For the MCP estimators, the tuning parameters are selected by cross-validation since the standard deviation of the random error is not known. We observe that the Lasso procedure yields the smallest prediction error. However, this is achieved by fitting substantially more complex models. The calibrated CCCP algorithm as well as ALasso and HLasso result in much sparser models with still small prediction errors. The performance of the MCP procedure is satisfactory but its optimal performance depends on the parameter a. In screening or diagnostic applications, it is often important to develop an accurate diagnostic test using as few features as possible in order to control the cost. The same consideration also matters when selecting target genes in gene therapies.

Table 3.

Gene expression data analysis. The results are based on 100 random partitions of the original data set.

p method ave model size Prediction Error
1000 Lasso 31.17 0.586
ALasso 11.76 0.646
HLasso 12.04 0.676
SCAD 4.81 0.827
MCP(a = 1.5) 11.79 0.668
MCP(a = 3) 7.02 0.768
New 8.50 0.689

2000 Lasso 32.01 0.604
ALasso 11.01 0.661
HLasso 10.82 0.689
SCAD 4.57 0.850
MCP(a = 1.5) 11.33 0.700
MCP(a = 3) 6.78 0.788
New 7.91 0.736

We also applied the calibrated CCCP procedure directly to the 18957 genes and evaluated the predicative performance based on 100 random partitions. The calibrated CCCP estimator has an average model size 8.1 and an average prediction error 0.58. Note that the model size and predictive performance are similar to what we obtain when we first select 1000 (or 2000) genes with the largest variance and marginal correlation. This demonstrates the stability of the calibrated CCCP estimator in ultra-high dimension.

When a probe is simultaneously identified by different variable selection procedures, we consider it as evidence for the strength of the signal. Probe 1368113_at is identified by both Lasso and the calibrated CCCP estimator. This probe corresponds to gene tff2, which was found to up-regulate cell proliferation in developing mice retina (Paunel-Görgülü et al., 2011). On the other hand, the probes identified by the calibrated CCCP but not by Lasso also merit further investigation. For instance, probe 1371168_at was identified by the calibrated CCCP estimator but not by Lasso. This probe corresponds to gene mpp2, which was found to be related to protein metabolism abnormalities in the development of retinopathy in diabetic mice (Gao et al., 2009).

4.3. Extension to penalized logistic regression

Regularized logistic regression is known to automatically result in a sparse set of features for classification in ultra-high dimension (van de Geer, 2008; Kwon and Kim, 2011). We consider the representative two-class classification problem, where the response variable yi takes two possible values 0 or 1, indicating the class membership. It is assumed that

P(yi=1xi)=exp(xiTβ){1+exp(xiTβ)} (4.1)

The penalized logistic regression estimator minimizes

n1i=1n[(xiTβ)yi+log{1+exp(xiTβ)}]+j=1ppλ(βj).

When a nonconvex penalty is adopted, it is easy to see that the CCCP algorithm can be extended to this case without difficulty as the penalized log-likelihood naturally possesses the convex-concave decomposition discussed in Section 2.2 of the main paper, because of the convexity of the negative log-likelihood for the exponential family. For easy implementation, the CCCP algorithm can be combined with the iteratively reweighted least squares algorithm for ordinary logistic regression, thus taking advantage of the CCCP algorithm for linear regression. Denote the nonconvex penalized logistic regression estimator by β^, then for a new feature vector x, the predicted class membership is I(exp(xTβ^)(1+exp(xTβ^))>0.5).

We demonstrate the performance of nonconvex penalized sion logistic regresfor classification through the following example: we generate xi as in Example 1 of the main paper, and the response variable yi is generated according to (4.1) with β(3,1.5,0,0,2,0p50T)T. We consider sample size n = 300 and feature dimension p = 2000. Furthermore, an independent test set of size 1000 is used to evaluate the misclassificaiton error. The simulation results are reported in Table 4. The results demonstrate that the calibrated CCCP estimator is effective in both accurate classification and identifying the relevant features.

Table 4.

Simulations for classification in high dimension (n = 300, p = 2000).

method TP FP TM Misclassification Rate
Oracle 3.00 0.00 1.00 0.116
Lasso 3.00 46.48 0.00 0.134
SCAD 2.08 4.02 0.04 0.161
ALASSO 2.02 4.58 0.00 0.188
HLASSO 2.87 0.00 0.87 0.120
MCP (a = 3) 2.96 0.56 0.54 0.128
New 2.99 0.00 0.99 0.116

We expect that the theory we derived for the linear regression case continues to hold for the logistic regression under similar conditions due to the convexity of the negative log-likelihood function and the fact that the Bernoulli random variables automatically satisfies the sub-Gaussian tail assumption. The latter is essential for obtaining the exponential bounds in deriving the theory.

5. Revisiting local minima of nonconvex penalized regression

In the following, we shall revisit the issue of multiple local minima of non-convex penalized regression. We derive an L2 bound of the distance between a sparse local minimum and the oracle estimator. The result indicates that a local minimum which is sufficiently sparse often enjoys fairly accurate estimation even when it is not the oracle estimator. This result, to our knowledge, is new in the literature on high-dimensional nonconvex penalized regression.

Our theory applies the necessary condition for the local minimizer as in Tao and An (1997) for convex differencing problems. Let

Qn(β)=(2n)1yXβ2+λj=1ppλ(βj)

and

(β)={ξRp:ξj=n1x(j)T(yXβ)+λlj},

where lj = sign(βj) if βj ≠ 0 and lj ∈ [−1, 1] otherwise, 1 ≤ jp. As Qn(β) can be expressed as the difference of two convex functions, a necessary condition for β to be a local minimizer of Qn(β) is

hn(β)β(β), (5.1)

where hn(β)=j=1pJλ(βj), where Jλ(∣βj∣) is defined in Section 2.2 for SCAD and MCP penalty functions.

To facilitate our study, we introduce below a new concept.

Definition 5.1

The relaxed sparse Riesz condition (SRC) in an L0-neighborhood of the true model is satisfied for a positive integer m (2qmn) if

ξmin(m)cfor some0<c<,

where ξmin is defined in (2.5).

Remark

The relaxed SRC condition is related to, but generally weaker than the sparse Reisz condition (Zhang and Huang 2008, Zhang 2010), the restricted eigenvalue condition of Bickel et al. (2009) and the partial orthogonality condition of Huang et al. (2008).

The theorem below unveils that for a given sparse estimator which is a local minimum of (1.1), its L2 distance to the oracle estimator β^(o) has an upper bound, which is determined by three key factors: tuning parameter λ, the sparsity size of the local solution, and the magnitude of the smallest sparse eigenvalue as characterized by the relaxed SRC condition. To this end, we consider any local minimum β^=(β^j,,β^j)T corresponding to the tuning parameter λ. Assume that the sparsity size of this local solution satisfies: β^0qun for some un > 0:

Theorem 5.2

(Properties of the local minima of nonconvex penalized regression) Consider SCAD or MCP penalized least squares regression. Assume that conditions (A1) and (A2) hold, and that the relaxed SRC condition in an L0-neighborhood of the true model is satisfied for m=qun where un=un+1. Then if λ = o(d*), then for all n sufficiently large,

P{β^(λ)β^(o)2λqunξmin1(qun)}12qexp[C1n(daλ)2(2σ2)]2(pq)exp[nλ2(2σ2)], (5.2)

where ξmin(m) is defined in (2.5) and the positive constant C1 is defined in (A1).

Corollary 5.3

Under the conditions of Theorem 5.2, if we take λ=3log(p)n, then we have

P{β^(λ)β^(o)12qunlog(p)nξmin2(qun)}12qexp[C1n(daλ)2(2σ2)]2(pq)exp[nλ2(2σ2)],

The simple form in the above corollary suggests that if a local minimum is sufficiently sparse, in the sense that un diverge to ∞ very slowly, this bound is nevertheless quite tight as the rate q log(p)/n is near-oracle. The factor unξmin2(qun) is expected to go to infinity at a relatively slow rate if the local solution is sufficiently sparse. Our experience with existing algorithms for solving nonconvex penalized regression is that they often yield a sparse local minimum, which however has a low probability to be the oracle estimator itself.

6. Proofs

We will provide here proofs for the main theoretical results in this paper.

Proof of Theorem 3.2

By definition, β^(λ)=argminβQλ(ββ^(1)), where Qλ(ββ^(1))=(2n)1yXβ2+j=1pJλ(β^j(1))βj+λj=1pβj. Since Qλ(ββ^(1)) is a convex function of β, the KKT condition is necessary and sufficient for characterizing the minimum. To verify that β^(o) is the minimizer of Qλ(ββ^(1)), it is sufficient to show that

n1x(j)T(yXβ^(o))+Jλ(β^j(1))+λsign(β^j(o))=0,jA0, (6.1)

and

n1x(j)T(yXβ^(o))+Jλ(β^j(1))λ,jA0. (6.2)

We first verify (6.1). Note that with the initial value 0, we have β^(1)=argminβ{(2n)1yXβ2+τλβ1}. Let Fn3={β^(1)β116τλγ2q}, where ∥ · ∥1 denotes the L1 norm. By modifying the proof of Theorem 7.2 of Bickel et al. (2009), we can show that under the conditions of the theorem,

P(Fn3)12pexp(nτ2λ2(8σ2)). (6.3)

By the assumption of the theorem, on the event Fn3=β^(1)β1λ2 for all n sufficiently large. Furthermore, we consider the event Fn1 defined in Lemma 3.1 with b1 = 1/2. By Lemma 3.1, we have P(β^(o)βλ2)12qexp[C1nλ2(8σ2)]. By the assumption λ = o(d*), for all n sufficiently large, on the event Fn1Fn3, we have sign(β^j(1))=sign(β^j(o)), for jA0 and minjA0β^j(1)>aλ. Hence, by condition (A3), on the event Fn1Fn3,Jλ(β^j(1))=λsign(β^j(1))=λsign(β^j(o)). Furthermore, n1x(j)T(λXβ^(o)=0, for jA0, following the definition of the oracle estimator. Therefore (6.1) holds with probability at least 1 − 2q exp[−C12/(8σ2)] − 2p exp(− 2λ2/(8σ2)).

Next we verify (6.2). On the event Fn3, we have maxjA0β^j(1)λ2, for all n sufficiently large. We consider the event Fn2 defined in Lemma 3.1 with b2 = 1/2. Lemma 3.1 implies that P(Fn2) ≥ 1 − 2(pq) exp[−2/(8σ2)]. On the event Fn2 we have maxjA0n1x(j)T(yXβ^(o))λ2. By condition (A3), on the event Fn2Fn3, (6.2) holds, and this occurs with probability at least 1 − 2(pq) exp[−2/(8σ2)] − 2p exp ( − 2λ2/(8σ2).

The above two steps proves (1). The result in (2) follows immediately from (1).

Proof of Corollary 3.3 and Corollary 3.4

The proof follows immediately from Theorem 3.2.

Proof of Theorem 3.5

Recall that Mλ={j:β^j(λ)0}. We define the following three index sets: Λn = {λ > 0 : λ ∈ Λn, A0 ⊄, Mλ}, Λn0 = {λ > 0 : λ ∈ Λn, A0 = Mλ}, and Λn+ = {λ > 0 : λ ∈ Λn A0Mλ and A0 ≠ = Mλ}. In other words, Λn, Λn0 and Λn+ denote the sets of λ values which lead to underfitted, exactly fitted and overfitted models, respectively. For a given model (or equivalently an index set) M, let SSEM=infβMRMyXMβM2. That is, SSEM is the sum of squared residuals when the least squares method is used to estimate model M. Also, let σ^M2=n1SSEM. From the definition, we always have σ^λ2σ^Mλ2.

Consider λn satisfying the of Theorem 3.2(2). We have P(Mλn = A0) → 1. We will prove that P(infλΛn [HBIC(λ) − HBIC(λn)] > 0) → 1 and P(infλ∈Λn+ [HBIC(λ) − HBIC(λn) > 0) → 1.

Case I

Consider an arbitrary λ ∈ Λn, i.e., the model corresponding to Mλ is underfitted.

P(infλΛn[HBIC(λ)HBIC(λn)]>0)=P(infλΛn[HBIC(λ)HBIC(λn)]>0,Mλn=A0)+P(infλΛn[HBIC(λ)HBIC(λn)]>0,MλnA0)P(infλΛn[log(σ^Mλ2σ^A02)+(Mλq)Cnlog(p)n]>0)+o(1),

where the inequality uses Theorem 3.2(2). Furthermore, we observe that

log(σ^Mλ2σ^A02)=log(1+n[σ^Mλ2σ^A02]T(InPA0)).

Applying the inequality log(1 + x) ≥ min{0.5x, log(2)}, ∀ x > 0, we have

P(infλΛn[HBIC(λ)HBIC(λn)]>0)P(min{infλΛnn(σ^Mλ2σ^A02)2T(InPA0),log(2)}qCnlog(p)n>0)+0(1)

To evaluate T (InPA0), we apply Corollary 1.3 of Mikosch (1991) with their An = InPA0 , Bn = 2σ4(nq), μn = σ2 and yn = (nq)/(log n), we have P(T(InPA0) ≤ 2σ2(nq)) → 1 as n → ∞. Thus

P(infλΛn[HBIC(λ)HBIC(λn)]>0)P(min{infλΛnn(σ^Mλ2σ^A02)4(nq)σ2,log(2)}qCnlog(p)n>0)+o(1).

In what follows, we will prove that qCnlog(p)=o(infλΛnn(σ^Mλ2σ^A02)), which combining with the assumption qCn log(p) = o(n) leads to the conclusion P(infλ∈Λn [HBIC(λ) − HBIC(λn)] > 0) → 1.

We have

n(σ^Mλ2σ^MT2)=μT(InPMλ)μ+2μT(InPMλ)TPMλ+TPA0=I1+I2I3+I4,

where μ = Xβ*, PMλ is the projection matrix into the space spanned by the columns of XMλ, and the definition of Ii, i = 1, 2, 3, 4, should be clear from the context. Let M = {j : jMλ, jMT}. Note that M is nonempty since Mλ underfits.

By assumption (3.3), ∣I1∣ ≥ κn, for all n sufficiently large. To evaluate I2, we have

I2=2μT(InPMλ)μZ(Mλ)=2I1Z(Mλ),

where Z(Mλ)=anT with anT=(μT(InPMλ)μ)12μT(InPMλ). Note that ∥an2 = 1 and Λt=0Kn(pt)σt=0Knpt=pKn+11p12pKn. Applying the sub-Gaussian tail property in (3.1), we have

P(supηΛnZ(Mλ)>nlog(n))4pKnexp(n(2σ2log(n)))=4exp(Knlog(p)n(2σ2log(n)))0,

as Kn log(p) log(n) = o(n). Hence supη∈ΛnI2∣. To evaluate I3, let r(λ) = Trace(PMλ). It follows from Proposition 3 of Zhang (2010) that for the sub-Gaussian random variables i, ∀ t > 0,

P{TPMλr(λ)σ21+t[12(et21+t1)]+2}exp(r(λ)t2)(1+t)r(λ)2. (6.4)

We take t = n/(2σ2Kn log(n)) − 1 in the above inequality. Then t → ∞ by the assumptions of the theorem. Thus for all n sufficiently large,

P(supλΛnTPMλ>nlog(n))P(supλΛnTPMλr(λ)σ2>nσ2Knlog(n))P(supλΛnTPMλr(λ)σ2>1+t[12(et21+t1)]+2)2pKnexp(n(8σ2)Knlog(n)))(2σ2Knlog(n)))Kn22exp(Knlog(p)n(8σ2Knlog(n))+Knlog(n(2σ2Knlog(n)))0,

since Kn2log(p)log(n)=o(n). Finally, TPA0 does not depend on λ. Similarly as above, P(supλ∈ΛnI4∣ ≥ n/log(n)) → ∞ 0 by the sub-Gaussian tail condition. Therefore, with probability approaching one, n(σ^Mλ2σ^A02) is dominated by I1. This finishes the proof for the first case as qCn log(p) = o(n).

Case II

Consider an arbitrary λ ∈ Λn+, i.e., the model corresponding to Mλ is overfitted. In this case, we have yT(InPMλ)y = T(InPMλ). Therefore, n(σ^A02σ^Mλ2)=T(PMλPA0). Let ^=(InPA0), then

log(σ^A02σ^Mλ2)=log(1+T(PMλPA0)T(InPMλ))T(PMλPA0)^T^T(PMλPA0),

by the fact log(1 + x) ≤ x,∀ x ≥ 0.

Similarly as in Case I,

P(infλΛn+[HBIC(λ)HBIC(λn)]>0)=P(infλΛn+[log(σ^A02σ^Mλ2)+(Mλq)Cnlog(p)n]>0)+o(1)P(infλΛn+[(Mλq)Cnlog(p)nT(PMλPA0)^T^T(PMλPA0)]>0)+o(1)=P(infλΛn+{(Mλq)[Cnlog(p)nT(PMλPA0))(Mλq)^T^T(PMλPA0)]}>0)+o(1).

It suffices to show that

P(infλΛn+[Cnlog(p)nT(PMλPA0)(Mλq)^T^T(PMλPA0)]>0)1,

which is implied by

P[Cnlog(p)nsupλΛn+T(PMλPA0)(Mλq)^T^supλΛn+T(PMλPA0)>0)1

Note that E(^T^)=Var(i)Trace(InPA0)(nq)σ2, hence^T^=Op(n). Similarly as in case I, we can show that P(supλ∈Λn+ T(PMλPA0) > n/log(n)) → 0, since Kn2log(p)log(n)=o(n). Thus^T^supλΛn+T(PMλPA0)=Op(n). Furthermore, applying (6.4) by letting t = 8 log(p) − 1, we have for all n sufficiently large,

P(supλΛn+T(PMλPA0)Mλq>16σ2log(p))Mλ=q+1p(pqMλq)exp((Mλq)2)(1+t)Mλq2=k=1pq(pqk)exp(2klog(p))(8log(p))12=k=1pq(pqk)(8log(p)pn2)k(1+8log(p)p2)pq10.

Thus with probability approaching one, for all n sufficiently large,

Cnlog(p)nsupλΛn+T(PMλPA0)(Mλq)^T^supλΛn+T(PMλPA0)>n1Cnlog(p)n1O(log(p))>0,

since Cn → ∞. This finishes the proof.

Proof of Theorem 3.6

We will first prove that there exists a constant C > 0 such that for Fn4={maxjβ^j(1)βjCτλ}, we have

P(Fn4)12pexp(nτ2λ28σ2). (6.5)

Let Fn5 = {∣Sj(β*)∣ ≤ τλ/2 for all j}: Since

P(Fn4c)j=1pP(x(j)Tn>τλ2)2pexp(nτ2λ28σ2),

we have

P(Fn5)12p,exp(nτ2λ28σ2).

Hence to prove (6.5), it suffices to show that Fn5Fn4.

Let

θ=inf{qXTXunu1:uA0c13uA01}.

Corollary 2 of Zhang and Zhang (2012) proves that on the event Fn5, ∣AA0∣ ≤ (α + 1)q, where A={j:β^j(1)=0}, provided

ξmax(αq)α136θ.

Since θγ2/16 (see (7) of Zhang and Zhang, 2012), where γ is defined in (A4) and

γκmin(13ξmax(αq)ακmin)

(see Bickel et al., 2009), Condition (A4’) implies that

Fn5{AA0(α+1)qLetC(β)=(2n)1yXβ2=τλΣj=1pβj.Then we haveC(β)C(β)=j=1p(βjβj)Sj(β)+(ββ)TXTX(ββ)(2n)+τλj=1p(βjβj). (6.6)

Let Xβ^ be the projection of Xβ* onto span(XA); the linear subspace spanned by the column vectors of XA. We define the p-dimensional vector γ* such that Xβ^=XAγA and γj=0 for jAc. We have

(β^(1)β)TXTX(β^(1)β)=(β^A(1)γA)TXATXA(β^A(1)γA)+XβXAγA2.

Therefore, we can write

β^(1)=argminβ:βAc=0{jAβjSj(β)+(βAγA)TXATXA(βAγA)2n}{+τλjAβj}

Hence β^A(1)γA=(XATXAn)1θA, where θRp such that θj = 0 for jAc and θj=Sj(β)sign(β^j)τλ for jA. On Fn5; maxjθj∣ ≤ 3τλ/2. Therefore, condition (A6) with (6.6) implies that on the event Fn5,

maxjAβ^j(1)γjηmin3τλ2. (6.7)

It follows from (6.7) that inequality (6.5) holds if we show that A0A, in which case γA=βA. We will prove this by contradiction. Assume A(−) = A0Ac is nonempty. Let x^(j) be the projection of x(j) onto span(XA) and let x^(j)=x(j)x^(j),jA(). Then, we can write

Xβ=XAγA+jAx~(j)βj.

Let y~=jAx~(j)βj. By Lemma 6.1 below, there exists lA such that

x(l)Ty~nκmind. (6.8)

By the KKT condition, we have x(l)T(XβXβ^(1))n+Sl(β)τλ. However we can write x(l)T(XβXβ^(1))n=x(l)TXA(γAβ^A(1))n+x(l)Ty~n. The inequalities (6.8) and (6.7) with condition (A6) imply that on Fn5

x(l)T(XβXβ^(1))n+Sl(β)x(l)Ty~nx(l)TXA(γAβ^A(1))nSl(β)x(l)Ty~nXAA0TXAA01γAβ^A(1)Sl(β)κmindηmaxηmin3τλ2>τλ

if d* > 3τλ(ηmaxηmin + 1)/(2κmin); which contradicts the KKT condition. Hence, we eventually have A0A on Fn5 and this proves (6.5).

We now slightly modify the proof of (1) of Theorem 3.2. More specifically, replacing Fn3 by Fn4, we can show that Fn1Fn2Fn4{β^(λ)=β^(o)}, and this proves (1). The result in (2) follows immediately from (1). The proof of (3) can be done similarly to that of Theorem 3.5.

In the proof of Theorem 3.6, we have used the following lemma, whose proof is given in the online supplementary material.

Lemma 6.1

There exists lA which satisfies (6.8).

Proof of Theorem 5.2

By (5.1), a local minimizer β necessarily satisfies:

n1x(j)T(yXβ)+ξj=0,j=1,,p, (6.9)

where ξj=λljhn(β)βj, with lj = sign(βj) if βj ≠ 0 and lj ∈ [−1, 1] otherwise, 1 ≤ jp.. It is easy to see that ∣ξj∣ ≤ λ, 1 ≤ jp. Although the objective function is nonconvex, abusing the notation a little, we refer to the collection of all vectors in the form of the left hand side of (6.9) as the subdifferential ∂Qn(β) and refer to a specific element of this set a subgradient. Then the necessary condition stated above can be considered as an extension of the classical KKT condition.

Alternatively, minimizing Qn(β) can be expressed as a constrained smooth minimization problem (e.g., Kim et al., 2008). By the corresponding second order sufficiency of KKT condition (e.g., page 320 of Bertsekas, 1999), β^ is a local minimizer of Qn(β) if

n1x(j)T(yXβ)=sgn(β^j)p.λ(β^j),β^j0,n1x(j)T(yXβ)λ,β^j=0.

Consider the event Fn = Fn2Fn6, where Fn2 is defined in Lemma 3.1 with b2 = 1, and Fn6={minjA0β^j(o)aλ}. Since β^j(o)βjβ^j(o)βj and λ = o(d*), similarly as in the proof for Lemma 3.1, we can show that for all n sufficiently large, P(Fn6) ≥ 1 − 2q exp[−C1n(d*)2/(2σ2)]. By Lemma 3.1, for all n sufficiently large, P(Fn) ≥ 1 − 2q exp[−C1n(d*)2/(2σ2)] − 2(pq) exp[−2/(2σ2)]. It is apparent that on the event Fn, the oracle estimator β^(o) satisfies the above sufficient condition. Therefore, by (6.9), there exist ξj(o)λ,1jp, such that

n1x(j)T(yXβ^(o))+ξj(o)=0.

Abusing notation a little, we denote this zero vector by βQn(β^(o)).

Now for any local minimizer β^ which satisfies the sparsity constraint β^0qun, we will prove by contradiction that under the conditions of the theorem we must have β^β^(o)2λqunξmin1(qun), where un=un+1. More specifically, we’ll derive a contradiction by showing that none of the subgradients of Qn(β) can be zero at β=β^.

Assume instead that β^β^(o)>2λqunξmin1(qun). Let A={j:β^j0 or β^j(o)0}, then β^Aβ^A(o)>2λqunξmin1(qun). Let βQn(β^)=n1x(j)T(yXβ^)+ηj be an arbitrary subgradient in the subdifferential Qn(β^). Let η=(η1,,ηp)T, then ηj satisfies ∣ηl∣ ≤ λ, 1 ≤ jp. We use βAQn(β^) to denote the size-∣A*∣ subvector of βQn(β^), i.e., βQn(β^)=(βjQn(β^):jA)T. And βAQn(β^(o)) is defined similarly. We have

(βAQn(β^)T(β^Aβ^A(o))β^Aβ^A(o)=(βAQn(β^)βAQn(β^(o)))T(β^Aβ^A(o))β^Aβ^A(o)=n1(β^Aβ^A(o))TXATXA(β^Aβ^A(o))β^Aβ^A(o)+(ηA)ξA(o))T(β^Aβ^A(o))β^Aβ^A(o)ϕmin(n1XATXA)β^Aβ^A(o)2λqun>ξmin(qun)2λqunξmin1(qun)2λqun=0,

where the second equality follows from the expression of subgradient, the second last inequality applies the Cauchy-Schwarz inequality, and the last inequality follows from the relaxed SRC condition in an L0-neighborhood of the true model. Thus this contradicts with the fact that at least one of the subgradients is zero if β^ is a local minimizer and the theorem is proved.

Proof of Corollary 5.3

It follows directly from Theorem 5.2.

Supplementary Material

Supp

Footnotes

SUPPLEMENTARY MATERIAL Supplement to “Calibrating Non-convex Penalized Regression in Ultra-high Dimension”:

(doi: COMPLETED BY THE TYPESETTER; .pdf). This supplemental material includes the proofs of Lemmas 3.1 and 6.1, and some additional numerical results.

*

Support in part by National Science Foundation grant DMS-1308960.

Support in part by National Research Foundation of Korea grant number 20100012671 funded by the Korea government.

Support in part by National Natural Science Foundation of China, 11028103 and NIH grants P50 DA10075, R21 DA024260, R01 CA168676 and R01 MH096711.

References

  • [1].Bertsekas DP. Nonlinear programming. 2nd edition Athena Scientific; Belmont, Mass: 1999. [Google Scholar]
  • [2].Bickel P, Ritov Y, Tsybakov A. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics. 2009;37:1705–1732. [Google Scholar]
  • [3].Bühlmann P, van de Geer S. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer; 2011. [Google Scholar]
  • [4].Cai T, Zhou H. Minimax estimation of large covariance matirces under l1 norm. To appear in Statistca Sinica. 2011 [Google Scholar]
  • [5].Chen J, Chen Z. Extended Bayesian information criterion for model selection with large model space. Biometrika. 2008;95:759–771. [Google Scholar]
  • [6].Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
  • [7].Fan J, Lv J. Non-concave penalized likelihood with NP-dimensionality. IEEE Transaction on Information Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Annals of Statistics. 2004;32:928–961. [Google Scholar]
  • [9].Frank IE, Friedman JH. A statistical view of some chemometric regression tools (with discussion) Technometrics. 1993;35:109–148. [Google Scholar]
  • [10].Gao BB, Phipps JA, Bursell D, Clermont AC, Feener EP. Angiotensin AT1 receptor antagonism ameliorates murine retinal proteome changes induced by diabetes. Journal of Proteome Research. 2009;8:5541–5549. doi: 10.1021/pr9006415. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Huang J, Ma SG, Zhang C-H. Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]
  • [12].Hunter DR, Li R. Variable selection using MM algorithms. Annals of Statistics. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Kim Y, Choi H, Oh H-S. Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association. 2008;103:1665–1673. [Google Scholar]
  • [14].Kim Y, Kwon S. Global optimality of nonconvex penalized estimators. Biometrika. 2012;99:315–325. [Google Scholar]
  • [15].Kim Y, Kwon S, Choi H. Consistent model selection criteria on high dimensions. Journal of Machine Learning Research. 2012;13:1037–1057. [Google Scholar]
  • [16].Kwon S, Kim Y. Large sample properties of the smoothly clipped absolute deviation penalized maximum likelihood estimation on high dimensions. Accepted by Statistica Sinica. 2011 [Google Scholar]
  • [17].Lounici K. Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators. Electric Journal of Statistics. 2008;2:90–102. [Google Scholar]
  • [18].Mazumder R, Friedman J, Hastie T. SparseNet: coordinate descent with non-convex penalties. Journal of the American Statistical Association. 2011;106:1125–1138. doi: 10.1198/jasa.2011.tm09738. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Meinshausen N, Yu B. Lasso-type recovery of sparse representations for high-dimensional data. Annals of Statistics. 2009;37:246–270. [Google Scholar]
  • [20].Mikosch T. Estimates for tail probabilities of quadratic and bilinear forms in subgaussian random variables. Probability and Mathematical Statistics. 1991;11:169–178. [Google Scholar]
  • [21].Paunel-Görgülü AN, Franke AG, Paulsen FP, Dünker N. Trefoil factor family peptide 2 acts pro-proliferative and pro-apoptotic in the murine retina. Histochemistry and Cell Biology. 2011;135:461–473. doi: 10.1007/s00418-011-0810-6. [DOI] [PubMed] [Google Scholar]
  • [22].Rinaldo A. Technical Report. Deapartment of Statistics, Carneige Mellon University; 2007. A note on the uniqueness of the Lasso solution. [Google Scholar]
  • [23].Scheetz TE, Kim K-YA, Swiderski RE, Philp AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Sheffield VC, Stone EM. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences. 2006;103:14429–14434. doi: 10.1073/pnas.0602562103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Tao PD, An LTH. Convex analysis approach to D.C. programming: theory, algorithms and applications. Acta Mathematica Vietnamica. 1997;22:289–355. [Google Scholar]
  • [25].Tibshirani RJ. Regression shrinkage and selection via the Lasso. Journal of Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
  • [26].van de Geer SA. High-dimensional generalized linear models and the lasso. Annals of Statistics. 2008;36:614–645. [Google Scholar]
  • [27].van de Geer SA, Bühlmann P, Zhou SH. The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso) Electronic Journal of Statistics. 2011;5:688–749. [Google Scholar]
  • [28].Wang H, Li B, Leng C. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of Royal Statistical Society, Series B. 2009;71:671–683. [Google Scholar]
  • [29].Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Wang L, Kim Y, Li R. Supplement to “Calibrating non-convex penalized regression in ultra-high dimension”. 2013 doi: 10.1214/13-AOS1159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Yuille A, Rangarajan A. The concave-convex procedure. Neural Computation. 2003;15:915–936. doi: 10.1162/08997660360581958. [DOI] [PubMed] [Google Scholar]
  • [32].Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics. 2010;38:894–942. [Google Scholar]
  • [33].Zhang C-H, Huang J. The sparsity and bias of the LASSO selection in high-dimensional regression. Annals of Statistics. 2008;36:156–594. [Google Scholar]
  • [34].Zhang C-H, Zhang T. A general theory of concave regularization for high dimensional sparse estimation problems. Statistical Science. 2012;27:576–593. [Google Scholar]
  • [35].Zhang T. Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learning Research. 2010;11:1080–1107. [Google Scholar]
  • [36].Zhang T. Multi-stage Convex Relaxation for Feature Selection. Bernoulli. 2012 To appear. [Google Scholar]
  • [37].Zhang Y, Li R, Tsai C-L. Regularization parameter selections via generalized information criterion. Journal of American Statistical Association. 2010;105:312–323. doi: 10.1198/jasa.2009.tm08013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].Zhao P, Yu B. On model selection consistency of Lasso. Journal of Machine Learning Reserach. 2006;7:2541–2563. [Google Scholar]
  • [39].Zhou SH, van de Geer SA, Bühlmann P. Adaptive Lasso for high dimensional regression and Gaussian graphical modeling. arxiv. 2009;0903:2515. [Google Scholar]
  • [40].Zhou SH. Thresholded Lasso for high dimensional variable selection and statistical estimation. arxiv. 2010;1002:1583. [Google Scholar]
  • [41].Zou H. The adaptive Lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
  • [42].Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models (with discussion) Annals of Statistics. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp

RESOURCES