CALIBRATING NON-CONVEX PENALIZED REGRESSION IN ULTRA-HIGH DIMENSION

Lan Wang; Yongdai Kim; Runze Li

doi:10.1214/13-AOS1159

. Author manuscript; available in PMC: 2014 Jun 17.

Published in final edited form as: Ann Stat. 2013 Oct 1;41(5):2505–2536. doi: 10.1214/13-AOS1159

CALIBRATING NON-CONVEX PENALIZED REGRESSION IN ULTRA-HIGH DIMENSION

Lan Wang ^*, Yongdai Kim ^†, Runze Li ^‡

PMCID: PMC4060811 NIHMSID: NIHMS590995 PMID: 24948843

Abstract

We investigate high-dimensional non-convex penalized regression, where the number of covariates may grow at an exponential rate. Although recent asymptotic theory established that there exists a local minimum possessing the oracle property under general conditions, it is still largely an open problem how to identify the oracle estimator among potentially multiple local minima. There are two main obstacles: (1) due to the presence of multiple minima, the solution path is nonunique and is not guaranteed to contain the oracle estimator; (2) even if a solution path is known to contain the oracle estimator, the optimal tuning parameter depends on many unknown factors and is hard to estimate. To address these two challenging issues, we first prove that an easy-to-calculate calibrated CCCP algorithm produces a consistent solution path which contains the oracle estimator with probability approaching one. Furthermore, we propose a high-dimensional BIC criterion and show that it can be applied to the solution path to select the optimal tuning parameter which asymptotically identifies the oracle estimator. The theory for a general class of non-convex penalties in the ultra-high dimensional setup is established when the random errors follow the sub-Gaussian distribution. Monte Carlo studies confirm that the calibrated CCCP algorithm combined with the proposed high-dimensional BIC has desirable performance in identifying the underlying sparsity pattern for high-dimensional data analysis.

Keywords: High-dimensional regression, LASSO, MCP, SCAD, variable selection, penalized least squares

1. Introduction

High-dimensional data, where the number of covariates p greatly exceeds the sample size n, arise frequently in modern applications in biology, chemometrics, economics, neuroscience and other scientific fields. To facilitate the analysis, it is often useful and reasonable to assume that only a small number of covariates are relevant for modeling the response variable. Under this sparsity assumption, a widely used approach for analyzing high-dimensional data is regularized or penalized regression. This approach estimates the unknown regression coefficients by solving the following penalized regression problem

\min_{β \in R^{p}} {{(2 n)}^{- 1} {∣ ∣ y - X β ∣ ∣}^{2} + \sum_{j = 1}^{p} p_{λ} (∣ β_{j} ∣)},

(1.1)

where y is the vector of responses, X is an n × p matrix of covariates, β = (β₁, … ,β_p)T is the vector of unknown regression coefficients, ∥ · ∥ denotes the L₂ norm (Euclidean norm), and P_λ(·) is a penalty function which depends on a tuning parameter λ > 0. Many commonly used variable selection procedures in the literature can be cast into the above framework, including the best subset selection, L₁ penalized regression or Lasso (Tib-shirani, 1996), Bridge regression (Frank and Friedman, 1993), SCAD (Fan and Li, 2001), MCP (Zhang, 2010), among others.

The Lasso penalized regression is computationally attractive and enjoys great performance in prediction. However, it is known that Lasso requires rather stringent conditions on the design matrix to be variable selection consistent (Zou, 2006; Zhao and Yu, 2006). Focusing on identifying the unknown sparsity pattern, non-convex penalized high-dimensional regression has recently received considerable attention. Fan and Li (2001) first systematically studied nonconvex penalized likelihood for fixed finite dimension p. In particular, they recommended the SCAD penalty which enjoys the oracle property for variable selection. That is, it can estimate the zero coefficients as exact zero with probability approaching one, and estimate the non-zero coefficients as efficiently as if the true sparsity pattern is known in advance. Fan and Peng (2004) extended these results by allowing p to grow with n at the rate p = o(n^1/5) or p = o(n^1/3). For high dimensional nonconvex penalized regression with p ⪢ n, Kim et al. (2008) proved that the oracle estimator itself is a local minimum of SCAD penalized least squares regression under very relaxed conditions; Zhang (2010) proposed a minimax concave penalty (MCP) and devised a novel PLUS algorithm which when used together can achieve the oracle property under certain regularity conditions. Important insight has also been gained through the recent work on theoretical analysis of the global solution (Kim and Kwon, 2012; Zhang and Zhang, 2012). However, direct computation of the global solution to the nonconvex penalized regression is infeasible in high dimensional setting.

For practical data analysis, it is critical to find an easy-to-implement procedure which can find a local solution with satisfactory theoretical propertyeven when the number of covariates greatly exceeds the sample size. Two challenging issues remain unsolved. One is the problem of multiple local minima; the other is the problem of optimal tuning parameter selection.

A direct consequence of the multiple local minima problem is that the solution path is not unique and is not guaranteed to contain the oracle estimator. This problem is due to the nature of the non-convexity of the penalty. To understand it, we note that the penalized objective function in (1.1) is non-convex in β whenever the convexity of the least squares loss function does not dominate the concavity of the penalty part. In general, the occurrence of multiple minima is unavoidable unless strong assumptions are imposed on both the design matrix and the penalty function. The recent theory for SCAD penalized linear regression (Kim et al., 2008) and for general non-concave penalized generalized linear models (Fan and Lv, 2011) indicates that one of the local minima enjoys the oracle property but it is still an unsolved problem how to identify the oracle estimator among multiple minima when p ⪢ n. Popularly used algorithms generally only en sure the convergence to a local minimum, which is not necessarily the oracle estimator. Numerical evidence in Section 4 suggests that the local minima identified by some of the popular algorithms have a relatively low probability to recover the unknown sparsity pattern although it may have small estimation error.

Even if a solution path is known to contain the oracle estimator, identifying such a desirable estimator from the path is itself a challenging problem in ultra-high dimension. The main issue is to find the optimal tuning parameter which yields the oracle estimator. The theoretically optimal tuning parameter does not have an explicit representation and depends on unknown factors such as the variance of the unobserved random noise. Cross-validation is commonly adopted in practice to select the tuning parameter but is observed to often result in overfitting. In the case of fixed p, Wang, Li and Tsai (2007) rigorously proved that generalized cross-validation leads to an overfitted model with a positive probability for SCAD-penalized regression. Effective BIC-type criterion for nonconvex penalized regression has been investigated in Wang, Li and Tsai (2007) and Zhang, Li and Tsai (2010) for fixed p; and in Wang, Li and Leng (2009) for diverging p (but p < n). However, to the best of our knowledge, there is still no satisfactory tuning parameter selection procedure for nonconvex penalized regression in ultra-high dimension.

The above two main concerns motivate us to consider calibrating nonconvex penalized regression in ultra-high dimension with the goal to identify the oracle estimator with high probability. To achieve this, we first prove that a calibration of the CCCP algorithm (Kim et al., 2008) for non-convex penalized regression produces a consistent solution path with probability approaching one in merely two steps under conditions much more relaxed than what would be required for the Lasso estimator to be model selection consistent. Furthermore, extending the recent work of Chen and Chen (2008) and Kim et al. (2011) for Bayesian information criterion (BIC) on high dimensional least squares regression, we propose a high-dimensional BIC for a nonconvex penalized solution path and prove its validity under more general conditions when p grows at an exponential rate. The recent independent work of Zhang (2010, 2012) devised a multi-stage convex relaxation scheme and proved that for the capped L₁ penalty the algorithm can find a consistent solution path with probability approaching one under certain conditions. Despite the similar flavor shared with the algorithm proposed in this paper, his algorithm takes multiple steps (which can be very large in practice depending on the design condition) and the paper has not studied the problem of tuning parameter selection.

To deepen our understanding of the nonconvex penalized regression, we also derive an interesting auxiliary theoretical result of an upper bound on the L₂ distance between a sparse local solution of nonconvex penalized regression and the oracle estimator. This result is new and insightful. It suggests that under general regularity conditions a sparse local minimum can often have small estimation error even though it may not be the oracle estimator. Overall, the theoretical results in this paper fill in important gaps in the literature, thus substantially enlarge the scope of applications of nonconvex penalized regression in ultra-high dimension. In Monte Carlo studies, we demonstrate that the calibrated CCCP algorithm combined with the proposed high-dimensional BIC is effective in identifying the underlying sparsity pattern.

The rest of the paper is organized as follows. In Section 2, we define the notation, review the CCCP algorithm and introduce the new methodology. In Section 3, we establish that the proposed calibrated CCCP solution path contains the oracle estimator with probability approaching one under general conditions, and that the proposed high-dimensional BIC is able to select the optimal tuning parameter with probability tending to one. In Section 4, we report numerical results from Monte Carlo simulations and a real data example. In Section 5, we present an auxiliary theoretical result which sheds light on the estimation accuracy of a local minimum of non-convex penalized regression if it is not the oracle estimator. The proofs are given in Section 6.

2. Calibrated non-convex penalized least squares method

2.1. Notation and setup

Suppose that ${(Y_{i}, x_{i})}_{i = 1}^{n}$ is a random sample from the linear regression model:

y = X β^{*} + ∊,

(2.1)

where y = (Y₁, … , Y_n)^T , X is the n × p non-stochastic design matrix with the ith row ${(Y_{1}, \dots, Y_{n})}^{T}$ is the vector of unknown true parameters, and ∊ = (∊₁, … , ∊_n)^T is a vector of independent and identically distributed random errors.

We are interested in the case where p = p_n greatly exceeds the sample size n. The vector of the true parameters β* is assumed to be sparse in the sense that the majority of its components are exactly zero. Let $A_{0} = {j : β_{j}^{*} \neq 0}$ be the index set of covariates with non-zero coefficients and let $∣ A_{0} ∣ = q$ denote the cardinality of A₀. We use $d_{*} = \min {∣ β_{j}^{*} ∣ : β_{j}^{*} \neq 0}$ to denote the minimal absolute value of the non-zero coefficients. Without loss of generality, we may assume that the first q components of β* are non-zero, thus we can write $β^{*} = {(β_{1}^{* T}, 0^{T})}^{T}$ , where 0 represents a zero vector of length p − q. The oracle estimator is defined as ${\hat{β}}^{(o)} = {({\hat{β}}_{1}^{(o) T}, 0^{T})}^{T}$ , where ${\hat{β}}_{1}^{(o)}$ is the least squares estimator fitted using only the covariates whose indices are in A₀.

To handle the high-dimensional covariates, we consider the penalized regression in (1.1). The penalty function p_λ(t) is assumed to be increasing and concave for t ∈ [0, +∞) with a continuous derivative ${\dot{p}}_{λ} (t)$ on (0, +∞). To induce sparsity of the penalized estimator, it is generally necessary for the penalty function to have a singularity at the origin, i.e., ${\dot{p}}_{λ} (0 +) > 0$ . Without loss of generality, the penalty function can be standardized such that ${\dot{p}}_{λ} (0 +) = λ$ . Furthermore, it is required that

{\dot{p}}_{λ} (t) \leq λ, \forall 0 < t < α_{0} λ,

(2.2)

{\dot{p}}_{λ} (t) = 0, \forall t > α_{0} λ,

(2.3)

for some positive constant a₀. Condition (2.3) plays the key role of not over-penalizing large coefficients, thus alleviating the bias problem associated with Lasso.

The above class of penalty functions include the popularly used SCAD penalty and MCP. The SCAD penalty is defined by

{\dot{p}}_{λ} (t) = λ {I (t \leq λ) + \frac{{(a λ - t)}_{+}}{(a - 1) λ} I (t > λ)}

(2.4)

for some a > 2, where the notation b₊ stands for the positive part of b, i.e., b₊ = bI(b > 0). Fan and Li (2001) recommended to use a = 3.7 from a Bayesian perspective. On the other hand, the MCP is defined by ${\dot{p}}_{λ} (t) = a^{- 1} {(a λ - t)}_{+}$ for some a > 0 (as a ↓ 1, it amounts to hard-thresholding, thus in the following we assume a > 1).

Let x_(j) be the jth column vector of X. Without loss of generality, we assume that $x_{(j)}^{Y} x_{(j)} ∕ n = 1$ for all j. Throughout this paper the following notation is used. For an arbitrary index set A ⊆ {1, 2 … , p}, X_A denotes the n × ∣A∣ submatrix of X formed by those columns of X whose indices are in A. For a vector v = (v₁, … , v_p)’, we use ∥v∥ to denote its L₂ norm; on the other hand ∥v∥₀ = #{j : v_j ≠ 0} denotes the L₀ norm, ∥v∥₁ = ∑_j ∣v_j∣ denotes the L₁ norm and ∥v∥_∞ = max_j ∣v_j∣ denotes the L_∞ norm. We use v_A to represent the size-∣A∣ subvector of v formed by the entries v_j with indices in A. For a symmetric matrix B, λ_min(B) and λ_max(B) stand for the smallest and largest eigenvalues of B, respectively. Furthermore, we let

ξ_{\min} (m) = \min_{∣ B ∣ \leq m, A_{0} \subseteq B} λ_{\min} (n^{- 1} X_{B}^{T} X_{B}) .

(2.5)

Finally, p, q, λ and other related quantities are all allowed to depend on n, but we suppress such dependence for notational simplicity.

2.2. The CCCP algorithm

It is challenging to solve the penalized regression problem in (1.1) when the penalty function is nonconvex. Kim et al. (2008) proposed a fast optimization algorithm called the SCAD-CCCP (CCCP stands for ConCave Convex procedure) algorithm for solving the SCAD-penalized regression. The key idea is to update the solution with the minimizer of the tight convex upper bound of the objective function obtained at the current solution. What makes a fast algorithm practical relies on the possibility of decomposing the non-convexed penalized least squares objective function as the sum of a convex function and a concave function. To be specific, suppose we want to minimize an objective function C(β) which has the representation C(β) = C_vex(β) + C_cav(β) for a convex function C_vex(β) and a concave function C_cav(β). Given a current solution β^(k), the tight convex upper bound of C(β) is given by Q(β) = C_vex(β) + ▿C_cav(β^k)’ β where ▿C_cav(β) = ∂C_cav(β)/∂β. We then update the solution by minimizing Q(β). Since Q(β) is a convex function, it can be easily minimized.

For the penalized regression in (1.1), we consider a penalty function p_λ(∣β_j∣) which has the decomposition

p_{λ} (∣ β_{j} ∣) = J_{λ} (∣ β_{j} ∣) + λ ∣ β_{j} ∣,

(2.6)

where J_λ(∣β_j∣) is a differentiable concave function. For example, for the SCAD penalty,

\begin{matrix} J_{λ} (∣ β_{j} ∣) \\ = & - \frac{β_{j}^{2} - 2 λ ∣ β_{j} ∣ + λ^{2}}{2 (a - 1)} I (λ \leq ∣ β_{j} ∣ \leq a λ) + [\frac{(a + 1) λ^{2}}{2} - λ ∣ β_{j} ∣] I (∣ β_{j} ∣ > a λ); \end{matrix}

while for the MCP penalty,

J_{λ} (∣ β_{j} ∣) = \frac{β_{j}^{2}}{2 a} I (0 \leq ∣ β_{j} ∣ < a λ) + [\frac{a λ^{2}}{2} - λ ∣ β_{j} ∣] I (∣ β_{j} ∣ \geq a λ) .

Hence, using the decomposition in (2.6), the penalized objective function in (1.1) can be rewritten as

\frac{1}{2 n} {∣ ∣ y - X β ∣ ∣}^{2} + \sum_{j = 1}^{p} J_{λ} (∣ β_{j} ∣) + λ \sum_{j = 1}^{p} ∣ β_{j} ∣,

which is the sum of convex and concave functions. The CCCP algorithm is applied as follows. Given a current solution β^(k), the tight convex upper bound is

Q (β ∣ β^{(k)}, λ) = \frac{1}{2 n} {∣ ∣ y - X β ∣ ∣}^{2} + \sum_{j = 1}^{p} \nabla J_{λ} (∣ β_{j}^{(k)} ∣) β_{j} + λ \sum_{j = 1}^{p} ∣ β_{j} ∣ .

(2.7)

We then update the current solution by β^(k+1) = arg min_β Q(β∣β^(k), λ).

An important property of the CCCP algorithm is that the objective function always decreases after each iteration (Yuille and Rangarajan, 2003 and Tao and An, 1997), from which it can be deduced that the solution converges to a local minimum. See, for example, Corollary 3.2 of Hunter and Li (2005). However, there is no guarantee that the local minimum found is the oracle estimator itself because there are multiple local minima and the solution of the CCCP algorithm depends on the choice of the initial solution.

2.3. Calibrated non-convex penalized regression

In this paper, we propose and study a calibrated CCCP estimator. More specifically, we start with the initial value β⁽⁰⁾ = 0 and a tuning parameter λ > 0, and let Q be the tight convex upper bound defined in (2.7). The calibrated algorithm consists of the following two steps.

Let ${\hat{β}}^{(1)} (λ) = agr \min_{β} Q (β ∣ β^{(0)}, τ λ)$ , where the choice of τ > 0 will be discussed later.
2. Let ${\hat{β}}^{(1)} (λ) = agr \min_{β} Q (β ∣ {\hat{β}}^{(1)} (λ), λ)$ .

When we consider a sequence of tuning parameter values, we obtain a solution path ${\hat{β} (λ) : λ > 0}$ . The calculation of the path is fast even for very high-dimensional p as for each of the two steps a convex minimization problem is solved. In step 1, a smaller tuning parameter τλ is adopted to increase the estimation accuracy, see Section 3.1 for discussions on the practical choice of τ. We call a solution path “path consistent” if it contains the oracle estimator. In Section 3.1, we will prove that the calibrated CCCP algorithm produces a consistent solution path under rather weak conditions.

Given such a solution path, a critical question is how to tune the regularization parameter λ in order to identify the oracle estimator. The performance of a penalized regression estimator is known to heavily depend on the choice of the tuning parameter. To further calibrate non-convex penalized regression, we consider the following high-dimensional BIC criterion (HBIC) to compare the estimators from the above solution path:

HBIC (λ) = \log ({\hat{σ}}_{λ}^{2}) + ∣ M_{λ} ∣ \frac{C_{n} \log (p)}{n},

(2.8)

where $M_{λ} = {j : {\hat{β}}_{j} (λ) \neq 0}$ is the model identified by $\hat{β} (λ), ∣ M_{λ} ∣$ denotes the cardinality of M_λ, and ${\hat{σ}}_{λ}^{2} = n^{- 1} {SSE}_{λ}$ with ${SSE}_{λ} = {‖ Y - X \hat{β} (λ) ‖}^{2}$ . As we are interested in the case where p greatly exceeds n, the penalty term also depends on p; and C_n is a sequence of numbers that diverges to ∞, which will be discussed later.

We compare the value of the above HBIC criterion for λ ∈ Λ_n = {λ : ∣M_λ∣ ≤ K_n}, where K_n > q represents a rough estimate of an upper bound of the sparsity of the model and is allowed to diverge to ∞. We select the tuning parameter

\hat{λ} = \arg \min_{λ \in Λ_{n}} HBIC (λ) .

The above criterion extends the recent works of Chen and Chen (2008) and Kim et al. (2012) on the high-dimensional BIC for the least squares regression to tuning parameter selection for nonconvex penalized regression. In Sections 3.1-3.3, we study asymptotic properties under conditions such as sub-Gaussian random errors, dimension of the covariates growing at the exponential rate and diverging K_n.

3. Theoretical properties

The main theory comprises two parts. We first show that under some general regularity conditions the calibrated CCCP algorithm yields a solution path with the “path consistency” property. We next verify that when the proposed high-dimensional BIC is applied to this solution path to choose the tuning parameter λ, with probability tending to one the resulted estimator is the oracle estimator itself.

To facilitate the presentation, we specify a set of regularity conditions.

(A1) There exists a positive constant C₁ such that $λ_{\min} (n^{- 1} X_{A_{0}}^{T} X_{A_{0}}) \geq C_{1}$ .

(A2) The random errors ∊₁, … , ∊_ν are i.i.d. mean zero sub-Gaussian random variables with a scale factor 0 < σ < ∞, i.e., E[exp(t∊_i)] ≤ e^σ2t2/2, ∀ t.

(A3) The penalty function p_λ(t) is assumed to be increasing and concave for t ∈ [0, +∞) with a continuous derivative ${\dot{p}}_{λ} (t)$ on (0, +∞). It admits a convex-concave decomposition as in (2.6) with J_λ(·) satisfies: ▿J_λ(∣t∣) = −λsign(t) for ∣t∣ > aλ, where a > 1 is a constant; and ∣▿J_λ(∣t∣)∣ ≤ ∣t∣ for ∣t∣ ≤ bλ, where b ≤ a is a positive constant.

(A4) The design matrix X satisfies: $γ = \min_{δ \neq 0, {‖ δ_{A_{0}^{c}} ‖}_{1} \leq 3 {‖ δ_{A_{0}} ‖}_{1}} \frac{‖ X δ ‖}{\sqrt{n} ‖ δ_{A_{0}} ‖} > 0$ .

(A5) Assume that λ = o(d_*) and τ = o(1), where d_* is defined on page 5, λ and τ are the two parameters in the modified CCCP algorithm given in the first paragraph of Section 2.3.

Remark 1

Condition (A1) concerns the true model and is a common assumption in the literature on high-dimensional regression. Condition (A2) implies that for a vector a = (a₁, … , a_n)^T,

P (∣ a^{T} ∊ ∣ > t) \leq 2 \exp (- \frac{t^{2}}{2 σ^{2} {∣ ∣ a ∣ ∣}^{2}}), t \geq 0 .

(3.1)

Condition (A3) is satisfied by popular nonconvex penalty functions such as SCAD and MCP. Note that the condition ▿J_λ(∣t∣) = −λsign(t) for ∣t∣ > aλ is equivalent to assuming that ${\dot{p}}_{λ} (∣ t ∣) = 0, \forall ∣ t ∣ > a λ$ i.e., large coefficients are not penalized, which is exactly the motivation for nonconvex penalties. Condition (A4), which is given in Bickel et al. (2009), ensures a desirable bound on the L₁ estimation loss of the Lasso estimator. Note that the CCCP algorithm yields the Lasso estimator after the first iteration, so the asymptotic properties of the CCCP estimator is related to that of the Lasso estimator. Condition (A4) holds under the restricted eigenvalue condition which is known to be a relatively mild condition on the design matrix for high-dimensional estimation. In particular, it is known to hold in some examples where the covariates are highly dependent, and is much weaker than the irrepresentable condition (Zhao and Yu, 2006) which is almost necessary for Lasso to be model selection consistent.

3.1. Property of the solution path

We first state a useful lemma that characterizes a nonasymptotic property of the oracle estimator in high dimension. The result is an extension of that in Kim et al. (2008) under the more general sub-Gaussian random error condition.

Lemma 3.1

For any given 0 < b₁ < 1 and 0 < b₂ < 1, consider the events

F_{n 1} = {\max_{j \in A_{0}} ∣ {\hat{β}}_{j}^{(\circ)} - β_{j}^{*} ∣ \leq b_{1} λ} and F_{n 2} = {\max_{j \in A_{0}^{c}} ∣ S_{j} ({\hat{β}}^{(\circ)}) ∣ \leq b_{2} λ}

where S_j(β) = −n⁻¹x_(j)^T(y − Xβ). Then under conditions (A1) and (A2),

P (F_{n 1} \cap F_{n 2}) \geq 1 - 2 q \exp [- C_{1} b_{1}^{2} n λ^{2} ∕ (2 σ^{2})] - 2 (p - q) \exp [- {nb}_{2}^{2} λ^{2} ∕ (2 σ^{2})] .

The proof of Lemma 3.1 is given in the online supplementary material. Theorem 3.2 below provides a non-asymptotic bound of the probability the solution path contains the oracle estimator. Under general conditions, this probability tends to one.

Theorem 3.2

(1) Assume that conditions (A1)-(A5) hold. If τγ⁻²q = o(1), then for all n sufficiently large,

P (\hat{β} (λ) - {\hat{β}}^{(\circ)}) \geq 1 - 8 p \exp (- n τ^{2} λ^{2} ∕ (8 σ^{2})) .

(2) Assume that conditions (A1)-(A5) hold. If nτ²λ² → ∞, log p = o(nτ²λ²) and τγ⁻²q = o(1), then

P (\hat{β} (λ) = {\hat{β}}^{(\circ)}) \to 1

as n → ∞.

Remark 2

Meinshausen and Yu (2009) considered thresholding Lasso, which has the oracle property under an incoherent design condition in the ultra-high dimension. Zhou (2010) further proposed and investigated a multi-step thresholding procedure which can accurately estimate the sparsity pattern under the restricted eigenvalue condition of Bickel et al. (2009). These theoretical results are derived by assuming the initial Lasso is obtained using a theoretical tuning parameter value, which depends on the unknown random noise variance σ². Estimating σ² is a difficult problem in high-dimensional setting, particularly when the random noise is non-Gaussian. On the other hand, if the true value of σ² is known a priori, then it is possible to derive variable selection consistency under somewhat more relaxed conditions on the design matrix than those in the current paper. Adaptive Lasso, originally proposed by Zou (2006) for fixed dimension, was extended to high dimension by Huang et al. (2008) under a rather strong mutual incoherence condition. Zhou, van der Geer and Bühlmann (2009) derived the consistency of adaptive Lasso in high dimension under similar conditions on X, but still requires complex conditions on s and d_*. Some favorable empirical performance of the multi-step thresholded Lasso versus the adaptive Lasso was reported in Zhou (2010). A theoretical comparison of these two procedures in high dimension was considered by van de Geer, Bühlmann and Zhou (2011) and Chapter 7 of Bühlmann and van de Geer (2011). For both adaptive and thresholded Lasso, if a covariate is deleted in the first step, it will be excluded from the final selected model. Zhang (2010) proved that selection consistency holds for the MCP solution at the universal penalty level $σ \sqrt{2 \log p ∕ n}$ . The LLA algorithm, which Zou and Li (2008) originally proposed for fixed dimensional models, alleviates this problem and has the potential to be extended to the ultra-high dimension under conditions similar as those in this paper. Needless to say, the performances of the above procedures all depend on the choice of tuning parameter. However, the important issue of tuning parameter selection has not been addressed.

Remark 3

We proved that the calibrated CCCP algorithm which involves merely two iterations is guaranteed to yield a solution path that contains the oracle estimator with high probability under general conditions. To provide some intuition on this theory, we first note that the first step of the algorithm yields the Lasso estimator, albeit with a small penalty level τλ. If we denote the first step estimator by ${\hat{β}}_{j}^{(Lasso)} (τ λ)$ , then based on the optimization theory, the oracle property is achieved when

\begin{matrix} \min_{j \in A_{0}} ∣ {\hat{β}}_{j}^{(Lasso)} (τ λ) ∣ \geq a λ > λ, \\ sign ({\hat{β}}_{j}^{(\circ)}) = sign (β_{j}^{*}), j \in A_{0}, \\ \max_{j \notin A_{0}} ∣ \nabla J_{λ} ({\hat{β}}_{j}^{(Lasso)} (τ λ)) ∣ + n^{- 1} {∣ ∣ X_{A_{0}^{c}}^{T} (Y - X) {\hat{β}}^{(\circ)} ∣ ∣}_{\infty} \leq λ . \end{matrix}

The proof of Theorem 3.2 relies on the following condition:

{∣ ∣ {\hat{β}}^{(Lasso)} (τ λ) - β^{*} ∣ ∣}_{\infty} \leq λ ∕ 2, \min_{β_{j}^{*} \neq 0} ∣ β_{j}^{*} ∣ > a λ + λ ∕ 2,

(3.2)

for the given a > 1. The proof proceeds by bounding the first part of (3.2) using a result of Bickel et al. (2009) via ${‖ {\hat{β}}^{(Lasso)} (τ λ) - β ‖}_{\infty} \leq {‖ {\hat{β}}^{Lasso} (τ λ) - β ‖}_{2}$ . In Section 3.3, we considered an alternative approach using the recent result of Zhang and Zhang (2012), which leads to weaker requirement on the minimal signal strength under slightly stronger assumptions on the design matrix. We also noted that Theorem 3.2 holds for any a > 1, although in the numerical studies we use the familiar a = 3.7.

How fast the probability that our estimator is equal to the oracle estimator approaches one depends on the sparsity level, the magnitude of the smallest signal, the size of the tuning parameter and the condition of the design matrix. Corollary 3.3 below confirms that the path-consistency can hold in ultra-high dimension.

Corollary 3.3

Assume that conditions (A1)-(A4) hold. Suppose there are two positive constants γ₀ and K such that γ ≥ γ₀ > 0 and q < K. If d_* = O(n^−c1) for some c₁ ≥ 0 and p = O(exp(n^c₂)) for some c₂ > 0, then

P (\hat{β} (λ) = {\hat{β}}^{(\circ)}) \to 1,

provided λ = O(n^−c₃) for some c₃ > c₁, τ²n^{1−2c₃−c₂} → ∞ and τ = o(1).

The above corollary indicates that if the true model is very sparse (i.e. q < K) and the design matrix behaves well (i.e. γ ≥ γ₀ > 0), then we can take τ to be a sequence that converges to 0 slowly, for example, τ = 1/log n. On the other hand, if one is concerned that the true model may not be very sparse (q → ∞) and the design matrix may not behave very well (γ → 0), then an alternative choice is to take τ = λ which works also quite well in practice. The following corollary establishes that under some general conditions, the choice of τ = λ yields a consistent solution path under ultra high-dimensionality.

Corollary 3.4

Assume that conditions (A1)-(A4) hold. If q = O(n^c1) for some c₁ ≥ 0, d_* = O(n^−c₂) for some c₂ ≥ 0, γ = O(n^−c₃) for some c₃ ≥ 0, p = O(exp(n^c₄)) for some 0 < c₄ < 1, λ = O(n^−c₅) for some max(c₂, c₁ + 2c₃) < c₅ < (1 − c₄)/4 and τ = λ, then

P (\hat{β} (λ) = {\hat{β}}^{(\circ)}) \to 1 .

3.2. Property of the high-dimensional BIC

Theorem 3.5 below establishes the effectiveness of the HBIC defined in (2.8) for selecting the oracle estimator along a solution path of the calibrated CCCP.

Theorem 3.5

(Property of HBIC) Assume that the conditions of Theorem 3.2(2) hold, and there exists a positive constant κ such that

\lim_{n \to \infty} \min_{A ⊉ A_{0}, ∣ A ∣ \leq K_{n}} {n^{- 1} {∣ ∣ (I_{n} - P_{A}) X_{A_{0}} β_{A_{0}}^{*} ∣ ∣}^{2}} \geq κ,

(3.3)

where I_n denotes the n × n identity matrix and P_A denotes the projection matrix onto the linear space spanned by the columns of X_A. If C_n, → ∞, qC_n log(p) = o(n) and $K_{n}^{2} \log (p) \log (n) = o (n)$ , then

P (M_{\hat{λ}} = A_{0}) \to 1,

as n, p → ∞.

Remark 4

Condition (3.3) is an asymptotic model identifiability condition, similar to that in Chen and Chen (2008). This condition states that if we consider any model which contains at most K_n covariates, it cannot predict the response variable as well as the true model does if it is not the true model. To give some intuition of this condition, as in Chen and Chen (2008), one can show that for A ⊉ A₀,

\begin{matrix} n^{- 1} {∣ ∣ (I_{n} - P_{A}) X_{A_{0}} β_{A_{0}}^{*} ∣ ∣}^{2} & \geq λ_{\min} (n^{- 1} X_{A_{0} \cup A}^{T} X_{A_{0} \cup A}) {∣ ∣ β_{A_{0} \cap A^{c}}^{*} ∣ ∣}^{2} \\ \geq λ_{\min} (n^{- 1} X_{A_{0} \cup A}^{T} X_{A_{0} \cup A}) \min_{β_{j} \neq 0} β_{j}^{* 2} . \end{matrix}

The theorem confirms that the BIC criterion for shrinkage parameter selection investigated in Wang, Li and Tsai (2007), Wang, Li and Leng (2009) and Zhang, Li and Tsai (2010) can be modified and extended to ultra-high dimensionality. Carefully examining the proof, it is worth noting that the consistency of the HBIC only requires a consistent solution path but does not rely on the particular method used to construct the path. Hence, the proposed HBIC has the potential to be generalized to other settings with ultra-high dimensionality. The sequence C_n should diverge to ∞ slowly, e.g. C_n = log(log n), which is used in our numerical studies.

3.3. Relaxing the conditions on the minimal signal

Theorem 3.2, which is the main result of the paper, implies that the oracle property of the calibrated CCCP estimator requires the following lower bound on the magnitude of the smallest nonzero regression coefficient

d_{*} ≻ λ ≻ cq \sqrt{\log p ∕ n,}

(3.4)

where a ≻ b means lim_n→∞ a/b = ∞, and c is a constant that depends on the design matrix X and other unknown factors such as σ². When the true model dimension q is fixed, the lower bound for d_* is arbitrarily close to the optimal lower bound $c \sqrt{\log p ∕ n}$ for nonconvex penalized approaches (e.g. Zhang, 2010). However, when q is diverging, this bound is suboptimal. In general, there is a tradeoff between the conditions on d_* and the conditions on the design matrix. Comparing to the results in the literature, Theorem 3.2 imposes weak conditions on the design matrix and the algorithm we investigate is transparent. In this section, we will prove that the optimal lower bound of d_* can be achieved by the calibrated CCCP procedure under a set of slightly stronger conditions on the design matrix.

Note that the calibrated CCCP estimator depends on ${\hat{β}}^{(1)}$ , which is the Lasso estimator obtained after the first iteration of the CCCP algorithm. In fact, the lower bound of d_* is proportional to the l_∞ convergence rate of ${\hat{β}}^{(1)}$ , to β*, and Condition (A4) only implies that $\max_{j} ∣ {\hat{β}}_{j}^{(1)} - β_{j}^{*} ∣$ is proportional to $O_{p} (q \sqrt{\log p ∕ n} ∕ τ)$ . If

\max_{j} ∣ {\hat{β}}_{j}^{(1)} - β_{j}^{*} ∣ = O_{p} (\sqrt{\log p ∕ n ∕ τ}),

(3.5)

we can show that $d_{*} ≻ c \sqrt{\log p ∕ n} ∕ τ$ for any τ = o(1); and hence we can achieve almost the optimal lower bound for d_*. Now, the question is under what conditions inequality (3.5) holds. Let v_ij be the (i, j) entry of X^T X. Lounici (2008) derived the convergence rate (3.5) under the condition of mutual coherence:

\max_{i \neq j} ∣ v_{ij} ∣ > b ∕ q

(3.6)

for some constant b > 0. However, the mutual coherence condition would be too strong for practical purposes when q is diverging, since it requires that the pairwise correlations between all possible pairs are sufficiently small. In this subsection, we give an alternative condition for (3.5) based on the l₁ operation norm of X^T X.

We replace condition (A4) with the slightly stronger condition (A4’) below. We also introduce an additional condition (A6) based on the matrix l₁ operational norm. For a given m × m matrix A, the l₁ operational norm ∥A∥₁ is defined by ${‖ A ‖}_{1} = \max_{i = 1, \dots, m} \sum_{j = 1}^{m} ∣ a_{ij} ∣$ , where a_ij is the (i, j)th entry of A. Let

ζ_{\max} (m) = \max_{∣ B ∣ \leq m, A_{0} \subset B} {∣ ∣ n^{- 1} X_{B}^{T} X_{B} ∣ ∣}_{1}, ζ_{\min} (m) = \max_{∣ B ∣ \leq m, A_{0} \subset B} {∣ ∣ {(n^{- 1} X_{B}^{T} X_{B})}^{- 1} ∣ ∣}_{1}

Condition (A4’): There exist positive constants α and κ_min such that

ξ_{\min} ((α + 1) q) \geq κ_{\min}

(3.7)

and

\frac{ξ_{\max} (α q)}{α} \leq \frac{1}{576} κ_{\min} {(1 - 3 \sqrt{\frac{ξ_{\max} (α q)}{α κ_{\min}}})}^{2},

(3.8)

where $ξ_{\max} (m) = \max_{∣ B ∣ \leq m, A_{0} \subset B} λ_{\max} (n^{- 1} X_{B}^{T} X_{B})$ .

Condition (A6): Let u = α + 1. There exist finite positive constants η_max and η_min such that

\underset{n \to \infty}{\lim \sup} ζ_{\max} (uq) \leq η_{\max} < \infty

and

\underset{n \to \infty}{\lim \sup} ζ_{\min} (uq) \leq η_{\min} < \infty .

Remark 5

Similar conditions to Condition (A4’) were considered by Meinshausen and Yu (2009) and Bickel et al. (2009) for the l₂ convergence of the Lasso estimator. However, (3.8) of Condition (A4’), which essentially assumes that ξ_max(αq)/α is sufficiently small, is weaker, at least asymptotically, than the corresponding condition in Meinshausen and Yu (2009) and Bickel et al. (2009), which assumes that ξ_max(q + min{n, p}) is bounded. Zhang and Zhang (2012) proved that $∣ {j : {\hat{β}}_{j} \neq 0} \cup A_{0} ∣ \leq {(α + 1)}_{q}$ under Condition (A4’). In addition, Condition (A4’) implies Condition (A4) (see Bickel et al. 2009). Condition (A6) is not too restrictive. Assume the x_i’s are randomly sampled from a distribution with mean 0 and covariance matrix ∑. If the l₁ operational norm of ∑ and ∑⁻¹ are bounded, then we have ζ_max(uq) ≤ max_{∣B∣≤uq,A₀⊂B} ∥∑B∥₁ + o_p(1) and $ζ_{\min} (u q) \leq \max_{∣ B ∣ \leq u q, A_{0}} \subset B {‖ \sum_{B}^{- 1} ‖}_{1} + o_{p} (1)$ provided that q does not diverge too fast. Here ∑_B is the ∣B∣ × ∣B∣ whose entries consist of σ_jl, the (j, l)th entry ∑, for j ∈ B and l ∈ B. See Proposition A.1 in the online supplementary material of this paper. An example of ∑ satisfying max_{∣B∣≤uq,A₀⊂B ∥∑B∥₁ *** ∞} and $\max_{∣ B ∣ \leq u q, A_{0} \subset B} {‖ \sum_{B}^{- 1} ‖}_{1} <$ is a block diagonal matrix where each block is well posed and of finite dimension. Moreover, Condition (A6) is almost necessary for the l_∞ convergence of the Lasso estimator. Suppose that p is small and d_* is large so that all coefficients of the Lasso coefficients are nonzero. Then,

{\hat{β}}^{(1)} = {\hat{β}}^{ls} + τ λ {(X^{T} X ∕ n)}^{- 1} δ,

where ${\hat{β}}^{l s}$ is the least square estimator, and δ = (δ₁, … , δ_p) with $δ_{j} = sign ({\hat{β}}_{j}^{ls})$ . Hence, for the sup norm between ${\hat{β}}^{1} - {\hat{β}}^{l s}$ to be the order of τλ, the l₁ operational norm of (X^T^X/n)⁻¹ should be bounded.

Theorem 3.6

Assume that conditions A(1)-A(3), (A4’), (A5) and (A6) hold.

(1) If τ = o(1), then for all n sufficiently large,

P (\hat{β} (λ) = {\hat{β}}^{(\circ)}) \geq 1 - 8 p \exp [- n τ^{2} λ^{2} ∕ (8 σ^{2})] .

(2) If τ = o(1) and log p = o(nτ²λ²), then

P (\hat{β} (λ) = {\hat{β}}^{(\circ)}) \to 1

as n → ∞

(3) Assume that the conditions of (2) and (3.3) hold. Let $\hat{λ}$ be the tuning parameter selected by HBIC. If $C_{n} \to \infty, q C_{n} \log (p) = o (n), K_{n}^{2} \log (p) \log (n) = o (n)$ , then $P (M_{\hat{λ}} = A_{0}) \to 1$ , as n, p → ∞.

Remark 6

We only need τ = o(1) in Theorem 3.6 for the probability bound of the calibrated CCCP estimator, while Theorem 3.2 requires τγ⁻²q = o(1): Under the conditions of Theorem 3.6, the oracle property of $\hat{β} (λ)$ holds when

d_{*} ≻ λ ≻ \frac{1}{τ} \sqrt{\log p ∕ n .}

(3.9)

Since τ can converge to 0 arbitrarily slowly (e.g. τ = 1/log n), the lower bound of d_* given by (3.9), $\sqrt{\log p ∕ n} ∕ τ$ , is almost optimal.

4. Numerical results

4.1. Monte Carlo studies

We now investigate the sparsity recovery and estimation properties of the proposed estimator via numerical simulations. We compare the following estimators: the oracle estimator which assumes the availability of the knowledge of the true underlying model; the Lasso estimator (implemented using the R package glmnet); the adaptive Lasso estimator (denoted by ALasso, Zou (2006), Section 2.8 of Bühlmann and van de Geer (2011)), the hard-thresholded Lasso estimator (denoted by HLasso, Section 2.8, Bühlmann and van de Geer (2011)), the SCAD estimator from the original CCCP algorithm without calibration (denoted by SCAD); the MCP estimator with a = 1.5 and 3. For Lasso and SCAD, 5-fold cross-validation is used to select the tuning parameter; for ALasso, sequential tuning as described in Chapter 2 of Bühlmann and van de Geer (2011) is applied. For HLasso, following a referee’s suggestion, we first used λ as the tuning parameter to obtain the initial Lasso estimator, then thresholded the Lasso estimator using thresholding parameter η = cλ for some c > 0 and refitted least squares regression. We denote the solution path of HLasso by ${\hat{β}}^{HL} (λ, c λ)$ , and apply HBIC to select λ. We consider c = 2 and set C_n = log log n in the HBIC as it is found they lead to overall good performance for HLasso. The MCP estimator is computed using the R package PLUS with the theoretical optimal tuning parameter value $λ = σ \sqrt{(2 ∕ n) \log p}$ , where the standard deviation σ is taken to be known. For the proposed calibrated CCCP estimator (denoted by New), we take τ = 1/log n and set C_n = log log n in the HBIC. We observe that the new estimator performs similarly if we take τ = λ. In the following, we report simulation results from two examples. Results of additional simulations can be found in the online supplemental file.

Example 1

We generate a random sample {y_i, x_i}, i = 1, … , 100 from the following linear regression model:

y_{i} = x_{i}^{T} β^{*} + ∊_{i},

where $β^{*} = {(3, 1.5, 0, 0, 2, 0_{p - 5}^{T}}^{T}$ with 0_k denoting a k-dimensional vector of zeros, the p-dimensional vector x_i has the N(0_p, Σ) distribution with covariance matrix Σ, ∊_i is independent of x_i and has a normal distribution with mean zero and standard deviation σ = 2. This simulation setup was considered in Fan and Li (2001) for a small p case. In this example, we consider p = 3000 and the following choices of Σ: (1) Case 1a: the (i, j)th entry of Σ is equal to 0.5^∣i−j∣, 1 ≤ i, j ≤ p; (2) Case 1b: the (i, j)th entry of Σ is equal to 0.8^∣i−j∣, 1 ≤ i, j ≤ p; (3) Case 1c: the (i, j)th entry of Σ equal to 1 if i = j and 0.5 if 1 ≤ i ≠ j ≤ p.

Example 2

We consider a more challenging case by modifying Example 1 Case 1a. We divide the p components of β* into continuous blocks of size 20. We randomly select 10 blocks and assign each block the value $(3, 1.5, 0, 0, 2, 0_{15}^{T} ∕ 1.5$ . Hence, the number of nonzero coefficients is 30. The entries in other blocks are set to be zero. We consider σ = 1. Two different cases are investigated: (1) Case 2a: n = 200 and p = 3000; (2) Case 2b: n = 300 and p = 4000.

In the two examples, based on 100 simulation runs we report the average number of non-zero coefficients correctly estimated to be nonzero (i.e., true positive, denoted by TP) and average number of zero coefficients incorrectly estimated to be nonzero (i.e., false positive, denoted by FP) and the proportion of times the true model is exactly identified (denoted by TM). These three quantities describe the ability of various estimators for sparsity recovery. To measure the estimation accuracy, we report the mean squared error (MSE), which is defined to be $100^{- 1} \sum_{m = 1}^{100} {‖ {\hat{β}}^{(m)} - β^{*} ‖}^{2}$ , where $\hat{β} (m)$ is the estimator from the mth simulation run.

The results are summarized in Table 1 and Table 2. It is not surprising that Lasso always overfits. Other procedures improve the performance of Lasso by reducing the false positive rate. The SCAD estimator from the original CCCP algorithm without calibration has no guarantee to find a good local minimum and has low probability of identifying the true model. The best overall performance is achieved by the calibrated new estimator: the probability of identifying the true model is high and the MSE is relatively small. The HLasso (with thresholding parameter selected by our proposed HBIC) and MCP (using PLUS algorithm and the theoretically optimal tuning parameter) also have overall fine performance. We do not report the results of the MCP with a = 1.5 for Example 2 since the PLUS algorithm sometimes runs into convergence problems.

Table 1.

Example 1. We report TP (the average number of non-zero coefficients correctly estimated to be nonzero, i.e., true positive), FP (average number of zero coefficients incorrectly estimated to be nonzero, i.e. false positive), TM (the proportion of the true model being exactly identified) and MSE.

Case	method	TP	FP	TM	MSE
1a	Oracle	3.00	0.00	1.00	0.146
	Lasso	3.00	28.99	0.00	1.101
	ALasso	3.00	11.47	0.01	1.327
	HLasso	3.00	0.49	0.79	0.383
	SCAD	3.00	10.12	0.08	1.496
	MCP(a = 1.5)	2.89	0.28	0.76	0.561
	MCP(a = 3)	2.91	0.42	0.68	1.292
	New	2.99	0.09	0.91	0.222

1b	Oracle	3.00	0.00	1.00	0.314
	Lasso	3.00	20.64	0.00	1.248
	ALasso	3.00	8.84	0.02	1.527
	HLasso	2.79	0.50	0.56	1.244
	SCAD	2.99	7.42	0.17	1.598
	MCP(a = 1.5)	2.02	0.51	0.06	5.118
	MCP(a = 3)	1.99	0.60	0.02	5.437
	New	2.77	0.21	0.66	1.150

1c	Oracle	3.00	0.00	1.00	0.195
	Lasso	2.99	28.22	0.00	2.987
	ALasso	2.96	10.09	0.02	2.433
	HLasso	2.84	0.77	0.56	1.361
	SCAD	2.96	18.09	0.01	3.428
	MCP(a = 1.5)	2.67	0.17	0.72	1.636
	MCP (a = 3)	2.77	0.22	0.68	1.677
	New	2.79	0.46	0.58	1.244

Open in a new tab

Table 2.

Example 2. Captions are the same as those in Table 1.

Case	method	TP	FP	TM	MSE
2a	Oracle	30.00	0.00	1.00	0.223
	Lasso	30.00	143.14	0.00	3.365
	ALasso	29.98	7.50	0.00	0.393
	HLasso	29.97	1.09	0.74	0.312
	SCAD	29.98	46.15	0.00	2.495
	MCP (a = 3)	29.83	0.50	0.92	0.807
	New	29.99	0.20	0.89	0.247

2b	Oracle	30.00	0.00	1.00	0.137
	Lasso	30.00	133.65	0.00	1.089
	ALasso	30.00	1.32	0.29	0.165
	HLasso	30.00	0.00	1.00	0.137
	SCAD	30.00	21.83	0.00	0.599
	MCP (a = 3)	30.00	0.08	0.92	0.137
	New	30.00	0.00	0.99	0.135

Open in a new tab

4.2. Real data analysis

To demonstrate the application, we analyze the gene expression data set of Scheetz et al. (2006), which contains expression values of 31042 probe sets on 120 twelve-week-old male offspring of rats. We are interested in identifying genes whose expressions are related to that of gene TRIM32 (known to be associated with human diseases of the retina) corresponding to probe 1389163_at. We first preprocess the data as described in Huang et al.(2008) to exclude genes that are either not expressed or lacking sufficient variation. This leaves 18957 genes.

For the analysis, we select 3000 genes that display the largest variance in expression level. We further analyze the top p (p = 1000 and 2000) genes that have the largest absolute value of marginal correlation with gene TRIM32. We randomly partition the 120 rats into the training data set (80 rates) and testing data set (40 rats). We use the training data set to fit the model and select the tuning parameter; and use the testing data set to evaluate the prediction performance. We perform 1000 random partitions and report in Table 3 the average model sizes and the average prediction error on the testing data set for p = 1000 and 2000. For the MCP estimators, the tuning parameters are selected by cross-validation since the standard deviation of the random error is not known. We observe that the Lasso procedure yields the smallest prediction error. However, this is achieved by fitting substantially more complex models. The calibrated CCCP algorithm as well as ALasso and HLasso result in much sparser models with still small prediction errors. The performance of the MCP procedure is satisfactory but its optimal performance depends on the parameter a. In screening or diagnostic applications, it is often important to develop an accurate diagnostic test using as few features as possible in order to control the cost. The same consideration also matters when selecting target genes in gene therapies.

Table 3.

Gene expression data analysis. The results are based on 100 random partitions of the original data set.

p	method	ave model size	Prediction Error
1000	Lasso	31.17	0.586
	ALasso	11.76	0.646
	HLasso	12.04	0.676
	SCAD	4.81	0.827
	MCP(a = 1.5)	11.79	0.668
	MCP(a = 3)	7.02	0.768
	New	8.50	0.689

2000	Lasso	32.01	0.604
	ALasso	11.01	0.661
	HLasso	10.82	0.689
	SCAD	4.57	0.850
	MCP(a = 1.5)	11.33	0.700
	MCP(a = 3)	6.78	0.788
	New	7.91	0.736

Open in a new tab

We also applied the calibrated CCCP procedure directly to the 18957 genes and evaluated the predicative performance based on 100 random partitions. The calibrated CCCP estimator has an average model size 8.1 and an average prediction error 0.58. Note that the model size and predictive performance are similar to what we obtain when we first select 1000 (or 2000) genes with the largest variance and marginal correlation. This demonstrates the stability of the calibrated CCCP estimator in ultra-high dimension.

When a probe is simultaneously identified by different variable selection procedures, we consider it as evidence for the strength of the signal. Probe 1368113_at is identified by both Lasso and the calibrated CCCP estimator. This probe corresponds to gene tff2, which was found to up-regulate cell proliferation in developing mice retina (Paunel-Görgülü et al., 2011). On the other hand, the probes identified by the calibrated CCCP but not by Lasso also merit further investigation. For instance, probe 1371168_at was identified by the calibrated CCCP estimator but not by Lasso. This probe corresponds to gene mpp2, which was found to be related to protein metabolism abnormalities in the development of retinopathy in diabetic mice (Gao et al., 2009).

4.3. Extension to penalized logistic regression

Regularized logistic regression is known to automatically result in a sparse set of features for classification in ultra-high dimension (van de Geer, 2008; Kwon and Kim, 2011). We consider the representative two-class classification problem, where the response variable y_i takes two possible values 0 or 1, indicating the class membership. It is assumed that

P (y_{i} = 1 ∣ x_{i}) = \exp (x_{i}^{T} β) ∕ {1 + \exp (x_{i}^{T} β)}

(4.1)

The penalized logistic regression estimator minimizes

n^{- 1} \sum_{i = 1}^{n} [- (x_{i}^{T} β) y_{i} + \log {1 + \exp (x_{i}^{T} β)}] + \sum_{j = 1}^{p} p λ (∣ β_{j} ∣) .

When a nonconvex penalty is adopted, it is easy to see that the CCCP algorithm can be extended to this case without difficulty as the penalized log-likelihood naturally possesses the convex-concave decomposition discussed in Section 2.2 of the main paper, because of the convexity of the negative log-likelihood for the exponential family. For easy implementation, the CCCP algorithm can be combined with the iteratively reweighted least squares algorithm for ordinary logistic regression, thus taking advantage of the CCCP algorithm for linear regression. Denote the nonconvex penalized logistic regression estimator by $\hat{β}$ , then for a new feature vector x, the predicted class membership is $I (\exp (x^{T} \hat{β}) ∕ (1 + \exp (x^{T} \hat{β})) > 0.5)$ .

We demonstrate the performance of nonconvex penalized sion logistic regresfor classification through the following example: we generate x_i as in Example 1 of the main paper, and the response variable y_i is generated according to (4.1) with $β^{*} {(3, 1.5, 0, 0, 2, 0_{p - 50}^{T})}^{T}$ . We consider sample size n = 300 and feature dimension p = 2000. Furthermore, an independent test set of size 1000 is used to evaluate the misclassificaiton error. The simulation results are reported in Table 4. The results demonstrate that the calibrated CCCP estimator is effective in both accurate classification and identifying the relevant features.

Table 4.

Simulations for classification in high dimension (n = 300, p = 2000).

method	TP	FP	TM	Misclassification Rate
Oracle	3.00	0.00	1.00	0.116
Lasso	3.00	46.48	0.00	0.134
SCAD	2.08	4.02	0.04	0.161
ALASSO	2.02	4.58	0.00	0.188
HLASSO	2.87	0.00	0.87	0.120
MCP (a = 3)	2.96	0.56	0.54	0.128
New	2.99	0.00	0.99	0.116

Open in a new tab

We expect that the theory we derived for the linear regression case continues to hold for the logistic regression under similar conditions due to the convexity of the negative log-likelihood function and the fact that the Bernoulli random variables automatically satisfies the sub-Gaussian tail assumption. The latter is essential for obtaining the exponential bounds in deriving the theory.

5. Revisiting local minima of nonconvex penalized regression

In the following, we shall revisit the issue of multiple local minima of non-convex penalized regression. We derive an L₂ bound of the distance between a sparse local minimum and the oracle estimator. The result indicates that a local minimum which is sufficiently sparse often enjoys fairly accurate estimation even when it is not the oracle estimator. This result, to our knowledge, is new in the literature on high-dimensional nonconvex penalized regression.

Our theory applies the necessary condition for the local minimizer as in Tao and An (1997) for convex differencing problems. Let

Q_{n} (β) = {(2 n)}^{- 1} {‖ y - X β ‖}^{2} + λ \sum_{j = 1}^{p} p λ (∣ β_{j} ∣)

and

\nabla (β) = {ξ \in R^{p} : ξ_{j} = - n^{- 1} x_{(j)}^{T} (y - X β) + λ l_{j}},

where l_j = sign(β_j) if β_j ≠ 0 and l_j ∈ [−1, 1] otherwise, 1 ≤ j ≤ p. As Q_n(β) can be expressed as the difference of two convex functions, a necessary condition for β to be a local minimizer of Q_n(β) is

\frac{\partial h_{n} (β)}{\partial β} \in \nabla (β),

(5.1)

where $h_{n} (β) = \sum_{j = 1}^{p} J_{λ} (∣ β_{j} ∣)$ , where J_λ(∣β_j∣) is defined in Section 2.2 for SCAD and MCP penalty functions.

To facilitate our study, we introduce below a new concept.

Definition 5.1

The relaxed sparse Riesz condition (SRC) in an L₀-neighborhood of the true model is satisfied for a positive integer m (2q ≤ m ≤ n) if

ξ_{\min} (m) \geq c_{*} for some 0 < c_{*} <,

where ξ_min is defined in (2.5).

Remark

The relaxed SRC condition is related to, but generally weaker than the sparse Reisz condition (Zhang and Huang 2008, Zhang 2010), the restricted eigenvalue condition of Bickel et al. (2009) and the partial orthogonality condition of Huang et al. (2008).

The theorem below unveils that for a given sparse estimator which is a local minimum of (1.1), its L₂ distance to the oracle estimator ${\hat{β}}^{(o)}$ has an upper bound, which is determined by three key factors: tuning parameter λ, the sparsity size of the local solution, and the magnitude of the smallest sparse eigenvalue as characterized by the relaxed SRC condition. To this end, we consider any local minimum $\hat{β} = {({\hat{β}}_{j}, \dots, {\hat{β}}_{j})}^{T}$ corresponding to the tuning parameter λ. Assume that the sparsity size of this local solution satisfies: ${‖ \hat{β} ‖}_{0} \leq q u_{n}$ for some u_n > 0:

Theorem 5.2

(Properties of the local minima of nonconvex penalized regression) Consider SCAD or MCP penalized least squares regression. Assume that conditions (A1) and (A2) hold, and that the relaxed SRC condition in an L₀-neighborhood of the true model is satisfied for $m = {qu}_{n}^{*}$ where $u_{n}^{*} = u_{n} + 1$ . Then if λ = o(d_*), then for all n sufficiently large,

\begin{matrix} P {‖ \hat{β} (λ) - {\hat{β}}^{(o)} ‖ \leq 2 λ \sqrt{q u_{n}^{*}} ξ_{\min}^{- 1} ({qu}_{n}^{*})} \\ \geq & 1 - 2 q \exp [- C_{1} n {(d_{*} - a λ)}^{2} ∕ (2 σ^{2})] - 2 (p - q) \exp [- n λ^{2} ∕ (2 σ^{2})], \end{matrix}

(5.2)

where ξ_min(m) is defined in (2.5) and the positive constant C₁ is defined in (A1).

Corollary 5.3

Under the conditions of Theorem 5.2, if we take $λ = \sqrt{3 \log (p) ∕ n}$ , then we have

\begin{matrix} P {‖ \hat{β} (λ) - {\hat{β}}^{(o)} ‖ \leq 12 \frac{{qu}_{n}^{*} \log (p)}{n ξ_{\min}^{2} ({qu}_{n}^{*})}} \\ \geq & 1 - 2 q \exp [- C_{1} n {(d_{*} - a λ)}^{2} ∕ (2 σ^{2})] - 2 (p - q) \exp [- n λ^{2} ∕ (2 σ^{2})], \end{matrix}

The simple form in the above corollary suggests that if a local minimum is sufficiently sparse, in the sense that u_n diverge to ∞ very slowly, this bound is nevertheless quite tight as the rate q log(p)/n is near-oracle. The factor $u_{n} ξ_{\min}^{- 2} ({qu}_{n}^{*})$ is expected to go to infinity at a relatively slow rate if the local solution is sufficiently sparse. Our experience with existing algorithms for solving nonconvex penalized regression is that they often yield a sparse local minimum, which however has a low probability to be the oracle estimator itself.

6. Proofs

We will provide here proofs for the main theoretical results in this paper.

Proof of Theorem 3.2

By definition, $\hat{β} (λ) = {argmin}_{β} Q_{λ} (β ∣ {\hat{β}}^{(1)})$ , where $Q_{λ} (β ∣ {\hat{β}}^{(1)}) = {(2 n)}^{- 1} {‖ y - X β ‖}^{2} + \sum_{j = 1}^{p} \nabla J_{λ} (∣ {\hat{β}}_{j}^{(1)} ∣) β_{j} + λ \sum_{j = 1}^{p} ∣ β_{j} ∣$ . Since $Q_{λ} (β ∣ {\hat{β}}^{(1)})$ is a convex function of β, the KKT condition is necessary and sufficient for characterizing the minimum. To verify that ${\hat{β}}^{(o)}$ is the minimizer of $Q_{λ} (β ∣ {\hat{β}}^{(1)})$ , it is sufficient to show that

n^{- 1} x_{(j)}^{T} (y - X {\hat{β}}^{(o)}) + \nabla J_{λ} (∣ {\hat{β}}_{j}^{(1)} ∣) + λ sign ({\hat{β}}_{j}^{(o)}) = 0, j \in A_{0},

(6.1)

and

∣ n^{- 1} x_{(j)}^{T} (y - X {\hat{β}}^{(o)}) + \nabla J_{λ} (∣ {\hat{β}}_{j}^{(1)} ∣) ∣ \leq λ, j \notin A_{0} .

(6.2)

We first verify (6.1). Note that with the initial value 0, we have ${\hat{β}}^{(1)} = {argmin}_{β} {{(2 n)}^{- 1} {‖ y - X β ‖}^{2} + τ λ {‖ β ‖}_{1}}$ . Let $F_{n 3} = {{‖ {\hat{β}}^{(1)} - β^{*} ‖}_{1} \leq 16 τ λ γ^{- 2} q}$ , where ∥ · ∥₁ denotes the L₁ norm. By modifying the proof of Theorem 7.2 of Bickel et al. (2009), we can show that under the conditions of the theorem,

P (F_{n 3}) \geq 1 - 2 p \exp (- n τ^{2} λ^{2} ∕ (8 σ^{2})) .

(6.3)

By the assumption of the theorem, on the event $F_{n 3} = {‖ {\hat{β}}^{(1)} - β^{*} ‖}_{1} \leq λ ∕ 2$ for all n sufficiently large. Furthermore, we consider the event F_n1 defined in Lemma 3.1 with b₁ = 1/2. By Lemma 3.1, we have $P (‖ {\hat{β}}^{(o)} - β^{*} ‖ - \infty \leq λ ∕ 2) \geq 1 - 2 q \exp [- C_{1} n λ^{2} ∕ (8 σ^{2})]$ . By the assumption λ = o(d_*), for all n sufficiently large, on the event F_n1 ⋂ F_n3, we have sign $({\hat{β}}_{j}^{(1)}) = sign ({\hat{β}}_{j}^{(o)})$ , for j ∈ A₀ and $\min_{j \in A_{0}} ∣ {\hat{β}}_{j}^{(1)} ∣ > a λ$ . Hence, by condition (A3), on the event $F_{n 1 \cap} F_{n 3}, \nabla J_{λ} (∣ {\hat{β}}_{j}^{(1)} ∣) = - λ sign ({\hat{β}}_{j}^{(1)}) = - λ sign ({\hat{β}}_{j}^{(o)})$ . Furthermore, $n^{- 1} x_{(j)}^{T} (λ - X {\hat{β}}^{(o)} = 0$ , for j ∈ A₀, following the definition of the oracle estimator. Therefore (6.1) holds with probability at least 1 − 2q exp[−C₁nλ²/(8σ²)] − 2p exp(− nτ²λ²/(8σ²)).

Next we verify (6.2). On the event F_n3, we have $\max_{j \notin A_{0}} ∣ {\hat{β}}_{j}^{(1)} ∣ \leq λ ∕ 2$ , for all n sufficiently large. We consider the event F_n2 defined in Lemma 3.1 with b₂ = 1/2. Lemma 3.1 implies that P(F_n2) ≥ 1 − 2(p − q) exp[−nλ²/(8σ²)]. On the event F_n2 we have $\max_{j \notin A_{0}} ∣ n^{- 1} x_{(j)}^{T} (y - X {\hat{β}}^{(o)}) ∣ \leq λ ∕ 2$ . By condition (A3), on the event F_n2 ⋂ F_n3, (6.2) holds, and this occurs with probability at least 1 − 2(p − q) exp[−nλ²/(8σ²)] − 2p exp ( − nτ²λ²/(8σ²).

The above two steps proves (1). The result in (2) follows immediately from (1).

Proof of Corollary 3.3 and Corollary 3.4

The proof follows immediately from Theorem 3.2.

Proof of Theorem 3.5

Recall that $M_{λ} = {j : {\hat{β}}_{j} (λ) \neq 0}$ . We define the following three index sets: Λ_n− = {λ > 0 : λ ∈ Λ_n, A₀ ⊄, M_λ}, Λ_n0 = {λ > 0 : λ ∈ Λ_n, A₀ = M_λ}, and Λ_n+ = {λ > 0 : λ ∈ Λ_n A₀ ⊂ M_λ and A₀ ≠ = M_λ}. In other words, Λ_n−, Λ_n0 and Λ_n+ denote the sets of λ values which lead to underfitted, exactly fitted and overfitted models, respectively. For a given model (or equivalently an index set) M, let ${SSE}_{M} = \inf_{β_{M^{\in R ∣ M ∣}}} {‖ y - X_{M} β_{M} ‖}^{2}$ . That is, SSE_M is the sum of squared residuals when the least squares method is used to estimate model M. Also, let ${\hat{σ}}_{M}^{2} = n^{- 1} {SSE}_{M}$ . From the definition, we always have ${\hat{σ}}_{λ}^{2} \geq {\hat{σ}}_{M_{λ}}^{2}$ .

Consider λ_n satisfying the of Theorem 3.2(2). We have P(M_λn = A₀) → 1. We will prove that P(inf_λ∈Λn− [HBIC(λ) − HBIC(λ_n)] > 0) → 1 and P(inf_λ∈Λn+ [HBIC(λ) − HBIC(λ_n) > 0) → 1.

Case I

Consider an arbitrary λ ∈ Λ_n−, i.e., the model corresponding to M_λ is underfitted.

\begin{matrix} P (\inf_{λ \in Λ_{n -}} [HBIC (λ) - HBIC (λ_{n})] > 0) \\ = & P (\inf_{λ \in Λ_{n -}} [HBIC (λ) - HBIC (λ_{n})] > 0, M_{λ_{n}} = A_{0}) \\ + P (\inf_{λ \in Λ_{n -}} [HBIC (λ) - HBIC (λ_{n})] > 0, M_{λ_{n}} \neq A_{0}) \\ \geq & P (\inf_{λ \in Λ_{n -}} [\log ({\hat{σ}}_{M_{λ}}^{2} ∕ {\hat{σ}}_{A_{0}}^{2}) + (∣ M_{λ} - q ∣) \frac{C_{n} \log (p)}{n}] > 0) + o (1), \end{matrix}

where the inequality uses Theorem 3.2(2). Furthermore, we observe that

\log (\frac{{\hat{σ}}_{M_{λ}}^{2}}{{\hat{σ}}_{A_{0}}^{2}}) = \log (1 + \frac{n [{\hat{σ}}_{M_{λ}}^{2} - {\hat{σ}}_{A_{0}}^{2}]}{∊^{T} (I_{n} - P_{A_{0}}) ∊}) .

Applying the inequality log(1 + x) ≥ min{0.5x, log(2)}, ∀ x > 0, we have

\begin{matrix} P (\inf_{λ \in Λ_{n -}} [HBIC (λ) - HBIC (λ_{n})] > 0) \\ \geq & P (\min {\inf_{λ \in Λ_{n -}} \frac{n ({\hat{σ}}_{M_{λ}}^{2} - {\hat{σ}}_{A_{0}}^{2})}{2 ∊^{T} (I_{n} - P_{A_{0}}) ∊}, \log (2)} - \frac{q C_{n} \log (p)}{n} > 0) + 0 (1) \end{matrix}

To evaluate ∊^T (I_n − P_A₀)∊, we apply Corollary 1.3 of Mikosch (1991) with their A_n = I_n − P_A₀ , B_n = 2σ⁴(n − q), μ_n = σ² and y_n = (n − q)/(log n), we have P(∊^T(I_n − P_A₀)∊ ≤ 2σ²(n − q)) → 1 as n → ∞. Thus

\begin{matrix} P (\inf_{λ \in Λ_{n -}} [HBIC (λ) - HBIC (λ_{n})] > 0) \\ \geq & P (\min {\frac{\inf_{λ \in Λ_{n -}} n ({\hat{σ}}_{M_{λ}}^{2} - {\hat{σ}}_{A_{0}}^{2})}{4 (n - q) σ^{2}}, \log (2)} - \frac{q C_{n} \log (p)}{n} > 0) + o (1) . \end{matrix}

In what follows, we will prove that $q C_{n} \log (p) = o (\inf_{λ \in Λ_{n -}} n ({\hat{σ}}_{M_{λ}}^{2} - {\hat{σ}}_{A_{0}}^{2}))$ , which combining with the assumption qC_n log(p) = o(n) leads to the conclusion P(inf_{λ∈Λ_n−} [HBIC(λ) − HBIC(λ_n)] > 0) → 1.

We have

\begin{matrix} n ({\hat{σ}}_{M_{λ}}^{2} - {\hat{σ}}_{M_{T}}^{2}) \\ = & μ^{T} (I_{n} - P_{M_{λ}}) μ + 2 μ^{T} (I_{n} - P_{M_{λ}}) ∊ - ∊^{T} P_{M_{λ}} ∊ + ∊^{T} P_{A_{0}} ∊ \\ = & I_{1} + I_{2} - I_{3} + I_{4}, \end{matrix}

where μ = Xβ*, P_{M_λ} is the projection matrix into the space spanned by the columns of XM_λ, and the definition of I_i, i = 1, 2, 3, 4, should be clear from the context. Let M₋ = {j : j ∉ M_{λ, j} ∈ M_T}. Note that M₋ is nonempty since M_λ underfits.

By assumption (3.3), ∣I₁∣ ≥ κ_n, for all n sufficiently large. To evaluate I₂, we have

I_{2} = 2 \sqrt{μ^{T} (I_{n} - P_{M_{λ}}) μ} Z (M_{λ}) = 2 \sqrt{I_{1}} Z (M_{λ}),

where $Z (M_{λ}) = a_{n}^{T} ∊$ with $a_{n}^{T} = (μ^{T} {(I_{n} - P_{M_{λ}}) μ)}^{- 1 ∕ 2} μ^{T} (I_{n} - P_{M_{λ}})$ . Note that ∥a_n∥² = 1 and $∣ Λ_{-} ∣ \leq \sum_{t = 0}^{K_{n}} (\begin{matrix} p \\ t \end{matrix}) \leq σ_{t = 0}^{K_{n}} p^{t} = \frac{p^{K_{n} + 1} - 1}{p - 1} \leq 2 p^{K_{n}}$ . Applying the sub-Gaussian tail property in (3.1), we have

\begin{matrix} P (\sup_{η \in Λ_{n -}} ∣ Z (M_{λ}) ∣ > \sqrt{n ∕ \log (n))} \leq 4 p^{K_{n}} \exp (- n ∕ (2 σ^{2} \log (n))) \\ = & 4 \exp (K_{n} \log (p) - n ∕ (2 σ^{2} \log (n))) \to 0, \end{matrix}

as K_n log(p) log(n) = o(n). Hence sup_{η∈Λ_n−} ∣I₂∣. To evaluate I₃, let r(λ) = Trace(P_{M_λ}). It follows from Proposition 3 of Zhang (2010) that for the sub-Gaussian random variables ∊_i, ∀ t > 0,

P {\frac{∊^{T} P_{M_{λ}} ∊}{r (λ) σ^{2}} \geq \frac{1 + t}{{[1 - 2 ∕ (e^{t ∕ 2} \sqrt{1 + t} - 1)]}_{+}^{2}}} \leq \exp (- \frac{r (λ) t}{2}) {(1 + t)}^{\frac{r (λ)}{2}} .

(6.4)

We take t = n/(2σ²K_n log(n)) − 1 in the above inequality. Then t → ∞ by the assumptions of the theorem. Thus for all n sufficiently large,

\begin{matrix} P (\sup_{λ \in Λ_{n -}} ∣ ∊^{T} P_{M_{λ}} ∊ ∣ > \frac{n}{\log (n)}) \\ \leq & P (\sup_{λ \in Λ_{n -}} ∣ \frac{∊^{T} P_{M_{λ}} ∊}{r (λ) σ^{2}} ∣ > \frac{n}{σ^{2} K_{n} \log (n)}) \\ \leq & P (\sup_{λ \in Λ_{n -}} ∣ \frac{∊^{T} P_{M_{λ}} ∊}{r (λ) σ^{2}} ∣ > \frac{1 + t}{{[1 - 2 ∕ (e^{t ∕ 2} \sqrt{1 + t} - 1)]}_{+}^{2}}) \\ \leq & 2 p^{K_{n}} \exp (- n ∕ (8 σ^{2}) K_{n} \log (n))) {(2 σ^{2} K_{n} \log (n)))}^{K_{n} ∕ 2} \\ \leq & 2 \exp (K_{n} \log (p) - n ∕ (8 σ^{2} K_{n} \log (n)) + K_{n} \log (n ∕ (2 σ^{2} K_{n} \log (n))) \\ \to & 0, \end{matrix}

since $K_{n}^{2} \log (p) \log (n) = o (n)$ . Finally, ∊^TP_A₀∊ does not depend on λ. Similarly as above, P(sup_{λ∈Λ_n−} ∣I₄∣ ≥ n/log(n)) → ∞ 0 by the sub-Gaussian tail condition. Therefore, with probability approaching one, $n ({\hat{σ}}_{M_{λ}}^{2} - {\hat{σ}}_{A_{0}}^{2})$ is dominated by I₁. This finishes the proof for the first case as qC_n log(p) = o(n).

Case II

Consider an arbitrary λ ∈ Λ_n+, i.e., the model corresponding to M_λ is overfitted. In this case, we have y^T(I_n − P_{M_λ})y = ∊^T(I_n − P_{M_λ})∊. Therefore, $n ({\hat{σ}}_{A_{0}}^{2} - {\hat{σ}}_{M_{λ}}^{2}) = ∊^{T} (P_{M_{λ}} - P_{A_{0}}) ∊$ . Let $\hat{∊} = (I_{n} - P_{A_{0}}) ∊$ , then

\log (\frac{{\hat{σ}}_{A_{0}}^{2}}{{\hat{σ}}_{M_{λ}}^{2}}) = \log (1 + \frac{∊^{T} (P_{M_{λ}} - P_{A_{0}}) ∊}{∊^{T} (I_{n} - P_{M_{λ}}) ∊}) \leq \frac{∊^{T} (P_{M_{λ}} - P_{A_{0}}) ∊}{{\hat{∊}}^{T} \hat{∊} - ∊^{T} (P_{M_{λ}} - P_{A_{0}}) ∊},

by the fact log(1 + x) ≤ x,∀ x ≥ 0.

Similarly as in Case I,

\begin{matrix} P (\inf_{λ \in Λ_{n +}} [HBIC (λ) - HBIC (λ_{n})] > 0) \\ = & P (\inf_{λ \in Λ_{n +}} [\log (\frac{{\hat{σ}}_{A_{0}}^{2}}{{\hat{σ}}_{M_{λ}}^{2}}) + (∣ M_{λ} ∣ - q) \frac{C_{n} \log (p)}{n}] > 0) + o (1) \\ \geq & P (\inf_{λ \in Λ_{n +}} [(∣ M_{λ} ∣ - q) \frac{C_{n} \log (p)}{n} - \frac{∊^{T} (P_{M_{λ}} - P_{A_{0}}) ∊}{{\hat{∊}}^{T} \hat{∊} - ∊^{T} (P_{M_{λ}} - P_{A_{0}}) ∊}] > 0) \\ + o (1) \\ = & P (\inf_{λ \in Λ_{n +}} {(∣ M_{λ} ∣ - q) [\frac{C_{n} \log (p)}{n} - \frac{∊^{T} (P_{M_{λ}} - P_{A_{0}}) ∊) ∕ (∣ M_{λ} ∣ - q)}{{\hat{∊}}^{T} \hat{∊} - ∊^{T} (P_{M_{λ}} - P_{A_{0}}) ∊}]} > 0) \\ + o (1) . \end{matrix}

It suffices to show that

P (\inf_{λ \in Λ_{n +}} [\frac{C_{n} \log (p)}{n} - \frac{∊^{T} (P_{M_{λ}} - P_{A_{0}}) ∊ ∕ (∣ M_{λ} ∣ - q)}{{\hat{∊}}^{T} \hat{∊} - ∊^{T} (P_{M_{λ}} - P_{A_{0}}) ∊}] > 0) \to 1,

which is implied by

P [\frac{C_{n} \log (p)}{n} - \frac{\sup_{λ \in Λ_{n +}} ∊^{T} (P_{M_{λ}} - P_{A_{0}}) ∊ ∕ (∣ M_{λ} ∣ - q)}{{\hat{∊}}^{T} \hat{∊} - \sup_{λ \in Λ_{n +}} ∊^{T} (P_{M_{λ}} - P_{A_{0}}) ∊} > 0) \to 1

Note that $E ({\hat{∊}}^{T} \hat{∊}) = V ar (∊_{i}) Trace (I_{n} - P_{A_{0}}) \leq (n - q) σ^{2}$ , $hence {\hat{∊}}^{T} \hat{∊} = O_{p} (n)$ . Similarly as in case I, we can show that P(sup_{λ∈Λ_n+} ∊^T(P_{M_λ} − P_A₀)∊ > n/log(n)) → 0, since $K_{n}^{2} \log (p) \log (n) = o (n)$ . Thus ${\hat{∊}}^{T} \hat{∊} - \sup_{λ \in Λ_{n +}} ∊^{T} (P_{M_{λ}} - P_{A_{0}}) ∊ = O_{p} (n)$ . Furthermore, applying (6.4) by letting t = 8 log(p) − 1, we have for all n sufficiently large,

\begin{matrix} P (\sup_{λ \in Λ_{n +}} \frac{∊^{T} (P_{M_{λ}} - P_{A_{0}}) ∊}{∣ M_{λ} ∣ - q} > 16 σ^{2} \log (p)) \\ \leq & \sum_{∣ M_{λ} ∣ = q + 1}^{p} (\begin{matrix} p - q \\ ∣ M_{λ} ∣ - q \end{matrix}) \exp (- \frac{(∣ M_{λ} ∣ - q)}{2}) {(1 + t)}^{\frac{∣ M_{λ} ∣ - q}{2}} \\ = & \sum_{k = 1}^{p - q} (\begin{matrix} p - q \\ k \end{matrix}) \exp (- 2 k \log (p)) {(8 \log (p))}^{\frac{1}{2}} \\ = & \sum_{k = 1}^{p - q} (\begin{matrix} p - q \\ k \end{matrix}) {(\frac{\sqrt{8 \log (p)}}{p_{n}^{2}})}^{k} \leq {(1 + \frac{\sqrt{8 \log (p)}}{p^{2}})}^{p - q} - 1 \to 0 . \end{matrix}

Thus with probability approaching one, for all n sufficiently large,

\begin{matrix} \frac{C_{n} \log (p)}{n} - \frac{\sup_{λ \in Λ_{n +}} ∊^{T} (P_{M_{λ}} - P_{A_{0}}) ∊ ∕ (∣ M_{λ} ∣ - q)}{{\hat{∊}}^{T} \hat{∊} - \sup_{λ \in Λ_{n +}} ∊^{T} (P_{M_{λ}} - P_{A_{0}}) ∊} \\ > & n^{- 1} C_{n} \log (p) - n^{- 1} O (\log (p)) > 0, \end{matrix}

since C_n → ∞. This finishes the proof.

Proof of Theorem 3.6

We will first prove that there exists a constant C > 0 such that for $F_{n 4} = {\max_{j} ∣ {\hat{β}}_{j}^{(1)} - β_{j}^{*} ∣ \leq C τ λ}$ , we have

P (F_{n 4}) \geq 1 - 2 p \exp (\frac{- n τ^{2} λ^{2}}{8 σ^{2}}) .

(6.5)

Let F_n5 = {∣S_j(β*)∣ ≤ τλ/2 for all j}: Since

P (F_{n 4}^{c}) \geq \sum_{j = 1}^{p} P (∣ x_{(j)}^{T} ∊ ∕ n ∣ > τ λ ∕ 2) \leq 2 p \exp (\frac{- n τ^{2} λ^{2}}{8 σ^{2}}),

we have

P (F_{n 5}) \leq 1 - 2 p, \exp (\frac{- n τ^{2} λ^{2}}{8 σ^{2}}) .

Hence to prove (6.5), it suffices to show that F_n5 ⊂ F_n4.

Let

θ = \inf {\frac{q {‖ X^{T} Xu ‖}_{\infty}}{n {‖ u ‖}_{1}} : {‖ u_{A_{0}^{c}} ‖}_{1} \leq 3 {‖ u_{A_{0}} ‖}_{1}} .

Corollary 2 of Zhang and Zhang (2012) proves that on the event F_n5, ∣A ⋃ A₀∣ ≤ (α + 1)q, where $A = {j : {\hat{β}}_{j}^{(1)} \neq = 0}$ , provided

\frac{ξ_{\max} (α q)}{α} \leq \frac{1}{36} θ .

Since θ ≥ γ²/16 (see (7) of Zhang and Zhang, 2012), where γ is defined in (A4) and

γ \geq \sqrt{κ_{\min}} (1 - 3 \sqrt{\frac{ξ_{\max} (α q)}{α κ_{\min}}})

(see Bickel et al., 2009), Condition (A4’) implies that

\begin{matrix} F_{n 5} \subset {∣ A \cup A_{0} ∣ \leq {(α + 1)}_{q} \\ Let C (β) = {(2 n)}^{- 1} & {‖ y - X β ‖}^{2} = τ λ Σ_{j = 1}^{p} ∣ β_{j} ∣ . Then we have \\ C (β) - C (β^{*}) = & \sum_{j = 1}^{p} (β_{j} - β_{j}^{*}) S_{j} (β^{*}) + {(β - β^{*})}^{T} X^{T} X (β - β^{*}) ∕ (2 n) \\ + τ λ \sum_{j = 1}^{p} (∣ β_{j} ∣ - ∣ β_{j}^{*} ∣) . \end{matrix}

(6.6)

Let $\hat{X β^{*}}$ be the projection of Xβ* onto span(X_A); the linear subspace spanned by the column vectors of X_A. We define the p-dimensional vector γ* such that $\hat{X β^{*}} = X_{A γ_{A}^{*}}$ and $γ_{j}^{*} = 0$ for j ∈ A^c. We have

{({\hat{β}}^{(1)} - β^{*})}^{T} X^{T} X ({\hat{β}}^{(1)} - β^{*}) = {({\hat{β}}_{A}^{(1)} - γ_{A}^{*})}^{T} X_{A}^{T} X_{A} ({\hat{β}}_{A}^{(1)} - γ_{A}^{*}) + {‖ X β^{*} - X_{A} γ_{A}^{*} ‖}^{2} .

Therefore, we can write

\begin{matrix} {\hat{β}}^{(1)} = & {argmin}_{β_{:} β_{A^{c}}} = 0 {\sum_{j \in A} β_{j} S_{j} (β^{*}) + {(β_{A} - γ_{A}^{*})}^{T} X_{A}^{T} X_{A} (β_{A} - γ_{A}^{*}) ∕ 2 n \\ + τ λ \sum_{j \in A} ∣ β_{j} ∣} \end{matrix}

Hence ${\hat{β}}_{A}^{(1)} - γ_{A}^{*} = {(X_{A}^{T} X_{A} ∕ n)}^{- 1} θ_{A}$ , where θ ∈ R^p such that θ_j = 0 for j ∈ A^c and $θ_{j} = - S_{j} (β) - sign ({\hat{β}}_{j}) τ λ$ for j ∈ A. On F_n5; max_j ∣θ_j∣ ≤ 3τλ/2. Therefore, condition (A6) with (6.6) implies that on the event F_n5,

\max_{j \in A} ∣ {\hat{β}}_{j}^{(1)} - γ_{j}^{*} ∣ \leq η_{\min} 3 τ λ ∕ 2 .

(6.7)

It follows from (6.7) that inequality (6.5) holds if we show that A₀ ⊂ A, in which case $γ_{A}^{*} = β_{A}^{*}$ . We will prove this by contradiction. Assume A⁽⁻⁾ = A₀ ⋂ A^c is nonempty. Let ${\hat{x}}_{(j)}$ be the projection of x_(j) onto span(X_A) and let ${\hat{x}}_{(j)} = x_{(j)} - {\hat{x}}_{(j)}, j \in A^{(-)}$ . Then, we can write

X β^{*} = X_{A γ_{A}^{*}} + \sum_{j \in A -} {\tilde{x}}_{(j)} β_{j}^{*} .

Let $\tilde{y} = \sum_{j \in A} - {\tilde{x}}_{(j)} β_{j}^{*}$ . By Lemma 6.1 below, there exists l ∈ A⁻ such that

∣ x_{(l)}^{T} \tilde{y} ∕ n ∣ \geq κ_{\min} d_{*} .

(6.8)

By the KKT condition, we have $∣ x_{(l)}^{T} (X β^{*} - X {\hat{β}}^{(1)}) ∕ n + S_{l} (β^{*}) ∣ \geq τ λ$ . However we can write $x_{(l)}^{T} (X β^{*} - X {\hat{β}}^{(1)}) ∕ n = x_{(l)}^{T} X_{A} (γ_{A}^{*} - {\hat{β}}_{A}^{(1)}) ∕ n + x_{(l)}^{T} \tilde{y} ∕ n$ . The inequalities (6.8) and (6.7) with condition (A6) imply that on F_n5

\begin{matrix} ∣ x_{(l)}^{T} (X β^{*} - X {\hat{β}}^{(1)}) ∕ n + S_{l} (β^{*}) ∣ \\ \geq & ∣ x_{(l)}^{T} \tilde{y} ∕ n ∣ - ∣ x_{(l)}^{T} X_{A} (γ_{A}^{*} - {\hat{β}}_{A}^{(1)}) ∕ n ∣ - ∣ S_{l} (β^{*}) ∣ \\ \geq & ∣ x_{(l)}^{T} \tilde{y} ∕ n ∣ - ‖ X_{A \cup A_{0}}^{T} X_{A \cup A_{0}} ‖ 1 ‖ γ_{A}^{*} - {\hat{β}}_{A}^{(1)} ‖ \infty - ∣ S_{l} ∣ (β^{*}) ∣ \\ \geq & κ_{\min} d_{*} - η_{\max} η_{\min} 3 τ λ ∕ 2 > τ λ \end{matrix}

if d_* > 3τλ(η_maxη_min + 1)/(2κ_min); which contradicts the KKT condition. Hence, we eventually have A₀ ⊂ A on F_n5 and this proves (6.5).

We now slightly modify the proof of (1) of Theorem 3.2. More specifically, replacing F_n3 by F_n4, we can show that $F_{n 1} \cap F_{n 2} \cap F_{n 4} \subset {\hat{β} (λ) = {\hat{β}}^{(o)}}$ , and this proves (1). The result in (2) follows immediately from (1). The proof of (3) can be done similarly to that of Theorem 3.5.

In the proof of Theorem 3.6, we have used the following lemma, whose proof is given in the online supplementary material.

Lemma 6.1

There exists l ∈ A⁻ which satisfies (6.8).

Proof of Theorem 5.2

By (5.1), a local minimizer β necessarily satisfies:

- n^{- 1} x_{(j)}^{T} (y - X β) + ξ_{j} = 0, j = 1, \dots, p,

(6.9)

where $ξ_{j} = λ l_{j} - \frac{\partial h_{n} (β)}{\partial β_{j}}$ , with l_j = sign(β_j) if β_j ≠ 0 and l_j ∈ [−1, 1] otherwise, 1 ≤ j ≤ p.. It is easy to see that ∣ξ_j∣ ≤ λ, 1 ≤ j ≤ p. Although the objective function is nonconvex, abusing the notation a little, we refer to the collection of all vectors in the form of the left hand side of (6.9) as the subdifferential ∂Q_n(β) and refer to a specific element of this set a subgradient. Then the necessary condition stated above can be considered as an extension of the classical KKT condition.

Alternatively, minimizing Q_n(β) can be expressed as a constrained smooth minimization problem (e.g., Kim et al., 2008). By the corresponding second order sufficiency of KKT condition (e.g., page 320 of Bertsekas, 1999), $\hat{β}$ is a local minimizer of Q_n(β) if

\begin{matrix} - n^{- 1} x_{(j)}^{T} (y - X β) = & sgn ({\hat{β}}_{j}) {\dot{p}}_{λ} ({\hat{β}}_{j}), {\hat{β}}_{j} \neq 0, \\ - n^{- 1} ∣ x_{(j)}^{T} (y - X β) ∣ \leq & λ, {\hat{β}}_{j} = 0 . \end{matrix}

Consider the event F_n = F_n2 ⋂ F_n6, where F_n2 is defined in Lemma 3.1 with b₂ = 1, and $F_{n 6} = {\min_{j \in A_{0}} ∣ {\hat{β}}_{j}^{(o)} ∣ \geq a λ}$ . Since $∣ {\hat{β}}_{j}^{(o)} ∣ \geq ∣ β_{j}^{*} ∣ - ∣ {\hat{β}}_{j}^{(o)} - β_{j}^{*} ∣$ and λ = o(d_*), similarly as in the proof for Lemma 3.1, we can show that for all n sufficiently large, P(F_n6) ≥ 1 − 2_q exp[−C₁n(d_* − aλ)²/(2σ²)]. By Lemma 3.1, for all n sufficiently large, P(F_n) ≥ 1 − 2_q exp[−C₁n(d_* − aλ)²/(2σ²)] − 2(p − q) exp[−nλ²/(2σ²)]. It is apparent that on the event F_n, the oracle estimator ${\hat{β}}^{(o)}$ satisfies the above sufficient condition. Therefore, by (6.9), there exist $∣ ξ_{j}^{(o)} ∣ \leq λ, 1 \leq j \leq p$ , such that

- n^{- 1} x_{(j)}^{T} (y - X {\hat{β}}^{(o)}) + ξ_{j}^{(o)} = 0 .

Abusing notation a little, we denote this zero vector by $\frac{\partial}{\partial β} Q_{n} ({\hat{β}}^{(o)})$ .

Now for any local minimizer $\hat{β}$ which satisfies the sparsity constraint ${‖ \hat{β} ‖}_{0} \leq {qu}_{n}$ , we will prove by contradiction that under the conditions of the theorem we must have $‖ \hat{β} - {\hat{β}}^{(o)} ‖ \leq 2 λ \sqrt{{qu}_{n}^{*}} ξ_{\min}^{- 1} ({qu}_{n}^{*})$ , where $u_{n}^{*} = u_{n} + 1$ . More specifically, we’ll derive a contradiction by showing that none of the subgradients of Q_n(β) can be zero at $β = \hat{β}$ .

Assume instead that $‖ \hat{β} - {\hat{β}}^{(o)} ‖ > 2 λ \sqrt{{qu}_{n}^{*}} ξ_{\min}^{- 1} ({qu}_{n}^{*})$ . Let $A^{*} = {j : {\hat{β}}_{j} \neq 0$ or ${\hat{β}}_{j}^{(o)} \neq 0}$ , then $‖ {\hat{β}}_{A^{*}} - {\hat{β}}_{A^{*}}^{(o)} ‖ > 2 λ \sqrt{{qu}_{n}^{*}} ξ_{\min}^{- 1} ({qu}_{n}^{*})$ . Let $\frac{\partial}{\partial β} Q_{n} (\hat{β}) = - n^{- 1} x_{(j)}^{T} (y - X \hat{β}) + η_{j}$ be an arbitrary subgradient in the subdifferential $\partial Q_{n} (\hat{β})$ . Let $η = {(η_{1}, \dots, η_{p})}^{T}$ , then η_j satisfies ∣η_l∣ ≤ λ, 1 ≤ j ≤ p. We use $\frac{\partial}{\partial β_{A^{*}}} Q_{n} (\hat{β})$ to denote the size-∣A^*∣ subvector of $\frac{\partial}{\partial β} Q_{n} (\hat{β})$ , i.e., $\frac{\partial}{\partial β} Q_{n} (\hat{β}) = {(\frac{\partial}{\partial β_{j}} Q_{n} (\hat{β}) : j \in A^{*})}^{T}$ . And $\frac{\partial}{\partial β_{A^{*}}} Q_{n} ({\hat{β}}^{(o)})$ is defined similarly. We have

\begin{matrix} ∣ {(\frac{\partial}{\partial β_{A^{*}} Q_{n} (\hat{β}})}^{T} \frac{({\hat{β}}_{A^{*}} - {\hat{β}}_{A^{*}}^{(o)})}{‖ {\hat{β}}_{A^{*}} - {\hat{β}}_{A^{*}}^{(o)}} ∣ \\ = & ∣ {(\frac{\partial}{\partial β_{A^{*}}} Q_{n} (\hat{β}) - \frac{\partial}{\partial β_{A^{*}}} Q_{n} ({\hat{β}}^{(o)}))}^{T} \frac{({\hat{β}}_{A^{*}} - {\hat{β}}_{A^{*}}^{(o)})}{‖ {\hat{β}}_{A^{*}} - {\hat{β}}_{A^{*}}^{(o)} ‖} ∣ \\ = & ∣ n^{- 1} {({\hat{β}}_{A^{*}} - {\hat{β}}_{A^{*}}^{(o)})}^{T} X_{A^{*}}^{T} X_{A^{*}}^{} ({\hat{β}}_{A^{*}} - {\hat{β}}_{A^{*}}^{(o)}) ∕ ‖ {\hat{β}}_{A^{*}} - {\hat{β}}_{A^{*}}^{(o)} ‖ \\ + {(η_{A^{*}}) - ξ_{A^{*}}^{(o)})}^{T} ({\hat{β}}_{A^{*}} - {\hat{β}}_{A^{*}}^{(o)}) ∕ ‖ {\hat{β}}_{A^{*}} - {\hat{β}}_{A^{*}}^{(o)} ‖ ∣ \\ \geq & ϕ_{\min} (n^{- 1} X_{A^{*}}^{T} X_{A^{*}}) ‖ {\hat{β}}_{A^{*}} - {\hat{β}}_{A^{*}}^{(o)} ‖ - 2 λ \sqrt{{qu}_{n}^{*}} \\ > & ξ_{\min} (q u_{n}^{*}) 2 λ \sqrt{q u_{n}^{*}} ξ_{\min}^{- 1} (q u_{n}^{*}) - 2 λ {\sqrt{q u}}_{n}^{*} = 0, \end{matrix}

where the second equality follows from the expression of subgradient, the second last inequality applies the Cauchy-Schwarz inequality, and the last inequality follows from the relaxed SRC condition in an L₀-neighborhood of the true model. Thus this contradicts with the fact that at least one of the subgradients is zero if $\hat{β}$ is a local minimizer and the theorem is proved.

Proof of Corollary 5.3

It follows directly from Theorem 5.2.

Supplementary Material

Supp

NIHMS590995-supplement-Supp.pdf^{(166.7KB, pdf)}

Footnotes

SUPPLEMENTARY MATERIAL Supplement to “Calibrating Non-convex Penalized Regression in Ultra-high Dimension”:

(doi: COMPLETED BY THE TYPESETTER; .pdf). This supplemental material includes the proofs of Lemmas 3.1 and 6.1, and some additional numerical results.

Support in part by National Science Foundation grant DMS-1308960.

^†

Support in part by National Research Foundation of Korea grant number 20100012671 funded by the Korea government.

^‡

Support in part by National Natural Science Foundation of China, 11028103 and NIH grants P50 DA10075, R21 DA024260, R01 CA168676 and R01 MH096711.

References

[1].Bertsekas DP. Nonlinear programming. 2nd edition Athena Scientific; Belmont, Mass: 1999. [Google Scholar]
[2].Bickel P, Ritov Y, Tsybakov A. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics. 2009;37:1705–1732. [Google Scholar]
[3].Bühlmann P, van de Geer S. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer; 2011. [Google Scholar]
[4].Cai T, Zhou H. Minimax estimation of large covariance matirces under l1 norm. To appear in Statistca Sinica. 2011 [Google Scholar]
[5].Chen J, Chen Z. Extended Bayesian information criterion for model selection with large model space. Biometrika. 2008;95:759–771. [Google Scholar]
[6].Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
[7].Fan J, Lv J. Non-concave penalized likelihood with NP-dimensionality. IEEE Transaction on Information Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Annals of Statistics. 2004;32:928–961. [Google Scholar]
[9].Frank IE, Friedman JH. A statistical view of some chemometric regression tools (with discussion) Technometrics. 1993;35:109–148. [Google Scholar]
[10].Gao BB, Phipps JA, Bursell D, Clermont AC, Feener EP. Angiotensin AT1 receptor antagonism ameliorates murine retinal proteome changes induced by diabetes. Journal of Proteome Research. 2009;8:5541–5549. doi: 10.1021/pr9006415. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Huang J, Ma SG, Zhang C-H. Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]
[12].Hunter DR, Li R. Variable selection using MM algorithms. Annals of Statistics. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Kim Y, Choi H, Oh H-S. Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association. 2008;103:1665–1673. [Google Scholar]
[14].Kim Y, Kwon S. Global optimality of nonconvex penalized estimators. Biometrika. 2012;99:315–325. [Google Scholar]
[15].Kim Y, Kwon S, Choi H. Consistent model selection criteria on high dimensions. Journal of Machine Learning Research. 2012;13:1037–1057. [Google Scholar]
[16].Kwon S, Kim Y. Large sample properties of the smoothly clipped absolute deviation penalized maximum likelihood estimation on high dimensions. Accepted by Statistica Sinica. 2011 [Google Scholar]
[17].Lounici K. Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators. Electric Journal of Statistics. 2008;2:90–102. [Google Scholar]
[18].Mazumder R, Friedman J, Hastie T. SparseNet: coordinate descent with non-convex penalties. Journal of the American Statistical Association. 2011;106:1125–1138. doi: 10.1198/jasa.2011.tm09738. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Meinshausen N, Yu B. Lasso-type recovery of sparse representations for high-dimensional data. Annals of Statistics. 2009;37:246–270. [Google Scholar]
[20].Mikosch T. Estimates for tail probabilities of quadratic and bilinear forms in subgaussian random variables. Probability and Mathematical Statistics. 1991;11:169–178. [Google Scholar]
[21].Paunel-Görgülü AN, Franke AG, Paulsen FP, Dünker N. Trefoil factor family peptide 2 acts pro-proliferative and pro-apoptotic in the murine retina. Histochemistry and Cell Biology. 2011;135:461–473. doi: 10.1007/s00418-011-0810-6. [DOI] [PubMed] [Google Scholar]
[22].Rinaldo A. Technical Report. Deapartment of Statistics, Carneige Mellon University; 2007. A note on the uniqueness of the Lasso solution. [Google Scholar]
[23].Scheetz TE, Kim K-YA, Swiderski RE, Philp AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Sheffield VC, Stone EM. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences. 2006;103:14429–14434. doi: 10.1073/pnas.0602562103. [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Tao PD, An LTH. Convex analysis approach to D.C. programming: theory, algorithms and applications. Acta Mathematica Vietnamica. 1997;22:289–355. [Google Scholar]
[25].Tibshirani RJ. Regression shrinkage and selection via the Lasso. Journal of Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
[26].van de Geer SA. High-dimensional generalized linear models and the lasso. Annals of Statistics. 2008;36:614–645. [Google Scholar]
[27].van de Geer SA, Bühlmann P, Zhou SH. The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso) Electronic Journal of Statistics. 2011;5:688–749. [Google Scholar]
[28].Wang H, Li B, Leng C. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of Royal Statistical Society, Series B. 2009;71:671–683. [Google Scholar]
[29].Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Wang L, Kim Y, Li R. Supplement to “Calibrating non-convex penalized regression in ultra-high dimension”. 2013 doi: 10.1214/13-AOS1159. [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Yuille A, Rangarajan A. The concave-convex procedure. Neural Computation. 2003;15:915–936. doi: 10.1162/08997660360581958. [DOI] [PubMed] [Google Scholar]
[32].Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics. 2010;38:894–942. [Google Scholar]
[33].Zhang C-H, Huang J. The sparsity and bias of the LASSO selection in high-dimensional regression. Annals of Statistics. 2008;36:156–594. [Google Scholar]
[34].Zhang C-H, Zhang T. A general theory of concave regularization for high dimensional sparse estimation problems. Statistical Science. 2012;27:576–593. [Google Scholar]
[35].Zhang T. Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learning Research. 2010;11:1080–1107. [Google Scholar]
[36].Zhang T. Multi-stage Convex Relaxation for Feature Selection. Bernoulli. 2012 To appear. [Google Scholar]
[37].Zhang Y, Li R, Tsai C-L. Regularization parameter selections via generalized information criterion. Journal of American Statistical Association. 2010;105:312–323. doi: 10.1198/jasa.2009.tm08013. [DOI] [PMC free article] [PubMed] [Google Scholar]
[38].Zhao P, Yu B. On model selection consistency of Lasso. Journal of Machine Learning Reserach. 2006;7:2541–2563. [Google Scholar]
[39].Zhou SH, van de Geer SA, Bühlmann P. Adaptive Lasso for high dimensional regression and Gaussian graphical modeling. arxiv. 2009;0903:2515. [Google Scholar]
[40].Zhou SH. Thresholded Lasso for high dimensional variable selection and statistical estimation. arxiv. 2010;1002:1583. [Google Scholar]
[41].Zou H. The adaptive Lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
[42].Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models (with discussion) Annals of Statistics. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp

NIHMS590995-supplement-Supp.pdf^{(166.7KB, pdf)}

[R1] [1].Bertsekas DP. Nonlinear programming. 2nd edition Athena Scientific; Belmont, Mass: 1999. [Google Scholar]

[R2] [2].Bickel P, Ritov Y, Tsybakov A. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics. 2009;37:1705–1732. [Google Scholar]

[R3] [3].Bühlmann P, van de Geer S. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer; 2011. [Google Scholar]

[R4] [4].Cai T, Zhou H. Minimax estimation of large covariance matirces under l1 norm. To appear in Statistca Sinica. 2011 [Google Scholar]

[R5] [5].Chen J, Chen Z. Extended Bayesian information criterion for model selection with large model space. Biometrika. 2008;95:759–771. [Google Scholar]

[R6] [6].Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R7] [7].Fan J, Lv J. Non-concave penalized likelihood with NP-dimensionality. IEEE Transaction on Information Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Annals of Statistics. 2004;32:928–961. [Google Scholar]

[R9] [9].Frank IE, Friedman JH. A statistical view of some chemometric regression tools (with discussion) Technometrics. 1993;35:109–148. [Google Scholar]

[R10] [10].Gao BB, Phipps JA, Bursell D, Clermont AC, Feener EP. Angiotensin AT1 receptor antagonism ameliorates murine retinal proteome changes induced by diabetes. Journal of Proteome Research. 2009;8:5541–5549. doi: 10.1021/pr9006415. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Huang J, Ma SG, Zhang C-H. Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]

[R12] [12].Hunter DR, Li R. Variable selection using MM algorithms. Annals of Statistics. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Kim Y, Choi H, Oh H-S. Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association. 2008;103:1665–1673. [Google Scholar]

[R14] [14].Kim Y, Kwon S. Global optimality of nonconvex penalized estimators. Biometrika. 2012;99:315–325. [Google Scholar]

[R15] [15].Kim Y, Kwon S, Choi H. Consistent model selection criteria on high dimensions. Journal of Machine Learning Research. 2012;13:1037–1057. [Google Scholar]

[R16] [16].Kwon S, Kim Y. Large sample properties of the smoothly clipped absolute deviation penalized maximum likelihood estimation on high dimensions. Accepted by Statistica Sinica. 2011 [Google Scholar]

[R17] [17].Lounici K. Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators. Electric Journal of Statistics. 2008;2:90–102. [Google Scholar]

[R18] [18].Mazumder R, Friedman J, Hastie T. SparseNet: coordinate descent with non-convex penalties. Journal of the American Statistical Association. 2011;106:1125–1138. doi: 10.1198/jasa.2011.tm09738. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Meinshausen N, Yu B. Lasso-type recovery of sparse representations for high-dimensional data. Annals of Statistics. 2009;37:246–270. [Google Scholar]

[R20] [20].Mikosch T. Estimates for tail probabilities of quadratic and bilinear forms in subgaussian random variables. Probability and Mathematical Statistics. 1991;11:169–178. [Google Scholar]

[R21] [21].Paunel-Görgülü AN, Franke AG, Paulsen FP, Dünker N. Trefoil factor family peptide 2 acts pro-proliferative and pro-apoptotic in the murine retina. Histochemistry and Cell Biology. 2011;135:461–473. doi: 10.1007/s00418-011-0810-6. [DOI] [PubMed] [Google Scholar]

[R22] [22].Rinaldo A. Technical Report. Deapartment of Statistics, Carneige Mellon University; 2007. A note on the uniqueness of the Lasso solution. [Google Scholar]

[R23] [23].Scheetz TE, Kim K-YA, Swiderski RE, Philp AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Sheffield VC, Stone EM. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences. 2006;103:14429–14434. doi: 10.1073/pnas.0602562103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Tao PD, An LTH. Convex analysis approach to D.C. programming: theory, algorithms and applications. Acta Mathematica Vietnamica. 1997;22:289–355. [Google Scholar]

[R25] [25].Tibshirani RJ. Regression shrinkage and selection via the Lasso. Journal of Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]

[R26] [26].van de Geer SA. High-dimensional generalized linear models and the lasso. Annals of Statistics. 2008;36:614–645. [Google Scholar]

[R27] [27].van de Geer SA, Bühlmann P, Zhou SH. The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso) Electronic Journal of Statistics. 2011;5:688–749. [Google Scholar]

[R28] [28].Wang H, Li B, Leng C. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of Royal Statistical Society, Series B. 2009;71:671–683. [Google Scholar]

[R29] [29].Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] [30].Wang L, Kim Y, Li R. Supplement to “Calibrating non-convex penalized regression in ultra-high dimension”. 2013 doi: 10.1214/13-AOS1159. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Yuille A, Rangarajan A. The concave-convex procedure. Neural Computation. 2003;15:915–936. doi: 10.1162/08997660360581958. [DOI] [PubMed] [Google Scholar]

[R32] [32].Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics. 2010;38:894–942. [Google Scholar]

[R33] [33].Zhang C-H, Huang J. The sparsity and bias of the LASSO selection in high-dimensional regression. Annals of Statistics. 2008;36:156–594. [Google Scholar]

[R34] [34].Zhang C-H, Zhang T. A general theory of concave regularization for high dimensional sparse estimation problems. Statistical Science. 2012;27:576–593. [Google Scholar]

[R35] [35].Zhang T. Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learning Research. 2010;11:1080–1107. [Google Scholar]

[R36] [36].Zhang T. Multi-stage Convex Relaxation for Feature Selection. Bernoulli. 2012 To appear. [Google Scholar]

[R37] [37].Zhang Y, Li R, Tsai C-L. Regularization parameter selections via generalized information criterion. Journal of American Statistical Association. 2010;105:312–323. doi: 10.1198/jasa.2009.tm08013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] [38].Zhao P, Yu B. On model selection consistency of Lasso. Journal of Machine Learning Reserach. 2006;7:2541–2563. [Google Scholar]

[R39] [39].Zhou SH, van de Geer SA, Bühlmann P. Adaptive Lasso for high dimensional regression and Gaussian graphical modeling. arxiv. 2009;0903:2515. [Google Scholar]

[R40] [40].Zhou SH. Thresholded Lasso for high dimensional variable selection and statistical estimation. arxiv. 2010;1002:1583. [Google Scholar]

[R41] [41].Zou H. The adaptive Lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

[R42] [42].Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models (with discussion) Annals of Statistics. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

CALIBRATING NON-CONVEX PENALIZED REGRESSION IN ULTRA-HIGH DIMENSION

Lan Wang

Yongdai Kim

Runze Li

Abstract

1. Introduction

2. Calibrated non-convex penalized least squares method

2.1. Notation and setup

2.2. The CCCP algorithm

2.3. Calibrated non-convex penalized regression

3. Theoretical properties

Remark 1

3.1. Property of the solution path

Lemma 3.1

Theorem 3.2

Remark 2

Remark 3

Corollary 3.3

Corollary 3.4

3.2. Property of the high-dimensional BIC

Theorem 3.5

Remark 4

3.3. Relaxing the conditions on the minimal signal

Remark 5

Theorem 3.6

Remark 6

4. Numerical results

4.1. Monte Carlo studies

Example 1

Example 2

Table 1.

Table 2.

4.2. Real data analysis

Table 3.

4.3. Extension to penalized logistic regression

Table 4.

5. Revisiting local minima of nonconvex penalized regression

Definition 5.1

Remark

Theorem 5.2

Corollary 5.3

6. Proofs

Proof of Theorem 3.2

Proof of Corollary 3.3 and Corollary 3.4

Proof of Theorem 3.5

Case I

Case II

Proof of Theorem 3.6

Lemma 6.1

Proof of Theorem 5.2

Proof of Corollary 5.3

Supplementary Material

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases