Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications

Jian Huang; Cun-Hui Zhang

. Author manuscript; available in PMC: 2013 Dec 12.

Published in final edited form as: J Mach Learn Res. 2012 Jun 1;13:1839–1864.

Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications

Jian Huang ¹, Cun-Hui Zhang ²

PMCID: PMC3860326 NIHMSID: NIHMS532729 PMID: 24348100

Abstract

The ℓ₁-penalized method, or the Lasso, has emerged as an important tool for the analysis of large data sets. Many important results have been obtained for the Lasso in linear regression which have led to a deeper understanding of high-dimensional statistical problems. In this article, we consider a class of weighted ℓ₁-penalized estimators for convex loss functions of a general form, including the generalized linear models. We study the estimation, prediction, selection and sparsity properties of the weighted ℓ₁-penalized estimator in sparse, high-dimensional settings where the number of predictors p can be much larger than the sample size n. Adaptive Lasso is considered as a special case. A multistage method is developed to approximate concave regularized estimation by applying an adaptive Lasso recursively. We provide prediction and estimation oracle inequalities for single- and multi-stage estimators, a general selection consistency theorem, and an upper bound for the dimension of the Lasso estimator. Important models including the linear regression, logistic regression and log-linear models are used throughout to illustrate the applications of the general results.

Keywords: variable selection, penalized estimation, oracle inequality, generalized linear models, selection consistency, sparsity

1. Introduction

High-dimensional data arise in many diverse fields of scientific research. For example, in genetic and genomic studies, more and more large data sets are being generated with rapid advances in biotechnology, where the total number of variables p is larger than the sample size n. Fortunately, statistical analysis is still possible for a substantial subset of such problems with a sparse underlying model where the number of important variables is much smaller than the sample size. A fundamental problem in the analysis of such data is to find reasonably accurate sparse solutions that are easy to interpret and can be used for the prediction and estimation of covariable effects. The ℓ₁-penalized method, or the Lasso (Tibshirani, 1996; Chen et al., 1998), has emerged as an important approach to finding such solutions in sparse, high-dimensional statistical problems.

In the last few years, considerable progress has been made in understanding the theoretical properties of the Lasso in p ≫ n settings. Most results have been obtained for linear regression models with a quadratic loss. Greenshtein and Ritov (2004) studied the prediction performance of the Lasso in high-dimensional least squares regression. Meinshausen and Bühlmann (2006) showed that, for neighborhood selection in the Gaussian graphical models, under a neighborhood stability condition on the design matrix and certain additional regularity conditions, the Lasso is selection consistent even when p → ∞ at a rate faster than n. Zhao and Yu (2006) formalized the neighborhood stability condition in the context of linear regression as a strong irrepresentable condition. Candes and Tao (2007) derived an upper bound for the ℓ₂ loss of a closely related Dantzig selector in the estimation of regression coefficients under a condition on the number of nonzero coefficients and a uniform uncertainty principle on the design matrix. Similar results have been obtained for the Lasso. For example, upper bounds for the ℓ_q loss of the Lasso estimator has being established by Bunea et al. (2007) for q = 1, Zhang and Huang (2008) for q ∈ [1;2], Meinshausen and Yu (2009) for q = 2, Bickel et al. (2009) for q ∈ [1;2], and Zhang (2009) and Ye and Zhang (2010) for general q ≥ 1. For convex minimization methods beyond linear regression, van de Geer (2008) studied the Lasso in high-dimensional generalized linear models (GLM) and obtained prediction and ℓ₁ estimation error bounds. Negahban et al. (2010) studied penalized M-estimators with a general class of regularizers, including an ℓ₂ error bound for the Lasso in GLM under a restricted strong convexity and other regularity conditions.

Theoretical studies of the Lasso have revealed that it may not perform well for the purpose of variable selection, since its required irrepresentable condition is not properly scaled in the number of relevant variables. In a number of simulation studies, the Lasso has shown weakness in variable selection when the number of nonzero regression coefficients increases. As a remedy, a number of proposals have been introduced in the literature and proven to be variable selection consistent under regularity conditions of milder forms, including concave penalized LSE (Fan and Li, 2001; Zhang, 2010a), adaptive Lasso (Zou, 2006; Meier and Bühlmann, 2007; Huang et al., 2008), stepwise regression (Zhang, 2011a), and multi-stage methods (Hunter and Li, 2005; Zou and Li, 2008; Zhang, 2010b, 2011b).

In this article, we study a class of weighted ℓ₁-penalized estimators with a convex loss function. This class includes the Lasso, adaptive Lasso and multistage recursive application of adaptive Lasso in generalized linear models as special cases. We study prediction, estimation, selection and sparsity properties of the weighted ℓ₁-penalized estimator based on a convex loss in sparse, high-dimensional settings where the number of predictors p can be much larger than the sample size n. The main contributions of this work are as follows.

We extend the existing theory for the unweighted Lasso from linear regression to more general convex loss function.
We develop a multistage method to approximate concave regularized convex minimization with recursive application of adaptive Lasso, and provide sharper risk bounds for this concave regularization approach in the general setting.
We apply our results to a number of important special cases, including the linear, logistic and log-linear regression models.

This article is organized as follows. In Section 2 we describe a general formulation of the absolute penalized minimization problem with a convex loss, along with two basic inequalities and a number of examples. In Section 3 we develop oracle inequalities for the weighted Lasso estimator for general quasi star-shaped loss functions and an ℓ₂ bound on the prediction error. In Section 4 we develop multistage recursive applications of adaptive Lasso as an approximate concave regularization method and provide sharper oracle inequalities for this approach. In Section 5 we derive sufficient conditions for selection consistency. In Section 6 we provide an upper bound on the dimension of the Lasso estimator. Concluding remarks are given in Section 7. All proofs are provided in an appendix.

2. Absolute Penalized Convex Minimization

In this section, we define the weighted Lasso for a convex loss function and characterize its solutions via the KKT conditions. We then derive some basic inequalities for the weighted Lasso solutions in terms of the symmetrized Bregman divergence (Bregman, 1967; Nielsen and Nock, 2007). We also illustrate the applications of the basic inequalities in several important examples.

2.1 Definition and the KKT Conditions

We consider a general convex loss function of the form

ℓ (β) = ψ (β) - 〈 β, z 〉,

(1)

where ψ(β) is a known convex function, z is observed, and β is unknown. Unless otherwise stated, the inner product space is ℝ^p, so that {z,β} ⊂ ℝ^p and 〈β,z〉 = β′z. Our analysis of (1) requires certain smoothness of the function ψ(β) in terms of its differentiability. In what follows, such smoothness assumptions are always explicitly described by invoking the derivative of ψ. For any v = (v₁,…, v_p)′, we use ||v|| to denote a general norm of v and |v|_q the ℓ_q norm (Σ_j |v_j|^q)¹^/q, with |v|_∞ = max_j |v_j|. Let ŵ ∈ ℝ^p be a (possibly estimated) weight vector with nonnegative elements ŵ_j, 1 ≤ j ≤ p, and Ŵ = diag(ŵ). The weighted absolute penalized estimator, or weighted Lasso, is defined as

\hat{β} = \underset{β}{arg min} {ℓ (β) + λ {∣ \hat{W} β ∣}_{1}} .

(2)

Here we focus on the case where Ŵ is diagonal. In linear regression, Tibshirani and Taylor (2011) considered non-diagonal, predetermined Ŵ and derived an algorithm for computing the solution paths.

A vector β̂ is a global minimizer in (2) if and only if the negative gradient at β̂ satisfies the Karush-Kuhn-Tucker (KKT) conditions,

g = - \dot{ℓ} (\hat{β}) = z - \dot{ψ} (\hat{β}), {\begin{matrix} g_{j} = {\hat{w}}_{j} λ sgn ({\hat{β}}_{j}) & if {\hat{β}}_{j} \neq 0 \\ g_{j} \in {\hat{w}}_{j} λ [- 1, 1] & all j, \end{matrix}

(3)

where ℓ̇(β) = (∂/∂β)ℓ(β) and ψ̇(β) = (∂/∂β)ψ (β). Since the KKT conditions are necessary and sufficient for (2), results on the performance of β̂ can be viewed as analytical consequences of (3).

The estimator (2) includes the ℓ₁-penalized estimator, or the Lasso, with the choice ŵ_j = 1, 1 ≤ j ≤ p. A careful study of the (unweighted) Lasso in general convex minimization (1) is by itself an interesting and important problem. Our work includes the Lasso as a special case since ŵ_j = 1 is allowed in our theorems.

In practice, unequal ŵ_j arise in many ways. In adaptive Lasso (Zou, 2006), a decreasing function of a certain initial estimator of β_j is used as the weight ŵ_j to remove the bias of the Lasso. In Zou and Li (2008) and Zhang (2010b), the weights ŵ_j are computed iteratively with ŵ_j = ρ̇_λ(β̂_j), where ρ̇_λ(t) = (d/dt)ρ_λ(t) with a suitable concave penalty function ρ_λ(t). This is also designed to remove the bias of the Lasso, since the concavity of ρ_λ(t) guarantees smaller weight for larger β̂_j. In Section 4, we provide results on the improvements of this weighted Lasso over the standard Lasso. In linear regression, Zhang (2010b) gave sufficient conditions under which this iterative algorithm provides smaller weights ŵ_j for most large β_j. Such nearly unbiased methods are expected to produce better results than the Lasso when a significant fraction of nonzero |β_j| are of the order λ or larger. Regardless of the computational methods, the results in this paper demonstrate the benefits of using data dependent weights in a general class of problems with convex losses.

Unequal weights may also arise for computational reasons. The Lasso with ŵ_j = 1 is expected to perform similarly to weighted Lasso with data dependent 1 ≤ ŵ_j ≤ C₀, with a fixed C₀. However, the weighted Lasso is easier to compute since ŵ_j can be determined as a part of an iterative algorithm. For example, in a gradient descent algorithm, one may take larger steps and stop the computation as soon as the KKT conditions (3) are attained for any weights satisfying 1 ≤ ŵ_j ≤ C₀.

The weight function ŵ_j can be also used to standardize the penalty level, for example with ŵ_j = {ψ̈_{j j}(β̂)}^1/2, where ψ̈_{j j}(β) is the j-th diagonal element of the Hessian matrix of ψ(β). When ψ(β) is quadratic, for example in linear regression, ŵ_j = {ψ̈_jj(β̂)}^1/2 does not depend on β̂. However, in other convex minimization problems, such weights need to be computed iteratively.

Finally, in certain applications, the effects of a certain set S_* of variables are of primary interest, so that penalization of β̂_{S_*}, and thus the resulting bias, should be avoided. This leads to “semi-penalized” estimators with ŵ_j = 0 for j ∈ S_*, for example, with weights ŵ_i = I{j ∉ S_*}.

2.2 Basic Inequalities, Prediction, and Bregman Divergence

Let β^* denote a target vector for β. In high-dimensional models, the performance of an estimator β̂ is typically measured by its proximity to a target under conditions on the sparsity of β^* and the size of the negative gradient −ℓ̇(β^*) = z − ψ̇(β^*). For ℓ₁-penalized estimators, such results are often derived from the KKT conditions (3) via certain basic inequalities, which are direct consequences of the KKT conditions and have appeared in different forms in the literature, for example, in the papers cited in Section 1. Let D(β,β^*) = ℓ(β) − ℓ(β^*) − 〈ℓ̇(β^*), β − β^*〉 be the Bregman divergence (Bregman, 1967) and consider its symmetrized version (Nielsen and Nock, 2007)

Δ (β, β^{*}) = D (β, β^{*}) + D (β^{*}, β) = 〈 β - β^{*}, \dot{ψ} (β) - \dot{ψ} (β^{*}) 〉 .

(4)

Since ψ is convex, Δ(β, β^*) ≥ 0. Two basic inequalities below provide upper bounds for the symmetrized Bregman divergence Δ(β̂, β^*). The sparsity of β^* is measured by a weighted ℓ₁ norm of β^* in the first one and by a sparse set in the second one.

Let S be any set of indices satisfying $S \supseteq {j : β_{j}^{*} \neq 0}$ and let S^c be the complement of S in {1, …, p}. We shall refer to S as the sparse set. Let W = diag(w) for a possibly unknown vector w ∈ ℝ^p with elements w_j ≥ 0. Define

z_{0}^{*} = {∣ {z - \dot{ψ} (β^{*})}_{S} ∣}_{\infty}, z_{1}^{*} = {∣ W_{S^{c}}^{- 1} {z - \dot{ψ} (β^{*})}_{S^{c}} ∣}_{\infty},

(5)

Ω_{0} = {{\hat{w}}_{j} \leq w_{j} \forall j \in S} \cap {w_{j} \leq {\hat{w}}_{j} \forall j \in S^{c}},

(6)

where for any p-vector v and set A, v_A = (v_j : j ∈ A)′. Here and in the sequel M_AB denotes the A × B subblock of a matrix M and M_A = M_AA.

Lemma 1

Let β^* be a target vector. In the event Ω₀ ∩ {|(z − ψ̇(β^*))_j| ≤ ŵ_jλ ∀j},
$Δ (\hat{β}, β^{*}) \leq 2 λ {∣ \hat{W} β^{*} ∣}_{1} \leq 2 λ {∣ W β^{*} ∣}_{1} .$ (7)
For any target vector β^* and $S \supseteq {j : β_{j}^{*} \neq 0}$ , the error h = β̂ − β^* satisfies
$\begin{array}{l} Δ (β^{*} + h, β^{*}) + (λ - z_{1}^{*}) {∣ W_{S^{c}} h_{S^{c}} ∣}_{1} \leq 〈 h_{S}, g_{S} - {z - \dot{ψ} (β^{*})}_{S} 〉 \\ \leq ({∣ w_{S} ∣}_{\infty} λ + z_{0}^{*}) {∣ h_{S} ∣}_{1} \end{array}$ (8)

in Ω₀ for a certain negative gradient vector g satisfying |g_j| ≤ ŵ_jλ. Consequently, in $Ω_{0} \cap {({∣ w_{S} ∣}_{\infty} λ + z_{0}^{*}) / (λ - z_{1}^{*}) \leq ξ}$ , h ≠ 0 belongs to the sign-restricted cone (ξ, S) = {b ∈ (ξ, S) : b_j(ψ̇(β+ b) − ψ̇(β))_j ≤ 0 ∀ j ∈ S^c}, where
$C (ξ, S) = {b \in ℝ^{p} : {∣ W_{S^{c}} b_{S^{c}} ∣}_{1} \leq ξ {∣ b_{S} ∣}_{1} \neq 0} .$ (9)

Remark 2

Sufficient conditions are given in Subsection 3.2 for {|(z − ψ̇(β^*))_j| ≤ ŵ_j λ ∀j} to hold with high probability in generalized linear models. See Lemma 8, Remarks 10 and 11 and Examples 7, 8, and 9.

A useful feature of Lemma 1 is the explicit statements of the monotonicity of the basic inequality in the weights. By Lemma 1 (ii), it suffices to study the analytical properties of the penalized criterion with the error h = β̂ − β^* in the sign-restricted cone, provided that the event $({∣ w_{S} ∣}_{\infty} λ + z_{0}^{*}) / (λ - z_{1}^{*}) \leq ξ$ has large probability. However, unless Inline graphic (ξ, S) is specified, we will consider the larger cone in (9) in order to simplify the analysis. The choices of the target vector β^*, the sparse set $S \supseteq {j : β_{j}^{*} \neq 0}$ , weight vector ŵ and its bound w are quite flexible. The main requirement is that {|S|, $z_{0}^{*}, z_{1}^{*}$ } should be small. In linear regression or generalized linear models, we may conveniently consider β^* as the vector of true regression coefficients under a probability measure P_β^*. However, β^* can also be a sparse version of a true β, for example, $β_{j}^{*} = β_{j} I {∣ β_{j} ∣ \geq τ}$ for a threshold value τ under P_β.

The upper bound in Lemma 1 (i) gives the so called “slow rate” of convergence for the Bregman divergence. In Section 3, we provide “fast rate” of convergence for the Bregman divergence via oracle inequalities for |h_S|₁ in (8).

The symmetrized Bregman divergence Δ(β̂, β^*) has the interpretations as the regret in prediction error in linear regression, the symmetrized Kullback-Leibler (KL) divergence in generalized linear models (GLM) and density estimation, and a spectrum loss for the graphical Lasso, as shown in examples below. These quantities can be all viewed as the size of the prediction error since they measure distances between a target density of the observations and an estimated density.

Example 1 (Linear regression)

Consider the linear regression model

y_{i} = \sum_{j = 1}^{p} x_{i j} β_{j} + ε_{i}, i = 1, \dots, n,

(10)

where y_i is the response variable, x_{i j} are predictors or design variables, and ε_i is the error term. Let y = (y₁, …, y_n)′ and let X be the design matrix whose ith row is xⁱ = (x_i₁,…, x_ip). The estimator (2) can be written as a weighted Lasso with $ψ (β) = {∣ X β ∣}_{2}^{2} / (2 n)$ and z = X′y/n in (1). For predicting a vector ỹ with E_β^*[ỹ|X, y] = Xβ^*,

n Δ (\hat{β}, β^{*}) = {∣ X \hat{β} - X β^{*} ∣}_{2}^{2} = E_{β^{*}} [{∣ \tilde{y} - X \hat{β} ∣}_{2}^{2} ∣ X, y] - min_{δ (X, y)} E_{β^{*}} [{∣ \tilde{y} - δ (X, y) ∣}_{2}^{2} ∣ X, y]

is the regret of using the linear predictor Xβ̂ compared with the optimal predictor. See Greenshtein and Ritov (2004) for several implications of (7).

Example 2 (Logistic regression)

We observe (X, y) ∈ ℝⁿ^×(^p⁺¹⁾ with independent rows (xⁱ,y_i), where y_i ∈ {0,1} are binary response variables with

P_{β} (y_{i} = 1 ∣ x^{i}) = π_{i} (β) = exp (x^{i} β) / (1 + exp (x^{i} β)), 1 \leq i \leq n .

(11)

The loss function (1) is the average negative log-likelihood:

ℓ (β) = ψ (β) - z^{'} β with ψ (β) = \sum_{i = 1}^{n} \frac{log (1 + exp (x^{i} β))}{n}, z = X^{'} y / n .

(12)

Thus, (2) is a weighted ℓ₁ penalized MLE. For probabilities {π′, π″} ⊂ (0,1), the KL information is K(π′, π″) = π′log(π′/π″) + (1 − π′)log{(1 − π′)/(1 − π″)}. Since $\dot{ψ} (β) = \sum_{i = 1}^{n} x^{i} π_{i} (β) / n$ and logit(π_i(β^*)) − logit(π_i(β)) = xⁱ(β^* − β), (4) gives

Δ (β, β^{*}) = \frac{1}{n} \sum_{i = 1}^{n} {K (π_{i} (β^{*}), π_{i} (β)) + K (π_{i} (β), π_{i} (β^{*}))} .

Thus, the symmetrized Bregman divergence Δ(β^*,β) is the symmetrised KL-divergence.

Example 3 (GLM)

The GLM contains the linear and logistic regression models as special cases. We observe (X,y) ∈ ℝ^n×⁽^p⁺¹⁾ with rows (xⁱ, y_i). Suppose that conditionally on X, y_i are independent under P_β with

y_{i} ~ f (y_{i} ∣ θ_{i}) = exp (\frac{θ_{i} y_{i} - ψ_{0} (θ_{i})}{σ^{2}} + \frac{c (y_{i}, σ)}{σ^{2}}), θ_{i} = x^{i} β .

(13)

Let $f_{(n)} (y ∣ X, β) = \prod_{i = 1}^{n} f (y_{i} ∣ x^{i} β)$ . The loss function can be written as a normalized negative likelihood ℓ(β) = (σ²/n)log f₍_n₎(y|X,β) with $ψ (β) = \sum_{i = 1}^{n} {ψ_{0} (x^{i} β) + c (y_{i}, σ)} / n$ and z = X′y/n. The KL divergence is

D (f_{n} (\cdot ∣ X, β^{*}) ‖ f_{n} (\cdot ∣ X, β)) = E_{β^{*}} log (\frac{f_{(n)} (y ∣ X, β^{*})}{f_{(n)} (y ∣ X, β)}) .

The symmetrized Bregman divergence can be written as

Δ (\hat{β}, β^{*}) = \frac{σ^{2}}{n} {D (f_{(n)} (\cdot ∣ X, β^{*}) ‖ f_{(n)} (\cdot ∣ X, \hat{β})) + D (f_{(n)} (\cdot ∣ X, \hat{β}) ‖ f_{(n)} (\cdot ∣ X, β^{*}))} .

Example 4 (Nonparametric density estimation)

Although the focus of this paper is on regression models, here we illustrate that Δ (β̂, β^*) is the symmetrised KL divergence in the context of non-parametric density estimation. Suppose the observations y = (y₁, …, y_n)′ are iid from f(·|β) = exp{〈β, T(·)〉 − ψ(β)} under P_β, where T(·) = (u_j(·), j ≤ p)′ with certain basis functions u_j(·). Let the loss function ℓ(β) in (1) be the average negative log-likelihood $n^{- 1} \sum_{i = 1}^{n} log f (y_{i} ∣ β)$ with $z = n^{- 1} \sum_{i = 1}^{n} T (y_{i})$ . Since E_βT (y_i) = ψ̇(β), the KL divergence is

D (f (\cdot ∣ β^{*}) ‖ f (\cdot ∣ β)) = E_{β^{*}} log (\frac{f (y_{i} ∣ β^{*})}{f (y_{i} ∣ β)}) = ψ (β) - ψ (β^{*}) - 〈 β - β^{*}, ψ (β^{*}) 〉 .

Again, the symmetrized Bregman divergence is the symmetrised KL divergence between the target density f(·|β^*) and the estimated density f(·|β̂):

Δ (β, β^{*}) = D (f (\cdot ∣ β^{*}) ‖ f (\cdot ∣ \hat{β})) + D (f (\cdot ∣ \hat{β}) ‖ f (\cdot ∣ β^{*})) .

van de Geer (2008) pointed out that for this example, the natural choices of the basis functions u_j and weights w_j satisfy ∫u_jdν = 0 and $w_{k}^{2} = \int u_{k}^{2} d ν$ .

Example 5 (Graphical Lasso)

Suppose we observe X ∈ ℝ^n×p and would like to estimate the precision matrix β = (EX′X/n)⁻¹ ∈ ℝ^p×p. In the graphical Lasso, (1) is the length normalized negative likelihood with ψ(β) = −logdetβ, z = −X′X/n, and 〈β, z〉 = −trace(βz). Since the gradient of ψ is ψ̇(β) = E_βz = −β⁻¹, we find

Δ (β, β^{*}) = trace ((\hat{β} - β^{*}) ({(β^{*})}^{- 1} - {\hat{β}}^{- 1}) = \sum_{j = 1}^{p} {(λ_{j} - 1)}^{2} / λ_{j},

where (λ₁, …, λ_p) are the eigenvalues of (β^*)^−1/2β̂(β^*)^−1/2. In graphical Lasso, the diagonal elements are typically not penalized. Consider ŵ_jk = I{j ≠ k}, so that the penalty for the off-diagonal elements are uniformly weighted. Since Lemma 1 requires |(z − ψ̇(β^*))_jk| ≤ ŵ_jkλ, β^* is taken to match X′X/n on the diagonal and the true β in correlations. Let S = {(j,k) : β_jk ≠ 0, j ≠ k}. In the event ${max}_{j \neq k} ∣ z_{j k} - β_{j k}^{*} ∣ \leq λ$ , Lemma 1 (i) gives

∣ S ∣ λ max_{j \neq k} ∣ β_{j k}^{*} ∣ = o (1) \Rightarrow {‖ {(β^{*})}^{- 1 / 2} \hat{β} {(β^{*})}^{- 1 / 2} - I_{p \times p} ‖}_{2} = o (1)

where ||·||₂ is the spectrum norm. Rothman et al. (2008) proved the consistency of the graphical Lasso under similar conditions with a different analysis.

3. Oracle Inequalities

In this section, we extract upper bounds for the estimation error β̂ − β^* from the basic inequality (8). Since (8) is monotone in the weights, the oracle inequalities are sharper when the weights ŵ_j are smaller in $S \supseteq {j : β_{j}^{*} \neq 0}$ and larger in S^c.

We say that a function φ(b) defined in ℝ^p is quasi star-shaped if φ(tb) is continuous and non-decreasing in t ∈ [0,∞) for all b ∈ ℝ^p and lim_b_→0 φ(b) = 0. All seminorms are quasi star-shaped. The sublevel sets {b : φ(b) ≤ t} of a quasi star-shaped function are all star-shaped. Constant factors of the following form play a crucial role in our analysis.

Definition 3

For 0 ≤ η^* ≤ 1 and any pair of quasi star-shaped functions φ₀(b) and φ(b), define a general invertibility factor (GIF) over the cone (9) as follows:

F (ξ, S; φ_{0}, φ) = inf {\frac{Δ (β^{*} + b, β^{*}) e^{φ_{0} (b)}}{{∣ b_{S} ∣}_{1} φ (b)} : b \in C (ξ, S), φ_{0} (b) \leq η^{*}},

(14)

where Δ(β, β^*) is as in (4).

The GIF extends the squared compatibility constant (van de Geer and Bühlmann, 2009) and the weak and sign-restricted cone invertibility factors (Ye and Zhang, 2010) from the linear regression model with φ₀(·) = 0 to the general model (1) and from ℓ_q norms to general φ(·). They are all closely related to the restricted eigenvalues (Bickel et al., 2009; Koltchinskii, 2009) as we will discuss in Subsection 3.1.

The basic inequality (8) implies that the symmetrized Bregman divergence Δ(β̂, β^*) is no greater than a linear function of |h_S|₁, where h = β̂ − β^*. If Δ(β̂, β^*) is no smaller than a linear function of the product |h_S|₁φ(h), then an upper bound for φ(h) exists. Since the symmetrized Bregman divergence (4) is approximately quadratic, Δ(β̂, β^*) ≈ 〈h, ψ̈(β^*)h〉, in a neighborhood of β^*, this is reasonable when h = β̂ − β^* is not too large and ψ̈ (β^*) is invertible in the cone. A suitable factor e^φ₀(b) in (14) forces the computation of this lower bound in a proper neighborhood of β^*.

We first provide a set of general oracle inequalities.

Theorem 4

Let { $z_{0}^{*}, z_{1}^{*}$ } be as in (5) with $S \supseteq {j : β_{j}^{*} \neq 0}$ , Ω₀ in (6), 0 ≤ η ≤ η^* ≤ 1, and {φ₀(b), φ(b)} be a pair of quasi star-shaped functions. Then, in the event

Ω_{1} = Ω_{0} \cap {\frac{{∣ w_{S} ∣}_{\infty} λ + z_{0}^{*}}{{(λ - z_{1}^{*})}_{+}} \leq ξ, \frac{{∣ w_{S} ∣}_{\infty} λ + z_{0}^{*}}{F (ξ, S; φ_{0}, φ_{0})} \leq η e^{- η}},

(15)

the following oracle inequalities hold:

φ_{0} (\hat{β} - β^{*}) \leq η, φ (\hat{β} - β^{*}) \leq \frac{e^{η} ({∣ w_{S} ∣}_{\infty} λ + z_{0}^{*})}{F (ξ, S; φ_{0}, φ)},

(16)

and with φ_1,_S(b) = |b_S|₁/|S|

Δ (\hat{β}, β^{*}) + (λ - z_{1}^{*}) {∣ W_{S^{c}} {(\hat{β} - β^{*})}_{S^{c}} ∣}_{1} \leq \frac{e^{η} {({∣ w_{S} ∣}_{\infty} λ + z_{0}^{*})}^{2} ∣ S ∣}{F (ξ, S; φ_{0}, φ_{1, S})} .

(17)

Remark 5

Sufficient conditions are given in Subsection 3.2 for (15) to hold with high probability. See Lemma 8, Remarks 10 and 11 and Examples 7, 8, and 9.

The oracle inequalities in Theorem 4 control both the estimation error in terms of φ(β̂ − β^*) and the prediction error in terms of the symmetrized Bregman divergence Δ(β̂,β^*) discussed in Section 2. Since they are based on the GIF (14) in the intersection of the cone and the unit ball {b : φ₀(b) ≤ 1/e}, they are different from typical results in a small-ball analysis based on the Taylor expansion of ψ(β) at β = β^*. An important feature of Theorem 4 is that its regularity condition is imposed only on the GIF (14) evaluated at the target β^*; The uniformity of the order of Δ(β + b, β) in β is not required. Theorem 4 does allow φ₀(·) = 0 with F(ξ, S;φ₀, φ₀) = ∞ and η = 0 in linear regression.

3.1 The Hessian and Related Quantities

In this subsection we describe the relationship between the GIF (14) and the Hessian of the convex function ψ(·) in (1) and examine cases where the quasi star-shaped functions φ₀(·) and φ(·) are familiar seminorms. Throughout, we assume that ψ(β) is twice differentiable. Let ψ̈(β) be the Hessian of ψ(β) and Σ^* = ψ̈(β^*).

The GIF (14) can be simplified under the following condition.

Definition 6

Given a nonnegative-definite matrix Σ and constant η^* > 0, the symmetriized Bregman divergence Δ(β, β^*) satisfies the φ₀-relaxed convexity (φ₀-RC) condition if

Δ (β^{*} + b, β^{*}) e^{φ_{0} (b)} \geq 〈 b, \sum b 〉, \forall b \in C (ξ, S), φ_{0} (b) \leq η^{*} .

(18)

The φ₀-RC condition is related to the restricted strong convexity condition for the Bregman divergence (Negahban et al., 2010): ℓ(β^* + b) − ℓ(β^*) − 〈ℓ̇ (β^*), b〉 ≥ κ̃||b||² with a certain restriction b ∈ Inline graphic and a loss function ||b||. It actually implies the restricted strong convexity of the symmetrized Bregman divergence with κ̃ = e^{−η^*} and loss ||b||_* = 〈b,Σb〉^1/2. However, (18) is used in our analysis mainly to find a quadratic form as a media for the eventual comparison of Δ(β^* + b, β^*) with |b_S|₁φ(b) in (14), where φ(b) is the loss function. In fact, in our examples, we find quasi star-shaped functions φ₀ for which (18) holds for unrestricted b(η^* = ξ = ∞). In such cases, the φ₀-RC condition is a smoothness condition on the Hessian operator ψ̈(β) = ℓ̈(β), since $Δ (β^{*} + h, β^{*}) = \int_{0}^{1} 〈 h, \ddot{ψ} (β^{*} + t h) h 〉 d t$ by (4).

In what follows, Σ = Σ^* = ψ̈ (β^*) is allowed in all statements unless otherwise stated. Under the φ₀-RC (18), the GIF (14) is bounded from below by the following simple GIF:

F_{0} (ξ, S; φ) = inf_{b \in C (ξ, S)} \frac{〈 b, \sum b 〉}{{∣ b_{S} ∣}_{1} φ (b)} .

(19)

In linear regression, F₀(ξ, S;φ) is the square of the compatibility factor for φ(b) = φ₁_,S(b) = |b_S|₁/|S| (van de Geer, 2007) and the weak cone invertibility factor for φ(b) = φ_q(b) = |b|_q/|S|^1/^q (Ye and Zhang, 2010). They are both closely related to the restricted isometry property (RIP) (Candes and Tao, 2005), the sparse Rieze condition (SRC) (Zhang and Huang, 2008), and the restricted eigenvalue (Bickel et al., 2009). Extensive discussion of these quantities can be found in Bickel et al. (2009), van de Geer and Bühlmann (2009) and Ye and Zhang (2010). The following corollary is an extension of an oracle inequality of Ye and Zhang (2010) from linear regression to the general convex minimization problem (1).

Corollary 7

Let η ≤ η^* ≤ 1. Suppose the φ₀-RC condition (18). Then, in the event

Ω_{0} \cap {{∣ w_{S} ∣}_{\infty} λ + z_{0}^{*} \leq min (ξ (λ - z_{1}^{*}), η e^{- η} F_{0} (ξ, S; φ_{0}))},

the oracle inequalities (16) and (17) in Theorem 4 hold with the GIF F(ξ, S;φ₀, φ) replaced by the simple GIF F₀(ξ, S; φ) in (19). In particular, in the same event,

φ_{0} (h) \leq η, {∣ h ∣}_{q} \leq \frac{e^{η} ({∣ w_{S} ∣}_{\infty} λ + z_{0}^{*}) {∣ S ∣}^{1 / q}}{F_{0} (ξ, S; φ_{q})}, \forall q > 0,

with φ_q(b) = |b|_q/| S|^1/q and h = β̂ − β^*, and with φ₁_,S(b) = |b_S|₁/|S|,

e^{- η} 〈 h, \sum h 〉 \leq Δ (\hat{β}, β^{*}) \leq \frac{e^{η} {({∣ w_{S} ∣}_{\infty} λ + z_{0}^{*})}^{2} ∣ S ∣}{F_{0} (ξ, S; φ_{1, S})} - (λ - z_{1}^{*}) {∣ W_{S^{c}} h_{S^{c}} ∣}_{1} .

Here the only differences between the general model (1) and linear regression (φ₀(b) = 0) are the extra factor e^η with η≤ 1, the extra constraint ${∣ w_{S} ∣}_{\infty} λ + z_{0}^{*} \leq η e^{- η} F_{0} (ξ, S; φ_{0})$ , and the extra φ₀-RC condition (18). Moreover, the simple GIF (19) explicitly expresses all conditions on F₀(ξ, S;φ) as properties of a fixed matrix Σ.

Example 6 (Linear regression: oracle inequalities)

For $ψ (β) = {∣ X b ∣}_{2}^{2} / (2 n)$ and Σ = X′X/n, F₀(ξ, S;φ_q) is the weak cone invertibility factor for q ∈ [1,∞] (Ye and Zhang, 2010), where a sharper version is defined as the sign restricted invertibility factor (SCIF):

{SCIF}_{q} (ξ, S) = inf_{b \in C_{-} (ξ, S)} {∣ \sum b ∣}_{\infty} / φ_{q} (b), φ_{q} = {∣ b ∣}_{q} / {∣ S ∣}^{1 / q} .

For q = 1, $F_{0}^{1 / 2} (ξ, S; φ_{1, S})$ is the compatibility constant (van de Geer, 2007)

κ_{*} (ξ, S) = inf_{b \in C (ξ, S)} \frac{{∣ S ∣}^{1 / 2} {∣ X b ∣}_{2}}{{∣ b_{S} ∣}_{1} n^{1 / 2}} = inf_{b \in C (ξ, S)} {(\frac{b^{'} \sum b}{{∣ b_{S} ∣}_{1}^{2} / ∣ S ∣})}^{1 / 2} .

(20)

They are all closely related to the ℓ₂ restricted eigenvalues

{RE}_{2} (ξ, S) = inf_{b \in C (ξ, S)} \frac{{∣ X b ∣}_{2}}{{∣ b ∣}_{2} n^{1 / 2}} = inf_{b \in C (ξ, S)} {(\frac{b^{'} \sum b}{{∣ b ∣}_{2}^{2}})}^{1 / 2}

(Bickel et al., 2009; Koltchinskii, 2009). Since ${∣ b_{S} ∣}_{1}^{2} \leq {∣ b ∣}_{2}^{2} ∣ S ∣$ , κ_*(ξ,S) ≥ RE₂(ξ, S) (van de Geer and Bühlmann, 2009). For the Lasso with ŵ_j = 1,

{∣ \hat{β} - β^{*} ∣}_{2} \leq \frac{{∣ S ∣}^{1 / 2} (λ + z_{0}^{*})}{{SCIF}_{2} (ξ, S)} \leq \frac{{∣ S ∣}^{1 / 2} (λ + z_{0}^{*})}{F_{0} (ξ, S; φ_{2})} \leq \frac{{∣ S ∣}^{1 / 2} (λ + z_{0}^{*})}{κ_{*} (ξ, S) {RE}_{2} (ξ, S)}

(21)

in the event $λ + z_{0}^{*} \leq ξ (λ - z_{1}^{*})$ (Ye and Zhang, 2010). Thus, cone and general invertibility factors yield sharper ℓ₂ oracle inequalities.

The factors in the oracle inequalities in (21) do not always have the same order for large |S|. Although the oracle inequality based on SCIF₂(ξ, S) is the sharpest among them, it seems not to lead to a simple extension to the general convex minimization in (1). Thus, we settle with extensions of the second sharpest oracle inequality in (21) with F₀(ξ, S;·).

3.2 Oracle Inequalities for the Lasso in GLM

An important special case of the general formulation is the ℓ₁-penalized estimator in a generalized linear model (GLM) (McCullagh and Nelder, 1989). This is Example 3 in Subsection 2.2, where we set up the notation in (13) and gave the KL divergence interpretation to (4). The ℓ₁ penalized, normalized negative likelihood is

ℓ (β) = ψ (β) - z^{'} β, with ψ (β) = \frac{1}{n} \sum_{i = 1}^{n} {ψ_{0} (x^{i} β) - c (y_{i}, σ)} and z = \frac{X^{'} y}{n} .

(22)

Assume that ψ₀ is twice differentiable. Denote the first and second derivatives of ψ₀ by ψ̇₀ and ψ̈₀, respectively. The gradient and Hessian are

\dot{ψ} (β) = X^{'} ψ_{0} (θ) / n and \ddot{ψ} (β) = X^{'} diag ({\ddot{ψ}}_{0} (θ)) X / n,

(23)

where θ = Xβ and ψ̇₀ and ψ̈₀ are applied to the individual components of θ.

A crucial condition in our analysis of the Lasso in GLM is the Lipschitz condition

max_{i \leq n} | log ({\ddot{ψ}}_{0} (x^{i} β^{*} + t)) - log ({\ddot{ψ}}_{0} (x^{i} β^{*})) | \leq M_{1} ∣ t ∣, \forall M_{1} ∣ t ∣ \leq η^{*},

(24)

where M₁ and η^* are constants determined by ψ₀. This condition gives

Δ (β^{*} + b, β^{*}) = \int_{0}^{1} 〈 b, \ddot{ψ} (β^{*} + t b) b 〉 d t \geq \int_{0}^{1} \sum_{{t M}_{1} ∣ x^{i} b ∣ \leq η^{*}} \frac{{\ddot{ψ}}_{0} (x^{i} β^{*}) {(x^{i} b)}^{2}}{{n e}^{{t M}_{1} ∣ x^{i} b ∣}} d t,

which implies the following lower bound for the GIF in (14):

F (ξ, S; φ_{0}, φ) \geq inf_{b \in C (ξ, S), φ_{0} (b) \leq η^{*}} \sum_{i = 1}^{n} \frac{{\ddot{ψ}}_{0} (x^{i} β^{*}) {(x^{i} b)}^{2}}{n {∣ b_{S} ∣}_{1} φ (b)} \int_{0}^{1} I {{t M}_{1} ∣ x^{i} b ∣ \leq φ_{0} (b)} d t .

For seminorms φ₀ and φ, the infimum above can be taken over φ₀(b) = M₂ due to scale invariance. Thus, for φ₀(b) = M₂|b|₂ and seminorms φ, this lower bound is

F^{*} (ξ, S; φ) = inf_{b \in C (ξ, S), {∣ b ∣}_{2} = 1} \sum_{i = 1}^{n} \frac{M_{2} {\ddot{ψ}}_{0} (x^{i} β^{*})}{n {∣ b_{S} ∣}_{1} φ (b)} min (\frac{∣ x^{i} b ∣}{M_{1}}, \frac{{(x^{i} b)}^{2}}{M_{2}}),

(25)

due to ${(x^{i} b)}^{2} \int_{0}^{1} I {{t M}_{1} ∣ x^{i} b ∣ \leq M_{2}} d t = M_{2} min {∣ x^{i} b ∣ / M_{1}, {(x^{i} b)}^{2} / M_{2}}$ .

If (24) holds with η^* = ∞, $Δ (β^{*} + b, β^{*}) \geq n^{- 1} \int_{0}^{1} \sum_{i} {\ddot{ψ}}_{0} (x^{i} β^{*}) {(x^{i} b)}^{2} e^{- {t M}_{1} ∣ x^{i} b ∣} d t$ , so that by the Jensen inequality (18) holds with Σ = Σ^* = ψ̈(β^*) and

φ_{0} (b) = \frac{M_{1} \sum_{i = 1}^{n} {\ddot{ψ}}_{0} (x^{i} β^{*}) {∣ x^{i} b ∣}^{3}}{\sum_{i = 1}^{n} {\ddot{ψ}}_{0} (x^{i} β^{*}) {(x^{i} b)}^{2}} \leq M_{1} {∣ X b ∣}_{\infty} .

(26)

This gives a special F₀(ξ, S;φ₀) as

F_{*} (ξ, S) = inf_{b \in C (ξ, S)} \frac{n {〈 b, \sum^{*} b 〉}^{2} / (M_{1} {∣ b_{S} ∣}_{1})}{\sum_{i = 1}^{n} {\ddot{ψ}}_{0} (x^{i} β^{*}) {∣ x^{i} b ∣}^{3}} .

(27)

Since ${∣ X b ∣}_{\infty} \leq {∣ X_{S} ∣}_{\infty} {∣ b_{S} ∣}_{1} + {∣ X_{S^{c}} W_{S^{c}}^{- 1} ∣}_{\infty} ∣ W_{S^{c}} b_{S^{c}} ∣ \leq {{∣ X_{S} ∣}_{\infty} + ξ {∣ X_{S^{c}} W_{S^{c}}^{- 1} ∣}_{\infty}} ∣ b_{S} ∣$ in the cone Inline graphic (ξ,S) in (9), for φ₀(b) = M₃|b_S|₁ with $M_{3} = M_{1} {{∣ X_{S} ∣}_{\infty} + ξ {∣ X_{S^{c}} W_{S^{c}}^{- 1} ∣}_{\infty}}$ , the φ₀-RC condition (18) automatically implies the stronger

e^{- φ_{0} (b)} 〈 b, \sum^{*} b 〉 \leq Δ (β^{*} + b, β^{*}) \leq e^{φ_{0} (b)} 〈 b, \sum^{*} b 〉, \forall b \in C (ξ, S), φ_{0} (b) < \infty .

(28)

Under the Lipschitz condition (24), we may also use the following large deviation inequalities to find explicit penalty levels to guarantee the noise bound (15).

Lemma 8

Suppose the model conditions (13) and (24) with certain {M₁, η^*}. Let x_j be the columns of X, $\sum_{i j}^{*}$ be the elements of Σ^* = ψ̈(β^*). For penalty levels {λ₀, λ₁} define t_j = λ₀I{j ∈ S} + w_jλ₁I{j ∉ S}. Suppose the bounds w_j in (6) are deterministic and
$M_{1} max_{j \leq p} ({∣ x_{j} ∣}_{\infty} ∣ t_{j} / \sum_{j j}^{*}) \leq η_{0} e^{η_{0}} and \sum_{j = 1}^{p} exp {- \frac{{n t}_{j}^{2} e^{- η_{0}}}{2 σ^{2} \sum_{j j}^{*}}} \leq \frac{ε_{0}}{2}$ (29)

for certain constants η₀ ≤ η^* and ε₀ > 0. Then, $P_{β^{*}} {z_{0}^{*} \leq λ_{0}, z_{1}^{*} \leq λ_{1}} \geq 1 - ε_{0}$ .
If c₀ = max_t ψ̈(t), then part (i) is still valid if (24) and (29) are replaced by
$\sum_{j = 1}^{p} exp {- \frac{n^{2} t_{j}^{2}}{2 σ^{2} c_{0} {∣ x_{j} ∣}_{2}^{2}}} \leq \frac{ε_{0}}{2} .$ (30)

In particular, if ${∣ x_{j} ∣}_{2}^{2} = n$ , 1 ≤ j ≤ p, w_j = 1, j ∉ S and λ₀ = λ₁ = λ (so t_j = λ), then part (i) still holds if $λ \geq σ \sqrt{(2 c_{0} / n) log (2 p / ε_{0})}$ .

The following theorem is a consequence of Theorem 4, Corollary 7 and Lemma 8.

Theorem 9

Let β̂ be the weighted Lasso estimator in (2) with GLM loss function in (22). Let β^* be a target vector and h = β̂−β^*. Suppose that the data follows the GML model (13) satisfying the Lipschitz condition (24) with certain {M₁, η^*}. Let F^*(ξ,S;φ) be as in (25) with $S \supseteq {j : β_{j}^{*} \neq 0}$ and a constant M₂. Let η ≤ 1 ∧ η^* and {λ, λ₀, λ₁} satisfy
${∣ w_{S} ∣}_{\infty} λ + λ_{0} \leq min {ξ (λ - λ_{1}), η e^{- η} F^{*} (ξ, S; M_{2} {∣ \cdot ∣}_{2})} .$ (31)

Then, in the event $Ω_{0} \cap {{max}_{k = 0, 1} (z_{k}^{*} / λ_{k}) \leq 1}$ with the $z_{k}^{*}$ in (5) and Ω₀ in (6),
$Δ (β^{*} + h, β^{*}) \leq \frac{e^{η} {({∣ w_{S} ∣}_{\infty} λ + λ_{0})}^{2} ∣ S ∣}{F^{*} (ξ, S; φ_{1, S})}, φ (h) \frac{e^{η} ({∣ w_{S} ∣}_{\infty} λ + λ_{0})}{F^{*} (ξ, S; φ)}$ (32)

for any seminorm φ as the estimation loss. In particular, for φ(b) = M₂|b|₂, (32) gives |h|₂ ≤ η/M₂. Moreover, if either (29) or (30) holds for the penalty level {λ₀, λ₁} and the weight bounds w_j in (6) are deterministic, then
$P_{β^{*}} {(32) holds for all seminorms φ} \geq P_{β^{*}} (Ω_{0}) - ε_{0} .$
Suppose η^* = ∞ and (31) holds with F^*(ξ,S;M₂|·|₂) replaced by the special simple GIF F_*(ξ,S) in (27) for the φ₀ in (26). Then, the conclusions of part (i) hold with F^*(ξ,S;·) replaced by the simple GIF F₀(ξ, S;·) in (19). Moreover, φ₀(h) ≤ η and (32) can be strengthened with the lower bound Δ(β^* + h,β^*) ≥ e^−η〈h,Σ^*h〉.
For any η^* > 0, the conclusions of part (ii) hold for the φ₀(b) = M₃|b_S|₁ in (28), if F_*(ξ,S) is replaced by $κ_{*}^{2} (ξ, S) / (M_{3} ∣ S ∣)$ in (31), where κ_*(ξ,S) is the compatibility constant in (20).

Remark 10

If either (29) or (30) holds for the penalty levels {λ₀, λ₁} and the bounds w_j in (6) are deterministic, then (32) implies P_β^*{the noise bound (15) holds} ≥ P_β^*(Ω₀) − ε₀.

Remark 11

Suppose that max_j _∉ _S 1/w_j, ${max}_{j} 1 / \sum_{j j}^{*}$ , max_j_∈_S w_j, ${max}_{j} \sum_{j j}^{*}$ , and M₁ are all bounded, and that ${1 + F_{*}^{2} (ξ, S)} (log p) / n \to 0$ . Then, (29) holds with the penalty level $λ_{0} = λ_{1} = a σ \sqrt{(2 / n) log (p / ε_{0})}$ for certain $a \leq (1 + o (1)) {max}_{j} {(\sum^{*})}_{j j}^{1 / 2} / w_{j}$ , due to max{λ₀, η, η₀} → 0+. Again, the conditions and conclusions of Theorem 9 “converge” to those for the linear regression as if the Gram matrix is Σ^*.

Remark 12

In Theorem 9, the key condition (31) is weaker in parts (i) and (ii) than part (iii), although part (ii) requires η^* = ∞. For Σ = Σ^* and M₁ = M₂ ≤ M₃/(1 + ξ),

κ_{*}^{2} (ξ, S) / (M_{3} ∣ S ∣) \leq min {F_{*} (ξ, S), F^{*} (ξ, S; M_{2} {∣ \cdot ∣}_{2})},

since $n^{- 1} \sum_{i = 1}^{n} {\ddot{ψ}}_{0} (x^{i} β^{*}) {∣ x^{i} b ∣}^{3} / 〈 b, \sum^{*} b 〉 \leq {∣ X b ∣}_{\infty} \leq {∣ b_{S} ∣}_{1} M_{3} / M_{1}$ as in the derivation of (28) and |b|₂ ≤ |b|₁ ≤ (1 + ξ)|b_S|₁ in the cone (9). For the more familiar $κ_{*}^{2} (ξ, S) / (M_{3} ∣ S ∣)$ with the compatibility constant, (31) essentially requires a small $∣ S ∣ \sqrt{(log p) / n}$ . The sharper Theorem 9 (i) and (ii) provides conditions to relax the requirement to a small |S|(log p)/n.

Remark 13

For ŵ_j = 1, Negahban et al. (2010) considered M-estimators under the restricted strong convexity condition discussed below Definition 6. For the GLM, they considered iid sub-Gaussian xⁱ and used empirical process theory to bound the ratio Δ(β^* + b,β^*)/{|b|₂(|b|₂ − c₀|b|₁} from below over the cone (9) with a small c₀. Their result extends the ℓ₂ error bound ${∣ S ∣}^{1 / 2} (λ + z_{0}^{*}) / {RE}_{2}^{2} (ξ, S)$ of Bickel et al. (2009), while Theorem 9 extends the sharper (21) with the factor F₀(ξ, S;φ₂). Theorem 9 applies to both deterministic and random designs. Similar to Negahban et al. (2010), for iid sub-Gaussian xⁱ, empirical process theory can be applied to the lower bound (25) for the GIF to verify the key condition (31) with F^*(ξ,S;M₂|·|₂) ≳ |S|^−1/2, provided that |S|(log p)/n is small.

Example 7 (Linear regression: oracle inequalities, continuation)

For the linear regression model (10) with quadratic loss, ψ₀(θ) = θ²/2, so that (24) holds with M₁ = 0 and η^* = ∞. It follows that F^*(ξ,S;M₂|·|₂) = ∞ and (31) has the interpretation with η = 0+ and ηe^−ηF^*(ξ,S;M₂| · |₂) = ∞. Moreover, since M₁ = 0, η₀ = 0+ in (29). Thus, the conditions and conclusions of Theorem 9 “converge” to the case of linear regression as M₁ → 0+. Suppose iid ε_i ~ N(0,σ²) as in (13). For ŵ_j = w _j = 1 and $\sum_{j j}^{*} = \sum_{i = 1}^{n} x_{i j}^{2} / n = 1$ , (29) holds with $λ_{0} = λ_{1} = σ \sqrt{(2 / n) log (p / ε_{0})}$ and (31) holds with λ = λ₀(1 + ξ)/(1 − ξ). The value of σ can be estimated iteratively using the mean residual squares (Städler et al., 2010; Sun and Zhang, 2011). Alternatively, cross-validation can be used to pick λ. For φ(b) = φ₂(b) = | b |₂/|S|^1/2, (32) matches the risk bound in (21) with the factor F₀(ξ, S;φ₂).

Example 8 (Logistic regression: oracle inequalities)

The model and loss function are given in (11) and (12) respectively. Here we verify the conditions of Theorem 9. The Lipschitz condition (24) holds with M₁ = 1 and η^* = ∞ since ψ₀(t) = log(1 + e^t) provides

\frac{{\ddot{ψ}}_{0} (θ + t)}{{\ddot{ψ}}_{0} (θ)} = \frac{e^{t} {(1 + e^{θ})}^{2}}{{(1 + e^{θ + t})}^{2}} \geq {\begin{cases} e^{- ∣ t ∣} & t < 0 \\ e^{- t} {(1 + e^{θ})}^{2} / {(e^{- t} + e^{θ})}^{2} \geq e^{- ∣ t ∣} & t > 0. \end{cases}

Since max_t ψ̈(t) = c₀ = 1/4 we can apply (30). In particular, if ${\hat{w}}_{j} = w_{j} = 1 = {∣ x_{j} ∣}_{2}^{2} / n, λ = {(ξ + 1) / (ξ - 1)} \sqrt{(log (p / ε_{0})) / (2 n)}$ and λ{2ξ/(ξ + 1)}/F_*(ξ,S) ≤ ηe^−η, then (32) holds with at least probability 1 − ε₀ under P_β^*. For such deterministic Ŵ and X, an adaptive choice of the penalty level is $λ = \hat{σ} \sqrt{(2 / n) log p}$ with ${\hat{σ}}^{2} = \sum_{i = 1}^{n} π_{i} (\hat{β}) {1 - π_{i} (\hat{β})} / n$ , where π_i(β) is as in Example 2.

Example 9 (Log-linear models: oracle inequalities)

Consider counting data with y_i ∈ {0,1,2,…}. In log-linear models, it is assume that

E_{β} (y_{i}) = e^{θ_{i}}, θ_{i} = x^{i} β, 1 \leq i \leq n .

This becomes a GLM with the average negative Poisson log-likelihood function

ℓ (β) = ψ (β) - z^{'} β, ψ (β) = \sum_{i = 1}^{n} \frac{exp (x^{i} β) - log (y_{i}!)}{n}, z = X^{'} y / n .

In this model, ψ₀(t) = e^t, so that the Lipschitz condition (24) holds with M₁ = 1 and η^* = ∞. Although (30) is not useful with c₀ = ∞, (29) can be used in Theorem 9.

4. Adaptive and Multistage Methods

We consider in this section an adaptive Lasso and its repeated applications, with weights recursively generated from a concave penalty function. This approach appears to provide the most appealing choice of weights both from heuristic and theoretical standpoints. The analysis here uses the results in Section 3 and an idea in Zhang (2010b).

We first consider adaptive Lasso and provide conditions under which it improves upon its initial estimator. Let ρ_λ(t) be a concave penalty function with ρ̇_λ(0+) = λ, where ρ̇_λ(t) = (∂/∂t)ρ_λ(t). The maximum concavity of the penalty is

κ = sup_{0 < t_{1} < t_{2}} \frac{∣ {\dot{ρ}}_{λ} (t_{2}) - {\dot{p}}_{λ} (t_{1}) ∣}{t_{2} - t_{1}} .

(33)

Let Inline graphic (ξ,S) be the cone in (9). Let φ₀(b) be a quasi star-shaped function and define

F_{2} (ξ, S; φ_{0}) = inf {\frac{e^{φ_{0} (b)} Δ (β^{*} + b, β^{*})}{{∣ b_{S} ∣}_{2} {∣ b ∣}_{2}} : 0 \neq b \in C (ξ, S), φ_{0} (b) \leq η^{*}} .

(34)

This quantity is an ℓ₂ version of the GIF in (14). The analysis in Section 3 can be used to find lower bounds for (34) in the same way simply by taking φ(b) = |b|₂ and replacing |b_S|₁ with |b_S|₂. For example, in generalized linear models (13) satisfying the Lipschitz condition (24), the derivation of (25) yields

F_{2} (ξ, S; M {∣ \cdot ∣}_{2}) \geq inf_{b \in C (ξ, S), {∣ b ∣}_{2} = 1} \sum_{i = 1}^{n} \frac{M_{2} {\ddot{ψ}}_{0} (x^{i} β^{*})}{n {∣ b_{S} ∣}_{2}} min (\frac{∣ x^{i} b ∣}{M_{1}}, \frac{{(x^{i} b)}^{2}}{M_{2}}) .

Given 0 < ε₀ < 1, the components of the error vector z − ψ̇(β^*) are sub-Gaussian if for all $0 \leq t \leq σ \sqrt{(2 / n) log (4 p / ε_{0})}$ ,

P_{β^{*}} {∣ {(z - \dot{ψ} (β^{*}))}_{j} ∣ \geq t} \leq 2 e^{- {n t}^{2} / (2 σ^{2})} .

(35)

This condition holds for all GLM when the components of Xβ^* are uniformly in the interior of the natural parameter space for the exponential family.

Theorem 14

Let κ be as in (33), $S_{0} = {j : β_{j}^{*} \neq 0}$ , λ₀ > 0, 0 < η < 1, 0 < γ₀ < 1/κ, A > 1, and ξ ≥ (A+1 − κγ₀)/(A−1). Let φ₀ be a quasi star-shaped function, F(ξ, S;φ₀, φ₀) be the GIF in (14), and F₂(ξ, S;φ₀) its ℓ₂-version in (34). Suppose

λ_{0} {1 + A / (1 - κ γ_{0})} \leq F (ξ, S; φ_{0}, φ_{0}) η e^{- η}, F_{*} \leq F_{2} (ξ, S; φ_{0}),

(36)

for all S ⊇ S₀ with |S\S₀| ≤ ℓ^*. Let β̃ be an initial estimator of β and β̂ be the weighted Lasso in (2) with weights ŵ_j = ρ̇_λ(|β̃_j|)/λ and penalty level λ = Aλ₀/(1 − κγ₀). Then,

{∣ \hat{β} - β^{*} ∣}_{2} \leq \frac{e^{η}}{F_{*}} {∣ {\dot{ρ}}_{λ} (∣ β_{S_{0}}^{*} {∣) ∣}_{2} + {∣ {z - \dot{ψ} (β^{*})}_{S_{0}} ∣}_{2} + (κ + \frac{1}{γ_{0} A} - \frac{κ}{A}) {∣ \tilde{β} - β^{*} ∣}_{2}}

in the event ${{∣ {(\tilde{β} - β)}_{S_{0}^{c}} ∣}_{2}^{2} \leq γ_{0}^{2} λ^{2} ℓ^{*}} \cap {{∣ z - \dot{ψ} (β^{*}) ∣}_{\infty} \leq λ_{0}}$ . Moreover, if (35) holds and $λ_{0} = σ \sqrt{(2 / n) log (2 p / ε_{0})}$ with 0 < ε₀ < 1, then P_{β_*} {|z − ψ̈(β^*)| ≥ λ₀} ≤ ε₀.

Theorem 14 raises the possibility that β̂ improves β̃ under proper conditions. Thus it is desirable to repeatedly apply this adaptive Lasso in the following way,

{\hat{β}}^{(k + 1)} = \underset{β}{arg min} {ℓ (β) + \sum_{j = 1}^{p} {\dot{ρ}}_{λ} ({\hat{β}}_{j}^{(k)}) ∣ β_{j} ∣}, k = 0, 1, \dots .

(37)

Such multistage algorithms have been considered in the literature (Fan and Li, 2001; Zou and Li, 2008; Zhang, 2010b). As discussed in Remark 16 below, it is beneficial to use a concave penalty ρ_λ in (37). Natural choices of ρ_λ include the smoothly clipped absolute deviation and minimax concave penalties (Fan and Li, 2001; Zhang, 2010a).

Theorem 15

Let {κ,S₀, λ₀, η, γ₀, A, ξ, ℓ^*, λ} be the same as Theorem 14. Let β̂⁽⁰⁾ be the unweighted Lasso with ŵ_j = 1 in (2) and β̂^(ℓ) be the ℓ-th iteration of the recursion (37) initialized with β̂⁽⁰⁾. Let ξ₀ = (λ + λ₀)/(λ − λ₀). Suppose (36) holds and

e^{η} {1 + (1 - κ γ_{0}) / A} / F (ξ_{0}, S_{0}; φ_{0}, {∣ \cdot ∣}_{2}) \leq γ_{0} \sqrt{ℓ^{*}} .

(38)

Define r₀ = (e^η/F_*){κ + 1/(γ₀A) − κ/A}. Suppose r₀ < 1. Then,

{∣ {\hat{β}}^{(ℓ)} - β^{*} ∣}_{2} \leq \frac{∣ {\dot{ρ}}_{λ} (∣ β_{S_{0}}^{*} {∣) ∣}_{2} + {∣ {z - \dot{ψ} (β^{*})}_{S_{0}} ∣}_{2}}{e^{- η} F_{*} (1 - r_{0}) / (1 - r_{0}^{ℓ})} + \frac{r_{0}^{ℓ} e^{η} λ {1 + (1 - κ γ_{0}) / A}}{F (ξ_{0}, S_{0}; φ_{0}, {∣ \cdot ∣}_{2})}

(39)

in the event

{{∣ z - \dot{ψ} (β^{*}) ∣}_{\infty} \leq λ_{0}} \cap {\frac{∣ {\dot{ρ}}_{λ} (∣ β_{S_{0}}^{*} {∣) ∣}_{2} + {∣ {z - \dot{ψ} (β^{*})}_{S_{0}} ∣}_{2}}{e^{- η} F_{*} (1 - r_{0})} \leq γ_{0} λ \sqrt{ℓ^{*}}} .

(40)

Moreover, if (35) holds and $λ_{0} = σ \sqrt{(2 / n) log (4 p / ε_{0})}$ with 0 < ε₀ < 1, then the intersection of the events (40) and ${{∣ {z - \dot{ψ} (β^{*})}_{S_{0}} ∣}_{2} \leq n^{- 1 / 2} σ \sqrt{2 ∣ S_{0} ∣ log (4 ∣ S_{0} / ε_{0} ∣)}}$ happens with at least P_β^*probability 1 − ε₀, provided that

\frac{∣ {\dot{ρ}}_{λ} (∣ β_{S_{0}}^{*} {∣) ∣}_{2} + n^{- 1 / 2} σ \sqrt{2 ∣ S_{0} ∣ log (4 ∣ S_{0} ∣ / ε_{0})}}{e^{- η} F_{*} (1 - r_{0})} \leq \frac{γ_{0} A λ_{0} \sqrt{ℓ^{*}}}{1 - κ γ_{0}} .

Remark 16

Define R⁽⁰⁾ = λe^η{1 + (1 − κγ₀)/A}/F(ξ₀, S₀; φ₀, |·|₂) and

R^{(\infty)} = \frac{∣ {\dot{ρ}}_{λ} (∣ β_{S_{0}}^{*} {∣) ∣}_{2} + {∣ {z - \dot{ψ} (β^{*})}_{S_{0}} ∣}_{2}}{e^{- η} F_{*} (1 - r_{0})}, R^{(ℓ)} = (1 - r_{0}^{ℓ}) R^{(\infty)} + r_{0}^{ℓ} R^{(0)} .

E_{β^{*}} R^{(\infty)} \leq {∣ {\dot{ρ}}_{λ} (∣ β_{S_{0}}^{*} {∣) ∣}_{2} + 2 σ \sqrt{∣ S_{0} ∣ / n}} e^{η} / {F_{*} (1 - r_{0})} .

Since ρ_λ(t) is concave in t, $∣ {\dot{ρ}}_{λ} (∣ β_{S_{0}}^{*} {∣) ∣}_{2} \leq {\dot{ρ}}_{λ} (0 +) {∣ S_{0} ∣}^{1 / 2} = λ {∣ S_{0} ∣}^{1 / 2}$ . This component of E_β^*R^(∞)matches the noise inflation due to model selection uncertainty since $λ ≍ λ_{0} = σ \sqrt{(2 / n) log (p / ε_{0})}$ . This noise inflation diminishes when ${min}_{j \in S_{0}} ∣ β_{j}^{*} ∣ \geq γ λ$ and ρ̇_λ(t) = 0 for |t| ≥ γλ, yielding the super-efficient $E_{β^{*}} R^{(\infty)} \leq {2 σ \sqrt{∣ S_{0} ∣ / n}} e^{η} / {F_{*} (1 - r_{0})}$ without the log p factor. The risk bound R^(∞) is comparable with those for concave penalized least squares in linear regression (Zhang, 2010a).

Remark 17

For log(p/n) ≍ log p, the penalty level λ in Theorems 14 and 15 are comparable with the best proven results and of the smallest possible order in linear regression. For log(p/n) ≪ log p, the proper penalty level is expected to be of the order $σ \sqrt{(2 / n) log (p / ∣ S_{0} ∣)}$ under a vectorized sub-Gaussian condition which is slightly stronger than (35). This refinement for log(p/n) ≪ log p is beyond the scope of this paper.

5. Selection Consistency

In this section, we provide a selection consistency theorem for the ℓ₁ penalized convex minimization estimator, including both the weighted and unweighted cases. Let ||M||_∞ = max_|_u_{|_∞≤1} |Mu|_∞ be the ℓ_∞-to-ℓ_∞ operator norm of a matrix M.

Theorem 18

Let ψ̈(β) = ℓ̈(β) be the Hessain of the loss in (1), β̂ be as in (2), β^* be a target vector, $z_{k}^{*}$ be as in (5), Ω₀ in (6), $S \supseteq {j : β_{j}^{*} \neq 0}$ and F(ξ,S;φ₀, φ) as in (14).

Let 0 < η ≤ η^* ≤ 1, $B_{0}^{*} = {β : φ_{0} (β - β^{*}) \leq η, β_{S^{c}} = 0}$ and S_β = { j: β_j ≠ 0}. Suppose
$sup_{β \in B_{0}^{*}} {∣ {\hat{W}}_{S^{c}}^{- 1} {\ddot{ψ}}_{S^{c}, S_{β}} (β) {{\ddot{ψ}}_{S_{β}} (β)}^{- 1} {\hat{W}}_{S_{β}} sgn (β_{S_{β}}) ∣}_{\infty} \leq κ_{0} < 1$ (41)

$sup_{β \in B_{0}^{*}} {‖ {\hat{W}}_{S^{c}}^{- 1} {\ddot{ψ}}_{S^{c}, S_{β}} (β) {{\ddot{ψ}}_{S_{β}} (β)}^{- 1} ‖}_{\infty} \leq κ_{1} .$ (42)

Then, {j: β̂_j ≠ 0} ⊆ S in the event
$Ω_{1}^{*} = Ω_{0} \cap {{∣ {\hat{w}}_{S} ∣}_{\infty} λ + z_{0}^{*} \leq η e^{- η} F (0, S; φ_{0}, φ_{0}), κ_{1} z_{0}^{*} + z_{1}^{*} < (1 - κ_{0}) λ} .$ (43)
Let 0 < η ≤ η^* ≤ 1 and = {β: φ₀(β − β^*) ≤ η,sgn(β) = sgn(β^*)}. Suppose (41) and (42) hold with $B_{0}^{*}$ replaced by . Then, sgn(β̂) = sgn(β^*) in the event
$Ω_{1}^{*} \cap {sup_{β \in B_{0}} {‖ {{\ddot{ψ}}_{S} (β)}^{- 1} ‖}_{\infty} ({∣ {\hat{w}}_{S} ∣}_{\infty} λ + z_{0}^{*}) < min_{j \in S} ∣ β_{j}^{*} ∣} .$ (44)
Suppose conditions of Theorem 9 hold for the GLM. Then, the conclusions of (i) and (ii) hold under the respective conditions if F(0,S;φ₀, φ₀) is replaced by F^*(ξ,S;M₂|·|₂) or F_*(ξ,S) or $κ_{*}^{2} (ξ, S) / (M_{3} ∣ S ∣)$ with the respective φ₀ in Theorem 9.

For ŵ_j = 1, this result is somewhat more specific in the radius η for the uniform irrepresentable condition (41), compared with a similar extension of the selection consistency theory to the graphical Lasso by Ravikumar et al. (2008). In linear regression (10), ψ̈β) = Σ = X′X/n does not depend on β, so that Theorem 18 with the special w_j = 1 matches the existing selection consistency theory for the unweighted Lasso (Meinshausen and Bühlmann, 2006; Tropp, 2006; Zhao and Yu, 2006; Wainwright, 2009). We discuss below the ℓ₁ penalized logistic regression as a specific example.

Example 10 (Logistic regression: selection consistency)

Suppose $w_{j} = 1 = {∣ x_{j} ∣}_{2}^{2} / n$ where x_j are the columns of X. If (43) and (44) hold with $z_{0}^{*}$ and $z_{1}^{*}$ replaced by $\sqrt{(log (p / ε_{0})) / (2 n)}$ , then the respective conclusions of Theorem 18 hold with at least probability 1 − ε₀ in P_β^*.

6. The Sparsity of the Lasso and SRC

The results in Sections 2, 3, and 4 are concerned with prediction and estimation properties of β̂, but not dimension reduction. Theorem 18 (i) and (iii) provide dimension reduction under ℓ_∞-type conditions (41) and (42). In this section, we provide upper bounds for the dimension of β̂ under conditions of a weaker ℓ₂ type. For this purpose, we introduce

κ_{+} (m) = sup_{∣ B ∣ = m} {λ_{max} (W_{B}^{- 2} \int_{0}^{1} {\ddot{ψ}}_{B} (β^{*} + t b) d t) : B \cap S = \emptyset, b \in C (ξ, S), φ_{0} (b) \leq η^{*}}

(45)

as a restricted upper eigenvalue, where λ_max(M) is the largest eigenvalue of matrix M, B ⊆ {1,…, p}, and ψ̈_B(β) and W_B are the restrictions of the Hessian of (1) and the weight operator W = diag(w₁,…, w_p) to ℝ^B.

Theorem 19

Let β^* be a target vector, $S \supseteq {j : β_{j}^{*} \neq 0}$ , β̂ be the weighted Lasso estimator (2), and $z_{k}^{*}$ be the ℓ_∞-noise level as in (5). Let 0 ≤ η^* ≤ 1, φ₁_,S(b) = |b_S|₁/|S|, φ₀ be a quasi star-shaped function, and F(ξ,S;φ₀, φ) be the GIF in (14). Then, in the event (15),

# {j : {\hat{β}}_{j} \neq 0, j \notin S} < d_{1} = min {m \geq 1 : \frac{m}{κ_{+} (m)} > \frac{e^{η} ξ^{2} ∣ S ∣}{F (ξ, S; φ_{0}, φ_{1, S})}} .

It follows from the Cauchy-Schwarz inequality that κ₊(m) is sub-additive, κ₊(m₁ + m₂) ≤ κ₊(m₁) + κ₊(m₂), so that m/κ₊(m) is non-decreasing in m. For GLM, lower bounds for the GIF and probability upper bounds for $z_{k}^{*}$ can be found in Subsection 3.2. For $S = {j : β_{j}^{*} \neq 0}$ . Theorem 19 gives an upper bound for the false negative.

In linear regression, upper bounds for the false negative of the Lasso or concave penalized LSE can be found in Zhang and Huang (2008) and Zhang (2010a) under a sparse Riesz condition (SRC). We now extend their results to the Lasso for the more general convex minimization problem (1). For this purpose, we strengthen (18) to

e^{- φ_{0} (b)} \sum^{*} \leq \ddot{ψ} (β^{*} + b) \leq e^{φ_{0} (b)} \sum^{*}, \forall b \in C (ξ, S), φ_{0} (b) \leq η^{*},

(46)

and assume the following SRC: for certain constants {c_*, c^*}, integer d^*, 0 < α < 1, 0 < η ≤ η^* ≤ 1, all A ⊃ S with |A| = d^*, and all u ∈ ℝ^A with |u| = 1,

c_{*} \leq 〈 u, {\ddot{ψ}}_{A} (β^{*}) u 〉 \leq c^{*}, \frac{∣ S ∣}{2 (1 - α)} (\frac{e^{2 η} c^{*}}{c_{*}} + 1 - 2 α) \leq d^{*} .

(47)

Theorem 20

Let β̂ be the Lasso estimator (2) with w_j = 1 for all j, β^* be a target vector, $S \supseteq {j : β_{j}^{*} \neq 0}$ , and $z_{k}^{*}$ be the ℓ_∞-noise level as in (5). Let φ₀ be a quasi star-shaped function, and F(ξ,S;φ₀,φ) be the GIF in (14). Suppose (46) and (47) hold. Let d₁ be the integer satisfying d₁ − 1 ≤ |S|(e^2ηc^*/c_* − 1)/(2 − 2α) < d₁. Then,

# {j : {\hat{β}}_{j} \neq 0, j \notin S} < d_{1}

when $z_{0}^{*} + ξ z_{1}^{*} \leq (ξ - 1) λ, λ + z_{0}^{*} \leq η e^{- η} F (ξ, S; φ_{0}, φ_{0})$ , and

max_{A \supset S, ∣ A ∣ \leq d_{1}} {∣ {(\sum^{*})}_{A}^{- 1 / 2} {\dot{ℓ}}_{A} (β^{*}) ∣}_{2} \leq e^{- η} α λ \sqrt{d_{1} / c^{*}} .

Theorems 19 and 20 use different sets of conditions to derive dimension bounds since different analytical approaches are used. These sets of conditions do not imply each other. In the most optimistic case, the SRC (47) allows d^* = d₁ +|S| to be arbitrary close to |S| when e^2ηc^*/c_* ≈ 1, while Theorem 19 requires d₁ ≥ |S| when κ₊(m) ≥ 1 and F(ξ,S;φ₀,φ_1,_S) ≤ 1 (always true for Σ^* with 1 in the diagonal).

7. Discussion

In this paper, we studied the estimation, prediction, selection and sparsity properties of the weighted and adaptive ℓ₁-penalized estimators in a general convex loss formulation. We also studied concave regularization in the form of recursive application of adaptive ℓ₁-penalized estimators.

We applied our general results to several important statistical models, including linear regression and generalized linear models. For linear regression, we extend the existing results to weighted and adaptive Lasso. For the GLMs, the ℓ_q,q ≥ 1 error bounds for a general q ≥ 1 for the GLMs are not available in the literature, although ℓ₁ and ℓ₂ bounds have been obtained under different sets of conditions respectively in van de Geer (2008) and ]citeNegahbanRWY10. Our fixed-sample analysis provides explicit definition of constant factors in an explicit neighborhood of a target. Our oracle inequalities yields even sharper results for multistage recursive application of adaptive Lasso based on a suitable concave penalty. The results on the sparsity of the solution to the ℓ₁-penalized convex minimization problem is based on a new approach.

An interesting aspect of the approach taken in this paper in dealing with general convex losses such as those for the GLM is that the conditions imposed on the Hessian naturally “converge” to those for the linear regression as the convex loss “converges” to a quadratic form.

A key quantity used in the derivation of the results is the generalized invertibility factor (14), which grow out of the idea of the ℓ₂ restricted eigenvalue but improves upon it. The use of GIF yields sharper bounds on the estimation and prediction errors. This was discussed in detail in the context of linear regression in Ye and Zhang (2010).

We assume that the convex function ψ(·) is twice differentiable. Although this assumption is satisfied in many important and widely used statistical models, it would be interesting to extend the results obtained in this paper to models with less smooth loss functions, such as those in quantile regression and support vector machine.

Acknowledgments

The work of Jian Huang is supported in part by the National Institutes of Health (NIH Grants R01CA120988 and R01CA142774) and the National Science Foundation (NSF Grant DMS-08-05670). The work of Cun-Hui Zhang is supported in part by the National Science Foundation (NSF Grants DMS-0906420 and DMS-1106753) and the National Security Agency (NSA Grant H98230-11-1-0205).

Appendix A

Proof of Lemma 1

Since ψ̇(β̂) − ψ̇(β^*) = z − ψ̇(β^*) − g, (3) implies

Δ (\hat{β}, β^{*}) = 〈 \hat{β}, z - \dot{ψ} (β^{*}) 〉 - λ {∣ \hat{W} \hat{β} ∣}_{1} - 〈 β^{*}, z - \dot{ψ} (β^{*}) - g 〉

and |g_j| ≤ ŵ_jλ. Thus, (7) follows from |(z − ψ̇(β^*)_j| ≤ ŵ_jλ and ŵ_j ≤ w_j in S in Ω₀.

For (8), we have h_S^c = β̂_S^c and $β_{S^{c}}^{*} = 0$ , so that in Ω₀ (3) gives

\begin{array}{l} Δ (\hat{β}, β^{*}) = 〈 {\hat{β}}_{S^{c}}, {z - \dot{ψ} (β^{*})}_{S^{c}} 〉 - λ {∣ {\hat{W}}_{S^{c}} {\hat{β}}_{S^{c}} ∣}_{1} - 〈 h_{S}, {z - \dot{ψ} (β^{*}) - g}_{S} 〉 \\ \leq {∣ W_{S^{c}} {\hat{β}}_{S^{c}} ∣}_{1} (z_{1}^{*} - λ) + 〈 h_{S}, g_{S} - {z - \dot{ψ} (β^{*})}_{S} 〉 \\ \leq {∣ W_{S^{c}} {\hat{β}}_{S^{c}} ∣}_{1} (z_{1}^{*} - λ) + {∣ h_{S} ∣}_{1} (z_{0}^{*} + {∣ w_{S} ∣}_{\infty} λ) . \end{array}

This gives (8). Since Δ(β̂,β^*)>0, h ∈ Inline graphic (ξ,S) when $({∣ w_{S} ∣}_{\infty} λ + z_{0}^{*}) / (λ - z_{1}^{*}) \leq ξ$ . For j ∉ S, h_j(ψ̇(β + h) − ψ̇(β))_j = β̂_j(z − ψ̇(β^*) − g)_j ≤ |β̂_j|(w_jλ − g_j) ≤ 0.

Proof of Theorem 4

Let h = β̂ − β^*. Since ψ(β) is a convex function,

t^{- 1} Δ (β^{*} + t h, β^{*}) = \frac{\partial}{\partial t} {ψ (β^{*} + t h) - t 〈 h, \dot{ψ} (β^{*}) 〉}

is an increasing function of t. For 0 ≤ t ≤ 1 and in the event Ω₁, (8) implies

t^{- 1} Δ (β^{*} + t h, β^{*}) \leq Δ (h + β^{*}, β^{*}) < ({∣ w_{S} ∣}_{\infty} λ + z_{0}^{*}) {∣ h_{S} ∣}_{1} .

By (9) and (14), F(ξ,S;φ₀,φ₀) ≤ Δ(β^* + th,β^*)e^φ₀(th)/{t|h_S|₁φ₀(th)} for φ₀(th) ≤ η^*. Thus, for φ₀(th) ≤ min{η^*,φ₀(h)} and in the event Ω₁,

φ_{0} (t h) e^{- φ_{0} (t h)} \leq \frac{Δ (β^{*} + t h, β^{*})}{t {∣ h_{S} ∣}_{1} F (ξ, S; φ_{0}, φ_{0})} \leq \frac{{∣ w_{S} ∣}_{\infty} λ + z_{0}^{*}}{F (ξ, S; φ_{0}, φ_{0})} \leq η e^{- η} .

If η^* < φ₀(h), the above inequality at φ₀(th) = η^* would give η^*e^{−η^*}< ηe^−η, which contradicts to η ≤ η^* ≤ 1. Thus, η^* ≥ φ₀(h) and φ₀(th)e^−φ₀(th) ≤ ηe^−η for all 0 ≤ t ≤ 1. This implies φ₀(h) ≤ η ≤ η^*. Another application of (8) yields

φ (h) \leq \frac{Δ (β^{*} + h, β^{*}) e^{φ_{0} (h)}}{F (ξ, S; φ_{0}, φ) {∣ h_{S} ∣}_{1}} \leq \frac{({∣ w_{S} ∣}_{\infty} λ + z_{0}^{*}) e^{η}}{F (ξ, S; φ_{0}, φ)} .

We obtain (17) by applying (16) with φ = φ_1,_S to the right-hand side of (8).

Proof of Lemma 8

Since $\dot{ψ} (β) = \sum_{i = 1}^{n} x^{i} {\dot{ψ}}_{0} (x^{i} β) / n$ by (23),
$\begin{array}{l} E_{β} exp {\frac{n}{σ^{2}} 〈 b, z - \dot{ψ} (β) 〉} = exp [\sum_{i = 1}^{n} \frac{ψ_{0} (x^{i} (β + b)) - ψ_{0} (x^{i} β) - (x^{i} b) {\dot{ψ}}_{0} (x^{i} β)}{σ^{2}}] \\ = exp [\sum_{i = 1}^{n} \int_{0}^{1} \frac{{(x^{i} b)}^{2} {\ddot{ψ}}_{0} (x^{i} (β + t b))}{σ^{2}} (1 - t) d t] . \end{array}$ (48)

This and (24) imply that for M₁|Xb|_∞ ≤ η₀,
$E_{β^{*}} exp {\frac{n}{σ^{2}} 〈 b, z - \dot{ψ} (β^{*}) 〉} \leq exp [\frac{{n e}^{η_{0}} 〈 b, \sum^{*} b 〉}{2 σ^{2}}] .$ (49)

Since ${max}_{k = 0, 1} z_{k}^{*} / λ_{k} = {max}_{j} t_{j}^{- 1} ∣ z_{j} - {\dot{ψ}}_{j} (β^{*})$ by (5),
$\begin{array}{l} P_{β^{*}} {max_{k = 0, 1} z_{k}^{*} / λ_{k} > 1} \leq \sum_{j = 1}^{p} P_{β^{*}} {∣ z_{j} - {\dot{ψ}}_{j} (β^{*}) ∣ > t_{j}} \\ \leq \sum_{j = 1}^{p} E_{β^{*}} exp {\frac{n}{σ^{2}} b_{j} ∣ z_{j} - {\dot{ψ}}_{j} (β^{*}) ∣ - \frac{n}{σ^{2}} b_{j} t_{j}} \end{array}$

with $b_{j} = e^{- η_{0}} t_{j} / \sum_{j j}^{*}$ . Since M₁ max_ij|x_ij|b_j ≤ η₀, (49) gives
$P_{β^{*}} {max_{k = 0, 1} z_{k}^{*} / λ_{k} > 1} \leq \sum_{j = 1}^{p} 2 exp (- \frac{{n e}^{- η_{0}} t_{j}^{2}}{2 σ^{2} \sum_{j j}^{*}}) .$
If (30) holds, we simply replace ¨₀(xⁱ(β + tb)) by c₀ in (48). The rest is simpler and omitted.

Proof of Theorem 9

(i) Since F^*(ξ,S;φ) in (25) is a lower bound of F(ξ,S;φ₀,φ) in (14), (32) follows from Theorem 4 with φ₀(b) = M₂|b|₂. The probability statement follows from Lemma 8. (ii) Since (18) holds for the φ₀(b) in (26), we are allowed to use F_*(ξ,S) = F₀(ξ,S;φ₀) in Corollary 7. The condition η^* = ∞ is used since φ₀(b) does not control M₁|Xb|_∞. (iii) We are also allowed to use φ₀(b) = M₃|b_S|₁ in (28) due to M₁|Xb|_∞ ≤ φ₀(b).

Proof of Theorem 14

Let h = β̂ − β^*, w_j = ŵ_j and S = {j : |β̂_j| > γ₀λ}∪ S₀. For j ∉ S, w_j = ρ̇_λ(|β̃_j|)/λ ≥ {ρ̇_λ(0+) − κγ₀λ}/λ = 1 − κγ₀, so that $z_{1}^{*} = {∣ W_{S^{c}}^{- 1} {z - \dot{ψ} (β^{*})}_{S^{c}} ∣}_{\infty} \leq λ_{0} / (1 - κ γ_{0}) = λ / A$ . We also have $z_{0}^{*} \leq {∣ z - \dot{ψ} (β^{*}) ∣}_{\infty} \leq λ_{0} = (1 - κ γ_{0}) / λ / A$ . Since |ŵ|_∞ ≤ 1, these bounds for $z_{0}^{*}$ and $z_{1}^{*}$ yield

\frac{{∣ {\hat{w}}_{S} ∣}_{\infty} λ + z_{0}^{*}}{λ - z_{1}^{*}} \leq \frac{λ + (1 - κ γ_{0}) λ / A}{λ - λ / A} = \frac{A + 1 - κ γ_{0}}{A - 1} \leq ξ .

Thus, since |g_j| ≤ ŵ_jλ in (8), Lemma 1 provides

h \in C (ξ, S), Δ (β^{*} + h, β^{*}) \leq {∣ h_{S} ∣}_{2} ({∣ {\hat{w}}_{S} ∣}_{2} λ + {∣ {z - \dot{ψ} (β^{*})}_{S} ∣}_{2})

Since $∣ S \ S_{0} ∣ \leq {∣ {(\tilde{β} - β^{*})}_{S_{0}^{c}} ∣}_{2}^{2} / (γ_{0}^{2} λ^{2}) \leq ℓ^{*}$ , we have by (36)

{∣ w_{S} ∣}_{\infty} λ + z_{0}^{*} \leq λ + λ_{0} = λ_{0} {1 + A / (1 - κ γ_{0})} \leq F (ξ, S; φ_{0}, φ_{0}) η e^{- η} .

Thus, φ₀(h) ≤ η by (16), so that by (34) and (36),

e^{- η} F_{*} {∣ h_{S} ∣}_{2} {∣ h ∣}_{2} \leq Δ (β^{*} + h, β^{*}) \leq {∣ h_{S} ∣}_{2} ({∣ {\hat{w}}_{S} ∣}_{2} λ + {∣ {z - \dot{ψ} (β^{*})}_{S} ∣}_{2}) .

Since |h_S| = 0 implies h = 0 for h ∈ Inline graphic (ξ,S), we find

e^{- η} F_{*} {∣ h ∣}_{2} \leq {∣ {\hat{w}}_{S} ∣}_{2} λ + {∣ {z - \dot{ψ} (β^{*})}_{S} ∣}_{2} .

(50)

Since ${\hat{w}}_{j} λ = {\dot{ρ}}_{λ} (∣ {\tilde{β}}_{j} ∣) \leq {\dot{ρ}}_{λ} (∣ β_{j}^{*} ∣) + κ ∣ {\tilde{β}}_{j} - β_{j}^{*} ∣$ , we have

{∣ {\hat{w}}_{S} ∣}_{2} λ \leq ∣ {\dot{ρ}}_{λ} (∣ β_{S_{0}}^{*} {∣) ∣}_{2} + κ {∣ \tilde{β} - β^{*} ∣}_{2} .

Since |z − ψ̇(β^*)|_∞ ≤ λ₀ = (1 − κγ₀)λ/A and $∣ {\tilde{β}}_{j} - β_{j}^{*} ∣ = ∣ {\tilde{β}}_{j} ∣ \geq γ_{0} λ$ for j ∈ S\S₀,

\begin{array}{l} {∣ {z - \dot{ψ} (β^{*})}_{S} ∣}_{2} \leq {∣ {z - \dot{ψ} (β^{*})}_{S_{0}} ∣}_{2} + {∣ S \ S_{0} ∣}^{1 / 2} (1 - κ γ_{0}) λ / A \\ \leq {∣ {z - \dot{ψ} (β^{*})}_{S_{0}} ∣}_{2} + {∣ \tilde{β} - β^{*} ∣}_{2} (1 - κ γ_{0}) / (γ_{0} A) . \end{array}

Inserting the above inequalities into (50), we find that

e^{- η} F_{*} {∣ \hat{β} - β^{*} ∣}_{2} \leq ∣ {\dot{ρ}}_{λ} (∣ β_{S_{0}}^{*} {∣) ∣}_{2} + {∣ {z - \dot{ψ} (β^{*})}_{S_{0}} ∣}_{2} + (κ + \frac{1}{γ_{0} A} - \frac{κ}{A}) {∣ \tilde{β} - β^{*} ∣}_{2} .

The probability statement follows directly from (35) with the union bound.

Proof of Theorem 15

Let R^(ℓ) be as in Remark 16. For |z − ψ̇(β^*)|_∞ ≤ λ₀, (16) of Theorem 4 gives |β̂⁽⁰⁾ − β^*|₂ ≤ e^η(λ + λ₀)/F(ξ₀,S₀;φ₀, |·|₂) = R⁽⁰⁾. Under conditions (38) and (40), we have $R^{(ℓ)} \leq γ_{0} λ \sqrt{ℓ^{*}}$ for all ℓ ≥ 0. We prove (39) by induction. We have already proved (39) for ℓ = 0. For ℓ ≥ 1, we let β̃ = β̂^(ℓ−1) and apply Theorem 14: |β̂^(ℓ) − β^*|₂ ≤ (1 − r₀)R^(∞) + r₀R^(ℓ−1) = R^(ℓ). The probability statement follows directly from (35) with the union bound.

Proof of Theorem 18

Let z̃ = z − ψ̇(β^*) and λ be fixed. Consider

\hat{β} (λ, t) = \underset{β}{arg min} {ψ (β) - 〈 β, \dot{ψ} (β^{*}) + t \tilde{z} 〉 + t λ \sum_{j = 1}^{p} {\hat{w}}_{j} ∣ β_{j} ∣ : β_{S^{c}} = 0}

(51)

as an artificial path for 0 ≤ t ≤ 1. For each t, the KKT conditions for β̂(λ, t) are

g_{S} (λ, t) = t λ {\hat{W}}_{S} u_{S} (λ, t), u_{j} (λ, t) {\begin{cases} = sgn ({\hat{β}}_{j} (λ, t)) & \forall {\hat{β}}_{j} (λ, t) \neq 0 \\ \in [- 1, 1], & \forall j \in S, \end{cases}

where g(λ, t) = −ψ̇(β̂(λ, t)) + ψ̇(β^*) + tz̃. Since (51) is constrained to β_S^c= 0 and both the error z̃ and the penalty level λ are scaled with t, Theorem 4 with ξ = 0 yields

φ_{0} (\hat{β} (λ, t) - β^{*}) \leq η_{t} \to 0 with η_{t} e^{- η_{t}} = t η e^{- η}, \forall 0 < t \leq 1.

(52)

Let S_t = {j : β̂_j(λ, t) ≠ 0}. Applying the differentiation operator D = (∂/∂t) to the KKT conditions, we find that almost everywhere in t,

{(D g)}_{S_{t}} (λ, t) = {\tilde{z}}_{S_{t}} - {\ddot{ψ}}_{S_{t}} (\hat{β} (λ, t)) {(D \hat{β}) (λ, t)}_{S_{t}} = λ {\hat{W}}_{S_{t}} u_{S_{t}} (λ, t) .

It follows that

{(D \hat{β})}_{S_{t}} (λ, t) = {{\ddot{ψ}}_{S_{t}} (\hat{β} (λ, t))}^{- 1} {{\tilde{z}}_{S_{t}} - λ {\hat{W}}_{S_{t}} u_{S_{t}} (λ, t)}

(53)

and with an application of the chain rule,

{(D g)}_{S^{c}} (λ, t) = {\tilde{z}}_{S^{c}} - {\ddot{ψ}}_{S^{c}, S_{t}} (\hat{β} (λ, t)) {{\ddot{ψ}}_{S_{t}} (\hat{β} (λ, t))}^{- 1} {{\tilde{z}}_{S_{t}} - λ {\hat{W}}_{S_{t}} u_{S_{t}} (λ, t)} .

Since g(λ, t) is almost differentiabe and β̂(λ, 0+) = β^*, we have g(λ, 0+) = 0 and

g_{S^{c}} (λ, 1 -) = \int_{0}^{1} [{\tilde{z}}_{S^{c}} - {\ddot{ψ}}_{S^{c}, S_{t}} (\hat{β} (λ, t)) {{\ddot{ψ}}_{S_{t}} (\hat{β} (λ, t))}^{- 1} {{\tilde{z}}_{S_{t}} - λ {\hat{W}}_{S_{t}} u_{S_{t}} (λ, t)}] d t .

Thus, (52), (41), and (42) imply

{∣ {\hat{W}}_{S^{c}}^{- 1} g_{S^{c}} (λ, 1 -) ∣}_{\infty} \leq {∣ {\hat{W}}_{S^{c}}^{- 1} {\tilde{z}}_{S^{c}} ∣}_{\infty} + κ_{1} {∣ {\tilde{z}}_{S} ∣}_{\infty} + κ_{0} λ {∣ u_{S_{t}} (λ, t) ∣}_{\infty},

which is smaller than λ in the event in (43). Thus, since ¨_S(β̂(λ, 1−)) is of full rank, β̂(λ, 1−) is the unique solution of the KKT conditions (3) for β̂. This completes the proof of part (i).

For part (ii), we observe that (44) implies $S = {j : β_{j}^{*} \neq 0}$ . Since β̂(λ, 0+) = β^*, there exists t₁ > 0 such that $u_{S} (λ, t) = sgn (β_{S}^{*})$ for all 0 <t <t₁. By (52), β̂(λ, t) ∈ Inline graphic for 0 <t <t₁. It follows from (53) and (44) that

{∣ {(D \hat{β})}_{S} (λ, t) ∣}_{\infty} \leq {‖ {{\ddot{ψ}}_{S_{t}} (\hat{β} (λ, t))}^{- 1} ‖}_{\infty} {∣ {\tilde{z}}_{S} - λ {\hat{W}}_{S} sgn (β_{S}^{*}) ∣}_{\infty} < min_{j \in S} ∣ β_{j}^{*} ∣ - ε_{1}

for 0 <t <t₁ and some ε₁ > 0. Since β̂(λ, 0+) = β^*, this implies ${∣ {\hat{β}}_{S} (λ, t) - β_{S}^{*} ∣}_{\infty} < {min}_{j \in S} ∣ β_{j}^{*} ∣ - ε_{1}$ for all 0 <t <t₁ ∧ 1. It follows that sgn(β̂(λ, t)) = sgn(β^*) for 0 < t ≤ 1 by the continuity of β̂(λ, t) in t, that is, t₁ = 1. Consequently, conditions (41), and (42) are only needed for the smaller class Inline graphic in the proof of part (i). This gives β̂(λ, 1) = β̂ and completes the proof of part (ii).

Finally, in part (iii), F₀(ξ,S;φ₀,φ₀) is simply replaced by its lower bounds with the respective φ₀.

Proof of Theorem 19

Suppose the event Ω₁ in (15) happens, so that ŵ_j ≥ w_j for j ∉ S and the conclusion of Theorem 4 hold. Let h = β̂ − β^* and $\sum^{^} = \int_{0}^{1} \ddot{ψ} (β^{*} + x h) d x$ . It follows from (1) that Σ̂h = ψ̇(β^* + h) − ψ̇(β^*) = ℓ̇(β̂) − ℓ̇(β^*). By the KKT conditions (3),

∣ {(\sum^{^} h)}_{j} ∣ = ∣ {(\dot{ℓ} (\hat{β}) - \dot{ℓ} (β^{*}))}_{j} ∣ \geq {\hat{w}}_{j} λ - z_{j} \geq w_{j} (λ - z_{1}^{*}) > 0, j \notin S .

Let B ⊆ {j ∉ S : β̂_j ≠ 0} with |B| ≤ d₁. It follows from Theorem 4 that φ₀(h) ≤ η ≤ η^*, so that (45) implies ${max}_{{∣ u ∣}_{2} = 1} {∣ {(W^{- 1} {\sum^{^}}^{1 / 2} u)}_{B} ∣}_{2}^{2} = λ_{max} (W_{B}^{- 2} {\sum^{^}}_{B}) \leq κ_{+} (d_{1})$ . Thus, by the definition of Δ(β, β^*) in (4),

{(λ - z_{1}^{*})}^{2} ∣ B ∣ \leq {∣ {(W^{- 1} \sum^{^} h)}_{B} ∣}_{2}^{2} \leq κ_{+} (d_{1}) 〈 h, \sum^{^} h 〉 = κ_{+} (d_{1}) Δ (β^{*} + h, β^{*}) .

This and the prediction bound in Theorem 4 yield

∣ B ∣ \leq \frac{κ_{+} (d_{1}) Δ (β^{*} + h, β^{*})}{{(λ - z_{1}^{*})}^{2}} \leq \frac{κ_{+} (d_{1}) e^{η} {({∣ w_{S} ∣}_{\infty} λ + z_{0}^{*})}^{2} ∣ S ∣}{{(λ - z_{1}^{*})}^{2} F (ξ, S; φ_{0}, φ_{1, S})} \leq \frac{κ_{+} (d_{1}) e^{η} ξ^{2} ∣ S ∣}{F (ξ, S; φ_{0}, φ_{1, S})} < d_{1} .

Since all subsets B ⊆ {j ∉ S : β̂_j ≠ 0} with |B| ≤ d₁ satisfies |B| < d₁, it must hold that #{j ∉ S : β̂_j ≠ 0} < d₁.

Proof of Theorem 20

Let z̃ = z − ψ̇(β^*) = −ℓ̇(β^*) and β̂(λ, t) be the artificial estimator in (51) with ŵ_j = 1, and h(λ, t) = β̂(λ, t) − β^*. Let λ_* ≤ λ^* be penalty levels satisfying

[λ_{*}, λ^{*}] \subseteq \cap_{0 < t \leq 1} {λ : φ_{0} (h (λ, t)) \leq η, h (λ, t) \in C (ξ, S), {∣ {(\sum^{*})}^{- 1 / 2} \tilde{z} ∣}_{2} \leq \frac{α λ \sqrt{d_{1}}}{e^{η} \sqrt{c^{*}}}} .

(54)

We pick such an interval [λ_*,λ^*] containing the penalty level λ of concern in the theorem. This is allowed by Lemma 1 and Theorem 4. We first prove the stronger conclusion

max_{λ_{*} \leq λ \leq λ^{*}} max_{0 < t \leq 1} # {j : {\hat{β}}_{j} (λ, t) \neq 0, j \notin S} < d_{1}

(55)

under the additional assumption

min_{λ_{*} \leq λ \leq λ^{*}} min_{0 < t \leq 1} # {j : {\hat{β}}_{j} (λ, t) \neq 0, j \notin S} \leq d_{1} .

(56)

Let g(λ, t) = tz̃ + ψ̇(β^*) − ψ̇(β̂(λ, t)) be the negative gradient at β̂(λ, t) in (51). By the KKT conditions for (51), sgn(β̂_j(λ, t±)) ≠ 0 implies |g(λ, t)| = tλ. Thus, (56) implies the existence of λ ∈ [λ_*,λ^*], t₁ ∈ (0, 1], and A₁ ⊂ {1, …, p} satisfying

{j : sgn ({\hat{β}}_{j} (λ, t_{1})) \neq 0} \cup S \subseteq A_{1} \subseteq {j : ∣ g (λ, t_{1}) ∣ = t_{1} λ} \cup S, ∣ A_{1} ∣ \leq d_{1} + ∣ S ∣ .

(57)

Moreover, if max_{λ_*≤λ≤λ^*} max_0<_t_≤1 #{j : β̂_j(λ, t) ≠ 0, j ∉ S} ≥ d₁, then by the continuity of β̂(λ, t), it would be possible to restrict (57) to |A₁| = d₁ + |S| with some different λ ∈ [λ_*,λ^*] and t₁ ∈ (0, 1]. Therefore, it suffices to deny this possibility by proving |A₁| < d₁ + |S| based on (57) and (54). Let A₀ = A₁\S. We prove |A₀| < d₁, which is equivalent to |A₁| < d₁ + |S|.

Let v₍_A₎ =(v_jI{j ∈ A}, j ∈ A₁)′ ∈ ℝ^A₁ and v_A = (v_j, j ∈ A)′ ∈ ℝ^A for all vectors v = (v₁, …, v_p)′. Let h = h(λ, t₁), $\sum^{^} = \int_{0}^{1} \ddot{ψ} (β^{*} + x h) d x$ , and g = g(λ, t₁) =t₁z̃ + ψ̇(β^*) − ψ̇(β^* + h) =t₁z̃ − Σ̂h. Since $h_{A_{1}^{c}} = 0, {\sum^{^}}_{A_{1}}^{- 1} g_{(A_{1})} = t_{1} {\sum^{^}}_{A_{1}}^{- 1} {\tilde{z}}_{A_{1}} - {\sum^{^}}_{A_{1}}^{- 1} {(\sum^{^} h)}_{A_{1}} = t_{1} {\sum^{^}}_{A_{1}}^{- 1} {\tilde{z}}_{A_{1}} - h_{A_{1}}$ . Thus, since g_j =t₁λsgn(h_j) for j ∈ A₀ by the KKT conditions,

〈 g_{(A_{0})}, {\sum^{^}}_{A_{1}}^{- 1} g (A_{1}) 〉 = t_{1} 〈 g_{(A_{0})}, {\sum^{^}}_{A_{1}}^{- 1} {\tilde{z}}_{A_{1}} 〉 - 〈 g_{(A_{0})}, h_{A_{1}} 〉 \leq t_{1} 〈 g_{(A_{0})}, {\sum^{^}}_{A_{1}}^{- 1} {\tilde{z}}_{A_{1}} 〉 .

Since ${∣ {\sum^{^}}_{A_{1}}^{- 1 / 2} g_{(A_{0})} ∣}_{2}^{2} + {∣ {\sum^{^}}_{A_{1}}^{- 1 / 2} g_{(A_{1})} ∣}_{2}^{2} = {∣ {\sum^{^}}_{A_{1}}^{- 1 / 2} g_{(S)} ∣}_{2}^{2} + 2 〈 g_{(A_{0})}, {\sum^{^}}_{A_{1}}^{- 1} g_{(A_{1})} 〉$ , we have

{∣ {\sum^{^}}_{A_{1}}^{- 1 / 2} g_{(A_{0})} ∣}_{2}^{2} + {∣ {\sum^{^}}_{A_{1}}^{- 1 / 2} g_{(A_{1})} ∣}_{2}^{2} \leq {∣ {\sum^{^}}_{A_{1}}^{- 1 / 2} g (s) ∣}_{2}^{2} + 2 t_{1} {∣ {\sum^{^}}_{A_{1}}^{- 1 / 2} g_{(A_{0})} ∣}_{2} {∣ {\sum^{^}}_{A_{1}}^{- 1 / 2} {\tilde{z}}_{A_{1}} ∣}_{2} .

By (54) and (46), ${∣ {\sum^{^}}_{A_{1}}^{- 1 / 2} {\tilde{z}}_{A_{1}} ∣}_{2} \leq e^{η / 2} {∣ {(\sum_{A_{1}}^{*})}^{- 1 / 2} {\tilde{z}}_{A_{1}} ∣}_{2} \leq α λ \sqrt{∣ A_{0} ∣ / (c^{*} e^{η})}$ , so that

(1 - α) {∣ {\sum^{^}}_{A_{1}}^{- 1 / 2} g_{(A_{0})} ∣}_{2}^{2} + {∣ {\sum^{^}}_{A_{1}}^{- 1 / 2} g_{(A_{1})} ∣}_{2}^{2} \leq {∣ {\sum^{^}}_{A_{1}}^{- 1 / 2} g (s) ∣}_{2}^{2} + α t_{1}^{2} λ^{2} ∣ A_{0} ∣ / (c^{*} e^{η}) .

Moreover, since |A₁| ≤ d₁ + |S| ≤ d^*, it follows from (54), (46), and (47) that the eigenvalues of Σ̂_A₁all lie in the interval c_*e^−η and c^*e^η. Thus, since g_A₀= t₁λsgn(β̂_A₀),

\frac{(1 - α) t_{1}^{2} λ^{2} ∣ A_{0} ∣}{c^{*} e^{η}} + \frac{t_{1}^{2} λ^{2} ∣ A_{0} ∣ + {∣ g_{S} ∣}_{2}^{2}}{c^{*} e^{η}} \leq \frac{{∣ g_{S} ∣}_{2}^{2}}{c_{*} e^{- η}} + \frac{α t_{1}^{2} λ^{2} ∣ A_{0} ∣}{c^{*} e^{η}} .

Since |g|_∞ ≤ t₁λ, the above inequality gives by algebra the dimension bound

∣ A_{0} ∣ \leq (\frac{e^{2 η} c^{*} / c_{*} - 1}{2 - 2 α}) \frac{{∣ g_{S} ∣}_{2}^{2}}{t_{1}^{2} λ^{2}} \leq (\frac{e^{2 η} c^{*} / c_{*} - 1}{2 - 2 α}) ∣ S ∣ < d_{1} .

This proves (55) under the additional assumption (56).

Now we prove (56). In the special case of φ₀(b) = 0, the condition on λ in (54) is monotone so that we are allowed to pick λ^* = ∞. Since β̂(λ, 1) = 0 for very large λ, (56) holds automatically for φ₀(b) = 0. By (46), this special case is equivalent to linear regression since the Hessian does not depend on β. The difference of the general model (1) from linear regression is that the condition $λ + z_{0}^{*} \leq η e^{- η} F (ξ, S; φ_{0}, φ_{0})$ , which excludes large λ, is needed to prove φ₀(h(λ, t)) ≤ η by Theorem 4. To overcome this difficulty, we consider very small t > 0. Let b = (β − β^*)/t. By (51),

\begin{array}{l} t^{- 1} {\hat{β} (λ, t) - β^{*}} = \underset{b}{arg min} {ψ (β^{*} + t b) - 〈 t b, \dot{ψ} (β^{*}) + t \tilde{z} 〉 + t λ {∣ β^{*} + t b ∣}_{1}} \\ = \underset{b}{arg min} {\int_{0}^{1} (1 - x) 〈 t b, \ddot{ψ} (β^{*} + xtb) t b 〉 d x - t^{2} 〈 b, \tilde{z} 〉 + t λ {∣ β^{*} + t b ∣}_{1}} \\ = \underset{b}{arg min} {\int_{0}^{1} (1 - x) 〈 b, \ddot{ψ} (β^{*} + xtb) b 〉 d x - 〈 b, \tilde{z} 〉 + λ {∣ β^{*} / t + b ∣}_{1}} . \end{array}

Let $S_{0} = {j : β_{j}^{*} \neq 0}$ . Since $λ {∣ β^{*} / t + b ∣}_{1} - λ {∣ β^{*} ∣}_{1} / t \to λ 〈 sgn (β^{*}), b 〉 + λ {∣ b_{S_{0}^{c}} ∣}_{1}$ as t → 0+, t⁻¹{β̂(λ, t) − β^*} converges (along a subsequence if necessary) to

\hat{b} (λ) = \underset{b}{arg min} {2^{- 1} 〈 b, \ddot{ψ} (β^{*}) b 〉 - 〈 b, \tilde{z} 〉 + λ 〈 sgn (β^{*}), b 〉 + λ {∣ b_{S_{0}^{c}} ∣}_{1}} .

Moreover, since z̃ − ¨(β^*)b̂(λ) is the negative gradient at b̂(λ), we have

{j : ∣ g_{j} (λ, t) ∣ = t λ, j \notin S} \to {j \notin S : {(\tilde{z} - \ddot{ψ} (β^{*}) \hat{b} (λ))}_{j} = λ sgn ({\hat{b}}_{j} (λ))} .

(58)

Since this limit does not depend on φ₀(·), the dimension bound (55) in the special case of linear regression implies that the right-hand side of (58) contains a smaller number of elements than d₁. This gives (56) in the general case by (58) and completes the proof.

Contributor Information

Jian Huang, Email: JIAN-HUANG@UIOWA.EDU, Department of Statistics and Actuarial Science, University of Iowa, Iowa City, IA 52242, USA.

Cun-Hui Zhang, Email: CZHANG@STAT.RUTGERS.EDU, Department of Statistics and Biostatistics, Rutgers University, Piscataway, New Jersey 08854, USA.

References

Bickel PJ, Ritov Y, Tsybakov A. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics. 2009;37(4):1705–1732. [Google Scholar]
Bregman LM. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics. 1967;7:200–217. [Google Scholar]
Bunea F, Tsybakov A, Wegkamp MH. Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics. 2007;1:169–194. [Google Scholar]
Candes EJ, Tao T. Decoding by linear programming. IEEE Trans on Information Theory. 2005;51:4203–4215. [Google Scholar]
Candes EJ, Tao T. The dantzig selector: statistical estimation when. p is much larger than n (with discussion) Annals of Statistics. 2007;35:2313–2404. [Google Scholar]
Chen S, Donoho DL, Saunders MA. Atomic decomposition by basis pursuit. SIAM J Sci Comput. 1998;20:33–61. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Greenshtein E, Ritov Y. Persistence in high–dimensional linear predictor selection and the virtue of overparametrization. Bernoulli. 2004;10:971–988. [Google Scholar]
Huang J, Ma S, Zhang CH. Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]
Hunter DR, Li R. Variable selection using mm algorithms. Annals of Statistics. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]
Koltchinskii V. The dantzig selector and sparsity oracle inequalities. Bernoulli. 2009;15:799–828. [Google Scholar]
McCullagh P, Nelder JA. Generalized Linear Models. Chapmann & Hall; 1989. [Google Scholar]
Meier L, Bühlmann P. Smoothing ℓ1-penalized estimators for high-dimensional time-course data. Electronic Journal of Statistics. 2007;1:597–615. [Google Scholar]
Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso. Annals of Statistics. 2006;34:1436–1462. [Google Scholar]
Meinshausen N, Yu B. Lasso-type recovery of sparse representations for high-dimensional data. Annals of Statistics. 2009;37:246–270. [Google Scholar]
Negahban S, Ravikumar P, Wainwright MJ, Yu B. Technical Report arXiv:1010.2731, arXiv. 2010. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizer. [Google Scholar]
Nielsen F, Nock R. On the centroids of symmetrized bregman divergences. CoRR. 2007 abs/0711.3242. [Google Scholar]
Ravikumar P, Wainwright MJ, Raskutti G, Yu B. Model selection in gaussian graphical models: High-dimensional consistency of ℓ1-regularized mle. Advances in Neural Information Processing Systems (NIPS) 2008;21 [Google Scholar]
Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics. 2008;2:494–515. [Google Scholar]
Städler N, Bühlmann P, van de Geer S. ℓ1-penalization for mixture regression models (with discussion) Test. 2010;19(2):209–285. [Google Scholar]
Sun T, Zhang C-H. Technical Report arXiv:1104.4595, arXiv. 2011. Scaled sparse linear regression. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 1996;58:267–288. [Google Scholar]
Tibshirani R, Taylor J. The solution path of the generalized lasso. The Annals of Statistics. 2011;39:1335–1371. [Google Scholar]
Tropp JA. Just relax: convex programming methods for identifying sparse signals in noise. IEEE Transactions on Information Theory. 2006;52:1030–1051. [Google Scholar]
van de Geer S. Technical Report 140. ETH Zurich; Switzerland: 2007. The deterministic lasso. [Google Scholar]
van de Geer S. High–dimensional generalized linear models and the lasso. Annals of Statistics. 2008;36:614–645. [Google Scholar]
van de Geer S, Bühlmann P. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics. 2009;3:1360–1392. [Google Scholar]
Wainwright MJ. Sharp thresholds for noisy and high–dimensional recovery of sparsity using ℓ1–constrained quadratic programming (lasso) IEEE Transactions on Information Theory. 2009;55:2183–2202. [Google Scholar]
Ye F, Zhang CH. Rate minimaxity of the lasso and dantzig selector for the ℓ q loss in ℓr balls. Journal of Machine Learning Research. 2010;11:3481–3502. [Google Scholar]
Zhang C-H. Least squares estimation and variable selection under minimax concave penalty. Mathematisches Forschungsintitut Oberwolfach: Sparse Recovery Problems in High Dimensions. 2009;3 [Google Scholar]
Zhang CH. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics. 2010a;38:894–942. [Google Scholar]
Zhang CH, Huang J. The sparsity and bias of the Lasso selection in high-dimensional linear regression. Annals of Statistics. 2008;36(4):1567–1594. [Google Scholar]
Zhang T. Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learning Research. 2010b;11:1087–1107. [Google Scholar]
Zhang T. Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE Transactions on Information Theory. 2011a;57:4689–4708. [Google Scholar]
Zhang T. Technical Report arXiv:1106.0565, arXiv. 2011b. Multi-stage convex relaxation for feature selection. [Google Scholar]
Zhao P, Yu B. On model selection consistency of Lasso. Journal of Machine Learning Research. 2006;7:2541–2567. [Google Scholar]
Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics. 2008;36(4):1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Bickel PJ, Ritov Y, Tsybakov A. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics. 2009;37(4):1705–1732. [Google Scholar]

[R2] Bregman LM. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics. 1967;7:200–217. [Google Scholar]

[R3] Bunea F, Tsybakov A, Wegkamp MH. Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics. 2007;1:169–194. [Google Scholar]

[R4] Candes EJ, Tao T. Decoding by linear programming. IEEE Trans on Information Theory. 2005;51:4203–4215. [Google Scholar]

[R5] Candes EJ, Tao T. The dantzig selector: statistical estimation when. p is much larger than n (with discussion) Annals of Statistics. 2007;35:2313–2404. [Google Scholar]

[R6] Chen S, Donoho DL, Saunders MA. Atomic decomposition by basis pursuit. SIAM J Sci Comput. 1998;20:33–61. [Google Scholar]

[R7] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R8] Greenshtein E, Ritov Y. Persistence in high–dimensional linear predictor selection and the virtue of overparametrization. Bernoulli. 2004;10:971–988. [Google Scholar]

[R9] Huang J, Ma S, Zhang CH. Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]

[R10] Hunter DR, Li R. Variable selection using mm algorithms. Annals of Statistics. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Koltchinskii V. The dantzig selector and sparsity oracle inequalities. Bernoulli. 2009;15:799–828. [Google Scholar]

[R12] McCullagh P, Nelder JA. Generalized Linear Models. Chapmann & Hall; 1989. [Google Scholar]

[R13] Meier L, Bühlmann P. Smoothing ℓ1-penalized estimators for high-dimensional time-course data. Electronic Journal of Statistics. 2007;1:597–615. [Google Scholar]

[R14] Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso. Annals of Statistics. 2006;34:1436–1462. [Google Scholar]

[R15] Meinshausen N, Yu B. Lasso-type recovery of sparse representations for high-dimensional data. Annals of Statistics. 2009;37:246–270. [Google Scholar]

[R16] Negahban S, Ravikumar P, Wainwright MJ, Yu B. Technical Report arXiv:1010.2731, arXiv. 2010. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizer. [Google Scholar]

[R17] Nielsen F, Nock R. On the centroids of symmetrized bregman divergences. CoRR. 2007 abs/0711.3242. [Google Scholar]

[R18] Ravikumar P, Wainwright MJ, Raskutti G, Yu B. Model selection in gaussian graphical models: High-dimensional consistency of ℓ1-regularized mle. Advances in Neural Information Processing Systems (NIPS) 2008;21 [Google Scholar]

[R19] Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics. 2008;2:494–515. [Google Scholar]

[R20] Städler N, Bühlmann P, van de Geer S. ℓ1-penalization for mixture regression models (with discussion) Test. 2010;19(2):209–285. [Google Scholar]

[R21] Sun T, Zhang C-H. Technical Report arXiv:1104.4595, arXiv. 2011. Scaled sparse linear regression. [Google Scholar]

[R22] Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 1996;58:267–288. [Google Scholar]

[R23] Tibshirani R, Taylor J. The solution path of the generalized lasso. The Annals of Statistics. 2011;39:1335–1371. [Google Scholar]

[R24] Tropp JA. Just relax: convex programming methods for identifying sparse signals in noise. IEEE Transactions on Information Theory. 2006;52:1030–1051. [Google Scholar]

[R25] van de Geer S. Technical Report 140. ETH Zurich; Switzerland: 2007. The deterministic lasso. [Google Scholar]

[R26] van de Geer S. High–dimensional generalized linear models and the lasso. Annals of Statistics. 2008;36:614–645. [Google Scholar]

[R27] van de Geer S, Bühlmann P. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics. 2009;3:1360–1392. [Google Scholar]

[R28] Wainwright MJ. Sharp thresholds for noisy and high–dimensional recovery of sparsity using ℓ1–constrained quadratic programming (lasso) IEEE Transactions on Information Theory. 2009;55:2183–2202. [Google Scholar]

[R29] Ye F, Zhang CH. Rate minimaxity of the lasso and dantzig selector for the ℓ q loss in ℓr balls. Journal of Machine Learning Research. 2010;11:3481–3502. [Google Scholar]

[R30] Zhang C-H. Least squares estimation and variable selection under minimax concave penalty. Mathematisches Forschungsintitut Oberwolfach: Sparse Recovery Problems in High Dimensions. 2009;3 [Google Scholar]

[R31] Zhang CH. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics. 2010a;38:894–942. [Google Scholar]

[R32] Zhang CH, Huang J. The sparsity and bias of the Lasso selection in high-dimensional linear regression. Annals of Statistics. 2008;36(4):1567–1594. [Google Scholar]

[R33] Zhang T. Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learning Research. 2010b;11:1087–1107. [Google Scholar]

[R34] Zhang T. Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE Transactions on Information Theory. 2011a;57:4689–4708. [Google Scholar]

[R35] Zhang T. Technical Report arXiv:1106.0565, arXiv. 2011b. Multi-stage convex relaxation for feature selection. [Google Scholar]

[R36] Zhao P, Yu B. On model selection consistency of Lasso. Journal of Machine Learning Research. 2006;7:2541–2567. [Google Scholar]

[R37] Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

[R38] Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics. 2008;36(4):1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications

Jian Huang

Cun-Hui Zhang

Abstract

1. Introduction

2. Absolute Penalized Convex Minimization

2.1 Definition and the KKT Conditions

2.2 Basic Inequalities, Prediction, and Bregman Divergence

Lemma 1

Remark 2

Example 1 (Linear regression)

Example 2 (Logistic regression)

Example 3 (GLM)

Example 4 (Nonparametric density estimation)

Example 5 (Graphical Lasso)

3. Oracle Inequalities

Definition 3

Theorem 4

Remark 5

3.1 The Hessian and Related Quantities

Definition 6

Corollary 7

Example 6 (Linear regression: oracle inequalities)

3.2 Oracle Inequalities for the Lasso in GLM

Lemma 8

Theorem 9

Remark 10

Remark 11

Remark 12

Remark 13

Example 7 (Linear regression: oracle inequalities, continuation)

Example 8 (Logistic regression: oracle inequalities)

Example 9 (Log-linear models: oracle inequalities)

4. Adaptive and Multistage Methods

Theorem 14

Theorem 15

Remark 16

Remark 17

5. Selection Consistency

Theorem 18

Example 10 (Logistic regression: selection consistency)

6. The Sparsity of the Lasso and SRC

Theorem 19

Theorem 20

7. Discussion

Acknowledgments

Appendix A

Proof of Lemma 1

Proof of Theorem 4

Proof of Lemma 8

Proof of Theorem 9

Proof of Theorem 14

Proof of Theorem 15

Proof of Theorem 18

Proof of Theorem 19

Proof of Theorem 20

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases