ADAPTIVE ROBUST VARIABLE SELECTION

Jianqing Fan; Yingying Fan; Emre Barut

doi:10.1214/13-AOS1191

. Author manuscript; available in PMC: 2015 Jan 8.

Published in final edited form as: Ann Stat. 2014 Feb 1;42(1):324–351. doi: 10.1214/13-AOS1191

ADAPTIVE ROBUST VARIABLE SELECTION

Jianqing Fan ^1,^*, Yingying Fan ^1,^†, Emre Barut ¹

PMCID: PMC4286898 NIHMSID: NIHMS649191 PMID: 25580039

Abstract

Heavy-tailed high-dimensional data are commonly encountered in various scientific fields and pose great challenges to modern statistical analysis. A natural procedure to address this problem is to use penalized quantile regression with weighted L₁-penalty, called weighted robust Lasso (WR-Lasso), in which weights are introduced to ameliorate the bias problem induced by the L₁-penalty. In the ultra-high dimensional setting, where the dimensionality can grow exponentially with the sample size, we investigate the model selection oracle property and establish the asymptotic normality of the WR-Lasso. We show that only mild conditions on the model error distribution are needed. Our theoretical results also reveal that adaptive choice of the weight vector is essential for the WR-Lasso to enjoy these nice asymptotic properties. To make the WR-Lasso practically feasible, we propose a two-step procedure, called adaptive robust Lasso (AR-Lasso), in which the weight vector in the second step is constructed based on the L₁-penalized quantile regression estimate from the first step. This two-step procedure is justified theoretically to possess the oracle property and the asymptotic normality. Numerical studies demonstrate the favorable finite-sample performance of the AR-Lasso.

Keywords and phrases: Adaptive weighted L₁, High dimensions, Oracle properties, Robust regularization

1. Introduction

The advent of modern technology makes it easier to collect massive, large-scale data sets. A common feature of these data sets is that the number of covariates greatly exceeds the number of observations, a regime opposite to conventional statistical settings. For example, portfolio allocation with hundreds of stocks in finance involves a covariance matrix of about tens of thousands of parameters, but the sample sizes are often only in the order of hundreds (e.g., daily data over a year period (Fan et al., 2008)). Genome-wide association studies in biology involve hundreds of thousands of single-nucleotide polymorphisms (SNPs), but the available sample size is usually in hundreds too. Data-sets with large number of variables but relatively small sample size pose great unprecedented challenges, and opportunities, for statistical analysis.

Regularization methods have been widely used for high-dimensional variable selection (Bickel and Li, 2006; Bickel et al., 2009; Efron et al., 2007; Fan and Li, 2001; Lv and Fan, 2009; Tibshirani, 1996; Zhang, 2010; Zou, 2006). Yet, most existing methods such as penalized least-squares or penalized likelihood (Fan and Lv, 2011) are designed for light-tailed distributions. Zhao and Yu (2006) established the irrepresentability conditions for the model selection consistency of the Lasso estimator. Fan and Li (2001) studied the oracle properties of nonconcave penalized likelihood estimators for fixed dimensionality. Lv and Fan (2009) investigated the penalized least-squares estimator with folded-concave penalty functions in the ultra-high dimensional setting and established a nonasymptotic weak oracle property. Fan and Lv (2008) proposed and investigated the sure independence screening method in the setting of light-tailed distributions. The robustness of the aforementioned methods have not yet been thoroughly studied and well understood.

Robust regularization methods such as the least absolute deviation (LAD) regression and quantile regression have been used for variable selection in the case of fixed dimensionality. See, for example, Li and Zhu (2008); Wang, Li and Jiang (2007); Wu and Liu (2009); Zou and Yuan (2008). The penalized composite likelihood method was proposed in Bradic et al. (2011) for robust estimation in ultra-high dimensions with focus on the efficiency of the method. They still assumed sub-Gaussian tails. Belloni and Chernozhukov (2011) studied the L₁-penalized quantile regression in high-dimensional sparse models where the dimensionality could be larger than the sample size. We refer to their method as robust Lasso (R-Lasso). They showed that the R-Lasso estimate is consistent at the near-oracle rate, and gave conditions under which the selected model includes the true model, and derived bounds on the size of the selected model, uniformly in a compact set of quantile indices. Wang (2012) studied the L₁-penalized LAD regression and showed that the estimate achieves near oracle risk performance with a nearly universal penalty parameter and established also a sure screening property for such an estimator. van de Geer and Müller (2012) obtained bounds on the prediction error of a large class of L₁ penalized estimators, including quantile regression. Wang et al. (2012) considered the nonconvex penalized quantile regression in the ultra-high dimensional setting and showed that the oracle estimate belongs to the set of local minima of the nonconvex penalized quantile regression, under mild assumptions on the error distribution.

In this paper, we introduce the penalized quantile regression with the weighted L₁-penalty (WR-Lasso) for robust regularization, as in Bradic et al. (2011). The weights are introduced to reduce the bias problem induced by the L₁-penalty. The exibility of the choice of the weights provides exibility in shrinkage estimation of the regression coefficient. WR-Lasso shares a similar spirit to the folded-concave penalized quantile-regression (Wang et al., 2012; Zou and Li, 2008), but avoids the nonconvex optimization problem. We establish conditions on the error distribution in order for the WR-Lasso to successfully recover the true underlying sparse model with asymptotic probability one. It turns out that the required condition is much weaker than the sub-Gaussian assumption in Bradic et al. (2011). The only conditions we impose is that the density function of error has Lipschitz property in a neighborhood around 0. This includes a large class of heavy-tailed distributions such as the stable distributions, including the Cauchy distribution. It also covers the double exponential distribution whose density function is nondifferentiable at the origin.

Unfortunately, because of the penalized nature of the estimator, WR-Lasso estimate has a bias. In order to reduce the bias, the weights in WR-Lasso need to be chosen adaptively according to the magnitudes of the unknown true regression coefficients, which makes the bias reduction infeasible for practical applications.

To make the bias reduction feasible, we introduce the adaptive robust Lasso (AR-Lasso). The AR-Lasso first runs R-Lasso to obtain an initial estimate, and then computes the weight vector of the weighted L₁-penalty according to a decreasing function of the magnitude of the initial estimate. After that, AR-Lasso runs WR-Lasso with the computed weights. We formally establish the model selection oracle property of AR-Lasso in the context of Fan and Li (2001) with no assumptions made on the tail distribution of the model error. In particular, the asymptotic normality of the AR-Lasso is formally established.

This paper is organized as follows. First, we introduce our robust estimators in Section 2. Then, to demonstrate the advantages of our estimator, we show in Section 3 with a simple example that Lasso behaves sub-optimally when noise has heavy tails. In Section 4.1, we study the performance of the oracle-assisted regularization estimator. Then in Section 4.2, we show that when the weights are adaptively chosen, WR-Lasso has the model selection oracle property, and performs as well as the oracle-assisted regularization estimate. In Section 4.3, we prove the asymptotic normality of our proposed estimator. The feasible estimator, AR-Lasso, is investigated in Section 5. Section 6 presents the results of the simulation studies. Finally, in Section 7 we present the proofs of the main theorems. Additional proofs, as well as the results of a genome-wide association study, are provided in the supplementary Appendix (Fan et al., 2013).

2. Adaptive Robust Lasso

Consider the linear regression model

y = X β + ε,

(2.1)

where y is an n-dimensional response vector, X = (x₁, …, x_n)^T = (x̃₁, ···, x̃_p) is an n × p fixed design matrix, β = (β₁,…, β_p)^T is a p-dimensional regression coefficient vector, and ε = (ε₁, …, ε_n)^T is an n-dimensional error vector whose components are independently distributed and satisfy P(ε_i ≤ 0) = τ for some known constant τ ∈ (0, 1). Under this model, $x_{i}^{T} β$ is the conditional τth-quantile of y_i given x_i. We impose no conditions on the heaviness of the tail probability or the homoscedasticity of ε_i. We consider a challenging setting in which log p = o(n^b) with some constant b > 0. To ensure the model identifiability and to enhance the model fitting accuracy and interpretability, the true regression coefficient vector β^* is commonly imposed to be sparse with only a small proportion of nonzeros (Fan and Li, 2001; Tibshirani, 1996). Denoting the number of nonzero elements of the true regression coefficients by s_n, we allow s_n to slowly diverge with the sample size n and assume that s_n = o(n). To ease the presentation, we suppress the dependence of s_n on n whenever there is no confusion. Without loss of generality, we write $β^{*} = {(β_{1}^{* T}, 0^{T})}^{T}$ , i.e. only the first s entries are non-vanishing. The true model is denoted by

M_{*} = supp (β^{*}) = {1, \dots, s},

and its complement, $M_{*}^{c} = {s + 1, \dots, p}$ , represents the set of noise variables.

We consider a fixed design matrix in this paper and denote by S = (S₁, ···, S_n)^T = (x̃₁, ···, x̃_s) the submatrix of X corresponding to the covariates whose coefficients are non-vanishing. These variables will be referred to as the signal covariates and the rest will be called noise covariates. The set of columns that correspond to the noise covariates is denoted by Q = (Q₁, ···, Q_n)^T = (x̃_s₊₁, ···, x̃_p). We standardize each column of X to have L₂-norm $\sqrt{n}$ .

To recover the true model and estimate β^*, we consider the following regularization problem

min_{β \in R^{p}} {\sum_{i = 1}^{n} ρ_{τ} (y_{i} - x_{i}^{T} β) + n λ_{n} \sum_{j = 1}^{p} p_{λ_{n}} (∣ β_{j} ∣)},

(2.2)

where ρ_τ (u) = u(τ − 1{u ≤ 0}) is the quantile loss function, and p_{λ_n}(·) is a nonnegative penalty function on [0, ∞) with a regularization parameter λ_n ≥ 0. The use of quantile loss function in (2.2) is to overcome the difficulty of heavy tails of the error distribution. Since P(ε ≤ 0) = τ, (2.2) can be interpreted as the sparse estimation of the conditional τth quantile. Regarding the choice of p_{λ_n}(·), it was demonstrated in Lv and Fan (2009) and Fan and Lv (2011) that folded-concave penalties are more advantageous for variable selection in high dimensions than the convex ones such as the L₁-penalty. It is, however, computationally more challenging to minimize the objective function in (2.2) when p_λ(·) is folded-concave. Noting that with a good initial estimate ${\hat{β}}^{ini} = {({\hat{β}}_{1}^{ini}, \dots, {\hat{β}}_{p}^{ini})}^{T}$ of the true coefficient vector, we have

p_{λ_{n}} (∣ β_{j} ∣) \approx p_{λ_{n}} (∣ {\hat{β}}_{j}^{ini} ∣) + p_{λ_{n}}^{'} (∣ {\hat{β}}_{j}^{ini} ∣) (∣ β_{j} ∣ - ∣ {\hat{β}}_{j}^{ini} ∣) .

Thus, instead of (2.2) we consider the following weighted L₁-regularized quantile regression

L_{n} (β) = \sum_{i = 1}^{n} ρ_{τ} (y_{i} - x_{i}^{T} β) + n λ_{n} {‖ d \circ β ‖}_{1},

(2.3)

where d = (d₁, ···, d_p)T is the vector of non-negative weights, and ∘ is the Hadamard product, i.e., the componentwise product of two vectors. This motivates us to define the weighted robust Lasso (WR-Lasso) estimate as the global minimizer of the convex function L_n(β) for a given non-stochastic weight vector:

\hat{β} = arg {min}_{β} L_{n} (β) .

(2.4)

The uniqueness of the global minimizer is easily guaranteed by adding a negligible L₂-regularization in implementation. In particular, when d_j = 1 for all j, the method will be referred to as robust Lasso (R-Lasso).

The adaptive robust Lasso (AR-Lasso) refers specifically to the two-stage procedure in which the stochastic weights ${\hat{d}}_{j} = p_{λ_{n}}^{'} (∣ {\hat{β}}_{j}^{ini} ∣)$ for j = 1, ···, p are used in the second step for WR-Lasso and are constructed using a concave penalty p_{λ_n}(·) and the initial estimates, ${\hat{β}}_{j}^{ini}$ , from the first step. In practice, we recommend using R-Lasso as the initial estimate and then using SCAD to compute the weights in AR-Lasso. The asymptotic result of this specific AR-Lasso is summarized in Corollary 1 in Section 5 for the ultra-high dimensional robust regression problem. This is a main contribution of the paper.

3. Suboptimality of Lasso

In this section, we use a specific example to illustrate that, in the case of heavy-tailed error distribution, Lasso fails at model selection unless the non-zero coefficients, $β_{1}^{*}, \dots, β_{s}^{*}$ , have a very large magnitude. We assume that the errors ε₁, ···, ε_n have the identical symmetric stable distribution and the characteristic function of ε₁ is given by

E [exp (i u ε_{1})] = exp (- {∣ u ∣}^{α}),

where α ∈ (0, 2). By Nolan (2012), E|ε₁|^p is finite for 0 < p < α, and E|ε₁|^p = ∞ for p ≥ α. Furthermore as z → ∞,

P (∣ ε_{1} ∣ \geq z) ~ c_{a} z^{- α},

where $c_{α} = sin (\frac{π α}{2}) Γ (α) / π$ is a constant depending only on α, and we use the notation ~ to denote that two terms are equivalent up to some constant. Moreover, for any constant vector a = (a₁, ···, a_n)^T, the linear combination a^Tε has the following tail behavior

P (∣ a^{T} ε ∣ > z) ~ {‖ a ‖}_{α}^{α} c_{α} z^{- α},

(3.1)

with ||·||_α denoting the L_α-norm of a vector.

To demonstrate the suboptimality of Lasso, we consider a simple case in which the design matrix satisfies the conditions that S^TQ = 0, $\frac{1}{n} S^{T} S = I_{s}$ , the columns of Q satisfy |supp(x̃_j)| = m_n = O(n^1/2) and supp(x̃_k) ∩ supp(x̃_j) = Ø for any k ≠ j and k, j ∈ {s + 1, ···, p}. Here, m_n is a positive integer measuring the sparsity level of the columns of Q. We assume that there are only fixed number of true variables, i.e., s is finite, and that max_ij |x_ij| = O(n^1/4). Thus, it is easy to see that p = O(n^1/2). In addition, we assume further that all nonzero regression coefficients are the same and $β_{1}^{*} = \dots = β_{s}^{*} = β_{0} > 0$ .

We first consider R-Lasso, which is the global minimizer of (2.4). We will later see in Theorem 2 that by choosing the tuning parameter

λ_{n} = O ({(log n)}^{2} \sqrt{(log p) / n}),

R-Lasso can recover the true support Inline graphic = {1, ···, s} with probability tending to 1. Moreover, the signs of the true regression coefficients can also be recovered with asymptotic probability one as long as the following condition on signal strength is satisfied

λ_{n}^{- 1} β_{0} \to \infty, i . e . {(log n)}^{- 2} \sqrt{n / (log p)} β_{0} \to \infty .

(3.2)

Now, consider Lasso, which minimizes

{\tilde{L}}_{n} (β) = \frac{1}{2} {‖ y - X β ‖}_{2}^{2} + n λ_{n} {‖ β ‖}_{1} .

(3.3)

We will see that for (3.3) to recover the true model and the correct signs of coefficients, we need a much stronger signal level than that is given in (3.2). By results in optimization theory, the Karush–Kuhn–Tucker (KKT) conditions guaranteeing the necessary and sufficient conditions for β̃ with Inline graphic = supp(β̃) being a minimizer to (3.3) are

\begin{matrix} {\tilde{β}}_{M} + n λ_{n} {(X_{M}^{T} X_{M})}^{- 1} sgn ({\tilde{β}}_{M}) = {(X_{M}^{T} X_{M})}^{- 1} X_{M}^{T} y, \\ {‖ X_{M^{c}}^{T} (y - X_{M} {\tilde{β}}_{M}) ‖}_{\infty} \leq n λ_{n}, \end{matrix}

where Inline graphic is the complement of , is the subvector formed by entries of β with indices in , and and are the submatrices formed by columns of X with indices in and , respectively. It is easy to see from the above two conditions that for Lasso to enjoy the sign consistency, sgn(β̃) = sgn(β^*) with asymptotic probability one, we must have these two conditions satisfied with Inline graphic = with probability tending to 1. Since we have assumed that Q^TS = 0 and n⁻¹S^TS = I, the above sufficient and necessary conditions can also be written as

{\tilde{β}}_{M^{*}} + λ_{n} sgn ({\tilde{β}}_{M^{*}}) = β_{M^{*}}^{*} + n^{- 1} S^{T} ε,

(3.4)

{‖ Q^{T} ε ‖}_{\infty} \leq n λ_{n},

(3.5)

Conditions (3.4) and (3.5) are hard for Lasso to hold simultaneously. The following proposition summarizes the necessary condition, whose proof is given in the supplementary material (Fan et al., 2013).

Proposition 1

In the above model, with probability at least 1 − e^−c̃₀, where c̃₀ is some positive constant, Lasso does not have sign consistency, unless the following signal condition is satisfied

n^{\frac{3}{4} - \frac{1}{α}} β_{0} \to \infty .

(3.6)

Comparing this with (3.2), it is easy to see that even in this simple case, Lasso needs much stronger signal levels than R-Lasso in order to have a sign consistency in the presence of a heavy-tailed distribution.

4. Model Selection Oracle Property

In this section, we establish the model selection oracle property of WR-Lasso. The study enables us to see the bias due to penalization, and that an adaptive weighting scheme is needed in order to eliminate such a bias. We need the following condition on the distribution of noise.

Condition 1

There exist universal constants c₁ > 0 and c₂ > 0 such that for any u satisfying |u| ≤ c₁, f_i(u)’s are uniformly bounded away from 0 and ∞ and

∣ F_{i} (u) - F_{i} (0) - u f_{i} (0) ∣ \leq c_{2} u^{2},

where f_i(u) and F_i(u) are the density function and distribution function of the error ε_i, respectively.

Condition 1 implies basically that each f_i(u) is Lipschitz around the origin. Commonly used distributions such as the double-exponential distribution and stable distributions including the Cauchy distribution all satisfy this condition.

Denote by H = diag{f₁(0), ···, f_n(0)}. The next condition is on the sub-matrix of X that corresponds to signal covariates and the magnitude of the entries of X.

Condition 2

The eigenvalues of $\frac{1}{n} S^{T} HS$ are bounded from below and above by some positive constants c₀ and 1/c₀, respectively. Furthermore,

κ_{n} \equiv max_{i j} ∣ x_{i j} ∣ = o (\sqrt{n} s^{- 1}) .

Although Condition 2 is on the fixed design matrix, we note that the above condition on κ_n is satisfied with asymptotic probability one when the design matrix is generated from some distributions. For instance, if the entries of X are independent copies from a sub-exponential distribution, the bound on κ_n is satisfied with asymptotic probability one as long as $s = o (\sqrt{n} / (log p))$ ; if the components are generated from sub-Gaussian distribution, then the condition on κ_n is satisfied with probability tending to one when $s = o (\sqrt{n / (log p)})$ .

4.1. Oracle Regularized Estimator

To evaluate our newly proposed method, we first study how well one can do with the assistance of the oracle information on the locations of signal covariates. Then, we use this to establish the asymptotic property of our estimator without the oracle assistance. Denote by ${\hat{β}}^{o} = {({({\hat{β}}_{1}^{o})}^{T}, 0^{T})}^{T}$ the oracle regularized estimator (ORE) with ${\hat{β}}_{1}^{o} \in R^{s}$ and 0 being the vector of all zeros, which minimizes L_n(β) over the space { $β = {(β_{1}^{T}, β_{2}^{T})}^{T} \in R^{p} : β_{2} = 0 \in R^{p - s}$ }. The next theorem shows that ORE is consistent, and estimates the correct sign of the true coefficient vector with probability tending to one. We use d₀ to denote the first s elements of d.

Theorem 1

Let $γ_{n} = C_{1} (\sqrt{s (log n) / n} + λ_{n} {‖ d_{0} ‖}_{2})$ with C₁ > 0 a constant. If Conditions 1 and 2 hold and $λ_{n} {‖ d_{0} ‖}_{2} \sqrt{s} κ_{n} \to 0$ , then there exists some constant c > 0 such that

P ({‖ {\hat{β}}_{1}^{o} - β_{1}^{*} ‖}_{2} \leq γ_{n}) \geq 1 - n^{- c s} .

(4.1)

If in addition $γ_{n}^{- 1} {min}_{1 \leq j \leq s} ∣ β_{j}^{*} ∣ \to \infty$ , then with probability at least 1−n^−cs,

sgn ({\hat{β}}_{1}^{o}) = sgn (β_{1}^{*}),

where the above equation should be understood componentwisely.

As shown in Theorem 1, the consistency rate of ${\hat{β}}_{1}^{o}$ in terms of the vector L₂-norm is given by γ_n. The first component of γ_n, $C_{1} \sqrt{s (log n) / n}$ , is the oracle rate within a factor of log n, and the second component C₁λ_n||d₀||₂ reflects the bias due to penalization. If no prior information is available, one may choose equal weights d₀ = (1, 1, ···, 1)^T, which corresponds to R-Lasso. Thus for R-Lasso, with probability at least 1 − n^−cs, it holds that

{‖ {\hat{β}}_{1}^{o} - β_{1}^{*} ‖}_{2} \leq C_{1} (\sqrt{s (log n) / n} + \sqrt{s} λ_{n}) .

(4.2)

4.2. WR-Lasso

In this section, we show that even without the oracle information, WR-Lasso enjoys the same asymptotic property as in Theorem 1 when the weight vector is appropriately chosen. Since the regularized estimator β̂ in (2.4) depends on the full design matrix X, we need to impose the following conditions on the design matrix to control the correlation of columns in Q and S.

Condition 3

With γ_n defined in Theorem 1, it holds that

{‖ \frac{1}{n} Q^{T} HS ‖}_{2, \infty} < \frac{λ_{n}}{2 {‖ d_{1}^{- 1} ‖}_{\infty} γ_{n}},

where ||A||_2,∞ = sup_x_≠0 ||Ax||_∞/||x||₂ for a matrix A and vector x, and $d_{1}^{- 1} = {(d_{s + 1}^{- 1}, \dots, d_{p}^{- 1})}^{T}$ . Furthermore, log(p) = o(n^b) for some constant b ∈ (0, 1).

To understand the implications of Condition 3, we consider the case of f₁(0) = ⋯ = f_n(0) ≡ f(0). In the special case of Q^T S = 0, Condition 3 is satisfied automatically. In the case of equal correlation, that is, n⁻¹X^TX having off-diagonal elements all equal to ρ, the above Condition 3 reduces to

∣ ρ ∣ < \frac{λ_{n}}{4 f (0) {‖ d_{1}^{- 1} ‖}_{\infty} \sqrt{s} γ_{n}} .

This puts an upper bound on the correlation coefficient ρ for such a dense matrix.

It is well known that for Gaussian errors, the optimal choice of regularization parameter λ_n has the order $\sqrt{(log p) / n}$ (Bickel et al., 2009). The distribution of the model noise with heavy tails demands a larger choice of λ_n to filter the noise for R-lasso. When $λ_{n} \geq \sqrt{(log n) / n}$ , γ_n given in (4.2) is in the order of $C_{1} λ_{n} \sqrt{s}$ . In this case, Condition 3 reduces to

{‖ n^{- 1} Q^{T} HS ‖}_{2, \infty} < O (s^{- 1 / 2}) .

(4.3)

For WR-Lasso, if the weights are chosen such that ${‖ d_{0} ‖}_{2} = O (\sqrt{s (log n) / n} / λ_{n})$ and ||d₁||_∞ = O(1), then γ_n is in the order of $C_{1} \sqrt{s (log n) / n}$ , and correspondingly, Condition 3 becomes

{‖ n^{- 1} Q^{T} HS ‖}_{2, \infty} < O (λ_{n} \sqrt{n / (s (log n))}) .

This is a more relaxed condition than (4.3), since with heavy-tailed errors, the optimal γ_n should be larger than $\sqrt{(log p) / n}$ . In other words, WR-Lasso not only reduces the bias of the estimate, but also allows for stronger correlations among the signal and noise covariates. However, the above choice of weights depends on unknown locations of signals. A data-driven choice will be given in Section 5, in which the resulting AR-Lasso estimator will be studied.

The following theorem shows the model selection oracle property of the WR-Lasso estimator.

Theorem 2

Suppose Conditions 1 – 3 hold. In addition, assume that min_j_≥_s₊₁ d_j > c₃ with some constant c₃ > 0,

γ_{n} s^{3 / 2} κ_{n}^{2} {({log}_{2} n)}^{2} = o (n λ_{n}^{2}), λ_{n} {‖ d_{0} ‖}_{2} κ_{n} max {\sqrt{s}, {‖ d_{0} ‖}_{2}} \to 0,

(4.4)

and $λ_{n} > 2 \sqrt{(1 + c) (log p) / n}$ , where κ_n is defined in Condition 2, γ_n is defined in Theorem 1, and c is some positive constant. Then, with probability at least 1− O(n^−cs), there exists a global minimizer $\hat{β} = {({({\hat{β}}_{1}^{o})}^{T}, {\hat{β}}_{2}^{T})}^{T}$ of L_n(β) which satisfies

β̂₂ = 0;
${‖ {\hat{β}}_{1}^{o} - β_{1}^{*} ‖}_{2} \leq γ_{n}$ .

Theorem 2 shows that the WR-Lasso estimator enjoys the same property as ORE with probability tending to one. However, we impose non-adaptive assumptions on the weight vector $d = {(d_{0}^{T}, d_{1}^{T})}^{T}$ . For noise covariates, we assume min_j>s d_j > c₃, which implies that each coordinate needs to be penalized. For the signal covariates, we impose (4.4), which requires ||d₀||₂ to be small.

When studying the nonconvex penalized quantile regression, Wang et al. (2012) assumed that κ_n is bounded and the density functions of ε_i’s are uniformly bounded away from 0 and ∞ in a small neighborhood of 0. Their assumption on the error distribution is weaker than our Condition 1. We remark that the difference is because we have weaker conditions on κ_n and the penalty function (See Condition 2 and (4.4)). In fact, our Condition 1 can be weakened to the same condition as that in Wang et al. (2012) at the cost of imposing stronger assumptions on κ_n and the weight vector d.

Belloni and Chernozhukov (2011) and Wang (2012) imposed the restricted eigenvalue assumption of the design matrix and studied the L₁-penalized quantile regression and LAD regression, respectively. We impose different conditions on the design matrix and allow flexible shrinkage by choosing d. In addition, our Theorem 2 provides a stronger result than consistency; we establish model selection oracle property of the estimator.

4.3. Asymptotic Normality

We now present the asymptotic normality of our estimator. Define V_n = (S^THS)^−1/2 and Z_n = (Z_n₁, ⋯, Z_nn)^T = SV_n with Z_nj ∈ R^s for j = 1, ⋯, n.

Theorem 3

Assume the conditions of Theorem 2 hold, the first and second order derivatives $f_{i}^{'} (u)$ and $f_{i}^{″} (u)$ are uniformly bounded in a small neighborhood of 0 for all i = 1, ⋯, n, and that ${‖ d_{0} ‖}_{2} = O (\sqrt{s / n} / λ_{n})$ , max_i ||H^1/2Z_ni||₂ = o(s^−7/2(log s)⁻¹), and $\sqrt{n / s} {min}_{1 \leq j \leq s} ∣ β_{j}^{*} ∣ \to \infty$ . Then, with probability tending to 1 there exists a global minimizer $\hat{β} = {({({\hat{β}}_{1}^{o})}^{T}, {\hat{β}}_{2}^{T})}^{T}$ of L_n(β) such that β̂₂ = 0. Moreover,

c^{T} {(Z_{n}^{T} Z_{n})}^{- 1 / 2} V_{n}^{- 1} [({\hat{β}}_{1}^{o} - β_{1}^{*}) + \frac{n λ_{n}}{2} V_{n}^{2} {\tilde{d}}_{0}] \overset{D}{\to} N (0, τ (1 - τ)),

where c is an arbitrary s-dimensional vector satisfying c^Tc = 1, and d̃₀ is an s-dimensional vector with the jth element $d_{j} sgn (β_{j}^{*})$ .

The proof of Theorem 3 is an extension of the proof on the asymptotic normality theorem for the LAD estimator in Pollard (1990), in which the theorem is proved for fixed dimensionality. The idea is to approximate L_n(β₁, 0) in (2.4) by a sequence of quadratic functions, whose minimizers converge to normal distribution. Since L_n(β₁, 0) and the quadratic approximation are close, their minimizers are also close, which results in the asymptotic normality in Theorem 3.

Theorem 3 assumes that max_i ||H^1/2Z_ni||₂ = o(s^−7/2(log s)⁻¹). Since by definition $\sum_{i = 1}^{n} {‖ H^{1 / 2} Z_{n i} ‖}_{2}^{2} = s$ , it is seen that the condition implies s = o(n^1/8). This assumption is made to guarantee that the quadratic approximation is close enough to L_n(β₁, 0). When s is finite, the condition becomes max_i ||Z_ni||₂ = o(1), as in Pollard (1990). Another important assumption is $λ_{n} \sqrt{n} {‖ d_{0} ‖}_{2} = O (\sqrt{s})$ , which is imposed to make sure that the bias 2⁻¹nλ_nc^TV_nd₀ caused by the penalty term does not diverge. For instance, using R-Lasso will create a non-diminishing bias and thus cannot be guaranteed to have asymptotic normality.

Note that we do not assume a parametric form of the error distribution. Thus, our oracle estimator is in fact a semiparametric estimator with the error density as the nuisance parameter. Heuristically speaking, Theorem 3 shows that the asymptotic variance of $\sqrt{n} ({\hat{β}}_{1}^{o} - β_{1}^{*})$ is $n τ (1 - τ) V_{n} Z_{n}^{T} Z_{n} V_{n}$ . Since V_n = (S^T HS)^−1/2 and Z_n = SV_n, if the model errors ε_i are i.i.d. with density function f_ε(·), then this asymptotic variance reduces to $τ (1 - τ) {(n^{- 1} f_{ε}^{2} (0) S^{T} S)}^{- 1}$ . In the random design case where the true covariate vectors ${S_{i}}_{i = 1}^{n}$ are i.i.d. observations, n⁻¹S^TS converges to $E [S_{1}^{T} S_{1}]$ as n → ∞, and the asymptotic variance reduces to $τ (1 - τ) {(f_{ε}^{2} (0) E [S_{1}^{T} S_{1}])}^{- 1}$ . This is the semiparametric efficiency bound derived by Newey and Powell (1990) for random designs. In fact, if we assume that (x_i, y_i) are i.i.d., then the conditions of Theorem 3 can hold with asymptotic probability one. Using similar arguments, it can be formally shown that $\sqrt{n} ({\hat{β}}_{1}^{o} - β_{1}^{*})$ is asymptotically normal with covariance matrix equal to the aforementioned semiparametric efficiency bound. Hence, our oracle estimator is semiparametric efficient.

5. Properties of the Adaptive Robust Lasso

In previous sections, we have seen that the choice of the weight vector d plays a pivotal role for the WR-Lasso estimate to enjoy the model selection oracle property and asymptotic normality. In fact, conditions in Theorem 2 require that min_j_≥_s₊₁ d_j > c₃ and that ||d₀||₂ does not diverge too fast. Theorem 3 imposes an even more stringent condition, ${‖ d_{0} ‖}_{2} = O (\sqrt{s / n} / λ_{n})$ , on the weight vector d₀. For R-Lasso, ${‖ d_{0} ‖}_{2} = \sqrt{s}$ and these conditions become very restrictive. For example, the condition in Theorem 3 becomes λ_n = O(n^−1/2), which is too low for a thresholding level even for Gaussian errors. Hence, an adaptive choice of weights is needed to ensure that those conditions are satisfied. To this end, we propose a two-step procedure.

In the first step, we use R-Lasso, which gives the estimate β̂ⁱⁿⁱ. As has been shown in Belloni and Chernozhukov (2011) and Wang (2012), R-Lasso is consistent at a near-oracle rate $\sqrt{s (log p) / n}$ and selects the true model Inline graphic as a submodel (in other words, R-Lasso has the sure screening property using the terminology of Fan and Lv (2008)) with asymptotic probability one, namely,

supp ({\hat{β}}^{ini}) \supseteq supp (β^{*}) and {‖ {\hat{β}}_{1}^{ini} - β_{1}^{*} ‖}_{2} = O (\sqrt{s (log p) / n}) .

We remark that our Theorem 2 also ensures the consistency of R-Lasso. Compared to Belloni and Chernozhukov (2011), Theorem 2 presents stronger results but also needs more restrictive conditions for R-Lasso. As will be shown in latter theorems, only the consistency of R-Lasso is needed in the study of AR-Lasso, so we quote the results and conditions on R-Lasso in Belloni and Chernozhukov (2011) with the mind of imposing weaker conditions.

In the second step, we set d̂ = (d̂₁, ···, d̂_p)^T with ${\hat{d}}_{j} = p_{λ_{n}}^{'} (∣ {\hat{β}}_{j}^{ini} ∣)$ where p_{λ_n}(|·|) is a folded concave penalty function, and then solve the regularization problem (2.4) with a newly computed weight vector. Thus, vector d̂₀ is expected to be close to the vector ${(p_{λ_{n}}^{'} (∣ β_{1}^{*} ∣), \dots, p_{λ_{n}}^{'} (∣ β_{s}^{*} ∣))}^{T}$ under L₂-norm. If a folded concave penalty such as SCAD is used, then $p_{λ_{n}}^{'} (∣ β_{j}^{*} ∣)$ will be close, or even equal, to zero for 1 ≤ j ≤ s and thus the magnitude of ||d̂₀||₂ is negligible.

Now, we formally establish the asymptotic properties of AR-Lasso. We first present a more general result and then highlight our recommended procedure, which uses R-Lasso as the initial estimate and then uses SCAD to compute the stochastic weights, in Corollary 1. Denote by $d^{*} = (d_{1}^{*}, \dots, d_{p}^{*})$ with $d_{j}^{*} = p_{λ_{n}}^{'} (∣ β_{j}^{*} ∣)$ . Using the weight vector d̂, AR-Lasso minimizes the following objective function

{\hat{L}}_{n} (β) = \sum_{i = 1}^{n} ρ_{τ} (y_{i} - x_{i}^{T} β) + n λ_{n} {‖ \hat{d} \circ β ‖}_{1} .

(5.1)

We also need the following conditions to show the model selection oracle property of the two-step procedure.

Condition 4

With asymptotic probability one, the initial estimate satisfies ${‖ {\hat{β}}^{ini} - β^{*} ‖}_{2} \leq C_{2} \sqrt{s (log p) / n}$ with some constant C₂ > 0.

As discussed above, if R-Lasso is used to obtain the initial estimate, it satisfies the above condition. Our second condition is on the penalty function.

Condition 5

$p_{λ_{n}}^{'} (t)$ is non-increasing in t ∈ (0, ∞) and is Lipschitz with constant c₅, that is,

∣ p_{λ_{n}}^{'} (∣ β_{1} ∣) - p_{λ_{n}}^{'} (∣ β_{2} ∣) ∣ \leq c_{5} ∣ β_{1} - β_{2} ∣,

for any β₁, β₂ ∈ R. Moreover, $p_{λ_{n}}^{'} (C_{2} \sqrt{s (log p) / n}) > \frac{1}{2} p_{λ_{n}}^{'} (0 +)$ for large enough n, where C₂ is defined in Condition 4.

For the SCAD (Fan and Li, 2001) penalty, $p_{λ_{n}}^{'} (β)$ is given by

p_{λ_{n}}^{'} (β) = 1 {β \leq λ_{n}} + \frac{{(a λ_{n} - β)}_{+}}{(a - 1) λ_{n}} 1 {β > λ_{n}},

(5.2)

for a given constant a > 2, and it can be easily verified that Condition 5 holds if $λ_{n} > 2 {(a + 1)}^{- 1} C_{2} \sqrt{s (log p) / n}$ .

Theorem 4

Assume conditions of Theorem 2 hold with d = d^* and γ_n = a_n, where

a_{n} = C_{3} (\sqrt{s (log n) / n} + λ_{n} ({‖ d_{0}^{*} ‖}_{2} + C_{2} c_{5} \sqrt{s (log p) / n})),

with some constant C₃ > 0, and $λ_{n} s κ_{n} \sqrt{(log p) / n} \to 0$ . Then, under Conditions 4 and 5, with probability tending to one, there exists a global minimizer $\hat{β} = {({\hat{β}}_{1}^{T}, {\hat{β}}_{2}^{T})}^{T}$ of (5.1) such that β̂₂ = 0 and ${‖ {\hat{β}}_{1} - β_{1}^{*} ‖}_{2} \leq a_{n}$ .

The results in Theorem 4 are analogous to those in Theorem 2. The extra term $λ_{n} \sqrt{s (log p) / n}$ in the convergence rate a_n, compared to the convergence rate γ_n in Theorem 2, is caused by the bias of the initial estimate β̂ⁱⁿⁱ. Since the regularization parameter λ_n goes to zero, the bias of AR-Lasso is much smaller than that of the initial estimator β̂ⁱⁿⁱ. Moreover, the AR-Lasso β̂ possesses the model selection oracle property.

Now we present the asymptotic normality of the AR-Lasso estimate.

Condition 6

The smallest signal satisfies ${min}_{1 \leq j \leq s} ∣ β_{j}^{*} ∣ > 2 C_{2} \sqrt{(s log p) / n}$ . Moreover, it holds that $p_{λ_{n}}^{″} (∣ β ∣) = o (s^{- 1} λ_{n}^{- 1} {(n log p)}^{- 1 / 2})$ for any $∣ β ∣ > 2^{- 1} {min}_{1 \leq j \leq s} ∣ β_{j}^{*} ∣$ .

The above condition on the penalty function is satisfied when the SCAD penalty is used and ${min}_{1 \leq j \leq s} ∣ β_{j}^{*} ∣ \geq 2 a λ_{n}$ where a is the parameter in the SCAD penalty (5.2).

Theorem 5

Assume conditions of Theorem 3 hold with d = d^* and γ_n = a_n, where a_n is defined in Theorem 4. Then, under Conditions 4 – 6, with asymptotic probability one, there exists a global minimizer β̂ of (5.1) having the same asymptotic properties as those in Theorem 3.

With the SCAD penalty, conditions in Theorems 4 and 5 can be simplified and AR-Lasso still enjoys the same asymptotic properties, as presented in the following corollary.

Corollary 1

Assume $λ_{n} = O (\sqrt{s (log p) (log log n) / n}), log p = o (\sqrt{n}), {min}_{1 \leq j \leq s} ∣ β_{j}^{*} ∣ \geq 2 a λ_{n}$ with a the parameter in the SCAD penalty, and κ_n = o(n^1/4s^−1/2(log n)^−3/2(log p)^1/2). Further assume that ${‖ n^{- 1} Q^{T} HS ‖}_{2. \infty} < C_{4} \sqrt{(log p) (log log n) / log n}$ with C₄ some positive constant. Then, under Conditions 1 and 2, with asymptotic probability one, there exists a global minimizer $\hat{β} = {({\hat{β}}_{1}^{T}, {\hat{β}}_{2}^{T})}^{T}$ of L̂_n(β) such that

{‖ {\hat{β}}_{1} - β_{1}^{*} ‖}_{2} \leq O (\sqrt{s (log n) / n}), sgn ({\hat{β}}_{1}) = sgn (β_{1}^{*}), and {\hat{β}}_{2} = 0 .

If in addition, max_i ||H^1/2Z_ni||₂ = o(s^−7/2(log s)⁻¹), then we also have

c^{T} {(Z_{n}^{T} Z_{n})}^{- 1 / 2} V_{n}^{- 1} ({\hat{β}}_{1} - β_{1}^{*}) \overset{D}{\to} N (0, τ (1 - τ)),

where c is an arbitrary s-dimensional vector satisfying c^Tc = 1.

Corollary 1 provides sufficient conditions for ensuring the variable selection sign consistency of AR-Lasso. These conditions require that R-Lasso in the initial step has the sure screening property. We remark that in implementation, AR-Lasso is able to select the variables missed by R-Lasso, as demonstrated in our numerical studies in the next section. The theoretical comparison of the variable selection results of R-Lasso and AR-Lasso would be an interesting topic for future study. One set of (p, n, s, κ_n) satisfying conditions in Corollary 1 is log p = O(n^b₁), s = o(n^(1−b₁)/2) and κ_n = o(n^b₁/4(log n)^−3/2) with b₁ ∈ (0, 1/2) some constant. Corollary 1 gives one specific choice of λ_n, not necessarily the smallest λ_n, which makes our procedure work. In fact, the condition on λ_n can be weakened to $λ_{n} > 2 {(a + 1)}^{- 1} {‖ {\hat{β}}_{1}^{ini} - β_{1} ‖}_{\infty}$ . Currently, we use the L₂-norm ${‖ {\hat{β}}_{1}^{ini} - β_{1} ‖}_{2}$ to bound this L_∞-norm, which is too crude. If one can establish ${‖ {\hat{β}}_{1}^{ini} - β_{1} ‖}_{\infty} = O_{p} (\sqrt{n^{- 1} log p})$ for an initial estimator ${\hat{β}}_{1}^{ini}$ , then the choice of λ_n can be as small as $O (\sqrt{n^{- 1} log p})$ , the same order as that used in Wang (2012). On the other hand, since we are using AR-Lasso, the choice of λ_n is not as sensitive as R-Lasso.

6. Numerical Studies

In this section we evaluate the finite sample property of our proposed estimator with synthetic data. Please see the supplementary material (Fan et al., 2013) for a real life data set analysis, where we provide results of an eQTL study on the CHRNA6 gene.

To assess the performance of the proposed estimator and compare it with other methods, we simulated data from the high-dimensional linear regression model

y_{i} = x_{i}^{T} β_{0} + ε_{i}, x ~ N (0, \sum_{x}),

where the data had n = 100 observations and the number of parameters was chosen as p = 400. We fixed the true regression coefficient vector as

β_{0} = {2, 0, 1.5, 0, .80, 0, 0, 1, 0, 1.75, 0, 0, .75, 0, 0, 0.3, 0, \dots, 0} .

For the distribution of the noise, ε, we considered six symmetric distributions: Normal with variance 2 ( Inline graphic (0, 2)), a scale mixture of Normals for which $σ_{i}^{2} = 1$ with probability 0.9 and $σ_{i}^{2} = 25$ otherwise (MN₁), a different scale mixture model where $ε_{i} ~ N (0, σ_{i}^{2})$ and σ_i ~ Unif(1, 5) (MN₂), Laplace, Student’s t with degrees of freedom 4 with doubled variance ( $\sqrt{2} \times t_{4}$ ) and Cauchy. We take τ = 0.5, corresponding to L₁-regression, throughout the simulation. Correlation of the covariates, Σ_x were either chosen to be identity (i.e. Σ_x = I_p) or they were generated from an AR(1) model with correlation 0.5, that is Σ_x₍_i_,_j₎ = 0.5^|ⁱ⁻^j^|.

We implemented five methods for each setting:

L₂-Oracle, which is the least squares estimator based on the signal covariates.
Lasso, the penalized least-squares estimator with L₁-penalty as in Tibshirani (1996).
SCAD, the penalized least-squares estimator with SCAD penalty as in Fan and Li (2001).
R-Lasso, the robust Lasso defined as the minimizer of (2.4) with d = 1.
AR-Lasso, which is the adaptive robust lasso whose adaptive weights on the penalty function were computed based on the SCAD penalty using the R-Lasso estimate as an initial value.

The tuning parameter, λ_n, was chosen optimally based on 100 validation data-sets. For each of these data-sets, we ran a grid search to find the best λ_n (with the lowest L₂ error for β) for the particular setting. This optimal λ_n was recorded for each of the 100 validation data-sets. The median of these 100 optimal λ_n were used in the simulation studies. We preferred this procedure over cross-validation because of the instability of the L₂ loss under heavy tails.

The following four performance measures were calculated:

L₂ loss, which is defined as ||β^* − β̂||₂.
L₁ loss, which is defined as ||β^* − β̂||₁.
Number of noise covariates that are included in the model, that is the number of false positives (FP).
Number of signal covariates that are not included, i.e. the number of false negatives (FN).

For each setting, we present the average of the performance measure based on 100 simulations. The results are depicted in Tables 1 and 2. A boxplot of the L₂ losses under different noise settings is also given in Figure 1 (the L₂ loss boxplot for the independent covariate setting is similar and omitted). For the results in Tables 1 and 2, one should compare the performance between Lasso and R-Lasso and that between SCAD and AR-Lasso. This comparison reflects the effectiveness of L₁-regression in dealing with heavy-tail distributions. Furthermore, comparing Lasso with SCAD, and R-Lasso with AR-Lasso, shows the effectiveness of using adaptive weights in the penalty function.

Table 1.

Simulation Results with Independent Covariates

L₂ Oracle

Lasso

SCAD

R-Lasso

AR-Lasso

(0, 2)

L₂ loss

0.833

4.114

3.412

5.342

2.662

L₁ loss

0.380

1.047

0.819

1.169

0.785

FP, FN

27.00, 0.49

29.60, 0.51

36.81, 0.62

17.27, 0.70

MN₁

L₂ loss

0.977

5.232

4.736

4.525

2.039

L₁ loss

0.446

1.304

1.113

1.028

0.598

FP, FN

26.80, 0.73

29.29, 0.68

34.26, 0.51

16.76, 0.51

MN₂

L₂ loss

1.886

7.563

7.583

8.121

5.647

L₁ loss

0.861

2.085

2.007

2.083

1.845

FP, FN

20.39, 2.28

23.25, 2.19

24.64, 2.29

11.97, 2.57

Laplace

L₂ loss

0.795

4.056

3.395

4.610

2.025

L₁ loss

0.366

1.016

0.799

1.039

0.573

FP, FN

26.87, 0.62

29.98, 0.49

34.76, 0.48

18.81, 0.40

\sqrt{2} \times t_{4}

L₂ loss

1.087

5.303

5.859

6.185

3.266

L₁ loss

0.502

1.378

1.256

1.403

0.951

FP, FN

24.61, 0.85

36.95, 0.76

33.84, 0.84

18.53, 0.82

Cauchy

L₂ loss

37.451

211.699

266.088

6.647

3.587

L₁ loss

17.136

30.052

40.041

1.646

1.081

FP, FN

27.39, 5.78

34.32, 5.94

27.33, 1.41

17.28, 1.10

Open in a new tab

Table 2.

Simulation Results with Correlated Covariates

L₂ Oracle

Lasso

SCAD

R-Lasso

AR-Lasso

(0, 2)

L₂ loss

0.836

3.440

3.003

4.185

2.580

L₁ loss

0.375

0.943

0.803

1.079

0.806

FP, FN

20.62, 0.59

23.13, 0.56

22.72, 0.77

14.49, 0.74

MN₁

L₂ loss

1.081

4.415

3.589

3.652

1.829

L₁ loss

0.495

1.211

1.055

0.901

0.593

FP, FN

18.66, 0.77

15.71, 0.75

26.65, 0.60

13.29, 0.51

MN₂

L₂ loss

1.858

6.427

6.249

6.882

4.890

L₁ loss

0.844

1.899

1.876

1.916

1.785

FP, FN

15.16, 2.08

14.77, 1.96

18.22, 1.91

7.86, 2.71

Laplace

L₂ loss

0.803

3.341

2.909

3.606

1.785

L₁ loss

0.371

0.931

0.781

0.927

0.573

FP, FN

19.32, 0.62

21.60, 0.38

24.44, 0.46

12.90, 0.55

\sqrt{2} \times t_{4}

L₂ loss

1.122

4.474

4.259

4.980

2.855

L₁ loss

0.518

1.222

1.201

1.299

0.946

FP, FN

20.00, 0.76

18.49, 0.91

23.56, 0.79

13.40, 1.05

Cauchy

L₂ loss

31.095

217.395

243.141

5.388

3.286

L₁ loss

13.978

31.361

36.624

1.461

1.074

FP, FN

25.59, 5.48

32.01, 5.43

20.80, 1.16

12.45, 1.17

Open in a new tab

Fig. 1 — Boxplots for L₂ Loss with Correlated Covariates

Our simulation results reveal the following facts. The quantile based estimators were more robust in dealing with the outliers. For example, for the first mixture model (MN₁) and Cauchy, R-Lasso outperformed Lasso, and AR-Lasso outperformed SCAD in all of the four metrics, and significantly so when the error distribution is the Cauchy distribution. On the other hand, for the light-tail distributions such as the normal distribution, the efficiency loss was limited. When the tails get heavier, for instance for the Laplace distribution, quantile based methods started to outperform the least-squares based approaches, more so when the tails got heavier.

The effectiveness of weights in AR-Lasso is self-evident. SCAD outperformed Lasso and AR-Lasso outperformed R-Lasso in almost all of the settings. Furthermore, for all of the error settings AR-Lasso had significantly lower L₂ and L₁ loss as well as a smaller model size compared to other estimators.

It is seen that when the noise does not have heavy tails, that is for the normal and the Laplace distribution, all the estimators are comparable in terms of L₁ loss. As expected, estimators that minimize squared loss worked better than R-Lasso and AR-Lasso estimators under Gaussian noise, but their performances deteriorated as the tails got heavier. In addition, in the two heteroscedastic settings, AR-Lasso had the best performance among others.

For Cauchy noise, least squares methods could only recover 1 or 2 of the true variables on average. On the other hand, L₁-estimators (R-Lasso and AR-Lasso) had very few false negatives, and as evident from L₂ loss values, these estimators only missed variables with smaller magnitudes.

In addition, AR-Lasso consistently selected a smaller set of variables than R-Lasso. For instance, for the setting with independent covariates, under the Laplace distribution, R-Lasso and AR-Lasso had on average 34.76 and 18.81 false positives, respectively. Also note that AR-Lasso consistently outperformed R-Lasso: It estimated β^* (lower L₁ and L₂ losses), and the support of β^* (lower averages for the number of false positives) more efficiently.

7. Proofs

In this section, we prove Theorems 1, 2 and 4 and provide the lemmas used in these proofs. The proofs of Theorems 3 and 5 and Proposition 1 are given in the supplementary Appendix (Fan et al., 2013).

We use techniques from empirical process theory to prove the theoretical results. Let $v_{n} (β) = \sum_{i = 1}^{n} ρ_{τ} (y_{i} - x_{i}^{T} β)$ . Then $L_{n} (β) = v_{n} (β) + n λ_{n} \sum_{j = 1}^{p} d_{j} ∣ β_{j} ∣$ . For a given deterministic M > 0, define the set

B_{0} (M) = {β \in R^{p} : {‖ β - β^{*} ‖}_{2} \leq M, supp (β) \subseteq supp (β^{*})} .

Then, define the function

Z_{n} (M) = sup_{β \in B_{0} (M)} \frac{1}{n} ∣ (v_{n} (β) - v_{n} (β^{*})) - E (v_{n} (β) - v_{n} (β^{*})) ∣ .

(7.1)

Lemma 1 in Section 7.4 gives the rate of convergence for Z_n(M).

7.1. Proof of Theorem 1

We first show that for any $β = {(β_{1}^{T}, 0^{T})}^{T} \in B_{0} (M)$ with $M = o (κ_{n}^{- 1} s^{- 1 / 2})$ ,

E [v_{n} (β) - v_{n} (β^{*})] \geq \frac{1}{2} c_{0} c n {‖ β_{1} - β_{1}^{*} ‖}_{2}^{2},

(7.2)

for sufficiently large n, where c is the lower bound for f_i(·) in the neighborhood of 0. The intuition follows from the fact that β^* is the minimizer of the function Ev_n(β) and hence in Taylor’s expansion of E[v_n(β) − v_n(β^*)] around β^*, the first order derivative is zero at the point β = β^*. The left-hand side of (7.2) will be controlled by Z_n(M). This yields the L₂-rate of convergence in Theorem 1.

To prove (7.2), we set $a_{i} = ∣ S_{i}^{T} (β_{1} - β_{1}^{*}) ∣$ . Then, for β ∈ Inline graphic (M),

∣ a_{i} ∣ \leq {‖ S_{i} ‖}_{2} {‖ β_{1} - β_{1}^{*} ‖}_{2} \leq \sqrt{s} κ_{n} M \to 0.

Thus if $S_{i}^{T} (β_{1} - β_{1}^{*}) > 0$ , by E1{ε_i ≤ 0} = τ, Fubini’s theorem, mean value theorem, and Condition 1 it is easy to derive that

\begin{array}{l} E [ρ_{τ} (ε_{i} - a_{i}) - ρ_{τ} (ε_{i})] = E [a_{i} (1 {ε_{i} \leq a_{i}} - τ) - ε_{i} 1 {0 \leq ε_{i} \leq a_{i}}] \\ = E [\int_{0}^{a_{i}} 1 {0 \leq ε_{i} \leq s} d s] = \int_{0}^{a_{i}} (F_{i} (s) - F_{i} (0)) d s = \frac{1}{2} f_{i} (0) a_{i}^{2} + o (1) a_{i}^{2}, \end{array}

(7.3)

where the o(1) is uniformly over all i = 1, ···, n. When $S_{i}^{T} (β_{1} - β_{1}^{*}) < 0$ , the same result can be obtained. Furthermore, by Condition 2,

\sum_{i = 1}^{n} f_{i} (0) a_{i}^{2} = {(β_{1} - β_{1}^{*})}^{T} S^{T} HS (β_{1} - β_{1}^{*}) \geq c_{0} n {‖ β_{1} - β_{1}^{*} ‖}_{2}^{2} .

This together with (7.3) and the definition of v_n(β) proves (7.2).

The inequality (7.2) holds for any $β = {(β_{1}^{T}, 0^{T})}^{T} \in B_{0} (M)$ , yet ${\hat{β}}^{o} = {({({\hat{β}}_{1}^{o})}^{T}, 0^{T})}^{T}$ may not be in the set. Thus, we let $\tilde{β} = {({\tilde{β}}_{1}^{T}, 0^{T})}^{T}$ , where

{\tilde{β}}_{1} = u {\hat{β}}_{1}^{o} + (1 - u) β_{1}^{*}, with u = M / (M + {‖ {\hat{β}}_{1}^{o} - β_{1}^{*} ‖}_{2}),

which falls in the set Inline graphic (M). Then, by the convexity and the definition of ${\hat{β}}_{1}^{o}$ ,

L_{n} (\tilde{β}) \leq u L_{n} ({\hat{β}}_{1}^{o}, 0) + (1 - u) L_{n} (β_{1}^{*}, 0) \leq L_{n} (β_{1}^{*}, 0) = L_{n} (β^{*}) .

Using this and the triangle inequality we have

\begin{array}{l} E [v_{n} (\tilde{β}) - v_{n} (β^{*})] = {v_{n} (β^{*}) - E v_{n} (β^{*})} - {v_{n} (\tilde{β}) - E v_{n} (\tilde{β})} + L_{n} (\tilde{β}) - L_{n} (β^{*}) + n λ_{n} {‖ d_{0} \circ β_{1}^{*} ‖}_{1} - n λ_{n} {‖ d_{0} \circ β_{1}^{*} ‖}_{1} \\ \leq {n Z}_{n} (M) + n λ_{n} {‖ d_{0} \circ (β_{1}^{*} - {\tilde{β}}_{1}) ‖}_{1} . \end{array}

(7.4)

By the Cauchy-Schwarz inequality, the very last term is bounded by $n λ_{n} {‖ d_{0} ‖}_{2} {‖ {\tilde{β}}_{1} - β_{1}^{*} ‖}_{2} \leq n λ_{n} {‖ d_{0} ‖}_{2} M$ .

Define the event $E_{n} = {Z_{n} (M) \leq 2 M n^{- 1 / 2} \sqrt{s log n}}$ . Then by Lemma 1,

P (E_{n}) \geq 1 - exp (- c_{0} s (log n) / 8) .

(7.5)

On the event Inline graphic , by (7.4), we have

E [v_{n} (\tilde{β}) - v_{n} (β^{*})] \leq 2 M \sqrt{s n (log n)} + n λ_{n} {‖ d_{0} ‖}_{2} M .

Taking $M = 2 \sqrt{s / n} + λ_{n} {‖ d_{0} ‖}_{2}$ . By Condition 2 and the assumption $λ_{n} {‖ d_{0} ‖}_{2} \sqrt{s} κ_{n} \to 0$ , it is easy to check that $M = o (κ_{n}^{- 1} s^{- 1 / 2})$ . Combining these two results with (7.2), we obtain that on the event Inline graphic ,

\frac{1}{2} c_{0} n {‖ {\tilde{β}}_{1} - β_{1}^{*} ‖}_{2}^{2} \leq (2 \sqrt{s n (log n)} + n λ_{n} {‖ d_{0} ‖}_{2}) (2 \sqrt{s / n} + λ_{n} {‖ d_{0} ‖}_{2}),

which entails that

{‖ β_{1}^{*} - {\tilde{β}}_{1} ‖}_{2} \leq O (λ_{n} {‖ d_{0} ‖}_{2} + \sqrt{s (log n) / n}) .

Note that ${‖ β_{1}^{*} - \tilde{β} ‖}_{2} \leq M$ implies ${‖ {\hat{β}}_{1} - β_{1}^{*} ‖}_{2} \leq 2 M$ . Thus, on the event Inline graphic ,

{‖ {\hat{β}}_{1} - β_{1}^{*} ‖}_{2} \leq O (λ_{n} {‖ d_{0} ‖}_{2} + \sqrt{s (log n) / n}) .

The second result follows trivially.

7.2. Proof of Theorem 2

Since ${\hat{β}}_{1}^{o}$ defined in Theorem 1 is a minimizer of L_n(β₁, 0), it satisfies the KKT conditions. To prove that $\hat{β} = {({({\hat{β}}_{1}^{o})}^{T}, 0^{T})}^{T} \in R^{p}$ is a global minimizer of L_n(β) in the original R^p space, we only need to check the following condition

{‖ d_{1}^{- 1} \circ Q^{T} ρ_{τ}^{'} (y - S {\hat{β}}_{1}^{o}) ‖}_{\infty} < n λ_{n},

(7.6)

where $ρ_{τ}^{'} (u) = {(ρ_{τ}^{'} (u_{i}), \dots, ρ_{τ}^{'} (u_{n}))}^{T}$ for any n-vector u = (u₁, ···, u_n)^T with $ρ_{τ}^{'} (u_{i}) = τ - 1 {u_{i} \leq 0}$ . Here, $d_{1}^{- 1}$ denotes the vector ${(d_{s + 1}^{- 1}, \dots, d_{p}^{- 1})}^{T}$ . Then the KKT conditions and the convexity of L_n(β) together ensure that β̂ is a global minimizer of L(β).

Define events

A_{1} = {{‖ {\hat{β}}_{1}^{o} - β_{1}^{*} ‖}_{2} \leq γ_{n}} A_{2} = {sup_{β \in N} {‖ d_{1}^{- 1} \circ Q^{T} ρ_{τ}^{'} (y - S β_{1}) ‖}_{\infty} < n λ_{n}},

where γ_n is defined in Theorem 1 and

N = {β = {(β_{1}^{T}, β_{2}^{T})}^{T} \in R^{p} : {‖ β_{1} - β_{1}^{*} ‖}_{2} \leq γ_{n}, β_{2} = 0 \in R^{p - s}} .

Then by Theorem 1 and Lemma 2 in Section 7.4, P(A₁ ∩ A₂) ≥1 − o(n⁻^cs). Since β̂ ∈ Inline graphic on the event A₁, the inequality (7.6) holds on the event A₁ ∩ A₂. This completes the proof of Theorem 2.

7.3. Proof of Theorem 4

The idea of the proof follows those used in the proof of Theorems 1 and 2. We first consider the minimizer of L̂_n(β) in the subspace { $β = {(β_{1}^{T}, β_{2}^{T})}^{T} \in R^{p} : β_{2} = 0$ }. Let $β = {(β_{1}^{T}, 0)}^{T}$ , where $β_{1} = β_{1}^{*} + {\tilde{a}}_{n} v_{1} \in R^{s}$ with ${\tilde{a}}_{n} = \sqrt{s (log n) / n} + λ_{n} ({‖ d_{0}^{*} ‖}_{2} + C_{2} c_{5} \sqrt{s (log p) / n})$ , ||v₁||₂ = C, and C > 0 is some large enough constant. By the assumptions in the theorem we have ${\tilde{a}}_{n} = o (κ_{n}^{- 1} s^{- 1 / 2})$ . Note that

{\hat{L}}_{n} (β_{1}^{*} + {\tilde{a}}_{n} v_{1}, 0) - {\hat{L}}_{n} (β_{1}^{*}, 0) = I_{1} (v_{1}) + I_{2} (v_{1}),

(7.7)

where $I_{1} (v_{1}) = {‖ ρ_{τ} (y - S (β_{1}^{*} + {\tilde{a}}_{n} v_{1})) ‖}_{1} - {‖ ρ_{τ} (y - S β_{1}^{*}) ‖}_{1}$ and $I_{2} (v_{1}) = n λ_{n} ({‖ {\hat{d}}_{0} \circ (β_{1}^{*} + {\tilde{a}}_{n} v_{1}) ‖}_{1} - {‖ {\hat{d}}_{0} \circ β_{1}^{*} ‖}_{1})$ with ${‖ ρ_{τ} (u) ‖}_{1} = \sum_{i = 1}^{n} ρ_{τ} (u_{i})$ for any vector u = (u₁, ···, u_n)^T. By the results in the proof of Theorem 1, $E [I_{1} (v_{1})] \geq 2^{- 1} c_{0} n {‖ {\tilde{a}}_{n} v_{1} ‖}_{2}^{2}$ , and moreover, with probability at least 1 − n⁻^cs,

∣ I_{1} (v_{1}) - E [I_{1} (v_{1})] ∣ \leq {n Z}_{n} (C a_{n}) \leq 2 {\tilde{a}}_{n} \sqrt{s (log n) n} {‖ v_{1} ‖}_{2} .

Thus, by the triangle inequality,

I_{1} (v_{1}) \geq 2^{- 1} c_{0} {\tilde{a}}_{n}^{2} n {‖ v_{1} ‖}_{2}^{2} - 2 {\tilde{a}}_{n} \sqrt{s (log n) n} {‖ v_{1} ‖}_{2} .

(7.8)

The second term on the right side of (7.7) can be bounded as

∣ I_{2} (v_{1}) ∣ \leq n λ_{n} {‖ \hat{d} \circ ({\tilde{a}}_{n} v_{1}) ‖}_{1} \leq {n a}_{n} λ_{n} {‖ {\hat{d}}_{0} ‖}_{2} {‖ v_{1} ‖}_{2} .

(7.9)

By triangle inequality and Conditions 4 and 5, it holds that

{‖ {\hat{d}}_{0} ‖}_{2} \leq {‖ {\hat{d}}_{0} - d_{0}^{*} ‖}_{2} + {‖ d_{0}^{*} ‖}_{2} \leq c_{5} {‖ {\hat{β}}_{1}^{ini} - β_{1}^{*} ‖}_{2} + {‖ d_{0}^{*} ‖}_{2} \leq C_{2} c_{5} \sqrt{s (log p) / n} + {‖ d_{0}^{*} ‖}_{2} .

(7.10)

Thus, combining (7.7)–(7.10) yields

{\hat{L}}_{n} (β^{*} + {\tilde{a}}_{n} v_{1}) - {\hat{L}}_{n} (β^{*}) \geq 2^{- 1} c_{0} n a_{n}^{2} {‖ v_{1} ‖}_{2}^{2} - 2 {\tilde{a}}_{n} \sqrt{s (log n) n} {‖ v_{1} ‖}_{2} - {n a}_{n} λ_{n} ({‖ d_{0}^{*} ‖}_{2} + C_{2} c_{5} \sqrt{s (log p) / n}) {‖ v_{1} ‖}_{2} .

Making ||v₁||₂ = C large enough, we obtain that with probability tending to one, L̂_n(β^*+ ã_nv) − L̂_n(β^*) > 0. Then, it follows immediately that with asymptotic probability one, there exists a minimizer β̂₁ of L̂_n(β₁, 0) such that ${‖ {\hat{β}}_{1} - β_{1}^{*} ‖}_{2} \leq C_{3} {\tilde{a}}_{n} \equiv a_{n}$ with some constant C₃ > 0.

It remains to prove that with asymptotic probability one,

{‖ {\hat{d}}_{1}^{- 1} \circ Q^{T} ρ_{τ}^{'} (y - S {\hat{β}}_{1}) ‖}_{\infty} < n λ_{n} .

(7.11)

Then by KKT conditions, $\hat{β} = {({\hat{β}}_{1}^{T}, 0^{T})}^{T}$ is a global minimizer of L̂_n(β).

Now we proceed to prove (7.11). Since $β_{j}^{*} = 0$ for all j = s + 1, ···, p, we have that $d_{j}^{*} = p_{λ_{n}}^{'} (0 +)$ . Furthermore, by Condition 4, it holds that $∣ {\hat{β}}_{j}^{ini} ∣ \leq C_{2} \sqrt{s (log p) / n}$ with asymptotic probability one. Then, it follows that

min_{j > s} p_{λ_{n}}^{'} (∣ {\hat{β}}_{j}^{ini} ∣) \geq p_{λ_{n}}^{'} (C_{2} \sqrt{s (log p) / n}) .

Therefore, by Condition 5 we conclude that

{‖ {({\hat{d}}_{1})}^{- 1} ‖}_{\infty} = {(min_{j > s} p_{λ_{n}}^{'} (∣ {\hat{β}}_{j}^{ini} ∣))}^{- 1} < 2 / p_{λ_{n}}^{'} (0 +) = 2 {‖ {(d_{1}^{*})}^{- 1} ‖}_{\infty} .

(7.12)

From the conditions of Theorem 2 with γ_n = a_n, it follows from Lemma 2 (inequality (7.20)) that, with probability at least 1 − o(p⁻^c),

sup_{{‖ β_{1} - β_{1}^{*} ‖}_{2} \leq C_{3} a_{n}} {‖ Q^{T} ρ_{τ}^{'} (y - S β_{1}) ‖}_{\infty} < \frac{n λ_{n}}{2 {‖ {(d_{1}^{*})}^{- 1} ‖}_{\infty}} (1 + o (1)) .

(7.13)

Combining (7.12)–(7.13) and by the triangle inequality, it holds that with asymptotic probability one,

sup_{{‖ β_{1} - β_{1}^{*} ‖}_{2} \leq C_{3} a_{n}} {‖ {({\hat{d}}_{1})}^{- 1} \circ Q^{T} ρ_{τ}^{'} (y - S β_{1}) ‖}_{\infty} < n λ_{n} .

Since the minimizer β̂₁ satisfies ${‖ {\hat{β}}_{1} - β_{1}^{*} ‖}_{2} < C_{3} a_{n}$ with asymptotic probability one, the above inequality ensures that (7.11) holds with probability tending to one. This completes the proof.

7.4. Lemmas

This subsection contains Lemmas used in proofs of Theorems 1,2 and 4.

Lemma 1

Under Condition 2, for any t > 0, we have

P (Z_{n} (M) \geq 4 M \sqrt{s / n} + t) \leq exp (- {n c}_{0} t^{2} / (8 M^{2})) .

(7.14)

Proof

Define ρ(s, y) = (y−s)(τ − 1{y− s ≤ 0}). Then, v_n(β) in (7.1) can be rewritten as $v_{n} (β) = \sum_{i = 1}^{n} ρ (x_{i}^{T} β, y_{i})$ . Note that the following Lipschitz condition holds for ρ(·; y_i)

∣ ρ (s_{1}, y_{i}) - ρ (s_{2}, y_{i}) ∣ \leq max {τ, 1 - τ} ∣ s_{1} - s_{2} ∣ \leq ∣ s_{1} - s_{2} ∣ .

(7.15)

Let W₁, ···, W_n be a Rademacher sequence, independent of model errors ε₁, ···, ε_n. The Lipschitz inequality (7.15) combined with the symmetrization theorem and Concentration inequality (see, for example, Theorems 14.3 and 14.4 in Büuhlmann and van de Geer (2011)) yields that

\begin{array}{l} E [Z_{n} (M)] \leq 2 E sup_{β \in B_{0} (M)} | \frac{1}{n} \sum_{i = 1}^{n} W_{i} (ρ (x_{i}^{T} β, y_{i}) - ρ (x_{i}^{T} β^{*}, y_{i})) | \\ \leq 4 E sup_{β \in B_{0} (M)} | \frac{1}{n} \sum_{i = 1}^{n} W_{i} (x_{i}^{T} β - x_{i}^{T} β^{*}) | . \end{array}

(7.16)

On the other hand, by the Cauchy-Schwarz inequality

\begin{array}{l} | \sum_{i = 1}^{n} W_{i} (x_{i}^{T} β - x_{i}^{T} β^{*}) | = | \sum_{j = 1}^{s} (\sum_{i = 1}^{n} W_{i} x_{i j}) (β_{j} - β_{j}^{*}) | \\ \leq {‖ β_{1} - β_{1}^{*} ‖}_{2} {\sum_{j = 1}^{s} {∣ \sum_{i = 1}^{n} W_{i} x_{i j} ∣}^{2}}^{1 / 2} . \end{array}

By Jensen’s inequality and concavity of the square root function, E(X^1/2) ≤ (EX)^1/2 for any non-negative random variable X. Thus, these two inequalities ensure that the very right hand side of (7.16) can be further bounded by

sup_{β \in B_{0} (M)} {‖ β - β^{*} ‖}_{2} E {\sum_{j = 1}^{s} {∣ \frac{1}{n} \sum_{i = 1}^{n} W_{i} x_{i j} ∣}^{2}}^{1 / 2} \leq M {\sum_{j = 1}^{s} E {∣ \frac{1}{n} \sum_{i = 1}^{n} W_{i} x_{i j} ∣}^{2}}^{1 / 2} = M \sqrt{s / n} .

(7.17)

Therefore, it follows from (7.16) and (7.17) that

E [Z_{n} (M)] \leq 4 M \sqrt{s / n} .

(7.18)

Next since n⁻¹S^TS has bounded eigenvalues, for any $β = (β_{1}^{T}, 0^{T}) \in B_{0} (M)$ ,

\frac{1}{n} \sum_{i = 1}^{n} {(x_{i}^{T} (β - β^{*}))}^{2} = \frac{1}{n} {(β_{1} - β_{1}^{*})}^{T} S^{T} S (β_{1} - β_{1}^{*}) \leq c_{0}^{- 1} {‖ β_{1} - β_{1}^{*} ‖}_{2}^{2} \leq c_{0}^{- 1} M^{2} .

Combining this with the Lipschitz inequality (7.15), (7.18), and applying Massart’s concentration theorem (see Theorem 14.2 in Bühlmann and van de Geer (2011)) yields that for any t > 0,

P (Z_{n} (M) \geq 4 M \sqrt{s / n} + t) \leq exp (- {n c}_{0} t^{2} / (8 M^{2})) .

This proves the Lemma.

Lemma 2

Consider a ball in R^s around $β^{*} : N = {β = {(β_{1}^{T}, β_{2}^{T})}^{T} \in R^{p}; β_{2} = 0, {‖ β_{1} - β_{1}^{*} ‖}_{2} \leq γ_{n}}$ with some sequence γ_n → 0. Assume that min_j_>_s d_j > c₃, $\sqrt{1 + γ_{n} s^{3 / 2} κ_{n}^{2}} {log}_{2} n = o (\sqrt{n} λ_{n})$ , n^1/2λ_n(log p)^−1/2 → ∞, and $κ_{n} γ_{n}^{2} = o (λ_{n})$ . Then under Conditions 1–3, there exists some constant c > 0 such that

P (sup_{β \in N} {‖ d_{1}^{- 1} \circ Q^{T} ρ_{τ}^{'} (y - S β_{1}) ‖}_{\infty} \geq n λ_{n}) \leq o (p^{- c}),

where $ρ_{τ}^{'} (u) = τ - 1 {u \leq 0}$ .

Proof

For a fixed j ∈ {s + 1, ···, p} and $β = {(β_{1}^{T}, β_{2}^{T})}^{T} \in N$ , define

γ_{β, j} (x_{i}, y_{i}) = x_{i j} [ρ_{τ}^{'} (y_{i} - x_{i}^{T} β) - ρ_{τ}^{'} (ε_{i}) - E [ρ_{τ}^{'} (y_{i} - x_{i}^{T} β) - ρ_{τ}^{'} (ε_{i})]],

where $x_{i}^{T} = (x_{i 1}, \dots, x_{i p})$ is the i-th row of the design matrix. The key for the proof is to use the following decomposition

sup_{β \in N} {‖ \frac{1}{n} Q^{T} ρ_{τ}^{'} (y - S β_{1}) ‖}_{\infty} \leq sup_{β \in N} {‖ \frac{1}{n} Q^{T} E [ρ_{τ}^{'} (y - S β_{1}) - ρ_{τ}^{'} (ε)] ‖}_{\infty} + {‖ \frac{1}{n} Q^{T} ρ_{τ}^{'} (ε) ‖}_{\infty} + max_{j > s} sup_{β \in N} \frac{1}{n} \sum_{i = 1}^{n} ∣ γ_{β, j} (x_{i}, y_{i}) ∣ .

(7.19)

We will prove that with probability at least 1 − o(p⁻^c),

I_{1} \equiv sup_{β \in N} {‖ \frac{1}{n} Q^{T} E [ρ_{τ}^{'} (y - S β_{1}) - ρ_{τ}^{'} (ε)] ‖}_{\infty} < \frac{λ_{n}}{2 {‖ d_{1}^{- 1} ‖}_{\infty}} + o (λ_{n}),

(7.20)

I_{2} \equiv n^{- 1} {‖ Q^{T} ρ_{τ}^{'} (ε) ‖}_{\infty} = o (λ_{n}),

(7.21)

I_{3} \equiv max_{j > s} sup_{β \in N} ∣ \frac{1}{n} \sum_{i = 1}^{n} γ_{β, j} (x_{i}, y_{i}) ∣ = o_{p} (λ_{n}) .

(7.22)

Combining (7.19)–(7.22) with the assumption min_j>s d_j > c₃ completes the proof of the Lemma.

Now we proceed to prove (7.20). Note that I₁ can be rewritten as

I_{1} = max_{j > s} sup_{β \in N} ∣ \frac{1}{n} \sum_{i = 1}^{n} x_{i j} E [ρ_{τ}^{'} (ε_{i}) - ρ_{τ}^{'} (y_{i} - x_{i}^{T} β)] ∣ .

(7.23)

By Condition 1,

E [ρ_{τ}^{'} (ε_{i}) - ρ_{τ}^{'} (y_{i} - x_{i}^{T} β)] = F_{i} (S_{i}^{T} (β_{1} - β_{1}^{*})) - F_{i} (0) = f_{i} (0) S_{i}^{T} (β_{1} - β_{1}^{*}) + {\tilde{I}}_{i},

where F(t) is the cumulative distribution function of ε_i, and ${\tilde{I}}_{i} = F_{i} (S_{i}^{T} (β_{1} - β_{1}^{*})) - F_{i} (0) - f_{i} (0) S_{i}^{T} (β_{1} - β_{1}^{*})$ . Thus, for any j > s,

\sum_{i = 1}^{n} x_{i j} E [ρ_{τ}^{'} (ε_{i}) - ρ_{τ}^{'} (y_{i} - x_{i}^{T} β)] = \sum_{i = 1}^{n} (f_{i} (0) x_{i j} S_{i}^{T}) (β_{1} - β_{1}^{*}) + \sum_{i = 1}^{n} x_{i j} {\tilde{I}}_{i} .

This together with (7.23) and Cauchy-Schwartz inequality entails that

I_{1} \leq {‖ \frac{1}{n} Q^{T} HS (β_{1} - β_{1}^{*}) ‖}_{\infty} + max_{j > s} ∣ \frac{1}{n} \sum_{i = 1}^{n} x_{i j} {\tilde{I}}_{i} ∣,

(7.24)

where H = diag{f₁(0, ···, f_n(0))}. We consider the two terms on the right hand side of (7.24) one by one. By Condition 3, the first term can be bounded as

{‖ \frac{1}{n} Q^{T} HS (β_{1} - β_{1}^{*}) ‖}_{\infty} \leq {‖ \frac{1}{n} Q^{T} HS ‖}_{2, \infty} {‖ β_{1} - β_{1}^{*} ‖}_{2} < \frac{λ_{n}}{2 {‖ d_{1}^{- 1} ‖}_{\infty}} .

(7.25)

By Condition 1, $∣ {\tilde{I}}_{i} ∣ \leq c {(S_{i}^{T} (β_{1} - β_{1}^{*}))}^{2}$ . This together with Condition 2 ensures that the second term of (7.24) can be bounded as

max_{j > s} | \frac{1}{n} \sum_{i = 1}^{n} x_{i j} {\tilde{I}}_{i} | \leq \frac{κ_{n}}{n} \sum_{i = 1}^{n} ∣ {\tilde{I}}_{1} ∣ \leq C \frac{κ_{n}}{n} \sum_{i = 1}^{n} {(S_{i}^{T} (β_{1} - β_{1}^{*}))}^{2} \leq C κ_{n} {‖ β_{1} - β_{1}^{*} ‖}_{2}^{2} .

Since β ∈ Inline graphic , it follows from the assumption $λ_{n}^{- 1} κ_{n} γ_{n}^{2} = o (1)$ that

max_{j > s} | \frac{1}{n} \sum_{i = 1}^{n} x_{i j} {\tilde{I}}_{i} | \leq C κ_{n} γ_{n}^{2} = o (λ_{n}) .

Plugging the above inequality and (7.25) into (7.24) completes the proof of (7.20).

Next we prove (7.21). By Hoeffding’s inequality, if $λ_{n} > 2 \sqrt{(1 + c) (log p) / n}$ with c is some positive constant, then

\begin{array}{l} P ({‖ Q^{T} ρ_{τ}^{'} (ε) ‖}_{\infty} \geq n λ_{n}) \leq \sum_{j = s + 1}^{p} 2 exp (- \frac{n^{2} λ_{n}^{2}}{4 \sum_{i = 1}^{n} x_{i j}^{2}}) \\ = 2 exp (log (p - s) - n λ_{n}^{2} / 4) \leq O (p^{- c}) . \end{array}

Thus, with probability at least 1 − O(p⁻^c), (7.21) holds.

We now apply Corollary 14.4 in Bühlmann and van de Geer (2011) to prove (7.22). To this end, we need to check conditions of the Corollary. For each fixed j, define the functional space Γ_j = {γ_β_,j : β ∈ Inline graphic }. First note that E[γ_β_,j(x_i, y_i)] = 0 for any γ_β_,j ∈ Γ_j. Second, since the $ρ_{τ}^{'}$ function is bounded, we have

\frac{1}{n} \sum_{i = 1}^{n} γ_{β, j}^{2} (x_{i}, y_{i}) = \frac{1}{n} \sum_{i = 1}^{n} x_{i j}^{2} {(ρ_{τ}^{'} (y_{i} = x_{i}^{T} β) - ρ_{τ}^{'} (ε_{i}) - E (ρ_{τ}^{'} (y_{i} - x_{i}^{T} β) - ρ_{τ} (ε_{i})))}^{2} \leq 4.

Thus, ${‖ γ_{β, j} ‖}_{n} \equiv {(n^{- 1} \sum_{i = 1}^{n} γ_{β, j}^{2} {(x_{i}, y_{i})}^{2})}^{1 / 2} \leq 2$ .

Third, we will calculate the covering number of the functional space Γ_j, N(·, Γ_j, ||·||₂). For any $β = {(β_{1}^{T}, β_{2}^{T})}^{T} \in N$ and $\tilde{β} = {({\tilde{β}}_{1}^{T}, {\tilde{β}}_{2}^{T})}^{T} \in N$ , by Condition 1 and the mean value theorem,

E [ρ_{τ}^{'} (y_{i} - x_{i}^{T} β) - ρ_{τ}^{'} (ε_{i})] - E [ρ_{τ}^{'} (y_{i} - x_{i}^{T} \tilde{β}) - ρ_{τ}^{'} (ε_{i})] = F_{i} (S_{i}^{T} (\tilde{β} - β_{1}^{*})) - F_{i} (S_{i}^{T} (β_{1} - β_{1}^{*})) = f_{i} (a_{1 i}) S_{i}^{T} (β_{1} - {\tilde{β}}_{1}),

(7.26)

where F(t) is the cumulative distribution function of ε_i, and a₁_i lies on the segment connecting $S_{i}^{T} ({\tilde{β}}_{1} - β_{1}^{*})$ and $S_{i}^{T} (β_{1} - β_{1}^{*})$ . Let κ_n = max_ij |x_ij|. Since f_i(u)’s are uniformly bounded, by (7.26),

∣ x_{i j} E [ρ_{τ}^{'} (y_{i} - x_{i}^{T} β) - ρ_{τ}^{'} (ε_{i})] - x_{i j} E [ρ_{τ}^{'} (y_{i} - x_{i}^{T} \tilde{β}) - ρ_{τ}^{'} (ε_{i})] ∣ \leq C ∣ x_{i j} S_{i}^{T} (β_{1} - {\tilde{β}}_{1}) ∣ \leq C {‖ x_{i j} S_{i} ‖}_{2} {‖ β_{1} - {\tilde{β}}_{1} ‖}_{2} \leq C \sqrt{s} κ_{n}^{2} {‖ β_{1} - {\tilde{β}}_{1} ‖}_{2},

(7.27)

where C > 0 is some generic constant. It is known (see, for example, Lemma 14.27 in Bühlmann and van de Geer (2011)) that the ball Inline graphic in R^s can be covered by (1 + 4γ_n/δ)^s balls with radius δ. Since $ρ_{τ}^{'} (y_{i} - x_{i}^{T} β) - ρ_{τ}^{'} (ε_{i})$ can only take 3 different values {−1, 0, 1}, it follows from (7.27) that the covering number of Γ_j is $N (2^{2 - k}, Γ_{j}, {‖ \cdot ‖}_{2}) = 3 {(1 + C^{- 1} 2^{k} γ_{n} s^{1 / 2} κ_{n}^{2})}^{s}$ . Thus, by calculus, for any 0 ≤ k ≤ (log₂ n)/2,

\begin{array}{l} log (1 + N (2^{2 - k}, Γ, {‖ \cdot ‖}_{2})) \leq log (6) + s log (1 + C^{- 1} 2^{k} γ_{n} s^{1 / 2} κ_{n}^{2}) \\ \leq log (6) + C^{- 1} 2^{k} γ_{n} s^{3 / 2} κ_{n}^{2} \leq 4 (1 + C^{- 1} γ_{n} s^{3 / 2} κ_{n}^{2}) 2^{2 k} . \end{array}

Hence, conditions of Corollary 14.4 in Bühlmann and van de Geer (2011) are checked and we obtain that for any t > 0,

P (sup_{β \in N} | \frac{1}{n} \sum_{i = 1}^{n} γ_{β, j} (x_{i}, y_{i}) | \geq \frac{8}{\sqrt{n}} (3 \sqrt{1 + C^{- 1} γ_{n} s^{3 / 2} κ_{n}^{2}} {log}_{2} n + 4 + 4 t)) \leq 4 exp (- \frac{n t^{2}}{8}) .

Taking $t = \sqrt{C (log p) / n}$ with C > 0 large enough constant we obtain that

\begin{array}{l} P (max_{j > s} sup_{β \in N} ∣ \frac{1}{n} \sum_{i = 1}^{n} γ_{β, j} (x_{i}, y_{i}) ∣ \geq \frac{24}{\sqrt{n}} \sqrt{1 + C^{- 1} γ_{n} s^{3 / 2} κ_{n}^{2}} {log}_{2} n) \\ \leq 4 (p - s) exp (- \frac{C log p}{8}) \to 0. \end{array}

Thus if $\sqrt{1 + γ_{n} s^{3 / 2} κ_{n}^{2}} {log}_{2} n = o (\sqrt{n} λ_{n})$ , then with probability at least 1 − o(p⁻^c), (7.22) holds. This completes the proof of the Lemma.

Supplementary Material

supplement

NIHMS649191-supplement-supplement.pdf^{(126.6KB, pdf)}

Acknowledgments

The authors sincerely thank the Editor, Associate Editor, and three referees for their constructive comments that led to substantial improvement of the paper.

References

Belloni A, Chernozhukov V. 1-penalized quantile regression in high-dimensional sparse models. Ann Statist. 2011;39:82–130. [Google Scholar]
Bickel PJ, Li B. Regularization in statistics (with discussion) Test. 2006;15:271–344. [Google Scholar]
Bickel PJ, Ritov Y, Tsybakov A. Simultaneous analysis of lasso and dantzig selector. Ann Statist. 2009;37:1705–1732. [Google Scholar]
Bradic J, Fan J, Wang W. Penalized composite quasi-likelihood for ultrahigh-dimensional variable selection. J Roy Statist Soc Ser B. 2011;73:325–349. doi: 10.1111/j.1467-9868.2010.00764.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bühlmann P, van de Geer S. Statistics for High-Dimensional Data: Methods, Theory and Applications. New York: Springer; 2011. [Google Scholar]
Candés EJ, Tao T. The Dantzig selector: statistical estimation when p is much larger than n (with discussion) Ann Statist. 2007;35:2313–2351. [Google Scholar]
Efron B, Hastie T, Tibshirani R. Discussion of the “dantzig selector”. Ann Statist. 2007;35(6):2358–2364. [Google Scholar]
Fan J, Fan Y, Barut E. Supplement to “Adaptive Robust Variable Selection”. 2013 doi: 10.1214/13-AOS1191. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Fan Y, Lv J. High dimensional covariance matrix estimation using a factor model. J Econometrics. 2008;147:186–197. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]
Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space (with discussion) J Roy Statist Soc Ser B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Lv J. Non-concave penalized likelihood with np-dimensionality. IEEE Trans Inform Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Peng H. On non-concave penalized likelihood with diverging number of parameters. Ann Statist. 2004;32:928–961. [Google Scholar]
Li Y, Zhu J. L1-norm quantile regression. J Comput Graph Statist. 2008;17:163–185. [Google Scholar]
Lv J, Fan Y. A unified approach to model selection and sparse recovery using regularized least squares. Ann Statist. 2009;37:3498–3528. [Google Scholar]
Meinshausen N, Bühlmann B. Stability selection. J Roy Statist Soc Ser B. 2010;72:417–473. [Google Scholar]
Newey WK, Powell JL. Efficient Estimation of Linear and Type I Censored Regression Models under Conditional Quantile Restrictions. Econometric Theory. 1990;3:295–317. [Google Scholar]
Nolan JP. Stable Distributions - Models for Heavy Tailed Data. Birkhauser; 2012. (In progress, Chapter 1 online at academic2.american.edu/~jpnolan) [Google Scholar]
Pollard D. Asymptotics for least absolute deviation regression estimators. Econometric Theory. 1990;7:186–199. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the Lasso. J Roy Statist Soc Ser B. 1996;58:267–288. [Google Scholar]
van de Geer S, Müller P. Quasi-likelihood and/or robust estimation in high dimensions. Stat Sci. 2012;27:469–480. [Google Scholar]
Wang H, Li G, Jiang G. Robust regression shrinkage and consistent variable selection through the LAD-Lasso. J Bus Econom Statist. 2007;25:347–355. [Google Scholar]
Wang L. L1 penalized LAD estimator for high dimensional linear regression. Tentatively accepted by Journal of Multivariate Analysis 2012 [Google Scholar]
Wang L, Wu Y, Li R. Quantile regression for analyzing heterogeneity in ultrahigh dimension. Journal of American Statistical Association. 2012;107:214– 222. doi: 10.1080/01621459.2012.656014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu Y, Liu Y. Variable selection in quantile regression. Statist Sin. 2009;37:801–817. [Google Scholar]
Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Statist. 2010;38:894–942. [Google Scholar]
Zhao P, Yu B. On model selection consistency of lasso. J Mach Learn Res. 2006;7:2541–2563. [Google Scholar]
Zou H. The adaptive lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. [Google Scholar]
Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models (with discussion) Ann Statist. 2008;36(4):1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou H, Yuan M. Composite quantile regression and the oracle model selection theory. Ann Statist. 2008;36:1108–1126. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

NIHMS649191-supplement-supplement.pdf^{(126.6KB, pdf)}

[R1] Belloni A, Chernozhukov V. 1-penalized quantile regression in high-dimensional sparse models. Ann Statist. 2011;39:82–130. [Google Scholar]

[R2] Bickel PJ, Li B. Regularization in statistics (with discussion) Test. 2006;15:271–344. [Google Scholar]

[R3] Bickel PJ, Ritov Y, Tsybakov A. Simultaneous analysis of lasso and dantzig selector. Ann Statist. 2009;37:1705–1732. [Google Scholar]

[R4] Bradic J, Fan J, Wang W. Penalized composite quasi-likelihood for ultrahigh-dimensional variable selection. J Roy Statist Soc Ser B. 2011;73:325–349. doi: 10.1111/j.1467-9868.2010.00764.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Bühlmann P, van de Geer S. Statistics for High-Dimensional Data: Methods, Theory and Applications. New York: Springer; 2011. [Google Scholar]

[R6] Candés EJ, Tao T. The Dantzig selector: statistical estimation when p is much larger than n (with discussion) Ann Statist. 2007;35:2313–2351. [Google Scholar]

[R7] Efron B, Hastie T, Tibshirani R. Discussion of the “dantzig selector”. Ann Statist. 2007;35(6):2358–2364. [Google Scholar]

[R8] Fan J, Fan Y, Barut E. Supplement to “Adaptive Robust Variable Selection”. 2013 doi: 10.1214/13-AOS1191. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Fan J, Fan Y, Lv J. High dimensional covariance matrix estimation using a factor model. J Econometrics. 2008;147:186–197. [Google Scholar]

[R10] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]

[R11] Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space (with discussion) J Roy Statist Soc Ser B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Fan J, Lv J. Non-concave penalized likelihood with np-dimensionality. IEEE Trans Inform Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Fan J, Peng H. On non-concave penalized likelihood with diverging number of parameters. Ann Statist. 2004;32:928–961. [Google Scholar]

[R14] Li Y, Zhu J. L1-norm quantile regression. J Comput Graph Statist. 2008;17:163–185. [Google Scholar]

[R15] Lv J, Fan Y. A unified approach to model selection and sparse recovery using regularized least squares. Ann Statist. 2009;37:3498–3528. [Google Scholar]

[R16] Meinshausen N, Bühlmann B. Stability selection. J Roy Statist Soc Ser B. 2010;72:417–473. [Google Scholar]

[R17] Newey WK, Powell JL. Efficient Estimation of Linear and Type I Censored Regression Models under Conditional Quantile Restrictions. Econometric Theory. 1990;3:295–317. [Google Scholar]

[R18] Nolan JP. Stable Distributions - Models for Heavy Tailed Data. Birkhauser; 2012. (In progress, Chapter 1 online at academic2.american.edu/~jpnolan) [Google Scholar]

[R19] Pollard D. Asymptotics for least absolute deviation regression estimators. Econometric Theory. 1990;7:186–199. [Google Scholar]

[R20] Tibshirani R. Regression shrinkage and selection via the Lasso. J Roy Statist Soc Ser B. 1996;58:267–288. [Google Scholar]

[R21] van de Geer S, Müller P. Quasi-likelihood and/or robust estimation in high dimensions. Stat Sci. 2012;27:469–480. [Google Scholar]

[R22] Wang H, Li G, Jiang G. Robust regression shrinkage and consistent variable selection through the LAD-Lasso. J Bus Econom Statist. 2007;25:347–355. [Google Scholar]

[R23] Wang L. L1 penalized LAD estimator for high dimensional linear regression. Tentatively accepted by Journal of Multivariate Analysis 2012 [Google Scholar]

[R24] Wang L, Wu Y, Li R. Quantile regression for analyzing heterogeneity in ultrahigh dimension. Journal of American Statistical Association. 2012;107:214– 222. doi: 10.1080/01621459.2012.656014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Wu Y, Liu Y. Variable selection in quantile regression. Statist Sin. 2009;37:801–817. [Google Scholar]

[R26] Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Statist. 2010;38:894–942. [Google Scholar]

[R27] Zhao P, Yu B. On model selection consistency of lasso. J Mach Learn Res. 2006;7:2541–2563. [Google Scholar]

[R28] Zou H. The adaptive lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. [Google Scholar]

[R29] Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models (with discussion) Ann Statist. 2008;36(4):1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Zou H, Yuan M. Composite quantile regression and the oracle model selection theory. Ann Statist. 2008;36:1108–1126. [Google Scholar]

PERMALINK

ADAPTIVE ROBUST VARIABLE SELECTION

Jianqing Fan

Yingying Fan

Emre Barut

Abstract

1. Introduction

2. Adaptive Robust Lasso

3. Suboptimality of Lasso

Proposition 1

4. Model Selection Oracle Property

Condition 1

Condition 2

4.1. Oracle Regularized Estimator

Theorem 1

4.2. WR-Lasso

Condition 3

Theorem 2

4.3. Asymptotic Normality

Theorem 3

5. Properties of the Adaptive Robust Lasso

Condition 4

Condition 5

Theorem 4

Condition 6

Theorem 5

Corollary 1

6. Numerical Studies

Table 1.

Table 2.

Fig. 1.

7. Proofs

7.1. Proof of Theorem 1

7.2. Proof of Theorem 2

7.3. Proof of Theorem 4

7.4. Lemmas

Lemma 1

Proof

Lemma 2

Proof

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases