Simultaneous estimation and variable selection in median regression using Lasso-type penalty

Jinfeng Xu; Zhiliang Ying

doi:10.1007/s10463-008-0184-2

. Author manuscript; available in PMC: 2013 Aug 21.

Published in final edited form as: Ann Inst Stat Math. 2008 Jul 10;62(3):487–514. doi: 10.1007/s10463-008-0184-2

Simultaneous estimation and variable selection in median regression using Lasso-type penalty

Jinfeng Xu ^1,^✉, Zhiliang Ying ²

PMCID: PMC3749002 NIHMSID: NIHMS491948 PMID: 23976790

Abstract

We consider the median regression with a LASSO-type penalty term for variable selection. With the fixed number of variables in regression model, a two-stage method is proposed for simultaneous estimation and variable selection where the degree of penalty is adaptively chosen. A Bayesian information criterion type approach is proposed and used to obtain a data-driven procedure which is proved to automatically select asymptotically optimal tuning parameters. It is shown that the resultant estimator achieves the so-called oracle property. The combination of the median regression and LASSO penalty is computationally easy to implement via the standard linear programming. A random perturbation scheme can be made use of to get simple estimator of the standard error. Simulation studies are conducted to assess the finite-sample performance of the proposed method. We illustrate the methodology with a real example.

Keywords: Variable selection, Median regression, Least absolute deviations, Lasso, Perturbation, Bayesian information criterion

1 Introduction

In the general linear model with independent and identically distributed errors, the Least Absolute Deviation (LAD) or L₁ method has been a viable alternative to the least squares method especially for its superior robustness properties. Consider the linear regression model

Y_{i} = β_{0}^{T} x_{i} + e_{i}, 1 \leq i \leq n,

(1)

where x_i are known p-vectors, β₀ the unknown p-vector of regression coefficients, and e_i the i.i.d random errors with a common distribution F.

The L₁ estimator β̂_L₁ is defined as a minimizer of the L₁ loss function

L_{n} (β) = \sum_{i = 1}^{n} | Y_{i} - β^{T} x_{i} | .

(2)

Although there is no explicit analytic form for β̂_L₁, the minimization may be carried out easily via linear programming (see for example, Koenker and D’Orey 1987). The more recent paper by Portnoy and Koenker (1997) gives speedy ways to compute the L₁ minimization, even for very large problems.

An important aspect in (regression) model building is model (variable) selection. For the least squares-based regression, there are a number of well established methods, including the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), Mallows’s C_p and etc. These approaches are characterized by two basic elements: a goodness of fit measure and a complexity index. The selection criteria are typically based upon various ways of balancing the two elements, so the resultant prediction errors are minimized. Remarkably, for the L₁ based regression, there have been very few works on model selection. Hurvich and Tsai (1990) proposed a (double-exponential) likelihood function-based criterion and studied its small sample properties via simulations. Robust modification of Mallows’s C_p were proposed by Ronchetti and Staudte (1994) under the framework of M estimation. Developing general variable selection methods with sound theoretical foundation and feasible implementation for L₁-based regression remains a great challenge.

An intriguing and novel recent advancement in variable selection is known as Basis Pursuit, proposed by Chen and Donoho (1994) or the Least Absolute Shrinkage and Selection Operator (LASSO), proposed by Tibshirani (1996). In it, the estimation and model selection are simultaneously treated as a single minimization problem. Knight and Fu (2000) established some asymptotic results for LASSO-type estimators. Fan and Li (2001) introduced Smoothly Clipped Absolute Deviation (SCAD) approach and proved its optimal properties. Efron et al. (2004) introduced Least Angel Regression (LARS) algorithm and its close connection to LASSO is extensively discussed.

With the fixed p and the invertible X^T X, the least squares estimate β^LS = (X^T X)⁻¹X^T Y uniquely minimizes the squared loss

\sum_{i = 1}^{n} {(Y_{i} - β^{T} x_{i})}^{2} .

LASSO estimate is defined as the minimizer of

\sum_{i = 1}^{n} {(Y_{i} - β^{T} x_{i})}^{2} subject to \sum_{j = 1}^{p} | β_{j} | \leq s * \sum_{j = 1}^{p} | β_{j}^{L S} |,

where 0 ≤ s ≤ 1 controls the amount of shrinkage that is applied to the estimates.

LASSO is similar in form to the ridge regression where the term in the constraint is $β_{j}^{2}$ rather than |β_j|. A remarkable feature of LASSO, as a result of the L₁ constraint, is that for some β_j’s, their fitted values are exactly 0. In fact, as the shrinkage parameter s goes from 1 to 0, the estimates go from no 0 to all 0. LASSO can also be regarded as a penalized least squares estimator with L₁ penalty: a minimizer of the objective function:

\sum_{i = 1}^{n} {(Y_{i} - β^{T} x_{i})}^{2} + λ_{n} \sum_{j = 1}^{p} | β_{j} |,

where λ_n is the tuning parameter.

In the present paper, we propose a parallel approach borrowing the ideas from LASSO by using the L₁ penalty, but with the least squares loss replaced by the L₁ loss. In doing so, we gain advantages in two fronts. First, it allows us to penetrate the difficult problem of variable selection for the L₁ regression. Appealingly, the shrinkage property of the LASSO estimator continues to hold in L₁ regression, see Fig. 1. Second, the single criterion function with both components being of L₁-type reduces (numerically) the minimization to a strictly linear programming problem, making any resulting methodology extremely easy to implement.

Fig. 1 — Graphical display of LASSO shrinkage of eight coefficients as a function of shrinkage parameter s in the prostate cancer example. The broken line s = 0.69 is selected by both BIC and GCV criterion

To be specific, our proposed estimator is a minimizer of the following criterion function

\sum_{i = 1}^{n} | Y_{i} - β^{T} x_{i} | + λ_{n} \sum_{j = 1}^{p} | β_{j} | .

It can be equivalently defined as a minimizer of the objective function

\sum_{i = 1}^{n} | Y_{i} - β^{T} x_{i} | subject to \sum_{j = 1}^{p} | β_{j} | \leq s * \sum_{j = 1}^{p} | β_{j}^{L_{1}} |,

where β^L₁ is the usual L₁ estimator.

As pointed out by Fan and Li (2001), LASSO does not possess the so-called oracle property in the sense that it cannot simultaneously have the best rate of convergence while correctly, with probability tending to one as sample size increases, set all unnecessary coefficients to 0.With this in mind, they proposed a variant of the penalty, called SCAD, smoothly clipped absolute deviation. Using such penalty, they were able to achieve the oracle property for the resulting estimator. Unfortunately, if we modify our approach by replacing the L₁ penalty function with the SCAD function, then the resultant minimization will be much more complicated. In particular, it is no longer numerically solvable by the linear programming.

To maintain numerical simplicity and uniqueness of solution of the linear programming, and to achieve the desirable oracle property, it is necessary for us to modify and extend the LASSO-type objective function. The tuning parameter λ_n there plays a crucial role of striking a balance between estimation of β and variable selection. Large values of λ_n tend to remove variables and increase bias in the estimation aspect while small values tend to retain variables. Thus it would be ideal that a large λ_n be used if a regression parameter is 0 (to be removed) and a small value be used if it is not 0. To this end, it becomes clear that we need a separate λ_n for each parameter component β_j. In other words, we need to consider the estimator as a minimizer of the objective function:

\sum_{i = 1}^{n} | Y_{i} - β' X_{i} | + \sum_{j = 1}^{p} λ_{n j} | β_{j} | .

In particular, we are interested in the case that λ_nj = η_nξ_nj, where ξ_nj are fixed weighting parameter. Likewise, the estimator can be regarded as a minimizer of

\sum_{i = 1}^{n} | Y_{i} - β^{T} x_{i} | subject to \sum_{j = 1}^{p} ξ_{n j} | β_{j} | \leq s * \sum_{j = 1}^{p} ξ_{n j} | β_{j}^{L_{1}} | .

In the Bayesian view, the β_js have independent prior distributions- double exponential

f (β_{j}) = \frac{λ_{n j}}{2} exp (- λ_{n j} | β_{j} |) .

With proper choice of tuning parameters, we will show the resultant penalized estimator exhibit optimal properties. After we obtained our results (Xu 2005), we noticed a recent work of Adaptive LASSO by Zou (2006) which has the same spirit in proposing different scaling parameters in LASSO for fixed p and squared loss. However, our work is motivated by a unified L₁ based approach for simultaneous estimation and variable selection and focus on theoretical investigation of the resultant data-driven procedure for the absolute deviation loss.

The rest of the paper is organized as follows. In Sect. 2, we introduce some notations for L₁ regression, list some conditions under which our main results hold and establish a useful proposition. In Sect. 3, asymptotics for the estimator are considered. The conditions under which consistency or $\sqrt{n}$ -consistency hold are given and limiting distribution results are proved. In Sect. 4, for properly chosen tuning parameters, we establish the oracle property of the estimator and use the perturbation method to estimate the standard error of the estimator. A two-stage data-driven procedure is also provided and proved to automatically select asymptotically optimal tuning parameters. In Sect. 5, simulation study as well as real data application are conducted to examine the performance of the proposed approach.

2 Differentially penalized L₁ estimator

We define the differentially penalized L₁ estimator β̂ as a minimizer of the objective function

Z_{n} (β) = \frac{1}{n} \sum_{i = 1}^{n} | Y_{i} - β^{T} x_{i} | + \frac{1}{n} \sum_{j = 1}^{p} λ_{n j} | β_{j} |,

(3)

where λ_nj, 1 ≤ j ≤ p are regularization parameters.

We need to make the following assumptions on the error distribution and the covariates. These assumptions are essentially the same as those made in Pollard (1991) and Rao and Zhao (1992).

(C.1)
{e_i} are i.i.d with median 0 and a density function f(·) which is continuous and strictly positive in a neighborhood of 0.
(C.2)
{x_i} is a deterministic sequence and there exists a positive definite matrix V for which $\frac{1}{n} V_{n}^{2} = \frac{1}{n} \sum_{i \leq n} x_{i} x_{i}^{T} \to V^{2}$ .

Now we introduce some notations. Let the true coefficient vector $β_{0} = (\begin{matrix} β_{0}^{1} \\ β_{0}^{2} \end{matrix})$ , where $β_{0}^{1}$ is s-vector and $β_{0}^{2}$ is (p − s)-vector. Without loss of generality, assume 1 ≤ s < p, $β_{0}^{2} = 0$ . Considering only the first s covariates, by (C.2), we have $\frac{1}{n} V_{n 1}^{2} = \frac{1}{n} \sum_{i \leq n} x_{i}^{1} x_{i}^{1^{T}} \to V_{1}^{2}$ , where $x_{i}^{1}$ is the subvector of x_i which contains the first s components.

Denote $G_{n} (β) = \frac{1}{n} \sum_{i = 1}^{n} E (| Y_{i} - x_{i}^{T} β | - | Y_{i} - x_{i}^{T} β_{0} |), R_{i} (β) = | Y_{i} - x_{i}^{T} β | - | Y_{i} - x_{i}^{T} β_{0} | + {(β - β_{0})}^{T} x_{i} sgn (e_{i})$ , then

\frac{1}{n} (L_{n} (β) - L_{n} (β_{0})) = - \frac{1}{n} \sum_{i = 1}^{n} {(β - β_{0})}^{T} x_{i} sgn (e_{i}) + G_{n} (β) + \frac{1}{n} \sum_{i = 1}^{n} [R_{i} (β) - E R_{i} (β)] .

In order to study the asymptotical properties of the penalized estimator, we need to establish the Local Asymptotic Quadratic (LAQ) property of the loss function (2).

Proposition 1 Under (C.1)–(C.2), for every sequence d_n > 0 with d_n → 0 in probability, we have

n^{- 1} L_{n} (β) - n^{- 1} L_{n} (β_{0}) = - n^{- 1} \sum_{i = 1}^{n} {(β - β_{0})}^{T} x_{i} sgn (e_{i}) + f (0) {(β - β_{0})}^{T} V^{2} (β - β_{0}) + o_{p} ({‖ β - β_{0} ‖}^{2} + n^{- 1})

(4)

holds uniformly in ‖β − β₀‖ ≤ d_n.

Proof It is easy to see that

| R_{i} (β) | \leq 2 | x_{i}^{T} (β - β_{0}) | I (| Y_{i} - x_{i}^{T} β | \leq | x_{i}^{T} (β - β_{0} |)),

sup_{‖ β - β_{0} ‖ \leq d_{n}} \frac{R_{i} (β)}{‖ β - β_{0} ‖} \leq 2 ‖ x_{i} ‖ I (| Y_{i} - x_{i}^{T} β_{0} | \leq 2 d_{n} ‖ x_{i} ‖) .

Since for any compact set B, the class of functions ${\frac{R_{i} (β)}{‖ β - β_{0} ‖} : β \in B}$ , is Euclidean with an integrable envelope in the sense of Pakes and Pollard (1989), we can apply the maximal inequality of Pollard (1990, p. 38) to get, for some C > 0,

E {[sup_{‖ β - β_{0} ‖ \leq d_{n}} \frac{\sum_{i = 1}^{n} (R_{i} (β) - E R_{i} (β))}{\sqrt{n} ‖ β - β_{0} ‖}]}^{2} \leq \frac{C}{n} \sum_{i = 1}^{n} {‖ x_{i} ‖}^{2} I (| ε_{i} | \leq 2 d_{n} ‖ x_{i} ‖) = o (1)

as n → ∞ and d_n → 0. Thus uniformly in ‖β − β₀‖ ≤ d_n,

\frac{1}{n} \sum_{i = 1}^{n} (R_{i} (β) - E R_{i} (β)) = o (‖ β - β_{0} ‖ / \sqrt{n}) .

(5)

and

G_{n} (β) = \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{- x_{i}^{T} (β - β_{0})} E sgn (ε_{i} + u) d u,

where $E sgn (ε_{i} + u) = 2 \int_{- u}^{0} f (x) d x$ . Since G_n(β) is a convex function, it has derivative 0 at β₀, and its second derivative at β₀ is $\frac{1}{n} \sum_{i = 1}^{n} x_{i} x_{i}^{T} 2 f (0)$ , by Taylor expansion,

G_{n} (β) = f (0) {(β - β_{0})}^{T} V^{2} (β - β_{0}) + o ({‖ β - β_{0} ‖}^{2}) .

(6)

Hence, (4) follows directly from (5) and (6).

3 Large sample properties

Knight and Fu (2000) studied the limiting distributions of LASSO-type estimator in least squares setting. In this section, we establish similar large sample properties for the proposed estimator β̂_n. The key tools we use are the LAQ property of the loss function and a novel inequality. The following result shows that β̂_n is consistent provided λ_nj = o_p(n).

Theorem 1 Under (C.1)–(C.2) and $lim_{n \to \infty} λ_{n j} / n \overset{p}{\to} λ_{0 j} \geq 0$ , 1 ≤ j ≤ p, β̂_n → argmin (Z), where

Z (β) = lim_{n \to \infty} G_{n} (β) + \sum_{j = 1}^{p} λ_{0 j} | β_{j} | .

In particular, since β₀ is the minimizer of G_n (β), β̂_n is consistent, provided λ_nj = o_p(n).

Proof By the uniform law of large numbers (Pollard 1990), n⁻¹L_n(β) − n⁻¹L_n(β₀) − G_n(β) = o(1) uniformly for β in any compact set K, hence Z_n(β) − Z(β) = o_p(1). Since $Z_{n} (β) \geq \frac{1}{n} L_{n} (β)$ and argmin (L_n) = O_p(1), we know that argmin (Z_n) = O_p(1). It follows that β̂_n → argmin (Z).

In order to establish the root-n consistency of β̂_n, we need to study the following object function:

C (u) = u^{T} D u / 2 - a^{T} u + \sum_{j = 1}^{s} λ_{j} u_{j} + \sum_{j = s + 1}^{p} λ_{j} | u_{j} |

where u ∈ R^p, D is a positive definite matrix, λ₁, …, λ_s are constants, λ_s+1, …, λ_p are nonnegative constants, and suppose that û is a minimizer of c(u), then we have the following proposition

Proposition 2 For any u, we have C(u) − C(û) ≥ (u − û)^T D(u − û)/2

Proof First, let us look at the case when s = 0, and without loss of generality, assumes $û = (\begin{matrix} û_{1} \\ 0 \end{matrix})$ ,,where û₁ ∈ R^r, and û_1i ≠ 0, 1 ≤ i ≤ r

denote

D = (\begin{matrix} D_{11} & D_{21} \\ D_{12} & D_{22} \end{matrix})

where D₁₁, D₁₂, D₂₁, D₂₂ are r × r, r × (p − r), (p − r) × r, (p − r) × (p − r) matrices, and $D_{12}^{T} = D_{21}$ .

denote $a = (\begin{matrix} a_{1} \\ a_{2} \end{matrix})$ , a₁, a₂ are r, (p − r) dimensional vectors respectively.

Since the $(\begin{matrix} û_{1} \\ 0 \end{matrix})$ are minimizer of C(u), we have

{(D_{11} û_{1})}_{i} - a_{1 i} + λ_{i} sgn (û_{1 i}) = 0, 1 \leq i \leq r

(7)

| {(D_{21} û_{1})}_{i} - a_{2 i} | \leq λ_{i}, r + 1 \leq i \leq p

(8)

the first equality holds because û₁ is not zero componentwise and is the minimizer of the objective function

u_{1}^{T} D_{11} u_{1} / 2 - a_{1}^{T} u_{1} + \sum_{j = 1}^{r} λ_{j} | u_{j} |

so the derivative of the objective function at the minimizer will be 0. The second inequality holds because generally 0 ∈ R^p is the minimizer of

u^{T} M u - b^{T} u + \sum_{i = 1}^{p} λ_{i} | u_{i} |

is equivalent to say that

| b_{i} | \leq λ_{i}

To show the proposition holds, we need to prove the inequality

u_{1}^{T} D_{11} u_{1} / 2 + u_{2}^{T} D_{22} u_{2} / 2 + u_{2}^{T} D_{21} u_{1} - a_{1}^{T} u_{1} - a_{2}^{T} u_{2} + \sum_{i = 1}^{p} λ_{j} | u_{j} | \geq û_{1}^{T} D_{11} û_{1} / 2 - a_{1}^{T} û_{1} + \sum_{i = 1}^{r} λ_{i} | û_{1 i} | + {(u_{1} - û_{1})}^{T} D_{11} (u_{1} - û_{1}) / 2 + {(u_{1} - û_{1})}^{T} D_{12} u_{2} + u_{2}^{T} D_{22} u_{2} / 2

equivalent to

u_{2}^{T} D_{21} u_{1} - a_{1}^{T} u_{1} - a_{2}^{T} u_{2} + \sum_{i = 1}^{p} λ_{j} | u_{j} | \geq û_{1}^{T} D_{11} û_{1} - a_{1}^{T} û_{1} + \sum_{i = 1}^{r} λ_{i} | û_{1 i} | - û_{1}^{T} D_{11} u_{1} + {(u_{1} - û_{1})}^{T} D_{12} u_{2}

by (7)

û_{1}^{T} D_{11} û_{1} - a_{1}^{T} û_{1} + \sum_{i = 1}^{r} λ_{i} | û_{1 i} | = 0

and

- a_{1}^{T} u_{1} + \sum_{i = 1}^{r} λ_{j} | u_{j} | \geq - û_{1}^{T} D_{11} u_{1}

so we only need to prove that

- a_{2}^{T} u_{2} + \sum_{i = r + 1}^{p} λ_{i} | u_{2 (i - r)} | \geq - û_{1}^{T} D_{12} u_{2}

which is apparent by (8).

Now we consider the general case when the s > 0, denote

D = (\begin{matrix} C & B \\ B^{T} & A \end{matrix})

where C, B, B^T, A are s × s, s × (p − s), (p − s) × s, (p − s) × (p − s)matrices.

denote $a = (\begin{matrix} a_{1} \\ a_{2} \end{matrix})$ , a₁, a₂ are s, (p − s) dimensional vectors respectively.

denote b = (λ₁, …, λ_s)^T we take the transformation υ₁ = u₁+C⁻¹Bu₂, the objective function becomes

υ_{1}^{T} C υ_{1} / 2 + u_{2}^{T} (A - B^{T} C^{- 1} B) u_{2} / 2 - (a_{2}^{T} - a_{1}^{T} C^{- 1} B) u_{2} - a_{1}^{T} υ_{1} + \sum_{i = s + 1}^{p} λ_{i} | u_{2 (i - s)} | + b^{T} υ_{1} - b^{T} C^{- 1} B u_{2}

and when minimizing this function u₂ and υ₁ can be separated. for the function of u₂ we apply the result we just got above, and rewrite the function of υ₁ in its quadratic form and it is straightforward to have

c (u) - c (û) \geq {(u_{2} - û_{2})}^{T} (A - B C^{- 1} B^{T}) (u_{2} - û_{2}) / 2 + {(υ_{1} - {υ̂}_{1})}^{T} C (υ_{1} - {υ̂}_{1}) / 2 = {(u - û)}^{T} D (u - û) / 2

Hence the proposition holds in the general case.

Suppose that û_n is a minimizer of the objective function

B_{n} (u) = - n^{- 1 / 2} x_{i}^{T} sgn (ε_{i}) u + f (0) u^{T} V^{2} u + \sum_{i = 1}^{s} \frac{λ_{n i}}{\sqrt{n}} sgn (β_{0 i}) u_{i} + \sum_{i = s + 1}^{p} \frac{λ_{n i}}{\sqrt{n}} | u_{i} | .

(9)

The following result shows that β̂_n is $\sqrt{n}$ -consistent provided $λ_{n j} = O_{p} (\sqrt{n})$ .

Theorem 2 Under (C.1)–(C.2) and $λ_{n j} / \sqrt{n} \overset{p}{\to} λ_{0 j} \geq 0$ , 1 ≤ j ≤ p, then $\sqrt{n} ({β̂}_{n} - β_{0}) \overset{d}{\to} argmin (V)$ , where

V (u) = W^{T} u + f (0) u^{T} V^{2} u + \sum_{i = 1}^{s} λ_{0 i} sgn (β_{0 i}) u_{i} + \sum_{i = s + 1}^{p} λ_{0 i} | u_{i} |,

and W has a N(0, V²) distribution. In particular if $λ_{n j} = o_{p} (\sqrt{n})$ , the penalized estimator behaves like the full L₁ estimator β̂_L₁.

Proof It follows from Theorem 1 that, β̂_n is consistent. By (4)

Z_{n} ({β̂}_{n}) - Z_{n} (β_{0}) = - n^{- 1} x_{i}^{T} sgn (e_{i}) ({β̂}_{n} - β_{0}) + f (0) {({β̂}_{n} - β_{0})}^{T} V^{2} ({β̂}_{n} - β_{0}) / 2 + o_{p} ({‖ {β̂}_{n} - β_{0} ‖}^{2} + n^{- 1}) + \sum_{i = 1}^{s} \frac{λ_{n i}}{n} (| {β̂}_{n}^{i} | - | β_{0 i} |) + \sum_{i = s + 1}^{p} \frac{λ_{n i}}{n} | {β̂}_{n}^{i} | = - n^{- 1} x_{i}^{T} sgn (e_{i}) ({β̂}_{n} - β_{0}) + f (0) {({β̂}_{n} - β_{0})}^{T} V^{2} ({β̂}_{n} - β_{0}) / 2 + o_{p} ({‖ {β̂}_{n} - β_{0} ‖}^{2} + n^{- 1}) + \sum_{i = 1}^{s} \frac{λ_{n i}}{n} sgn (β_{0 i}) ({β̂}_{n}^{i} - β_{0 i}) + \sum_{i = s + 1}^{p} \frac{λ_{n i}}{n} | {β̂}_{n}^{i} - β_{0 i} |

let $n^{\frac{1}{2}} ({β̂}_{n} - β_{0}) = ũ_{n}$ , then

Z_{n} ({β̂}_{n}) - Z_{n} (β_{0}) = n^{- 1} B_{n} (ũ_{n}) + o_{p} (n^{- 1} {‖ ũ_{n} ‖}^{2} + n^{- 1})

It is easy to see that $û_{n} \overset{p}{\to} argmin (V) = O_{p} (1)$ . So $n^{- \frac{1}{2}} û_{n}$ converges to 0 in probability, again by (4), we have $Z_{n} (β_{0} + n^{- \frac{1}{2}} û_{n}) - Z_{n} (β_{0}) = n^{- 1} B_{n} (û_{n}) + o_{p} (n^{- 1} {‖ û_{n} ‖}^{2} + n^{- 1})$ .

Furthermore, since β̂_n is a minimizer of Z_n(β), we have

0 \geq n^{- 1} (B_{n} (ũ_{n}) - B_{n} (ũ_{n})) + o_{p} (n^{- 1} {‖ ũ_{n} ‖}^{2} + n^{- 1} {‖ ũ_{n} ‖}^{2} + n^{- 1})

by the Proposition 2, we have,

\frac{1}{n} f (0) {(ũ_{n} - û_{n})}^{T} V^{2} (ũ_{n} - û_{n}) + o_{p} (n^{- 1} {‖ ũ_{n} ‖}^{2} + n^{- 1} {‖ û_{n} ‖}^{2} + n^{- 1}) \leq 0

It says that $n^{\frac{1}{2}} ({β̂}_{n} - β_{0})$ and û_n has the same asymptotic distribution, and completes the proof.

4 Adaptive two-stage procedure

4.1 Oracle property

In this section, we show that for properly chosen tuning parameters, the resultant penalized estimator exhibits the so-called oracle property. Suppose the {λ_nj} satisfy the following conditions:

(C.3)

$\frac{λ_{n j}}{\sqrt{n}} \overset{p}{\to} 0 for 1 \leq j \leq s and \frac{λ_{n j}}{\sqrt{n}} \overset{p}{\to} \infty for s + 1 \leq j \leq p .$

The first part of (C.3) tries to preserve the $\sqrt{n}$ -consistency of the estimator, and the second part of it does the work of shrinking the zero coefficients directly to zero. Notice that the rates of regularization parameters are different between zero coefficients and nonzero ones. Practically, we do not know before hand about such information, and actually this is exactly the task that the variable selection procedure is trying to accomplish. However, since we can estimate the coefficients with some precision, we can choose data-driven tuning parameters with asymptotically correct rates, and then the penalized estimator can exhibit the same asymptotic properties as the one with ideal tuning parameters. An approach based on this idea is given and illustrated in Sect. 4.3.

Perturbation methods are used to estimate the covariance matrix, define a new loss function

Z_{n}^{*} (β) ≔ \frac{1}{n} \sum_{i = 1}^{n} | Y_{i} - β^{T} x_{i} | ω_{i} + \frac{1}{n} \sum_{j = 1}^{p} λ_{n j} | β_{j} |

where ω_i (i = 1, …, n) are independent positive random variables with E(ω_i) = Var(ω_i) = 1 and are independent of the data (Y_i, X_i)(i = 1, …, n), let the ${β̂}_{n}^{*}$ be a minimizer $Z_{n}^{*} (β)$ . We will show in Sect. 4.2 that conditional on the data (Y_i, x_i)(i = 1, …, n), $n^{\frac{1}{2}} ({β̂}_{n}^{*} - {β̂}_{n})$ has the same asymptotic distribution as $n^{\frac{1}{2}} ({β̂}_{n} - β_{0})$ , hence the realizations of ${β̂}_{n}^{*}$ by repeatedly generalizing the random sample (ω₁, …, ω_n) can be used to estimate the covariance matrix.

We need to establish the following $\sqrt{n}$ -consistency for later use.

Proposition 3 ( $\sqrt{n}$ -consistency) Under (C.1)–(C.3), we have $n^{\frac{1}{2}} ({β̂}_{n} - β_{0}) = O_{P} (1)$ .

Proof We only need to show that for any given ε > 0, there exists a large constant C such that

P {inf_{‖ u ‖ = C} Z_{n} (β_{0} + \frac{u}{\sqrt{n}}) > Z_{n} (β_{0})} \geq 1 - ε

(10)

together with the convexity of Z_n, this implies that $n^{\frac{1}{2}} ({β̂}_{n} - β_{0}) = O_{P} (1)$ .

And

n {Z_{n} (β_{0} + \frac{u}{\sqrt{n}}) - Z_{n} (β_{0})} \geq L_{n} (β_{0} + \frac{u}{\sqrt{n}}) - L_{n} (β_{0}) + \sum_{j = 1}^{s} λ_{n j} (| β_{j 0} + \frac{u_{j}}{\sqrt{n}} | - | β_{j 0} |) - \sum_{i = 1}^{n} \frac{u^{T} x_{i}}{\sqrt{n}} sgn (e_{i}) + \frac{f (0)}{2} u^{T} V^{2} u + o_{p} (1 + {‖ u ‖}^{2})

(11)

for sufficiently large C, the second term of (11) dominates the rest terms, hence (10) holds. It completes the proof.

The following Proposition is needed to establish the oracle property of the estimator.

Proposition 4 Under conditions (C.1)–(C.3), with probability tending to one, for any given β¹ satisfying that $‖ β^{1} - β_{0}^{1} ‖ = O_{p} (n^{- 1 / 2})$ and any constant C,

Z_{n} {(\begin{matrix} β^{1} \\ 0 \end{matrix})} = min_{‖ β^{2} ‖ \leq C n^{- 1 / 2}} Z_{n} {(\begin{matrix} β^{1} \\ β^{2} \end{matrix})}

Proof Denote the gradient of L_n(β) by $U_{n} (β) = - \sum_{i = 1}^{n} x_{i} sgn (Y_{i} - β^{T} x_{i})$ . It is sufficient to show that with probability tending to one as n → ∞, for any β¹ satisfying that $β^{1} - β_{0}^{1} = O_{P} (n^{- 1 / 2})$ , and ‖β²‖ ≤ Cn^−1/2, U_nj (β) + λ_nj sgn(β_j) and β_j have the same signs for β_j ∈ (−Cn^−1/2, Cn^−1/2) for j = s + 1, …, p.

Similarly as in Proposition 1, we have

U_{n} (β) = U_{n} (β_{0}) + 2 f (0) n V^{2} (β - β_{0}) + o_{p} (n^{1 / 2} + n ‖ β - β_{0} ‖)

(12)

it follows that

U_{n j} (β) + λ_{n j} sgn (β_{j}) = n^{1 / 2} {n^{- 1 / 2} U_{j} (β_{0}) + 2 f (0) n^{1 / 2} {[V^{2} (β - β_{0})]}_{j} + o (1 + n^{1 / 2} ‖ β - β_{0} ‖) + \frac{λ_{n j}}{\sqrt{n}} sgn (β_{j})} = n^{1 / 2} (O_{P} (1) + \frac{λ_{n j}}{\sqrt{n}} sgn (β_{j})),

since $\frac{λ_{n j}}{\sqrt{n}} \to \infty$ , for j = s+1, …, p, the sign of the U_j (β)+λ_nj sgn(β_j) is completely determined by the sign of the β_j, this completes the proof.

Now, we can establish the following main theorem. The first component is exactly zero and the second component is estimated as well as if the correct model were known. This is the so-called oracle property.

Theorem 3 (Oracle property) Under (C.1)–(C.3), with probability tending to one, the penalized estimator ${β̂}_{n} = (\begin{matrix} {β̂}_{n}^{1} \\ {β̂}_{n}^{2} \end{matrix})$ has the following properties:

${β̂}_{n}^{2} = 0$
$n^{\frac{1}{2}} ({β̂}_{n}^{1} - β_{0}^{1}) \to N (0, \frac{1}{4 f {(0)}^{2}} V_{1}^{- 2})$ .

Proof It follows from Proposition 4 that part (a) holds. To prove part (b), notice that ${β̂}_{n}^{1}$ is the minimizer of the object function

Z_{n 1} (β^{1}) ≔ \frac{1}{n} \sum_{i = 1}^{n} | Y_{i} - β^{1^{T}} x_{i}^{1} | + \frac{1}{n} \sum_{j = 1}^{s} λ_{n j} | β_{j}^{1} |,

(13)

it is the penalized estimator considering only the first s covariates. Since ${β̂}_{n}^{1}$ is the minimizer and λ_nj = o(n^1/2), 1 ≤ j ≤ s, by Theorem 2, we know that $\sqrt{n} ({β̂}_{n}^{1} - β_{0}^{1}) \to argmin (V_{1})$ , where $V_{1} (u_{1}) = W_{1}^{T} u_{1} + f (0) u_{1}^{T} V_{1}^{2} u_{1}$ , and W₁ has a $N (0, V_{1}^{2})$ distribution. $argmin (V_{1}) = {(2 f (0))}^{- 1} V_{1}^{- 2} W_{1} \to N (0, \frac{1}{4 f {(0)}^{2}} V_{1}^{- 2})$ . It completes the proof.

Remark 1 By Theorem 2, we can see that although with positive probability, the LASSO estimate shrinks some coefficients to zero, it does necessarily shrink the true zero coefficients and thus may erroneously retain insignificant variables in the model and in the meantime increase biases in the estimation of true nonzero coefficients. However, by Theorem 3 (b), with differentially scaled tuning parameters, the MLASSO performs the correct variable selection and achieves the asymptotic efficiency.

4.2 Distributional approximation

Now we establish the asymptotic properties of the perturbed penalized estimator, ${β̂}_{n}^{*}$ is a minimizer of the loss function

Z_{n}^{*} (β) = \frac{1}{n} \sum_{i = 1}^{n} | Y_{i} - β^{T} x_{i} | ω_{i} + \frac{1}{n} \sum_{j = 1}^{p} λ_{n j} | β_{j} |,

(14)

We are able to show conditional on data, the randomly perturbed estimator can be used to approximate the distribution of the estimator. To be more specific, we have the following theorem:

Theorem 4 Under conditions (C.1)–(C.3), with probability tending to one, conditional on the data (Y_i, x_i)(i = 1, …, n), ${β̂}_{n}^{*} = (\begin{matrix} {β̂}_{n 1}^{*} \\ {β̂}_{n 2}^{*} \end{matrix})$ has the following properties:

${β̂}_{n 2}^{*} = 0$
$n^{\frac{1}{2}} ({β̂}_{n 1}^{*} - {β̂}_{n}^{1}) \to N (0, \frac{1}{4 f {(0)}^{2}} V_{1}^{- 2})$ .

Proof Using the same arguments as in proving Proposition 1, denote $L_{n}^{*} (β) = \sum_{i = 1}^{n} | Y_{i} - x_{i}^{T} β | ω_{i}$ , for every sequence d_n > 0 with d_n → 0 in probability, we can prove

n^{- 1} L_{n}^{*} (β) - n^{- 1} L_{n}^{*} (β_{0}) = - n^{- 1} \sum_{i = 1}^{n} {(β - β_{0})}^{T} x_{i} sgn (e_{i}) ω_{i} + f (0) {(β - β_{0})}^{T} V^{2} (β - β_{0}) + o_{p} ({‖ β - β_{0} ‖}^{2} + n^{- 1})

(15)

holds uniformly in ‖β − β₀‖ ≤ d_n, Then as in Proposition 3, we can prove that conditionally on the original data, $n^{\frac{1}{2}} ({β̂}_{n}^{*} - {β̂}_{n}) = O_{P} (1)$ , and as in Proposition 4, for any given β¹ satisfying that $‖ β^{1} - β_{0}^{1} ‖ = O_{p} (n^{- 1 / 2})$ and any constant C, conditionally on the original data, we have

Z_{n}^{*} {(\begin{matrix} β^{1} \\ 0 \end{matrix})} = min_{‖ β^{2} ‖ \leq C n^{- 1 / 2}} Z_{n}^{*} {(\begin{matrix} β^{1} \\ β^{2} \end{matrix})}

By Theorem 1, with probability tending to one, ${β̂}_{n}^{2} = 0$ , it follows that conditionally on the original data, ${β̂}_{n 2}^{*} = 0$ , then considering only the first s covariates, ${β̂}_{n 1}^{*}$ is a minimizer of function

Z_{n 1}^{*} (β^{1}) = \frac{1}{n} \sum_{i = 1}^{n} | Y_{i} - β^{1^{T}} x_{i}^{1} | ω_{i} + \frac{1}{n} \sum_{j = 1}^{s} λ_{n j} | β_{j}^{1} |,

By Proposition 1 and Theorem 2, we have

V_{n 1} ({β̂}_{n}^{1} - β_{0}^{1}) = \frac{1}{2 f (0)} \sum_{i = 1}^{n} x_{i}^{1^{T}} V_{n 1}^{- 1} sgn (e_{i}) + o_{p} (1)

(16)

and similarly we have

V_{n 1} ({β̂}_{n 1}^{*} - β_{0}^{1}) = \frac{1}{2 f (0)} \sum_{i = 1}^{n} x_{i}^{1^{T}} V_{n 1}^{- 1} sgn (e_{i}) ω_{i} + o_{p} (1) .

Thus,

V_{n 1} ({β̂}_{n 1}^{*} - {β̂}_{n}^{1}) = \frac{1}{2 f (0)} \sum_{i = 1}^{n} x_{i}^{1^{T}} V_{n 1}^{- 1} sgn (e_{i}) (ω_{i} - 1) + o_{p} (1) .

It suffices to show conditionally on (Y_i, x_i), i = 1, …, n,

\frac{1}{2 f (0)} \sum_{i = 1}^{n} x_{i}^{1^{T}} V_{n 1}^{- 1} sgn (e_{i}) (ω_{i} - 1) \to N (0, \frac{I_{s}}{4 f {(0)}^{2}})

(17)

Since $\sum_{i = 1}^{n} V_{n 1}^{- 1} x_{i}^{1} x_{i}^{1^{T}} V_{n 1}^{- 1} sgn {(e_{i})}^{2} = I_{S}$ and $max_{1 \leq i \leq n} | x_{i}^{1^{T}} V_{n 1} sgn (e_{i}) | \to 0$ , by the central limit theorem, (17) follows, it completes the proof.

4.3 Selection of tuning parameters

Although the theoretical result in Sects. 4.1 and 4.2 is interesting, it is impractical if we can not find λ_nj which satisfy C.3. Noticing that the rates of tuning parameters for nonzero and zero coefficients are different, we must base our selection of λ_nj on some preliminary estimation procedure. Denote the components of the L₁ estimator β̂_L₁ by a_j, j = 1, …, p. The estimates of their standard errors are denoted by b_j, j = 1, …, p. Let

λ_{n j} = η * {(\frac{\sqrt{n} * | b_{j} |}{| a_{j} |})}^{γ}, γ > 1, η > 0 .

(18)

For the λ_nj defined in (18), we have the following proposition:

Proposition 5 For a fixed (η, γ), the λ_nj defined in (18) satisfy C.3.

Proof (i) If the jth component of the β₀ is zero, then a_j/b_j converges to a N(0, 1) r.v.. Denoting the sequence by Z_n and the latter by Z, we have $λ_{n j} / \sqrt{n} = η * n^{γ / 2 - 1 / 2} / | Z_{n} |^{γ} \to \infty$

(ii) If the jth component of the β₀ is non-zero, denote it by θ, it is known that $n^{\frac{1}{2}} (a_{j} - θ) \to N (0, σ^{2})$ , where σ is a fixed positive constant, denote the sequence by Z_n and the latter by Z. So $\sqrt{n} * b_{j} \to σ$ and $a_{j} = Z_{n} / \sqrt{n} + θ$ , then $λ_{n j} / \sqrt{n} = η * {(\frac{\sqrt{n} * | b_{j} |}{| Z_{n} / \sqrt{n} + θ |})}^{γ} / \sqrt{n} \to η * {(\frac{σ}{| Z / \sqrt{n} + θ |})}^{γ} / \sqrt{n} \to 0$ .

Now if the λ_nj is chosen as above to estimate the β₀, we also have to select parameters η and γ. Resampling-based model selection methods such as cross-validation or generalized degrees of freedom (Shen and Ye 2002) can be employed which are computationally intensive. Tibshirani (1996) constructed the generalized cross-validation style statistic to select the tuning parameter. A key statistic therein is the number of effective parameters or the degrees of freedom. An interesting property of the LASSO-type procedure is that the number of nonzero coefficients of the estimator is an unbiased estimate of its degrees of freedom; see, for example, Efron et al. (2004) or Zou et al. (2004). In this connection, generalized cross-validation (GCV) can be modified to choose η and γ where the residual sums of squares is replaced with the sum of absolute residuals and the degrees of freedom d is taken as the number of nonzero coefficients of the differentially penalized L₁ estimator. This allows us to reduce computational burden greatly. In traditional subset selection, GCV criterion is not consistent in the sense of choosing true subset model with probability tending to 1 as sample size goes to infinity while the bayesian information criterion (BIC) is. To come up with a consistent data-driven variable selection procedure, a BIC type criterion seems necessary. For tradition subset selection in the L₁ regression, the BIC is defined as

B I C = L_{n} {β̂} / σ̂ + \frac{log n}{2} α,

where $σ̂ = \sum_{i = 1}^{n} | Y_{i} - {β̂}_{L_{1}} x_{i} | / n$ is the maximum likelihood estimate of scale parameter σ in the full model with all the candidate variables and α is the size of subset model. The size of subset model is its degrees of freedom while the degrees of freedom of the Lasso-type procedure can be unbiasedly estimated by the number of nonzero coefficients of the estimator. Thus the number of nonzero coefficients is denoted by d and the following BIC-type criterion is proposed for the selection of tuning parameters η and γ.

B I C = L_{n} {β̂} / σ̂ + \frac{log n}{2} d .

The selected (η_n, γ_n) minimizes the BIC function in the region (η, γ) ∈ (0, ∞) × (1, ∞).

The proposed selection criterion has at least two advantages. First, since the optimal tuning parameters are found by a grid search, computationally it is feasible. Second, the asymptotic optimality such as the oracle property is usually established for the penalized estimator with fixed tuning parameters (Fan and Li 2002). Theoretically, it is more worthy to investigate theoretical properties of the penalized estimator with data-driven tuning parameters. Interestingly, in the following theorem, the differentially penalized L₁ estimator with the BIC-based data-driven tuning parameters indeed exhibits the oracle property.

Theorem 5 If we define the tuning parameters {λ_nj} as in (18), and select (η, γ) by the BIC function defined above, the resultant penalized estimator exhibits the oracle property.

Proof It suffices to prove that the selected tuning parameters satisfy the C.3 or the penalized estimator is asymptotically equivalent to the estimator with tuning parameters satisfying the C.3.Without loss of generality, we assume that $0 \leq lim_{n \to \infty} \frac{λ_{n j}}{\sqrt{n}} \leq \infty$ and $0 \leq lim_{n \to \infty} \frac{λ_{n j}}{n} \leq \infty$ . In the case where the limits do not exist, similar arguments can be applied for subsequences of λ_nj. It is easy to see that tuning parameters {λ_nj} can fall into the following five cases.

$lim_{n \to \infty} \frac{λ_{n j}}{\sqrt{n}} = 0, 1 \leq j \leq p$ ,
$lim_{n \to \infty} \frac{λ_{n j}}{\sqrt{n}} = 0, 1 \leq j \leq s, 0 < lim_{n \to \infty} \frac{λ_{n j}}{\sqrt{n}} < \infty, s + 1 \leq j \leq p$ ,
$lim_{n \to \infty} \frac{λ_{n j}}{\sqrt{n}} = 0, 1 \leq j \leq s, lim_{n \to \infty} \frac{λ_{n j}}{\sqrt{n}} = \infty, s + 1 \leq j \leq p$ ,
$0 < lim_{n \to \infty} \frac{λ_{n j}}{\sqrt{n}} < \infty, 1 \leq j \leq s, lim_{n \to \infty} \frac{λ_{n j}}{n} = \infty, s + 1 \leq j \leq p$ ,
$0 < lim_{n \to \infty} \frac{λ_{n j}}{\sqrt{n}} = \infty, 1 \leq j \leq s, lim_{n \to \infty} \frac{λ_{n j}}{n} = \infty, s + 1 \leq j \leq p$ ,
1. For tuning parameters in case 1, the estimator have no zero components, denote it by β̂(p), while the penalized estimator with tuning parameters in case 3 only have s true nonzero coefficients, denote it by β̂(s), denote the usual L₁ estimator with all the p covariates and only the first s covariates by β̂_L₁ (p) and β̂_L₁ (s), respectively. Hence
  $B I C (β̂ (p)) = \sum_{i = 1}^{n} | Y_{i} - x_{i}^{T} β̂ (p) | / σ̂ + \frac{log n}{2} p \geq \sum_{i = 1}^{n} | Y_{i} - x_{i}^{T} {β̂}_{L_{1}}^{T} (p) | / σ̂ + \frac{log n}{2} p$
  and
  $B I C (β̂ (s)) = \sum_{i = 1}^{n} | Y_{i} - x_{i}^{1^{T}} β̂ (s) | / σ̂ + \frac{log n}{2} s \leq \sum_{i = 1}^{n} | Y_{i} - x_{i}^{1^{T}} {β̂}_{L_{1}} (s) | / σ̂ + \frac{log n}{2} s + \sum_{j = 1}^{s} λ_{n j}^{0} (| {β̂}_{L_{1}} {(s)}^{j} | - | β̂ {(s)}^{j} |,$
  since β̂_L₁ (s) and β̂(s) are both $\sqrt{n}$ – consistent and ${λ_{n j}^{0}}$ satisfy C.3, $B I C (β̂ (s)) \leq \sum_{i = 1}^{n} | Y_{i} - x_{i}^{1^{T}} {β̂}_{L_{1}} (s) | / σ̂ + \frac{log n}{2} s + o_{p} (1)$ . And by Proposition 1, we know that
  $\sum_{i = 1}^{n} (| Y_{i} - x_{i}^{1^{T}} {β̂}_{L_{1}} (p) | - | e_{i} |) / σ̂$
  and
  $\sum_{i = 1}^{n} (| Y_{i} - x_{i}^{1^{T}} {β̂}_{L_{1}} (s) | - | e_{i} |) / σ̂$
  are both O_p(1), so with probability tending to 1, BIC(β̂(p)) > BIC(β̂(s)), hence the selected tuning parameters are not in case 1.
2. For tuning parameters in case 2, by consistency, true nonzero components of the estimator remain nonzero, and some of true zero components might be zero. If all true zero components are zero, then the estimator is asymptotically equivalent to the penalized estimator with tuning parameters in case 3, otherwise, apply the same argument as before, with probability tending to 1, those tuning parameters are not chosen.

L_n(β) = O_p(n), with probability tending to 1, true zero components of the estimator are zero, some of true nonzero components might be zero. If all true nonzero components are nonzero, then its degrees of freedom is the same as in the case 3. Suppose that the estimator is β̃(s) and the estimator corresponding to tuning parameters in case 3 is β̂(s). Since its tuning parameters are larger than the ones in the case 3, if they are not asymptotically equivalent, L_n(β̃(s)) is strictly larger than L_n(β̂(s)) and tuning parameters are not chosen. In the other situation, the subset of nonzero components of the resultant estimator is a true subset of {1, 2, …, s}. Without loss of generality, assume the subset of nonzero components of the resultant estimator is {1, 2, …, s − 1}, denote it by β̂(s − 1), to show these tuning parameters are not chosen, it suffices to prove that, with probability tending to 1, BIC(β̂(s − 1)) > BIC(β̂(s)), denote the L₁ estimator with only the first s − 1 covariates by β̂_L₁ (s − 1), we only need to prove that $\sum_{i = 1}^{n} (| Y_{i} - x_{i}^{1^{T}} {β̂}_{L_{1}} (s - 1) | - | e_{i} |) / σ̂ > log n + O_{p} (1)$ .

Actually, we will show that for sufficiently large n,

\sum_{i = 1}^{n} (| Y_{i} - x_{i}^{1^{T}} {β̂}_{L_{1}} (s - 1) | - | e_{i} |) > δ n,

(19)

where δ is a positive number.

Denote $β_{0}^{(s)} = {(β_{01}, \dots, β_{0 s})}^{T}$ , let $ℬ = {β^{(s)} \in R^{s} : | β^{(s)} - β_{0}^{(s)} | = \frac{| β_{0 s} |}{2}}$ . By the uniform law of large number, we have

sup_{{β^{(s)} \in ℬ}} | \frac{1}{n} \sum_{i = 1}^{n} {(| Y_{i} - x_{i}^{1^{T}} β^{(s)} | - | e_{i} |) - E (| Y_{i} - x_{i}^{1^{T}} β^{(s)} | - | e_{i} |)} | \overset{p}{\to} 0

(20)

For large n, $| {β̂}_{L_{1}} (s) - β_{0}^{(s)} | < \frac{| β_{0 s} |}{2}$ , thus there exists a point β̂_(s) ∈ ℬ satisfying,

\sum_{i = 1}^{n} | Y_{i} - x_{i}^{1^{T}} {β̂}_{L_{1}} (s - 1) | \geq \sum_{i = 1}^{n} | Y_{i} - x_{i}^{1^{T}} {β̂}^{(s)} |,

Also there exists β̄^(s) such that

inf_{β^{(s)} \in ℬ} \sum_{i = 1}^{n} E | Y_{i} - x_{i}^{1^{T}} β^{(s)} | = \sum_{i = 1}^{n} E | Y_{i} - x_{i}^{1^{T}} {β̄}^{(s)} | .

\sum_{i = 1}^{n} (| Y_{i} - x_{i}^{1^{T}} {β̂}^{(s)} | - | e_{i} |) = \sum_{i = 1}^{n} {(| Y_{i} - x_{i}^{1^{T}} {β̂}^{(s)} | - | e_{i} |) - E (| Y_{i} - x_{i}^{1^{T}} {β̂}^{(s)} | - | e_{i} |)} + E (| Y_{i} - x_{i}^{1^{T}} {β̂}^{(s)} | - | e_{i} |)} \geq o_{p} (n) + \sum_{i = 1}^{n} E (| Y_{i} - x_{i}^{1^{T}} {β̄}^{(s)} | - | e_{i} |) \geq δ n,

thus (19) holds, hence concludes the proof.

Remark 2 Other model selection criteria such as GCV and AIC can also be modified accordingly to choose parameters η and γ. However, as GCV and AIC criteria are not consistent in the sense of choosing the true model with probability tending to 1 as n goes to infinity, the GCV or AIC based differentially scaled L₁ penalized estimator do not enjoy the desired oracle property.

Simulation results in Sect. 5 shows that the finite-sample performance of the estimates does not vary much with different values of γ. As a referee pointed out, it is worthy to explore how the tuning parameter γ changes the behavior of the estimator. To gain some insights into this, here we study how the estimate changes with γ in the orthonormal case. Denote $\frac{\sqrt{n} * | b_{j} |}{| a_{j} |}$ by z_nj and $λ_{n j} = λ z_{n j}^{γ}$ . In the orthonormal case, the covariate vector x₁, …, x_p are mutually orthogonal, and ${β̂}_{j} = sgn (x_{i}^{T} Y) (| x_{i}^{T} Y | - \frac{λ}{2} z_{n j}^{γ}) I (| x_{i}^{T} Y | \geq \frac{λ}{2} z_{n j}^{γ})$ . For true zero coefficients, $z_{n j} = O_{p} (\sqrt{n})$ ; for true nonzero coefficients, z_nj = O_p(1). The magnitudes of z_nj of true zero coefficients are much larger than those of true nonzero coefficients and this contrast is power factored by γ into λ_nj to differentially shrink coefficients. Large γ itself leads to large tuning parameters for nonzero coefficients although it magnifies the difference of tuning parameters for true zero coefficients and nonzero coefficients. Thus the ideal γ should be slightly larger than 1 to not only assure the correct asymptotic rate and but also minimize its impact on inflating the tuning parameters. From the explicit formula for the estimates, we see that small change of γ will not in general influence much on the behavior of the estimates.

5 Numerical studies

We have conducted extensive numerical studies to compare our proposed method with LASSO, traditional subset selection methods and the oracle least absolute deviations estimates. We denote our method by MLASSO since it is the natural generalization of LASSO by considering multiple tuning parameters. All simulations are conducted using IMSL’s routine for L₁ regression RLAV in Fortran. We select the tuning parameter of LASSO and MLASSO by BIC or GCV function. For MLASSO, the results between selecting both η and γ and selecting only η with a fixed γ = 1.5 are very similar. So we let γ = 1.5 in most of the examples. For subset selection methods, the best subset is chosen as the one which minimizes BIC or GCV function. Following Tibshirani (1996) and Fan and Li (2001), we report simulation results in terms of model error instead of prediction error (PE). In the setting of linear models, suppose that

Y = β^{T} X + ε,

from the (Y, X) we get β̂ as an estimate of β and use the β̂^T X_future to predict the future response Y_future, where (Y_future, X_future) is an independent copy of (Y, X). The mean-squared error (ME) is defined by

M E = E {({β̂}^{T} X_{future} - β^{T} X_{future})}^{2} = {(β̂ - β)}^{T} R (β̂ - β),

The prediction error (PE) is defined as

P E = E {(Y - {β̂}^{T} X_{future})}^{2} = M E + σ^{2},

where R is the population covariance matrix of X, and σ² is the variance of the error.

5.1 Normal error case

Considering the simulation scenario of Fan and Li (2001), we simulate 100 datasets. Each of them consists of n observations from the model

Y_{i} = β^{T} X_{i} + σ ε_{i}

(21)

β = (3, 1.5, 0, 0, 2, 0, 0, 0)^T, the components of x and ε are standard normal. The correlation between x_i and x_j is ρ^|i−j| with ρ = 0.5. We compare the model error of each variable selection procedure to that of the full L₁ estimator. The relative value is called the relative model error (RME). We do the simulations for different sample sizes and σ values. In Table 1, we summarize the results in terms of the median of relative model errors (MRME), the average correct 0 coefficients and the average incorrect 0 coeffients over 100 simulated datasets. Our resampling procedure uses 1, 000 random samples from standard exponential distribution. The results are similar for the other distributions. From Table 1, we see that in the situation of small sample size and big noise, Best subset performs the best in reducing the model error while LASSO tends to identify the least incorrect zero components. When the noise level is decreased, even in the situation of small sample size, all procedures do not identify any nonzero component to be zero and in terms of both the number of correctly identified zero components and the reduction of the model error, MLASSO and Best subset perform much better than LASSO. When the sample size is increased, MLASSO tends to perform better than Best subset and closer to the true oracle estimator. It is also interesting to notice that the same criterion (BIC or GCV) based Best subset and MLASSO perform very similarly. In all the simulations, the methods of fixing γ = 1.5 and selecting γ give almost the identical performance. Hence in the later examples, we will let γ = 1.5.

Table 1.

Variable selection in normal error case

		Avg. no. of 0 coefficients

Method	MRME (%)	Correct	Incorrect
n = 40, σ = 3.0
LASSO (BIC)	72.49	3.34	0.15
LASSO (GCV)	75.76	3.55	0.17
LASSO (AIC)	75.69	2.60	0.15
MLASSO¹ (BIC)	81.12	3.85	0.32
MLASSO¹ (GCV)	78.81	4.21	0.42
MLASSO¹ (AIC)	77.33	3.20	0.13
MLASSO² (BIC)	80.87	3.90	0.32
MLASSO² (GCV)	79.54	4.21	0.41
MLASSO² (AIC)	76.22	3.24	0.13
Subset (BIC)	69.40	4.11	0.34
Subset (GCV)	65.09	4.44	0.40
Subset (AIC)	68.52	3.44	0.29
Oracle	37.95	5	0
n = 40, σ = 1.0
LASSO (BIC)	72.49	3.25	0
LASSO (GCV)	72.49	3.46	0
LASSO (AIC)	73.85	2.60	0
MLASSO¹ (BIC)	59.63	4.19	0
MLASSO¹ (GCV)	51.83	4.51	0
MLASSO¹ (AIC)	73.58	3.46	0
MLASSO² (BIC)	59.63	4.19	0
MLASSO² (GCV)	50.92	4.52	0
MLASSO² (AIC)	72.40	3.48	0
Subset (BIC)	56.45	4.16	0
Subset (GCV)	52.21	4.51	0
Subset (AIC)	68.86	3.40	0
Oracle	37.95	5	0
n = 60, σ = 1.0
LASSO (BIC)	69.19	3.45	0
LASSO (GCV)	68.53	3.53	0
LASSO (AIC)	73.82	2.38	0
MLASSO¹ (BIC)	53.30	4.35	0
MLASSO¹ (GCV)	52.13	4.44	0
MLASSO¹ (AIC)	72.79	3.28	0
MLASSO² (BIC)	54.35	4.37	0
MLASSO² (GCV)	54.06	4.42	0
MLASSO² (AIC)	71.60	3.32	0
Subset (BIC)	65.07	4.26	0
Subset (GCV)	55.33	4.39	0
Subset (AIC)	81.48	3.48	0
Oracle	33.63	5	0

Open in a new tab

The value of γ in MLASSO¹ is selected, whereas the value of γ in MLASSO² is 1.5

We also use the simulations to test the accuracy of the estimated standard error of the estimator via the perturbation method. The standard error of 100 estimates (SD) is regarded as the true standard error of the estimator. The mean and the standard error of 100 estimated standard errors of the estimator via the perturbation method (SD_m, SD_s) are used to assess the performance of the perturbation method. In Table 2, we summarize the results for the situation of n = 60, σ = 1.0. From Table 2, it can be seen that SD and SD_m are very close and hence the perturbation method performs very well.

Table 2.

Estimation in normal error case (n = 60, σ = 1.0)

	β̂₁			β̂₂			β̂₅

Method	SD	SD_m	(SD_s)	SD	SD_m	(SD_s)	SD	SD_m	(SD_s)
LASSO (BIC)	0.208	0.224	0.055	0.216	0.230	0.050	0.193	0.231	0.058
LASSO (GCV)	0.207	0.225	0.055	0.214	0.231	0.050	0.191	0.231	0.058
LASSO (AIC)	0.207	0.224	0.056	0.215	0.230	0.049	0.191	0.232	0.058
MLASSO¹ (BIC)	0.200	0.205	0.053	0.225	0.206	0.051	0.191	0.191	0.054
MLASSO¹ (GCV)	0.195	0.206	0.054	0.209	0.205	0.051	0.193	0.190	0.055
MLASSO¹ (AIC)	0.196	0.205	0.053	0.217	0.205	0.050	0.192	0.190	0.054
MLASSO² (BIC)	0.198	0.205	0.053	0.224	0.207	0.050	0.191	0.190	0.053
MLASSO² (GCV)	0.198	0.206	0.055	0.211	0.205	0.050	0.190	0.189	0.053
MLASSO² (AIC)	0.199	0.206	0.054	0.212	0.206	0.049	0.190	0.190	0.053
Subset (BIC)	0.199	0.201	0.050	0.226	0.201	0.054	0.181	0.182	0.044
Subset (GCV)	0.194	0.202	0.051	0.215	0.201	0.051	0.181	0.182	0.041
Subset (AIC)	0.195	0.203	0.052	0.217	0.203	0.055	0.183	0.182	0.042
Oracle	0.192	0.209	0.052	0.196	0.205	0.051	0.155	0.180	0.039

Open in a new tab

The value of γ in MLASSO¹ is selected, whereas the value of γ in MLASSO² is 1.5

To demonstrate the consistent property of BIC-type MLASSO, we increase the sample size. In Table 3, we summarize the results about the proportion of the procedure selecting the true model. It can be seen from Table 3 BIC-based MLASSO and BIC-based Best subset tend to select the true model with the proportion increases to 1 as the sample size increases. The other procedures do not exhibit this good property. It can also be seen that BIC-based MLASSO performs even better than BIC-based Best subset when the sample size is large.

Table 3.

Performance on consistency

sample size (n)	60	100	200	500	1,000	2,000
Subset (BIC)	0.45 (4.26)	0.66 (4.58)	0.64 (4.61)	0.81 (4.79)	0.81 (4.80)	0.80 (4.77)
Subset (GCV)	0.53 (4.39)	0.60 (4.52)	0.56 (4.44)	0.62 (4.54)	0.56 (4.42)	0.52 (4.34)
Subset (AIC)	0.16 (3.48)	0.19 (3.55)	0.18 (3.50)	0.23 (3.60)	0.21 (3.54)	0.22 (3.56)
LASSO (BIC)	0.24 (3.45)	0.27 (3.83)	0.28 (3.75)	0.38 (4.11)	0.44 (3.97)	0.19 (3.53)
LASSO (GCV)	0.29 (3.53)	0.27 (3.79)	0.20 (3.53)	0.25 (3.65)	0.29 (3.47)	0.17 (3.18)
LASSO (AIC)	0.05 (2.38)	0.06 (2.39)	0.08 (2.42)	0.07 (2.40)	0.10 (2.50)	0.10 (2.52)
MLASSO (BIC)	0.65 (4.35)	0.68 (4.59)	0.83 (4.78)	0.84 (4.80)	0.85 (4.83)	0.90 (4.90)
MLASSO (GCV)	0.67 (4.44)	0.67 (4.57)	0.67 (4.52)	0.68 (4.51)	0.71 (4.53)	0.61 (4.40)
MLASSO (AIC)	0.25 (3.28)	0.26 (3.29)	0.28 (3.35)	0.29 (3.36)	0.32 (3.38)	0.32 (3.34)

Open in a new tab

The number in the parenthesis is the average number of correctly identified nonzero coefficients

5.2 Laplace error case

In this example and the next example, we change the error distribution in model (21) to explore the robustness of the proposed estimator. We simulate 100 datasets consisting of 60 observations from model (21) with the error distribution now drawn from the standard double exponential (Laplace) distribution. The σ is set to be 1.0. Table 4 and Table 5 summarize the results of the simulations. From Table 4, it can be seen that the MLASSO performs favorably compared to the other methods. Form Table 5, we see that the perturbation method indeed gives a very accurate estimate of the standard error for the estimator.

Table 4.

Variable selection in Laplace error case

		Avg. no. of 0 coefficients

Method	MRME (%)	Correct	Incorrect
LASSO (BIC)	61.17	3.67	0
LASSO (GCV)	60.94	3.81	0
LASSO (AIC)	58.55	2.71	0
MLASSO (BIC)	42.89	4.64	0
MLASSO (GCV)	41.20	4.74	0
MLASSO (AIC)	61.32	3.88	0
Subset (BIC)	43.73	4.69	0
Subset (GCV)	42.59	4.71	0
Subset (AIC)	59.32	3.89	0
Oracle	25.32	5	0

Open in a new tab

Table 5.

Estimation in Laplace error case

	β̂₁			β̂₂			β̂₅

Method	SD	SD_m	(SD_s)	SD	SD_m	(SD_s)	SD	SD_m	(SD_s)
LASSO (BIC)	0.213	0.252	0.059	0.225	0.270	0.072	0.214	0.257	0.069
LASSO (GCV)	0.212	0.253	0.060	0.222	0.270	0.072	0.213	0.258	0.068
LASSO (AIC)	0.212	0.253	0.060	0.224	0.271	0.071	0.214	0.259	0.068
MLASSO (BIC)	0.212	0.222	0.060	0.203	0.229	0.068	0.191	0.204	0.058
MLASSO (GCV)	0.213	0.222	0.0584	0.203	0.229	0.067	0.183	0.204	0.0580
MLASSO (AIC)	0.212	0.222	0.059	0.202	0.228	0.065	0.183	0.205	0.057
Subset (BIC)	0.210	0.217	0.064	0.218	0.217	0.056	0.162	0.187	0.046
Subset (GCV)	0.210	0.218	0.0638	0.218	0.217	0.056	0.164	0.188	0.046
Subset (AIC)	0.211	0.217	0.064	0.219	0.218	0.057	0.164	0.189	0.047
Oracle	0.201	0.224	0.067	0.210	0.221	0.057	0.153	0.190	0.047

Open in a new tab

5.3 Mixed error case

In this example, we do the same simulations as in the previous example except that we now draw the error distribution from the standard normal distribution with 30% outliers from standard Cauchy distribution. Table 6 and Table 7 summarize the simulation results. It can be seen that the MLASSO performs the best in this situation and the perturbation method still performs very well.

Table 6.

Variable selection in mixed error case

		Avg. no. of 0 coefficients

Method	MRME (%)	Correct	Incorrect
LASSO (BIC)	71.35	3.90	0.00
LASSO (GCV)	71.29	4.02	0.00
LASSO (AIC)	56.43	3.24	0.00
MLASSO (BIC)	37.84	4.76	0.00
MLASSO (GCV)	36.38	4.80	0.00
MLASSO (AIC)	45.45	4.28	0.00
Subset (BIC)	41.28	4.82	0.03
Subset (GCV)	41.00	4.86	0.03
Subset (AIC)	29.02	4.09	0.00
Oracle	37.27	5	0

Open in a new tab

Table 7.

Estimation in mixed error case

	β̂₁			β̂₂			β̂₅

Method	SD	SD_m	(SD_s)	SD	SD_m	(SD_s)	SD	SD_m	(SD_s)
LASSO (BIC)	0.241	0.272	0.091	0.239	0.271	0.072	0.222	0.269	0.085
LASSO (GCV)	0.251	0.274	0.090	0.237	0.273	0.072	0.213	0.268	0.085
LASSO (AIC)	0.247	0.274	0.090	0.236	0.272	0.073	0.214	0.269	0.086
MLASSO (BIC)	0.231	0.236	0.067	0.244	0.233	0.063	0.171	0.204	0.051
MLASSO (GCV)	0.230	0.236	0.067	0.245	0.234	0.064	0.170	0.204	0.050
MLASSO (AIC)	0.234	0.238	0.066	0.247	0.233	0.064	0.171	0.204	0.050
Subset (BIC)	0.267	0.245	0.066	0.322	0.234	0.075	0.280	0.202	0.053
Subset (GCV)	0.267	0.245	0.066	0.322	0.234	0.074	0.279	0.202	0.053
Subset (AIC)	0.269	0.245	0.067	0.322	0.233	0.074	0.280	0.203	0.054
Oracle	0.215	0.245	0.064	0.239	0.242	0.069	0.188	0.203	0.050

Open in a new tab

5.4 Prostate cancer example

In this example, we apply the proposed approach to the prostate cancer data. The dataset comes from a study by Stamey et al. (1989). It consists of 97 patients who were about to receive a radical prostatectomy. A number of clinical measures for each patient were recorded. The purpose of the study was to examine the correlation between the level of prostate specific antigen and eight factors. The factors are log (cancer volume) (lcavol), log (prostate weight) (lweight), age, log (benign prostaic hyperplasia amount) (lbph), seminal vesicle invasion (svi), log (capsular penetration) (lcp), Gleason score (gleason) and percentage Gleason scores 4 or 5 (pgg45). First we standardize the predictors and center the response variable, then we fit a linear model that relates the log (prostate specific antigen) (lpsa) to the predictors. We use the full LAD, the LASSO and the MLASSO method to estimate the coefficients in the model. The results are summarized in Table 8. With BIC or GCV, LASSO and Best subset result in the identical model and both of them exclude variable lcp and pgg45. With AIC, Best subset excludes only variable pgg45 while LASSO selects η = 0 and results in the full LAD. With GCV, MLASSO selects η = 0.16 and excludes only variable pgg45 while with BIC it selects η = 0.84 and results in a very parsimonious model that retains only the four variables (lcavol, lweight, lbph and svi). With AIC, η is again selected to be 0 and MLASSO produces the full LAD. In Fig. 1, we show the LASSO estimates as a function of shrinkage parameter s, both BIC-based and GCV-based approach select the shrinkage parameter s = 0.69. In Fig. 2, we show the MLASSO estimates as a function of shrinkage parameter s, the BIC-based and GCV-based approach select the shrinkage parameter s = 0.28, 0.71 respectively.

Table 8.

Prostate cancer example

Predictor	1 lcavol	2 lweight	3 age	4 lbph	5 svi	6 lcp	7 gleason	8 pgg45
LAD	0.63(0.13)	0.25 (0.12)	−0.16 (0.08)	0.20 (0.11)	0.35 (0.12)	−0.17 (0.14)	0.14 (0.11)	0.09 (0.13)
Subset (BIC)	0.58 (0.11)	0.23 (0.11)	−0.19 (0.08)	0.25 (0.11)	0.32 (0.13)	0.00 (−)	0.12 (0.09)	0.00 (−)
Subset (GCV)	0.58 (0.11)	0.23 (0.11)	−0.19 (0.08)	0.25 (0.11)	0.32 (0.13)	0.00 (−)	0.12 (0.09)	0.00 (−)
Subset (AIC)	0.61 (0.12)	0.22 (0.11)	−0.17 (0.09)	0.21 (0.10)	0.38 (0.12)	−0.15 (0.08)	0.19 (0.11)	0.00 (−)
LASSO (BIC)	0.59 (0.11)	0.24 (0.11)	−0.11 (0.08)	0.17 (0.11)	0.23 (0.13)	0.00 (0.07)	0.04 (0.07)	0.00 (0.09)
LASSO (GCV)	0.59 (0.11)	0.24 (0.11)	−0.11 (0.08)	0.17 (0.11)	0.23 (0.13)	0.00 (0.07)	0.04 (0.07)	0.00 (0.09)
LASSO (AIC)	0.63 (0.13)	0.25 (0.12)	−0.16 (0.08)	0.20 (0.11)	0.35 (0.12)	−0.17 (0.14)	0.14 (0.11)	0.09 (0.13)
MLASSO (BIC)	0.64 (0.12)	0.18 (0.10)	0.00 (0.05)	0.11 (0.07)	0.22 (0.11)	0.00 (0.04)	0.00 (0.03)	0.00 (0.00)
MLASSO (GCV)	0.60 (0.12)	0.22 (0.11)	−0.18 (0.08)	0.25 (0.11)	0.35 (0.12)	−0.06 (0.09)	0.14 (0.08)	0.00 (0.05)
MLASSO (AIC)	0.63 (0.13)	0.25 (0.12)	−0.16 0.08)	0.20 (0.11)	0.35 (0.12)	−0.17 (0.14)	0.14 (0.11)	0.09 (0.13)

Open in a new tab

Fig. 2 — Graphical display of MLASSO shrinkage of eight coefficients as a function of shrinkage parameter s in the prostate cancer example. The broken line s = 0.28 is selected by BIC and s = 0.71 is selected by GCV

6 Discussion

Variable selection is a fundamental problem in statistical modeling. A variety of methods have been well developed in the least squares-based regression while their counterparts in the median regression are much less understood. With the recent advancement of the linear programming techniques for the L₁ minimization, numerical simplicity is now also a nice property for the methods in the median regression. In this article, we consider the problem of simultaneous estimation and variable selection in the median regression model via penalizing the L₁ loss function via the L₁ (Lasso) penalty. Combining L₁ loss function with LASSO-type penalty, the penalized estimator can be solved easily by standard linear programming packages. Differentially scaled L₁ penalty are used to achieve desirable properties in terms of both identifying zero coefficients and estimating nonzero coefficients. Large sample properties of the proposed estimator are established by using local asymptotic quadratic property of the L₁ loss function and a novel inequality. Standard error of the estimator is obtained by using the random perturbation method. It is shown that for properly chosen tuning parameters, the differentially penalized L₁ estimator exhibits the oracle property. More interestingly, a modified BIC function is employed to obtain data-driven tuning parameters and the resultant two-stage procedure is proved to enjoy optimal properties. Extensive numerical studies show that the unified L₁ method fares comparably well in terms of simultaneous estimation and variable selection and retains the appealing robustness of L₁ estimator. The numerical simplicity of the proposed methodology gains extra benefits in real data analysis.

In spirit, the differentially scaled L₁ penalty is similar to the Adaptive Lasso proposed by Zou (2006) though the latter is developed for the squared loss while our investigation is conducted under the L₁ loss setting. A practical issue for both the differentially scaled L₁ penalty and the Adaptive Lasso is the construction of tuning parameters which behaves differently for the true nonzero coefficients and the true zero coefficients. A slight difference between our construction and the Adaptive Lasso is that we standardize the preliminarily estimated coefficients via their standard errors while the Adaptive Lasso uses the unstandardized ones. As the magnitudes of the standard errors for the least squares estimates or the median estimates may differ substantially in practice especially when the predictor variables are highly correlated, the differential or adaptive weights should be standardized to decrease their impact on the tuning parameters.

Acknowledgments

This research was supported by grants from the U.S. National Institutes of Health, the U.S. National Science Foundation and the National University of Singapore (R-155-000-075-112).We are very grateful to the editor, the associate editor and the referees for their helpful comments which have greatly improved the paper.

Contributor Information

Jinfeng Xu, Email: staxj@nus.edu.sg, Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546, Singapore.

Zhiliang Ying, Email: zying@stat.columbia.edu, Department of Statistics, Columbia University, New York, NY 10027, USA.

References

Chen S, Donoho D. Basis pursuit. 28th Asilomar Conference Signals; Systems Computers; Asilomar. 1994. [Google Scholar]
Efron B, Johnstone I, Hastie T, Tibshirani R. Least angle regression (with discussions) Annals of Statistics. 2004;32:407–499. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Annals of Statistics. 2002;30:74–99. [Google Scholar]
Hurvich CM, Tsai CL. Model selection for Least absolute Deviations Regressions in Small Samples. Statistics and Probability Letters. 1990;9:259–265. [Google Scholar]
Knight K, Fu WJ. Asymptotics for Lasso-type estimators. Annals of Statistics. 2000;28:1356–1378. [Google Scholar]
Koenker R, D’Orey V. Computing regression quantiles. Applied Statistics. 1987;36:383–393. [Google Scholar]
Shen X, Ye J. Adaptive model selection. Journal of the American Statistical Association. 2002;97:210–221. [Google Scholar]
Pakes A, Pollard D. Simulation and the asymptotics of optimization estimators. Econometrica. 1989;57:1027–1057. [Google Scholar]
Pollard D. Empirical Processes: Theory and Applications, Reginal Conference Series Probability and Statistics; Institute of Mathematical Statistics; Hayward. 1990. [Google Scholar]
Pollard D. Asymptotics for least absolute deviation regression estimators. Econometric Theory. 1991;7:186–199. [Google Scholar]
Portnoy S, Koenker R. The Gaussian hare and the Laplacian tortoise: Computability of squared-error versus absolute-error estimators. Statistical Science. 1997;12:279–296. [Google Scholar]
Rao CR, Zhao LC. Approximation to the distribution of M-estimates in linear models by randomly weighted bootstrap. Sankhyā A. 1992;54:323–331. [Google Scholar]
Ronchetti E, Staudte RG. A Robust Version of Mallows’s Cp. Journal of the American Statistical Association. 1994;89:550–559. [Google Scholar]
Stamey T, Kabalin J, McNeal J, Johnstone I, Freiha F, Redwine E, Yang N. Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate, ii: Radical prostatectomy treated patients. Journal of Urology. 1989;16:1076–1083. doi: 10.1016/s0022-5347(17)41175-x. [DOI] [PubMed] [Google Scholar]
Tibshirani RJ. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B. 1996;58:267–288. [Google Scholar]
Xu J. PhD dissertation. Columbia University; 2005. Parameter estimation, model selection and inferences in L1-based linear regression. [Google Scholar]
Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
Zou H, Hastie T, Tibshirani R. On the “degrees of freedom” of the LASSO. Annals of Statistics. 2007;35:2173–2192. [Google Scholar]

[R1] Chen S, Donoho D. Basis pursuit. 28th Asilomar Conference Signals; Systems Computers; Asilomar. 1994. [Google Scholar]

[R2] Efron B, Johnstone I, Hastie T, Tibshirani R. Least angle regression (with discussions) Annals of Statistics. 2004;32:407–499. [Google Scholar]

[R3] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R4] Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Annals of Statistics. 2002;30:74–99. [Google Scholar]

[R5] Hurvich CM, Tsai CL. Model selection for Least absolute Deviations Regressions in Small Samples. Statistics and Probability Letters. 1990;9:259–265. [Google Scholar]

[R6] Knight K, Fu WJ. Asymptotics for Lasso-type estimators. Annals of Statistics. 2000;28:1356–1378. [Google Scholar]

[R7] Koenker R, D’Orey V. Computing regression quantiles. Applied Statistics. 1987;36:383–393. [Google Scholar]

[R8] Shen X, Ye J. Adaptive model selection. Journal of the American Statistical Association. 2002;97:210–221. [Google Scholar]

[R9] Pakes A, Pollard D. Simulation and the asymptotics of optimization estimators. Econometrica. 1989;57:1027–1057. [Google Scholar]

[R10] Pollard D. Empirical Processes: Theory and Applications, Reginal Conference Series Probability and Statistics; Institute of Mathematical Statistics; Hayward. 1990. [Google Scholar]

[R11] Pollard D. Asymptotics for least absolute deviation regression estimators. Econometric Theory. 1991;7:186–199. [Google Scholar]

[R12] Portnoy S, Koenker R. The Gaussian hare and the Laplacian tortoise: Computability of squared-error versus absolute-error estimators. Statistical Science. 1997;12:279–296. [Google Scholar]

[R13] Rao CR, Zhao LC. Approximation to the distribution of M-estimates in linear models by randomly weighted bootstrap. Sankhyā A. 1992;54:323–331. [Google Scholar]

[R14] Ronchetti E, Staudte RG. A Robust Version of Mallows’s Cp. Journal of the American Statistical Association. 1994;89:550–559. [Google Scholar]

[R15] Stamey T, Kabalin J, McNeal J, Johnstone I, Freiha F, Redwine E, Yang N. Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate, ii: Radical prostatectomy treated patients. Journal of Urology. 1989;16:1076–1083. doi: 10.1016/s0022-5347(17)41175-x. [DOI] [PubMed] [Google Scholar]

[R16] Tibshirani RJ. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B. 1996;58:267–288. [Google Scholar]

[R17] Xu J. PhD dissertation. Columbia University; 2005. Parameter estimation, model selection and inferences in L1-based linear regression. [Google Scholar]

[R18] Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

[R19] Zou H, Hastie T, Tibshirani R. On the “degrees of freedom” of the LASSO. Annals of Statistics. 2007;35:2173–2192. [Google Scholar]

PERMALINK

Simultaneous estimation and variable selection in median regression using Lasso-type penalty

Jinfeng Xu

Zhiliang Ying

Abstract

1 Introduction

Fig. 1.

2 Differentially penalized L₁ estimator

3 Large sample properties

4 Adaptive two-stage procedure

4.1 Oracle property

4.2 Distributional approximation

4.3 Selection of tuning parameters

5 Numerical studies

5.1 Normal error case

Table 1.

Table 2.

Table 3.

5.2 Laplace error case

Table 4.

Table 5.

5.3 Mixed error case

Table 6.

Table 7.

5.4 Prostate cancer example

Table 8.

Fig. 2.

6 Discussion

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Simultaneous estimation and variable selection in median regression using Lasso-type penalty

Jinfeng Xu

Zhiliang Ying

Abstract

1 Introduction

Fig. 1.

2 Differentially penalized L1 estimator

3 Large sample properties

4 Adaptive two-stage procedure

4.1 Oracle property

4.2 Distributional approximation

4.3 Selection of tuning parameters

5 Numerical studies

5.1 Normal error case

Table 1.

Table 2.

Table 3.

5.2 Laplace error case

Table 4.

Table 5.

5.3 Mixed error case

Table 6.

Table 7.

5.4 Prostate cancer example

Table 8.

Fig. 2.

6 Discussion

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2 Differentially penalized L₁ estimator