ON THE ADAPTIVE ELASTIC-NET WITH A DIVERGING NUMBER OF PARAMETERS

Hui Zou; Hao Helen Zhang

doi:10.1214/08-AOS625

. Author manuscript; available in PMC: 2010 May 4.

Published in final edited form as: Ann Stat. 2009;37(4):1733–1751. doi: 10.1214/08-AOS625

ON THE ADAPTIVE ELASTIC-NET WITH A DIVERGING NUMBER OF PARAMETERS

Hui Zou ^1,^*, Hao Helen Zhang ^2,^†

PMCID: PMC2864037 NIHMSID: NIHMS114303 PMID: 20445770

Abstract

We consider the problem of model selection and estimation in situations where the number of parameters diverges with the sample size. When the dimension is high, an ideal method should have the oracle property (Fan and Li, 2001; Fan and Peng, 2004) which ensures the optimal large sample performance. Furthermore, the high-dimensionality often induces the collinearity problem which should be properly handled by the ideal method. Many existing variable selection methods fail to achieve both goals simultaneously. In this paper, we propose the adaptive Elastic-Net that combines the strengths of the quadratic regularization and the adaptively weighted lasso shrinkage. Under weak regularity conditions, we establish the oracle property of the adaptive Elastic-Net. We show by simulations that the adaptive Elastic-Net deals with the collinearity problem better than the other oracle-like methods, thus enjoying much improved finite sample performance.

Keywords and phrases: Adaptive regularization, Elastic-Net, High dimensionality, Model selection, Oracle property, Shrinkage methods

1. Introduction

1.1. Background

Consider the problem of model selection and estimation in the classical linear regression model

y = X β^{*} + ϵ,

(1.1)

where y = (y₁,…,y_n)^T is the response vector and x_j = (x_1j,…,x_nj)^T, j = 1,…,p, are the linearly independent predictors. Let X = [x₁, ···, x_p] be the predictor matrix. Without loss of generality we assume the data are centered, so the intercept is not included in the regression function. Throughout this paper, we assume the errors are identically and independent distributed with zero mean and finite variance σ². We are interested in the sparse modeling problem where the true model has a sparse representation, i.e., some components of β^* are exactly zero. Let $A = {j : β_{j}^{*} \neq 0, j = 1, 2, \dots, p} .$ In this work we call the size of 𝒜 the intrinsic dimension of the underlying model. We wish to discover the set 𝒜 and estimate the corresponding coefficients.

Variable selection is fundamentally important for knowledge discovery with high-dimensional data (Fan & Li 2006) and it could greatly enhance the prediction performance of the fitted model. Traditional model selection procedures follow best-subset selection and its step-wise variants. However, best-subset selection is computationally prohibitive when the number of predictors is large. Furthermore, as analyzed by Breiman (1996), subset selection is unstable, thus the resulting model has poor prediction accuracy. To overcome the fundamental drawbacks of subset selection, statisticians have recently proposed various penalization methods to perform simultaneous model selection and estimation. In particular, the lasso (Tibshirani 1996) and the SCAD (Fan & Li 2001) are two very popular methods due to their good computational and statistical properties. Efron, Hastie, Johnstone & Tibshirani (2004) proposed the LARS algorithm for computing the entire lasso solution path. Knight & Fu (2000) studied the asymptotic properties of the lasso. Fan & Li (2001) showed that the SCAD enjoys the oracle property, that is, the SCAD estimator can perform as well as the oracle if the penalization parameter is appropriately chosen.

1.2. Two fundamental issues with the ℓ₁ penalty

The lasso estimator (Tibshirani 1996) is obtained by solving the ℓ₁ penalized least squares problem

\hat{β} (lasso) = \underset{β}{arg min} ∥ y - X β ∥_{2}^{2} + λ ∥ β ∥_{1},

(1.2)

where $∥ β ∥_{1} = \sum_{j = 1}^{p} | β_{j} |$ is the ℓ₁-norm of β. The ℓ₁ penalty enables the lasso to simultaneously regularize the least squares fit and shrink some components of β̂(lasso) to zero for some appropriately chosen λ. The entire lasso solution paths can be computed by the LARS algorithm (Efron et al. 2004). These nice properties make the lasso a very popular variable selection method.

Despite its popularity the lasso does have two serious drawbacks: namely the lack of oracle property and instability with high-dimensional data. First of all, the lasso does not have the oracle property. Fan & Li (2001) first pointed out that asymptotically the lasso has non-ignorable bias for estimating the nonzero coefficients. They further conjectured that the lasso may not have the oracle property because of the bias problem. This conjecture was recently proven in Zou (2006). Zou (2006) further showed that the lasso could be inconsistent for model selection unless the predictor matrix (or the design matrix) satisfies a rather strong condition. Zou (2006) proposed the following adaptive lasso estimator

\hat{β} (AdaLasso) = \underset{β}{arg min} ∥ y - X β ∥_{2}^{2} + λ \sum_{j = 1}^{p} {\hat{w}}_{j} | β_{j} |,

(1.3)

where ${{\hat{w}}_{j}}_{j = 1}^{p}$ are the adaptive data-driven weights and can be computed by ${\hat{w}}_{j} = {(| {\hat{β}}_{j}^{ini} |)}^{- γ},$ where γ is a positive constant and β̂ⁱⁿⁱ is an initial root-n consistent estimate of β. Zou (2006) showed that with an appropriately chosen λ, the adaptive lasso performs as well as the oracle. Candes, Wakin & Boyd (2007) used the adaptive lasso idea to enhance sparsity in sparse signal recovery via the reweighted ℓ₁ minimization.

Secondly the ℓ₁ penalization methods can have very poor performance when there are highly correlated variables in the predictor set. The collinearity problem is often encountered in high-dimensional data analysis. Even when the predictors are independent, as long as the dimension is high, the maximum sample correlation can be large, as shown in Fan & Lv (2007). Collinearity can severely degrade the performance of the lasso. As shown in Zou & Hastie (2005), the lasso solution paths are unstable when predictors are highly correlated. Zou & Hastie (2005) proposed the Elastic-Net as an improved version of the lasso for analyzing high-dimensional data. The Elastic-Net estimator is defined as follows:

\hat{β} (enet) = (1 + \frac{λ_{2}}{n}) {\underset{β}{arg min} ∥ y - X β ∥_{2}^{2} + λ_{2} ∥ β ∥_{2}^{2} + λ_{1} ∥ β ∥_{1}} .

(1.4)

If the predictors are standardized (each variable has mean zero and L₂-norm one), then we should change $(1 + \frac{λ_{2}}{n})$ to (1 + λ₂) as in Zou & Hastie (2005). The ℓ₁ part of the Elastic-Net performs automatic variable selection while the ℓ₂ part stabilizes the solution paths and hence improves the prediction. In an orthogonal design where the lasso is shown to be optimal (Donoho, Johnstone, Kerkyacharian & Picard 1995), the Elastic-Net automatically reduces to the lasso. However, when the correlations among the predictors become high, the Elastic-Net can significantly improve the prediction accuracy of the lasso.

1.3. The adaptive Elastic-Net

The adaptively weighted ℓ₁ penalty and the Elastic-Net penalty improve the lasso in two different directions. The adaptive lasso achieves the oracle property of the SCAD and the Elastic-Net handles the collinearity. However, following the arguments in Zou & Hastie (2005) and Zou (2006), we can easily see that the adaptive lasso inherits the instability of the lasso for high-dimensional data, while the Elastic-Net is lack of the oracle property. Thus, it is natural to consider combining the ideas of the adaptively weighted ℓ₁ penalty and the Elastic-Net regularization to obtain a better method which can improve the lasso in both directions. To this end, we propose the adaptive Elastic-Net that penalizes the squared error loss using a combination of the ℓ₂ penalty and the adaptive ℓ₁ penalty. Since the adaptive Elastic-Net is designed for high-dimensional data analysis, we study its asymptotic properties under the assumption that the dimension diverges with the sample size.

Pioneering papers on asymptotic theories with diverging number of parameters include Huber (1988) and Portnoy (1984) which studied the M-estimators. Recently, Fan, Peng & Huang (2005) studied a semi-parametric model with a growing number of nuisance parameters, whereas Lam & Fan (2007) investigated the profile likelihood ratio inference for the growing number of parameters. In particular, our work is influenced by Fan & Peng (2004) who studied the oracle property of nonconcave penalized likelihood estimators. Fan & Peng (2004) provocatively argued why it is important to study the validity of the oracle property when the dimension diverges. We would like to know whether the adaptive Elastic-Net enjoys the oracle property with a diverging number of predictors. This question will be thoroughly investigated in this paper.

The rest of the paper is organized as follows. In Section 2 we introduce the adaptive Elastic-Net. Statistical theory, including the oracle property, of the adaptive Elastic-Net is established in Section 3. In Section 4 we use simulation to compare the finite sample performance of the adaptive Elastic-Net with the SCAD and other competitors. Section 5 discusses how to combine SIS of Fan & Lv (2007) and the adaptive Elastic-Net to deal with the ultra-high dimension cases. Technical proofs are presented in Section 6.

2. Method

The adaptive Elastic-Net can be viewed as a combination of the Elastic-Net and the adaptive lasso. Suppose we first compute the Elastic-Net estimator β̂(enet) as defined in (1.4), and then we construct the adaptive weights by

{\hat{w}}_{j} = {(| {\hat{β}}_{j} (enet) |)}^{- γ}, j = 1, 2, \dots, p,

(2.1)

where γ is a positive constant. Now we solve the following optimization problem to get the adaptive Elastic-Net estimates

\hat{β} (AdaEnet) = (1 + \frac{λ_{2}}{n}) {\underset{β}{arg min} ∥ y - X β ∥_{2}^{2} + λ_{2} ∥ β ∥_{2}^{2} + λ_{1}^{*} \sum_{j = 1}^{p} {\hat{w}}_{j} | β_{j} |} .

(2.2)

From now on, we write β̂ = β̂(AdaEnet) for the sake of convenience.

If we force λ₂ to be zero in (2.2), then the adaptive Elastic-Net reduces to the adaptive lasso. Following the arguments in Zou & Hastie (2005), we can easily show that in an orthogonal design the adaptive Elastic-Net reduces to the adaptive lasso, regardless the value of λ₂. This is desirable because in that setting the adaptive lasso achieves the optimal minimax risk bound (Zou 2006). The role of the ℓ₂ penalty in (2.2) is to further regularize the adaptive lasso fit whenever the collinearity may cause serious trouble.

We know the Elastic-Net naturally adopts a sparse representation. One can use ŵ_j = (|β̂_j(enet)| + 1/n)^−γ to avoid dividing zeros. We can also define ŵ_j = ∞ when β̂_j(enet) = 0. Let 𝒜̂_enet = {j : β̂_j (enet) ≠ 0} and ${\hat{A}}_{enet}^{c}$ denotes its complement set. Then we have ${\hat{β}}_{{\hat{A}}_{enet}^{c}} = 0$ and

{\hat{β}}_{{\hat{A}}_{enet}} = (1 + \frac{λ_{2}}{n}) {\underset{β}{arg min} ∥ y - X_{{\hat{A}}_{enet}} β ∥_{2}^{2} + λ_{2} ∥ β ∥_{2}^{2} + λ_{1}^{*} \sum_{j \in {\hat{Α}}_{enet}} {\hat{w}}_{j} | β_{j} |} .

(2.3)

where β in (2.3) is a vector of length |𝒜̂_enet|, the size of 𝒜̂_enet.

The ℓ₁ regularization parameters, $λ_{1}^{*}$ and λ₁, are directly responsible for the sparsity of the estimates. Their values are allowed to be different. On the other hand, we use the same λ₂ for the ℓ₂ penalty component in the Elastic-Net and the adaptive Elastic-Net estimators, because the ℓ₂ penalty offers the same kind of contribution in both estimators.

3. Statistical Theory

In our theoretical analysis, we assume the following regularity conditions throughout.

(A1) We use λ_min (M) and λ_max (M) to denote the minimum and maximum eigenvalues of a positive definite matrix M, respectively. Then we assume
$b \leq λ_{min} (\frac{1}{n} X^{T} X) \leq λ_{max} (\frac{1}{n} X^{T} X) \leq B$
where b and B are two positive constants.
(A2)
${lim}_{n \to \infty} \frac{{max}_{i = 1, 2, \dots, n} \sum_{j = 1}^{p} x_{ij}^{2}}{n} = 0 .$
(A3) E[|ϵ|^2+δ] < ∞ for some δ > 0.
(A4)
${lim}_{n \to \infty} \frac{log (p)}{log (n)} = ν for some 0 \leq ν < 1 .$

To construct the adaptive weights (ŵ), we take a fixed γ such that $γ > \frac{2 ν}{1 - ν} .$ In our numerical studies we let $γ = ⌈ \frac{2 ν}{1 - ν} ⌉ + 1$ to avoid the tuning on γ. Once γ is chosen, we choose the regularization parameters according to the following conditions
(A5)
$lim_{n \to \infty} \frac{λ_{2}}{n} = 0, lim_{n \to \infty} \frac{λ_{1}}{\sqrt{n}} = 0,$
and
$lim_{n \to \infty} \frac{λ_{1}^{*}}{\sqrt{n}} = 0, lim_{n \to \infty} \frac{λ_{1}^{*}}{\sqrt{n}} n^{\frac{(1 - ν) (1 + γ) - 1}{2}} = \infty .$
(A6)
$lim_{n \to \infty} \frac{λ_{2}}{\sqrt{n}} \sqrt{\sum_{j \in A} β_{j}^{* 2}} = 0, lim_{n \to \infty} min (\frac{n}{λ_{1} \sqrt{p}}, {(\frac{\sqrt{n}}{\sqrt{p} λ_{1}^{*}})}^{\frac{1}{γ}}) (min_{j \in A} | β_{j}^{*} |) \to \infty .$

Conditions (A1) and (A2) assume the predictor matrix has a reasonably good behavior. Similar conditions were considered in Portnoy (1984). Note that in the linear regression setting, condition (A1) is exactly condition (F) in Fan & Peng (2004). Condition (A3) is used to establish the asymptotic normality of β̂ (AdaEnet).

It is worth pointing out that condition (A4) is weaker than that used in Fan & Peng (2004) in which p is assumed to satisfy p⁴/n → 0 or at most p³/n → 0. It means their results require $ν < \frac{1}{3} .$ Our theory removes this limitation. For any 0 ≤ ν < 1, we can choose an appropriate γ to construct the adaptive weights and the oracle property holds as long as $γ > \frac{2 ν}{1 - ν}$ . Also note that in the finite dimension setting ν = 0, thus any positive γ can be used, which agrees with the results in Zou (2006).

Condition (A6) is similar to condition (H) in Fan & Peng (2004). Basically, condition (A6) allows the nonzero coefficients to vanish but at a rate that can be distinguished by the penalized least squares. In the finite dimension setting the condition is implicitly assumed.

THEOREM 3.1

Given the data (y, X), let ŵ = (ŵ₁, …, ŵ_p) be a vector whose components are all non-negative and can depend on (y, X). Define

{\hat{β}}_{\hat{w}} (λ_{2}, λ_{1}) = \underset{β}{arg min} {∥ y - X β ∥_{2}^{2} + λ_{2} ∥ β ∥_{2}^{2} + λ_{1} \sum_{j = 1}^{p} {\hat{w}}_{j} | β_{j} |},

for non-negative parameters λ₂ and λ₁. If ŵ_j = 1 for all j, we denote β̂_ŵ(λ₂, λ₁) by β̂(λ₂, λ₁) for convenience.

If we assume the model (1.1) and condition (A1), then

E (∥ {\hat{β}}_{\hat{w}} (λ_{2}, λ_{1}) - β^{*} ∥_{2}^{2}) \leq 4 \frac{λ_{2}^{2} ∥ β^{*} ∥_{2}^{2} + Bpn σ^{2} + λ_{1}^{2} E (\sum_{j = 1}^{p} {\hat{w}}_{j}^{2})}{{(bn + λ_{2})}^{2}} .

In particular, when ŵ_j = 1 for all j, we have

E (∥ \hat{β} (λ_{2}, λ_{1}) - β^{*} ∥_{2}^{2}) \leq 4 \frac{λ_{2}^{2} ∥ β^{*} ∥_{2}^{2} + Bpn σ^{2} + λ_{1}^{2} p}{{(bn + λ_{2})}^{2}} .

It is worth mentioning that the derived risk bounds are non-asymptotic. Theorem 3.1 is very useful for the asymptotic analysis. A direct corollary of Theorem 3.1 is that, under conditions (A1)–(A6), β̂(λ₂, λ₁) is a root-(n/p)-consistent estimator. This consistent rate is the same as the result of SCAD (Fan & Peng 2004). The root-(n/p) consistency result suggests that it is appropriate to use the Elastic-Net to construct the adaptive weights.

THEOREM 3.2

Let us write $β^{*} = (β_{A}^{*}, 0)$ and define

{\tilde{β}}_{A}^{*} = \underset{β}{arg min} {∥ y - X_{A} β ∥_{2}^{2} + λ_{2} \sum_{j \in A} β_{j}^{2} + λ_{1}^{*} \sum_{j \in A} {\hat{w}}_{j} | β_{j} |} .

(3.1)

Then with probability tending to 1, $((1 + \frac{λ_{2}}{n}) {\tilde{β}}_{A}^{*}, 0)$ is the solution to (2.2).

Theorem 3.2 provides an asymptotic characterization of the solution to the adaptive Elastic-Net criterion. The definition of ${\tilde{β}}_{A}^{*}$ borrows the concept of “oracle” (Donoho & Johnstone 1994, Fan & Li 2001, Fan & Peng 2004, Zou 2006). If there was an oracle informing us the true subset model, then we would use this oracle information and the adaptive Elastic-Net criterion would become that in (2.3). Theorem 3.2 tells us that asymptotically speaking, the adaptive Elastic-Net works as if it had such oracle information. Theorem 3.2 also suggests that the adaptive Elastic-Net should enjoy the oracle property, which is confirmed in the next theorem.

THEOREM 3.3

Under conditions (A1)–(A6), the adaptive Elastic-Net has the oracle property, that is, the estimator β̂(AdaEnet) must satisfy:

Consistency in selection : Pr ({j : β̂(AdaEnet)_j ≠ 0} = 𝒜) → 1,
Asymptotic normality : $α^{T} \frac{I + λ_{2} \sum_{A}^{- 1}}{1 + \frac{λ_{2}}{n}} \sum_{A}^{\frac{1}{2}} (\hat{β} {(AdaEnet)}_{A} - β_{A}^{*}) \to_{d} N (0, σ^{2}),$ where $\sum_{A} = X_{A}^{T} X_{A}$ and α is a vector of norm 1.

By Theorem 3.3 the selection consistency and the asymptotic normality of the adaptive Elastic-Net are still valid when the number of parameters diverges. Technically speaking, the selection consistency result is stronger than that Theorem 3.2 implies, although Theorem 3.2 plays an important role in the proof of Theorem 3.3. As a special case, when we let λ₂ = 0, which is a choice satisfying conditions (A5) and (A6), Theorem 3.3 tell us that the adaptive lasso enjoys the selection consistency and the asymptotical normality:

α^{T} \sum_{A}^{\frac{1}{2}} (\hat{β} {(AdaLasso)}_{A} - β_{A}^{*}) \to_{d} N (0, σ^{2}) .

4. Numerical Studies

In this section we present simulations to study the finite sample performance of the adaptive Elastic-Net. We considered five methods in the simulation study: the lasso(Lasso), the Elastic-Net(Enet), the adaptive lasso(ALasso), the adaptive Elastic-Net(AEnet) and the SCAD. In our implementation, we let λ₂ = 0 in the adaptive Elastic-Net to get the adaptive lasso fit. There are several commonly used tuning parameter selection methods, such as cross-validation, generalized cross-validation(GCV), AIC and BIC. Zou, Hastie & Tibshirani (2007) suggested using BIC to select the lasso tuning parameter. Wang, Li & Tsai (2007) showed that for the SCAD, BIC is a better tuning parameter selector than GCV and AIC. In this work, we used BIC to select the tuning parameter for each method.

Fan & Peng (2004) considered simulation models in which $p_{n} = [4 n^{\frac{1}{4}}] - 5$ and |𝒜| = 5. Our theory allows p_n = O(n^ν) for any ν < 1. Thus, we are interested in models in which p_n = O(n^ν) with $ν > \frac{1}{3} .$ In addition, we allow the intrinsic dimension (𝒜) to diverge with the sample size as well, because such designs make the model selection and estimation more challenging than in the fixed |𝒜| situations.

Example 1

We generated data from the linear regression model,

y = x^{T} β^{*} + ϵ,

where β^* is a p-dim vector and ε ∼ N(0, σ²), σ = 6 and x follows a p-dim multivariate normal distribution with zero mean and covariance Σ whose (j, k) entry is Σ_j,k = ρ^|j−k| 1 ≤ k, j ≤ p. We considered ρ = 0.5 and ρ = 0.75. Let p = p_n = [4n^1/2] − 5 for n = 100,200,400. Let 1_m/0_m denote a m-vector of 1s/0s. The true coefficients are β* = (3 · 1_q, 3 · 1_q, 3 · 1_q, 0_p−3q)^T and |𝒜| = 3q and q = [p_n/9]. In this example $ν = \frac{1}{2},$ hence we used γ = 3 for computing the adaptive weights in the adaptive Elastic-Net.

For each estimator β̂, its estimation accuracy is measured by the mean squared error (MSE) defined as E[(β̂ − β*)^T Σ(β̂ − β*)]. The variable selection performance is gauged by (C, IC), where C is the number of zero coefficients that are correctly estimated by zero and IC is the number of nonzero coefficients that are incorrectly estimated by zero.

Table 1 documents the simulation results. Several interesting observations can be made.

When the sample size is large (n = 400), the three oracle-like estimators outperform the lasso and the Elastic-Net which do not have the oracle property. That is expected according to the asymptotic theory.
The SCAD and the adaptive Elastic-Net are the best when the sample size is large and the correlation is moderate. However, the SCAD can perform much worse than the adaptive Elastic-Net when the correlation is high (ρ = 0.75) or the sample size is small.
Both the Elastic-Net and the adaptive lasso can do significantly better than the lasso. What is more interesting is that the adaptive Elastic-Net often outperforms the Elastic-Net and the adaptive lasso.

TABLE 1.

Simulation I: model selection and fitting results based on 100 replications.

ρ = 0.5	n	p_n	\|𝒜\|	Model	MSE	C	IC

	100	35	9	Truth		26	0
				Lasso	7.57 (0.31)	24.08	0.01
				ALasso	6.78 (0.42)	25.50	0.42
				Enet	5.91 (0.29)	24.06	0.00
				AEnet	5.07 (0.35)	25.47	0.15
				SCAD	10.55 (0.68)	22.54	0.35

	200	51	15	Truth		36	0
				Lasso	6.63 (0.24)	33.32	0.00
				ALasso	3.78 (0.18)	35.46	0.02
				Enet	4.86 (0.19)	33.36	0.00
				AEnet	3.46 (0.17)	35.47	0.01
				SCAD	4.76 (0.33)	34.63	0.10

	400	75	24	Truth		51	0
				Lasso	4.99 (0.15)	47.31	0.00
				ALasso	2.76 (0.09)	50.33	0.00
				Enet	3.37 (0.12)	48.00	0.00
				AEnet	2.47 (0.08)	50.45	0.00
				SCAD	2.42 (0.09)	50.88	0.00

ρ = 0.75	n	p_n	\|𝒜\|	Model	MSE	C	IC

	100	35	9	Truth		26	0
				Lasso	5.93 (0.26)	24.80	0.14
				ALasso	8.49 (0.39)	25.76	1.84
				Enet	4.18 (0.24)	24.77	0.05
				AEnet	5.24 (0.32)	25.70	0.74
				SCAD	11.59 (0.56)	22.46	1.34

	200	51	15	Truth		36	0
				Lasso	5.10 (0.18)	34.66	0.02
				ALasso	5.32 (0.31)	35.70	0.87
				Enet	3.79 (0.17)	34.79	0.00
				AEnet	3.32 (0.17)	35.80	0.19
				SCAD	5.99 (0.31)	33.10	0.35

	400	75	24	Truth		51	0
				Lasso	3.83 (0.12)	49.03	0.00
				ALasso	2.85 (0.12)	50.53	0.09
				Enet	3.24 (0.11)	49.07	0.00
				AEnet	2.71 (0.09)	50.54	0.03
				SCAD	3.64 (0.17)	48.43	0.09

Open in a new tab

Example 2

We considered the same setup as in example 1, except that we let p = p_n = [4n^2/3] − 5 for n = 100,200,800. Since $ν = \frac{2}{3},$ we used γ = 5 for computing the adaptive weights in the adaptive Elastic-Net and the adaptive lasso. The estimation problem in this example is even more difficult than that in example 1. To see why, note that when n = 200 the dimension increases from 51 in example 1 to 131 in this example, and the intrinsic dimension (|𝒜|) is almost tripled.

The simulation results are presented in Table 2 from which we can see that the three observations made in example 1 are still valid in this example. Furthermore, we see that for every combination of (n,p, |𝒜|,ρ), the adaptive Elastic-Net has the best performance.

TABLE 2.

Example 2: model selection and fitting results based on 100 replications.

ρ = 0.5	n	p_n	\|𝒜\|	Model	MSE	C	IC

	100	81	27	Truth		54	0
				Lasso	31.73 (1.06)	47.06	0.19
				ALasso	28.78 (1.22)	53.01	2.12
				Enet	27.61 (1.04)	46.35	0.13
				AEnet	20.27 (0.94)	53.00	1.15
				SCAD	44.88 (2.65)	47.79	2.37

	200	131	42	Truth		89	0
				Lasso	23.41 (0.67)	80.51	0.00
				ALasso	12.70 (0.48)	87.99	0.14
				Enet	18.94 (0.61)	80.27	0.00
				AEnet	10.68 (0.37)	87.97	0.00
				SCAD	14.14 (0.64)	87.42	0.25

	800	339	111	Truth		228	0
				Lasso	13.72 (0.23)	212.10	0.00
				ALasso	6.44 (0.12)	226.61	0.00
				Enet	11.02 (0.18)	213.91	0.00
				AEnet	6.00 (0.10)	226.75	0.00
				SCAD	7.79 (0.30)	228.00	0.33

ρ = 0.75	n	p_n	\|𝒜\|	Model	MSE	C	IC

	100	81	27	Truth		54	0
				Lasso	22.04 (0.73)	50.74	0.71
				ALasso	33.98 (1.08)	53.73	7.19
				Enet	17.37 (0.62)	50.82	0.46
				AEnet	16.18 (0.80)	53.67	2.36
				SCAD	31.84 (1.77)	50.55	4.74

	200	131	42	Truth		89	0
				Lasso	16.71 (0.50)	85.17	0.06
				ALasso	20.98 (0.92)	88.64	3.98
				Enet	14.12 (0.48)	85.35	0.05
				AEnet	11.16 (0.46)	88.60	0.87
				SCAD	15.27 (0.61)	87.20	1.33

	800	339	111	Truth		228	0
				Lasso	10.01 (0.16)	221.74	0.00
				ALasso	6.39 (0.12)	226.89	0.00
				Enet	8.01 (0.13)	222.74	0.00
				AEnet	6.23 (0.11)	226.94	0.00
				SCAD	6.62 (0.17)	228.00	0.29

Open in a new tab

5. Ultra-high dimensional data

In this section we discuss how the adaptive Elastic-Net can be applied to ultra-high dimensional data in which p > n. When p is much larger than n, Candes & Tao (2007) suggested using the Dantzig selector which can achieve the ideal estimation risk up to a log(p) factor under the uniform uncertainty condition. Fan & Lv (2007) showed that the uniform uncertainty condition may easily fail and the log(p) factor is too large when p is exponentially large. Moreover, the computational cost of the Dantzig selector would be very high when p is large. In order to overcome these difficulties, Fan & Lv (2007) introduced the Sure Independence Screening (SIS) idea which reduces the ultra-high dimensionality to a relatively large scale d_n but d_n < n. Then, the lower dimension methods such as the SCAD can be used to estimate the sparse model. This procedure is referred to as SIS+SCAD. Under regularity conditions, Fan & Lv (2007) proved that SIS misses true features with an exponentially small probability and SIS+SCAD holds the oracle property if $d_{n} = o (n^{\frac{1}{3}}) .$ Furthermore, with the help of SIS, the Dantzig selector can achieve the ideal risk up to a log(d_n) factor, rather than the original log(p).

Inspired by the results of Fan & Lv (2007), we consider combining the adaptive Elastic-Net and SIS when p > n. We first apply SIS to reduce the dimension to d_n and then fit the data by using the adaptive Elastic-Net. We call this procedure SIS+AEnet.

THEOREM 5.1

Suppose the conditions for Theorem 1 in Fan and Lv (2007) hold. Let d_n = O(n^ν), ν < 1, then SIS+AEnet produces an estimator that holds the oracle property.

We make a note here that Theorem 5.1 is a direct consequence of Theorem 1 in Fan & Lv (2007) and Theorem 3.3, thus its proof is omitted. Theorem 5.1 is similar to Theorem 5 in Fan & Lv (2007), but there is a difference. SIS+AEnent can hold the oracle property when d_n exceeds $O (n^{\frac{1}{3}}),$ while Theorem 5 in Fan & Lv (2007) assumes $d_{n} = o (n^{\frac{1}{3}})$ .

To demonstrate SIS+AEnet, we consider the simulation example used in Fan & Lv (2007) (Section 3.3.1). The model is y = x^T β* + 1.5N(0,1), where $β^{*} = {(β_{1}^{T}, 0_{p - | A |})}^{T}$ with |𝒜| = ∀. Here β₁ is a 8-dim vector and each component has the form (−1)^u (a_n + |z|), where $a_{n} = 4 log (n) / \sqrt{n}$ , u is randomly drawn from Ber(0.4) and z is randomly drawn from the standard normal distribution. We generated n = 200 data from the above model. Before applying the adaptive Elastic-Net, we used SIS to reduce the dimensionality from 1000 to $d_{n} = [5.5 n^{\frac{2}{3}}] = 188 .$ The estimation problem is still rather challenging, as we need to estimate 188 parameters by using only 200 observations. From Table 3 we see that SIS+AEnet performs favorably compared to SIS+SCAD.

TABLE 3.

A demonstration of SIS+AEnet: model selection and fitting results based on 100 replications.

d_n = [5.5n^2/3]	Model	MSE	C	IC

188	Truth		992	0
	SIS+AEnet	0.71 (0.18)	987.45	0.05
	SIS+SCAD	1.48 (0.90)	982.20	0.06

Open in a new tab

6. Proofs

PROOF OF THEOREM 3.1

We write

\hat{β} (λ_{2}, 0) = \underset{β}{arg min} ∥ y - X β ∥_{2}^{2} + λ_{2} ∥ β ∥_{2}^{2}

By the definition of β̂_ŵ(λ₂, λ₁) and β̂(λ₂, 0), we know

\begin{matrix} ∥ y - X {\hat{β}}_{\hat{w}} (λ_{2}, λ_{1}) ∥_{2}^{2} + λ_{2} ∥ {\hat{β}}_{\hat{w}} (λ_{2}, λ_{1}) ∥_{2}^{2} \\ \geq & ∥ y - X \hat{β} (λ_{2}, 0) ∥_{2}^{2} + λ_{2} ∥ \hat{β} (λ_{2}, 0) ∥_{2}^{2}, \end{matrix}

and

\begin{matrix} ∥ y - X \hat{β} (λ_{2}, 0) ∥_{2}^{2} + λ_{2} ∥ \hat{β} (λ_{2}, 0) ∥_{2}^{2} + λ_{1} \sum_{j = 1}^{p} {\hat{w}}_{j} | \hat{β} {(λ_{2}, 0)}_{j} | \\ \geq & ∥ y - X {\hat{β}}_{\hat{w}} (λ_{2}, λ_{1}) ∥_{2}^{2} + λ_{2} ∥ {\hat{β}}_{\hat{w}} (λ_{2}, λ_{1}) ∥_{2}^{2} + λ_{1} \sum_{j = 1}^{p} {\hat{w}}_{j} | {\hat{β}}_{\hat{w}} {(λ_{2}, λ_{1})}_{j} | . \end{matrix}

From the above two inequalities, we have

\begin{matrix} λ_{1} \sum_{j = 1}^{p} {\hat{w}}_{j} (| \hat{β} {(λ_{2}, 0)}_{j} | - | {\hat{β}}_{\hat{w}} {(λ_{2}, λ_{1})}_{j} |) & \geq & (∥ y - X {\hat{β}}_{\hat{w}} (λ_{2}, λ_{1}) ∥_{2}^{2} + λ_{2} ∥ {\hat{β}}_{\hat{w}} (λ_{2}, λ_{1}) ∥_{2}^{2}) \\ - (∥ y - X \hat{β} (λ_{2}, 0) ∥_{2}^{2} + λ_{2} ∥ \hat{β} (λ_{2}, 0) ∥_{2}^{2}) . \end{matrix}

(6.1)

On the other hand, we have

\begin{matrix} (∥ y - X {\hat{β}}_{\hat{w}} (λ_{2}, λ_{1}) ∥_{2}^{2} + λ_{2} ∥ {\hat{β}}_{\hat{w}} (λ_{2}, λ_{1}) ∥_{2}^{2}) - (∥ y - X \hat{β} (λ_{2}, 0) ∥_{2}^{2} + λ_{2} ∥ \hat{β} (λ_{2}, 0) ∥_{2}^{2}) \\ = & {({\hat{β}}_{\hat{w}} (λ_{2}, λ_{1}) - \hat{β} (λ_{2}, 0))}^{T} (X^{T} X + λ_{2} I) ({\hat{β}}_{\hat{w}} (λ_{2}, λ_{1}) - \hat{β} (λ_{2}, 0)), \end{matrix}

and

\begin{matrix} \sum_{j = 1}^{p} {\hat{w}}_{j} (| \hat{β} {(λ_{2}, 0)}_{j} | - | {\hat{β}}_{\hat{w}} {(λ_{2}, λ_{1})}_{j} |) & \leq \sum_{j = 1}^{p} {\hat{w}}_{j} | \hat{β} {(λ_{2}, 0)}_{j} - {\hat{β}}_{\hat{w}} {(λ_{2}, λ_{1})}_{j} | \\ \leq \sqrt{\sum_{j = 1}^{p} {\hat{w}}_{j}^{2}} ∥ \hat{β} (λ_{2}, 0) - {\hat{β}}_{\hat{w}} (λ_{2}, λ_{1}) ∥_{2} . \end{matrix}

Note that λ_min(X^TX + λ₂I) = λ_min(X^TX) + λ₂. Therefore, we end up with

\begin{matrix} (λ_{min} (X^{T} X) + λ_{2}) ∥ {\hat{β}}_{\hat{w}} (λ_{2}, λ_{1}) - \hat{β} (λ_{2}, 0) ∥_{2}^{2} \\ \leq & {({\hat{β}}_{\hat{w}} (λ_{2}, λ_{1}) - \hat{β} (λ_{2}, 0))}^{T} (X^{T} X + λ_{2} I) ({\hat{β}}_{\hat{w}} (λ_{2}, λ_{1}) - \hat{β} (λ_{2}, 0)) \\ \leq & λ_{1} \sqrt{\sum_{j = 1}^{p} {\hat{w}}_{j}^{2}} ∥ \hat{β} (λ_{2}, 0) - {\hat{β}}_{\hat{w}} (λ_{2}, λ_{1}) ∥_{2}, \end{matrix}

(6.2)

which results in the following inequality

∥ {\hat{β}}_{\hat{w}} (λ_{2}, λ_{1}) - \hat{β} (λ_{2}, 0) ∥_{2} \leq \frac{λ_{1} \sqrt{\sum_{j = 1}^{p} {\hat{w}}_{j}^{2}}}{λ_{min} (X^{T} X) + λ_{2}} .

(6.3)

Note that

\hat{β} (λ_{2}, 0) - β^{*} = - λ_{2} {(X^{T} X + λ_{2} I)}^{- 1} β^{*} + {(X^{T} X + λ_{2} I)}^{- 1} X^{T} ϵ,

which implies that

\begin{matrix} E (∥ \hat{β} (λ_{2}, 0) - β^{*} ∥_{2}^{2}) \\ \leq & 2 λ_{2}^{2} ∥ {(X^{T} X + λ_{2} I)}^{- 1} β^{*} ∥_{2}^{2} + 2 E (∥ {(X^{T} X + λ_{2} I)}^{- 1} X^{T} ϵ ∥_{2}^{2}) \\ \leq & 2 λ_{2}^{2} {(λ_{min} (X^{T} X) + λ_{2})}^{- 2} ∥ β^{*} ∥_{2}^{2} + 2 {(λ_{min} (X^{T} X) + λ_{2})}^{- 2} E (ϵ^{T} X X^{T} ϵ) \\ = & 2 {(λ_{min} (X^{T} X) + λ_{2})}^{- 2} (λ_{2}^{2} ∥ β^{*} ∥_{2}^{2} + Tr (X^{T} X) σ^{2}) \\ \leq & 2 {(λ_{min} (X^{T} X) + λ_{2})}^{- 2} (λ_{2}^{2} ∥ β^{*} ∥_{2}^{2} + p λ_{max} (X^{T} X) σ^{2}) . \end{matrix}

(6.4)

Combing (6.3) and (6.4), we have

\begin{matrix} E (∥ {\hat{β}}_{\hat{w}} (λ_{2}, λ_{1}) - β^{*} ∥_{2}^{2}) \\ \leq & 2 E (∥ \hat{β} (λ_{2}, 0) - β^{*} ∥_{2}^{2}) + 2 E (∥ {\hat{β}}_{\hat{w}} (λ_{2}, λ_{1}) - \hat{β} (λ_{2}, 0) ∥_{2}^{2}) \\ \leq & \frac{4 λ_{2}^{2} ∥ β^{*} ∥_{2}^{2} + 4 p λ_{max} (X^{T} X) σ^{2} + 2 λ_{1}^{2} E [\sum_{j = 1}^{p} {\hat{w}}_{j}^{2}]}{{(λ_{min} (X^{T} X) + λ_{2})}^{2}} \end{matrix}

(6.5)

\leq 4 \frac{λ_{2}^{2} ∥ β^{*} ∥_{2}^{2} + Bpn σ^{2} + λ_{1}^{2} E [\sum_{j = 1}^{p} {\hat{w}}_{j}^{2}]}{{(bn + λ_{2})}^{2}} .

(6.6)

We have used condition (A1) in the last inequality. When ŵ_j = 1 for all j, we have

E (∥ \hat{β} (λ_{2}, λ_{1}) - β^{*} ∥_{2}^{2}) \leq 4 \frac{λ_{2}^{2} ∥ β^{*} ∥_{2}^{2} + Bpn σ^{2} + p λ_{1}^{2}}{{(bn + λ_{2})}^{2}} .

PROOF OF THEOREM 3.2

We show that $((1 + \frac{λ_{2}}{n}) {\tilde{β}}_{A}^{*}, 0)$ satisfies the Karush-Kuhn-Tucker (KKT) conditions of (2.2) with probability tending to 1. By the definition of ${\tilde{β}}_{A}^{*},$ it suffices to show

Pr (\forall j \in A^{c}, | - 2 X_{j}^{T} (y - X_{A} {\tilde{β}}_{A}^{*}) | \leq λ_{1}^{*} {\hat{w}}_{j}) \to 1,

or equivalently

Pr (\exists j \in A^{c}, | - 2 X_{j}^{T} (y - X_{A} {\tilde{β}}_{A}^{*}) | > λ_{1}^{*} {\hat{w}}_{j}) \to 0 .

Let $η = {min}_{j \in A} (| β_{j}^{*} |)$ and $\hat{η} = {min}_{j \in A} (| \hat{β} {(enet)}_{j}^{*} |) .$ We note that

\begin{matrix} Pr (\exists j \in A^{c}, | - 2 X_{j}^{T} (y - X_{A} {\tilde{β}}_{A}^{*}) | > λ_{1}^{*} {\hat{w}}_{j}) \\ \leq & \sum_{j \in A^{c}} Pr (| - 2 X_{j}^{T} (y - X_{A} {\tilde{β}}_{A}^{*}) | > λ_{1}^{*} {\hat{w}}_{j}, \tilde{η} > η / 2) + Pr (\tilde{η} \leq η / 2) \end{matrix}

Pr (\tilde{η} \leq η / 2) \leq Pr (∥ \hat{β} (enet) - β^{*} ∥_{2} \geq η / 2) \leq \frac{E (∥ \hat{β} (enet) - β^{*} ∥_{2}^{2})}{η^{2} / 4} .

Then by Theorem 3.1 we obtain

Pr (\hat{η} \leq η / 2) \leq 16 \frac{λ_{2}^{2} ∥ β^{*} ∥_{2}^{2} + Bpn σ^{2} + λ_{1}^{2} p}{{(bn + λ_{2})}^{2} η^{2}} .

(6.7)

Moreover, let $M = {(\frac{λ_{1}^{*}}{n})}^{\frac{1}{1 + γ}}$ and we have

\begin{matrix} \sum_{j \in A^{c}} Pr (| - 2 X_{j}^{T} (y - X_{A} {\tilde{β}}_{A}^{*}) | > λ_{1}^{*} {\hat{w}}_{j}, \hat{η} > η / 2) \\ \leq & \sum_{j \in A^{c}} Pr (| - 2 X_{j}^{T} (y - X_{A} {\tilde{β}}_{A}^{*}) | > λ_{1}^{*} {\hat{w}}_{j}, \hat{η} > η / 2, | \hat{β} {(enet)}_{j} | \leq M) \\ + \sum_{j \in A^{c}} Pr (| \hat{β} {(enet)}_{j} | > M) \\ \leq & \sum_{j \in A^{c}} Pr (| - 2 X_{j}^{T} (y - X_{A} {\tilde{β}}_{A}^{*}) | > λ_{1}^{*} M^{- γ}, \hat{η} > η / 2) + \sum_{j \in A^{c}} Pr (| \hat{β} {(enet)}_{j} | > M) \\ \leq & \frac{4 M^{2 γ}}{λ_{1}^{* 2}} E (\sum_{j \in A^{c}} | X_{j}^{T} (y - X_{A} {\tilde{β}}_{A}^{*}) |^{2} I (\hat{η} > η / 2)) + \frac{1}{M^{2}} E (\sum_{j \in A^{c}} | \hat{β} {(enet)}_{j} |^{2}) \\ \leq & \frac{4 M^{2 γ}}{λ_{1}^{* 2}} E (\sum_{j \in A^{c}} | X_{j}^{T} (y - X_{A} {\tilde{β}}_{A}^{*}) |^{2} I (\hat{η} > η / 2)) + \frac{E (∥ \hat{β} (enet) - β^{*} ∥_{2}^{2})}{M^{2}} \\ \leq & \frac{4 M^{2 γ}}{λ_{1}^{* 2}} E (\sum_{j \in A^{c}} | X_{j}^{T} (y - X_{A} {\tilde{β}}_{A}^{*}) |^{2} I (\hat{η} > η / 2)) + 4 \frac{λ_{2}^{2} ∥ β^{*} ∥_{2}^{2} + Bpn σ^{2} + λ_{1}^{2} p}{{(bn + λ_{2})}^{2} M^{2}}, \end{matrix}

(6.8)

where we have used Theorem 3.1 in the last step. By the model assumption, we have

\begin{matrix} \sum_{j \in A^{c}} | X_{j}^{T} (y - X_{A} {\tilde{β}}_{A}^{*}) |^{2} & = \sum_{j \in A^{c}} | X_{j}^{T} (X_{A} β_{A}^{*} - X_{A} {\tilde{β}}_{A}^{*}) + X_{j}^{T} ϵ |^{2} \\ \leq 2 \sum_{j \in A^{c}} | X_{j}^{T} (X_{A} β_{A}^{*} - X_{A} {\tilde{β}}_{A}^{*}) |^{2} + 2 \sum_{j \in A^{c}} | X_{j}^{T} ϵ |^{2} \\ \leq 2 Bn ∥ X_{A} (β_{A}^{*} - {\tilde{β}}_{A}^{*}) ∥_{2}^{2} + 2 Bn ∥ ϵ ∥_{2}^{2} \\ \leq 2 Bn \cdot Bn ∥ β_{A}^{*} - {\tilde{β}}_{A}^{*} ∥_{2}^{2} + 2 Bn ∥ ϵ ∥_{2}^{2}, \end{matrix}

which gives us the below inequality

\begin{matrix} E (\sum_{j \in A^{c}} | X_{j}^{T} (y - X_{A} {\tilde{β}}_{A}^{*}) |^{2} I (\hat{η} > η / 2)) \\ \leq & 2 B^{2} n^{2} E (∥ β_{A}^{*} - {\tilde{β}}_{A}^{*} ∥_{2}^{2} I (\hat{η} > η / 2)) + 2 Bn σ^{2} . \end{matrix}

(6.9)

We now bound $E (∥ β_{A}^{*} - {\tilde{β}}_{A}^{*} ∥_{2}^{2} I (\hat{η} > η / 2)) .$ Let

{\tilde{β}}_{A}^{*} (λ_{2}, 0) = \underset{β}{arg min} {∥ y - X_{A} β ∥_{2}^{2} + λ_{2} \sum_{j \in A} β_{j}^{2}} .

Then by using the same arguments for deriving (6.1), (6.2) and (6.3), we have

∥ {\tilde{β}}_{A}^{*} - {\tilde{β}}_{A}^{*} (λ_{2}, 0) ∥_{2} \leq \frac{λ_{1}^{*} \cdot {max}_{j \in A} {\hat{w}}_{j} \sqrt{| A |}}{λ_{min} (X_{A}^{T} X_{A}) + λ_{2}} \leq \frac{λ_{1}^{*} {\hat{η}}^{- γ} \sqrt{p}}{bn + λ_{2}}

(6.10)

Note that $λ_{min} (X_{A}^{T} X_{A}) \geq λ_{min} (X^{T} X) \geq bn$ and $λ_{max} (X_{A}^{T} X_{A}) \leq λ_{max} (X^{T} X) \leq Bn .$ Following the rest arguments in the proof of Theorem 3.1, we obtain

\begin{matrix} E (∥ β_{A}^{*} - {\tilde{β}}_{A}^{*} ∥_{2}^{2} I (\hat{η} > η / 2)) \\ \leq & 4 \frac{λ_{2}^{2} ∥ β_{A}^{*} ∥_{2}^{2} + λ_{max} (X_{A}^{T} X_{A}) | A | σ^{2} + λ_{1}^{* 2} {(η / 2)}^{- 2 γ} | A |}{{(λ_{min} (X_{A}^{T} X_{A}) + λ_{2})}^{2}} \\ \leq & 4 \frac{λ_{2}^{2} ∥ β^{*} ∥_{2}^{2} + Bpn σ^{2} + λ_{1}^{* 2} {(η / 2)}^{- 2 γ} p}{{(bn + λ_{2})}^{2}} . \end{matrix}

(6.11)

The combination of (6.7), (6.8), (6.9) and (6.11) yields

\begin{matrix} Pr (\exists j \in A^{c}, | - 2 X_{j}^{T} (y - X_{A} {\tilde{β}}_{A}^{*}) | > λ_{1}^{*} {\hat{w}}_{j}) \\ \leq & \frac{4 M^{2 γ} n}{λ_{1}^{* 2}} (8 B^{2} n \frac{λ_{2}^{2} ∥ β^{*} ∥_{2}^{2} + Bpn σ^{2} + λ_{1}^{* 2} {(η / 2)}^{- 2 γ} p}{{(bn + λ_{2})}^{2}} + 2 B σ^{2}) \\ + \frac{λ_{2}^{2} ∥ β^{*} ∥_{2}^{2} + Bpn σ^{2} + λ_{1}^{2} p}{{(bn + λ_{2})}^{2}} \frac{4}{M^{2}} \\ + \frac{λ_{2}^{2} ∥ β^{*} ∥_{2}^{2} + Bpn σ^{2} + λ_{1}^{2} p}{{(bn + λ_{2})}^{2}} \frac{16}{η^{2}} \\ \hat{\equiv} & K_{1} + K_{2} + K_{3} . \end{matrix}

We have chosen $γ > \frac{2 ν}{1 - ν}$ , then under conditions (A1)–(A6) it follows that

\begin{array}{l} K_{1} = O ({(\frac{λ_{1}^{*}}{\sqrt{n}} n^{\frac{(1 + γ) (1 - ν) - 1}{2}})}^{- \frac{2}{1 + γ}}) \to 0, \\ K_{2} = O (\frac{p}{n} {(\frac{n}{λ_{1}^{*}})}^{\frac{2}{1 + γ}}) \to 0, \\ K_{3} = O (\frac{p}{n} \frac{1}{η^{2}}) = O ({(λ_{1}^{*} \sqrt{\frac{p}{n}} η^{- γ})}^{\frac{2}{γ}} {(\frac{p}{n} {(\frac{n}{λ_{1}^{*}})}^{\frac{2}{1 + γ}})}^{\frac{1 + γ}{γ}} p^{- \frac{2}{γ}}) \to 0 . \end{array}

(6.12)

Thus the proof is completed.

PROOF OF THEOREM 3.3

From Theorem 3.2 we have shown that with probability tending to 1 the adaptive Elastic-Net estimator is equal to $((1 + \frac{λ_{2}}{n}) {\tilde{β}}_{A}^{*}, 0) .$ Therefore, in order to prove the model selection consistency result, we only need to show $Pr ({min}_{j \in A} | {\tilde{β}}_{j}^{*} | > 0) \to 1 .$ By (6.10) we have

min_{j \in A} | {\tilde{β}}_{j}^{*} | > min_{j \in A} | {\tilde{β}}^{*} {(λ_{2}, 0)}_{j} | - \frac{λ_{1}^{*} \sqrt{p} {\hat{η}}^{- γ}}{bn + λ_{2}} .

Note that

min_{j \in A} | {\tilde{β}}^{*} {(λ_{2}, 0)}_{j} | > min_{j \in A} | β_{j}^{*} | - ∥ {\tilde{β}}_{A}^{*} (λ_{2}, 0) - β_{A}^{*} ∥_{2}

Following (6.6) it is easy to see that

E (∥ {\tilde{β}}_{A}^{*} (λ_{2}, 0) - β_{A}^{*} ∥_{2}^{2}) \leq 4 \frac{λ_{2}^{2} ∥ β^{*} ∥_{2}^{2} + Bpn σ^{2}}{{(bn + λ_{2})}^{2}} = O (\frac{p}{n})

Moreover, $\frac{λ_{1}^{*} \sqrt{p} {\hat{η}}^{- γ}}{bn + λ_{2}} = O (\frac{1}{\sqrt{n}}) (\frac{λ_{1}^{*} \sqrt{p}}{\sqrt{n}} η^{- γ}) {(\frac{\hat{η}}{η})}^{- γ}$ and

\begin{matrix} E ({(\frac{\hat{η}}{η})}^{2}) & \leq 2 + \frac{2}{η^{2}} E ({(\hat{η} - η)}^{2}) \\ \leq 2 + \frac{2}{η^{2}} E (∥ \hat{β} (λ_{2}, λ_{1}) - β^{*} ∥_{2}^{2}) \\ \leq 2 + \frac{8}{η^{2}} \frac{λ_{2}^{2} ∥ β^{*} ∥_{2}^{2} + Bpn σ^{2} + λ_{1}^{2} p}{{(bn + λ_{2})}^{2}} . \end{matrix}

In (6.12) we have shown $η^{2} \frac{n}{p} \to \infty .$ Thus

\frac{λ_{1}^{*} \sqrt{p} {\hat{η}}^{- γ}}{bn + λ_{2}} = o (\frac{1}{\sqrt{n}}) O_{P} (1) .

(6.13)

Hence, we have

min_{j \in A} | {\tilde{β}}_{j}^{*} | > η - \sqrt{\frac{p}{n}} O_{P} (1) - o (\frac{1}{\sqrt{n}}) O_{P} (1),

and $Pr ({min}_{j \in A} | {\tilde{β}}_{j}^{*} | > 0) \to 1 .$

We now prove the asymptotic normality. For convenience write

z_{n} = α^{T} \frac{I + λ_{2} \sum_{A}^{- 1}}{1 + \frac{λ_{2}}{n}} \sum_{A}^{\frac{1}{2}} (\hat{β} {(AdaEnet)}_{A} - β_{A}^{*}) .

Note that

\begin{matrix} α^{T} (I + λ_{2} \sum_{A}^{- 1}) \sum_{A}^{\frac{1}{2}} ({\tilde{β}}_{A}^{*} - \frac{β_{A}^{*}}{1 + \frac{λ_{2}}{n}}) \\ = & α^{T} (I + λ_{2} \sum_{A}^{- 1}) \sum_{A}^{\frac{1}{2}} \frac{λ_{2} β_{A}^{*}}{n + λ_{2}} + α^{T} (I + λ_{2} \sum_{A}^{- 1}) \sum_{A}^{\frac{1}{2}} ({\tilde{β}}_{A}^{*} - {\tilde{β}}_{A}^{*} (λ_{2}, 0)) \\ + α^{T} (I + λ_{2} \sum_{A}^{- 1}) \sum_{A}^{\frac{1}{2}} ({\tilde{β}}_{A}^{*} (λ_{2}, 0) - β_{A}^{*}) . \end{matrix}

In addition, we have

(I + λ_{2} \sum_{A}^{- 1}) \sum_{A}^{\frac{1}{2}} ({\tilde{β}}_{A}^{*} (λ_{2}, 0) - β_{A}^{*}) = - λ_{2} \sum_{A}^{- \frac{1}{2}} β_{A}^{*} + \sum_{A}^{- \frac{1}{2}} X_{A}^{T} ϵ .

Therefore, by Theorem 3.2 it follows that with probability tending to 1, z_n = T₁ + T₂ + T₃, where

\begin{array}{l} T_{1} = α^{T} (I + λ_{2} \sum_{A}^{- 1}) \sum_{A}^{\frac{1}{2}} \frac{λ_{2} β_{A}^{*}}{n + λ_{2}} - α^{T} λ_{2} \sum_{A}^{- \frac{1}{2}} β_{A}^{*}, \\ T_{2} = α^{T} (I + λ_{2} \sum_{A}^{- 1}) \sum_{A}^{\frac{1}{2}} ({\tilde{β}}_{A}^{*} - {\tilde{β}}_{A}^{*} (λ_{2}, 0)), \\ T_{3} = α^{T} \sum_{A}^{- \frac{1}{2}} X_{A}^{T} \in . \end{array}

We now show that T₁ = o(1), T₂ = o_P (1) and T₃ → N(0, σ²) in distribution. Then by Slutsky’s theorem we know z_n →_d N(0, σ²). By (A1) and α^Tα = 1, we have

\begin{matrix} T_{1}^{2} & \leq 2 ∥ (I + λ_{2} \sum_{A}^{- 1}) \sum_{A}^{\frac{1}{2}} \frac{λ_{2} β_{A}^{*}}{n + λ_{2}} ∥_{2}^{2} + 2 ∥ λ_{2} \sum_{A}^{- \frac{1}{2}} β_{A}^{*} ∥_{2}^{2} \\ \leq 2 \frac{λ_{2}^{2}}{{(n + λ_{2})}^{2}} ∥ \sum_{A}^{\frac{1}{2}} β_{A}^{*} ∥_{2}^{2} {(1 + \frac{λ_{2}}{bn})}^{2} + 2 λ^{2} ∥ β_{A}^{*} ∥_{2}^{2} \frac{1}{bn} \\ \leq \frac{2 λ_{2}^{2} Bn}{{(n + λ_{2})}^{2}} {(1 + \frac{λ_{2}}{bn})}^{2} ∥ β_{A}^{*} ∥_{2}^{2} + 2 λ^{2} ∥ β_{A}^{*} ∥_{2}^{2} \frac{1}{bn} . \end{matrix}

Hence it follows by (A6) that T₁ = o(1). Similarly, we can bound T₂ as follows

\begin{matrix} T_{2}^{2} & \leq {(1 + \frac{λ_{2}}{bn})}^{2} ∥ \sum_{A}^{\frac{1}{2}} ({\tilde{β}}_{A}^{*} - {\tilde{β}}_{A}^{*} (λ_{2}, 0)) ∥_{2}^{2} \\ \leq {(1 + \frac{λ_{2}}{bn})}^{2} Bn ∥ {\tilde{β}}_{A}^{*} - {\tilde{β}}_{A}^{*} (λ_{2}, 0) ∥_{2}^{2} \\ \leq {(1 + \frac{λ_{2}}{bn})}^{2} Bn {(\frac{λ_{1}^{*} {\hat{η}}^{- γ}}{bn + λ_{2}})}^{2} \end{matrix}

where we have used (6.10) in the last step. Then (6.13) tells us that $T_{2}^{2} = \frac{1}{n^{2}} O_{P} (1)$ . Next we consider T₃. Let X_𝒜[i,] denote the ith row of the matrix X_𝒜. With such notation we can write $T_{3} = \sum_{i = 1}^{n} r_{i} ϵ_{i},$ where $r_{i} = α^{T} {(X_{A}^{T} X_{A})}^{- \frac{1}{2}} {(X_{A} [i,])}^{T}$ . Then it is easy to see that

\begin{matrix} \sum_{i = 1}^{n} r_{i}^{2} & = \sum_{i = 1}^{n} α^{T} {(X_{A}^{T} X_{A})}^{- \frac{1}{2}} {(X_{A} [i,])}^{T} (X_{A} [i,]) {(X_{A}^{T} X_{A})}^{- \frac{1}{2}} α \\ = α^{T} {(X_{A}^{T} X_{A})}^{- \frac{1}{2}} (X_{A}^{T} X_{A}) {(X_{A}^{T} X_{A})}^{- \frac{1}{2}} α \\ = α^{T} α = 1 . \end{matrix}

(6.14)

Furthermore, we have for k = 2 + δ, δ > 0

\sum_{i = 1}^{n} E [| ϵ_{i} |^{2 + δ}] | r_{i}^{2 + δ} | \leq E [| ϵ |^{2 + δ}] (\sum_{i = 1}^{n} | r_{i}^{2} | (max_{i} | r_{i} |^{δ})) = E [| ϵ |^{2 + δ}] {(max_{i} | r_{i}^{2} |)}^{\frac{δ}{2}} .

Note that $r_{i}^{2} \leq ∥ \sum_{A}^{- \frac{1}{2}} {(X_{A} [i,])}^{T} \leq (\sum_{j \in A} x_{i j}^{2}) (λ_{max} (\sum_{A}^{- 1})) \leq \frac{\sum_{j = 1}^{p} x_{i j}^{2}}{bn}$ . Hence,

\sum_{i = 1}^{n} E [| ϵ_{i} |^{2 + δ}] | r_{i}^{2 + δ} | \leq E [| ϵ |^{2 + δ}] {(\frac{{max}_{i} (\sum_{j = 1}^{p} x_{ij}^{2})}{bn})}^{\frac{δ}{2}} \to 0 .

(6.15)

From (6.14) and (6.15) Lyapunov conditions for the central limit theorem are established. Thus, T₃ →_d N(0, σ²). This completes the proof.

Acknowledgments

We sincerely thank an associate editor and referees for their help comments and suggestions.

Contributor Information

Hui Zou, School of Statistics, University of Minnesota, Minneapolis, Mn 55455, E-Mail: hzou@stat.umn.edu.

Hao Helen Zhang, Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203, E-Mail: hzhang2@stat.ncsu.edu.

REFERENCES

Breiman L. ‘Heuristics of instability and stabilization in model selection’. The Annals of Statistics. 1996;24:2350–2383. [Google Scholar]
Candes E, Tao T. ‘The dantzig selector: statistical estimation when p is much larger than n’. The Annals of Statistics, to appear. 2007 [Google Scholar]
Candes E, Wakin M, Boyd S. California Institute of Technology; 2007. Enhancing sparsity by reweighted ℓ1 minimization, Technical report. [Google Scholar]
Donoho D, Johnstone I. ‘Ideal spatial adaptation via wavelet shrinkage’. Biometrika. 1994;81:425–455. [Google Scholar]
Donoho D, Johnstone I, Kerkyacharian G, Picard D. ‘Wavelet shrinkage: asymptopia? (with discussion)’. Journal of Royal Statistical Society, Series B. 1995;57:301–337. [Google Scholar]
Efron B, Hastie T, Johnstone I, Tibshirani R. ‘Least angle regression’. The Annals of Statistics. 2004;32:407–499. [Google Scholar]
Fan J, Li R. ‘Variable selection via nonconcave penalized likelihood and its oracle properties’. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Fan J, Li R. ‘Statistical challenges with high dimensionality: Feature selection in knowledge discovery’. Proceedings of the Madrid International Congress of Mathematicians 2006. 2006;Vol. III:595–622. [Google Scholar]
Fan J, Lv J. Department of Operations Research and Financial Engineering, Princeton University; 2007. Sure independence screening for ultra-high dimensional feature space, Technical report. [Google Scholar]
Fan J, Peng H. ‘Nonconcave penalized likelihood with a diverging number of parameters’. The Annals of Statistics. 2004;32:928–961. [Google Scholar]
Fan J, Peng H, Huang T. ‘Semilinear high-dimensional model for normalization of microarray data: a theoretical analysis and partial consistency (with discussion)’. Journal of the American Statistical Association. 2005;100:781–813. [Google Scholar]
Huber P. ‘Robust regression: Asymptotics, conjectures and monte carlo’. The Annals of Statistics. 1988;1:799–821. [Google Scholar]
Knight K, Fu W. ‘Asymptotics for lasso-type estimators’. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]
Lam C, Fan J. ‘Profile-kernel likelihood inference with diverging number of parameters’. The Annals of Statistics. 2007 doi: 10.1214/07-AOS544. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
Portnoy S. ‘Asymptotic behavior of M-estimatiors of p regression parameters when p2/n is large. I. consistency’. The Annals of Statistics. 1984;12:1298–1309. [Google Scholar]
Tibshirani R. ‘Regression shrinkage and selection via the lasso’. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
Wang H, Li R, Tsai C. ‘Tuning parameter selectors for the smoothly clipped absolute deviation method’. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou H. ‘The adaptive lasso and its oracle properties’. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
Zou H, Hastie T. ‘Regularization and variable selection via the elastic net’. Journal of the Royal Statistical Society, Series B. 2005;67:301–320. [Google Scholar]
Zou H, Hastie T, Tibshirani R. ‘On the degrees of freedom of the lasso’. The Annals of Statistics. 2007;35:2173–2192. [Google Scholar]

[R1] Breiman L. ‘Heuristics of instability and stabilization in model selection’. The Annals of Statistics. 1996;24:2350–2383. [Google Scholar]

[R2] Candes E, Tao T. ‘The dantzig selector: statistical estimation when p is much larger than n’. The Annals of Statistics, to appear. 2007 [Google Scholar]

[R3] Candes E, Wakin M, Boyd S. California Institute of Technology; 2007. Enhancing sparsity by reweighted ℓ1 minimization, Technical report. [Google Scholar]

[R4] Donoho D, Johnstone I. ‘Ideal spatial adaptation via wavelet shrinkage’. Biometrika. 1994;81:425–455. [Google Scholar]

[R5] Donoho D, Johnstone I, Kerkyacharian G, Picard D. ‘Wavelet shrinkage: asymptopia? (with discussion)’. Journal of Royal Statistical Society, Series B. 1995;57:301–337. [Google Scholar]

[R6] Efron B, Hastie T, Johnstone I, Tibshirani R. ‘Least angle regression’. The Annals of Statistics. 2004;32:407–499. [Google Scholar]

[R7] Fan J, Li R. ‘Variable selection via nonconcave penalized likelihood and its oracle properties’. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R8] Fan J, Li R. ‘Statistical challenges with high dimensionality: Feature selection in knowledge discovery’. Proceedings of the Madrid International Congress of Mathematicians 2006. 2006;Vol. III:595–622. [Google Scholar]

[R9] Fan J, Lv J. Department of Operations Research and Financial Engineering, Princeton University; 2007. Sure independence screening for ultra-high dimensional feature space, Technical report. [Google Scholar]

[R10] Fan J, Peng H. ‘Nonconcave penalized likelihood with a diverging number of parameters’. The Annals of Statistics. 2004;32:928–961. [Google Scholar]

[R11] Fan J, Peng H, Huang T. ‘Semilinear high-dimensional model for normalization of microarray data: a theoretical analysis and partial consistency (with discussion)’. Journal of the American Statistical Association. 2005;100:781–813. [Google Scholar]

[R12] Huber P. ‘Robust regression: Asymptotics, conjectures and monte carlo’. The Annals of Statistics. 1988;1:799–821. [Google Scholar]

[R13] Knight K, Fu W. ‘Asymptotics for lasso-type estimators’. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]

[R14] Lam C, Fan J. ‘Profile-kernel likelihood inference with diverging number of parameters’. The Annals of Statistics. 2007 doi: 10.1214/07-AOS544. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Portnoy S. ‘Asymptotic behavior of M-estimatiors of p regression parameters when p2/n is large. I. consistency’. The Annals of Statistics. 1984;12:1298–1309. [Google Scholar]

[R16] Tibshirani R. ‘Regression shrinkage and selection via the lasso’. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]

[R17] Wang H, Li R, Tsai C. ‘Tuning parameter selectors for the smoothly clipped absolute deviation method’. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Zou H. ‘The adaptive lasso and its oracle properties’. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

[R19] Zou H, Hastie T. ‘Regularization and variable selection via the elastic net’. Journal of the Royal Statistical Society, Series B. 2005;67:301–320. [Google Scholar]

[R20] Zou H, Hastie T, Tibshirani R. ‘On the degrees of freedom of the lasso’. The Annals of Statistics. 2007;35:2173–2192. [Google Scholar]

PERMALINK

ON THE ADAPTIVE ELASTIC-NET WITH A DIVERGING NUMBER OF PARAMETERS

Hui Zou

Hao Helen Zhang

Abstract

1. Introduction

1.1. Background

1.2. Two fundamental issues with the ℓ₁ penalty

1.3. The adaptive Elastic-Net

2. Method

3. Statistical Theory

THEOREM 3.1

THEOREM 3.2

THEOREM 3.3

4. Numerical Studies

Example 1

TABLE 1.

Example 2

TABLE 2.

5. Ultra-high dimensional data

THEOREM 5.1

TABLE 3.

6. Proofs

PROOF OF THEOREM 3.1

PROOF OF THEOREM 3.2

PROOF OF THEOREM 3.3

Acknowledgments

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

ON THE ADAPTIVE ELASTIC-NET WITH A DIVERGING NUMBER OF PARAMETERS

Hui Zou

Hao Helen Zhang

Abstract

1. Introduction

1.1. Background

1.2. Two fundamental issues with the ℓ1 penalty

1.3. The adaptive Elastic-Net

2. Method

3. Statistical Theory

THEOREM 3.1

THEOREM 3.2

THEOREM 3.3

4. Numerical Studies

Example 1

TABLE 1.

Example 2

TABLE 2.

5. Ultra-high dimensional data

THEOREM 5.1

TABLE 3.

6. Proofs

PROOF OF THEOREM 3.1

PROOF OF THEOREM 3.2

PROOF OF THEOREM 3.3

Acknowledgments

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

1.2. Two fundamental issues with the ℓ₁ penalty