Abstract
We consider the problem of model selection and estimation in situations where the number of parameters diverges with the sample size. When the dimension is high, an ideal method should have the oracle property (Fan and Li, 2001; Fan and Peng, 2004) which ensures the optimal large sample performance. Furthermore, the high-dimensionality often induces the collinearity problem which should be properly handled by the ideal method. Many existing variable selection methods fail to achieve both goals simultaneously. In this paper, we propose the adaptive Elastic-Net that combines the strengths of the quadratic regularization and the adaptively weighted lasso shrinkage. Under weak regularity conditions, we establish the oracle property of the adaptive Elastic-Net. We show by simulations that the adaptive Elastic-Net deals with the collinearity problem better than the other oracle-like methods, thus enjoying much improved finite sample performance.
Keywords and phrases: Adaptive regularization, Elastic-Net, High dimensionality, Model selection, Oracle property, Shrinkage methods
1. Introduction
1.1. Background
Consider the problem of model selection and estimation in the classical linear regression model
(1.1) |
where y = (y1,…,yn)T is the response vector and xj = (x1j,…,xnj)T, j = 1,…,p, are the linearly independent predictors. Let X = [x1, ···, xp] be the predictor matrix. Without loss of generality we assume the data are centered, so the intercept is not included in the regression function. Throughout this paper, we assume the errors are identically and independent distributed with zero mean and finite variance σ2. We are interested in the sparse modeling problem where the true model has a sparse representation, i.e., some components of β* are exactly zero. Let In this work we call the size of 𝒜 the intrinsic dimension of the underlying model. We wish to discover the set 𝒜 and estimate the corresponding coefficients.
Variable selection is fundamentally important for knowledge discovery with high-dimensional data (Fan & Li 2006) and it could greatly enhance the prediction performance of the fitted model. Traditional model selection procedures follow best-subset selection and its step-wise variants. However, best-subset selection is computationally prohibitive when the number of predictors is large. Furthermore, as analyzed by Breiman (1996), subset selection is unstable, thus the resulting model has poor prediction accuracy. To overcome the fundamental drawbacks of subset selection, statisticians have recently proposed various penalization methods to perform simultaneous model selection and estimation. In particular, the lasso (Tibshirani 1996) and the SCAD (Fan & Li 2001) are two very popular methods due to their good computational and statistical properties. Efron, Hastie, Johnstone & Tibshirani (2004) proposed the LARS algorithm for computing the entire lasso solution path. Knight & Fu (2000) studied the asymptotic properties of the lasso. Fan & Li (2001) showed that the SCAD enjoys the oracle property, that is, the SCAD estimator can perform as well as the oracle if the penalization parameter is appropriately chosen.
1.2. Two fundamental issues with the ℓ1 penalty
The lasso estimator (Tibshirani 1996) is obtained by solving the ℓ1 penalized least squares problem
(1.2) |
where is the ℓ1-norm of β. The ℓ1 penalty enables the lasso to simultaneously regularize the least squares fit and shrink some components of β̂(lasso) to zero for some appropriately chosen λ. The entire lasso solution paths can be computed by the LARS algorithm (Efron et al. 2004). These nice properties make the lasso a very popular variable selection method.
Despite its popularity the lasso does have two serious drawbacks: namely the lack of oracle property and instability with high-dimensional data. First of all, the lasso does not have the oracle property. Fan & Li (2001) first pointed out that asymptotically the lasso has non-ignorable bias for estimating the nonzero coefficients. They further conjectured that the lasso may not have the oracle property because of the bias problem. This conjecture was recently proven in Zou (2006). Zou (2006) further showed that the lasso could be inconsistent for model selection unless the predictor matrix (or the design matrix) satisfies a rather strong condition. Zou (2006) proposed the following adaptive lasso estimator
(1.3) |
where are the adaptive data-driven weights and can be computed by where γ is a positive constant and β̂ini is an initial root-n consistent estimate of β. Zou (2006) showed that with an appropriately chosen λ, the adaptive lasso performs as well as the oracle. Candes, Wakin & Boyd (2007) used the adaptive lasso idea to enhance sparsity in sparse signal recovery via the reweighted ℓ1 minimization.
Secondly the ℓ1 penalization methods can have very poor performance when there are highly correlated variables in the predictor set. The collinearity problem is often encountered in high-dimensional data analysis. Even when the predictors are independent, as long as the dimension is high, the maximum sample correlation can be large, as shown in Fan & Lv (2007). Collinearity can severely degrade the performance of the lasso. As shown in Zou & Hastie (2005), the lasso solution paths are unstable when predictors are highly correlated. Zou & Hastie (2005) proposed the Elastic-Net as an improved version of the lasso for analyzing high-dimensional data. The Elastic-Net estimator is defined as follows:
(1.4) |
If the predictors are standardized (each variable has mean zero and L2-norm one), then we should change to (1 + λ2) as in Zou & Hastie (2005). The ℓ1 part of the Elastic-Net performs automatic variable selection while the ℓ2 part stabilizes the solution paths and hence improves the prediction. In an orthogonal design where the lasso is shown to be optimal (Donoho, Johnstone, Kerkyacharian & Picard 1995), the Elastic-Net automatically reduces to the lasso. However, when the correlations among the predictors become high, the Elastic-Net can significantly improve the prediction accuracy of the lasso.
1.3. The adaptive Elastic-Net
The adaptively weighted ℓ1 penalty and the Elastic-Net penalty improve the lasso in two different directions. The adaptive lasso achieves the oracle property of the SCAD and the Elastic-Net handles the collinearity. However, following the arguments in Zou & Hastie (2005) and Zou (2006), we can easily see that the adaptive lasso inherits the instability of the lasso for high-dimensional data, while the Elastic-Net is lack of the oracle property. Thus, it is natural to consider combining the ideas of the adaptively weighted ℓ1 penalty and the Elastic-Net regularization to obtain a better method which can improve the lasso in both directions. To this end, we propose the adaptive Elastic-Net that penalizes the squared error loss using a combination of the ℓ2 penalty and the adaptive ℓ1 penalty. Since the adaptive Elastic-Net is designed for high-dimensional data analysis, we study its asymptotic properties under the assumption that the dimension diverges with the sample size.
Pioneering papers on asymptotic theories with diverging number of parameters include Huber (1988) and Portnoy (1984) which studied the M-estimators. Recently, Fan, Peng & Huang (2005) studied a semi-parametric model with a growing number of nuisance parameters, whereas Lam & Fan (2007) investigated the profile likelihood ratio inference for the growing number of parameters. In particular, our work is influenced by Fan & Peng (2004) who studied the oracle property of nonconcave penalized likelihood estimators. Fan & Peng (2004) provocatively argued why it is important to study the validity of the oracle property when the dimension diverges. We would like to know whether the adaptive Elastic-Net enjoys the oracle property with a diverging number of predictors. This question will be thoroughly investigated in this paper.
The rest of the paper is organized as follows. In Section 2 we introduce the adaptive Elastic-Net. Statistical theory, including the oracle property, of the adaptive Elastic-Net is established in Section 3. In Section 4 we use simulation to compare the finite sample performance of the adaptive Elastic-Net with the SCAD and other competitors. Section 5 discusses how to combine SIS of Fan & Lv (2007) and the adaptive Elastic-Net to deal with the ultra-high dimension cases. Technical proofs are presented in Section 6.
2. Method
The adaptive Elastic-Net can be viewed as a combination of the Elastic-Net and the adaptive lasso. Suppose we first compute the Elastic-Net estimator β̂(enet) as defined in (1.4), and then we construct the adaptive weights by
(2.1) |
where γ is a positive constant. Now we solve the following optimization problem to get the adaptive Elastic-Net estimates
(2.2) |
From now on, we write β̂ = β̂(AdaEnet) for the sake of convenience.
If we force λ2 to be zero in (2.2), then the adaptive Elastic-Net reduces to the adaptive lasso. Following the arguments in Zou & Hastie (2005), we can easily show that in an orthogonal design the adaptive Elastic-Net reduces to the adaptive lasso, regardless the value of λ2. This is desirable because in that setting the adaptive lasso achieves the optimal minimax risk bound (Zou 2006). The role of the ℓ2 penalty in (2.2) is to further regularize the adaptive lasso fit whenever the collinearity may cause serious trouble.
We know the Elastic-Net naturally adopts a sparse representation. One can use ŵj = (|β̂j(enet)| + 1/n)−γ to avoid dividing zeros. We can also define ŵj = ∞ when β̂j(enet) = 0. Let 𝒜̂enet = {j : β̂j (enet) ≠ 0} and denotes its complement set. Then we have and
(2.3) |
where β in (2.3) is a vector of length |𝒜̂enet|, the size of 𝒜̂enet.
The ℓ1 regularization parameters, and λ1, are directly responsible for the sparsity of the estimates. Their values are allowed to be different. On the other hand, we use the same λ2 for the ℓ2 penalty component in the Elastic-Net and the adaptive Elastic-Net estimators, because the ℓ2 penalty offers the same kind of contribution in both estimators.
3. Statistical Theory
In our theoretical analysis, we assume the following regularity conditions throughout.
- (A1) We use λmin (M) and λmax (M) to denote the minimum and maximum eigenvalues of a positive definite matrix M, respectively. Then we assume
where b and B are two positive constants. - (A2)
(A3) E[|ϵ|2+δ] < ∞ for some δ > 0.
-
(A4)
To construct the adaptive weights (ŵ), we take a fixed γ such that In our numerical studies we let to avoid the tuning on γ. Once γ is chosen, we choose the regularization parameters according to the following conditions
- (A5)
and - (A6)
Conditions (A1) and (A2) assume the predictor matrix has a reasonably good behavior. Similar conditions were considered in Portnoy (1984). Note that in the linear regression setting, condition (A1) is exactly condition (F) in Fan & Peng (2004). Condition (A3) is used to establish the asymptotic normality of β̂ (AdaEnet).
It is worth pointing out that condition (A4) is weaker than that used in Fan & Peng (2004) in which p is assumed to satisfy p4/n → 0 or at most p3/n → 0. It means their results require Our theory removes this limitation. For any 0 ≤ ν < 1, we can choose an appropriate γ to construct the adaptive weights and the oracle property holds as long as . Also note that in the finite dimension setting ν = 0, thus any positive γ can be used, which agrees with the results in Zou (2006).
Condition (A6) is similar to condition (H) in Fan & Peng (2004). Basically, condition (A6) allows the nonzero coefficients to vanish but at a rate that can be distinguished by the penalized least squares. In the finite dimension setting the condition is implicitly assumed.
THEOREM 3.1
Given the data (y, X), let ŵ = (ŵ1, …, ŵp) be a vector whose components are all non-negative and can depend on (y, X). Define
for non-negative parameters λ2 and λ1. If ŵj = 1 for all j, we denote β̂ŵ(λ2, λ1) by β̂(λ2, λ1) for convenience.
If we assume the model (1.1) and condition (A1), then
In particular, when ŵj = 1 for all j, we have
It is worth mentioning that the derived risk bounds are non-asymptotic. Theorem 3.1 is very useful for the asymptotic analysis. A direct corollary of Theorem 3.1 is that, under conditions (A1)–(A6), β̂(λ2, λ1) is a root-(n/p)-consistent estimator. This consistent rate is the same as the result of SCAD (Fan & Peng 2004). The root-(n/p) consistency result suggests that it is appropriate to use the Elastic-Net to construct the adaptive weights.
THEOREM 3.2
Let us write and define
(3.1) |
Then with probability tending to 1, is the solution to (2.2).
Theorem 3.2 provides an asymptotic characterization of the solution to the adaptive Elastic-Net criterion. The definition of borrows the concept of “oracle” (Donoho & Johnstone 1994, Fan & Li 2001, Fan & Peng 2004, Zou 2006). If there was an oracle informing us the true subset model, then we would use this oracle information and the adaptive Elastic-Net criterion would become that in (2.3). Theorem 3.2 tells us that asymptotically speaking, the adaptive Elastic-Net works as if it had such oracle information. Theorem 3.2 also suggests that the adaptive Elastic-Net should enjoy the oracle property, which is confirmed in the next theorem.
THEOREM 3.3
Under conditions (A1)–(A6), the adaptive Elastic-Net has the oracle property, that is, the estimator β̂(AdaEnet) must satisfy:
Consistency in selection : Pr ({j : β̂(AdaEnet)j ≠ 0} = 𝒜) → 1,
Asymptotic normality : where and α is a vector of norm 1.
By Theorem 3.3 the selection consistency and the asymptotic normality of the adaptive Elastic-Net are still valid when the number of parameters diverges. Technically speaking, the selection consistency result is stronger than that Theorem 3.2 implies, although Theorem 3.2 plays an important role in the proof of Theorem 3.3. As a special case, when we let λ2 = 0, which is a choice satisfying conditions (A5) and (A6), Theorem 3.3 tell us that the adaptive lasso enjoys the selection consistency and the asymptotical normality:
4. Numerical Studies
In this section we present simulations to study the finite sample performance of the adaptive Elastic-Net. We considered five methods in the simulation study: the lasso(Lasso), the Elastic-Net(Enet), the adaptive lasso(ALasso), the adaptive Elastic-Net(AEnet) and the SCAD. In our implementation, we let λ2 = 0 in the adaptive Elastic-Net to get the adaptive lasso fit. There are several commonly used tuning parameter selection methods, such as cross-validation, generalized cross-validation(GCV), AIC and BIC. Zou, Hastie & Tibshirani (2007) suggested using BIC to select the lasso tuning parameter. Wang, Li & Tsai (2007) showed that for the SCAD, BIC is a better tuning parameter selector than GCV and AIC. In this work, we used BIC to select the tuning parameter for each method.
Fan & Peng (2004) considered simulation models in which and |𝒜| = 5. Our theory allows pn = O(nν) for any ν < 1. Thus, we are interested in models in which pn = O(nν) with In addition, we allow the intrinsic dimension (𝒜) to diverge with the sample size as well, because such designs make the model selection and estimation more challenging than in the fixed |𝒜| situations.
Example 1
We generated data from the linear regression model,
where β* is a p-dim vector and ε ∼ N(0, σ2), σ = 6 and x follows a p-dim multivariate normal distribution with zero mean and covariance Σ whose (j, k) entry is Σj,k = ρ|j−k| 1 ≤ k, j ≤ p. We considered ρ = 0.5 and ρ = 0.75. Let p = pn = [4n1/2] − 5 for n = 100,200,400. Let 1m/0m denote a m-vector of 1s/0s. The true coefficients are β* = (3 · 1q, 3 · 1q, 3 · 1q, 0p−3q)T and |𝒜| = 3q and q = [pn/9]. In this example hence we used γ = 3 for computing the adaptive weights in the adaptive Elastic-Net.
For each estimator β̂, its estimation accuracy is measured by the mean squared error (MSE) defined as E[(β̂ − β*)T Σ(β̂ − β*)]. The variable selection performance is gauged by (C, IC), where C is the number of zero coefficients that are correctly estimated by zero and IC is the number of nonzero coefficients that are incorrectly estimated by zero.
Table 1 documents the simulation results. Several interesting observations can be made.
When the sample size is large (n = 400), the three oracle-like estimators outperform the lasso and the Elastic-Net which do not have the oracle property. That is expected according to the asymptotic theory.
The SCAD and the adaptive Elastic-Net are the best when the sample size is large and the correlation is moderate. However, the SCAD can perform much worse than the adaptive Elastic-Net when the correlation is high (ρ = 0.75) or the sample size is small.
Both the Elastic-Net and the adaptive lasso can do significantly better than the lasso. What is more interesting is that the adaptive Elastic-Net often outperforms the Elastic-Net and the adaptive lasso.
TABLE 1.
ρ = 0.5 | n | pn | |𝒜| | Model | MSE | C | IC |
---|---|---|---|---|---|---|---|
100 | 35 | 9 | Truth | 26 | 0 | ||
Lasso | 7.57 (0.31) | 24.08 | 0.01 | ||||
ALasso | 6.78 (0.42) | 25.50 | 0.42 | ||||
Enet | 5.91 (0.29) | 24.06 | 0.00 | ||||
AEnet | 5.07 (0.35) | 25.47 | 0.15 | ||||
SCAD | 10.55 (0.68) | 22.54 | 0.35 | ||||
200 | 51 | 15 | Truth | 36 | 0 | ||
Lasso | 6.63 (0.24) | 33.32 | 0.00 | ||||
ALasso | 3.78 (0.18) | 35.46 | 0.02 | ||||
Enet | 4.86 (0.19) | 33.36 | 0.00 | ||||
AEnet | 3.46 (0.17) | 35.47 | 0.01 | ||||
SCAD | 4.76 (0.33) | 34.63 | 0.10 | ||||
400 | 75 | 24 | Truth | 51 | 0 | ||
Lasso | 4.99 (0.15) | 47.31 | 0.00 | ||||
ALasso | 2.76 (0.09) | 50.33 | 0.00 | ||||
Enet | 3.37 (0.12) | 48.00 | 0.00 | ||||
AEnet | 2.47 (0.08) | 50.45 | 0.00 | ||||
SCAD | 2.42 (0.09) | 50.88 | 0.00 | ||||
ρ = 0.75 | n | pn | |𝒜| | Model | MSE | C | IC |
100 | 35 | 9 | Truth | 26 | 0 | ||
Lasso | 5.93 (0.26) | 24.80 | 0.14 | ||||
ALasso | 8.49 (0.39) | 25.76 | 1.84 | ||||
Enet | 4.18 (0.24) | 24.77 | 0.05 | ||||
AEnet | 5.24 (0.32) | 25.70 | 0.74 | ||||
SCAD | 11.59 (0.56) | 22.46 | 1.34 | ||||
200 | 51 | 15 | Truth | 36 | 0 | ||
Lasso | 5.10 (0.18) | 34.66 | 0.02 | ||||
ALasso | 5.32 (0.31) | 35.70 | 0.87 | ||||
Enet | 3.79 (0.17) | 34.79 | 0.00 | ||||
AEnet | 3.32 (0.17) | 35.80 | 0.19 | ||||
SCAD | 5.99 (0.31) | 33.10 | 0.35 | ||||
400 | 75 | 24 | Truth | 51 | 0 | ||
Lasso | 3.83 (0.12) | 49.03 | 0.00 | ||||
ALasso | 2.85 (0.12) | 50.53 | 0.09 | ||||
Enet | 3.24 (0.11) | 49.07 | 0.00 | ||||
AEnet | 2.71 (0.09) | 50.54 | 0.03 | ||||
SCAD | 3.64 (0.17) | 48.43 | 0.09 |
Example 2
We considered the same setup as in example 1, except that we let p = pn = [4n2/3] − 5 for n = 100,200,800. Since we used γ = 5 for computing the adaptive weights in the adaptive Elastic-Net and the adaptive lasso. The estimation problem in this example is even more difficult than that in example 1. To see why, note that when n = 200 the dimension increases from 51 in example 1 to 131 in this example, and the intrinsic dimension (|𝒜|) is almost tripled.
The simulation results are presented in Table 2 from which we can see that the three observations made in example 1 are still valid in this example. Furthermore, we see that for every combination of (n,p, |𝒜|,ρ), the adaptive Elastic-Net has the best performance.
TABLE 2.
ρ = 0.5 | n | pn | |𝒜| | Model | MSE | C | IC |
---|---|---|---|---|---|---|---|
100 | 81 | 27 | Truth | 54 | 0 | ||
Lasso | 31.73 (1.06) | 47.06 | 0.19 | ||||
ALasso | 28.78 (1.22) | 53.01 | 2.12 | ||||
Enet | 27.61 (1.04) | 46.35 | 0.13 | ||||
AEnet | 20.27 (0.94) | 53.00 | 1.15 | ||||
SCAD | 44.88 (2.65) | 47.79 | 2.37 | ||||
200 | 131 | 42 | Truth | 89 | 0 | ||
Lasso | 23.41 (0.67) | 80.51 | 0.00 | ||||
ALasso | 12.70 (0.48) | 87.99 | 0.14 | ||||
Enet | 18.94 (0.61) | 80.27 | 0.00 | ||||
AEnet | 10.68 (0.37) | 87.97 | 0.00 | ||||
SCAD | 14.14 (0.64) | 87.42 | 0.25 | ||||
800 | 339 | 111 | Truth | 228 | 0 | ||
Lasso | 13.72 (0.23) | 212.10 | 0.00 | ||||
ALasso | 6.44 (0.12) | 226.61 | 0.00 | ||||
Enet | 11.02 (0.18) | 213.91 | 0.00 | ||||
AEnet | 6.00 (0.10) | 226.75 | 0.00 | ||||
SCAD | 7.79 (0.30) | 228.00 | 0.33 | ||||
ρ = 0.75 | n | pn | |𝒜| | Model | MSE | C | IC |
100 | 81 | 27 | Truth | 54 | 0 | ||
Lasso | 22.04 (0.73) | 50.74 | 0.71 | ||||
ALasso | 33.98 (1.08) | 53.73 | 7.19 | ||||
Enet | 17.37 (0.62) | 50.82 | 0.46 | ||||
AEnet | 16.18 (0.80) | 53.67 | 2.36 | ||||
SCAD | 31.84 (1.77) | 50.55 | 4.74 | ||||
200 | 131 | 42 | Truth | 89 | 0 | ||
Lasso | 16.71 (0.50) | 85.17 | 0.06 | ||||
ALasso | 20.98 (0.92) | 88.64 | 3.98 | ||||
Enet | 14.12 (0.48) | 85.35 | 0.05 | ||||
AEnet | 11.16 (0.46) | 88.60 | 0.87 | ||||
SCAD | 15.27 (0.61) | 87.20 | 1.33 | ||||
800 | 339 | 111 | Truth | 228 | 0 | ||
Lasso | 10.01 (0.16) | 221.74 | 0.00 | ||||
ALasso | 6.39 (0.12) | 226.89 | 0.00 | ||||
Enet | 8.01 (0.13) | 222.74 | 0.00 | ||||
AEnet | 6.23 (0.11) | 226.94 | 0.00 | ||||
SCAD | 6.62 (0.17) | 228.00 | 0.29 |
5. Ultra-high dimensional data
In this section we discuss how the adaptive Elastic-Net can be applied to ultra-high dimensional data in which p > n. When p is much larger than n, Candes & Tao (2007) suggested using the Dantzig selector which can achieve the ideal estimation risk up to a log(p) factor under the uniform uncertainty condition. Fan & Lv (2007) showed that the uniform uncertainty condition may easily fail and the log(p) factor is too large when p is exponentially large. Moreover, the computational cost of the Dantzig selector would be very high when p is large. In order to overcome these difficulties, Fan & Lv (2007) introduced the Sure Independence Screening (SIS) idea which reduces the ultra-high dimensionality to a relatively large scale dn but dn < n. Then, the lower dimension methods such as the SCAD can be used to estimate the sparse model. This procedure is referred to as SIS+SCAD. Under regularity conditions, Fan & Lv (2007) proved that SIS misses true features with an exponentially small probability and SIS+SCAD holds the oracle property if Furthermore, with the help of SIS, the Dantzig selector can achieve the ideal risk up to a log(dn) factor, rather than the original log(p).
Inspired by the results of Fan & Lv (2007), we consider combining the adaptive Elastic-Net and SIS when p > n. We first apply SIS to reduce the dimension to dn and then fit the data by using the adaptive Elastic-Net. We call this procedure SIS+AEnet.
THEOREM 5.1
Suppose the conditions for Theorem 1 in Fan and Lv (2007) hold. Let dn = O(nν), ν < 1, then SIS+AEnet produces an estimator that holds the oracle property.
We make a note here that Theorem 5.1 is a direct consequence of Theorem 1 in Fan & Lv (2007) and Theorem 3.3, thus its proof is omitted. Theorem 5.1 is similar to Theorem 5 in Fan & Lv (2007), but there is a difference. SIS+AEnent can hold the oracle property when dn exceeds while Theorem 5 in Fan & Lv (2007) assumes .
To demonstrate SIS+AEnet, we consider the simulation example used in Fan & Lv (2007) (Section 3.3.1). The model is y = xT β* + 1.5N(0,1), where with |𝒜| = ∀. Here β1 is a 8-dim vector and each component has the form (−1)u (an + |z|), where , u is randomly drawn from Ber(0.4) and z is randomly drawn from the standard normal distribution. We generated n = 200 data from the above model. Before applying the adaptive Elastic-Net, we used SIS to reduce the dimensionality from 1000 to The estimation problem is still rather challenging, as we need to estimate 188 parameters by using only 200 observations. From Table 3 we see that SIS+AEnet performs favorably compared to SIS+SCAD.
TABLE 3.
dn = [5.5n2/3] | Model | MSE | C | IC |
---|---|---|---|---|
188 | Truth | 992 | 0 | |
SIS+AEnet | 0.71 (0.18) | 987.45 | 0.05 | |
SIS+SCAD | 1.48 (0.90) | 982.20 | 0.06 |
6. Proofs
PROOF OF THEOREM 3.1
We write
By the definition of β̂ŵ(λ2, λ1) and β̂(λ2, 0), we know
and
From the above two inequalities, we have
(6.1) |
On the other hand, we have
and
Note that λmin(XTX + λ2I) = λmin(XTX) + λ2. Therefore, we end up with
(6.2) |
which results in the following inequality
(6.3) |
Note that
which implies that
(6.4) |
Combing (6.3) and (6.4), we have
(6.5) |
(6.6) |
We have used condition (A1) in the last inequality. When ŵj = 1 for all j, we have
PROOF OF THEOREM 3.2
We show that satisfies the Karush-Kuhn-Tucker (KKT) conditions of (2.2) with probability tending to 1. By the definition of it suffices to show
or equivalently
Let and We note that
Then by Theorem 3.1 we obtain
(6.7) |
Moreover, let and we have
(6.8) |
where we have used Theorem 3.1 in the last step. By the model assumption, we have
which gives us the below inequality
(6.9) |
We now bound Let
Then by using the same arguments for deriving (6.1), (6.2) and (6.3), we have
(6.10) |
Note that and Following the rest arguments in the proof of Theorem 3.1, we obtain
(6.11) |
The combination of (6.7), (6.8), (6.9) and (6.11) yields
We have chosen , then under conditions (A1)–(A6) it follows that
(6.12) |
Thus the proof is completed.
PROOF OF THEOREM 3.3
From Theorem 3.2 we have shown that with probability tending to 1 the adaptive Elastic-Net estimator is equal to Therefore, in order to prove the model selection consistency result, we only need to show By (6.10) we have
Note that
Following (6.6) it is easy to see that
Moreover, and
In (6.12) we have shown Thus
(6.13) |
Hence, we have
and
We now prove the asymptotic normality. For convenience write
Note that
In addition, we have
Therefore, by Theorem 3.2 it follows that with probability tending to 1, zn = T1 + T2 + T3, where
We now show that T1 = o(1), T2 = oP (1) and T3 → N(0, σ2) in distribution. Then by Slutsky’s theorem we know zn →d N(0, σ2). By (A1) and αTα = 1, we have
Hence it follows by (A6) that T1 = o(1). Similarly, we can bound T2 as follows
where we have used (6.10) in the last step. Then (6.13) tells us that . Next we consider T3. Let X𝒜[i,] denote the ith row of the matrix X𝒜. With such notation we can write where . Then it is easy to see that
(6.14) |
Furthermore, we have for k = 2 + δ, δ > 0
Note that . Hence,
(6.15) |
From (6.14) and (6.15) Lyapunov conditions for the central limit theorem are established. Thus, T3 →d N(0, σ2). This completes the proof.
Acknowledgments
We sincerely thank an associate editor and referees for their help comments and suggestions.
Contributor Information
Hui Zou, School of Statistics, University of Minnesota, Minneapolis, Mn 55455, E-Mail: hzou@stat.umn.edu.
Hao Helen Zhang, Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203, E-Mail: hzhang2@stat.ncsu.edu.
REFERENCES
- Breiman L. ‘Heuristics of instability and stabilization in model selection’. The Annals of Statistics. 1996;24:2350–2383. [Google Scholar]
- Candes E, Tao T. ‘The dantzig selector: statistical estimation when p is much larger than n’. The Annals of Statistics, to appear. 2007 [Google Scholar]
- Candes E, Wakin M, Boyd S. California Institute of Technology; 2007. Enhancing sparsity by reweighted ℓ1 minimization, Technical report. [Google Scholar]
- Donoho D, Johnstone I. ‘Ideal spatial adaptation via wavelet shrinkage’. Biometrika. 1994;81:425–455. [Google Scholar]
- Donoho D, Johnstone I, Kerkyacharian G, Picard D. ‘Wavelet shrinkage: asymptopia? (with discussion)’. Journal of Royal Statistical Society, Series B. 1995;57:301–337. [Google Scholar]
- Efron B, Hastie T, Johnstone I, Tibshirani R. ‘Least angle regression’. The Annals of Statistics. 2004;32:407–499. [Google Scholar]
- Fan J, Li R. ‘Variable selection via nonconcave penalized likelihood and its oracle properties’. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
- Fan J, Li R. ‘Statistical challenges with high dimensionality: Feature selection in knowledge discovery’. Proceedings of the Madrid International Congress of Mathematicians 2006. 2006;Vol. III:595–622. [Google Scholar]
- Fan J, Lv J. Department of Operations Research and Financial Engineering, Princeton University; 2007. Sure independence screening for ultra-high dimensional feature space, Technical report. [Google Scholar]
- Fan J, Peng H. ‘Nonconcave penalized likelihood with a diverging number of parameters’. The Annals of Statistics. 2004;32:928–961. [Google Scholar]
- Fan J, Peng H, Huang T. ‘Semilinear high-dimensional model for normalization of microarray data: a theoretical analysis and partial consistency (with discussion)’. Journal of the American Statistical Association. 2005;100:781–813. [Google Scholar]
- Huber P. ‘Robust regression: Asymptotics, conjectures and monte carlo’. The Annals of Statistics. 1988;1:799–821. [Google Scholar]
- Knight K, Fu W. ‘Asymptotics for lasso-type estimators’. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]
- Lam C, Fan J. ‘Profile-kernel likelihood inference with diverging number of parameters’. The Annals of Statistics. 2007 doi: 10.1214/07-AOS544. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Portnoy S. ‘Asymptotic behavior of M-estimatiors of p regression parameters when p2/n is large. I. consistency’. The Annals of Statistics. 1984;12:1298–1309. [Google Scholar]
- Tibshirani R. ‘Regression shrinkage and selection via the lasso’. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
- Wang H, Li R, Tsai C. ‘Tuning parameter selectors for the smoothly clipped absolute deviation method’. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zou H. ‘The adaptive lasso and its oracle properties’. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
- Zou H, Hastie T. ‘Regularization and variable selection via the elastic net’. Journal of the Royal Statistical Society, Series B. 2005;67:301–320. [Google Scholar]
- Zou H, Hastie T, Tibshirani R. ‘On the degrees of freedom of the lasso’. The Annals of Statistics. 2007;35:2173–2192. [Google Scholar]