Abstract
In regression problems where covariates can be naturally grouped, the group Lasso is an attractive method for variable selection since it respects the grouping structure in the data. We study the selection and estimation properties of the group Lasso in high-dimensional settings when the number of groups exceeds the sample size. We provide sufficient conditions under which the group Lasso selects a model whose dimension is comparable with the underlying model with high probability and is estimation consistent. However, the group Lasso is, in general, not selection consistent and also tends to select groups that are not important in the model. To improve the selection results, we propose an adaptive group Lasso method which is a generalization of the adaptive Lasso and requires an initial estimator. We show that the adaptive group Lasso is consistent in group selection under certain conditions if the group Lasso is used as the initial estimator.
Keywords: group selection, high-dimensional data, penalized regression, rate consistency, selection consistency
1. Introduction
Consider the linear regression model with p groups of covariates
where Yi is the response variable, εi is the error term, Xik is a dk × 1 covariate vector representing the kth group and βk is the corresponding dk × 1 vector of regression coefficients. For such a model, the group Lasso (Antoniadis and Fan (2001), Yuan and Lin (2006)) is an attractive method for variable selection since it respects the grouping structure in the covariates. This method is a natural extension of the Lasso (Tibshirani (1996)), in which an ℓ2-norm of the coefficients associated with a group of variables is used as a component in the penalty function. However, the group Lasso is, in general, not selection consistent and tends to select more groups than there are in the model. To improve the selection results, we consider an adaptive group Lasso method which is a generalization of the adaptive Lasso (Zou (2006)). We provide sufficient conditions under which the adaptive group Lasso is selection consistent if the group Lasso is used as the initial estimator.
The need to select groups of variables arises in many statistical modeling problems and applications. For example, in multifactor analysis of variance, a factor with multiple levels can be represented by a group of dummy variables. In nonparametric additive regression, each component can be expressed as a linear combination of a set of basis functions. In both cases, the selection of important factors or nonparametric components amounts to the selection of groups of variables. Several recent papers have considered group selection using penalized methods. In addition to the group Lasso, Yuan and Lin (2006) have proposed the group Lars and group non-negative garrote methods. Kim, Kim and Kim (2006) considered the group Lasso in the context of generalized linear models. Zhao, Rocha and Yu (2008) proposed a composite absolute penalty for group selection, which can be considered a generalization of the group Lasso. Meier, van de Geer and Bühlmann (2008) studied the group Lasso for logistic regression. Huang, Ma, Xie and Zhang (2008) proposed a group bridge method that can be used for simultaneous group and individual variable selection.
There has been much work on the penalized methods for variable selection and estimation with high-dimensional data. Several approaches have been proposed, including the least absolute shrinkage and selection operator (Lasso, Tibshirani (1996)), the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li (2001), Fan and Peng (2004)), the elastic net (Enet) penalty (Zou and Hastie (2006)) and the minimum concave penalty (Zhang (2007)). Much progress has been made in understanding the statistical properties of these methods in both fixed p and p ≫ n settings. In particular, several recent studies considered the Lasso with regard to its variable selection, estimation and prediction properties; see, for example, Knight and Fu (2001), Greenshtein and Ritov (2004), Meinshausen and Buhlmann (2006), Zhao and Yu (2006), Huang, Ma and Zhang (2006), van de Geer (2008) and Zhang and Huang (2008), among others. All of these studies are concerned with the Lasso for individual variable selection.
In this article, we study the asymptotic properties of the group Lasso and the adaptive group Lasso in high-dimensional settings when p ≫ n. We generalize the results concerning the Lasso obtained in Zhang and Huang (2008) to the group Lasso. We show that, under a generalized sparsity condition and the sparse Riesz condition, as well as certain regularity conditions, the group Lasso selects a model whose dimension has the same order as the underlying model, selects all groups whose ℓ2-norms are of greater order than the bias of the selected model and is estimation consistent. In addition, under a narrow-sense sparsity condition (see page 1371) and using the group Lasso as the initial estimator, the adaptive group Lasso can correctly select important groups with high probability.
Our theoretical and simulation results suggest the following one-step approach to group selection in high-dimensional settings. First, we use the group Lasso to obtain an initial estimator and reduce the dimension of the problem. We then use the adaptive group Lasso to select the final set of groups of variables. Since the computation of the adaptive group Lasso estimator can be carried out using the same algorithm and program for the group Lasso, the computational cost of this one-step approach is approximately twice that of a single group Lasso computation. This approach, iteratively using the group Lasso twice, follows the idea of the adaptive Lasso (Zou (2006)) and a proposal by Bühlmann and Meier (2008) in the context of individual variable selection.
The rest of the paper is organized as follows. In Section 2, we state the results on the selection, bias of the selected model and convergent rate of the group Lasso estimator. In Section 3, we describe the selection and estimation consistency results concerning the adaptive group Lasso. In Section 4, we use simulation to compare the group Lasso and adaptive group Lasso. Proofs are given in Section 5. Concluding remarks are given in Section 6.
2. The asymptotic properties of the group Lasso
Let Y = (Y1, …, Yn)′ and X = (X1, …, Xp), where Xk is the n × dk covariate submatrix corresponding to the kth group. For a given penalty level λ ≥ 0, the group Lasso estimator of is
(2.1) |
where .
We consider the model selection and estimation properties of β̂ under a generalized sparsity condition (GSC) of the model and a sparse Riesz condition (SRC) on the covariate matrix. These two conditions were first formulated in the study of the Lasso estimator (Zhang and Huang (2008)). The GSC assumes that for some η1 ≥ 0, there exists an A0 ⊂ {1, …, p} such that Σk∈A0||βk||2 ≤ η1, where || · ||2 denotes the ℓ2-norm. Without loss of generality, let A0 = {q + 1, …, p}. The GSC is then
(2.2) |
The number of truly important groups is thus q. A more rigid way to describe sparsity is to assume η1 = 0, that is,
(2.3) |
This is a special case of the GSC and we call it the narrow-sense sparsity condition (NSC). In practice, the GSC is a more realistic formulation of a sparse model. However, the NSC can often be considered a reasonable approximation to the GSC, especially when η1 is smaller than the noise level associated with model fitting.
The SRC controls the range of eigenvalues of the submatrix. For A ⊂ {1,…, p}, we define XA = (Xk, k ∈ A) and . Note that XA is an n × Σk∈Adk matrix. The design matrix XA satisfies the sparse Riesz condition (SRC) with rank q* and spectrum bounds 0 < c* < c* < ∞ if
(2.4) |
Let  = {k: ||β̂k||2 > 0, 1 ≤ k ≤ p}, which is the set of indices of the groups selected by the group Lasso. An important quantity is the cardinality of Â, defined as
(2.5) |
which determines the dimension of the selected model. If q̂ = O(q), then the selected model has dimension comparable to the underlying model. Following Zhang and Huang (2008), we also consider two measures of the selected model. The first measures the error of the selected model:
(2.6) |
where P̂ is the projection matrix from Rn to the linear span of the set of selected groups and I ≡ In×n is the identity matrix. Thus, ω̂2 is the sum of squares of the mean vector not accounted for by the selected model. To measure the important groups missing in the selected model, we define
(2.7) |
We now describe several quantities that will be useful in describing the main results. Let da = max1≤k≤p dk, db = min1≤k≤p dk, d = da/db and . Define
(2.8) |
where η2 ≡ maxA ⊂A0 || Σk∈A Xkβk||2,
(2.9) |
(2.10) |
(2.11) |
Let , where c0 ≥ 0 and an ≥ 0, satisfying pda/(Nd ∨ an)1+c0 ≈ 0, and λ0 = inf{λ: M1q +1 ≤ q*}, where inf Ø = ∞. We also consider the constraint
(2.12) |
For large p, the lower bound here is allowed to be λn, p = 2σ[8(1+c0)dad2q*c̄nc* log(Nd)]1/2 with an = 0; for fixed p, an → ∞ is required.
We assume the following basic condition.
-
(C1)
The errors ε1, …, εn are independent and identically distributed as N (0, σ2).
Theorem 2.1
Suppose that q ≥ 1 and that (C1), the GSC (2.2) and SRC (2.4) are satisfied. Let q̂, ω̃ and ζ2 be defined as in (2.5), (2.6) and (2.7), respectively, for the model  selected by the group Lasso from (2.1). Let M1, M2 and M3 be defined as in (2.9), (2.10) and (2.11), respectively. If the constraint (2.12) is satisfied, then the following assertions hold with probability converging to 1:
where .
Remark 2.1
The condition q ≥ 1 is not necessary since it is only used to express quantities in terms of ratios in (2.8) and Theorem 2.1. If q = 0, we use and to recover M1, M2 and M3 in (2.9), (2.10), (2.11), respectively, giving the results and .
Remark 2.2
If η1 = 0 in (2.2), then r1 = r2 = 0 and
all of which depend only on d and c̄. This suggests that the relative sizes of the groups affect the selection results. Since d ≥ 1, the most favorable case is d = 1, that is, when the groups have equal sizes.
Remark 2.3
If d1 = · · · = dp = 1, the group Lasso simplifies to the Lasso and Theorem 2.1 is a direct generalization of Theorem 1 on the selection properties of the Lasso obtained by Zhang and Huang (2008). In particular, when d1 = · · · = dp = 1, r1, r2, M1, M2, M3 are the same as the constants in Theorem 1 of Zhang and Huang (2008).
Remark 2.4
A more general definition of the group Lasso is
(2.13) |
where Rk is a dk × dk positive definite matrix. This is useful when certain relationships among the coefficients need be specified. By the Cholesky decomposition, there exists a matrix Qk such that . Let β* = Qkβ, and . Then, (2.13)becomes
The GSC for (2.13) is . The SRC can be assumed for X · Q−1, where .
Immediately, from Theorem 2.1, we have the following corollary.
Corollary 2.1
Suppose that the conditions of Theorem 2.1 hold and λ satisfies the constraint (2.12). Then, with probability converging to one, all groups with are selected.
From Theorem 2.1 and Corollary 2.1, the group Lasso possesses similar properties to the Lasso in terms of sparsity and bias (Zhang and Huang (2008)). In particular, the group Lasso selects a model whose dimension has the same order as the underlying model. Furthermore, all of the groups with coefficients whose ℓ2-norms are greater than the threshold given in Corollary 2.1 are selected with high probability.
Theorem 2.2
Let {c̄, σ, r1, r2, c0, d} be fixed and 1 ≤ q ≤ n ≤ p → ∞. Suppose that the conditions in Theorem 2.1 hold. Then, with probability converging to 1, we have
and
Theorem 2.2 is stated for a general λ that satisfies (2.12). The following result is an immediate corollary of Theorem 2.2.
Corollary 2.2
Let with a fixed . Suppose that all of the conditions in Theorem 2.2 hold. We then have
This corollary follows by substituting the given λ value into the expressions in the results of Theorem 2.2.
3. Selection consistency of the adaptive group Lasso
As shown in the previous section, the group Lasso has excellent selection and estimation properties. However, there is room for improvement, particularly with regard to selection. Although the group Lasso selects a model whose dimension is comparable to that of the underlying model, the simulation results reported in Yuan and Lin (2006) and those reported below suggest that it tends to select more groups than there are in the underlying model. To correct the tendency of overselection by the group Lasso, we generalize the idea of the adaptive Lasso (Zou (2006)) for individual variable selection to the present problem of group selection.
Consider a general group Lasso criterion with a weighted penalty term,
(3.1) |
where wk is the weight associated with the kth group. The λk ≡ λ̂wk can be regarded as the penalty level corresponding to the kth group. For different groups, the penalty level λk can be different. If we can have lower penalty for groups with large coefficients and higher penalty for groups with small coefficients (in the ℓ2 sense), then we expect to be able to improve variable selection accuracy and reduce estimation bias. One way to obtain the information about whether a group has large or small coefficients is by using a consistent initial estimator.
Suppose that an initial estimate β̃ is available. A simple approach to determining the weight is to use the initial estimator. Consider
(3.2) |
Thus, for each group, its penalty is proportional to the inverse of the norm of β̃k. This choice of the penalty level for each group is a natural generalization of the adaptive Lasso (Zou (2006)). In particular, when each group only contains a single variable, (3.2) simplifies to the adaptive Lasso penalty.
Let and . We say that an initial estimator β̃ is consistent at zero with rate rn if rn maxk∈A0 ||βk||2 = Op(1), where rn → ∞ as n → ∞, and there exists a constant ξb > 0 such that for any ε > 0, for n sufficiently large.
In addition to (C1), we assume the following conditions:
-
(C2)
the initial estimator β̃ is consistent at zero with rate rn → ∞;
-
(C3)
-
(C4)
all of the eigenvalues of are bounded away from zero and infinity.
Condition (C2) assumes that an initial zero-consistent estimator exists. It is the most critical one and is generally difficult to establish. It assumes that we can consistently differentiate between important and non-important groups. For fixed p and dk, the ordinary least-squares estimator can be used as the initial estimator. However, when p > n, the least-squares estimator is no longer feasible. By Theorems 2.1 and 2.2, the group Lasso estimator β̂ is consistent at zero with rate . Condition (C3) restricts the numbers of important and non-important groups, as well as variables within the groups. It also places constraints on the penalty parameter and the ℓ2-norm of the smallest important group. Condition (C4) assumes that the eigenvalues of are finite and bounded away from zero. This is reasonable since the number of important groups is small in a sparse model. This condition ensures that the true model is identifiable.
Define
(3.3) |
Theorem 3.1
If (C1)–(C4) and NSC (2.3) are satisfied, then
Therefore, the adaptive group Lasso is selection consistent if the conditions stated in Theorem 2.1 hold.
If we use β̂ as the initial estimator, then (C3) can be changed to (C3)*
We often have λ̃ = nα for some 0 < α < 1/2. In this case, the number of non-important groups can be as large as exp(n2α/(q log q)) with the number of important groups satisfying q5 log q/n → 0, assuming that θb and the number of variables within the groups are finite.
Corollary 3.1
Let the initial estimator β̃ = β̂, where β̂ is the group Lasso estimator. Suppose that the NSC (2.3) holds and that (C1), (C2), (C3)* and (C4) are satisfied. We then have
This corollary follows directly from Theorem 3.1. It shows that the iterated group Lasso procedure that uses a combination of the group Lasso and the adaptive group Lasso is selection consistent.
Theorem 3.2
Suppose that the conditions in Theorem 2.2 hold and that θb > tb for some constant tb > 0. If λ̃~ O(nα) for some 0 < α < 1/2, then
Theorem 3.2 implies that for the adaptive group Lasso, given a zero-consistent initial estimator, we can reduce a high-dimensional problem to a lower-dimensional one. The convergence rate is improved, compared with that of the group Lasso, by choosing an appropriate penalty parameter λ̃.
4. Simulation studies
In this section, we use simulation to evaluate the finite sample performance of the group Lasso and the adaptive group Lasso. Let λk = λ̃/||β̂k||2, if ||β̂k||2 > 0; if ||β̂k||2 = 0, then λk = ∞, . We can thus drop the corresponding covariates Xk from the model and only consider the groups with . After a scale transformation, we can directly apply the group least angle regression algorithm (Yuan and Lin (2006)) to compute the adaptive group Lasso estimator β̂*. The penalty parameters for the group Lasso and the adaptive group Lasso are selected using the BIC criterion (Schwarz (1978)).
We consider two scenarios of simulation models. In the first scenario, the group sizes are equal; in the second, the group sizes vary. For every scenario, we consider the cases p < n and p > n. In all of the examples, the sample size is n = 200.
Example 1
In this example, there are 10 groups, each consisting of 5 covariates. The covariate vector is X = (X1,…, X10), where Xj = (X5(j−1)+1, …, X5(j−1)+5), 1 ≤ j ≤ 10. To generate X, we first simulate 50 random variables, R1, …, R50, independently from N(0,1). Then, Zj, j = 1,…, 10, are simulated from a multivariate normal distribution with with mean zero and cov(Zj1, Zj2) = 0.6|j1 − j2|. The covariates X1, …, X50 are generated as
The random error ε ~ N(0,32). The response variable Y is generated from , where β1 = (0.5, 1, 1.5, 2, 2.5), β2 = (2, 2, 2, 2, 2), β3 = · · · = β10 = (0, 0, 0, 0, 0).
Example 2
In this example, the number of groups is p = 10. Each group consists of 5 covariates. The covariates are generated the same way as in Example 1. However, the regression coefficients β1 = (0.5, 1, 1.5, 1, 0.5), β2 = (1, 1, 1, 1, 1), β3 = (−1, 0, 1, 2, 1.5), β4 = (−1.5,1,0.5,0.5,0.5), β5 = · · · = β10 = (0, 0, 0, 0, 0).
Example 3
In this example, the number of groups p = 210 is bigger than the sample size n. Each group consists of 5 covariates. The covariates are generated the same way as in Example 1. However, the regression coefficients β1 = (0.5, 1, 1.5, 1, 0.5), β2 = (1,1,1,1,1), β3 = (−1, 0, 1, 2, 1.5), β4 = (−1.5, 1, 0.5, 0.5, 0.5), β5 = · · · = β210 = (0, 0, 0, 0, 0).
Example 4
In this example, the group sizes differ across groups. There are 5 groups with size 5 and 5 groups with size 3. The covariate vector is X = (X1, …, X10), where Xj = (X5(j−1)+1, …, X5(j−1)+5), 1 ≤ j ≤ 5, and Xj = (X3(j−6)+26, …, X3(j−6)+28), 6 ≤ j ≤ 10. In order to generate X, we first simulate 40 random variables R1, …, R40, independently from N(0,1). Then, Zj, j = 1, …, 10 are simulated with a normal distribution with mean zero and cov(Zj1, Zj2) = 0.6|j1 − j2|. The covariates X1, …, X40 are generated as
The random error ε ~ N(0,32). The response variable Y is generated from , where β1 = (0.5, 1, 1.5, 2, 2.5), β2 = (2, 0, 0, 2, 2), β3 = · · · = β5 = (0,0,0,0,0), β6 = (−1, −2, −3), β7 = · · · = β10 = (0, 0, 0).
Example 5
In this example, the number of groups is p = 10 and the group sizes differ across groups. The data are generated the same way as in Example 4. However, the regression coefficients β1 = (0.5, 1, 1.5, 2, 2.5), β2 = (2, 2, 2, 2, 2), β3 = (−1, 0, 1, 2, 3), β4 = (−1.5,2,0,0,0), β5 = (0, 0, 0, 0, 0), β6 = (2, −2, 1), β7 = (0, −3, 1.5), β8 = (−1.5, 1.5, 2), β9 = (−2, −2, −2), β10 = (0, 0, 0).
Example 6
In this example, the number of groups p = 210 and the group sizes differ across groups. The data are generated the same way as in Example 4. However, the regression coefficients β1 = (0.5, 1, 1.5, 2, 2.5), β2 = (2, 2, 2, 2, 2), β3 = (−1, 0, 1, 2, 3), β4 = (−1.5, 2, 0, 0, 0), β5 = · · · = β100 = (0, 0, 0, 0, 0), β101 = (2, −2, 1), β102 = (0, −3, 1.5), β103 = (−1.5, 1.5, 2), β104 = (−2, −2, −2), β105 = · · · = β210 = (0, 0, 0).
The results are given in Table 1, based on 400 replications. The columns in the table include the average number of groups selected with standard error in parentheses, the median number (‘med’) of groups selected with the 25% and 75% quantiles of the number of selected groups in parentheses, model error (‘ME’), percentage of occasion on which correct groups are included in the selected model (‘% incl’) and percentage of occasions on which the exactly correct groups are selected (‘% sel’), with standard error in parentheses.
Table 1.
Group Lasso
|
Adaptive group Lasso
|
|||||||||
---|---|---|---|---|---|---|---|---|---|---|
σ = 3 | mean | med | ME | % incl | % sel | mean | med | ME | % incl | % sel |
Ex. 1, [2] | 2.04 (0.18) | 2 (2, 2) | 8.79 (0.94) | 100 (0) | 96.5 (0.18) | 2.01 (0.07) | 2 (2, 2) | 8.54 (0.90) | 100 (0) | 99.5 (0.07) |
Ex. 2, [4] | 4.11 (0.34) | 4 (4, 4) | 8.52 (0.94) | 99.5 (0.07) | 88.5 (0.32) | 4.00 (0.14) | 4 (4, 4) | 8.10 (0.87) | 99.5 (0.07) | 98.00 (0.14) |
Ex. 3, [4] | 4.00 (0.38) | 4 (4, 4) | 9.48 (1.19) | 93.0 (0.26) | 86.5 (0.34) | 3.94 (0.27) | 4 (4, 4) | 8.19 (0.96) | 93.0 (0.26) | 92.5 (0.26) |
Ex. 4, [3] | 3.17 (0.45) | 3 (3, 3) | 8.78 (1.00) | 100 (0) | 85.3 (0.35) | 3.00 (0) | 3 (3, 3) | 8.36 (0.90) | 100 (0) | 100 (0) |
Ex. 5, [8] | 8.88 (0.81) | 9 (8, 10) | 7.68 (0.94) | 100 (0) | 40.0 (0.49) | 8.03 (0.16) | 8 (8, 8) | 7.58 (0.86) | 100 (0) | 97.5 (0.16) |
Ex 6, [8] | 12.90 (12.42) | 9 (8, 11) | 14.61 (7.21) | 66.5 (0.47) | 7.0 (0.26) | 11.49 (12.68) | 8 (7, 8) | 9.28 (5.79) | 66.5 (0.47) | 47.0 (0.50) |
Several observations can be made from Table 1. First, in all six examples, the adaptive group Lasso performs better than the group Lasso in terms of model error and the percentage of correctly selected models. The group Lasso which gives the initial estimator for the adaptive group Lasso includes the correct groups with high probability. And the improvement is considerable for models with different group sizes. Second, the results from models with equal group sizes (Examples 1, 2 and 3) are better than those from models with different group sizes (Examples 4, 5 and 6). Finally, when the dimension of the model increases, the performance of both methods becomes worse. This is to be expected since selection in models with a larger number of groups is more difficult.
5. Concluding remarks
We have studied the asymptotic selection and estimation properties of the group Lasso and adaptive group Lasso in ‘large p, small n’ linear regression models. For the adaptive group Lasso to be selection consistent, the initial estimator should possess two properties: (a) it does not miss important groups and variables; (b) it is estimation consistent, although it may not be group-selection or variable-selection consistent. Under the conditions stated in Theorem 2.1, the group Lasso is shown to satisfy these two requirements. Thus, the iterated group Lasso procedure, which uses the group Lasso to achieve dimension reduction and generate the initial estimates and then uses the adaptive group Lasso to achieve selection consistency, is an appealing approach to group selection in high-dimensional settings.
6. Proofs
We first introduce some notation which will be used in proofs. Let . Set A2 = {1, …, p}\A1, A3 = A1\A0, A4 = A1 ∩ A0, A5 = A2\A0 and A6 = A2 ∩ A0. Thus, we have A1 = A3 ∪ A4, A3 ∩ A4 = Ø, A2 = A5∪A6 and A5∩A6 = Ø. Let |Ai| = Σk∈Ai dk, N(Ai) = #{k: k ∈ Ai}, i = 1, …, 6 and q1 = N(A1).
Proof of Theorem 2.1
The basic idea used in this proof follows the proof of the rate consistency of the Lasso in Zhang and Huang (2008). However, there are many differences in technical details, for example, in the characterization of the solution via the Karush–Kuhn–Tucker (KKT) conditions, in the constraint needed for the penalty level and in the use of maximal inequalities.
The proof consists of three steps. Step 1 proves some inequalities related to q1, ω̃ and ζ2. Step 2 translates the results of Step 1 into upper bounds for q̂, ω̃ and ζ2. Step 3 completes the proof by showing the probability of the event in Step 2 converging to 1. The details of the complete proof are available from the website www.stat.uiowa.edu/techrep. We will sketch the proof in the following.
If β̂ is a solution of (2.1), then, by the KKT condition, and . We then have
(6.1) |
(6.2) |
where , all the elements of matrix edki×1 equal 1, ki ∈ Ai and .
Step 1
Define
where QAkj is the matrix representing the selection of variables in Ak from Aj. Define . From (6.1) and (6.2), we have and . Then, under GSC,
(6.3) |
Step 2
Define and . In this step, we consider the event . Suppose that the set A1 contains all large βk ≠ 0. From (6.3), , so we have
(6.4) |
For general A1, let C5 = c*(|A5|)/c*(|A1| ∪ |A5|). From (6.3),
(6.5) |
From Zhang and Huang (2008), and ||XA2βA2||2 η 2 + ||XA5βA5||2 ≤ η2 + (nc*(|A5|))1/2||βA5||2. By the Cauchy–Schwarz inequality, then, we have
(6.6) |
where c*,5 = c*(|A1 ∪ A5|).
Step 3
Letting c*(|Am|) = c*, c*(|Am|) = c* for N(Am) ≤ q*, we have
(6.7) |
We have c̄ = C5 = c*(|A5|)/c*(|A1| ∨ |A5|) = c*/c* and c*,5 = c* (|A1 ∪ A5|) = c*. From (6.4), (6.5) and (6.6), (q1 − q)+ + q ≤ M1q, when (2.12) is satisfied. Define
(6.8) |
for |A| = q1 = m ≥ 0, , where , ||UAk||2 = 1. Let , where . For a given A, let Vlj = (0,…, 0, 1, 0, …, 0) be the |A| × 1 vector with the jth element in the lth group being 1. Then, by (6.8),
If we define , then for N(A1) ≥ m0 ≥ 0. By the definition of , it is less than the maximum of normal variables with mean 0 and variance , plus the maximum of normal variables with mean 0 and variance σ2. It follows that P{(X, ε) ∈ Ωm0} → 1 when (6.7) holds. This completes the sketch of the proof of Theorem 2.1.
Proof of Theorem 2.2
Consider the case when {c*, c*, r1, r2, c0,σ, d} are fixed. The required configurations in Theorem 2.1 then become
(6.9) |
Let A1 = {k: ||β̂k||2 > 0 or k ∉ A0}. Define v1 = XA1 (β^A1 − βA1) and . We then have and . Therefore, . Since with probability converging to 1 under the normality assumption, . We then have
(6.10) |
Since A2 ⊂ A0, by the second inequality in (6.9), . By the SRC and the third inequality in (6.9), and . From (6.10), we then have
This completes the proof of Theorem 2.2.
Proof of Theorem 3.1
Let and , where λk = λ/||β̃k||2. By the KKT conditions, if there exists û such that
(6.11) |
(6.12) |
then ||β̂k||2 ≠ 0 for k = 1,…, q and ||β̂k||2 = 0 for k = q + 1, …, p.
From (6.11) and (6.12), and . Define the events
where (·)k denotes the dk-dimensional subvector of the vector (·) corresponding to the kth group. We then have P(||β̂k||2 ≠ 0, k ∈ A0, and ||β̂k||2 = 0, k ∉ A0) ≥ P (E1 ∩ E2) and .
First, we consider . Define , where c1 is a constant. . By (C2), P(Rc) → 0. Let be the eigenvalues of and γ1, …, γNq be the associated eigenvectors. The jth element in the lth group of vector is . By the Cauchy–Schwarz inequality, . Therefore, .
If we define , then . By Lemmas 1 and 2 of Huang, Ma and Zhang (2008), , where K is a constant, from (C3). We then have
Next, we consider . Similarly as above, define . By (C2), P(Dc) → 0. , where ul is the lth element of vector . If we define , CA0 = {maxk∈A0 || ξk||2 > vA0}, then P(Qc) ≤ P(CA0), P(CA0 ≤ K (da log(p − q))1/2/vA0. K (da log(p − q))1/2/vA0 → 0 from (C3). We then have . This completes the proof of Theorem 3.1.
Proof of Theorem 3.2
If we let  = {k: ||β̂k||2 > 0, k = 1,…, p}, then , the dimension of our problem (3.1) is reduced to q̂, q̂ ≤ q* and Âc A0. By the definition of β̂*, we have
(6.13) |
(6.14) |
If we let and , then . By (6.13) and (6.14), , so . By the triangle inequality, ||δÂ||2 ≤ ||δ − Dε||2 + ||Dε||2. Thus, .
Let Di be the ith column of D. . Then, with probability converging to 1,
Thus, for λ̃ = nα for some 0 < α < 1/2, with probability converging to 1,
and . This completes the proof of Theorem 3.2.
Acknowledgments
The authors are grateful to Professor Cun-Hui Zhang for sharing his insights into the problem and related topics. The work of Jian Huang is supported in part by NIH Grant R01CA120988 and NSF Grants DMS-07-06108 and 0805670.
Contributor Information
FENGRONG WEI, Email: fwei@westga.edu.
JIAN HUANG, Email: jian-huang@uiowa.edu.
References
- Antoniadis A, Fan J. Regularization of wavelet approximation (with discussion) J Amer Statist Assoc. 2001;96:939–967. [Google Scholar]
- Bühlmann P, Meier L. Discussion of “One-step sparse estimates in nonconcave penalized likelihood models,” by H. Zou and R. Li. Ann Statist. 2008;36:1534–1541. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]
- Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Ann Statist. 2004;32:928–961. [Google Scholar]
- Greenshtein E, Ritov Y. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli. 2004;10:971–988. [Google Scholar]
- Huang J, Horowitz JL, Ma SG. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann Statist. 2008;36:587–613. [Google Scholar]
- Huang J, Ma S, Zhang CH. Adaptive lasso for sparse high-dimensional regression models. Statist Sinica. 2006;18:1603–1618. [Google Scholar]
- Kim Y, Kim J, Kim Y. The blockwise sparse regression. Statist Sinica. 2006;16:375–390. [Google Scholar]
- Knight K, Fu WJ. Asymptotics for lasso-type estimators. Ann Statist. 2001;28:1356–1378. [Google Scholar]
- Meier L, van de Geer S, Bühlmann P. Group Lasso for logisitc regression. J R Stat Soc Ser B. 2008;70:53–71. [Google Scholar]
- Meinshausen N, Buhlmann P. High dimensional graphs and variable selection with the Lasso. Ann Statist. 2006;34:1436–1462. [Google Scholar]
- Schwarz G. Estimating the dimension of a model. Ann Statist. 1978;6:461–464. [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B. 1996;58:267–288. [Google Scholar]
- van de Geer S. High-dimensional generalized linear models and the Lasso. Ann Statist. 2008;36:614–645. [Google Scholar]
- Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B. 2006;68:49–67. [Google Scholar]
- Zhang CH. Technical Report 2007-003. Dept. Statistics, Rutgers Univ; 2007. Penalized linear unbiased selection. [Google Scholar]
- Zhang CH, Huang J. Model-selection consistency of the LASSO in high-dimensional linear regression. Ann Statist. 2008;36:1567–1594. [Google Scholar]
- Zhao P, Rocha G, Yu B. Grouped and hierarchical model selection through composite absolute penalties. Ann Statist. 2008;36:1567–1594. [Google Scholar]
- Zhao P, Yu B. On model selection consistency of LASSO. J Mach Learn Res. 2006;7:2541–2563. [Google Scholar]
- Zou H. The adaptive Lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. [Google Scholar]
- Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B. 2006;67:301–320. [Google Scholar]