Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2011 Nov 7.
Published in final edited form as: Bernoulli (Andover). 2010 Nov;16(4):1369–1384. doi: 10.3150/10-BEJ252

Consistent group selection in high-dimensional linear regression

FENGRONG WEI 1, JIAN HUANG 2
PMCID: PMC3209717  NIHMSID: NIHMS331983  PMID: 22072891

Abstract

In regression problems where covariates can be naturally grouped, the group Lasso is an attractive method for variable selection since it respects the grouping structure in the data. We study the selection and estimation properties of the group Lasso in high-dimensional settings when the number of groups exceeds the sample size. We provide sufficient conditions under which the group Lasso selects a model whose dimension is comparable with the underlying model with high probability and is estimation consistent. However, the group Lasso is, in general, not selection consistent and also tends to select groups that are not important in the model. To improve the selection results, we propose an adaptive group Lasso method which is a generalization of the adaptive Lasso and requires an initial estimator. We show that the adaptive group Lasso is consistent in group selection under certain conditions if the group Lasso is used as the initial estimator.

Keywords: group selection, high-dimensional data, penalized regression, rate consistency, selection consistency

1. Introduction

Consider the linear regression model with p groups of covariates

Yi=k=1pXikβk+εi,i=1,,n,

where Yi is the response variable, εi is the error term, Xik is a dk × 1 covariate vector representing the kth group and βk is the corresponding dk × 1 vector of regression coefficients. For such a model, the group Lasso (Antoniadis and Fan (2001), Yuan and Lin (2006)) is an attractive method for variable selection since it respects the grouping structure in the covariates. This method is a natural extension of the Lasso (Tibshirani (1996)), in which an 2-norm of the coefficients associated with a group of variables is used as a component in the penalty function. However, the group Lasso is, in general, not selection consistent and tends to select more groups than there are in the model. To improve the selection results, we consider an adaptive group Lasso method which is a generalization of the adaptive Lasso (Zou (2006)). We provide sufficient conditions under which the adaptive group Lasso is selection consistent if the group Lasso is used as the initial estimator.

The need to select groups of variables arises in many statistical modeling problems and applications. For example, in multifactor analysis of variance, a factor with multiple levels can be represented by a group of dummy variables. In nonparametric additive regression, each component can be expressed as a linear combination of a set of basis functions. In both cases, the selection of important factors or nonparametric components amounts to the selection of groups of variables. Several recent papers have considered group selection using penalized methods. In addition to the group Lasso, Yuan and Lin (2006) have proposed the group Lars and group non-negative garrote methods. Kim, Kim and Kim (2006) considered the group Lasso in the context of generalized linear models. Zhao, Rocha and Yu (2008) proposed a composite absolute penalty for group selection, which can be considered a generalization of the group Lasso. Meier, van de Geer and Bühlmann (2008) studied the group Lasso for logistic regression. Huang, Ma, Xie and Zhang (2008) proposed a group bridge method that can be used for simultaneous group and individual variable selection.

There has been much work on the penalized methods for variable selection and estimation with high-dimensional data. Several approaches have been proposed, including the least absolute shrinkage and selection operator (Lasso, Tibshirani (1996)), the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li (2001), Fan and Peng (2004)), the elastic net (Enet) penalty (Zou and Hastie (2006)) and the minimum concave penalty (Zhang (2007)). Much progress has been made in understanding the statistical properties of these methods in both fixed p and pn settings. In particular, several recent studies considered the Lasso with regard to its variable selection, estimation and prediction properties; see, for example, Knight and Fu (2001), Greenshtein and Ritov (2004), Meinshausen and Buhlmann (2006), Zhao and Yu (2006), Huang, Ma and Zhang (2006), van de Geer (2008) and Zhang and Huang (2008), among others. All of these studies are concerned with the Lasso for individual variable selection.

In this article, we study the asymptotic properties of the group Lasso and the adaptive group Lasso in high-dimensional settings when pn. We generalize the results concerning the Lasso obtained in Zhang and Huang (2008) to the group Lasso. We show that, under a generalized sparsity condition and the sparse Riesz condition, as well as certain regularity conditions, the group Lasso selects a model whose dimension has the same order as the underlying model, selects all groups whose 2-norms are of greater order than the bias of the selected model and is estimation consistent. In addition, under a narrow-sense sparsity condition (see page 1371) and using the group Lasso as the initial estimator, the adaptive group Lasso can correctly select important groups with high probability.

Our theoretical and simulation results suggest the following one-step approach to group selection in high-dimensional settings. First, we use the group Lasso to obtain an initial estimator and reduce the dimension of the problem. We then use the adaptive group Lasso to select the final set of groups of variables. Since the computation of the adaptive group Lasso estimator can be carried out using the same algorithm and program for the group Lasso, the computational cost of this one-step approach is approximately twice that of a single group Lasso computation. This approach, iteratively using the group Lasso twice, follows the idea of the adaptive Lasso (Zou (2006)) and a proposal by Bühlmann and Meier (2008) in the context of individual variable selection.

The rest of the paper is organized as follows. In Section 2, we state the results on the selection, bias of the selected model and convergent rate of the group Lasso estimator. In Section 3, we describe the selection and estimation consistency results concerning the adaptive group Lasso. In Section 4, we use simulation to compare the group Lasso and adaptive group Lasso. Proofs are given in Section 5. Concluding remarks are given in Section 6.

2. The asymptotic properties of the group Lasso

Let Y = (Y1, …, Yn)′ and X = (X1, …, Xp), where Xk is the n × dk covariate submatrix corresponding to the kth group. For a given penalty level λ ≥ 0, the group Lasso estimator of β=(β1,,βp) is

β^=argminβ12(YXβ)T(YXβ)+λk=1pdk||βk||2, (2.1)

where β^=(β1,,βp).

We consider the model selection and estimation properties of β̂ under a generalized sparsity condition (GSC) of the model and a sparse Riesz condition (SRC) on the covariate matrix. These two conditions were first formulated in the study of the Lasso estimator (Zhang and Huang (2008)). The GSC assumes that for some η1 ≥ 0, there exists an A0 ⊂ {1, …, p} such that ΣkA0||βk||2η1, where || · ||2 denotes the 2-norm. Without loss of generality, let A0 = {q + 1, …, p}. The GSC is then

k=q+1p||βk||2η1. (2.2)

The number of truly important groups is thus q. A more rigid way to describe sparsity is to assume η1 = 0, that is,

||βk||2=0,k=q+1,,p. (2.3)

This is a special case of the GSC and we call it the narrow-sense sparsity condition (NSC). In practice, the GSC is a more realistic formulation of a sparse model. However, the NSC can often be considered a reasonable approximation to the GSC, especially when η1 is smaller than the noise level associated with model fitting.

The SRC controls the range of eigenvalues of the submatrix. For A ⊂ {1,…, p}, we define XA = (Xk, kA) and AA=XAXA/n. Note that XA is an n × ΣkAdk matrix. The design matrix XA satisfies the sparse Riesz condition (SRC) with rank q* and spectrum bounds 0 < c* < c* < ∞ if

c||XAν||22n||ν||22cAwithq=A=#{k:kA}andνRkAdk. (2.4)

Let  = {k: ||β̂k||2 > 0, 1 ≤ kp}, which is the set of indices of the groups selected by the group Lasso. An important quantity is the cardinality of Â, defined as

q^=A^=#{k:||β^k||2>0,1kp}, (2.5)

which determines the dimension of the selected model. If = O(q), then the selected model has dimension comparable to the underlying model. Following Zhang and Huang (2008), we also consider two measures of the selected model. The first measures the error of the selected model:

ω=||(IP^)Xβ||2, (2.6)

where is the projection matrix from Rn to the linear span of the set of selected groups and IIn×n is the identity matrix. Thus, ω̂2 is the sum of squares of the mean vector not accounted for by the selected model. To measure the important groups missing in the selected model, we define

ζ2=(kA0||βk||22I{||β^k||2=0})1/2. (2.7)

We now describe several quantities that will be useful in describing the main results. Let da = max1≤kp dk, db = min1≤kp dk, d = da/db and Nd=k=1pdk. Define

r1r1(λ;)=(ncdaη1λdbq)1/2,r2r2(λ)=(ncη22λ2dbq)1/2,c¯=cc, (2.8)

where η2 ≡ maxAA0 || ΣkA Xkβk||2,

M1=M1(λ)=2+4r12+4dc¯r2+4dc¯, (2.9)
M2M2(λ)=23(1+4r12+2dc¯+42d(1+c¯)c¯r2+163dc¯2), (2.10)
M3M3(λ)=23(1+4r12+4dc¯(1+21+c¯)r2+3r22+23dc¯(7+4c¯)). (2.11)

Let λn,p=2σ8(1+c0)dad2qc¯nclog(Ndan), where c0 ≥ 0 and an ≥ 0, satisfying pda/(Ndan)1+c0 ≈ 0, and λ0 = inf{λ: M1q +1 ≤ q*}, where inf Ø = ∞. We also consider the constraint

λmax{λ0,λn,p}. (2.12)

For large p, the lower bound here is allowed to be λn, p = 2σ[8(1+c0)dad2q*c̄nc* log(Nd)]1/2 with an = 0; for fixed p, an → ∞ is required.

We assume the following basic condition.

  • (C1)

    The errors ε1, …, εn are independent and identically distributed as N (0, σ2).

Theorem 2.1

Suppose that q ≥ 1 and that (C1), the GSC (2.2) and SRC (2.4) are satisfied. Let , ω̃ and ζ2 be defined as in (2.5), (2.6) and (2.7), respectively, for the model  selected by the group Lasso from (2.1). Let M1, M2 and M3 be defined as in (2.9), (2.10) and (2.11), respectively. If the constraint (2.12) is satisfied, then the following assertions hold with probability converging to 1:

q^#{k:||β^k||2>0orkA0}M1(λ)q,ω2=||(IP^)Xβ||22M2(λ)B12(λ),ζ22=kA0||βk||22I{||β^k||2=0}M3(λ)B12(λ)cn,

where B1(λ)=((λ2db2q)/(nc))1/2.

Remark 2.1

The condition q ≥ 1 is not necessary since it is only used to express quantities in terms of ratios in (2.8) and Theorem 2.1. If q = 0, we use r12q=ncdaη1/(λdb) and r22q=ncη22/(λ2db) to recover M1, M2 and M3 in (2.9), (2.10), (2.11), respectively, giving the results q^4ncdaη1/λdb,ω28λdadbη1/3 and ζ22=0.

Remark 2.2

If η1 = 0 in (2.2), then r1 = r2 = 0 and

M1=2+4dc¯,M2=23(1+2dc¯+163dc¯2),M3=23(1+23dc¯(7+4c¯)),

all of which depend only on d and . This suggests that the relative sizes of the groups affect the selection results. Since d ≥ 1, the most favorable case is d = 1, that is, when the groups have equal sizes.

Remark 2.3

If d1 = · · · = dp = 1, the group Lasso simplifies to the Lasso and Theorem 2.1 is a direct generalization of Theorem 1 on the selection properties of the Lasso obtained by Zhang and Huang (2008). In particular, when d1 = · · · = dp = 1, r1, r2, M1, M2, M3 are the same as the constants in Theorem 1 of Zhang and Huang (2008).

Remark 2.4

A more general definition of the group Lasso is

β^=argminβ12(YXβ)(YXβ)+λk=1p(βkRkβk)1/2, (2.13)

where Rk is a dk × dk positive definite matrix. This is useful when certain relationships among the coefficients need be specified. By the Cholesky decomposition, there exists a matrix Qk such that Rk=dkQkQk. Let β* = Qkβ, and Xk=XkQk1. Then, (2.13)becomes

β^=argminβ(YXβ)(YXβ)+λk=1pdk||βk||2.

The GSC for (2.13) is k=q+1p(βkQkQkβk)1/2η1. The SRC can be assumed for X · Q−1, where X·Q1=(X1Q11,,XpQp1).

Immediately, from Theorem 2.1, we have the following corollary.

Corollary 2.1

Suppose that the conditions of Theorem 2.1 hold and λ satisfies the constraint (2.12). Then, with probability converging to one, all groups with ||βk||22>M3(λ)qλ2/(ccn2) are selected.

From Theorem 2.1 and Corollary 2.1, the group Lasso possesses similar properties to the Lasso in terms of sparsity and bias (Zhang and Huang (2008)). In particular, the group Lasso selects a model whose dimension has the same order as the underlying model. Furthermore, all of the groups with coefficients whose 2-norms are greater than the threshold given in Corollary 2.1 are selected with high probability.

Theorem 2.2

Let {c̄, σ, r1, r2, c0, d} be fixed and 1 ≤ q ≤ n ≤ p → ∞. Suppose that the conditions in Theorem 2.1 hold. Then, with probability converging to 1, we have

||β^β||21nc(2σM1log(Nd)q+(r2+dM1c¯)B1)+cr12+r22ccqλn

and

||Xβ^Xβ||22σM1log(Nd)q+(2r2+dM1c¯)B1.

Theorem 2.2 is stated for a general λ that satisfies (2.12). The following result is an immediate corollary of Theorem 2.2.

Corollary 2.2

Let λ=2σ8(1+c0)dad2qc¯cnlog(Nd) with a fixed c0c0. Suppose that all of the conditions in Theorem 2.2 hold. We then have

||β^β||2=Op(qlog(Nd)/n)and||Xβ^Xβ||2=Op(qlog(Nd)).

This corollary follows by substituting the given λ value into the expressions in the results of Theorem 2.2.

3. Selection consistency of the adaptive group Lasso

As shown in the previous section, the group Lasso has excellent selection and estimation properties. However, there is room for improvement, particularly with regard to selection. Although the group Lasso selects a model whose dimension is comparable to that of the underlying model, the simulation results reported in Yuan and Lin (2006) and those reported below suggest that it tends to select more groups than there are in the underlying model. To correct the tendency of overselection by the group Lasso, we generalize the idea of the adaptive Lasso (Zou (2006)) for individual variable selection to the present problem of group selection.

Consider a general group Lasso criterion with a weighted penalty term,

12(YXβ)(YXβ)+λk=1pwkdk||βk||2, (3.1)

where wk is the weight associated with the kth group. The λkλ̂wk can be regarded as the penalty level corresponding to the kth group. For different groups, the penalty level λk can be different. If we can have lower penalty for groups with large coefficients and higher penalty for groups with small coefficients (in the 2 sense), then we expect to be able to improve variable selection accuracy and reduce estimation bias. One way to obtain the information about whether a group has large or small coefficients is by using a consistent initial estimator.

Suppose that an initial estimate β̃ is available. A simple approach to determining the weight is to use the initial estimator. Consider

wk=1||βk||2,k=1,,p. (3.2)

Thus, for each group, its penalty is proportional to the inverse of the norm of β̃k. This choice of the penalty level for each group is a natural generalization of the adaptive Lasso (Zou (2006)). In particular, when each group only contains a single variable, (3.2) simplifies to the adaptive Lasso penalty.

Let θa=maxkA0c||βk||2 and θb=minkA0c||βk||2. We say that an initial estimator β̃ is consistent at zero with rate rn if rn maxkA0 ||βk||2 = Op(1), where rn → ∞ as n → ∞, and there exists a constant ξb > 0 such that for any ε > 0, P(minkA0c||βk||2>ξbθb)>1ε for n sufficiently large.

In addition to (C1), we assume the following conditions:

  • (C2)

    the initial estimator β̃ is consistent at zero with rate rn → ∞;

  • (C3)
    da(logq)nθb0,λda3/2qnθb20,ndlog(pq)λrn0,da5/2q2rnθbdb0;
  • (C4)

    all of the eigenvalues of A0cA0c are bounded away from zero and infinity.

Condition (C2) assumes that an initial zero-consistent estimator exists. It is the most critical one and is generally difficult to establish. It assumes that we can consistently differentiate between important and non-important groups. For fixed p and dk, the ordinary least-squares estimator can be used as the initial estimator. However, when p > n, the least-squares estimator is no longer feasible. By Theorems 2.1 and 2.2, the group Lasso estimator β̂ is consistent at zero with rate n/(qlog(Nd)). Condition (C3) restricts the numbers of important and non-important groups, as well as variables within the groups. It also places constraints on the penalty parameter and the 2-norm of the smallest important group. Condition (C4) assumes that the eigenvalues of A0cA0c are finite and bounded away from zero. This is reasonable since the number of important groups is small in a sparse model. This condition ensures that the true model is identifiable.

Define

β^=argmin12(YXβ)(YXβ)+λk=1p||βk||21dk||βk||2. (3.3)

Theorem 3.1

If (C1)–(C4) and NSC (2.3) are satisfied, then

P(||β^k||20,kA0,||β^k||2=0,kA0)1.

Therefore, the adaptive group Lasso is selection consistent if the conditions stated in Theorem 2.1 hold.

If we use β̂ as the initial estimator, then (C3) can be changed to (C3)*

da(logq)nθb0,λda3/2qnθb20,dqlog(pq)log(Nd)λ0,(daq)5/2log(Nd)θbndb0.

We often have λ̃ = nα for some 0 < α < 1/2. In this case, the number of non-important groups can be as large as exp(n2α/(q log q)) with the number of important groups satisfying q5 log q/n → 0, assuming that θb and the number of variables within the groups are finite.

Corollary 3.1

Let the initial estimator β̃ = β̂, where β̂ is the group Lasso estimator. Suppose that the NSC (2.3) holds and that (C1), (C2), (C3)* and (C4) are satisfied. We then have

P(||β^k||20,kA0,||β^k||2=0,kA0)1.

This corollary follows directly from Theorem 3.1. It shows that the iterated group Lasso procedure that uses a combination of the group Lasso and the adaptive group Lasso is selection consistent.

Theorem 3.2

Suppose that the conditions in Theorem 2.2 hold and that θb > tb for some constant tb > 0. If λ̃~ O(nα) for some 0 < α < 1/2, then

||β^β||2=Op(qn+λ2n2)=Op(qn),||Xβ^Xβ||2O(q+λ2n)=Op(q).

Theorem 3.2 implies that for the adaptive group Lasso, given a zero-consistent initial estimator, we can reduce a high-dimensional problem to a lower-dimensional one. The convergence rate is improved, compared with that of the group Lasso, by choosing an appropriate penalty parameter λ̃.

4. Simulation studies

In this section, we use simulation to evaluate the finite sample performance of the group Lasso and the adaptive group Lasso. Let λk = λ̃/||β̂k||2, if ||β̂k||2 > 0; if ||β̂k||2 = 0, then λk = ∞, β^k=0. We can thus drop the corresponding covariates Xk from the model and only consider the groups with ||β^k||2>0. After a scale transformation, we can directly apply the group least angle regression algorithm (Yuan and Lin (2006)) to compute the adaptive group Lasso estimator β̂*. The penalty parameters for the group Lasso and the adaptive group Lasso are selected using the BIC criterion (Schwarz (1978)).

We consider two scenarios of simulation models. In the first scenario, the group sizes are equal; in the second, the group sizes vary. For every scenario, we consider the cases p < n and p > n. In all of the examples, the sample size is n = 200.

Example 1

In this example, there are 10 groups, each consisting of 5 covariates. The covariate vector is X = (X1,…, X10), where Xj = (X5(j−1)+1, …, X5(j−1)+5), 1 ≤ j ≤ 10. To generate X, we first simulate 50 random variables, R1, …, R50, independently from N(0,1). Then, Zj, j = 1,…, 10, are simulated from a multivariate normal distribution with with mean zero and cov(Zj1, Zj2) = 0.6|j1j2|. The covariates X1, …, X50 are generated as

X5(ji)+k=Zj+R5(ji)+k2,1j10,1k5.

The random error ε ~ N(0,32). The response variable Y is generated from k=110Xkβk+ε, where β1 = (0.5, 1, 1.5, 2, 2.5), β2 = (2, 2, 2, 2, 2), β3 = · · · = β10 = (0, 0, 0, 0, 0).

Example 2

In this example, the number of groups is p = 10. Each group consists of 5 covariates. The covariates are generated the same way as in Example 1. However, the regression coefficients β1 = (0.5, 1, 1.5, 1, 0.5), β2 = (1, 1, 1, 1, 1), β3 = (−1, 0, 1, 2, 1.5), β4 = (−1.5,1,0.5,0.5,0.5), β5 = · · · = β10 = (0, 0, 0, 0, 0).

Example 3

In this example, the number of groups p = 210 is bigger than the sample size n. Each group consists of 5 covariates. The covariates are generated the same way as in Example 1. However, the regression coefficients β1 = (0.5, 1, 1.5, 1, 0.5), β2 = (1,1,1,1,1), β3 = (−1, 0, 1, 2, 1.5), β4 = (−1.5, 1, 0.5, 0.5, 0.5), β5 = · · · = β210 = (0, 0, 0, 0, 0).

Example 4

In this example, the group sizes differ across groups. There are 5 groups with size 5 and 5 groups with size 3. The covariate vector is X = (X1, …, X10), where Xj = (X5(j−1)+1, …, X5(j−1)+5), 1 ≤ j ≤ 5, and Xj = (X3(j−6)+26, …, X3(j−6)+28), 6 ≤ j ≤ 10. In order to generate X, we first simulate 40 random variables R1, …, R40, independently from N(0,1). Then, Zj, j = 1, …, 10 are simulated with a normal distribution with mean zero and cov(Zj1, Zj2) = 0.6|j1j2|. The covariates X1, …, X40 are generated as

X5(ji)+k=Zj+R5(ji)+k2,1j5,1k5,X3(j6)+25+k=Zj+R3(j6)+25+k2,6j10,1k3.

The random error ε ~ N(0,32). The response variable Y is generated from k=110Xkβk+ε, where β1 = (0.5, 1, 1.5, 2, 2.5), β2 = (2, 0, 0, 2, 2), β3 = · · · = β5 = (0,0,0,0,0), β6 = (−1, −2, −3), β7 = · · · = β10 = (0, 0, 0).

Example 5

In this example, the number of groups is p = 10 and the group sizes differ across groups. The data are generated the same way as in Example 4. However, the regression coefficients β1 = (0.5, 1, 1.5, 2, 2.5), β2 = (2, 2, 2, 2, 2), β3 = (−1, 0, 1, 2, 3), β4 = (−1.5,2,0,0,0), β5 = (0, 0, 0, 0, 0), β6 = (2, −2, 1), β7 = (0, −3, 1.5), β8 = (−1.5, 1.5, 2), β9 = (−2, −2, −2), β10 = (0, 0, 0).

Example 6

In this example, the number of groups p = 210 and the group sizes differ across groups. The data are generated the same way as in Example 4. However, the regression coefficients β1 = (0.5, 1, 1.5, 2, 2.5), β2 = (2, 2, 2, 2, 2), β3 = (−1, 0, 1, 2, 3), β4 = (−1.5, 2, 0, 0, 0), β5 = · · · = β100 = (0, 0, 0, 0, 0), β101 = (2, −2, 1), β102 = (0, −3, 1.5), β103 = (−1.5, 1.5, 2), β104 = (−2, −2, −2), β105 = · · · = β210 = (0, 0, 0).

The results are given in Table 1, based on 400 replications. The columns in the table include the average number of groups selected with standard error in parentheses, the median number (‘med’) of groups selected with the 25% and 75% quantiles of the number of selected groups in parentheses, model error (‘ME’), percentage of occasion on which correct groups are included in the selected model (‘% incl’) and percentage of occasions on which the exactly correct groups are selected (‘% sel’), with standard error in parentheses.

Table 1.

Simulation study by the group Lasso and adaptive group Lasso for Examples 1–6. The true numbers of groups are included in [] in the first column

Group Lasso
Adaptive group Lasso
σ = 3 mean med ME % incl % sel mean med ME % incl % sel
Ex. 1, [2] 2.04 (0.18) 2 (2, 2) 8.79 (0.94) 100 (0) 96.5 (0.18) 2.01 (0.07) 2 (2, 2) 8.54 (0.90) 100 (0) 99.5 (0.07)
Ex. 2, [4] 4.11 (0.34) 4 (4, 4) 8.52 (0.94) 99.5 (0.07) 88.5 (0.32) 4.00 (0.14) 4 (4, 4) 8.10 (0.87) 99.5 (0.07) 98.00 (0.14)
Ex. 3, [4] 4.00 (0.38) 4 (4, 4) 9.48 (1.19) 93.0 (0.26) 86.5 (0.34) 3.94 (0.27) 4 (4, 4) 8.19 (0.96) 93.0 (0.26) 92.5 (0.26)
Ex. 4, [3] 3.17 (0.45) 3 (3, 3) 8.78 (1.00) 100 (0) 85.3 (0.35) 3.00 (0) 3 (3, 3) 8.36 (0.90) 100 (0) 100 (0)
Ex. 5, [8] 8.88 (0.81) 9 (8, 10) 7.68 (0.94) 100 (0) 40.0 (0.49) 8.03 (0.16) 8 (8, 8) 7.58 (0.86) 100 (0) 97.5 (0.16)
Ex 6, [8] 12.90 (12.42) 9 (8, 11) 14.61 (7.21) 66.5 (0.47) 7.0 (0.26) 11.49 (12.68) 8 (7, 8) 9.28 (5.79) 66.5 (0.47) 47.0 (0.50)

Several observations can be made from Table 1. First, in all six examples, the adaptive group Lasso performs better than the group Lasso in terms of model error and the percentage of correctly selected models. The group Lasso which gives the initial estimator for the adaptive group Lasso includes the correct groups with high probability. And the improvement is considerable for models with different group sizes. Second, the results from models with equal group sizes (Examples 1, 2 and 3) are better than those from models with different group sizes (Examples 4, 5 and 6). Finally, when the dimension of the model increases, the performance of both methods becomes worse. This is to be expected since selection in models with a larger number of groups is more difficult.

5. Concluding remarks

We have studied the asymptotic selection and estimation properties of the group Lasso and adaptive group Lasso in ‘large p, small n’ linear regression models. For the adaptive group Lasso to be selection consistent, the initial estimator should possess two properties: (a) it does not miss important groups and variables; (b) it is estimation consistent, although it may not be group-selection or variable-selection consistent. Under the conditions stated in Theorem 2.1, the group Lasso is shown to satisfy these two requirements. Thus, the iterated group Lasso procedure, which uses the group Lasso to achieve dimension reduction and generate the initial estimates and then uses the adaptive group Lasso to achieve selection consistency, is an appealing approach to group selection in high-dimensional settings.

6. Proofs

We first introduce some notation which will be used in proofs. Let {k:||β^k||2>0,kp}A1{k:Xk(YXβ^)=λdkβ^k/||β^k||2}{1,...,q}. Set A2 = {1, …, p}\A1, A3 = A1\A0, A4 = A1A0, A5 = A2\A0 and A6 = A2A0. Thus, we have A1 = A3A4, A3A4 = Ø, A2 = A5A6 and A5A6 = Ø. Let |Ai| = ΣkAi dk, N(Ai) = #{k: kAi}, i = 1, …, 6 and q1 = N(A1).

Proof of Theorem 2.1

The basic idea used in this proof follows the proof of the rate consistency of the Lasso in Zhang and Huang (2008). However, there are many differences in technical details, for example, in the characterization of the solution via the Karush–Kuhn–Tucker (KKT) conditions, in the constraint needed for the penalty level and in the use of maximal inequalities.

The proof consists of three steps. Step 1 proves some inequalities related to q1, ω̃ and ζ2. Step 2 translates the results of Step 1 into upper bounds for , ω̃ and ζ2. Step 3 completes the proof by showing the probability of the event in Step 2 converging to 1. The details of the complete proof are available from the website www.stat.uiowa.edu/techrep. We will sketch the proof in the following.

If β̂ is a solution of (2.1), then, by the KKT condition, Xk(YXβ^)=λdkβ^k/||β^k||2||β^k||2>0 and λdkXk(YXβ^)λdk||β^k||2=0. We then have

111SA1/n=(βA1β^A1)+11112βA2+111XA1ε/n, (6.1)
n22βA2n2111112βA2CA2XA2ε21111SA1+21111XA1ε, (6.2)

where SAi=(Sk1,...,Skqi),Ski=λdkiski,sk=Xk(YXβ^)/(λdk),CAi=(Ck1,...,Ckqi),Cki=λdkiI(||β^ki||2=0)edki×1, all the elements of matrix edki×1 equal 1, kiAi and ijXAiXAj/n.

Step 1

Define

V1j=111/2QAj1SAj/n,j=1,3,4,ωk=(IPA1)XAkβAk,k=2,...,6,

where QAkj is the matrix representing the selection of variables in Ak from Aj. Define u=XA1111QA41SA4/nω2/||XA1111QA41SA4/nω2||2. From (6.1) and (6.2), we have V14(V13+V14)SA4QA4111112βA2+SA4QA41111XA1ε/n+daλkA4||βk||2 and ||ω2||22βA2(CA2XA2ε21111SA1+21111XA1ε). Then, under GSC,

||V14||22+||ω2||22(||V14||22+||ω2||22)1/2uε+(||V14||2+||P1XA2βA2||2)(λ2daN(A3)nc(A1))1/2+daλη1+λda||βA5||2. (6.3)

Step 2

Define B12=λ2dbq/(nc(A1)) and B22=λ2dbq/(nc(A0A1)). In this step, we consider the event uε2(A1db)B12/(4qda). Suppose that the set A1 contains all large βk ≠ 0. From (6.3), ||V14||22B12+4daλη1+4dη2B2+4dB22, so we have

(q1q)+q+nc(A1)λ2db(4daλη1+4λ2daqnc(A1)η2+4λ2daqnc(A1)). (6.4)

For general A1, let C5 = c*(|A5|)/c*(|A1| ∪ |A5|). From (6.3),

||ω2||2243(B122+dB22+d(1+C5)η2B2+2daη1)+329dC5B22. (6.5)

From Zhang and Huang (2008), ||ω2||22(||βA5||2(nc,5)1/2η2)2 and ||XA2βA2||2 η 2 + ||XA5βA5||2η2 + (nc*(|A5|))1/2||βA5||2. By the Cauchy–Schwarz inequality, then, we have

||βA5||22nc,5[43λdaqnc,5(1+c(A5)c(A1))1/2+2η2]2+83[B124+daλη1+η2(λ2daqnc(A1))1/2+λ2daq2nc(A1)34η22], (6.6)

where c*,5 = c*(|A1A5|).

Step 3

Letting c*(|Am|) = c*, c*(|Am|) = c* for N(Am) ≤ q*, we have

q1N(A1A5)q,uε2(A1db)λ2db4danc(A1). (6.7)

We have = C5 = c*(|A5|)/c*(|A1| ∨ |A5|) = c*/c* and c*,5 = c* (|A1A5|) = c*. From (6.4), (6.5) and (6.6), (q1q)+ + qM1q, ||ω2||22M2B12,nc||γA5||22M3B12 when (2.12) is satisfied. Define

xmmaxA=mmax||UAk||2=1,k=1,...,m|εXA(XAXA)1S¯A(IPA)Xβ||XA(XAXA)1S¯A(IPA)Xβ||2| (6.8)

for |A| = q1 = m ≥ 0, S¯A=(S¯A1,...,SAm), where S¯Ak=λdAkUAk, ||UAk||2 = 1. Let QA=XA(XAXA)1, where Xk=λdkXk. For a given A, let Vlj = (0,, 0, 1, 0,, 0) be the |A| × 1 vector with the jth element in the lth group being 1. Then, by (6.8),

xmmaxA=mmaxl,j{|εQAVlj||QAVlj||2|||QAVlj||2lAdl||QAUA||2+|ε(IPA)Xβ||(IPA)Xβ||2|}.

If we define Ωm0={(U,ε):xmσ8(1+c0)V2((mdb)db)log(Ndan)mm0}, then (X,ε)Ωm0uε2(xm)2<(A1db)λ2db/(4danc) for N(A1) ≥ m0 ≥ 0. By the definition of xm, it is less than the maximum of (pm)kAdk normal variables with mean 0 and variance σ2Vε2, plus the maximum of (pm) normal variables with mean 0 and variance σ2. It follows that P{(X, ε) ∈ Ωm0} → 1 when (6.7) holds. This completes the sketch of the proof of Theorem 2.1.

Proof of Theorem 2.2

Consider the case when {c*, c*, r1, r2, c0, d} are fixed. The required configurations in Theorem 2.1 then become

M1q+1<q,η1r12cqλn,η22r22cqλn2. (6.9)

Let A1 = {k: ||β̂k||2 > 0 or kA0}. Define v1 = XA1 (β^A1βA1) and g1=XA1(YXβ^). We then have ||v1||22cn||β^A1βA1||22,(β^A1βA1)g1=v1(XβXA1βA1+ε)||v1||22 and ||g1||maxk,||β^k||2>0||λdkβ^k/||β^k||2||=λda. Therefore, ||v1||2η2+||PA1ε||2+λdaN(A1)/(nc). Since ||PA1ε||22σN(A1)log(Nd) with probability converging to 1 under the normality assumption, ||X(β^β)||22η2+||PA1ε||2+λdaN(A1)/(nc). We then have

(kA1||β^kβk||22)1/2||v1||2nc1nc(η2+2σN(A1)log(Nd)+dM1c¯B1). (6.10)

Since A2A0, by the second inequality in (6.9), #{kA0:||βk||2>λ/n}r12q/cO(q). By the SRC and the third inequality in (6.9), kA0||βk||22I{||βk||2>λ/n}kA0||Xkβk×I{||βk||2>λ/n}||22/(nc)r22qλ2/(n2cc) and kA0||βk||22I{||βk||2λ/n}r12qλ2/(cn2). From (6.10), we then have

||β^β||21nc(2σM1log(Nd)q+(r2+dM1c¯)B1)+cr12+r22ccqλn,||Xβ^Xβ||22σM1log(Nd)q+(2r2+dM1c¯)B1.

This completes the proof of Theorem 2.2.

Proof of Theorem 3.1

Let u^=β^β,W=Xε/n,V(u)=i=1n[(εixiu)2εi2)]+k=1pλkdk||uk+βk||2 and u^=minu(εXu)(εXu)+k=1pλkdk||uk+βk||2, where λk = λ/||β̃k||2. By the KKT conditions, if there exists û such that

A0cA0c(nu^A0c)WA0c=SA0c/n,||u^k||2||βk||2forkA0c, (6.11)
CA0/nA0A0c(nu^A0c)WA0CA0/n, (6.12)

then ||β̂k||2 ≠ 0 for k = 1,…, q and ||β̂k||2 = 0 for k = q + 1, …, p.

From (6.11) and (6.12), (nu^A0c)A0cA0c1WA0c=1nA0cA0c1SA0c and A0A0c(nu^A0c)WA0=n1/2XA0(IPA0c)εn1/2A0A0cA0cA0c1SA0c.. Define the events

E1={n1/2||(A0cA0c1XA0cε)k||2<n||βk||2n1/2||(A0cA0c1SA0c)k||2,kA0c},E2={n/2||(XA0(IPA0c)ε)k||2<n1/2||Ck||2n1/2||(A0A0cA0cA0c1SA0c)k||2,kA0},

where (·)k denotes the dk-dimensional subvector of the vector (·) corresponding to the kth group. We then have P(||β̂k||2 ≠ 0, kA0, and ||β̂k||2 = 0, kA0) ≥ P (E1E2) and P(E1E2)=1P(E1cE2c)1P(E1c)P(E2c).

First, we consider P(E1c). Define R={||βk||21c1θb1,kA0c}, where c1 is a constant. P(E1c)=P(E1cR)+P(E1cRc)P(E1cR)+P(Rc). By (C2), P(Rc) → 0. Let Nq=k=1qdk,τ1τNg be the eigenvalues of A0cA0c and γ1, …, γNq be the associated eigenvectors. The jth element in the lth group of vector A0cA0c1SA0c is ulj=l=1Nqτl1(γlSA0c)γlj. By the Cauchy–Schwarz inequality, ulj2τ12l=1Nq||γl||22||SA0c||22=τl2Nq||SA0c||22τ12Nq(k=1qλk2dk). Therefore, ||uk||22dkτ12q2da2(λc1θb1)2.

If we define vA0c=nθbn1/2c1τ11qda3/2λθb1,ηA0c=n1/2A0cA0c1XA0cε,ξA0=n1/2×XA0(IPA0c)ε,CA0c={maxkA0c||ηk||2vA0c}, then P(E1c)P(CA0c). By Lemmas 1 and 2 of Huang, Ma and Zhang (2008), P(CA0c)K(dalogq)1/2/vAoc, where K is a constant, k(dalogq)1/2/vA0c0 from (C3). We then have P(E1cR)0,P(E1c)0

Next, we consider P(E2c). Similarly as above, define D={||βk||21>rn,kA0}R. P(E2c)P(E2cD)+P(Dc) By (C2), P(Dc) → 0. l=1Nqi=1n(XA0)ij(XA0c)ilull=1Nqul/nτ11q2da2λc1θb1, where ul is the lth element of vector A0cA0c1SA0c. If we define vA0=n1/2λrndbn1/2τ11q2da5/2λc1θb1, CA0 = {maxkA0 || ξk||2 > vA0}, then P(Qc) ≤ P(CA0), P(CA0K (da log(pq))1/2/vA0. K (da log(pq))1/2/vA0 → 0 from (C3). We then have P(E2cD)0,P(E2c)0. This completes the proof of Theorem 3.1.

Proof of Theorem 3.2

If we let  = {k: ||β̂k||2 > 0, k = 1,…, p}, then kA^c||β^k||2=0, the dimension of our problem (3.1) is reduced to , q* and Âc A0. By the definition of β̂*, we have

12||YXA^β^A^||22+λkA^dk||βk||2||β^k||212||YXA^βA^||22+λkA^dk||βk||2||βk||2, (6.13)
η=λkA^dk||βk||2(||βk||2||β^k||2)λkA^dk||βk||2||β^kβk||2. (6.14)

If we let δA^=A^A^1/2(β^A^βA^) and D=A^A^1/2XA^, then ||YXA^β^A^||22/2||YXA^βA^||22/2=δA^δA^/2(Dε)δA^. By (6.13) and (6.14), δA^δA^/2(Dε)δA^η0, so ||δA^Dε||22||Dε||222η0. By the triangle inequality, ||δÂ||2 ≤ ||δÂ||2 + ||Dε||2. Thus, ||δA^||226||Dε||22+6η.

Let Di be the ith column of D. E(||Dε||22)=σ2tr(DD)=σ2q^. Then, with probability converging to 1, ||β^A^βA^||226σ2M1q/(nc)+(λda/(ξbθbnc))2/2+||β^A^βA^||22/2

Thus, for λ̃ = nα for some 0 < α < 1/2, with probability converging to 1,

||β^A^βA^||26σ2M1cqn+da(ξbθbc)2(λn)2O(qn)

and ||XA^β^A^XA^βA^||2nc||β^A^βA^||2O(q). This completes the proof of Theorem 3.2.

Acknowledgments

The authors are grateful to Professor Cun-Hui Zhang for sharing his insights into the problem and related topics. The work of Jian Huang is supported in part by NIH Grant R01CA120988 and NSF Grants DMS-07-06108 and 0805670.

Contributor Information

FENGRONG WEI, Email: fwei@westga.edu.

JIAN HUANG, Email: jian-huang@uiowa.edu.

References

  1. Antoniadis A, Fan J. Regularization of wavelet approximation (with discussion) J Amer Statist Assoc. 2001;96:939–967. [Google Scholar]
  2. Bühlmann P, Meier L. Discussion of “One-step sparse estimates in nonconcave penalized likelihood models,” by H. Zou and R. Li. Ann Statist. 2008;36:1534–1541. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]
  4. Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Ann Statist. 2004;32:928–961. [Google Scholar]
  5. Greenshtein E, Ritov Y. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli. 2004;10:971–988. [Google Scholar]
  6. Huang J, Horowitz JL, Ma SG. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann Statist. 2008;36:587–613. [Google Scholar]
  7. Huang J, Ma S, Zhang CH. Adaptive lasso for sparse high-dimensional regression models. Statist Sinica. 2006;18:1603–1618. [Google Scholar]
  8. Kim Y, Kim J, Kim Y. The blockwise sparse regression. Statist Sinica. 2006;16:375–390. [Google Scholar]
  9. Knight K, Fu WJ. Asymptotics for lasso-type estimators. Ann Statist. 2001;28:1356–1378. [Google Scholar]
  10. Meier L, van de Geer S, Bühlmann P. Group Lasso for logisitc regression. J R Stat Soc Ser B. 2008;70:53–71. [Google Scholar]
  11. Meinshausen N, Buhlmann P. High dimensional graphs and variable selection with the Lasso. Ann Statist. 2006;34:1436–1462. [Google Scholar]
  12. Schwarz G. Estimating the dimension of a model. Ann Statist. 1978;6:461–464. [Google Scholar]
  13. Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B. 1996;58:267–288. [Google Scholar]
  14. van de Geer S. High-dimensional generalized linear models and the Lasso. Ann Statist. 2008;36:614–645. [Google Scholar]
  15. Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B. 2006;68:49–67. [Google Scholar]
  16. Zhang CH. Technical Report 2007-003. Dept. Statistics, Rutgers Univ; 2007. Penalized linear unbiased selection. [Google Scholar]
  17. Zhang CH, Huang J. Model-selection consistency of the LASSO in high-dimensional linear regression. Ann Statist. 2008;36:1567–1594. [Google Scholar]
  18. Zhao P, Rocha G, Yu B. Grouped and hierarchical model selection through composite absolute penalties. Ann Statist. 2008;36:1567–1594. [Google Scholar]
  19. Zhao P, Yu B. On model selection consistency of LASSO. J Mach Learn Res. 2006;7:2541–2563. [Google Scholar]
  20. Zou H. The adaptive Lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. [Google Scholar]
  21. Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B. 2006;67:301–320. [Google Scholar]

RESOURCES