Consistent group selection in high-dimensional linear regression

FENGRONG WEI; JIAN HUANG

doi:10.3150/10-BEJ252

. Author manuscript; available in PMC: 2011 Nov 7.

Published in final edited form as: Bernoulli (Andover). 2010 Nov;16(4):1369–1384. doi: 10.3150/10-BEJ252

Consistent group selection in high-dimensional linear regression

FENGRONG WEI ¹, JIAN HUANG ²

PMCID: PMC3209717 NIHMSID: NIHMS331983 PMID: 22072891

Abstract

In regression problems where covariates can be naturally grouped, the group Lasso is an attractive method for variable selection since it respects the grouping structure in the data. We study the selection and estimation properties of the group Lasso in high-dimensional settings when the number of groups exceeds the sample size. We provide sufficient conditions under which the group Lasso selects a model whose dimension is comparable with the underlying model with high probability and is estimation consistent. However, the group Lasso is, in general, not selection consistent and also tends to select groups that are not important in the model. To improve the selection results, we propose an adaptive group Lasso method which is a generalization of the adaptive Lasso and requires an initial estimator. We show that the adaptive group Lasso is consistent in group selection under certain conditions if the group Lasso is used as the initial estimator.

Keywords: group selection, high-dimensional data, penalized regression, rate consistency, selection consistency

1. Introduction

Consider the linear regression model with p groups of covariates

Y_{i} = \sum_{k = 1}^{p} X_{i k}^{'} β_{k} + ε_{i}, i = 1, \dots, n,

where Y_i is the response variable, ε_i is the error term, X_ik is a d_k × 1 covariate vector representing the kth group and β_k is the corresponding d_k × 1 vector of regression coefficients. For such a model, the group Lasso (Antoniadis and Fan (2001), Yuan and Lin (2006)) is an attractive method for variable selection since it respects the grouping structure in the covariates. This method is a natural extension of the Lasso (Tibshirani (1996)), in which an ℓ₂-norm of the coefficients associated with a group of variables is used as a component in the penalty function. However, the group Lasso is, in general, not selection consistent and tends to select more groups than there are in the model. To improve the selection results, we consider an adaptive group Lasso method which is a generalization of the adaptive Lasso (Zou (2006)). We provide sufficient conditions under which the adaptive group Lasso is selection consistent if the group Lasso is used as the initial estimator.

The need to select groups of variables arises in many statistical modeling problems and applications. For example, in multifactor analysis of variance, a factor with multiple levels can be represented by a group of dummy variables. In nonparametric additive regression, each component can be expressed as a linear combination of a set of basis functions. In both cases, the selection of important factors or nonparametric components amounts to the selection of groups of variables. Several recent papers have considered group selection using penalized methods. In addition to the group Lasso, Yuan and Lin (2006) have proposed the group Lars and group non-negative garrote methods. Kim, Kim and Kim (2006) considered the group Lasso in the context of generalized linear models. Zhao, Rocha and Yu (2008) proposed a composite absolute penalty for group selection, which can be considered a generalization of the group Lasso. Meier, van de Geer and Bühlmann (2008) studied the group Lasso for logistic regression. Huang, Ma, Xie and Zhang (2008) proposed a group bridge method that can be used for simultaneous group and individual variable selection.

There has been much work on the penalized methods for variable selection and estimation with high-dimensional data. Several approaches have been proposed, including the least absolute shrinkage and selection operator (Lasso, Tibshirani (1996)), the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li (2001), Fan and Peng (2004)), the elastic net (Enet) penalty (Zou and Hastie (2006)) and the minimum concave penalty (Zhang (2007)). Much progress has been made in understanding the statistical properties of these methods in both fixed p and p ≫ n settings. In particular, several recent studies considered the Lasso with regard to its variable selection, estimation and prediction properties; see, for example, Knight and Fu (2001), Greenshtein and Ritov (2004), Meinshausen and Buhlmann (2006), Zhao and Yu (2006), Huang, Ma and Zhang (2006), van de Geer (2008) and Zhang and Huang (2008), among others. All of these studies are concerned with the Lasso for individual variable selection.

In this article, we study the asymptotic properties of the group Lasso and the adaptive group Lasso in high-dimensional settings when p ≫ n. We generalize the results concerning the Lasso obtained in Zhang and Huang (2008) to the group Lasso. We show that, under a generalized sparsity condition and the sparse Riesz condition, as well as certain regularity conditions, the group Lasso selects a model whose dimension has the same order as the underlying model, selects all groups whose ℓ₂-norms are of greater order than the bias of the selected model and is estimation consistent. In addition, under a narrow-sense sparsity condition (see page 1371) and using the group Lasso as the initial estimator, the adaptive group Lasso can correctly select important groups with high probability.

Our theoretical and simulation results suggest the following one-step approach to group selection in high-dimensional settings. First, we use the group Lasso to obtain an initial estimator and reduce the dimension of the problem. We then use the adaptive group Lasso to select the final set of groups of variables. Since the computation of the adaptive group Lasso estimator can be carried out using the same algorithm and program for the group Lasso, the computational cost of this one-step approach is approximately twice that of a single group Lasso computation. This approach, iteratively using the group Lasso twice, follows the idea of the adaptive Lasso (Zou (2006)) and a proposal by Bühlmann and Meier (2008) in the context of individual variable selection.

The rest of the paper is organized as follows. In Section 2, we state the results on the selection, bias of the selected model and convergent rate of the group Lasso estimator. In Section 3, we describe the selection and estimation consistency results concerning the adaptive group Lasso. In Section 4, we use simulation to compare the group Lasso and adaptive group Lasso. Proofs are given in Section 5. Concluding remarks are given in Section 6.

2. The asymptotic properties of the group Lasso

Let Y = (Y₁, …, Y_n)′ and X = (X₁, …, X_p), where X_k is the n × d_k covariate submatrix corresponding to the kth group. For a given penalty level λ ≥ 0, the group Lasso estimator of $β = {(β_{1}^{'}, \dots, β_{p}^{'})}^{'}$ is

\hat{β} = arg min_{β} \frac{1}{2} {(Y - X β)}^{T} (Y - X β) + λ \sum_{k = 1}^{p} \sqrt{d_{k}} {| | β_{k} | |}_{2},

(2.1)

where $\hat{β} = {(β_{1}^{'}, \dots, β_{p}^{'})}^{'}$ .

We consider the model selection and estimation properties of β̂ under a generalized sparsity condition (GSC) of the model and a sparse Riesz condition (SRC) on the covariate matrix. These two conditions were first formulated in the study of the Lasso estimator (Zhang and Huang (2008)). The GSC assumes that for some η₁ ≥ 0, there exists an A₀ ⊂ {1, …, p} such that Σ_k∈A₀||β_k||₂ ≤ η₁, where || · ||₂ denotes the ℓ₂-norm. Without loss of generality, let A₀ = {q + 1, …, p}. The GSC is then

\sum_{k = q + 1}^{p} {| | β_{k} | |}_{2} \leq η_{1} .

(2.2)

The number of truly important groups is thus q. A more rigid way to describe sparsity is to assume η₁ = 0, that is,

{| | β_{k} | |}_{2} = 0, k = q + 1, \dots, p .

(2.3)

This is a special case of the GSC and we call it the narrow-sense sparsity condition (NSC). In practice, the GSC is a more realistic formulation of a sparse model. However, the NSC can often be considered a reasonable approximation to the GSC, especially when η₁ is smaller than the noise level associated with model fitting.

The SRC controls the range of eigenvalues of the submatrix. For A ⊂ {1,…, p}, we define X_A = (X_k, k ∈ A) and $\sum_{A A} = X_{A}^{'} X_{A} / n$ . Note that X_A is an n × Σ_k_∈_Ad_k matrix. The design matrix X_A satisfies the sparse Riesz condition (SRC) with rank q^* and spectrum bounds 0 < c_* < c* < ∞ if

c_{*} \leq \frac{{| | X_{A} ν | |}_{2}^{2}}{n {| | ν | |}_{2}^{2}} \leq c^{*} \forall A with q^{*} = ∣ A ∣ = # {k : k \in A} and ν \in R^{\sum_{k \in A} d_{k}} .

(2.4)

Let Â = {k: ||β̂_k||₂ > 0, 1 ≤ k ≤ p}, which is the set of indices of the groups selected by the group Lasso. An important quantity is the cardinality of Â, defined as

\hat{q} = ∣ \hat{A} ∣ = # {k : {| | {\hat{β}}_{k} | |}_{2} > 0, 1 \leq k \leq p},

(2.5)

which determines the dimension of the selected model. If q̂ = O(q), then the selected model has dimension comparable to the underlying model. Following Zhang and Huang (2008), we also consider two measures of the selected model. The first measures the error of the selected model:

\tilde{ω} = {| | (I - \hat{P}) X β | |}_{2},

(2.6)

where P̂ is the projection matrix from Rⁿ to the linear span of the set of selected groups and I ≡ I_n_×_n is the identity matrix. Thus, ω̂² is the sum of squares of the mean vector not accounted for by the selected model. To measure the important groups missing in the selected model, we define

ζ_{2} = {(\sum_{k \notin A_{0}} {| | β_{k} | |}_{2}^{2} I {{| | {\hat{β}}_{k} | |}_{2} = 0})}^{1 / 2} .

(2.7)

We now describe several quantities that will be useful in describing the main results. Let d_a = max_1≤_k_≤_p d_k, d_b = min_1≤_k_≤_p d_k, d = d_a/d_b and $N_{d} = \sum_{k = 1}^{p} d_{k}$ . Define

r_{1} \equiv r_{1} (λ;) = {(\frac{{n c}^{*} \sqrt{d_{a}} η_{1}}{λ d_{b} q})}^{1 / 2}, r_{2} \equiv r_{2} (λ) = {(\frac{{n c}^{*} η_{2}^{2}}{λ^{2} d_{b} q})}^{1 / 2}, \bar{c} = \frac{c^{*}}{c_{*}},

(2.8)

where η₂ ≡ max_{A ⊂A₀} || Σ_k_∈_A X_kβ_k||₂,

M_{1} = M_{1} (λ) = 2 + 4 r_{1}^{2} + 4 \sqrt{d \bar{c}} r_{2} + 4 d \bar{c},

(2.9)

M_{2} \equiv M_{2} (λ) = \frac{2}{3} (1 + 4 r_{1}^{2} + 2 d \bar{c} + 4 \sqrt{2 d} (1 + \sqrt{\bar{c}}) \sqrt{\bar{c}} r_{2} + \frac{16}{3} d {\bar{c}}^{2}),

(2.10)

M_{3} \equiv M_{3} (λ) = \frac{2}{3} (1 + 4 r_{1}^{2} + 4 \sqrt{d \bar{c}} (1 + 2 \sqrt{1 + \bar{c}}) r_{2} + 3 r_{2}^{2} + \frac{2}{3} d \bar{c} (7 + 4 \bar{c})) .

(2.11)

Let $λ_{n, p} = 2 σ \sqrt{8 (1 + c_{0}) d_{a} d^{2} q^{*} \bar{c} {n c}^{*} log (N_{d} \lor a_{n})}$ , where c₀ ≥ 0 and a_n ≥ 0, satisfying pd_a/(N_d ∨ a_n)^1+c₀ ≈ 0, and λ₀ = inf{λ: M₁q +1 ≤ q*}, where inf Ø = ∞. We also consider the constraint

λ \geq max {λ_{0}, λ_{n, p}} .

(2.12)

For large p, the lower bound here is allowed to be λ_n_, _p = 2σ[8(1+c₀)d_ad²q*c̄nc^* log(N_d)]¹^/² with a_n = 0; for fixed p, a_n → ∞ is required.

We assume the following basic condition.

(C1)
The errors ε₁, …, ε_n are independent and identically distributed as N (0, σ²).

Theorem 2.1

Suppose that q ≥ 1 and that (C1), the GSC (2.2) and SRC (2.4) are satisfied. Let q̂, ω̃ and ζ₂ be defined as in (2.5), (2.6) and (2.7), respectively, for the model Â selected by the group Lasso from (2.1). Let M₁, M₂ and M₃ be defined as in (2.9), (2.10) and (2.11), respectively. If the constraint (2.12) is satisfied, then the following assertions hold with probability converging to 1:

\begin{array}{l} \hat{q} \leq # {k : {| | {\hat{β}}_{k} | |}_{2} > 0 o r k \notin A_{0}} \leq M_{1} (λ) q, \\ {\tilde{ω}}^{2} = {| | (I - \hat{P}) X β | |}_{2}^{2} \leq M_{2} (λ) B_{1}^{2} (λ), \\ ζ_{2}^{2} = \sum_{k \notin A_{0}} {| | β_{k} | |}_{2}^{2} I {{| | {\hat{β}}_{k} | |}_{2} = 0} \leq \frac{M_{3} (λ) B_{1}^{2} (λ)}{c_{*} n}, \end{array}

where $B_{1} (λ) = {((λ^{2} d_{b}^{2} q) / ({n c}^{*}))}^{1 / 2}$ .

Remark 2.1

The condition q ≥ 1 is not necessary since it is only used to express quantities in terms of ratios in (2.8) and Theorem 2.1. If q = 0, we use $r_{1}^{2} q = {n c}^{*} \sqrt{d_{a}} η_{1} / (λ d_{b})$ and $r_{2}^{2} q = {n c}^{*} η_{2}^{2} / (λ^{2} d_{b})$ to recover M₁, M₂ and M₃ in (2.9), (2.10), (2.11), respectively, giving the results $\hat{q} \leq 4 {n c}^{*} \sqrt{d_{a}} η_{1} / λ d_{b}, {\tilde{ω}}^{2} \leq 8 λ \sqrt{d_{a}} d_{b} η_{1} / 3$ and $ζ_{2}^{2} = 0$ .

Remark 2.2

If η₁ = 0 in (2.2), then r₁ = r₂ = 0 and

M_{1} = 2 + 4 d \bar{c}, M_{2} = \frac{2}{3} (1 + 2 d \bar{c} + \frac{16}{3} d {\bar{c}}^{2}), M_{3} = \frac{2}{3} (1 + \frac{2}{3} d \bar{c} (7 + 4 \bar{c})),

all of which depend only on d and c̄. This suggests that the relative sizes of the groups affect the selection results. Since d ≥ 1, the most favorable case is d = 1, that is, when the groups have equal sizes.

Remark 2.3

If d₁ = · · · = d_p = 1, the group Lasso simplifies to the Lasso and Theorem 2.1 is a direct generalization of Theorem 1 on the selection properties of the Lasso obtained by Zhang and Huang (2008). In particular, when d₁ = · · · = d_p = 1, r₁, r₂, M₁, M₂, M₃ are the same as the constants in Theorem 1 of Zhang and Huang (2008).

Remark 2.4

A more general definition of the group Lasso is

{\hat{β}}^{*} = arg min_{β} \frac{1}{2} {(Y - X β)}^{'} (Y - X β) + λ \sum_{k = 1}^{p} {(β_{k}^{'} R_{k} β_{k})}^{1 / 2},

(2.13)

where R_k is a d_k × d_k positive definite matrix. This is useful when certain relationships among the coefficients need be specified. By the Cholesky decomposition, there exists a matrix Q_k such that $R_{k} = d_{k} Q_{k}^{'} Q_{k}$ . Let β* = Q_kβ, and $X_{k}^{*} = X_{k} Q_{k}^{- 1}$ . Then, (2.13)becomes

{\hat{β}}^{*} = arg min_{β *} {(Y - X^{*} β^{*})}^{'} (Y - X^{*} β^{*}) + λ \sum_{k = 1}^{p} \sqrt{d_{k}} {| | β_{k}^{*} | |}_{2} .

The GSC for (2.13) is $\sum_{k = q + 1}^{p} {(β_{k}^{'} Q_{k}^{'} Q_{k} β_{k})}^{1 / 2} \leq η_{1}$ . The SRC can be assumed for X · Q⁻¹, where $X \cdot Q^{- 1} = (X_{1} Q_{1}^{- 1}, \dots, X_{p} Q_{p}^{- 1})$ .

Immediately, from Theorem 2.1, we have the following corollary.

Corollary 2.1

Suppose that the conditions of Theorem 2.1 hold and λ satisfies the constraint (2.12). Then, with probability converging to one, all groups with ${| | β_{k} | |}_{2}^{2} > M_{3} (λ) q λ^{2} / (c_{*} c^{*} n^{2})$ are selected.

From Theorem 2.1 and Corollary 2.1, the group Lasso possesses similar properties to the Lasso in terms of sparsity and bias (Zhang and Huang (2008)). In particular, the group Lasso selects a model whose dimension has the same order as the underlying model. Furthermore, all of the groups with coefficients whose ℓ₂-norms are greater than the threshold given in Corollary 2.1 are selected with high probability.

Theorem 2.2

Let {c̄, σ, r₁, r₂, c₀, d} be fixed and 1 ≤ q ≤ n ≤ p → ∞. Suppose that the conditions in Theorem 2.1 hold. Then, with probability converging to 1, we have

{| | \hat{β} - β | |}_{2} \leq \frac{1}{\sqrt{{n c}_{*}}} (2 σ \sqrt{M_{1} log (N_{d}) q} + (r_{2} + \sqrt{{d M}_{1} \bar{c}}) B_{1}) + \sqrt{\frac{c_{*} r_{1}^{2} + r_{2}^{2}}{c_{*} c^{*}}} \frac{\sqrt{q} λ}{n}

and

{| | X \hat{β} - X β | |}_{2} \leq 2 σ \sqrt{M_{1} log (N_{d}) q} + (2 r_{2} + \sqrt{{d M}_{1} \bar{c}}) B_{1} .

Theorem 2.2 is stated for a general λ that satisfies (2.12). The following result is an immediate corollary of Theorem 2.2.

Corollary 2.2

Let $λ = 2 σ \sqrt{8 (1 + c_{0}^{'}) d_{a} d^{2} q^{*} {\bar{c} c}^{*} n log (N_{d})}$ with a fixed $c_{0}^{'} \geq c_{0}$ . Suppose that all of the conditions in Theorem 2.2 hold. We then have

{| | \hat{β} - β | |}_{2} = O_{p} (\sqrt{q log (N_{d}) / n}) and {| | X \hat{β} - X β | |}_{2} = O_{p} (\sqrt{q log (N_{d})}) .

This corollary follows by substituting the given λ value into the expressions in the results of Theorem 2.2.

3. Selection consistency of the adaptive group Lasso

As shown in the previous section, the group Lasso has excellent selection and estimation properties. However, there is room for improvement, particularly with regard to selection. Although the group Lasso selects a model whose dimension is comparable to that of the underlying model, the simulation results reported in Yuan and Lin (2006) and those reported below suggest that it tends to select more groups than there are in the underlying model. To correct the tendency of overselection by the group Lasso, we generalize the idea of the adaptive Lasso (Zou (2006)) for individual variable selection to the present problem of group selection.

Consider a general group Lasso criterion with a weighted penalty term,

\frac{1}{2} {(Y - X β)}^{'} (Y - X β) + \tilde{λ} \sum_{k = 1}^{p} w_{k} \sqrt{d_{k}} {| | β_{k} | |}_{2},

(3.1)

where w_k is the weight associated with the kth group. The λ_k ≡ λ̂w_k can be regarded as the penalty level corresponding to the kth group. For different groups, the penalty level λ_k can be different. If we can have lower penalty for groups with large coefficients and higher penalty for groups with small coefficients (in the ℓ₂ sense), then we expect to be able to improve variable selection accuracy and reduce estimation bias. One way to obtain the information about whether a group has large or small coefficients is by using a consistent initial estimator.

Suppose that an initial estimate β̃ is available. A simple approach to determining the weight is to use the initial estimator. Consider

w_{k} = \frac{1}{{| | {\tilde{β}}_{k} | |}_{2}}, k = 1, \dots, p .

(3.2)

Thus, for each group, its penalty is proportional to the inverse of the norm of β̃_k. This choice of the penalty level for each group is a natural generalization of the adaptive Lasso (Zou (2006)). In particular, when each group only contains a single variable, (3.2) simplifies to the adaptive Lasso penalty.

Let $θ_{a} = {max}_{k \in A_{0}^{c}} {| | β_{k} | |}_{2}$ and $θ_{b} = {min}_{k \in A_{0}^{c}} {| | β_{k} | |}_{2}$ . We say that an initial estimator β̃ is consistent at zero with rate r_n if r_n max_k∈A₀ ||β_k||₂ = O_p(1), where r_n → ∞ as n → ∞, and there exists a constant ξ_b > 0 such that for any ε > 0, $P ({min}_{k \in A_{0}^{c}} {| | {\tilde{β}}_{k} | |}_{2} > ξ_{b} θ_{b}) > 1 - ε$ for n sufficiently large.

In addition to (C1), we assume the following conditions:

(C2)
the initial estimator β̃ is consistent at zero with rate r_n → ∞;
(C3)

$\frac{\sqrt{d_{a} (log q)}}{\sqrt{n} θ_{b}} \to 0, \frac{\tilde{λ} d_{a}^{3 / 2} q}{n θ_{b}^{2}} \to 0, \frac{\sqrt{nd log (p - q)}}{\tilde{λ} r_{n}} \to 0, \frac{d_{a}^{5 / 2} q^{2}}{r_{n} θ_{b} \sqrt{d_{b}}} \to 0;$
(C4)
all of the eigenvalues of $\sum_{A_{0}^{c} A_{0}^{c}}$ are bounded away from zero and infinity.

Condition (C2) assumes that an initial zero-consistent estimator exists. It is the most critical one and is generally difficult to establish. It assumes that we can consistently differentiate between important and non-important groups. For fixed p and d_k, the ordinary least-squares estimator can be used as the initial estimator. However, when p > n, the least-squares estimator is no longer feasible. By Theorems 2.1 and 2.2, the group Lasso estimator β̂ is consistent at zero with rate $\sqrt{n / (q log (N_{d}))}$ . Condition (C3) restricts the numbers of important and non-important groups, as well as variables within the groups. It also places constraints on the penalty parameter and the ℓ₂-norm of the smallest important group. Condition (C4) assumes that the eigenvalues of $\sum_{A_{0}^{c} A_{0}^{c}}$ are finite and bounded away from zero. This is reasonable since the number of important groups is small in a sparse model. This condition ensures that the true model is identifiable.

Define

{\hat{β}}^{*} = arg min \frac{1}{2} {(Y - X β)}^{'} (Y - X β) + \tilde{λ} \sum_{k = 1}^{p} {| | {\tilde{β}}_{k} | |}_{2}^{- 1} \sqrt{d_{k}} {| | β_{k} | |}_{2} .

(3.3)

Theorem 3.1

If (C1)–(C4) and NSC (2.3) are satisfied, then

P ({| | {\hat{β}}_{k}^{*} | |}_{2} \neq 0, k \notin A_{0}, {| | {\hat{β}}_{k}^{*} | |}_{2} = 0, k \in A_{0}) \to 1.

Therefore, the adaptive group Lasso is selection consistent if the conditions stated in Theorem 2.1 hold.

If we use β̂ as the initial estimator, then (C3) can be changed to (C3)^*

\frac{\sqrt{d_{a} (log q)}}{\sqrt{n} θ_{b}} \to 0, \frac{\tilde{λ} d_{a}^{3 / 2} q}{n θ_{b}^{2}} \to 0, \frac{\sqrt{dq log (p - q) log (N_{d})}}{\tilde{λ}} \to 0, \frac{{(d_{a} q)}^{5 / 2} \sqrt{log (N_{d})}}{θ_{b} \sqrt{{n d}_{b}}} \to 0.

We often have λ̃ = n^α for some 0 < α < 1/2. In this case, the number of non-important groups can be as large as exp(n²^α/(q log q)) with the number of important groups satisfying q⁵ log q/n → 0, assuming that θ_b and the number of variables within the groups are finite.

Corollary 3.1

Let the initial estimator β̃ = β̂, where β̂ is the group Lasso estimator. Suppose that the NSC (2.3) holds and that (C1), (C2), (C3)^* and (C4) are satisfied. We then have

P ({| | {\hat{β}}_{k}^{*} | |}_{2} \neq 0, k \notin A_{0}, {| | {\hat{β}}_{k}^{*} | |}_{2} = 0, k \in A_{0}) \to 1.

This corollary follows directly from Theorem 3.1. It shows that the iterated group Lasso procedure that uses a combination of the group Lasso and the adaptive group Lasso is selection consistent.

Theorem 3.2

Suppose that the conditions in Theorem 2.2 hold and that θ_b > t_b for some constant t_b > 0. If λ̃~ O(n^α) for some 0 < α < 1/2, then

{| | {\hat{β}}^{*} - β | |}_{2} = O_{p} (\sqrt{\frac{q}{n} + \frac{{\tilde{λ}}^{2}}{n^{2}}}) = O_{p} (\sqrt{\frac{q}{n}}), {| | X {\hat{β}}^{*} - X β | |}_{2} \sim O (\sqrt{q + \frac{{\tilde{λ}}^{2}}{n}}) = O_{p} (\sqrt{q}) .

Theorem 3.2 implies that for the adaptive group Lasso, given a zero-consistent initial estimator, we can reduce a high-dimensional problem to a lower-dimensional one. The convergence rate is improved, compared with that of the group Lasso, by choosing an appropriate penalty parameter λ̃.

4. Simulation studies

In this section, we use simulation to evaluate the finite sample performance of the group Lasso and the adaptive group Lasso. Let λ_k = λ̃/||β̂_k||₂, if ||β̂_k||₂ > 0; if ||β̂_k||₂ = 0, then λ_k = ∞, ${\hat{β}}_{k}^{*} = 0$ . We can thus drop the corresponding covariates X_k from the model and only consider the groups with ${| | {\hat{β}}_{k}^{*} | |}_{2} > 0$ . After a scale transformation, we can directly apply the group least angle regression algorithm (Yuan and Lin (2006)) to compute the adaptive group Lasso estimator β̂^*. The penalty parameters for the group Lasso and the adaptive group Lasso are selected using the BIC criterion (Schwarz (1978)).

We consider two scenarios of simulation models. In the first scenario, the group sizes are equal; in the second, the group sizes vary. For every scenario, we consider the cases p < n and p > n. In all of the examples, the sample size is n = 200.

Example 1

In this example, there are 10 groups, each consisting of 5 covariates. The covariate vector is X = (X₁,…, X₁₀), where X_j = (X₅₍_j₋₁₎₊₁, …, X₅₍_j₋₁₎₊₅), 1 ≤ j ≤ 10. To generate X, we first simulate 50 random variables, R₁, …, R₅₀, independently from N(0,1). Then, Z_j, j = 1,…, 10, are simulated from a multivariate normal distribution with with mean zero and cov(Z_j_1, Z_j₂) = 0.6^{|j₁ − j₂|}. The covariates X₁, …, X₅₀ are generated as

X_{5 (j - i) + k} = \frac{Z_{j} + R_{5 (j - i) + k}}{\sqrt{2}}, 1 \leq j \leq 10, 1 \leq k \leq 5.

The random error ε ~ N(0,3²). The response variable Y is generated from $\sum_{k = 1}^{10} X_{k}^{'} β_{k} + ε$ , where β₁ = (0.5, 1, 1.5, 2, 2.5), β₂ = (2, 2, 2, 2, 2), β₃ = · · · = β₁₀ = (0, 0, 0, 0, 0).

Example 2

In this example, the number of groups is p = 10. Each group consists of 5 covariates. The covariates are generated the same way as in Example 1. However, the regression coefficients β₁ = (0.5, 1, 1.5, 1, 0.5), β₂ = (1, 1, 1, 1, 1), β₃ = (−1, 0, 1, 2, 1.5), β₄ = (−1.5,1,0.5,0.5,0.5), β₅ = · · · = β₁₀ = (0, 0, 0, 0, 0).

Example 3

In this example, the number of groups p = 210 is bigger than the sample size n. Each group consists of 5 covariates. The covariates are generated the same way as in Example 1. However, the regression coefficients β₁ = (0.5, 1, 1.5, 1, 0.5), β₂ = (1,1,1,1,1), β₃ = (−1, 0, 1, 2, 1.5), β₄ = (−1.5, 1, 0.5, 0.5, 0.5), β₅ = · · · = β₂₁₀ = (0, 0, 0, 0, 0).

Example 4

In this example, the group sizes differ across groups. There are 5 groups with size 5 and 5 groups with size 3. The covariate vector is X = (X₁, …, X₁₀), where X_j = (X₅₍_j₋₁₎₊₁, …, X₅₍_j₋₁₎₊₅), 1 ≤ j ≤ 5, and X_j = (X₃₍_j₋₆₎₊₂₆, …, X₃₍_j₋₆₎₊₂₈), 6 ≤ j ≤ 10. In order to generate X, we first simulate 40 random variables R₁, …, R₄₀, independently from N(0,1). Then, Z_j, j = 1, …, 10 are simulated with a normal distribution with mean zero and cov(Z_j_1, Z_j₂) = 0.6^{|j₁ − j₂|}. The covariates X₁, …, X₄₀ are generated as

\begin{matrix} X_{5 (j - i) + k} = \frac{Z_{j} + R_{5 (j - i) + k}}{\sqrt{2}}, 1 \leq j \leq 5, 1 \leq k \leq 5, \\ X_{3 (j - 6) + 25 + k} = \frac{Z_{j} + R_{3 (j - 6) + 25 + k}}{\sqrt{2}}, 6 \leq j \leq 10, 1 \leq k \leq 3. \end{matrix}

The random error ε ~ N(0,3²). The response variable Y is generated from $\sum_{k = 1}^{10} X_{k} β_{k} + ε$ , where β₁ = (0.5, 1, 1.5, 2, 2.5), β₂ = (2, 0, 0, 2, 2), β₃ = · · · = β₅ = (0,0,0,0,0), β₆ = (−1, −2, −3), β₇ = · · · = β₁₀ = (0, 0, 0).

Example 5

In this example, the number of groups is p = 10 and the group sizes differ across groups. The data are generated the same way as in Example 4. However, the regression coefficients β₁ = (0.5, 1, 1.5, 2, 2.5), β₂ = (2, 2, 2, 2, 2), β₃ = (−1, 0, 1, 2, 3), β₄ = (−1.5,2,0,0,0), β₅ = (0, 0, 0, 0, 0), β₆ = (2, −2, 1), β₇ = (0, −3, 1.5), β₈ = (−1.5, 1.5, 2), β₉ = (−2, −2, −2), β₁₀ = (0, 0, 0).

Example 6

In this example, the number of groups p = 210 and the group sizes differ across groups. The data are generated the same way as in Example 4. However, the regression coefficients β₁ = (0.5, 1, 1.5, 2, 2.5), β₂ = (2, 2, 2, 2, 2), β₃ = (−1, 0, 1, 2, 3), β₄ = (−1.5, 2, 0, 0, 0), β₅ = · · · = β₁₀₀ = (0, 0, 0, 0, 0), β₁₀₁ = (2, −2, 1), β₁₀₂ = (0, −3, 1.5), β₁₀₃ = (−1.5, 1.5, 2), β₁₀₄ = (−2, −2, −2), β₁₀₅ = · · · = β₂₁₀ = (0, 0, 0).

The results are given in Table 1, based on 400 replications. The columns in the table include the average number of groups selected with standard error in parentheses, the median number (‘med’) of groups selected with the 25% and 75% quantiles of the number of selected groups in parentheses, model error (‘ME’), percentage of occasion on which correct groups are included in the selected model (‘% incl’) and percentage of occasions on which the exactly correct groups are selected (‘% sel’), with standard error in parentheses.

Table 1.

Simulation study by the group Lasso and adaptive group Lasso for Examples 1–6. The true numbers of groups are included in [] in the first column

	Group Lasso					Adaptive group Lasso
σ = 3	mean	med	ME	% incl	% sel	mean	med	ME	% incl	% sel
Ex. 1, [2]	2.04 (0.18)	2 (2, 2)	8.79 (0.94)	100 (0)	96.5 (0.18)	2.01 (0.07)	2 (2, 2)	8.54 (0.90)	100 (0)	99.5 (0.07)
Ex. 2, [4]	4.11 (0.34)	4 (4, 4)	8.52 (0.94)	99.5 (0.07)	88.5 (0.32)	4.00 (0.14)	4 (4, 4)	8.10 (0.87)	99.5 (0.07)	98.00 (0.14)
Ex. 3, [4]	4.00 (0.38)	4 (4, 4)	9.48 (1.19)	93.0 (0.26)	86.5 (0.34)	3.94 (0.27)	4 (4, 4)	8.19 (0.96)	93.0 (0.26)	92.5 (0.26)
Ex. 4, [3]	3.17 (0.45)	3 (3, 3)	8.78 (1.00)	100 (0)	85.3 (0.35)	3.00 (0)	3 (3, 3)	8.36 (0.90)	100 (0)	100 (0)
Ex. 5, [8]	8.88 (0.81)	9 (8, 10)	7.68 (0.94)	100 (0)	40.0 (0.49)	8.03 (0.16)	8 (8, 8)	7.58 (0.86)	100 (0)	97.5 (0.16)
Ex 6, [8]	12.90 (12.42)	9 (8, 11)	14.61 (7.21)	66.5 (0.47)	7.0 (0.26)	11.49 (12.68)	8 (7, 8)	9.28 (5.79)	66.5 (0.47)	47.0 (0.50)

Open in a new tab

Several observations can be made from Table 1. First, in all six examples, the adaptive group Lasso performs better than the group Lasso in terms of model error and the percentage of correctly selected models. The group Lasso which gives the initial estimator for the adaptive group Lasso includes the correct groups with high probability. And the improvement is considerable for models with different group sizes. Second, the results from models with equal group sizes (Examples 1, 2 and 3) are better than those from models with different group sizes (Examples 4, 5 and 6). Finally, when the dimension of the model increases, the performance of both methods becomes worse. This is to be expected since selection in models with a larger number of groups is more difficult.

5. Concluding remarks

We have studied the asymptotic selection and estimation properties of the group Lasso and adaptive group Lasso in ‘large p, small n’ linear regression models. For the adaptive group Lasso to be selection consistent, the initial estimator should possess two properties: (a) it does not miss important groups and variables; (b) it is estimation consistent, although it may not be group-selection or variable-selection consistent. Under the conditions stated in Theorem 2.1, the group Lasso is shown to satisfy these two requirements. Thus, the iterated group Lasso procedure, which uses the group Lasso to achieve dimension reduction and generate the initial estimates and then uses the adaptive group Lasso to achieve selection consistency, is an appealing approach to group selection in high-dimensional settings.

6. Proofs

We first introduce some notation which will be used in proofs. Let ${k : {| | {\hat{β}}_{k} | |}_{2} > 0, k \leq p} \subseteq A_{1} \subseteq {k : X_{k}^{'} (Y - X \hat{β}) = λ \sqrt{d_{k}} {\hat{β}}_{k} / {| | {\hat{β}}_{k} | |}_{2}} \cup {1, ..., q}$ . Set A₂ = {1, …, p}\A₁, A₃ = A₁\A₀, A₄ = A₁ ∩ A₀, A₅ = A₂\A₀ and A₆ = A₂ ∩ A₀. Thus, we have A₁ = A₃ ∪ A₄, A₃ ∩ A₄ = Ø, A₂ = A₅∪A₆ and A₅∩A₆ = Ø. Let |A_i| = Σ_{k∈A_i} d_k, N(A_i) = #{k: k ∈ A_i}, i = 1, …, 6 and q₁ = N(A₁).

Proof of Theorem 2.1

The basic idea used in this proof follows the proof of the rate consistency of the Lasso in Zhang and Huang (2008). However, there are many differences in technical details, for example, in the characterization of the solution via the Karush–Kuhn–Tucker (KKT) conditions, in the constraint needed for the penalty level and in the use of maximal inequalities.

The proof consists of three steps. Step 1 proves some inequalities related to q₁, ω̃ and ζ₂. Step 2 translates the results of Step 1 into upper bounds for q̂, ω̃ and ζ₂. Step 3 completes the proof by showing the probability of the event in Step 2 converging to 1. The details of the complete proof are available from the website www.stat.uiowa.edu/techrep. We will sketch the proof in the following.

If β̂ is a solution of (2.1), then, by the KKT condition, $X_{k}^{'} (Y - X \hat{β}) = λ \sqrt{d_{k}} {\hat{β}}_{k} / {| | {\hat{β}}_{k} | |}_{2} \forall {| | {\hat{β}}_{k} | |}_{2} > 0$ and $- λ \sqrt{d_{k}} \leq X_{k}^{'} (Y - X \hat{β}) \leq λ \sqrt{d_{k}} \forall {| | {\hat{β}}_{k} | |}_{2} = 0$ . We then have

\sum_{11}^{- 1} S_{A_{1}} / n = (β_{A_{1}} - {\hat{β}}_{A_{1}}) + \sum_{11}^{- 1} \sum_{12} β_{A_{2}} + \sum_{11}^{- 1} X_{A_{1}}^{'} ε / n,

(6.1)

n \sum_{22} β_{A_{2}} - n \sum_{21} \sum_{11}^{- 1} \sum_{12} β_{A_{2}} \leq C_{A_{2}} - X_{A_{2}}^{'} ε - \sum_{21} \sum_{11}^{- 1} S_{A_{1}} + \sum_{21} \sum_{11}^{- 1} X_{A_{1}}^{'} ε,

(6.2)

where $S_{A_{i}} = {(S_{k_{1}}^{'}, ..., S_{k_{q_{i}}}^{'})}^{'}, S_{k_{i}} = λ \sqrt{d_{k_{i}}} s_{k_{i}}, s_{k} = X_{k}^{'} (Y - X \hat{β}) / (λ \sqrt{d_{k}}), C_{A_{i}} = {(C_{k_{1}}^{'}, ..., C_{k_{q_{i}}}^{'})}^{'}, C_{k_{i}} = λ \sqrt{d_{k_{i}}} I ({| | {\hat{β}}_{k_{i}} | |}_{2} = 0) e_{d_{k_{i}} \times 1}$ , all the elements of matrix e_{d_{k_i}×1} equal 1, k_i ∈ A_i and $\sum_{i j} X_{A_{i}}^{'} X_{A_{j}} / n$ .

Step 1

Define

V_{1 j} = \sum_{11}^{- 1 / 2} Q_{A_{j} 1}^{'} S_{A_{j}} / \sqrt{n}, j = 1, 3, 4, ω_{k} = (I - P_{A_{1}}) X_{A_{k}} β_{A_{k}}, k = 2, ..., 6,

where Q_{A_kj} is the matrix representing the selection of variables in A_k from A_j. Define $u = X_{A_{1}} \sum_{11}^{- 1} Q_{A_{4} 1}^{'} S_{A_{4}} / n - ω_{2} / {| | X_{A_{1}} \sum_{11}^{- 1} Q_{A_{4} 1}^{'} S_{A_{4}} / n - ω_{2} | |}_{2}$ . From (6.1) and (6.2), we have $V_{14}^{'} (V_{13} + V_{14}) \leq S_{A_{4}}^{'} Q_{A_{4} 1} \sum_{11}^{- 1} \sum_{12} β_{A_{2}} + S_{A_{4}}^{'} Q_{A_{4} 1} \sum_{11}^{- 1} X_{A_{1}}^{'} ε / n + \sqrt{d_{a}} λ \sum_{k \in A_{4}} {| | β_{k} | |}_{2}$ and ${| | ω_{2} | |}_{2}^{2} \leq β_{A_{2}}^{'} (C_{A_{2}} - X_{A_{2}}^{'} ε - \sum_{21} \sum_{11}^{- 1} S_{A_{1}} + \sum_{21} \sum_{11}^{- 1} X_{A_{1}}^{'} ε)$ . Then, under GSC,

{| | V_{14} | |}_{2}^{2} + {| | ω_{2} | |}_{2}^{2} \leq {({| | V_{14} | |}_{2}^{2} + {| | ω_{2} | |}_{2}^{2})}^{1 / 2} u^{'} ε + ({| | V_{14} | |}_{2} + {| | P_{1} X_{A_{2}} β_{A_{2}} | |}_{2}) {(\frac{λ^{2} d_{a} N (A_{3})}{{n c}_{*} (∣ A_{1} ∣)})}^{1 / 2} + \sqrt{d_{a}} λ η_{1} + λ \sqrt{d_{a}} {| | β_{A_{5}} | |}_{2} .

(6.3)

Step 2

Define $B_{1}^{2} = λ^{2} d_{b} q / ({n c}^{*} (∣ A_{1} ∣))$ and $B_{2}^{2} = λ^{2} d_{b} q / ({n c}_{*} (∣ A_{0} ∣ \lor ∣ A_{1} ∣))$ . In this step, we consider the event ${∣ u^{'} ε ∣}^{2} \leq (∣ A_{1} ∣ \lor d_{b}) B_{1}^{2} / (4 {q d}_{a})$ . Suppose that the set A₁ contains all large β_k ≠ 0. From (6.3), ${| | V_{14} | |}_{2}^{2} \leq B_{1}^{2} + 4 \sqrt{d_{a}} λ η_{1} + 4 \sqrt{d} η_{2} B_{2} + 4 {d B}_{2}^{2}$ , so we have

{(q_{1} - q)}^{+} \leq q + \frac{{n c}^{*} (∣ A_{1} ∣)}{λ^{2} d_{b}} (4 \sqrt{d_{a}} λ η_{1} + 4 \sqrt{\frac{λ^{2} d_{a} q}{{n c}_{*} (∣ A_{1} ∣)}} η_{2} + \frac{4 λ^{2} d_{a} q}{{n c}_{*} (∣ A_{1} ∣)}) .

(6.4)

For general A₁, let C₅ = c*(|A₅|)/c_*(|A₁| ∪ |A₅|). From (6.3),

{| | ω_{2} | |}_{2}^{2} \leq \frac{4}{3} (\frac{B_{1}^{2}}{2} + {d B}_{2}^{2} + \sqrt{d} (1 + \sqrt{C_{5}}) η_{2} B_{2} + 2 \sqrt{d_{a}} η_{1}) + \frac{32}{9} {d C}_{5} B_{2}^{2} .

(6.5)

From Zhang and Huang (2008), ${| | ω_{2} | |}_{2}^{2} \geq {({| | β_{A_{5}} | |}_{2} {({n c}_{*, 5})}^{1 / 2} - η_{2})}^{2}$ and ||X_A₂β_A₂||₂ η ₂ + ||X_A₅β_A₅||₂ ≤ η₂ + (nc*(|A₅|))^1/2||β_A₅||₂. By the Cauchy–Schwarz inequality, then, we have

{| | β_{A_{5}} | |}_{2}^{2} {n c}_{*, 5} \leq {[\frac{4}{3} λ \sqrt{\frac{d_{a} q}{{n c}_{*, 5}}} {(1 + \frac{c^{*} (∣ A_{5} ∣)}{c_{*} (∣ A_{1} ∣)})}^{1 / 2} + 2 η_{2}]}^{2} + \frac{8}{3} [\frac{B_{1}^{2}}{4} + \sqrt{d_{a}} λ η_{1} + η_{2} {(\frac{λ^{2} d_{a} q}{{n c}_{*} (∣ A_{1} ∣)})}^{1 / 2} + \frac{λ^{2} d_{a} q}{2 {n c}_{*} (∣ A_{1} ∣)} - \frac{3}{4} η_{2}^{2}],

(6.6)

where c_*,5 = c_*(|A₁ ∪ A₅|).

Step 3

Letting c_*(|A_m|) = c_*, c*(|A_m|) = c* for N(A_m) ≤ q*, we have

q_{1} \leq N (A_{1} \cup A_{5}) \leq q^{*}, {∣ u^{'} ε ∣}^{2} \leq \frac{(∣ A_{1} ∣ \lor d_{b}) λ^{2} d_{b}}{4 d_{a} {n c}^{*} (∣ A_{1} ∣)} .

(6.7)

We have c̄ = C₅ = c*(|A₅|)/c_*(|A₁| ∨ |A₅|) = c*/c_* and c_*,5 = c_* (|A₁ ∪ A₅|) = c_*. From (6.4), (6.5) and (6.6), (q₁ − q)⁺ + q ≤ M₁q, ${| | ω_{2} | |}_{2}^{2} \leq M_{2} B_{1}^{2}, {n c}_{*} {| | {\tilde{γ}}_{A_{5}} | |}_{2}^{2} \leq M_{3} B_{1}^{2}$ when (2.12) is satisfied. Define

x_{m}^{*} \equiv max_{∣ A ∣ = m} max_{{| | U_{A_{k}} | |}_{2} = 1, k = 1, ..., m} | ε^{'} \frac{X_{A} {(X_{A}^{'} X_{A})}^{- 1} {\bar{S}}_{A} - (I - P_{A}) X β}{{| | X_{A} {(X_{A}^{'} X_{A})}^{- 1} {\bar{S}}_{A} - (I - P_{A}) X β | |}_{2}} |

(6.8)

for |A| = q₁ = m ≥ 0, ${\bar{S}}_{A} = {({\bar{S}}_{A_{1}}^{'}, ..., S_{A_{m}}^{'})}^{'}$ , where ${\bar{S}}_{A_{k}} = λ \sqrt{d_{A_{k}}} U_{A_{k}}$ , ||U_{A_k}||₂ = 1. Let $Q_{A} = X_{A}^{*} {(X_{A}^{'} X_{A})}^{- 1}$ , where $X_{k}^{*} = λ \sqrt{d_{k}} X_{k}$ . For a given A, let V_lj = (0,…, 0, 1, 0, …, 0) be the |A| × 1 vector with the jth element in the lth group being 1. Then, by (6.8),

x_{m}^{*} \leq max_{∣ A ∣ = m} max_{l, j} {| ε^{'} \frac{Q_{A} V_{l j}}{{| | Q_{A} V_{l j} | |}_{2}} | \frac{{| | Q_{A} V_{l j} | |}_{2} \sum_{l \in A} \sqrt{d_{l}}}{{| | Q_{A} U_{A} | |}_{2}} + | \frac{ε^{'} (I - P_{A}) X β}{{| | (I - P_{A}) X β | |}_{2}} |} .

If we define $Ω_{m_{0}} = {(U, ε) : x_{m}^{*} \leq σ \sqrt{8 (1 + c_{0}) V^{2} (({m d}_{b}) \lor d_{b}) log (N_{d} \lor a_{n})} \forall m \geq m_{0}}$ , then $(X, ε) \in Ω_{m_{0}} \Rightarrow {∣ u^{'} ε ∣}^{2} \leq {(x_{m}^{*})}^{2} < (∣ A_{1} ∣ \lor d_{b}) λ^{2} d_{b} / (4 d_{a} {n c}^{*})$ for N(A₁) ≥ m₀ ≥ 0. By the definition of $x_{m}^{*}$ , it is less than the maximum of $(\begin{matrix} p \\ m \end{matrix}) \sum_{k \in A} d_{k}$ normal variables with mean 0 and variance $σ^{2} V_{ε}^{2}$ , plus the maximum of $(\begin{matrix} p \\ m \end{matrix})$ normal variables with mean 0 and variance σ². It follows that P{(X, ε) ∈ Ω_m₀} → 1 when (6.7) holds. This completes the sketch of the proof of Theorem 2.1.

Proof of Theorem 2.2

Consider the case when {c^*, c_*, r₁, r₂, c₀,σ, d} are fixed. The required configurations in Theorem 2.1 then become

M_{1} q + 1 < q^{*}, η_{1} \leq \frac{r_{1}^{2}}{c^{*}} \frac{q λ}{n}, η_{2}^{2} \leq \frac{r_{2}^{2}}{c^{*}} {\frac{q λ}{n}}^{2} .

(6.9)

Let A₁ = {k: ||β̂_k||₂ > 0 or k ∉ A₀}. Define v₁ = X_A₁ (β^_A₁ − β_A₁) and $g_{1} = X_{A_{1}}^{'} (Y - X \hat{β})$ . We then have ${| | v_{1} | |}_{2}^{2} \geq c_{*} n {| | {\hat{β}}_{A_{1}} - β_{A_{1}} | |}_{2}^{2}, {({\hat{β}}_{A_{1}} - β_{A_{1}})}^{'} g_{1} = v_{1}^{'} (X β - X_{A_{1}} β_{A_{1}} + ε) - {| | v_{1} | |}_{2}^{2}$ and ${| | g_{1} | |}_{\infty} \leq {max}_{k, {| | {\hat{β}}_{k} | |}_{2} > 0} | | λ \sqrt{d_{k}} {\hat{β}}_{k} / {| | {\hat{β}}_{k} {| |}_{2} | |}_{\infty} = λ d_{a}$ . Therefore, ${| | v_{1} | |}_{2} \leq η_{2} + {| | P_{A_{1}} ε | |}_{2} + λ \sqrt{d_{a} N (A_{1}) / ({n c}_{*})}$ . Since ${| | P_{A_{1}} ε | |}_{2} \leq 2 σ \sqrt{N (A_{1}) log (N_{d})}$ with probability converging to 1 under the normality assumption, ${| | X (\hat{β} - β) | |}_{2} \leq 2 η_{2} + {| | P_{A_{1}} ε | |}_{2} + λ \sqrt{d_{a} N (A_{1}) / ({n c}_{*})}$ . We then have

{(\sum_{k \in A_{1}} {| | {\hat{β}}_{k} - β_{k} | |}_{2}^{2})}^{1 / 2} \leq \frac{{| | v_{1} | |}_{2}}{\sqrt{{n c}_{*}}} \leq \frac{1}{\sqrt{{n c}_{*}}} (η_{2} + 2 σ \sqrt{N (A_{1}) log (N_{d}}) + \sqrt{{d M}_{1} \bar{c}} B_{1}) .

(6.10)

Since A₂ ⊂ A₀, by the second inequality in (6.9), $# {k \in A_{0} : {| | β_{k} | |}_{2} > λ / n} \leq r_{1}^{2} q / c^{*} \sim O (q)$ . By the SRC and the third inequality in (6.9), $\sum_{k \in A_{0}} {| | β_{k} | |}_{2}^{2} I {{| | β_{k} | |}_{2} > λ / n} \leq \sum_{k \in A_{0}} | | X_{k} β_{k} \times I {{{| | β_{k} | |}_{2} > λ / n} | |}_{2}^{2} / ({n c}_{*}) \leq r_{2}^{2} q λ^{2} / (n^{2} c_{*} c^{*})$ and $\sum_{k \in A_{0}} {| | β_{k} | |}_{2}^{2} I {{| | β_{k} | |}_{2} \leq λ / n} \leq r_{1}^{2} q λ^{2} / (c^{*} n^{2})$ . From (6.10), we then have

\begin{array}{l} {| | \hat{β} - β | |}_{2} \leq \frac{1}{\sqrt{{n c}_{*}}} (2 σ \sqrt{M_{1} log (N_{d}) q} + (r_{2} + \sqrt{{d M}_{1} \bar{c}}) B_{1}) + \sqrt{\frac{c_{*} r_{1}^{2} + r_{2}^{2}}{c_{*} c^{*}}} \frac{\sqrt{q} λ}{n}, \\ {| | X \hat{β} - X β | |}_{2} \leq 2 σ \sqrt{M_{1} log (N_{d}) q} + (2 r_{2} + \sqrt{{d M}_{1} \bar{c}}) B_{1} . \end{array}

This completes the proof of Theorem 2.2.

Proof of Theorem 3.1

Let $\hat{u} = \hat{β} - β, W = X^{'} ε / \sqrt{n}, V (u) = \sum_{i = 1}^{n} [{(ε_{i} - x_{i} u)}^{2} - ε_{i}^{2})] + \sum_{k = 1}^{p} λ_{k} \sqrt{d_{k}} {| | u_{k} + β_{k} | |}_{2}$ and $\hat{u} = {min}_{u} {(ε - X u)}^{'} (ε - X u) + \sum_{k = 1}^{p} λ_{k} \sqrt{d_{k}} {| | u_{k} + β_{k} | |}_{2}$ , where λ_k = λ/||β̃_k||₂. By the KKT conditions, if there exists û such that

\sum_{A_{0}^{c} A_{0}^{c}} (\sqrt{n} {\hat{u}}_{A_{0}^{c}}) - W_{A_{0}^{c}} = - S_{A_{0}^{c}} / \sqrt{n}, {| | {\hat{u}}_{k} | |}_{2} \leq {| | β_{k} | |}_{2} for k \in A_{0}^{c},

(6.11)

- C_{A_{0}} / \sqrt{n} \leq \sum_{A_{0} A_{0}^{c}} (\sqrt{n} {\hat{u}}_{A_{0}^{c}}) - W_{A_{0}} \leq C_{A_{0}} / \sqrt{n},

(6.12)

then ||β̂_k||₂ ≠ 0 for k = 1,…, q and ||β̂_k||₂ = 0 for k = q + 1, …, p.

From (6.11) and (6.12), $(\sqrt{n} {\hat{u}}_{A_{0}^{c}}) - \sum_{A_{0}^{c} A_{0}^{c}}^{- 1} W_{A_{0}^{c}} = - \frac{1}{\sqrt{n}} \sum_{A_{0}^{c} A_{0}^{c}}^{- 1} S_{A_{0}^{c}}$ and $\sum_{A_{0} A_{0}^{c}} (\sqrt{n} {\hat{u}}_{A_{0}^{c}}) - W_{A_{0}} = - n^{- 1 / 2} X_{A_{0}}^{'} (I - P_{A_{0}^{c}}) ε - n^{- 1 / 2} \sum_{A_{0} A_{0}^{c}} \sum_{A_{0}^{c} A_{0}^{c}}^{- 1} S_{A_{0}^{c}} .$ . Define the events

\begin{array}{l} E_{1} = {n^{- 1 / 2} {| | {(\sum_{A_{0}^{c} A_{0}^{c}}^{- 1} X_{A_{0}^{c}}^{'} ε)}_{k} | |}_{2} < \sqrt{n} {| | β_{k} | |}_{2} - n^{- 1 / 2} {| | {(\sum_{A_{0}^{c} A_{0}^{c}}^{- 1} S_{A_{0}^{c}})}_{k} | |}_{2}, k \in A_{0}^{c}}, \\ E_{2} = {n^{- / 2} {| | {(X_{A_{0}}^{'} (I - P_{A_{0}^{c}}) ε)}_{k} | |}_{2} < n^{- 1 / 2} {| | C_{k} | |}_{2} - n^{- 1 / 2} {| | {(\sum_{A_{0} A_{0}^{c}} \sum_{A_{0}^{c} A_{0}^{c}}^{- 1} S_{A_{0}^{c}})}_{k} | |}_{2}, k \in A_{0}}, \end{array}

where (·)_k denotes the d_k-dimensional subvector of the vector (·) corresponding to the kth group. We then have P(||β̂_k||₂ ≠ 0, k ∈ A₀, and ||β̂_k||₂ = 0, k ∉ A₀) ≥ P (E₁ ∩ E₂) and $P (E_{1} \cap E_{2}) = 1 - P (E_{1}^{c} \cup E_{2}^{c}) \geq 1 - P (E_{1}^{c}) - P (E_{2}^{c})$ .

First, we consider $P (E_{1}^{c})$ . Define $R = {{| | {\tilde{β}}_{k} | |}_{2}^{- 1} \leq c_{1} θ_{b}^{- 1}, k \in A_{0}^{c}}$ , where c₁ is a constant. $P (E_{1}^{c}) = P (E_{1}^{c} \cap R) + P (E_{1}^{c} \cap R^{c}) \leq P (E_{1}^{c} \cap R) + P (R^{c})$ . By (C2), P(R^c) → 0. Let $N_{q} = \sum_{k = 1}^{q} d_{k}, τ_{1} \leq \dots \leq τ_{N_{g}}$ be the eigenvalues of $\sum_{A_{0}^{c} A_{0}^{c}}$ and γ₁, …, γN_q be the associated eigenvectors. The jth element in the lth group of vector $\sum_{A_{0}^{c} A_{0}^{c}}^{- 1} S_{A_{0}^{c}}$ is $u_{l j} = \sum_{l^{'} = 1}^{N_{q}} τ_{l^{'}}^{- 1} (γ_{l^{'}}^{'} S_{A_{0}^{c}}) γ_{l j}$ . By the Cauchy–Schwarz inequality, $u_{l j}^{2} \leq τ_{1}^{- 2} \sum_{l = 1}^{N_{q}} {| | γ_{l} | |}_{2}^{2} {| | S_{A_{0}^{c}} | |}_{2}^{2} = τ_{l}^{- 2} N_{q} {| | S_{A_{0}^{c}} | |}_{2}^{2} \leq τ_{1}^{- 2} N_{q} (\sum_{k = 1}^{q} λ_{k}^{2} d_{k})$ . Therefore, ${| | u_{k} | |}_{2}^{2} \leq d_{k} τ_{1}^{- 2} q^{2} d_{a}^{2} {({\tilde{λ}}_{c_{1}} θ_{b}^{- 1})}^{2}$ .

If we define $v_{A_{0}^{c}} = \sqrt{n} θ_{b} - n^{- 1 / 2} c_{1} τ_{1}^{- 1} {q d}_{a}^{3 / 2} \tilde{λ} θ_{b}^{- 1}, η_{A_{0}^{c}} = n^{- 1 / 2} \sum_{A_{0}^{c} A_{0}^{c}}^{- 1} X_{A_{0}^{c}}^{'} ε, ξ_{A_{0}} = n^{- 1 / 2} \times X_{A_{0}}^{'} (I - P_{A_{0}^{c}}) ε, C_{A_{0}^{c}} = {{max}_{k \in A_{0}^{c}} {| | η_{k} | |}_{2} \geq v_{A_{0}^{c}}}$ , then $P (E_{1}^{c}) \leq P (C_{A_{0}^{c}})$ . By Lemmas 1 and 2 of Huang, Ma and Zhang (2008), $P (C_{A_{0}}^{c}) \leq K {(d_{a} log q)}^{1 / 2} / v_{A_{o}^{c}}$ , where K is a constant, $k {(d_{a} log q)}^{1 / 2} / v_{A_{0}^{c}} \to 0$ from (C3). We then have $P (E_{1}^{c} \cap R) \to 0, P (E_{1}^{c}) \to 0$

Next, we consider $P (E_{2}^{c})$ . Similarly as above, define $D = {{| | {\tilde{β}}_{k} | |}_{2}^{- 1} > r_{n}, k \in A_{0}} \cap R$ . $P (E_{2}^{c}) \leq P (E_{2}^{c} \cap D) + P (D^{c})$ By (C2), P(D^c) → 0. $∣ \sum_{l = 1}^{N_{q}} \sum_{i = 1}^{n} {(X_{A_{0}})}_{i j} {(X_{A_{0}^{c}})}_{i l} u_{l} ∣ \leq \sum_{l = 1}^{N_{q}} ∣ u_{l} / n ∣ \leq τ_{1}^{- 1} q^{2} d_{a}^{2} {\tilde{λ}}_{c_{1}} θ_{b}^{- 1}$ , where u_l is the lth element of vector $\sum_{A_{0}^{c} A_{0}^{c}}^{- 1} S_{A_{0}^{c}}$ . If we define $v_{A_{0}} = n^{- 1 / 2} \tilde{λ} r_{n} \sqrt{d_{b}} - n^{- 1 / 2} τ_{1}^{- 1} q^{2} d_{a}^{5 / 2} {\tilde{λ}}_{c_{1}} θ_{b}^{- 1}$ , C_A₀ = {max_k∈A₀ || ξ_k||₂ > v_A₀}, then P(Q^c) ≤ P(C_A₀), P(C_A₀ ≤ K (d_a log(p − q))^1/2/vA₀. K (d_a log(p − q))¹^/²/vA₀ → 0 from (C3). We then have $P (E_{2}^{c} \cap D) \to 0, P (E_{2}^{c}) \to 0$ . This completes the proof of Theorem 3.1.

Proof of Theorem 3.2

If we let Â = {k: ||β̂_k||₂ > 0, k = 1,…, p}, then $\sum_{k \in {\hat{A}}^{c}} {| | {\hat{β}}_{k}^{*} | |}_{2} = 0$ , the dimension of our problem (3.1) is reduced to q̂, q̂ ≤ q^* and Â_c A₀. By the definition of β̂^*, we have

\frac{1}{2} {| | Y - X_{\hat{A}} {\hat{β}}_{\hat{A}}^{*} | |}_{2}^{2} + \tilde{λ} \sum_{k \in \hat{A}} \frac{\sqrt{d_{k}}}{{| | {\tilde{β}}_{k} | |}_{2}} {| | {\hat{β}}_{k}^{*} | |}_{2} \leq \frac{1}{2} {| | Y - X_{\hat{A}} β_{\hat{A}} | |}_{2}^{2} + \tilde{λ} \sum_{k \in \hat{A}} \frac{\sqrt{d_{k}}}{{| | {\tilde{β}}_{k} | |}_{2}} {| | β_{k} | |}_{2},

(6.13)

η^{*} = \tilde{λ} \sum_{k \in \hat{A}} \frac{\sqrt{d_{k}}}{{| | {\tilde{β}}_{k} | |}_{2}} ({| | β_{k} | |}_{2} - {| | {\hat{β}}_{k}^{*} | |}_{2}) \leq \tilde{λ} \sum_{k \in \hat{A}} \frac{\sqrt{d_{k}}}{{| | {\tilde{β}}_{k} | |}_{2}} {| | {\hat{β}}_{k}^{*} - β_{k} | |}_{2} .

(6.14)

If we let $δ_{\hat{A}} = \sum_{\hat{A} \hat{A}}^{1 / 2} ({\hat{β}}_{\hat{A}}^{*} - β_{\hat{A}})$ and $D = \sum_{\hat{A} \hat{A}}^{- 1 / 2} X_{\hat{A}}^{'}$ , then ${| | Y - X_{\hat{A}} {\hat{β}}_{\hat{A}}^{*} | |}_{2}^{2} / 2 - {| | Y - X_{\hat{A}} β_{\hat{A}} | |}_{2}^{2} / 2 = δ_{\hat{A}}^{'} δ_{\hat{A}} / 2 - {(D ε)}^{'} δ_{\hat{A}}$ . By (6.13) and (6.14), $δ_{\hat{A}}^{'} δ_{\hat{A}} / 2 - {(D ε)}^{'} δ_{\hat{A}} - η^{*} \leq 0$ , so ${| | δ_{\hat{A}} - D ε | |}_{2}^{2} - {| | D ε | |}_{2}^{2} - 2 η^{*} \leq 0$ . By the triangle inequality, ||δ_Â||₂ ≤ ||δ_Â − Dε||₂ + ||Dε||₂. Thus, ${| | δ_{\hat{A}} | |}_{2}^{2} \leq 6 {| | D ε | |}_{2}^{2} + 6 η^{*}$ .

Let D_i be the ith column of D. $E ({| | D ε | |}_{2}^{2}) = σ^{2} tr (D^{'} D) = σ^{2} \hat{q}$ . Then, with probability converging to 1, ${| | {\hat{β}}_{\hat{A}} - β_{\hat{A}} | |}_{2}^{2} \leq 6 σ^{2} M_{1} q / ({n c}_{*}) + {(\tilde{λ} \sqrt{d_{a}} / (ξ_{b} θ_{b} {n c}_{*}))}^{2} / 2 + {| | {\hat{β}}_{\hat{A}} - β_{\hat{A}} | |}_{2}^{2} / 2$

Thus, for λ̃ = n^α for some 0 < α < 1/2, with probability converging to 1,

{| | {\hat{β}}_{\hat{A}} - β_{\hat{A}} | |}_{2} \leq \sqrt{\frac{6 σ^{2} M_{1}}{c_{*}} \frac{q}{n} + \frac{d_{a}}{{(ξ_{b} θ_{b} c_{*})}^{2}} {(\frac{\tilde{λ}}{n})}^{2}} \sim O (\sqrt{\frac{q}{n}})

and ${| | X_{\hat{A}} {\hat{β}}_{\hat{A}} - X_{\hat{A}} β_{\hat{A}} | |}_{2} \leq \sqrt{{n c}^{*}} {| | {\hat{β}}_{\hat{A}} - β_{\hat{A}} | |}_{2} \sim O (\sqrt{q})$ . This completes the proof of Theorem 3.2.

Acknowledgments

The authors are grateful to Professor Cun-Hui Zhang for sharing his insights into the problem and related topics. The work of Jian Huang is supported in part by NIH Grant R01CA120988 and NSF Grants DMS-07-06108 and 0805670.

Contributor Information

FENGRONG WEI, Email: fwei@westga.edu.

JIAN HUANG, Email: jian-huang@uiowa.edu.

References

Antoniadis A, Fan J. Regularization of wavelet approximation (with discussion) J Amer Statist Assoc. 2001;96:939–967. [Google Scholar]
Bühlmann P, Meier L. Discussion of “One-step sparse estimates in nonconcave penalized likelihood models,” by H. Zou and R. Li. Ann Statist. 2008;36:1534–1541. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]
Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Ann Statist. 2004;32:928–961. [Google Scholar]
Greenshtein E, Ritov Y. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli. 2004;10:971–988. [Google Scholar]
Huang J, Horowitz JL, Ma SG. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann Statist. 2008;36:587–613. [Google Scholar]
Huang J, Ma S, Zhang CH. Adaptive lasso for sparse high-dimensional regression models. Statist Sinica. 2006;18:1603–1618. [Google Scholar]
Kim Y, Kim J, Kim Y. The blockwise sparse regression. Statist Sinica. 2006;16:375–390. [Google Scholar]
Knight K, Fu WJ. Asymptotics for lasso-type estimators. Ann Statist. 2001;28:1356–1378. [Google Scholar]
Meier L, van de Geer S, Bühlmann P. Group Lasso for logisitc regression. J R Stat Soc Ser B. 2008;70:53–71. [Google Scholar]
Meinshausen N, Buhlmann P. High dimensional graphs and variable selection with the Lasso. Ann Statist. 2006;34:1436–1462. [Google Scholar]
Schwarz G. Estimating the dimension of a model. Ann Statist. 1978;6:461–464. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B. 1996;58:267–288. [Google Scholar]
van de Geer S. High-dimensional generalized linear models and the Lasso. Ann Statist. 2008;36:614–645. [Google Scholar]
Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B. 2006;68:49–67. [Google Scholar]
Zhang CH. Technical Report 2007-003. Dept. Statistics, Rutgers Univ; 2007. Penalized linear unbiased selection. [Google Scholar]
Zhang CH, Huang J. Model-selection consistency of the LASSO in high-dimensional linear regression. Ann Statist. 2008;36:1567–1594. [Google Scholar]
Zhao P, Rocha G, Yu B. Grouped and hierarchical model selection through composite absolute penalties. Ann Statist. 2008;36:1567–1594. [Google Scholar]
Zhao P, Yu B. On model selection consistency of LASSO. J Mach Learn Res. 2006;7:2541–2563. [Google Scholar]
Zou H. The adaptive Lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. [Google Scholar]
Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B. 2006;67:301–320. [Google Scholar]

[R1] Antoniadis A, Fan J. Regularization of wavelet approximation (with discussion) J Amer Statist Assoc. 2001;96:939–967. [Google Scholar]

[R2] Bühlmann P, Meier L. Discussion of “One-step sparse estimates in nonconcave penalized likelihood models,” by H. Zou and R. Li. Ann Statist. 2008;36:1534–1541. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]

[R4] Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Ann Statist. 2004;32:928–961. [Google Scholar]

[R5] Greenshtein E, Ritov Y. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli. 2004;10:971–988. [Google Scholar]

[R6] Huang J, Horowitz JL, Ma SG. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann Statist. 2008;36:587–613. [Google Scholar]

[R7] Huang J, Ma S, Zhang CH. Adaptive lasso for sparse high-dimensional regression models. Statist Sinica. 2006;18:1603–1618. [Google Scholar]

[R8] Kim Y, Kim J, Kim Y. The blockwise sparse regression. Statist Sinica. 2006;16:375–390. [Google Scholar]

[R9] Knight K, Fu WJ. Asymptotics for lasso-type estimators. Ann Statist. 2001;28:1356–1378. [Google Scholar]

[R10] Meier L, van de Geer S, Bühlmann P. Group Lasso for logisitc regression. J R Stat Soc Ser B. 2008;70:53–71. [Google Scholar]

[R11] Meinshausen N, Buhlmann P. High dimensional graphs and variable selection with the Lasso. Ann Statist. 2006;34:1436–1462. [Google Scholar]

[R12] Schwarz G. Estimating the dimension of a model. Ann Statist. 1978;6:461–464. [Google Scholar]

[R13] Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B. 1996;58:267–288. [Google Scholar]

[R14] van de Geer S. High-dimensional generalized linear models and the Lasso. Ann Statist. 2008;36:614–645. [Google Scholar]

[R15] Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B. 2006;68:49–67. [Google Scholar]

[R16] Zhang CH. Technical Report 2007-003. Dept. Statistics, Rutgers Univ; 2007. Penalized linear unbiased selection. [Google Scholar]

[R17] Zhang CH, Huang J. Model-selection consistency of the LASSO in high-dimensional linear regression. Ann Statist. 2008;36:1567–1594. [Google Scholar]

[R18] Zhao P, Rocha G, Yu B. Grouped and hierarchical model selection through composite absolute penalties. Ann Statist. 2008;36:1567–1594. [Google Scholar]

[R19] Zhao P, Yu B. On model selection consistency of LASSO. J Mach Learn Res. 2006;7:2541–2563. [Google Scholar]

[R20] Zou H. The adaptive Lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. [Google Scholar]

[R21] Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B. 2006;67:301–320. [Google Scholar]

PERMALINK

Consistent group selection in high-dimensional linear regression

FENGRONG WEI

JIAN HUANG

Abstract

1. Introduction

2. The asymptotic properties of the group Lasso

Theorem 2.1

Remark 2.1

Remark 2.2

Remark 2.3

Remark 2.4

Corollary 2.1

Theorem 2.2

Corollary 2.2

3. Selection consistency of the adaptive group Lasso

Theorem 3.1

Corollary 3.1

Theorem 3.2

4. Simulation studies

Example 1

Example 2

Example 3

Example 4

Example 5

Example 6

Table 1.

5. Concluding remarks

6. Proofs

Proof of Theorem 2.1

Step 1

Step 2

Step 3

Proof of Theorem 2.2

Proof of Theorem 3.1

Proof of Theorem 3.2

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases