VARIABLE SELECTION IN NONPARAMETRIC ADDITIVE MODELS

Jian Huang; Joel L Horowitz; Fengrong Wei

doi:10.1214/09-AOS781

. Author manuscript; available in PMC: 2010 Nov 30.

Published in final edited form as: Ann Stat. 2010 Aug 1;38(4):2282–2313. doi: 10.1214/09-AOS781

VARIABLE SELECTION IN NONPARAMETRIC ADDITIVE MODELS

Jian Huang ¹, Joel L Horowitz ², Fengrong Wei ³

PMCID: PMC2994588 NIHMSID: NIHMS251165 PMID: 21127739

Abstract

We consider a nonparametric additive model of a conditional mean function in which the number of variables and additive components may be larger than the sample size but the number of nonzero additive components is “small” relative to the sample size. The statistical problem is to determine which additive components are nonzero. The additive components are approximated by truncated series expansions with B-spline bases. With this approximation, the problem of component selection becomes that of selecting the groups of coefficients in the expansion. We apply the adaptive group Lasso to select nonzero components, using the group Lasso to obtain an initial estimator and reduce the dimension of the problem. We give conditions under which the group Lasso selects a model whose number of components is comparable with the underlying model, and the adaptive group Lasso selects the nonzero components correctly with probability approaching one as the sample size increases and achieves the optimal rate of convergence. The results of Monte Carlo experiments show that the adaptive group Lasso procedure works well with samples of moderate size. A data example is used to illustrate the application of the proposed method.

Key words and phrases: Adaptive group Lasso, component selection, high-dimensional data, nonparametric regression, selection consistency

1. Introduction

Let (Y_i, X_i), i = 1, …, n, be random vectors that are independently and identically distributed as (Y, X), where Y is a response variable and X = (X₁, …, X_p)′ is a p-dimensional covariate vector. Consider the nonparametric additive model

Y_{i} = μ + \sum_{j = 1}^{p} f_{j} (X_{ij}) + ε_{i},

(1)

where µ is an intercept term, X_ij is the jth component of X_i, the f_j’s are unknown functions, and ε_i is an unobserved random variable with mean zero and finite variance σ². Suppose that some of the additive components f_j are zero. The problem addressed in this paper is to distinguish the nonzero components from the zero components and estimate the nonzero components. We allow the possibility that p is larger than the sample size n, which we represent by letting p increase as n increases. We propose a penalized method for variable selection in (1) and show that the proposed method can correctly select the nonzero components with high probability.

There has been much work on penalized methods for variable selection and estimation with high-dimensional data. Methods that have been proposed include the bridge estimator [Frank and Friedman (1993), Huang, Horowitz and Ma (2008)]; least absolute shrinkage and selection operator or Lasso [Tibshirani (1996)], the smoothly clipped absolute deviation (SCAD) penalty [Fan and Li (2001), Fan and Peng (2004)], and the minimum concave penalty [Zhang (2010)]. Much progress has been made in understanding the statistical properties of these methods. In particular, many authors have studied the variable selection, estimation and prediction properties of the Lasso in high-dimensional settings. See, for example, Meinshausen and Bühlmann (2006), Zhao and Yu (2006), Zou (2006), Bunea, Tsybakov and Wegkamp (2007), Meinshausen and Yu (2009), Huang, Ma and Zhang (2008), van de Geer (2008) and Zhang and Huang (2008), among others. All these authors assume a linear or other parametric model. In many applications, however, there is little a priori justification for assuming that the effects of covariates take a linear form or belong to any other known, finite-dimensional parametric family. For example, in studies of economic development, the effects of covariates on the growth of gross domestic product can be nonlinear. Similarly, there is evidence of nonlinearity in the gene expression data used in the empirical example in Section 5.

There is a large body of literature on estimation in nonparametric additive models. For example, Stone (1985, 1986) showed that additive spline estimators achieve the same optimal rate of convergence for a general fixed p as for p = 1. Horowitz and Mammen (2004) and Horowitz, Klemelä and Mammen (2006) showed that if p is fixed and mild regularity conditions hold, then oracle-efficient estimates of the f_j’s can be obtained by a two-step procedure. Here, oracle efficiency means that the estimator of each f_j has the same asymptotic distribution that it would have if all the other f_j’s were known. However, these papers do not discuss variable selection in nonparametric additive models.

Antoniadis and Fan (2001) proposed a group SCAD approach for regularization in wavelets approximation. Zhang et al. (2004) and Lin and Zhang (2006) have investigated the use of penalization methods in smoothing spline ANOVA with a fixed number of covariates. Zhang et al. (2004) used a Lasso-type penalty but did not investigate model-selection consistency. Lin and Zhang (2006) proposed the component selection and smoothing operator (COSSO) method for model selection and estimation in multivariate nonparametric regression models. For fixed p, they showed that the COSSO estimator in the additive model converges at the rate n^−d/(2d+1), where d is the order of smoothness of the components. They also showed that, in the special case of a tensor product design, the COSSO correctly selects the nonzero additive components with high probability. Zhang and Lin (2006) considered the COSSO for nonparametric regression in exponential families.

Meier, van de Geer and Bühlmann (2009) treat variable selection in a nonparametric additive model in which the numbers of zero and nonzero f_j’s may both be larger than n. They propose a penalized least-squares estimator for variable selection and estimation. They give conditions under which, with probability approaching 1, their procedure selects a set of f_j’s containing all the additive components whose distance from zero in a certain metric exceeds a specified threshold. However, they do not establish model-selection consistency of their procedure. Even asymptotically, the selected set may be larger than the set of nonzero f_j’s. Moreover, they impose a compatibility condition that relates the levels and smoothness of the f_j’s. The compatibility condition does not have a straightforward, intuitive interpretation and, as they point out, cannot be checked empirically. Ravikumar et al. (2009) proposed a penalized approach for variable selection in nonparametric additive models. In their approach, the penalty is imposed on the ℓ₂ norm of the nonparametric components, as well as the mean value of the components to ensure identifiability. In their theoretical results, they require that the eigenvalues of a “design matrix” be bounded away from zero and infinity, where the “design matrix” is formed from the basis functions for the nonzero components. It is not clear whether this condition holds in general, especially when the number of nonzero components diverges with n. Another critical condition required in the results of Ravikumar et al. (2009) is similar to the irrepresentable condition of Zhao and Yu (2006). It is not clear for what type of basis functions this condition is satisfied. We do not require such a condition in our results on selection consistency of the adaptive group Lasso.

Several other recent papers have also considered variable selection in nonparametric models. For example, Wang, Chen and Li (2007) and Wang and Xia (2008) considered the use of group Lasso and SCAD methods for model selection and estimation in varying coefficient models with a fixed number of coefficients and covariates. Bach (2007) applies what amounts to the group Lasso to a nonparametric additive model with a fixed number of covariates. He established model selection consistency under conditions that are considerably more complicated than the ones we require for a possibly diverging number of covariates.

In this paper, we propose to use the adaptive group Lasso for variable selection in (1) based on a spline approximation to the nonparametric components. With this approximation, each nonparametric component is represented by a linear combination of spline basis functions. Consequently, the problem of component selection becomes that of selecting the groups of coefficients in the linear combinations. It is natural to apply the group Lasso method, since it is desirable to take into the grouping structure in the approximating model. To achieve model selection consistency, we apply the group Lasso iteratively as follows. First, we use the group Lasso to obtain an initial estimator and reduce the dimension of the problem. Then we use the adaptive group Lasso to select the final set of nonparametric components. The adaptive group Lasso is a simple generalization of the adaptive Lasso [Zou (2006)] to the method of the group Lasso [Yuan and Lin (2006)]. However, here we apply this approach to nonparametric additive modeling.

We assume that the number of nonzero f_j’s is fixed. This enables us to achieve model selection consistency under simple assumptions that are easy to interpret. We do not have to impose compatibility or irrepresentable conditions, nor do we need to assume conditions on the eigenvalues of certain matrices formed from the spline basis functions. We show that the group Lasso selects a model whose number of components is bounded with probability approaching one by a constant that is independent of the sample size. Then using the group Lasso result as the initial estimator, the adaptive group Lasso selects the correct model with probability approaching 1 and achieves the optimal rate of convergence for nonparametric estimation of an additive model.

The remainder of the paper is organized as follows. Section 2 describes the group Lasso and the adaptive group Lasso for variable selection in nonparametric additive models. Section 3 presents the asymptotic properties of these methods in “large p, small n” settings. Section 4 presents the results of simulation studies to evaluate the finite-sample performance of these methods. Section 5 provides an illustrative application, and Section 6 includes concluding remarks. Proofs of the results stated in Section 3 are given in the Appendix.

2. Adaptive group Lasso in nonparametric additive models

We describe a two-step approach that uses the group Lasso for variable selection based on a spline representation of each component in additive models. In the first step, we use the standard group Lasso to achieve an initial reduction of the dimension in the model and obtain an initial estimator of the nonparametric components. In the second step, we use the adaptive group Lasso to achieve consistent selection.

Suppose that each X_j takes values in [a, b] where a < b are finite numbers. To ensure unique identification of the f_j’s, we assume that E f_j(X_j) = 0, 1 ≤ j ≤ p. Let a = ξ₀ < ξ₁ < ⋯ < ξ_K < ξ_K+1 = b be a partition of [a, b] into K subintervals I_Kt = [ξ_t, ξ_t+1), t = 0, …, K − 1, and I_KK = [ξ_K, ξ_K+1], where K ≡ K_n = n^υ with 0 < υ < 0.5 is a positive integer such that max_1≤k≤K+1 |ξ_k − ξ_k−1| = O(n^−υ). Let 𝒮_n be the space of polynomial splines of degree l ≥ 1 consisting of functions s satisfying: (i) the restriction of s to I_Kt is a polynomial of degree l for 1 ≤ t ≤ K; (ii) for l ≥ 2 and 0 ≤ l ′ ≤ l − 2, s is l′ times continuously differentiable on [a, b]. This definition is phrased after Stone (1985), which is a descriptive version of Schumaker (1981), page 108, Definition 4.1.

There exists a normalized B-spline basis {ϕ_k, 1 ≤ k ≤ m_n} for 𝒮_n, where m_n ≡ K_n + l [Schumaker (1981)]. Thus, for any f_nj ∈ 𝒮_n, we can write

f_{nj} (x) = \sum_{k = 1}^{m_{n}} β_{jk} ϕ_{k} (x), 1 \leq j \leq p .

(2)

Under suitable smoothness assumptions, the f_j’s can be well approximated by functions in 𝒮_n. Accordingly, the variable selection method described in this paper is based on the representation (2).

Let ${‖ a ‖}_{2} \equiv {(\sum_{j = 1}^{m} {| a_{j} |}^{2})}^{1 / 2}$ denote the ℓ₂ norm of any vector a ∈ ℝ^m. Let β_nj = (β_j1, …, β_{jm_n})′ and $β_{n} = (β_{n 1}^{'}, \dots, β_{np}^{'})'$ . Let w_n = (w_n1, …, w_np)′ be a given vector of weights, where 0 ≤ w_nj ≤ ∞, 1 ≤ j ≤ p. Consider the penalized least squares criterion

L_{n} (μ, β_{n}) = {\sum_{i = 1}^{n} [Y_{i} - μ - \sum_{j = 1}^{p} \sum_{k = 1}^{m_{n}} β_{jk} ϕ_{k} (X_{ij})]}^{2} + λ_{n} \sum_{j = 1}^{p} w_{nj} {‖ β_{nj} ‖}_{2},

(3)

where λ_n is a penalty parameter. We study the estimators that minimize L_n(µ, β_n) subject to the constraints

\sum_{i = 1}^{n} \sum_{k = 1}^{m_{n}} β_{jk} ϕ_{k} (X_{ij}) = 0, 1 \leq j \leq p .

(4)

These centering constraints are sample analogs of the identifying restriction E f_j(X_j) = 0, 1 ≤ j ≤ p. We can convert (3) and (4) to an unconstrained optimization problem by centering the response and the basis functions. Let

{\bar{ϕ}}_{jk} = \frac{1}{n} \sum_{i = 1}^{n} ϕ_{k} (X_{ij}), ψ_{jk} (x) = ϕ_{k} (x) - {\bar{ϕ}}_{jk} .

(5)

For simplicity and without causing confusion, we simply write ψ_k(x) = ψ_jk(x). Define

Z_{ij} = (ψ_{1} (X_{ij}), \dots, ψ_{m_{n}} (X_{ij}))' .

So, Z_ij consists of values of the (centered) basis functions at the ith observation of the jth covariate. Let Z_j = (Z_1j, …, Z_nj)′ be the n × m_n “design” matrix corresponding to the jth covariate. The total “design” matrix is Z = (Z₁, …, Z_p). Let Y = (Y₁ − Y̅, …, Y_n − Y̅)′. With this notation, we can write

L_{n} (β_{n}; λ) = {‖ Y - Z β_{n} ‖}_{2}^{2} + λ_{n} \sum_{j = 1}^{p} w_{nj} {‖ β_{nj} ‖}_{2} .

(6)

Here, we have dropped µ in the argument of L_n. With the centering, µ̂ = Y̅. Then minimizing (3) subject to (4) is equivalent to minimizing (6) with respect to β_n, but the centering constraints are not needed for (6).

We now describe the two-step approach to component selection in the nonparametric additive model (1).

Step 1. Compute the group Lasso estimator. Let

L_{n 1} (β_{n}, λ_{n 1}) = {‖ Y - Z β_{n} ‖}_{2}^{2} + λ_{n 1} \sum_{j = 1}^{p} {‖ β_{nj} ‖}_{2} .

This objective function is the special case of (6) that is obtained by setting w_nj = 1, 1 ≤ j ≤ p. The group Lasso estimator is β̃_n ≡ β̃_n(λ_n1) = arg min_{β_n} L_n1(β_n; λ_n1).

Step 2. Use the group Lasso estimator β̃_n to obtain the weights by setting

w_{nj} = {\begin{matrix} {‖ {\tilde{β}}_{nj} ‖}_{2}^{- 1}, & if {‖ {\tilde{β}}_{nj} ‖}_{2} > 0, \\ \infty, & if {‖ {\tilde{β}}_{nj} ‖}_{2} = 0 . \end{matrix}

The adaptive group Lasso objective function is

L_{n 2} (β_{n}; λ_{n 2}) = {‖ Y - Z β_{n} ‖}_{2}^{2} + λ_{n 2} \sum_{j = 1}^{p} w_{nj} {‖ β_{nj} ‖}_{2} .

Here, we define 0 · ∞ = 0. Thus, the components not selected by the group Lasso are not included in Step 2. The adaptive group Lasso estimator is β̂_n ≡ β̂_n(λ_n2) = arg min_{β_n} L_n2(β_n; λ_n2). Finally, the adaptive group Lasso estimators of µ and f_j are

{\hat{μ}}_{n} = \bar{Y} \equiv n^{- 1} \sum_{i = 1}^{n} Y_{i}, {\hat{f}}_{nj} (x) = \sum_{k = 1}^{m_{n}} {\hat{β}}_{jk} ψ_{k} (x), 1 \leq j \leq p .

3. Main results

This section presents our results on the asymptotic properties of the estimators defined in Steps 1 and 2 of Section 2.

Let k be a nonnegative integer, and let α ∈ (0, 1] be such that d = k + α > 0.5. Let ℱ be the class of functions f on [0, 1] whose kth derivative f^(k) exists and satisfies a Lipschitz condition of order α:

| f^{(k)} (s) - f^{(k)} (t) | \leq C | s - t |^{α} for s, t \in [a, b] .

In (1), without loss of generality, suppose that the first q components are nonzero, that is, f_j(x) ≠ 0, 1 ≤ j ≤ q, but f_j(x) ≡ 0, q + 1 ≤ j ≤ p. Let A₁ = {1, …, q} and A₀ = {q + 1, …, p}. Define ${‖ f ‖}_{2} = {[\int_{a}^{b} f^{2} (x) dx]}^{1 / 2}$ for any function f, whenever the integral exists.

We make the following assumptions.

(A1) The number of nonzero components q is fixed and there is a constant c_f > 0 such that min_1≤j≤q‖f_j‖₂ ≥ c_f.

(A2) The random variables ε₁, …, ε_n are independent and identically distributed with Eε_i = 0 and Var(ε_i) = σ². Furthermore, their tail probabilities satisfy P(|ε_i| > x) ≤ K exp(−Cx²), i = 1, …, n, for all x ≥ 0 and for constants C and K.

(A3) E f_j(X_j) = 0 and f_j ∈ ℱ, j = 1, …, q.

(A4) The covariate vector X has a continuous density and there exist constants C₁ and C₂ such that the density function g_j of X_j satisfies 0 < C₁ ≤ g_j (x) ≤ C₂ < ∞ on [a, b] for every 1 ≤ j ≤ p.

We note that (A1), (A3) and (A4) are standard conditions for nonparametric additive models. They would be needed to estimate the nonzero additive components at the optimal ℓ₂ rate of convergence on [a, b], even if q were fixed and known. Only (A2) strengthens the assumptions needed for nonparametric estimation of a nonparametric additive model. While condition (A1) is reasonable in most applications, it would be interesting to relax this condition and investigate the case when the number of nonzero components can also increase with the sample size. The only technical reason that we assume this condition is related to Lemma 3 given in the Appendix, which is concerned with the properties of the smallest and largest eigenvalues of the “design matrix” formed from the spline basis functions. If this lemma can be extended to the case of a divergent number of components, then (A1) can be relaxed. However, it is clear that there needs to be restriction on the number of nonzero components to ensure model identification.

3.1. Estimation consistency of the group Lasso

In this section, we consider the selection and estimation properties of the group Lasso estimator. Define Ã₁ = {j : ‖β̃_nj‖₂ ≠ 0, 1 ≤ j ≤ p}. Let |A| denote the cardinality of any set A ⊆ {1, …, p}.

THEOREM 1. Suppose that (A1) to (A4) hold and $λ_{n 1} \geq C \sqrt{n log ({pm}_{n})}$ for a sufficiently large constant C.

With probability converging to 1, |Ã₁| ≤ M₁|A₁| = M_1q for a finite constant M₁ > 1.
If $m_{n}^{2} log ({pm}_{n}) / n \to 0 and (λ_{n 1}^{2} m_{n}) / n^{2} \to 0 as n \to \infty$ , then all the nonzero β_nj, 1 ≤ j ≤ q, are selected with probability converging to one.
$\sum_{j = 1}^{p} {‖ {\tilde{β}}_{nj} - β_{nj} ‖}_{2}^{2} = O_{p} (\frac{m_{n}^{2} log ({pm}_{n})}{n}) + O_{p} (\frac{m_{n}}{n}) + O (\frac{1}{m_{n}^{2 d - 1}}) + O (\frac{4 m_{n}^{2} λ_{n 1}^{2}}{n^{2}}) .$

Part (i) of Theorem 1 says that, with probability approaching 1, the group Lasso selects a model whose dimension is a constant multiple of the number of nonzero additive components f_j, regardless of the number of additive components that are zero. Part (ii) implies that every nonzero coefficient will be selected with high probability. Part (iii) shows that the difference between the coefficients in the spline representation of the nonparametric functions in (1) and their estimators converges to zero in probability. The rate of convergence is determined by four terms: the stochastic error in estimating the nonparametric components (the first term) and the intercept µ (the second term), the spline approximation error (the third term) and the bias due to penalization (the fourth term).

Let ${\tilde{f}}_{nj} (x) = \sum_{j = 1}^{m_{n}} {\tilde{β}}_{jk} ψ (x)$ , 1 ≤ j ≤ p. The following theorem is a consequence of Theorem 1.

THEOREM 2. Suppose that (A1) to (A4) hold and that $λ_{n 1} \geq C \sqrt{n log ({pm}_{n})}$ for a sufficiently large constant C. Then:

Let Ã_f = {j : ‖f̃_nj‖₂ > 0, 1 ≤ j ≤ p}. There is a constant M₁ > 1 such that, with probability converging to 1, |Ã_f| ≤ M_1q.
If (m_n log(pm_n))/n → 0 and $(λ_{n 1}^{2} m_{n}) / n^{2} \to 0 as n \to \infty$ , then all the nonzero additive components f_j, 1 ≤ j ≤ q, are selected with probability converging to one.
${‖ {\tilde{f}}_{nj} - f_{j} ‖}_{2}^{2} = O_{p} (\frac{m_{n} log ({pm}_{n})}{n}) + O_{p} (\frac{1}{n}) + O (\frac{1}{m_{n}^{2 d}}) + O (\frac{4 m_{n} λ_{n 1}^{2}}{n^{2}}), j \in {\tilde{A}}_{2},$
where Ã₂ = A₁ ∪ Ã₁.

Thus, under the conditions of Theorem 2, the group Lasso selects all the nonzero additive components with high probability. Part (iii) of the theorem gives the rate of convergence of the group Lasso estimator of the nonparametric components.

For any two sequences {a_n, b_n, n = 1, 2,…}, we write a_n ≍ b_n if there are constants 0 < c₁ < c₂ < ∞ such that c₁ ≤ a_n/b_n ≤ c₂ for all n sufficiently large.

We now state a useful corollary of Theorem 2.

COROLLARY 1. Suppose that (A1) to (A4) hold. If $λ_{n 1} ≍ \sqrt{n log ({pm}_{n})}$ and m_n ≍ n^1/(2d+1), then:

If n^−2d/(2d+1) log(p) → 0 as n → ∞, then with probability converging to one, all the nonzero components f_j, 1 ≤ j ≤ q, are selected and the number of selected components is no more than M₁q.
${‖ {\tilde{f}}_{nj} - f_{j} ‖}_{2}^{2} = O_{p} (n^{- 2 d / (2 d + 1)} log ({pm}_{n})), j \in {\tilde{A}}_{2} .$

For the λ_n1 and m_n given in Corollary 1, the number of zero components can be as large as exp(o(n^2d/(2d+1))). For example, if each f_j has continuous second derivative (d = 2), then it is exp(o(n^4/5)), which can be much larger than n.

3.2. Selection consistency of the adaptive group Lasso

We now consider the properties of the adaptive group Lasso. We first state a general result concerning the selection consistency of the adaptive group Lasso, assuming an initial consistent estimator is available. We then apply to the case when the group Lasso is used as the initial estimator. We make the following assumptions.

(B1) The initial estimators β̃_nj are r_n-consistent at zero:

r_{n} max_{j \in A_{0}} {‖ {\tilde{β}}_{nj} ‖}_{2} = O_{P} (1), r_{n} \to \infty,

and there exists a constant c_b > 0 such that

P (min_{j \in A_{1}} {‖ {\tilde{β}}_{nj} ‖}_{2} \geq c_{b} b_{n 1}) \to 1,

where b_n1 = min_j∈A₁‖β_nj‖₂.

(B2) Let q be the number of nonzero components and s_n = p − q be the number of zero components. Suppose that:

$\frac{m_{n}}{n^{1 / 2}} + \frac{λ_{n 2} m_{n}^{1 / 4}}{n} = o (1),$
$\frac{n^{1 / 2} {log}^{1 / 2} (s_{n} m_{n})}{λ_{n 2} r_{n}} + \frac{n}{λ_{n 2} r_{n} m_{n}^{(2 d + 1) / 2}} = o (1) .$

We state condition (B1) for a general initial estimator, to highlight the point that the availability of an r_n-consistent estimator at zero is crucial for the adaptive group Lasso to be selection consistent. In other words, any initial estimator satisfying (B1) will ensure that the adaptive group Lasso (based on this initial estimator) is selection consistent, provided that certain regularity conditions are satisfied. We note that it follows immediately from Theorem 1 that the group Lasso estimator satisfies (B1). We will come back to this point below.

For ${\hat{β}}_{n} \equiv ({\hat{β}}_{n 1}^{'}, \dots, {\hat{β}}_{np}^{'})' and β_{n} \equiv (β_{n 1}^{'}, \dots, β_{np}^{'})'$ , we say β̂_n =₀ β_n if sgn₀(‖β̂_nj‖) = sgn₀(‖β_nj‖), 1 ≤ j ≤ p, where sgn₀(|x|) = 1 if |x| > 0 and = 0 if |x| = 0.

THEOREM 3. Suppose that conditions (B1), (B2) and (A1)–(A4) hold. Then:

$P ({\hat{β}}_{n} =_{0} β_{n}) \to 1 .$
$\sum_{j = 1}^{q} {‖ {\hat{β}}_{nj} - β_{nj} ‖}_{2}^{2} = O_{p} (\frac{m_{n}^{2}}{n}) + O_{p} (\frac{m_{n}}{n}) + O (\frac{1}{m_{n}^{2 d - 1}}) + O (\frac{4 m_{n}^{2} λ_{n 2}^{2}}{n^{2}}) .$

This theorem is concerned with the selection and estimation properties of the adaptive group Lasso in terms of β̂_n. The following theorem states the results in terms of the estimators of the nonparametric components.

THEOREM 4. Suppose that conditions (B1), (B2) and (A1)–(A4) hold. Then:

$P ({‖ {\hat{f}}_{nj} ‖}_{2} > 0, j \in A_{1} and {‖ {\hat{f}}_{nj} ‖}_{2} = 0, j \in A_{0}) \to 1 .$
$\sum_{j = 1}^{q} {‖ {\hat{f}}_{nj} - f_{j} ‖}_{2}^{2} = O_{p} (\frac{m_{n}}{n}) + O_{p} (\frac{1}{n}) + O (\frac{1}{m_{n}^{2 d}}) + O (\frac{4 m_{n} λ_{n 2}^{2}}{n^{2}}) .$

Part (i) of this theorem states that the adaptive group Lasso can consistently distinguish nonzero components from zero components. Part (ii) gives an upper bound on the rate of convergence of the estimator.

We now apply the above results to our proposed procedure described in Section 2, in which we first obtain the the group Lasso estimator and then use it as the initial estimator in the adaptive group Lasso.

By Theorem 1, if $λ_{n 1} ≍ \sqrt{n log ({pm}_{n})}$ and m_n ≍ n^1/(2d+1) for d ≥ 1, then the group Lasso estimator satisfies (B1) with $r_{n} ≍ n^{d / (2 d + 1)} / \sqrt{log ({pm}_{n})}$ . In this case, (B2) simplifies to

\frac{λ_{n 2}}{n^{(8 d + 3) / (8 d + 4)}} = o (1) and \frac{n^{1 / (4 d + 2)} {log}^{1 / 2} ({pm}_{n})}{λ_{n 2}} = o (1) .

(7)

We summarize the above discussion in the following corollary.

COROLLARY 2. Let the group Lasso estimator β̃_n ≡ β̃_n(λ_n1) with $λ_{n 1} ≍ \sqrt{n log ({pm}_{n})}$ and m_n ≍ n^1/(2d+1) be the initial estimator in the adaptive group Lasso. Suppose that the conditions of Theorem 1 hold. If λ_n2 ≤ O(n^1/2) and satisfies (7), then the adaptive group Lasso consistently selects the nonzero components in (1), that is, part (i) of Theorem 4 holds. In addition,

\sum_{j = 1}^{q} {‖ {\hat{f}}_{nj} - f_{j} ‖}_{2}^{2} = O_{p} (n^{- 2 d / (2 d + 1)}) .

This corollary follows directly from Theorems 1 and 4. The largest λ_n2 allowed is λ_n2 = O(n^1/2). With this λ_n2, the first equation in (6) is satisfied. Substitute it into the second equation in (6), we obtain p = exp(o(n^2d/(2d+1))), which is the largest p permitted and can be larger than n. Thus, under the conditions of this corollary, our proposed adaptive group Lasso estimator using the group Lasso as the initial estimator is selection consistent and achieves optimal rate of convergence even when p is larger than n. Following model selection, oracle-efficient, asymptotically normal estimators of the nonzero components can be obtained by using existing methods.

4. Simulation studies

We use simulation to evaluate the performance of the adaptive group Lasso with regard to variable selection. The generating model is

y_{i} = f (x_{i}) + ε_{i} \equiv \sum_{j = 1}^{p} f_{j} (x_{ij}) + ε_{i}, i = 1, \dots, n .

(8)

Since p can be larger than n, we consider two ways to select the penalty parameter, the BIC [Schwarz (1978)] and the EBIC [Chen and Chen (2008, 2009)]. The BIC is defined as

BIC (λ) = log ({RSS}_{λ}) + {df}_{λ} \cdot \frac{log n}{n} .

Here, RSS_λ is the residual sum of squares for a given λ, and the degrees of freedom df_λ = q̂_λm_n, where q̂_λ is the number of nonzero estimated components for the given λ. The EBIC is defined as

EBIC (λ) = log ({RSS}_{λ}) + {df}_{λ} \cdot \frac{log n}{n} + ν \cdot {df}_{λ} \cdot \frac{log p}{n},

where 0 ≤ ν ≤ 1 is a constant. We use ν = 0.5.

We have also considered two other possible ways of defining df: (a) using the trace of a linear smoother based on a quadratic approximation; (b) using the number of estimated nonzero components. We have decided to use the definition given above based on the results from our simulations. We note that the df for the group Lasso of Yuan and Lin (2006) requires an initial (least squares) estimator, which is not available when p > n. Thus, their df is not applicable to our problem.

In our simulation example, we compare the adaptive group Lasso with the group Lasso and ordinary Lasso. Here, the ordinary Lasso estimator is defined as the value that minimizes

{‖ Y - Z β_{n} ‖}_{2}^{2} + λ_{n} \sum_{j = 1}^{p} \sum_{k = 1}^{m_{n}} | β_{jk} | .

This simple application of the Lasso does not take into account the grouping structure in the spline expansions of the components. The group Lasso and the adaptive group Lasso estimates are computed using the algorithm proposed by Yuan and Lin (2006). The ordinary Lasso estimates are computed using the Lars algorithms [Efron et al. (2004)]. The group Lasso is used as the initial estimate for the adaptive group Lasso.

We also compare the results from the nonparametric additive modeling with those from the standard linear regression model with Lasso. We note that this is not a fair comparison because the generating model is highly nonlinear. Our purpose is to illustrate that it is necessary to use nonparametric models when the underlying model deviates substantially from linear models in the context of variable selection with high-dimensional data and that model misspecification can lead to bad selection results.

EXAMPLE 1. We generate data from the model

y_{i} = f (x_{i}) + ε_{i} \equiv \sum_{j = 1}^{p} f_{j} (x_{ij}) + ε_{i}, i = 1, \dots, n,

where f₁(t) = 5t, f₂(t) = 3(2t − 1)², f₃(t) = 4 sin(2πt)/(2 − sin(2πt)), f₄(t) = 6(0.1 sin(2πt) + 0.2 cos(2πt) + 0.3 sin(2πt)² + 0.4 cos(2πt)³ + 0.5 sin(2πt)³), and f₅(t) = ⋯ = f_p(t) = 0. Thus, the number of nonzero functions is q = 4. This generating model is the same as Example 1 of Lin and Zhang (2006). However, here we use this model in high-dimensional settings. We consider the cases where p = 1000 and three different sample sizes: n = 50, 100 and 200. We use the cubic B-spline with six evenly distributed knots for all the functions f_k. The number of replications in all the simulations is 400.

The covariates are simulated as follows. First, we generate $w_{i 1}, \dots, w_{ip}, u_{i}, u_{i}^{'}, υ_{i}$ independently from N(0, 1) truncated to the interval [0, 1], i = 1, …, n. Then we set x_ik = (w_ik + tu_i)/(1 + t) for k = 1, …, 4 and x_ik = (w_ik + tυ_i)/(1 + t) for k = 5, …, p, where the parameter t controls the amount of correlation among predictors. We have Corr(x_ik, x_ij) = t²/(1 + t²), 1 ≤ j ≤ 4, 1 ≤ k ≤ 4, and Corr(x_ik, x_ij) = t²/(1 + t²), 4 ≤ j ≤ p, 4 ≤ k ≤ p, but the covariates of the nonzero components and zero components are independent. We consider t = 0, 1 in our simulation. The signal to noise ratio is defined to be sd(f)/sd(ε). The error term is chosen to be ε_i ~ N(0, 1.27²) to give a signal-to-noise ratio (SNR) 3.11 : 1. This value is the same as the estimated SNR in the real data example below, which is the square root of the ratio of the sum of estimated components squared divided by the sum of residual squared.

The results of 400 Monte Carlo replications are summarized in Table 1. The columns are the mean number of variables selected (NV), model error (ER), the percentage of replications in which all the correct additive components are included in the selected model (IN), and the percentage of replications in which precisely the correct components are selected (CS). The corresponding standard errors are in parentheses. The model error is computed as the average of $n^{- 1} \sum_{i = 1}^{n} {[\hat{f} (x_{i}) - f (x_{i})]}^{2}$ over the 400 Monte Carlo replications, where f is the true conditional mean function.

TABLE 1.

Example 1. Simulation results for the adaptive group Lasso, group Lasso, ordinary Lasso, and linear model with Lasso, n = 50, 100 or 200, p = 1000. NV, average number of the variables being selected; ME, model error; IN, percentage of occasions on which the correct components are included in the selected model; CS, percentage of occasions on which correct components are selected, averaged over 400 replications. Enclosed in parentheses are the corresponding standard errors. Top panel, independent predictors; bottom panel, correlated predictors

		Adaptive group Lasso				Group Lasso				Ordinary Lasso				Linear mode with Lasso

		NV	ME	IN	CS	NV	ME	IN	CS	NV	ME	IN	CS	NV	ME	IN	CS
		Independent predictors
n = 200	BIC	4.15	26.72	90.00	80.00	4.20	27.54	90.00	58.25	9.73	28.44	95.00	18.00	3.35	31.89	0.00	0.00
		(0.43)	(4.13)	(0.30)	(0.41)	(0.43)	(4.45)	(0.30)	(0.54)	(6.72)	(5.55)	(0.22)	(0.40)	(1.75)	(5.65)	(0.00)	(0.00)
	EBIC	4.09	26.64	92.00	81.75	4.18	27.40	92.00	60.00	9.58	28.15	95.00	32.50	3.30	32.08	0.00	0.00
		(0.38)	(4.06)	(0.24)	(0.39)	(0.40)	(4.33)	(0.24)	(0.50)	(6.81)	(5.25)	(0.22)	(0.47)	(1.86)	(5.69)	(0.00)	(0.00)
n = 100	BIC	4.73	28.26	85.00	70.00	5.03	29.07	85.00	35.00	17.25	29.50	82.50	12.00	6.35	31.57	5.00	0.00
		(1.18)	(5.71)	(0.36)	(0.46)	(1.22)	(6.01)	(0.36)	(0.48)	(8.72)	(5.89)	(0.38)	(0.44)	(2.91)	(7.22)	(0.22)	(0.00)
	EBIC	4.62	28.07	84.25	74.00	4.90	28.87	84.25	38.00	15.93	29.35	84.00	27.75	5.90	31.53	5.00	0.00
		(0.89)	(5.02)	(0.36)	(0.42)	(1.20)	(5.72)	(0.36)	(0.50)	(9.06)	(5.25)	(0.36)	(0.45)	(2.97)	(6.40)	(0.22)	(0.00)
n = 50	BIC	4.75	28.86	80.00	65.00	5.12	29.97	80.00	32.00	18.53	30.05	75.00	11.00	12.53	32.52	22.50	0.00
		(1.22)	(5.72)	(0.41)	(0.48)	(1.29)	(6.15)	(0.41)	(0.48)	(12.67)	(6.26)	(0.41)	(0.31)	(3.80)	(8.37)	(0.43)	(0.00)
	EBIC	4.69	28.94	78.00	65.00	5.01	29.82	78.00	36.00	17.27	30.50	77.50	26.00	10.33	31.64	20.00	0.00
		(1.98)	(6.48)	(0.40)	(0.48)	(1.21)	(6.11)	0.40)	(0.49)	(15.32)	(7.89)	(0.39)	(0.44)	(3.19)	(8.17)	(0.41)	(0.00)
		Correlated predictors
n = 200	BIC	3.20	27.76	66.00	60.00	3.85	28.12	66.00	30.00	9.13	28.80	56.00	11.00	1.08	32.18	0.00	0.00
		(1.27)	(4.74)	(0.46)	(0.50)	(1.49)	(4.76)	(0.46)	(0.46)	(7.02)	(5.36)	(0.51)	(0.31)	(0.33)	(8.99)	(0.00)	(0.00)
	EBIC	3.23	27.60	68.00	63.00	3.92	27.85	68.00	31.00	9.24	28.22	58.00	13.75	1.30	32.00	0.00	0.00
		(1.24)	(4.34)	(0.45)	(0.49)	(1.68)	(4.50)	(0.45)	(0.48)	(7.18)	(5.30)	(0.52)	(0.44)	(1.60)	(8.92)	(0.00)	(0.00)
n = 100	BIC	2.88	27.88	60.00	56.00	3.28	28.33	60.00	22.00	8.80	28.97	52.00	8.00	1.00	32.24	0.00	0.00
		(1.91)	(4.88)	(0.50)	(0.56)	(1.96)	(4.92)	(0.50)	(0.42)	(10.22)	(5.45)	(0.44)	(0.26)	(0.00)	(9.20)	(0.00)	(0.00)
	EBIC	3.04	27.78	61.75	58.00	3.44	28.16	61.75	24.00	9.06	28.55	54.00	10.00	1.00	32.09	0.00	0.00
		(1.46)	(4.85)	(0.49)	(0.54)	(1.52)	(4.90)	(0.49)	(0.43)	(11.24)	(5.42)	(0.46)	(0.28)	(0.00)	(8.98)	(0.00)	(0.00)
n = 50	BIC	2.50	28.36	48.50	38.00	3.10	29.37	48.50	20.00	8.01	30.48	30.00	5.00	1.00	33.28	0.00	0.00
		(1.64)	(5.32)	(0.50)	(0.55)	(1.78)	(5.98)	(0.50)	(0.41)	(11.42)	(6.77)	(0.46)	(0.23)	(0.00)	(9.42)	(0.00)	(0.00)
	EBIC	2.48	28.57	48.00	38.00	3.07	30.13	48.00	18.00	8.24	30.89	32.00	6.00	1.00	33.25	0.00	0.00
		(1.62)	(5.51)	(0.51)	(0.55)	(1.76)	(7.60)	(0.51)	(0.40)	(11.46)	(6.40)	(0.48)	(0.24)	(0.00)	(9.38)	(0.00)	(0.00)

Open in a new tab

Table 1 shows that the adaptive group Lasso selects all the nonzero components (IN) and selects exactly the correct model (CS) more frequently than the other methods do. For example, with the BIC and n = 200, the percentage of correct selections (CS) by the adaptive group Lasso ranges from 65.25% to 81%, which is much higher than the ranges 30–57.75% for the group Lasso and 12–15.75% for the ordinary Lasso. The adaptive group Lasso and group Lasso perform better than the ordinary Lasso in all of the experiments, which illustrates the importance of taking account of the group structure of the coefficients of the spline expansion. Correlation among covariates increases the difficulty of component selection, so it is not surprising that all methods perform better with independent covariates than with correlated ones. The percentage of correct selections increases as the sample size increases. The linear model with Lasso never selects the correct model. This illustrates the poor results that can be produced by a linear model when the true conditional mean function is nonlinear.

Table 1 also shows that the model error (ME) of the group Lasso is only slightly larger than that of the adaptive group Lasso. The models selected by the group Lasso nest and, therefore, have more estimated coefficients than the models selected by the adaptive group Lasso. Therefore, the group Lasso estimators of the conditional mean function have a larger variance and larger ME. The differences between the MEs of the two methods are small, however, because as can be seen from the NV column, the models selected by the group Lasso in our experiments have only slightly more estimated coefficients than the models selected by the adaptive group Lasso.

EXAMPLE 2. We now compare the adaptive group Lasso with the COSSO [Lin and Zhang (2006)]. This comparison is suggested to us by the Associate Editor. Because the COSSO algorithm only works for the case when p is smaller than n, we use the same set-up as in Example 1 of Lin and Zhang (2006). In this example, the generating model is as in (8) with 4 nonzero components. Let X_j = (W_j + tU)/(1 + t), j = 1, …, p, where W₁, …, W_p and U are i.i.d. from N(0, 1), truncated to the interval [0, 1]. Therefore, corr(X_j, X_k) = t²/(1 + t²) for j ≠ k. The random error term ε ~ N(0, 1.32²). The SNR is 3:1. We consider three different sample sizes n = 50, 100 or 200 and three different number of predictors p = 10, 20 or 50. The COSSO estimator is computed using the Matlab software which is publicly available at http://www4.stat.ncsu.edu/~hzhang/cosso.html.

The COSSO procedure uses either generalized cross-validation or 5-fold cross-validation. Based the simulation results of Lin and Zhang (2006) and our own simulations, the COSSO with 5-fold cross-validation has better selection performance. Thus, we compare the adaptive group Lasso with BIC or EBIC with the COSSO with 5-fold cross-validation. The results are given in Table 2. For independent predictors, when n = 200 and p = 10, 20 or 50, the adaptive group Lasso and COSSO have similar performance in terms of selection accuracy and model error. However, for smaller n and larger p, the adaptive group Lasso does significantly better. For example, for n = 100 and p = 50, the percentage of correct selection for the adaptive group Lasso is 81–83%, but it is only 11% for the COSSO. The model error of the adaptive group Lasso is similar to or smaller than that of the COSSO. In several experiments, the model error of the COSSO is 2 to more than 7 times larger than that of the adaptive group Lasso. It is interesting to note that when n = 50 and p = 20 or 50, the adaptive group Lasso still does a descent job in selecting the correct model, but the COSSO does poorly in these two cases. In particular, for n = 50 and p = 50, the COSSO did not select the exact correct model in all the simulation runs. For dependent predictors, the comparison is even mode favorable to the adaptive group Lasso, which performs significantly better than COSSO in terms of both model error and selection accuracy in all the cases.

TABLE 2.

Example 2. Simulation results comparing the adaptive group Lasso and COSSO. n = 50, 100 or 200, p = 10, 20 or 50. NV, average number of the variables being selected; ME, model error; IN, percentage of occasions on which all the correct components are included in the selected model; CS, percentage of occasions on which correct components are selected, averaged over 400 replications. Enclosed in parentheses are the corresponding standard errors

		p = 10				p = 20				p = 50

		NV	ME	IN	CS	NV	ME	IN	CS	NV	ME	IN	CS
		Independent predictors
n = 200	AGLasso(BIC)	4.02	0.27	100.00	98.00	4.01	0.34	96.00	92.00	4.10	0.88	98.00	90.00
		(0.14)	(0.10)	(0.00)	(0.14)	(0.40)	(0.10)	(0.20)	(0.27)	(0.39)	(0.19)	(0.14)	(0.30)
	AGLasso(EBIC)	4.02	0.27	100.00	99.00	4.05	0.32	100.00	94.00	4.08	0.87	98.00	90.00
		(0.14)	(0.09)	(0.00)	(0.10)	(0.22)	(0.09)	(0.00)	(0.24)	(0.30)	(0.16)	(0.14)	(0.30)
	COSSO(5CV)	4.06	0.29	100.00	98.00	4.10	0.37	100.00	92.00	4.49	1.53	94.00	84.00
		(0.24)	(0.07)	(0.00)	(0.14)	(0.39)	(0.11)	(0.00)	(0.27)	(1.10)	(0.86)	(0.24)	(0.37)
n = 100	AGLasso(BIC)	4.06	0.56	99.00	90.00	4.11	0.63	98.00	87.00	4.27	1.04	93.00	81.00
		(0.24)	(0.19)	(0.10)	(0.30)	(0.42)	(0.26)	(0.14)	(0.34)	(0.58)	(0.64)	(0.26)	(0.39)
	AGLasso(EBIC)	4.06	0.54	99.00	91.00	4.10	0.59	98.00	89.00	4.22	1.01	93.00	83.00
		(0.24)	(0.21)	(0.10)	(0.31)	(0.39)	(0.22)	(0.14)	(0.31)	(0.56)	(0.60)	(0.26)	(0.38)
	COSSO(5CV)	4.17	0.53	96.00	89.00	4.18	1.04	83.00	63.00	4.89	6.63	30.00	11.00
		(0.62)	(0.19)	(0.20)	(0.31)	(0.96)	(0.64)	(0.38)	(0.49)	(1.50)	(1.29)	(0.46)	(0.31)
n = 50	AGLasso(BIC)	4.18	0.72	98.00	84.00	4.25	0.99	96.00	79.00	4.30	1.06	90.00	71.00
		(0.66)	(0.56)	(0.14)	(0.36)	(0.72)	(0.60)	(0.20)	(0.41)	(0.89)	(0.68)	(0.30)	(0.46)
	AGLasso(EBIC)	4.16	0.70	98.00	84.00	4.24	1.02	94.00	78.00	4.27	1.04	92.00	73.00
		(0.64)	(0.52)	(0.14)	(0.36)	(0.70)	(0.62)	(0.20)	(0.42)	(0.86)	(0.64)	(0.27)	(0.45)
	COSSO(5CV)	4.41	1.77	61.00	58.00	5.06	5.53	33.00	20.00	5.96	7.60	8.00	0.00
		(1.08)	(1.35)	(0.46)	(0.42)	(1.54)	(1.88)	(0.47)	(0.40)	(2.20)	(2.07)	(0.27)	(0.00)
		Correlated predictors
n = 200	AGLasso(BIC)	3.75	0.49	82.00	70.00	3.71	1.20	75.00	66.00	3.50	1.68	68.00	62.00
		(0.61)	(0.14)	(0.39)	(0.46)	(0.68)	(0.89)	(0.41)	(0.46)	(0.92)	(1.29)	(0.45)	(0.49)
	AGLasso(EBIC)	3.75	0.49	82.00	70.00	3.73	1.18	75.00	68.00	3.58	1.60	70.00	65.00
		(0.61)	(0.14)	(0.39)	(0.46)	(0.65)	(0.88)	(0.41)	(0.45)	(0.84)	(1.27)	(0.46)	(0.46)
	COSSO(5CV)	3.70	0.53	69.00	41.00	3.89	1.24	57.00	36.00	4.11	1.76	41.00	16.00
		(0.58)	(0.17)	(0.46)	(0.49)	(0.60)	(0.90)	(0.50)	(0.48)	(0.86)	(1.33)	(0.49)	(0.37)
n = 100	AGLasso(BIC)	3.72	1.40	78.00	68.00	3.68	1.78	70.00	64.00	3.02	3.07	63.00	59.00
		(0.66)	(0.70)	(0.40)	(0.45)	(0.74)	(1.15)	(0.46)	(0.48)	(1.58)	(2.37)	(0.49)	(0.51)
	AGLasso(EBIC)	3.70	1.46	75.00	66.00	3.71	1.74	72.00	64.00	3.20	2.98	65.00	60.00
		(0.72)	(0.78)	(0.41)	(0.46)	(0.68)	(1.06)	(0.42)	(0.48)	(1.42)	(1.96)	(0.46)	(0.50)
	COSSO(5CV)	3.98	1.42	41.00	26.00	4.14	1.76	30.00	6.00	4.24	6.88	8.00	0.00
		(0.64)	(0.74)	(0.49)	(0.42)	(2.27)	(1.11)	(0.46)	(0.24)	(2.96)	(2.91)	(0.27)	(0.00)
n = 50	AGLasso(BIC)	3.30	2.26	70.00	62.00	3.06	3.02	65.00	60.00	2.87	4.01	52.00	42.00
		(1.16)	(1.09)	(0.46)	(0.49)	(1.52)	(2.14)	(0.46)	(0.50)	(1.56)	(3.69)	(0.44)	(0.52)
	AGLasso(EBIC)	3.32	2.20	70.00	64.00	3.10	3.01	68.00	62.00	2.90	3.88	50.00	42.00
		(1.14)	(1.06)	(0.46)	(0.48)	(1.51)	(2.12)	(0.45)	(0.49)	(1.54)	(3.62)	(0.42)	(0.52)
	COSSO(5CV)	4.14	3.77	25.00	6.00	4.20	6.98	5.00	0.00	4.90	9.93	1.00	0.00
		(2.25)	(2.02)	(0.44)	(0.24)	(2.88)	(2.82)	(0.22)	(0.00)	(3.30)	(4.08)	(0.10)	(0.00)

Open in a new tab

5. Data example

We use the data set reported in Scheetz et al. (2006) to illustrate the application of the proposed method in high-dimensional settings. For this data set, 120 twelve-week old male rats were selected for tissue harvesting from the eyes and for microarray analysis. The microarrays used to analyze the RNA from the eyes of these animals contain over 31,042 different probe sets (Affymetric GeneChip Rat Genome 230 2.0 Array). The intensity values were normalized using the robust multi-chip averaging method [Irizzary et al. (2003)] method to obtain summary expression values for each probe set. Gene expression levels were analyzed on a logarithmic scale.

We are interested in finding the genes that are related to the gene TRIM32. This gene was recently found to cause Bardet–Biedl syndrome [Chiang et al. (2006)], which is a genetically heterogeneous disease of multiple organ systems including the retina. Although over 30,000 probe sets are represented on the Rat Genome 230 2.0 Array, many of them are not expressed in the eye tissue and initial screening using correlation shows that most probe sets have very low correlation with TRIM32. In addition, we are expecting only a small number of genes to be related to TRIM32. Therefore, we use 500 probe sets that are expressed in the eye and have highest marginal correlation in the analysis. Thus, the sample size is n = 120 (i.e., there are 120 arrays from 120 rats) and p = 500. It is expected that only a few genes are related to TRIM32. Therefore, this is a sparse, high-dimensional regression problem.

We use the nonparametric additive model to model the relation between the expression of TRIM32 and those of the 500 genes. We estimate model (1) using the ordinary Lasso, group Lasso, and adaptive group Lasso for the nonparametric additive model. To compare the results of the nonparametric additive model with that of the linear regression model, we also analyzed the data using the linear regression model with Lasso. We scale the covariates so that their values are between 0 and 1 and use cubic splines with six evenly distributed knots to estimate the additive components. The penalty parameters in all the methods are chosen using the BIC or EBIC as in the simulation study. Table 3 lists the probes selected by the group Lasso and the adaptive group Lasso, indicated by the check signs. Table 4 shows the number of variables, the residual sums of squares obtained with each estimation method. For the ordinary Lasso with the spline expansion, a variable is considered to be selected if any of the estimated coefficients of the spline approximation to its additive component are nonzero. Depending on whether BIC or EBIC is used, the group Lasso selects 16–17 variables, the adaptive group Lasso selects 15 variables and the ordinary Lasso with the spline expansion selects 94–97 variables, the linear model selects 8–14 variables. Table 4 shows that the adaptive group Lasso does better than the other methods in terms of residual sum of squares (RSS). We have also examined the plots (not shown) of the estimated additive components obtained with the group Lasso and the adaptive group Lasso, respectively. Most are highly nonlinear, confirming the need for taking into account nonlinearity.

TABLE 3.

Probe sets selected by the group Lasso and the adaptive group Lasso in the data example using BIC or EBIC for penalty parameter selection. GL, group Lasso; AGL, adaptive group Lasso; Linear, linear model with Lasso

Probes	GL(BIC)	AGL(BIC)	Linear(BIC)	GL(EBIC)	AGL(EBIC)	Linear(EBIC)
1389584_at	✓	✓	✓	✓	✓	✓
1383673_at	✓	✓	✓	✓	✓	✓
1379971_at	✓	✓	✓	✓	✓	✓
1374106_at	✓		✓	✓		✓
1393817_at	✓	✓	✓	✓	✓
1373776_at	✓	✓	✓	✓	✓
1377187_at	✓	✓	✓	✓	✓
1393955_at	✓	✓	✓	✓	✓
1393684_at	✓	✓		✓	✓
1381515_at	✓	✓		✓	✓
1382835_at	✓	✓	✓	✓	✓
1385944_at	✓	✓	✓	✓	✓
1382263_at	✓	✓	✓	✓	✓	✓
1380033_at	✓	✓		✓	✓
1398594_at	✓		✓			✓
1376744_at	✓	✓		✓	✓
1382633_at	✓	✓		✓	✓
1383110_at			✓			✓
1386683_at			✓			✓

Open in a new tab

TABLE 4.

Analysis results for the data example. No. of probes, the number of probe sets selected; RSS, the residual sum of squares of the fitted model

	BIC		EBIC

	No. of probe sets	RSS	No. of probe sets	RSS
Adaptive group Lasso	15	1.52e–03	15	1.52e–03
Group Lasso	17	3.24e–03	16	3.40e–03
Ordinary Lasso	97	2.96e–07	94	8.10e–08
Linear regression with Lasso	14	2.62e–03	8	3.75e–03

Open in a new tab

In order to evaluate the performance of the methods, we use cross-validation and compare the prediction mean square errors (PEs). We randomly partition the data into 6 subsets, each set consisting of 20 observations. We then fit the model with 5 subsets as training set and calculate the PE for the remaining set which we consider as test set. We repeat this process 6 times, considering one of the 6 subsets as test set every time. We compute the average of the numbers of probes selected and the prediction errors of these 6 calculations. Then we replicate this process 400 times (this is suggested to us by the Associate Editor). Table 5 gives the average values over 400 replications. The adaptive group Lasso has smaller average prediction error than the group Lasso, the ordinary Lasso and the linear regression with Lasso. The ordinary Lasso selects far more probe sets than the other approaches, but this does not lead to better prediction performance. Therefore, in this example, the adaptive group Lasso provides the investigator a more targeted list of probe sets, which can serve as a starting point for further study.

TABLE 5.

Comparison of adaptive group Lasso, group Lasso, ordinary Lasso, and linear regression model with Lasso for the data example. ANP, the average number of probe sets selected averaged across 400 replications; PE, the average of prediction mean square errors for the test set

	Adaptive group Lasso		Group Lasso		Ordinary Lasso		Linear model with Lasso

	ANP	PE	ANP	PE	ANP	PE	ANP	PE
BIC	15.75	1.86e–02	16.45	2.89e–02	78.48	1.40e–02	9.25	2.26e–02
	(0.85)	(0.47e–02)	(0.88)	(0.49e–02)	(3.62)	(0.90e–02)	(0.88)	(1.41e–2)
EBIC	15.55	1.78e–02	16.75	1.99e–02	80.00	1.23e–02	9.15	2.03e–02
	(0.82)	(0.42e–02)	(0.84)	(0.47e–02)	(3.50)	(0.89e–02)	(0.86)	(1.39e–02)

Open in a new tab

It is of interest to compare the selection results from the adaptive group Lasso and the linear regression model with Lasso. The adaptive group Lasso and the linear model with Lasso select different sets of genes. When the penalty parameter is chosen with the BIC, the adaptive group Lasso selects 5 genes that are not selected by the linear model with Lasso. In addition, the linear model with Lasso selects 5 genes that are not selected by the adaptive group Lasso. When the penalty parameter is selected with the EBIC, the adaptive group Lasso selects 10 genes that are not selected by the linear model with Lasso. The estimated effects of many of the genes are nonlinear, and the Monte Carlo results of Section 4 show that the performance of the linear model with Lasso can be very poor in the presence of nonlinearity. Therefore, we interpret the differences between the gene selections of the adaptive group Lasso and the linear model with Lasso as evidence that the selections produced by the linear model are misleading.

6. Concluding remarks

In this paper, we propose to use the adaptive group Lasso for variable selection in nonparametric additive models in sparse, high-dimensional settings. A key requirement for the adaptive group Lasso to be selection consistent is that the initial estimator is estimation consistent and selects all the important components with high probability. In low-dimensional settings, finding an initial consistent estimator is relatively easy and can be achieved by many well-established approaches such as the additive spline estimators. However, in high-dimensional settings, finding an initial consistent estimator is difficult. Under the conditions stated in Theorem 1, the group Lasso is shown to be consistent and selects all the important components. Thus the group Lasso can be used as the initial estimator in the adaptive Lasso to achieve selection consistency. Following model selection, oracle-efficient, asymptotically normal estimators of the nonzero components can be obtained by using existing methods. Our simulation results indicate that our procedure works well for variable selection in the models considered. Therefore, the adaptive group Lasso is a useful approach for variable selection and estimation in sparse, high-dimensional nonparametric additive models.

Our theoretical results are concerned with a fixed sequence of penalty parameters, which are not applicable to the case where the penalty parameters are selected based on data driven procedures such as the BIC. This is an important and challenging problem that deserves further investigation, but is beyond the scope of this paper. We have only considered linear nonparametric additive models. The adaptive group Lasso can be applied to generalized nonparametric additive models, such as the generalized logistic nonparametric additive model and other nonparametric models with high-dimensional data. However, more work is needed to understand the properties of this approach in those more complicated models.

Acknowledgments

The authors wish to thank the Editor, Associate Editor and two anonymous referees for their helpful comments.

APPENDIX: PROOFS

We first prove the following lemmas. Denote the centered versions of 𝒮_n by

𝒮_{nj}^{0} = {f_{nj} : f_{nj} (x) = \sum_{k = 1}^{m_{n}} b_{jk} ψ_{k} (x), (β_{j 1}, \dots, β_{{jm}_{n}}) \in ℝ^{m_{n}}}, 1 \leq j \leq p,

where ψ_k’s are the centered spline bases defined in (5).

LEMMA 1. Suppose that f ∈ ℱ and E f(X_j) = 0. Then under (A3) and (A4), there exists an $f_{n} \in 𝒮_{nj}^{0}$ satisfying

‖ f_{n} - f ‖_{2} = O_{p} (m_{n}^{- d} + m_{n}^{1 / 2} n^{- 1 / 2}) .

In particular, if we choose m_n = O(n^1/(2d+1)), then

‖ f_{n} - f ‖_{2} = O_{p} (m_{n}^{- d}) = O_{p} (n^{- d / (2 d + 1)}) .

PROOF. By (A4), for f ∈ ℱ, there is an $f_{n}^{*} \in 𝒮_{n}$ such that ${‖ f - f_{n}^{*} ‖}_{2} = O (m_{n}^{- d})$ . Let $f_{n} = f_{n}^{*} - n^{- 1} \sum_{i = 1}^{n} f_{n}^{*} (X_{ij})$ . Then $f_{n} \in 𝒮_{nj}^{0} and | f_{n} - f | \leq | f_{n}^{*} - f | + | P_{n} f_{n}^{*} |$ , where P_n is the empirical measure of i.i.d. random variables X_1j, …, X_nj. Consider

P_{n} f_{n}^{*} = (P_{n} - P) f_{n}^{*} + P (f_{n}^{*} - f) .

Here, we use the linear functional notation, for example, Pf = ∫ fdP, where P is the probability measure of X_1j. For any ε > 0, the bracketing number $N_{[\cdot]} (ε, 𝒮_{nj}^{0}, L_{2} (P)) of 𝒮_{nj}^{0}$ satisfies $log N_{[\cdot]} (ε, 𝒮_{nj}^{0}, L_{2} (P)) \leq c_{1} m_{n} log (1 / ε)$ for some constant c₁ > 0 [Shen and Wong (1994), page 597]. Thus, by the maximal inequality; see, for example, van der Vaart (1998, page 288), $(P_{n} - P) f_{n}^{*} = O_{p} (n^{- 1 / 2} m_{n}^{1 / 2})$ . By (A4), $| P (f_{n}^{*} - f) | \leq C_{2} {‖ f_{n}^{*} - f ‖}_{2} = O (m_{n}^{- d})$ for some constant C₂ > 0. The lemma follows from the triangle inequality.

LEMMA 2. Suppose that conditions (A2) and (A4) hold. Let

T_{jk} = n^{- 1 / 2} m_{n}^{1 / 2} \sum_{i = 1}^{n} ψ_{k} (X_{ij}) ε_{i}, 1 \leq j \leq p, 1 \leq k \leq m_{n},

and T_n = max_{1≤j≤p,1≤k≤m_n} |T_jk|. Then

E (T_{n}) \leq C_{1} n^{- 1 / 2} m_{n}^{1 / 2} \sqrt{log ({pm}_{n})} {(\sqrt{2 C_{2} m_{n}^{- 1} n log ({pm}_{n})} + 4 log (2 {pm}_{n}) + C_{2} {nm}_{n}^{- 1})}^{1 / 2},

where C₁ and C₂ are two positive constants. In particular, when m_n log(pm_n)/n → 0,

E (T_{n}) = O (1) \sqrt{log ({pm}_{n})} .

PROOF. Let $s_{njk}^{2} = \sum_{i = 1}^{n} ψ_{k}^{2} (X_{ij})$ . Conditional on X_ij’s, T_jk’s are sub-Gaussian. Let $s_{n}^{2} = {max}_{1 \leq j \leq p, 1 \leq k \leq m_{n}} s_{njk}^{2}$ . By (A2) and the maximal inequality for sub-Gaussian random variables [van der Vaart and Wellner (1996), Lemmas 2.2.1 and 2.2.2],

E (max_{1 \leq j \leq p, 1 \leq k \leq m_{n}} | T_{jk} | | {X_{ij}, 1 \leq i \leq n, 1 \leq j \leq p}) \leq C_{1} n^{- 1 / 2} m_{n}^{1 / 2} s_{n} \sqrt{log ({pm}_{n})} .

Therefore,

E (max_{1 \leq j \leq p, 1 \leq k \leq m_{n}} | T_{jk} |) \leq C_{1} n^{- 1 / 2} m_{n}^{1 / 2} \sqrt{log ({pm}_{n})} E (s_{n}),

(9)

where C₁ > 0 is a constant. By (A4) and the properties of B-splines,

| ψ_{k} (X_{ij}) | \leq | ϕ_{k} (X_{ij}) | + | {\bar{ϕ}}_{jk} | \leq 2 and E {(ψ_{k} (X_{ij}))}^{2} \leq C_{2} m_{n}^{- 1}

(10)

for a constant C₂ > 0, for every 1 ≤ j ≤ p and 1 ≤ k ≤ m_n. By (10),

\sum_{i = 1}^{n} E {[ψ_{k}^{2} (X_{ij}) - E ψ_{k}^{2} (X_{ij})]}^{2} \leq 4 C_{2} {nm}_{n}^{- 1}

(11)

and

max_{1 \leq j \leq p, 1 \leq k \leq m_{n}} \sum_{i = 1}^{n} E ψ_{k}^{2} (X_{ij}) \leq C_{2} {nm}_{n}^{- 1} .

(12)

By Lemma A.1 of van de Geer (2008), (10) and (11) imply

E (max_{1 \leq j \leq p, 1 \leq k \leq m_{n}} | \sum_{i = 1}^{n} {ψ_{k}^{2} (X_{ij}) - E ψ_{k}^{2} (X_{ij})} |) \leq \sqrt{2 C_{2} m_{n}^{- 1} n log ({pm}_{n})} + 4 log (2 {pm}_{n}) .

Therefore, by (12) and the triangle inequality,

E s_{n}^{2} \leq \sqrt{2 C_{2} m_{n}^{- 1} n log ({pm}_{n})} + 4 log (2 {pm}_{n}) + C_{2} {nm}_{n}^{- 1} .

Now since ${E s}_{n} \leq {({E s}_{n}^{2})}^{1 / 2}$ , we have

E s_{n} \leq {(\sqrt{2 C_{2} m_{n}^{- 1} n log ({pm}_{n})} + 4 log (2 {pm}_{n}) + C_{2} {nm}_{n}^{- 1})}^{1 / 2} .

(13)

The lemma follows from (9) and (13).

Denote

β_{A} = (β_{j}^{'}, j \in A)' and Z_{A} = (Z_{j}, j \in A) .

Here, β_A is an |A|m_n × 1 vector and Z_A is an n × |A|m_n matrix. Let $C_{A} = Z_{A}^{'} Z_{A} / n$ . When A = {1, …, p}, we simply write C = Z′Z/n. Let ρ_min(C_A) and ρ_max(C_A) be the minimum and maximum eigenvalues of C_A, respectively.

LEMMA 3. Let m_n = O(n^γ) where 0 < γ < 0.5. Suppose that |A| is bounded by a fixed constant independent of n and p. Let $h \equiv h_{n} ≍ m_{n}^{- 1}$ . Then under (A3) and (A4), with probability converging to one,

c_{1} h_{n} \leq ρ_{min} (C_{A}) \leq ρ_{max} (C_{A}) \leq c_{2} h_{n},

where c₁ and c₂ are two positive constants.

PROOF. Without loss of generality, suppose A = {1, …, k}. Then Z_A = (Z₁, …, Z_q). Let $b = (b_{1}^{'}, \dots, b_{q}^{'})'$ , where b_j ∈ R^m_n. By Lemma 3 of Stone (1985),

{‖ Z_{1} b_{1} + \dots + Z_{q} b_{q} ‖}_{2} \geq c_{3} ({‖ Z_{1} b_{1} ‖}_{2} + \dots + {‖ Z_{q} b_{q} ‖}_{2})

for a certain constant c₃ > 0. By the triangle inequality,

{‖ Z_{1} b_{1} + \dots + Z_{q} b_{q} ‖}_{2} \leq {‖ Z_{1} b_{1} ‖}_{2} + \dots + {‖ Z_{q} b_{q} ‖}_{2} .

Since Z_Ab = Z₁b₁ + ⋯ + Z_qb_q, the above two inequalities imply that

c_{3} ({‖ Z_{1} b_{1} ‖}_{2} + \dots + {‖ Z_{q} b_{q} ‖}_{2}) \leq {‖ Z_{A} b ‖}_{2} \leq {‖ Z_{1} b_{1} ‖}_{2} + \dots + {‖ Z_{q} b_{q} ‖}_{2} .

Therefore,

c_{3}^{2} ({‖ Z_{1} b_{1} ‖}_{2}^{2} + \dots + {‖ Z_{q} b_{q} ‖}_{2}^{2}) \leq {‖ Z_{A} b ‖}_{2}^{2} \leq 2 ({‖ Z_{1} b_{1} ‖}_{2}^{2} + \dots + {‖ Z_{q} b_{q} ‖}_{2}^{2}) .

(14)

Let $C_{j} = n^{- 1} Z_{j}^{'} Z_{j}$ . By Lemma 6.2 of Zhou, Shen and Wolf (1998),

c_{4} h \leq ρ_{min} (C_{j}) \leq ρ_{max} (C_{j}) \leq c_{5} h, j \in A .

(15)

Since $C_{A} = n^{- 1} Z_{A}^{'} Z_{A}$ , it follows from (14) that

c_{3}^{2} (b_{1}^{'} C_{1} b_{1} + \dots + b_{q}^{'} C_{q} b_{q}) \leq b' C_{A} b \leq 2 (b_{1}^{'} C_{1} b_{1} + \dots + b_{q}^{'} C_{q} b_{q}) .

Therefore, by (15),

\begin{matrix} \frac{b_{1}^{'} C_{1} b_{1}}{{‖ b ‖}_{2}^{2}} + \dots + \frac{b_{q}^{'} C_{q} b_{q}}{{‖ b ‖}_{2}^{2}} & = \frac{b_{1}^{'} C_{1} b_{1}}{{‖ b_{1} ‖}_{2}^{2}} \frac{{‖ b_{1} ‖}_{2}^{2}}{{‖ b ‖}_{2}^{2}} + \dots + \frac{b_{q}^{'} C_{q} b_{q}}{{‖ b_{q} ‖}_{2}^{2}} \frac{{‖ b_{q} ‖}_{2}^{2}}{{‖ b ‖}_{2}^{2}} \\ \geq ρ_{min} (C_{1}) \frac{{‖ b_{1} ‖}_{2}^{2}}{{‖ b ‖}_{2}^{2}} + \dots + ρ_{min} (C_{q}) \frac{{‖ b_{q} ‖}_{2}^{2}}{{‖ b ‖}_{2}^{2}} \\ \geq c_{4} h . \end{matrix}

Similarly,

\frac{b_{1}^{'} C_{1} b_{1}}{{‖ b ‖}_{2}^{2}} + \dots + \frac{b_{q}^{'} C_{q} b_{q}}{{‖ b ‖}_{2}^{2}} \leq c_{5} h .

Thus, we have

c_{3}^{2} c_{4} h \leq \frac{b' C_{A} b}{b' b} \leq 2 c_{5} h .

The lemma follows.

PROOF OF THEOREM 1. The proof of parts (i) and (ii) essentially follows the proof of Theorem 2.1 of Wei and Huang (2008). The only change that must be made here is that we need to consider the approximation error of the regression functions by splines. Specifically, let ξ_n = ε_n + δ_n, where δ_n = (δ_n1, …, δ_nn)′ with $δ_{ni} = \sum_{j = 1}^{q_{n}} (f_{0 j} (X_{ij}) - f_{nj} (X_{ij}))$ . Since ${‖ f_{0 j} - f_{nj} ‖}_{2} = O (m_{n}^{- d}) = O (n^{- d / (2 d + 1)})$ for m_n = n^1/(2d+1), we have

{‖ δ_{n} ‖}_{2} \leq C_{1} \sqrt{{nqm}_{n}^{- 2 d}} = C_{1} q n^{1 / (4 d + 2)}

for some constant C₁ > 0. For any integer t, let

χ_{t} = max_{| A | = t} max_{{‖ U_{A_{k}} ‖}_{2} = 1, 1 \leq k \leq t} \frac{| ξ_{n}^{'} V_{A} (s) |}{{‖ V_{A} (s) ‖}_{2}} and χ_{t}^{*} = max_{| A | = t} max_{{‖ U_{A_{k}} ‖}_{2} = 1, 1 \leq k \leq t} \frac{| ε_{n}^{'} V_{A} (s) |}{{‖ V_{A} (s) ‖}_{2}},

where $V_{A} (S_{A}) = ξ_{n}^{'} (Z_{A} {(Z_{A}^{'} Z_{A})}^{- 1} {\bar{S}}_{A} - (I - P_{A}) X β$ for N(A) = q₁ = m ≥ 0, $S_{A} = (S_{A_{1}}^{'}, \dots, S_{A_{m}}^{'})', S_{A_{k}} = λ \sqrt{d_{A_{k}}} U_{A_{k}}$ and ‖U_{A_k}‖₂ = 1.

For a sufficiently large constant C₂ > 0, define

Ω_{t_{0}} = {(Z, ε_{n}) : x_{t} \leq σ C_{2} \sqrt{(t \lor 1) m_{n} log ({pm}_{n}),} \forall t \geq t_{0}}

and

Ω_{t_{0}}^{*} = {(Z, ε_{n}) : x_{t}^{*} \leq σ C_{2} \sqrt{(t \lor 1) m_{n} log ({pm}_{n}),} \forall t \geq t_{0}},

where t₀ ≥ 0.

As in the proof of Theorem 2.1 of Wei and Huang (2008),

(Z, ε_{n}) \in Ω_{q} \Rightarrow | {\tilde{A}}_{1} | \leq M_{1} q

for a constant M₁ > 1. By the triangle and Cauchy–Schwarz inequalities,

\frac{| ξ_{n}^{'} V_{A} (s) |}{{‖ V_{A} (s) ‖}_{2}} = \frac{| ε_{n}^{'} V_{A} (s) + δ_{n}^{'} V_{A} (s) |}{{‖ V_{A} (s) ‖}_{2}} \leq \frac{| ε_{n}^{'} V_{A} (s) |}{{‖ V_{A} ‖}_{2}} + ‖ δ_{n} ‖ .

(16)

In the proof of Theorem 2.1 of Wei and Huang (2008), it is shown that

P (Ω_{0}^{*}) \geq 2 - \frac{2}{p^{1 + c_{0}}} - exp (\frac{2 p}{p^{1 + c_{0}}}) \to 1 .

(17)

Since

\frac{| δ_{n}^{'} V_{A} (s) |}{{‖ V_{A} (s) ‖}_{2}} \leq {‖ δ_{n} ‖}_{2} \leq C_{1} {qn}^{1 / (2 (2 d + 1))}

and m_n = O(n^1/(2d+1)), we have for all t ≥ 0 and n sufficiently large,

{‖ δ_{n} ‖}_{2} \leq C_{1} {qn}^{1 / (2 (2 d + 1))} \leq σ C_{2} \sqrt{(t \lor 1) m_{n} log (p)} .

(18)

It follows from (16), (17) and (18) that P(Ω₀) → 1. This completes the proof of part (i) of Theorem 1.

Before proving part (ii), we first prove part (iii) of Theorem 1. By the definition of ${\tilde{β}}_{n} \equiv ({\tilde{β}}_{n 1}^{'}, \dots, {\tilde{β}}_{np}^{'})'$ ,

{‖ Y - Z {\tilde{β}}_{n} ‖}_{2}^{2} + λ_{n 1} \sum_{j = 1}^{p} {‖ {\tilde{β}}_{nj} ‖}_{2} \leq {‖ Y - Z β_{n} ‖}_{2}^{2} + λ_{n 1} \sum_{j = 1}^{p} {‖ β_{nj} ‖}_{2} .

(19)

Let A₂ = {j : ‖β_nj‖₂ ≠ 0 or ‖β̃_nj‖₂ ≠ 0} and d_n2 = |A₂|. By part (i), d_n2 = O_p(q). By (19) and the definition of A₂,

{‖ Y - Z_{A_{2}} {\tilde{β}}_{n A_{2}} ‖}_{2}^{2} + λ_{n 1} \sum_{j \in A_{2}} {‖ {\tilde{β}}_{nj} ‖}_{2} \leq {‖ Y - Z_{A_{2}} β_{n A_{2}} ‖}_{2}^{2} + λ_{n 1} \sum_{j \in A_{2}} {‖ β_{nj} ‖}_{2} .

(20)

Let η_n = Y − Zβ_n. Write

Y - Z_{A_{2}} {\tilde{β}}_{n A_{2}} = Y - Z β_{n} - Z_{A_{2}} ({\tilde{β}}_{n A_{2}} - β_{n A_{2}}) = η_{n} - Z_{A_{2}} ({\tilde{β}}_{n A_{2}} - β_{n A_{2}}) .

We have

{‖ Y - Z_{A_{2}} {\tilde{β}}_{n A_{2}} ‖}_{2}^{2} = {‖ Z_{A_{2}} ({\tilde{β}}_{n A_{2}} - β_{n A_{2}}) ‖}_{2}^{2} - 2 η_{n}^{'} Z_{A_{2}} ({\tilde{β}}_{n A_{2}} - β_{n A_{2}}) + η_{n}^{'} η_{n} .

We can rewrite (20) as

{‖ Z_{A_{2}} ({\tilde{β}}_{n A_{2}} - β_{n A_{2}}) ‖}_{2}^{2} - 2 η_{n}^{'} Z_{A_{2}} ({\tilde{β}}_{n A_{2}} - β_{n A_{2}}) \leq λ_{n 1} \sum_{j \in A_{1}} {‖ β_{nj} ‖}_{2} - λ_{n 1} \sum_{j \in A_{1}} {‖ {\tilde{β}}_{nj} ‖}_{2} .

(21)

Now

| \sum_{j \in A_{1}} {‖ β_{nj} ‖}_{2} - \sum_{j \in A_{1}} {‖ {\tilde{β}}_{nj} ‖}_{2} | \leq \sqrt{| A_{1} |} \cdot {‖ {\tilde{β}}_{n A_{1}} - β_{n A_{1}} ‖}_{2} \leq \sqrt{| A_{1} |} \cdot {‖ {\tilde{β}}_{n A_{2}} - β_{n A_{2}} ‖}_{2} .

(22)

Let ν_n = Z_A₂(β̃_nA₂ − β_nA₂). Combining (20), (21) and (22) to get

{‖ ν_{n} ‖}_{2}^{2} - 2 η_{n}^{'} ν_{n} \leq λ_{n 1} \sqrt{| A_{1} |} \cdot {‖ {\tilde{β}}_{n A_{2}} - β_{n A_{2}} ‖}_{2} .

(23)

Let $η_{n}^{*}$ be the projection of η_n to the span Z_A₂, that is, $η_{n}^{*} = Z_{A_{2}} {(Z_{A_{2}}^{'} \times Z_{A_{2}})}^{- 1} Z_{A_{2}}^{'} η_{n}$ . By the Cauchy–Schwarz inequality,

2 | η_{n}^{'} ν_{n} | \leq 2 {‖ η_{n}^{*} ‖}_{2} \cdot {‖ ν_{n} ‖}_{2} \leq 2 {‖ η_{n}^{*} ‖}_{2}^{2} + \frac{1}{2} {‖ ν_{n} ‖}_{2}^{2} .

(24)

From (23) and (24), we have

{‖ ν_{n} ‖}_{2}^{2} \leq 4 {‖ η_{n}^{*} ‖}_{2}^{2} + 2 λ_{n 1} \sqrt{| A_{1} |} \cdot {‖ {\tilde{β}}_{n A_{2}} - β_{n A_{2}} ‖}_{2} .

Let c_n* be the smallest eigenvalue of $Z_{A_{2}}^{'} Z_{A_{2}} / n$ . By Lemma 3 and part (i), $c_{n *} ≍_{p} m_{n}^{- 1}$ . Since ${‖ ν_{n} ‖}_{2}^{2} \geq {nc}_{n *} {‖ {\tilde{β}}_{{nA}_{2}} - β_{{nA}_{2}} ‖}_{2}^{2}$ and 2ab ≤ a² + b²,

{nc}_{n *} {‖ {\tilde{β}}_{n A_{2}} - β_{n A_{2}} ‖}_{2}^{2} \leq 4 {‖ η_{n}^{*} ‖}_{2}^{2} + \frac{{(2 λ_{n 1} \sqrt{| A_{1} |})}^{2}}{2 {nc}_{n *}} + \frac{1}{2} {nc}_{n *} {‖ {\tilde{β}}_{n A_{2}} - β_{n A_{2}} ‖}_{2}^{2} .

It follows that

{‖ {\tilde{β}}_{n A_{2}} - β_{n A_{2}} ‖}_{2}^{2} \leq \frac{8 {‖ η_{n}^{*} ‖}_{2}^{2}}{{nc}_{n *}} + \frac{4 λ_{n 1}^{2} | A_{1} |}{n^{2} c_{n *}^{2}} .

(25)

Let $f_{0} (X_{i}) = \sum_{j = 1}^{p} f_{0 j} (X_{ij})$ and f_0A(X_i) = ∑_j∈A f_0j(X_ij). Write

η_{i} = Y_{i} - μ - f_{0} (X_{i}) + (μ - \bar{Y}) + f_{0} (X_{i}) - \sum_{j \in A_{2}} Z_{ij}^{'} β_{nj} = ε_{i} + (μ - \bar{Y}) + f_{A_{2}} (X_{i}) - f_{n A_{2}} (X_{i}) .

Since |µ − Y̅|² = O_p(n⁻¹) and ${‖ f_{0 j} - f_{nj} ‖}_{\infty} = O (m_{n}^{- d})$ , we have

{‖ η_{n}^{*} ‖}_{2}^{2} \leq 2 {‖ ε_{n}^{*} ‖}_{2}^{2} + O_{p} (1) + O ({nd}_{n 2} m_{n}^{- 2 d}),

(26)

where $ε_{n}^{*}$ is the projection of ε_n = (ε₁, …, ε_n)′ to the span of Z_A₂. We have

{‖ ε_{n}^{*} ‖}_{2}^{2} = {‖ {(Z_{A_{2}}^{'} Z_{A_{2}})}^{- 1 / 2} Z_{A_{2}}^{'} ε_{n} ‖}_{2}^{2} \leq \frac{1}{{nc}_{n *}} {‖ Z_{A_{2}}^{'} ε_{n} ‖}_{2}^{2} .

Now

max_{A : | A | \leq d_{n 2}} {‖ Z_{A}^{'} ε_{n} ‖}_{2}^{2} = max_{A : | A | \leq d_{n 2}} \sum_{j \in A} {‖ Z_{j}^{'} ε_{n} ‖}_{2}^{2} \leq d_{n 2} m_{n} max_{1 \leq j \leq p, 1 \leq k \leq m_{n}} {| 𝒵_{jk}^{'} ε |}^{2},

where 𝒵_jk = (ψ_k(X_1j), …, ψ_k(X_nj))′. By Lemma 2,

max_{1 \leq j \leq p, 1 \leq k \leq m_{n}} {| 𝒵_{jk}^{'} ε_{n} |}^{2} = {nm}_{n}^{- 1} max_{1 \leq j \leq p, 1 \leq k \leq m_{n}} {| {(m_{n} / n)}^{1 / 2} 𝒵_{jk}^{'} ε_{n} |}^{2} = O_{p} (1) {nm}_{n}^{- 1} log ({pm}_{n}) .

It follows that,

{‖ ε_{n}^{*} ‖}_{2}^{2} = O_{p} (1) \frac{d_{n 2} log ({pm}_{n})}{c_{n *}} .

(27)

Combining (25), (26) and (27), we get

{‖ {\tilde{β}}_{A_{2}} - β_{A_{2}} ‖}_{2}^{2} \leq O_{p} (\frac{d_{n 2} log ({pm}_{n})}{{nc}_{n *}^{2}}) + O_{p} (\frac{1}{{nc}_{n *}}) + O (\frac{d_{n 2} m_{n}^{- 2 d}}{c_{n *}}) + \frac{4 λ_{n 1}^{2} | A_{1} |}{n^{2} c_{n *}^{2}} .

Since d_n2 = O_p(q), $c_{n *} ≍_{p} m_{n}^{- 1} and c_{n}^{*} ≍_{p} m_{n}^{- 1}$ , we have

{‖ {\tilde{β}}_{A_{2}} - β_{A_{2}} ‖}_{2}^{2} \leq O_{p} (\frac{m_{n}^{2} log ({pm}_{n})}{n}) + O_{p} (\frac{m_{n}}{n}) + O (\frac{1}{m_{n}^{2 d - 1}}) + O (\frac{4 m_{n}^{2} λ_{n 1}^{2}}{n^{2}}) .

This completes the proof of part (iii).

We now prove part (ii). Since ${‖ f_{j} ‖}_{2} \geq c_{f} > 0, 1 \leq j \leq q, {‖ f_{j} - f_{nj} ‖}_{2} = O (m_{n}^{- d})$ and ‖f_nj‖₂ ≥ ‖f_j‖₂ − ‖f_j − f_nj‖₂, we have ‖f_nj‖₂ ≥ 0.5c_f for n sufficiently large. By a result of de Boor (2001), see also (12) of Stone (1986), there are positive constants c₆ and c₇ such that

c_{6} m_{n}^{- 1} {‖ β_{n} ‖}_{2}^{2} \leq {‖ f_{nj} ‖}_{2}^{2} \leq c_{7} m_{n}^{- 1} {‖ β_{nj} ‖}_{2}^{2} .

It follows that ${‖ β_{nj} ‖}_{2}^{2} \geq c_{7}^{- 1} m_{n} {‖ f_{nj} ‖}_{2}^{2} \geq 0.25 c_{7}^{- 1} c_{f}^{2} m_{n}$ . Therefore, if ‖β_nj‖₂ ≠ 0 but ‖β̃_nj‖₂ = 0, then

{‖ {\tilde{β}}_{nj} - β_{nj} ‖}_{2}^{2} \geq 0.25 c_{7}^{- 1} c_{f}^{2} m_{n} .

(28)

However, since (m_n log(pm_n))/n → 0 and $(λ_{n 1}^{2} m_{n}) / n^{2} \to$ , (28) contradicts part (iii).

PROOF OF THEOREM 2. By the definition of f̃_j, 1 ≤ j ≤ p, parts (i) and (ii) follow from parts (i) and (ii) of Theorem 1 directly.

Now consider part (iii). By the properties of spline [de Boor (2001)],

c_{6} m_{n}^{- 1} {‖ {\tilde{β}}_{nj} - β_{nj} ‖}_{2}^{2} \leq {‖ {\tilde{f}}_{nj} - f_{nj} ‖}_{2}^{2} \leq c_{7} m_{n}^{- 1} {‖ {\tilde{β}}_{nj} - β_{nj} ‖}_{2}^{2} .

Thus,

{‖ {\tilde{f}}_{nj} - f_{nj} ‖}_{2}^{2} = O_{p} (\frac{m_{n} log ({pm}_{n})}{n}) + O_{p} (\frac{1}{n}) + O (\frac{1}{m_{n}^{2 d}}) + O (\frac{4 m_{n} λ_{n 1}^{2}}{n^{2}}) .

(29)

By (A3),

{‖ f_{j} - f_{nj} ‖}_{2}^{2} = O (m_{n}^{- 2 d}) .

(30)

Part (iii) follows from (29) and (30).

In the proofs below, for any matrix H, denote its 2-norm by ‖H‖, which is equal to its largest eigenvalue. This norm satisfies the inequality ‖Hx‖ ≤ ‖H‖‖x‖ for a column vector x whose dimension is the same as the number of the columns of H.

Denote $β_{{nA}_{1}} = (β_{nj}^{'}, j \in A_{1})', {\hat{β}}_{{nA}_{1}} = ({\hat{β}}_{nj}^{'}, j \in A_{1})'$ and Z_A₁ = Z_j, j ∈ A₁). Define $C_{A_{1}} = n^{- 1} Z_{A_{1}}^{'} Z_{A_{1}}$ . Let ρ_n1 and ρ_n2 be the smallest and largest eigenvalues of C_A₁, respectively.

PROOF OF THEOREM 3. By the KKT, a necessary and sufficient condition for β̂_n is

{\begin{matrix} 2 Z_{j}^{'} (Y - Z {\hat{β}}_{n}) = λ_{n 2} w_{nj} \frac{{\hat{β}}_{nj}}{‖ {\hat{β}}_{nj} ‖}, & {‖ {\hat{β}}_{j} ‖}_{2} \neq 0, j \geq 1, \\ 2 {‖ Z_{j}^{'} (Y - Z {\hat{β}}_{n}) ‖}_{2} \leq λ_{n 2} w_{nj}, & ‖ {\hat{β}}_{nj} ‖ = 0, j \geq 1 . \end{matrix}

(31)

Let ν_n = (w_njβ̂_j/(2‖β̂_nj‖), j ∈ A₁)′. Define

{\hat{β}}_{n A_{1}} = {(Z_{A_{1}}^{'} Z_{A_{1}})}^{- 1} (Z_{A_{1}}^{'} Y - λ_{n 2} ν_{n}) .

(32)

If β̂_nA₁ =₀ β_nA₁, then the equation in (31) holds for ${\hat{β}}_{n} \equiv {\hat{β}}_{{nA}_{1}}^{'}, 0')'$ . Thus, since Zβ̂_n = Z_A₁β̂_nA₁ for this β̂_n and {Z_j, j ∈ A₁} are linearly independent,

{\hat{β}}_{n} =_{0} β_{n} if {\begin{matrix} {\hat{β}}_{n A_{1} =_{0}} β_{n A_{1}}, \\ {‖ Z_{j}^{'} (Y - Z_{A_{1}} {\hat{β}}_{n A_{1}}) ‖}_{2} \leq λ_{n 2} w_{nj} / 2, & \forall j \notin A_{1} . \end{matrix}

This is true if

{\hat{β}}_{n} =_{0} β_{n} if {\begin{matrix} {‖ β_{nj} ‖}_{2} - {‖ {\hat{β}}_{nj} ‖}_{2} < {‖ β_{nj} ‖}_{2}, & \forall j \in A_{1}, \\ {‖ Z_{j}^{'} (Y - Z_{A_{1}} {\hat{β}}_{n A_{1}}) ‖}_{2} \leq λ_{n 2} w_{nj} / 2, & \forall j \notin A_{1} . \end{matrix}

Therefore,

P ({\hat{β}}_{n} \neq_{0} β_{n}) \leq P ({‖ {\hat{β}}_{nj} - β_{nj} ‖}_{2} \geq {‖ β_{nj} ‖}_{2}, \exists j \in A_{1}) + P {‖ Z_{j}^{'} (Y - Z_{A_{1}} {\hat{β}}_{n A_{1}}) ‖}_{2} > λ_{n 2} w_{nj} / 2, \exists j \notin A_{1}) .

Let f_0j (X_j) = (f_0j(X_1j), …, f_0j(X_nj))′ and δ_n = ∑_j∈A₁ f_0j (X_j) − Z_A₁β_nA₁. By Lemma 1, we have

n^{- 1} {‖ δ_{n} ‖}^{2} = O_{p} ({qm}_{n}^{- 2 d}) .

(33)

Let $H_{n} = I_{n} - Z_{A_{1}} {(Z_{A_{1}}^{'} Z_{A_{1}})}^{- 1} Z_{A_{1}}^{'}$ . By (32),

{\hat{β}}_{n A_{1}} - β_{n A_{1}} = n^{- 1} C_{A_{1}}^{- 1} (Z_{A_{1}}^{'} (ε_{n} + δ_{n}) - λ_{n 2} ν_{n})

(34)

and

Y - Z_{A_{1}} {\hat{β}}_{n A_{1}} = H_{n} ε_{n} + H_{n} δ_{n} + λ_{n 2} Z_{A_{1}} C_{A_{1}}^{- 1} ν_{n} / n .

(35)

Based on these two equations, Lemma 5 below shows that

P ({‖ {\hat{β}}_{nj} - β_{nj} ‖}_{2} \geq {‖ β_{nj} ‖}_{2}, \exists j \in A_{1}) \to 0,

and Lemma 6 below shows that

P ({‖ Z_{j}^{'} (Y - Z_{A_{1}} {\hat{β}}_{n A_{1}}) ‖}_{2} > λ_{n 2} w_{nj} / 2, \exists j \notin A_{1}) \to 0 .

These two equations lead to part (i) of the theorem.

We now prove part (ii) of Theorem 3. As in (26), for η_n = Y − Zβ_n and

η_{n 1}^{*} = Z_{A_{1}} {(Z_{A_{1}}^{'} Z_{A_{1}})}^{- 1} Z_{A_{1}}^{'} η_{n},

we have

{‖ η_{n 1}^{*} ‖}_{2}^{2} \leq 2 {‖ ε_{n 1}^{*} ‖}_{2}^{2} + O_{p} (1) + O ({qnm}_{n}^{- 2 d}),

(36)

where $ε_{n 1}^{*}$ is the projection of ε_n = (ε₁, …, ε_n)′ to the span of Z_A₁. We have

{‖ ε_{n 1}^{*} ‖}_{2}^{2} = {‖ {(Z_{A_{1}}^{'} Z_{A_{1}})}^{- 1 / 2} Z_{A_{1}}^{'} ε_{n} ‖}_{2}^{2} \leq \frac{1}{n ρ_{n 1}} {‖ Z_{A_{1}}^{'} ε_{n} ‖}_{2}^{2} = O_{p} (1) \frac{| A_{1} |}{ρ_{n 1}} .

(37)

Now similarly to the proof of (25), we can show that

{‖ {\hat{β}}_{n A_{1}} - β_{n A_{1}} ‖}_{2}^{2} \leq \frac{8 {‖ η_{n 1}^{*} ‖}_{2}^{2}}{n ρ_{n 1}} + \frac{4 λ_{n 2}^{2} | A_{1} |}{n^{2} ρ_{n 1}^{2}} .

(38)

Combining (36), (37) and (38), we get

{‖ {\hat{β}}_{n A_{1}} - β_{n A_{1}} ‖}_{2}^{2} = O_{p} (\frac{8}{n ρ_{n 1}^{2}}) + O_{p} (\frac{1}{n ρ_{n 1}}) + O (\frac{1}{m_{n}^{2 d - 1}}) + O (\frac{4 λ_{n 2}^{2}}{n^{2} ρ_{n 1}^{2}}) .

Since $ρ_{n 1} ≍_{p} m_{n}^{- 1}$ , the result follows.

The following lemmas are needed in the proof of Theorem 3.

LEMMA 4. For ν_n = (w_njβ̃_j/(2‖β̃_nj‖), j ∈ A₁)′, under condition (B1),

{‖ ν_{n} ‖}^{2} = O_{p} (h_{n}^{2}) = O_{p} ({(b_{n 1}^{2} c_{b})}^{- 2} r_{n}^{- 1} + q b_{n 1}^{- 1}) .

PROOF. Write

{‖ ν_{n} ‖}^{2} = \sum_{j \in A_{1}} w_{j}^{2} = \sum_{j \in A_{1}} {‖ {\tilde{β}}_{nj} ‖}^{- 2} = \sum_{j \in A_{1}} \frac{{‖ β_{nj} ‖}^{2} - {‖ {\tilde{β}}_{nj} ‖}^{2}}{{‖ β_{nj} ‖}^{2} \cdot {‖ {\tilde{β}}_{nj} ‖}^{2}} + \sum_{j \in A_{1}} {‖ β_{nj} ‖}^{- 1} .

Under (B2),

\sum_{j \in A_{1}} \frac{| {‖ β_{nj} ‖}^{2} - {‖ {\tilde{β}}_{nj} ‖}^{2} |}{{‖ β_{nj} ‖}^{2} \cdot {‖ {\tilde{β}}_{nj} ‖}^{2}} \leq M c_{b}^{- 2} b_{n 1}^{- 4} ‖ {\tilde{β}}_{n} - β_{n} ‖

and $\sum_{j \in A_{1}} {‖ β_{nj} ‖}^{- 2} \leq {qb}_{n 1}^{- 2}$ . The claim follows.

Let ρ_n3 be the maximum of the largest eigenvalues of $n^{- 1} Z_{j}^{'} Z_{j}, j \in A_{0}$ , that is, $ρ_{n 3} = {max}_{j \in A_{0}} {‖ n^{- 1} Z_{j}^{'} Z_{j} ‖}_{2}$ . By Lemma 3,

b_{n 1} ≍ O (m_{n}^{1 / 2}), ρ_{n 1} ≍_{p} m_{n}^{- 1}, ρ_{n 2} ≍_{p} m_{n}^{- 1} and ρ_{n 3} ≍_{p} m_{n}^{- 1} .

(39)

LEMMA 5. Under conditions (B1), (B2), (A3) and (A4),

P ({‖ {\hat{β}}_{nj} - β_{nj} ‖}_{2} \geq {‖ β_{nj} ‖}_{2}, \exists j \in A_{1}) \to 0 .

(40)

PROOF. Let T_nj be an m_n × qm_n matrix with the form

T_{nj} = (0_{m_{n}}, \dots, 0_{m_{n}}, I_{m_{n}}, 0_{m_{n}}, \dots, 0_{m_{n}}),

where O_{m_n} is an m_n × m_n matrix of zeros and I_{m_n} is an m_n × m_n identity matrix, and I_{m_n} is at the jth block. By (34), ${\hat{β}}_{nj} - β_{nj} = n^{- 1} T_{nj} C_{A_{1}}^{- 1} (Z_{A_{1}}^{'} ε_{n} + Z_{A_{1}}^{'} δ_{n} - λ_{n 2} ν_{n})$ . By the triangle inequality,

{‖ {\hat{β}}_{nj} - β_{nj} ‖}_{2} \leq n^{- 1} {‖ T_{nj} C_{A_{1}}^{- 1} Z_{A_{1}}^{'} ε_{n} ‖}_{2} + n^{- 1} {‖ T_{nj} C_{A_{1}}^{- 1} Z_{A_{1}}^{'} δ_{n} ‖}_{2} + n^{- 1} λ_{n 2} {‖ T_{nj} C_{A_{1}}^{- 1} ν_{n} ‖}_{2} .

(41)

Let C be a generic constant independent of n. The first term on the right-hand side

\begin{matrix} max_{j \in A_{1}} n^{- 1} {‖ T_{nj} C_{A_{1}}^{- 1} Z_{A_{1}}^{'} ε_{n} ‖}_{2} & \leq n^{- 1} ρ_{n 1}^{- 1} {‖ Z_{A_{1}}^{'} ε_{n} ‖}_{2} \\ = n^{- 1 / 2} ρ_{n 1}^{- 1} {‖ n^{- 1 / 2} Z_{A_{1}}^{'} ε_{n} ‖}_{2} \\ = O_{p} (1) n^{- 1 / 2} ρ_{n 1}^{- 1} m_{n}^{- 1 / 2} {({qm}_{n})}^{1 / 2} . \end{matrix}

(42)

By (33), the second term

\begin{matrix} max_{j \in A_{1}} n^{- 1} {‖ T_{nj} C_{A_{1}}^{- 1} Z_{A_{1}}^{'} δ_{n} ‖}_{2} & \leq {‖ C_{A_{1}}^{- 1} ‖}_{2} \cdot {‖ n^{- 1} Z_{A_{1}}^{'} Z_{A_{1}} ‖}_{2}^{1 / 2} \cdot {‖ n^{- 1} δ_{n} ‖}_{2} \\ = O_{p} (1) ρ_{n 1}^{- 1} ρ_{n 2}^{1 / 2} q^{1 / 2} m_{n}^{- d} . \end{matrix}

(43)

By Lemma 4, the third term

max_{j \in A_{1}} n^{- 1} λ_{n 2} {‖ T_{nj} C_{A_{1}}^{- 1} ν_{n} ‖}_{2} \leq n λ_{n 2} ρ_{n 1}^{- 1} {‖ ν_{n} ‖}_{2} = O_{p} (1) ρ_{n 1}^{- 1} n^{- 1} λ_{n 2} h_{n} .

(44)

Thus, (40) follows from (39), (42)–(44) and condition (B2a).

LEMMA 6. Under conditions (B1), (B2), (A3) and (A4),

P ({‖ Z_{j}^{'} (Y - Z_{A_{1}} {\hat{β}}_{n A_{1}}) ‖}_{2} > λ_{n 2} w_{nj} / 2, \exists j \notin A_{1}) \to 0.

(45)

PROOF. By (35), we have

Z_{j}^{'} (Y - Z_{A_{1}} {\hat{β}}_{n A_{1}}) = Z_{j}^{'} H_{n} ε_{n} + Z_{j}^{'} H_{n} δ_{n} + λ n^{- 1} Z_{j}^{'} Z_{A_{1}} C_{A_{1}}^{- 1} ν_{n} .

(46)

Recall s_n = p − q is the number of zero components in the model. By Lemma 2,

E (max_{j \notin A_{1}} {‖ n^{- 1 / 2} Z_{j}^{'} H_{n} ε_{n} ‖}_{2}) \leq O (1) {log (s_{n} m_{n})}^{1 / 2} .

(47)

Since w_nj = ‖β̂_nj‖⁻¹ = O_p (r_n) for j ∉ A₁ and by (47), for the first term on the right-hand side of (46), we have

\begin{matrix} P ({‖ Z_{j}^{'} H_{n} ε_{n} ‖}_{2} > λ_{n 2} w_{nj} / 6, \exists j \notin A_{1}) \\ \leq P ({‖ Z_{j}^{'} H_{n} ε_{n} ‖}_{2} > C λ_{n 2} r_{n}, \exists j \notin A_{1}) + o (1) \\ = P (max_{j \notin A_{1}} {‖ n^{- 1 / 2} Z_{j}^{'} H_{n} ε_{n} ‖}_{2} > C n^{- 1 / 2} λ_{n 2} r_{n}) + o (1) \\ \leq O (1) \frac{n^{1 / 2} {log (s_{n} m_{n})}^{1 / 2}}{C λ_{n 2} r_{n}} + o (1) . \end{matrix}

(48)

By (33), the second term on the right-hand side of (46)

\begin{matrix} max_{j \notin A_{1}} {‖ Z_{j}^{'} H_{n} δ_{n} ‖}_{2} & \leq n^{1 / 2} max_{j \notin A_{1}} {‖ n^{- 1} Z_{j}^{'} Z_{j} ‖}_{2}^{1 / 2} \cdot {‖ H_{n} ‖}_{2} \cdot {‖ δ_{n} ‖}_{2} \\ = O (1) n ρ_{n 3}^{1 / 2} q^{1 / 2} m_{n}^{- d} . \end{matrix}

(49)

By Lemma 4, the third term on the right-hand side of (46)

\begin{matrix} max_{j \notin A_{1}} λ_{n 2} n^{- 1} {‖ Z_{j} Z_{A_{1}} C_{A_{1}}^{- 1} ν_{n} ‖}_{2} \\ \leq λ_{n 2} max_{j \notin A_{1}} {‖ n^{- 1 / 2} Z_{j} ‖}_{2} \cdot {‖ n^{- 1 / 2} Z_{A_{1}} C_{A_{1}}^{- 1 / 2} ‖}_{2} \cdot {‖ C_{A_{1}}^{- 1 / 2} ‖}_{2} \cdot {‖ ν_{n} ‖}_{2} \\ = λ_{n 2} ρ_{n 3}^{1 / 2} ρ_{n 1}^{- 1 / 2} O_{p} (q b_{n 1}^{- 1}) . \end{matrix}

(50)

Therefore, (45) follows from (39), (48), (49), (50) and condition (B2b).

PROOF OF THEOREM 4. The proof is similar to that of Theorem 2 and is omitted.

Footnotes

Supported in part by NIH Grant CA120988 and NSF Grant DMS-08-05670.

Supported in part by NSF Grant SES-0817552.

Contributor Information

Jian Huang, Department of Statistics and Actuarial Science, 241 SH, University of Iowa, Iowa City, Iowa 52242, USA, jian-huang@uiowa.edu.

Joel L. Horowitz, Department of Economics, Northwestern University, 2001 Sheridan Road, Evanston, Illinois 60208, USA, joel-horowitz@northwestern.edu.

Fengrong Wei, Department of Mathematics, University of West Georgia, Carrollton, Georgia 30118, USA, fwei@westga.edu.

REFERENCES

Antoniadis A, Fan J. Regularization of wavelet approximation (with discussion) J. Amer. Statist. Assoc. 2001;96:939–967. MR1946364. [Google Scholar]
Bach FR. Consistency of the group Lasso and multiple kernel learning. J. Mach. Learn. Res. 2007;9:1179–1225. MR2417268. [Google Scholar]
Bunea F, Tsybakov A, Wegkamp M. Sparsity oracle inequalities for the Lasso. Electron. J. Stat. 2007:169–194. MR2312149. [Google Scholar]
Chen J, Chen Z. Extended Bayesian information criteria for model selection with large model space. Biometrika. 2008;95:759–771. [Google Scholar]
Chen J, Chen Z. Extended BIC for small-n-large-P sparse GLM. 2009. Available at http://www.stat.nus.edu.sg/~stachenz/ChenChen.pdf.
Chiang AP, Beck JS, Yen H-J, Tayeh MK, Scheetz TE, Swiderski R, Nishimura D, Braun TA, Kim K-Y, Huang J, Elbedour K, Carmi R, Slusarski DC, Casavant TL, Stone EM, Sheffield VC. Homozygosity mapping with SNP arrays identifies a novel gene for Bardet–Biedl syndrome (BBS10) Proc. Natl. Acad. Sci. USA. 2006;103:6287–6292. doi: 10.1073/pnas.0600158103. [DOI] [PMC free article] [PubMed] [Google Scholar]
de Boor C. A Practical Guide to Splines. revised ed. New York: Springer; 2001. MR1900298. [Google Scholar]
Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression (with discussion) Ann. Statist. 2004;32:407–499. MR2060166. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 2001;96:1348–1360. MR1946581. [Google Scholar]
Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Ann. Statist. 2004;32:928–961. MR2065194. [Google Scholar]
Frank IE, Friedman JH. A statistical view of some chemometrics regression tools (with discussion) Technometrics. 1993;35:109–148. [Google Scholar]
Horowitz JL, Klemelä J, Mammen E. Optimal estimation in additive regression models. Bernoulli. 2006;12:271–298. MR2218556. [Google Scholar]
Horowitz JL, Mammen E. Nonparametric estimation of an additive model with a link function. Ann. Statist. 2004;32:2412–2443. [Google Scholar]
Huang J, Horowitz JL, Ma SG. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann. Statist. 2008;36:587–613. MR2396808. [Google Scholar]
Huang J, Ma S, Zhang C-H. Adaptive Lasso for high-dimensional regression models. Statist. Sinica. 2008;18:1603–1618. MR2469326. [Google Scholar]
Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4:249–264. doi: 10.1093/biostatistics/4.2.249. [DOI] [PubMed] [Google Scholar]
Lin Y, Zhang H. Component selection and smoothing in multivariate nonparametric regression. Ann. Statist. 2006;34:2272–2297. MR2291500. [Google Scholar]
Meier L, van de Geer S, Bühlmann P. High-dimensional additive modeling. Ann. Statist. 2009;37:3779–3821. MR2572443. [Google Scholar]
Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the Lasso. Ann. Statist. 2006;34:1436–1462. MR2278363. [Google Scholar]
Meinshausen N, Yu B. Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist. 2009;37:246–270. MR2488351. [Google Scholar]
Ravikumar P, Liu H, Lafferty J, Wasserman L. Sparse additive models. J. Roy. Statist. Soc. Ser. B. 2009;71:1009–1030. [Google Scholar]
Scheetz TE, Kim K-YA, Swiderski RE, Philp AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Sheffield VC, Stone EM. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc. Natl. Acad. Sci. USA. 2006;103:14429–14434. doi: 10.1073/pnas.0602562103. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schwarz G. Estimating the dimension of a model. Ann. Statist. 1978;6:461–464. MR0468014. [Google Scholar]
Schumaker L. Spline Functions: Basic Theory. New York: Wiley; 1981. MR0606200. [Google Scholar]
Shen X, Wong WH. Convergence rate of sieve estimates. Ann. Statist. 1994;22:580–615. [Google Scholar]
Stone CJ. Additive regression and other nonparametric models. Ann. Statist. 1985;13:689–705. MR0790566. [Google Scholar]
Stone CJ. The dimensionality reduction principle for generalized additive models. Ann. Statist. 1986;14:590–606. MR0840516. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B. 1996;58:267–288. MR1379242. [Google Scholar]
van de Geer S. High-dimensional generalized linear models and the Lasso. Ann. Statist. 2008;36:614–645. MR2396809. [Google Scholar]
Van der Vaart AW. Asymptotic Statistics. Cambridge: Cambridge Univ. Press; 1998. [Google Scholar]
van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes: With Applications to Statistics. New York: Springer; 1996. MR1385671. [Google Scholar]
Wang L, Chen G, Li H. Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics. 2007;23:1486–1494. doi: 10.1093/bioinformatics/btm125. [DOI] [PubMed] [Google Scholar]
Wang H, Xia Y. Shrinkage estimation of the varying coefficient model. J. Amer. Statist. Assoc. 2008;104:747–757. MR2541592. [Google Scholar]
Wei F, Huang J. Technical Report #387. Dept. Statistics and Actuarial Science, Univ. Iowa; 2008. Consistent group selection in high-dimensional linear regression. Available at http://www.stat.uiowa.edu/techrep/tr387.pdf. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 2006;68:49–67. MR2212574. [Google Scholar]
Zhang C-H. Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 2010;38:894–942. [Google Scholar]
Zhang H, Wahba G, Lin Y, Voelker M, Ferris M, Klein R, Klein B. Variable selection and model building via likelihood basis pursuit. J. Amer. Statist. Assoc. 2004;99:659–672. MR2090901. [Google Scholar]
Zhang C-H, Huang J. The sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann. Statist. 2008;36:1567–1594. MR2435448. [Google Scholar]
Zhang HH, Lin Y. Component selection and smoothing for nonparametric regression in exponential families. Statist. Sinica. 2006;16:1021–1041. MR2281313. [Google Scholar]
Zhao P, Yu B. On model selection consistency of LASSO. J. Mach. Learn. Res. 2006;7:2541–2563. MR2274449. [Google Scholar]
Zhou S, Shen X, Wolf DA. Local asymptotics for regression splines and confidence regions. Ann. Statist. 1998;26:1760–1782. MR1673277. [Google Scholar]
Zou H. The adaptive Lasso and its oracle properties. J. Amer. Statist. Assoc. 2006;101:1418–1429. MR2279469. [Google Scholar]

[R1] Antoniadis A, Fan J. Regularization of wavelet approximation (with discussion) J. Amer. Statist. Assoc. 2001;96:939–967. MR1946364. [Google Scholar]

[R2] Bach FR. Consistency of the group Lasso and multiple kernel learning. J. Mach. Learn. Res. 2007;9:1179–1225. MR2417268. [Google Scholar]

[R3] Bunea F, Tsybakov A, Wegkamp M. Sparsity oracle inequalities for the Lasso. Electron. J. Stat. 2007:169–194. MR2312149. [Google Scholar]

[R4] Chen J, Chen Z. Extended Bayesian information criteria for model selection with large model space. Biometrika. 2008;95:759–771. [Google Scholar]

[R5] Chen J, Chen Z. Extended BIC for small-n-large-P sparse GLM. 2009. Available at http://www.stat.nus.edu.sg/~stachenz/ChenChen.pdf.

[R6] Chiang AP, Beck JS, Yen H-J, Tayeh MK, Scheetz TE, Swiderski R, Nishimura D, Braun TA, Kim K-Y, Huang J, Elbedour K, Carmi R, Slusarski DC, Casavant TL, Stone EM, Sheffield VC. Homozygosity mapping with SNP arrays identifies a novel gene for Bardet–Biedl syndrome (BBS10) Proc. Natl. Acad. Sci. USA. 2006;103:6287–6292. doi: 10.1073/pnas.0600158103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] de Boor C. A Practical Guide to Splines. revised ed. New York: Springer; 2001. MR1900298. [Google Scholar]

[R8] Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression (with discussion) Ann. Statist. 2004;32:407–499. MR2060166. [Google Scholar]

[R9] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 2001;96:1348–1360. MR1946581. [Google Scholar]

[R10] Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Ann. Statist. 2004;32:928–961. MR2065194. [Google Scholar]

[R11] Frank IE, Friedman JH. A statistical view of some chemometrics regression tools (with discussion) Technometrics. 1993;35:109–148. [Google Scholar]

[R12] Horowitz JL, Klemelä J, Mammen E. Optimal estimation in additive regression models. Bernoulli. 2006;12:271–298. MR2218556. [Google Scholar]

[R13] Horowitz JL, Mammen E. Nonparametric estimation of an additive model with a link function. Ann. Statist. 2004;32:2412–2443. [Google Scholar]

[R14] Huang J, Horowitz JL, Ma SG. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann. Statist. 2008;36:587–613. MR2396808. [Google Scholar]

[R15] Huang J, Ma S, Zhang C-H. Adaptive Lasso for high-dimensional regression models. Statist. Sinica. 2008;18:1603–1618. MR2469326. [Google Scholar]

[R16] Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4:249–264. doi: 10.1093/biostatistics/4.2.249. [DOI] [PubMed] [Google Scholar]

[R17] Lin Y, Zhang H. Component selection and smoothing in multivariate nonparametric regression. Ann. Statist. 2006;34:2272–2297. MR2291500. [Google Scholar]

[R18] Meier L, van de Geer S, Bühlmann P. High-dimensional additive modeling. Ann. Statist. 2009;37:3779–3821. MR2572443. [Google Scholar]

[R19] Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the Lasso. Ann. Statist. 2006;34:1436–1462. MR2278363. [Google Scholar]

[R20] Meinshausen N, Yu B. Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist. 2009;37:246–270. MR2488351. [Google Scholar]

[R21] Ravikumar P, Liu H, Lafferty J, Wasserman L. Sparse additive models. J. Roy. Statist. Soc. Ser. B. 2009;71:1009–1030. [Google Scholar]

[R22] Scheetz TE, Kim K-YA, Swiderski RE, Philp AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Sheffield VC, Stone EM. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc. Natl. Acad. Sci. USA. 2006;103:14429–14434. doi: 10.1073/pnas.0602562103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Schwarz G. Estimating the dimension of a model. Ann. Statist. 1978;6:461–464. MR0468014. [Google Scholar]

[R24] Schumaker L. Spline Functions: Basic Theory. New York: Wiley; 1981. MR0606200. [Google Scholar]

[R25] Shen X, Wong WH. Convergence rate of sieve estimates. Ann. Statist. 1994;22:580–615. [Google Scholar]

[R26] Stone CJ. Additive regression and other nonparametric models. Ann. Statist. 1985;13:689–705. MR0790566. [Google Scholar]

[R27] Stone CJ. The dimensionality reduction principle for generalized additive models. Ann. Statist. 1986;14:590–606. MR0840516. [Google Scholar]

[R28] Tibshirani R. Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B. 1996;58:267–288. MR1379242. [Google Scholar]

[R29] van de Geer S. High-dimensional generalized linear models and the Lasso. Ann. Statist. 2008;36:614–645. MR2396809. [Google Scholar]

[R30] Van der Vaart AW. Asymptotic Statistics. Cambridge: Cambridge Univ. Press; 1998. [Google Scholar]

[R31] van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes: With Applications to Statistics. New York: Springer; 1996. MR1385671. [Google Scholar]

[R32] Wang L, Chen G, Li H. Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics. 2007;23:1486–1494. doi: 10.1093/bioinformatics/btm125. [DOI] [PubMed] [Google Scholar]

[R33] Wang H, Xia Y. Shrinkage estimation of the varying coefficient model. J. Amer. Statist. Assoc. 2008;104:747–757. MR2541592. [Google Scholar]

[R34] Wei F, Huang J. Technical Report #387. Dept. Statistics and Actuarial Science, Univ. Iowa; 2008. Consistent group selection in high-dimensional linear regression. Available at http://www.stat.uiowa.edu/techrep/tr387.pdf. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 2006;68:49–67. MR2212574. [Google Scholar]

[R36] Zhang C-H. Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 2010;38:894–942. [Google Scholar]

[R37] Zhang H, Wahba G, Lin Y, Voelker M, Ferris M, Klein R, Klein B. Variable selection and model building via likelihood basis pursuit. J. Amer. Statist. Assoc. 2004;99:659–672. MR2090901. [Google Scholar]

[R38] Zhang C-H, Huang J. The sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann. Statist. 2008;36:1567–1594. MR2435448. [Google Scholar]

[R39] Zhang HH, Lin Y. Component selection and smoothing for nonparametric regression in exponential families. Statist. Sinica. 2006;16:1021–1041. MR2281313. [Google Scholar]

[R40] Zhao P, Yu B. On model selection consistency of LASSO. J. Mach. Learn. Res. 2006;7:2541–2563. MR2274449. [Google Scholar]

[R41] Zhou S, Shen X, Wolf DA. Local asymptotics for regression splines and confidence regions. Ann. Statist. 1998;26:1760–1782. MR1673277. [Google Scholar]

[R42] Zou H. The adaptive Lasso and its oracle properties. J. Amer. Statist. Assoc. 2006;101:1418–1429. MR2279469. [Google Scholar]

PERMALINK

VARIABLE SELECTION IN NONPARAMETRIC ADDITIVE MODELS

Jian Huang

Joel L Horowitz

Fengrong Wei

Abstract

1. Introduction

2. Adaptive group Lasso in nonparametric additive models

3. Main results

3.1. Estimation consistency of the group Lasso

3.2. Selection consistency of the adaptive group Lasso

4. Simulation studies

TABLE 1.

TABLE 2.

5. Data example

TABLE 3.

TABLE 4.

TABLE 5.

6. Concluding remarks

Acknowledgments

APPENDIX: PROOFS

Footnotes

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

VARIABLE SELECTION IN NONPARAMETRIC ADDITIVE MODELS

Jian Huang

Joel L Horowitz

Fengrong Wei

Abstract

1. Introduction

2. Adaptive group Lasso in nonparametric additive models

3. Main results

3.1. Estimation consistency of the group Lasso

3.2. Selection consistency of the adaptive group Lasso

4. Simulation studies

TABLE 1.

TABLE 2.

5. Data example

TABLE 3.

TABLE 4.

TABLE 5.

6. Concluding remarks

Acknowledgments

APPENDIX: PROOFS

Footnotes

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases