Regularization Parameter Selections via Generalized Information Criterion

Yiyun Zhang; Runze Li; Chih-Ling Tsai

doi:10.1198/jasa.2009.tm08013

. Author manuscript; available in PMC: 2011 Mar 1.

Published in final edited form as: J Am Stat Assoc. 2010 Mar 1;105(489):312–323. doi: 10.1198/jasa.2009.tm08013

Regularization Parameter Selections via Generalized Information Criterion

Yiyun Zhang ¹, Runze Li ¹, Chih-Ling Tsai ¹

PMCID: PMC2911045 NIHMSID: NIHMS153154 PMID: 20676354

Abstract

We apply the nonconcave penalized likelihood approach to obtain variable selections as well as shrinkage estimators. This approach relies heavily on the choice of regularization parameter, which controls the model complexity. In this paper, we propose employing the generalized information criterion (GIC), encompassing the commonly used Akaike information criterion (AIC) and Bayesian information criterion (BIC), for selecting the regularization parameter. Our proposal makes a connection between the classical variable selection criteria and the regularization parameter selections for the nonconcave penalized likelihood approaches. We show that the BIC-type selector enables identification of the true model consistently, and the resulting estimator possesses the oracle property in the terminology of Fan and Li (2001). In contrast, however, the AIC-type selector tends to overfit with positive probability. We further show that the AIC-type selector is asymptotically loss efficient, while the BIC-type selector is not. Our simulation results confirm these theoretical findings, and an empirical example is presented. Some technical proofs are given in the online supplementary material.

Keywords: AIC, BIC, GIC, LASSO, Nonconcave penalized likelihood, SCAD

1 Introduction

Using the penalized likelihood function to simultaneously select variables and estimate unknown parameters has received considerable attention in recent years. To avoid the instability of the classical subset selection procedure, Tibshirani (1996) proposed the Least Absolute Shrinkage and Selection Operator (LASSO). In the same spirit as LASSO, Fan and Li (2001) introduced a unified approach via a nonconcave penalized likelihood function, and demonstrated its usefulness for both linear and generalized linear models. Subsequently, Fan and Li (2002) employed the nonconcave penalized likelihood approach for Cox models, while Fan and Li (2004) used it on semiparametric regression models. In addition, Fan and Peng (2004) investigated the theoretical properties of the nonconcave penalized likelihood when the number of parameters tends to infinity as the sample size increases. Recently, more researchers applied penalized approaches to study variable selections (e.g., Park and Hastie, 2007; Wang and Leng, 2007; Yuan and Lin, 2007; Zhang and Lu, 2007; Li and Liang, 2008).

In employing the nonconcave penalized likelihood in regression analysis, we face two challenges. The first hurdle is to compute the nonconcave penalized likelihood estimate. This issue has been carefully studied in the recent literature: (i.) Fan and Li (2001) proposed the local quadratic approximation (LQA) algorithm, which was further analyzed by Hunter and Li (2005); (ii.) Efron, Hastie, Johnstone, and Tibshirani (2004) introduced the LARS algorithm, which can be used for the adaptive LASSO (Zou, 2006; Wang, Li, and Tsai, 2007a; Zhang and Lu, 2007). With the aid of local linear approximation (LLA) algorithm (Zou and Li, 2008), LARS can be adopted to solve optimization problems of nonconcave penalized likelihood functions. However, the above computational procedures rely on the regularization parameter. Hence, the selection of this parameter becomes the second challenge, and is therefore the primary aim of our paper.

In the literature, selection criteria are usually classified into two categories: consistent (e.g., the Bayesian information criterion BIC, Schwarz, 1978) and efficient (e.g., the Akaike information criterion AIC, Akaike, 1974; the generalized cross-validation GCV, Craven and Wahba, 1979). A consistent criterion identifies the true model with a probability that approaches 1 in large samples when a set of candidate models contains the true model. An efficient criterion selects the model so that its average squared error is asymptotically equivalent to the minimum offered by the candidate models when the true model is approximated by a family of candidate models. Detailed discussions on efficiency and consistency can be found in Shibata (1981, 1984), Li (1987), Shao (1997) and McQuarrie and Tsai (1998). In the context of linear and generalized linear models (GLIM) with a nonconcave penalized function, Fan and Li (2001) proposed applying GCV to choose the regularization parameter. Recently, Wang, Li, and Tsai (2007b) found that the resulting model selected by GCV tends to overfit, while BIC is able to identify the finite-dimensional true linear and partial linear models consistently. Wang et al. (2007b) also indicated that GCV is similar to AIC. However, they only studied the penalized least squares function with the smoothly clipped absolute deviation (SCAD) penalty. This motivated us to study the issues of regularization parameter selection for penalized likelihood-based models with a nonconcave penalized function.

In this paper, we adopt Nishii's (1984) generalized information criterion (GIC) to choose regularization parameters in nonconcave penalized likelihood functions. This criterion not only contains AIC and BIC as its special cases, but also bridges the connection between the classical variable selection criteria and the nonconcave penalized likelihood methodology. This connection provides more flexibility for practitioners to employ their own favored variable selection criteria in choosing desirable models. When the true model is among a set of candidate models with the GLIM structure, we show that the BIC-type tuning parameter selector enables us to identify the true model consistently, whereas the AIC-type selector tends to yield an overfitted model. On the other hand, if the true model is approximated by a set of candidate models with the linear structure, we demonstrate that the AIC-type tuning parameter selector is asymptotically loss efficient, while the BIC-type selector does not have this property in general. These findings are consistent with the features of AIC (see Shibata, 1981; Li, 1987) and BIC (see Shao, 1997) used in best subset variable selections.

The rest of the paper is organized as follows. Section 2 proposes the generalized information criterion under a general nonconcave penalized likelihood setting. Section 3 studies the consistent property of GIC for generalized linear models, while section 4 investigates the asymptotic loss efficiency of the AIC-type selector for linear regression models. Monte Carlo simulations and an empirical example are presented in section 5 to illustrate the use of the regularization parameter selectors. Section 6 provides discussions, and technical proofs are given in the Appendix.

2 Nonconcave penalized likelihood function

2.1 Penalized estimators and penalty conditions

Consider the data (x₁, y₁), · · · , (x_n, y_n) being collected identically and independently, where y_i is the response from the ith subject, and x_i is the associated d dimensional predictor variable. Let $ℓ (β)$ be the log likelihood-based (or loss) function of d dimensional parameter vector $β = {(β_{1}, \dots, β_{d})}^{T}$ . Then, adopting Fan and Li's (2001) approach, we define a penalized likelihood to be

Q (β) = ℓ (β) - n \sum_{j = 1}^{d} p_{λ} (∣ β_{j} ∣),

(1)

where p_λ(·) is a penalty function with regularization parameter λ. Several penalty functions have been proposed in the literature. For example, the L_q (0 < q < 2) penalty, namely p_λ(|β|) = q^–¹λ|β|^q, leads to the bridge regression (Frank and Friedman, 1993). In particular, the L₁ penalty yields the LASSO estimator (Tibshirani, 1996). Fan and Li (2001) proposed the nonconcave penalized likelihood method and advocated the use of the smoothly clipped absolute deviation (SCAD) penalty, whose first derivative is given by

p_{λ}^{'} (∣ β_{j} ∣) = λ {I (∣ β_{j} ∣ \leq λ) + \frac{(a λ - ∣ β_{j} ∣) +}{(a - 1) λ} I (∣ β_{j} ∣ > λ)},

with a = 3.7 and p_λ(0) = 0. With properly chosen regularization parameter λ from 0 to its upper limit, λ_max, the resulting penalized estimator is sparse and therefore suitable for variable selections.

To investigate the asymptotic properties of the regularization parameter selectors, we present the penalty conditions given below.

(C1) Assume that λ_max depends on n and satisfies λ_max → 0 as n → ∞.
(C2) There exists a constant m such that the penalty p_λ(ζ) satisfies $p_{λ}^{'} (ζ) = 0$ for θ > mζ.
(C3) If λ_n → 0 as n → ∞, then the penalty function satisfies
$\underset{n \to \infty}{\lim \inf} \underset{ζ \to 0 +}{\lim \inf} \sqrt{n} p_{λ n}^{'} (ζ) \to \infty .$

Condition (C1) indicates that a smaller regularization parameter is needed if the sample size is large. Condition (C2) assures that the resulting penalized likelihood estimate is asymptotically unbiased (Fan and Li, 2001). Both the SCAD and Zhang's (2007) minimax concave penalties (MCP) satisfy this condition. Condition (C3) is adapted from Fan and Li's (2001) Equation (3.5), which is used to study the oracle property.

Remark 1. Both the SCAD penalty and L_q penalty for 0 < q ≤ 1 are singular at the origin. Hence, it becomes challenging to maximize their corresponding penalized likelihood functions. Accordingly, Fan and Li (2001) proposed the LQA algorithm for finding the solution of the nonconcave penalized likelihood. In LQA, p_λ(|β|) is locally approximated by a quadratic function q_λ(|β|), whose first derivative is given by

q_{λ}^{'} (∣ β_{j} ∣) = {p_{λ}^{'} (∣ β_{j} ∣) ∕ ∣ β_{j} ∣} β_{j} .

The above equation is evaluated at the (k + 1)-step iteration of the Newton-Raphson algorithm if $β_{j}^{(k)}$ is not very close to zero. Otherwise, the resulting parameter estimator of β_j is set to 0.

2.2 Generalized information criterion

Before introducing the generalized information criterion, we first define the candidate model, which is involved in variable selections.

Definition 1 (Candidate Model). We define α, a subset of $\overset{‒}{α} = {1, \dots, d}$ , as a candidate model, meaning that the corresponding predictors labelled by α are included in the model. Accordingly, $\overset{‒}{α}$ is the full model. In addition, we denote the size of model α (i.e., the number of nonzero parameters in α) and the coefficients associated with the predictors in model α by d_α and β_α, respectively. Moreover, we denote the collection of all candidate models by $A$ . For a penalized estimator ${\hat{β}}_{λ}$ that minimizes the objective function (1), denote the model associated with ${\hat{β}}_{λ}$ by α_λ.

In the normal linear regression model, $y_{i} + x_{i}^{T} β_{α} + e_{i}$ and e_i ~ N(0, σ²) for i = 1, · · · , n, Nishii (1984) proposed the generalized information criterion for classical variable selections. It is

{GIC}_{κ_{n}} (α) = \log {\hat{σ}}_{α}^{2} + \frac{1}{n} κ_{n} d_{α},

where β_α is the parameter of the candidate model α, ${\hat{σ}}_{α}^{2}$ is the maximum likelihood estimator of σ², and κ_n is a positive number that controls the properties of variable selection. Note that Nishii's GIC is different from the GIC proposed by Konishi and Kitagawa (1996). When κ_n = 2, GIC becomes AIC, while κ_n = log(n) leads to GIC being BIC. Because GIC contains a broad range of selection criteria, this motivates us to propose the following GIC-type regularization parameter selector,

{GIC}_{κ_{n}} (λ) = \frac{1}{n} {G (y, {\hat{β}}_{λ}) + κ_{n} d f_{λ}},

(2)

where $G (y, {\hat{β}}_{λ})$ measures the fitting of model α_λ, y = (y₁, · · · , y_n)^T, ${\hat{β}}_{λ}$ is the penalized parameter estimator obtained by maximizing Equation (1) with respect to β, and df_λ is the degrees of freedom of model α_λ. For any given κ_n, we select λ that minimizes GIC_κn(λ). We can see that, the larger κ_n is, the higher the penalty for models with more variables. Therefore, for some given data, the size of the selected model decreases as κ_n increases.

Remark 2. For any given model α (including the penalized model α_λ and the full model $\overset{‒}{α}$ ), we are able to obtain the non-penalized parameter estimator ${\hat{β}}_{α}^{*}$ by maximizing the log-likelihood function $ℓ (β)$ in (1). Then, Equation (2) becomes

{GIC}_{κ_{n}}^{*} (α) = \frac{1}{n} {G (y, {\hat{β}}_{α}^{*}) + κ_{n} d_{α}},

(3)

which can be used for classical variable selections. In addition, ${GIC}_{κ_{n}}^{*} (α)$ turns into Nishii's GIC_κn(α) if we replace $G (y, {\hat{β}}_{α}^{*})$ in (3) with the –2 log-likelihood function of the fitted normal regression model.

We next study the degrees of freedom used in the second term of GIC. In the selection of the regularization parameter, Fan and Li (2001, 2002) proposed that the degrees of freedom be the trace of the approximate linear projection matrix, i.e.

d f_{L} (λ) ≜ t r {{(\nabla_{λ}^{\otimes 2} Q^{*} ({\hat{β}}_{λ}))}^{- 1} \nabla_{λ}^{\otimes 2} ℓ ({\hat{β}}_{λ})},

(4)

where $Q^{*} (β) = ℓ (β) - n \sum_{j = 1}^{d} q_{λ} (∣ β_{j} ∣)$ , ${[\nabla_{λ}^{\otimes 2} Q^{*} (β)]}_{j j^{'}} = \frac{\partial^{2}}{\partial β_{j} β_{j^{'}}} Q^{*} (β)$ , and ${[\nabla_{λ}^{\otimes 2} ℓ (β)]}_{j j^{'}} = \frac{\partial^{2}}{\partial β_{j} β_{j^{'}}} ℓ (β)$ for j, j′ such that ${\hat{β}}_{j} \neq 0$ and ${\hat{β}}_{j^{'}} \neq 0$ . To understand the large sample property of df_L(λ), we show its asymptotic behavior given below.

Proposition 1. Assume that the penalized likelihood estimator ${\hat{β}}_{λ}$ is sparse (i.e., with probability tending to one, ${\hat{β}}_{λ j} = 0$ if the true value of β_j is 0) and consistent, where ${\hat{β}}_{λ j}$ is the j-th component of ${\hat{β}}_{λ}$ . Under conditions (C1) and (C2), we have

P {d f_{L} (λ) = d_{α_{λ}}} \to 1,

where d_αλ is the size of model α_λ.

Proof. After algebraic simplifications, $d f_{L} (λ) = t r {{(\nabla_{λ}^{\otimes 2} ℓ_{α_{λ}} ({\hat{β}}_{λ}) + n Σ_{λ})}^{- 1} \nabla_{λ}^{\otimes 2} ℓ ({\hat{β}}_{λ})}$ , where $Σ_{λ} = {diag}_{{\hat{β}}_{λ j} \neq 0} {p_{λ}^{'} (∣ {\hat{β}}_{λ j} ∣) ∕ ∣ {\hat{β}}_{λ j} ∣}$ . Because of the consistency and sparsity of ${\hat{β}}_{λ j}$ , ${\hat{β}}_{λ j}$ converges to β_j with probability tending to 1 for all j such that ${\hat{β}}_{λ j} > 0$ . Hence, those ${\hat{β}}_{λ j}$ are all bounded from 0. This result, together with conditions (C1) and (C2), implies that Σ_λ = 0 with probability tending to 1. Subsequently, using the fact that $n^{- 1} \nabla_{λ}^{\otimes 2} ℓ ({\hat{β}}_{λ}) = O_{P} (1)$ , we complete the proof.

The above proposition suggests that the difference between df_L(λ) and the size of the model, d_αλ, is small. Because d_αλ is simple to calculate, we use it as the degrees of freedom df_λ in (2). In linear regression models, Efron et al. (2004) and Zou, Hastie, and Tibshirani (2007) also suggested using d_αλ as an estimator of the degrees of freedom for LASSO. Moreover, Zou et al. (2007) showed that d_αλ is an asymptotically unbiased estimator. In this article, our asymptotical results are valid without regard to the use of df_L(λ) or d_αλ as the degrees of freedom. When the sample size is small, however, df_L(λ) should be considered. In the following section, we explore the properties of GIC for the generalized linear models which have been widely used in various disciplines.

3 Consistency

In this section, we assume that the set of candidate models contains the unique true model, and that the number of parameters in the full model is finite. Under this assumption, we are able to study the asymptotic consistency of GIC by introducing the following definition and condition.

Definition 2 (Underfitted and Overfitted Models). We assume that there is a unique true model α₀in $A$ , whose corresponding coefficients are nonzero. Therefore, any candidate model α ⊅ α₀, is referred to as an underfitted model, while any α ⊃ α₀other than α₀itself is referred to as an overfitted model.

Based on the above definitions, we partition the tuning parameter interval [0, λ_max] into the underfitted, true, and overfitted subsets, respectively,

\begin{matrix} Ω_{-} & = {λ : α_{λ} ⊅ α_{0}}, \\ Ω_{0} & = {λ : α_{λ} = α_{0}}, and \\ Ω_{+} & = {λ : α_{λ} \supset α_{0} and α_{λ} \neq α_{0}} . \end{matrix}

This partition allows us to assess the performance of regularization parameter selections.

To investigate the asymptotic properties of the regularization parameter selectors, we introduce the technical condition given below.

(C4) For any candidate model $α \in A$ , there exists c_α > 0 such that $\frac{1}{n} G (y, {\hat{β}}_{α}^{*}) \overset{P}{\to} c_{α}$ . In addition, for any underfitted model α ⊅ α₀, c_α > c_α₀, where c_α₀ is the limit of $\frac{1}{n} G (y, β_{α_{0}})$ and β_α₀ is the parameter vector of the true model α₀.

The above condition assures that the underfitted model yields a larger measure of model fitting than that of the true model. We next explore the asymptotic consistency of GIC for the generalized linear models which have been used in various disciplines.

Consider the generalized linear model (GLIM, see McCullagh and Nelder, 1989), whose conditional density function of y_i given x_i is

f_{i} (y_{i}; θ_{i}, ϕ) = \exp {[y_{i} θ_{i} - b (θ_{i})] ∕ a (ϕ) + c (y_{i}, ϕ)},

(5)

where a(·), b(·) and c(·, ·) are suitably chosen functions, θ_i is the canonical parameter, E(y_i|x_i) = μ_i = b′(θ_i), g(μ_i) = θ_i, g is a link function, and ϕ is a scale parameter. Throughout this paper, we assume that ϕ is known (such as in the logistic regression model and the Poisson log-linear model) or that it can be estimated by fitting the data with the full model (for instance, the normal linear model). In addition, we follow the classical regression approach to model θ_i by $x_{i}^{T} β$ . Based on (5), the log likelihood-based function in (1) is

ℓ (β) = ℓ (μ; y) = ℓ (θ) = \sum_{i = 1}^{n} \log f_{i} (y_{i}; θ_{i}, ϕ) = \sum_{i = 1}^{n} [{y_{i} x_{i}^{T} β - b (x_{i}^{T} β)} ∕ a (ϕ) + c (y_{i}, ϕ)],

(6)

where μ = (μ₁, · · · , μ_n)^T, y = (y₁, · · · , y_n)^T, and θ = (θ₁, · · · , θ_n)^T. Then, the resulting scaled deviance of a penalized estimate ${\hat{β}}_{λ}$ is

D (y; {\hat{μ}}_{λ}) = 2 {ℓ (y; y) - ℓ ({\hat{μ}}_{λ}; y)},

where ${\hat{μ}}_{λ} = {(g^{- 1} (x_{1}^{T} {\hat{β}}_{λ}), \dots, g^{- 1} (x_{n}^{T} {\hat{β}}_{λ}))}^{T}$ .

For model α_λ, we employ the scaled deviance $D (y; {\hat{μ}}_{λ})$ as the goodness-of-fit measure, $G (y, {\hat{β}}_{λ})$ , in (2) so that the resulting generalized information criterion for GLIM is

{GIC}_{κ_{n}} (λ) = \frac{1}{n} D (y; {\hat{μ}}_{λ}) + \frac{1}{n} κ_{n} d f_{λ} .

(7)

In addition, when we fit the data with the non-penalized likelihood approach under model α, GIC becomes

{GIC}_{κ_{n}}^{*} (α) = \frac{1}{n} D (y; {\hat{μ}}_{α}^{*}) + \frac{1}{n} κ_{n} d_{α},

(8)

where ${\hat{μ}}_{α}^{*} = {(g^{- 1} (x_{1}^{T} {\hat{β}}_{α}^{*}), \dots, g^{- 1} (x_{n}^{T} {\hat{β}}_{α}^{*}))}^{T}$ , and ${\hat{β}}_{α}^{*}$ is the non-penalized maximum likelihood estimator of β. Accordingly, GIC* can be used in classical variable selection (see Eq. (3.10) of McCullagh and Nelder, 1989). Next, we show the asymptotic performances of GIC.

Theorem 1. Suppose the density function of the generalized linear model satisfies Fan and Li's (2001) three regularity conditions (A)–(C) and that the technical condition (C4) holds.

If there exists a positive constant M such that κ_n < M, then the tuning parameter $\hat{λ}$ selected by minimizing GIC_κn(λ) in (7) satisfies
$P {\hat{λ} \in Ω_{-}} \to 0, a n d P {\hat{λ} \in Ω_{+}} \geq π,$
where π is a nonzero probability.
Suppose that conditions (C1)-(C3) are satisfied. If κ_n → ∞ and $κ_{n} ∕ \sqrt{n} \to 0$ , then the tuning parameter $\hat{λ}$ selected by minimizing GIC_κn(λ) in (7) satisfies $P {α_{\hat{λ}} = α_{0}} \to 1$ .

The proof is given in Appendix A.

Theorem 1 provides guidance on the choice of the regularization parameter. Theorem 1(A) implies that the GIC selector with bounded κ_n tends to overfit without regard to which penalty function is being used. Analogous to the classical variable selection criterion AIC, we refer to GIC with κ_n = 2 as the AIC selector, while we name GIC with κ_n → 2 the AIC-type selector. In contrast, because κ_n = log(n) fulfills the conditions of Theorem 1(B), we call GIC with κ_n → ∞ and $κ_{n} ∕ \sqrt{n} \to 0$ the BIC-type selector.Theorem 1(B) indicates that the BIC-type selector identifies the true model consistently. Thus, the nonconcave penalized likelihood of the generalized linear model with the BIC-type selector possesses the oracle property.

Remark 3. In linear regression models, Fan and Li (2001) applied the GCV selector given below to choose the regularization parameter,

{GCV}^{*} (λ) = \frac{{‖ y - X {\hat{β}}_{λ} ‖}^{2}}{n {1 - d f_{L} (λ) ∕ n}^{2}},

(9)

where ∥ · ∥ denotes the Euclidean norm, $X = {(x_{1}^{T}, \dots, x_{n}^{T})}^{T}$ , and df_L(λ) is defined in (4). To extend the application of GCV, we replace the residual sum of squares in (9) by $G (y, {\hat{β}}_{λ})$ , and then choose λ that minimizes

GCV (λ) = \frac{G (y, {\hat{β}}_{λ})}{n {1 - d f_{λ} ∕ n}^{2}} .

(10)

Using the Taylor expansion, we further have that

GCV (λ) \approx \frac{1}{n} {G (y, {\hat{β}}_{λ}) + d f_{λ} [\frac{2 G (y, {\hat{β}}_{λ})}{n}]} .

Because $G (y, {\hat{β}}_{λ}) ∕ n$ is bounded via Condition (C4), GCV yields an overfitted model with a positive probability. Therefore, the penalized partial likelihood with the GCV selector does not possess the oracle property.

Remark 4. In linear regression models, Wang et al. (2007b) demonstrated that Fan and Li's (2001) GCV-selector for the SCAD penalized least squares procedure cannot select the tuning parameter consistently. They further proposed the following BIC tuning parameter selector,

{BIC}^{*} (λ) = \log ({\hat{σ}}_{λ}^{2}) + \frac{1}{n} \log (n) d f_{L} (λ),

where ${\hat{σ}}_{λ}^{2} = \sum_{i = 1}^{n} {(y_{i} - x_{i}^{T} {\hat{β}}_{λ})}^{2} ∕ n$ . Using the result that log(1 + t) ≈ t for small t, ${BIC}_{λ}^{*}$ is approximately equal to

{BIC}^{* *} (λ) = \frac{1}{n} D^{* *} (y; {\hat{μ}}_{λ}) + \frac{1}{n} \log (n) d f_{L} (λ),

where $D^{* *} (y; {\hat{μ}}_{λ}) = {\hat{σ}}_{λ}^{2} ∕ {\hat{σ}}_{\overset{‒}{α}}^{2}$ is the scaled deviance of the normal distribution, and ${\hat{σ}}_{\overset{‒}{α}}^{2}$ is the dispersion parameter estimator computed from the full model. It can be seen that BIC** is a BIC-type selector. Under the conditions in Theorem 1(B), the SCAD penalized least squares procedure with the BIC** selector possesses the oracle property, which is consistent with the findings in Wang et al. (2007b).

4 Efficiency

Under the assumption that the true model is included in a family of candidate models, we established the consistency of BIC-type selectors. In practice, however, this assumption may not be valid, which motivates us to study the asymptotic efficiency of the AIC-type selector. In the literature, the L₂ norm has been commonly used to assess the efficiency of the classical AIC procedure (see Shibata, 1981, 1984 and Li, 1987) in linear regression models. Hence, we focus on the efficiency of linear regression model selections via L₂ norm.

Consider the following model

y_{i} = μ_{i} + ∊_{i}, for i = 1, \dots, n,

where μ = (μ₁, ..., μ_n)^T is an unknown mean vector, and ε_i's are independent and identically distributed (i.i.d) random errors with mean 0 and variance σ². Furthermore, we assume that Xβ constitutes the nearest representation of the true mean vector μ, and hence the full model is not necessarily a correct model. Adapting the formulation of Li (1987), we allow d, the dimension of β, to tend to infinity with n, but d/n → 0.

For the given data set {(x_i, y_i) : i = 1, · · · , n}, we follow the formulation of (1) to define the penalized least squares function

Q^{L S} (β) = \sum_{i = 1}^{n} {(y_{i} - x_{i}^{T} β)}^{2} + n \sum_{j = 1}^{d} p_{λ} (∣ β_{j} ∣) .

(11)

The resulting penalized estimator of μ in model α_λ is ${\hat{μ}}_{λ} = X {\hat{β}}_{λ}$ . In addition, the non-penalized estimator of μ in model α is ${\hat{μ}}_{α}^{*} = X {\hat{β}}_{α}^{*}$ . It is noteworthy that ${\hat{β}}_{λ}$ and ${\hat{β}}_{α}^{*}$ have been defined in Definition 1 and Remark 2, respectively. In practice, the tuning parameter λ is unknown and can be selected by minimizing

{GIC}_{κ_{n}}^{L S} (λ) = \frac{1}{n} {\sum_{i = 1}^{n} {(y_{i} - x_{i}^{T} {\hat{β}}_{λ})}^{2} + κ_{n} σ^{2} d_{α_{λ}}},

(12)

where σ² is assumed to be known, and the case with the unknown σ² will be discussed later. When ${\hat{β}}_{λ}$ in (12) is replaced by the least squares estimator ${\hat{β}}_{α}^{*}$ , ${GIC}_{κ_{n}}^{L S}$ with κ_n = 2 becomes the Mallows’ C_p criterion that has been used for classical variable selections.

To assess the performance of the tuning parameter selector, we adopt the approach of Shibata (1981) (also see Li, 1987; Shao, 1997), and define the average squared loss (or the L₂ loss) associated with the estimator ${\hat{β}}_{λ}$ to be

L ({\hat{β}}_{λ}) = \frac{1}{n} {‖ μ - {\hat{μ}}_{λ} ‖}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {(μ_{i} - x_{i}^{T} {\hat{β}}_{λ})}^{2} .

(13)

Accordingly, the risk function is $R (\hat{β}) = E [L (\hat{β})]$ .

Using the average squared loss measure, we further define the asymptotic loss efficiency.

Definition 3 (Asymptotically Loss Efficient). A tuning parameter selection procedure is said to be asymptotically loss efficient if

\frac{L ({\hat{β}}_{\hat{λ}})}{\inf_{λ \in [0, λ_{\max}]} L ({\hat{β}}_{λ})} \to 1,

(14)

in probability, where ${\hat{β}}_{\hat{λ}}$ is associated with the tuning parameter $\hat{λ}$ selected by this procedure. We also say ${\hat{β}}_{\hat{λ}}$ is asymptotically loss efficient if (14) holds.

Moreover, we introduce the following technical conditions for studying the asymptotic loss efficiency of the AIC-type selector in linear regression.

(C5) ${(\frac{1}{n} X^{T} X)}^{- 1}$ exists, and its largest eigenvalue is bounded by a constant number C.
(C6) $E ∊_{1}^{4 q} < \infty$ , for some positive integer q.
(C7) The risks of the least squares estimators ${\hat{β}}_{α}^{*}$ for all $α \in A$ satisfy
$\sum_{α \in A} {[n R ({\hat{β}}_{α}^{*})]}^{- q} \to 0 .$
(C8) Let b = (b₁, ..., b_d)^T, where $b_{j} = p_{λ}^{'} (∣ {\hat{β}}_{λ j} ∣) sgn ({\hat{β}}_{λ j})$ for all j such that $∣ {\hat{β}}_{λ j} ∣ < 0$ , and b_j = 0 otherwise, and ${\hat{β}}_{λ j}$ is the j-th component of the penalized estimator ${\hat{β}}_{λ}$ . In addition, let ${\hat{β}}_{α_{λ}}^{*}$ be the least squares estimator of β obtained from model α_λ. Then, we assume that, in probability,
$\sup_{λ \in [0, λ_{\max}]} \frac{{‖ b ‖}^{2}}{R ({\hat{β}}_{α_{λ}}^{*})} \to 0 .$

Condition (C5) has been commonly considered in the literature. Conditions (C6) and (C7) are adopted from conditions (A.2) and (A.3), respectively, in Li (1987). It can be shown that if the true model is approximated by a set of the candidate models (e.g., the true model is of infinite dimension), then Condition (C7) holds. Condition (C8) ensures that the difference between the penalized mean function estimator and the corresponding least squares mean function estimator is small in comparison with the risk of the least squares estimator (see Lemma 3 in Appendix B). Sufficient conditions for (C8) are also given in Appendix C. We next show the asymptotic efficiency of the AIC-type selector.

Theorem 2. Assume conditions (C5)—(C8) hold. Then, the tuning parameter $\hat{λ}$ selected by minimizing $G I C_{κ_{n}}^{L S} (λ)$ in (12) with κ_n → 2 yields an asymptotically loss efficient estimator, ${\hat{β}}_{\hat{λ}}$ , in the sense of (14).

The proof is given in Appendix B.

Theorem 2 demonstrates that the AIC-type selector is asymptotically loss efficient. In addition, using the result that log(1 + t) ≈ t for small t, the AIC selector behaves similarly to the following AIC* selector,

{AIC}^{*} (λ) = \log ({\hat{σ}}_{λ}^{2}) + \frac{2 σ^{2} d_{α_{λ}}}{n} .

Accordingly, both AIC and AIC* selectors are asymptotically loss efficient.

Applying Lemma 4 and Equation (26) in Appendix B, we find that if $\sup_{λ \in [0, λ_{\max}]} ∣ n^{- 1} (κ_{n} - 2) σ^{2} d_{α_{λ}} ∕ R ({\hat{β}}_{λ}) ∣ \to 0$ in probability, then ${GIC}_{κ_{n}}^{L S}$ is asymptotically loss efficient. Note that it can be shown $n^{- 1} σ^{2} d_{α_{λ}} ∕ R ({\hat{β}}_{λ})$ is bounded by 1. As a result, κ_n → 2 (including κ_n = 2) is critical in establishing the asymptotic loss efficiency. This finding is similar to the classical efficient criteria (see Shibata, 1980, Shao, 1997 and Yang, 2005). It is also not surprising that the BIC-type selectors do not possess asymptotic loss efficiency, which is consistent with the finding of classical variable selections (see Li, 1987 and Shao, 1997).

In practice, σ² is often unknown. It is natural to replace the σ² in ${GIC}_{2}^{L S}$ by its consistent estimator (see Shao, 1997). The following corollary shows that the asymptotical property of GIC still holds.

Corollary 1. If the tuning parameter $\hat{λ}$ is selected by minimizing ${GIC}_{κ_{n}}^{L S} (λ)$ with κ_n → 2 and σ²being replaced by its consistent estimator ${\tilde{σ}}^{2}$ , then the resulting procedure is also asymptotically loss efficient.

The proof is given in Appendix B.

Remark 5. As suggested by an anonymous reviewer, we study the asymptotic loss efficiency of the generalized linear model in (5). Following the spirit of the deviance measure commonly used in the GLIM, we adopt the Kullback-Leibler (KL) distance measure to define the KL loss of an estimate $\hat{β}$ (either the maximum likelihood estimate ${\hat{β}}_{α}^{*}$ or the nonconcave penalized estimate ${\hat{β}}_{λ}$ ), as

L_{K L} (\hat{β}) = \frac{2}{n} E_{0} {ℓ (θ_{0}) - ℓ (\hat{θ})},

(15)

where $ℓ (\cdot)$ is defined in (6), θ₀ = (θ₀₁, · · · , θ₀_n)^T is the true unknown canonical parameter, $\hat{θ} = {({\hat{θ}}_{1}, \dots, {\hat{θ}}_{n})}^{T} = X \hat{β}$ under canonical link function, and E₀ denotes the expectation under the true model (see McQuarrie and Tsai, 1998, for example). It can be shown that the KL loss is identical to the squared loss for normal linear regression models with known variance. To obtain the asymptotic loss efficiency for GLIM, however, it is necessary to employ the Taylor expansion to expand $b ({\hat{θ}}_{i})$ at the true value of θ₀_i for i = 1, · · · , n. Accordingly, we need to impose some strong assumptions to establish the asymptotic loss efficiency for the nonconcave penalized estimate under the KL loss if the tuning parameter $\hat{λ}$ is selected by minimizing GIC in (7) with κ_n → 2. To avoid considerably increasing the length of the paper, we do not present detailed discussions, justifications, and proofs here, but they are given in the online supplemental material available in JASA website.

5 Numerical studies

In this section, we present three examples comprised of two Monte Carlo experiments and one empirical example. It is noteworthy that Example 1 considers a setting in which the true model is in a set of candidate models with logistic regressions. This setting allows us to examine the finite sample performance of the consistent selection criterion, which is expected to perform well via Theorem 1. In contrast, Example 2 considers a setting in which the true model is not included in a set of candidate models with Gaussian regressions. This setting enables us to assess the finite sample performance of the efficient selection criterion, which is expected to perform well via Theorem 2.

Example 1. Adapting the model setting from Tibshirani (1996) and Fan and Li (2001), we simulate the data from the logistic regression model, y|x ~ Bernoulli{p(x^T β)}, where

p (x^{T} β) = μ (x^{T} β) = \frac{\exp (x^{T} β)}{1 + \exp (x^{T} β)} .

In addition, β = (3, 1.5, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0)^T, x is a 12-dimensional random vector, the first 9 components of x are multivariate normal with covariance matrix Σ = (σ_ij), in which σ_ij = 0.5^|i–j|, and the last 3 components of x are generated from an independent Bernoulli distribution with success probability 0.5. Moreover, we conduct 1000 realizations with the sample sizes n = 200 and 400.

To assess the finite sample performance of the proposed methods, we report the percentage of models correctly fitted, underfitted, and overfitted with 1, 2, 3, 4, 5 or more parameters by SCAD-AIC (i.e., κ_n = 2), SCAD-BIC (i.e., κ_n = log(n)), SCAD-GCV, AIC, and BIC selectors as well as via the oracle procedure (i.e., the simulated data were fitted with the true model). Their corresponding standard errors can be calculated by $\sqrt{\hat{p} (1 - \hat{p}) ∕ 1000}$ , where p̂ is the observed proportion in 1000 simulations. Moreover, we report the average number of zero coefficients that were correctly (C) and incorrectly (I) identified by different methods. To compare model fittings, we further calculate the following model error for the new observation (x, y),

ME (\hat{β}) = E_{x} {μ (x^{T} β) - μ (x^{T} \hat{β})}^{2},

where the expectation is taken with respect to the new observed covariate vector x, and μ(x^T β) = E(y|x). Then, we report the median of the relative model error (MRME), where the relative model error is defined as RME = ME/ME_full, and ME_full is the model error calculated by fitting the data with the full model.

Table 1 shows that the MRME of SCAD-BIC is smaller than that of SCAD-AIC. As the sample size increases, the MRME of SCAD-BIC approaches that of the oracle estimator, whereas the MRME of SCAD-AIC remains at the same level. Hence, SCAD-BIC is superior to SCAD-AIC in terms of model error measures in this example. Moreover, Figure 1 presents the boxplots of RMEs, which lead to a similar conclusion.

Table 1.

Simulation results for the logistic regression model. MRME is the median relative model error; C denotes the average number of the nine true zero coefficients that were correctly identified as zero, while I denotes the average number of the three nonzero coefficients that were incorrectly identified as zero; the numbers in parentheses are standard errors; Under, Exact and Overfitted represent the proportions of corresponding models being selected in 1000 Monte Carlo realizations, respectively.

	MRME (%)	Zeros		Under (%)	Exact (%)	Overfitted (%)
Method	MRME (%)	C	I	Under (%)	Exact (%)	1	2	3	4	≥ 5
n = 200
SCAD-AIC	58.03	7.4 (1.6)	0.0 (0.0)	0.2	30.2	23.2	19.6	13.6	7.5	5.9
SCAD-BIC	16.90	8.8 (0.5)	0.0 (0.1)	0.7	83.7	13.2	2.4	0.4	0.0	0.0
SCAD-GCV	81.91	5.8 (1.9)	0.0 (0.0)	0.0	7.8	12.0	18.1	18.8	17.6	25.7
AIC	57.05	7.4 (1.2)	0.0 (0.0)	0.1	19.9	31.5	26.7	14.8	4.3	2.8
BIC	18.27	8.8 (0.5)	0.0 (0.1)	0.8	80.4	17.0	2.3	0.1	0.0	0.0
Oracle	12.49	9.0 (0.0)	0.0 (0.0)	0.0	100.0	0.0	0.0	0.0	0.0	0.0
n = 400
SCAD-AIC	64.45	7.4 (1.5)	0.0 (0.0)	0.0	29.4	21.9	21.0	15.1	8.5	4.1
SCAD-BIC	19.03	8.9 (0.4)	0.0 (0.0)	0.0	89.9	7.9	1.8	0.2	0.2	0.0
SCAD-GCV	82.13	6.0 (1.9)	0.0 (0.0)	0.0	9.2	14.7	18.6	19.8	15.9	21.8
AIC	63.17	7.4 (1.2)	0.0 (0.0)	0.0	21.7	29.2	27.3	14.6	5.5	1.7
BIC	20.46	8.8 (0.4)	0.0 (0.0)	0.0	86.3	12.1	1.4	0.2	0.0	0.0
Oracle	15.88	9.0 (0.0)	0.0 (0.0)	0.0	100.0	0.0	0.0	0.0	0.0	0.0

Open in a new tab

Boxplot of relative model error (RME) with n=200 and n=400.

In model identifications, SCAD-BIC has a higher chance than SCAD-AIC of correctly setting the nine true zero coefficients to zero, while SCAD-BIC is slightly more prone than SCAD-AIC to incorrectly set the three nonzero coefficients to zero when the sample size is small. In addition, SCAD-BIC has a much higher possibility of correctly identifying the true model than that of SCAD-AIC. Moreover, among the overfitted models, the SCAD-BIC method is likely to include only one irrelevant variable, whereas the SCAD-AIC method often includes two or more. When the sample size increases, the SCAD-BIC method yields a lesser degree of overfitting, while the SCAD-AIC method still overfits quite seriously. These results are consistent with the theoretical findings presented in Theorem 1.

It is not surprising that SCAD-GCV performs similar to SCAD-AIC. Furthermore, the classical AIC and BIC selection criteria behave as SCAD-AIC and SCAD-BIC, respectively, in terms of model errors and identifications. It is of interest to note that SCAD-AIC usually suggests a sparser model than AIC. In sum, SCAD-BIC performs the best.

Example 2. In practice, it is plausible that the full model fails to include some important explanatory variable. This motivates us to mimic such a situation and then assess various selection procedures. Accordingly, we consider a linear regression model $y_{i} = x_{i}^{T} β + ∊_{i}$ , where x_i's are i.i.d. multivariate normal random variables with dimension 13, the correlation between x_i and x_j is 0.5^|i–j|, and ε_i's are i.i.d. N(0, σ²) with σ = 4. In addition, we partition $x = {(x_{full}^{T}, x_{exc}^{T})}^{T}$ , where x_full contains d = 12 covariates of the full model and x_exc is the covariate excluded from model fittings. Accordingly, we partition $β = {(β_{full}^{T}, β_{exc}^{T})}^{T}$ , where β_full is a 12 × 1 vector and β_exc is a scalar. To investigate the performance of the proposed methods under various parameter structures, we let $β_{full} = β_{0} + γ δ ∕ \sqrt{n}$ , where β₀ = (3, 1.5, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0)^T, δ = (0, 0, 1.5, 1.5, 1, 1, 0, 0, 0, 0, 0.5, 0.5)^T, γ ranges from 0 to 10, and β_exc = 0.2. Because the candidate model is the subset of the full model that contains 12 covariates in x_full, the above model settings ensure that the true model is not included in the set of candidate models.

We simulate 1000 data sets with n = 400, 800, and 1600. To study the performance of selectors, we define the finite sample's loss efficiency,

LE ({\hat{β}}_{\hat{λ}}) = \frac{L ({\hat{β}}_{\hat{λ}})}{\inf_{λ} L ({\hat{β}}_{λ})},

where $\hat{λ}$ is chosen by SCAD-GCV, SCAD-AIC, and SCAD-BIC. It is noteworthy that β_full depends on the sample size and γ, a sensible approach to compare the loss efficiency of the selection criterion across different sample sizes is to compare the loss efficiency under the same value of β_full.

Because the performance of SCAD-GCV is very similar to that of SCAD-AIC, it is not reported here. Figure 2(a) and 2(b) depict the LEs of SCAD-AIC and SCAD-BIC, respectively, across various γ. Figure 2(a) clearly indicates that the loss efficiency of SCAD-AIC converges to 1 regardless of the value of γ, which corroborates the theoretical finding in Theorem 2. However, the loss efficiency of SCAD-BIC in Figure 2(b) does not show this tendency. Specifically, when γ is close to 0 (i.e., the model is nearly sparse), SCAD-BIC results in smaller loss due to its larger penalty function. As γ increases so that the full model contains the medium-sized coefficients, SCAD-BIC is likely to choose the model that is too sparse and hence increases the loss. When γ becomes large enough and the coefficient β_exc is dominated by the remaining coefficients in β_full, the consistent property of SCAD-BIC tends to have the smaller loss. In sum, Figure 2 shows that SCAD-AIC is efficient, whereas SCAD-BIC is not. Finally, we examine the efficiencies of AIC and BIC, and find that the performances of AIC and BIC are similar to those of SCAD-AIC and SCAD-BIC, respectively, and hence are omitted.

(a) The loss efficiency of SCAD-AIC; (b) The loss efficiency of SCAD-BIC.

Example 3. (Mammographic mass data) Mammography is the most effective method for screening for the presence of breast cancer. However, mammogram interpretations yield approximately a 70% rate of unnecessary biopsies with benign outcomes. Hence, several computer-aided diagnosis (CAD) systems have been proposed to assist physicians in predicting breast biopsy outcomes from the findings of BI-RADS (Breast Imaging Reporting And Data System). To examine the capability of two novel CAD approaches, Elter, Schulz-Wendtland, and Wittenberg (2007) recently analyzed a data set containing 516 benign and 445 malignant instances. We downloaded this data set from UCI Machine Learning Repository, and excluded 131 missing values and 15 coding errors. As a result, we considered a total of 815 cases.

To study the severity (benign or malignant), Elter et al. (2007) considered the following three covariates: x₁-birads, BI-RADS assessment assigned in a double-review process by physicians (definitely benign=1 to highly suggestive of malignancy=5); x₂-age, patient's age in years; x₃-mass density, high=1, iso=2, low=3, fat-containing=4. In addition, we employ the dummy variables, x₄ to x₆, to represent the four mass shapes: round=1, oval=2, lobular=3, irregular=4, where “irregular” is used as the baseline. Analogously, we apply the dummy variables, x₇ to x₁₀, to represent the five mass margins: circumscribed=1, microlobulated=2, obscured=3, ill-defined=4, spiculated=5), where “spiculated” is used as the baseline.

To predict the value of the binary response, y = 0 (benign) or y = 1 (malignant), on the basis of the 10 explanatory variables, we fit the data with the following logistic regression model:

\log \frac{p (x^{T} β)}{1 - p (x^{T} β)} = β_{0} + \sum_{j = 1}^{10} x_{j} β_{j},

where x = (x₁, . . . , x₁₀)^T, β = (β₁, . . . , β₁₀)^T, and p(x^Tβ) is the probability of the case being classified as malignant. As a result, the tuning parameters selected by SCAD-AIC and SCAD-BIC are 0.0332 and 0.1512, respectively, and SCAD-GCV yields the same model as SCAD-AIC. Because the true model is unknown, we follow an anonymous referee's suggestion to include the models selected by the delete-one cross validation (delete-1 CV) and the 5-fold cross validation (5-fold CV). The detailed procedures for fitting those models can be obtained from the first author.

Because the model selected by the delete-1 CV is the same as SCAD-AIC, Table 2 only presents the non-penalized maximum likelihood estimates from the full-model as well as the SCAD-AIC, SCAD-BIC, and 5-fold CV estimates, together with their standard errors. It indicates that the non-penalized maximum likelihood approach fits five spurious variables (x₃, x₆, and x₈ to x₁₀), while SCAD-AIC and 5-fold CV include two variables (x₆ and x₉) and one variable (x₉), respectively, with insignificant effects at a level of 0.05. In contrast, all variables (x₁, x₂, x₄, x₅, and x₇) selected by SCAD-BIC are significant. Because the sample size n = 815 is large, these findings are consistent with Theorem 1 and the simulation results. In addition, the p-value of the deviance test for assessing the SCAD-BIC model against the full model is 0.41, and as a result, there is no evidence of lack of fit in the SCAD-BIC model.

Table 2.

Estimates for mammographic mass data with standard deviations in parentheses

	MLE	SCAD-AIC	SCAD-BIC	5-Fold CV
β ₀	−11.04(1.48)	−11.17(1.13)	−11.16(1.07)	−11.49 (1.14)
BIRADS (x₁)	2.18(0.23)	2.19(0.23)	2.25(0.23)	2.23 (0.23)
age (x₂)	0.05(0.01)	0.04(0.01)	0.04(0.01)	0.05 (0.01)
density (x₃)	−0.04(0.29)	0(-)	0(-)	0(-)
sRound (x₄)	−0.98(0.37)	−0.99(0.37)	−0.80(0.34)	−0.81 (0.35)
sOval (x₅)	−1.21(0.32)	−1.22(0.32)	−1.07(0.30)	−1.07 (0.30)
sLobular (x₆)	−0.53(0.35)	−0.54(0.34)	0(-)	0(-)
mCircum (x₇)	−1.05(0.42)	−0.98(0.32)	−1.01(0.30)	−1.07 (0.31)
mMicro (x₈)	−0.03(0.65)	0(-)	0(-)	0(-)
mObscured (x₉)	−0.48(0.39)	−0.42(0.30)	0(-)	−0.47 (0.30)
mIlldef (x₁₀)	−0.09(0.33)	0(-)	0(-)	0(-)

Open in a new tab

Based on the model selected by SCAD-BIC, we conclude that a higher BI-RADS assessment or a greater age results in a higher chance for malignancy. In addition, the oval or round mass shape yields lower odds for malignance than does the irregular mass shape, while the odds for malignance designated by the lobular mass shape is not significantly different from that of the irregular mass shape. Moreover, the odds for malignance indicated by the microlobulated, obscured, and ill-defined mass margins are not significantly different from that of the spiculated mass margin. However, the circumscribed mass margin leads to lesser odds for malignance than that of the four other types of mass margins.

6 Discussion

In the context of variable selection, we propose the generalized information criterion to choose regularization parameters for nonconcave penalized likelihood functions. Furthermore, we study the theoretical properties of GIC. If we believe that the true model is contained in a set of candidate models with the GLIM structure, then the BIC-type selector identifies the true model with probability tending to 1, while the GIC selectors with bounded κ_n tends to overfit with positive probability. However, if the true model is approximated by a family of candidate models, the AIC-type selector is asymptotically loss efficient, whereas the BIC-type selector is not, in general. Simulation studies support the finite sample performance of the selection criteria.

Although we obtain the theoretical properties of GIC for GLIM, the application of GIC is not limited. For example, we could employ GIC to select the regularization parameter in Cox proportional hazard and quasi-likelihood models. We believe that these efforts would further enhance the usefulness of GIC in real data analysis.

Supplementary Material

NIHMS153154-supplement-L1.pdf^{(148.3KB, pdf)}

Acknowledgments

We are grateful to the editor, the associate editor and three referees for their helpful and constructive comments that substantially improved an earlier draft. Zhang's research is supported by National Institute on Drug Abuse grants R21 DA024260 and P50 DA10075 as a research assistant. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIDA or the NIH. Li's research is supported by National Science Foundation grants DMS 0348869, 0722351 and 0926187.

Appendix A. Proof of Theorem 1

Before proving the theorem, we show the following two lemmas. Then, Theorems 1(A) and 1(B) follow from Lemmas 1 and 2, respectively. For the sake of convenience, we denote $D (y; {\hat{μ}}_{λ})$ in (7) and $D (y; {\hat{μ}}_{α}^{*})$ in (8) by D(λ) and D*(α), respectively.

Lemma 1. Suppose the density function of the generalized linear model satisfies Fan and Li's (2001) three regularity conditions (A)—(C). Assume that there exists a positive constant M such that κ_n < M. Then under condition (C4), we have

P {\inf_{λ \in Ω_{-}} G I C_{κ_{n}} (λ) > G I C_{κ_{n}}^{*} (\overset{‒}{α})} \to 1

(16)

and

\underset{n \to \infty}{\lim \inf} P {\inf_{λ \in Ω_{0}} G I C_{κ_{n}} (λ) > G I C_{κ_{n}}^{*} (\overset{‒}{α})} \geq π .

(17)

as n → ∞.

Proof. For a given λ, the non-penalized maximum likelihood estimator, ${\hat{β}}_{α_{λ}}^{*}$ , maximizes $ℓ (β)$ under model α_λ. This implies D(λ) ≥ D*(α_λ) for any given λ, which leads to

{GIC}_{κ_{n}} (λ) = D (λ) ∕ n + κ_{n} d f_{λ} ∕ n > D^{*} (α_{λ}) ∕ n .

(18)

Then, we obtain that

{GIC}_{κ_{n}} (λ) - {GIC}_{κ_{n}}^{*} (\overset{‒}{α}) > \frac{D^{*} (α_{λ})}{n} - \frac{D^{*} (\overset{‒}{α})}{n} - \frac{κ_{n} d_{\overset{‒}{α}}}{n},

holds true for any λ ∈ Ω_– = {λ : α_λ ⊅ α₀}. This together with condition (C4) and $κ_{n} d_{\overset{‒}{α}} ∕ n = o_{P} (1)$ obtained from the fact of κ_n < M, we have

\begin{matrix} P {\inf_{λ \in Ω_{-}} {GIC}_{κ_{n}} (λ) - {GIC}_{κ_{n}}^{*} (\overset{‒}{α}) > 0} & \geq P {\min_{α ⊅ α_{0}} \frac{D^{*} (α)}{n} - \frac{D^{*} (\overset{‒}{α})}{n} - \frac{κ_{n} d_{\overset{‒}{α}}}{n} > 0} \\ = P {\min_{α ⊅ α_{0}} c_{α} - c_{\overset{‒}{α}} + o_{P} (1) > 0} \to 1, \end{matrix}

(19)

as n → ∞. The last step uses the fact that the number of underfitting models is finite, hence min_α⊅α₀c_α is strictly greater than $c_{\overset{‒}{α}}$ (i.e., c_α₀) under condition (C4). The above equation yields (16) immediately.

Next we show that the model selected by the GIC selector with bounded κ_n is overfitted with probability bounded away from zero. For any λ ∈ Ω₀, α_λ = α₀. Then subtracting ${GIC}_{κ_{n}}^{*} (\overset{‒}{α})$ from both sides of (18) and subsequently taking inf_λ∈Ω₀ over GIC_{κ_n}(λ), we have

\inf_{λ \in Ω_{0}} {GIC}_{κ_{n}} (λ) - {GIC}_{κ_{n}}^{*} (\overset{‒}{α}) > D^{*} (α_{0}) ∕ n - [D^{*} (\overset{‒}{α}) ∕ n + κ_{n} d_{\overset{‒}{α}} ∕ n] .

(20)

Note that the right-hand side of the above equation does not involve λ.

$D^{*} (α_{0}) - D^{*} (\overset{‒}{α}) = - 2 [ℓ ({\hat{β}}_{α_{0}}^{*}) - ℓ ({\hat{β}}_{\overset{‒}{α}}^{*})]$ , where ${\hat{β}}_{α_{0}}^{*}$ and ${\hat{β}}_{\overset{‒}{α}}^{*}$ are the non-penalized maximum likelihood estimators computed via the true and full models, respectively. Under regularity conditions (A)—(C) in Fan and Li (2001), it can be shown that ${\hat{β}}_{α_{0}}^{*}$ and ${\hat{β}}_{\overset{‒}{α}}^{*}$ are consistent and asymptotically normal. Hence, the likelihood ratio test statistic $- 2 [ℓ ({\hat{β}}_{α_{0}}^{*}) - ℓ ({\hat{β}}_{\overset{‒}{α}}^{*})] \overset{L}{\to} χ_{d_{\overset{‒}{α}} - d_{α_{0}}}^{2}$ . This result, together with κ_n < M and (20), yields

\begin{matrix} P {\inf_{λ \in Ω_{0}} {GIC}_{κ_{n}} (λ) - {GIC}_{κ_{n}}^{*} (\overset{‒}{α}) > 0} & \geq P {- \frac{2}{n} [ℓ ({\hat{β}}_{α_{0}}^{*}) - ℓ ({\hat{β}}_{\overset{‒}{α}}^{*})] - κ_{n} d_{\overset{‒}{α}} ∕ n > 0} \\ \geq P {- 2 [ℓ ({\hat{β}}_{α_{0}}^{*}) - ℓ ({\hat{β}}_{\overset{‒}{α}}^{*})] > d_{\overset{‒}{α}} M} \\ \to P {χ_{d_{\overset{‒}{α}} - d_{α_{0}}}^{2} \geq d_{\overset{‒}{α}} M} ≜ π . \end{matrix}

This implies (17), and we complete the proof of Lemma 1.

To prove Theorem 1(B), we present the following lemma.

Lemma 2. Suppose the density function of the generalized linear model satisfies Fan and Li's (2001) three regularity conditions (A)—(C). Assume conditions (C1)—(C4) hold, and let $λ_{n} = κ_{n} ∕ \sqrt{n}$ . If κ_n satisfies κ_n → ∞ and λ_n → 0 as n → ∞, then

P {G I C_{κ_{n}} (λ_{n}) = G I C_{κ_{n}}^{*} (α_{0})} \to 1,

(21)

and

P {\inf_{λ \in Ω_{-} \cup Ω_{+}} G I C_{κ_{n}} (λ) > G I C_{κ_{n}} (λ_{n})} \to 1 .

(22)

Proof. Without loss of generality, we assume that the first d_α₀ coefficients of β_α₀ in the true model are nonzero and the rest are zeros. Note that the density function of the generalized linear model satisfies Fan and Li's (2001) three regularity conditions (A)—(C). These conditions, together with condition (C3) and the assumptions of κ_n stated in Lemma 2, allow us to apply Fan and Li's Theorems 1 and 2 to show that, with probability tending to 1, the last d – d_α₀ components of ${\hat{β}}_{λ_{n}}$ are zeros and the first d_α₀ components of ${\hat{β}}_{λ_{n}}$ satisfy the normal equations

\frac{\partial}{\partial β_{j}} ℓ ({\hat{β}}_{λ_{n}}) + b_{λ_{n} j} = 0 f o r j = 1, \dots, d_{α_{0}},

(23)

where $b_{λ_{n} j} = p_{λ_{n}}^{'} (∣ {\hat{β}}_{λ_{n} j} ∣) sgn ({\hat{β}}_{λ_{n} j})$ , and ${\hat{β}}_{λ_{n} j}$ is the j-th component of ${\hat{β}}_{λ_{n}}$ .

Using the oracle property, we have $∣ {\hat{β}}_{λ_{n} j} ∣ \to ∣ β_{j} ∣ \geq \min_{1 \leq j \leq d_{α_{0}}} ∣ β_{j} ∣$ . Then, under conditions (C1) and (C2), there exists a constant m so that $p_{λ_{n}}^{'} (∣ {\hat{β}}_{λ_{n} j} ∣) = 0$ for min_{1≤j≤d_α₀} |β_j| > mλ_n as n gets large. Accordingly, P(b_{λ_nj} = 0) → 1 for j = 1, ..., d_α₀. This together with (23) implies that, with probability tending to 1, the first d_α₀ components of ${\hat{β}}_{λ_{n}}$ solve the normal equations

\frac{\partial}{\partial β_{j}} ℓ ({\hat{β}}_{λ_{n}}) = 0, j = 1, \dots, d_{α_{0}},

and the remaining d – d_α₀ components are zeros. This is exactly the same as the normal equation in solving the non-penalized maximum likelihood estimator ${\hat{β}}_{α_{0}}^{*}$ . As a result, ${\hat{β}}_{α_{0}}^{*} = {\hat{β}}_{λ_{n}}$ with probability tending to 1. It follows that

P {D (λ_{n}) = D^{*} (α_{0})} = P {ℓ ({\hat{β}}_{λ_{n}}) = ℓ ({\hat{β}}_{α_{0}}^{*})} \to 1 .

Moreover, using the result from Proposition 1, we have P{df_{λ_n} = d_α₀} → 1. Consequently,

P {{GIC}_{κ_{n}} (λ_{n}) = {GIC}_{κ_{n}}^{*} (α_{0})} = P {\frac{1}{n} (D (λ) - D^{*} (α_{0})) + \frac{κ_{n}}{n} (d f_{λ_{n}} - d_{α_{0}}) = 0} \to 1 .

The proof of (21) is complete.

We next show that, for any λ which cannot identify the true model, the resulting GIC_{κ_n}(λ) is consistently larger than GIC_{κ_n}(λ_n). To this end, we consider two cases, underfitting and overfitting.

Case 1: Underfitted model (i.e., λ ∈ Ω_– so that α_λ ⊅ α₀). Applying (18) and (21), we obtain that, with probability tending to 1,

{GIC}_{κ_{n}} (λ) - {GIC}_{κ_{n}} (λ_{n}) > \frac{1}{n} D^{*} (α_{λ}) - \frac{1}{n} D^{*} (α_{0}) - \frac{κ_{n}}{n} d_{α_{0}} .

Subsequently, take inf_{λ∈Ω_–} of both sides of the above equation, which yields

\inf_{λ \in Ω_{-}} {GIC}_{κ_{n}} (λ) - {GIC}_{κ_{n}} (λ_{n}) > \min_{α ⊅ α_{0}} \frac{1}{n} D^{*} (α) - \frac{1}{n} D^{*} (α_{0}) - \frac{κ_{n}}{n} d_{α_{0}} .

(24)

This, in conjunction with condition (C4) and the number of underfitted models being finite, leads to

P {\min_{α ⊅ α_{0}} \frac{1}{n} D^{*} (α) - \frac{1}{n} D^{*} (α_{0}) - \frac{κ_{n}}{n} d_{α_{0}} > 0} = P {\min_{α ⊅ α_{0}} c_{α} - c_{α_{0}} + o_{P} (1) > 0} \to 1,

as n → ∞. As a result,

P {\inf_{λ \in Ω_{-}} {GIC}_{κ_{n}} (λ) > {GIC}_{κ_{n}} (λ_{n})} \to 1 .

Case 2: Overfitted model (i.e., λ ∈ Ω₊ so that $α_{λ} ⫌ α_{0}$ . According to Lemma 1, with probability tending to 1, D(λ_n) = D*(α₀). In addition, Proposition 1 indicates that df_λ – df_{λ_n} > τ + O_P (1) for some τ > 0 and λ ∈ Ω₊. Moreover, as noticed in the proof of Lemma 1, D(λ) ≥ D*(α_λ). Therefore, with probability tending to 1,

\begin{matrix} n ({GIC}_{κ_{n}} (λ) - {GIC}_{κ_{n}} (λ_{n})) & = D (λ) - D (λ_{n}) + (d f_{λ} - d f_{λ_{n}}) κ_{n} \\ \geq D^{*} (α_{λ}) - D^{*} (α_{0}) + (τ + o_{P} (1)) κ_{n} . \end{matrix}

Because D*(α₀) – D*(α) follows a χ² distribution with d_α – d_α₀ degrees of freedom asymptotically for any $α ⫌ α_{0}$ , it is of order O_P (1). Accordingly,

\inf_{λ \in Ω_{+}} n ({GIC}_{κ_{n}} (λ) - {GIC}_{κ_{n}} (λ_{n})) \geq \min_{α ⫌ α_{0}} {D^{*} (α_{0}) - D^{*} (α)} + (τ + o_{P} (1)) κ_{n} \approx τ κ_{n} .

(25)

Using the fact that κ_n goes to infinity as n → ∞, the right-hand side of equation (25) goes to positive infinity, which guarantees that the left-hand side of equation (25) is positive as n → ∞. Hence, we finally have

P {\inf_{λ \in Ω_{+}} {GIC}_{κ_{n}} (λ) > {GIC}_{κ_{n}} (λ_{n})} \to 1 .

The results of Cases 1 and 2 complete the proof of (22).

Proofs of Theorem 1. Lemma 1 implies that GIC_{κ_n}(λ), which produces the underfitted model, is consistently larger than ${GIC}_{κ_{n}}^{*} (\overset{‒}{α})$ . Thus, the optimal model selected by minimizing the GIC_{κ_n}(λ) must contain all of the significant variables with probability tending to one. In addition, Lemma 1 indicates that there is a nonzero probability that the smallest value of GIC_{κ_n}(λ) for λ ∈ Ω₀ is larger than that of the full model. As a result, there is a positive probability that any λ associated with the true model cannot be selected by GIC_{κ_n}(λ) as the regularization parameter. Theorem 1(A) follows.

Lemma 2 indicates that the model identified by λ_n converges to the true model as the sample size gets large. In addition, it shows λ's that fail to identify the true model cannot be selected by GIC_{κ_n}(λ) asymptotically. Theorem 1(B) follows.

Appendix B. Proofs of Theorem 2 and Corollary 1

Before proving the theorem, we establish the following two lemmas. Lemma 3 evaluates the difference between a penalized mean estimator ${\hat{μ}}_{λ}$ and its corresponding least squares mean estimator ${\hat{μ}}_{α_{λ}}^{*}$ , while Lemma 4 demonstrates that the losses of ${\hat{μ}}_{λ}$ and ${\hat{μ}}_{α_{λ}}^{*}$ are asymptotically equivalent.

Lemma 3. Under condition (C5),

{‖ {\hat{μ}}_{λ} - {\hat{μ}}_{α_{λ}}^{*} ‖}^{2} \leq n C {‖ b ‖}^{2},

where C is the constant number in condition (C5) and b is defined in condition (C8).

Proof. Without loss of generality, we assume that the first d_{α_λ} components of ${\hat{β}}_{λ}$ and ${\hat{β}}_{α_{λ}}^{*}$ are nonzero, and denote them by ${\hat{β}}_{λ}^{(1)}$ and ${\hat{β}}_{α_{λ}}^{* (1)}$ , respectively. Thus, ${\hat{μ}}_{λ} = X {\hat{β}}_{λ} = X_{α_{λ}} {\hat{β}}_{λ}^{(1)}$ and ${\hat{μ}}_{α_{λ}}^{*} = X {\hat{β}}_{α_{λ}}^{*} = X_{α_{λ}} {\hat{β}}_{α_{λ}}^{* (1)}$ . From the proofs of Theorems 1 and 2 in Fan and Li (2001), with probability tending to 1, we have that ${\hat{β}}_{λ}^{(1)}$ is the solution of the following equation,

\frac{1}{n} X_{α_{λ}}^{T} (y - X_{α_{λ}} β_{λ}^{(1)}) + b^{(1)} = 0,

where b⁽¹⁾ is the subvector of b that corresponds to ${\hat{β}}_{λ}^{(1)}$ . Accordingly,

{\hat{β}}_{λ}^{(1)} = {(X_{α_{λ}}^{T} X_{α_{λ}})}^{- 1} X_{α_{λ}}^{T} y + {(\frac{1}{n} X_{α_{λ}}^{T} X_{α_{λ}})}^{- 1} b^{(1)} = {\hat{β}}_{α_{λ}}^{* (1)} + V_{α_{λ}} b^{(1)},

where $V_{α_{λ}} ≜ {(\frac{1}{n} X_{α_{λ}}^{T} X_{α_{λ}})}^{- 1}$ . In addition, the eigenvalues of V_{α_λ} are bounded under condition (C5). Hence,

{‖ {\hat{μ}}_{λ} - {\hat{μ}}_{α_{λ}}^{*} ‖}^{2} = {‖ X_{α_{λ}} ({\hat{β}}_{λ}^{(1)} - {\hat{β}}_{λ}^{* (1)}) ‖}^{2} = n b_{1}^{T} V_{α_{λ}} b^{(1)} \leq n C {‖ b ‖}^{2},

for some positive constant number C. This completes the proof.

Lemma 4. If conditions (C5)—(C8) hold, then

\sup_{λ \in [0, λ_{\max}]} ∣ \frac{L ({\hat{β}}_{λ})}{L ({\hat{β}}_{α_{λ}}^{*})} - 1 ∣ \to 0,

in probability.

Proof. After algebraic simplification, we have

L ({\hat{β}}_{λ}) - L ({\hat{β}}_{α_{λ}}^{*}) = \frac{{‖ {\hat{μ}}_{α_{λ}}^{*} - {\hat{μ}}_{λ} ‖}^{2}}{n} + \frac{2 {(μ - {\hat{μ}}_{α_{λ}}^{*})}^{T} ({\hat{μ}}_{α_{λ}}^{*} - {\hat{μ}}_{λ})}{n} = I_{1} + I_{2} .

Under conditions (C6) and (C7), Li (1987) showed that

\sup_{α \in A} ∣ \frac{λ ({\hat{β}}_{α}^{*})}{R ({\hat{β}}_{α}^{*})} - 1 ∣ \to 0 .

This, together with condition (C8) and Lemma 3, implies

\sup_{λ \in [0, λ_{\max}]} ∣ \frac{I_{1}}{L ({\hat{β}}_{α_{λ}}^{*})} ∣ = \sup_{λ \in [0, λ_{\max}]} {\frac{{‖ {\hat{μ}}_{α_{λ}}^{*} - {\hat{μ}}_{λ} ‖}^{2}}{n R ({\hat{β}}_{α_{λ}}^{*})} - \frac{{‖ {\hat{μ}}_{α_{λ}}^{*} - {\hat{μ}}_{λ} ‖}^{2}}{n L ({\hat{β}}_{α_{λ}}^{*})} [\frac{L ({\hat{β}}_{α_{λ}}^{*})}{R ({\hat{β}}_{α_{λ}}^{*})} - 1]} \to 0 .

Applying the Cauchy-Schwarz inequality, we next obtain

I_{2} \leq \frac{2 ‖ μ - {\hat{μ}}_{α_{λ}}^{*} ‖ \cdot ‖ {\hat{μ}}_{α_{λ}}^{*} - {\hat{μ}}_{λ} ‖}{n} = 2 \sqrt{L ({\hat{β}}_{α_{λ}}^{*})} \cdot \frac{1}{\sqrt{n}} ‖ {\hat{μ}}_{α_{λ}}^{*} - {\hat{μ}}_{λ} ‖ .

As a result, $\sup_{λ \in [0, λ_{\max}]} ∣ \frac{I_{2}}{L ({\hat{β}}_{α_{λ}}^{*})} ∣ \to 0$ , and Lemma 4 follows immediately.

Proof of Theorem 2. To show the asymptotic efficiency of the AIC-type selector, it suffices to demonstrate that minimizing ${GIC}_{κ_{n}}^{L S} (λ)$ with κ_n → 2 is the same as minimizing $L ({\hat{β}}_{λ})$ asymptotically. To this end, we need to prove that, in probability,

\sup_{λ \in [0, λ_{\max}]} ∣ \frac{{GIC}_{κ_{n}}^{L S} (λ) - \frac{{‖ ∊ ‖}^{2}}{n} - L ({\hat{β}}_{λ})}{L ({\hat{β}}_{λ})} ∣ \to 0 .

(26)

Let the projection matrix corresponding to the model α be $H_{α} = X_{α} {(X_{α}^{T} X_{α})}^{- 1} X_{α}^{T}$ . Then,

\begin{matrix} {GIC}_{κ_{n}}^{L S} (λ) = & \frac{{‖ y - X {\hat{β}}_{λ} ‖}^{2}}{n} + \frac{κ_{n} σ^{2} d_{α_{λ}}}{n} \\ = & \frac{{‖ y - {\hat{μ}}_{α_{λ}}^{*} ‖}^{2}}{n} + \frac{{‖ {\hat{μ}}_{α_{λ}}^{*} - {\hat{μ}}_{λ} ‖}^{2}}{n} + \frac{κ_{n} σ^{2} d_{α_{λ}}}{n} \\ = & \frac{{‖ ∊ ‖}^{2}}{n} + L ({\hat{β}}_{λ}) + [L ({\hat{β}}_{α_{λ}}^{*}) - L ({\hat{β}}_{λ})] + \frac{1}{n} {‖ {\hat{μ}}_{α_{λ}}^{*} - {\hat{μ}}_{λ} ‖}^{2} \\ + \frac{2}{n} ∊^{T} (I - H_{α_{λ}}) μ + \frac{2}{n} (σ^{2} d_{α_{λ}} - ∊^{T} H_{α_{λ}} ∊) + \frac{1}{n} (κ_{n} - 2) σ^{2} d_{α_{λ}} . \end{matrix}

(27)

Let $J_{1} = L ({\hat{β}}_{α_{λ}}^{*}) - L ({\hat{β}}_{λ}), J_{2} = {‖ {\hat{μ}}_{α_{λ}}^{*} - {\hat{μ}}_{λ} ‖}^{2} ∕ n, J_{3} = 2 {\hat{∊}}^{T} (I - H_{α_{λ}}) μ ∕ n, J_{4} = 2 (σ^{2} d_{α_{λ}} - ∊^{T} H_{α_{λ}} ∊) ∕ n$ and J₅ = (κ_n – 2)σ²d_{α_λ}/n. Using Lemma 3, Lemma 4 and similar arguments used in Li (1987), we obtain that, in probability,

\sup_{λ \in [0, λ_{\max}]} ∣ \frac{J_{j}}{L ({\hat{β}}_{λ})} ∣ \to 0, for j = 1, \dots, 4 .

Because κ_n – 2 and the fact that $n^{- 1} σ^{2} d_{α_{λ}} ∕ R ({\hat{β}}_{λ})$ is bounded by 1, we can further show that $\sup_{λ \in [0, λ_{\max}]} ∣ J_{5} ∕ L ({\hat{β}}_{λ}) ∣ \to 0$ using Lemma 4 and condition (C7). Accordingly, (26) holds, which implies that the difference between ${GIC}_{κ_{n}}^{L S} (λ) - \frac{{‖ ∊ ‖}^{2}}{n}$ and $L ({\hat{β}}_{λ})$ is negligible in comparison to $L ({\hat{β}}_{λ})$ . This completes the proof.

Proof of Corollary 1. When σ² is unknown, the ${GIC}_{κ_{n}}^{L S} (λ)$ in (27) becomes

{GIC}_{κ_{n}}^{L S} (λ) = \frac{{‖ ∊ ‖}^{2}}{n} + L ({\hat{β}}_{λ}) + J_{1} + J_{2} + J_{3} + J_{4} + J_{5} + \frac{2 ({\tilde{σ}}^{2} - σ^{2}) d_{α_{λ}}}{n} .

Using Lemma 4 and condition (C7), we have

\sup_{λ \in [0, λ_{\max}]} ∣ \frac{2 ({\tilde{σ}}^{2} - σ^{2}) d_{α_{λ}}}{n L ({\hat{β}}_{λ})} ∣ \to 0

in probability, which completes the proof.

Appendix C. Sufficient conditions for (C8)

We first present three sufficient conditions for (C8).

(S1) There exists a constant M₁ such that λ_max satisfies $\sqrt{n} λ_{\max} < M_{1}$ for all n.
(S2) There exists a constant M₂ such that p_λ(θ) satisfies $p_{λ}^{'} (θ) \leq M_{2} λ$ for any θ.
(S3) The average error of the full model $\overset{‒}{α}$ , i.e. $Δ_{\overset{‒}{α}} ≜ {‖ μ - H_{\overset{‒}{α}} μ ‖}^{2} ∕ n$ , satisfies $\frac{n Δ_{\overset{‒}{α}}}{d} \to \infty$ , as n → ∞.

We next provide motivations for these conditions. Assume that the true model is $x_{i}^{T} β_{0} = \sum_{j = 1}^{d} x_{i j} β_{j 0}$ for i = 1, . . . , n, and then write $a_{n} = \max_{1 \leq j \leq d} {p_{λ_{n}}^{'} (∣ β_{j 0} ∣), β_{j 0} \neq 0}$ . Under the condition d = O(n^ν) with ν < ¼, Fan and Peng (2004) proved that there exists a local maximum of the penalized least squares function, so that $‖ \hat{β} - β_{0} ‖ = O_{P} {\sqrt{d} (n^{- 1 ∕ 2} + a_{n})}$ . Thus, conditions (S1) and (S2) ensure that the penalized estimator is $\sqrt{n ∕ d} -consistent$ . As for condition (S3), consider a case where the full model misses an important variable x_exc that is orthogonal to the rest of the variables. Accordingly, $Δ_{\overset{‒}{α}} = n^{- 1} \sum_{i = 1}^{n} x_{exc; i}^{2} β_{exc}^{2}$ , and hence $n Δ_{\overset{‒}{α}} ∕ d$ is of the same order as $n β_{exc}^{2} ∕ d$ . Consequently, (S3) holds if the coefficient β_exc satisfies $\sqrt{n ∕ d} β_{exc} \to \infty$ , which is valid, for example, when β_exc is fixed.

In contrast to imposing condition (S3) on the error of the full model, an alternative condition (S3*) motivated by Fan and Peng (2004) is given below.

(S3*) The proportion of penalized coefficients satisfies $\sup_{λ \in [0, λ_{\max}]} \frac{1}{d_{α_{λ}}} \sum_{j = 1}^{d} I (0 < ∣ {\hat{β}}_{λ j} ∣ \leq m λ) \to 0$ , as n → ∞, where m is defined in (C2).

Condition (S3*) means that the proportion of small or moderate size coefficients $(0 < \sqrt{n} ∣ {\hat{β}}_{λ j} ∣ \leq m M_{1})$ is vanishing. This condition is satisfied under the identifiability assumption of Fan and Peng (2004) (i.e., min_j:β₀≠0 |β_0j|/λ → ∞, as n → ∞). As a byproduct, a similar condition can also be established for the adaptive LASSO, which employs data-driven penalties with $b_{j} = λ ∕ {({\hat{β}}_{\overset{‒}{α} j}^{*})}^{k}$ for some k > 0 to cope with the bias of LASSO. In sum, (S1) to (S3) (or S3*) are mild conditions, and two propositions given below demonstrate their sufficiency for condition (C8).

Proposition 2. Under (S1) to (S3), condition (C8) holds.

Proof. It is noteworthy that

R ({\hat{β}}_{α_{λ}}^{*}) = Δ_{α_{λ}} + \frac{d_{α_{λ}} σ^{2}}{n} \geq Δ_{\overset{‒}{α}} + \frac{d_{α_{λ}} σ^{2}}{n} .

(28)

On the right-hand side of (28), the first term dominates the second term via condition (S3). From (S1), (S2), and (28), we have that

\frac{{‖ b ‖}^{2}}{R ({\hat{β}}_{α_{λ}}^{*})} \leq \frac{M_{2}^{2} λ^{2} d_{α_{λ}}}{R ({\hat{β}}_{α_{λ}}^{*})} \leq M_{2}^{2} \cdot n λ_{\max}^{2} \cdot \frac{1}{n Δ_{\overset{‒}{α}} ∕ d},

which goes to zero independently of λ. This completes the proof.

Proposition 3. Assume that the penalty function satisfies condition (C2). Under (S1) to (S3*), condition (C8) holds.

Proof. Under condition (C2), the components in b are zero except for those ${\hat{β}}_{λ j} \leq m λ$ . Then, employing (S2) and (28), we obtain that

\frac{{‖ b ‖}^{2}}{R ({\hat{β}}_{α_{λ}}^{*})} \leq \frac{{‖ b ‖}^{2}}{n^{- 1} d_{α_{λ}} σ^{2}} \leq \frac{M_{2}^{2}}{σ^{2}} \cdot n λ_{\max}^{2} \cdot \frac{1}{d_{α_{λ}}} \sum_{j = 1}^{d} I (0 < ∣ {\hat{β}}_{λ j} ∣ \leq m λ) .

On the right-hand side of the above equation, the first term is constant, the second term is bounded under (S1), and the third term goes to zero uniformly in λ under (S3*). This completes the proof.

Footnotes

7 Supplemental materials

Technical Details for Remark 5: Detailed discussions, justifications, and proofs referenced in Remark 5 are available under the Paper information link at the JASA website http://pubs.amstat.org/loi/jasa (pdf).

References

Akaike H. A New Look at the Statistical Model Identification. IEEE Trans. on Automatic Control. 1974;19:716–723. [Google Scholar]
Craven P, Wahba G. Smoothing Noisy Data with Spline Functions. Numerische Mathematik. 1979;31:377–403. [Google Scholar]
Efron B, Hastie T, Johnstone IM, Tibshirani R. Least Angle Regression. The Annals of Statistics. 2004;32:407–499. [Google Scholar]
Elter M, Schulz-Wendtland R, Wittenberg T. The Prediction of Breast Cancer Biopsy Outcomes Using Two CAD Approaches that Both Emphasize an Intelligible Decision Process. Medical Physics. 2007;34:4164–4172. doi: 10.1118/1.2786864. [DOI] [PubMed] [Google Scholar]
Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Fan J, Li R. Variable Selection for Cox's Proportional Hazards Model and Fratile Model. The Annals of Statistics. 2002;30:74–99. [Google Scholar]
Fan J, Li R. New Estimation and Model Selection Procedures for Semiparametric Modeling in Longitudinal Data Analysis. Journal of the American Statistical Association. 2004;99:710–723. [Google Scholar]
Fan J, Peng H. Nonconcave Penalized Likelihood with a Diverging Number of Parameters. The Annals of Statistics. 2004;32:928–961. [Google Scholar]
Frank IE, Friedman JH. A Statistical View of Some Chemometrics Regression Tools. Technometrics. 1993;35:109–148. [Google Scholar]
Hunter D, Li R. Variable Selection Using MM Algorithms. The Annals of Statistics. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]
Konishi S, Kitagawa G. Generalised Information Criteria in Model Selection. Biometrika. 1996;83:875–890. [Google Scholar]
Li K-C. Asymptotic Optimality for Cp, CL, Cross-Validation and Generalized Cross-Validation: Discrete Index Set. The Annals of Statistics. 1987;15:958–975. [Google Scholar]
Li R, Liang H. Variable Selection In Semiparametric Regression Modeling. The Annals of Statistics. 2008;36:261–286. doi: 10.1214/009053607000000604. [DOI] [PMC free article] [PubMed] [Google Scholar]
McCullagh P, Nelder JA. Generalized Linear Models. Chapman & HALL/CRC; New York: 1989. [Google Scholar]
McQuarrie ADR, Tsai C-L. Regression and Time Series Model Selection. 1st ed. World Scientific Publishing Co, Pte. Ltd.; Singapore: 1998. [Google Scholar]
Nishii R. Asymptotic Properties of Criteria for Selection of Variables in Multiple Regression. The Annals of Statistics. 1984;12:758–765. [Google Scholar]
Park M-Y, Hastie T. An L1 Regularization-path Algorithm for Generalized Linear Models. Journal of the Royal Statistical Society, Series B. 2007;69:659–677. [Google Scholar]
Schwarz G. Estimating the Dimension of a Model. The Annals of Statistics. 1978;19(2):461–464. [Google Scholar]
Shao J. An Asymptotic Theory for Linear Model Selection. Statistica Sinica. 1997;7:221–264. [Google Scholar]
Shibata R. Asymptotically Efficient Selection of the Order of the Model for Estimating Parameters of a Linear Process. The Annals of Statistics. 1980;8:147–164. [Google Scholar]
Shibata R. An Optimal Selection of Regression Variables. Biometrika. 1981;68:45–54. [Google Scholar]
Shibata R. Approximation Efficiency of a Selection Procedure for the Number of Regression Variables. Biometrika. 1984;71:43–49. [Google Scholar]
Tibshirani R. Regression Shrinkage and Selection via LASSO. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
Wang H, Leng C. Unified LASSO Estimation via Least Squares Approximation. Journal of the American Statistical Association. 2007;102:1039–1048. [Google Scholar]
Wang H, Li G, Tsai C-L. Regression coefficient and autoregressive order shrinkage and selection via LASSO. Journal of the Royal Statistical Society, Series B. 2007a;69:63–78. [Google Scholar]
Wang H, Li R, Tsai C-L. Tuning Parameter Selectors for the Smoothly Clipped Absolute Deviation Method. Biometrika. 2007b;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang Y. Can the Strengths of AIC and BIC be shared? A Conflict Between Model Identification and Regression Estimation. Biometrika. 2005;92:937–950. [Google Scholar]
Yuan M, Lin Y. On the Non-Negative Garrotte Estimator. Journal of the Royal Statistical Society, Series B. 2007;69:143–161. [Google Scholar]
Zhang C-H. Penalized linear unbiased selection. Department of Statistics, Rutgers University; 2007. Technical Report No. 2007-003. [Google Scholar]
Zhang HH, Lu W. Adaptive LASSO for Cox's Proportional Hazards Model. Biometrika. 2007;94:691–703. [Google Scholar]
Zou H. The Adaptive LASSO and its Oracle Properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
Zou H, Hastie T, Tibshirani R. On the Degrees of Freedom of the LASSO. The Annals of Statistics. 2007;35:2173–2192. [Google Scholar]
Zou H, Li R. One-Step Sparse Estimates in Nonconcave Penalized Likelihood Models. The Annals of Statistics. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS153154-supplement-L1.pdf^{(148.3KB, pdf)}

[R1] Akaike H. A New Look at the Statistical Model Identification. IEEE Trans. on Automatic Control. 1974;19:716–723. [Google Scholar]

[R2] Craven P, Wahba G. Smoothing Noisy Data with Spline Functions. Numerische Mathematik. 1979;31:377–403. [Google Scholar]

[R3] Efron B, Hastie T, Johnstone IM, Tibshirani R. Least Angle Regression. The Annals of Statistics. 2004;32:407–499. [Google Scholar]

[R4] Elter M, Schulz-Wendtland R, Wittenberg T. The Prediction of Breast Cancer Biopsy Outcomes Using Two CAD Approaches that Both Emphasize an Intelligible Decision Process. Medical Physics. 2007;34:4164–4172. doi: 10.1118/1.2786864. [DOI] [PubMed] [Google Scholar]

[R5] Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R6] Fan J, Li R. Variable Selection for Cox's Proportional Hazards Model and Fratile Model. The Annals of Statistics. 2002;30:74–99. [Google Scholar]

[R7] Fan J, Li R. New Estimation and Model Selection Procedures for Semiparametric Modeling in Longitudinal Data Analysis. Journal of the American Statistical Association. 2004;99:710–723. [Google Scholar]

[R8] Fan J, Peng H. Nonconcave Penalized Likelihood with a Diverging Number of Parameters. The Annals of Statistics. 2004;32:928–961. [Google Scholar]

[R9] Frank IE, Friedman JH. A Statistical View of Some Chemometrics Regression Tools. Technometrics. 1993;35:109–148. [Google Scholar]

[R10] Hunter D, Li R. Variable Selection Using MM Algorithms. The Annals of Statistics. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Konishi S, Kitagawa G. Generalised Information Criteria in Model Selection. Biometrika. 1996;83:875–890. [Google Scholar]

[R12] Li K-C. Asymptotic Optimality for Cp, CL, Cross-Validation and Generalized Cross-Validation: Discrete Index Set. The Annals of Statistics. 1987;15:958–975. [Google Scholar]

[R13] Li R, Liang H. Variable Selection In Semiparametric Regression Modeling. The Annals of Statistics. 2008;36:261–286. doi: 10.1214/009053607000000604. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] McCullagh P, Nelder JA. Generalized Linear Models. Chapman & HALL/CRC; New York: 1989. [Google Scholar]

[R15] McQuarrie ADR, Tsai C-L. Regression and Time Series Model Selection. 1st ed. World Scientific Publishing Co, Pte. Ltd.; Singapore: 1998. [Google Scholar]

[R16] Nishii R. Asymptotic Properties of Criteria for Selection of Variables in Multiple Regression. The Annals of Statistics. 1984;12:758–765. [Google Scholar]

[R17] Park M-Y, Hastie T. An L1 Regularization-path Algorithm for Generalized Linear Models. Journal of the Royal Statistical Society, Series B. 2007;69:659–677. [Google Scholar]

[R18] Schwarz G. Estimating the Dimension of a Model. The Annals of Statistics. 1978;19(2):461–464. [Google Scholar]

[R19] Shao J. An Asymptotic Theory for Linear Model Selection. Statistica Sinica. 1997;7:221–264. [Google Scholar]

[R20] Shibata R. Asymptotically Efficient Selection of the Order of the Model for Estimating Parameters of a Linear Process. The Annals of Statistics. 1980;8:147–164. [Google Scholar]

[R21] Shibata R. An Optimal Selection of Regression Variables. Biometrika. 1981;68:45–54. [Google Scholar]

[R22] Shibata R. Approximation Efficiency of a Selection Procedure for the Number of Regression Variables. Biometrika. 1984;71:43–49. [Google Scholar]

[R23] Tibshirani R. Regression Shrinkage and Selection via LASSO. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]

[R24] Wang H, Leng C. Unified LASSO Estimation via Least Squares Approximation. Journal of the American Statistical Association. 2007;102:1039–1048. [Google Scholar]

[R25] Wang H, Li G, Tsai C-L. Regression coefficient and autoregressive order shrinkage and selection via LASSO. Journal of the Royal Statistical Society, Series B. 2007a;69:63–78. [Google Scholar]

[R26] Wang H, Li R, Tsai C-L. Tuning Parameter Selectors for the Smoothly Clipped Absolute Deviation Method. Biometrika. 2007b;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Yang Y. Can the Strengths of AIC and BIC be shared? A Conflict Between Model Identification and Regression Estimation. Biometrika. 2005;92:937–950. [Google Scholar]

[R28] Yuan M, Lin Y. On the Non-Negative Garrotte Estimator. Journal of the Royal Statistical Society, Series B. 2007;69:143–161. [Google Scholar]

[R29] Zhang C-H. Penalized linear unbiased selection. Department of Statistics, Rutgers University; 2007. Technical Report No. 2007-003. [Google Scholar]

[R30] Zhang HH, Lu W. Adaptive LASSO for Cox's Proportional Hazards Model. Biometrika. 2007;94:691–703. [Google Scholar]

[R31] Zou H. The Adaptive LASSO and its Oracle Properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

[R32] Zou H, Hastie T, Tibshirani R. On the Degrees of Freedom of the LASSO. The Annals of Statistics. 2007;35:2173–2192. [Google Scholar]

[R33] Zou H, Li R. One-Step Sparse Estimates in Nonconcave Penalized Likelihood Models. The Annals of Statistics. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Regularization Parameter Selections via Generalized Information Criterion

Yiyun Zhang

Runze Li

Chih-Ling Tsai

Abstract

1 Introduction

2 Nonconcave penalized likelihood function

2.1 Penalized estimators and penalty conditions

2.2 Generalized information criterion

3 Consistency

4 Efficiency

5 Numerical studies

Table 1.

Figure 1.

Figure 2.

Table 2.

6 Discussion

Supplementary Material

Acknowledgments

Appendix A. Proof of Theorem 1

Appendix B. Proofs of Theorem 2 and Corollary 1

Appendix C. Sufficient conditions for (C8)

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Regularization Parameter Selections via Generalized Information Criterion

Yiyun Zhang

Runze Li

Chih-Ling Tsai

Abstract

1 Introduction

2 Nonconcave penalized likelihood function

2.1 Penalized estimators and penalty conditions

2.2 Generalized information criterion

3 Consistency

4 Efficiency

5 Numerical studies

Table 1.

Figure 1.

Figure 2.

Table 2.

6 Discussion

Supplementary Material

Acknowledgments

Appendix A. Proof of Theorem 1

Appendix B. Proofs of Theorem 2 and Corollary 1

Appendix C. Sufficient conditions for (C8)

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases