Tuning parameter selectors for the smoothly clipped absolute deviation method

Hansheng Wang; Runze Li; Chih-Ling Tsai

doi:10.1093/biomet/asm053

. Author manuscript; available in PMC: 2009 Apr 1.

Published in final edited form as: Biometrika. 2007 Aug 1;94(3):553–568. doi: 10.1093/biomet/asm053

Tuning parameter selectors for the smoothly clipped absolute deviation method

Hansheng Wang ¹, Runze Li ², Chih-Ling Tsai ³

PMCID: PMC2663963 NIHMSID: NIHMS104012 PMID: 19343105

Summary

The penalised least squares approach with smoothly clipped absolute deviation penalty has been consistently demonstrated to be an attractive regression shrinkage and selection method. It not only automatically and consistently selects the important variables, but also produces estimators which are as efficient as the oracle estimator. However, these attractive features depend on appropriately choosing the tuning parameter. We show that the commonly used the generalised crossvalidation cannot select the tuning parameter satisfactorily, with a nonignorable overfitting effect in the resulting model. In addition, we propose a bic tuning parameter selector, which is shown to be able to identify the true model consistently. Simulation studies are presented to support theoretical findings, and an empirical example is given to illustrate its use in the Female Labor Supply data.

Keywords: aic, bic, Generalised crossvalidation, Least absolute shrinkage and selection operator, Smoothly clipped absolute deviation

1. Introduction

In regression analysis, an underfitted model can lead to severely biased estimation and prediction. In contrast, an overfitted model can seriously degrade the efficiency of the resulting parameter estimates and predictions. Hence, obtaining a sparsely parsimonious and effectively predictive model is essential.

Traditional model selection criteria, such as aic (Akaike, 1973) and bic (Schwarz, 1978), suffer from a number of limitations. Their major drawback arises because parameter estimation and model selection are two different processes, which can result in instability (Breiman, 1996) and complicated stochastic properties (Fan & Li, 2001). Moreover, the total number of candidate models increases exponentially as the number of covariates increases.

To overcome the deficiency of traditional methods, Fan & Li (2001) proposed the smoothly clipped absolute deviation or SCAD method, which estimates parameters while simultaneously selecting important variables. As compared with another popular regression shrinkage and selection method, the least absolute shrinkage and selection operator or Lasso of Tibshirani (1996), the smoothly clipped absolute deviation method not only selects important variables consistently, but also produces parameter estimators as efficient as if the true model were known, i.e., the oracle estimator, a property not enjoyed by the Lasso. The above features of the smoothly clipped absolute deviation method rely on the proper choice of tuning parameter, or regularisation parameter, which is usually selected by generalised crossvalidation (Craven & Wahba, 1979).

We show that the optimal tuning parameter selected by generalised crossvalidation has a nonignorable overfitting effect even as the sample size goes to infinity. Moreover, we propose a bic-based tuning parameter selector for the smoothly clipped absolute deviation method, and prove that the proposed procedure identifies the true model consistently.

2. The Smoothly Clipped Absolute Deviation Method

Consider the linear regression model,

y_{i} = x_{i}^{'} β + ∊_{i},

(2.1)

where y_i is the response from the ith subject, x_i = (x_i1, ⋯ , x_id)′ is the associated d-dimensional explanatory covariate, β = (β₁, ⋯ , β_d)′, and ∊_i is the random error with mean 0 and variance $σ_{∊}^{2}$ . Let (x_i, y_i), i = 1, ⋯ , n, be a random sample from (2.1). To select simultaneously variables and estimate parameters, the smoothly clipped absolute deviation method of Fan & Li (2001) estimates β by minimising the penalised least squares function

\frac{1}{2 n} ∥ Y - X β ∥^{2} + Σ_{j = 1}^{d} p λ (| β_{j} |),

(2.2)

where Y = (y₁, ⋯ , y_n)′, X = (x₁, ⋯ , x_n)′, ∥ · ∥ stands for the Euclidean norm, and p_λ(·) is the smoothly clipped absolute deviation penalty with a tuning parameter λ to be selected by a data-driven method. The penalty p_λ(·) satisfies p_λ(0) = 0, and its first-order derivative is

p_{λ}^{'} (θ) = λ {I (θ \leq λ) + \frac{{(a λ - θ)}_{+}}{(a - 1) λ} I (θ > λ)},

where a is some constant usually taken to be a = 3.7 (Fan & Li, 2001), and (t)₊ = tI{t > 0} is the hinge loss function. For a given tuning parameter, we denote the estimator obtained by minimising (2.2) by β̂_λ = (β̂_λ1, ⋯ , β̂_λd)′.

Fan & Li (2001) showed that if λ → 0 and $\sqrt{n} λ \to \infty$ as n → ∞, the method consistently identifies irrelevant variables by producing zero solutions for their associated regression coefficients. In addition, the method estimates the coefficients of the relevant variables with the same efficiency as if the true model were known, which is referred to as the oracle property (Fan & Li, 2001). Hence, the choice of λ is critical. In practice, λ is usually selected by minimising the generalised crossvalidation criterion

{GCV}_{λ} = \frac{{∥ Y - X {\hat{β}}_{λ} ∥}^{2}}{n {(1 - {DF}_{λ} ∕ n)}^{2}} = \frac{{\hat{σ}}_{λ}^{2}}{{(1 - {DF}_{λ} ∕ n)}^{2},}

(2.3)

where ${\hat{σ}}_{λ}^{2} = n^{- 1} ‖ Y - X {\hat{β}}_{λ} ‖^{2}$ , df_λ is the generalised degrees of freedom (Fan & Li, 2001) given by

{DF}_{λ} = tr {X {(X^{'} X + n Σ_{λ})}^{- 1} X^{'}},

and $Σ_{λ} = diag {p_{λ}^{'} (| {\hat{β}}_{λ 1} |) ∕ | {\hat{β}}_{λ 1} |, \dots, p_{λ}^{'} (| {\hat{β}}_{λ d} |) ∕ | {\hat{β}}_{λ d} |}$ . The diagonal elements of Σ_λ are coefficients of quadratic terms in the local quadratic approximation to the smoothly clipped absolute deviation penalty function p_λ(·) (Fan & Li, 2001). Since some coefficients of the estimator of β are exactly equal to zero, df_λ is calculated by replacing X with its submatrix corresponding to the selected covariates, and by replacing Σ_λ with its corresponding submatrix. The resulting optimal tuning parameter is λ̂_gcv = argmin_λgcv_λ.

The log-transformation of gcv_λ can be approximated by

\log {GCV}_{λ} = \log {\hat{σ}}_{λ}^{2} - 2 \log (1 - {DF}_{λ} ∕ n) ≏ \log {\hat{σ}}_{λ}^{2} + 2 {DF}_{λ} ∕ n ≜ {AIC}_{λ} .

Hence, log gcv_λ is very similar to the traditional model selection criterion aic, which is an efficient selection criterion in that it selects the best finite-dimensional candidate model in terms of prediction accuracy when the true model is of infinite dimension. However, aic is not a consistent selection criterion, since it does not select the correct model with probability approaching 1 in large samples when the true model is of finite dimension. For further discussion of efficiency and consistency in model selection, see Shao (1997), McQuarrie & Tsai (1998) and Yang (2005). Consequently, the model selected by λ̂_gcv may not identify the finite-dimensional true model consistently. This motivated us to employ a variable selection criterion known to be consistent, bic (Schwarz, 1978), as the tuning parameter selector. We select the optimal λ by minimising

{BIC}_{λ} = \log {\hat{σ}}_{λ}^{2} + {DF}_{λ} \log (n) ∕ n .

(2.4)

The resulting optimal regularisation parameter is denoted by λ̂_bic.

3. Theoretical Results

3.1. Notation and conditions

Suppose that there is an integer 0 ≤ d₀ ≤ d, such that β_jk ≠ 0 for 1 ≤ k ≤ d₀, with the other β_j's equal to 0. Thus, the true model only contains the j₁th, ⋯ , j_d₀th covariates as significant variables. Furthermore, in order to define the underfitted and overfitted models, we denote by 𝒮_F = {1, ⋯ , d} and 𝒮_T = {j₁, ⋯ , j_d₀} the full and true parsimonious submodels, respectively. Then any candidate model 𝒮 ⊅ 𝒮_T, is referred to as an underfitted model in the sense that it misses at least one important variable. In contrast, any 𝒮 ⊃ 𝒮_T other than 𝒮_T itself, is referred to as an overfitted model in the sense that it contains all significant variables, but also at least one insignificant variable.

For an arbitrary model 𝒮 = {j₁, ⋯ , j_d*} ⊂ 𝒮_F, we denote its associated covariate matrix by X_𝒮, which is an n × d* matrix with the ith row given by (x_ij₁, ⋯ , x_{ij_d*}). After fitting the data with model 𝒮 by least squares, we denote the resulting ordinary least squares estimator, the residual sum of squares, the variance estimator and the generalised crossvalidation value by

{\hat{β}}_{𝒮} = {(X_{𝒮}^{'} X_{𝒮})}^{- 1} (X_{𝒮}^{'} Y),

(3.1)

{SSE}_{𝒮} = {∥ Y - X_{𝒮} {\hat{β}}_{𝒮} ∥}^{2},

(3.2)

{\hat{σ}}_{𝒮}^{2} = {SSE}_{𝒮} ∕ n,

(3.3)

{GCV}_{𝒮} = n^{- 1} {SSE}_{𝒮} ∕ {(1 - d^{*} ∕ n)}^{2},

(3.4)

respectively. The smoothly clipped absolute deviation estimator β̂_λ, obtained by minimising the objective function in (2.2), naturally identifies the model 𝒮_λ = {j : β̂_λj ≠ 0}, for which the ordinary least squares estimator is β̂_{𝒮_λ}. By the definition of the ordinary least squares estimator, we have

{SSE}_{λ} = {∥ Y - X {\hat{β}}_{λ} ∥}^{2} \geq {∥ Y - X_{𝒮 λ} {\hat{β}}_{𝒮 λ} ∥}^{2} = {SSE}_{𝒮_{λ}} .

(3.5)

Furthermore, the ${\hat{σ}}_{λ}^{2}$ defined in (2.3) can be simply expressed as ${\hat{σ}}_{λ}^{2} = {SSE}_{λ} ∕ n$ . If λ = 0, then the penalty term in (2.2) is 0, and β̂₀, is exactly the same as the full-model's ordinary least squares estimator, β̂_{𝒮_F}. Moreover, sse₀ = sse_{𝒮_F}, ${\hat{σ}}_{0}^{2} = {\hat{σ}}_{𝒮_{F}}^{2}$ and gcv₀ = gcv_{𝒮_F}.

In practice, however, λ is unknown, and it is necessary to search for the optimal λ from the positive real line ℛ⁺, or, within the bounded interval Ω = [0, λ_max], for some upper limit λ_max. We now present the technical conditions that are needed for studying the theoretical properties of the tuning parameter selectors.

Condition 1. For any 𝒮 ⊂ 𝒮_F, there is a $σ_{𝒮}^{2} > 0$ such that ${\hat{σ}}_{𝒮}^{2} \to σ_{𝒮}^{2}$ in probability.
Condition 2. For any 𝒮 ⊅ 𝒮_T, we have $σ_{𝒮}^{2} > σ_{𝒮_{T}}^{2}$ , where $σ_{𝒮_{T}}^{2}$ is a positive value such that ${\hat{σ}}_{𝒮_{T}}^{2} \to σ_{𝒮_{T}}^{2}$ , in probability.
Condition 3. The ∊_i's are independent and identically distributed as $N (0, σ_{∊}^{2})$ .
Condition 4. The upper limit λ_max → 0 as n → ∞.
Condition 5. The matrix cov(x_i) = Σ_x is finite and positive definite.

Condition 1 facilitates the proof of the asymptotic results, while Condition 2 elucidates the underfitting effect. Both conditions are satisfied if (x_i, ∊_i) are jointly non-degenerate multivariate normal distribution. Similar conditions can be found in Shi & Tsai (2002, 2004) and Huang & Yang (2004). Condition 3 is needed only for evaluating generalised crossvalildation's overfitting effect in §3.2, and is not necessary for establishing the consistency of the proposed bic criterion. Condition 4 implies that the search region for λ shrinks towards 0 as the sample size goes to infinity. This condition is used to simplify the proof of the consistency of bic in §3.3. Note that the rate at which λ_max converges to 0 is not specified. Finally, Condition 5 ensures the root-n consistency of an unpenalised estimator.

3.2. The overfitting effect of generalised crossvalidation

We define Ω₋ = {λ ∈ Ω : 𝒮_λ ⊅ 𝒮_T}, Ω₀ = {λ ∈ Ω : 𝒮_λ = 𝒮_T}, and Ω₊ = {λ ∈ Ω : 𝒮_λ ⊃ 𝒮_T and 𝒮_λ ≠ 𝒮_T}. In other words, Ω₀, Ω₋ and Ω₊ are three subsets of Ω, in which the true, under, and overfitted models can be produced. We first show that the smoothly clipped absolute deviation method with generalised crossvalidation is conservative in the sense that it does not miss any important variables as long as the sample size is sufficiently large.

Lemma 1

Under Conditions 1 and 2, we have

pr (\inf_{λ ∊ Ω_} {GCV}_{λ} > {GCV}_{𝒮_{F}} = {GCV}_{0}) \to 1 .

All proofs are given in the Appendix. According to this lemma, the generalised crossvalidation evaluated at the tuning parameter which produces the underfitted model, is consistently larger than ggv_{𝒮_F} = gcv₀. As a result, the optimal model selected by minimising the generalised crossvalidation values, i.e., 𝒮_{λ̂_gcv}, must contain all significant variables with probability tending to one. However, this does not necessarily imply that 𝒮_{λ̂_gcv} is the true model 𝒮_T. In the next lemma, we show that the optimal model selected by generalised crossvalidation overfits the true model with a positive probability.

Lemma 2

Under Conditions 1—3 there exists a nonzero probability α > 0 such that lim inf_n pr(inf_λ∈Ω₀gcv_λ > gcv_{𝒮_F} = gcv₀) ≥ α.

According to this lemma, there is a nonzero probability that the smallest value of generalised crossvalidation associated with the true model is larger than that of the full-model. Hence, there is a positive probability that any λ associated with the true model cannot be selected by generalised crossvalidation as the optimal tuning parameter. Combining the results from Lemmas 1 and 2, we obtain the following theorem.

Theorem 1

If Conditions 1—3 hold, there is a nonzero probability α > 0 such that pr(𝒮_{λ̂_gcv} ⊃ 𝒮_T) → 1 and

\underset{n}{liminf} pr (𝒮_{{\hat{λ}}_{GCV}} \supset 𝒮_{T} and 𝒮_{{\hat{λ}}_{GCV}} \neq 𝒮_{T}) > α .

Theorem 1 indicates that, with probability tending to 1, the model 𝒮_{λ̂_gcv} contains all significant variables, but with nonzero probability includes superfluous variables, thereby leading to overfitting.

3.3. Consistency of bic

To establish the consistency of bic, we first construct a sequence of reference tuning parameters, $λ_{n} = \log (n) ∕ \sqrt{n}$ . Thus, λ_n → 0 and $\sqrt{n} λ_{n} \to \infty$ . According to Theorem 2 of Fan & Li (2001), pr(𝒮_{λ_n} = 𝒮_T) → 1 under appropriate regularity conditions. This implies that the model identified by the reference tuning parameter converges to the true model as the sample size gets large.

Lemma 3

Under Condition 5, pr(bic_{λ_n} = bic_{𝒮_T}) → 1.

According to this lemma, with probability tending to 1,

{BIC}_{λ_{n}} = {BIC}_{𝒮_{T}} = \log {\hat{σ}}_{𝒮_{T}}^{2} + d_{0} \log (n) ∕ n .

Applying this result, we finally show that, for any λ which cannot identify the true model, the bic value is consistently larger than bic_{λ_n}.

Lemma 4

Under Conditions 1, 2, 4 and 5,

pr (\inf_{λ ∊ Ω_\cup Ω_{+}} {BIC}_{λ} > {BIC}_{λ_{n}}) \to 1 .

Note that this lemma does not necessarily imply that λ_n = λ̂_bic. However, it does indicate that those λ's which fail to identify the true model cannot be selected by bic asymptotically, because at least the true model identified by λ_n is a better choice. As a result, the optimal value λ̂_bic can only be one of those λ's whose smoothly clipped absolute deviation estimator yields the true model, i.e., λ ∈ Ω₀. Hence, the subsequent theorem follows immediately.

Theorem 2

If Conditions 1, 2, 4, and 5 hold, pr(𝒮_{λ̂_bic} = 𝒮_T) → 1.

In addition to generalised crossvalidation and bic, other selection criteria, such as aic and ric (Shi & Tsai, 2002) can be used to select the tuning parameter for the smoothly clipped absolute deviation method. Techniques similar to those used above show that aic performs like generalised crossvalidation, with a potential for overfitting, while ric consistently identifies the true model.

4. Partially Linear Model

In the context of partially linear models, Bunea (2004) and Bunea & Wegkamp (2004) proposed an information-type criterion and established a consistency property. Fan & Li (2004) extended their nonconvex penalised least squares method to partially linear models with longitudinal data, and showed that the resulting estimator performs as well as the oracle estimator. However, Fan & Li (2004) employed generalised crossvalidation for selecting the tuning parameter. In this section, we demonstrate that this results in overfitting. We further propose a bic approach and show that it can identify the true model consistently.

Consider the partially linear model

y_{i} = α (u_{i}) + x_{i}^{'} β + ∊_{i},

(4.1)

where u_i is a covariate, α(u_i) is nonparametric smooth function of u_i, and the remainder of the notation is the same as that for model (2.1). Various estimation procedures have been proposed in the literature (Engle et al., 1986; Heckman, 1986; Robinson, 1988; Speckman, 1988), and a comprehensive survey for the partially linear model is given by Härdle et al. (2000).

Similar to (2.2), we propose

\frac{1}{2 n} {∥ Y - ϴ - X β ∥}^{2} + Σ_{j = 1}^{d} p_{λ} (| β_{j} |)

(4.2)

as a penalised least squares function for the partially linear model, where

ϴ = {(α (u_{1}), \dots, α (u_{n}))}^{'},

the penalty function p_λ(|β_j|) is defined as in (2.2), and α(·) is a nonparametric smoothing function. To obtain the penalised least squares estimator, we first adopt Fan & Li's (2004) profile least squares technique to eliminate the nuisance parameter Θ for a given β. As a result, we have

y_{i}^{*} = α (u_{i}) + ∊_{i},

(4.3)

where $y_{i}^{*} = y_{i} - x_{i}^{'} β$ . We then use Fan & Gijbel's (1996) local linear regression approach to estimate α(·). For u in a neighbourhood of u_i, we find (α̂₀, α̂₁) by minimising

Σ_{i = 1}^{n} {y_{i}^{*} - α_{0} - α (u_{i} - u)}^{2} K_{h} (u_{i} - u),

where K(·) is a kernel function, h is a bandwidth and K_h(·) = h⁻¹K(·/h). The local linear estimator at u is simply α̂(u; β) = α̂₀.

Since the local linear estimator is a linear smoother, Θ̂ has the closed-form expression

\hat{ϴ} = S_{h} (Y - Y β),

(4.4)

where S_h is the smoothing matrix corresponding to the local linear regression and depends only on u_i and K_h(·). Substituting Θ in (4.2) with Θ̂, we obtain the penalised profile least squares function

\frac{1}{2 n} {∥ (I - S_{h}) Y - (I - S_{h}) X β ∥}^{2} + Σ_{j = 1}^{d} p_{λ} (| β_{j} |),

(4.5)

where I is an n × n identity matrix. Under certain regularity conditions, Fan & Li (2004) established the oracle property for the penalised profile least squares estimator. Note that the profile least squares estimator of β is closely related to Speckman's (1988) partial residual estimator, which is obtained by minimising the first term of equation (4.5). In addition, Speckman (1988) used the kernel smoothing approach to estimate α, while we employed the local linear smoothing approach.

Based on (4.5), we define gcv_λ for the penalised profile least squares problem by substituting X and Y in (2.3) with X_h = (I − S_h)X and Y_h = (I − S_h)Y, respectively. Analogously, we define bic_λ by replacing X and Y in the calculation of $σ_{λ}^{2}$ in (2.4) with X_h and Y_h, respectively. Applying the selector gcv_λ or bic_λ, we are finally able to compute the SCAD estimator of β.

As shown in Fan & Li (2004), the penalised profile least squares estimator β̂_λ is root-n consistent provided that λ → 0 and $\sqrt{n} λ \to \infty$ as n → ∞. Under regularity conditions given in the Appendix, the asymptotic bias and variance of α̂(u) are of order O_p(h²) and O_p(1/nh), respectively, since the parametric convergence rate of β̂ is faster than the nonparametric convergence rate of α̂(u). Furthermore, it can be shown that

\sup_{u \in U} | \hat{α} (u) - α (u) | = O_{P} (n^{- 1 ∕ 4})

by using results in Mack & Silverman (1982), where 𝒰 is the support of u; see Fan & Huang (2005) for details. Thus, Conditions 1 and 2 are reasonable assumptions in the proofs of asymptotic properties of gcv and bic in partially linear models.

Applying Theorem 3.1 of Fan & Huang (2005), we have

n {\frac{{SSE}_{𝒮_{T}} - {SSE}_{𝒮_{F}}}{{SSE}_{𝒮_{F}}}} \to χ_{d - d_{0}}^{2},

in distribution. This, together with the arguments used in the proofs of Lemmas 1 and 2, implies that Theorem 1 holds for penalised profile least squares with the smoothly clipped absolute deviation penalty. Theorem 3.1 of Fan & Huang (2005) also implies that

{SSE}_{S_{T}} - {SSE}_{S_{λ}} \to χ_{d_{λ} - d_{0}}^{2}

in distribution for any overfitted model 𝒮_λ(⊃ 𝒮_T) including d_λ variables. As a result, equation (A7) is valid for profile least squares estimators. Applying this result, in conjunction with the same arguments as those employed in the proofs of Lemmas 3 and 4, shows that Theorem 2 is true for penalised profile least squares with the smoothly clipped absolute deviation penalty.

Remark

To facilitate choosing the bandwidth h and the tuning parameter λ, we consider the β̂ to be a root-n consistent estimator of β. Thus, its convergence rate is faster than the nonparametric convergence rate of α̂(u). This motivates us to substitute β in the expression of $y_{i}^{*}$ with its root-n consistent estimator. By following the approach of Fan & Li (2004), we can show that the leading terms in the asymptotic bias and variance of the resulting local linear estimator α̂(u) are the same as those obtained by replacing β with its true value. This indicates that we are able to choose the bandwidth and tuning parameter separately, which expedites the computation of h and λ. To be specific, we adapt the approach of Fan & Li (2004) to obtain the difference-based estimator of β for the full-model, which is a root-n consistent estimator (Yatchew, 1997). Subsequently, we replace β in $y_{i}^{*}$ with the difference-based estimator so that (4.3) becomes a one-dimensional smoothing problem, and we use a smoothing selector to choose the bandwidth. Here, we use the plug-in method proposed by Ruppert et al. (1995) to choose the bandwidth, but this does not exclude the use of other known bandwidth selectors, such as generalised crossvalidation. Finally, we apply gcv_λ or bic_λ to choose λ.

5. Numerical Studies

5.1. Preliminaries

We examine the finite sample performance of the bic and generalised crossvalidation tuning parameter selectors in terms of both model error, i.e., lack-of-fit, and model complexity. However, we do not compare the smoothly clipped absolute deviation method with the best-subset variable selection bic since Fan & Li (2001, 2004) have compared them by Monte Carlo. To facilitate the computational process, we directly applied the local quadratic approximation algorithm to search the smoothly clipped absolute deviation solution. We set the threshold for shrinking β̂_j to zero at 10⁻⁶, which is much smaller than half of the standard error of the unpenalised least squares estimator, the threshold used in Fan & Li (2001). Thus, the average number of zeros using the generalised crossvalidation tuning parameter selector is expected to be slightly smaller than that in Fan & Li (2001). All simulations were conducted using Matlab code, which is available from the authors.

5.2. Simulation studies

We first consider Fan & Li's (2001) model error measure. Let (u,x,y) be a new observation from a regression model with E(y|u,x) = μ(u,x), and let μ̂ (·,·) be an estimate of the regression function based on data {(u_i,x_i,y_i), i = 1,…, n}. Then model error is defined to be E{μ̂(u,x) − μ(u,x)}², where the expectation is the conditional expectation given the data used in caluclating μ̂(·,·). For a partially linear model, μ(u,x) = α(u) + x′β, the model error is

E {\hat{α} (u) - α (u)}^{2} + E {(x^{'} \hat{β} - x^{'} β)}^{2} + 2 E {\hat{α} (u) - α (u)} (x^{'} \hat{β} - x^{'} β),

(5.1)

The first term in (5.1) measures the nonparametric component fit, while the second term assesses the parametric component fit. To investigate the performance of the smoothly clipped absolute deviation method on the just parametric regression component, we chose the simulation setting so that the cross-product term in (5.1) equals 0. Note that for the linear regression model (2.1), the model error exactly equals E(x′β̂ − x′β)². To compare the generalised crossvalidation and bic approaches, we define the model error as

ME (\hat{β}) = E {(x^{'} \hat{β} - x^{'} β)}^{2} = {(\hat{β} - β)}^{'} E ({xx}^{'}) (\hat{β} - β),

and define the relative model error as RME = ME/ME_{S_F}, where ME_{S_F} is the model error obtained by fitting the data with the full-model S_F in conjunction with the unpenalised least squares estimator, β̂_{𝒮_F}.

In addition to model error, we also calculate the percentages of models correctly fitted, underfitted and overfitted by generalised crossvalidation and bic, and the average number of zero coefficients produced by the smoothly clipped absolute deviation method.

Example 1

We simulated 1000 datasets, each consisting of a random sample of size n, from the linear regression model

y = x^{'} β + σ_{∊} ∊,

where β = (3,1.5,0,0,2,0,0,0)′, ∊ ∼ N(0,1) and the 8 × 1 vector x ∼ N₈(0, Σ_x), in which (Σ_x)ij = ρ^|i–j| for all i and j. Values chosen were σ_∊ = 3 and 1, n = 50, 100 and 200, and ρ = 0.75, 0.5 and 0.25.

As a benchmark, we compute the oracle estimator, which is the least squares estimator, of the true submodel, y = β₁x1 + β₂x₂ + β₅x₅ + ∊. Since the pattern of the results is the same for all three correlations, we only present the results for ρ= 0.5. Table 1 indicates that median of RME over 1000 realisations of the bic approach rapidly approaches that of the oracle estimator as the sample size increases or the noise level decreases, whereas the value for the generalised crossvalidation method remains at almost the same level across different noise levels and sample sizes. Hence, the bic approach outperforms the generalised crossvalidation approach in terms of model error measure.

Table 1.

Example 1. Simulation results for the linear regression model

σ_ε	n	Method	Underfitted(%)	Correctly fitted(%)	Overfitted(%)			No. of Zeros		MRME (%)
σ_ε	n	Method	Underfitted(%)	Correctly fitted(%)	1	2	≥ 3	I	C	MRME (%)
3	50	λ̂_gcv	6.4	16.9	23.0	31.6	22.1	0.064	3.279	64.17
		λ̂_bic	10.1	30.0	31.1	20.5	8.3	0.101	3.899	62.30
		Oracle	0	100	0	0	0	0	5	30.63
	100	λ̂_gcv	0.2	24.0	20.4	29.8	25.6	0 02	3.369	57.72
		λ̂_bic	1.0	52.5	27.0	14.6	4.9	0.100	4.275	50.43
		Oracle	0	100	0	0	0	0	5	33.05
	200	λ̂_gcv	0	25.4	35.8	25.4	13.4	0	3.300	55.18
		λ̂_bic	0	72.7	21.9	4.5	0.9	0	4.528	42.12
		Oracle	0	100	0	0	0	0	5	34.45

1	50	λ̂_gcv	0	17.1	24.5	33.2	25.2	0	3.272	55.64
		λ̂_bic	0	45.6	23.9	21.1	9.4	0	4.042	40.97
		Oracle	0	100	0	0	0	0	5	30.62
	100	λ̂_gcv	0	19.0	24.5	31.8	24.7	0	3.324	55.91
		λ̂_bic	0	54.9	23.6	16.9	4.6	0	4.277	40.53
		Oracle	0	100	0	0	0	0	5	33.05
	200	λ̂_gcv	0	48.1	37.5	11.7	2.7	0	3.302	55.00
		λ̂_bic	0	81.8	16.4	1.3	0.5	0	4.405	38.36
		Oracle	0	100	0	0	0	0	5	34.42

Open in a new tab

I, the average number of the three truly nonzero coefficients incorrectly set to zero; C, the average number of the five true zero coefficients that were correctly set to zero; MRME, median of relative model error.

The column labelled ‘C’ in Table 1 denotes the average number of the five true zero coefficients that were correctly set to zero, and the column labeled ‘I’ denotes the average number of the three truly nonzero coefficients incorrectly set to zero. Table 1 also reports the proportions of models underfitted, correctly fitted and overfitted. In the case of overfitting, the columns labelled ‘1’, ‘2’ and ‘≥3’ are the proportions of models including 1, 2 and more than 2 irrelevant covariates, respectively. It shows that the bic method has a much better rate of correctly identifying the true submodel than does the generalised crossvalidation method. Furthermore, among the overfitted models, the bic method is likely to include just one irrelevant variable, whereas the generalised crossvalidation approach often includes two or more. Not surprisingly, both methods improve as the signal gets stronger, i.e., σ_∊ decreases from 3 to 1. However, the generalised crossvalidation method still seriously overfits even if σ_∊ = 1 and n = 200. In contrast, the bic method overfits less often. These results corroborate the theoretical findings.

Example 2

In this example, we considered the partially linear model,

y = α (u) + x^{'} β + σ_{∊} ∊,

where u ∼ Un(0,1), and α(u) = exp{2sin(2πu)}. The rest of the simulation settings are the same as in Example 1. As mentioned in the remark of §4, in each simulation, we replace β in $y_{i}^{*}$ with the difference-based estimator, and then use the plug-in method proposed by Ruppert et al. (1995) to choose a bandwidth. Table 2 presents the simulation results for ρ = 0.5, and shows that once more, the bic method out-performs the generalised crossvalidation method in both identifying the true model and in reducing the model error and complexity.

Table 2.

Example 2. Simulation results for the partially linear regression model

σ_ε	n	Method	Underfitted(%)	Correctly fitted(%)	Overfitted(%)			No. of Zeros		MRME (%)
σ_ε	n	Method	Underfitted(%)	Correctly fitted(%)	1	2	≥ 3	I	C	MRME (%)
3	50	λ̂_gcv	10.9	15.9	24.6	25.8	22.8	0.112	3.263	66.78
		λ̂_bic	15.5	29.3	29.3	18.4	7.5	0.160	3.929	67.04
		Oracle	0	100	0	0	0	0	5	29.29
	100	λ̂_gcv	0.8	23.1	22.6	29.7	23.8	0 08	3.368	58.15
		λ̂_bic	1.9	51.8	29.4	13.1	3.8	0 19	4.301	52.10
	100	Oracle	0	100	0	0	0	0	5	33.58
	200	λ̂_gcv	0	22.9	21.5	30.5	25.1	0	3.352	54.47
		λ̂_bic	0	70.0	16.7	10.9	2.4	0	4.540	43.34
		Oracle	0	100	0	0	0	0	5	34.50

1	50	λ̂_gcv	0	26.0	25.7	31.0	17.3	0	3.567	51.93
		λ̂_bic	0.1	60.3	20.6	13.9	5.1	0 01	4.356	38.31
		Oracle	0	100	0	0	0	0	5	29.30
	100	λ̂_gcv	0	26.3	27.5	27.5	18.7	0	3.567	50.90
		λ̂_bic	0	67.9	18.9	9.9	3.3	0	4.509	39.10
		Oracle	0	100	0	0	0	0	5	33.42
	200	λ̂_gcv	0	26.5	26.9	28.9	17.7	0	3.582	49.24
		λ̂_bic	0	75.7	15.7	7.2	1.4	0	4.656	39.01
		Oracle	0	100	0	0	0	0	5	34.77

Open in a new tab

5.3. Real data examples

Example 3

We consider the Female Labour Supply data collected in East Germany in about 1994. The dataset consists of 607 observations and has been analysed by Fan et al. (1998) using additive models. Here we take the response variable y to be the ‘wage per hour’. The u-variable in the partially linear model is the ‘woman's age’; this is because the relationship between y and u cannot be characterised by a simple functional form; see Fig. 1. There are seven explanatory variables: x₁ is the weekly number of working hours; x₂ is the ‘Treiman prestige index’ of the woman's job; x₃ is the monthly net income of the woman's husband; x₄ = 1 if the years of the woman's education is between 13 and 16, and x₄ = 0 otherwise; x₅ = 1 if the years of the woman's education is not less than 17, and x₅ = 0 otherwise; x₆ = 1 if the woman has children less than 16-years-old, and x₆ = 0 otherwise; and x₇ is the unemployment rate in the place where she lives. After some preliminary analysis, we consider the following partially linear model with seven linear main effects and some first-order interaction effects among x₁, x₂ and x₃:

y = α (u) + Σ_{j = 1}^{7} β_{j} x_{j} + Σ_{k = 1}^{3} Σ_{l = k}^{3} β_{kl} x_{k} x_{l} + ∊ .

Here the x-variables have been standardised.

Fig. 1 — Female Labour Study data. The plot of α̂(u). The solid curve is α̂(·) based on the penalised profile least squares with λ̂_bic, the dashed curve is α̂(·) based on the penalised profile least squares with λ̂_gcv, the dotted curve is α̂(·) obtained by smoothing partial residuals with the bic best-subset selection, over *u_i*, and the dash-dotted curve is α̂(·) based on the unpenalised full-model profile least squares estimate.

Following Fan & Li's (2004) approach, we first calculate the difference-based estimator for β, and then apply the plug-in method to select 4.6249 as a bandwidth for α̂(·). With this bandwidth, we next choose the tuning parameters by minimising the generalised crossvalidation and bic scores, resulting in λ̂_gcv = 0.0896 and λ̂_bic = 0.2655. Subsequently, we obtain the unpenalised profile least squares estimate (Fan & Li, 2004) and the smoothly clipped absolute deviation estimate based on generalised crossvalidation and bic, together with their standard errors (Table 3). We also consider the model selected by the unpenalised full-model profile least squares method, i.e., equation (4.5) without the penalty term, and the best-subset variable selection criterion, $BIC = \log {\hat{σ}}_{h}^{2} + d^{*} \log (n) ∕ n$ , where ${\hat{σ}}_{h}^{2} = {SSE}_{hs} ∕ n$ , sse_hs is the sum of squares of the errors by fitting Y_h versus X_hs = (I – S_h)X_𝒮, and d* is the dimension of X_𝒮. The first four columns of Table 3 clearly show that the unpenalised full-model profile least squares approach fits spurious variables, while the smoothly clipped absolute deviation method based on gcv tends to include variables with small, insignificant effects. In contrast, all variables selected by the smoothly clipped absolute deviation method based on bic are significant at level 0.05. Fig. 1 shows that the four estimates of α(·) are fairly similar, but that the estimate from the unpenalised full-model profile least squares approach is slightly different from the others. Moreover, the resulting intercept function changes with age with no particular functional form.

Table 3.

Female Labour Supply data. Estimated coefficients and their standard errors.

Variable	Profile LSE	SCAD λ̂_gcv	SCAD λ̂_bic	Best-subset BIC	Best-subset BIC(n=602)
x₁	1.244(0.637)	1.281(0.636)	1.872(0.562)	1.343(0.496)	0
$x_{1}^{2}$	−1.451(0.563)	−1.446(0.559)	−1.841(0.517)	−2.192(0.497)	−0.853(0.119)
x₂	1.520(0.721)	1.602(0.704)	1.357(0.681)	0	0
$x_{2}^{2}$	1.162(0.617)	1.281(0.599)	1.341(0.601)	1.433(0.136)	1.410(0.137)
x₃	−1.229(0.692)	−1.063(0.549)	0	0	0
$x_{3}^{2}$	−0.011(0.276)	0	0	0	0
x₁x₂	−1.781(0.702)	−1.885(0.684)	−1.493(0.653)	0	0
x₁x₃	0.922(0.559)	0.995(0.549)	0	0	0
x₂x₃	0.313(0.485)	0	0	0	0
x₄	0.609(0.130)	0.593(0.130)	0.249(0.055)	0.605(0.129)	0.590(0.129)
x₅	1.194(0.140)	1.183(0.140)	1.030(0.131)	1.168(0.138)	1.172(0.139)
x₆	−0.290(0.189)	−0.028(0.019)	0	0	0
x₇	0.118(0.117)	0.005(0.006)	0	0	0

Open in a new tab

LSE, least squares estimate

SCAD, smoothly clipped absolute deviation

The fourth column of Table 3 shows that the best-subset variable selection with the bic criterion yields a simpler model than that from the smoothly clipped absolute deviation method based on bic. However, Breiman (1996) found that the best-subset method suffers from a lack of stability. To demonstrate this point, we exclude the last 5 observations, so leaving a dataset with n = 602. The last column of Table 3 shows that the best-subset variable selection with bic criterion yields a different model from that with n = 607, which corroborates Breiman's finding. The model based on the smoothly clipped absolute deviation method with bic turns out to be unchanged, although details are not given.

We conclude that the best model in this study is that selected by the smoothly clipped absolute deviation method with bic:

\hat{y} = \hat{α} (u) + {\hat{β}}_{1} x_{1} + {\hat{β}}_{2} x_{2} + {\hat{β}}_{11} x_{1}^{2} + {\hat{β}}_{12} x_{1} x_{2} + {\hat{β}}_{22} x_{2}^{2} + {\hat{β}}_{4} x_{4} + {\hat{β}}_{5} x_{5} .

Hence, the hourly wage of a woman depends primarily on working hours, job prestige and years of education, while the husband's income, the local unemployment rate and the indicator of whether or not the woman has a young child seem not to affect the hourly wage significantly. Fig. 1 indicates that the hourly wage is almost constant before the age of 50, but decreases rapidly thereafter.

6. Discussion

One could extend the current work by adapting Fan & Li's (2001) approach to define the penalised likelihood function for the generalised linear model by replacing the first term of equation (2.2) with twice the negative of the corresponding loglikelihood function. Subsequently, one could explore the overfitting effect of generalised crossvalidation and the consistency of bic. It is also of interest to compare the generalised crossvalidation approach to the bic method for semiparametric models and single-index models. Research along these lines is currently under investigation.

Acknowledgement

We are grateful to the editor, the associate editor and two referees for their helpful and constructive comments. Li's research was supported by grants from the U.S. National Institute on Drug Abuse and National Science Foundation.

Appendix

Proofs

Proof of Lemma 1

When λ = 0, we have df₀ = d and gcv₀ = gcv_{𝒮_F} in (2.3). Then, applying Condition 1 together with 2log(1 – d/n) = O(n⁻ⁿ), we obtain

\log {GCV}_{𝒮_{F}} = \log (\frac{{SSE}_{𝒮_{F}}}{n}) - 2 \log (1 - \frac{d}{n}) = \log {\hat{σ}}_{𝒮_{F}}^{2} + O (n^{- 1}) .

(A1)

According to (3.5), sseλ ≥ sse_𝒮_λ, which leads to

\log {GCV}_{λ} \geq \log (\frac{1}{n} {SSE}_{λ}) \geq \log (\frac{1}{n} {SSE}_{𝒮_{λ}}) = \log {\hat{σ}}_{𝒮_{λ}}^{2} .

As a result,

\inf_{λ ∊ Ω_} \log {GCV}_{λ} \geq \min_{𝒮 ⊅ 𝒮_{T}} \log {\bar{σ}}_{𝒮}^{2} .

(A2)

In addition, Condition 2 implies that $σ_{𝒮}^{2} > σ_{𝒮_{F}}^{2} = σ_{∊}^{2}$ for 𝒮 ⊅ 𝒮_T. Hence, $\min_{𝒮 ⊅ 𝒮_{T}} \log σ_{𝒮}^{2} > \log σ_{𝒮_{F}}^{2}$ . This result, in conjunction with Conditions 1 and 2 and equations (A1), and (A2), yields

pr (\inf_{λ ∊ Ω_} \log {GCV}_{λ} > \log {GCV}_{𝒮_{F}}) \to 1

as n → ∞, and the proof is complete.

Proof of Lemma 2

For any λ ∈ Ω₀, we have 𝒮_λ = 𝒮_T. Hence, gcv_λ > (1/n)sse_{𝒮_λ} = (1/n)sse_{𝒮T_T}. This, together with the fact (1 − d/n)⁻² = 1 + 2d/n + O(n⁻²), leads to

\begin{matrix} pr (\inf_{λ ∊ Ω_{0}} {GCV}_{λ} > {GCV}_{𝒮_{F}}) & \geq pr {\frac{{SSE}_{𝒮_{T}}}{n} > \frac{{SSE}_{𝒮_{F}}}{n {(1 - d ∕ n)}^{2}}} \\ = pr {\frac{{SSE}_{𝒮_{T}} - {SSE}_{𝒮_{F}}}{{\hat{σ}}_{𝒮_{F}}^{2}} > 2 d + O (n^{- 1})} \end{matrix}

(A3)

since ${\hat{σ}}_{𝒮_{F}}^{2} = {SSE}_{𝒮_{F}} ∕ n$ . According to Conditions 1 and 2, we have that ${\hat{σ}}_{𝒮_{F}}^{2} \to σ_{𝒮_{F}}^{2} = σ_{∊}^{2}$ in probability. Furthermore, under Condition 3, $({SSE}_{𝒮_{T}} - {SSE}_{𝒮_{F}}) ∕ σ_{∊}^{2}$ follows a $χ_{d - d_{0}}^{2}$ distribution. As a result,

\begin{matrix} pr (\inf_{λ ∊ Ω_{0}} {GCV}_{λ} > {GCV}_{𝒮_{F}}) & \geq pr [χ_{d - d_{0}}^{2} {1 + o_{p} (1)} > + O (n^{- 1})] \\ \to pr (χ_{d - d_{0}}^{2} > 2 d) ≜ α . \end{matrix}

This completes the proof.

Proof of Lemma 3

Let β_𝒮 = (β_j₁, ⋯ , β_jd₀)′ be the vector of relevant coefficients, and let β_N consist of irrelevant coefficients. Without loss of generality, we assume that β_𝒮 = (β₁, ⋯ , β_d₀)′, and β_N = (β_d₀+1, ⋯ , β_d)′. In addition, let β̂_{λ_n} = (β̂′_{Sλ_n}, β̂′_{Nλ_n})′, where β̂′_{Sλ_n} and β̂′_{Nλ_n} are the smoothly clipped absolute deviation estimators of β′_S and β′_N, respectively. Under Condition 5, we apply Theorem 2 of Fan & Li (2001) to obtain that, with probability tending to 1, β̂_{Sλ_n} satisfies

\frac{1}{n} X_{𝒮_{T}}^{'} (Y - X_{𝒮_{T}} {\hat{β}}_{S λ_{n}}) + b_{n} ({\hat{β}}_{S λ_{n}}) = 0,

(A3)

where b_n(β_S) = (p′_{λ_n}(|β₁|)sign(β₁), ⋯ ,p′_{λ_n}(|β_d₀|)sign(β_d₀))′. According to Theorem 1 of Fan & Li (2001), β̂_{Sλ_n} → β_S ≠ 0 in probability. In addition, because $λ_{n} = \log (n) ∕ \sqrt{n}$ , we have aλ_n → 0. As a result, pr(|β̂_{Sλ_n}| > aλ_n) → 1, which implies that pr{b_n(β̂_{Sλ_n}) = 0} → 1. This, together with (A4), implies that, with probability tending to 1, the normal equation (A4) is exactly the same as

\frac{1}{n} X_{𝒮_{T}}^{'} (Y - X_{𝒮_{T}} {\hat{β}}_{S λ_{n}}) = 0,

which is the normal equation for the ordinary least squares estimator based on the true model. As a result, with probability tending to 1, the smoothly clipped absolute deviation estimator β̂_{Sλ_n} is exactly the same as β̂_{𝒮_T} = (X′_{𝒮_T}X_{𝒮_T})⁻¹ (X_{𝒮_T}Y), the first d₀ elements of the oracle estimator. It follows immediately that pr(sse_{λ_n} = sse_{𝒮_T}) → 1, since, with probability tending to one, β̂_{Nλ_n} = 0 by the sparsity in Theorem 2 of Fan & Li (2001). Using similar arguments, we can show that, with probability tending to one, the non-vanished diagonal elements of Σ_{λ_n} converge to zero, which implies that pr(df_{λ_n} = d₀) → 1. As a result, with probability tending to one, we have bic_λ = bic_{𝒮_T}. This completes the proof.

Proof of Lemma 4

For 𝒮_λ ≠ 𝒮_T, i.e., λ ∈ Ω_ U Ω₊, we can identify two different cases, i.e. underfitting or overfitting. In each case, we show that Lemma 4 holds as given below.

Case 1: Underfitted model, i.e. 𝒮_λ ⊅ 𝒮_T

Applying Lemma 3 and Condition 1, we first have that

{BIC}_{λ_{n}} = \log {\hat{σ}}_{𝒮_{T}}^{2} + d_{0} \log (n) ∕ n \to \log (σ_{𝒮_{T}}^{2}) = \log (σ_{∊}^{2}),

(A5)

in probability. It follows by the fact of 𝒮_λ ⊅ 𝒮_T and Conditions 1 and 2 that

\begin{matrix} {BIC}_{λ} = \log (\frac{1}{n} {SSE}_{λ}) & + {DF}_{λ} \frac{\log (n)}{n} \geq \log {\frac{1}{n} {SSE}_{𝒮_{λ}}} \\ \geq \min_{{𝒮 : 𝒮 ⊅ 𝒮_{T}}} \log {\hat{σ}}_{𝒮}^{2} & \to \min_{{𝒮 : 𝒮 ⊅ 𝒮_{T}}} \log {\hat{σ}}_{𝒮}^{2} > \log σ_{∊}^{2}, \end{matrix}

(A6)

in probability. Finally, (A5) and (A6) imply that pr{inf_{λ∈Ω_}bicλ > bic_{λ_n}} → 1.

Case 2: Overfitted model, i.e. 𝒮_λ ⊃ 𝒮_T, but 𝒮_λ ≠ 𝒮_T

According to Condition 1, ${\hat{σ}}_{𝒮_{T}}^{2} \to_{p} σ_{𝒮_{T}}^{2} = σ_{∊}^{2} > 0$ . Next, let d_λ be the number of variables included in the model 𝒮_λ. Then, for the overfitted model, d_λ > d₀. Moreover, using the theory of the sum of squares decomposition, we can easily show that $({SSE}_{𝒮_{T}} - {SSE}_{𝒮_{λ}}) ∕ σ_{∊}^{2} \to χ_{d_{λ} - d_{0}}^{2}$ in distribution as n → ∞. Under the normality assumption, Condition 3, this is also true for a finite sample. Thus, for any overfitted model 𝒮,

{SSE}_{𝒮_{T}} - {SSE}_{𝒮} = O_{p} (1) .

(A7)

This, together with Lemma 3 and the definition of bic_λ, implies that, with probability tending to 1,

\begin{matrix} n ({BIC}_{λ} - {BIC}_{λ_{n}}) & \geq n \log (\frac{{SSE}_{𝒮_{λ}}}{{SSE}_{𝒮_{T}}}) + ({DF}_{λ} - d_{0}) \log n \\ = {{\hat{σ}}_{𝒮_{T}}^{- 2} ({SSE}_{𝒮_{λ}} - {SSE}_{𝒮_{T}}) + O_{p} (1)} + {d_{λ} + O_{p} (λ) - d_{0}} \log n, \end{matrix}

where the last equality follows because df_λ = d_λ + o_p(λ) by Conditions 4 and 5. Thus,

\inf_{λ \in Ω_{+}} n ({BIC}_{λ} - {BIC}_{λ_{n}}) \geq {\hat{σ}}_{𝒮_{T}}^{2} \min_{{𝒮 \supset 𝒮_{T}}} ({SSE}_{𝒮} - {SSE}_{𝒮_{T}}) + {1 + O_{p} (λ)} \log n + O_{p} (1) .

(A8)

It follows from (A7) that

\min_{{𝒮 : 𝒮 \supset 𝒮_{T}}} ({SSE}_{𝒮} - {SSE}_{𝒮_{T}}) = O_{p} (1) .

This, together with the fact ${\hat{σ}}_{𝒮_{T}}^{2} \to σ_{𝒮_{T}}^{2}$ in probability, the right-hand side of (A8) diverges to +∞ as n → ∞, which implies that

pr {\inf_{λ \in Ω_{+}} n ({BIC}_{λ} - {BIC}_{λ_{n}}) > 0} = pr (\inf_{λ \in Ω_{+}} {BIC}_{λ} > {BIC}_{λ_{n}}) \to 1 .

The results of Cases 1 and 2 complete the proof.

Regularity Conditions for Partially Linear Model

Suppose that {(u_i, x_i, y_i), i = 1, ⋯ , n}, is a random sample from (4.1). The following regularity conditions are imposed to facilitate the technical proofs:

(i) the kernel function is a symmetric density function with compact support;
(ii) the random variable u₁ has a bounded support 𝒰, and its density function f(·) is Lipschitz continuous and bounded away from 0 on its support;
(iii) the function α(·) has a continuous second-order derivative for u ∈ 𝒰;
(iv) the conditional expectation E(x₁|u₁ = u) is Lipschitz continuous for u ∈ 𝒰;
(v) there is an s > 1 such that E∥x₁∥^2s < ∞ and for some η < 2 − s^-1 such that n^2η−1h → ∞;
(vi) the bandwidth h = O_P(n^−1/5).

Contributor Information

Hansheng Wang, Guanghua School of Management, Peking University, Beijing, China, 100871 hansheng@gsm.pku.edu.cn.

Runze Li, Department of Statistics and The Methodology Center, The Pennsylvania State University, University Park Pennsylvania, 16802-2111, U.S.A. rli@stat.psu.edu.

Chih-Ling Tsai, Graduate School of Management, University of California, Davis California, 95616-8609, U.S.A. cltsai@ucdavis.edu.

References

Akaike . Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Csaki F, editors. 2nd International Symposium on Information Theory. Akademia Kiado; Budapest: 1973. pp. 267–81. [Google Scholar]
Breiman L. Heuristics of instability and stabilization in model selection. Ann. Statist. 1996;24:2350–83. [Google Scholar]
Bunea F. Consistent covariate selection and post model selection inference in semiparametric regression. Ann. Statist. 2004;32:898–927. [Google Scholar]
Bunea F, Wegkamp M. Two-stage model selection procedures in partially linear regression. Can. J. Statist. 2004;32:105–18. [Google Scholar]
Craven P, Wahba G. Smoothing noisy data with spline function: Estimating the correct degree of smoothing by the method of generalized cross validation. Numer. Math. 1979;31:337–403. [Google Scholar]
Engle RF, Granger CWJ, Rice J, Weiss A. Semiparametric estimates of the relation between weather and electricity sales. J. Am. Statist. Assoc. 1986;81:310–20. [Google Scholar]
Fan J, Gijbels I. Local Polynomial Modelling and Its Applications. Chapman and Hall; New York: 1996. [Google Scholar]
Fan J, Härdle W, Mammen E. Direct estimation of low-dimensional components in additive models. Ann. Statist. 1998;26:943–71. [Google Scholar]
Fan J, Huang T. Profile likelihood inference on semiparametric varying-coefficient partially linear models. Bernoulli. 2005;11:1031–57. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalised likelihood and its oracle properties. J. Am. Statist. Assoc. 2001;96:1348–60. [Google Scholar]
Fan J, Li R. New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. J. Am. Statist. Assoc. 2004;99:710–723. [Google Scholar]
Härdle W, Liang H, Gao J. Partially Linear Models. Springer Physica-Verlag; Heidelberg: 2000. [Google Scholar]
Heckman NE. Spline smoothing in partly linear models. J. R. Statist. Soc. B. 1986;48:244–8. [Google Scholar]
Huang J, Yang L. Identification of non-linear additive autoregressive models. J. R. Statist. Soc. B. 2004;66:463–77. [Google Scholar]
Mack YP, Silverman BW. Weak and strong uniform consistency of kernel regression estimates. Z. Wahr. verw. Geb. 1982;61:405–15. [Google Scholar]
McQuarrie DR, Tsai CL. Regression and Time Series Model Selection. World Scientific; Singapore: 1998. [Google Scholar]
Robinson PM. Root-n-consistent semiparametric regression. Econometrica. 1988;56:931–54. [Google Scholar]
Ruppert D, Sheather SJ, Wand MP. An effective bandwidth selector for local least squares regression. J. Am. Statist. Assoc. 1995;90:1257–70. [Google Scholar]
Schwarz G. Estimating the dimension of a model. Ann. Statist. 1978;6:461–4. [Google Scholar]
Shao J. An asymptotic theory for linear model selection. Statist. Sinica. 1997;7:221–64. [Google Scholar]
Shi P, Tsai CL. Regression model selection - a residual likelihood approach. J. R. Statist. Soc. B. 2002;64:237–52. [Google Scholar]
Shi P, Tsai CL. A joint regression variable and autoregressive order selection criterion. J. Time Ser. Anal. 2004;25:923–41. [Google Scholar]
Speckman P. Kernel smoothing in partially linear models. J. R. Statist. Soc. B. 1988;50:413–36. [Google Scholar]
Tibshirani RJ. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B. 1996;58:267–88. [Google Scholar]
Yang Y. Can the strengths of aic and bic be shared? A conflict between model identification and regression estimation. Biometrika. 2005;92:973–50. [Google Scholar]
Yatchew A. An elementary estimator for the partially linear model. Economet. Lett. 1997;57:135–43. [Google Scholar]

[R1] Akaike . Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Csaki F, editors. 2nd International Symposium on Information Theory. Akademia Kiado; Budapest: 1973. pp. 267–81. [Google Scholar]

[R2] Breiman L. Heuristics of instability and stabilization in model selection. Ann. Statist. 1996;24:2350–83. [Google Scholar]

[R3] Bunea F. Consistent covariate selection and post model selection inference in semiparametric regression. Ann. Statist. 2004;32:898–927. [Google Scholar]

[R4] Bunea F, Wegkamp M. Two-stage model selection procedures in partially linear regression. Can. J. Statist. 2004;32:105–18. [Google Scholar]

[R5] Craven P, Wahba G. Smoothing noisy data with spline function: Estimating the correct degree of smoothing by the method of generalized cross validation. Numer. Math. 1979;31:337–403. [Google Scholar]

[R6] Engle RF, Granger CWJ, Rice J, Weiss A. Semiparametric estimates of the relation between weather and electricity sales. J. Am. Statist. Assoc. 1986;81:310–20. [Google Scholar]

[R7] Fan J, Gijbels I. Local Polynomial Modelling and Its Applications. Chapman and Hall; New York: 1996. [Google Scholar]

[R8] Fan J, Härdle W, Mammen E. Direct estimation of low-dimensional components in additive models. Ann. Statist. 1998;26:943–71. [Google Scholar]

[R9] Fan J, Huang T. Profile likelihood inference on semiparametric varying-coefficient partially linear models. Bernoulli. 2005;11:1031–57. [Google Scholar]

[R10] Fan J, Li R. Variable selection via nonconcave penalised likelihood and its oracle properties. J. Am. Statist. Assoc. 2001;96:1348–60. [Google Scholar]

[R11] Fan J, Li R. New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. J. Am. Statist. Assoc. 2004;99:710–723. [Google Scholar]

[R12] Härdle W, Liang H, Gao J. Partially Linear Models. Springer Physica-Verlag; Heidelberg: 2000. [Google Scholar]

[R13] Heckman NE. Spline smoothing in partly linear models. J. R. Statist. Soc. B. 1986;48:244–8. [Google Scholar]

[R14] Huang J, Yang L. Identification of non-linear additive autoregressive models. J. R. Statist. Soc. B. 2004;66:463–77. [Google Scholar]

[R15] Mack YP, Silverman BW. Weak and strong uniform consistency of kernel regression estimates. Z. Wahr. verw. Geb. 1982;61:405–15. [Google Scholar]

[R16] McQuarrie DR, Tsai CL. Regression and Time Series Model Selection. World Scientific; Singapore: 1998. [Google Scholar]

[R17] Robinson PM. Root-n-consistent semiparametric regression. Econometrica. 1988;56:931–54. [Google Scholar]

[R18] Ruppert D, Sheather SJ, Wand MP. An effective bandwidth selector for local least squares regression. J. Am. Statist. Assoc. 1995;90:1257–70. [Google Scholar]

[R19] Schwarz G. Estimating the dimension of a model. Ann. Statist. 1978;6:461–4. [Google Scholar]

[R20] Shao J. An asymptotic theory for linear model selection. Statist. Sinica. 1997;7:221–64. [Google Scholar]

[R21] Shi P, Tsai CL. Regression model selection - a residual likelihood approach. J. R. Statist. Soc. B. 2002;64:237–52. [Google Scholar]

[R22] Shi P, Tsai CL. A joint regression variable and autoregressive order selection criterion. J. Time Ser. Anal. 2004;25:923–41. [Google Scholar]

[R23] Speckman P. Kernel smoothing in partially linear models. J. R. Statist. Soc. B. 1988;50:413–36. [Google Scholar]

[R24] Tibshirani RJ. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B. 1996;58:267–88. [Google Scholar]

[R25] Yang Y. Can the strengths of aic and bic be shared? A conflict between model identification and regression estimation. Biometrika. 2005;92:973–50. [Google Scholar]

[R26] Yatchew A. An elementary estimator for the partially linear model. Economet. Lett. 1997;57:135–43. [Google Scholar]

PERMALINK

Tuning parameter selectors for the smoothly clipped absolute deviation method

Hansheng Wang

Runze Li

Chih-Ling Tsai

Summary

1. Introduction

2. The Smoothly Clipped Absolute Deviation Method

3. Theoretical Results

3.1. Notation and conditions

3.2. The overfitting effect of generalised crossvalidation

Lemma 1

Lemma 2

Theorem 1

3.3. Consistency of bic

Lemma 3

Lemma 4

Theorem 2

4. Partially Linear Model

Remark

5. Numerical Studies

5.1. Preliminaries

5.2. Simulation studies

Example 1

Table 1.

Example 2

Table 2.

5.3. Real data examples

Example 3

Fig. 1.

Table 3.

6. Discussion

Acknowledgement

Appendix

Proofs

Proof of Lemma 1

Proof of Lemma 2

Proof of Lemma 3

Proof of Lemma 4

Case 1: Underfitted model, i.e. 𝒮λ ⊅ 𝒮T

Case 2: Overfitted model, i.e. 𝒮λ ⊃ 𝒮T, but 𝒮λ ≠ 𝒮T

Regularity Conditions for Partially Linear Model

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Case 1: Underfitted model, i.e. 𝒮_λ ⊅ 𝒮_T

Case 2: Overfitted model, i.e. 𝒮_λ ⊃ 𝒮_T, but 𝒮_λ ≠ 𝒮_T