Automatic Model Selection for Partially Linear Models

Xiao Ni; Hao Helen Zhang; Daowen Zhang

doi:10.1016/j.jmva.2009.06.009

. Author manuscript; available in PMC: 2010 Oct 1.

Published in final edited form as: J Multivar Anal. 2009 Oct 1;100(9):2100–2111. doi: 10.1016/j.jmva.2009.06.009

Automatic Model Selection for Partially Linear Models

Xiao Ni ¹, Hao Helen Zhang ^1,^✉, Daowen Zhang ¹

PMCID: PMC2766091 NIHMSID: NIHMS135122 PMID: 20160947

Abstract

We propose and study a unified procedure for variable selection in partially linear models. A new type of double-penalized least squares is formulated, using the smoothing spline to estimate the nonparametric part and applying a shrinkage penalty on parametric components to achieve model parsimony. Theoretically we show that, with proper choices of the smoothing and regularization parameters, the proposed procedure can be as efficient as the oracle estimator (Fan and Li, 2001). We also study the asymptotic properties of the estimator when the number of parametric effects diverges with the sample size. Frequentist and Bayesian estimates of the covariance and confidence intervals are derived for the estimators. One great advantage of this procedure is its linear mixed model (LMM) representation, which greatly facilitates its implementation by using standard statistical software. Furthermore, the LMM framework enables one to treat the smoothing parameter as a variance component and hence conveniently estimate it together with other regression coefficients. Extensive numerical studies are conducted to demonstrate the effective performance of the proposed procedure.

Keywords: Key words and phrases: Semiparametric regression, Smoothing splines, Smoothly clipped absolute deviation, Variable selection

1. Introduction

Partially linear models are popular semiparametric modeling techniques which assume the mean response of interest to be linearly dependent on some covariates, whereas its relation to other additional variables are characterized by nonparametric functions. In particular, we consider a partially linear model Y = X^T β + f(T) + ε, where X are explanatory variables of primary interest, β are regression parameters, f(·) is an unknown smooth function of the auxiliary covariate T, and the errors are uncorrelated. This model is a special case of general additive models (Hastie and Tibshirani, 1990). Estimation of β and f has been studied in various contexts including kernel smoothing (Speckman, 1998), smoothing splines (Engle et al., 1986; Heckman, 1986; Wahba, 1990; Green and Silverman, 1994; Gu, 2002), and penalized splines (Ruppert et al., 2003; Liang, 2006).

Often times, the number of potential explanatory variables, d, is large, but only a subset of them are predictive to the response. Variable selection is necessary to improve prediction accuracy and model interpretability of final models. In this paper, we treat f(T) as a nuisance effect and mainly focus on automatic selection, estimation and inferences for important linear effects in the presence of T. For linear models, numerous variable selection methods have been developed such as stepwise selection, best subset selection, and shrinkage methods like nonnegative garrote (Breiman, 1995), least absolute selection and shrinkage operator (LASSO; Tibshirani, 1996), smoothly clipped absolute deviation (SCAD; Fan and Li, 2001), least angle regression (Efron et al., 2004), adaptive lasso (Zou, 2006; Zhang and Lu, 2006). Information criteria commonly used for model comparison include Mallows C_p (Mallows, 1973), Akaike’s Information Criteria (Akaike, 1973) and Bayesian Information Criteria (BIC; Schwarz, 1978). A thorough review on variable selection for linear models is given in Linhart and Zucchini (1986) and Miller (2002).

Though there is a vast amount of work on variable selection for linear models, limited work has been done on model selection for partially linear models as noted in Fan and Li (2004). Model selection for partially linear models is challenging, since it consists of several interrelated estimation and selection problems: nonparametric estimation, smoothing parameter selection, and variable selection and estimation for linear covariates. Fan and Li (2004) has done some pioneering work in this area. In the framework of kernel smoothing, Fan and Li (2004) proposed an effective kernel estimator for nonparametric function estimation while using the SCAD penalty for variable selection; they were among the first to extend the shrinkage selection idea to partially linear models. Bunea (2004) proposed a class of sieve estimators based on penalized least squares for semiparametric model selection, and established the consistency property of their estimator. Bunea and Wegkamp (2004) suggested another two-stage estimation procedure and proved that the estimator is minimax adaptive under some regularity conditions. Recently, variable selection for high dimensional data, either d diverges with n or d > n, has been actively studied. Fan and Peng (2004) established asymptotic properties of the nonconcave penalized likelihood estimators for linear model variable selection when d increases with the sample size. Xie and Huang (2007) studied the SCAD-penalized regression for partially linear models for high dimensional data, where polynomial regression splines are employed for model estimation.

In this work, we propose a new regularization approach for model selection in the context of partially smoothing spline models and study its theoretical and computational properties. As we show in the paper, the elegant smoothing spline theory and formulation can be used to develop a simple yet effective procedure for joint function estimation and variable selection. Inspired by Fan and Li (2004), we adopt the SCAD penalty for model parsimony due to its nice theoretical properties. We will show that the new estimator has the oracle property if both smoothing and regularization parameters are chosen properly as n → ∞, when the dimension d is fixed. In the more challenging case when d_n → ∞ as n→ ∞, the estimator is shown to be $\sqrt{n / d_{n}}$ -consistent and be able to select important variables correctly with probability tending to one. In addition to these desired asymptotic properties, the new approach also has advantages in computation and parameter estimation. It naturally owns a linear mixed model (LMM) representation, which allows one to take advantage of standard software and implement it without much extra programming effort. This LMM framework further facilitates the process of tuning multiple parameters: the smoothing parameter in the roughness penalty and the regularization parameter associated with the shrinkage penalty. In our work, the smoothing parameter is treated as an additional variance component and estimated jointly with the residual variance using the restricted maximum likelihood (REML) approach, and therefore a two-dimensional grid search can be avoided. We also show that the local quadratic approximation (LQA; Fan and Li, 2001) technique used for computation provides us a convenient and robust sandwich formula for standard errors of the resulting estimates.

The rest of the article is organized as follows. In Section 2 we propose the double penalized least squares method for joint variable selection and model estimation, and establish the asymptotic properties of the resulting estimator β̂. We further study the large-sample properties of the estimator, such as the estimation consistency and variable selection consistency, in situations when the input dimension increases with the sample size n. In Section 3 we suggest a linear mixed model (LMM) representation for the proposed procedure, which leads to an iterative algorithm with easy implementation. We also discuss how to select the tuning parameters. In Section 4, we derive the covariance estimates for β̂ and f̂, from both Frequentist and Bayesian perspectives. Sections 5 and 6 present simulation results and a real data application. Section 7 concludes the article with a discussion.

2. Double-Penalized Least Squares Estimators and Their Asymptotics

2.1. Double-Penalized Least Squares Estimators

Suppose that the sample consists of n observations. For the ith observation, denote by y_i the response, by x_i the covariate vector from which important covariates are to be selected, and by t_i the covariate whose effect cannot be adequately characterized by a parametric function. We consider the following partially linear model:

y_{i} = x_{i}^{T} β + f (t_{i}) + ε_{i}, i = 1, \dots, n,

(2.1)

where β is a d×1 vector of regression coefficients, f(t) is an arbitrary twice-differentiable smooth function, and ε_i’s are assumed to be uncorrelated random variables with mean zero and a common unknown variance σ². Define Y = (y₁, ···, y_n)^T. Without loss of generality, we further assume that t_i ∈[0, 1] and f (t) is in the Sobolev space {f (t): f, f′ are absolutely continuous, and J²(f) < ∞}, where $J^{2} (f) = \int_{0}^{1} {f^{″} (t)}^{2} d t$ .

To simultaneously achieve the estimation of the nonparametric function f(t) and the selection of important variables, we propose a double-penalized least squares (DPLS) approach by minimizing

L_{d p} (β, f (\cdot); Y) = \frac{1}{2} \sum_{i = 1}^{n} {y_{i} - x_{i}^{T} β - f (t_{i})}^{2} + \frac{n λ_{1}}{2} \int_{0}^{1} {f^{″} (t)}^{2} d t + n \sum_{j = 1}^{d} p_{λ_{2}} (∣ β_{j} ∣) .

(2.2)

The first penalty term in (2.2) penalizes the roughness of the nonparametric fit f(t) and the second penalty p_λ₂ (|β_j|) is the shrinkage penalty on β_j ‘s. To our best knowledge, there has been little work on the DPLS in literature. We call the minimizer of (2.2) double-penalized least squares estimators (DPLSEs). There are two tuning parameters in (2.2): λ₁ ≥ 0 is a smoothing parameter which balances smoothness of f(t) with fidelity to data, and λ₂ ≥ 0 is a regularization parameter controlling the amount of shrinkage used in the variable selection. Choices of tuning parameters are very important to assure effective model selection and estimation, which will be discussed later. In the DPLS (2.2), we adopt the nonconcave SCAD penalty proposed by Fan and Li (2001), which is a piecewise quadratic function and satisfies

p_{λ_{2}}^{'} (ω) = λ_{2} {I (ω \leq λ_{2}) + \frac{{(a λ_{2} - ω)}_{+}}{(a - 1) λ_{2}} I (ω > λ_{2})} for ω > 0,

(2.3)

where a > 2 is also a tuning parameter. Fan and Li (2001) showed that the SCAD penalty function results in consistent, sparse and continuous estimators in linear models.

2.2 Asymptotic Theory: d fixed

First we lay out regularity conditions on x_i, t_i and ε which are necessary for the theoretical results. Denote the true coefficients as $β_{0} = {(β_{10}, \dots, β_{d 0})}^{T} = {(β_{10}^{T}, β_{20}^{T})}^{T}$ , where β₂₀ = 0 and β₁₀ consists of all q nonzero components. Assume the uncorrelated random variables ε_i’s have uniformly bounded absolute third moments. In addition, we assume that x₁, …, x_n are independently and identically distributed with mean zero, finite positive definite covariance matrix R, and that the components of x_i have finite third and fourth moments. As in Heckman (1986), we assume that $t_{i}^{'} s$ are distinct values in [0, 1] and satisfy $\int_{0}^{t_{i}} u (w) d w = i / n$ , where u(·) is a continuous and strictly positive function independent of n.

Define X = (x₁, …, x_n)^T, ε = (ε₁, …, ε_n)^T and f = (f(t₁), …, f(t_n))^T. The partially linear model (2.1) can then be expressed as Y = Xβ + f + ε. It can be shown that for given λ₁ and λ₂, minimizing the DPLS (2.2) leads to a smoothing spline estimate for f(·). Hence by theorem (2.1) in Green and Silverman (1994), we can rewrite the DPLS (2.2) as

L_{d p} (β, f; Y) = \frac{1}{2} {(Y - X β - f)}^{T} (Y - X β - f) + \frac{n λ_{1}}{2} f^{T} Kf + n \sum_{j = 1}^{d} p_{λ_{2}} (∣ β_{j} ∣),

(2.4)

where K is the nonnegative definite smoothing matrix defined by Green and Silverman (1994). Given λ₁, λ₂, and β, the DPLS minimizer of (2.4) is given by f̂(β) = (I + nλ₁K)⁻¹(Y − Xβ), where A(λ₁) = (I + nλ₁K)⁻¹ is equivalent to the linear smoother matrix in Craven and Wahba (1979) and Heckman (1986). Plugging f̂(β) into (2.4), we obtain a penalized profile least squares only of β:

Q (β) = \frac{1}{2} {(Y - X β)}^{T} {I - A (λ_{1})} (Y - X β) + n \sum_{j = 1}^{d} p_{λ_{2}} (∣ β_{j} ∣) .

We call the quadratic term in Q(β) as the profile least squares and denote it by L(β).

In the following, we establish the asymptotic theory for our estimator in terms of both estimation and variable selection. Proofs of these results involve the second-order Taylor expansion of p_λ₂ (|β|), and we will adapt the derivations of Fan and Li (2001) to our partially linear model context. Compared to the linear models studied in Fan and Li (2001), the major difficulty here is due to the appearance of the nonparametric component f in (2.1), which can affect the linear estimate β through the smoother matrix A(λ₁). In Lemma 1, we first establish some theoretical properties of L(β), which are useful for the proofs of Lemma 2 and Theorems 1 and 2 later in this section.

Lemma 1

Let L′(β₀) and L″(β₀) be the gradient vector and Hessian matrix of L respectively, evaluated at β₀. Assume that X_i are independent and identically distributed with finite fourth moments. If λ_1n → 0 and $n λ_{1 n}^{1 / 4} \to \infty$ as n → ∞, then

$n^{- 1 / 2} L^{'} (β_{0}) \overset{d}{\to} N (0, σ^{2} R)$ ,
$n^{- 1} L^{″} (β_{0}) \overset{p}{\to} R$ .

From Lemma 1, we have n^−1/2L′(β₀) = O_p(1), n⁻¹L″ (β₀) = R + o_p(1) and $n^{- 1} \frac{\partial L (β_{0})}{\partial β_{j}} = O_{p} (n^{- 1 / 2}), n^{- 1} \frac{\partial^{2} L (β_{0})}{\partial β_{j} \partial β_{k}} = R_{j k} + o_{p} (1)$ , where R_jk is the (j, k)th element of R. Using these results, we can prove the root-n consistency of the DPLSE β̂ and its oracle properties. Since the derivations of Theorems 1, 2, and Lemma 2 given in the following are similar to those in Fan and Li (2001), they are omitted in the paper.

Theorem 1

As n → ∞, if λ_1n →0, $n λ_{1 n}^{1 / 4} \to \infty$ and λ_2n → 0, then there exists a local minimizer β̂ of Q(β) such that ||β̂− β₀|| = O_p(n^−1/2).

Theorem 1 says that if we choose proper sequences of λ₁_n and λ₂_n as n → ∞, then the DPLSE β̂ is root-n consistent. In the following, we establish through Lemma 2 and Theorem 2 that β̂ can perform as well as the oracle procedure in variable selection.

Lemma 2

As n → ∞, if λ_1n → 0, $n λ_{1 n}^{1 / 4} \to \infty$ , λ_2n → 0, and n^1/2λ_2n → ∞, then with probability tending to 1, for any β₁ which satisfies ||β₁ − β₁₀||= O(n^−1/2) and any constant C > 0,

Q {(β_{1}, 0)} = min_{∥ β_{2} ∥ \leq C n^{- 1 / 2}} Q {(β_{1}, β_{2})} .

Theorem 2

As n → ∞, if λ_1n →0, $n λ_{1 n}^{1 / 4} \to \infty$ , λ_2n → 0, and n^1/2→λ_2n → ∞, then with probability tending to 1, the local minimizer $\hat{β} = {({\hat{β}}_{1}^{T}, {\hat{β}}_{2}^{T})}^{T}$ in Theorem 1 must satisfy:

Sparsity: β̂₂ = 0.
Asymptotic normality: $n^{1 / 2} ({\hat{β}}_{1} - β_{10}) \overset{d}{\to} N {0, σ^{2} R_{11}^{- 1}}$ , where R₁₁ is the q × q upper-left sub-matrix of R.

2.3 Asymptotic Theory: d_n → ∞ as n → ∞

In this section, we study the sampling properties of the DPLSEs in the situation where the number of linear predictors tends to ∞ as the sample size n goes to ∞. Similar to Fan and Peng (2004), we show that under certain regularity conditions, the DPLSEs are $\sqrt{n / d_{n}}$ -consistent and also consistent in selecting important variables, where d_n is the dimension of β to emphasize its dependence on the sample size n. Similarly, we re-define the number of important parametric effects as q_n. We write the true regression coefficients as $β_{n 0} = {(β_{n 10}^{T}, 0^{T})}^{T}$ and the DPLSE estimator as ${\hat{β}}_{n} = {({\hat{β}}_{n 1}^{T}, {\hat{β}}_{n 2}^{T})}^{T}$ . For any square matrix G, denote its smallest eigenvalue and largest eigenvalue respectively by Λ_min(G) and Λ_max(G). The following are the regularity conditions assumed to facilitate the technical derivations.

(C1) The elements {β_n_10, _j}’s of β_n₁₀ satisfy
$min {∣ β_{n 10, j} ∣, 1 \leq j \leq q_{n}} / λ_{2 n} \to \infty .$
(C2) There exist constants c₁ and c₂ such that

0 < c_{1} < Λ_{min} (R) \leq Λ_{max} (R) < c_{2} < \infty, \forall n .

Both conditions above are adopted from Fan and Peng (2004), which is the first work to study the large-sample properties of the non-concave penalized estimators for linear models when the dimension of data diverges with the sample size n. As pointed out by Fan and Peng (2004), the condition (C1) gives the rate at which the penalized estimator can distinguish nonvanishing parameters from 0. Condition (C2) assumes that the R is positive definite and its eigenvalues are uniformly bounded.

Theorem 3

Under the conditions (C1) and (C2), as n → ∞, if λ_1n →0, $n λ_{1 n}^{1 / 4} \to \infty$ , λ_2n →0, and $d_{n} = o (n^{1 / 2} \land n λ_{1 n}^{1 / 4})$ , then there exists a local minimizer β̂_n of Q(β_n) such that $∥ {\hat{β}}_{n} - β_{n 0} ∥ = O_{p} (\sqrt{d_{n} / n})$ .

Theorem 3 says that if we choose proper sequences of λ₁_n and λ₂_n as n → ∞, then the DPLSE β̂_n is $\sqrt{n / d_{n}}$ -consistent. This consistency rate is the same as the result of Fan and Peng (2004), where the number of parameters diverges in linear models. It is also the same as the result of the M-estimator studied by Huber (1973) in the diverging dimension situations. In the next, Theorem 4 shows that β̂_n is also consistent in variable selection, i.e, unimportant linear predictors will be estimated as exactly zeros with probability tending to one. All the proofs are given in the Appendix.

Theorem 4

Under the regularity conditions (C1) and (C2), as n → ∞, if λ_1n →0, $n λ_{1 n}^{1 / 4} \to \infty$ , λ_2n →0, $\sqrt{n / d_{n}} λ_{2 n} \to \infty$ , and $d_{n} = o (n^{1 / 2} \land n λ_{1 n}^{1 / 4})$ , then with probability tending to 1, the local minimizer ${\hat{β}}_{n} = {({\hat{β}}_{n 1}^{T}, {\hat{β}}_{n 2}^{T})}^{T}$ in Theorem 3 must satisfy β̂_n2 = 0.

3. Computational Algorithm and Parameter Tuning

We reformulate the DPLS into a linear mixed model (LMM) representation for ease of computation. The LMM allows us to treat the smoothing parameter as a variance component and provides a unified estimation and inferential framework. An iterative algorithm is then outlined.

3.1. Linear Mixed Model (LMM) Representation

Let t = (t₁, …, t_n)^T be the vector of distinct t_i’s and f = (f (t₁), …, f(t_n))^T. In the case where there are ties in t_i’s, an incidence matrix can be used to cast the DPLS into a linear mixed model framework as in Zhang et al. (1998). The partially linear model (2.1) can then be expressed as

Y = X β + f + ε .

(3.1)

If ε_i’s were normally distributed, then minimizing (2.4) with respect to (β, f) is equivalent to maximizing the double-penalized likelihood

ℓ_{d p} (β, f; Y) = ℓ (β, f; Y) - \frac{n λ_{1}}{2 σ^{2}} f^{T} Kf - \frac{n}{σ^{2}} \sum_{j = 1}^{d} p_{λ_{2}} (∣ β_{j} ∣),

(3.2)

where ℓ(β, f; Y) = − (n/2) log σ² − (Y − Xβ − f)^T (Y −Xβ −f)/(2σ²). Following Green (1987), we may write f via a one-to-one linear transformation as f = Tδ + Ba, where T = [1, t], 1 is the vector of 1’s with length n, δ and a are of length 2 and n − 2 respectively, and B = L(L^TL)⁻¹ with L being an n × (n − 2) full rank matrix satisfying K = LL^T and L^T T = 0. It follows that f^T Kf = a^Ta and yields an equivalent double-penalized log-likelihood

\begin{array}{l} ℓ_{d p} (β, δ, a; Y) = - \frac{n}{2} log σ^{2} - \frac{1}{2 σ^{2}} {(Y - X_{*} β_{*} - Ba)}^{T} (Y - X_{*} β_{*} - Ba) \\ - \frac{n λ_{1}}{2 σ^{2}} a^{T} a - \frac{n}{σ^{2}} \sum_{j = 1}^{d} p_{λ_{2}} (∣ β_{j} ∣), \end{array}

(3.3)

where X_* = [T, X], β_* = (δ^T, β^T)^T.

For fixed β_* (and given λ₁,λ₂,σ²), (3.3) can be treated as the joint log-likelihood for the following linear mixed model (LMM) subject to the SCAD penalty on β

Y = X_{*} β_{*} + Ba + ε,

(3.4)

where β_* represent fixed effects, and a are random effects with a ~ N(0, τI), τ = 2σ²/(nλ₁), and θ = (τ, σ²) are variance components. We then conduct variable selection by maximizing the penalized log-likelihood of β_* subject to the SCAD penalty

ℓ_{d p} (β_{*}; Y) = - \frac{1}{2} {(Y - X_{*} β_{*})}^{T} V^{- 1} (Y - X_{*} β_{*}) - \frac{n}{σ^{2}} \sum_{j = 1}^{d} p_{λ_{2}} (∣ β_{j} ∣),

(3.5)

where V = σ²I_n +τBB^T is the variance of Y under mixed model representation (3.4). After selecting important variables and obtaining estimates β̂_*, we can use δ̂ and the best linear unbiased prediction (BLUP) estimate â to construct the smoothing spline fit f̂(t). This LMM representation suggests that the inverse of the smoothing parameter τ can be treated as a variance component and hence can be jointly estimated with σ² using the maximum likelihood or restricted maximum likelihood (REML) approach during the variable selection process under the working distributional assumption that $ε_{i}^{'} s$ were normal. However, it should be noted that the above mixed model representation is merely a framework convenient for computation. The asymptotic results in Section 2 do not depend on the normal error assumption. Simulation results in Section 5 indicate that our procedure is quite robust to the distributional assumption for ε_i’s.

The SCAD penalty function defined by (2.3) is not differentiable at the origin, causing difficulty in maximizing (3.5) with gradient based methods such as the Newton-Raphson. Following Fan and Li (2001, 2004), we use a local quadratic approximation (LQA) approach. Assuming ${\hat{β}}_{*}^{0}$ is an initial value close to the maximizer of (3.5), we have the following local approximation:

[p_{λ_{2}} (∣ β_{* j} ∣)]^{'} = p_{λ_{2}}^{'} (∣ β_{* j} ∣) sign (β_{* j}) \approx \frac{p_{λ_{2}}^{'} (∣ {\hat{β}}_{* j}^{0} ∣)}{∣ {\hat{β}}_{* j}^{0} ∣} β_{* j}, for ∣ β_{* j}^{0} ∣ \geq ξ, j \geq 3,

where ξ is a pre-specified threshold.

Using Taylor expansions, we can approximate (3.5) by

\begin{array}{l} ℓ_{d p} (β_{*} ∣ {\hat{β}}_{*}^{0}) \approx - \frac{1}{2} {(Y - X_{*} β_{*})}^{T} V^{- 1} (Y - X_{*} β_{*}) - \frac{n}{2 σ^{2}} β_{*}^{T} \sum_{λ_{2}} ({\hat{β}}_{*}^{0}) β_{*} \\ - \frac{n}{σ^{2}} \sum_{j = 3}^{d + 2} {p_{λ_{2}} (∣ {\hat{β}}_{* j}^{0} ∣) - \frac{1}{2} \frac{p_{λ_{2}}^{'} (∣ {\hat{β}}_{* j}^{0} ∣)}{∣ {\hat{β}}_{* j}^{0} ∣} {({\hat{β}}_{* j}^{0})}^{2}}, \end{array}

(3.6)

where $\sum_{λ_{2}} (β_{*}) = diag {0, 0, p_{λ_{2}}^{'} (∣ β_{1} ∣) / ∣ β_{1} ∣, \dots, p_{λ_{2}}^{'} (∣ β_{d} ∣) / ∣ β_{d} ∣}$ . For fixed θ = (τ, σ²)we apply the Newton-Raphson method to maximize (3.6) and get the updating formula

{\hat{β}}_{*} = {X_{*}^{T} V^{- 1} (θ) X_{*} + n \sum_{λ_{2}} ({\hat{β}}_{*}^{0}) / σ^{2}}^{- 1} X_{*}^{T} V^{- 1} (θ) Y .

(3.7)

It is easy to recognize that (3.7) is equivalent to an iterative ridge regression algorithm.

We propose to alternately estimate (β, f) and (τ, σ²) iteratively. The initial values for β_*, τ and σ² are obtained by the MIXED procedure in SAS to fit the linear mixed model (3.4) with all the covariates. We then use formula (3.7) to iteratively update β̂_*. The LMM framework allows us to treat τ = σ²/(nλ₁) as an extra variance component based on selected important linear covariates, so that we can estimate it together with the error variance σ² using the restricted maximum likelihood (REML). There is rich literature on the use of REML to estimate smoothing parameters and variance components (e.g. Wahba, 1985; Speed, 1991; Zhang et al., 1998). For example, Zhang et al. (1998) estimated the smoothing parameter via REML for longitudinal data with a nonparametric baseline function and complex variance structures. The partially linear model (3.1) has a similar form as (2) of Zhang et al. (1998), with only two variance components (τ, σ²), and hence the estimation proceeds similarly.

3.2. Choice of Tuning Parameters

Although the smoothing parameter λ₁ (or equivalently τ) is readily estimated in the LMM framework, we still need to estimate the SCAD tuning parameters (λ₂, a). To find their optimal values, one common approach could be a two-dimensional grid search using some data-driven criteria, such as CV and GCV (Craven and Wahba, 1979), which can be rather computationally prohibitive. Fan and Li (2001) showed numerically that a = 3.7 minimizes the Bayesian risk and recommended its use in practice. Thus we set a = 3.7 and only tune λ₂ in our implementation.

Many selection criteria, such as cross validation (CV), generalized cross validation (GCV), BIC and AIC selection can be used for parameter tuning. Wang et al. (2007) suggested using the BIC for the SCAD estimator in linear models and partially linear models, and proved its model selection consistency property, i.e. the optimal parameter chosen by the BIC can identify the true model with probability tending to one. We will also use the BIC to select the optimal λ₂ from a gridded range under working normal distributional assumption for ε_i.

Given λ₂, suppose q variables are selected by the algorithm in Section 3. Let X₁ be the sub-matrix of X for the q important variables and β₁ be the corresponding q × 1 regression coefficient vector. Then we may use the estimation method of Zhang et al. (1998) to solve the partially linear model (2.1). Consequently Ŷ = SY, where S is a smoother matrix with q₁ = trace(S). The BIC criterion is then computed as BIC(λ₂) = −2ℓ+ q₁ log n, where ℓ = −(n/2) log(2πσ̂²) − (Y−X₁β̂₁ − f̂)^T (Y −X₁β̂₁ − f̂)/(2σ̂²). For each grid point of λ₂, the iterative ridge regression results in a model with a set of important covariates, and we compute the BIC for this selected model. Based on our empirical evidence and the fact that BIC is consistent in selecting correct models under certain conditions (Schwarz, 1978), we chose BIC over GCV for tuning λ₂ in our numerical analysis.

4. Frequentist and Bayesian Covariance Estimates

We derive the frequentist and Bayesian covariance formulas for β̂ and f̂ parallel to Sections 3.4 and 3.5 in Zhang et al. (1998), except that we also take into account the bias introduced by the imposed penalty for the variable selection. Using these covariance estimates, we are able to construct confidence intervals for the regression coefficients and the nonparametric function. The proposed covariance estimates are evaluated via simulation in Section 5.

4.1. Frequentist Covariance Estimates

From frequentists’ point of view, cov(Y|t, x) = σ²I, and we can write β̂_* = (δ̂^T, β̂^T)^T as an approximately linear function of Y: β̂_* = QY. Let $Q = {(Q_{1}^{T}, Q_{2}^{T})}^{T}$ , where Q₁ and Q₂ are partitions of Q with dimensions corresponding to (δ^T, β^T)^T, so that δ̂ = Q₁Y, and β̂ = Q₂Y. The estimated covariance matrix for β̂ is given by

{\hat{cov}}_{F} (\hat{β} ∣ t, x) = Q_{2} cov (Y) Q_{2}^{T} = {\hat{σ}}^{2} Q_{2} Q_{2}^{T},

(4.1)

where σ̂² is the estimated error variance. It is easy to show that the empirical BLUP estimate of a is â = Ã(Y − X_*β_*) = S_aY, where S_a= Ã(I−X_*Q) and $\tilde{A} = {(n λ_{1} σ^{2} I + B_{*}^{T} B_{*})}^{- 1} B_{*}^{T}$ . Therefore f̂ = Tδ̂+ Bâ = (TQ₁ + BS_a)Y and its covariance

{\hat{cov}}_{F} (\hat{f} ∣ t, x) = {\hat{σ}}^{2} (T Q_{1} + B S_{a}) {(T Q_{1} + B S_{a})}^{T} .

(4.2)

4.2. Bayesian Covariance Estimates

The LMM representation in Section 3.1 and (3.3) suggests a prior for f(t) of the form f = Tδ + Ba, with a ~ N(0, τI) and a flat prior for δ. As a prior for β, a reasonable choice appears to be the one with kernel $\exp {- \frac{1}{2} β^{T} \sum_{λ_{2}} β}$ , where Σ_λ₂ is a diagonal matrix defined in Section 3.1. The definition of the SCAD penalty function (2.3) implies that some diagonal elements of the matrix Σ_λ₂ can be zero, corresponding to those coefficients with |β_j| > aλ₂. Assume after reordering, Σ_λ₂ = diag(0, Σ₂₂), where Σ₂₂ has positive diagonal elelments. It follows that β can be partitioned into ${(β_{1}^{T}, β_{2}^{T})}^{T}$ , where β₁ can be regarded as “fixed” effects and β₂ as “random” effects with $β_{2} \sim N (0, \sum_{22}^{- 1})$ . The matrix X is partitioned into [X₁, X₂] accordingly. Now we reformulate the mixed model (3.4) as: Y = Tδ+X₁β₁ + X₂β₂ + B_*a + ε, or as Y = χγ + Zb + ε, where χ = [T, X₁], $γ = {(δ^{T}, β_{1}^{T})}^{T}$ , Z = [X₂, B_*] and $b = {(β_{2}^{T}, a^{T})}^{T}$ is the new random effect distributed as b ~ N(0, Σ_b) with a block diagonal covariance matrix $\sum_{b} = diag (\sum_{22}^{- 1}, τ I)$ . Under the reformulated linear mixed model, β consists of both fixed and random effects. Therefore the Bayesian covariances for (β̂,f̂) are

{cov}_{B} (\hat{β}) = cov {{\hat{β}}_{1}^{T}, {({\hat{β}}_{2} - β_{2})}^{T}}^{T},

(4.3)

{cov}_{B} (\hat{f}) = [T, B] cov {{\hat{δ}}^{T}, {(\hat{a} - a)}^{T}}^{T} {[T, B]}^{T} .

(4.4)

These Bayesian variance estimates can be viewed to account for the bias in β̂ and f̂ due to imposed penalties (Wahba, 1983).

5. Simulation Studies

We conduct Monte Carlo simulation studies to evaluate the finite sampling performance of the proposed DPLS method in terms of both model estimation and variable selection. Furthermore, we compare our procedure with the SCAD and LASSO methods proposed by Fan and Li (2004). In the following, these three methods are respectively referred to as “DPLSE”, “SCAD” and “LASSO”. When implementing Fan and Li (2004), we adopt their approach to choose the kernel bandwidth: first compute the difference based estimator (DBE) for β and then select the bandwidth using the plug-in method of Ruppert et al. (1995). To select the SCAD and LASSO tuning parameters, we tried both BIC and GCV and found that BIC generally gave better performance, so BIC was used for tuning in the SCAD and LASSO.

We simulate the data from a partially linear model y = x^T β + f(t) + ε. Adopting the configuration in Tibshirani (1996) and Fan and Li (2001, 2004), we generate the correlated covariates x = (x₁, …, x₈)^T from a standard normal distribution with AR(1) corr(x_i, x_j) = 0.5^|ⁱ⁻^j^|, and we set the true coefficients β = (3, 1.5, 0, 0, 2, 0, 0, 0)^T. Two types of non-normal errors are used to demonstrate that the proposed normal likelihood based REML estimation is robust to the distributional assumption of errors. We compare three methods in a 2 × 2 × 2 factorial experiment. There are two combinations of (f, ε): 1. f₁(t) = 4 sin(2πt/4) with ε₁ ~ C₀t₆; 2. f₂(t) = 5β(t/20, 11, 5) + 4β(t/20, 5, 11) where $β (t, a, b) = \frac{Γ (a + b)}{Γ (a) Γ (b)} t^{a - 1} {(1 - t)}^{b - 1}$ , with a mixture normal error ε₂ ~ C₀ (0.5N (1, 1) + 0.5N (−1, 3)). The scale C₀ is chosen such that the error variance is σ² = 1 or 9. Consider two sample sizes n = 100 and n = 200. The number of observed unique time points t_i’s is chosen to be 50 in all the settings.

As in Fan and Li (2004), we use the mean squares error (MSE) for β̂ and f̂ to respectively evaluate goodness-of-fit for parametric and nonparametric estimation. They are defined as MSE(β̂) = E(||β̂− β||²), and $MSE (\hat{f}) = E [\int_{T_{1}}^{T_{2}} {\hat{f} (t) - f (t)}^{2} d t]$ . In practice, we compute MSE(f̂) by averaging over the design knots. Under each setting, we carry out 100 Monte Carlo (MC) simulation runs and report the MC sample mean and standard deviation (given in the parentheses) for the MSEs. To evaluate variable selection performance of each method, we report the number of correct zero coefficients (denoted as “Corr.”), the number of coefficients incorrectly set to 0 (denoted as “Inc.”), and the model size. In addition, we report the point estimate, bias, and the 95% coverage probability of frequentist and Bayesian confidence intervals for the DPLSE.

5.1. Overall Model Selection and Estimation Results

Table 5.1 compares three variable selection procedures when σ² = 1. The DPLSE outperforms other methods in terms of both estimation and variable selection in all scenarios, and SCAD performs better than LASSO. Overall, the DPLSE achieves a sparser model, with both “Corr.” and “Inc.” closer to the oracle (5 & 0 respectively). In our implementation for the SCAD and LASSO, the bandwidth selected using the plug-in method occasionally caused numerical problems and failed to converge. Therefore, the results of SCAD and LASSO are only based on converged cases.

Table 5.1.

Comparison of variable selection procedures (σ² = 1)^a

				Model	Zero coef.
(n, f)	Method	MSE(β̂)	MSE(f̂)	Size (3)	Corr.(5)	Inc.(0)
(100, f₁)	DPLSE	0.05 (0.06)	0.07 (0.04)	3.22	4.78	0
	SCAD	0.09 (0.09)	0.17 (0.07)	3.39	4.61	0
	LASSO	0.10 (0.09)	0.17 (0.07)	3.82	4.18	0
(100, f₂)	DPLSE	0.06 (0.06)	0.14 (0.05)	3.21	4.79	0
	SCAD	0.08 (0.08)	0.28 (0.10)	3.31	4.69	0
	LASSO	0.13 (0.10)	0.29 (0.11)	3.69	4.31	0
(200, f₁)	DPLSE	0.02 (0.02)	0.04 (0.02)	3.08	4.92	0
	SCAD	0.02 (0.02)	0.09 (0.03)	3.26	4.74	0
	LASSO	0.03 (0.02)	0.09 (0.03)	3.45	4.55	0
(200, f₂)	DPLSE	0.02 (0.02)	0.08 (0.03)	3.07	4.93	0
	SCAD	0.03 (0.03)	0.19 (0.05)	3.24	4.76	0
	LASSO	0.04 (0.03)	0.19 (0.05)	3.53	4.47	0

Open in a new tab

SCAD and LASSO estimates are based on M converged MC samples, where M ≥ 90 except M = 72 for (200, f₁).

Table 5.2 presents the results for a high variance case σ² = 9. We notice that, as σ² increases from 1 to 9, although there is a substantial amount of increase in the MSEs, the DPLSE still maintains very good performance in model selection. The MSEs of the DPLSE are consistently smaller than those of SCAD and LASSO (not reported here to save space). The incidence of incorrect zero coefficients occurs seldom for n = 100 and never occurs for n = 200.

Table 5.2.

DPLSE model selection and estimation results (σ² = 9)

				Zero coef.
(n, f)	MSE(β̂)	MSE(f̂)	Model Size (3)	Corr. (5)	Inc. (0)
(100, f₁)	0.58 (0.67)	0.55 (0.39)	3.23	4.75	0.02
(200, f₁)	0.22 (0.24)	0.27 (0.15)	3.12	4.88	0
(100, f₂)	0.71 (0.75)	0.92 (0.49)	3.21	4.77	0.02
(200, f₂)	0.22 (0.22)	0.48 (0.19)	3.97	4.93	0

Open in a new tab

5.2. Performance of DPLSE for Parametric Estimation

Table 5.3 presents the point estimate, relative bias, empirical standard error, model-based frequentist and Bayesian standard errors of the estimate. To save space, we only report the point estimation results for the parameters which are truly nonzero. The point estimate is the MC sample average and the empirical standard error is computed by the MC standard deviation. Relative bias is the ratio of the bias and the true value.

Table 5.3.

DPLSE point estimation results for four selected scenarios.

Scenario	Model	Point	Relative Empirical		Model-based SE		95% CP
(n, σ², f)	Parameter	Estimate	Bias	SE	Freq.	Bayesian	Freq.	Bayesian
(100, 1, f₁)	β₁	3.011	0.004	0.129	0.128	0.129	0.95	0.95
	β₂	1.500	0.000	0.113	0.106	0.107	0.94	0.94
	β₅	2.024	0.012	0.134	0.105	0.107	0.89	0.90
(200, 1, f₁)	β₁	3.006	0.002	0.086	0.087	0.087	0.94	0.95
	β₂	1.502	0.002	0.086	0.087	0.088	0.95	0.96
	β₅	1.994	−0.002	0.075	0.076	0.077	0.96	0.96
(200, 1, f₂)	β₁	3.009	0.003	0.088	0.087	0.088	0.94	0.94
	β₂	1.497	−0.002	0.088	0.088	0.088	0.96	0.97
	β₅	1.996	−0.002	0.074	0.077	0.078	0.98	0.99
(200, 9, f₂)	β₁	3.037	0.012	0.242	0.261	0.263	0.94	0.94
	β₂	1.487	−0.009	0.302	0.264	0.265	0.96	0.96
	β₅	1.983	−0.012	0.246	0.230	0.232	0.96	0.96

Open in a new tab

We report the results in four scenarios with varying n, σ² and f(t), and those in other scenarios are similar and hence omitted. We observe that β̂ is roughly unbiased in all scenarios. Both Bayesian and frequentist SEs of β̂_j’s obtained from (4.1) and (4.3) agree well with the empirical SEs; all SEs decrease as n increases or σ² decreases. Bayesian SEs are slightly larger than their frequentist counterparts, since they also account for bias in β̂_j. The confidence intervals based on either Bayesian or frequentist SEs achieve the nominal coverage probability, indicating the accuracy of the SE formulas. Overall, the DPLSE works very well for estimating model parameters.

5.3. Performance of f̂ (t) and Pointwise Standard Errors

In Figure 5.1 we plot the pointwise estimates and biases for estimating f₁(t) and f₂(t) when n = 200 and σ² = 1 for all three methods.

In plots (a) and (c), the averaged fitted curves are almost indistinguishable from the true nonparametric function, indicating small biases in f̂ (t) for all three methods. Pointwise biases are magnified in plots (b) and (d), which show that the DPLSE overall has smaller bias than the other two methods. The SCAD and LASSO fits have slightly larger and rougher pointwise biases, which indicates under-smoothing due to a small bandwidth selected by the plug-in method. Our method is more advantageous in that it automatically estimates the smoothing parameter and controls the amount of smoothing more appropriately by treating τ = 1/(nλ₁) as a variance component.

Figure 5.2 depicts the pointwise standard errors and pointwise coverage probabilities of confidence intervals given by the covariance formulas (4.2) and (4.4). Here n = 200 and σ² = 1; (a) and (b) are for f₁ with t₆ errors, and (c) and (d) are for f₂ with mixture normal errors.

We note that the frequentist pointwise SEs interlace with the empirical SEs, whereas the Bayesian pointwise SEs are a little larger than the frequentist counterparts. Accordingly, as shown in plots (b) and (d), the pointwise coverage probability rates for frequentist confidence intervals are around the nominal level, whereas most of the Bayesian coverage probabilities are higher than 95%.

6. Real Data Application

We apply the proposed DPLS method to the Ragweed Pollen Level data, which was analyzed in Ruppert et al. (2003). The data was collected in Kalamazoo, Michigan during the 1993 ragweed season, and it consists of 87 daily observations of ragweed pollen level and relevant information. The main interest is to develop accurate models to forecast daily ragweed pollen level. The raw response ragweed is the daily ragweed pollen level (grains/m³). Among the explanatory variables, x₁ is an indicator of significant rain, where x₁ = 1 if there is at least 3 hour steady or brief but intense rain and x₁ = 0 otherwise; x₂ is temperature (°F); x₃ is wind speed (knots). The x-covariates are standardized first. Since the raw response is rather skewed, Ruppert et al. (2003) suggested a square root transformation $y = \sqrt{ragweed}$ . Marginal plots suggest a strong nonlinear relationship between y and the day number in the current ragweed pollen season. Consequently, a semiparametric regression model with a nonparametric baseline f(day) is reasonable. Ruppert et al. (2003) fitted a semiparametric model with x₁, x₂ and x₃, whereas we add quadratic and interaction terms and consider a more complex model:

y = f (day) + β_{1} x_{1} + β_{2} x_{2} + β_{3} x_{3} + β_{22} x_{2}^{2} + β_{33} x_{3}^{2} + β_{12} x_{1} x_{2} + β_{13} x_{1} x_{3} + β_{23} x_{2} x_{3} + ε .

The tuning parameter selected by BIC is λ₂ = 0.177. Table 6.1 gives the DPLSE for the regression coefficients and their corresponding frequentist and Bayesian standard errors.

Table 6.1.

Estimated coefficients and frequentist and Bayesian SE for ragweed pollen level data

Full model

Selected model

Variable

Parameter

Frequentist

Bayesian

Parameter

Frequentist

Bayesian

Estimate

x₁

0.64

0.22

0.23

0.70

0.18

x₂

1.31

0.37

0.39

1.16

0.36

0.37

x₃

0.87

0.19

0.20

0.76

0.19

0.20

x_{2}^{2}

0.53

0.23

0.24

x_{3}^{2}

0.04

0.19

x₁x₂

0.26

0.19

x₁x₃

0.02

0.22

0.23

x₂x₃

0.34

0.20

Open in a new tab

For comparison, we also fitted the full model via traditional partially splines with only roughness penalty on f. Table 6.1 shows that the final fitted model is ŷ = f̂ (day) + β̂₁x₁ + β̂₂x₂ + β̂₃x₃, indicating that the linear main effect model suffices. All the estimated coefficients are positive, suggesting that the ragweed pollen level increases as each of the covariates increases. The shrinkage estimates have relatively smaller standard errors than those under the full model. Figure 6.1 depicts the estimated nonparametric function f̂ (day) and its frequentist and Bayesian 95% pointwise confidence intervals. The plot indicates that the baseline f(day) climbs rapidly to the peak on around day 25 and plunges until day 60, and decreases steadily thereafter.

Figure 6.1 — Plot of estimated f(*day*) and its frequentist and Bayesian 95% pointwise confidence intervals for the Ragweed Pollen Level data.

7. Discussion

We propose a new regularization method for simultaneous variable selection and model estimation in partially linear models via double-penalized least squares. Under certain regularity conditions, the DPLSE β̂ is root-n consistent and has the oracle property. To facilitate computation, we reformulate the problem into a linear mixed model (LMM) framework, which allows us to estimate the smoothing parameter λ₁ as an additional variance component instead of conducting the conventional two-dimensional grid search together with the other tuning parameter λ₂. Another advantage of the LMM representation is that standard software can be used to implement the DPLS. Simulation studies show that the new method works effectively in terms of both variable selection and model estimation. We have derived both frequentist and Bayesian covariance formulas for the DPLSEs and empirical results favor the frequentist SE formulas for f(t). Furthermore, our empirical results suggest that the DPLSE is robust to the distributional assumption of errors, giving strong support for its application in general situations.

In this paper, we have studied the large sample properties of the new estimators when the dimension d satisfies: (i) d fixed, or (ii) d_n → ∞ as n → ∞ with d_n < n. In future research we will investigate the properties and performance of our estimators for the more challenging situation d ≫ n. Our major challenges will be to study how the convergence rate and asymptotic distributions of the linear components, in the presence of nuisance nonparametric components, will be affected when d > n. Very recently, Ravikumar et al. (2008) and Meier et al. (2008) consider the sparse estimation and function smoothing for additive models in high dimensional data settings. We will see how these works can be adapted to tackle our challenges in the future.

The proposed DPLS method assumes that the errors are uncorrelated. In future research, we will generalize it to model selection for correlated data such as longitudinal data. Another interesting problem is model selection for generalized semiparametric models, e.g. E(Y) = g{Xβ + f(t)}, where g is a link function. In that case we will consider the double-penalized likelihood and investigate asymptotic properties for the resulting estimators.

Acknowledgments

The authors thank the editor, the associate editor, and two reviewers for their constructive comments and suggestions. The research of Hao Zhang was supported by in part by National Science Foundation DMS-0645293 and by National Institute of Health R01 CA085848-08. The research of Daowen Zhang was supported by National Institute of Health R01 CA85848-08.

Appendix

Proofs

Proof of Lemma 1

Differentiating L(β) in Q(β) and evaluating at β₀, we get:

- L^{'} (β_{0}) = X^{T} {I - A (λ_{1})} (Y - X β_{0}),

(A1)

L^{″} (β_{0}) = X^{T} {I - A (λ_{1})} X .

(A2)

For the partially linear model, we have Y − Xβ₀ = f + ε. Substitution into (A1) yields

\begin{array}{l} - n^{- 1 / 2} L^{'} (β_{0}) = n^{- 1 / 2} X^{T} {I - A (λ_{1})} (f + ε) \\ = n^{- 1 / 2} X^{T} [{I - A (λ_{1})} f + ε] - n^{- 1 / 2} X^{T} A (λ_{1}) ε . \end{array}

(A3)

Now, the proof of Theorem 1 in Heckman (1986) and its four propositions can be used. Under regularity conditions, we have that if λ₁_n → 0 and $n λ_{1 n}^{1 / 4} \to \infty$ , then

n^{- 1 / 2} X^{T} [{I - A (λ_{1})} f + ε] \overset{d}{\to} N (0, σ^{2} R),

(A4)

n^{- 1 / 2} X^{T} A (λ_{1}) ε \overset{p}{\to} 0 .

(A5)

Parts (a) and (b) are obtained by applying Slutsky’s theorem to (A1) and (A2).

To prove Theorems 3 and 4, we need the following lemma. Its proof can be derived in the similar fashion as Lemma 1 above and Theorem 1 of Heckman (1986). To save space, we only state the results below and omit the proof. For any vector v, we use[v]_i to denote its ith component. For any matrix G, we use [G]_ij to denote its (i, j)th element.

Lemma 3

Under the regularity conditions (C1) and (C2), if λ₁_n → 0, then

[L′(β_n₀)]_i = O_p(n^1/2),
${[L^{″} (β_{n 0})]}_{i j} = n R_{i j} + O_{p} (n^{1 / 2} \lor λ_{1 n}^{- 1 / 4})$ .

Proof of Theorem 3

Let $c_{n} = \sqrt{d_{n} / n}$ . We need to show that for any given ε > 0, there exists a large constant C such that

P {inf_{∥ r ∥ \geq C} Q (β_{n 0} + c_{n} r) > Q (β_{n 0})} \geq 1 - ε .

(A6)

Let Δ_n(r) = Q(β_n₀ + c_nr) − Q(β_n₀). Recall that the first q_n components of β_n₀ are nonzero, p_{λ_2n} (0) = 0 and p_{λ_2n}(·) is nonnegative. By Taylor’s expansion, we have

\begin{array}{l} Δ_{n} (r) \geq L (β_{n 0} + c_{n} r) - L (β_{n 0}) + n \sum_{j = 1}^{q_{n}} {p_{λ_{2 n}} (∣ β_{n 10, j} + c_{n} r_{j} ∣) - p_{λ_{2 n}} (∣ β_{n 10, j} ∣)} \\ \geq c_{n} r^{T} L^{'} (β_{n 0}) + \frac{1}{2} c_{n}^{2} r^{T} L^{″} (β_{n 0}) r + \sum_{j = 1}^{q_{n}} [n c_{n} p_{λ_{2 n}}^{'} (∣ β_{n 10, j} ∣) sign (β_{n 10, j}) r_{j}] \\ + \sum_{j = 1}^{q_{n}} [n c_{n}^{2} {p^{″}}_{λ_{2 n}} (∣ β_{n 10, j} ∣) r_{j}^{2} {1 + o (1)}] \\ \equiv I_{1} + I_{2} + I_{3} + I_{4} . \end{array}

By Lemma 3(a), we have

∣ I_{1} ∣ = ∣ c_{n} r^{T} L^{'} (β_{n 0}) ∣ \leq c_{n} ∥ L^{'} (β_{n 0}) ∥ ∥ r ∥ = O_{p} (c_{n} \sqrt{n d_{n}}) ∥ r ∥ = O_{p} (n c_{n}^{2}) ∥ r ∥ .

By Lemma 3(b), under the regularity condition (C2), we have

\begin{array}{l} I_{2} = \frac{1}{2} c_{n}^{2} r^{T} L^{″} (β_{n 0}) r = \frac{1}{2} n c_{n}^{2} {r^{T} R r + O_{p} (d_{n} n^{- 1 / 2} \lor d_{n} λ_{1 n}^{- 1 / 4})} \\ = \frac{1}{2} n c_{n}^{2} {r^{T} R r + o_{p} (1) ∥ r ∣ ∣^{2}}, \end{array}

the last equation above is due to the dimension condition $d_{n} = o (n^{1 / 2} \land n λ_{1 n}^{1 / 4})$ . With regard to I₃ and I₄, we have

∣ I_{3} ∣ \leq \sum_{j = 1}^{q_{n}} ∣ n c_{n} p_{λ_{2 n}}^{'} (∣ β_{n 10, j} ∣) sign (β_{n 10, j}) r_{j} ∣ \leq n c_{n}^{2} ∥ r ∥,

and

∣ I_{4} ∣ = \sum_{j = 1}^{q_{n}} n c_{n}^{2} {p^{″}}_{λ_{2 n}} (∣ β_{n 10, j} ∣) r_{j}^{2} {1 + o (1)} \leq 2 \cdot max_{1 \leq j \leq q_{n}} p_{λ_{2 n}}^{″} (∣ β_{n 10, j} ∣) \cdot n c_{n}^{2} ∥ r ∣ ∣^{2} .

Under the condition (C1), ${max}_{1 \leq j \leq q_{n}} p_{λ_{2 n}}^{'} (∣ β_{n 10, j} ∣) = 0$ and ${max}_{1 \leq j \leq q_{n}} p_{λ_{2 n}}^{″} (∣ β_{n 10, j} ∣) = 0$ when n is large enough and λ₂_n → 0. So, both I₃ and I₄ are dominated by I₂. Therefore, by allowing C to be large enough, all terms I₁, I₃, I₄ are dominated by I₂, which is positive. This proves (A6) and completes the proof.

Proof of Theorem 4

Let $γ_{n} = C \sqrt{d_{n} / n}$ . It suffices to show that as n → ∞ with probability tending to 1, for any β_n₁ satisfying $β_{n 1} - β_{n 10} = O (\sqrt{d_{n} / n})$ and j = q_n + 1, …, d_n,

\frac{\partial Q (β)}{\partial β_{n j}} < 0 for β_{n j} \in (- γ_{n}, 0),

(A7)

> 0 for β_{n j} \in (0, γ_{n}) .

(A8)

By Taylor expansion and the fact that L(β_n) is quadratic in β_n, we get

\begin{array}{l} \frac{\partial Q (β_{n})}{\partial β_{n j}} = \frac{\partial L (β_{n})}{\partial β_{n j}} + n p_{λ_{2 n}}^{'} (∣ β_{n j} ∣) sign (β_{n j}) \\ = \frac{\partial L (β_{n 0})}{\partial β_{n j}} + \sum_{k = 1}^{d} \frac{\partial^{2} L (β_{n 0})}{\partial β_{n j} \partial β_{n k}} (β_{n k} - β_{n k 0}) + n p_{λ_{2 n}}^{'} (∣ β_{n j} ∣) sign (β_{n j}) \\ \equiv J_{1} + J_{2} + J_{3} . \end{array}

By Lemma 3 and the regularity conditions (C1) and (C2), we have

J_{1} = O_{p} (n^{1 / 2}) = O_{p} (\sqrt{n d_{n}}), J_{2} = O_{p} (\sqrt{n d_{n}}),

so $J_{1} + J_{2} = O_{p} (\sqrt{n d_{n}})$ . Since $\sqrt{d_{n} / n} / λ_{2 n} \to 0$ , from

\frac{\partial Q (β_{n})}{\partial β_{n j}} = n λ_{2 n} {- \frac{- p_{λ_{2 n}}^{'} (β_{n j})}{λ_{2 n}} sign (β_{n j}) + O_{p} (\sqrt{d_{n} / n} / λ_{2 n})},

we can see that the sign of $\frac{\partial Q (β_{n})}{\partial β_{n j}}$ is totally determined by the sign of β_nj. Therefore, (A7) and (A8) hold for j > q_n, which leads to β̂_n₂ = 0. Combining with the result of Theorem 2, there is a $\sqrt{n / d_{n}}$ -consistent local minimizer β̂_n of Q(β_n), and β̂_n has the form ${({\hat{β}}_{n 1}^{T}, 0^{T})}^{T}$ . This completes the proof.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Contributor Information

Xiao Ni, Email: xni@stat.ncsu.edu.

Hao Helen Zhang, Email: hzhang2@stat.ncsu.edu.

Daowen Zhang, Email: zhang@stat.ncsu.edu.

References

Akaike H. Maximum likelihood identification of Gaussian autoregressive moving average models. Biometrika. 1973;60:255–265. [Google Scholar]
Breiman L. Better subset selection using the nonnegative garrote. Technometrics. 1995;37:373–384. [Google Scholar]
Bunea F. Consistent covariate selection and post model selection inference in semiparametric regression. Annals of Statistics. 2004;32:898–927. [Google Scholar]
Bunea F, Wegkamp M. Two-stage model detection procedures in partially linear regression. The Canadian Journal of Statistics. 2004;32:105–118. [Google Scholar]
Craven P, Wahba G. Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerishe Mathematik. 1979;31:377–403. [Google Scholar]
Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Annals of Statistics. 2004;32:407–451. [Google Scholar]
Engle R, Granger C, Rice J, Weiss A. Nonparametric estimates of the relation between weather and electricity sales. Journal of American Statistical Association. 1986;81:310–386. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Fan J, Li R. New estimation and model selection procedures for semi-parametric modeling in longitudinal data analysis. Journal of American Statistical Association. 2004;99:710–723. [Google Scholar]
Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Annals of Statistics. 2004;32:928–961. [Google Scholar]
Green PJ. Penalized likelihood for general semi-parametric regression models. Int Statist Rev. 1987;55:245–260. [Google Scholar]
Green PJ, Silverman BW. Nonparametric Regression and Generalized Linear Models. Chapman and Hall; London: 1994. [Google Scholar]
Gu C. Smoothing Spline ANOVA Models. Springer; New York: 2002. [Google Scholar]
Hastie T, Tibshirani RJ. Generalized Additive Models. Chapman and Hall; London: 1990. [DOI] [PubMed] [Google Scholar]
Heckman NE. Spline smoothing in a partly linear model. J R Statist Soc B. 1986;48:244–248. [Google Scholar]
Liang H. Estimation in partially linear models and numerical comparisons. Computational Statistics and Data Analysis. 2006;50:675–687. doi: 10.1016/j.csda.2004.10.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
Linhart H, Zucchini W. Model Selection. New York: Wiley; 1986. [Google Scholar]
Mallows C. Some comments on cp. Technometrics. 1973;15:661–675. [Google Scholar]
Meier L, Van de Geer S, Bühlmann P. High-dimensional additive modeling. 2008 http://arxiv.org/abs/0806.4115v1.
Miller AJ. Subset Selection in Regression. Chapman and Hall; London: 2002. [Google Scholar]
Ravikumar P, Liu H, Lafferty J, Wasserman L. Spam: Sparse additive models. Advances in Neural Information Processing systems. 2008;20:1202–1208. [Google Scholar]
Ruppert D, Sheather S, Wand M. An effective bandwidth selector for local least squares regression. Journal of American Statistical Association. 1995;90:1257–1270. [Google Scholar]
Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Cambridge University Press; Cambridge, New York: 2003. [Google Scholar]
Schwarz G. Estimating the dimension of a model. Annals of Statistics. 1978;6:461–464. [Google Scholar]
Speckman P. Kernel smoothing in partial linear models. J R Statist Soc B. 1988;50:413–436. [Google Scholar]
Speed T. Discussion of ‘blup is a good thing: the estimation of random effects’ by G. K. Robinson. Statistical Science. 1991;6:15–51. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the lasso. J R Statist Soc B. 1996;58:267–288. [Google Scholar]
Wahba G. Bayesian ‘confidence intervals’ for the cross-validated smoothing spline. J R Statist Soc B. 1983;45:133–150. [Google Scholar]
Wahba G. A comparison of GCV and GML for choosing the smoothing parameter in the generalized spline smoothing problem. Annals of Statistics. 1985;13:1378–1402. [Google Scholar]
Wahba G. Spline Models for Observation Data. Society for Industrial and Applied Mathematics 1990 [Google Scholar]
Wang H, Li R, Tsai CL. Tuning parameter selector for SCAD. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang D, Lin X, Raz J, Sowers M. Semiparametric stochastic mixed models for longitudinal data. Journal of American Statistical Association. 1998;93:710–719. [Google Scholar]
Zhang H, Lu W. Adaptive lasso for cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]
Zou H. The adaptive lasso and its oracle properties. Journal of American Statistical Association. 2006;101:1418–1429. [Google Scholar]

[R1] Akaike H. Maximum likelihood identification of Gaussian autoregressive moving average models. Biometrika. 1973;60:255–265. [Google Scholar]

[R2] Breiman L. Better subset selection using the nonnegative garrote. Technometrics. 1995;37:373–384. [Google Scholar]

[R3] Bunea F. Consistent covariate selection and post model selection inference in semiparametric regression. Annals of Statistics. 2004;32:898–927. [Google Scholar]

[R4] Bunea F, Wegkamp M. Two-stage model detection procedures in partially linear regression. The Canadian Journal of Statistics. 2004;32:105–118. [Google Scholar]

[R5] Craven P, Wahba G. Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerishe Mathematik. 1979;31:377–403. [Google Scholar]

[R6] Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Annals of Statistics. 2004;32:407–451. [Google Scholar]

[R7] Engle R, Granger C, Rice J, Weiss A. Nonparametric estimates of the relation between weather and electricity sales. Journal of American Statistical Association. 1986;81:310–386. [Google Scholar]

[R8] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R9] Fan J, Li R. New estimation and model selection procedures for semi-parametric modeling in longitudinal data analysis. Journal of American Statistical Association. 2004;99:710–723. [Google Scholar]

[R10] Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Annals of Statistics. 2004;32:928–961. [Google Scholar]

[R11] Green PJ. Penalized likelihood for general semi-parametric regression models. Int Statist Rev. 1987;55:245–260. [Google Scholar]

[R12] Green PJ, Silverman BW. Nonparametric Regression and Generalized Linear Models. Chapman and Hall; London: 1994. [Google Scholar]

[R13] Gu C. Smoothing Spline ANOVA Models. Springer; New York: 2002. [Google Scholar]

[R14] Hastie T, Tibshirani RJ. Generalized Additive Models. Chapman and Hall; London: 1990. [DOI] [PubMed] [Google Scholar]

[R15] Heckman NE. Spline smoothing in a partly linear model. J R Statist Soc B. 1986;48:244–248. [Google Scholar]

[R16] Liang H. Estimation in partially linear models and numerical comparisons. Computational Statistics and Data Analysis. 2006;50:675–687. doi: 10.1016/j.csda.2004.10.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Linhart H, Zucchini W. Model Selection. New York: Wiley; 1986. [Google Scholar]

[R18] Mallows C. Some comments on cp. Technometrics. 1973;15:661–675. [Google Scholar]

[R19] Meier L, Van de Geer S, Bühlmann P. High-dimensional additive modeling. 2008 http://arxiv.org/abs/0806.4115v1.

[R20] Miller AJ. Subset Selection in Regression. Chapman and Hall; London: 2002. [Google Scholar]

[R21] Ravikumar P, Liu H, Lafferty J, Wasserman L. Spam: Sparse additive models. Advances in Neural Information Processing systems. 2008;20:1202–1208. [Google Scholar]

[R22] Ruppert D, Sheather S, Wand M. An effective bandwidth selector for local least squares regression. Journal of American Statistical Association. 1995;90:1257–1270. [Google Scholar]

[R23] Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Cambridge University Press; Cambridge, New York: 2003. [Google Scholar]

[R24] Schwarz G. Estimating the dimension of a model. Annals of Statistics. 1978;6:461–464. [Google Scholar]

[R25] Speckman P. Kernel smoothing in partial linear models. J R Statist Soc B. 1988;50:413–436. [Google Scholar]

[R26] Speed T. Discussion of ‘blup is a good thing: the estimation of random effects’ by G. K. Robinson. Statistical Science. 1991;6:15–51. [Google Scholar]

[R27] Tibshirani R. Regression shrinkage and selection via the lasso. J R Statist Soc B. 1996;58:267–288. [Google Scholar]

[R28] Wahba G. Bayesian ‘confidence intervals’ for the cross-validated smoothing spline. J R Statist Soc B. 1983;45:133–150. [Google Scholar]

[R29] Wahba G. A comparison of GCV and GML for choosing the smoothing parameter in the generalized spline smoothing problem. Annals of Statistics. 1985;13:1378–1402. [Google Scholar]

[R30] Wahba G. Spline Models for Observation Data. Society for Industrial and Applied Mathematics 1990 [Google Scholar]

[R31] Wang H, Li R, Tsai CL. Tuning parameter selector for SCAD. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Zhang D, Lin X, Raz J, Sowers M. Semiparametric stochastic mixed models for longitudinal data. Journal of American Statistical Association. 1998;93:710–719. [Google Scholar]

[R33] Zhang H, Lu W. Adaptive lasso for cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]

[R34] Zou H. The adaptive lasso and its oracle properties. Journal of American Statistical Association. 2006;101:1418–1429. [Google Scholar]

PERMALINK

Automatic Model Selection for Partially Linear Models

Xiao Ni

Hao Helen Zhang

Daowen Zhang

Abstract

1. Introduction

2. Double-Penalized Least Squares Estimators and Their Asymptotics

2.1. Double-Penalized Least Squares Estimators

2.2 Asymptotic Theory: d fixed

Lemma 1

Theorem 1

Lemma 2

Theorem 2

2.3 Asymptotic Theory: dn → ∞ as n → ∞

Theorem 3

Theorem 4

3. Computational Algorithm and Parameter Tuning

3.1. Linear Mixed Model (LMM) Representation

3.2. Choice of Tuning Parameters

4. Frequentist and Bayesian Covariance Estimates

4.1. Frequentist Covariance Estimates

4.2. Bayesian Covariance Estimates

5. Simulation Studies

5.1. Overall Model Selection and Estimation Results

Table 5.1.

Table 5.2.

5.2. Performance of DPLSE for Parametric Estimation

Table 5.3.

5.3. Performance of f̂ (t) and Pointwise Standard Errors

Figure 5.1.

Figure 5.2.

6. Real Data Application

Table 6.1.

Figure 6.1.

7. Discussion

Acknowledgments

Appendix

Proofs

Proof of Lemma 1

Lemma 3

Proof of Theorem 3

Proof of Theorem 4

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.3 Asymptotic Theory: d_n → ∞ as n → ∞