Semiparametric Regression Pursuit

Jian Huang; Fengrong Wei; Shuangge Ma

doi:10.5705/ss.2010.298

. Author manuscript; available in PMC: 2013 Apr 2.

Published in final edited form as: Stat Sin. 2012 Oct 1;22(4):1403–1426. doi: 10.5705/ss.2010.298

Semiparametric Regression Pursuit

Jian Huang ¹, Fengrong Wei ², Shuangge Ma ³

PMCID: PMC3613788 NIHMSID: NIHMS332916 PMID: 23559831

Abstract

The semiparametric partially linear model allows flexible modeling of covariate effects on the response variable in regression. It combines the flexibility of nonparametric regression and parsimony of linear regression. The most important assumption in the existing methods for the estimation in this model is to assume a priori that it is known which covariates have a linear effect and which do not. However, in applied work, this is rarely known in advance. We consider the problem of estimation in the partially linear models without assuming a priori which covariates have linear effects. We propose a semiparametric regression pursuit method for identifying the covariates with a linear effect. Our proposed method is a penalized regression approach using a group minimax concave penalty. Under suitable conditions we show that the proposed approach is model-pursuit consistent, meaning that it can correctly determine which covariates have a linear effect and which do not with high probability. The performance of the proposed method is evaluated using simulation studies, which support our theoretical results. A real data example is used to illustrated the application of the proposed method.

Keywords: Group selection, Minimax concave penalty, Model-pursuit consistency, Penalized regression, Semiparametric models

1. Introduction

Suppose we have a random sample (y_i, x_i₁, …, x_ip), 1 ≤ i ≤ n, where y_i is the response variable and (x_i₁, …, x_ip) is a p-dimensional covariate vector. Consider the semiparametric partially linear model

y_{i} = μ + \sum_{j \in S_{1}} β_{j} x_{i j} + \sum_{j \in S_{2}} f_{j} (x_{i j}) + ε_{i}, 1 \leq i \leq n,

(1)

where S₁ and S₂ are mutually exclusive and complementary subsets of {1, …, p}, {β_j: j ∈ S₁} are regression coefficients of the covariates with indices in S₁, and (f_j: j ∈ S₂) are unknown functions. In this model, the mean response is linearly related to the covariates in S₁, while its relation with the remaining covariates is not specified up to any finite number of parameters. This model combines the flexibility of nonparametric regression and parsimony of linear regression. When the relation between y_i and {x_ij: j ∈ S₁} is of main interest and can be approximated by a linear function, it offers more interpretability than a purely nonparametric additive model.

There is a large literature on the estimation in partially linear models. Examples include the partial spline estimator (Wahba 1984; Engle, Granger, Rice and Weiss 1986 and Heckman 1986) and the partial residual estimator (Robinson 1988, Speckman 1988) and polynomial spline estimator (Chen 1988). An excellent discussion of partially linear models can be found in the book by Härdle, Liang and Gao (2000), which also contains an extensive list of references on this model. A comprehensive treatment of general semiparametric theory and many related models can be found in Bickle, Klaassen, Ritov and Wellner (1993).

The most important assumption in the existing methods for the estimation in partially linear models is to assume that it is known a priori which covariates have a linear form and which do not in the model. This assumption underlies the construction of the estimators and investigation of their theoretical properties in the existing methods. However, in applied work, it is rarely known in advance which covariates have linear effects and which have nonlinear effects.

Recently, Zhang, Cheng and Liu (2010) proposed a novel method for determining the zero, linear and nonlinear components in partially linear models. Their method is a two-step regularization method in the smoothing spline ANOVA framework. In the first step, they obtain an initial consistent estimator for the components in a nonparametric additive model, and then use the initial estimator as the weights in their proposed regularized smoothing spline method in a way similar to the adaptive Lasso (Zou 2006). They obtained the rate of convergence of their proposed estimator. They also showed that their method is selection consistent in the special case of tensor product design. However, they did not prove any selection consistency results for general partially linear models. Also, in their two-step approach, a total of four penalty parameters need to be selected, which may be difficult to implement in practice.

We consider the problem of estimation in partially linear models without assuming a priori which covariates have a linear effect and which have nonlinear effects. We propose a semiparametric regression pursuit method for identifying the covariates with linear effects and those with nonlinear effects. We embed partially linear models into a nonparametric additive model. By approximating the nonparametric components using spline series expansions, we transform the problem of model specification into a group variable selection problem. We then determine the linear and nonlinear components with a penalized approach, using the minimax concave penalty (MCP, Zhang 2010) imposed on the norm of the coefficients in the spline expansion. We refer to this penalized approach as the group MCP method. We show that, under suitable conditions, the proposed approach is model pursuit consistent, meaning that it can correctly determine which covariates have a linear effect and which do not with high probability. We allow the possibility that the underlying true model is not partially linear. Then the proposed approach has the same asymptotic property as the nonparametric estimator in the nonparametric additive model. We also show that the estimated coefficients of linear effects are asymptotically normal, with the same distribution as the estimator assuming the true model were known in advance.

Some of the techniques used in this paper are similar to those in Huang, Horowitz and Wei (2010), in which the problem of variable selection in nonparametric additive models is considered. In particular, after transforming the present problem of model pursuit into a group selection problem based on spline approximation, some of the techniques in obtaining rate of convergence for the group Lasso estimator in the context of nonparametric additive models in Huang et al. (2010) can be applied here with some modiffcations, see the proof of Theorem 2 in the Appendix. However, the problem of model pursuit considered in this paper is very different from that in Huang et al. (2010). Also, here we use the group MCP rather than the group Lasso, which requires different treatment at the technical level as well.

This article is organized as follows. In Section 2 we describe our proposed semi-parametric regression pursuit (SRP) method. We transform the problem of identifying linear and nonlinear components into an group selection problem using the group MCP. In Section 3 we derived a group coordinate descent algorithm to implement the proposed method. In Section 4 we state the theoretical results concerning the selection and estimation properties of the proposed method. Section 5 includes simulation studies and an illustration of the proposed method on a data example. Proofs of the results stated in Section 3 are given in the Appendix.

2. Semiparametric regression pursuit via group minimax concave penalization

2.1. Method

The semiparametric partially linear model (1) can be embedded into the nonparametric additive model (Hastie and Tibshirani 1990),

y_{i} = μ + f_{1} (x_{i 1}) + \dots + f_{p} (x_{i p}) + ε_{i} .

(2)

Suppose that x_ij takes values in [a, b] where a < b are finite constants. To ensure unique identification of the f_j’s, we assume that Ef_j(x_ij) = 0, 1 ≤ j ≤ p. If some of the f_j’s are linear, then (2) becomes the partially linear additive model (1). The problem becomes that of determineing which f_j’s have a linear form and which do not. For this purpose, we decompose f_j into a linear part and a nonparametric part

f_{j} (x) = β_{0 j} + β_{j} x + g_{j} (x) .

Consider a truncated series expansion for approximating g_j,

g_{n j} (x) = \sum_{k = 1}^{m_{n}} θ_{j k} φ_{k} (x),

(3)

where φ₁, …, φ_{m_n} are basis functions and m_n → ∞ at certain rate as n → ∞. If θ_jk = 0, 1 ≤ k ≤ m_n, then f_j has the linear form. Therefore, with this formulation, the problem now is to determine which groups of {θ_jk, 1 ≤ k ≤ m_n} are zero.

Let β = (β₁, …, β_p)′ and $θ_{n} = {(θ_{1 n}^{'}, \dots, θ_{p n}^{'})}^{'}$ , where θ_jn = (θ_j₁, …, θ_{jm_n})′. Define the penalized least squares criterion

L (μ, β, θ_{n}; λ, γ) = \frac{1}{2 n} \sum_{i = 1}^{n} {(y_{i} - μ - \sum_{j = 1}^{p} x_{i j} β_{j} - \sum_{j = 1}^{p} \sum_{k = 1}^{m_{n}} θ_{j k} φ_{k} (x_{i j}))}^{2} + \sum_{j = 1}^{p} ρ_{γ} (| | θ_{j n} | | A_{j}; \sqrt{m_{n}} λ),

(4)

where ρ is a penalty function depending on the penalty parameter λ ≥ 0 and a regularization parameter γ. Here without causing confusing, we still use μ to denote the intercept. The norm $| | θ_{j n} | | A_{j} = {(θ_{n j}^{'} A_{j} θ_{n j})}^{1 / 2}$ for a given positive definite matrix A_j. In theory, any positive definite matrix can be used as A_j, since ||θ_{jn||A_j} = 0 if and only if θ_jn = 0 as long as A_j is positive definite. However, it is important to choose a suitable choice of A_j to make the amount of penalization comparable across the groups and to facilitate the computation. We will specify A_j in (9) below.

We use the minimax concave penalty, or MCP introduced by Zhang (2010). This penalty function is defined by

ρ_{γ} (t; λ) = λ \int_{0}^{t} {(1 - x / (γ λ))}_{+} d x, t \geq 0,

(5)

where γ is a parameter that controls the concavity of ρ and λ is the penalty parameter. Here x₊ denotes the nonnegative part of x, that is, x₊ = x1_{_x_≥0}. We require λ ≥ 0 and γ > 1. The term MCP comes from the fact that it minimizes the maximum concavity measure defined in (2.2) of Zhang (2010) subject to conditions on unbiasedness and selection features. The MCP can be easily understood by considering its derivative

{\dot{ρ}}_{γ} (t; λ) = λ {(1 - t / (γ λ))}_{+}, t \geq 0.

(6)

It begins by applying the same rate of penalization as the lasso, but continuously relaxes that penalization until, when t > γλ, the rate of penalization drops to 0. It provides a continuum of penalties with the ℓ₁ penalty at γ = ∞ and the hard-thresholding penalty as γ → 1+. In particular, it includes the Lasso penalty as a special case at γ = ∞. Detailed discussions on the MCP can be found in Zhang (2010).

The penalty in (4) is a composite of the penalty function ρ_γ (·; λ) and a weighted ℓ₂-norm of θ_j. The ρ_γ (·; λ) is a penalty for individual variable selection. When it is applied to a norm of θ_j, it selects the coefficients in θ_j as a group. This is desirable, since the nonlinear components are represented by the coefficients in the θ_j’s as groups. Based on the definition of the penalty function in (4), it is natural to call it the group minimax concave penalty, or group MCP.

For a given (λ, γ), the penalized least squares solution is defined by

({\hat{μ}}_{n}, {\hat{β}}_{n}, {\hat{θ}}_{n}) = \underset{μ, β, θ_{n}}{arg min} L (μ, β, θ_{n}; λ, γ),

subject to the constraints

\sum_{i = 1}^{n} \sum_{k = 1}^{m_{n}} θ_{j k} φ_{k} (x_{i j}) = 0, 1 \leq j \leq p .

(7)

These centering constraints are sample analogs of the identifying restriction Ef_j(x_ij) = 0, 1 ≤ i ≤ n, 1 ≤ j ≤ p.

We convert (7) to an unconstrained optimization problem by centering the response and the covariate functions. Specifically, we center the responses and covariates and standardize the covariates by imposing

\sum_{i = 1}^{n} y_{i} = 0, \sum_{i = 1}^{n} x_{i j} = 0 and \sum_{i = 1}^{n} x_{i j}^{2} = n .

We also center the basis functions. Let

{\bar{φ}}_{j k} = \frac{1}{n} \sum_{i = 1}^{n} φ_{k} (x_{i j}), ψ_{j k} (x) = φ_{k} (x) - {\bar{φ}}_{j k} .

(8)

Define

Z_{i j} = {(ψ_{j 1} (x_{i j}), \dots, ψ_{{j m}_{n}} (x_{i j}))}^{'} .

So the z_ij consists of the centered basis functions at the ith observation of the jth covariate. Let Z = (Z₁, …, Z_p), where Z_j = (z₁_j, …, z_nj)′ is the n × m_n ‘design’ matrix corresponding to the jth expansion. Let y = (y₁, …, y_n)′, x_j = (x₁_j, …, x_nj)′ and X = (x₁, …, x_p). We can write

({\hat{β}}_{n}, {\hat{θ}}_{n}) = \underset{β, θ_{n}}{arg min} {L (β, θ_{n}; λ, γ) = \frac{1}{2 n} {| | y - X β - Z θ_{n} | |}^{2} + \sum_{j = 1}^{p} ρ_{γ} ({| | θ_{n j} | |}_{A_{j}}; \sqrt{m_{n}} λ)} .

Here we dropped μ from the arguments of L, since the intercept is zero due to centering. With the centering, the constrained optimization problem becomes an unconstrained one.

2.2 Penalized profile least squares

To compute (β̂_n, θ̂_n), we can use a penalized profile least squares approach. For any given θ_n, the β̂ that minimizes L necessarily satisfies

X^{'} (y - X β - Z θ_{n}) = 0.

Thus β = (X′X)⁻¹X′(y − Zθ_n). Let Q = I − P_X, where P_X = X(X′X)⁻¹X′ is the projection matrix onto the column space of X. The profile objective function of θ_n is

L (θ_{n}; λ, γ) = \frac{1}{2 n} {| | Q (y - Z θ_{n}) | |}^{2} + \sum_{j = 1}^{p} ρ_{γ} ({| | θ_{n j} | |}_{A_{i}}; \sqrt{m_{n}} λ) .

(9)

As noted above, any positive definite matrix can be used for A_j. Here we use $A_{j} = Z_{j}^{'} {Q Z}_{j} / n$ . The rationale for this choice is based on the following considerations. First, in the profile objective function (9), the covariate matrix for group j is QZ_j. The Gram matrix associated with it is $Z_{j}^{'} Q^{'} {Q Z}_{j} / n = A_{j}$ , since Q is an orthonormal matrix. Although the original covariates x_ij’s are standardized, the covariate matrices for the groups are not necessarily so. Therefore, this choice of A_j standardizes the covariate matrices associated with θ_nj’s and makes the amount of penalization comparable across the groups comparable. Second, this leads to an explicit expression in the update steps in the group coordinate algorithm described below. This facilitates the implementation of the algorithm, since computation in each update step can be carried out using explicit expressions. For any given (λ, γ), the penalized profile least squares solution is defined by θ̂_n = arg min_{θ_n} L(θ_n; λ, γ). We compute θ̂_n using a group coordinate descent algorithm described in Section 3.

The set of indices of the covariates that are estimated to have the linear form in the regression model (1) is Ŝ₁ ≡ {j: ||θ̂_nj|| = 0}. Thus,

{\hat{g}}_{n j} (x) = 0, j \in {\hat{S}}_{1} and {\hat{g}}_{n j} (x) = \sum_{k = 1}^{m_{n}} {\hat{θ}}_{j k} ψ_{j k} (x), j \notin {\hat{S}}_{1} .

Denote X̂₍₁₎ = (x_j, j ∈ Ŝ₁), Ẑ₍₂₎= (Z_j: j ∉ Ŝ₁) and ${\hat{θ}}_{n (2)} = {({\hat{θ}}_{n j}^{'}, j \notin {\hat{S}}_{1})}^{'}$ . We have β̂_n = (X′X)⁻¹X′(y − Ẑ₍₂₎θ̂_n₍₂₎). The estimator of the coefficients of the linear components is β̂_n₁ = (β̂_j: j ∈ Ŝ₁)′. Let

{\hat{f}}_{n j} (x) = {\hat{β}}_{j} x + {\hat{g}}_{n j} (x), j \notin {\hat{S}}_{1} .

Denote f̂_nj(x_j) = (f̂_nj(x₁_j), …, f̂_nj(x_nj))′. Then the estimator of the coefficient vector of the linear components can also be written as

{\hat{β}}_{n 1} = {({\hat{X}}_{(1)}^{'} {\hat{X}}_{(1)})}^{- 1} {\hat{X}}_{(1)}^{'} (y - \sum_{j \notin {\hat{S}}_{1}} {\hat{f}}_{n j} (x_{j})) .

2.3 Spline approximation

We use polynomial splines to approximate the non-parametric components g_j, 1 ≤ j ≤ p. Let a = t₀ < t₁ < ··· < t_K < t_K₊₁ = b be a partition of [a, b] into K subintervals I_Kk = [t_k, t_k₊₁), k = 0, …, K − 1 and I_KK = [t_K, t_K₊₁], where K ≡ K_n = O(n^v) with 0 < v < 0.5 is a positive integer such that max_1≤_k_≤_K₊₁ |t_k − t_k₋₁| = O(n⁻^v). Let S_n be the space of polynomial splines of degree l ≥ 1 consisting of functions s satisfying: (i) the restriction of s to I_Kk is a polynomial of degree l for 1 ≤ k ≤ K; (ii) for l ≥ 2 and 0 ≤ l′ ≤ l − 2, s is l′ times continuously differentiable on [a, b] (Schumaker 1981). There exists normalized B-spline basis functions {φ_k, 1 ≤ k ≤ m_n} for S_n, where m_n ≡ K_n + l (Schumaker 1981). We can use these basis functions in the approximation (3).

3. Computation

We derive a group coordinate descent algorithm for computing θ̂ _n. This algorithm is a natural extension of the standard coordinate descent algorithm (Fu 1998; Friedman et al. 2007; Wu and Lange 2007) used in optimization problems with convex penalties such as the Lasso. It has also been used in calculating the penalized estimates based on concave penalty functions (Breheny and Huang 2010).

The group coordinate descent algorithm optimizes a target function with respect to a single group at a time, iteratively cycling through all groups until convergence is reached. This algorithm is particularly suitable for computing θ̂_n, since it has a simple closed form expression for a single-group model as given in (10) below.

We write $A_{j} = R_{j}^{'} R_{j}$ for an m_n × m_n upper triangular matrix R_j via the Cholesky decomposition. Let b_j = R_jθ_j, ỹ = Qy and ${\tilde{Z}}_{j} = {Q Z}_{j} R_{j}^{- 1}$ . Simple algebra shows that

L (b; λ, γ) = \frac{1}{2 n} {| | \tilde{y} - \sum_{j = 1}^{p} {\tilde{Z}}_{j} b_{j} | |}^{2} + \sum_{j = 1}^{p} ρ_{γ} (| | b_{j} | |; \sqrt{m_{n}} λ)

Note that $n^{- 1} {\tilde{Z}}_{j}^{'} {\tilde{Z}}_{j} = {R_{j}^{- 1}}^{'} (n^{- 1} Z_{j}^{'} {Q Z}_{j}) R_{j}^{- 1} = I_{m_{n}}$ . Let ${\tilde{y}}_{j} = \tilde{y} - \sum_{k \neq j}^{p} {\tilde{Z}}_{k} b_{k}$ . Denote

L_{j} (b_{j}; λ, γ) = \frac{1}{2 n} {| | {\tilde{y}}_{j} - {\tilde{Z}}_{j} b_{j} | |}^{2} + ρ_{γ} (| | b_{j} | |; \sqrt{m_{n}} λ) .

Let $η_{j} = {\tilde{Z}}_{j} {({\tilde{Z}}_{j}^{'} {\tilde{Z}}_{j})}^{- 1} {\tilde{y}}_{j} = n^{- 1} {\tilde{Z}}_{j}^{'} \tilde{y}$ . For γ > 1, it can be verified that the value that minimizes L_j with respect to b_j is

{\tilde{b}}_{j, G M} (λ, γ) = M (η_{j}; λ, γ) \equiv {\begin{array}{l} 0, & if | | η_{j} | | \leq \sqrt{m_{n}} λ, \\ \frac{γ}{γ - 1} (1 - \frac{\sqrt{m_{n}} λ}{| | η_{j} | |}) η_{j}, & if \sqrt{m_{n}} λ < | | η_{j} | | \leq γ \sqrt{m_{n}} λ, \\ η_{j}, & if | | η_{j} | | > γ \sqrt{m_{n}} λ . \end{array}

(10)

In particular, when γ = ∞, we have

{\tilde{b}}_{j, G L} = {(1 - \frac{\sqrt{m_{n}} λ}{| | η_{j} | |})}_{+} η_{j},

which is the group Lasso estimate for a single-group model (Yuan and Lin 2006).

With the above expressions, the group coordinate descent algorithm can be implemented as follows. Suppose the current values for the group coefficients ${\tilde{b}}_{k}^{(s)}$ , k ≠ j are given. We want to minimize L with respect to b_j. Define

L_{j} (b_{j}; λ, γ) = \frac{1}{2 n} {| | \tilde{y} - \sum_{k \neq j} {\tilde{Z}}_{k} {\tilde{b}}_{k}^{(s)} - {\tilde{Z}}_{j} b_{j} | |}^{2} + ρ_{γ} (| | b_{j} | |; \sqrt{m_{n}} λ) .

Denote ${\tilde{y}}_{j} = \sum_{k \neq j} {\tilde{Z}}_{k} {\tilde{b}}_{k}^{(s)}$ and ${\tilde{η}}_{j} = n^{- 1} {\tilde{Z}}_{j}^{'} (\tilde{y} - {\tilde{y}}_{j})$ . Let b̃_j denote the minimizer of $L_{j} (b_{j}; \sqrt{m_{n}} λ, γ)$ . When γ > 1, we have ${\tilde{b}}_{j} = M ({\tilde{η}}_{j}; \sqrt{m_{n}} λ, γ)$ , where M is defined in (10).

For any given (λ, γ), we use (10) to cycle through one component at a time. Let ${\tilde{β}}^{(0)} = {({\tilde{β}}_{1}^{(0)}^{'}, ..., {\tilde{β}}_{p}^{(0)}^{'})}^{'}$ be the initial value. The proposed coordinate descent algorithm is as follows.

Initialize vector of residuals r = y − ỹ, where $\tilde{y} = \sum_{j = 1}^{p} {\tilde{Z}}_{j} b_{j}^{(0)}$ . For s = 0, 1, …, carry out the following calculation until convergence. For j = 1, …, p, repeat the following steps:

Calculate ${\tilde{η}}_{j} = n^{- 1} {\tilde{Z}}_{j}^{'} r + {\tilde{b}}_{j}^{(s)}$ .
Update ${\tilde{b}}_{j}^{(s + 1)} = M ({\tilde{η}}_{j}; λ, γ)$ .
Update $r \leftarrow r - {\tilde{Z}}_{j} ({\tilde{b}}_{j}^{(s + 1)} - {\tilde{b}}_{j}^{(s)})$ and j ← j + 1.

The last step ensures that r always holds the current values of the residuals. Although the objective function is not necessarily convex, it is convex with respect to a single group when the coefficients of all the other groups are fixed. Thus, Theorem 5.1 of Tseng (2001) implies that the group coordinate descent algorithm described above always converges.

4. Theoretical properties

We present the results on the model-pursuit consistency, rate of convergence and asymptotic normality of the proposed SRP estimator. In particular, our model-pursuit consistency result shows that the proposed method can correctly determine the linear and nonlinear components in the partially linear model with high probability.

Denote the underlying regression components by f₀_j and write

f_{0 j} (x) = β_{0 j} x + g_{0 j} (x) .

Suppose the series expansion for approximating g₀_j is

g_{0 j} (x) = \sum_{j = 1}^{m_{n}} θ_{0 j k} φ_{k} (x) .

Let θ₀_jn = (θ₀_j₁, …, θ_{0jm_n})′. Denote ${| | g | |}_{2} = {(\int_{a}^{b} g^{2} (x) d x)}^{1 / 2}$ for any square integrable function g on [a, b]. We have S₁ = {j: ||g_{0_j}||₂ = 0} and ||θ_{0n_j}|| = 0 for j ∈ S₁. Let $θ_{0 n} = {(θ_{0 n 1}^{'}, ..., θ_{0 n p}^{'})}^{'}$ .

Let q = |S₁| be the cardinality of S₁, which is the number of linear components in the regression model. Define

{\tilde{θ}}_{n} = \underset{θ_{n}}{arg min} {\frac{1}{2 n} {| | Q (y - Z θ_{n}) | |}^{2} : θ_{n j} = 0, j \in S_{1}} .

(11)

This is the oracle estimator of θ₀_n assuming the identity of the linear components were known. We note that the oracle estimator is not computable since S₁ is unknown. We use it as the benchmark for our proposed estimator.

Analogous to the actual estimates defined at the end of Section 2.2, define the oracle estimators

{\tilde{g}}_{n j} (x) = 0, j \in S_{1} and {\tilde{g}}_{n j} (x) = \sum_{k = 1}^{m_{n}} {\tilde{θ}}_{j k} ψ_{j k} (x), j \notin S_{1} .

Denote X₍₁₎ = (x_j, j ∈ S₁), X₍₂₎ = (x_j: j ∈ S₂) and ${\tilde{θ}}_{n (2)} = {({\tilde{θ}}_{n j}^{'}, j \in S_{2})}^{'}$ . Let

{\tilde{f}}_{n j} (x) = {\tilde{β}}_{j} x + {\tilde{g}}_{n j} (x), j \in S_{2} .

Denote f̃_nj(x_j) = (f̃_nj(x₁_j), …, f̃_nj(x_nj))′. The oracle estimator of the coefficients of the linear components is

{\tilde{β}}_{n 1} = {(X_{(1)}^{'} X_{(1)})}^{- 1} X_{(1)}^{'} (y - \sum_{j \in S_{2}} {\tilde{f}}_{n j} (x_{j})) .

Without loss of generality, suppose that S₁ = {1, …, q}. Write ${\tilde{θ}}_{n} = {(0_{{q m}_{n}}^{'}, {\tilde{θ}}_{n (2)}^{'})}^{'}$ , where 0_{qm_n} is a (qm_n)-dimensional vector of zeros and

{\tilde{θ}}_{n (2)} = {(Z_{(2)}^{'} {Q Z}_{(2)})}^{- 1} Z_{(2)}^{'} Q y .

(12)

Define θ_*= min_j∈S₁ ||θ₀_nj||, which is the smallest norm of the coefficients in the spline expansions of the nonlinear components.

Let k be a non-negative integer, and let α ∈ (0, 1] be such that d = k + α > 0.5. Let Inline graphic be the class of functions g on [0, 1] whose kth derivative g⁽^k⁾ exists and satisfies a Lipschitz condition of order α:

∣ g^{(k)} (s) - g^{(k)} (t) ∣ \leq C {∣ s - t ∣}^{α} for s, t \in [a, b] .

Define ${| | g | |}_{2} = {[\int_{a}^{b} g^{2} (x) d x]}^{1 / 2}$ for any function g, whenever the integral exists.

We make the following assumptions.

(A1) The p and q are fixed and ε₁, …, ε_n are independent and identically distributed with Eε_i = 0 and Var(ε_i) = σ². Furthermore, P(|ε_i| > x) ≤ K exp(−Cx²), i = 1, …, n, for all x ≥ 0 for some constants C and K.

(A2) Eg_j(x_j) = 0 and g_j ∈ Inline graphic , j = q + 1, …, p.

(A3) The covariate vector X has a continuous density and there exist constants C₁ and C₂ such that the density function η_j of x_j satisfies 0 < C₁ ≤ η_j(x) ≤ C₂ < ∞ on [a, b] for every 1 ≤ j ≤ p.

Theorem 1

Suppose that m_n = O(n^1/(2^d⁺¹⁾), $1 / \sqrt{m_{n}} γ$ is less than the smallest eigen-value of Z′QZ/n, and

\frac{1}{m_{n}^{(2 d - 1) / 2} (θ_{*} - γ λ)} + \frac{1}{λ \sqrt{n}} \to 0.

(13)

Then under (A1)–(A3),

P ({\hat{θ}}_{n} \neq {\tilde{θ}}_{n}) \to 0.

Consequently,

\begin{matrix} P ({\hat{S}}_{1} = S_{1}) \to 1, \\ P ({\hat{β}}_{n 1} = {\tilde{β}}_{n 1}) \to 1, and P ({| | {\hat{f}}_{n j} - {\tilde{f}}_{n j} | |}_{2} = 0, j \in S_{2}) \to 1. \end{matrix}

Therefore, under the conditions of Theorem 1, the proposed estimator can correctly distinguish linear and nonlinear components with high probability. Furthermore, the proposed estimator has the oracle property in the sense that it is the same as the oracle estimator assuming the identity of the linear and nonlinear components were known, except on an event with probability tending to zero.

We note that, except the assumption on the tail probabilities in (A1), (A1)–(A3) are standard conditions for nonparametric additive models. They would be needed to estimate the additive components at the optimal ℓ₂ rate of convergence in standard nonparametric additive model setting. The main extra condition needed here is (13), which requires λ = o(n^−1/2) and $θ_{*} > γ λ + a_{n} m_{n}^{- (2 d - 1) / 2}$ for some a_n → ∞ simultaneously. The first part of this requirement ensures that the bias resulting from the penalty is small so that it does not interfere with selection, and the second part requires that the smallest norm θ_* of the coefficients in the spline expansions of the (nonzero) nonlinear components should be larger than the penalty level plus a term due to the spline approximation error.

Theorem 2

Suppose (A1)–(A3) hold. Under model (2), we have

\sum_{j = 1}^{p} {| | {\hat{f}}_{n j} - f_{0 j} | |}_{2}^{2} \leq O_{p} (\frac{m_{n}}{n}) + O (\frac{1}{m_{n}^{2 d}}) + O (m_{n} λ^{2}) .

This theorem gives rate of convergence of the proposed estimator under the non-parametric additive model (2), which contains the partially linear models as special cases. In particular, if we assume that each component in (2) is second order differentiable (d = 2) and take m_n = O(n^1/5) and λ = n^−1/2+^δ for a small δ > 0, then $\sum_{j = 1}^{p} {| | {\hat{f}}_{n j} - f_{0 j} | |}_{2}^{2} = O_{p} (n^{- 4 / 5})$ , which is the optimal rate of convergence in nonparametric regression.

We now consider the asymptotic distribution of β̂_n₁. Denote

H_{j} = {h_{j} = {(h_{j k} : k \in S_{1})}^{'} : {E h}_{j k}^{2} (x_{j}) < \infty, {E h}_{j k} (x_{j}) = 0}, j \in S_{2} .

Each element of H_j is a |S₁|-vector of square integrable functions with mean zero. Denote the sumspace

H = {h = \sum_{j \in S_{2}} h_{j} : h_{j} \in H_{j}} .

The projection of the centered covariate vector x₍₁₎ − E(x₍₁₎) ∈ R^q onto the sumspace H is defined to be the ${(h_{1}^{*}, \dots, h_{r}^{*})}^{'}$ with ${E h}_{j}^{*} (x_{j}) = 0$ , j ≤ Ŝ₂ that minimizes

W (h) \equiv E {| | x_{(1)} - E (x_{(1)}) - \sum_{j \in S_{2}} h_{j} (x_{j}) | |}^{2} .

(14)

For x₍₂₎ = (x_j: j ∈ S₂), denote

h^{*} (x_{(2)}) = \sum_{j \in S_{2}} h_{j}^{*} (x_{j}) .

(15)

Under condition (A3), by Lemma 1 of Stone (1985) and Proposition 2 in Appendix 4 of Bickel, Ritov, Klaassen and Wellner (1993), the sumspace H is closed. Thus the orthogonal projection h^* onto H is well defined and unique. Furthermore, each individual component $h_{j}^{*}$ is also well defined and unique. In addition to (A1)–(A3), we also need the following condition for asymptotic normality of the linear component estimator.

(A4) Let w ≥ 1 be a positive integer. The wth partial derivatives of the joint density of x₍₂₎ = (x_j, j ∈ S₂) are bounded by a constant and the qth derivative of each component of &xi;(v) = E(x₍₁₎|x_j = v), j ∈ S₂ is bounded by a constant.

Let A = E[x₍₁₎ − E(x₍₁₎ − h* (x₍₂₎)]^⊗2, where h* is defined in (15). Here x^⊗2 = xx′ for any column vector x ∈ R^d.

Theorem 3

Suppose that the conditions in Theorem 1 and (A4) are satisfied and that A is nonsingular. Then,

n^{1 / 2} ({\hat{β}}_{n 1} - β_{(1)}) \to_{d} N (0, \sum),

where β₍₁₎ = (β_j: j ∈ S₁)′ and Σ = σ²A⁻¹.

Theorem 3 provides sufficient conditions under which the proposed estimator β̂_n₁ of the linear components in the model is asymptotically normal with same the limit normal distribution as the oracle estimator β̃_n₁.

5. Numerical studies

5.1 Simulation studies

We use simulation to evaluate the finite sample performance of the proposed method. Two examples are considered in the simulation. In each of the simulated models, two sample sizes (n=100, 200) are considered and a total of 100 replications are conducted. Consider the following six functions defined on [0, 1]:

\begin{array}{l} f_{1} (x) = x, f_{2} (x) = sin (2 π x) / (2 - sin (2 π x)), \\ f_{3} (x) = 0.1 sin (2 π x) + 0.2 cos (2 π x) + 0.3 {sin}^{2} (2 π x) + 0.4 {cos}^{3} (2 π x) + 0.5 {sin}^{3} (2 π x), \\ f_{4} (x) = {(3 x - 1)}^{2}, f_{5} (x) = cos (2 π x) / (2 - cos (2 π x)), \\ f_{6} (x) = 0.1 cos (2 π x) + 0.2 sin (2 π x) + 0.3 {cos}^{2} (2 π x) + 0.4 {sin}^{3} (2 π x) + 0.5 {cos}^{3} (2 π x) . \end{array}

In the implementation, we use cubic B-spline with seven basis functions to approximate each function.

Example 1

Let p = 6. Consider the model

y = 3 f_{1} (x_{1}) + 4 f_{1} (x_{2}) - 2 f_{1} (x_{3}) + 8 f_{2} (x_{4}) + 6 f_{3} (x_{5}) + 5 f_{4} (x_{6}) + ε .

In this model, the first three variables have linear effect and the last three variables have nonlinear effect. The p covariates are simulated in the following way. First we simulate w₁, ···, w_p and u independently from U [0, 1]. Then x_ik = (w_k + u)/2 for k = 1, ···, p. The correlation among predictors is Corr(x_ij, x_ik) = 0.5. The error term ε is chosen from N(0, 1.57²) to give a signal to noise ratio 3.

Example 2

Let p = 10. Consider the model

y = 3 f_{1} (x_{1}) + 4 f_{1} (x_{2}) - f_{1} (x_{3}) - f_{1} (x_{4}) + 2 f_{1} (x_{5}) + 5 f_{2} (x_{6}) + 4 f_{3} (x_{7}) + 5 f_{4} (x_{8}) + 5 f_{5} (x_{9}) + 4 f_{6} (x_{10}) + ε .

In this model, the first 5 components are linear and the remaining 5 are nonlinear. The covariates are simulated in the same way as in Example 1. The error term ε ~ N(0, 1.80²), which gives a signal to noise ratio 3.

The group coordinate descent algorithm described in Section 3 is used repeatedly to compute θ̂_n over a grid of (λ, γ) values in a rectangle [λ_max, λ_min] × [γ_max, γ_min]. Here $λ_{max} = {max}_{1 \leq j \leq p} | | n^{- 1} {\tilde{Z}}_{j}^{'} \tilde{y} | |$ , which is the smallest value of λ that forces all the solutions to be zero, and we take λ_min = 0.0001λ_max. We use a set of 100 equally spaced grid points on the logarithmic scale in [λ_max, λ_min]. For the γ parameter in the group MCP, we consider a grid of equally spaced points in the interval [γ_max, γ_min] = [8.0, 1.1] with grid size 0.1. We note that Zhang (2010) suggested using γ = 2.7 for standardized covariates in linear regression. In our simulation studies, we found that the value of γ also has considerable impact on the results. Thus instead of using a fixed γ value, we consider a range of γ values.

For the group Lasso, which can be considered a special case of the group MCP with γ= ∞, the algorithm starts at λ_max where θ̂_n equals 0 and proceeds along the grid values of λ, using the previous solution as the initial value at each grid point. For the group MCP, for each value of λ in the λ-grid and the corresponding initial value from the group Lasso, the algorithm proceeds along the grids of γ in [8.0, 1.1], that is, for each λ grid value, we start the algorithm at γ = 8 using the group Lasso solution as the initial value. This approach follows that of Mazumder, Friedman and Hastie (2009). We then apply the BIC (Schwarz 1978) to select (λ, γ). Here the BIC is defined as

BIC (λ, γ) = log ({RSS}_{λ, γ}) + log n \cdot \frac{m_{n} {d f}_{λ, γ}}{n},

where RSS_{λ; γ} is the residual sum of squares and df_{λ; γ} is the number of the nonzero selected groups for a given (λ, γ). Recall m_n is the number of spline basis functions given in (3). The optimal value of (λ, γ) is chosen to be the one that minimizes the BIC.

The simulation results based on 100 replications are presented in Tables 1–3. The columns in Table 1 are: the average number of nonlinear components being selected (NL), the average model error (ER), the percentage of occasions on which the correct nonlinear components are included in the selected model (IN%) and the percentage of occasions on which the exactly nonlinear components are selected (CS%) in the final model. Enclosed in parentheses are the corresponding standard errors. Table 2 includes the number of times each component being estimated as nonlinear function. Table 3 shows the average mean square error for each function. Enclosed in parentheses are the corresponding standard errors.

Table 1.

Simulation results for Examples 1–2. NL, the average number of the nonlinear components being selected; ER, the average model error; IN%, the percentage of occasions on which the correct nonlinear components are included in the selected model; CS%, the percentage of occasions on which exactly correct nonlinear components are selected, averaged over 100 replications. Enclosed in parentheses are the corresponding standard errors.

	n = 100				n = 200

	NL	ER	IN%	CS%	NL	ER	IN%	CS%
Example 1, Group Lasso	3.46 (0.76)	2.66 (0.66)	100 (0.00)	67 (0.47)	3.10 (0.39)	2.71 (0.39)	100 (0.00)	92 (0.27)
Group MCP	3.18 (0.39)	2.28 (0.47)	100 (0.00)	82 (0.39)	3.01 (0.10)	2.43 (0.30)	100 (0.00)	99 (0.10)

Example 2, Group Lasso	4.37 (2.90)	6.26 (4.84)	51 (0.50)	17 (0.38)	5.41 (0.71)	3.55 (0.59)	98 (0.14)	62 (0.49)
Group MCP	5.25 (1.37)	2.98 (1.22)	76 (0.43)	43 (0.50)	5.22 (0.54)	3.09 (0.38)	98 (0.14)	78 (0.42)

Open in a new tab

Table 3.

The average mean square error for each component based on 100 replications by the group Lasso and group MCP methods in Examples 1–2.

	f₁	f₂	f₃	f₄	f₅	f₆	f₇	f₈	f₉	f₁₀
	n = 100
Example 1, Group Lasso	0.64 (0.93)	0.66 (0.79)	0.67 (1.05)	7.52 (1.48)	12.23 (6.68)	25.50 (10.02)
Group MCP	0.54 (0.83)	0.55 (0.70)	0.49 (0.65)	7.51 (1.45)	11.39 (6.72)	25.34 (9.77)
Oracle	0.11 (0.25)	0.11 (0.17)	0.12 (0.23)	2.22 (1.07)	0.76 (0.46)	10.05 (2.39)

	n = 200

Group Lasso	0.21 (0.28)	0.19 (0.27)	0.20 (0.26)	7.29 (1.05)	12.08 (4.47)	27.24 (7.04)
Group MCP	0.20 (0.28)	0.16 (0.21)	0.19 (0.26)	7.25 (1.03)	11.35 (4.77)	27.08 (7.12)
Oracle	0.09 (0.07)	0.08 (0.06)	0.09 (0.07)	1.88 (0.65)	0.50 (0.18)	9.93 (1.72)

Example 2, Group Lasso	1.22 (1.45)	1.55 (2.63)	1.58 (2.08)	1.40 (2.06)	1.87 (2.95)	3.66 (1.43)	10.24 (7.17)	23.80 (12.7)	3.03 (2.76)	10.09 (5.80)
Group MCP	0.87 (1.02)	1.05 (1.91)	0.90 (1.16)	0.89 (1.51)	1.03 (1.33)	3.55 (1.24)	9.27 (6.88)	22.30 (10.6)	1.96 (1.98)	9.85 (5.08)
Oracle	0.52 (1.00)	0.17 (0.60)	0.27 (0.36)	0.31 (0.63)	0.44 (0.79)	2.57 (0.90)	1.09 (1.54)	13.31 (13.9)	1.28 (1.80)	1.85 (10.45)

	n = 200

Group Lasso	0.34 (0.45)	0.36 (0.40)	0.30 (0.41)	0.38 (0.61)	0.39 (0.56)	3.34 (0.71)	8.55 (3.19)	20.09 (6.61)	0.95 (0.81)	9.26 (3.86)
Group MCP	0.30 (0.40)	0.32 (0.39)	0.28 (0.39)	0.31 (0.55)	0.34 (0.52)	3.32 (0.70)	8.52 (3.24)	19.91 (6.50)	0.87 (0.81)	9.19 (3.66)
Oracle	0.23 (0.20)	0.16 (0.23)	0.05 (0.02)	0.16 (0.33)	0.16 (0.41)	0.88 (0.30)	0.36 (0.14)	9.83 (1.68)	0.50 (0.17)	0.33 (0.14)

Open in a new tab

Table 2.

Number of times each component being selected as nonlinear component in the 100 replications by the group Lasso and group MCP methods in Examples 1–2.

	f₁	f₂	f₃	f₄	f₅	f₆	f₇	f₈	f₉	f₁₀
	n = 100
Example 1, Group Lasso	21	13	12	100	100	100
Group MCP	9	4	5	100	100	100

	n = 200
Group Lasso	3	4	3	100	100	100
Group MCP	1	0	0	100	100	100

	n = 100
Example 2, Group Lasso	19	21	14	17	18	54	73	95	69	57
Group MCP	16	13	9	9	11	89	99	100	97	82

	n = 200
Group Lasso	9	8	7	9	11	99	100	100	100	98
Group MCP	5	6	6	5	2	99	100	100	100	99

Open in a new tab

Several observations can be made from Tables 1 and 2. Table 1 shows that the proposed method with the group MCP performs better than the proposed method with the group Lasso in terms of the percentage of occasions on which the correct nonlinear components are included in the selected model (IN%) and the percentage of occasions on which the exactly nonlinear components are selected (CS%) in the final model. For instance, in Example 1, when n = 100, the percentage of correct selection (CS%) is 82% with the group MCP and is 67% with the group Lasso. Also, when the sample size increases from 100 to 200, the percentage of including all the nonlinear components (IN%) and selecting the exactly correct model (CS%) by both methods are increased. This is not surprising since data with a larger sample size contain more information about the underlying model. Table 2 shows that the group MCP is more accurate in distinguishing the linear functions from the nonlinear functions than the group Lasso. When n = 200, the group MCP can correctly distinguish the linear from nonlinear components 99% of the times in Example 1 and 78% of the times in Example 2. In Table 3, we examine the performance of the proposed method for estimating the linear and nonlinear components in the simulated models. In general, the proposed method with the group MCP have smaller mean square errors. Overall, the proposed method with the group MCP is effective in distinguishing the linear components from the nonlinear ones in the simulation models.

5.2 Diabetes data example

This data set is from a study reported in Willems et al. (1997). The data consist of 19 variables on 403 subjects from 1046 African Americans who were interviewed in a study to understand the prevalence of obesity, diabetes, and other cardiovascular risk factors in central Virginia. Diabetes Mellitus Type II (adult onset diabetes) is associated with obesity. The 403 subjects were the ones who were screened for diabetes. Glycosolated hemoglobin > 7.0 is usually taken as a positive diagnosis of this disease.

We consider Glycosolated hemoglobin as the response variable and the other 15 variables as the covariates excluding. These 15 variables are: cholesterol (chol), stabilized glucose (stab.glu), high density lipoprotein (hdl), cholesterol/hdl ratio (ratio), location, age, gender, height, height, weight, frame, first systolic blood pressure (bp.1s), first diastolic blood pressure (bp.1d), waist, hip, postprandial time when labs were drawn (time.ppn). Among these 15 variables, 3 are categorial variables (location, gender, frame), 12 are continuous variables. We are interested in finding which continuous covariates have nonlinear effects on the response variable. In our study, we only consider the subjects which have all the information, without missing values. Thus the number of subjects are n = 366, p = 15.

The results are summarized in Tables 4 and 5. The top panel of Table 4 lists the 12 continuous variables being selected by the group MCP and the group Lasso as linear or nonlinear variables, indicated by 0/1 (1, nonlinear; 0, linear). The top panel of Table 5 shows the number of variables being selected as nonlinear variables and the residual sum of squares by both the group MCP and the group Lasso methods.

Table 4.

Diabetes data: Number of each component being selected by the group Lasso and group MCP methods as nonlinear components. The top panel of Table lists the 12 continuous variables being selected by the group MCP and the group Lasso as linear or nonlinear variables, indicated by 0 or 1 (0, linear; 1, nonlinear). The bottom panel shows the number of times a variable has a nonlinear effect in the 100 partitions.

	chol	stab.glu	hdl	ratio	age	height	weight	bp.1s	bp.1d	waist	hip	time.ppn
	whole data set
group Lasso	0	1	0	0	0	1	0	0	0	0	0	0
group MCP	1	1	0	1	1	1	0	0	0	0	0	1

	training and testing sets

group Lasso	29	66	7	1	0	72	0	0	0	0	0	0
group MCP	89	100	30	99	65	100	9	2	0	0	4	89

Open in a new tab

Table 5.

Diabetes data: The top panel shows that the number of selected nonlinear components (NL) and the residual sum of squares (RSS) based on the whole data. The bottom panel shows the NL, the RSS and the prediction error (PE), averaged over 100 replications. Enclosed in parentheses are the corresponding standard errors.

	NL	RSS	PE

	whole data
group Lasso	2.00	3.06
group MCP	6.00	2.53

	training and testing sets

group Lasso	1.75 (0.76)	3.01 (0.19)	3.44 (1.02)
group MCP	5.87 (0.87)	2.53 (0.16)	3.27 (0.89)

Open in a new tab

To evaluate the prediction performance of the methods, we randomly select a training set with 300 subjects from the data to do the estimation and selection and use the remaining 66 subjects at the test set for prediction. We repeat this process 100 times and the results are summarized in the bottom panel of Tables 4 and 5. The bottom panel of Table 4 shows the number of times a variable has a nonlinear effect. The bottom panel of Table 5 shows the number of variables being selected (NL) as nonlinear components, the residual sum of squares (RSS) and the prediction error (PE), averaged over 100 replications with standard error in the parentheses. Table 5 shows that the proposed method with the group MCP performs better than with the group Lasso in terms of the residual sum of squares and the prediction error.

6. Concluding remarks

In this paper, we proposed a semiparametric regression pursuit method for distinguishing linear from nonlinear components in semi-parametric partially linear models. This approach determines the parametric and non-parametric components in a semiparametric model adaptively based on the data. Our proposed method is fundamentally different from the standard semiparametric inference approach where the parametric and nonparametric components in a model are pre-specified. We showed that our method has the asymptotic oracle properties, meaning that it is the same as the standard semiparametric estimator assuming the model structure were known with high probability. The asymptotic rates of the penalty parameters required for our theoretical results are derived. However, as in many recent studies, it is not clear whether the penalty parameters selected using the BIC or other procedures can match the asymptotic rates. This is an important and challenging problem that requires further investigation, but is beyond the scope of the current paper. Our simulation study indicates that the proposed method works well in finite sample situations.

We have only considered the proposed semiparametric regression pursuit method in the partially linear model with fixed p. In many applications such as genomic data analysis, it is possible to have data with p > n. In this case, our proposed method is not directly applicable. In the p > n case, assuming the model is sparse in the sense the number of important covariates is much smaller than n, we can first reduce the model dimension and then apply the proposed method. For example, we can first use the adaptive group Lasso method to select the important variables in the nonparametric additive model (Huang, Horowitz and Wei 2010). We then use the proposed method in this paper to determine linear and nonlinear components in the model. Under the conditions given in Huang et al. (2010) and those given in this paper, this two-step approach has the asymptotic oracle property even in p > n settings. Further work is needed to evaluate the finite sample performance and spelled out the technical details of this two-step approach in p > n settings.

The proposed semiparametric regression pursuit method extends the scope of the application of penalized methods from variable selection to model specification. We have focused on the proposed method in the context of semiparametric partially linear models. This method can be extended to other models, such as the generalized partially linear and partially linear proportional hazards models (Huang 1999). It would be interesting to generalized the results of this paper to these more complicated models.

Acknowledgments

J. Huang wishes to thank Professor Guang Cheng for sharing with us their unpublished manuscript (Zhang, Cheng and Liu 2010) and Professor Cun-Hui Zhang for sharing his insights on the properties of the minimax concave penalty. We also thank an anonymous referee, the associate editor and editor for their helpful comments which led to considerable improvements in the paper. The research of Huang is partially supported by NIH grants R01CA120988, R01CA142774 and NSF grant DMS 0805670. The research of Ma is partially supported by NIH grants R01CA120988 and R01CA142774.

Appendix

Proof of Theorem 1

Since $1 / \sqrt{m_{n}} γ$ is less than the smallest eigenvalue of Z′QZ/n, L(·; λ, γ) in (9) is a convex function. By the Karush-Kuhn-Tucker conditions, a necessary and sufficient condition for θ̂_n is

{\begin{array}{l} Z_{j}^{'} Q (y - Z {\hat{θ}}_{n}) = n \dot{ρ} (| | {\hat{θ}}_{n} | |; λ), & {| | {\hat{θ}}_{j} | |}_{2} \neq 0, \\ {| | Z_{k}^{'} Q (y - Z {\hat{θ}}_{n}) | |}_{2} \leq n λ, & | | {\hat{θ}}_{n j} | | = 0. \end{array}

(16)

For j ∉ S₁, if ||θ̃_nj|| ≥ γλ then ρ̇(||θ̃_nj||; λ) = 0. Thus θ̃_n satisfies (16) if also ${| | Z_{j}^{'} Q (y - Z {\tilde{θ}}_{n}) | |}_{2} \leq n λ$ for j ∈ S₁. Therefore, θ̂_n = θ̃_n in the intersection of the events

Ω_{1} (λ) = {min_{j \notin S_{1}} | | {\tilde{θ}}_{n j} | | \geq γ λ} and Ω_{2} (λ) = {max_{j \in S_{1}} | | Z_{j}^{'} Q (y - Z {\tilde{θ}}_{n}) | | \leq n λ} .

(17)

Let g₀_j(x_j) = (g₀_j(x₁_j), …, g₀_j(x_nj))′ and δ_n = Σ_j∉S₁g₀_j(x_j) − Z₍₂₎θ_n₍₂₎. By the approximation properties of splines to a smooth function, we have

n^{- 1} {| | δ_{n} | |}^{2} = O_{p} ((p - q) m_{n}^{- 2 d}) .

(18)

Let $C_{(2)} = Z_{(2)}^{'} {Q Z}_{(2)}$ and $H = Q - {Q Z}_{(2)} {(Z_{(2)}^{'} {Q Z}_{(2)})}^{- 1} Z_{(2)}^{'} Q$ . By (12),

{\tilde{θ}}_{n (2)} - θ_{n (2)} = C_{(2)}^{- 1} Z_{(2)}^{'} Q (ε_{n} + δ_{n}),

(19)

and

Z_{j}^{'} Q (y - Z_{(2)} {\tilde{θ}}_{n (2)}) = Z_{j}^{'} H (ε_{n} + δ_{n}) .

(20)

Recall θ_* = min_j∈S₁ ||θ_nj||. If ||θ_nj − θ_nj|| ≤ θ_* − γλ, then min_j∉S₁||θ̃_nj|| ≥ γλ. Therefore,

1 - P (Ω_{1} (λ)) \leq P (max_{j \notin S_{1}} | | {\tilde{θ}}_{n j} - θ_{n j} | | > θ_{*} - γ λ) .

We also have

1 - P (Ω_{2} (λ)) \leq P (n^{- 1} max_{j \in S_{1}} | | (Z_{j}^{'} H (ε_{n} + δ_{n}) | | > λ) .

Lemma 1 below shows that, when

\begin{matrix} \frac{{(p - q)}^{1 / 2} m_{n}^{- (2 d - 1) / 2}}{θ_{*} - γ λ} \to 0, \\ P (max_{j \notin S_{1}} | | {\tilde{θ}}_{n j} - θ_{n j} | | > θ_{*} - γ λ) \leq \frac{(p - q) m_{n}}{\sqrt{n} (θ_{*} - γ λ)} . \end{matrix}

and Lemma 2 below shows that, when

\begin{matrix} \frac{1}{λ m_{n}^{(2 d + 1) / 2}} \to 0, \\ P (n^{- 1} max_{j \in S_{1}} | | + Z_{j}^{'} H (ε_{n} + δ_{n}) | | > λ) \leq \frac{{log ({q m}_{n})}^{1 / 2}}{λ \sqrt{n})} . \end{matrix}

Note that when m_n = n^1/(2^d⁺¹⁾, we have $m_{n} n^{- 1 / 2} = m_{n}^{- (2 d - 1) / 2}$ . Therefore, under the conditions of Theorem 1, we have P(θ̂_n ≠ θ̃_n) → 0. This completes the proof.

Lemma 1

Suppose that

\begin{matrix} \frac{{(p - q)}^{1 / 2}}{m_{n}^{(2 d - 1) / 2} (θ_{*} - γ λ)} \to 0, \\ P (max_{j \notin S_{1}} | | {\tilde{θ}}_{n j} - θ_{n j} | | > θ_{*} - γ λ) \leq O (1) \frac{(p - q) m_{n}}{\sqrt{n} (θ_{*} - γ λ)} \end{matrix}

(21)

Proof of Lemma 1

Let T_nj be an m_n × (p − q)m_n matrix with the form

T_{n j} = (0_{m_{n}}, \dots, 0_{m_{n}}, I_{m_{n}}, 0_{m_{n}}, \dots, 0_{m_{n}}),

where 0_{m_n} is an m_n × m_n matrix of zeros and I_{m_n} is an m_n × m_n identity matrix in the jth block. By the triangle inequality,

{| | {\tilde{θ}}_{n j} - θ_{n j} | |}_{2} \leq {| | T_{n j} C_{(2)}^{- 1} Z_{(2)}^{'} Q ε_{n} | |}_{2} + {| | T_{n j} C_{(2)}^{- 1} Z_{(2)}^{'} Q δ_{n} | |}_{2} .

(22)

Let C be a generic constant independent of n. For the first term on the right-hand side, we have

\begin{array}{l} E max_{j \notin S_{1}} {| | T_{n j} C_{(2)}^{- 1} Z_{(2)}^{'} Q ε_{n} | |}_{2} \leq n^{- 1} ρ_{n 1}^{- 1} E {| | Z_{(2)}^{'} Q ε_{n} | |}_{2} \\ = n^{- 1 / 2} ρ_{n 1}^{- 1} E {| | n^{- 1 / 2} Z_{(2)}^{'} Q ε_{n} | |}_{2} \\ = n^{- 1 / 2} ρ_{n 1}^{- 1} m_{n}^{- 1 / 2} {((p - q) m_{n})}^{1 / 2} \end{array}

(23)

= O (1) (p - q) n^{- 1 / 2} m_{n} .

(24)

Thus

P (max_{j \notin S_{1}} | | T_{n j} C_{(2)}^{- 1} Z_{(2)}^{'} Q ε_{n} | | \geq (θ_{*} - γ) / 2) \leq \frac{O (1) (p - q) m_{n}}{\sqrt{n} (θ_{*} - γ λ)} .

By (18), the second term

\begin{array}{l} max_{j \notin S_{1}} {| | T_{n j} C_{(2)}^{- 1} Z_{(2)}^{'} Q δ_{n} | |}_{2} \leq {| | {n C}_{(2)}^{- 1} | |}_{2} \cdot {| | n^{- 1} Z_{(2)}^{'} Z_{(2)} | |}_{2}^{1 / 2} \cdot {| | n^{- 1 / 2} δ_{n} | |}_{2} \\ = O_{p} (1) ρ_{n 1}^{- 1} ρ_{n 2}^{- 1 / 2} {(p - q)}^{1 / 2} m_{n}^{- d} \\ = O_{p} (1) {(p - q)}^{1 / 2} m_{n}^{- (2 d - 1) / 2} . \end{array}

(25)

Therefore, when

\frac{(p - q) m_{n}}{\sqrt{n} (θ_{*} - γ λ)} \to 0,

(21) holds. This proves the lemma.

Lemma 2

Suppose that

\frac{1}{λ m_{n}^{(2 d + 1) / 2}} \to 0,

we have

P (n^{- 1} max_{j \in S_{1}} | | Z_{j}^{'} H (ε_{n} + δ_{n}) | | > λ) \leq O (1) \frac{{log {(q \lor 1) m_{n}}}^{1 / 2}}{λ \sqrt{n}}

(26)

Proof of Lemma 2

Write

n^{- 1} Z_{j}^{'} H (ε_{n} + δ_{n}) = n^{- 1} Z_{j}^{'} H_{n} ε_{n} + n^{- 1} Z_{j}^{'} H_{n} δ_{n} .

(27)

By Lemma 2 of Huang et al. (2010),

E (max_{j \in S_{1}} {| | n^{- 1 / 2} Z_{j}^{'} H_{n} ε_{n} | |}_{2}) \leq O (1) {log ((p - ∣ S_{1} ∣) m_{n})}^{1 / 2} .

(28)

Therefore,

P (n^{- 1} max_{j \in S_{1}} {| | Z_{j}^{'} H_{n} ε_{n} | |}_{2} > λ / 2) \leq O (1) \frac{{log ({q m}_{n})}^{1 / 2}}{λ \sqrt{n}} .

(29)

By (18), the second term on the right hand side of (27)

\begin{array}{l} n^{- 1} max_{j \in S_{1}} {| | Z_{j}^{'} Η_{ν} δ_{n} | |}_{2} \leq n^{- 1 / 2} max_{j \in S_{1}} {| | n^{- 1} Z_{j}^{'} Z_{j} | |}_{2}^{1 / 2} \cdot {| | H_{n} | |}_{2} \cdot {| | δ_{n} | |}_{2} \\ = O (1) ρ_{n 2}^{1 / 2} {(p - q)}^{1 / 2} m_{n}^{- d} \\ = O (1) {(p - q)}^{1 / 2} m_{n}^{- (2 d + 1) / 2} . \end{array}

(30)

Therefore, when

\frac{1}{λ m_{n}^{(2 d + 1) / 2}} \to 0,

(26) follows from (29) and (30).

Proof of Theorem 2

By the definition of ${\hat{θ}}_{n} \equiv {({\hat{θ}}_{n 1}^{'}, \dots, {\hat{θ}}_{n p}^{'})}^{'}$ ,

\frac{1}{2 n} {| | Q (y - Z {\hat{θ}}_{n}) | |}_{2}^{2} + \sum_{j = 1}^{p} ρ_{γ} | | {\hat{θ}}_{n j} | |; λ) \leq \frac{1}{2 n} {| | Q (y - Z θ_{n}) | |}_{2}^{2} + \sum_{j = 1}^{p} ρ_{γ} | | θ_{n j} | |; λ) .

(31)

Let η_n = Q(y − Zθ_n) and ν_n = QZ(θ̂_n − θ_n). Write

Q (y - Z {\hat{θ}}_{n}) = Q (y - Z θ_{n}) - Q Z ({\hat{θ}}_{n} - θ_{n}) = η_{n} - ν_{n} .

We have ${| | Q (y - Z {\hat{θ}}_{m}) | |}_{2}^{2} = {| | ν_{n} | |}_{2}^{2} - 2 η_{n}^{'} ν_{n} + {| | η_{n} | |}^{2}$ . We can rewrite (31) as

{| | ν_{n} | |}_{2}^{2} - 2 η_{n}^{'} ν_{n} \leq 2 n \sum_{j = 1}^{p} (ρ_{γ} | | θ_{n j} | |; λ) - ρ_{γ} (| | {\hat{θ}}_{n j} | |; λ) .

(32)

Since

∣ ρ_{γ} (| | θ_{n j} | |; λ) - ρ_{γ} (| | {\hat{θ}}_{n j} | |; λ) ∣ \leq λ | | θ_{n j} - {\hat{θ}}_{n j} | |,

(33)

combining (32) and (33), we get

{| | ν_{n} | |}_{2}^{2} - 2 η_{n}^{'} ν_{n} \leq 2 n λ \sqrt{p} | | {\hat{θ}}_{n} - θ_{n} | | .

(34)

Let $η_{n}^{*} = Q Z {(Z^{'} Q Z)}^{- 1} Z^{'} Q η_{n}$ . By the Cauchy-Schwartz inequality,

2 ∣ η_{n}^{'} ν_{n} ∣ \leq 2 {| | η_{n}^{*} | |}_{2} \cdot {| | ν_{n} | |}_{2} \leq 2 {| | η_{n}^{*} | |}_{2}^{2} + \frac{1}{2} {| | ν_{n} | |}_{2}^{2} .

(35)

From (34) and (35), we have

{| | ν_{n} | |}_{2}^{2} \leq 4 {| | η_{n}^{*} | |}_{2}^{2} + 4 n λ \sqrt{p} \cdot {| | {\hat{θ}}_{n} - θ_{n} | |}_{2} .

Let c_n_* be the smallest eigenvalue of Z′QZ/n. By Lemma 1 of Huang, Horowitz and Wei (2010), $c_{n *} ≍_{p} m_{n}^{- 1}$ . Since ${| | ν_{n} | |}_{2}^{2} \geq {n c}_{n *} {| | {\hat{θ}}_{n} - θ_{n} | |}_{2}^{2}$ and 2ab ≤ a² + b²,

{n c}_{n *} {| | {\hat{θ}}_{n} - θ_{n} | |}_{2}^{2} \leq 4 {| | η_{n}^{*} | |}_{2}^{2} + \frac{{(2 n λ \sqrt{p})}^{2}}{2 {n c}_{n *}} + \frac{1}{2} {n c}_{n *} {| | {\hat{θ}}_{n} - θ_{n} | |}_{2}^{2} .

It follows that

{| | {\hat{θ}}_{n} - θ_{n} | |}_{2}^{2} \leq \frac{8 {| | η_{n}^{*} | |}_{2}^{2}}{{n c}_{n *}} + \frac{4 λ^{2} p}{C_{n *}^{2}} .

(36)

Let $f_{0} (x_{i}) = \sum_{j = 1}^{p} f_{0 j} (x_{i j})$ . Write

η_{n} = Q (ε_{i} + (μ - \bar{y}) 1 + f (x_{i}) - Z θ_{n}) .

Since | μ − ȳ|² = O_p(n⁻¹) and ${| | f_{0 j} - f_{n j} | |}_{\infty} = O (m_{n}^{- d})$ , we have

{| | η_{n}^{*} | |}_{2}^{2} \leq 2 {| | ε_{n}^{*} | |}_{2}^{2} + O_{p} (1) + O ({npm}_{n}^{- 2 d}),

(37)

where $ε_{n}^{*}$ is the projection of ε_n = (ε₁, …, ε_n)′ to the span of QZ. We have

{| | ε_{n}^{*} {| |}_{2}^{2} = {(Z^{'} Q Z)}^{- 1 / 2} Z^{'} Q ε_{n} | |}_{2}^{2} \leq O_{p} ({p m}_{n})

(38)

Combining (36), (37), and (38), we get

{| | {\hat{θ}}_{n} - θ_{n} | |}_{2}^{2} \leq O_{p} (\frac{{p m}_{n}}{{n c}_{n *}}) + O_{p} (\frac{1}{{n c}_{n *}}) + O (\frac{d_{n 2} m_{n}^{- 2 d}}{c_{n *}}) + \frac{4 p λ^{2}}{c_{n *}^{2}} .

Since $c_{n *} ≍_{p} m_{n}^{- 1}$ and $c_{n}^{*} ≍_{p} m_{n}^{- 1}$ , we have

{| | {\hat{θ}}_{n} - θ_{n} | |}_{2}^{2} \leq O_{p} (\frac{{p m}_{n}^{2}}{n}) + O_{p} (\frac{m_{n}}{n}) + O (\frac{1}{m_{n}^{2 d - 1}}) + O (m_{n}^{2} λ^{2}) .

Now the result follows from the properties of polynomial splines (Schumaker 2001). This completes the proof of the theorem.

Proof of Theorem 3

Let θ̃_n be the oracle estimator defined in (11). Define

{\tilde{g}}_{n j} (x) = 0, j \in S_{1} and {\tilde{g}}_{n j} (x) = \sum_{k = 1}^{m_{n}} {\tilde{θ}}_{j k} ψ_{j k} (x), j \in S_{2} .

Let

{\tilde{f}}_{n j} (x) = {\tilde{β}}_{j} x + {\tilde{g}}_{n j} (x), j \in {\hat{S}}_{2} .

Denote f̃_nj(x_j) = (f̃_nj(x₁_j), …, f̃_nj(x_nj))′. The estimator of the coefficients of the linear components is

{\tilde{β}}_{n 1} = {(X_{(1)}^{'} X_{(1)})}^{- 1} X_{(1)}^{'} (y - \sum_{j \in S_{2}} {\tilde{f}}_{n j} (x_{j})) .

Using the standard techniques in semiparametric models such as those described in Huang (1996), we can show that

\sqrt{n} ({\tilde{β}}_{n 1} - β_{01}) \to_{D} N (0, \sum) .

By Theorem 1, P(β̂_n₁ = β̃_n₁) → 1, which implies $\sqrt{n} ({\hat{β}}_{n 1} - {\tilde{β}}_{n 1}) \to_{P} 0$ . Therefore, by Slutsky’s lemma, we also have

\sqrt{n} ({\hat{β}}_{n 1} - β_{01}) = \sqrt{n} ({\tilde{β}}_{n 1} - β_{01}) + \sqrt{n} ({\hat{β}}_{n 1} - {\tilde{β}}_{n 1}) \to_{D} N (0, \sum) .

This completes the proof of Theorem 3.

References

Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press; Baltimore: 1993. [Google Scholar]
Breheny P, Huang J. Coordinate Descent Algorithms for Nonconvex Penalized Regression Methods. Ann Appl Statist. 2010;5:232–253. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen H. Convergence rates for parametric components in a partly linear model. Ann Statist. 1988;16:136–146. [Google Scholar]
Chen H. Asymptotically efficient estimation in semiparametric generalized linear models. Ann Statist. 1995;23:1102–1129. [Google Scholar]
Engle RF, Granger CWJ, Rice J, Weiss A. Semiparametric estimates of the relation between weather and electricity sales. J Amer Statist Assoc. 1986;81:310–320. [Google Scholar]
Friedman J, Hastie, Hoefling H, Tibshirani R. Pathwise coordinate optimization. Ann Appl Statist. 2007;35:302–332. [Google Scholar]
Fu WJ. Penalized regressions: the bridge versus the lasso. J Comp Graph Statist. 1998;7:397–416. [Google Scholar]
Härdle W, Liang H, Gao J. Partially Linear Models. Physica-Verlag; Heidelberg: 2000. [Google Scholar]
Hastie T, Tibshirani R. Generalized additive models. Chapman & Hall; 1990. [DOI] [PubMed] [Google Scholar]
Heckman N. Spline smoothing in partly linear model. J Roy Statist Soc Ser B. 1986;48:244–248. [Google Scholar]
Huang J. Efficient estimation for the Cox model with interval censoring. Ann Statist. 1996;24:540–568. [Google Scholar]
Huang J. Efficient estimation of the partly linear additive Cox model. Ann Statist. 1999;27:1536–1563. [Google Scholar]
Huang J, Horowitz JL, Wei FR. Variable selection in nonparametric additive models. Ann Statist. 2010;38:2282–2313. doi: 10.1214/09-AOS781. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mazumder R, Friedman J, Hastie T. Preprint. Department of Statistics, Stanford University; 2009. SparseNet: Coordinate descent with non-convex penalties. [Google Scholar]
Rice J. Convergence rates for partially spline models. Statist & Probab Lett. 1986;4:203–208. [Google Scholar]
Shen X, Wong WH. Convergence rate of sieve estimates. Ann Statist. 1994;22:580–615. [Google Scholar]
Schumaker L. Spline Functions: Basic Theory. Wiley; New York: 1981. [Google Scholar]
Speckman P. Spline smoothing and optimal rates of convergence in nonparametric regression models. Ann Statist. 1985;13:970–983. [Google Scholar]
Stone CJ. Additive regression and other nonparametric models. Ann Statist. 1985;13:689–705. [Google Scholar]
Tseng P. Convergence of a block coordinate descent method for nondifferentiable minimization. J Opt Th & Appl. 2001;109:475–494. [Google Scholar]
Van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. Springer Verlag; New York: 1996. [Google Scholar]
Wahba G. Analyses for Time Series, Japan-US Joint Seminar. Tokyo: Institute of Statistical Mathematics; 1984. Partial spline models for the semiparametric estimation of functions of several variables; p. 319329. [Google Scholar]
Willems JP, Saunders JT, Hunt DE, Schorling JB. Prevalence of coronary heart disease risk factors among rural blacks: A community-based study. Southern Med J. 1997;90:814–820. doi: 10.1097/00007611-199708000-00008. [DOI] [PubMed] [Google Scholar]
Wu T, Lange K. Coordinate descent procedures for lasso penalized regression. Ann Appl Statist. 2007;2:224–244. [Google Scholar]
Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Statist Soc B. 2006;68:49–67. [Google Scholar]
Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Statist. 2010;38:894–942. [Google Scholar]
Zhang HH, Cheng G, Liu Y. Linear or nonlinear? Automatic structure discovery for partially linear models. Preprint Under revision for J Amer Statist Assoc. 2010 doi: 10.1198/jasa.2011.tm10281. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou H. The adaptive Lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. [Google Scholar]

[R1] Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press; Baltimore: 1993. [Google Scholar]

[R2] Breheny P, Huang J. Coordinate Descent Algorithms for Nonconvex Penalized Regression Methods. Ann Appl Statist. 2010;5:232–253. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Chen H. Convergence rates for parametric components in a partly linear model. Ann Statist. 1988;16:136–146. [Google Scholar]

[R4] Chen H. Asymptotically efficient estimation in semiparametric generalized linear models. Ann Statist. 1995;23:1102–1129. [Google Scholar]

[R5] Engle RF, Granger CWJ, Rice J, Weiss A. Semiparametric estimates of the relation between weather and electricity sales. J Amer Statist Assoc. 1986;81:310–320. [Google Scholar]

[R6] Friedman J, Hastie, Hoefling H, Tibshirani R. Pathwise coordinate optimization. Ann Appl Statist. 2007;35:302–332. [Google Scholar]

[R7] Fu WJ. Penalized regressions: the bridge versus the lasso. J Comp Graph Statist. 1998;7:397–416. [Google Scholar]

[R8] Härdle W, Liang H, Gao J. Partially Linear Models. Physica-Verlag; Heidelberg: 2000. [Google Scholar]

[R9] Hastie T, Tibshirani R. Generalized additive models. Chapman & Hall; 1990. [DOI] [PubMed] [Google Scholar]

[R10] Heckman N. Spline smoothing in partly linear model. J Roy Statist Soc Ser B. 1986;48:244–248. [Google Scholar]

[R11] Huang J. Efficient estimation for the Cox model with interval censoring. Ann Statist. 1996;24:540–568. [Google Scholar]

[R12] Huang J. Efficient estimation of the partly linear additive Cox model. Ann Statist. 1999;27:1536–1563. [Google Scholar]

[R13] Huang J, Horowitz JL, Wei FR. Variable selection in nonparametric additive models. Ann Statist. 2010;38:2282–2313. doi: 10.1214/09-AOS781. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Mazumder R, Friedman J, Hastie T. Preprint. Department of Statistics, Stanford University; 2009. SparseNet: Coordinate descent with non-convex penalties. [Google Scholar]

[R15] Rice J. Convergence rates for partially spline models. Statist & Probab Lett. 1986;4:203–208. [Google Scholar]

[R16] Shen X, Wong WH. Convergence rate of sieve estimates. Ann Statist. 1994;22:580–615. [Google Scholar]

[R17] Schumaker L. Spline Functions: Basic Theory. Wiley; New York: 1981. [Google Scholar]

[R18] Speckman P. Spline smoothing and optimal rates of convergence in nonparametric regression models. Ann Statist. 1985;13:970–983. [Google Scholar]

[R19] Stone CJ. Additive regression and other nonparametric models. Ann Statist. 1985;13:689–705. [Google Scholar]

[R20] Tseng P. Convergence of a block coordinate descent method for nondifferentiable minimization. J Opt Th & Appl. 2001;109:475–494. [Google Scholar]

[R21] Van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. Springer Verlag; New York: 1996. [Google Scholar]

[R22] Wahba G. Analyses for Time Series, Japan-US Joint Seminar. Tokyo: Institute of Statistical Mathematics; 1984. Partial spline models for the semiparametric estimation of functions of several variables; p. 319329. [Google Scholar]

[R23] Willems JP, Saunders JT, Hunt DE, Schorling JB. Prevalence of coronary heart disease risk factors among rural blacks: A community-based study. Southern Med J. 1997;90:814–820. doi: 10.1097/00007611-199708000-00008. [DOI] [PubMed] [Google Scholar]

[R24] Wu T, Lange K. Coordinate descent procedures for lasso penalized regression. Ann Appl Statist. 2007;2:224–244. [Google Scholar]

[R25] Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Statist Soc B. 2006;68:49–67. [Google Scholar]

[R26] Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Statist. 2010;38:894–942. [Google Scholar]

[R27] Zhang HH, Cheng G, Liu Y. Linear or nonlinear? Automatic structure discovery for partially linear models. Preprint Under revision for J Amer Statist Assoc. 2010 doi: 10.1198/jasa.2011.tm10281. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Zou H. The adaptive Lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. [Google Scholar]

PERMALINK

Semiparametric Regression Pursuit

Jian Huang

Fengrong Wei

Shuangge Ma

Abstract

1. Introduction

2. Semiparametric regression pursuit via group minimax concave penalization

2.1. Method

2.2 Penalized profile least squares

2.3 Spline approximation

3. Computation

4. Theoretical properties

Theorem 1

Theorem 2

Theorem 3

5. Numerical studies

5.1 Simulation studies

Example 1

Example 2

Table 1.

Table 3.

Table 2.

5.2 Diabetes data example

Table 4.

Table 5.

6. Concluding remarks

Acknowledgments

Appendix

Proof of Theorem 1

Lemma 1

Proof of Lemma 1

Lemma 2

Proof of Lemma 2

Proof of Theorem 2

Proof of Theorem 3

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases