PENALIZED VARIABLE SELECTION PROCEDURE FOR COX MODELS WITH SEMIPARAMETRIC RELATIVE RISK

Pang Du; Shuangge Ma; Hua Liang

doi:10.1214/09-AOS780

. Author manuscript; available in PMC: 2011 Jan 1.

Published in final edited form as: Ann Stat. 2010 Aug 1;38(4):2092–2117. doi: 10.1214/09-AOS780

PENALIZED VARIABLE SELECTION PROCEDURE FOR COX MODELS WITH SEMIPARAMETRIC RELATIVE RISK

Pang Du ^1,^✉, Shuangge Ma ², Hua Liang ^3,^†

PMCID: PMC2928655 NIHMSID: NIHMS175610 PMID: 20802853

Abstract

We study the Cox models with semiparametric relative risk, which can be partially linear with one nonparametric component, or multiple additive or nonadditive nonparametric components. A penalized partial likelihood procedure is proposed to simultaneously estimate the parameters and select variables for both the parametric and the nonparametric parts. Two penalties are applied sequentially. The first penalty, governing the smoothness of the multivariate nonlinear covariate effect function, provides a smoothing spline ANOVA framework that is exploited to derive an empirical model selection tool for the nonparametric part. The second penalty, either the smoothly-clipped-absolute-deviation (SCAD) penalty or the adaptive LASSO penalty, achieves variable selection in the parametric part. We show that the resulting estimator of the parametric part possesses the oracle property, and that the estimator of the nonparametric part achieves the optimal rate of convergence. The proposed procedures are shown to work well in simulation experiments, and then applied to a real data example on sexually transmitted diseases.

Keywords and phrases: Backfitting, partially linear models, penalized variable selection, proportional hazards, penalized partial likelihood, smoothing spline ANOVA

1. Introduction

In survival analysis, a problem of interest is to identify relevant risk factors and evaluate their contributions to survival time. Cox proportional hazards (PH) model is a popular approach to study the influence of covariates on survival outcome. Conventional PH models assume that covariates have a log-linear effect on the hazard function. These PH models have been studied by numerous authors; see, e.g., the references in [15]. The log-linear assumption can be too rigid in practice, especially when continuous covariates are present. This limitation motivates PH models with nonparametric relative risk. Some examples are [6, 11, 12, 23, 35]. However, nonparametric models may suffer from the curse of dimensionality. They also lack the easy interpretation in parametric risk models. PH models with semiparametric relative risk strike a good balance by allowing nonparametric risk for some covariates and parametric risk for others. The benefits of such models are two-folds. First, they have the merits of models with parametric risk, including easy interpretation, easy estimation and easy inference. Second, their nonparametric part allows a flexible form for some continuous covariates whose patterns are unexplored and whose contribution cannot be assessed by simple parametric models. For example, [10] proposed efficient estimation for a partially linear Cox model with additive nonlinear covariate effects. [3] studied partially linear hazard regression for multivariate survival data with time-dependent covariates via a profile pseudo-partial likelihood approach, where the only nonlinear covariate effect was estimated by local polynomials. But these models are limited to one nonparametric component or additive nonparametric components, ignoring the possible interactions between different nonparametric components. [30] proposed a partially linear additive hazard model whose nonlinear varying coefficients represent the interaction between the time-dependent nonlinear covariate and other covariates.

Variable selection in survival data has drawn much attention in the past decade. Traditional procedures such as Akaike information criterion (AIC) and Bayesian information criterion (BIC), as noted by [2], suffer from the lack of stability and lack of incorporating stochastic errors inherited in the stage of variable selection. [26] and [32] extended respectively the LASSO and the adaptive LASSO variable selection procedures to the Cox model. [8] extended the nonconcave penalized likelihood approach ([7]) to the Cox PH models. [4] studied variable selection for multivariate survival data. The Cox models considered in these three papers all assumed a linear form of covariate effects in the relative risk. More recently, [13] and [14] proposed procedures for selecting variables in semiparametric linear regression models for censored data, where the dependence of response over covariates was also assumed to be of linear form. Hence, the aforementioned variable selection procedures are limited in their rigid assumption of parametric covariate effects which may not be realistic in practice. We will fill in these gaps in three aspects: (i) our models are flexible with semiparametric relative risk, which allows nonadditive nonparametric components, without limiting to single or additive nonlinear covariate effects; and (ii) our approach can simultaneously estimate the parametric coefficient vector and select contributing parametric components; and (iii) our approach also provides a model selection tool for the nonparametric components.

Let the hazard function for a subject be

h (t) = h_{0} (t) \exp [β^{T} U + η (W)],

(1.1)

where h₀ is the unknown baseline hazard, Z^T = (U^T, W^T) is the covariate vector, β is the unknown coefficient vector, and η(w) = η(w₁, … , w_q) is an unknown multivariate smooth function. We propose a doubly penalized profile partial likelihood approach for estimation, following the general profile likelihood framework set up by [20]. Given β, η is estimated by smoothing splines through the minimization of a penalized log partial likelihood. Then the smoothing spline ANOVA decomposition not only allows the natural inclusion of interaction effects but also provides the basis for deriving an empirical model selection tool. After substituting the estimate of η, we obtain a profile partial likelihood, which is then penalized to get an estimate of β. To achieve variable selection in β, we use the smoothly clipped absolute deviation (SCAD) penalty. We show that our estimate of η achieves the optimal convergence rate, and our estimate of β possesses the oracle property such that the true zero coefficients are automatically estimated as zeros and the remaining coefficients are estimated as well as if the correct submodel were known in advance. Our numerical studies reveal that the proposed method is promising in both estimation and variable selection. We then apply it to a study on sexually transmitted diseases with 877 subjects.

The rest of the article is organized as follows. Section 2 gives the details of the proposed method, in the order of model description and estimation procedure (§2.1), model selection in the nonparametric part (§2.2), asymptotic properties (§2.3), and miscellaneous issues (§2.4) like standard error estimates and smoothing parameter selection. Section 3 presents the empirical studies, and Section 4 gives an application study. Remarks in Section 5 conclude the article.

2. Method

Let T be the failure time and C be the right-censoring time. Assume that T and C are conditionally independent given the covariate. The observable random variable is (X, Δ, Z), where X = min(T, C), Δ = I_[T≤C], and Z = (U, W) is the covariate vector with U ∈ ℝ^d and W ∈ ℝ^q. With n i.i.d. (X_i, Δ_i, Z_i), i = 1, … , n, we assume a Cox model for the hazard function as in (1.1).

2.1. Estimation and Variable selection for parametric parts

Let Y_i(t) = I_{[X_i≥t]}. We propose to estimate (β, η) through a penalized profile partial likelihood approach. Given β, η is estimated as the minimizer of the penalized partial likelihood

l_{β} (η) \equiv - \frac{1}{n} \sum_{i = 1}^{n} Δ_{i} {U_{i}^{T} β + η (W_{i}) - \log \sum_{k = 1}^{n} Y_{k} (X_{i}) \exp [U_{k}^{T} β + η (W_{k})]} + λ J (η),

(2.1)

where the summation is the negative log partial likelihood representing the goodness-of-fit, J(η) is a roughness penalty specifying the smoothness of η, and λ > 0 is a smoothing parameter controlling the tradeoff. A popular choice for J is the L₂-penalty which yields tensor product cubic splines (see, e.g., [9]) for multivariate W. Note that η in (2.1) is identifiable up to a constant, so we use the constraint ∫ η = 0.

Once an estimate η̂ of η is obtained, the estimator of β is then the maximizer of the penalized profile partial likelihood

\begin{array}{l} l_{\hat{η}} (β) \equiv \sum_{i = 1}^{n} Δ_{i} {U_{i}^{T} β + \hat{η} (W_{i}) \\ - \log \sum_{k = 1}^{n} Y_{k} (X_{i}) \exp [U_{k}^{T} β + \hat{η} (W_{k})]} - n \sum_{j = 1}^{d} p_{θ_{j}} (| β_{j} |), \end{array}

(2.2)

where p_{θ_j} (| · |) is the SCAD penalty on β ([7]).

The detailed algorithm for our estimation procedure is as follows.

Step 1. Find a proper initial estimate β̂⁽⁰⁾. We note that, as long as the initial estimate is reasonable, convergence to the true optimizer can be achieved. Difference choices of the initial estimate will affect the number of iterations needed but not the convergence itself.
Step 2. Let β̂^(k−1) be the estimate of β before the kth iteration. Plug β̂^(k−1) into (2.1) and solve for η by minimizing the penalized partial likelihood l_β̂^(k−1) (η). Let η̂^(k) be the estimate thus obtained.
Step 3. Plug η̂^(k) into (2.2) and solve for β by maximizing the penalized profile partial likelihood l_η̂^(k)(β). Let β̂^(k) be the estimate thus obtained.
Step 4. Replace β̂^(k−1) in Step 2 by β̂^(k) and repeat Steps 2 and 3 until convergence to obtain the final estimates β̂ and η̂.

Our experience shows that the algorithm usually converges quickly within a few iterations. As in the classical Cox proportional hazards model, the estimation of baseline hazard function is of less interest and not required in our estimation procedure.

In Step 3, we use a one-step approximation to the SCAD penalty ([34]). It transforms the SCAD penalty problem to a LASSO-type optimization, where the celebrated LARS algorithm proposed in [5] can be used. Let l_η̂(β) be the profile log partial likelihood in Step 3, and I(β) = −∇²l_η̂ (β) be the Hessian matrix, where the derivative is with respect to β treating η̂ as fixed. Compute the Cholesky decomposition of I (β̂^(k−1)) such that I (β̂^(k−1)) = V^TV. Let $A = {j : {p^{'}}_{θ_{j}} (| {\hat{β}}_{j}^{(k - 1)} |) = 0}$ and $B = {j : {p^{'}}_{θ_{j}} (| {\hat{β}}_{j}^{(k - 1)} |) > 0}$ . Decompose V and the new estimate β̂^(k) accordingly such that V = [V_A, V_B] and ${\hat{β}}^{(k)} = {({\hat{β}}_{A}^{{(k)}^{T}}, {\hat{β}}_{B}^{{(k)}^{T}})}^{T}$ .

(Step 3a) Let y = Vβ̂^(k−1). Then for each j ∈ B, replace the j-th column of V by setting $υ_{j} = υ_{j} \frac{θ_{j}}{{p^{'}}_{θ_{j}} (| {\hat{β}}_{j}^{(k - 1)} |)}$ .
(Step 3b) Let $H_{A} = V_{A} {(V_{A}^{T} V_{A})}^{- 1} V_{A}^{T}$ be the projection matrix to the column space of V_A. Compute y* = y − H_Ay and V_B* = V_B − H_AV_B.
(Step 3c) Apply the LARS algorithm to solve
${\hat{β}}_{B}^{*} = \arg \min_{β} {\frac{1}{2} {∥ y^{*} - V_{B}^{*} β ∥}^{2} + n \sum_{j \in B} θ_{j} | β_{j} |} .$
(Step 3d) Compute ${\hat{β}}_{A}^{*} = {(V_{A}^{T} V_{A})}^{- 1} V_{A}^{T} (y - V_{B} {\hat{β}}_{B}^{*})$ to obtain ${\hat{β}}^{*} = {({\hat{β}}_{A}^{*^{T}}, {\hat{β}}_{B}^{*^{T}})}^{T}$ .
(Step 3e) For j ∈ A, set ${\hat{β}}_{j}^{(k)} = {\hat{β}}_{j}^{*}$ . For j ∈ B, set ${\hat{β}}_{j}^{(k)} = {\hat{β}}_{j}^{*} \frac{θ_{j}}{p_{θ_{j}}^{'} (| {\hat{β}}_{j}^{(k - 1)} |)}$ .

2.2. Model Selection for Nonparametric Component

While the SCAD penalty takes care of variable selection for the parametric components, we still need an approach to assess the structure of the nonparametric components. In this section, we will first transform the profile partial likelihood problem in (2.1) to a density estimation problem with biased sampling, and then derive a model selection tool based on the Kullback-Leibler geometry. In this part, we treat β as fixed, taking the value from the previous step in the algorithm.

Let (i₁, … , i_N) be the indices for the failed subjects. Then the profile partial likelihood in (2.1) for estimating η is

{\prod_{i = 1}^{n} [\frac{e^{U_{i}^{T} β + η (W_{i})}}{\sum_{k = 1}^{n} Y_{k} (X_{i}) e^{U_{k}^{T} β + η (W_{k})}}]}^{Δ_{i}} = \prod_{p = 1}^{N} [\frac{e^{U_{i_{p}}^{T} β + η (W_{i_{p}})}}{\sum_{k = 1}^{n} Y_{k} (X_{i_{p}}) e^{U_{k}^{T} β + η (W_{k})}}] .

Consider the empirical measure $P_{n}^{w}$ on the discrete domain W_n = {W₁, … , W_n} such that $\int f d P_{n}^{w} = \frac{1}{n} \sum_{i = 1}^{n} f (W_{i})$ . Then $e^{η} / \int e^{η} d P_{n}^{w}$ defines a density function on W_n. Let a₁(·), … , a_N (·) be weight functions defined on the discrete domain W_n such that $a_{p} (W_{k}) = Y_{k} (X_{i_{p}}) e^{U_{k}^{T} β}$ , p = 1, … , N. Alternatively, one can think of a_p’s as vectors of weights with length n. Then each term in the profile partial likelihood, with the constant n ignored, becomes $a_{p} (W_{i_{p}}) e^{η (W_{i_{p}})} / \int a_{p} (w) e^{η (w)} d P_{n}^{w}$ . Thus, this resembles a density estimation problem with bias introduced by the known weight function a_p(·).

For two density estimates η₁ and η₂ in the above pseudo biased sampling density estimation problem, define their Kullback-Leibler distance as

\begin{array}{l} KL (η_{1}, η_{2}) & = & \frac{1}{N} \sum_{p = 1}^{N} {\frac{\int (η_{1} (w) - η_{2} (w)) a_{p} (w) e^{η_{1} (w)} d P_{n}^{w}}{\int a_{p} (w) e^{η_{1} (w)} d P_{n}^{w}} \\ - \log \int a_{p} (w) e^{η_{1} (w)} d P_{n}^{w} + \log \int a_{p} (w) e^{η_{2} (w)} d P_{n}^{w}} . \end{array}

(2.3)

Let η₀ be the true function. Suppose the estimation of η₀ has been done in a space ℋ₁, but in fact η₀ ∈ ℋ₂ ⊂ ℋ₁. Let η̂ be the estimate of η₀ in ℋ₁. Let η̃ be the Kullback-Leibler projection of η̂ in ℋ₂, that is, the minimizer of KL(η̂, η) for η ∈ ℋ₂, and η_c be the estimate from the constant model. Set η = η̃ + α(η̃ − η_c) for α real. Differentiation KL(η̂, η) with respect to α and evaluating at α = 0, one has

\sum_{p = 1}^{N} \frac{\int (\tilde{η} (w) - η_{c} (w)) a_{p} (w) e^{\hat{η} (w)} d P_{n}^{w}}{\int a_{p} (w) e^{\hat{η} (w)} d P_{n}^{w}} = \sum_{p = 1}^{N} \frac{\int (\tilde{η} (w) - η_{c} (w)) a_{p} (w) e^{\tilde{η} (w)} d P_{n}^{w}}{\int a_{p} (w) e^{\tilde{η} (w)} d P_{n}^{w}},

which, through straightforward calculation, yields

KL (\hat{η}, η_{c}) = KL (\hat{η}, \tilde{η}) + KL (\tilde{η}, η_{c}) .

Hence the ratio KL(η̂, η̃)/KL(η̂, η_c) can be used to diagnose the feasibility of a reduced model η ∈ ℋ₂: the smaller the ratio is, the more feasible the reduced model is.

2.3. Asymptotic Results

Denote by ℋ^m (W) the Sobolev space of functions on W whose mth order partial derivatives are square integrable. Let

ℋ = {η \in ℋ^{m} (W), \int_{W} η (w) d w = 0},

and η̂* be the estimate of η₀ in ℋ that minimizes the penalized partial likelihood

- \frac{1}{n} \sum_{i = 1}^{n} \int_{Τ} {U_{i}^{T} β + η (W_{i}) - \log \sum_{k = 1}^{n} Y_{k} (t) \exp [U_{k}^{T} β + η (W_{k})]} d N_{i} (t) + \frac{λ}{2} J (η) .

(2.4)

Note that ℋ is an infinite dimensional function space. Hence, in practice, the minimization of (2.4) is usually performed in a data-adaptive finite dimensional space

ℋ_{n} = N_{J} \oplus span {R_{J} (W_{i_{l}}, \cdot) : l = 1, \dots, q_{n}}

where N_J = {η ∈ ℋ, J(η) = 0} is the null space of J, and R_J is the reproducing kernel (see, e.g., [28]) in its complement space ℋ_J = ℋ⊖N_J, and {W_i₁, … , W_{i_{q_n}}} is a random subset of {W_i : i = 1, … , n}. When q_n = n, one selects all the W_i, i = 1, … , n as the knots. This is the number of knots used in conventional smoothing splines. However, under the regression setting, [16] showed that a q_n of the order n^2/(r+1)+ε, ∀ε > 0 is sufficient to yield an estimate with the optimal convergence rate. Here r is a constant associated with the Sobolev space ℋ, e.g., r = 2m for splines of order m (one-dimension w) and r = 2m − δ, ∀δ > 0 for tensor product splines (multi-dimension w). We shall show that such an order for q_n also works for the η estimation in our partially linear Cox model.

Let $s_{n} [f; β, η] (t) = \frac{1}{n} \sum_{k = 1}^{n} Y_{k} (t) f (U_{k}, W_{k}) \exp (U_{k}^{T} β + η (W_{k}))$ and s_n[β, η](t) = s_n (1; β, η)(t). Define

\begin{array}{l} s [f; β, η] (t) & = & E [Y (t) f (U, W) \exp (U^{T} β + η (W))] \\ = & \int \int f (u, w) e^{u^{T} β + η (w)} q (t, u, w) dudw . \end{array}

For any functions f and g, define

V (f, g) = \int_{Τ} {\frac{s [f g; β, η_{0}] (t)}{s [β, η_{0}] (t)} - \frac{s [f; β, η_{0}] (t)}{s [β, η_{0}] (t)} \frac{s [g; β, η_{0}] (t)}{s [β, η_{0}] (t)}} s [β, η_{0}] (t) d Λ_{0} (t) .

(2.5)

Write V (f) ≡ V (f, f). Let η̂ be the estimate that minimizes (2.4) in ℋ_n. Then we have

Theorem 2.1

Under Conditions A1-A7 in the Appendix,

(V + λ J) ({\hat{η}}^{*} - η_{0}) = O_{P} (n^{- r / (r + 1)}) and (V + λ J) (\hat{η} - η_{0}) = O_{P} (n^{- r / (r + 1)}) .

This is the optimal convergence rate for estimate of a nonparametric function. In the view of Lemma 6.1, this theorem also indicates the same convergence rate in terms of the L₂-norm. Also note that although a higher order of q_n such as O(n) would yield the same convergence rate for η̂, it will make the function space ℋ_n too big to apply an entropy bound result that is critical in the proof of Theorem 2.2.

Let $ℒ_{P} (β) = l_{p} (β) - n \sum_{j = 1}^{d} p_{θ_{j}} (| β_{j} |)$ , where

l_{p} (β) = \sum_{i = 1}^{n} \int {U_{i}^{T} β + \hat{η} (W_{i}) - \log s_{n} [β, \hat{η}] (t)} d N_{i} (t) .

Let $β_{0} = {(β_{10}, \dots, β_{d 0})}^{T} = {(β_{10}^{T}, β_{20}^{T})}^{T}$ be the true coefficient vector. Without loss of generality, assume that β₂₀ = 0. Let s be the number of nonzero components in β₀. Define $a_{n} = \max_{j} {| p_{θ_{j}}^{'} (| β_{j 0} |) | : β_{j 0} \neq 0}, b_{n} = \max_{j} {| p_{θ_{j}}^{″} (| β_{j 0} |) | : β_{j 0} \neq 0}$ , and

\begin{array}{l} b & = & {(p_{θ_{1}}^{'} (β_{10}) sgn (β_{10}), \dots, p_{θ_{s}}^{'} (β_{s 0}) sgn (β_{s 0}))}^{T}, \\ \sum_{θ} & = & diag (p_{θ_{1}}^{'} (| β_{10} |) / | β_{10} |, \dots, p_{θ_{s}}^{'} (| β_{s 0} |) / | β_{s 0} |) . \end{array}

Define π₁ : ℝ^d+q → ℝ^s such that π₁ (u, w) = u₁, where u₁ is the vector of the first s components of u. Let V₀(π₁) be defined like V (π₁) in (2.5) but with β replaced by β₀.

Theorem 2.2

Under Conditions A1-A7 in the Appendix, if a_n = O(n^−1/2), b_n = o(1) and q_n = o(n^1/2), then

There exists a local maximizer β̂ of ℒ_P (β) such that ||β̂ − β₀|| = O_p(n^−1/2).
Further assume that for all 1 ≤ j ≤ d, θ_j = o(1), $θ_{j}^{- 1} = o (n^{1/2})$ , and
${lim inf}_{n \to \infty} {lim inf}_{u \to 0^{+}} θ_{j}^{- 1} p_{θ_{j}}^{'} (u) > 0 .$

With probability approaching one, the root-n consistent estimator β̂ in (i) must satisfy β̂₂ = 0 and
$\sqrt{n} (V_{0} (π_{1}) + \sum_{θ}) {{\hat{β}}_{1} - β_{10} + {(V_{0} (π_{1}) + \sum_{θ})}^{- 1} b} \to N (0, V_{0} (π_{1})) .$

2.4. Miscellaneous Issues

In this section, we will propose the standard error estimates for both the parametric and the nonparametric components, and discuss the selection of the smoothing parameters θ and λ.

2.4.1. Standard error estimates

Let l_p(β) be the profile log partial likelihood in the last iteration of Step 3 and

\sum_{θ} (β) = diag {p_{θ_{j}}^{'} (| β_{1} |) / | β_{1} |, \dots, p_{θ_{j}}^{'} (| β_{p} |) / | β_{p} |} .

Then the standard errors for the nonzero coefficients of β̂ are given by the sandwich formula

c \hat{o} v (\hat{β}) = {\nabla^{2} l_{p} (\hat{β}) - n \sum_{θ} (\hat{β})}^{- 1} c \hat{o} v {\nabla l_{p} (\hat{β})} {\nabla^{2} l_{p} (\hat{β}) - n \sum_{θ} (\hat{β})}^{- 1} .

Sometimes the standard errors for zero coefficients are also of interest. A discussion of this problem is in Section 5

In (2.1), η can be decomposed as η = η^[0] + η^[1] where η^[0] lies in the null space of the penalty J representing the lower order part and η^[1] lies in the complement space representing the higher order part. A Bayes model interprets (2.1) as a posterior likelihood when η^[0] is assigned an improper constant prior and η^[1] is assigned a Gaussian prior with zero mean and certain covariance matrix. The minimizer η̂ of (2.1) then becomes the posterior mode. When the minimization is carried out in a data-adaptive function space ℋ_n with basis functions ψ = (ψ₁, … , ψ_{q_n})^T, we can write η̂ = ψ^T ĉ. Then a quadratic approximation to (2.1) yields an approximate posterior covariance matrix for c, which can be used to construct point-wise confidence intervals for η.

2.4.2. Smoothing parameter selection

As shown in [33], the effective degrees of freedom for l₁-penalty model is well approximated by the number of nonzero coefficients. Note that our SCAD procedure is implemented by a LASSO approximation at each step. Hence, if we let Â be the set of nonzero coefficients, the AIC score for selecting θ in Step 3 is

AIC = 2 l_{p} (\hat{β}) + 2 | \hat{A} |,

where |Â| is the cardinality of Â.

As illustrated in Section 2.2, the estimation of η in Step 2 can be cast as a density estimation problem with biased sampling. Let KL(η₀, η̂_λ) be the Kullback-Leibler distance, as defined in (2.3), between the true “density” $e^{η_{0}} / \int e^{η_{0}} d P_{n}^{w}$ and the estimate $e^{{\hat{η}}_{λ}} / \int e^{{\hat{η}}_{λ}} d P_{n}^{w}$ . An optimal λ should minimize KL(η₀, η̂_λ) or the relative Kullback-Leibler distance

\begin{array}{l} RKL (η_{0}, {\overset{⌢}{η}}_{λ}) = \frac{1}{N} \sum_{p = 1}^{N} {\frac{\int (η_{0} (w) - {\overset{⌢}{η}}_{λ} (w)) a_{p} (w) e^{η_{0} (w)} d P_{n}^{w}}{\int a_{p} (w) e^{η_{0} (w)} d P_{n}^{w}} \\ + \log \int a_{p} (w) e^{{\overset{⌢}{η}}_{λ} (w)} d P_{n}^{w}} . \end{array}

(2.6)

The second term of (2.6) is directly computable from the estimate η̂_λ. But the first term needs to be estimated. Let ψ be the vector of spline basis functions as in the previous subsection and η = ψ^T c. Through a delete-one cross-validation approximation, a proxy for (2.6) can be derived as

- \frac{1}{N} \sum_{p = 1}^{N} {η (W_{i}) - \log \int a_{p} (w) e^{η (w)} d P_{n}^{w}} + \frac{tr (P_{1} Q^{T} H^{- 1} Q P_{1})}{N (N - 1)},

where P₁ = I − 11^T / N, Q = (ψ(W_i₁), … , ψ(W_{i_N})), and H matrix for minimizing (2.1) with respect to the coefficient vector c. λ is chosen to minimize this score.

3. Numerical Studies

In the simulations, we generated failure times from the exponential hazard model with h(t|U, W) = exp[U^T β₀ + η₀(W)]. We used the same settings for the parametric component, which consists of eight covariates U_j, j = 1, … , 8. The U_j’s were generated from a multivariate normal distribution with zero mean and Cov(U_j, U_k) = 0:5^{|^j−k|}. The true coefficient vector was β₀ = (0:8; 0; 0; 1; 0; 0; 0:6; 0)^T.

The theory in Section 2.3 gives the sufficient order for q_n, the number of knots in our smoothing spline estimation of η. In practice, [16] suggested q_n = kn^2/(2m+1) with k = 10 if the tensor product splines of order m are used. Since we use tensor product cubic splines in all the simulations below, our choice is q_n = 10n^2/5.

3.1. Variable Selection for Parametric Components

The nonparametric part had one covariate W generated from Uniform(0, 1). Two different η₀ were used:

η_{0 a} (w) = 1.5 \sin (2 π w - \frac{π}{2}) or η_{0 b} (w) = 4 {(w - 0.3)}^{2} + 4.7 e^{- w} - 3.4643.

Note that both functions satisfies $\int_{0}^{1} η_{0} (w) d w = 0$ . Given U and W, the censoring times were generated from exponential distributions such that the censoring rates are respectively 23% and 40%. Sample sizes n = 150 and 500 were considered. One thousand data replicates were generated for each of the four combinations of η₀ and n.

For a prediction procedure M and the estimator (β̂_M, η̂_M) yielded from the procedure, an appropriate measure for the goodness-of-fit under Cox model with h₀(t) ≡ 1 is the model error: ME(β̂_M, η̂_M) = E[(exp(−U^T β̂_M − η̂_M(W)) − exp(−U^T β₀ − η₀(W)))²]. The relative model error (RME) of M₁ versus M₂ is defined as the ratio ME(β̂_M₁, η̂_M₁)/ME(β̂_M₂, η̂_M₂). The procedure M₀ with complete oracle is used as our benchmark. In M₀, (U₁, U₄, U₇, W) are known to be the only contributing covariates, the exact form of η₀ is known, and the only parameters to be estimated are the coefficients of U₁, U₄, U₇. Note that M₀ can be implemented only in simulations, but is unrealistic in practice since neither the contributing covariates nor the form of η₀ would be known. We then compare the performance of the following four procedures, including the proposed procedures, through their RMEs versus M₀:

M_A: procedure with partial oracle and misspecified parametric η₀, that is, (U₁, U₄, U₇, W) are known to be the only contributing covariates but η₀ is misspecified to be of the parametric form η₀(W) = β_W W and β_W is estimated together with the coefficients for (U₁, U₄, U₇);
M_B: procedure with partial oracle and estimated η₀, that is, (U₁, U₄, U₇, W) are known to be the only contributing covariates but the form of η₀ is unknown, and η₀ is estimated together with the coefficients for (U₁, U₄, U₇) by penalized profile partial likelihood;
M_C: the proposed partial linear procedure with the SCAD penalty on β;
M_D: the proposed partial linear procedure with the adaptive LASSO penalty on β.

Procedure M_A has a mis-specified covariate effect. We intend to show that the estimation results can be unsatisfactory if the semiparametric form of covariate effect is mistakenly specified as parametric. Procedure M_B is “partial oracle” and expected to have equal or better performance than procedures M_C and M_D. Note, however, M_B is unrealistic in practice since the contributing covariates would not be known. M_C and M_D are two versions of the proposed partial linear procedure with different penalties on β.

For each combination of η₀ and n, we computed the following quantities out of the 1000 data replicates: the median RMEs of the complete oracle procedure M₀ versus the procedures M_A to M_D, the average number of correctly selected nonzero coefficients (CC), the average number of incorrectly selected nonzero coefficients (IC), the proportion of under-fit replicates that excluded any nonzero coefficients, the proportion of correct-fit replicates that selected the exact subset model, and the proportion over-fit replicates that included all three significant variables and some noise variables. The results are summarized in Table 1. In general, a partial oracle with misspecified parametric η₀ (Procedure M_A) has much inferior performance when comparing with the other three procedures; the proposed procedure with the SCAD penalty (Procedure M_C) or the adaptive LASSO penalty (Procedure M_D) is competitive to the partial oracle with estimated η₀ (Procedure M_B); the SCAD penalty performs slightly better than the adaptive LASSO penalty. Also, the proposed procedure generally performs as well as the complete oracle. For Procedure M_C, we also did some extra computation to evaluate the proposed standard error estimate of β. In Table 2, SD is the median absolute deviation divided by 0.6745 of the 1000 nonzero β̂’s that can be regarded as the true standard error, SD_m is the median of the 1000 estimated SDs, and SD_mad is the median absolute deviation of the 1000 estimated SDs divided by 0.6745. The standard errors were set to 0 for the coefficients estimated as 0s. The results in Table 2 suggests a good performance of the proposed standard error formula for β.

Table 1.

Variable Selection for Parametric Components (Section 3.1).

		No. of Non-zeros		Proportion of
Procedure	MRME	CC	IC	Under-fit	Correct-fit	Over-fit
	n = 150, η₀ = η_0a (23% censoring)
M_A	0.168	-	-	-	-	-
M_B	0.475	-	-	-	-	-
M_C	0.409	2.998	0.825	0.002	0.476	0.522
M_D	0.387	2.998	0.959	0.002	0.444	0.554
	n = 150, η₀ = η_0b (40% censoring)
M_A	0.167	-	-	-	-	-
M_B	0.711	-	-	-	-	-
M_C	0.518	2.996	0.949	0.004	0.430	0.566
M_D	0.563	2.998	1.131	0.002	0.378	0.620
	n = 500, η₀ = η_0a (23% censoring)
M_A	0.056	-	-	-	-	-
M_B	0.431	-	-	-	-	-
M_C	0.396	3.000	0.717	0.000	0.525	0.475
M_D	0.375	3.000	0.736	0.000	0.540	0.460
	n = 500, η₀ = η_0b (40% censoring)
M_A	0.057	-	-	-	-	-
M_B	0.712	-	-	-	-	-
M_C	0.619	3.000	0.749	0.000	0.512	0.488
M_D	0.628	3.000	0.776	0.000	0.529	0.471

Open in a new tab

Table 2.

Standard Deviations for β̂’s in the Partial Linear SCAD Procedure M_C (Section 3.1).

	β̂₁		β̂₄		β̂₇

n, censor %	SD	SD_m(SD_mad)	SD	SD_m(SD_mad)	SD	SD_m(SD_mad)
150, 23%	0.124	0.113(0.015)	0.141	0.121(0.017)	0.135	0.109(0.015)
150, 40%	0.159	0.135(0.017)	0.188	0.145(0.021)	0.155	0.128(0.019)
500, 23%	0.065	0.059(0.005)	0.073	0.063(0.005)	0.062	0.057(0.005)
500, 40%	0.075	0.070(0.006)	0.088	0.076(0.006)	0.078	0.066(0.006)

Open in a new tab

To examine the estimation of η₀, we computed the point-wise estimates at the grid w = (0, 1, by = 0:01) for each data replicate. Then at each grid point, the mean, the 0.025 and the 0.975 quantiles of the 1000 estimates, together with the mean of the 1000 95% confidence intervals were computed. The results are in Figure 1. The plots show satisfactory nonparametric fits and standard error estimates.

Fig 1 — Estimates for Nonparametric Components (Section 3.1). Dotted lines are true function, solid lines are connected point-wise mean estimates, faded lines are connected 0.025 and 0.975 quantiles of the point-wise estimates, and dashed lines are the connected point-wise 95% confidence intervals.

3.2. Model Selection for Nonparametric Components

In this section, we present some simulations to evaluate the model selection tool for nonparametric part introduced in Section 2.2. We used the SCAD penalty on the parametric components in this section. Two covariates W₁ and W₂, independently generated from Uniform(0, 1), were used. We considered two scenarios for the true model of the nonparametric part: (i) nonparametric univariate model η₀(W) = η₀₁(W₁) and (ii) nonparametric bivariate additive model η₀(W) = η₀₁(W₁) + η₀₂(W₂). For scenario (i), the data sets generated in the last section were used, with W₁ being the existing W covariate and W₂ being an additional noise covariate. The fitted model was nonparametric additive in W₁ and W₂. The ratios KL(η̂, η̃)/KL(η̂, η_c) for the projections to the univariate models η₀(W) = η₀₁(W₁) and η₀(W) = η₀₂(W₂) were computed. For scenario (ii), we considered two sample sizes n = 150 and 300. The true η₀ was

\begin{array}{l} η_{0} (w_{1}, w_{2}) = 0.7 η_{0 a} (w_{1}) + 0.3 η_{0 b} (w_{2}) or \\ η_{0} (w_{1}, w_{2}) = η_{0 a} (w_{1}) + η_{0 b} (w_{2}), \end{array}

where η_0a and η_0b are as defined in Section 3.1. The censoring times were generated from exponential distributions such that the resulting censoring rates were respectively 25% and 39%. Note that both choices of η₀ are additive in w₁ and w₂. The fitted model was the nonparametric bivariate full model with both the main effects and the interaction. Then the ratios KL(η̂, η̃)/KL(η̂, η_c) for the projections to the bivariate additive model and the two univariate models were computed. In both scenarios, we claim a reduced model is feasible when the corresponding ratio KL(η̂, η̃)/KL(η̂, η_c) < 0:05.

For each of the eight combinations of η₀ and n, we simulated 1000 data replicates and computed the proportions of replicates that produced the following results in the reduced model: selected the main effect of W₁, selected the main effect of W₂, selected the interaction W₁ : W₂, under-fitted the model by excluding at least one truly significant effect, correctly fitted the model by reducing to the exact subset model, and over-fitted the model by including all the truly significant effects and some irrelevant effects. These proportion results are summarized in Table 3. It shows that the variable selection tool for the nonparametric component works very well. The better performance appears to be associated with bigger sample sizes and lower censoring rates.

Table 3.

Model Selection for Nonparametric Components (Section 3.2).

Sample Size	Proportion of Selecting			Proportion of
Sample Size	W₁	W₂	W₁ : W₂	Under-fit	Correct-fit	Over-fit
	True model: η₀(w₁, w₂) = η_0a(w₁), 23% censoring
n = 150	1.000	0.036	-	0.000	0.964	0.036
n = 500	1.000	0.002	-	0.000	0.998	0.002
	True model: η₀(w₁, w₂) = η_0b(w₁), 40% censoring
n = 150	1.000	0.304	-	0.000	0.696	0.304
n = 500	1.000	0.062	-	0.000	0.938	0.062
	True model: η₀(w₁, w₂) = 0.7η_0a(w₁) + 0.3η_0b(w₂), 25% censoring
n = 150	1.000	0.998	0.084	0.002	0.914	0.084
n = 300	1.000	1.000	0.013	0.000	0.987	0.013
	True model: η₀(w₁, w₂) = η_0a(w₁) + η_0b(w₂), 39% censoring
n = 150	1.000	0.672	0.201	0.328	0.471	0.201
n = 300	1.000	0.616	0.096	0.384	0.520	0.096

Open in a new tab

4. Example

An example in [17] is a study on two sexually transmitted diseases: gonorrhea and chlamydia. The purpose of the study was to identify factors that are related to time until reinfection by gonorrhea or chlamydia given an initial infection of either disease. A sample of 877 individuals with an initial diagnosis of gonorrhea or chlamydia were followed for reinfection. Recorded for each individual were follow-up time, indicator of reinfection, demographic variables including race (white or black, U₁), marital status (divorced/separated, married, or single, U₂ and U₃), age at initial infection (W₁), years of schooling (W₂), and type of initial infection (gonorrhea, chlamydia, or both, U₄ and U₅), behavior factors at the initial diagnosis including number of partners in the last 30 days (U₆), indicators of oral sex within past 12 months and within past 30 days (U₇ and U₈), indicators of rectal sex within past 12 months and within past 30 days (U₉ and U₁₀), and condom use (always, sometimes, or never, U₁₁ and U₁₂), symptom variables at time of initial infection including presence of abdominal pain (U₁₃), sign of discharge (U₁₄), sign of dysuria (U₁₅), sign of itch (U₁₆), sign of lesion (U₁₇), sign of rash (U₁₈), and sign of lymph involvement (U₁₉), and symptom variables at time of examination including involvement vagina at exam (U₂₀), discharge at exam (U₂₁), and abnormal node at exam (U₂₂).

We used q_n = 10 · 877^2/5 = 151 knots in all the analysis below. We first considered the partial linear Cox model

h_{i} (t | Z) = h_{0} (t) \exp {\sum_{j = 1}^{3} η_{j} (W_{j i}) + \sum_{k = 1}^{22} U_{k i} β_{k}},

where η₃(W_3i) = η₃(W_1i, W_2i) is the interaction term between W₁ and W₂. However, the interaction term was found to be negligible with the ratio KL(η̂, η̃)/KL(η̂, η_c) = 0.003. Hence, we took out this interaction term and refitted the model. In this model, neither W₁ (age) nor W₂ (years of schooling) in the nonparametric component were found to be negligible, with the ratios KL(η̂, η̃)/KL(η̂, η_c) equal to 0.633 for removing W₁ and 0.259 for removing W₂. Their effects are plotted in Figure 2 together with the 95% point-wise confidence interval. We can see that the hazard increased with age at both ends of the age domain (between age 13 and 20, and between age 38 and 48) and stayed flat in the middle, and that the hazard decreased with years of school from 6 years to 10 years but stayed flat afterwards. The fitted coefficients from the proposed method with the SCAD penalty are in Table 4 together with their standard error estimates. For comparisons, Table 4 also lists the fitted coefficients and standard errors for three other models, namely the proposed semiparametric relative risk model with the adaptive LASSO penalty, and the parametric relative risk models with the SCAD and the adaptive LASSO penalties. We can see that the SCAD penalty yielded sparser models than the adaptive LASSO penalty, and that both parametric models missed the age effect. Common factors identified by all the four procedures to be associated with reinfection risk are marital status, type of infection, oral sex behavior, condom use, sign of abdominal pain, sign of lymph involvement and sign of discharge at exam.

Fig 2 — Nonparametric Component Estimates for Sexually Transmitted Diseases Data. Left: Effect of age at initial infection. Right: Effect of years of schooling. Solid lines are the estimates, dashed lines are 95% confidence intervals, and dotted lines are the reference zero lines.

Table 4.

Fitted Coefficients and Their Standard Errors for Sexually Transmitted Diseases Data. (Models from top to bottom: semiparametric relative risk with SCAD penalty and with adaptive LASSO penalty, parametric relative risk with SCAD penalty and with adaptive LASSO penalty.)

age	yschool	npart	raceW	maritalM	maritalS
-(-)	-(-)	0(-)	0(-)	0(-)	0.487(0.212)
-(-)	-(-)	0.060(0.048)	-0.127(0.097)	0(-)	0.448(0.186)
0(-)	-0.059(0.018)	0(-)	0(-)	0(-)	0.332(0.213)
0(-)	-0.119(0.031)	0.026(0.024)	0(-)	0(-)	0.210(0.119)

typeC	typeB	oralY	oralM	rectY	rectM
-0.412(0.149)	-0.337(0.144)	-0.336(0.201)	-0.341(0.235)	0(-)	0(-)
-0.349(0.137)	-0.300(0.130)	-0.330(0.155)	-0.318(0.173)	0(-)	0(-)
-0.376(0.149)	-0.249(0.145)	-0.236(0.202)	-0.348(0.235)	0(-)	0(-)
-0.228(0.096)	-0.083(0.065)	-0.110(0.058)	-0.371(0.117)	0(-)	0(-)

abdom	disc	dysu	condS	condN	itch
0.253(0.151)	0(-)	0.193(0.152)	0(-)	-0.327(0.114)	0(-)
0.177(0.120)	0(-)	0.089(0.074)	0.152(0.114)	-0.291(0.106)	0(-)
0.285(0.148)	0(-)	0(-)	0(-)	-0.296(0.114)	0(-)
0.184(0.094)	0(-)	0(-)	0(-)	-0.223(0.092)	0(-)

lesion	rash	lymph	involve	discE	node
0(-)	0(-)	0(-)	0.423(0.166)	-0.460(0.220)	0(-)
0(-)	0(-)	0(-)	0.327(0.159)	-0.407(0.209)	0(-)
0(-)	0(-)	0(-)	0.392(0.168)	-0.443(0.221)	0(-)
0(-)	0(-)	0(-)	0.289(0.133)	-0.280(0.163)	0(-)

Open in a new tab

5. Discussion

We have proposed a Cox PH model with semiparametric relative risk. The nonparametric part of the risk is estimated by smoothing spline ANOVA model and model selection procedure derived based on a Kullback-Leibler geometry. The parametric part of the risk is estimated by penalized profile partial likelihood and variable selection achieved by choosing a nonconcave penalty. Both theoretical and numerical studies show promising results for the proposed method. An important question in using the method in practice is which covariate effects should be treated as parametric. We suggest the following guideline for making choices. As a starting point, the effects of all the continuous covariates are put in the nonparametric part and those of the discrete covariates in the parametric part. If the estimation results show that some of the continuous covariate effects can be described by certain parametric forms such as linear form, then a new model can be fitted with those continuous covariate effects moved to the parametric part. In this way, one can take full advantage of the flexible exploratory analysis provided by the proposed method.

We thank a referee for raising the interesting question on the standard error estimates for the coefficients estimated to be 0 in β̂. [25] and [7] suggested to set these standard errors to 0s based on the belief that those covariates with zero coefficient estimates are not important. This is the approach adopted here. When such a belief is in doubt, nonzero standard errors are preferred even for coefficients estimated to be 0s. This problem has been addressed only in a few papers. [22] looked at the problem for LASSO but it is based on a smooth approximation. [24] presented a Bayesian approach and pointed out that no fully satisfactory frequentist solution had been proposed so far, no matter LASSO or SCAD variable selection procedure is considered. This problem presents an interesting challenge that we hope to address in some future work.

Another choice of pθ (| · |) is the adaptive LASSO penalty ([31]). Our simulations in §3.1 indicates a similar performance when compared to the SCAD penalty. So we decided not to present the details here.

Although our method is presented for time-independent covariates, a lengthier argument modifying [23] can yield similar theoretical results for external time-dependent covariates ([15]). However, the implementation of such extension is more complicated and not pursued here.

A recently proposed nonparametric component selection procedure in a penalized likelihood framework is the COSSO method in [19] where the penalty switches from J(η) to J^1/2(η). Taking advantage of the smoothing spline ANOVA decomposition, the COSSO method does model selection by applying a soft thresholding type operation to the function components. An extension of COSSO to the Cox proportional hazards model with nonparametric relative risk is available in [18]. Although a similar extension to our proportional hazards model with semiparametric relative risk is of interest, it is not clear whether the theoretical properties of the COSSO method such as the existence and the convergence rate of the COSSO estimator can be transferred to the estimation of η under our semiparametric setting. Furthermore, the dimension of the function space in COSSO is O(n), too big to allow an entropy bound that is critical in deriving the asymptotic properties of β̂.

6. Proofs

For z = (u, w), let p_z(t) = exp(u^T β+η₀(w))q(t, u, w)/s[β, η₀](t) and f̄_t ≡ ∫ ∫ f(u, w)p_z (t)dudw. Let S̃(t, u, w) = E [Y(t) = 1∣U = u, W = w] = P(Y(t) = 1∣U = u, W = w) and q(t, u, w) = S̃(t, u, w)p(u, w), where p(u, w) is the density function of (U, W). Let Ƶ = U × W be the domain of the covariate Z = (U^T, W^T)^T. We need the following conditions.

A1. The true coefficient β₀ is an interior point of a bounded subset of ℝ^d.
A2. The domain Ƶ of covariate is a compact set in ℝ^d+q.
A3. Failure time T and censoring time C are conditionally independent given the covariate Z.
A4. Assume the observations are in a finite time interval [0, τ ]. Assume that the baseline hazard function h₀(t) is bounded away from zero and infinity.
A5. Assume that there exist constants k₂ > k₁ > 0 such that k₁ < q(t, u, w) < k₂ and $| \frac{\partial}{\partial t} q (t, u, w) | < k_{2}$ .
A6. Assume the true function η₀ ∈ ℋ. For any η in a sufficiently big convex neighborhood B₀ of η₀, there exist constants c₁, c₂ > 0 such that c₁e^η₀(w) ≤ e^η(w) ≤ c₂e^η₀(w) for all w.
A7. The smoothing parameter λ ≍ n^−r/(r+1).

Condition A1 requires that β₀ is not on the boundary of the parameter space. Condition A2 is also a common boundedness assumption on covariate. Condition A3 assumes noninformative censoring. Condition A4 is the common boundedness assumption on the baseline hazard. Condition A5 bounds the joint density of (T, Z) and thus also the derivatives of the partial likelihood. Condition A6 assumes that η₀ has proper level of smoothness and integrates to zero. The neighborhood B₀ in Condition A6 should be big enough to contain all the estimates of η₀ considered below. When the members of B₀ are all uniformly bounded, Condition A6 is automatically satisfied. The order for λ in Condition A7 matches that in standard smooth spline problems.

We first show the equivalence between V (·) and the L₂-norm ${∥ \cdot ∥}_{2}^{2}$ .

Lemma 6.1

Let f ∈ ℋ. Then there exist constants 0 < c₃ ≤ c₄ < ∞ such that

c_{3} {∥ f ∥}_{2}^{2} \leq V (f) \leq c_{4} {∥ f ∥}_{2}^{2} .

Proof

For z = (u, w), let p_z (t) = exp(u^T β + η₀(w))q(t, u, w)/s[β, η₀](t) and f̄_t ≡ ∫ ∫ f(u, w)p_z (t)dudw. Simple algebraic manipulation yields

V (f) = \int_{Τ} {\int \int {f (u, w) - {\bar{f}}_{t})}^{2} p_{z} (t) dudw} s [β, η_{0}] (t) d Λ_{0} (t)

By Conditions A4 and A5, there exist positive constants c₁ and c₂ such that

\begin{array}{l} c_{1} \int_{Τ} {\int \int {(f (u, w) - {\bar{f}}_{t})}^{2} dudw} d Λ_{0} (t) \\ \leq V (f) \leq c_{2} \int_{Τ} {\int \int {(f (u, w) - {\bar{f}}_{t})}^{2} dudw} d Λ_{0} (t) \end{array}

Let m(Ƶ) < ∞ be the Lebesgue measure of Ƶ. Then

\begin{array}{l} \int_{Τ} {\int \int {(f (u, w) - {\bar{f}}_{t})}^{2} dudw} d Λ_{0} (t) \\ = Λ_{0} (τ) \int \int f^{2} (u, w) dudw + m (Z) \int \int {[{\bar{f}}_{t}]}^{2} d Λ_{0} (t) . \end{array}

The lemma follows from the Cauchy-Schwartz inequality and Condition A4.

Proof of Theorem 2.1

We will prove the results using an eigenvalue analysis of three steps. In the first step (linear approximation), we show the convergence rate O_p(n^−r/(r+1)) for the minimizer η̃ of a quadratic approximation to (2.4). In the second step (approximation error), we show that the difference between η̃ and the estimate η̂* in ℋ is also O_p(n^−r/(r+1)), and so is the convergence rate of η̂*. In the third step (semiparametric approximation), we show that the projection η* of η̂* in ℋ_n is not so different from either η̂* or the estimate η̂ in ℋ_n, and then the convergence rate of η̂ follows.

A quadratic function B is said to be completely continuous with respect to another quadratic functional A, if for any ε > 0, there exists a finite number of linear functionals L₁, … , L_k such that L_j f = 0, j = 1, … , k, implies that B(f) ≤ εA(f); When a quadratic functional B is completely continuous with respect to another quadratic functional A, there exists eigenfunctions {ϕ_ν, ν = 1, 2, ⋯} such that B(ϕ_ν, ϕ_μ) = δ_νμ and A(ϕ_ν, ϕ_μ) = ρ_ν δ_νμ, where δ_νμ is the Kronecker delta and 0 ≤ ρ_ν ↑ ∞. And functions satisfying A(f) < ∞ can be expressed as a Fourier series expansion f = Σ_νf_νϕ_ν, where f_ν = B(f, ϕ_ν) are the Fourier coefficients. See, e.g., [29] and [9].

We first present two lemmas without proof. The first one follows directly from the results in Section 8.1 of [9] and Lemma 6.1. The second one is exactly Lemma 8.1 in [9].

Lemma 6.2

V is completely continuous to J and the eigenvalues ρ_ν of J with respect to V satisfy that as ν → ∞, $ρ_{ν}^{- 1} = O (ν^{r})$ .

Lemma 6.3

As λ → 0, the sums $\sum_{ν} \frac{λ ρ_{ν}}{{(1 + λ ρ_{ν})}^{2}}, \sum_{ν} \frac{1}{{(1 + λ ρ_{ν})}^{2}}$ , and $\sum_{ν} \frac{1}{1 + λ ρ_{ν}}$ are all order O(λ^−1/r).

Step 1: Linear Approximation

A linear approximation η̃ to η̂* is the minimizer of a quadratic approximation to (2.4),

\begin{array}{l} - \frac{1}{n} \sum_{i = 1}^{n} \int_{Τ} {η (W_{i}) - s_{n}^{- 1} [β, η_{0}] (t) s_{n} [η - η_{0}; β, η_{0}] (t)} d N_{i} (t) \\ + \frac{1}{2} V (η - η_{0}) + \frac{λ}{2} J (η) . \end{array}

(6.1)

Let η = Σ_ν η_νϕ_ν and η₀ = Σ_ν η_ν,0ϕ_ν be the Fourier expansions of η and η₀. Plugging them into (6.1) and dropping the terms not involving η yield

\sum_{ν} {- η_{ν} γ_{ν} + \frac{1}{2} {(η_{ν} - η_{ν, 0})}^{2} + \frac{λ}{2} ρ_{ν} η_{ν}^{2}},

(6.2)

where $γ_{ν} = \frac{1}{n} \sum_{i = 1}^{n} \int_{Τ} {ϕ_{ν} (W_{i}) - s_{n}^{- 1} [β, η_{0}] (t) s_{n} [ϕ_{ν}; β, η_{0}] (t)} d N_{i} (t)$ . The Fourier coefficients that minimize (6.2) are η̃_ν = (γ_ν + η_ν,0)/(1 + λρ_ν). Note that ∫ ϕ_ν (w)dw = 0 and V (ϕ_ν) = 1. Straightforward calculation gives E[γ_ν] = 0 and $E [γ_{ν}^{2}] = n^{- 1}$ . Then,

\begin{array}{c} E [V (\tilde{η} - η_{0})] = \frac{1}{n} \sum_{ν} \frac{1}{{(1 + λ ρ_{ν})}^{2}} + λ \sum_{ν} \frac{λ ρ_{ν}}{{(1 + λ ρ_{ν})}^{2}} ρ_{ν} η_{ν, 0}^{2}, \\ E [λ J (\tilde{η} - η_{0})] = \frac{1}{n} \sum_{ν} \frac{λ ρ_{ν}}{{(1 + λ ρ_{ν})}^{2}} + λ \sum_{ν} \frac{{(λ ρ_{ν})}^{2}}{{(1 + λ ρ_{ν})}^{2}} ρ_{ν} η_{ν, 0}^{2} . \end{array}

(6.3)

Combining Lemma 6.3 and (6.3), we obtain that (V + λJ)(η̃ − η₀) = O_p(λ + n⁻¹ λ^−1/r), as n → ∞ and λ → 0.

Step 2: Approximation Error

We now investigate the approximation error η̂* − η̃ and prove the convergence rate of η̂*. Define A_f,g(α) and B_f,g(α) respectively as the resulting functionals from setting η = f + αg in (2.4) and (6.1). Differentiating them with respect to α and then setting α = 0 yields

{\dot{A}}_{f, g} (0) = - \frac{1}{n} \sum_{i = 1}^{n} \int_{Τ} {g (W_{i}) - s_{n}^{- 1} [β, f] (t) s_{n} [g; β, f] (t)} d N_{i} (t) + λ J (f, g),

(6.4)

\begin{array}{l} {\dot{B}}_{f, g} (0) & = & - \frac{1}{n} \sum_{i = 1}^{n} \int_{Τ} {g (W_{i}) - s_{n}^{- 1} [β, η_{0}] (t) s_{n} [g; β, η_{0}] (t)} d N_{i} (t) \\ + V (f - η_{0}, g) + λ J (f, g) . \end{array}

(6.5)

Set f = η̂* and g = η̂* − η̃ in (6.4), and set f = η̃ and g = η̂* − η̃ in (6.5). Then subtracting the resulted equations gives

\begin{array}{l} μ_{{\hat{η}}^{*}} ({\hat{η}}^{*} - \tilde{η}) - μ_{\tilde{η}} ({\hat{η}}^{*} - \tilde{η}) + λ J ({\hat{η}}^{*} - \tilde{η}) \\ = V (\tilde{η} - η_{0}, {\hat{η}}^{*} - \tilde{η}) + μ_{η_{0}} ({\hat{η}}^{*} - \tilde{η}) - μ_{\tilde{η}} ({\hat{η}}^{*} - \tilde{η}), \end{array}

(6.6)

where $μ_{f} (g) \equiv \frac{1}{n} \sum_{i = 1}^{n} \int_{Τ} s_{n}^{- 1} [β, f] (t) s_{n} [g; β, f] (t) d N_{i} (t)$ . Define

S_{n} [f, g] (t) = \frac{s_{n} [f g; β, η_{0}] (t)}{s_{n} [β, η_{0}] (t)} - \frac{s_{n} [f; β, η_{0}] (t)}{s_{n} [β, η_{0}] (t)} \frac{s_{n} [g; β, η_{0}] (t)}{s_{n} [β, η_{0}] (t)}

and S [f, g](t) be its limit. The following lemma is needed to proceed.

Lemma 6.4

\frac{1}{n} \sum_{i = 1}^{n} \int_{Τ} S_{n} [f, g] (t) d N_{i} (t) = V (f, g) + o_{p} ({(V + λ J) (f) (V + λ J) (g)}^{1 / 2}) .

Proof

Let f = Σ_νf_νϕ_ν and g = Σ_μg_μϕ_μ be the Fourier series expansion of f and g. [1] shows that sup_t |s_n[β, η₀] − s[β, η₀]| converges to zero in probability. Note that

M (t) \equiv M (t ∣ Z) = N (t) - \int_{0}^{t} s [β, η_{0}] (τ) d Λ_{0} (τ)

defines a local martingale with mean zero. Combining the above uniform convergence result and the martingale property with the boundedness condition, we obtain that for any ν and μ,

E [{\int_{Τ} S [ϕ_{ν}, ϕ_{μ}] (t) d N (t) - V (ϕ_{ν}, ϕ_{μ})}^{2}] < \infty

Then from the Cauchy-Schwartz inequality and Lemma 6.3,

\begin{array}{l} | \frac{1}{n} \sum_{i = 1}^{n} \int_{Τ} S_{n} [f, g] (t) d N_{i} (t) - V (f, g) | \\ = | \sum_{ν} \sum_{μ} f_{ν} g_{μ} {\frac{1}{n} \sum_{i = 1}^{n} \int_{Τ} S_{n} [ϕ_{ν}, ϕ_{μ}] (t) d N_{i} (t) - V (ϕ_{ν}, ϕ_{μ})} | \\ \leq {\sum_{ν} \sum_{μ} \frac{1}{1 + λ ρ_{ν}} \frac{1}{1 + λ ρ_{μ}} \\ {\times {\frac{1}{n} \sum_{i = 1}^{n} \int_{Τ} S_{n} [ϕ_{ν}, ϕ_{μ}] (t) d N_{i} (t) - V (ϕ_{ν}, ϕ_{μ})}^{2}}}^{1 / 2} \\ \times {\sum_{ν} \sum_{μ} (1 + λ ρ_{ν}) (1 + λ ρ_{μ}) f_{ν}^{2} g_{μ}^{2}}^{1 / 2} \\ = O_{p} (n^{- 1 / 2} λ^{- 1 / r}) {(V + λ J) (f) (V + λ J) (g)}^{1 / 2} . \end{array}

A Taylor expansion at η₀ gives

\begin{matrix} μ_{{\hat{η}}^{*}} ({\hat{η}}^{*} - \tilde{η}) - μ_{\tilde{η}} ({\hat{η}}^{*} - \tilde{η}) = \frac{1}{n} \sum_{i = 1}^{n} \int_{Τ} S_{n} [{\hat{η}}^{*} - \tilde{η}, {\hat{η}}^{*} - \tilde{η}] (t) d N_{i} (t) (1 + o_{p} (1)), \\ μ_{\tilde{η}} ({\hat{η}}^{*} - \tilde{η}) - μ_{η_{0}} ({\hat{η}}^{*} - \tilde{η}) = \frac{1}{n} \sum_{i = 1}^{n} \int_{Τ} S_{n} [\tilde{η} - η_{0}, {\hat{η}}^{*} - \tilde{η}] (t) d N_{i} (t) (1 + o_{p} (1)) . \end{matrix}

Then by the mean value theorem, Condition A6, Lemma 6.4, and (6.6),

\begin{array}{l} (c_{1} V + λ J) ({\hat{η}}^{*} - \tilde{η}) (1 + o_{p} (1)) \\ \leq {(| 1 - c | V + λ J) ({\hat{η}}^{*} - \tilde{η})}^{1 / 2} O_{p} ({(| 1 - c | V + λ J) (\tilde{η} - η_{0})}^{1 / 2}) \end{array}

for some c ∈ [c₁, c₂]. Then the convergence rate of η̂* follows from that of η̃ proved in the previous step.

Step 3: Semiparametric Approximation

Our last goal is the convergence rate for the minimizer η̂ in the space ℋ_n. For any h ∈ ℋ ⊖ ℋ_n, one has h(W_{i_l}) = J(R_J(W_il,·),h) = 0, so $s_{q_{n}} [h^{j}; β, η_{0}] (t) = \frac{1}{q_{n}} \sum_{k = 1}^{q_{n}} Y_{i_{k}} (t) h^{j} (W_{i_{k}}) \exp (U_{i_{k}}^{T} β + η (W_{i_{k}})) = 0$ for j = 1, 2 and $\sum_{l = 1}^{q_{n}} \int_{Τ} S_{q_{n}} [h, h] (t) d N_{i_{l}} (t) = 0$ . Hence, by the same arguments used in the proof of Lemma 6.4,

\begin{array}{l} V (h) = | \frac{1}{q_{n}} \sum_{l = 1}^{q_{n}} \int_{Τ} S_{q_{n}} [h, h] (t) d N_{i_{l}} (t) - V (h) | \\ = O_{p} (q_{n}^{- 1 / 2} λ^{- 1 / r}) (V + λ J) (h) = o_{p} (λ J (h)), \end{array}

(6.7)

where the last equality follows from q_n ≍ n^2/(r+1)+ε and Condition A7.

Let η* be the projection of η̂* in ℋ_n. Setting f = η̂* and g = η̂* − η* in (6.4) and noting that J(η*, η̂* − η*) = 0, some algebra yields

\begin{array}{l} λ J ({\hat{η}}^{*} - η^{*}) = {\frac{1}{n} \sum_{i = 1}^{n} \int_{Τ} ({\hat{η}}^{*} - η^{*}) (W_{i}) d N_{i} (t) - μ_{η_{0}} ({\hat{η}}^{*} - η^{*})} \\ - {μ_{{\hat{η}}^{*}} ({\hat{η}}^{*} - η^{*}) - μ_{η_{0}} ({\hat{η}}^{*} - η^{*})} \end{array}

(6.8)

Recall that $γ_{ν} = \frac{1}{n} \sum_{i = 1}^{n} \int_{Τ} ϕ_{ν} (W_{i}) d N_{i} (t) - μ_{η_{0}} (ϕ_{ν})$ with E[γ_ν] = 0 and $E [γ_{ν}^{2}] = 1 / n$ . An application of the Cauchy-Schwartz inequality and Lemma 6.3 shows that the first term in (6.8) is of order {(V + λJ)(η̂* − η*)}^1/2O_p (n^−1/2 λ^−1/2r). By the mean value theorem, Condition A6, Lemma 6.4, and (6.7), the remaining term in (6.8) is of order o_p({λJ(η̂*−η*)(V + λJ)(η̂*−η₀)}^1/2). These, combined with (6.8) and the convergence rates of η̂*, yield λJ (η̂* − η*) = O_p(n⁻¹ λ^−1/r + λ) and V (η̂* − η*) = O_p(n⁻¹ λ^−1/r + λ).

Note that J(η̂* − η*, η*) = J(η̂* − η*, η̂) = 0, so J(η̂*, η̂* − η̂), = J (η̂* − η*) + J (η*, η* − η̂). Set f = η̂ and g = η̂ − η* in (6.4), and set f = η̂* and g = η̂* − η̂ in (6.4). Adding the resulted equations yields

\begin{array}{l} μ_{\hat{η}} (\hat{η} - η^{*}) - μ_{η_{0}} (\hat{η} - η^{*}) + λ J (\hat{η} - η^{*}) + λ J ({\hat{η}}^{*} - η^{*}) \\ = {\frac{1}{n} \sum_{i = 1}^{n} \int_{Τ} ({\hat{η}}^{*} - η^{*}) (W_{i}) d N_{i} (t) - μ_{η_{0}} ({\hat{η}}^{*} - η^{*})} \\ - {μ_{{\hat{η}}^{*}} ({\hat{η}}^{*} - η^{*}) - μ_{η_{0}} ({\hat{η}}^{*} - η^{*})} + {μ_{{\hat{η}}^{*}} (\hat{η} - η^{*}) - μ_{η_{0}} (\hat{η} - η^{*})} . \end{array}

(6.9)

By the mean value theorem, Condition A6, and Lemma 6.4, the left-hand side of (6.9) is bounded from below by

(c_{1} V + λ J) (\hat{η} - η^{*}) (1 + o_{p} (1)) + λ J ({\hat{η}}^{*} - η^{*}) .

For the right-hand side, the terms in the first and second brackets are respectively of the orders {(V + λJ)(η̂* − η*)}^1/2O_p(n^−1/2 λ^−1/2r) and o_p({λJ(η̂* − η*)(V + λJ)(η̂* − η₀)}^1/2) by similar arguments for (6.8), and the terms in the third bracket is of the order

{(V + λ J) (\hat{η} - η^{*})}^{1 / 2} o_{p} ({λ J ({\hat{η}}^{*} - η^{*})}^{1 / 2})

by Condition 3, Lemma 6.4, and (6.7). Putting all these together, one obtains (V + λJ)(η̂ − η*) = O_p(n⁻¹ λ^−1/r + λ) and hence (V + λJ)(η̂ − η₀) = O_p (n⁻¹ λ^−1/r + λ). And an application of Condition A7 yields the final convergence rates.

Proof for the Asymptotic Properties of β̂

Let P_n be the empirical measure of (X_i, Δ_i = 1, Z_i), i = 1, … , n such that it is related to the empirical measure Q_n of (X_i, Δ_i, Z_i), i = 1, … , n by Let $P_{n} f = \int f d P_{n} = \int Δ f d Q_{n} = n^{- 1} \sum_{i = 1}^{n} Δ_{i} f (T_{i}, Δ_{i} Z_{i})$ . Let P be its corresponding (sub)probability measure. Let L₂(P) = {f : ∫ f²dP < ∞} and || · ||₂ be the usual L₂-norm. For any subclass F of L₂(P) and any ε > 0, let N_[](ε, F, L₂(P)) be the bracketing number and $J_{[]} (δ, F, L_{2} (P)) = \int_{0}^{δ} \sqrt{1 + \log N_{[]} (ε, F, L_{2} (P))} d ε$

Lemma 6.5

Let m₀(t, u, w; β, η) = u^T β+η(w)−log s [β, η ](t), m₁(t, u, w; s, β, η) = 1_[s≤t] exp(u^T β + η(w)), and m₂(t, u, w; s, β, η, f) = 1_[s≤t]f (u, w) exp(u^T β + η(w)). Define the classes of functions

\begin{array}{l} ℳ_{0} (δ) = {m_{0} : ∥ β - β_{0} ∥ \leq δ, {∥ η - η_{0} ∥}_{2} \leq δ}, \\ ℳ_{1} (δ) = {m_{1} : s \in Τ, ∥ β - β_{0} ∥ \leq δ, {∥ η - η_{0} ∥}_{2} \leq δ}, \\ ℳ_{2} (δ) = {m_{2} : s \in Τ, ∥ β - β_{0} ∥ \leq δ, {∥ η - η_{0} ∥}_{2} \leq δ, {∥ h ∥}_{2} δ} . \end{array}

Then $J_{[]} (δ, M_{0}, L_{2} (P)) \leq c_{0} q_{n}^{1 / 2} δ$ and J_[](δ, M_j, L₂(P)) = c_jδ{q_n + log(1/δ)}^1/2 j = 1, 2.

Proof

The proof is similar to that of Corollary A.1 in [10] and thus omitted here.

Lemma 6.6

\sup_{t \in Τ} | s^{- 1} [β_{0}, \hat{η}] (t) s [U; β_{0}, \hat{η}] (t) - s_{n}^{- 1} [β_{0}, \hat{η}] (t) s_{n} [U; β_{0}, \hat{η}] (t) | = o_{p} (n^{- 1 / 2}) .

Proof

Write

\begin{array}{l} s^{- 1} [β_{0}, \hat{η}] (t) s [U; β_{0}, \hat{η}] (t) - s_{n}^{- 1} [β_{0}, \hat{η}] (t) s_{n} [U; β_{0}, \hat{η}] (t) = \\ \frac{s [β_{0}, \hat{η}] (t) A_{1 n} (t) - s [U; β_{0}, \hat{η}] (t) A_{2 n} (t)}{s [β_{0}, \hat{η}] (t) s_{n} [β_{0}, \hat{η}] (t)}, \end{array}

where A1_n = s_n [U; β₀, η̂] (t)−s [U; β₀, η̂] (t) and A_2n = s_n [β₀, η̂](t)−s[β₀, η̂](t). Note that q_n = o(n^1/2), hence the result follows from Lemma 3.4.2 of [27] and Lemma 6.5.

Proof of Theorem 2.2

Let γ_n = n^−1/2. To prove 2.2(i), we need to show that ∀δ > 0, there exists a large constant C such that

P {\sup_{∥ υ ∥ = C} ℒ_{P} (β_{0} + γ_{n} υ) < ℒ_{P} (β_{0})} \geq 1 - δ .

Consider ℒ_P(β₀ + γ_nυ) − ℒ_P(β₀). We can decompose it to the sum of D_n1 = l_p(β₀ + γ_nυ) − l_p(β₀) and the penalty difference D_n2. As shown in [7], under the assumption of a_n = O(n^−1/2) and b_n = o(1), n⁻¹ D_n2 is bounded by

\sqrt{s} γ_{n} a_{n} ∥ υ ∥ + γ_{n}^{2} b_{n} {∥ υ ∥}^{2} = C γ_{n}^{2} (\sqrt{s} + b_{n} C),

(6.10)

where s is the number of nonzero elements in β₀.

Applying the second order Taylor expansion to n⁻¹ D_n1 gives

n^{- 1} D_{n 1} = γ_{n} v^{T} J_{1 n} - \frac{1}{2} γ_{n}^{2} v^{T} J_{2 n} v + o_{p} (n^{- 1})

(6.11)

with $J_{1 n} = P_{n} {U - s_{n}^{- 1} [β_{0}, \hat{η}] (t) s_{n} [U; β_{0}, \hat{η}] (t)}$ , and $J_{2 n} = P_{n} {s_{n}^{- 2} [β_{0}, \hat{η}] (t) [s_{n} [{UU}^{T}; β_{0}, \hat{η}] (t) s_{n} [β_{0}, \hat{η}] (t) - s_{n} [U; β_{0}, \hat{η}] (t) s_{n} [U; β_{0}, \hat{η}] {(t)}^{T}]}$ , where U(u, w) ≡ u.

Let ${\bar{U}}_{n} = \sum_{i = 1}^{n} U_{i} / n$ . Note that $s_{n}^{- 1} [β_{0}, \hat{η}] (t) s_{n} [{\bar{U}}_{n}; β_{0}, \hat{η}] (t) = {\bar{U}}_{n}$ . We have

\begin{array}{l} J_{1 n} & = P_{n} {U - {\bar{U}}_{n} - s_{n}^{- 1} [β_{0}, \hat{η}] (t) s_{n} [U - {\bar{U}}_{n}; β_{0}, \hat{η}] (t)} \\ \equiv I_{1 n} + I_{2 n} + I_{3 n}, \end{array}

(6.12)

where

\begin{array}{l} I_{1 n} & = & (P_{n} - P) {U - {\bar{U}}_{n} - s^{- 1} [β_{0}, \hat{η}] (t) s [U - {\bar{U}}_{n}; β_{0}, \hat{η}] (t)}, \\ I_{2 n} & = & P_{n} {s^{- 1} [β_{0}, \hat{η}] (t) s [U; β_{0}, \hat{η}] (t) - s_{n}^{- 1} [β_{0}, \hat{η}] (t) s_{n} [U; β_{0}, \hat{η}] (t)}, \\ I_{3 n} & = & P {U - {\bar{U}}_{n} - s^{- 1} [β_{0}, \hat{η}] (t) s [U - {\bar{U}}_{n}; β_{0}, \hat{η}] (t)} . \end{array}

Lemma 3.4.2 of [27] and Lemma 6.5 indicate that (P_n − P) {s⁻¹[β₀, η̂] (t) s[U − Ū_n; β₀, η̂](t)} = o_p(n^−1/2), where the fact that q_n = o_p(n^−1/2) is used again. Also (P_n − P) {U − Ū_n} = O_p(n^−1/2) by the LLN. Hence we have I_1n = O_p(n^−1/2). Lemma 6.6 gives I_2n = O_p(n^−1/2). Lastly, by the boundedness assumption I_3n = P {s⁻¹ [β₀, η̂](t) s[Ū_n − U; β₀, η̂] (t)} = O_p (E|Ūn − U|) = O_p (n^−1/2). Hence J_1n = O_p (n^−1/2). Also, J₂_n converges to V(U) > 0. Thus, when C is sufficiently large, the second term in (6.11) dominates both terms in (6.10). And Theorem 2.2(i) follows.

Next, we shall show the sparsity of β̂. It suffices to show that for any given β₁ satisfying ||β₁ − β₁₀|| = O_p (n^−1/2) and any j = s + 1, … , d, ∂ℒ_P (β)/∂β_j > 0 for 0 < β_j < Cn^−1/2 and ∂ℒ_P (β)/∂β_j < 0 for −Cn^−1/2 < β_j < 0. For β_j ≠ 0 and j = s + 1, … , d,

n^{- 1} \partial ℒ_{P} (β) / \partial β_{j} = P_{n} {U_{j} - s_{n}^{- 1} [β, \hat{η}] (t) s_{n} [U_{j}; β, \hat{η}] (t)} - p_{θ_{j}}^{'} (| β_{j} |) sgn (β_{j}) .

Similar to bounding J_1n, the first term can be shown to be O_p(n^−1/2). Recall that $θ_{j}^{- 1} = o (n^{1/2})$ and $\lim \inf_{n \to \infty} \lim \inf_{u \to 0} + θ_{j}^{- 1} p_{θ_{j}}^{'} (u) > 0$ . Hence the sign of ∂ℒ_p (β)/∂β_j is completely determined by that of β_j. Then β̂₂ = 0.

Lastly, we show the asymptotic normality of β̂₁ using the result in [21]. Let z_i = (X_i, Δ_i, Z_i). Note that β̂₁ is the solution of the estimating equation:

\sum_{i = 1}^{n} M (z_{i}, β_{1}, \hat{η}) - n ζ_{1} = 0,

(6.13)

where $M (z, β_{1}, η) = \int {U_{1} - s_{n}^{- 1} [β_{1}, η] (t) s_{n} [U_{1}; β_{1}, η] (t)} d N (t)$ and $ζ_{1} = {(p_{θ_{1}}^{'} (| β_{1} |) sgn (β_{1}), \dots, p_{θ_{s}}^{'} (| β_{s} |) sgn (β_{s}))}^{T}$ . Let

D (z, h) = \int {\frac{s_{n} [U_{1} h; β_{10}, η_{0}] (t)}{s_{n} [U_{1} h; β_{10}, η_{0}] (t)} - \frac{s_{n} [U_{1}; β_{10}, η_{0}] (t)}{s_{n} [β_{10}, η_{0}] (t)} \frac{s_{n} [h; β_{10}, η_{0}] (t)}{s_{n} [β_{10}, η_{0}] (t)}} d N (t)

be the Fréchet derivative of M(z, β₁₀, η) at η₀. Since the convergence rate of η̂ is n^{−r/[2(r+1)]} = o(n^−1/4), the linearization assumption (Assumption 5.1) in [21] is satisfied. A derivation similar to bounding (6.12) can verify the stochastic assumption (Assumption 5.2) in [21]. Direct calculation yields E[D(z, η − η₀)] = 0 for η close to η₀. Then the mean-square continuity assumption (Assumption 5.3) in [21] also holds with α(z) ≡ 0. By Lemma 5.1 in [21], β̂₁ thus has the same distribution as the solution to the equation

\sum_{i = 1}^{n} M (z_{i}, β_{1}, η_{0}) - n ζ_{1} = 0 .

A straightforward simplification yields the result.

Acknowledgments

We would like to thank the Associate Editor and two referees for their insightful comments that have improved the article.

Contributor Information

Pang Du, Department of Statistics, Virginia Tech, Blacksburg, VA 24061, USA, pangdu@vt.edu.

Shuangge Ma, Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, CT 06520, USA, shuangge.ma@yale.edu.

Hua Liang, Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY 14642, USA, hliang@bst.rochester.edu.

References

1.Andersen PK, Gill RD. Cox’s regression model for counting processes: A large sample study. Ann Statist. 1982;10:1100–1120. [Google Scholar]
2.Breiman Leo. Heuristics of instability and stabilization in model selection. Ann Statist. 1996;24(6):2350–2383. [Google Scholar]
3.Cai J, Fan J, Jiang J, Zhou H. Partially linear hazard regression for multivariate survival data. J Amer Statist Assoc. 2007;102(478):538–551. [Google Scholar]
4.Cai J, Fan J, Li R, Zhou H. Variable selection for multivariate failure time data. Biometrika. 2005;92:303–316. doi: 10.1093/biomet/92.2.303. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression (with discussion) Ann Statist. 2004;32(2):407–499. [Google Scholar]
6.Fan J, Gijbels I, King M. Local likelihood and local partial likelihood in hazard regression. Ann Statist. 1997;25(4):1661–1690. [Google Scholar]
7.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96(456):1348–1360. [Google Scholar]
8.Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Ann Statist. 2002;30(1):74–99. [Google Scholar]
9.Gu C. Smoothing Spline ANOVA Models. Springer-Verlag; New York: 2002. [Google Scholar]
10.Huang J. Efficient estimation of the partly linear additive Cox model. Ann Statist. 1999;27(5):1536–1563. [Google Scholar]
11.Huang JZ, Kooperberg C, Stone CJ, Truong YK. Functional ANOVA modeling for proportional hazards regression. Ann Statist. 2000;28(4):961–999. [Google Scholar]
12.Huang JZ, Liu L. Polynomial spline estimation and inference of proportional hazards regression models with flexible relative risk form. Biometrics. 2006;62:793–802. doi: 10.1111/j.1541-0420.2005.00519.x. [DOI] [PubMed] [Google Scholar]
13.Johnson BA. Variable selection in semiparametric linear regression with censored data. J Roy Statist Soc Ser B. 2008;70(2):351–370. [Google Scholar]
14.Johnson BA, Lin DY, Zeng D. Penalized estimating functions and variable selection in semiparametric regression models. J Amer Statist Assoc. 2008;103(482):672–680. doi: 10.1198/016214508000000184. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. Wiley; New York: 2002. [Google Scholar]
16.Kim Y-J, Gu C. Smoothing spline Gaussian regression: More scalable computation via efficient approximation. J Roy Statist Soc Ser B. 2004;66:337–356. [Google Scholar]
17.Klein JP, Moeschberger ML. Survival Analysis: Techniques for Censored and Truncated Data. Springer-Verlag Inc; 1997. [Google Scholar]
18.Leng C, Zhang HH. Model selection in nonparametric hazard regression. Journal of Nonparametric Statistics. 2006;18(7-8):417–429. [Google Scholar]
19.Lin Y, Zhang HH. Component selection and smoothing in smoothing spline analysis of variance models. Ann Statist. 2006;34(5):2272–2297. [Google Scholar]
20.Murphy SA, van der Vaart AW. On profile likelihood (with discussion) J Amer Statist Assoc. 2000;95(450):449–465. [Google Scholar]
21.Newey WK. The asymptotic variance of semiparametric estimators. Econometrica. 1994;62:1349–1382. [Google Scholar]
22.Osborne Michael R, Presnell Brett, Turlach Berwin A. On the LASSO and its dual. J Comput Graph Statist. 2000;9(2):319–337. [Google Scholar]
23.O’Sullivan F. Nonparametric estimation in the Cox model. Ann Statist. 1993;21:124–145. [Google Scholar]
24.Park Trevor, Casella George. The Bayesian Lasso. J Amer Statist Assoc. 2008;103(482):681–686. [Google Scholar]
25.Tibshirani RJ. Regression shrinkage and selection via the lasso. J Roy Statist Soc Ser B. 1996;58:267–288. [Google Scholar]
26.Tibshirani Robert. The lasso method for variable selection in the Cox model. Statist in Med. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
27.van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. Springer-Verlag Inc; 1996. [Google Scholar]
28.Wahba G. volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM; Philadelphia: 1990. Spline Models for Observational Data. [Google Scholar]
29.Weinberger HF. volume 15 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM; Philadelphia: 1974. Variational Methods for Eigenvalue Approximation. [Google Scholar]
30.Yin G, Li H, Zeng D. Partially linear additive hazards regression with varying coefficients. J Amer Statist Assoc. 2008;103(483):1200–1213. [Google Scholar]
31.Zou H. The adaptive LASSO and its oracle properties. J Amer Statist Assoc. 2006;101(476):1418–1429. [Google Scholar]
32.Zou H. A note on path-based variable selection in the penalized proportional hazards model. Biometrika. 2008;95(1):241–247. [Google Scholar]
33.Zou H, Hastie T, Tibshirani R. On the “degree of freedom” of the LASSO. Ann Statist. 2007;35(5):2173–2192. [Google Scholar]
34.Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models (with discussion) Ann Statist. 2008;36(4):1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Zucker DM, Karr AF. Nonparametric survival analysis with time-dependent covariate effects: A penalized partial likelihood approach. Ann Statist. 1990;18:329–353. [Google Scholar]

[R1] 1.Andersen PK, Gill RD. Cox’s regression model for counting processes: A large sample study. Ann Statist. 1982;10:1100–1120. [Google Scholar]

[R2] 2.Breiman Leo. Heuristics of instability and stabilization in model selection. Ann Statist. 1996;24(6):2350–2383. [Google Scholar]

[R3] 3.Cai J, Fan J, Jiang J, Zhou H. Partially linear hazard regression for multivariate survival data. J Amer Statist Assoc. 2007;102(478):538–551. [Google Scholar]

[R4] 4.Cai J, Fan J, Li R, Zhou H. Variable selection for multivariate failure time data. Biometrika. 2005;92:303–316. doi: 10.1093/biomet/92.2.303. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression (with discussion) Ann Statist. 2004;32(2):407–499. [Google Scholar]

[R6] 6.Fan J, Gijbels I, King M. Local likelihood and local partial likelihood in hazard regression. Ann Statist. 1997;25(4):1661–1690. [Google Scholar]

[R7] 7.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96(456):1348–1360. [Google Scholar]

[R8] 8.Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Ann Statist. 2002;30(1):74–99. [Google Scholar]

[R9] 9.Gu C. Smoothing Spline ANOVA Models. Springer-Verlag; New York: 2002. [Google Scholar]

[R10] 10.Huang J. Efficient estimation of the partly linear additive Cox model. Ann Statist. 1999;27(5):1536–1563. [Google Scholar]

[R11] 11.Huang JZ, Kooperberg C, Stone CJ, Truong YK. Functional ANOVA modeling for proportional hazards regression. Ann Statist. 2000;28(4):961–999. [Google Scholar]

[R12] 12.Huang JZ, Liu L. Polynomial spline estimation and inference of proportional hazards regression models with flexible relative risk form. Biometrics. 2006;62:793–802. doi: 10.1111/j.1541-0420.2005.00519.x. [DOI] [PubMed] [Google Scholar]

[R13] 13.Johnson BA. Variable selection in semiparametric linear regression with censored data. J Roy Statist Soc Ser B. 2008;70(2):351–370. [Google Scholar]

[R14] 14.Johnson BA, Lin DY, Zeng D. Penalized estimating functions and variable selection in semiparametric regression models. J Amer Statist Assoc. 2008;103(482):672–680. doi: 10.1198/016214508000000184. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. Wiley; New York: 2002. [Google Scholar]

[R16] 16.Kim Y-J, Gu C. Smoothing spline Gaussian regression: More scalable computation via efficient approximation. J Roy Statist Soc Ser B. 2004;66:337–356. [Google Scholar]

[R17] 17.Klein JP, Moeschberger ML. Survival Analysis: Techniques for Censored and Truncated Data. Springer-Verlag Inc; 1997. [Google Scholar]

[R18] 18.Leng C, Zhang HH. Model selection in nonparametric hazard regression. Journal of Nonparametric Statistics. 2006;18(7-8):417–429. [Google Scholar]

[R19] 19.Lin Y, Zhang HH. Component selection and smoothing in smoothing spline analysis of variance models. Ann Statist. 2006;34(5):2272–2297. [Google Scholar]

[R20] 20.Murphy SA, van der Vaart AW. On profile likelihood (with discussion) J Amer Statist Assoc. 2000;95(450):449–465. [Google Scholar]

[R21] 21.Newey WK. The asymptotic variance of semiparametric estimators. Econometrica. 1994;62:1349–1382. [Google Scholar]

[R22] 22.Osborne Michael R, Presnell Brett, Turlach Berwin A. On the LASSO and its dual. J Comput Graph Statist. 2000;9(2):319–337. [Google Scholar]

[R23] 23.O’Sullivan F. Nonparametric estimation in the Cox model. Ann Statist. 1993;21:124–145. [Google Scholar]

[R24] 24.Park Trevor, Casella George. The Bayesian Lasso. J Amer Statist Assoc. 2008;103(482):681–686. [Google Scholar]

[R25] 25.Tibshirani RJ. Regression shrinkage and selection via the lasso. J Roy Statist Soc Ser B. 1996;58:267–288. [Google Scholar]

[R26] 26.Tibshirani Robert. The lasso method for variable selection in the Cox model. Statist in Med. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]

[R27] 27.van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. Springer-Verlag Inc; 1996. [Google Scholar]

[R28] 28.Wahba G. volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM; Philadelphia: 1990. Spline Models for Observational Data. [Google Scholar]

[R29] 29.Weinberger HF. volume 15 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM; Philadelphia: 1974. Variational Methods for Eigenvalue Approximation. [Google Scholar]

[R30] 30.Yin G, Li H, Zeng D. Partially linear additive hazards regression with varying coefficients. J Amer Statist Assoc. 2008;103(483):1200–1213. [Google Scholar]

[R31] 31.Zou H. The adaptive LASSO and its oracle properties. J Amer Statist Assoc. 2006;101(476):1418–1429. [Google Scholar]

[R32] 32.Zou H. A note on path-based variable selection in the penalized proportional hazards model. Biometrika. 2008;95(1):241–247. [Google Scholar]

[R33] 33.Zou H, Hastie T, Tibshirani R. On the “degree of freedom” of the LASSO. Ann Statist. 2007;35(5):2173–2192. [Google Scholar]

[R34] 34.Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models (with discussion) Ann Statist. 2008;36(4):1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Zucker DM, Karr AF. Nonparametric survival analysis with time-dependent covariate effects: A penalized partial likelihood approach. Ann Statist. 1990;18:329–353. [Google Scholar]

PERMALINK

PENALIZED VARIABLE SELECTION PROCEDURE FOR COX MODELS WITH SEMIPARAMETRIC RELATIVE RISK

Pang Du

Shuangge Ma

Hua Liang

Abstract

1. Introduction

2. Method

2.1. Estimation and Variable selection for parametric parts

2.2. Model Selection for Nonparametric Component

2.3. Asymptotic Results

Theorem 2.1

Theorem 2.2

2.4. Miscellaneous Issues

2.4.1. Standard error estimates

2.4.2. Smoothing parameter selection

3. Numerical Studies

3.1. Variable Selection for Parametric Components

Table 1.

Table 2.

Fig 1.

3.2. Model Selection for Nonparametric Components

Table 3.

4. Example

Fig 2.

Table 4.

5. Discussion

6. Proofs

Lemma 6.1

Proof

Proof of Theorem 2.1

Lemma 6.2

Lemma 6.3

Step 1: Linear Approximation

Step 2: Approximation Error

Lemma 6.4

Proof

Step 3: Semiparametric Approximation

Proof for the Asymptotic Properties of β̂

Lemma 6.5

Proof

Lemma 6.6

Proof

Proof of Theorem 2.2

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases