Feature screening in ultrahigh-dimensional varying-coefficient Cox model

Guangren Yang; Ling Zhang; Runze Li; Yuan Huang

doi:10.1016/j.jmva.2018.12.009

. Author manuscript; available in PMC: 2020 May 1.

Published in final edited form as: J Multivar Anal. 2018 Dec 28;171:284–297. doi: 10.1016/j.jmva.2018.12.009

Feature screening in ultrahigh-dimensional varying-coefficient Cox model

Guangren Yang ^a,^*, Ling Zhang ^b, Runze Li ^c, Yuan Huang ^d

PMCID: PMC6924954 NIHMSID: NIHMS1517567 PMID: 31866699

Abstract

The varying-coefficient Cox model is flexible and useful for modeling the dynamic changes of regression coefficients in survival analysis. In this paper, we study feature screening for varying-coefficient Cox models in ultrahigh-dimensional covariates. The proposed screening procedure is based on the joint partial likelihood of all predictors, thus different from marginal screening procedures available in the literature. In order to carry out the new procedure, we propose an effective algorithm and establish its ascent property. We further prove that the proposed procedure possesses the sure screening property. That is, with probability tending to 1, the selected variable set includes the actual active predictors. We conducted simulations to evaluate the finite-sample performance of the proposed procedure and compared it with marginal screening procedures. A genomic data set is used for illustration purposes.

Keywords: Cox model, Partial likelihood, Penalized likelihood, Ultrahigh-dimensional survival data

1. Introduction

Feature screening can effectively reduce ultrahigh dimensionality and therefore has attracted considerable attention in the recent literature. Fan and Lv [12] proposed a marginal screening procedure for ultrahigh-dimensional Gaussian linear model, and further showed that marginal screening procedures may possess a sure screening property under certain conditions. Feature screening procedures for varying-coefficient models (VCM) with ultrahigh-dimensional covariates have been proposed in the literature. Liu et al. [20] developed a sure independence screening (SIS) procedure for ultrahigh-dimensional VCM by taking conditional Pearson correlation coefficients as a marginal utility for ranking the importance of predictors. Fan et al. [13] proposed an SIS procedure for ultrahigh-dimensional VCM by extending B-spline techniques in Fan et al. [10] for additive models. Xia et al. [26] further extended the SIS procedure proposed in [13] to generalized varying-coefficient models (GVCM). Cheng et al. [5] proposed a forward variable selection procedure for ultrahigh-dimensional VCM based on techniques related to B-splines regression and grouped variable selection. Song et al. [22] extended the procedure in [13] to longitudinal data without taking into account within-subject correlation, while Chu et al. [6] proposed an SIS procedure for longitudinal data based on a weighted residual sum of squares to use within-subjection correlation to improve accuracy of feature screening. Kong et al. [17] proposed a new screening method that leaves a variable in the active set if it has, jointly with some other variables, a high canonical correlation with the response.

Survival analysis has been widely used in medical science, economics, finance, and social science, among others. In many studies, survival data have primary outcomes or responses that are subject to censoring. The Cox model [7, 8] is the most commonly used regression model for survival data, and the partial likelihood method has become a standard approach to parameter estimation and statistical inference. Recently, variable selection and parameter estimation in Cox regression models have been considered by various authors (see, e.g., [4, 9, 14, 18, 19, 30]). Huang et al. [15] studied the penalized partial likelihood with the ℓ₁-penalty for the Cox model with high-dimensional covariates. Yan and Huang [29] proposed the adaptive group Lasso in a Cox regression model with time-varying coefficients. However, they have not considered varying-coefficient models.

In this paper, we propose a new feature screening procedure for ultrahigh-dimensional varying-coefficient Cox models. It is distinguished from SIS procedures [11, 32] in that the proposed procedure is based on the joint partial likelihood of potentially important features, rather than the marginal partial likelihood of individual features. Xu and Chen [27] proposed a joint screening procedure and showed its advantage over SIS procedures in the context of generalized linear models. Yang et al. [28] extended the procedures in [27] to the Cox models. This work further extends the joint screening strategy and develops a feature screening procedure for varying-coefficient Cox models, which are natural extensions of Cox models and can be useful to explore nonlinear interaction effects between a primary covariate and other covariates.

The asymptotic properties of the proposed procedure are studied systematically. It is technically challenging to establish its sure screening property. The techniques used in [28] and other works related to SIS procedures cannot be applied for the present setting. We first develop Hoeffding’s inequality for a sequence of martingale differences and then establish a concentration inequality for the score function of a partial likelihood. Based on the concentration inequality, we prove the screening property for our proposed sure joint screening procedure. We also conduct simulation studies to assess the finite-sample performance of the proposed procedure and compare its performance with existing sure screening procedures for ultrahigh-dimensional survival data. The proposed methodology is demonstrated through an empirical analysis of a genomic data set.

The rest of this paper is organized as follows. In Section 2, we propose a new feature screening procedure for the varying-coefficient Cox model, develop an algorithm to carry it out, and demonstrate the ascent property of the proposed algorithm. We study the sampling property of the proposed procedure and establish its sure screening property. In Section 3, we present numerical comparisons and an empirical analysis of a real data set. Discussion is in Section 4. Technical proofs are in the Appendix.

2. New feature screening procedure for varying-coefficient Cox model

Let T be the survival time and x and U be p-dimensional covariate vector and univariate covariate, respectively. Throughout this paper, we consider the varying-coefficient Cox proportional hazard model given by

h (t | x, U) = h_{0} (t) + \exp {x^{⊤} α (U)},

(1)

where h₀(t) is an unspecified baseline hazard function and α(U) = (α₁(U), …, α_p(U))^T consists of the unknown nonparametric coefficient functions. It is assumed that the support of U is finite and denoted by [a, b]. In survival data analysis, survival times are subject to a censoring time C. Denote the observed time by Z = min(T, C) and the event indicator by δ = 1(T ≤ C). It is assumed throughout this paper that the censoring mechanism is noninformative. That is, given x and U, T and C are conditionally independent.

Suppose that (x₁, U₁, Z₁, δ₁), …, (x_n, U_n, Z_n, δ_n) is a random sample from model (1). Let $t_{1}^{0} < \dots < t_{N}^{0}$ be the ordered observed failure times. Let ( j) be the label for the subject failing at time $t_{j}^{0}$ , so that the covariates associated with the N failures are x₍₁₎, …, x_(N) and U₍₁₎, …, U_(N). Denote the risk set right before time $t_{j}^{0}$ by $R_{j} = {i : Z_{i} \leq t_{j}^{0}}$ . The partial likelihood function [8] of the random sample is

ℓ_{p} {α (U)} = \sum_{j = 1}^{N} [x_{(j)}^{⊤} α (U_{(j)}) - \ln [\sum_{i \in R_{j}} \exp {x_{i}^{⊤} α (U_{i})}]] .

(2)

To estimate the nonparametric regression, we use a B-spline basis. Let S_n be the space of polynomial splines of degree ℓ ≥ 1 and ${ψ_{j 1}, \dots, ψ_{j d_{n j}}}$ denote a normalized B-spline basis with ${‖ ψ_{j k} ‖}_{\infty} \leq 1$ and d_n j = O(n^1/5), where ${‖ \cdot ‖}_{\infty}$ is the supremum norm. For any j ∈ {1, …, p} and $α_{n j} \in S_{n}$ , we have

α_{n j} (U) = \sum_{k = 1}^{d_{n j}} β_{j k} ψ_{j k} (U) = β_{j}^{⊤} ψ_{j} (U)

(3)

for some coefficients $β_{j 1}, \dots, β_{j d_{n j}}$ . Here we allow d_{n j} to increase with n and differ for different j because different coefficient functions may have different smoothness. Under some conditions, the nonparametric coefficient functions $α_{1} (U), \dots, α_{p} (U)$ can be well approximated by functions in $S_{n}$ .

Denote $β = {(β_{1}^{⊤}, \dots, β_{p}^{⊤})}^{⊤}$ and $z_{i} {(x_{i 1} ψ_{1} {(U_{i})}^{⊤}, \dots, x_{i p} ψ_{p} {(U_{i})}^{⊤})}^{⊤}$ , and define z_(j) similarly to x_(j). Substituting (3) into (2), the maximum partial likelihood estimate of (2) is to maximize

ℓ_{p} (β) \hat{=} \sum_{j = 1}^{N} [z_{(j)}^{⊤} β - \ln {\sum_{i \in R_{j}} \exp (z_{i}^{⊤} β)}]

(4)

with respect to β. We next propose a feature screening procedure based on (4).

2.1. A new feature screening procedure

Denote ${‖ α_{j} (\cdot) ‖}_{2} = {E α_{j}^{2} (U)}^{1 / 2}$ , the L₂-norm of α _j(·). For ease of presentation, denote s as an arbitrary subset of {1, …, p}, x_s = {x _j : j ∈ s} and α_s(U) = {α _j(U) : j ∈ s}. For a set s, τ(s) stands for the cardinality of s. Suppose the effect of x is sparse and the true value of α(U) is α∗(U), where β∗ is the corresponding coefficients of α∗(U). Denote $s^{*} = {j : {‖ α_{j} (\cdot) ‖}_{2} > 0}$ . By sparsity, we mean that τ(s∗) is much less than p. The goal of feature screening is to identify a subset s such that s∗ ⊂ s with overwhelming probability and τ(s) is also much less than p. According to (4), we propose screening features for the varying-coefficient Cox model by the constrained partial likelihood

{\hat{β}}_{m} = \underset{β}{\arg \max} ℓ_{p} (β) subject to τ ({j : {‖ β_{j} ‖}_{2} > 0}) \leq m

(5)

for a pre-specified m, which is assumed to be greater than the number of nonzero elements of β*.

For high-dimensional problems, it becomes almost impossible to solve the constrained maximization problem (5) directly. Alternatively, we consider a proxy of the partial likelihood function. It follows by the Taylor expansion for the partial likelihood function ℓ_p(γ) at β lying within a neighborhood of γ that

ℓ_{p} (γ) \approx ℓ_{p} (β) + {(γ - β)}^{⊤} ℓ_{p}^{'} (β) + {(γ - β)}^{⊤} ℓ_{p}^{″} (β) (γ - β) / 2,

where $ℓ_{p}^{'} (β) = \partial ℓ_{p} (γ) / \partial γ |_{γ = β}$ and $ℓ_{p}^{″} (β) = \partial^{2} ℓ_{p} (γ) / \partial γ \partial γ^{⊤} |_{γ = β}$ . Denote $P_{t} = d_{n 1} + \dots + d_{n p}$ . If $ℓ_{p}^{″} (β)$ is invertible, the computational complexity of calculating the inverse of $ℓ_{p}^{″} (β)$ is $O (P_{t}^{3})$ . For large P_t, small n problems (i.e., P_t ≫ n), $ℓ_{p}^{″} (β)$ becomes not invertible. Low computational cost is always desirable for feature screening. To deal with singularity of the Hessian matrix and save computational cost, we propose using the approximation

h (γ | β) = ℓ_{p} (β) + {(γ - β)}^{⊤} ℓ_{p}^{'} (β) - u {(γ - β)}^{⊤} W (β) (γ - β) / 2

(6)

for $ℓ_{p}^{″} (γ)$ , where u is a scaling constant to be specified and W(β) = diag{W₁(β), …, W_p(β)}, a block diagonal matrix with W_j(β) being a d_nj × d_nj matrix. Here (6) is the minimization of the original objective function, $h (γ | β) \leq ℓ_{p} (β)$ , for all γ under some conditions. Due to the properties of the majorization and minorization algorithm, using (6) we can obtain the same estimates as the original objective function. The two functions themselves, however, are not numerically equal. Here we allow W(β) to depend on β. This implies that we approximate $ℓ_{p}^{″} (β)$ by −uW(β). Throughout this paper, we will use $W_{j} (β) = - \partial^{2} ℓ_{p} (β) / \partial β_{j} \partial β_{j}^{⊤}$ .

It can be seen that $h (β | β) = ℓ_{p} (β)$ , and, under some conditions, $h (γ | β) \leq ℓ_{p} (β)$ for all γ. This ensures the ascent property. See Theorem 1 below for more details. Since W(β) is a block diagonal matrix, h(γ|β) is an additive function of γ _j for any given β. The additivity enables us to have a closed form solution for the maximization problem

\max_{γ} h (γ | β) subject to τ ({j : {‖ γ_{j} ‖}_{2} > 0}) \leq m

(7)

for given β and m. Define ${\tilde{γ}}_{j} = β_{j} + u^{- 1} W_{j}^{- 1} (β_{j}) \partial ℓ_{p} (β) / \partial β_{j}$ for j ∈ {1, …, p}, and $\tilde{γ} = {({\tilde{γ}}_{1}^{⊤}, \dots, {\tilde{γ}}_{p}^{⊤})}^{⊤} = β + u^{- 1} W^{- 1} (β) ℓ_{p}^{'} (β)$ is the maximizer of h(γ|β). Denote $g_{j} = {\tilde{γ}}_{j}^{⊤} W_{j} (β_{j}) {\tilde{γ}}_{j}$ for j ∈ {1, …, p}, and sort g_j so that $g_{(1)} \geq \dots \geq g_{(p)}$ . The solution of the maximization problem (7) is the hard-thresholding rule defined below:

{\hat{γ}}_{j} = {\hat{γ}}_{j} 1 {g_{j} > g_{(m + 1)}} .

This enables us to effectively screen features by using the following algorithm.

\begin{array}{l} Feature Screening Algorithm of Varying Coefficient Cox' s Models \\ \bar{\begin{array}{l} Step 1 . Set the initial value β_{1}^{(0)} = \dots = β_{p}^{(0)} = 0 . \\ Step 2 . For t \in {0, 1, \dots}, iteratively conduct Step 2a and Step 2b below \\ until the algorithm converges : \\ Step 2 a . Calculate {\tilde{γ}}_{j}^{(t)} = β_{j}^{(t)} + u_{t}^{- 1} W_{j}^{- 1} (β_{j}) \partial ℓ (β^{(t)}) / \partial β_{j}, and g_{j}^{(t)} = {{\tilde{γ}}_{j}^{(t)}}^{⊤} W_{j} (β^{(t)}) {\tilde{γ}}_{j}^{(t)} . \end{array}} \\ Let g_{(1)}^{(t)} \geq \dots \geq g_{(p)}^{(t)}, the order statistics o f g_{j}^{(t)} s . Set S_{t} = {j : g_{j}^{(t)} \geq g_{(m + 1)}^{(t)}}, \\ the nonzero index set . \\ Step 2 b . Update β by β^{(t + 1)} = {(β_{1}^{(t + 1)}, \dots, β_{p}^{(t + 1)})}^{⊤} as follows . If j \notin S_{t}, set β_{j}^{(t + 1)} = 0, \\ \underline{otherwise, set {β_{j}^{(t + 1)} : j \in S_{t}} be the partial likelihood estimate of the submodel S_{t} .} \end{array}

Theorem 1. Suppose that Conditions (D1)–(D4) in the Appendix hold. Let β^(t) be the sequence defined in Step 2b in the above algorithm. Denote

ρ^{(t)} = \sup_{β} [λ_{\max} {W^{- 1 / 2} (β^{(t)}) {- ℓ_{p}^{″} (β)} W^{- 1 / 2} (β^{(t)})}],

where λ_max(A) stands for the maximal eigenvalue of a matrix A. If u_t ≥ ρ^(t), then $ℓ_{p} (β^{(t + 1)}) \geq ℓ_{p} (β^{(t)})$ , where β^(t+1) is defined in Step 2b in the above algorithm.

Theorem 1 claims the ascent property of the proposed algorithm if u_t is appropriately chosen. That is, the proposed algorithm may improve the current estimate within the feasible region (i.e., $τ ({j : {‖ α_{j} (U) ‖}_{2} > 0}) \leq m$ ), and the resulting estimate in the current step may serve as a refinement of the last step. This theorem provides us with some insights into how to choose u_t in practical implementation.

2.2. Sure screening property

For a subset s of {1, …, p} with size τ(s), recall the notation x_s = {x_j : j ∈ s} and associated coefficients α_s(U) = {α _j(U) : j ∈ s} corresponding to β_s = {β _j : j ∈ s} with $β_{j} = {(β_{j 1}, \dots β_{j d_{n_{j}}})}^{⊤}$ . We denote the true model by $s^{*} = {j : E α_{j}^{2} (U) > 0, 1 \leq j \leq p}$ with τ(s∗) = q. The objective of feature screening is to obtain a subset ŝ such that s∗ ⊂ ŝ with very high probability.

We now provide some theoretical justifications for the proposed screening procedure for the ultrahigh-dimensional varying-coefficient Cox model. The sure screening property [12] is referred to as

\lim_{n \to \infty} \Pr (s^{*} \subset \hat{s}) = 1.

(8)

To establish this sure screening property for the proposed screening procedure, we introduce some additional notation as follows. For any model s, let $ℓ^{'} (β_{s}) = \partial ℓ (β_{s}) / \partial β_{s}$ and $ℓ^{″} (β_{s}) = \partial^{2} ℓ (β_{s}) / \partial β_{s} \partial β_{s}^{⊤}$ be the score function and the Hessian matrix of ℓ as a function of β_s, respectively. Assume that a screening procedure retains m out of p features such that τ(s∗) = q < m. So, we define

S_{+}^{m} = {s : s^{*} \subset s, {‖ s ‖}_{0} \leq m} and S_{-}^{m} = {s : s^{*} ⊄ s, {‖ s ‖}_{0} \leq m}

as the collections of the over-fitted models and the under-fitted models, respectively. We investigate the asymptotic properties of ${\hat{β}}_{m}$ under the scenario where p, q, m and β* are allowed to depend on the sample size n. We impose the following conditions, some of which are purely technical and merely serve to facilitate theoretical understanding of the proposed feature screening procedure. For ease of presentation and without loss of generality, it is assumed that $d_{n 1} = \dots = d_{n p} \hat{=} d_{n}$ .

(C1) The support of U is bounded on [a, b].

(C2) The functions α₁(U), …, α_p(U) belong to a class of functions $F$ , whose rth derivative $α_{j}^{(r)}$ exists and is Lipschitz of order η,

F = {α_{j} : | α_{j}^{(r)} (s) - α_{j}^{(r)} (t) | \leq K {| s - t |}^{η} for s, t \in [a, b]},

for some positive constant K, where r is a nonnegative integer and η ∈ (0, 1] such that ν = r + η > 0.5.

(C3) There exist w₁, w₂ > 0 and some non-negative constants τ₁, τ₂ such that τ₁ + τ₂ < ½ and

\min_{j \in s^{*}} {‖ α_{j} (U) ‖}_{2} \geq w_{1} n^{- τ_{1}} and q < m \leq w_{2} n^{- τ_{2}} .

(C4) ln p = O(n^κ) for some 0 ≤ κ < 1 − 2(τ₁ + τ₂).

(C5) There exist constants C₁, C₂ > 0, δ > 0, such that for sufficiently large n,

C_{1} d_{n}^{- 1} \leq λ_{\min} {- n^{- 1} ℓ_{p}^{″} (β_{s})} \leq λ_{\max} {- n^{- 1} ℓ_{p}^{″} (β_{s})} \leq C_{2} d_{n}^{- 1},

for $β_{s} \in {β : {‖ β_{s} - β_{s}^{*} ‖}_{2} \leq δ}$ and $s \in S_{+}^{2 m}$ , where λ_min and λ_max denote the smallest and largest eigenvalues of a matrix, respectively.

Under Conditions (C1)–(C2), the following two properties of B-splines are valid.

(a) de Boor [3]: For k ∈ {1, …, d_n}, ψ _jk (U) ≥ 0 and $ψ_{j 1} (U) + \dots + ψ_{j d_{n}} (U) = 1$ , U ∈ [a, b]. In addition, there exist positive constants C₃ and C₄ such that $C_{3} d_{n}^{- 1} \leq E ψ_{j k}^{2} (U) \leq C_{4} d_{n}^{- 1}$ .

(b) Stone [23, 24]: If {α₁, …, α_p} is a set of functions in $F$ described in Condition (C2), there exists a positive constant C₅ that does not depend on α _j(U); then the uniform approximation error satisfies $ρ = \sup_{U \in [a, b]} {‖ α_{j} (U) - α_{n j} (U) ‖}_{2} \leq C_{5} d_{n}^{- v}$ for all j ∈ {1, …, p}, as d_n → ∞.

Conditions (C1)–(C2) ensure properties (a) and (b), which are required for the B-spline approximation and establishing the sure screening properties. Note that ${‖ α_{n j} (U) ‖}_{2}^{2} = β_{j}^{⊤} E{ψ_{j} (U) ψ_{j} {(U)}^{⊤}} β_{j}$ . Based on properties (a) and (b) and Condition (C3), we can derive that

\min_{j \in s^{*}} {‖ β_{j} ‖}_{2} \geq w_{1} d_{n} n^{- τ_{1}} .

Condition (C3) states a few requirements for establishing the sure screening property of the proposed procedure. The first one is the sparsity of α∗(U), which makes the sure screening possible with $τ (\hat{s}) = m > q$ . Also, it requires that the minimal component in α∗(U) does not degenerate too quckly, so that the signal is detectable in the asymptotic sequence. Meanwhile, together with (C4), it confines an appropriate order of m that guarantees the identifiability of s∗ over s for τ(s) ≤ m. Condition (C5) assumes that p diverges from n at up to an exponential rate; it implies that the number of covariates can be substantially larger than the sample size.

We establish the sure screening property of the quasi-likelihood estimation in the following theorem.

Theorem 2. Suppose that Conditions (C1)–(C5) and Conditions (D1)–(D7) in the Appendix hold. Let ŝ be the model obtained by Eq. (5) of size m. We have $\Pr (s^{*} \subset \hat{s}) \to 1$ as n → ∞.

The proof is given in the Appendix. The sure screening property is an appealing property of a screening procedure because it ensures that the true active predictors are retained in the model selected by the screening procedure. To be distinguished from the SIS procedure, the proposed procedure is referred to as a sure joint screening (SJS) procedure.

3. Numerical studies

In this section, we assess the finite-sample performance of the proposed procedure, compare it with existing procedures via simulation, and illustrate the proposed procedure by an empirical analysis of a genomic data set.

3.1. Simulation studies

The main purpose of our simulation studies is to assess the performance of the proposed procedure by comparing it with the SIS [11] and the SJS [28] procedures for the Cox model. The model sizes selected by the three methods are set to be the same for comparison. We vary the dimension of predictors p, sample size n and sample correlation ρ to examine their impact on the performance of the proposed procedure. We use the success rate of active predictors being selected and computing time as our criteria to compare the performance of screening procedures.

In our simulation, the predictors x are generated from a p-dimensional normal distribution with mean zero and covariance matrix $Σ = (σ_{i j})$ . Two commonly used covariance structures are used in our simulation:

(S1) Σ is compound symmetric, (i.e., σ_ij = ρ for i ≠ j and equal 1 for i = j). We choose ρ ∈ {0.25, 0.5, 0.75}.

(S2) Σ has autoregressive structure with AR(1), (i.e., σ_ij = ρ^{|i− j|}). We choose ρ ∈ {0.25, 0.5, 0.75}.

We generate the survival time from the Cox model with h₀(t) = 1 and the censoring time from a uniform distribution $U [0, 10]$ . Three different coefficient function settings α(u)s are considered:

(a1): $α_{1}^{(1)} (u) = 1 + 2 \sin (2 π u)$ , $α_{2}^{(1)} (u) = 1 - 2 \cos (2 π u)$ , $α_{3}^{(1)} (u) = 0.5 + 2 u^{2}$ ;

(a2): $α_{1}^{(2)} (u) = 5 \sin (2 π u)$ , $α_{2}^{(2)} (u) = 5 \cos (2 π u)$ , $α_{3}^{(2)} (u) = 2.5 + 5 u^{2}$ ;

(a3): $α_{1}^{(3)} (u) = e^{0.5 u}$ , $α_{2}^{(3)} (u) = 2 (u^{3} + 1.5 {(u - 0.5)}^{2})$ , $α_{3}^{(3)} (u) = 2 u$ .

We consider n ∈ {200, 400}, and p ∈ {2000, 5000}. For the feature screening model size, we follow Liu et al. [20] and set m = ⌊n^0.8/ ln(n^0.8)⌋, where ⌊a⌋ denotes the integer part of a. For each combination of different inputs, we conduct 1000 repetitions.

To illustrate the performance of a statistical procedure in survival data analysis, we want the censoring rates to lie within a reasonable range. Table 1 depicts the censoring rates for the 18 combinations of covariance structure, sample correlation ρ and the values of α(u). The censoring rates range from 22% to 37%, which is reasonable for simulation studies.

Table 1:

Censoring rates.

	ρ = 0.25			ρ = 0.5			ρ = 0.75

Σ	(a1)	(a2)	(a3)	(a1)	(a2)	(a3)	(a1)	(a2)	(a3)
S1	.276	.367	.223	.277	.356	.260	.277	.340	.248
S2	.275	.365	.265	.279	.358	.283	.278	.347	.245

Open in a new tab

We compare the performance of feature screening procedures using the following two criteria: P_s, the probability that an individual active predictor is selected, and P_a, the probability that all active predictors are selected. It is expected that the performance of the proposed varying-coefficient SJS (VSJS) procedure depends on the following factors: the structure of the covariance matrix, the values of α(u), the dimension of all candidate features p, the sample correlation ρ and the sample size n.

Tables 2–3 report P_s and P_a of VSJS, SIS and SJS for the active predictors under (S1). Overall, VSJS outperforms both SIS and SJS for all the three sets of α(u) in terms of Ps and Pa. For (a1), VSJS achieves a high success rate in detecting signals of $α_{1}^{(1)}$ and $α_{2}^{(1)}$ , while SIS and SJS fail from time to time.

Table 2:

Comparison between VSJS, SIS and SJS with Σ = (1 − ρ)I + ρ11^T (n = 200).

	VSJS					SIS					SJS

	P_s			P_a	Time	P_s			P_a	Time	P_s			P_a	Time

α(U)	X₁	X₂	X₃	all	(s)	X₁	X₂	X₃	all	(s)	X₁	X₂	X₃	all	(s)
	n = 200, p = 2000 and ρ = .25

α⁽¹⁾	.989	1	1	.989	74.5	.796	.747	.990	.580	9.5	.499	.419	.936	.190	3.6
α⁽²⁾	.999	.998	.999	.996	67.7	.016	.002	1	0	8.3	.018	.037	.999	.002	2.4
α⁽³⁾	1	.810	.993	.803	82.2	1	.771	.992	.763	6.0	1	.785	.996	.781	2.8

	n = 200, p = 2000 and ρ = .5

α⁽¹⁾	.970	.976	.915	.868	68.9	.621	.557	.968	.325	9.2	.392	.311	.863	.092	2.9
α⁽²⁾	.922	.922	.990	.848	66.8	.006	.003	1	0	7.8	.020	.052	.997	0	2.5
α⁽³⁾	.998	.617	.938	.581	74.8	.999	.611	.932	.573	5.3	1	.574	.932	.542	3.2

	n = 200, p = 2000 and ρ = .75

α⁽¹⁾	.628	.670	.682	.259	62.4	.357	.316	.879	.093	9.4	.247	.211	.701	.031	3.0
α⁽²⁾	.485	.535	.738	.204	67.3	.005	.001	1	0	6.8	.018	.059	.935	0	3.4
α⁽³⁾	.910	.361	.686	.247	62.5	.987	.341	.736	.250	5.3	.958	.286	.644	.181	3.4

	n = 200, p = 5000 and ρ = .25

α⁽¹⁾	1	1	.993	.993	464.0	.721	.649	.983	.456	15.4	.391	.326	.865	.097	32.9
α⁽²⁾	.996	.994	1	.990	416.3	.004	.004	1	0	18.1	.007	.016	.994	0	17.6
α⁽³⁾	1	.708	.984	.694	451.5	1	.684	.974	.667	15.2	1	.627	.980	.615	16.8

	n = 200, p = 5000 and ρ = .5

α⁽¹⁾	.925	.930	.845	.725	412.7	.496	.430	.954	.199	22.9	.281	.224	.779	.040	16.8
α⁽²⁾	.856	.876	.976	.740	423.7	.005	.002	1	0	16.1	.007	.030	.968	0	18.9
α⁽³⁾	.992	.508	.884	.446	390.4	.999	.455	.866	.38	15.2	.998	.435	.878	.383	24.0

	n = 200, p = 5000 and ρ = .75

α⁽¹⁾	.510	.501	.504	.121	398.1	.261	.218	.803	.042	15.3	.135	.140	.541	.010	20.3
α⁽²⁾	.372	.399	.625	.093	396.6	.002	0	.999	0	14.9	.006	.022	.867	0	22.2
α⁽³⁾	.892	.276	.597	.158	369.5	.977	.258	.624	.159	13.3	.909	.164	.493	.075	24.7

Open in a new tab

Table 3:

Comparison between VSJS, SIS and SJS with Σ = (1 − ρ)I + ρ11^T (n = 400).

	VSJS					SIS					SJS

	P_s			P_a	Time	P_s			P_a	Time	P_s			P_a	Time

α(U)	X₁	X₂	X₃	all	(s)	X₁	X₂	X₃	all	(s)	X₁	X₂	X₃	all	(s)
	n = 400, p = 2000 and ρ = .25

α⁽¹⁾	1	1	1	1	217.7	1	.960	1	.960	8.8	.859	.805	.999	.686	5.8
α⁽²⁾	1	1	1	1	205.9	.020	.001	1	0	7.9	.010	.076	1	0	5.6
α⁽³⁾	1	1	1	1	215.3	1	.974	1	.974	8.3	1	.997	1	.997	4.9

	n = 400, p = 2000 and ρ = .5

α⁽¹⁾	1	1	1	1	190.2	.900	.871	.999	.779	8.5	.736	.607	.998	.437	4.6
α⁽²⁾	1	1	1	1	184.3	.010	.001	1	0	8.5	.023	.133	1	.002	6.3
α⁽³⁾	1	.988	1	.988	199.5	1	.918	.997	.916	8.2	1	.944	1	.944	5.1

	n = 400, p = 2000 and ρ = .75

α⁽¹⁾	.984	.991	.976	.955	169.0	.655	.566	.997	.349	8.6	.474	.356	.955	.155	6.3
α⁽²⁾	.998	.995	1	.994	162.2	.001	0	1	0	9.5	.035	.193	.999	.004	6.6
α⁽³⁾	1	.733	.982	.719	162.8	1	.676	.968	.657	8.2	1	.576	.938	.540	6.1

	n = 400, p = 5000 and ρ = .25

α⁽¹⁾	1	1	1	1	1202	.963	.957	1	.920	21.6	.963	.957	1	.920	21.6
α⁽²⁾	1	1	1	1	1164	.006	.001	1	0	20.6	.004	.038	1	.001	31.1
α⁽³⁾	1	1	1	1	1180	1	.960	1	.960	18.2	1	.993	1	.993	36.5

	n = 400, p = 5000 and ρ = .5

α⁽¹⁾	1	1	1	1	1086	.849	.798	.999	.669	21.0	.849	.798	.999	.669	21.1
α⁽²⁾	1	1	1	1	1101	.001	0	1	0	22.2	.011	.071	1	.002	32.1
α⁽³⁾	1	.975	1	.975	1071	1	.840	.998	.838	19.6	1	.872	1	.872	40.3

	n = 400, p = 5000 and ρ = .75

α⁽¹⁾	1	1	.980	.980	929.0	.562	.426	.994	.224	21.0	.336	.267	.933	.073	35.9
α⁽²⁾	.994	.992	.997	.988	936.7	.001	0	1	0	20.8	.016	.109	1	.001	35.3
α⁽³⁾	.995	.621	.926	.586	909.6	1	.580	.935	.535	18.3	.999	.446	.900	.401	46.1

Open in a new tab

We next consider the performance of VSJS under (a2). For the zero-centered $α_{1}^{(2)}$ and $α_{2}^{(2)}$ , VSJS successfully detects their variation signal and achieves high success rates. As a comparison, SIS and SJS fail to identify $α_{1}^{(2)}$ and $α_{2}^{(2)}$ as active predictors completely in (a2). In general, VSJS still performs better to some extent in (a3), though SIS slightly outperforms VSJS in a few cases.

Tables 2–3 clearly show how performance is affected by sample correlation ρ, predictor dimension p, and sample size n. When ρ increases, n decreases, or p increases, the three methods perform worse under (S1). Compared to SIS and SJS, the performance of VSJS is more resistant to these changes. Also, Tables 2–3 suggest that VSJS is more computationally inefficient than SIS and SJS.

Tables 4–5 report P_s and P_a of VSJS, SIS, and SJS for the active predictors under (S2). Overall, VSJS still outperforms SIS and SJS. It is worth noting that the three methods have much better performance under (S2) than previous cases under (S1), especially when the correlation ρ is larger. In (a1), VSJS and SIS both perform perfectly and slightly better than SJS. When we consider (a2), SIS and SJS perform better under (S2) and successfully identify $α_{1}^{(2)}$ and $α_{2}^{(2)}$ from time to time. However, VSJS again outperforms them in (a2). For (a3), the three methods achieve almost 100% success rate for selecting active predictors. SJS misses some active predictors in a few cases.

Table 4:

Comparison between VSJS, SIS and SJS with Σ = (ρ^{|i− j|}) (n = 200).

	VSJS					SIS					SJS

	P_s			P_a	Time	P_s			P_a	Time	P_s			P_a	Time

α(U)	X₁	X₂	X₃	all	(s)	X₁	X₂	X₃	all	(s)	X₁	X₂	X₃	all	(s)
	n = 200, p = 2000 and ρ = .25

α⁽¹⁾	1	1	1	1	76.9	1	1	.988	.988	5.2	.856	.809	.997	.684	2.6
α⁽²⁾	1	1	1	1	70.8	.042	.116	1	.008	5.9	.047	.027	1	0	2.4
α⁽³⁾	1	1	1	1	86.6	1	1	1	1	7.1	1	.981	1	.981	3.0

	n = 200, p = 2000 and ρ = .5

α⁽¹⁾	1	1	1	1	73.1	1	1	.999	.999	8.2	.889	.792	.990	.690	2.5
α⁽²⁾	1	1	1	1	67.6	.166	.611	1	.145	5.8	.052	.065	1	.011	2.4
α⁽³⁾	1	1	1	1	82.8	1	1	1	1	7.7	1	.977	1	.977	3.1

	n = 200, p = 2000 and ρ = .75

α⁽¹⁾	1	1	1	1	75.5	1	1	1	1	5.2	.877	.768	.990	.642	3.0
α⁽²⁾	1	1	1	1	68.6	.722	.968	1	.720	5.8	.125	.417	.997	.076	2.6
α⁽³⁾	1	.997	1	.997	79.4	1	1	1	1	8.4	1	.926	.991	.917	3.1

	n = 200, p = 5000 and ρ = .25

α⁽¹⁾	1	1	1	1	456.4	.968	.997	1	.965	15.4	.785	.734	.989	.559	16.1
α⁽²⁾	1	1	1	1	463.8	.016	.067	1	.004	14.6	.016	.022	.999	0	14.9
α⁽³⁾	1	1	.998	.998	477.1	1	.999	1	.999	16.2	1	.967	1	.967	20.1

	n = 200, p = 5000 and ρ = .5

α⁽¹⁾	1	1	1	1	451.1	1	1	1	1	13.1	.799	.730	.979	.543	13.2
α⁽²⁾	1	1	1	1	439.9	.121	.501	1	.103	14.3	.030	.025	1	.003	16.0
α⁽³⁾	1	1	1	1	475.4	1	1	1	1	15.8	1	.966	.997	.963	20.3

	n = 200, p = 5000 and ρ = .75

α⁽¹⁾	1	1	1	1	448.2	1	1	1	1	15.4	.844	.685	.987	.538	19.0
α⁽²⁾	1	1	1	1	427.3	.627	.938	1	.626	14.8	.062	.327	1	.040	15.9
α⁽³⁾	1	.996	1	.996	453.9	1	1	1	1	14.4	1	.916	.980	.896	23.3

Open in a new tab

Table 5:

Comparison between VSJS, SIS and SJS with Σ = (ρ^{|i− j|}) (n = 400).

	VSJS					SIS					SJS

	_Ps			P_a	Time	P_s			P_a	Time	P_s			P_a	Time

α(U)	X₁	X₂	X₃	all	(s)	X₁	X₂	X₃	all	(s)	X₁	X₂	X₃	all	(s)
	n = 400, p = 2000 and ρ = .25

α⁽¹⁾	1	1	1	1	229.6	1	1	1	1	8.6	.991	.979	1	.970	6.3
α⁽²⁾	1	1	1	1	223.3	.083	.251	1	.036	8.5	.047	.040	1	.001	5.2
α⁽³⁾	1	1	1	1	240.1	1	1	1	1	11.9	1	1	1	1	7.0

	n = 400, p = 2000 and ρ = .5

α⁽¹⁾	1	1	1	1	225.9	1	1	1	1	7.5	.992	.959	1	.951	5.3
α⁽²⁾	1	1	1	1	226.1	.387	.922	1	.382	8.8	.070	.263	1	.031	5.2
α⁽³⁾	1	1	1	1	236.8	1	1	1	1	8.5	1	1	1	1	7.3

	n = 400, p = 2000 and ρ = .75

α⁽¹⁾	1	1	1	1	217.9	1	1	1	1	8.9	.979	.907	1	.886	6.4
α⁽²⁾	1	1	1	1	218.4	.969	1	1	.969	9.1	.139	.598	1	.080	5.8
α⁽³⁾	1	.999	1	.999	227.8	1	1	1	1	11.9	1	.997	1	.997	7.6

	n = 400, p = 5000 and ρ = .25

α⁽¹⁾	1	1	1	1	1264	1	1	1	1	20.6	.988	.962	1	.952	29.5
α⁽²⁾	1	1	1	1	1265	.054	.183	1	.018	18.7	.029	.032	1	0	28.8
α⁽³⁾	1	1	1	1	1215	1	1	1	1	20.8	1	1	1	1	33.8

	n = 400, p = 5000 and ρ = .5

α⁽¹⁾	1	1	1	1	1274	1	1	1	1	20.5	.976	.924	1	.900	32.5
α⁽²⁾	1	1	1	1	1256	.318	.884	1	.312	19.9	.038	.162	1	.017	29.1
α⁽³⁾	1	1	1	1	1194	1	1	1	1	20.6	1	.999	1	.999	35.6

	n = 400, p = 5000 and ρ = .75

α⁽¹⁾	1	1	1	1	1202	1	1	1	1	20.7	.969	.902	1	.871	36.9
α⁽²⁾	1	1	1	1	1225	.954	1	1	.954	21.9	.085	.548	1	.051	29.9
α⁽³⁾	1	1	1	1	1139	1	1	1	1	29.5	1	.995	1	.995	34.6

Open in a new tab

We can conclude from Tables 4 and 5 that SIS and SJS tend to perform better when ρ increases, n increases, or p decreases. For VSJS, it performs perfectly in all three settings under (S1). Similarly, Tables 4 and 5 show that VSJS is more computationally intensive than SIS and SJS.

3.2. Real data analysis

We analyze The Cancer Genome Atlas (TCGA; http://cancergenome.nih.gov/) data on liver hepatocellular carcinoma to illustrate the proposed procedure. Liver hepatocellular carcinoma is the most common form of liver cancer and the third cancer death cause worldwide. Zhang and Sun [31] studied 17,255 patients in the Surveillance, Epidemiology, and End Results Program (SEER; https://seer.cancer.gov/) cancer registry and suggested that age is a prognostic factor for liver cancer. Therefore, we consider age as the univariate covariate for coefficient functions, allowing the effects of gene expression on survival time to vary with age. After removing five subjects whose survival time is zero, we obtain 354 subjects with gene expressions (IlluminaHiSeq RNA-seq v2 platform), age at diagnosis, and survival months. We apply a log 2 transformation to gene expressions and analyze 14,683 genes that have more than 90% nonzero observations.

For VSJS, we use a linear combination of five B-spline basis functions to approximate the varying-coefficient functions. As a result, VSJS retains 23 = ⌊354^0.8/ ln(354^0.8)⌋ genes and the partial likelihood function value for the corresponding model is −544.9. With the same number of genes retained, the resulting partial likelihood function values for SIS and SJS are −589.2 and −588.4, respectively. Simultaneous modeling of the 23 retained genes shows a clear advantage of VSJS in terms of higher partial likelihood value.

To better understand the screening result of VSJS, we apply the backward selection procedure to those 23 genes and obtain a more parsimonious model. Specifically, each backward elimination step removes a gene with the smallest likelihood ratio test statistic until all the genes are significant at level 0.05. Table 6 provides the final list of 11 genes after applying the backward elimination, and Figure 1 depicts their varying coefficients.

Table 6:

Genes selected by backward elimination.

Gene Name	ANLN	CEP55	DYNC1LI1	GTPBP4
LRT Stat	15.869	14.137	18.171	22.658
p-value	0.00723	0.0148	0.00274	< 001

Gene Name	SLC2A1	KIF2C	KIF20A	KPNA2
LRT Stat	18.465	26.261	15.839	14.511
p-value	0.00241	< 0.001	0.00731	0.0127

Gene Name	LIMS2	TRIP13	UCK2
LRT Stat	23.093	17.517	14.671
p-value	< 0.001	0.00361	0.0119

Open in a new tab

Figure 1: — Estimated coefficient functions and the pointwise conference intervals of selected genes. The red line represents the average level of the varying-coefficient functions.

Our literature search reveals that those 11 genes are all associated with cancer risk and some genes. For example, GTPBP4 [21] and SLC2A2 [16] are promising prognostic factors for hepatocellular carcinoma. To test whether those 11 genes have varying coefficients versus constant coefficients, a test of $H_{0} : α_{j} (u) = α_{j}$ for some constant α _j versus $H_{1} : α_{j} (u) \equiv α_{j}$ can be conducted for each j in the selected gene set. The test result is shown in Table 7, and all the genes except DYNC1LI1 have significant varying-coefficient functions of age at the 5% level of significance. There is no evidence of their time-varying effects in the current medical literature, but our study may suggest some evidence for potential granular investigation on those genes.

Table 7:

LRT statistics and p-values for the varying coefficients of the final selected genes

Gene Name	ANLN	CEP55	DYNC1LI1	GTPBP4
LRT Stat	15.058	10.495	8.268	19.036
p-value	0.00458	0.0328	0.0822	0.000773

Gene Name	SLC2A1	KIF2C	KIF20A	KPNA2
LRT Stat	17.473	24.253	15.183	14.238
p-value	0.00156	0.000071	0.00433	0.00657

Gene Name	LIMS2	TRIP13	UCK2
LRT Stat	23.097	16.191	13.803
p-value	0.000121	0.00277	0.00795

Open in a new tab

4. Discussion

We have proposed an SJS procedure for the varying-coefficient Cox model with ultrahigh-dimensional covariates based on partial likelihood. The proposed SJS is distinguished from the existing SIS procedure in that the proposed procedure is based on the joint likelihood of potential candidate features. We also proposed an effective algorithm to carry out the feature screening procedures and show that the proposed algorithm possesses an ascent property. We studied the sampling property of SJS and established the sure screening property for SJS.

Theorem 1 ensures the ascent property of the proposed algorithm under certain conditions, but it does not imply that the proposed algorithm converges to the global optimizer. If the proposed algorithm converges to a global maximizer of (5), then Theorem 2 shows that such a solution enjoys the sure screen property.

Acknowledgments

Yang’s research was supported by the National Nature Science Foundation of China grants 11471086 and 11871173, the National Social Science Foundation of China grants 16BTJ032, the National Statistical Scientific Center 2015LD02 and the Fundamental Research Funds for the Central Universities of Jinan University Qimingxing Plan 15JNQM019. Zhang and Li’s research was supported by NIDA grant P50 DA039838, and NSF grant DMS 1820702, and grants from NNSFC 11690014 and 11690015. Huang’s research was supported by the NIH grant 5R01NS091161 and the CHDI Foundation grant A-13343. The content is solely the responsibility of the authors and does not necessarily represent the official views of the CHDI, the NIDA, the NIAID, the NIH, the NSF, or the NNSFC.

Appendix

We use the following notation to present the regularity conditions for the partial likelihood and the Cox model. Most notations are adapted from Andersen and Gill [1], in which counting processes were introduced for the Cox model and the consistency and asymptotic normality of the partial likelihood estimate were established. Denote ${\bar{N}}_{i} (t) = 1 (T_{i} \leq t, T_{i} \leq C_{i})$ and R_i(t) = {T_i ≥ t, C_i ≥ t}. Assume that there are no two component processes N_i(t) jumping at the same time. For simplicity, we work on the finite interval [0, τ].

In Cox’s model, properties of stochastic processes, such as being a local martingale or a predictable process, are relative to a right-continuous nondecreasing family ${F_{t} : t \in [0, τ]}$ of sub σ-algebras on a sample space $(Ω, F, P)$ ; $F_{t}$ represents everything that happens up to time t. Throughout this section, we define $Λ_{0} (t) = \int_{0}^{t} h_{0} (u) d u$

By stating that ${\bar{N}}_{i} (t)$ has intensity process $h_{i} (t) \hat{=} h (t | x_{i}, U_{i})$ , we mean that the processes M_i(t) defined, for each i ∈ {1, …, n}, by

M_{i} (t) = {\bar{N}}_{i} (t) - \int_{0}^{⊤} h_{i} (u) d u,

are local martingales on the time interval [0, τ]. For k ∈ {0, 1, 2}, define

S^{(k)} (β, t) = \frac{1}{n} \sum_{i = 1}^{n} R_{i} (t) \exp (z_{i}^{⊤} β) z_{i}^{\otimes k}, s^{(k)} (β, t) = E {S^{(k)} (β, t)}

and

E (β, t) = S^{(1)} (β, t) / S^{(0)} (β, t), V (β, t) = S^{(2)} (β, t) / S^{(0)} (β, t) - E {(β, t)}^{\otimes 2},

where $z_{i}^{\otimes 0} = 1$ , $z_{i}^{\otimes 1} = z_{i}$ and $z_{i}^{\otimes 2} = z_{i} z_{i}^{⊤}$ . Note that S⁽⁰⁾(β, t) is a scalar, S⁽¹⁾(β, t) and E(β, t) are p-vector, and S⁽²⁾(β, t) and V(β, t) are p × p matrices. Define

Q_{j} = \sum_{i = 1}^{n} \int_{0}^{t_{j}} {z_{i} - \sum_{i \in R_{j}} z_{i} \exp (z_{i}^{⊤} β) / \sum_{i \in R_{j}} \exp (z_{i}^{⊤} β)} d M_{i} .

Here, $E (Q_{j} | F_{j - 1}) = Q_{j - 1}$ , i.e., $E (Q_{j} - Q_{j - 1} | F_{j - 1}) = 0$ . Let b_j = Q_j − Q_j−1, then b₁, b₂, … is a sequence of bounded martingale differences on $(Ω, F, P)$ . That is, b_j is bounded almost surely and $E (b_{j} | F_{j - 1}) = 0$ as for j ∈ {1, 2, …}.

(D1) Finite interval: $Λ_{0} (τ) = \int_{0}^{τ} h_{0} (t) d t < \infty$ .

(D2) Asymptotic stability: There exists a neighborhood $B$ of β* and scalar, vector and matrix functions s⁽⁰⁾,s⁽¹⁾ and s⁽²⁾ defined on $B \times [0, τ]$ such that for k ∈ {0, 1, 2},

\sup_{t \in [0, τ], β \in B} ‖ S^{(k)} (β, t) - s^{(k)} (β, t) ‖ \overset{p}{\to} 0.

(D3) Lindeberg condition: There exists δ > 0 such that

n^{- 1 / 2} \sup_{i, t} | z_{i} | R_{i} (t) 1 {β_{0}^{⊤} z_{i} > - δ | z_{i} |} \overset{p}{\to} 0,

(D4) Asymptotic regularity conditions: Let $B$ , s⁽⁰⁾, s⁽¹⁾ and s⁽²⁾ be as in Condition (D2) and define e = s⁽¹⁾/s⁽⁰⁾ and v = s⁽²⁾/s⁽⁰⁾ − e^⊗2. For all $β \in B$ , t ∈ [0, τ],

s^{(1)} (β, t) = \partial s^{(0)} (β, t) / \partial β, s^{(2)} (β, t) = \partial^{2} s^{(0)} (β, t) / \partial β^{2},

s⁽⁰⁾(·, t), s⁽¹⁾(·, t) and s⁽²⁾(·, t) are continuous functions of $β \in B$ , uniformly in t ∈ [0, τ], s⁽⁰⁾, s⁽¹⁾ and s⁽²⁾ are bounded on $B \times [0, τ]$ ; s⁽⁰⁾ is bounded away from zero on $B \times [0, τ]$ , and the matrix

S = \int_{0}^{τ} v (β_{0}, t) s^{(0)} (β_{0}, t) h_{0} (t) d t

is positive definite.

(D5) The function S⁽⁰⁾(β*, t) and s⁽⁰⁾(β*, t) are bounded away from 0 on [0, τ].

(D6) There exist constants C₁, C₂ > 0, such that max_ij |z_ij| < C₁ and $\max_{i} | z_{i}^{⊤} β^{*} | < C_{2}$ .

(D7) b₁, b₂, … is a sequence of martingale differences and there exist nonnegative constants c₁, …, c_N such that for every real number t and all j ∈ {1, …, N}, $E {\exp (t b_{j}) | F_{j - 1}} \leq \exp (c_{j}^{2} t^{2} / 2)$ almost surely. For each j ∈ {1, …, N}, the minimum of those c_j is denoted by η(b_j) and |b_j| ≤ K_j as and E(b_j1, b_j2, …, b_jk ) = 0 for b_j1 < … < b_jk.

Note that the partial derivative conditions on s⁽⁰⁾, s⁽¹⁾ and s⁽²⁾ are satisfied by S⁽⁰⁾, S⁽¹⁾ and S⁽²⁾; furthermore, S is automatically positive semidefinite. Moreover, the interval [0, τ] in the conditions may everywhere be replaced by the set {t : h₀(t) > 0}.

Conditions (D1)–(D5) are standard requirements for the proportional hazards model [1], which are weaker than those required by Bradic et al. [4], and S^(k)(β₀, t) converges uniformly to s^(k)(β₀, t). Condition (D6) is a routine one, which is needed to apply the concentration inequality for general empirical processes. For example, the bounded covariate assumption is used by Huang et al. [15] for discussing the Lasso estimator of proportional hazards models. Condition (D7) is needed for the asymptotic behavior of the score function $ℓ_{p}^{'} (β)$ of partial likelihood because the score function cannot be represented as a sum of independent random vectors, but it can be represented as sum of a sequence of martingale differences.

Proof of Theorem 1. Applying the Taylor expansion to ℓ_p(γ) at γ = β, one finds

ℓ_{p} (γ) = ℓ_{p} (β) + ℓ_{p}^{'} (γ) (γ - β) + {(γ - β)}^{⊤} ℓ_{p}^{″} (\tilde{β}) (γ - β) / 2,

where $\tilde{β}$ lies between γ and β.

{(γ - β)}^{⊤} {- ℓ_{p}^{″} (\tilde{β})} (γ - β) = {(γ - β)}^{⊤} W^{1 / 2} (β) W^{- 1 / 2} (β) {- ℓ_{p}^{″} (\tilde{β})} W^{- / 2} (β) W^{1 / 2} (β) (γ - β) \leq λ_{\max} [W^{- 1 / 2} (β) {- ℓ_{p}^{″} (\tilde{β}) W^{- 1 / 2} (β)] {(γ - β)}^{⊤} W (β) (γ - β),

where W(β) is a block diagonal matrix with W_j(β) being a $d_{n_{j}} \times d_{n_{j}}$ matrix. Given that −ℓ″(β) is non-negative definite, $λ_{\max} [W^{- 1 / 2} (β) {- ℓ_{p}^{″} (\tilde{β})} W^{- 1 / 2} (β)] \geq 0$ . Thus, if $u > λ_{\max} [W^{- 1 / 2} (β) {- ℓ_{p}^{″} (\tilde{β})} W^{- 1 / 2} (β)] \geq 0$ , then

ℓ_{p} (γ) \geq ℓ_{p} (β) + ℓ_{p}^{'} (β) (γ - β) - u {(γ - β)}^{⊤} W (β) (γ - β) / 2 = h (γ | β) .

Thus it follows that $ℓ_{p} (γ) \geq h (γ | β)$ and $ℓ_{p} (β) = h (β | β)$ by the definition of h(γ|β). The solution of $\partial h (γ | β) / \partial γ = 0$ is $γ = β + u^{- 1} W (β) ℓ^{'} (β)$ . Hence, under the conditions of Theorem 1, it follows that

ℓ_{p} (β_{j}^{* (t + 1)}) \geq h (β^{* (t + 1)} | β^{(t)}) \geq h (β^{(t)} | β^{(t)}) = ℓ (β^{(t)}) .

The second inequality is due to the fact that

τ [{j : {‖ β_{j}^{* (t + 1)} ‖}_{2} > 0}] = τ [{j : {‖ β_{j}^{(t)} ‖}_{2} > 0}] = m

and $β^{* (t + 1)} = \arg \min_{γ} h (γ | β^{(t)})$ subject to $τ [{j : {‖ γ_{j} ‖}_{2} > 0}] \leq m$ . By definition of β^(t+1), $ℓ_{p} (β^{(t + 1)}) \geq ℓ_{p} (β^{* (t + 1)})$ and $τ [{j : {‖ β_{j}^{(t + 1)} ‖}_{2} > 0}] = m$ . This proves Theorem 1. □

Proof of Theorem 2. For a given model s, a subset of {1, …, p}, let ${\hat{α}}_{s} (U)$ be the partial likelihood estimate of α_s(U) based on the spline approximation. The theorem is implied if $\Pr (\hat{s} \in S_{+}^{m}) \to 1$ . Thus, it suffices to show that

\lim_{n \to \infty} \Pr [\max_{s \in S_{-}^{m}} ℓ_{p} {{\hat{α}}_{s} (U)} \geq \min_{s \in S_{+}^{m}} ℓ_{p} {{\hat{α}}_{s} (U)}] = 0,

(A.1)

For each j ∈ {1, …, p}, we approximate the coefficient function α _j(U) by

α_{n j} (U) = \sum_{k = 1}^{d_{n}} β_{j k} ψ_{j k} (U) = β_{j}^{⊤} ψ_{j} (U) .

(A.2)

where $ψ_{j 1} (U), \dots, ψ_{j d_{n}} (U)$ are basis functions and d_n is the number of basis functions, which is allowed to increase with the sample size n. For α_{n j}(U), define the approximation error for each j ∈ {1, …, p}, by

ρ_{j} (U) = α_{j} (U) - α_{n j} (U) = α_{j} (U) - β_{j}^{⊤} ψ_{j} (U) .

Let ${α_{j} (U), S_{j}} = \inf_{α_{n j} (U) \in S_{j}} \sup_{U \in [a, b]} {‖ ρ_{j} (U) ‖}_{2}$ , and take $ρ = \max_{1 \leq j \leq p} dist {α_{j} (U), S_{j}}$ . Let α_n(U) = (α_n1(U), …, α_np(U))^T and α(U) = (α₁(U), …, α_p(U))^T. For any s,

α_{s} (U) = {(\begin{matrix} ψ_{1} (U) \\ ⋱ \\ ψ_{s} (U) \end{matrix})}_{s \times s d_{n}} {(\begin{matrix} β_{1} \\ ⋮ \\ β_{s} \end{matrix})}_{s d_{n} \times 1} + (\begin{matrix} ρ_{1} (U) \\ ⋮ \\ ρ_{s} (U) \end{matrix}) \hat{=} Ψ_{s} (U) β_{s} + ρ_{s} (U),

where Ψ_s(U) = diag{ψ₁(U), …, ψ_s(U)} with $ψ_{j} (U) = (ψ_{j 1} (U), \dots, ψ_{j d_{n}} (U))$ , and $β_{j} = {(β_{j 1}, \dots, β_{j d_{n}})}^{⊤}$ for all j ∈ {1, …, s}. For any $s \in S_{-}^{m}$ , define $s' = s \cup s^{*} \in S_{+}^{2 m}$ . So, we have

ℓ_{p} {α_{s'} (U)} - ℓ_{p} {α_{s'}^{*} (U)} = ℓ_{p} {Ψ_{s'} (U) β_{s'} + ρ_{s'} (U)} - ℓ_{p} {Ψ_{s'} (U) β_{s'}^{*} + ρ_{s'}^{*} (U)} = ℓ_{p} {Ψ_{s'} (U) β_{s'}} + ℓ_{p}^{'} {Ψ_{s'} (U) {\tilde{β}}_{s'}} ρ_{s'} (U) - ℓ_{p} {Ψ_{s'} (U) β_{s'}^{*}} - ℓ_{p}^{'} {Ψ_{s'} (U) {\tilde{β}}_{s'}^{*}} ρ_{s'}^{*} (U),

where ${\tilde{β}}_{s'}$ and ${\tilde{β}}_{s'}^{*}$ are two immediate values. Denote

Δ_{1} = {ℓ_{p} (β_{s'}) - ℓ_{p} (β_{s'}^{*})}, Δ_{2} = ℓ_{p}^{'} ({\tilde{β}}_{s'}) ρ_{s^{'}} (U), Δ_{3} = ℓ_{p}^{'} ({\tilde{β}}_{s'}^{*}) ρ_{s'}^{*} (U) .

Thus, we have $ℓ_{p} {α_{s'} (U)} - ℓ_{p} {α_{s'}^{*} (U)} = Δ_{1} + Δ_{2} + Δ_{3}$ . For ∆₂, by the Cauchy–Schwarz inequality, we have

E | Δ_{2} | = E | ℓ_{p}^{'} ({\tilde{β}}_{s'}) ρ_{s'} (U) | \leq \sqrt{E {‖ ℓ_{p}^{'} ({\tilde{β}}_{s'}) ‖}^{2}} \sqrt{E {‖ ρ_{s'} (U) ‖}^{2}} .

By Condition (C5) and Corollary 1 in [25], we obtain ∆₂ = o_p(1). Similarly to ∆₂, we can also conclude that ∆₃ = o_p(1).

Next, we consider the term ∆₁. For any $s \in S_{-}^{m}$ , define $s^{'} = s \cup s^{*} \in S_{+}^{2 m}$ . Under Condition (C3), we consider $β_{s'}$ close to $β_{s'}^{*}$ such that $‖ β_{s'} - β_{s'}^{*} ‖ = w_{1} d_{n} n^{- τ_{1}}$ for some w₁, τ₁ > 0. Clearly, when n is sufficiently large, β_s′ falls into a small neighborhood of $β_{s'}^{*}$ , so that Condition (C5) becomes applicable. Thus, it follows from Condition (C5) and the Cauchy–Schwarz inequality that

ℓ_{p} (β_{s'}) - ℓ_{p} (β_{s'}^{*}) = {(β_{s'} - β_{s'}^{*})}^{⊤} ℓ_{p}^{'} (β_{s'}^{*}) + (1 / 2) {(β_{s'} - β_{s'}^{*})}^{⊤} ℓ_{p}^{″} ({\tilde{β}}_{s'}) (β_{s'} - β_{s'}^{*}) \leq {(β_{s'} - β_{s'}^{*})}^{⊤} ℓ_{p}^{'} (β_{s'}^{*}) - (C_{1} d_{n}^{- 1} / 2) n {‖ β_{s'} - β_{s'}^{*} ‖}_{2}^{2} \leq w_{1} d_{n} n^{- τ_{1}} {‖ ℓ_{p}^{'} (β_{s'}^{*}) ‖}_{2} - (C_{1} d_{n} / 2) w_{1}^{2} n^{1 - 2 τ_{1}},

(A.3)

where ${\tilde{β}}_{s'}$ is an intermediate value between β_s′ and $β_{s'}^{*}$ . Thus, we have

\Pr {ℓ_{p} (β_{s'}) - ℓ_{p} (β_{s'}^{*}) \geq 0} \leq \Pr {{‖ ℓ_{p}^{'} (β_{s'}^{*}) ‖}_{2} \geq (C_{1} w_{1} / 2) n^{1 - τ_{1}}} = \Pr [\sum_{j \in s'} {ℓ_{p}^{'} (β_{s'}^{*})}^{2} \geq {(C_{1} w_{1} / 2)}^{2} n^{2 - 2 τ_{1}}] \leq \sum_{j \in s'} \Pr [{ℓ_{j}^{'} (β_{s'}^{*})}^{2} \geq {(2 m)}^{- 1} {(C_{1} w_{1} / 2)}^{2} n^{2 - 2 τ_{1}}] .

Also, by (C3), we have $m \leq w_{2} n^{τ_{2}}$ , and also the following probability inequality

\Pr {ℓ_{j}^{'} (β_{s'}^{*}) \geq {(2 m)}^{- 1 / 2} (C_{1} w_{1} / 2) n^{1 - τ_{1}}} \leq \Pr {ℓ_{j}^{'} (β_{s'}^{*}) \geq {(2 w_{2} n^{τ_{2}})}^{- 1 / 2} (C_{1} w_{1} / 2) n^{1 - τ_{1}}} = \Pr {ℓ_{j}^{'} (β_{s'}^{*}) \leq c n^{1 - τ_{1} - 0.5 τ_{2}}} = \Pr {ℓ_{j}^{'} (β_{s'}^{*}) \leq n c n^{1 - τ_{1} - 0.5 τ_{2}}},

(A.4)

where $c = C_{1} w_{1} / (2 \sqrt{2 w_{2}})$ denotes some generic positive constant. Recall (2), by differentiation and rearrangement of terms, it can be shown as in [1] that the gradient of ℓ_p(β) is

ℓ_{p}^{'} (β) \equiv \partial ℓ_{p} (β) / \partial β = \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{\infty} {z_{i} - {\bar{z}}_{n} (β, t)} d {\bar{N}}_{i} (t),

(A.5)

where ${\bar{z}}_{n} (β, t) = \sum_{i \in R} z_{i} \exp (z_{i}^{⊤} β) / \sum_{i \in R} \exp (z_{i}^{⊤} β)$ . As a result, the partial score function $ℓ_{p}^{'} (β)$ no longer has a martingale structure, and the large deviation results for continuous time martingale in [4] and [15] are not directly applicable. The martingle process associated with ${\bar{N}}_{i} (t)$ is given by

M_{i} (t) = {\bar{N}}_{i} (t) \int_{0}^{⊤} R_{i} (u) \exp (z^{⊤} β^{*}) d Λ_{0} (u) .

For each j ∈ {1, …, N}, let t_j be the time of the jth jump of the process $\sum_{i = 1}^{n} \int_{0}^{\infty} R_{i} (t) d {\bar{N}}_{i} (t)$ and set t₀ = 0. Then, t _j are stopping times. For j ∈ {0, …, N}, further define

Q_{j} = \sum_{i = 1}^{n} \int_{0}^{t_{j}} b_{i} (u) d {\bar{N}}_{i} (u) = \sum_{i = 1}^{n} \int_{0}^{t_{j}} b_{i} (u) d M_{i} (u) .

(A.6)

where $b_{i} (u) = z_{i} - {\bar{z}}_{n} (β, u)$ for all i ∈ {1, …, n} are predictable, provided that no two component processes jump at the same time, (D6 holds), and |b_i(u)| ≤ 1.

Since M_i(u) are martingales and b_i(u) are predictable, {Q₀, Q₁, …} is a martingale with the difference |Q_j − Q_j−1| ≤ max_u,i |b_i(u)| ≤ 1. Recall definition of N in Section 2, we define $C_{0}^{2} n \leq N$ , where C₀ is a constant. So, by the martingale version of Hoeffding’s inequality [2] and under Condition (D7), we have

\Pr (| Q_{N} | > n C_{0} x) \leq 2 \exp {- n^{2} C_{0}^{2} x^{2} / (2 N)} \leq 2 \exp (- n x^{2} / 2) .

By (A.6), $Q_{N} = n ℓ_{p}^{'} (β)$ if and only if $\sum_{i = 1}^{n} \int_{0}^{\infty} R_{i} (t) d {\bar{N}}_{i} (t) \leq N$ Thus, the left-hand side of (3.15) in Lemma 3.3 of [15] is no greater than Pr(|Q_N| > nC₀ x) ≤ 2 exp(−nx²/2). Now (A.4) can be rewritten as follows:

\Pr {ℓ_{j}^{'} (β_{s'}^{*}) \geq n c n^{- τ_{1} - 0.5 τ_{2}}} \leq \exp {- 0.5 n n^{- 2 τ_{1} - τ_{2}}} = \exp {- 0.5 n^{1 - 2 τ_{1} - τ_{2}}} .

(A.7)

By the same arguments, we have

\Pr {ℓ_{j}^{'} (β_{s'}^{*}) \leq - m^{- 1 / 2} (C_{1} w_{1} / 2) n^{1 - τ_{1}}} \leq \exp {- 0.5 n^{1 - 2 τ_{1} - τ_{2}}} .

(A.8)

Inequalities (A.7) and (A.8) imply that,

\Pr {ℓ_{p} (β_{s'}) \geq ℓ_{p} (β_{s'}^{*})} \leq 4 m \exp {- 0.5 n^{1 - 2 τ_{1} - τ_{2}}} .

Consequently, by Bonferroni’s inequality and under conditions (C3)–(C4), we have

\Pr {\max_{s \in S_{-}^{m}} ℓ_{p} (β_{s'}) \geq ℓ_{p} (β_{s'}^{*})} \leq \sum_{s \in S_{-}^{m}} \Pr {ℓ_{p} (β_{s'}^{*})} \leq 4 m p^{m} \exp {- 0.5 n^{1 - 2 τ_{1} - τ_{1}}} = 4 \exp {\log m + m \log p - 0.5 n^{1 - 2 τ_{1} - τ_{1}}} \leq 4 \exp {\log w_{2} + τ_{2} \log n + w_{2} n^{τ_{2}} \tilde{c} n^{κ} - {0.5}^{1 - 2 τ_{1} - τ_{2}}} = 4 w_{2} \exp {τ_{2} \log n + w_{2} \tilde{c} n^{τ_{2} + κ} - 0.5 n^{1 - 2 τ_{1} - τ_{2}}} = a_{1} \exp {τ_{2} \log n + a_{2} n^{τ_{2} + κ} - 0.5 n^{1 - 2 τ_{1} - τ_{2}}} = o (1),

(A.9)

as n → ∞ for some generic positive constants a₁ = 4w₂ and $a_{2} = w_{2} \tilde{c}$ . By Condition (C5), ℓ_p(β_s′ ) is concave in β_s′ and (A.9) holds for any β_s′ such that $‖ β_{s'} - β_{s'}^{*} ‖ = w_{1} d_{n} n^{- τ_{1}}$ .

For any $s \in S_{-}^{m m}$ , let ${\overset{⌣}{β}}_{s'}$ be ${\hat{β}}_{s}$ augmented with zeros corresponding to the elements in s′/s∗, i.e., s′ = {s ∪ (s∗/s′)} ∪ (s′/s∗). By Condition (C3),

{‖ {\overset{⌣}{β}}_{s'} - β_{s'}^{*} ‖}_{2} = {‖ {\overset{⌣}{β}}_{s^{*} \cup (s' / s^{*})} - β_{s^{*} \cup (s' / s^{*})}^{*} ‖}_{2} = {‖ {\overset{⌣}{β}}_{s^{*} \cup (s' / s^{*})} - β_{s^{*}}^{*} ‖}_{2} \geq {‖ β_{s^{*} \cup (s' / s^{*})}^{*} - β_{s^{*}}^{*} ‖}_{2} \geq {‖ β_{s' / s^{*}}^{*} ‖}_{2} = w_{1} d_{n} n^{- τ_{1}} .

Consequently,

\Pr {\max_{s \in S_{-}^{m}} ℓ_{p} ({\hat{β}}_{s}) \geq \min_{s \in S_{+}^{m}} ℓ_{p} ({\hat{β}}_{s})} \leq \Pr {\max_{s \in S_{-}^{m}} ℓ_{p} ({\overset{⌣}{β}}_{s'}) \geq ℓ_{p} (β_{s'}^{*})} = o (1) .

So, we have shown that

\lim_{n \to \infty} \Pr [\max_{s \in S_{-}^{m}} ℓ {{\hat{α}}_{s} (U)} \geq \min_{s \in S_{+}^{m}} ℓ {{\hat{α}}_{s} (U)}] = 0.

Therefore, the theorem is proved.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

[1].Andersen PK, Gill RD, Cox’s regression model for counting processes: A large sample study, Ann. Statist 10 (1982) 1100–1120. [Google Scholar]
[2].Azuma K, Weighted sums of certain dependent random variables, Tohoku Math. J 19 (1967) 357–367. [Google Scholar]
[3].de Boor C, A Practical Guide to Splines, Springer, New York, 1978. [Google Scholar]
[4].Bradic JFJ, Fan J, Jiang J, Regularization for Cox’s proportional hazards model with NP-dimensionality, Ann. Statist 39 (2011) 3092–3120. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Cheng M-Y, Honda T, Zhang J-T, Forward variable selection for sparse ultra-high dimensional varying-coefficient models, J. Amer. Statist. Assoc 111 (2016) 1209–1221. [Google Scholar]
[6].Chu W, Li R, Reimherr M, Feature screening for time-varying coefficient models with ultrahigh dimensional longitudinal data, Ann. Appl. Statist 10 (2016) 596–617. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Cox DR, Regression models and life tables (with discussion), J. R. Stat. Soc. Ser. B 34 (1972) 187–220. [Google Scholar]
[8].Cox DR, Partial likelihood, Biometrika 62 (1975) 269–276. [Google Scholar]
[9].Du P, Ma S, Liang H, Penalized variable selection procedure for Cox models with semiparametric relative risk, Ann. Statist 38 (2010) 2092–2117. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Fan J, Feng Y, Song R, Nonparametric independence screening in sparse ultra-high-dimensional additive models, J. Amer. Statist. Assoc 106 (2011) 544–557. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Fan J, Feng Y, Wu Y, High-dimensional variable selection for Cox’s proportional hazards model, Borrowing Strength: Theory Powering Applications — A Festschrift for Lawrence D. Brown, 70–86, Inst. Math. Stat. (IMS) Collect., 6, Inst. Math. Statist, Beachwood, OH, 2010. [Google Scholar]
[12].Fan J, Lv J, Sure independence screening for ultrahigh dimensional feature space (with discussion), J. R. Stat. Soc. Ser. B 70 (2008) 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Fan J, Ma Y, Dai W, Nonparametric independence screening in sparse ultra-high dimensional varying-coefficient models, J. Amer. Statist. Assoc 109 (2014) 1270–1284. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Hu Y, Liang H, Variable selection in a partially linear proportional hazards model with a diverging dimensionality, Statist. Probab. Lett 83 (2013) 61–69. [Google Scholar]
[15].Huang J, Sun T, Ying Z, Yu Y, Zhang C-H, Oracle inequalities for the LASSO in the Cox model, Ann. Statist 41 (2013) 1142–1165. [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Kim YH, Jeong DC, Pak K, Han M-E, Kim J-Y, Liangwen L, Kim HJ, Kim TW, Kim TH, Hyun DW, Oh S-O, Slc2a2 (glut2) as a novel prognostic factor for hepatocellular carcinoma, Oncotarget 8 (2017) 68381–68392. [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].Kong X-B, Liu Z, Yao Y, Zhou W, Sure screening by ranking the canonical correlations, Test 26 (2017) 46–70. [Google Scholar]
[18].Leng C, Zhang H, Model selection in nonparametric hazard regression, J. Nonparametr. Stat 18 (2006) 417–429. [Google Scholar]
[19].Lian H, Li J, Hu Y, Shrinkage variable selection and estimation in proportional hazards models with additive structure and high dimensionality, Comput. Stat. Data Anal 63 (2013) 99–112. [Google Scholar]
[20].Liu J, Li R, Wu R, Feature selection for varying-coefficient models with ultrahigh-dimensional covariates, J. Amer. Statist. Assoc 109 (2014) 266–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Liu W-B, Jia W-D, Ma J-L, Xu G-L, Zhou H-C, Peng Y, Wang W, Krockdown of gtpbp4 inhibits cell growth and survival in human hepatocellular carcinoma and its prognostic significance, Oncotarget 8 (2017) 93984–93997. [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Song R, Yi F, Zou H, On varying-coefficient independence screening for high dimensional varying-coefficient models, Stat. Sinica 24 (2014) 1735–1752. [PMC free article] [PubMed] [Google Scholar]
[23].Stone CJ, Optimal global rates of convergence for nonparametric regression, Ann. Statist 10 (1982) 1040–1053. [Google Scholar]
[24].Stone CJ, Additive regression and other nonparametric models, Ann. Statist 13 (1985) 689–705. [Google Scholar]
[25].Wei F, Huang J, Li H, Variable selection and estimation in high-dimensional varying-coefficient models, Stat. Sinica 21 (2011) 1515–1540. [DOI] [PMC free article] [PubMed] [Google Scholar]
[26].Xia X, Yang H, Li J, Feature screening for generalized varying-coefficient models with application to dichotomous response, Comput. Stat. Data Anal 102 (2016) 85–97. [Google Scholar]
[27].Xu C, Chen J, The sparse MLE for ultrahigh-dimensional feature screening, J. Amer. Statist. Assoc 109 (2014) 1257–1269. [DOI] [PMC free article] [PubMed] [Google Scholar]
[28].Yang G, Yu Y, Li R, Buu A, Feature screening in ultrahigh dimensional Cox’s model, Stat. Sinica 26 (2016) 881–901. [DOI] [PMC free article] [PubMed] [Google Scholar]
[29].Yan J, Huang J, Model selection for Cox models with time-varying coefficients, Biometrics 68 (2012) 419–428. [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Zhang H, Lu W, Adaptive Lasso for Cox’s proportional hazards model, Biometrika 94 (2007) 691–703. [Google Scholar]
[31].Zhang W, Sun B, Impact of age on the survival of patients with liver cancer: An analysis of 27,255 patients in the seer database, Oncotarget 6 (2015) 633–641. [DOI] [PMC free article] [PubMed] [Google Scholar]
[32].Zhao S, Li Y, Principled sure independence screening for Cox models with ultra-high-dimensional covariates, J. Multivariate Anal 105 (2012) 397–411. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] [1].Andersen PK, Gill RD, Cox’s regression model for counting processes: A large sample study, Ann. Statist 10 (1982) 1100–1120. [Google Scholar]

[R2] [2].Azuma K, Weighted sums of certain dependent random variables, Tohoku Math. J 19 (1967) 357–367. [Google Scholar]

[R3] [3].de Boor C, A Practical Guide to Splines, Springer, New York, 1978. [Google Scholar]

[R4] [4].Bradic JFJ, Fan J, Jiang J, Regularization for Cox’s proportional hazards model with NP-dimensionality, Ann. Statist 39 (2011) 3092–3120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] [5].Cheng M-Y, Honda T, Zhang J-T, Forward variable selection for sparse ultra-high dimensional varying-coefficient models, J. Amer. Statist. Assoc 111 (2016) 1209–1221. [Google Scholar]

[R6] [6].Chu W, Li R, Reimherr M, Feature screening for time-varying coefficient models with ultrahigh dimensional longitudinal data, Ann. Appl. Statist 10 (2016) 596–617. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Cox DR, Regression models and life tables (with discussion), J. R. Stat. Soc. Ser. B 34 (1972) 187–220. [Google Scholar]

[R8] [8].Cox DR, Partial likelihood, Biometrika 62 (1975) 269–276. [Google Scholar]

[R9] [9].Du P, Ma S, Liang H, Penalized variable selection procedure for Cox models with semiparametric relative risk, Ann. Statist 38 (2010) 2092–2117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Fan J, Feng Y, Song R, Nonparametric independence screening in sparse ultra-high-dimensional additive models, J. Amer. Statist. Assoc 106 (2011) 544–557. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Fan J, Feng Y, Wu Y, High-dimensional variable selection for Cox’s proportional hazards model, Borrowing Strength: Theory Powering Applications — A Festschrift for Lawrence D. Brown, 70–86, Inst. Math. Stat. (IMS) Collect., 6, Inst. Math. Statist, Beachwood, OH, 2010. [Google Scholar]

[R12] [12].Fan J, Lv J, Sure independence screening for ultrahigh dimensional feature space (with discussion), J. R. Stat. Soc. Ser. B 70 (2008) 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Fan J, Ma Y, Dai W, Nonparametric independence screening in sparse ultra-high dimensional varying-coefficient models, J. Amer. Statist. Assoc 109 (2014) 1270–1284. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].Hu Y, Liang H, Variable selection in a partially linear proportional hazards model with a diverging dimensionality, Statist. Probab. Lett 83 (2013) 61–69. [Google Scholar]

[R15] [15].Huang J, Sun T, Ying Z, Yu Y, Zhang C-H, Oracle inequalities for the LASSO in the Cox model, Ann. Statist 41 (2013) 1142–1165. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [16].Kim YH, Jeong DC, Pak K, Han M-E, Kim J-Y, Liangwen L, Kim HJ, Kim TW, Kim TH, Hyun DW, Oh S-O, Slc2a2 (glut2) as a novel prognostic factor for hepatocellular carcinoma, Oncotarget 8 (2017) 68381–68392. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] [17].Kong X-B, Liu Z, Yao Y, Zhou W, Sure screening by ranking the canonical correlations, Test 26 (2017) 46–70. [Google Scholar]

[R18] [18].Leng C, Zhang H, Model selection in nonparametric hazard regression, J. Nonparametr. Stat 18 (2006) 417–429. [Google Scholar]

[R19] [19].Lian H, Li J, Hu Y, Shrinkage variable selection and estimation in proportional hazards models with additive structure and high dimensionality, Comput. Stat. Data Anal 63 (2013) 99–112. [Google Scholar]

[R20] [20].Liu J, Li R, Wu R, Feature selection for varying-coefficient models with ultrahigh-dimensional covariates, J. Amer. Statist. Assoc 109 (2014) 266–274. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Liu W-B, Jia W-D, Ma J-L, Xu G-L, Zhou H-C, Peng Y, Wang W, Krockdown of gtpbp4 inhibits cell growth and survival in human hepatocellular carcinoma and its prognostic significance, Oncotarget 8 (2017) 93984–93997. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Song R, Yi F, Zou H, On varying-coefficient independence screening for high dimensional varying-coefficient models, Stat. Sinica 24 (2014) 1735–1752. [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Stone CJ, Optimal global rates of convergence for nonparametric regression, Ann. Statist 10 (1982) 1040–1053. [Google Scholar]

[R24] [24].Stone CJ, Additive regression and other nonparametric models, Ann. Statist 13 (1985) 689–705. [Google Scholar]

[R25] [25].Wei F, Huang J, Li H, Variable selection and estimation in high-dimensional varying-coefficient models, Stat. Sinica 21 (2011) 1515–1540. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] [26].Xia X, Yang H, Li J, Feature screening for generalized varying-coefficient models with application to dichotomous response, Comput. Stat. Data Anal 102 (2016) 85–97. [Google Scholar]

[R27] [27].Xu C, Chen J, The sparse MLE for ultrahigh-dimensional feature screening, J. Amer. Statist. Assoc 109 (2014) 1257–1269. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] [28].Yang G, Yu Y, Li R, Buu A, Feature screening in ultrahigh dimensional Cox’s model, Stat. Sinica 26 (2016) 881–901. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] [29].Yan J, Huang J, Model selection for Cox models with time-varying coefficients, Biometrics 68 (2012) 419–428. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] [30].Zhang H, Lu W, Adaptive Lasso for Cox’s proportional hazards model, Biometrika 94 (2007) 691–703. [Google Scholar]

[R31] [31].Zhang W, Sun B, Impact of age on the survival of patients with liver cancer: An analysis of 27,255 patients in the seer database, Oncotarget 6 (2015) 633–641. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] [32].Zhao S, Li Y, Principled sure independence screening for Cox models with ultra-high-dimensional covariates, J. Multivariate Anal 105 (2012) 397–411. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Feature screening in ultrahigh-dimensional varying-coefficient Cox model

Guangren Yang

Ling Zhang

Runze Li

Yuan Huang

Abstract

1. Introduction

2. New feature screening procedure for varying-coefficient Cox model

2.1. A new feature screening procedure

2.2. Sure screening property

3. Numerical studies

3.1. Simulation studies

Table 1:

Table 2:

Table 3:

Table 4:

Table 5:

3.2. Real data analysis

Table 6:

Figure 1:

Table 7:

4. Discussion

Acknowledgments

Appendix

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Feature screening in ultrahigh-dimensional varying-coefficient Cox model

Guangren Yang

Ling Zhang

Runze Li

Yuan Huang

Abstract

1. Introduction

2. New feature screening procedure for varying-coefficient Cox model

2.1. A new feature screening procedure

2.2. Sure screening property

3. Numerical studies

3.1. Simulation studies

Table 1:

Table 2:

Table 3:

Table 4:

Table 5:

3.2. Real data analysis

Table 6:

Figure 1:

Table 7:

4. Discussion

Acknowledgments

Appendix

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases