Forward regression for Cox models with high-dimensional covariates

Hyokyoung G Hong; Qi Zheng; Yi Li

doi:10.1016/j.jmva.2019.02.011

. Author manuscript; available in PMC: 2020 Sep 1.

Published in final edited form as: J Multivar Anal. 2019 Mar 5;173:268–290. doi: 10.1016/j.jmva.2019.02.011

Forward regression for Cox models with high-dimensional covariates

Hyokyoung G Hong ^a, Qi Zheng ^b, Yi Li ^c,^*

PMCID: PMC6469712 NIHMSID: NIHMS1523198 PMID: 31007300

Abstract

Forward regression, a classical variable screening method, has been widely used for model building when the number of covariates is relatively low. However, forward regression is seldom used in high-dimensional settings because of the cumbersome computation and unknown theoretical properties. Some recent works have shown that forward regression, coupled with an extended Bayesian information criterion (EBIC)-based stopping rule, can consistently identify all relevant predictors in high-dimensional linear regression settings. However, the results are based on the sum of residual squares from linear models and it is unclear whether forward regression can be applied to more general regression settings, such as Cox proportional hazards models. We introduce a forward variable selection procedure for Cox models. It selects important variables sequentially according to the increment of partial likelihood, with an EBIC stopping rule. To our knowledge, this is the first study that investigates the partial likelihood-based forward regression in high-dimensional survival settings and establishes selection consistency results. We show that, if the dimension of the true model is finite, forward regression can discover all relevant predictors within a finite number of steps and their order of entry is determined by the size of the increment in partial likelihood. As partial likelihood is not a regular density-based likelihood, we develop some new theoretical results on partial likelihood and use these results to establish the desired sure screening properties. The practical utility of the proposed method is examined via extensive simulations and analysis of a subset of the Boston Lung Cancer Survival Cohort study, a hospital-based study for identifying biomarkers related to lung cancer patients’ survival.

Keywords: Forward selection, partial likelihood, sure screening properties, extended Bayesian information criteria, high-dimensional predictors

1. Introduction

New biotechnologies have generated a vast amount of high-throughput data. In the Boston Lung Cancer Survival Cohort study, a hospital-based study for lung cancer patients, identifying high-throughput predictors such as molecular profiles that are associated with patients’ survival is a major research goal for understanding disease progression processes and designing more effective gene therapies. When the number p of covariates is less than the sample size n, the Cox proportional hazards model has been routinely used for modeling survival data in many practical settings. When p > n, penalized partial likelihood methods have been proposed by various authors [29, 41] and the oracle properties and statistical error bounds of estimation have been established [18]. However, when p >> n, these methods often fail because of serious challenges in “computational expediency, statistical accuracy, and algorithmic stability” [8]. Recently, Bradic et al. [3] established the oracle properties of the regularized partial likelihood estimates under a high-dimensional setting. However, the results require the estimates to be unique and global optimizers, which is, in general, difficult to verify, especially when the dimension of covariates is exceedingly high.

Forward regression has been widely used for model selection, but it has often been criticized for not achieving selection consistency as it fails to account for multiple comparisons in the model building process. Recently, some authors, e.g., [5, 13, 19, 23, 34, 39] have revamped forward regression in the context of linear regression or varying-coefficient linear models. The advantages can be summarized as follows. First, these authors have shown that, with some proper stopping criteria, forward regression can achieve screening consistency even in high-dimensional settings. Second, the variables are sequentially selected into the final model with the entry order determined by the size of the likelihood increment, which might reflect the relative importance of each selected variable. Third, the implementation is simple as no cross-validation for tuning parameters is needed. Finally, the method only needs assumptions on the original model and does not require restrictive faithfulness assumptions, in which the marginal models reflect the original model. However, to our knowledge, the aforementioned forward regression approaches are either based on the sum of residual squares from linear models [5, 34] or Lasso estimation [23]. It is unclear whether forward regression can be applied to more general regression settings, such as the Cox proportional hazards models.

In contrast, there has been active research in developing high-dimensional screening tools for survival data. These works include principled sure screening [37], feature aberration at survival times screening [12] and conditional screening [16], quantile adaptive sure independence screening [14], a censored rank independence screening procedure [26], and integrated powered density screening [15]; see [17] for an extensive review. However, the screening methods require a threshold to dictate how many variables to retain, for which unfortunately there are no clear rules. Zhao and Li [37] did tie the threshold with false discoveries, but it still needs to prefix the number of false positives that users are willing to tolerate. Recently, Li et al. [21] designed a model-free measure, namely the survival impact index, that sensibly captures the overall influence of a covariate on the survival outcome and can help guide selecting important variables. However, even this method, like the other screening methods, does not directly lead to a final model, for which extra modeling steps have to be implemented.

We introduce a new forward variable selection procedure for survival data based on partial likelihood. It selects important variables sequentially according to the increment of partial likelihood, with a stopping rule based on EBIC. We show that if the dimension of the true model is finite, within a finite number of steps forward regression can discover all relevant predictors, with the entry order determined by the size of the likelihood increment.

Our work is novel in the following aspects. It likely registers as the first attempt to thoroughly investigate the forward regression in high-dimensional survival settings, methodologically, theoretically and numerically. Technically this paper is also novel. First, our work represents technical advances and a broadened scope compared to the existing forward regression [5, 23, 34]. This may be the first work that investigates the partial likelihood-based forward regression in survival models with high-dimensional predictors, and establishes rigorous selection consistency results when the extended Bayesian information criterion (EBIC) [4] is used. It improves the partial likelihood-based variable selection developed by Volinsky and Raftery [33] and Xu et al. [35] for survival data in low dimensional settings. Second, as partial likelihood is not a regular density-based likelihood, it fails to satisfy the requirements for theories of forward regression. We revisit partial likelihood and develop some new inequalities, based on which we establish the desired sure screening properties. The derived theoretical framework and techniques will facilitate the extension of the procedure to other general likelihood-based settings, such as generalized linear regression models. Finally, we note that forward selection starts with an empty model or some important variables identified a priori, and then sequentially recruits variables given important variables identified in the previous steps. This may resemble the conditional screening approach [16], which incorporates prior knowledge into variable screening. However, our method is valid even in the absence of such information.

The rest of the paper is organized as follows. In Section 2, we introduce the proposed forward regression procedure. In Section 3, we rigorously establish forward regression’s screening consistency and asymptotic normality under some regularity conditions. We carry out simulation studies to assess the performance of the proposed method in Section 4, and apply the method in Section 5 to analyze a subset of the Boston Lung Cancer Survival Cohort study, our motivating study for identifying biomarkers related to lung cancer patients’ survival. We conclude the paper with a natural extension of the proposal in Section 6. Technical proofs and all of the lemmas are presented in the Appendix.

2. Partial likelihood-based forward regression

Suppose we have n independent subjects with p covariates, where p >> n. For subject i, denote by X_ij the jth covariate for subject i, write X_i = (X_i1, … , X_ip)^⊺, and let T_i and C_i be the underlying survival and censoring times. We, however, only observe Y_i = min(T_i, C_i), and the event indicator δ_i = 1(T_i ≤ C_i), where 1 is the indicator function. We assume random censoring such that C_i and T_i are independent given X_i. We assume that (Y₁, δ₁, X₁), … , (Y_n, δ_n, X_n) are mutually independently and identically distributed (iid). In particular, we assume that (Y₁, T₁, X_1j), … , (Y_n, T_n, X_nj), are mutually independent copies of (Y, T, X_j), the random variables that underlie the observed survival time, true survival time and covariates.

To link T_i to X_i, for each i ∈ {1, … , n}, we consider the Cox proportional hazards model, viz.

λ (t ∣ X_{i}) = λ_{0} (t) \exp (β_{0}^{⊺} X_{i}),

(1)

where λ₀ is the unspecified baseline hazard function and β₀ = (β₀₁, … , β_0p)^⊺ is the vector of regression coefficients. Without loss of generality, we assume that E(X₁) = ⋯ = E(X_p) = 0. Denote the true model as $M (j : B_{0 j} \neq 0)$ . The overarching goal of variable screening is to estimate $M$ and we let its estimate be $\hat{M}$ .

We introduce more notation. For an index set S ⊂ {1, … , p} and a p-dimensional vector A, we use A_S = {A_j : j ∈ S} to denote the subvector of A corresponding to S. For example, X_iS denotes the collection of covariates for the ith individual corresponding to S. We use |S| to denote the cardinality of S and let S^C denote the complement of S.

Now we elaborate on the idea of forward regression under model (1). Initialize S₀ = Ø. We can also start with a set of given variables according to some prior knowledge, which is in the same spirit as conditional screening [16] but is followed by a sequential selection process. Specifically, we sequentially select the sets of covariates in such a way that

S_{0} \subset \dots \subset S_{k},

where S_k ⊂ {1, … , p} is the index set of the selected covariates upon completion of the kth step, with k ≥ 0. Then at the (k + 1)th step, we need to choose a new candidate variable not in S_k and then decide whether we should stop at the kth step or we should include the new candidate in our selection and proceed to the next step. We emphasize that our selection criterion is based on the partial likelihood. The framework is much broader than that of the one based on the reduction in sum of squared residuals proposed in [5, 34] and can be extended to more general regression settings.

Now, given S_k, we consider estimation of the extended Cox model by adding a new variable index to S_k. Specifically, for every $j \in S_{k}^{C}$ , we denote by S_k,j = S_k ∪ {j}, and fit a Cox model on the variables indexed by S_k,j. We then compute the increment of log partial likelihood for each $j \in S_{k}^{C}$ , viz.

ℓ_{S_{k, j}} ({\hat{β}}_{S_{k, j}}) - ℓ_{S_{k}} ({\hat{β}}_{S_{k}}) .

Here, for a covariate set S, ℓ_S(β_S) is the log partial likelihood function given X_S, viz.

ℓ_{S} (β_{S}) = Σ_{i = 1}^{n} \int_{0}^{τ} [β_{S}^{⊺} X_{i S} - \ln {Σ_{ℓ = 1}^{n} {\overset{‒}{Y}}_{ℓ} (t) \exp (β_{S}^{⊺} X_{ℓ S})}] d N_{i} (t),

(2)

and ${\hat{β}}_{S}$ maximizes (2), where N_i(t) = 1(Y_i ≤ t, δ_i = 1) is the counting process, ${\overset{‒}{Y}}_{i} (t) = 1 (Y_{i} \geq t)$ is the at-risk process, and τ > 0 is the study duration such that Pr(Y ≥ τ) > 0. Then, the candidate index is chosen as

j^{*} =_{j \notin S_{k}}^{\arg \max} ℓ_{S_{k, j}} ({\hat{β}}_{S_{k, j}}) - ℓ_{S_{k}} ({\hat{β}}_{S_{k}}) .

Upon completion of the (k + 1)th step, update S_k+1 = S_k ∪ {j*}.

We are now in a position to decide whether to stop at the kth step or to include variable j* in our selection and proceed to the next step. In the survival setting, the effective sample size is the number of uncensored events, in which case Volinsky and Raftery [33] showed that replacement of the sample size with the number of uncensored events in the penalty term of EBIC gives a better approximation to the Bayes factor. Therefore, we propose the following as the modified EBIC criterion for ultrahigh-dimensional survival data:

EBIC (S_{k + 1}) = - 2 ℓ_{S_{k + 1}} ({\hat{β}}_{S_{k + 1}}) + ∣ S_{k + 1} (\ln d + 2 η \ln p) = - 2 ℓ_{S_{k + 1}} ({\hat{β}}_{S_{k + 1}}) + (k + 1) (\ln d + 2 η \ln p),

where d = δ₁ + ⋯ + δ_n is the number of events and η is some positive constant.

We stop if EBIC(S_k+1) > EBIC(S_k) and declare $\hat{M} = S_{k}$ ; otherwise, we proceed to the next step.

3. Theoretical properties

We first introduce more notation. Let →_p denote convergence in probability. Given a random sample Z₁, … , Z_n let $E_{n} {f (Z_{i})} = Σ_{i = 1}^{n} f (Z_{i}) ∕ n$ and $G_{n} {f (Z_{i})} = Σ_{i = 1}^{n} [f (Z_{i} - E {f (Z_{i})}] ∕ \sqrt{n}$ . For a column vector v, let v^⊗0 = 1, v^⊗1 = v, and v^⊗2 = vv^⊺. We denote the ℓ_q norm of v by ||v||_q for q ≥ 1, and, in particular, denote its ℓ₂ norm by ||v||. For any symmetric matrix A, let λ_min(A) and λ_max(A) represent the smallest and largest eigenvalues. Given an index set S and an index j ∈ S, we use S \ j to denote the set {r : r ∈ S, r ≠ j}.

Given an index set S ⊂ {1, … , p}, for k ∈ {0, 1, 2}, define

R_{S}^{(k)} (β_{S}, t) = E_{n} {{\overset{‒}{Y}}_{i} X_{i S}^{\otimes k} \exp (β_{S}^{⊺} X_{i S})}, r_{S}^{(k)} (β_{S}, t) = E {R_{S}^{(k)} (β_{S}, t)},

V_{S}^{(k)} (t) = E_{n} {{\overset{‒}{Y}}_{i} (t) X_{i S}^{\otimes k} λ_{0} (t) \exp (β_{0}^{⊺} X_{i})}, v_{S}^{(k)} (t) = E {V_{S}^{(k)} (t)} .

In addition, we use $β_{S}^{*}$ to denote the least false value, which is the unique root of

\int_{0}^{τ} {v_{S}^{(1)} (t) - \frac{r_{S}^{(1)} (β_{S}, t)}{r_{S}^{(0)} (β_{S}, t)} v_{S}^{(0)} (t)} d t = 0 .

We use $F_{T} (t; X M)$ , $f_{T} (t; X_{M})$ and $S_{T} (t; X_{M})$ to denote the conditional cumulative distribution function (cdf), probability density function (pdf), and survival function of T given the true model $X_{M}$ , respectively. Likewise, the conditional cdf, pdf, and survival function of C given $X_{M_{A}}$ are denoted by $F_{C} (t; X_{M_{A}})$ , $f C (t; X_{M_{A}})$ , and $S_{C} (t; X_{M_{A}})$ , respectively, where $X_{M_{A}}$ is the collection of covariates that are related to the censoring time C. Let $M_{O} = M \cup M_{A}$ .

3.1. Regularity conditions

We posit the regularity conditions, followed by explanations. The assumptions are, in general, mild, well justified, and follow the same lines as suggested by the existing literature.

(A) The study has a finite duration τ such that ω = Pr(Y ≥ τ) > 0.

(B) The X_j are time-independent and bounded by a constant K > 1 with E(X_j) = 0 and $E (X_{j}^{2}) = 1$ for all j ∈ {1, … , p}.

κ_{\min} < λ_{\min} {E (X_{S}^{\otimes^{2}})} \leq λ_{\max} {E (X_{S}^{\otimes 2})} < κ_{\max},

uniformly in S ⊂ {1, … , p} satisfying |S| ≤ ρ for some $ρ > ∣ M ∣$ .

(D) $\sup_{∣ S ∣ \leq ρ} ∣ ∣ β_{S}^{*} ∣ ∣_{1} \leq L$ for some constant L.

(E) For some constant α ∈ (0, 1/2),

_{j \in M}^{\inf} ∣ \int_{0}^{τ} E {X_{j} f_{T} (t; X_{M}) S_{C} (t; X_{M_{A}})} d t ∣ \geq K n^{- α} .

(F) There exists ζ > 0 such that, for all t > 0,

κ_{\min} \leq \inf_{{∥ β_{S} - β_{S}^{*} ∥}_{\infty \leq ζ, ∣ S ∣ \leq ρ}} λ_{\min} [\int_{0}^{τ} [\frac{r_{S}^{(2)} (β_{S}, t)}{r_{S}^{(0)} (β_{S}, t)} - \frac{{r_{S}^{(1)} (β_{S}, t)}^{\otimes 2}}{{r_{S}^{(0)} (β_{S}, t)}^{2}}] v_{S}^{(0)} (t) d t] \leq \sup_{{∥ β_{S} - β_{S}^{*} ∥}_{\infty \leq ζ, ∣ S ∣ \leq ρ}} λ_{\max} {\int_{0}^{τ} \frac{r_{S}^{(2)} (β_{S}, t)}{r_{S}^{(0)} (β_{S}, t)} v_{S}^{(0)} (t) d t} \leq κ_{\max} .

(G) $E {X_{j} S_{T} (t; X_{M}) f C (t; X_{M_{A}}) ∣ X_{M_{o / j}}}$ and $E {X_{j} S_{T} (t; X_{M}) S_{C} (t; X_{M_{A}}) ∣ X_{M_{o / j}}}$ have the same sign across t, for any $j \in M$ .

Condition (A) is standard in survival models with censored data; see, e.g., [20]. Conditions (B) and (C) are commonly assumed in the literature for variable selection and screening; see, e.g., [5, 23, 34, 38]. The boundedness of X is adopted to simplify our theoretical development and can be relaxed to the Cramér condition as in [3]. Condition (D) replaces the Lipschitz assumption in [30] and has a similar flavor to the conditions in [37], and [20]. Condition (E) is introduced in [37], which is an adapted version of the conditions in [7] to survival data. Condition (F) is analogous to Condition 2 considered in [3] for regularized Cox models. The condition essentially requires that the concavity of the log partial likelihood is well bounded in a neighborhood of $β_{S}^{*}$ . We invoke Condition (G) in order to analyze the least false value $β_{S}^{*}$ . The condition often holds in practice; for example, Lemma A in the Appendix shows that it is satisfied if $M \cap M_{A} = \emptyset$ .

Since the log partial likelihood function in (2) is the sum of non-iid random variables, we consider its asymptotically equivalent version, which can be expressed as the sum of iid terms, viz.

{\tilde{ℓ}}_{S} (β_{S}) = Σ_{i = 1}^{n} \int_{0}^{τ} {β_{S}^{⊺} X_{i S} - \ln r_{S}^{(0)} (β_{S}, t)} d N_{i} (t) .

(3)

According to Kong and Nan [20], the log partial likelihood function (2) can then be viewed as a “working” model of (3), and the corresponding loss function becomes

γ S (β_{S}; X_{i}, Y_{i}, δ_{i}) = - \int_{0}^{τ} {β_{S}^{⊺} X_{i S} - \ln r_{S}^{(0)} (β_{S}, t)} d N_{i} (t)

with the expected loss Γ_S(β_S) = E{γ_S (β_S; X_i, Y_i, δ_i).

To validate the replacement of the log partial likelihood (3), a commonly assumed condition in the literature is that there exists a neighborhood $B$ of $β_{0}$ such that for k ∈ {0, 1, 2},

\sup_{t \in [0, τ], β \in B} ∥ R_{M}^{(k)} (β, t) - r_{M}^{(k)} (β, t) ∥ \to_{p} 0 .

See, e.g., [3, 37]. Under Conditions (B) and (D), Lemma C shows that the above condition holds uniformly for all S ⊂ {1, … , p} satisfying |S| ≤ ρ.

We note that the proportional hazards assumption is made only on the true (and sparse) model. At each step of the screening procedure, we treat the misspecified Cox proportional hazards model as a working model following [10, 22]. Our theoretical derivations depend on the least false value, which helps us characterize the asymptotic behavior of our estimator at each step even without the proportional hazards assumption. Specifically, similar to [10, 22]. the proposed estimator will converge to the least false value at each step under the working model, and when the second derivative of the log partial likelihood is bounded in a neighborhood of the least false value, adding an active variable will increase the partial-likelihood, even if a mis-specified model is under consideration.

3.2. Main results

Theorem 1. Under Conditions (A)–(G), if ρ⁴ ln(p/n) → 0, then with probability at least 1 – 8 exp (−3ρ ln p), there exists α ∈ (0, 1/2) such that

\min_{S : ∣ S ∣ < ρ, M ⊄ S} \max_{j \in S^{C}} {ℓ_{S \cup {j}} ({\hat{β}}_{S \cup {j}}) - ℓ_{S} ({\hat{β}}_{S})} \geq c_{1} n^{1 - 2 α} - c_{2} \sqrt{n ρ^{2} \ln (p)} .

Theorem 1 shows that if $M ⊄ S_{k}$ and |S_k| < ρ, then the increment of the log-likelihood at the (k + 1)th step is at least $c_{1} n^{1 - 2 α} - c_{2} \sqrt{n ρ^{2} \ln (p)}$ . Since the maximum increment is bounded by |ℓ_S₀(0)|, we naturally obtain an upper bound on the number of steps for the forward selection, which is stated in the following corollary.

Corollary 1. Suppose the same conditions as in Theorem 1 hold. If $\sqrt{ρ^{2} \ln (p ∕ n)} = o (n^{- 2 α})$ and

M = 2 E [\int_{0}^{τ} \ln {n r_{S_{0}}^{(0)} (0, t)} d N_{i} (t)] n^{2 α} < ρ

for some α ∈ (0, 1/2), then $M \subset S_{k}$ , for some k ≤ M, with probability of at least 1 – 11 exp(−3ρ ln p).

Corollary 1 establishes the screening consistency of the forward selection procedure. However, the upper bound M is not sharp, as it is calculated based on the lower bound of the increment of the log-likelihood. The following corollary establishes an upper bound of the number of steps by evaluating how likely a signal variable will be selected at each step.

Corollary 2. Under the same conditions as in Corollary 1, if the $X_{M^{C}}$ are independent of $X_{M}$ , then $M \subset S_{c_{3} ∣ M ∣}$ , for some c₃ ∈ (1, ∞), with probability at least 1 – 8 exp(−2ρ ln p).

The condition that the $X_{M^{C}}$ are independent of $X_{M}$ stems from the assumption imposed in [37], and is similar to the partial orthogonality assumption introduced in [9]. It ensures that selecting a noise variable would bring less of an increment of the log-log-likelihoodlikelihood compared to choosing a signal variable. Thus, it is much more likely for our procedure to select a signal variable at each step.

The following result follows from Corollary 2. We expect the proposed forward procedure to stop early with $M \subset S_{k}$ for some k.

Corollary 3. Under the same conditions as in Corollary 2, if $M ⊄ S_{k - 1}$ and $M \subset S_{k}$ , then with probability going to 1,

(i) (screening consistency) the procedure stops at the kth step and $M \subset \hat{M} = S_{k}$ ;

(ii) (false discovery rate control) $∣ \hat{M} \cap M^{C} ∣ ∕ ∣ \hat{M} ∣ \leq c_{3} - 1$ .

By Corollary 3, our proposed forward selection procedure will stop at a final step, denoted by $\hat{k}$ , which is at most $c_{3} ∣ M ∣$ . The final model $\hat{M}$ not only achieves screening consistency, but also has well-controlled false discoveries.

4. Numerical studies

Simulations were conducted to compare the performance of the proposed forward regression (FR) with two partial likelihood based screening methods, the principled sure independence screening (PSIS) by Zhao and Li [37], and the conditional screening (CS) by Hong et al. [16]. The size of the models selected by the PSIS and CS was initially set to be ⌊n/ ln n⌋ as suggested in [7]. To further reduce false positives, we applied Lasso [28], SCAD [6], and MCP [36] penalties to further reduce the sizes of models selected by each method. In the tables, we used screening method + penalty to denote the corresponding procedure. Although FR could start from a null model, we tried different initial sets for FR, including active or inactive variables. In particular, we chose X₁ and X₁₀ to represent the active and inactive initial sets, respectively. When computing the model size for both FR and CS, we included the given initial set.

In this paper, we considered η as a fixed constant, which is analogous to the constant “a” parameter in the SCAD penalty function [6]. This distinguishes this from the other screening approaches that typically require data-driven thresholding tuning parameters and may incur more of a computational burden for finding them. To further justify the use of a fixed η, we considered various values of η between 0 and 1, the theoretical range of η in EBIC [4]. The BIC is a special case of EBIC when η = 0. We explored using BIC as the stopping criterion, but it incurred too many false positives compared to EBIC. This may cause overfitting of the Cox proportional hazards model with unreliably estimated regression coefficients and spuriously detected associations [32]. Thus, we elected not to use BIC.

We next considered three different values of η, 0.5, 1 – ln d/(3 ln p), and 1; see Tables 1–3. Essentially, a larger η gives more penalty to a complicated model, which may incur more false negatives, while a smaller η less penalizes the complexity of models and may lead to more false positives. Based on Tables 1–3, it seems that under all of the scenarios considered, the choice of η = 1 – ln d/(3 ln p) strikes a good balance between false negatives and false positives.

Table 1:

Comparisons of methods under mild censoring.

Example	Method	(n, p) = (200,1000)			(n, p) = (400,1000)

		PIT	TP	FP	PIT	TP	FP
1 (P₀ = 4)	FR	3.65 (0.52)	3.61 (0.49)	0.05 (0.21)	3.99 (0.18)	3.98 (0.14)	0.01 (0.12)
	FR+Lasso	3.65 (0.52)	3.02 (0.15)	0.63 (0.48)	3.99 (0.18)	3.01 (0.12)	0.98 (0.14)
	FR+MCP	3.65 (0.52)	3.02 (0.15)	0.63 (0.48)	3.99 (0.18)	3.01 (0.12)	0.98 (0.14)
	FR+SCAD	3.65 (0.52)	3.02 (0.15)	0.63 (0.48)	3.99 (0.18)	3.01 (0.12)	0.98 (0.14)
	FR_x1	3.65 (0.52)	3.61 (0.49)	0.05 (0.21)	3.99 (0.18)	3.98 (0.14)	0.01 (0.12)
	FR_x10	4.64 (0.52)	3.60 (0.49)	1.04 (0.20)	4.99 (0.19)	3.98 (0.15)	1.01 (0.12)
	FR(η₁)	5.04 (1.41)	3.89 (0.32)	1.16 (1.38)	4.53 (0.85)	4.00 (0.00)	0.53 (0.85)
	FR(η₂)	3.95 (0.62)	3.77 (0.42)	0.18 (0.45)	4.08 (0.33)	3.99 (0.09)	0.09 (0.31)
	PSIS	38.00 (0.00)	3.03 (0.55)	34.97 (0.55)	67.00 (0.00)	3.39 (0.49)	63.61 (0.49)
	PSIS+Lasso	4.44 (2.71)	2.43 (0.93)	2.01 (2.12)	6.82 (2.70)	3.13 (0.48)	3.69 (2.64)
	PSIS+MCP	4.05 (2.14)	2.47 (0.85)	1.58 (1.67)	4.81 (1.60)	3.08 (0.47)	1.74 (1.55)
	PSIS+SCAD	4.22 (2.55)	2.36 (0.89)	1.86 (1.98)	5.33 (1.96)	3.06 (0.53)	2.27 (1.85)
	CS	38.00 (0.00)	3.08 (0.48)	34.92 (0.48)	67.00 (0.00)	3.23 (0.42)	63.77 (0.42)
	CS+Lasso	4.63 (2.49)	2.51 (0.77)	2.12 (2.09)	6.50 (2.45)	2.95 (0.43)	3.54 (2.35)
	CS+MCP	4.00 (1.86)	2.44 (0.76)	1.55 (1.47)	4.78 (1.40)	2.92 (0.37)	1.86 (1.33)
	CS+SCAD	4.30 (2.23)	2.37 (0.75)	1.93 (1.81)	5.39 (1.82)	2.90 (0.44)	2.48 (1.67)
2 (p₀ = 6)	FR	0.91 (0.30)	5.89 (0.40)	1.34 (1.40)	1.00 (0.00)	6.00 (0.00)	1.01 (1.20)
	FR	6.16 (0.47)	6.00 (0.00)	0.16 (0.47)	6.02 (0.15)	6.00 (0.00)	0.02 (0.15)
	FR+Lasso	6.16 (0.47)	6.00 (0.00)	0.16 (0.47)	6.02 (0.15)	6.00 (0.00)	0.02 (0.15)
	FR+MCP	6.16 (0.47)	6.00 (0.00)	0.16 (0.47)	6.02 (0.15)	6.00 (0.00)	0.02 (0.15)
	FR+SCAD	6.08 (0.35)	5.91 (0.32)	0.16 (0.47)	6.02 (0.15)	6.00 (0.00)	0.02 (0.15)
	FR_x1	6.08 (0.33)	6.00 (0.00)	0.08 (0.33)	6.02 (0.15)	6.00 (0.00)	0.02 (0.15)
	FR_x10	7.13 (0.42)	6.00 (0.00)	1.13 (0.42)	7.01 (0.13)	6.00 (0.00)	1.01 (0.13)
	FR(η₁)	7.51 (1.78)	6.00 (0.00)	1.51 (1.78)	6.78 (1.16)	6.00 (0.00)	0.78 (1.16)
	FR(η₂)	6.34 (0.72)	6.00 (0.00)	0.34 (0.72)	6.11 (0.40)	6.00 (0.00)	0.11 (0.40)
	PSIS	38.00 (0.00)	3.42 (0.93)	34.58 (0.93)	67.00 (0.00)	4.83 (0.53)	62.17 (0.53)
	PSIS+Lasso	8.54 (4.97)	3.71 (0.97)	4.84 (4.66)	8.74 (5.33)	4.82 (0.54)	3.92 (5.21)
	PSIS+MCP	5.75 (2.39)	3.34 (1.01)	2.42 (1.97)	9.21 (3.05)	4.80 (0.61)	4.41 (2.77)
	PSIS+SCAD	6.73 (3.29)	3.51 (0.95)	3.21 (3.08)	8.80 (4.80)	4.77 (0.63)	4.03 (4.56)
	CS	38.00 (0.00)	4.45 (0.83)	33.55 (0.83)	67.00 (0.00)	5.66 (0.53)	61.34 (0.53)
	CS+Lasso	8.68 (4.49)	4.55 (0.93)	4.13 (4.07)	11.60 (4.50)	5.67 (0.53)	5.93 (4.22)
	CS+MCP	6.41 (2.38)	4.27 (1.09)	2.14 (1.97)	7.44 (2.20)	5.64 (0.63)	1.80 (2.46)
	CS+SCAD	6.68 (2.92)	4.34 (0.98)	2.34 (2.69)	6.83 (2.54)	5.63 (0.63)	1.20 (2.71)
3 (P₀ = 6)	FR	5.24 (0.95)	5.02 (1.03)	0.22 (0.49)	5.75 (0.45)	5.74 (0.44)	0.01 (0.12)
	FR+Lasso	5.24 (0.95)	5.19 (0.87)	0.05 (0.25)	5.75 (0.45)	5.74 (0.44)	0.01 (0.09)
	FR+MCP	5.24 (0.95)	5.19 (0.87)	0.05 (0.25)	5.75 (0.45)	5.74 (0.44)	0.01 (0.09)
	FR+SCAD	5.21 (0.94)	5.16 (0.87)	0.05 (0.25)	5.75 (0.45)	5.74 (0.44)	0.01 (0.09)
	FR_x1	5.25 (0.85)	5.10 (0.84)	0.16 (0.40)	5.75 (0.45)	5.74 (0.44)	0.01 (0.12)
	FR_x10	6.05 (1.28)	4.87 (1.28)	1.17 (0.41)	6.75 (0.45)	5.73 (0.44)	1.01 (0.12)
	FR(η₁)	6.84 (1.87)	5.38 (0.93)	1.45 (1.70)	6.43 (0.87)	5.93 (0.26)	0.50 (0.84)
	FR(η₂)	5.60 (1.09)	5.18 (1.00)	0.41 (0.73)	5.93 (0.48)	5.85 (0.36)	0.07 (0.33)
	PSIS	38.00 (0.00)	2.57 (0.72)	35.43 (0.72)	67.00 (0.00)	3.35 (0.58)	63.65 (0.58)
	PSIS+Lasso	3.43 (2.69)	2.06 (1.05)	1.37 (1.96)	6.18 (2.77)	3.22 (0.56)	2.96 (2.59)
	PSIS+MCP	3.04 (1.64)	2.21 (0.90)	0.83 (1.16)	3.51 (0.95)	3.11 (0.49)	0.40 (0.79)
	PSIS+SCAD	3.64 (2.24)	2.21 (0.99)	1.43 (1.61)	3.91 (1.18)	3.13 (0.49)	0.78 (1.04)
	CS	38.00 (0.00)	3.14 (0.68)	34.86 (0.68)	67.00 (0.00)	4.17 (0.62)	62.83 (0.62)
	CS+Lasso	4.74 (3.02)	2.67 (1.18)	2.07 (2.21)	7.97 (3.03)	4.05 (0.65)	3.93 (2.77)
	CS+MCP	3.76 (1.67)	2.84 (0.80)	0.93 (1.39)	4.31 (0.97)	3.94 (0.58)	0.38 (0.79)
	CS+SCAD	4.62 (2.16)	2.86 (0.95)	1.76 (1.69)	4.66 (1.08)	3.95 (0.58)	0.71 (0.98)

Open in a new tab

NOTE: FR, forward regression; PSIS, the principled sure independence screening; CS, the conditional screening with the given conditioning variable; PIT, estimated probability of including all true predictors in the selected predictors; TP, average number of true positives; FP, average number of false positives; p₀ denotes the number of true signals; numbers in the parentheses are standard deviations. We used η₁ = .5 and η₂ = 1 – ln d/(3 ln p). When it is not noted, η = 1 was used. FR_S₀ denotes that FR was performed with the initial set of S₀. We considered two initial sets x₁ and x₁₀.

Table 3:

Comparisons of methods under covariate-dependent censoring

Example	Method	(n, p) = (200, 1000)			(n, p) = (400, 1000)

		PIT	TP	FP	PIT	TP	FP
1* (P₀ = 4)	FR	3.58 (0.51)	3.55 (0.50)	0.03 (0.17)	3.99 (0.15)	3.98 (0.13)	0.00 (0.06)
	FR+Lasso	3.58 (0.51)	3.01 (0.09)	0.57 (0.50)	3.99 (0.15)	3.00 (0.06)	0.98 (0.13)
	FR+MCP	3.58 (0.51)	3.01 (0.09)	0.57 (0.50)	3.99 (0.15)	3.00 (0.06)	0.98 (0.13)
	FR+SCAD	3.58 (0.51)	3.01 (0.09)	0.57 (0.50)	3.99 (0.15)	3.00 (0.06)	0.98 (0.13)
	FR_x1	3.58 (0.51)	3.55 (0.50)	0.03 (0.17)	3.99 (0.15)	3.98 (0.13)	0.00 (0.06)
	FR_x10	4.58 (0.51)	3.56 (0.50)	1.03 (0.17)	4.98 (0.14)	3.98 (0.13)	1.00 (0.04)
	FR(η₁)	5.05 (1.56)	3.84 (0.37)	1.21 (1.53)	4.56 (0.85)	4.00 (0.00)	0.56 (0.85)
	FR(η₂)	3.89 (0.64)	3.70 (0.46)	0.19 (0.48)	4.08 (0.32)	4.00 (0.06)	0.09 (0.31)
	PSIS	38.00 (0.00)	2.96 (0.51)	35.04 (0.51)	67.00 (0.00)	3.41 (0.50)	63.59 (0.50)
	PSIS+Lasso	4.39 (2.82)	2.29 (0.96)	2.11 (2.24)	6.58 (2.63)	3.06 (0.59)	3.53 (2.53)
	PSIS+MCP	4.00 (2.15)	2.37 (0.90)	1.63 (1.70)	4.81 (1.60)	3.08 (0.54)	1.73 (1.55)
	PSIS+SCAD	4.30 (2.54)	2.27 (0.91)	2.03 (2.01)	5.34 (1.91)	3.04 (0.59)	2.30 (1.83)
	CS	38.00 (0.00)	3.00 (0.49)	35.00 (0.49)	67.00 (0.00)	3.25 (0.43)	63.75 (0.43)
	CS+Lasso	4.50 (2.67)	2.36 (0.77)	2.14 (2.28)	6.50 (2.44)	2.92 (0.41)	3.58 (2.33)
	CS+MCP	4.01 (2.00)	2.38 (0.75)	1.63 (1.65)	4.72 (1.45)	2.91 (0.35)	1.81 (1.38)
	CS+SCAD	4.27 (2.39)	2.29 (0.76)	1.98 (1.96)	5.45 (1.87)	2.89 (0.39)	2.57 (1.77)
2* (P₀ = 6)	FR	6.13 (0.39)	6.00 (0.04)	0.13 (0.39)	6.01 (0.09)	6.00 (0.00)	0.01 (0.09)
	FR+Lasso	6.12 (0.38)	5.99 (0.09)	0.13 (0.39)	6.01 (0.09)	6.00 (0.00)	0.01 (0.09)
	FR+MCP	6.12 (0.38)	5.99 (0.09)	0.13 (0.39)	6.01 (0.09)	6.00 (0.00)	0.01 (0.09)
	FR+SCAD	6.04 (0.25)	5.91 (0.30)	0.13 (0.39)	6.01 (0.09)	6.00 (0.00)	0.01 (0.09)
	FR_x1	6.06 (0.26)	6.00 (0.04)	0.06 (0.25)	6.01 (0.09)	6.00 (0.00)	0.01 (0.09)
	FR_x10	7.09 (0.32)	6.00 (0.04)	1.09 (0.31)	7.01 (0.11)	6.00 (0.00)	1.01 (0.11)
	FR(η₁)	7.55 (1.72)	6.00 (0.00)	1.55 (1.72)	6.68 (1.05)	6.00 (0.00)	0.68 (1.05)
	FR(η₂)	6.32 (0.64)	6.00 (0.00)	0.32 (0.64)	6.11 (0.39)	6.00 (0.00)	0.11 (0.39)
	PSIS	38.00 (0.00)	3.44 (0.92)	34.56 (0.92)	67.00 (0.00)	4.80 (0.59)	62.20 (0.59)
	PSIS+Lasso	8.55 (5.16)	3.63 (1.00)	4.92 (4.81)	9.14 (5.81)	4.79 (0.58)	4.34 (5.65)
	PSIS+MCP	5.50 (2.47)	3.25 (1.04)	2.24 (2.05)	9.01 (2.91)	4.76 (0.67)	4.25 (2.63)
	PSIS+SCAD	6.27 (3.19)	3.42 (0.99)	2.86 (2.98)	8.67 (4.77)	4.73 (0.68)	3.95 (4.50)
	CS	38.00 (0.00)	4.40 (0.89)	33.60 (0.89)	67.00 (0.00)	5.66 (0.53)	61.34 (0.53)
	CS+Lasso	8.74 (4.59)	4.47 (0.97)	4.27 (4.22)	11.42 (4.57)	5.66 (0.55)	5.76 (4.32)
	CS+MCP	6.19 (2.39)	4.12 (1.15)	2.06 (1.95)	7.30 (2.11)	5.62 (0.62)	1.67 (2.37)
	CS+SCAD	6.54 (2.98)	4.22 (1.08)	2.33 (2.76)	6.91 (2.81)	5.61 (0.66)	1.30 (2.98)
3* (P₀ = 6)	FR	5.11 (1.02)	4.87 (1.13)	0.24 (0.51)	5.72 (0.47)	5.70 (0.46)	0.02 (0.13)
	FR+Lasso	5.11 (1.02)	5.06 (0.96)	0.04 (0.22)	5.72 (0.47)	5.71 (0.45)	0.01 (0.10)
	FR+MCP	5.11 (1.02)	5.06 (0.96)	0.04 (0.22)	5.72 (0.47)	5.71 (0.45)	0.01 (0.10)
	FR+SCAD	5.08 (1.01)	5.03 (0.94)	0.04 (0.22)	5.72 (0.47)	5.71 (0.45)	0.01 (0.10)
	FR_x1	5.11 (0.98)	4.92 (1.00)	0.19 (0.46)	5.72 (0.47)	5.70 (0.46)	0.02 (0.13)
	FR_x10	5.74 (1.49)	4.59 (1.50)	1.15 (0.38)	6.72 (0.49)	5.70 (0.46)	1.03 (0.17)
	FR(η₁)	7.11 (1.96)	5.38 (0.87)	1.73 (1.84)	6.60 (1.00)	5.91 (0.28)	0.69 (0.96)
	FR(η₂)	5.51 (1.09)	5.07 (1.02)	0.44 (0.73)	5.97 (0.54)	5.83 (0.37)	0.14 (0.39)
	PSIS	38.00 (0.00)	2.55 (0.72)	35.45 (0.72)	67.00 (0.00)	3.34 (0.63)	63.66 (0.63)
	PSIS+Lasso	3.08 (2.58)	1.85 (1.10)	1.23 (1.79)	5.93 (2.76)	3.19 (0.70)	2.74 (2.46)
	PSIS+MCP	3.10 (1.84)	2.10 (0.96)	1.00 (1.37)	3.48 (1.06)	3.05 (0.57)	0.43 (0.87)
	PSIS+SCAD	3.38 (2.40)	2.02 (1.11)	1.36 (1.67)	4.12 (1.48)	3.11 (0.58)	1.01 (1.37)
	CS	38.00 (0.00)	3.12 (0.70)	34.88 (0.70)	67.00 (0.00)	4.10 (0.68)	62.90 (0.68)
	CS+Lasso	4.47 (3.28)	2.53 (1.21)	1.94 (2.42)	7.88 (3.17)	3.96 (0.69)	3.92 (2.89)
	CS+MCP	3.97 (1.99)	2.77 (0.89)	1.20 (1.62)	4.35 (1.07)	3.86 (0.62)	0.50 (0.87)
	CS+SCAD	4.60 (2.57)	2.72 (1.05)	1.87 (1.97)	4.83 (1.28)	3.87 (0.63)	0.96 (1.13)

Open in a new tab

We considered p = 1000 and two sample sizes, viz. n ∈ {200, 400}. The survival time was generated from a Cox model λ(t|X) = λ₀(t)exp(β^⊺X) with a Weibull baseline hazard. Specifically, λ₀(t) = αγt^γ−1, with α = 1 and γ = 1.5. We considered various models for X and different parameter configurations for β in the following four examples. The censoring time was independently generated from a uniform distribution, $U [0, c)$ . We varied c for each example in order to yield mild (around 25%) and heavy censoring proportions (around 50%). For each configuration, a total of 500 simulated datasets were generated.

Example 1. We chose β = (1, 0.5, −1, 0, 1, 0_p–5) and generated X from a multivariate normal distribution where the mean was 0, the variance 1, and cor(X_j, X_j’) = 0.5^|j–j’.

Example 2. We set $β = (1_{5}^{⊺}, - 2.5, 0_{p - 6}^{⊺})$ and generated X from a multivariate normal distribution with mean 0, variance 1, and cor(X_j, X_j’) = 0.5. In this case, since cov(T, X₆) = 0, X₆ has a lower marginal utility than all of the noise variables with cov(T, X_j) = 1.25 for $j \in M^{C}$ .

Example 3. We let $β = (1, - 1, 1, - 1, 1, - ν + ν^{2} - ν^{3} + ν^{4} - ν^{5}, 0_{p - 6}^{⊺})$ with ν = 0.5. We generated X from a multivariate normal distribution with mean 0, variance 1, and cor(X_j, X_j’) = 0.5^|j–j’|. In this case, since cov(T, X₆) = 0, X₆ is an active but hidden variable. Furthermore, the signals of the active variables are weak due to signal cancellation.

Tables 1–2 report the average of the estimated probabilities of including the true models (PIT), the average numbers of true positive (TP) and false positive (FP), and their standard deviations in parenthesis, under mild and heavy censoring. We use p₀ to denote the number of true signals. We have observed the competing performance of the proposed method as detailed below.

Table 2:

Comparisons of methods under heavy censoring.

Example	Method	(n, p) = (200, 1000)			(n, p) = (400, 1000)

		PIT	TP	FP	PIT	TP	FP
1 (P₀ = 4)	FR	3.33 (0.54)	3.29 (0.49)	0.04 (0.22)	3.86 (0.39)	3.85 (0.36)	0.01 (0.13)
	FR+Lasso	3.33 (0.54)	3.00 (0.16)	0.33 (0.48)	3.86 (0.39)	3.01 (0.10)	0.85 (0.36)
	FR+MCP	3.33 (0.54)	3.00 (0.16)	0.33 (0.48)	3.86 (0.39)	3.01 (0.10)	0.85 (0.36)
	FR+SCAD	3.33 (0.54)	3.00 (0.16)	0.33 (0.48)	3.86 (0.39)	3.01 (0.10)	0.85 (0.36)
	FR_x1	3.33 (0.54)	3.29 (0.49)	0.04 (0.22)	3.86 (0.39)	3.85 (0.36)	0.01 (0.13)
	FR_x10	4.35 (0.55)	3.29 (0.49)	1.05 (0.26)	4.85 (0.39)	3.84 (0.37)	1.01 (0.13)
	FR(η₁)	5.24 (1.98)	3.62 (0.49)	1.62 (1.88)	4.71 (1.04)	3.98 (0.15)	0.74 (1.02)
	FR(η₂)	3.64 (0.72)	3.45 (0.51)	0.19 (0.49)	4.02 (0.42)	3.93 (0.26)	0.09 (0.33)
	PSIS	38.00 (0.00)	2.93 (0.59)	35.07 (0.59)	67.00 (0.00)	3.37 (0.49)	63.63 (0.49)
	PSIS+Lasso	3.58 (2.57)	2.02 (0.92)	1.55 (1.98)	5.92 (2.46)	2.91 (0.58)	3.01 (2.26)
	PSIS+MCP	3.56 (2.23)	2.13 (0.87)	1.43 (1.74)	4.70 (1.60)	2.92 (0.53)	1.79 (1.49)
	PSIS+SCAD	3.66 (2.48)	2.01 (0.86)	1.65 (1.96)	5.22 (2.17)	2.85 (0.64)	2.37 (1.92)
	CS	38.00 (0.00)	2.90 (0.55)	35.10 (0.55)	67.00 (0.00)	3.27 (0.45)	63.73 (0.45)
	CS+Lasso	3.60 (2.29)	2.09 (0.83)	1.52 (1.81)	5.69 (2.51)	2.80 (0.56)	2.89 (2.30)
	CS+MCP	3.36 (1.91)	2.07 (0.79)	1.28 (1.50)	4.70 (1.66)	2.83 (0.47)	1.87 (1.51)
	CS+SCAD	3.67 (2.19)	2.04 (0.78)	1.63 (1.74)	5.21 (2.29)	2.74 (0.56)	2.48 (2.03)
2 (P₀ = 6)	FR	6.22 (0.65)	5.85 (0.61)	0.36 (0.74)	6.03 (0.18)	6.00 (0.00)	0.03 (0.18)
	FR+Lasso	6.21 (0.65)	5.95 (0.23)	0.26 (0.60)	6.03 (0.18)	6.00 (0.00)	0.03 (0.18)
	FR+MCP	6.21 (0.65)	5.95 (0.23)	0.26 (0.60)	6.03 (0.18)	6.00 (0.00)	0.03 (0.18)
	FR+SCAD	6.07 (0.5)	5.81 (0.45)	0.26 (0.60)	6.02 (0.17)	6.00 (0.06)	0.03 (0.18)
	FR_x1	6.10 (0.45)	5.94 (0.33)	0.16 (0.46)	6.02 (0.17)	6.00 (0.00)	0.02 (0.17)
	FR_x10	7.14 (0.58)	5.88 (0.49)	1.25 (0.60)	7.03 (0.17)	6.00 (0.00)	1.03 (0.17)
	FR(η₁)	8.36 (2.55)	5.97 (0.27)	2.39 (2.56)	6.98 (1.31)	6.00 (0.00)	0.98 (1.31)
	FR(η₂)	6.5 (0.90)	5.93 (0.44)	0.57 (0.95)	6.14 (0.43)	6.00 (0.00)	0.14 (0.43)
	PSIS	38.00 (0.00)	3.13 (0.94)	34.87 (0.94)	67.00 (0.00)	4.61 (0.67)	62.39 (0.67)
	PSIS+Lasso	8.17 (4.84)	3.38 (1.05)	4.80 (4.46)	8.98 (5.70)	4.62 (0.70)	4.36 (5.48)
	PSIS+MCP	5.34 (2.38)	2.97 (1.09)	2.37 (1.95)	7.82 (2.80)	4.47 (0.82)	3.35 (2.41)
	PSIS+SCAD	6.52 (3.31)	3.17 (1.06)	3.36 (3.02)	7.45 (4.02)	4.48 (0.77)	2.97 (3.73)
	CS	38.00 (0.00)	4.04 (0.87)	33.96 (0.87)	67.00 (0.00)	5.42 (0.64)	61.58 (0.64)
	CS+Lasso	8.41 (4.48)	4.13 (0.99)	4.29 (4.07)	10.64 (4.83)	5.41 (0.66)	5.23 (4.59)
	CS+MCP	5.81 (2.35)	3.68 (1.13)	2.12 (1.84)	7.53 (2.29)	5.33 (0.78)	2.20 (2.44)
	CS+SCAD	6.75 (3.19)	3.88 (1.07)	2.87 (2.81)	7.13 (3.07)	5.32 (0.79)	1.81 (3.17)
3 (P₀ = 6)	FR	4.45 (1.51)	4.08 (1.64)	0.37 (0.61)	5.49 (0.54)	5.46 (0.50)	0.03 (0.21)
	FR+Lasso	4.45 (1.51)	4.40 (1.44)	0.05 (0.24)	5.49 (0.54)	5.48 (0.50)	0.01 (0.16)
	FR+MCP	4.45 (1.51)	4.40 (1.44)	0.05 (0.24)	5.49 (0.54)	5.48 (0.50)	0.01 (0.16)
	FR+SCAD	4.44 (1.50)	4.39 (1.43)	0.05 (0.24)	5.49 (0.54)	5.48 (0.50)	0.01 (0.16)
	FR_x1	4.57 (1.39)	4.30 (1.44)	0.26 (0.50)	5.49 (0.54)	5.46 (0.50)	0.03 (0.21)
	FR_x10	4.97 (1.91)	3.72 (1.88)	1.24 (0.48)	6.48 (0.53)	5.45 (0.50)	1.03 (0.21)
	FR(η₁)	6.99 (2.59)	4.72 (1.40)	2.27 (2.29)	6.55 (1.18)	5.76 (0.43)	0.80 (1.09)
	FR(η₂)	4.93 (1.62)	4.35 (1.56)	0.58 (0.84)	5.74 (0.63)	5.60 (0.49)	0.14 (0.42)
	PSIS	38.00 (0.00)	2.32 (0.76)	35.68 (0.76)	67.00 (0.00)	3.20 (0.61)	63.8 (0.61)
	PSIS+Lasso	2.17 (2.01)	1.46 (1.01)	0.71 (1.28)	5.24 (2.50)	2.94 (0.73)	2.30 (2.17)
	PSIS+MCP	2.66 (1.86)	1.81 (0.98)	0.86 (1.25)	3.44 (1.16)	2.88 (0.54)	0.56 (1.02)
	PSIS+SCAD	2.78 (2.35)	1.66 (1.08)	1.11 (1.55)	4.23 (1.53)	2.92 (0.59)	1.31 (1.37)
	CS	38.00 (0.00)	2.87 (0.71)	35.13 (0.71)	67.00 (0.00)	3.88 (0.69)	63.12 (0.69)
	CS+Lasso	3.34 (2.81)	2.03 (1.17)	1.31 (1.92)	7.16 (3.05)	3.71 (0.75)	3.45 (2.71)
	CS+MCP	3.45 (1.88)	2.44 (0.98)	1.00 (1.34)	4.19 (1.15)	3.62 (0.64)	0.57 (0.92)
	CS+SCAD	4.13 (2.60)	2.42 (1.14)	1.71 (1.82)	4.80 (1.36)	3.66 (0.66)	1.14 (1.24)

Open in a new tab

First, Example 1 was designed in such a way that all of the true signals have nonzero marginal correlations with the outcome and are detectable by marginal screening methods. In particular, the dependence among the active variables in Example 1 can further strengthen the marginal correlations between them and the outcome. Even under these situations, FR was found to perform better than the marginal screening methods, including the conditional screening method, with larger PIT, TP and smaller FP. When the sample size decreases to 200 or with more censored events, FR’s performance was still decent with, for example, a PIT around 0.8. In contrast, the performances of PSIS and CS deteriorated quickly with smaller sample sizes or with more censoring.

Second, Examples 2 and 3 were designed so that X₆, though active, has a 0 marginal correlation with the outcome and, therefore, is not detectable by marginal screening methods. As a result, PSIS and CS failed in this challenging situation. In contrast, FR remained competent by being able to detect the hidden variable. It even outperformed the conditional screening that used the prior information.

Third, even using penalties such as Lasso, SCAD, and MCP to further reduce false positives, the screening methods still incurred many false positives. The final models selected by the different penalties vary a bit. However, compared to FR, all of these penalties still perform similarly. In contrast, with the EBIC-based stopping rule, the proposed FR caused much fewer false positives without the help of Lasso.

Finally, the simulation results hint that the performance of FR is quite robust toward the choice of the initial set. Even if the “wrong” set was employed as the initial set, the TP is almost the same as the one that starts from the null set.

We further explored the robustness of the method to the violation of the independence censoring assumption. We considered Examples 1*–3*, which have the same setup as Examples 1–3, except that the underlying survival times and censoring times have a common latent variable b which was generated from the standard Gaussian distribution and the censoring times C_i were covariate-dependent. That is,

λ (t ∣ X) = \exp (b + β^{⊺} X) and λ_{C} (t ∣ X) = c \exp (b + α^{⊺} X),

where α = (0.75, 0.75, 0_p−;2) and c was chosen to censor approximately 25% of the observations. The results are documented in Table 3.

We found that, with dependent censoring, the performance of all of the methods deteriorated a bit across the board. However, our proposed method still outperformed all the other methods, hinting at the usability of the proposal under dependent censoring. More work is warranted.

5. Analysis of the Boston Lung Cancer Survival Cohort (BLCSC) study

Recent studies demonstrate that aberrant methylation may be the most common mechanism of inactivating cancer-related genes in lung cancer. It occurs in the smoking-damaged bronchial epithelium from cancer-free individuals, can be reversed in vitro by demethylating agents, and may be a useful biomarker for lung cancer risk assessment [40]. It is thus of substantial interest to identify the methylations that play an important role in the pathogenesis of lung cancer, which affects patients’ overall survival.

The motivating data represented a subset of the Boston Lung Cancer Survival Cohort (BLCSC) and included 124 samples, each with 442,613 methylations. The median follow up time of the subjects was 6.2 years and during the follow up, 84 deaths were observed. Each methylation resides within a certain gene. Prior literature has suggested that the following genes are associated with the development of lung cancer: ROS1, RET, PIK3CA, NRAS, BRAF, ALK, AKT 1, VGLL2, MET, KRAS, EGFR, KDM4, ST3GAL3, and CDH13. We used the array annotations from the Bioconductor package FDb.InfiniumMethylation.hg19 (version 2.2.0) to identify methylations that lie within these genes. A total of 589 methylations were identified. The other available environmental exposure and demographic variables in the data included lifetime tobacco exposure (SMOK), computed by multiplying the number of packs of cigarettes smoked per day by the number of years the person has smoked until the beginning of the study; AGE, the age at diagnosis in years; and SEX (1 = male; 0 = female).

Our analytical goal was to explore what methylations and their interactions with demographic and environmental exposure variables might be related to patients’ overall survival. Thus, the outcome was the time to death, while the dependent variables included the aforementioned variables, consisting of demographic information, environmental exposures, and methylations and their interactions, for a total of 2,359 variables. We applied FR to the dataset and identified cg04187088×SEX and cg14363146×SMOK.

To check the model adequacy for the final model obtained by FR, we plotted the Cox-Snell residuals based on the final model that includes two predictors, cg04187088×SEX and cg14363146×SMOK. Figure 1 shows that the final model fits the data reasonably well.

Furthermore, we conducted the score test for the scaled Schoenfeld residuals to test the proportional hazards assumption on each included predictor in the final model. We obtained p-values of 0.542 and 0.670 for cg04187088 × SEX and cg14363146 × SMOK, respectively. It appears that the proportional hazards assumption is not rejected for either of them.

We also applied competing methods to the BLCSC dataset, including the PSIS, CS, and SII. For each competing method, we selected the top ⌊n/ ln(n)⌋ = 25 genes, and compared them with the genes selected by FR.

In terms of computing time, PSIS, CS, SII, and FR took 2.47, 2.52, 1473.65, and 20.89 seconds, respectively. Due to the sequential nature of the proposed procedure, FR is understandably more computationally intensive than the marginal screening approaches such as PSIS and CS. However, FR appears to run faster than SII, a nonparametric approach.

Table 4 lists the overlapping genes across the four methods. It appears that two genes selected by FR did not overlap with any genes selected by PSIS, CS and SII.

Table 4:

The number of overlapped genes chosen by PSIS, CS, SII and FR

	PSIS	CS	SII	FR
PSIS	25	1	3	0
CS	1	25	0	0
SII	3	0	25	0
FR	0	0	0	2

Open in a new tab

In contrast, using the SMOK, AGE, and SEX as the initial set, FR further selected cg11704212, in addition to these identified interactions. Our Pubmed review did not detect any previous literature that discusses these two SNPs. This may indicate the ability of the proposed FR to identify some novel SNPs, which may not have been detected by the existing methods. We expect that future studies are warranted to confirm and study the functionality of these detected biomarkers.

To elucidate the identified effects, we further conducted Kaplan–Meier analysis. We first dichotomized methylations by using the median values and used “+” or “−” if a subject’s methylation is higher or lower than its median, respectively. Figure 2 compared survival curves across various subgroups. Figure 2(a) clearly indicated that female patients with low cg04187088 had a higher survival probability than the other comparison groups, while Figure 2(b) revealed that heavy smokers with high cg14363146 had a much lower survival probability than the other comparison groups. Figure 2(c) also showed that patients with high cg11704212 had a larger survival probability than those with low cg11704212.

Figure 2: — Kaplan–Meier plots to illustrate the main and interaction effects identified by the proposed method.

6. Concluding remarks

This article proposes forward regression with partial likelihood for high-dimensional survival data, and has obtained computationally and theoretically useful results. We envision the established theoretical framework will facilitate the extension of the procedure to other general settings, such as generalized linear models.

To further improve the computational efficiency, we can consider a natural extension of the proposed forward selection in the spirit of boosting [11]. That is, at each step, we use the selected variables and the obtained coefficient estimates to construct an offset, and search for a variable that will maximize the partial likelihood with such an offset. The advantage is that we need to maximize the partial likelihood with respect to only one covariate at each step, which may enhance computational efficiency. Specifically, we denote the log partial likelihood with an offset term O and covariate j in the model by

ℓ_{O, j} (β) = Σ_{i = 1}^{n} \int_{0}^{τ} [O_{i} + β X_{i j} - \ln {Σ_{ℓ = 1}^{n} {\overset{‒}{Y}}_{ℓ} (t) \exp (O_{ℓ} + β_{j} X_{ℓ j})}] d N_{i} (t),

Here O_i refers to the offset term O evaluated at the ith subject. We let O^(k) be the offset term evaluated at kth step and S_k be the set of indices of covariates selected up to the kth step.

For FR, we initialize S₀ = Ø and set O⁽⁰⁾ = 0. For j ∈ {1, … , p}, compute

{\hat{β}}_{j}^{(1)} =_{β}^{\arg \max} ℓ_{O^{(0)}, j} (β)

Then j₁ = arg max_j∈{1…,p}ℓ_O(0),_j ${(\hat{β})}_{j}^{(1)}$ Now set $O^{(1)} = {\hat{β}}_{j 1}^{(1)} X_{j 1}$ and S₁ = {j₁}. Given O^(k) and S_k, compute

{\hat{β}}_{j}^{(k + 1)} =_{β}^{\arg \max} ℓ_{O^{(k)}, j} (β)

for $j \in S_{k}^{C}$ . Then $j_{k + 1} = {\arg \max}_{j \in S_{k}^{C}} ℓ_{O^{(k)}, j} ({\hat{β}}_{j}^{(k + 1)})$ . Now set $O^{(k + 1)} = O^{(k)} + {\hat{β}}_{j k + 1}^{(k + 1)} X_{j k + 1}$ and S_k+1 = Sk ∪ {j_k+1}. We note this proposal does not require re-estimation of the coefficients of the covariates selected in the previous steps, which expedites computation. We will explore this further.

We employed a modified EBIC to select the final models. Although it worked well in our simulations, it tends to be conservative in real data analysis and recruits too few variables, whereas BIC recruits too many variables. It would be interesting to investigate the optimal η in the EBIC penalty term to strike a balance between EBIC and BIC.

Appendix

A. Preliminary lemmas

We present some preliminary lemmas in this section. Given an index set S ⊂ {1,…,p}, we use S_+r to denote S ∪ {r} for some r ∈ S^C.

Lemma A. Condition (G) is satisfied if $M \cap M_{A} = \emptyset$ .

Without loss of generality, we assume that X_r is the last element of X_{S_+r}, with $β_{*}^{r}$ being the corresponding least false coefficient under the model S_+r.

Lemma B. Given S ⊂ {1,…,p} satisfying |S| < ρ and r ∈ S^C, the following statements hold under Conditions (B) and (G).

(i) If $r \in M^{C}$ and the $X_{M^{C}}$ are independent of $X_{M}$ , then $β_{S_{+ r}}^{*} = ({β_{S}^{*}}^{⊺}, 0)^{⊺}$ , i.e., the least false coefficient for X_r under the model S_+r is 0.

If $r \in M$ , then $β_{S_{+ r}}^{*} \neq ({β_{S}^{*}}^{⊺}, 0)^{⊺}$ .

If $r \in M$ and Conditions (E) and (F) are satisfied, then $∥ β_{S_{+ r}}^{*} - ({β_{S}^{*}}^{⊺}, 0)^{⊺} ∥ \geq K k_{\max}^{- 1} n^{- α}$ .

Lemma B quantifies the difference between $β_{S_{+ r}}^{*}$ and ${({β_{S}^{*}}^{⊺}, 0)}^{⊺}$ when a noise variable or a signal variable is selected into the model.

Lemma C. Under Conditions (B) and (D), if ρ³ ln(p/n) → 0, then for each S ⊂ {1,…, p} satisfying |S| ≤ ρ, we can find a neighborhood $B_{S}^{0} (c)$ of $β_{S}^{*}$ , for some constant c, such that, for k ∈ {0,1,2},

\sup_{∣ S ∣ \leq ρ, t \in [0, τ], β_{S} \in B_{S}^{0} (c)} ∥ R_{S}^{(k)} (β_{S}, t) - r_{S}^{(k)} (β_{S}, t) ∥ \to_{p} 0 .

Define

Z_{S} (β_{S}) = ∣ [E_{n} {γ_{S} (β_{S}; X_{i}, Y_{i}, δ_{i})} - Γ_{S} (β_{S})] - [E_{n} {γ_{S} (β_{S}^{*}; X_{i}, Y_{i}, δ_{i})} - Γ_{S} (β_{S}^{*})] ∣ .

Lemma D. Under the same conditions as in Lemma C,

\Pr {\sup_{∣ S ∣ \leq_{ρ}, β_{S} \in B_{S}^{0} (c)} Z_{S} (β_{S}) \geq 2 \frac{c}{K} a_{n} + 3 \frac{c \sqrt{ρ}}{K \sqrt{S}} a_{n} + 6 c τ \sqrt{ρ \ln (p ∕ n)} + 2 K \sqrt{ρ} ζ_{n}} \leq 3 \exp (- 6 ρ \ln p),

where

a_{n} = \sqrt{2 K^{2} \ln (2 p) ∕ n} + K \ln (2 p) ∕ n and ζ_{n} = c ∕ (K n^{2}) .

Define $D_{S} (β_{S}) = ∣ {ℓ_{S} (β_{S}) - {\tilde{ℓ}}_{S} (β_{S})} - {ℓ_{S} (β_{S}^{*}) - {\tilde{ℓ}}_{S} (β_{S}^{*})} ∣ ∕ n$

Lemma E. Under Conditions (A), (B), and (D), we have

\Pr {\sup_{∣ S ∣ \leq_{ρ}, β_{S} \in B_{S}^{0} (c)} D_{S} (β_{S}) \geq A_{10} \sqrt{ρ^{2} \ln (p ∕ n)}} \leq 2 \exp (- 3 ρ \ln p),

for some constant A₁₀ that does not depend on n.

Lemma F. Under Conditions (B), (D), and (F), given an index set S satisfying |S| ≤ ρ, then for any $β_{S} \in B_{S}^{0} (c)$ ,

κ_{\min} {∥ β_{S} - β_{S}^{*} ∥}^{2} ∕ 2 \leq Γ_{S} (β_{S}) - Γ_{S} (β_{S}^{*}) \leq κ_{\max} {∥ β_{S} - β_{S}^{*} ∥}^{2} ∕ 2 .

Lemma G. Under Conditions (A)–(F), if ρ⁴ ln(p/n) → 0, there exist two constants A₁₁ and A₁₂ that do not depend on n such that

\Pr [_{∣ S ∣ \leq ρ}^{\sup} ∥ {\hat{β}}_{S} - β_{S}^{*} ∥ \leq A_{11} {ρ^{2} \ln (p ∕ n)}^{1 ∕ 4}] \geq 1 - 5 \exp (- 3 ρ \ln p)

(i)

and

\Pr {_{∣ S ∣ \leq ρ}^{\sup} n^{- 1} ∣ ℓ_{S} ({\hat{β}}_{S}) - ℓ_{S} (β_{S}^{*}) ∣ \leq A_{12} \sqrt{ρ^{2} \ln (p ∕ n)}} \geq 1 - 5 \exp (- 3 ρ \ln p)

(ii)

Lemma H. Under Conditions (B) and (D), there exists some constant A₁₄, which does not depend on n, such that

\Pr [_{∣ S ∣ < ρ, r \in S^{C}}^{\sup} ∣ n^{- 1} {ℓ_{S_{+ r}} (β_{S_{+ r}}^{*}) - ℓ_{S} (β_{S}^{*})} - {- Γ_{S_{+ r}} (β_{S_{+ r}}^{*}) + Γ_{S} (β_{S}^{*})} ∣ \geq A_{14} \sqrt{ρ \ln (p ∕ n)}] \leq 3 \exp (- 3 ρ \ln ρ) .

B. Proofs of main theoretical results in Section 3.2

In the following, we provide the proofs for the theoretical results in Section 3.2.

Proof of Theorem 1. We prove the theorem for a generic index set S satisfying S ⊂ {1,…, p}, $M ⊄ S$ and |S| < ρ. The change of log-likelihood by adding a variable X_r, where r ∈ S^C, can be decomposed as

n^{- 1} {ℓ_{S_{+ r}} ({\hat{β}}_{S_{+ r}}) - ℓ_{S} ({\hat{β}}_{S})} = n^{- 1} [ℓ_{S_{+ r}} ({\hat{β}}_{S_{+ r}}) - ℓ_{S} ({\hat{β}}_{S}) - {ℓ_{S_{+ r}} (β_{S_{+ r}}^{*}) - ℓ_{S} (β_{S}^{*})}] + [n^{- 1} {ℓ_{S_{+ r}} (β_{S_{+ r}}^{*}) - ℓ_{S} (β_{S}^{*})} - {- Γ_{S_{+ r}} (β_{S_{+ r}}^{*}) + Γ_{S} (β_{S}^{*})}] - {Γ_{S_{+ r}} (β_{S_{+ r}}^{*}) - Γ_{S} (β_{S}^{*})} .

We restrict our attention $Ω_{5}^{C} \cap Ω_{6}^{C}$ , where

Ω_{5} = {_{∣ S ∣ \leq ρ}^{\sup} n^{- 1} ∣ ℓ_{S} ({\hat{β}}_{S}) - ℓ_{S} (β_{S}^{*}) ∣ A_{12} \sqrt{ρ^{2} \ln (p ∕ n)}}

and

Ω_{6} = {_{∣ S ∣ < ρ, r \in S^{C}}^{\sup} ∣ n^{- 1} {ℓ_{S_{+ r}} (β_{S_{+ r}}^{*}) - ℓ_{S} (β_{S}^{*})} - {- Γ_{S_{+ r}} (β_{S_{+ r}}^{*}) + Γ_{S} (β_{S}^{*})} ∣ \geq A_{14} \sqrt{ρ \ln (p ∕ n)}} .

According to Lemmas G and H, $Ω_{5}^{C} \cap Ω_{6}^{C}$ holds with probability at least 1 – 8 exp(−3ρ ln p).

If $r \in M$ , then by Lemma B (iii), $∣ ∣ β_{S + r}^{*} - (β_{S}^{* ⊺}, 0)^{⊺} ∣ ∣ \geq K κ_{\max}^{- 1} n^{- α}$ . For any β_{S_+r} such that $∣ ∣ β_{S_{+ r}} - β_{S_{+ r}}^{*} ∣ ∣ = K κ_{\max}^{- 1} n^{- α}$ , noting that $β_{S_{+ r}}^{*}$ is the solution to (4) under model S_+r, we apply Taylor’s expansion to obtain

Γ_{S_{+ r}} (β_{S_{+ r}}) - Γ_{S_{+ r}} (β_{S_{+ r}}^{*}) = \frac{1}{2} (β_{S_{+ r}}^{*} - β_{S_{+ r}})^{⊺} \times [\int_{0}^{⊺} \frac{r_{S_{+ r}}^{(2)} ({\tilde{β}}_{S_{+ r}}, u)}{r_{S_{+ r}}^{(0)} ({\tilde{β}}_{S_{+ r}}, u)} - \frac{{r_{S_{+ r}}^{(1)} ({\tilde{β}}_{S_{+ r}}, u)}^{\otimes 2}}{{S^{(0)} ({\tilde{β}}_{S_{+ r}}, u)}^{2}} v^{(0)} (u) d u] (β_{S_{+ r}}^{*} - β_{S_{+ r}}) \geq K^{2} κ_{\min} κ_{\max}^{- 2} n^{- 2 α} \equiv c_{1} n^{- 2 α},

where ${\tilde{β}}_{S_{+ r}}$ is between β_{S_+r} and $β_{S_{+ r}}^{*}$ , the last inequality follows from Condition (F) and $c_{1} = K^{2} κ_{\min} k_{\max}^{- 2}$ . By the convexity of Γ_{S_+r} (β, we have $Γ_{S_{+ r}} {({β_{S}^{*}}^{⊺}, 0)^{⊺}} - Γ_{S_{+ r}} (β_{S_{+ r}}^{*}) \geq c_{1} n^{- 2 α})$ Thus,

n^{- 1} {ℓ_{S_{+ r}} ({\hat{β}}_{S_{+ r}}) - ℓ_{S} ({\hat{β}}_{S})} \geq - {n^{- 1} ∣ ℓ_{S_{+ r}} ({\hat{β}}_{S_{+ r}}) - ℓ_{S_{+ r}} (β_{S_{+ r}}^{*}) ∣ + n^{- 1} ∣ ℓ_{S} ({\hat{β}}_{S}) - ℓ_{S} (β_{S}^{*}) ∣} - ∣ n^{- 1} {ℓ_{S_{+ r}} (β_{S_{+ r}}^{*}) - ℓ_{S} (β_{S}^{*})} - {- Γ_{S_{+ r}} (β_{S_{+ r}}^{*}) + Γ_{S} (β_{S}^{*})} ∣ + {Γ_{S} (β_{S}^{*}) - Γ_{S_{+ r}} (β_{S_{+ r}}^{*})} \geq - 2 A_{12} \sqrt{ρ^{2} \ln (p ∕ n)} - A_{14} \sqrt{ρ \ln (p ∕ n)} + c_{1} n^{- 2 α} .

Consequently,

_{∣ S ∣ < ρ, r \in S^{C}}^{\sup} n^{- 1} {ℓ_{S_{+ r}} ({\hat{β}}_{S_{+ r}}) - ℓ_{S} ({\hat{β}}_{S})} \geq c_{1} n^{- 2 α} - 2 A_{12} \sqrt{ρ^{2} \ln (p ∕ n)} - A_{14} \sqrt{ρ \ln (p ∕ n)} \geq c_{1} n^{- 2 α} - c_{2} \sqrt{ρ^{2} \ln (p ∕ n)},

for some constant c2 that does not depend on n. Then, we obtain

_{r \in S^{C}}^{\max} {ℓ_{S_{+ r}} ({\hat{β}}_{S_{+ r}}) - ℓ_{S} ({\hat{β}}_{S})} \geq c_{1} n^{1 - 2 α} - c_{2} \sqrt{n ρ^{2} \ln p} .

Withdrawing the restriction on $Ω_{5}^{C} \cap Ω_{6}^{C}$ completes the proof of Theorem 1. □

Proof of Corollary 1. As shown in Theorem 1, for any S such that $∣ S ∣ < ρ, M ⊄ S$ , with probability at least 1 – 8 exp(−3ρ ln p),

_{r \in S^{C}}^{\max} ℓ_{S_{+ r}} ({\hat{β}}_{S_{+ r}}) - ℓ_{S} ({\hat{β}}_{S}) \geq c_{1} n^{1 - 2 α} - c_{2} \sqrt{n ρ^{2} \ln (p)} .

If $\sqrt{n ρ^{2} \ln p} = o (n^{1 - 2 α})$ , then ln d + 2η ln p = o(n^1–2α), and consequently,

EBIC (S_{+ r}) - EBIC (S) = - 2 ℓ_{S_{+ r}} ({\hat{β}}_{S_{+ r}}) + (∣ S ∣ + 1) (\ln d + 2 η \ln p) - {- 2 ℓ_{S} ({\hat{β}}_{S}) + ∣ S ∣ (\ln d + 2 η \ln p} \leq c_{2} \sqrt{n ρ^{2} \ln p} - c_{1} n^{1 - 2 α} + (\ln d + 2 η \ln p) < 0 .

Therefore, our forward selection does not stop when $M ⊄ S$ and |S| < ρ with probability at least 1 – 8 exp(−3ρ ln p). Noting that S₀ = Ø, if $M ⊄ S_{k}$ , then

Σ_{i = 1}^{n} [\int_{0}^{τ} \ln {n R_{S_{0}}^{(0)} (0, t)} d N_{i} (t)] - 0 \geq {ℓ_{S_{1}} ({\hat{β}}_{S_{1}}) - ℓ_{S_{0}} ({\hat{β}}_{S_{0}})} + {ℓ_{S_{2}} ({\hat{β}}_{S_{2}}) - ℓ_{S_{1}} ({\hat{β}}_{S_{1}})} + \dots + {ℓ_{S_{k}} ({\hat{β}}_{S_{k}}) - ℓ_{S_{k - 1}} ({\hat{β}}_{S_{k - 1}})} \geq k c_{1} n^{1 - 2 α} ∕ 2,

when n is sufficiently large such that 2c₂ $\sqrt{ρ^{2} \ln (p ∕ n)} \leq c_{1} n^{- 2 α} ∕ 2$ .

As shown in the proof of Lemma H,

E_{n} {\int_{0}^{τ} \ln R_{S_{0}}^{(0)} (0, t) d N_{i} (t)} \to E {\int_{0}^{τ} \ln r_{S_{0}}^{(0)} (0, t) d N_{i} (t)},

with probability at least 1 – 3 exp(−3ρ ln p). If $M ⊄ S_{M}$ , then

\frac{2}{c_{1} n^{- 2 α}} E [\int_{0}^{τ} \ln {n r_{S_{0}}^{(0)} (0, t)} d N_{i} (t)] > M,

which contradicts the definition of M. Hence, we have some k such that $M \subset S_{k}$ . This completes the proof of Corollary 1. □

Proof of Corollary 2. By Lemma B (i), If $r \in M C$ and $X_{M^{C}}$ are independent of $X_{M}$ , then $β_{S_{+ r}}^{*} = (β_{S}^{* ⊺}, 0)^{⊺}$ . Thus, under $Ω_{5}^{C} \cap Ω_{6}^{C}$ ,

n^{- 1} {ℓ_{S_{+ r}} ({\hat{β}}_{S_{+ r}}) - ℓ_{S} ({\hat{β}}_{S})} \leq n^{- 1} ∣ ℓ_{S_{+ r}} ({\hat{β}}_{S_{+ r}}) - ℓ_{S_{+ r}} (β_{S_{+ r}}^{*}) ∣ + n^{- 1} ∣ ℓ_{S} ({\hat{β}}_{S}) - ℓ_{S} (β_{S}^{*}) ∣ \leq 2 A_{12} \sqrt{ρ^{2} \ln (p ∕ n)} .

If $\sqrt{n ρ^{2} \ln p} = o (n^{1 - 2 α})$ , we have for any S such that |S| < ρ and $M ⊄ S$ , when n is sufficiently large.

_{r \in S^{C}}^{\arg \min} EBIC (S_{+ r}) - EBIC (S) \in M .

Withdrawing the restriction on $Ω_{5}^{C} \cap Ω_{6}^{C}$ , we obtain that, at each step, the probability of selecting a noise variable is at most 8 exp(−3ρ ln p). Since $M ⊄ S_{k}$ implies that at more than $k - ∣ M ∣$ steps, a noise variable is selected, then for $k = c_{3} ∣ M ∣$ ,

\Pr (M ⊄ S_{k}) \leq Σ_{j = k - ∣ M ∣}^{k} (\begin{matrix} k \\ j \end{matrix}) {8 \exp (- 3 ρ \ln p)}^{j} \leq ∣ M ∣ k^{∣ M ∣} {8 \exp (- 3 ρ \ln p)}^{k - ∣ M ∣} \leq 8 \exp (- 3 ρ \ln p + \ln ∣ M ∣ + ∣ M ∣ \ln k) 8 \exp (- 2 ρ \ln p) .

Therefore, $M \subset S_{c_{3} ∣ M ∣}$ with probability at least 1 – 8 exp(−2ρ ln p). This completes the proof of Corollary 2. □

Proof of Corollary 3. By Corollary 2, we know that $M \subset S_{k}$ , for some $k \leq c_{3} ∣ M ∣$ . Thus, both $S_{k} \subset S_{k + 1} \in A_{0}$ , where

A_{0} = {S : M \subset S, ∣ S ∣ \leq c_{3} ∣ M ∣} .

(i): It can be shown that EBIC(S_k+1) < EBIC(S_k) if and only if $2 ℓ_{S_{k + 1}} ({\hat{β}}_{S_{k + 1}}) - 2 ℓ_{S} ({\hat{β}}_{S_{k}}) < \ln d + 2 η \ln p$ . Following the same arguments used to show Eq. (14) in [24], we can show that with probability tending to 1,

2 ℓ_{S_{k + 1}} ({\hat{β}}_{S_{k + 1}}) - 2 ℓ_{S} ({\hat{β}}_{S_{k}}) < \ln d + 2 η \ln p,

for all η > 0. Therefore, with probability tending to 1, the procedure stops at the kth step and $M \subset \hat{M} = S_{k}$ .

(ii): From Corollary 2 and Part (i), we have $∣ \hat{M} \cap M^{C} ∣ = ∣ \hat{M} ∣ - ∣ M ∣$ as well as $∣ \hat{M} ∣ \geq ∣ M ∣$ with probability going to 1. Hence, the stated result follows. □

Proof of Lemma A. We first show that $r \in M$ , $E {X_{r} S_{T} (t; X_{M}) f C (t; X_{M_{A}}) ∣ {X_{M}}_{o / r}}$ has the same sign across t > 0. Let $S_{0} (t) = \exp {- \int_{0}^{t} λ_{0} (u) d u}$ .

E {X_{r} S_{T} (t; X_{M}) f C (t; X_{M_{A}}) ∣ X_{M_{o / r}}} = E {X_{r} S_{T} (t; X_{M}) ∣ X_{M \ r}} = E {f C (t; X_{M_{A}}) ∣ X_{M_{A}}} = E {X_{r} S_{0} (t)^{\exp (β_{0 M}^{⊺} X_{M})} ∣ X_{M \ r}} = E {f C (t; X_{M_{A}}) ∣ X_{M_{A}}} = E = {X_{r} S_{0} (t)^{\exp (β_{0 M \ r}^{⊺} X_{M \ r}) \exp (β_{0 r} X_{r})} ∣ X_{M \ r}} E {f C (t; X_{M_{A}}) ∣ X_{M_{A}}} = \frac{1}{β_{0 r}} E [β_{r} X_{r} {S_{0} (t)^{\exp (β_{0 M \ r}^{⊺} X_{M \ r})}}^{\exp (β_{0 r} X_{r})} ∣ X_{M \ r}] E {f C (t; X_{M_{A}}) ∣ X_{M A}} = - \frac{1}{β_{0 r}} E [- U {S_{0} (t)^{\exp (β_{0 M \ r}^{⊺} X_{M \ r})}}^{\exp (U)} ∣ X_{M \ r}] E {f C (t; X_{M_{A}}) ∣ X_{M_{A}}},

where U =β_0rX_r. Noting that S₀(t) ≤ 1 and $\exp (β_{0 M \ r}^{⊺} X_{M \ r}) > 0$ , we have $S_{0} (t)^{\exp (β_{0 M \ r}^{⊺} X_{M \ r})} \leq 1$ . Therefore, given $X_{M \ r}$ , $- {S_{0} (t)^{\exp (β_{0 M \ r}^{⊺} X_{M \ r})}}^{\exp (U)}$ is monotone increasing with respect to U.

Then by [25],

E [- U {S_{0} (t)^{\exp (β_{0 M \ r}^{⊺} X_{M \ r})}}^{\exp (U)} ∣ X_{M \ r}] \geq 0,

for all t > 0, and hence

E {X_{r} S_{T} (t; X_{M})} E {f_{C} (t; X_{M_{A}}) ∣ X_{M_{A}}} = - \frac{1}{β_{0 r}} E [- U {S_{0} (t)^{\exp (β_{M \ r}^{⊺} X_{M \ r})}}^{\exp (U)} ∣ X_{M \ r}] E {f_{C} (t; X_{M_{A}}) ∣ X_{M_{A}}}

has the same sign as −1/β_0r, for all t > 0.

Next, by the same argument used above, $E {X_{r} S_{T} (t; X_{M}) S_{C} (t; X_{M_{A}}) ∣ X_{M_{o / r}}}$ has the same sign as −1/β_0r across t > 0. This completes the proof of Lemma A. □

Proof of Lemma B. We first note that $β_{S}^{*}$ is the root of (4). Let $U_{S} (β_{S}) = \int_{0}^{τ} {v_{S}^{(1)} (t) - r_{S}^{(1)} (β_{S}, t) v_{S}^{(0)} (t) ∕ r_{S}^{(0)} (β_{S}, t)} d t$ .

U_{S} (β_{S}^{*}) = \int_{0}^{τ} E {1 (Y \geq u) X_{S} λ_{0} (u) \exp (β_{0 M}^{⊺} X_{M})} - \frac{E {1 (Y \geq u) X_{S} \exp (β_{S}^{⊺} X_{S})}}{E {1 (Y \geq u) \exp (β_{S}^{⊺} X_{S})}} E {1 (Y \geq u) λ_{0} (u) \exp (β_{0 M}^{⊺} X_{M})} d u = 0 .

(A.1)

Part (i): Let $S_{M} = S \cap M$ and $S_{M^{C}} = S \cap M^{C}$ . By the condition that the $X_{M^{C}}$ is independent of $X_{M}$ , the $X_{S_{M C}}$ is independent of $X_{S_{M}}$ . Thus, by Condition (B),

\int_{0}^{τ} E {1 (Y \geq u) X_{S_{M^{C}}} λ_{0} (u) \exp (β_{0 M}^{⊺} X_{M})} - \frac{E {1 (Y \geq u) X_{S_{M^{C}}} \exp (β_{S_{M}}^{* ⊺} X_{S_{M}})}}{E {1 (Y \geq u) \exp (β_{S_{M}}^{* ⊺} X_{S_{M}})}} E {1 (Y \geq u) λ_{0} (u) \exp (β_{0 M}^{* ⊺} X_{M})} d u = \int_{0}^{τ} E (X_{S_{M^{C}}}) E {1 (Y \geq u) λ_{0} (u) \exp (β_{0 M}^{⊺} X_{M})} - \frac{E (X_{S_{M^{C}}}) E {1 (Y \geq u) \exp (β_{S_{M}}^{* ⊺} X_{S_{M}})}}{E {1 (Y \geq u) \exp (β_{S_{M}}^{* ⊺} X_{S_{M}})}} E {1 (Y \geq u) λ_{0} (u) \exp (β_{0 M}^{⊺} X_{M})} d u = 0,

where the last equality follows from Condition (B). Combining the result and (A.1), it can be shown that

\int_{0}^{τ} E {1 (Y \geq u) X_{S} λ_{0} (u) \exp (β_{0 M}^{⊺} X_{M})} - \frac{E {1 (Y \geq u) X_{S} \exp (β_{S_{M}}^{* ⊺} X_{S_{M}})}}{E {1 (Y \geq u) \exp (β_{S_{M}}^{* ⊺} X_{S_{M}})}} E {1 (Y \geq u) λ_{0} (u) \exp (β_{0 M}^{⊺} X_{M})} d u = 0 .

Therefore, $β_{*}^{S} = (β_{S_{M}}^{* ⊺}, 0^{⊺})^{⊺}$ .

If $r \in M^{C}$ , then by the same arguments, we can show that $β_{S_{+ r}}^{*} = (β_{S}^{* ⊺}, 0)^{⊺}$ .

Part (ii): Suppose $β_{S_{+ r}}^{*} = (β_{S}^{* ⊺}, 0)^{⊺}$ . By the martingale property, $β_{S_{+ r}}^{*}$ is also the solution to the following equation,

\int_{0}^{\infty} E {1 (Y \geq u) X_{r} λ_{0} (u) \exp (β_{0 M}^{⊺} X_{M})} d u = \int_{0}^{τ} \frac{E {1 (Y \geq u) X_{r} \exp (β_{S}^{* ⊺} X_{S})}}{E {1 (Y \geq u) \exp (β_{S}^{* ⊺} X_{S}) t}} E {1 (Y \geq u) λ_{0} (u) \exp (β_{0 M}^{⊺} X_{M})} d u .

Furthermore, it is straightforward to see that

\int_{0}^{\infty} E {1 (Y \geq u) X_{r} λ_{0} (u) exp (β_{0 M}^{⊺} X_{M}) d u = \int_{0}^{\infty} E {X_{r} S_{T} (u; X_{M}) λ_{0} (u) exp (β_{0 M}^{⊺} X_{M}) S_{C} (u; X_{M_{A}})} d u = \int_{0}^{\infty} E {X_{r} f_{T} (u; X M) S_{C} (u; X_{M_{A}})} d u = E {X_{r} \int_{0}^{\infty} F_{T} (u; X_{M}) f_{C} (u; X_{M_{A}}) d u} = - \int_{0}^{\infty} E {X_{r} S_{T} (u; X_{M}) f_{C} (u; X_{M_{A}})} d u = - \int_{0}^{\infty} E [E {X_{r} S_{T} (u; X_{M}) f_{C} (u; X_{M_{A}}) ∣ {X_{M}}_{o / r}}] d u,

which has the opposite sign as $E {X_{r} S_{T} (u; X_{M}) f_{C} (u; X_{M_{A}}) ∣ {X_{M}}_{o / r}}$ Besides,

E {1 (Y \geq u) X_{r} \exp (β_{S}^{* ⊺} X_{S} 0} = E {X_{r} S_{T} (u; X_{M}) S_{C} (u; X_{M_{A}}) \exp (β_{S}^{* ⊺} X_{S})} = E [E {X_{r} S_{T} (u; X_{M}) S_{C} (u; X_{M_{A}}) \exp (β_{S}^{* ⊺} X_{S}) ∣ X_{M_{o / r}}}] = E [E {X_{r} S_{T} (u; X_{M}) S_{C} (u; X_{M_{A}}) ∣ X_{M_{o / r}}} \exp (β_{S}^{* ⊺} X_{S})],

which has the same sign as $E {X_{r} S_{T} (u; X_{M}) f_{C} (u; X_{M_{A}}) ∣ X_{M_{o / r}}}$ by Condition (G) and the fact that $(β_{S}^{* ⊺} X_{S}) > 0$ .

Thus, $\int_{0}^{τ} E {1 (Y \geq u) X_{r} λ_{0} (u) \exp (β_{0 M}^{⊺} X_{M})} d u$ and

\int_{0}^{τ} \frac{E {1 (Y \geq u) X_{r} \exp (β_{S}^{* ⊺} X_{S})}}{E {1 (Y \geq u) \exp (β_{S}^{* ⊺} X_{S})}} E {1 (Y \geq u) λ_{0} (u) \exp (β_{0 M}^{⊺} X_{M})} d u

have the opposite signs. Contradiction! Therefore, $β_{S_{+ r}}^{*} \neq (β_{S}^{* ⊺}, 0)^{⊺}$ .

Part(iii): Without loss of generality, we assume that X_r is the last element of X_{S_+r}. Let e_r be a vector of length |S| + 1 with the rth element 1 and all other elements 0. By definition, $v_{S}^{(0)} (u) = v_{S_{+ r}}^{(0)} (u)$ . Then by the Mean Value Theorem,

\int_{0}^{τ} E {1 (Y \geq u) X_{r} λ_{0} (u) \exp (β_{M}^{* ⊺} X_{M})} d u - \int_{0}^{τ} \frac{E {1 (Y \geq u) X_{r} \exp (β_{S}^{* ⊺} X_{S})}}{r_{S}^{(0)} (β_{S}^{*}, u)} v_{S}^{(0)} (u) d u - \int_{0}^{τ} E {1 (Y \geq u) X_{r} λ_{0} (u) \exp (β_{M}^{* ⊺} X_{M})} d u + \int_{0}^{τ} \frac{E {1 (Y \geq u) X_{r} \exp (β_{S_{+ r}}^{* ⊺} X_{S_{+ r}})}}{r_{S_{+ r}}^{(0)} (β_{S_{r}}^{*}, u)} v_{S_{+ r}}^{(0)} (u) d u = \int_{0}^{τ} [\frac{E {1 (Y \geq u) X_{r} X_{S_{+ r}}^{⊺} \exp ({\tilde{β}}_{S_{+ r}}^{⊺} X_{S_{+ r}})} r_{S_{+ r}}^{(0)} (β_{S_{+ r}}^{*}, u)}{{r_{S_{+ r}}^{(0)} (β_{S_{+ r}}^{*}, u)}^{2}} - \frac{E {1 (Y \geq u) X_{r} \exp (β_{S_{+ r}}^{* ⊺} X_{S_{+ r}})} E {1 (Y \geq u) X_{S_{+ r}}^{⊺} \exp (β_{S_{+ r}}^{* ⊺} X_{S_{+ r}})}}{{r_{S_{+ r}}^{(0)} (β_{S_{+ r}}^{*}, u)}^{2}}] \times {β_{S_{+ r}}^{*} - (β_{S}^{* ⊺}, 0)^{⊺}} \times E {1 (Y \geq u) λ_{0} (u) \exp (β_{M}^{* ⊺} X_{M})} d u,

which reduces to

\int_{0}^{τ} e_{r}^{⊺} [\frac{r_{S_{+ r}}^{(2)} ({\tilde{β}}_{S_{+ r}}, u)}{r_{S_{+ r}}^{(0)} ({\tilde{β}}_{S_{+ r}}, u)} - \frac{r_{S_{+ r}}^{(1)} ({\tilde{β}}_{S_{+ r}}, u)}^{\otimes 2}}{{r_{S_{+ r}}^{(0)} ({\tilde{β}}_{S_{+ r}}, u)}^{2}}] (β_{S_{+ r}}^{*} - β_{S}^{*}) v_{S_{+ r}}^{(0)} (u) d u,

where ${\tilde{β}}_{S_{+ r}}$ is between $β_{S_{+ r}}^{*}$ and $(β_{S}^{* ⊺}, 0)^{⊺}$ . Thus,

∣ \int_{0}^{τ} E {1 (Y \geq u) X_{r} λ_{0} (u) \exp (β_{M}^{* ⊺} X_{M})} d u ∣ \leq ∣ \int_{0}^{τ} E {1 (Y \geq u) X_{r} λ_{0} (u) \exp (β_{M}^{* ⊺} X_{M})} d u - \int_{0}^{τ} \frac{E {1 (Y \geq u) X_{r} \exp (β_{S}^{* ⊺} X_{S}}}{r_{S}^{(0)} (β_{S}^{*}, u)} v_{S}^{(0)} (u) d u ∣ \leq κ_{\max} ∣ ∣ e_{r} ∣ ∣ ∣ ∣ β_{S_{+ r}}^{*} - (β_{S}^{* ⊺}, 0)^{⊺} ∥,

where the first inequality follows from the proof of part (ii) that $\int_{0}^{τ} E {1 (Y \geq u) X_{r} λ_{0} (u) \exp (β_{0 M}^{τ} X_{M})} d u$ and

\int_{0}^{τ} \frac{E {1 (Y \geq u) X_{r} \exp (β_{S}^{* τ} X_{S})}}{r_{S}^{(0)} (β_{S}^{*}, u)} v_{S}^{(0)} (u) d u

have the opposite signs, and the second inequality follows from Condition (F). Then, by Condition (E)

∣ ∣ β_{S_{+ r}}^{*} - (β_{S}^{* ⊺}, 0)^{⊺} ∣ ∣ \geq ∣ \int_{0}^{τ} E {X_{r} f_{T} (u; X_{M}) S_{C} (u; X_{M_{A}})} d u ∣ \geq K κ_{\max}^{- 1} n^{- α} .

This completes the proof of Lemma B.

Proof of Lemma C. We only prove k = 1, as k = 2, 3 can be proved similarly. Given an index set S of size |S| = s ≤ ρ, let $B_{S}^{0} (c) = {β_{s} : ∣ ∣ β_{S} - β_{S}^{*} ∣ ∣ \leq c ∕ (K \sqrt{S})}$ . By Conditions (B) and (D), it can be shown that for any π ∈ $π \in R^{s}$ satisfying ||π|| = 1,

0 \leq ∣ ∣ π^{⊺} X ∣ ∣ \leq \sqrt{s} K, \exp (β_{S}^{⊺} X_{S}) \leq \exp {(β_{S} - β_{S}^{*})^{⊺} X_{S}} \exp (β_{S}^{*} X_{S}) \leq \exp (c + K L) .

Define $h (β_{S}, π, t) = {\sqrt{S} K \exp (c + K L)}^{- 1} \overset{‒}{Y} (t) π^{⊺} X_{S} \exp (β_{S}^{⊺} X_{S})$ . Then by (B) and (C), h(β_S, π, u) is bounded between −1 and 1 uniformly over $B_{S}^{0} (c), ∣ ∣ π ∣ ∣ = 1$ and u ∈ [0, τ]. Define the function class

H_{S} = {h (β_{S}, π, u) : β_{S} \in B_{S}^{0} (c), ∣ ∣ π ∣ ∣ = 1, u \in [0, t]} .

Following the arguments used for Lemma 11 in [1] and Lemma 2.6.17 in [31], we can show that there exists some universal constant A₁ such that the class of functions $H_{S}$ has a VC index bounded by A₁ s; for the definitions of VC index, see p. 85 in [31]. By Theorem 2.6.7 in [31], for any probability measure Q, there exists some universal constant A₂, such that the covering number sup_Q $N [∊ ∣ ∣ H_{S} ∣ ∣_{Q, 2}, H_{S}, L_{2} (Q)]$ is bounded by (A₂/ϵ)^2A1s for any ϵ > 0; for the definition of covering numbers, see p. 83 in [31].

Thus, by Theorem 1.1 in [27], there exists some constant A₃ that depends on A₂ only, such that for all ϵ > 0,

\Pr [_{β_{S} \in B_{S}^{0} (c), ∣ ∣ π ∣ ∣ = 1, u \in [0, τ]}^{\sup} ∣ \sqrt{n} G_{n} {h (β_{S}, π, u)} ∣ \geq \sqrt{n} ∊] \leq \frac{A_{3}}{∊} {(\frac{A_{3} ∊^{2}}{A_{1^{S}}})}^{A_{1^{S}}} \exp (- 2 ∊^{2}) .

By choosing $∊ = A_{4} \sqrt{ρ \ln (p)}$ for some universal constant A₄, we obtain

\Pr [_{β_{S} \in B_{S}^{0} (c), ∣ ∣ π ∣ ∣ = 1, u \in [0, τ]}^{\sup} ∣ G_{n} {h (β_{S}, π, u)} ∣ \geq A_{4} \sqrt{ρ \ln p}] \leq \exp (- 5 ρ \ln p) .

Consequently,

\Pr [_{∣ S ∣ = s, β_{S} \in B_{S}^{0} (c), u \in [0, τ]}^{\sup} ∣ ∣ R_{S}^{(1)} (β_{S}, u) - r_{S}^{(1)} (β_{S}, u) ∣ ∣ \geq A_{4} \sqrt{s} K \exp (c + K L) \sqrt{ρ \ln (p ∕ n)}] \leq {(\frac{e p}{S})}^{s} \exp (- 5 ρ \ln p),

where the inequality follows from the combinatoric inequality $(\begin{matrix} p \\ s \end{matrix}) \leq (e p ∕ s)^{S}$ .

Let A₅ = A₄K exp(c + KL). Then,

\Pr {_{∣ S ∣ \leq ρ, β_{s} \in B_{s}^{0} (c), u \in [0, τ]}^{\sup} ∣ ∣ R_{S}^{(1)} (β_{S}, u) - r_{S}^{(1)} (β_{S}, u) ∣ ∣ \geq A_{5} \sqrt{ρ^{2} \ln (p ∕ n)}} \leq Σ_{s = 1}^{ρ} {(\frac{e p}{s})}^{s} \exp (- 5 ρ \ln p) \leq \exp (- 3 ρ \ln p),

when n is sufficiently large. Thus, if ρ² ln(p/n) → 0,

_{∣ S ∣ \leq ρ, β_{s} \in B_{s}^{0} (c), u \in [0, τ]}^{\sup} ∣ ∣ R_{S}^{(1)} (β_{S}, u) - r_{S}^{(1)} (β_{S}, u) ∣ ∣ \to_{p} 0 .

Similarly, we can show that

\Pr {_{∣ S ∣ \leq ρ, β_{s} \in B_{s}^{0} (c), u \in [0, τ]}^{\sup} ∣ ∣ R_{S}^{(0)} (β_{S}, u) - r_{S}^{(0)} (β_{S}, u) ∣ ∣ \geq A_{6} \sqrt{ρ \ln (p ∕ n)}} \leq \exp (- 3 ρ \ln p)

and

\Pr {_{∣ S ∣ \leq ρ, β_{s} \in B_{s}^{0} (c), u \in [0, τ]}^{\sup} ∣ ∣ R_{S}^{(2)} (β_{S}, u) - r_{S}^{(2)} (β_{S}, u) ∣ ∣ \geq A_{7} \sqrt{ρ^{3} \ln (p ∕ n)}} \leq \exp (- 3 ρ \ln p)

for some constants A₆ and A₇ that do not depend on n. This completes the proof of Lemma C.

Proof of Lemma D. Given an index set S such that |S| < ρ, it is easy to see that Z_s (β_S) ≤ I + II, where

I = n^{- 1 ∕ 2} ∣ G_{n} [\int_{0}^{τ} {β_{S}^{⊺} X_{i S} - β_{S}^{* ⊺} X_{i S}} d N_{i} (t)] ∣, I I = n^{- 1 ∕ 2} ∣ G_{n} [\int_{0}^{τ} {\ln r_{s}^{(0)} (β_{S}, t) - \ln r_{S}^{(0)} (β_{S}^{*}, t)} d N_{i} (t)] ∣ .

For term I, by Conditions (B) and (C), it can be shown that $∣ β_{S}^{⊺} X_{i S} - β_{S}^{* ⊺} X_{i S} ∣ \leq c ∕ (K \sqrt{S}) K \sqrt{S} \leq c$ , and $var {(β_{S}^{τ} X_{i S} - β_{S}^{* T} X_{i S}) N_{i} (τ)} \leq κ_{\max} c^{2} ∕ (K^{2} s)$ , for any $β_{S} \in B_{S}^{0} (c)$ . Let ϵ₁,…,ϵ_n be a Rademacher sequence. Then we have

E (_{β_{s} \in B_{s}^{0} (c)}^{\sup} I) \leq 2 E [_{β_{s} \in B_{s}^{0} (c)}^{\sup} ∣ E_{n} {∊_{i} (β_{S}^{⊺} X_{i S} - β_{S}^{* ⊺} X_{i S}) N_{t} (τ)} ∣] \leq 2 E [_{β_{s} \in B_{s}^{0} (c)}^{\sup} ∣ ∣ β_{S} - β_{S}^{*} ∣ ∣_{1}_{1 \leq j \leq p}^{\max} ∣ E_{n} {∊_{i} X_{i j} N_{i} (τ)} ∣] \leq 2 \frac{c}{K \sqrt{S}} \sqrt{S} E [_{1 \leq j \leq p}^{\max} ∣ E_{n} {∊_{i} X_{i j} N_{i} (τ)} ∣] \leq 2 a_{n} c ∕ K,

where the first inequality follows from Lemma 2.3.1 in‘[31], the second inequality is trivial, and the third inequality follows from Condition (B) and Lemma A.1 in [30]. Applying Bousquet’s concentration theorem [2] yields that, for any r > 0,

\Pr [_{β_{s} \in B_{s}^{0} (c)}^{\sup} I \leq 2 c a_{n} ∕ K + r a_{n} \sqrt{2 {c^{2} ∕ (K^{2} s) + 2 c^{2} a_{n} ∕ K}} + 2 r^{2} a_{n}^{2} c ∕ 3] \leq \exp (- n r^{2} a_{n}^{2}) .

Choose $r = 2 \sqrt{ρ}$ . As $ρ \sqrt{\ln (p ∕ n)} \to 0$ , we obtain that, when n is sufficiently large,

\Pr (_{β_{s} \in B_{s}^{0} (c)}^{\sup} I \geq 2 \frac{c}{K} a_{n} + 3 \frac{c \sqrt{ρ}}{K \sqrt{s}} a_{n}) \leq \exp [- 4 n ρ {\sqrt{2 K^{2} \ln (2 p) ∕ n} + K \ln (2 p) ∕ n}^{2}] \leq \exp (- 8 K^{2} ρ \ln p) .

(A.2)

For term II, let $R_{s} (c)$ denote a ball with dimensionality s and radius $c ∕ (K \sqrt{s})$ . Then $B_{s}^{0} (c) = R_{s} (c) + β_{s}^{*}$ . Let $C_{s} = {C (ξ_{ℓ})}$ be a collection of cubes that cover the ball R_s(c), where $C (ξ_{ℓ})$ is a cube containing ξ_ΐ with sides of length $c ∕ (K \sqrt{s} n^{2})$ . Then $∣ C ∣_{s} \leq (4 n^{2})^{2}$ and $∣ ∣ ξ_{ℓ} ∣ ∣ \leq C ∕ (K \sqrt{s})$ . For any $ξ \in C (ξ_{ℓ}), ∣ ∣ ξ - ξ_{ℓ} ∣ ∣ \leq c ∕ (K n^{2}) \equiv ζ_{n}$ , ||ξ−ξ_ℓ|| ≤ c/(Kn²) ≡ ζ_n Let $T_{S} (ξ) = E_{n} \int_{0}^{τ} \ln r_{s}^{(0)} (β_{S}^{*} + xi, t) d N_{i} (t)$ . By the Mean Value Theorem,

∣ \int_{0}^{τ} \ln r_{s}^{(0)} (β_{s}^{*} + ξ_{ℓ}, t) d N_{i} (t) - \int_{0}^{τ} \ln r_{s}^{(0)} (β_{s}^{*}, t) d N_{i} (t) ∣ = ∣ \int_{0}^{τ} \frac{1}{r_{s}^{(0)} (β_{s}^{*} + {\tilde{ξ}}_{ℓ}, t)} E [\overset{‒}{Y} (t) \exp {(β_{s}^{*} + {\tilde{ξ}}_{ℓ})^{⊺} X_{s}} ξ_{ℓ}^{⊺} X_{s}] d N_{i} (t) ∣ \leq ∣ \int_{0}^{τ} \frac{1}{r_{s}^{(0)} (β_{s}^{*} + {\tilde{ξ}}_{ℓ}, t)} E [\overset{‒}{Y} (t) \exp {(β_{s}^{*} + {\tilde{ξ}}_{ℓ})^{⊺} X_{s}}] ∣ ∣ ξ_{ℓ} ∣ ∣_{1} ∣ ∣ X ∣ {∣_{\infty}}_{d N_{i}} (t) ∣ \leq ∣ \int_{0}^{τ} \frac{1}{r_{s}^{(0)} (β_{s}^{*} + {\tilde{ξ}}_{ℓ}, t)} r_{s}^{(0)} (β_{s}^{*} + {\tilde{ξ}}_{ℓ}, t) d N_{i} (t) ∣ ∣ ∣ {xi}_{ℓ} ∣ ∣_{1} ∣ ∣ X ∣ ∣_{\infty} \leq τ C ∕ (K \sqrt{s}) \sqrt{s} K = C τ,

where ${\tilde{ξ}}_{ℓ}$ is between ξ_ι and 0. Applying Bernstein’s inequality yields that for any r > 0,

\Pr [n ∣ T_{s} (ξ_{ℓ}) - T_{s} (0) - E {T_{s} (ξ_{ℓ}) - T_{s} (0)} ∣ \geq r] \geq 2 \exp (- \frac{1}{2} \frac{r^{2}}{n c^{2} τ^{2} + 2 c τ r ∕ 3}),

and consequently, by choosing $r = 6 c τ \sqrt{n ρ \ln p}$ ,

\Pr [_{1 \leq ℓ \leq (4 n^{2})^{s}}^{\max} ∣ T_{s} (ξ_{ℓ}) - T_{s} (0) - E {T_{s} (ξ_{ℓ}) - T_{s} (0)} ∣ > 6 c τ \sqrt{n ρ \ln p} ∕ n] \leq 2 (4 n^{2})^{s} \exp (- \frac{1}{2} \frac{36 c^{2} τ^{2} n ρ \ln p}{n c^{2} τ^{2} + 12 c^{2} τ^{2} \sqrt{n ρ \ln p}}) \leq 2 (4 n^{2})^{s} \exp (- 12 ρ \ln p) \leq 2 \exp (- 12 ρ \ln p + s \ln 4 + 2 s \ln n)) \leq 2 \exp (- 10 ρ \ln p),

(A.3)

where n is sufficiently large.

Given any $ξ \in C (ξ_{ℓ})$ , we can similarly show that

∣ \int_{0}^{τ} {\ln r_{s}^{(0)} (β_{s}^{*} + ξ, t) - \ln r_{s}^{(0)} (β_{s}^{*} + ξ_{ℓ}, t)} d N_{i} (t) ∣ \leq K \sqrt{s} ζ_{n} .

Therefore,

∣ T_{s} (ξ) - T_{s} (ξ_{ℓ}) - E {T_{s} (ξ) - T_{s} (ξ_{ℓ})} ∣ \leq 2 K \sqrt{s} ζ_{n} \leq 2 K \sqrt{ρ} ζ_{n} .

(A.4)

Combining (A.3) and (A.4) implies that

\Pr (_{β_{s} \in B_{s}^{0} (c)}^{\sup} I I \geq 6 c τ \sqrt{ρ \ln (p ∕ n)} + 2 K \sqrt{ρ} ζ_{n}) \leq 2 \exp (- 10 ρ \ln p) .

(A.5)

By (A.2) and (A.5), we have

\Pr (_{β_{s} \in B_{s}^{0} (c)}^{\sup} Z_{s} (β_{s}) \geq 2 \frac{c}{K} a_{n} + 3 \frac{c \sqrt{ρ}}{K \sqrt{s}} a_{n} + 6 c τ \sqrt{ρ \ln (p ∕ n)} + 2 K \sqrt{ρ} ζ_{n}) \leq \exp (- 8 K^{2} ρ \ln p) + 2 \exp (- 10 ρ \ln p) .

By the combinatoric inequality $(\begin{matrix} p \\ s \end{matrix}) \leq (e p ∕ s)^{s}$ , we obtain that

\Pr {_{∣ S ∣ \leq ρ, β_{s} \in B_{s}^{0} (c)}^{\sup} Z_{s} (β_{s}) \geq 2 \frac{c}{K} a_{n} + 3 \frac{c \sqrt{ρ}}{K \sqrt{s}} a_{n} + 6 c τ \sqrt{ρ \ln (p ∕ n)} + 2 K \sqrt{ρ} ζ_{n}} \leq Σ_{s = 1}^{ρ} (e p ∕ s)^{s} {\exp (- 8 K^{2} ρ \ln p) + 2 \exp (- 10 ρ \ln p)} \leq 3 \exp (- 6 ρ \ln p) .

This completes the proof of Lemma D. □

Proof of Lemma E. We first define the following events:

Ω_{1} = {_{∣ S ∣ \leq ρ, β_{s} \in B_{s}^{0} (c), u \in [0, τ]}^{\sup} ∣ R_{s}^{(0)} (β_{s}, u) - r_{s}^{(0)} (β_{s}, u) ∣ \geq A_{6} \sqrt{ρ \ln (p ∕ n)}},

Ω_{2} = {_{∣ S ∣ \leq ρ, β_{s} \in B_{s}^{1} (c), u \in [0, τ]}^{\sup} ∣ ∣ R_{s}^{(1)} (β_{s}, u) - r_{s}^{(1)} (β_{s}, u) ∣ ∣ \geq A_{5} \sqrt{ρ^{2} \ln (p ∕ n)}} .

By Lemma C, we obtain that Pr (Ω₁) ≤ exp(−3ρ ln p) and Pr (Ω₂) ≤ exp(−3ρ ln p). In the rest of the proof, we restrict our attention to $Ω_{1}^{C} \cap Ω_{2}^{C}$ . By the arguments in [20], we can show that

D_{S} (β_{S}) \leq E_{n} ∣ \int_{0}^{τ} [\ln {\frac{R_{S}^{(0)} (β_{S}, t)}{r_{S}^{(0)} (β_{S}, t)}} - \ln {\frac{R_{S}^{(0)} (β_{S}^{*}, t)}{r_{S}^{(0)} (β_{S}^{*}, t)}}] d N_{i} (t) ∣ \leq_{0 \leq t \leq τ}^{\sup} ∣ \ln {\frac{R_{S}^{(0)} (β_{S}, t)}{r_{S}^{(0)} (β_{S}, t)}} - \ln {\frac{R_{S}^{(0)} (β_{S}^{*}, t)}{r_{S}^{(0)} (β_{S}^{*}, t)}} ∣ \leq_{0 \leq t \leq τ}^{\sup} ∣ \frac{r_{S}^{(0)} ({\tilde{β}}_{S}, t)}{R_{S}^{(0)} ({\tilde{β}}_{S}, t)} (β_{S} - β_{S}^{*})^{⊺} [\frac{R_{S}^{(1)} ({\tilde{β}}_{S}, t)}{r_{S}^{(0)} ({\tilde{β}}_{S}, t)} - \frac{R_{S}^{(0)} ({\tilde{β}}_{S}, t) r_{S}^{(1)} ({\tilde{β}}_{S}, t)}{{r_{S}^{(0)} ({\tilde{β}}_{S}, t)}^{2}}] ∣ =_{0 \leq t \leq τ}^{\sup} ∣ (β_{S} - β_{S}^{*})^{⊺} \frac{R_{S}^{(1)} ({\tilde{β}}_{S}, t) - r_{S}^{(1)} ({\tilde{β}}_{S}, t)}{R_{S}^{(0)} ({\tilde{β}}_{S}, t)} ∣ +_{0 \leq t \leq τ}^{\sup} ∣ {β_{S} - β_{S}^{*})^{⊺} r_{S}^{(1)} ({\tilde{β}}_{S}, t) {\frac{1}{R_{S}^{(0)} ({\tilde{β}}_{S}, t)} - \frac{1}{r_{S}^{(0)} ({\tilde{β}}_{S}, t)}} ∣ \equiv I + I I .

We first consider I. By Condition (B),

r_{S}^{(0)} (β_{S}, t) = E {1 (Y \geq t) \exp (β_{S}^{⊺} X_{S})} \geq E {1 (Y \geq t) \exp (- ∣ ∣ β_{S} ∣ ∣_{1} ∣ ∣ X_{S} ∣ ∣_{\infty})} \geq E [1 (Y \geq t) \exp {- (c + K L)}] \geq ω \exp {- (c + K L)},

for any $β_{S} \in B_{S}^{0} (c)$ . We obtain that

_{∣ S ∣ \leq ρ, β_{s} \in B_{s}^{0} (c), t \in [0, τ]}^{\inf} r_{S}^{(0)} (β_{S}, t) \geq ω \exp {- (c + K L)} .

Then,

I \leq_{0 \leq t \leq τ}^{\sup} ∣ \frac{1}{R_{S}^{(0)} ({\tilde{β}}_{S}, t)} ∣ ∣ (β_{S} - β_{S}^{*})^{⊺} {R_{S}^{(1)} ({\tilde{β}}_{S}, t) - r_{S}^{(1)} ({\tilde{β}}_{S}, t)} ∣ \leq \frac{∣ ∣ β_{S} - β_{S}^{*} ∣ ∣}{\inf_{∣ S ∣ \leq ρ, β_{S} \in B_{S}^{0} (c), t \in [0, τ]} r_{S}^{(0)} (β_{S}, t) - A_{6} \sqrt{ρ \ln (p ∕ n)}}_{∣ S ∣ \leq ρ, β_{S} \in B_{S}^{0} (c), t \in [0, τ]}^{\sup} ∣ ∣ R_{S}^{(1)} ({\tilde{β}}_{S}, t) - r_{S}^{(1)} ({\tilde{β}}_{S}, t) ∣ ∣ \leq \frac{1}{ω \exp {- (c + K L)} - A_{6} \sqrt{ρ \ln (p ∕ n)}} \times \frac{c}{K \sqrt{s}} \times A_{5} \sqrt{ρ^{2} \ln (p ∕ n)} \leq A_{8} \sqrt{ρ^{2} \ln (p ∕ n)},

for some constant A₈ that does not depend on n.

We now bound II. By Condition (B),

E {1 (Y \geq t) X_{j} \exp (β_{S}^{⊺} X_{S})} \leq K E {\exp (β_{S}^{⊺} X_{S})} \leq K \exp (∣ ∣ X_{S} ∣ ∣_{\infty} ∣ ∣ β_{S} ∣ ∣_{1}) \leq K \exp (c + K L) .

Thus,

I I \leq_{∣ S ∣ \leq ρ, β_{S} \in B_{S}^{0} (c), t \in [0, τ]}^{\sup} ∣ ∣ r_{S}^{(1)} (β_{S}, t) ∣ ∣ \underset{∣ S ∣ \leq ρ, β_{S} \in B_{S}^{0} (c), t \in [0, τ]}{} ∣ \frac{1}{R_{S}^{(0)} ({\tilde{β}}_{S}, t) r_{S}^{(0)} ({\tilde{β}}_{S}, t)} ∣ \times_{∣ S ∣ \leq ρ, β_{S} \in B_{S}^{0} (c), t \in [0, τ]}^{\sup} ∣ R_{S}^{(0)} (β_{S}, t) - r_{S}^{(0)} (β_{S}, t) ∣ ∣ ∣ β_{S} - β_{S}^{*} t ∣ ∣ \leq \frac{K \sqrt{ρ} \exp (c + K L)}{ω \exp {- (c + K L)} [ω \exp {- (c + K L)} - A_{6} \sqrt{ρ \ln (p ∕ n)}]} \times \frac{A_{6} \sqrt{ρ \ln (p ∕ n)} \times c}{K \sqrt{s}} \leq A_{9} \sqrt{ρ^{2} \ln (p ∕ n)},

for some constant A₉ that is free of n.

Withdrawing the restriction to $Ω_{1}^{C} \cap Ω_{2}^{C}$ , the above results indicate that

\Pr {_{∣ S ∣ \leq ρ, β_{S} \in B_{S}^{0} (c)}^{\sup} D_{S} (β_{S}) \geq A_{10} \sqrt{ρ^{2} \ln (p ∕ n)}} \leq 2 \exp (- 3 ρ \ln p),

for some constant A₁₀. This completes the proof of Lemma E. □

Proof of Lemma F: Given any index set S satisfying $∣ S ∣ \leq ρ$ and $β_{S} \in B_{S}^{0} (c)$ , by Taylor’s expansion,

Γ_{S} (β_{S}) = E [- \int_{0}^{τ} {β_{S}^{⊺} X_{S} - \ln r_{S}^{(0)} (β_{S}, t)} d N (t)] = E [- \int_{0}^{τ} {β_{S}^{* ⊺} X_{S} - \ln r_{S}^{(0)} (β_{S}^{*}, t)} d N (t)] + E [- \int_{0}^{τ} {X_{S}^{⊺} - r_{S}^{(1)} (β_{S}^{*}, t) ∕ r_{S}^{(0)} (β_{S}^{*}, t)} d N (t)] (β_{S} - β_{S}^{*}) + \frac{1}{2} (β_{S} - β_{S}^{*})^{⊺} E [\int_{0}^{τ} [r_{S}^{(2)} ({\tilde{β}}_{S}, t) ∕ r_{S}^{(0)} ({\tilde{β}}_{S}, t) - {r_{S}^{(1)} ({\tilde{β}}_{S}, t)}^{\otimes 2} ∕ {r_{S}^{(0)} ({\tilde{β}}_{S}, t)}^{2}] d N (t)] (β_{S} - β_{S}^{*}),

where ${\tilde{β}}_{S}$ is between β_S and $β_{S}^{*}$ . Noting that

E [- \int_{0}^{τ} {X_{S}^{⊺} - r_{S}^{(1)} (β_{S}^{*}, t) ∕ r_{S}^{(0)} (β_{S}^{*}, t)} d N (t)] = 0,

by Condition (F), we have $Γ_{S} (β_{S}) - Γ_{S} (β_{S}^{*}) \geq κ_{\min} ∣ ∣ β_{S} - β_{S}^{*} ∣ ∣^{2} ∕ 2$ . Similarly,

Γ_{S} (β_{S}) - Γ_{S} (β_{S}^{*}) = \frac{1}{2} (β_{S} - β_{S}^{*})^{⊺} E [\int_{0}^{τ} [r_{S}^{(2)} ({\tilde{β}}_{S}, t) ∕ r_{S}^{(0)} ({\tilde{β}}_{S}, t) - {r_{S}^{(1)} ({\tilde{β}}_{S}, t)}^{\otimes 2} ∕ {r_{S}^{(0)} ({\tilde{β}}_{S}, t)}^{2}] d N (t)] (β_{S} - β_{S}^{*}) \leq \frac{1}{2} [\int_{0}^{τ} (β_{S} - β_{S}^{*})^{⊺} {r_{S}^{(2)} ({\tilde{β}}_{S}, t) ∕ r_{S}^{(0)} (\tilde{β}, t)} (β_{S} - β_{S}^{*}) d N (t)] \leq \frac{1}{2} κ_{\max} ∣ ∣ β_{S} - β_{S}^{*} ∣ ∣^{2},

where the last inequality follows from Condition (F). This completes the proof of Lemma F. □

Proof of Lemma G. Let

Ω_{3} = {_{∣ S ∣ \leq ρ, β_{S} \in B_{S}^{0} (c)}^{\sup} Z_{S} (β_{S}) \geq 2 \frac{c}{K} a_{n} + 3 \frac{c \sqrt{ρ}}{K \sqrt{s}} a_{n} + 6 c τ \sqrt{ρ \ln (p ∕ n)} + 2 K \sqrt{ρ} ζ_{n}}, Ω_{4} = {_{∣ S ∣ \leq ρ, β_{S} \in B_{S}^{0} (c)}^{\sup} D_{S} (β_{S}) \leq A_{10} \sqrt{ρ^{2} \ln (p ∕ n)}} .

We consider $Ω_{3}^{C} \cap Ω_{4}^{C}$ , which holds with probability at least 1 – 5 exp(−3ρ ln p), by Lemmas D and E. In the rest of the proof, we restrict our attention to $Ω_{3}^{C} \cap Ω_{4}^{C}$ .

Given an index set S such that |S| ≤ ρ, for any β_S with $∣ ∣ β_{S} - β_{S}^{*} ∣ ∣ = A_{11} {ρ^{2} \ln (p ∕ n)}^{1 ∕ 4}$ for some constant A₁₁ defined later, we have $β_{S} \in B_{S}^{0} (c)$ when n is sufficiently large such that $A_{11} {ρ^{2} \ln (p ∕ n)}^{1 ∕ 4} \leq c ∕ (K \sqrt{ρ})$ , since ρ⁴ ln(p/n) → 0.

Therefore, $β_{S} \in \partial B_{S}^{0} (c)$ , where $\partial B_{S}^{0} (c)$ denotes the boundary of $B_{S}^{0} (c)$ .

Noting that $\tilde{ℓ} (β_{S}) ∕ n = - E_{n} {γ_{S} (β_{S}; X_{i}, Y_{i}, δ_{i})}$ , we observe that

n^{- 1} {- ℓ_{S} (β_{S}) + ℓ_{S} (β_{S}^{*})} = n^{- 1} {- ℓ_{S} (β_{S}) + ℓ_{S} (β_{S}^{*})} - n^{- 1} {- {\tilde{ℓ}}_{S} (β_{S}) + {\tilde{ℓ}}_{S} (β_{S}^{*})} + n^{- 1} {- {\tilde{ℓ}}_{S} (β_{S}) + {\tilde{ℓ}}_{S} (β_{S}^{*})} - {Γ_{S} (β_{S}) - Γ_{S} (β_{S}^{*})} + {Γ_{S} (β_{S}) - Γ_{S} (β_{S}^{*})} \geq {Γ_{S} (β_{S}) - Γ_{S} (β_{S}^{*})} - n^{- 1} ∣ {ℓ_{S} (β_{S}) - ℓ_{S} (β_{S}^{*})} - {{\tilde{ℓ}}_{S} (β_{S}) - {\tilde{ℓ}}_{S} (β_{S}^{*})} ∣ - ∣ n^{- 1} {- {\tilde{ℓ}}_{S} (β_{S}) + {\tilde{ℓ}}_{S} (β_{S}^{*})} - {Γ_{S} (β_{S}) - Γ_{S} (β_{S}^{*})} ∣ = {Γ_{S} (β_{S}) - Γ_{S} (β_{S}^{*})} - ∣ D_{S} (β_{S}) ∣ - ∣ Z_{S} (β_{S}) ∣

and that the right-hand term can be bounded from below by

\frac{1}{2} κ_{\min} ∣ ∣ β_{S} - β_{S}^{*} ∣ ∣^{2} - 2 \frac{c}{K} a_{n} - 3 \frac{c \sqrt{ρ}}{K \sqrt{s}} a_{n} - 6 c τ \sqrt{ρ \ln (p ∕ n)} - 2 K \sqrt{ρ} ζ_{n} - A_{10} \sqrt{ρ^{2} \ln (p ∕ n)} \geq \frac{1}{2} κ_{\min} A_{11}^{2} \frac{\sqrt{ρ^{2} \ln p}}{\sqrt{n}} - 2 \frac{c}{K} \frac{\sqrt{2 K^{2} \ln (2 p)}}{\sqrt{n}} - 2 \frac{c}{K} K \frac{\ln (2 p)}{n} - 3 \frac{c \sqrt{ρ}}{K \sqrt{s}} \frac{\sqrt{2 K^{2} \ln (2 p)}}{\sqrt{n}} - 3 \frac{c \sqrt{ρ}}{K \sqrt{s}} K \frac{\ln (2 p)}{n} - 6 c τ \frac{\sqrt{ρ \ln p}}{\sqrt{n}} - 2 K \sqrt{ρ} \frac{c}{K n^{2}} - A_{10} \frac{\sqrt{ρ^{2} \ln p}}{\sqrt{n}} > 0,

with some constant A₁₁ satisfying $κ_{\min} A_{11}^{2} > A_{10}$ . Therefore,

_{∣ ∣ β_{S} - β_{S}^{*} ∣ ∣ = A_{11} (ρ^{2} \ln (p ∕ n))^{1 ∕ 4}}^{\inf} n^{- 1} {- ℓ_{S} (β_{S}) + ℓ_{S} (β_{S}^{*})} > 0 .

By the concavity of $ℓ_{S}, ∣ ∣ {\hat{β}}_{S} - β_{S}^{*} ∣ ∣ \leq A_{11} {ρ^{2} \ln (p ∕ n)}^{1 ∕ 4}$ . Withdrawing the restriction to $Ω_{3}^{C} \cap Ω_{4}^{C}$ , we have

\Pr [_{∣ S ∣ \leq ρ}^{\sup} ∣ ∣ {\hat{β}}_{S} - β_{S}^{*} ∣ ∣ \leq A_{11} {ρ^{2} \ln (p ∕ n)}^{1 ∕ 4}] \geq 1 - 5 \exp (- 3 ρ \ln p) .

Furthermore,

∣ ℓ_{S} ({\hat{β}}_{S}) - ℓ_{S} (β_{S}^{*}) ∣ \leq \frac{1}{2} κ_{\max} \frac{\sqrt{ρ^{2} \ln p}}{\sqrt{n}} + 2 \frac{c}{K} \frac{\sqrt{2 K^{2} \ln (2 p)}}{\sqrt{n}} + 2 \frac{c}{K} K \frac{\ln (2 p)}{n} + 3 \frac{c \sqrt{ρ}}{K \sqrt{S}} \frac{\sqrt{2 K^{2} \ln (2 p)}}{\sqrt{n}} + 3 \frac{c \sqrt{ρ}}{K \sqrt{s}} K \frac{\ln (2 p)}{n} + 6 c τ \frac{\sqrt{ρ \ln p}}{\sqrt{n}} + 2 K \sqrt{ρ} \frac{c}{K n^{2}} + A_{10} \frac{\sqrt{ρ^{2} \ln p}}{\sqrt{n}} \leq A_{12} \frac{\sqrt{ρ^{2} \ln p}}{\sqrt{n}},

for some constant A₁₂. Similarly, we obtain that

\Pr {_{∣ S ∣ \leq ρ}^{\sup} ∣ ℓ_{S} ({\hat{β}}_{S}) - ℓ_{S} (β_{S}^{*}) ∣ \leq A_{12} \sqrt{ρ^{2} \ln (p ∕ n)}} \geq 1 - 5 \exp (- 3 ρ \ln p) .

This complete the proof of Lemma G. □ Proof of Lemma H. Given any S ⊂ (1,…,p} and r such that |S| < ρ, r ∈ S^C,

n^{- 1} {ℓ_{S_{+ r}} (β_{S_{+ r}}^{*}) - ℓ_{S} (β_{S}^{*})} - {- Γ_{S_{+ r}} (β_{S_{+ r}}^{*}) + Γ_{S} (β_{S}^{*})} = I + I I,

where

I = n^{- 1} {ℓ_{S_{+ r}}} (β_{S_{+ r}}^{*}) - ℓ_{S} (β_{S}^{*}} - n^{- 1} {{\tilde{ℓ}}_{S_{+ r}} (β_{S_{+ r}}^{*}) - {\tilde{ℓ}}_{S} (β_{S}^{*})}, I I = n^{- 1} {{\tilde{ℓ}}_{S_{+ r}} (β_{S_{+ r}}^{*}) - {\tilde{ℓ}}_{S} (β_{S}^{*})} - {- Γ_{S_{+ r}} (β_{S_{+ r}}^{*}) + Γ_{S} (β_{S}^{*})} .

Noting that $n^{- 1} {\tilde{ℓ}}_{S} (β_{S}) = - E_{n} {γ_{S} (β_{S}; X_{i}, Y_{i}, δ_{i})}$ , it is easy to check that

∣ I ∣ \leq ∣ E_{n} [\int_{0}^{τ} {\ln R_{S}^{(0)} (β_{S}^{*}, u) - \ln r_{S}^{(0)} (β_{S}^{*}, u)} d N_{i} (u)] ∣ + ∣ E_{n} [\int_{0}^{τ} {\ln R_{S}^{(0)} (β_{S_{+ r}}^{*}, u) - \ln r_{S}^{(0)} (β_{S_{+ r}}^{*}, u)} d N_{i} (u)] ∣ \equiv I_{1} + I_{2}, ∣ I I ∣ = ∣ [E_{n} {γ_{S} (β_{S_{+ r}}^{*}; X_{i}, Y_{i}, δ_{i})} - Γ_{S} (β_{S_{+ r}}^{*})] - [E_{n} {γ_{S} (β_{S}^{*}; X_{i}, Y_{i}, δ_{i})} - Γ_{S} (β_{S}^{*})] ∣ \leq ∣ E_{n} {γ_{S} (β_{S_{+ r}}^{*}; X_{i}, Y_{i}, δ_{i})} - Γ_{S} (β_{S_{+ r}}^{*}) ∣ + ∣ E_{n} {γ_{S} (β_{S}^{*}; X_{i}, Y_{i}, δ_{i})} - Γ_{S} (β_{S}^{*}) ∣ \equiv I I_{1} + I I_{2} .

We restrict our attention to $Ω_{1}^{C}$ , where Ω₁ is defined in Lemma E. We consider I₁ first. We have

∣ I_{1} ∣ = ∣ E_{n} [\int_{0}^{τ} {\ln R_{S}^{(0)} (β_{S}^{*}, u) - \ln r_{S}^{(0)} (β_{S}^{*}, u)} d N_{i} (u)] ∣ \leq_{0 \leq t \leq τ}^{\sup} ∣ \ln R_{S}^{(0)} (β_{S}^{*}, t) ∕ r_{S}^{(0)} (β_{S}^{*}, t) ∣ =_{0 \leq t \leq τ}^{\sup} ∣ {R_{S}^{(0)} (β_{S}^{*}, t) - r_{S}^{(0)} (β_{S}^{*}, t)} ∕ \tilde{r} (t) ∣ \leq_{0 \leq t \leq τ}^{\sup} ∣ \frac{A_{6} \sqrt{ρ \ln (p ∕ n)}}{r_{S}^{(0)} (β_{S}^{*}, t) - A_{6} \sqrt{ρ \ln (p ∕ n)}} ∣ \leq \frac{A_{6} \sqrt{ρ \ln (p ∕ n)}}{ω \exp {- (c + K L)} - A_{6} \sqrt{ρ \ln (p ∕ n)}} \leq A_{13} \sqrt{ρ \ln (p ∕ n)},

where $\tilde{r} (t)$ is between $R_{S}^{(0)} (β_{S}^{*}, t)$ and $r_{S}^{(0)} (β_{S}^{*}, t)$ . By the same argument, $I_{2} \leq A_{13} \sqrt{ρ \ln (p ∕ n)}$ as well. Therefore, $∣ I ∣ \leq 2 A_{13} \sqrt{ρ \ln (p ∕ n)}$ . By Conditions (B) and (D), $∣ γ_{S} (β_{S}^{*}; X_{i}, Y_{i}, δ_{i}) ∣ \leq ∣ β_{S}^{* ⊺} X_{i} ∣ + \ln r_{S}^{(0)} (β_{S}^{*}, t) \leq 2 K L$ . Applying Bernstein’s inequality, we get

\Pr [∣ E_{n} {γ_{S} (β_{S}^{*}; X_{i} i, Y_{i}, δ_{i})} - Γ_{S} (β_{S}^{*}) ∣ > 6 K L \sqrt{ρ \ln (p ∕ n)}] \leq 2 \exp (- \frac{1}{2} \frac{36 K^{2} L^{2} n ρ \ln p}{4 n K^{2} L^{2} + 12 K^{2} L^{2} \sqrt{ρ n \ln p} ∕ 3}) \leq 2 \exp (- 4 ρ \ln p) .

Thus,

\Pr {_{∣ S ∣ < ρ, r \in S^{C}}^{\sup} ∣ I I ∣ > 12 K L \sqrt{ρ \ln (p ∕ n)}) \leq \underset{∣ S ∣ \leq ρ}{Σ} Σ_{r = 1}^{p} 2 \exp (- 4 ρ \ln p) \leq p Σ_{s = 1}^{ρ - 1} (e p ∕ s)^{s} 2 \exp (- 4 ρ \ln p) \leq 2 \exp (- 3 ρ \ln p) .

Withdrawing the restriction to $Ω_{1}^{C}$ , we obtain that

\Pr [_{∣ S ∣ < ρ, r \in S^{C}}^{\sup} ∣ n^{- 1} {ℓ_{S_{+ r}} (β_{S_{+ r}}^{*}) - ℓ_{S} (β_{S}^{*})} - {- Γ_{S_{+ r}} (β_{S_{+ r}}^{*}) + Γ_{S} (β_{S}^{*})} ∣ \geq 2 A_{13} \sqrt{ρ \ln (p ∕ n)} + 12 K L \sqrt{ρ \ln (p ∕ n)}] \leq 2 \exp (- 3 ρ \ln p) + \exp (- 3 ρ \ln p) \leq 3 \exp (- 3 ρ \ln p) .

Let A₁₄ = 2A₁₃ + 12KL and we complete the proof of Lemma H. □

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

[1].Belloni A, Chernozhukov V, ℓ1-penalized quantile regression in high-dimensional sparse models, Ann. Statist. 39 (2011) 82–130. [Google Scholar]
[2].Bousquet O, A Bennett concentration inequality and its application to suprema of empirical processes, C. R. Math. 334 (2002) 495–500. [Google Scholar]
[3].Bradic J, Fan J, Jiang J, Regularization for Cox’s proportional hazards model with NP-dimensionality, Ann. Statist. 39 (2011) 3092–3120. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Chen J, Chen Z, Extended Bayesian information criteria for model selection with large model spaces, Biometrika 95 (2008) 759–771. [Google Scholar]
[5].Cheng M-Y, Honda T, Zhang J-T, Forward variable selection for sparse ultra-high dimensional varying coefficient models, J. Amer. Statist. Assoc. 111 (2016) 1209–1221. [Google Scholar]
[6].Fan J, Li R, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Statist. Assoc. 96 (2001) 1348–1360. [Google Scholar]
[7].Fan J, Lv J, Sure independence screening for ultrahigh dimensional feature space (with discussion), J. R. Stat. Soc. Ser. B 70 (2008) 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Fan J, Samworth R, Wu Y, Ultrahigh dimensional feature selection: Beyond the linear model, J. Mach. Learn. Res. 10 (2009) 2013–2038. [PMC free article] [PubMed] [Google Scholar]
[9].Fan J, Song R, Sure independence screening in generalized linear models with NP-dimensionality, Ann. Statist. 38 (2010) 3567–3604. [Google Scholar]
[10].Fine J, Comparing nonnested Cox models, Biometrika 89 (2002) 635–648. [Google Scholar]
[11].Friedman J, Greedy function approximation: A gradient boosting machine, Ann. Statist. 29 (2001) 1189–1232. [Google Scholar]
[12].Gorst-Rasmussen A, Scheike T, Independent screening for single-index hazard rate models with ultrahigh dimensional features, J. R. Stat. Soc. Ser. B 75 (2013) 217–245. [Google Scholar]
[13].Hao N, Zhang HH, Interaction screening for ultrahigh-dimensional data, J. Amer. Statist. Assoc. 109 (2014) 1285–1301. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].He X, Wang L, Hong HG, Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data, Ann. Statist. 41 (2013) 342–369. [Google Scholar]
[15].Hong HG, Chen X, Christiani DC, Li Y, Integrated powered density: Screening ultrahigh dimensional covariates with survival outcomes, Biometrics 74 (2017) 421–429. [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Hong HG, Kang J, Li Y, Conditional screening for ultra-high dimensional covariates with survival outcomes, Lifetime Data Anal. 24 (2018) 45–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].Hong HG, Li Y, Feature selection of ultrahigh-dimensional covariates with survival outcomes: A selective review, Appl. Math. Ser. B 32 (2017) 379–396. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Huang J, Sun T, Ying Z, Yu Y, Zhang C-H, Oracle inequalities for the Lasso in the Cox model, Ann. Statist. 41 (2013) 1142–1165. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Ing C-K, Lai TL, A stepwise regression method and consistent model selection for high-dimensional sparse linear models, Statist. Sinica 21 (2011) 1473–1513. [Google Scholar]
[20].Kong S, Nan B, Non-asymptotic oracle inequalities for the high-dimensional Cox regression via Lasso, Statist. Sinica 24 (2014) 25–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Li J, Zheng Q, Peng L, Huang Z, Survival impact index and ultrahigh-dimensional model-free screening with survival outcomes, Biometrics 72 (2016) 1145–1154. [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Lin DY, Wei L-J, The robust inference for the Cox proportional hazards model, J. Amer. Statist. Assoc. 84 (1989) 1074–1078. [Google Scholar]
[23].Luo S, Chen Z, Sequential Lasso cum EBIC for feature selection with ultra-high dimensional feature space, J. Amer. Statist. Assoc. 109 (2014) 1229–1240. [Google Scholar]
[24].Luo S, Xu J, Chen Z, Extended Bayesian information criterion in the Cox model with a high-dimensional feature space, Ann. Inst. Statist. Math. 67 (2015) 287–311. [Google Scholar]
[25].Schmidt KD, On the Covariance of Monotone Functions of a Random Variable, Professoren des Inst. f¨ur Math. Stochastik, 2003. [Google Scholar]
[26].Song R,Lu W, Ma S, Jeng XJ, Censored rank independence screening for high-dimensional survival data, Biometrika 101 (2014) 799–814. [DOI] [PMC free article] [PubMed] [Google Scholar]
[27].Talagrand M, Sharper bounds for Gaussian and empirical processes, Ann. Probab. 22 (1994) 28–76. [Google Scholar]
[28].Tibshirani RJ, Regression shrinkage and selection via the Lasso, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 58 (1996) 267–288. [Google Scholar]
[29].Tibshirani RJ, The Lasso method for variable selection in the Cox model, Statist. Medicine 16 (1997) 385–395. [DOI] [PubMed] [Google Scholar]
[30].van de Geer SA, High-dimensional generalized linear models and the Lasso, Ann. Statist. 36 (2008) 614–645. [Google Scholar]
[31].van der Vaart AW, Wellner JA, Weak Convergence, Springer, New York, 1996. [Google Scholar]
[32].Vittinghoff E, McCulloch CE, Relaxing the rule of ten events per variable in logistic and Cox regression, Amer. J. Epidemiology 165 (2007) 710–718. [DOI] [PubMed] [Google Scholar]
[33].Volinsky CT, Raftery AE, Bayesian information criterion for censored survival models, Biometrics 56 (2000) 256–262. [DOI] [PubMed] [Google Scholar]
[34].Wang H, Forward regression for ultra-high dimensional variable screening, J. Amer. Statist. Assoc. 104 (2009) 1512–1524. [Google Scholar]
[35].Xu R, Vaida F, Harrington DP, Using profile likelihood for semiparametric model selection with application to proportional hazards mixed models, Statist. Sinica 19 (2009) 819–842. [PMC free article] [PubMed] [Google Scholar]
[36].Zhang CH, Nearly unbiased variable selection under minimax concave penalty, Ann. Statist. 38 (2010) 894–942. [Google Scholar]
[37].Zhao SD, Li Y, Principled sure independence screening for Cox models with ultra-high-dimensional covariates, J. Multivariate Anal. 105 (2012) 397–411. [DOI] [PMC free article] [PubMed] [Google Scholar]
[38].Zheng Q, Peng L, He X, Globally adaptive quantile regression with ultra-high dimensional data, Ann. Statist. 43 (2015) 2225. [DOI] [PMC free article] [PubMed] [Google Scholar]
[39].Zhong W, Zhang T, Zhu Y, Liu JS, Correlation pursuit: forward stepwise variable selection for index models, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 74 (2012) 849–870. [DOI] [PMC free article] [PubMed] [Google Scholar]
[40].Zöchbauer-M¨uller S, Minna JD, Gazdar AF, Aberrant DNA methylation in lung cancer: Biological and clinical implications, The Oncologist 7 (2002) 451–457. [DOI] [PubMed] [Google Scholar]
[41].Zou H, A note on path-based variable selection in the penalized proportional hazards model, Biometrika 95 (2008) 241–247. [Google Scholar]

[R1] [1].Belloni A, Chernozhukov V, ℓ1-penalized quantile regression in high-dimensional sparse models, Ann. Statist. 39 (2011) 82–130. [Google Scholar]

[R2] [2].Bousquet O, A Bennett concentration inequality and its application to suprema of empirical processes, C. R. Math. 334 (2002) 495–500. [Google Scholar]

[R3] [3].Bradic J, Fan J, Jiang J, Regularization for Cox’s proportional hazards model with NP-dimensionality, Ann. Statist. 39 (2011) 3092–3120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Chen J, Chen Z, Extended Bayesian information criteria for model selection with large model spaces, Biometrika 95 (2008) 759–771. [Google Scholar]

[R5] [5].Cheng M-Y, Honda T, Zhang J-T, Forward variable selection for sparse ultra-high dimensional varying coefficient models, J. Amer. Statist. Assoc. 111 (2016) 1209–1221. [Google Scholar]

[R6] [6].Fan J, Li R, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Statist. Assoc. 96 (2001) 1348–1360. [Google Scholar]

[R7] [7].Fan J, Lv J, Sure independence screening for ultrahigh dimensional feature space (with discussion), J. R. Stat. Soc. Ser. B 70 (2008) 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Fan J, Samworth R, Wu Y, Ultrahigh dimensional feature selection: Beyond the linear model, J. Mach. Learn. Res. 10 (2009) 2013–2038. [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Fan J, Song R, Sure independence screening in generalized linear models with NP-dimensionality, Ann. Statist. 38 (2010) 3567–3604. [Google Scholar]

[R10] [10].Fine J, Comparing nonnested Cox models, Biometrika 89 (2002) 635–648. [Google Scholar]

[R11] [11].Friedman J, Greedy function approximation: A gradient boosting machine, Ann. Statist. 29 (2001) 1189–1232. [Google Scholar]

[R12] [12].Gorst-Rasmussen A, Scheike T, Independent screening for single-index hazard rate models with ultrahigh dimensional features, J. R. Stat. Soc. Ser. B 75 (2013) 217–245. [Google Scholar]

[R13] [13].Hao N, Zhang HH, Interaction screening for ultrahigh-dimensional data, J. Amer. Statist. Assoc. 109 (2014) 1285–1301. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].He X, Wang L, Hong HG, Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data, Ann. Statist. 41 (2013) 342–369. [Google Scholar]

[R15] [15].Hong HG, Chen X, Christiani DC, Li Y, Integrated powered density: Screening ultrahigh dimensional covariates with survival outcomes, Biometrics 74 (2017) 421–429. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [16].Hong HG, Kang J, Li Y, Conditional screening for ultra-high dimensional covariates with survival outcomes, Lifetime Data Anal. 24 (2018) 45–71. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] [17].Hong HG, Li Y, Feature selection of ultrahigh-dimensional covariates with survival outcomes: A selective review, Appl. Math. Ser. B 32 (2017) 379–396. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Huang J, Sun T, Ying Z, Yu Y, Zhang C-H, Oracle inequalities for the Lasso in the Cox model, Ann. Statist. 41 (2013) 1142–1165. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Ing C-K, Lai TL, A stepwise regression method and consistent model selection for high-dimensional sparse linear models, Statist. Sinica 21 (2011) 1473–1513. [Google Scholar]

[R20] [20].Kong S, Nan B, Non-asymptotic oracle inequalities for the high-dimensional Cox regression via Lasso, Statist. Sinica 24 (2014) 25–42. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Li J, Zheng Q, Peng L, Huang Z, Survival impact index and ultrahigh-dimensional model-free screening with survival outcomes, Biometrics 72 (2016) 1145–1154. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Lin DY, Wei L-J, The robust inference for the Cox proportional hazards model, J. Amer. Statist. Assoc. 84 (1989) 1074–1078. [Google Scholar]

[R23] [23].Luo S, Chen Z, Sequential Lasso cum EBIC for feature selection with ultra-high dimensional feature space, J. Amer. Statist. Assoc. 109 (2014) 1229–1240. [Google Scholar]

[R24] [24].Luo S, Xu J, Chen Z, Extended Bayesian information criterion in the Cox model with a high-dimensional feature space, Ann. Inst. Statist. Math. 67 (2015) 287–311. [Google Scholar]

[R25] [25].Schmidt KD, On the Covariance of Monotone Functions of a Random Variable, Professoren des Inst. f¨ur Math. Stochastik, 2003. [Google Scholar]

[R26] [26].Song R,Lu W, Ma S, Jeng XJ, Censored rank independence screening for high-dimensional survival data, Biometrika 101 (2014) 799–814. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] [27].Talagrand M, Sharper bounds for Gaussian and empirical processes, Ann. Probab. 22 (1994) 28–76. [Google Scholar]

[R28] [28].Tibshirani RJ, Regression shrinkage and selection via the Lasso, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 58 (1996) 267–288. [Google Scholar]

[R29] [29].Tibshirani RJ, The Lasso method for variable selection in the Cox model, Statist. Medicine 16 (1997) 385–395. [DOI] [PubMed] [Google Scholar]

[R30] [30].van de Geer SA, High-dimensional generalized linear models and the Lasso, Ann. Statist. 36 (2008) 614–645. [Google Scholar]

[R31] [31].van der Vaart AW, Wellner JA, Weak Convergence, Springer, New York, 1996. [Google Scholar]

[R32] [32].Vittinghoff E, McCulloch CE, Relaxing the rule of ten events per variable in logistic and Cox regression, Amer. J. Epidemiology 165 (2007) 710–718. [DOI] [PubMed] [Google Scholar]

[R33] [33].Volinsky CT, Raftery AE, Bayesian information criterion for censored survival models, Biometrics 56 (2000) 256–262. [DOI] [PubMed] [Google Scholar]

[R34] [34].Wang H, Forward regression for ultra-high dimensional variable screening, J. Amer. Statist. Assoc. 104 (2009) 1512–1524. [Google Scholar]

[R35] [35].Xu R, Vaida F, Harrington DP, Using profile likelihood for semiparametric model selection with application to proportional hazards mixed models, Statist. Sinica 19 (2009) 819–842. [PMC free article] [PubMed] [Google Scholar]

[R36] [36].Zhang CH, Nearly unbiased variable selection under minimax concave penalty, Ann. Statist. 38 (2010) 894–942. [Google Scholar]

[R37] [37].Zhao SD, Li Y, Principled sure independence screening for Cox models with ultra-high-dimensional covariates, J. Multivariate Anal. 105 (2012) 397–411. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] [38].Zheng Q, Peng L, He X, Globally adaptive quantile regression with ultra-high dimensional data, Ann. Statist. 43 (2015) 2225. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] [39].Zhong W, Zhang T, Zhu Y, Liu JS, Correlation pursuit: forward stepwise variable selection for index models, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 74 (2012) 849–870. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] [40].Zöchbauer-M¨uller S, Minna JD, Gazdar AF, Aberrant DNA methylation in lung cancer: Biological and clinical implications, The Oncologist 7 (2002) 451–457. [DOI] [PubMed] [Google Scholar]

[R41] [41].Zou H, A note on path-based variable selection in the penalized proportional hazards model, Biometrika 95 (2008) 241–247. [Google Scholar]

PERMALINK

Forward regression for Cox models with high-dimensional covariates

Hyokyoung G Hong

Qi Zheng

Yi Li

Abstract

1. Introduction

2. Partial likelihood-based forward regression

3. Theoretical properties

3.1. Regularity conditions

3.2. Main results

4. Numerical studies

Table 1:

Table 3:

Table 2:

5. Analysis of the Boston Lung Cancer Survival Cohort (BLCSC) study

Figure 1:

Table 4:

Figure 2:

6. Concluding remarks

Appendix

A. Preliminary lemmas

B. Proofs of main theoretical results in Section 3.2

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Forward regression for Cox models with high-dimensional covariates

Hyokyoung G Hong

Qi Zheng

Yi Li

Abstract

1. Introduction

2. Partial likelihood-based forward regression

3. Theoretical properties

3.1. Regularity conditions

3.2. Main results

4. Numerical studies

Table 1:

Table 3:

Table 2:

5. Analysis of the Boston Lung Cancer Survival Cohort (BLCSC) study

Figure 1:

Table 4:

Figure 2:

6. Concluding remarks

Appendix

A. Preliminary lemmas

B. Proofs of main theoretical results in Section 3.2

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases