Abstract
Forward regression, a classical variable screening method, has been widely used for model building when the number of covariates is relatively low. However, forward regression is seldom used in high-dimensional settings because of the cumbersome computation and unknown theoretical properties. Some recent works have shown that forward regression, coupled with an extended Bayesian information criterion (EBIC)-based stopping rule, can consistently identify all relevant predictors in high-dimensional linear regression settings. However, the results are based on the sum of residual squares from linear models and it is unclear whether forward regression can be applied to more general regression settings, such as Cox proportional hazards models. We introduce a forward variable selection procedure for Cox models. It selects important variables sequentially according to the increment of partial likelihood, with an EBIC stopping rule. To our knowledge, this is the first study that investigates the partial likelihood-based forward regression in high-dimensional survival settings and establishes selection consistency results. We show that, if the dimension of the true model is finite, forward regression can discover all relevant predictors within a finite number of steps and their order of entry is determined by the size of the increment in partial likelihood. As partial likelihood is not a regular density-based likelihood, we develop some new theoretical results on partial likelihood and use these results to establish the desired sure screening properties. The practical utility of the proposed method is examined via extensive simulations and analysis of a subset of the Boston Lung Cancer Survival Cohort study, a hospital-based study for identifying biomarkers related to lung cancer patients’ survival.
Keywords: Forward selection, partial likelihood, sure screening properties, extended Bayesian information criteria, high-dimensional predictors
1. Introduction
New biotechnologies have generated a vast amount of high-throughput data. In the Boston Lung Cancer Survival Cohort study, a hospital-based study for lung cancer patients, identifying high-throughput predictors such as molecular profiles that are associated with patients’ survival is a major research goal for understanding disease progression processes and designing more effective gene therapies. When the number p of covariates is less than the sample size n, the Cox proportional hazards model has been routinely used for modeling survival data in many practical settings. When p > n, penalized partial likelihood methods have been proposed by various authors [29, 41] and the oracle properties and statistical error bounds of estimation have been established [18]. However, when p >> n, these methods often fail because of serious challenges in “computational expediency, statistical accuracy, and algorithmic stability” [8]. Recently, Bradic et al. [3] established the oracle properties of the regularized partial likelihood estimates under a high-dimensional setting. However, the results require the estimates to be unique and global optimizers, which is, in general, difficult to verify, especially when the dimension of covariates is exceedingly high.
Forward regression has been widely used for model selection, but it has often been criticized for not achieving selection consistency as it fails to account for multiple comparisons in the model building process. Recently, some authors, e.g., [5, 13, 19, 23, 34, 39] have revamped forward regression in the context of linear regression or varying-coefficient linear models. The advantages can be summarized as follows. First, these authors have shown that, with some proper stopping criteria, forward regression can achieve screening consistency even in high-dimensional settings. Second, the variables are sequentially selected into the final model with the entry order determined by the size of the likelihood increment, which might reflect the relative importance of each selected variable. Third, the implementation is simple as no cross-validation for tuning parameters is needed. Finally, the method only needs assumptions on the original model and does not require restrictive faithfulness assumptions, in which the marginal models reflect the original model. However, to our knowledge, the aforementioned forward regression approaches are either based on the sum of residual squares from linear models [5, 34] or Lasso estimation [23]. It is unclear whether forward regression can be applied to more general regression settings, such as the Cox proportional hazards models.
In contrast, there has been active research in developing high-dimensional screening tools for survival data. These works include principled sure screening [37], feature aberration at survival times screening [12] and conditional screening [16], quantile adaptive sure independence screening [14], a censored rank independence screening procedure [26], and integrated powered density screening [15]; see [17] for an extensive review. However, the screening methods require a threshold to dictate how many variables to retain, for which unfortunately there are no clear rules. Zhao and Li [37] did tie the threshold with false discoveries, but it still needs to prefix the number of false positives that users are willing to tolerate. Recently, Li et al. [21] designed a model-free measure, namely the survival impact index, that sensibly captures the overall influence of a covariate on the survival outcome and can help guide selecting important variables. However, even this method, like the other screening methods, does not directly lead to a final model, for which extra modeling steps have to be implemented.
We introduce a new forward variable selection procedure for survival data based on partial likelihood. It selects important variables sequentially according to the increment of partial likelihood, with a stopping rule based on EBIC. We show that if the dimension of the true model is finite, within a finite number of steps forward regression can discover all relevant predictors, with the entry order determined by the size of the likelihood increment.
Our work is novel in the following aspects. It likely registers as the first attempt to thoroughly investigate the forward regression in high-dimensional survival settings, methodologically, theoretically and numerically. Technically this paper is also novel. First, our work represents technical advances and a broadened scope compared to the existing forward regression [5, 23, 34]. This may be the first work that investigates the partial likelihood-based forward regression in survival models with high-dimensional predictors, and establishes rigorous selection consistency results when the extended Bayesian information criterion (EBIC) [4] is used. It improves the partial likelihood-based variable selection developed by Volinsky and Raftery [33] and Xu et al. [35] for survival data in low dimensional settings. Second, as partial likelihood is not a regular density-based likelihood, it fails to satisfy the requirements for theories of forward regression. We revisit partial likelihood and develop some new inequalities, based on which we establish the desired sure screening properties. The derived theoretical framework and techniques will facilitate the extension of the procedure to other general likelihood-based settings, such as generalized linear regression models. Finally, we note that forward selection starts with an empty model or some important variables identified a priori, and then sequentially recruits variables given important variables identified in the previous steps. This may resemble the conditional screening approach [16], which incorporates prior knowledge into variable screening. However, our method is valid even in the absence of such information.
The rest of the paper is organized as follows. In Section 2, we introduce the proposed forward regression procedure. In Section 3, we rigorously establish forward regression’s screening consistency and asymptotic normality under some regularity conditions. We carry out simulation studies to assess the performance of the proposed method in Section 4, and apply the method in Section 5 to analyze a subset of the Boston Lung Cancer Survival Cohort study, our motivating study for identifying biomarkers related to lung cancer patients’ survival. We conclude the paper with a natural extension of the proposal in Section 6. Technical proofs and all of the lemmas are presented in the Appendix.
2. Partial likelihood-based forward regression
Suppose we have n independent subjects with p covariates, where p >> n. For subject i, denote by Xij the jth covariate for subject i, write Xi = (Xi1, … , Xip)⊺, and let Ti and Ci be the underlying survival and censoring times. We, however, only observe Yi = min(Ti, Ci), and the event indicator δi = 1(Ti ≤ Ci), where 1 is the indicator function. We assume random censoring such that Ci and Ti are independent given Xi. We assume that (Y1, δ1, X1), … , (Yn, δn, Xn) are mutually independently and identically distributed (iid). In particular, we assume that (Y1, T1, X1j), … , (Yn, Tn, Xnj), are mutually independent copies of (Y, T, Xj), the random variables that underlie the observed survival time, true survival time and covariates.
To link Ti to Xi, for each i ∈ {1, … , n}, we consider the Cox proportional hazards model, viz.
(1) |
where λ0 is the unspecified baseline hazard function and β0 = (β01, … , β0p)⊺ is the vector of regression coefficients. Without loss of generality, we assume that E(X1) = ⋯ = E(Xp) = 0. Denote the true model as . The overarching goal of variable screening is to estimate and we let its estimate be .
We introduce more notation. For an index set S ⊂ {1, … , p} and a p-dimensional vector A, we use AS = {Aj : j ∈ S} to denote the subvector of A corresponding to S. For example, XiS denotes the collection of covariates for the ith individual corresponding to S. We use |S| to denote the cardinality of S and let SC denote the complement of S.
Now we elaborate on the idea of forward regression under model (1). Initialize S0 = Ø. We can also start with a set of given variables according to some prior knowledge, which is in the same spirit as conditional screening [16] but is followed by a sequential selection process. Specifically, we sequentially select the sets of covariates in such a way that
where Sk ⊂ {1, … , p} is the index set of the selected covariates upon completion of the kth step, with k ≥ 0. Then at the (k + 1)th step, we need to choose a new candidate variable not in Sk and then decide whether we should stop at the kth step or we should include the new candidate in our selection and proceed to the next step. We emphasize that our selection criterion is based on the partial likelihood. The framework is much broader than that of the one based on the reduction in sum of squared residuals proposed in [5, 34] and can be extended to more general regression settings.
Now, given Sk, we consider estimation of the extended Cox model by adding a new variable index to Sk. Specifically, for every , we denote by Sk,j = Sk ∪ {j}, and fit a Cox model on the variables indexed by Sk,j. We then compute the increment of log partial likelihood for each , viz.
Here, for a covariate set S, ℓS(βS) is the log partial likelihood function given XS, viz.
(2) |
and maximizes (2), where Ni(t) = 1(Yi ≤ t, δi = 1) is the counting process, is the at-risk process, and τ > 0 is the study duration such that Pr(Y ≥ τ) > 0. Then, the candidate index is chosen as
Upon completion of the (k + 1)th step, update Sk+1 = Sk ∪ {j*}.
We are now in a position to decide whether to stop at the kth step or to include variable j* in our selection and proceed to the next step. In the survival setting, the effective sample size is the number of uncensored events, in which case Volinsky and Raftery [33] showed that replacement of the sample size with the number of uncensored events in the penalty term of EBIC gives a better approximation to the Bayes factor. Therefore, we propose the following as the modified EBIC criterion for ultrahigh-dimensional survival data:
where d = δ1 + ⋯ + δn is the number of events and η is some positive constant.
We stop if EBIC(Sk+1) > EBIC(Sk) and declare ; otherwise, we proceed to the next step.
3. Theoretical properties
We first introduce more notation. Let →p denote convergence in probability. Given a random sample Z1, … , Zn let and . For a column vector v, let v⊗0 = 1, v⊗1 = v, and v⊗2 = vv⊺. We denote the ℓq norm of v by ||v||q for q ≥ 1, and, in particular, denote its ℓ2 norm by ||v||. For any symmetric matrix A, let λmin(A) and λmax(A) represent the smallest and largest eigenvalues. Given an index set S and an index j ∈ S, we use S \ j to denote the set {r : r ∈ S, r ≠ j}.
Given an index set S ⊂ {1, … , p}, for k ∈ {0, 1, 2}, define
In addition, we use to denote the least false value, which is the unique root of
We use , and to denote the conditional cumulative distribution function (cdf), probability density function (pdf), and survival function of T given the true model , respectively. Likewise, the conditional cdf, pdf, and survival function of C given are denoted by , , and , respectively, where is the collection of covariates that are related to the censoring time C. Let .
3.1. Regularity conditions
We posit the regularity conditions, followed by explanations. The assumptions are, in general, mild, well justified, and follow the same lines as suggested by the existing literature.
(A) The study has a finite duration τ such that ω = Pr(Y ≥ τ) > 0.
(B) The Xj are time-independent and bounded by a constant K > 1 with E(Xj) = 0 and for all j ∈ {1, … , p}.
(C) There exist two positive constants 0 < κmin < κmax < ∞, such that
uniformly in S ⊂ {1, … , p} satisfying |S| ≤ ρ for some .
(D) for some constant L.
(E) For some constant α ∈ (0, 1/2),
(F) There exists ζ > 0 such that, for all t > 0,
(G) and have the same sign across t, for any .
Condition (A) is standard in survival models with censored data; see, e.g., [20]. Conditions (B) and (C) are commonly assumed in the literature for variable selection and screening; see, e.g., [5, 23, 34, 38]. The boundedness of X is adopted to simplify our theoretical development and can be relaxed to the Cramér condition as in [3]. Condition (D) replaces the Lipschitz assumption in [30] and has a similar flavor to the conditions in [37], and [20]. Condition (E) is introduced in [37], which is an adapted version of the conditions in [7] to survival data. Condition (F) is analogous to Condition 2 considered in [3] for regularized Cox models. The condition essentially requires that the concavity of the log partial likelihood is well bounded in a neighborhood of . We invoke Condition (G) in order to analyze the least false value . The condition often holds in practice; for example, Lemma A in the Appendix shows that it is satisfied if .
Since the log partial likelihood function in (2) is the sum of non-iid random variables, we consider its asymptotically equivalent version, which can be expressed as the sum of iid terms, viz.
(3) |
According to Kong and Nan [20], the log partial likelihood function (2) can then be viewed as a “working” model of (3), and the corresponding loss function becomes
with the expected loss ΓS(βS) = E{γS (βS; Xi, Yi, δi).
To validate the replacement of the log partial likelihood (3), a commonly assumed condition in the literature is that there exists a neighborhood of such that for k ∈ {0, 1, 2},
See, e.g., [3, 37]. Under Conditions (B) and (D), Lemma C shows that the above condition holds uniformly for all S ⊂ {1, … , p} satisfying |S| ≤ ρ.
We note that the proportional hazards assumption is made only on the true (and sparse) model. At each step of the screening procedure, we treat the misspecified Cox proportional hazards model as a working model following [10, 22]. Our theoretical derivations depend on the least false value, which helps us characterize the asymptotic behavior of our estimator at each step even without the proportional hazards assumption. Specifically, similar to [10, 22]. the proposed estimator will converge to the least false value at each step under the working model, and when the second derivative of the log partial likelihood is bounded in a neighborhood of the least false value, adding an active variable will increase the partial-likelihood, even if a mis-specified model is under consideration.
3.2. Main results
Theorem 1. Under Conditions (A)–(G), if ρ4 ln(p/n) → 0, then with probability at least 1 – 8 exp (−3ρ ln p), there exists α ∈ (0, 1/2) such that
Theorem 1 shows that if and |Sk| < ρ, then the increment of the log-likelihood at the (k + 1)th step is at least . Since the maximum increment is bounded by |ℓS0(0)|, we naturally obtain an upper bound on the number of steps for the forward selection, which is stated in the following corollary.
Corollary 1. Suppose the same conditions as in Theorem 1 hold. If and
for some α ∈ (0, 1/2), then , for some k ≤ M, with probability of at least 1 – 11 exp(−3ρ ln p).
Corollary 1 establishes the screening consistency of the forward selection procedure. However, the upper bound M is not sharp, as it is calculated based on the lower bound of the increment of the log-likelihood. The following corollary establishes an upper bound of the number of steps by evaluating how likely a signal variable will be selected at each step.
Corollary 2. Under the same conditions as in Corollary 1, if the are independent of , then , for some c3 ∈ (1, ∞), with probability at least 1 – 8 exp(−2ρ ln p).
The condition that the are independent of stems from the assumption imposed in [37], and is similar to the partial orthogonality assumption introduced in [9]. It ensures that selecting a noise variable would bring less of an increment of the log-log-likelihoodlikelihood compared to choosing a signal variable. Thus, it is much more likely for our procedure to select a signal variable at each step.
The following result follows from Corollary 2. We expect the proposed forward procedure to stop early with for some k.
Corollary 3. Under the same conditions as in Corollary 2, if and , then with probability going to 1,
(i) (screening consistency) the procedure stops at the kth step and ;
(ii) (false discovery rate control) .
By Corollary 3, our proposed forward selection procedure will stop at a final step, denoted by , which is at most . The final model not only achieves screening consistency, but also has well-controlled false discoveries.
4. Numerical studies
Simulations were conducted to compare the performance of the proposed forward regression (FR) with two partial likelihood based screening methods, the principled sure independence screening (PSIS) by Zhao and Li [37], and the conditional screening (CS) by Hong et al. [16]. The size of the models selected by the PSIS and CS was initially set to be ⌊n/ ln n⌋ as suggested in [7]. To further reduce false positives, we applied Lasso [28], SCAD [6], and MCP [36] penalties to further reduce the sizes of models selected by each method. In the tables, we used screening method + penalty to denote the corresponding procedure. Although FR could start from a null model, we tried different initial sets for FR, including active or inactive variables. In particular, we chose X1 and X10 to represent the active and inactive initial sets, respectively. When computing the model size for both FR and CS, we included the given initial set.
In this paper, we considered η as a fixed constant, which is analogous to the constant “a” parameter in the SCAD penalty function [6]. This distinguishes this from the other screening approaches that typically require data-driven thresholding tuning parameters and may incur more of a computational burden for finding them. To further justify the use of a fixed η, we considered various values of η between 0 and 1, the theoretical range of η in EBIC [4]. The BIC is a special case of EBIC when η = 0. We explored using BIC as the stopping criterion, but it incurred too many false positives compared to EBIC. This may cause overfitting of the Cox proportional hazards model with unreliably estimated regression coefficients and spuriously detected associations [32]. Thus, we elected not to use BIC.
We next considered three different values of η, 0.5, 1 – ln d/(3 ln p), and 1; see Tables 1–3. Essentially, a larger η gives more penalty to a complicated model, which may incur more false negatives, while a smaller η less penalizes the complexity of models and may lead to more false positives. Based on Tables 1–3, it seems that under all of the scenarios considered, the choice of η = 1 – ln d/(3 ln p) strikes a good balance between false negatives and false positives.
Table 1:
Comparisons of methods under mild censoring.
Example | Method | (n, p) = (200,1000) | (n, p) = (400,1000) | ||||
---|---|---|---|---|---|---|---|
PIT | TP | FP | PIT | TP | FP | ||
1 (P0 = 4) | FR | 3.65 (0.52) | 3.61 (0.49) | 0.05 (0.21) | 3.99 (0.18) | 3.98 (0.14) | 0.01 (0.12) |
FR+Lasso | 3.65 (0.52) | 3.02 (0.15) | 0.63 (0.48) | 3.99 (0.18) | 3.01 (0.12) | 0.98 (0.14) | |
FR+MCP | 3.65 (0.52) | 3.02 (0.15) | 0.63 (0.48) | 3.99 (0.18) | 3.01 (0.12) | 0.98 (0.14) | |
FR+SCAD | 3.65 (0.52) | 3.02 (0.15) | 0.63 (0.48) | 3.99 (0.18) | 3.01 (0.12) | 0.98 (0.14) | |
FRx1 | 3.65 (0.52) | 3.61 (0.49) | 0.05 (0.21) | 3.99 (0.18) | 3.98 (0.14) | 0.01 (0.12) | |
FRx10 | 4.64 (0.52) | 3.60 (0.49) | 1.04 (0.20) | 4.99 (0.19) | 3.98 (0.15) | 1.01 (0.12) | |
FR(η1) | 5.04 (1.41) | 3.89 (0.32) | 1.16 (1.38) | 4.53 (0.85) | 4.00 (0.00) | 0.53 (0.85) | |
FR(η2) | 3.95 (0.62) | 3.77 (0.42) | 0.18 (0.45) | 4.08 (0.33) | 3.99 (0.09) | 0.09 (0.31) | |
PSIS | 38.00 (0.00) | 3.03 (0.55) | 34.97 (0.55) | 67.00 (0.00) | 3.39 (0.49) | 63.61 (0.49) | |
PSIS+Lasso | 4.44 (2.71) | 2.43 (0.93) | 2.01 (2.12) | 6.82 (2.70) | 3.13 (0.48) | 3.69 (2.64) | |
PSIS+MCP | 4.05 (2.14) | 2.47 (0.85) | 1.58 (1.67) | 4.81 (1.60) | 3.08 (0.47) | 1.74 (1.55) | |
PSIS+SCAD | 4.22 (2.55) | 2.36 (0.89) | 1.86 (1.98) | 5.33 (1.96) | 3.06 (0.53) | 2.27 (1.85) | |
CS | 38.00 (0.00) | 3.08 (0.48) | 34.92 (0.48) | 67.00 (0.00) | 3.23 (0.42) | 63.77 (0.42) | |
CS+Lasso | 4.63 (2.49) | 2.51 (0.77) | 2.12 (2.09) | 6.50 (2.45) | 2.95 (0.43) | 3.54 (2.35) | |
CS+MCP | 4.00 (1.86) | 2.44 (0.76) | 1.55 (1.47) | 4.78 (1.40) | 2.92 (0.37) | 1.86 (1.33) | |
CS+SCAD | 4.30 (2.23) | 2.37 (0.75) | 1.93 (1.81) | 5.39 (1.82) | 2.90 (0.44) | 2.48 (1.67) | |
2 (p0 = 6) | FR | 0.91 (0.30) | 5.89 (0.40) | 1.34 (1.40) | 1.00 (0.00) | 6.00 (0.00) | 1.01 (1.20) |
FR | 6.16 (0.47) | 6.00 (0.00) | 0.16 (0.47) | 6.02 (0.15) | 6.00 (0.00) | 0.02 (0.15) | |
FR+Lasso | 6.16 (0.47) | 6.00 (0.00) | 0.16 (0.47) | 6.02 (0.15) | 6.00 (0.00) | 0.02 (0.15) | |
FR+MCP | 6.16 (0.47) | 6.00 (0.00) | 0.16 (0.47) | 6.02 (0.15) | 6.00 (0.00) | 0.02 (0.15) | |
FR+SCAD | 6.08 (0.35) | 5.91 (0.32) | 0.16 (0.47) | 6.02 (0.15) | 6.00 (0.00) | 0.02 (0.15) | |
FRx1 | 6.08 (0.33) | 6.00 (0.00) | 0.08 (0.33) | 6.02 (0.15) | 6.00 (0.00) | 0.02 (0.15) | |
FRx10 | 7.13 (0.42) | 6.00 (0.00) | 1.13 (0.42) | 7.01 (0.13) | 6.00 (0.00) | 1.01 (0.13) | |
FR(η1) | 7.51 (1.78) | 6.00 (0.00) | 1.51 (1.78) | 6.78 (1.16) | 6.00 (0.00) | 0.78 (1.16) | |
FR(η2) | 6.34 (0.72) | 6.00 (0.00) | 0.34 (0.72) | 6.11 (0.40) | 6.00 (0.00) | 0.11 (0.40) | |
PSIS | 38.00 (0.00) | 3.42 (0.93) | 34.58 (0.93) | 67.00 (0.00) | 4.83 (0.53) | 62.17 (0.53) | |
PSIS+Lasso | 8.54 (4.97) | 3.71 (0.97) | 4.84 (4.66) | 8.74 (5.33) | 4.82 (0.54) | 3.92 (5.21) | |
PSIS+MCP | 5.75 (2.39) | 3.34 (1.01) | 2.42 (1.97) | 9.21 (3.05) | 4.80 (0.61) | 4.41 (2.77) | |
PSIS+SCAD | 6.73 (3.29) | 3.51 (0.95) | 3.21 (3.08) | 8.80 (4.80) | 4.77 (0.63) | 4.03 (4.56) | |
CS | 38.00 (0.00) | 4.45 (0.83) | 33.55 (0.83) | 67.00 (0.00) | 5.66 (0.53) | 61.34 (0.53) | |
CS+Lasso | 8.68 (4.49) | 4.55 (0.93) | 4.13 (4.07) | 11.60 (4.50) | 5.67 (0.53) | 5.93 (4.22) | |
CS+MCP | 6.41 (2.38) | 4.27 (1.09) | 2.14 (1.97) | 7.44 (2.20) | 5.64 (0.63) | 1.80 (2.46) | |
CS+SCAD | 6.68 (2.92) | 4.34 (0.98) | 2.34 (2.69) | 6.83 (2.54) | 5.63 (0.63) | 1.20 (2.71) | |
3 (P0 = 6) | FR | 5.24 (0.95) | 5.02 (1.03) | 0.22 (0.49) | 5.75 (0.45) | 5.74 (0.44) | 0.01 (0.12) |
FR+Lasso | 5.24 (0.95) | 5.19 (0.87) | 0.05 (0.25) | 5.75 (0.45) | 5.74 (0.44) | 0.01 (0.09) | |
FR+MCP | 5.24 (0.95) | 5.19 (0.87) | 0.05 (0.25) | 5.75 (0.45) | 5.74 (0.44) | 0.01 (0.09) | |
FR+SCAD | 5.21 (0.94) | 5.16 (0.87) | 0.05 (0.25) | 5.75 (0.45) | 5.74 (0.44) | 0.01 (0.09) | |
FRx1 | 5.25 (0.85) | 5.10 (0.84) | 0.16 (0.40) | 5.75 (0.45) | 5.74 (0.44) | 0.01 (0.12) | |
FRx10 | 6.05 (1.28) | 4.87 (1.28) | 1.17 (0.41) | 6.75 (0.45) | 5.73 (0.44) | 1.01 (0.12) | |
FR(η1) | 6.84 (1.87) | 5.38 (0.93) | 1.45 (1.70) | 6.43 (0.87) | 5.93 (0.26) | 0.50 (0.84) | |
FR(η2) | 5.60 (1.09) | 5.18 (1.00) | 0.41 (0.73) | 5.93 (0.48) | 5.85 (0.36) | 0.07 (0.33) | |
PSIS | 38.00 (0.00) | 2.57 (0.72) | 35.43 (0.72) | 67.00 (0.00) | 3.35 (0.58) | 63.65 (0.58) | |
PSIS+Lasso | 3.43 (2.69) | 2.06 (1.05) | 1.37 (1.96) | 6.18 (2.77) | 3.22 (0.56) | 2.96 (2.59) | |
PSIS+MCP | 3.04 (1.64) | 2.21 (0.90) | 0.83 (1.16) | 3.51 (0.95) | 3.11 (0.49) | 0.40 (0.79) | |
PSIS+SCAD | 3.64 (2.24) | 2.21 (0.99) | 1.43 (1.61) | 3.91 (1.18) | 3.13 (0.49) | 0.78 (1.04) | |
CS | 38.00 (0.00) | 3.14 (0.68) | 34.86 (0.68) | 67.00 (0.00) | 4.17 (0.62) | 62.83 (0.62) | |
CS+Lasso | 4.74 (3.02) | 2.67 (1.18) | 2.07 (2.21) | 7.97 (3.03) | 4.05 (0.65) | 3.93 (2.77) | |
CS+MCP | 3.76 (1.67) | 2.84 (0.80) | 0.93 (1.39) | 4.31 (0.97) | 3.94 (0.58) | 0.38 (0.79) | |
CS+SCAD | 4.62 (2.16) | 2.86 (0.95) | 1.76 (1.69) | 4.66 (1.08) | 3.95 (0.58) | 0.71 (0.98) |
NOTE: FR, forward regression; PSIS, the principled sure independence screening; CS, the conditional screening with the given conditioning variable; PIT, estimated probability of including all true predictors in the selected predictors; TP, average number of true positives; FP, average number of false positives; p0 denotes the number of true signals; numbers in the parentheses are standard deviations. We used η1 = .5 and η2 = 1 – ln d/(3 ln p). When it is not noted, η = 1 was used. FRS0 denotes that FR was performed with the initial set of S0. We considered two initial sets x1 and x10.
Table 3:
Comparisons of methods under covariate-dependent censoring
Example | Method | (n, p) = (200, 1000) | (n, p) = (400, 1000) | ||||
---|---|---|---|---|---|---|---|
PIT | TP | FP | PIT | TP | FP | ||
1* (P0 = 4) | FR | 3.58 (0.51) | 3.55 (0.50) | 0.03 (0.17) | 3.99 (0.15) | 3.98 (0.13) | 0.00 (0.06) |
FR+Lasso | 3.58 (0.51) | 3.01 (0.09) | 0.57 (0.50) | 3.99 (0.15) | 3.00 (0.06) | 0.98 (0.13) | |
FR+MCP | 3.58 (0.51) | 3.01 (0.09) | 0.57 (0.50) | 3.99 (0.15) | 3.00 (0.06) | 0.98 (0.13) | |
FR+SCAD | 3.58 (0.51) | 3.01 (0.09) | 0.57 (0.50) | 3.99 (0.15) | 3.00 (0.06) | 0.98 (0.13) | |
FRx1 | 3.58 (0.51) | 3.55 (0.50) | 0.03 (0.17) | 3.99 (0.15) | 3.98 (0.13) | 0.00 (0.06) | |
FRx10 | 4.58 (0.51) | 3.56 (0.50) | 1.03 (0.17) | 4.98 (0.14) | 3.98 (0.13) | 1.00 (0.04) | |
FR(η1) | 5.05 (1.56) | 3.84 (0.37) | 1.21 (1.53) | 4.56 (0.85) | 4.00 (0.00) | 0.56 (0.85) | |
FR(η2) | 3.89 (0.64) | 3.70 (0.46) | 0.19 (0.48) | 4.08 (0.32) | 4.00 (0.06) | 0.09 (0.31) | |
PSIS | 38.00 (0.00) | 2.96 (0.51) | 35.04 (0.51) | 67.00 (0.00) | 3.41 (0.50) | 63.59 (0.50) | |
PSIS+Lasso | 4.39 (2.82) | 2.29 (0.96) | 2.11 (2.24) | 6.58 (2.63) | 3.06 (0.59) | 3.53 (2.53) | |
PSIS+MCP | 4.00 (2.15) | 2.37 (0.90) | 1.63 (1.70) | 4.81 (1.60) | 3.08 (0.54) | 1.73 (1.55) | |
PSIS+SCAD | 4.30 (2.54) | 2.27 (0.91) | 2.03 (2.01) | 5.34 (1.91) | 3.04 (0.59) | 2.30 (1.83) | |
CS | 38.00 (0.00) | 3.00 (0.49) | 35.00 (0.49) | 67.00 (0.00) | 3.25 (0.43) | 63.75 (0.43) | |
CS+Lasso | 4.50 (2.67) | 2.36 (0.77) | 2.14 (2.28) | 6.50 (2.44) | 2.92 (0.41) | 3.58 (2.33) | |
CS+MCP | 4.01 (2.00) | 2.38 (0.75) | 1.63 (1.65) | 4.72 (1.45) | 2.91 (0.35) | 1.81 (1.38) | |
CS+SCAD | 4.27 (2.39) | 2.29 (0.76) | 1.98 (1.96) | 5.45 (1.87) | 2.89 (0.39) | 2.57 (1.77) | |
2* (P0 = 6) | FR | 6.13 (0.39) | 6.00 (0.04) | 0.13 (0.39) | 6.01 (0.09) | 6.00 (0.00) | 0.01 (0.09) |
FR+Lasso | 6.12 (0.38) | 5.99 (0.09) | 0.13 (0.39) | 6.01 (0.09) | 6.00 (0.00) | 0.01 (0.09) | |
FR+MCP | 6.12 (0.38) | 5.99 (0.09) | 0.13 (0.39) | 6.01 (0.09) | 6.00 (0.00) | 0.01 (0.09) | |
FR+SCAD | 6.04 (0.25) | 5.91 (0.30) | 0.13 (0.39) | 6.01 (0.09) | 6.00 (0.00) | 0.01 (0.09) | |
FRx1 | 6.06 (0.26) | 6.00 (0.04) | 0.06 (0.25) | 6.01 (0.09) | 6.00 (0.00) | 0.01 (0.09) | |
FRx10 | 7.09 (0.32) | 6.00 (0.04) | 1.09 (0.31) | 7.01 (0.11) | 6.00 (0.00) | 1.01 (0.11) | |
FR(η1) | 7.55 (1.72) | 6.00 (0.00) | 1.55 (1.72) | 6.68 (1.05) | 6.00 (0.00) | 0.68 (1.05) | |
FR(η2) | 6.32 (0.64) | 6.00 (0.00) | 0.32 (0.64) | 6.11 (0.39) | 6.00 (0.00) | 0.11 (0.39) | |
PSIS | 38.00 (0.00) | 3.44 (0.92) | 34.56 (0.92) | 67.00 (0.00) | 4.80 (0.59) | 62.20 (0.59) | |
PSIS+Lasso | 8.55 (5.16) | 3.63 (1.00) | 4.92 (4.81) | 9.14 (5.81) | 4.79 (0.58) | 4.34 (5.65) | |
PSIS+MCP | 5.50 (2.47) | 3.25 (1.04) | 2.24 (2.05) | 9.01 (2.91) | 4.76 (0.67) | 4.25 (2.63) | |
PSIS+SCAD | 6.27 (3.19) | 3.42 (0.99) | 2.86 (2.98) | 8.67 (4.77) | 4.73 (0.68) | 3.95 (4.50) | |
CS | 38.00 (0.00) | 4.40 (0.89) | 33.60 (0.89) | 67.00 (0.00) | 5.66 (0.53) | 61.34 (0.53) | |
CS+Lasso | 8.74 (4.59) | 4.47 (0.97) | 4.27 (4.22) | 11.42 (4.57) | 5.66 (0.55) | 5.76 (4.32) | |
CS+MCP | 6.19 (2.39) | 4.12 (1.15) | 2.06 (1.95) | 7.30 (2.11) | 5.62 (0.62) | 1.67 (2.37) | |
CS+SCAD | 6.54 (2.98) | 4.22 (1.08) | 2.33 (2.76) | 6.91 (2.81) | 5.61 (0.66) | 1.30 (2.98) | |
3* (P0 = 6) | FR | 5.11 (1.02) | 4.87 (1.13) | 0.24 (0.51) | 5.72 (0.47) | 5.70 (0.46) | 0.02 (0.13) |
FR+Lasso | 5.11 (1.02) | 5.06 (0.96) | 0.04 (0.22) | 5.72 (0.47) | 5.71 (0.45) | 0.01 (0.10) | |
FR+MCP | 5.11 (1.02) | 5.06 (0.96) | 0.04 (0.22) | 5.72 (0.47) | 5.71 (0.45) | 0.01 (0.10) | |
FR+SCAD | 5.08 (1.01) | 5.03 (0.94) | 0.04 (0.22) | 5.72 (0.47) | 5.71 (0.45) | 0.01 (0.10) | |
FRx1 | 5.11 (0.98) | 4.92 (1.00) | 0.19 (0.46) | 5.72 (0.47) | 5.70 (0.46) | 0.02 (0.13) | |
FRx10 | 5.74 (1.49) | 4.59 (1.50) | 1.15 (0.38) | 6.72 (0.49) | 5.70 (0.46) | 1.03 (0.17) | |
FR(η1) | 7.11 (1.96) | 5.38 (0.87) | 1.73 (1.84) | 6.60 (1.00) | 5.91 (0.28) | 0.69 (0.96) | |
FR(η2) | 5.51 (1.09) | 5.07 (1.02) | 0.44 (0.73) | 5.97 (0.54) | 5.83 (0.37) | 0.14 (0.39) | |
PSIS | 38.00 (0.00) | 2.55 (0.72) | 35.45 (0.72) | 67.00 (0.00) | 3.34 (0.63) | 63.66 (0.63) | |
PSIS+Lasso | 3.08 (2.58) | 1.85 (1.10) | 1.23 (1.79) | 5.93 (2.76) | 3.19 (0.70) | 2.74 (2.46) | |
PSIS+MCP | 3.10 (1.84) | 2.10 (0.96) | 1.00 (1.37) | 3.48 (1.06) | 3.05 (0.57) | 0.43 (0.87) | |
PSIS+SCAD | 3.38 (2.40) | 2.02 (1.11) | 1.36 (1.67) | 4.12 (1.48) | 3.11 (0.58) | 1.01 (1.37) | |
CS | 38.00 (0.00) | 3.12 (0.70) | 34.88 (0.70) | 67.00 (0.00) | 4.10 (0.68) | 62.90 (0.68) | |
CS+Lasso | 4.47 (3.28) | 2.53 (1.21) | 1.94 (2.42) | 7.88 (3.17) | 3.96 (0.69) | 3.92 (2.89) | |
CS+MCP | 3.97 (1.99) | 2.77 (0.89) | 1.20 (1.62) | 4.35 (1.07) | 3.86 (0.62) | 0.50 (0.87) | |
CS+SCAD | 4.60 (2.57) | 2.72 (1.05) | 1.87 (1.97) | 4.83 (1.28) | 3.87 (0.63) | 0.96 (1.13) |
We considered p = 1000 and two sample sizes, viz. n ∈ {200, 400}. The survival time was generated from a Cox model λ(t|X) = λ0(t)exp(β⊺X) with a Weibull baseline hazard. Specifically, λ0(t) = αγtγ−1, with α = 1 and γ = 1.5. We considered various models for X and different parameter configurations for β in the following four examples. The censoring time was independently generated from a uniform distribution, . We varied c for each example in order to yield mild (around 25%) and heavy censoring proportions (around 50%). For each configuration, a total of 500 simulated datasets were generated.
Example 1. We chose β = (1, 0.5, −1, 0, 1, 0p–5) and generated X from a multivariate normal distribution where the mean was 0, the variance 1, and cor(Xj, Xj’) = 0.5|j–j’.
Example 2. We set and generated X from a multivariate normal distribution with mean 0, variance 1, and cor(Xj, Xj’) = 0.5. In this case, since cov(T, X6) = 0, X6 has a lower marginal utility than all of the noise variables with cov(T, Xj) = 1.25 for .
Example 3. We let with ν = 0.5. We generated X from a multivariate normal distribution with mean 0, variance 1, and cor(Xj, Xj’) = 0.5|j–j’|. In this case, since cov(T, X6) = 0, X6 is an active but hidden variable. Furthermore, the signals of the active variables are weak due to signal cancellation.
Tables 1–2 report the average of the estimated probabilities of including the true models (PIT), the average numbers of true positive (TP) and false positive (FP), and their standard deviations in parenthesis, under mild and heavy censoring. We use p0 to denote the number of true signals. We have observed the competing performance of the proposed method as detailed below.
Table 2:
Comparisons of methods under heavy censoring.
Example | Method | (n, p) = (200, 1000) | (n, p) = (400, 1000) | ||||
---|---|---|---|---|---|---|---|
PIT | TP | FP | PIT | TP | FP | ||
1 (P0 = 4) | FR | 3.33 (0.54) | 3.29 (0.49) | 0.04 (0.22) | 3.86 (0.39) | 3.85 (0.36) | 0.01 (0.13) |
FR+Lasso | 3.33 (0.54) | 3.00 (0.16) | 0.33 (0.48) | 3.86 (0.39) | 3.01 (0.10) | 0.85 (0.36) | |
FR+MCP | 3.33 (0.54) | 3.00 (0.16) | 0.33 (0.48) | 3.86 (0.39) | 3.01 (0.10) | 0.85 (0.36) | |
FR+SCAD | 3.33 (0.54) | 3.00 (0.16) | 0.33 (0.48) | 3.86 (0.39) | 3.01 (0.10) | 0.85 (0.36) | |
FRx1 | 3.33 (0.54) | 3.29 (0.49) | 0.04 (0.22) | 3.86 (0.39) | 3.85 (0.36) | 0.01 (0.13) | |
FRx10 | 4.35 (0.55) | 3.29 (0.49) | 1.05 (0.26) | 4.85 (0.39) | 3.84 (0.37) | 1.01 (0.13) | |
FR(η1) | 5.24 (1.98) | 3.62 (0.49) | 1.62 (1.88) | 4.71 (1.04) | 3.98 (0.15) | 0.74 (1.02) | |
FR(η2) | 3.64 (0.72) | 3.45 (0.51) | 0.19 (0.49) | 4.02 (0.42) | 3.93 (0.26) | 0.09 (0.33) | |
PSIS | 38.00 (0.00) | 2.93 (0.59) | 35.07 (0.59) | 67.00 (0.00) | 3.37 (0.49) | 63.63 (0.49) | |
PSIS+Lasso | 3.58 (2.57) | 2.02 (0.92) | 1.55 (1.98) | 5.92 (2.46) | 2.91 (0.58) | 3.01 (2.26) | |
PSIS+MCP | 3.56 (2.23) | 2.13 (0.87) | 1.43 (1.74) | 4.70 (1.60) | 2.92 (0.53) | 1.79 (1.49) | |
PSIS+SCAD | 3.66 (2.48) | 2.01 (0.86) | 1.65 (1.96) | 5.22 (2.17) | 2.85 (0.64) | 2.37 (1.92) | |
CS | 38.00 (0.00) | 2.90 (0.55) | 35.10 (0.55) | 67.00 (0.00) | 3.27 (0.45) | 63.73 (0.45) | |
CS+Lasso | 3.60 (2.29) | 2.09 (0.83) | 1.52 (1.81) | 5.69 (2.51) | 2.80 (0.56) | 2.89 (2.30) | |
CS+MCP | 3.36 (1.91) | 2.07 (0.79) | 1.28 (1.50) | 4.70 (1.66) | 2.83 (0.47) | 1.87 (1.51) | |
CS+SCAD | 3.67 (2.19) | 2.04 (0.78) | 1.63 (1.74) | 5.21 (2.29) | 2.74 (0.56) | 2.48 (2.03) | |
2 (P0 = 6) | FR | 6.22 (0.65) | 5.85 (0.61) | 0.36 (0.74) | 6.03 (0.18) | 6.00 (0.00) | 0.03 (0.18) |
FR+Lasso | 6.21 (0.65) | 5.95 (0.23) | 0.26 (0.60) | 6.03 (0.18) | 6.00 (0.00) | 0.03 (0.18) | |
FR+MCP | 6.21 (0.65) | 5.95 (0.23) | 0.26 (0.60) | 6.03 (0.18) | 6.00 (0.00) | 0.03 (0.18) | |
FR+SCAD | 6.07 (0.5) | 5.81 (0.45) | 0.26 (0.60) | 6.02 (0.17) | 6.00 (0.06) | 0.03 (0.18) | |
FRx1 | 6.10 (0.45) | 5.94 (0.33) | 0.16 (0.46) | 6.02 (0.17) | 6.00 (0.00) | 0.02 (0.17) | |
FRx10 | 7.14 (0.58) | 5.88 (0.49) | 1.25 (0.60) | 7.03 (0.17) | 6.00 (0.00) | 1.03 (0.17) | |
FR(η1) | 8.36 (2.55) | 5.97 (0.27) | 2.39 (2.56) | 6.98 (1.31) | 6.00 (0.00) | 0.98 (1.31) | |
FR(η2) | 6.5 (0.90) | 5.93 (0.44) | 0.57 (0.95) | 6.14 (0.43) | 6.00 (0.00) | 0.14 (0.43) | |
PSIS | 38.00 (0.00) | 3.13 (0.94) | 34.87 (0.94) | 67.00 (0.00) | 4.61 (0.67) | 62.39 (0.67) | |
PSIS+Lasso | 8.17 (4.84) | 3.38 (1.05) | 4.80 (4.46) | 8.98 (5.70) | 4.62 (0.70) | 4.36 (5.48) | |
PSIS+MCP | 5.34 (2.38) | 2.97 (1.09) | 2.37 (1.95) | 7.82 (2.80) | 4.47 (0.82) | 3.35 (2.41) | |
PSIS+SCAD | 6.52 (3.31) | 3.17 (1.06) | 3.36 (3.02) | 7.45 (4.02) | 4.48 (0.77) | 2.97 (3.73) | |
CS | 38.00 (0.00) | 4.04 (0.87) | 33.96 (0.87) | 67.00 (0.00) | 5.42 (0.64) | 61.58 (0.64) | |
CS+Lasso | 8.41 (4.48) | 4.13 (0.99) | 4.29 (4.07) | 10.64 (4.83) | 5.41 (0.66) | 5.23 (4.59) | |
CS+MCP | 5.81 (2.35) | 3.68 (1.13) | 2.12 (1.84) | 7.53 (2.29) | 5.33 (0.78) | 2.20 (2.44) | |
CS+SCAD | 6.75 (3.19) | 3.88 (1.07) | 2.87 (2.81) | 7.13 (3.07) | 5.32 (0.79) | 1.81 (3.17) | |
3 (P0 = 6) | FR | 4.45 (1.51) | 4.08 (1.64) | 0.37 (0.61) | 5.49 (0.54) | 5.46 (0.50) | 0.03 (0.21) |
FR+Lasso | 4.45 (1.51) | 4.40 (1.44) | 0.05 (0.24) | 5.49 (0.54) | 5.48 (0.50) | 0.01 (0.16) | |
FR+MCP | 4.45 (1.51) | 4.40 (1.44) | 0.05 (0.24) | 5.49 (0.54) | 5.48 (0.50) | 0.01 (0.16) | |
FR+SCAD | 4.44 (1.50) | 4.39 (1.43) | 0.05 (0.24) | 5.49 (0.54) | 5.48 (0.50) | 0.01 (0.16) | |
FRx1 | 4.57 (1.39) | 4.30 (1.44) | 0.26 (0.50) | 5.49 (0.54) | 5.46 (0.50) | 0.03 (0.21) | |
FRx10 | 4.97 (1.91) | 3.72 (1.88) | 1.24 (0.48) | 6.48 (0.53) | 5.45 (0.50) | 1.03 (0.21) | |
FR(η1) | 6.99 (2.59) | 4.72 (1.40) | 2.27 (2.29) | 6.55 (1.18) | 5.76 (0.43) | 0.80 (1.09) | |
FR(η2) | 4.93 (1.62) | 4.35 (1.56) | 0.58 (0.84) | 5.74 (0.63) | 5.60 (0.49) | 0.14 (0.42) | |
PSIS | 38.00 (0.00) | 2.32 (0.76) | 35.68 (0.76) | 67.00 (0.00) | 3.20 (0.61) | 63.8 (0.61) | |
PSIS+Lasso | 2.17 (2.01) | 1.46 (1.01) | 0.71 (1.28) | 5.24 (2.50) | 2.94 (0.73) | 2.30 (2.17) | |
PSIS+MCP | 2.66 (1.86) | 1.81 (0.98) | 0.86 (1.25) | 3.44 (1.16) | 2.88 (0.54) | 0.56 (1.02) | |
PSIS+SCAD | 2.78 (2.35) | 1.66 (1.08) | 1.11 (1.55) | 4.23 (1.53) | 2.92 (0.59) | 1.31 (1.37) | |
CS | 38.00 (0.00) | 2.87 (0.71) | 35.13 (0.71) | 67.00 (0.00) | 3.88 (0.69) | 63.12 (0.69) | |
CS+Lasso | 3.34 (2.81) | 2.03 (1.17) | 1.31 (1.92) | 7.16 (3.05) | 3.71 (0.75) | 3.45 (2.71) | |
CS+MCP | 3.45 (1.88) | 2.44 (0.98) | 1.00 (1.34) | 4.19 (1.15) | 3.62 (0.64) | 0.57 (0.92) | |
CS+SCAD | 4.13 (2.60) | 2.42 (1.14) | 1.71 (1.82) | 4.80 (1.36) | 3.66 (0.66) | 1.14 (1.24) |
First, Example 1 was designed in such a way that all of the true signals have nonzero marginal correlations with the outcome and are detectable by marginal screening methods. In particular, the dependence among the active variables in Example 1 can further strengthen the marginal correlations between them and the outcome. Even under these situations, FR was found to perform better than the marginal screening methods, including the conditional screening method, with larger PIT, TP and smaller FP. When the sample size decreases to 200 or with more censored events, FR’s performance was still decent with, for example, a PIT around 0.8. In contrast, the performances of PSIS and CS deteriorated quickly with smaller sample sizes or with more censoring.
Second, Examples 2 and 3 were designed so that X6, though active, has a 0 marginal correlation with the outcome and, therefore, is not detectable by marginal screening methods. As a result, PSIS and CS failed in this challenging situation. In contrast, FR remained competent by being able to detect the hidden variable. It even outperformed the conditional screening that used the prior information.
Third, even using penalties such as Lasso, SCAD, and MCP to further reduce false positives, the screening methods still incurred many false positives. The final models selected by the different penalties vary a bit. However, compared to FR, all of these penalties still perform similarly. In contrast, with the EBIC-based stopping rule, the proposed FR caused much fewer false positives without the help of Lasso.
Finally, the simulation results hint that the performance of FR is quite robust toward the choice of the initial set. Even if the “wrong” set was employed as the initial set, the TP is almost the same as the one that starts from the null set.
We further explored the robustness of the method to the violation of the independence censoring assumption. We considered Examples 1*–3*, which have the same setup as Examples 1–3, except that the underlying survival times and censoring times have a common latent variable b which was generated from the standard Gaussian distribution and the censoring times Ci were covariate-dependent. That is,
where α = (0.75, 0.75, 0p−;2) and c was chosen to censor approximately 25% of the observations. The results are documented in Table 3.
We found that, with dependent censoring, the performance of all of the methods deteriorated a bit across the board. However, our proposed method still outperformed all the other methods, hinting at the usability of the proposal under dependent censoring. More work is warranted.
5. Analysis of the Boston Lung Cancer Survival Cohort (BLCSC) study
Recent studies demonstrate that aberrant methylation may be the most common mechanism of inactivating cancer-related genes in lung cancer. It occurs in the smoking-damaged bronchial epithelium from cancer-free individuals, can be reversed in vitro by demethylating agents, and may be a useful biomarker for lung cancer risk assessment [40]. It is thus of substantial interest to identify the methylations that play an important role in the pathogenesis of lung cancer, which affects patients’ overall survival.
The motivating data represented a subset of the Boston Lung Cancer Survival Cohort (BLCSC) and included 124 samples, each with 442,613 methylations. The median follow up time of the subjects was 6.2 years and during the follow up, 84 deaths were observed. Each methylation resides within a certain gene. Prior literature has suggested that the following genes are associated with the development of lung cancer: ROS1, RET, PIK3CA, NRAS, BRAF, ALK, AKT 1, VGLL2, MET, KRAS, EGFR, KDM4, ST3GAL3, and CDH13. We used the array annotations from the Bioconductor package FDb.InfiniumMethylation.hg19 (version 2.2.0) to identify methylations that lie within these genes. A total of 589 methylations were identified. The other available environmental exposure and demographic variables in the data included lifetime tobacco exposure (SMOK), computed by multiplying the number of packs of cigarettes smoked per day by the number of years the person has smoked until the beginning of the study; AGE, the age at diagnosis in years; and SEX (1 = male; 0 = female).
Our analytical goal was to explore what methylations and their interactions with demographic and environmental exposure variables might be related to patients’ overall survival. Thus, the outcome was the time to death, while the dependent variables included the aforementioned variables, consisting of demographic information, environmental exposures, and methylations and their interactions, for a total of 2,359 variables. We applied FR to the dataset and identified cg04187088×SEX and cg14363146×SMOK.
To check the model adequacy for the final model obtained by FR, we plotted the Cox-Snell residuals based on the final model that includes two predictors, cg04187088×SEX and cg14363146×SMOK. Figure 1 shows that the final model fits the data reasonably well.
Figure 1:
Cox–Snell residual plot.
Furthermore, we conducted the score test for the scaled Schoenfeld residuals to test the proportional hazards assumption on each included predictor in the final model. We obtained p-values of 0.542 and 0.670 for cg04187088 × SEX and cg14363146 × SMOK, respectively. It appears that the proportional hazards assumption is not rejected for either of them.
We also applied competing methods to the BLCSC dataset, including the PSIS, CS, and SII. For each competing method, we selected the top ⌊n/ ln(n)⌋ = 25 genes, and compared them with the genes selected by FR.
In terms of computing time, PSIS, CS, SII, and FR took 2.47, 2.52, 1473.65, and 20.89 seconds, respectively. Due to the sequential nature of the proposed procedure, FR is understandably more computationally intensive than the marginal screening approaches such as PSIS and CS. However, FR appears to run faster than SII, a nonparametric approach.
Table 4 lists the overlapping genes across the four methods. It appears that two genes selected by FR did not overlap with any genes selected by PSIS, CS and SII.
Table 4:
The number of overlapped genes chosen by PSIS, CS, SII and FR
PSIS | CS | SII | FR | |
---|---|---|---|---|
PSIS | 25 | 1 | 3 | 0 |
CS | 1 | 25 | 0 | 0 |
SII | 3 | 0 | 25 | 0 |
FR | 0 | 0 | 0 | 2 |
In contrast, using the SMOK, AGE, and SEX as the initial set, FR further selected cg11704212, in addition to these identified interactions. Our Pubmed review did not detect any previous literature that discusses these two SNPs. This may indicate the ability of the proposed FR to identify some novel SNPs, which may not have been detected by the existing methods. We expect that future studies are warranted to confirm and study the functionality of these detected biomarkers.
To elucidate the identified effects, we further conducted Kaplan–Meier analysis. We first dichotomized methylations by using the median values and used “+” or “−” if a subject’s methylation is higher or lower than its median, respectively. Figure 2 compared survival curves across various subgroups. Figure 2(a) clearly indicated that female patients with low cg04187088 had a higher survival probability than the other comparison groups, while Figure 2(b) revealed that heavy smokers with high cg14363146 had a much lower survival probability than the other comparison groups. Figure 2(c) also showed that patients with high cg11704212 had a larger survival probability than those with low cg11704212.
Figure 2:
Kaplan–Meier plots to illustrate the main and interaction effects identified by the proposed method.
6. Concluding remarks
This article proposes forward regression with partial likelihood for high-dimensional survival data, and has obtained computationally and theoretically useful results. We envision the established theoretical framework will facilitate the extension of the procedure to other general settings, such as generalized linear models.
To further improve the computational efficiency, we can consider a natural extension of the proposed forward selection in the spirit of boosting [11]. That is, at each step, we use the selected variables and the obtained coefficient estimates to construct an offset, and search for a variable that will maximize the partial likelihood with such an offset. The advantage is that we need to maximize the partial likelihood with respect to only one covariate at each step, which may enhance computational efficiency. Specifically, we denote the log partial likelihood with an offset term O and covariate j in the model by
Here Oi refers to the offset term O evaluated at the ith subject. We let O(k) be the offset term evaluated at kth step and Sk be the set of indices of covariates selected up to the kth step.
For FR, we initialize S0 = Ø and set O(0) = 0. For j ∈ {1, … , p}, compute
Then j1 = arg maxj∈{1…,p}ℓO(0),j Now set and S1 = {j1}. Given O(k) and Sk, compute
for . Then . Now set and Sk+1 = Sk ∪ {jk+1}. We note this proposal does not require re-estimation of the coefficients of the covariates selected in the previous steps, which expedites computation. We will explore this further.
We employed a modified EBIC to select the final models. Although it worked well in our simulations, it tends to be conservative in real data analysis and recruits too few variables, whereas BIC recruits too many variables. It would be interesting to investigate the optimal η in the EBIC penalty term to strike a balance between EBIC and BIC.
Appendix
A. Preliminary lemmas
We present some preliminary lemmas in this section. Given an index set S ⊂ {1,…,p}, we use S+r to denote S ∪ {r} for some r ∈ SC.
Lemma A. Condition (G) is satisfied if .
Without loss of generality, we assume that Xr is the last element of XS+r, with being the corresponding least false coefficient under the model S+r.
Lemma B. Given S ⊂ {1,…,p} satisfying |S| < ρ and r ∈ SC, the following statements hold under Conditions (B) and (G).
(i) If and the are independent of , then , i.e., the least false coefficient for Xr under the model S+r is 0.
If , then .
If and Conditions (E) and (F) are satisfied, then .
Lemma B quantifies the difference between and when a noise variable or a signal variable is selected into the model.
Lemma C. Under Conditions (B) and (D), if ρ3 ln(p/n) → 0, then for each S ⊂ {1,…, p} satisfying |S| ≤ ρ, we can find a neighborhood of , for some constant c, such that, for k ∈ {0,1,2},
Define
Lemma D. Under the same conditions as in Lemma C,
where
Define
Lemma E. Under Conditions (A), (B), and (D), we have
for some constant A10 that does not depend on n.
Lemma F. Under Conditions (B), (D), and (F), given an index set S satisfying |S| ≤ ρ, then for any ,
Lemma G. Under Conditions (A)–(F), if ρ4 ln(p/n) → 0, there exist two constants A11 and A12 that do not depend on n such that
(i) |
and
(ii) |
Lemma H. Under Conditions (B) and (D), there exists some constant A14, which does not depend on n, such that
B. Proofs of main theoretical results in Section 3.2
In the following, we provide the proofs for the theoretical results in Section 3.2.
Proof of Theorem 1. We prove the theorem for a generic index set S satisfying S ⊂ {1,…, p}, and |S| < ρ. The change of log-likelihood by adding a variable Xr, where r ∈ SC, can be decomposed as
We restrict our attention , where
and
According to Lemmas G and H, holds with probability at least 1 – 8 exp(−3ρ ln p).
If , then by Lemma B (iii), . For any βS+r such that , noting that is the solution to (4) under model S+r, we apply Taylor’s expansion to obtain
where is between βS+r and , the last inequality follows from Condition (F) and . By the convexity of ΓS+r (β, we have Thus,
Consequently,
for some constant c2 that does not depend on n. Then, we obtain
Withdrawing the restriction on completes the proof of Theorem 1. □
Proof of Corollary 1. As shown in Theorem 1, for any S such that , with probability at least 1 – 8 exp(−3ρ ln p),
If , then ln d + 2η ln p = o(n1–2α), and consequently,
Therefore, our forward selection does not stop when and |S| < ρ with probability at least 1 – 8 exp(−3ρ ln p). Noting that S0 = Ø, if , then
when n is sufficiently large such that 2c2 .
As shown in the proof of Lemma H,
with probability at least 1 – 3 exp(−3ρ ln p). If , then
which contradicts the definition of M. Hence, we have some k such that . This completes the proof of Corollary 1. □
Proof of Corollary 2. By Lemma B (i), If and are independent of , then . Thus, under ,
If , we have for any S such that |S| < ρ and , when n is sufficiently large.
Withdrawing the restriction on , we obtain that, at each step, the probability of selecting a noise variable is at most 8 exp(−3ρ ln p). Since implies that at more than steps, a noise variable is selected, then for ,
Therefore, with probability at least 1 – 8 exp(−2ρ ln p). This completes the proof of Corollary 2. □
Proof of Corollary 3. By Corollary 2, we know that , for some . Thus, both , where
(i): It can be shown that EBIC(Sk+1) < EBIC(Sk) if and only if . Following the same arguments used to show Eq. (14) in [24], we can show that with probability tending to 1,
for all η > 0. Therefore, with probability tending to 1, the procedure stops at the kth step and .
(ii): From Corollary 2 and Part (i), we have as well as with probability going to 1. Hence, the stated result follows. □
Proof of Lemma A. We first show that , has the same sign across t > 0. Let .
where U =β0rXr. Noting that S0(t) ≤ 1 and , we have . Therefore, given , is monotone increasing with respect to U.
Then by [25],
for all t > 0, and hence
has the same sign as −1/β0r, for all t > 0.
Next, by the same argument used above, has the same sign as −1/β0r across t > 0. This completes the proof of Lemma A. □
Proof of Lemma B. We first note that is the root of (4). Let .
(A.1) |
Part (i): Let and . By the condition that the is independent of , the is independent of . Thus, by Condition (B),
where the last equality follows from Condition (B). Combining the result and (A.1), it can be shown that
Therefore, .
If , then by the same arguments, we can show that .
Part (ii): Suppose . By the martingale property, is also the solution to the following equation,
Furthermore, it is straightforward to see that
which has the opposite sign as Besides,
which has the same sign as by Condition (G) and the fact that .
Thus, and
have the opposite signs. Contradiction! Therefore, .
Part(iii): Without loss of generality, we assume that Xr is the last element of XS+r. Let er be a vector of length |S| + 1 with the rth element 1 and all other elements 0. By definition, . Then by the Mean Value Theorem,
which reduces to
where is between and . Thus,
where the first inequality follows from the proof of part (ii) that and
have the opposite signs, and the second inequality follows from Condition (F). Then, by Condition (E)
This completes the proof of Lemma B.
Proof of Lemma C. We only prove k = 1, as k = 2, 3 can be proved similarly. Given an index set S of size |S| = s ≤ ρ, let . By Conditions (B) and (D), it can be shown that for any π ∈ satisfying ||π|| = 1,
Define . Then by (B) and (C), h(βS, π, u) is bounded between −1 and 1 uniformly over and u ∈ [0, τ]. Define the function class
Following the arguments used for Lemma 11 in [1] and Lemma 2.6.17 in [31], we can show that there exists some universal constant A1 such that the class of functions has a VC index bounded by A1 s; for the definitions of VC index, see p. 85 in [31]. By Theorem 2.6.7 in [31], for any probability measure Q, there exists some universal constant A2, such that the covering number supQ is bounded by (A2/ϵ)2A1s for any ϵ > 0; for the definition of covering numbers, see p. 83 in [31].
Thus, by Theorem 1.1 in [27], there exists some constant A3 that depends on A2 only, such that for all ϵ > 0,
By choosing for some universal constant A4, we obtain
Consequently,
where the inequality follows from the combinatoric inequality .
Let A5 = A4K exp(c + KL). Then,
when n is sufficiently large. Thus, if ρ2 ln(p/n) → 0,
Similarly, we can show that
and
for some constants A6 and A7 that do not depend on n. This completes the proof of Lemma C.
Proof of Lemma D. Given an index set S such that |S| < ρ, it is easy to see that Zs (βS) ≤ I + II, where
For term I, by Conditions (B) and (C), it can be shown that , and , for any . Let ϵ1,…,ϵn be a Rademacher sequence. Then we have
where the first inequality follows from Lemma 2.3.1 in‘[31], the second inequality is trivial, and the third inequality follows from Condition (B) and Lemma A.1 in [30]. Applying Bousquet’s concentration theorem [2] yields that, for any r > 0,
Choose . As , we obtain that, when n is sufficiently large,
(A.2) |
For term II, let denote a ball with dimensionality s and radius . Then . Let be a collection of cubes that cover the ball Rs(c), where is a cube containing ξΐ with sides of length . Then and . For any , ||ξ−ξℓ|| ≤ c/(Kn2) ≡ ζn Let . By the Mean Value Theorem,
where is between ξι and 0. Applying Bernstein’s inequality yields that for any r > 0,
and consequently, by choosing ,
(A.3) |
where n is sufficiently large.
Given any , we can similarly show that
Therefore,
(A.4) |
Combining (A.3) and (A.4) implies that
(A.5) |
By the combinatoric inequality , we obtain that
This completes the proof of Lemma D. □
Proof of Lemma E. We first define the following events:
By Lemma C, we obtain that Pr (Ω1) ≤ exp(−3ρ ln p) and Pr (Ω2) ≤ exp(−3ρ ln p). In the rest of the proof, we restrict our attention to . By the arguments in [20], we can show that
We first consider I. By Condition (B),
for any . We obtain that
Then,
for some constant A8 that does not depend on n.
We now bound II. By Condition (B),
Thus,
for some constant A9 that is free of n.
Withdrawing the restriction to , the above results indicate that
for some constant A10. This completes the proof of Lemma E. □
Proof of Lemma F: Given any index set S satisfying and , by Taylor’s expansion,
where is between βS and . Noting that
by Condition (F), we have . Similarly,
where the last inequality follows from Condition (F). This completes the proof of Lemma F. □
Proof of Lemma G. Let
We consider , which holds with probability at least 1 – 5 exp(−3ρ ln p), by Lemmas D and E. In the rest of the proof, we restrict our attention to .
Given an index set S such that |S| ≤ ρ, for any βS with for some constant A11 defined later, we have when n is sufficiently large such that , since ρ4 ln(p/n) → 0.
Therefore, , where denotes the boundary of .
Noting that , we observe that
and that the right-hand term can be bounded from below by
with some constant A11 satisfying . Therefore,
By the concavity of . Withdrawing the restriction to , we have
Furthermore,
for some constant A12. Similarly, we obtain that
This complete the proof of Lemma G. □ Proof of Lemma H. Given any S ⊂ (1,…,p} and r such that |S| < ρ, r ∈ SC,
where
Noting that , it is easy to check that
We restrict our attention to , where Ω1 is defined in Lemma E. We consider I1 first. We have
where is between and . By the same argument, as well. Therefore, . By Conditions (B) and (D), . Applying Bernstein’s inequality, we get
Thus,
Withdrawing the restriction to , we obtain that
Let A14 = 2A13 + 12KL and we complete the proof of Lemma H. □
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- [1].Belloni A, Chernozhukov V, ℓ1-penalized quantile regression in high-dimensional sparse models, Ann. Statist. 39 (2011) 82–130. [Google Scholar]
- [2].Bousquet O, A Bennett concentration inequality and its application to suprema of empirical processes, C. R. Math. 334 (2002) 495–500. [Google Scholar]
- [3].Bradic J, Fan J, Jiang J, Regularization for Cox’s proportional hazards model with NP-dimensionality, Ann. Statist. 39 (2011) 3092–3120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Chen J, Chen Z, Extended Bayesian information criteria for model selection with large model spaces, Biometrika 95 (2008) 759–771. [Google Scholar]
- [5].Cheng M-Y, Honda T, Zhang J-T, Forward variable selection for sparse ultra-high dimensional varying coefficient models, J. Amer. Statist. Assoc. 111 (2016) 1209–1221. [Google Scholar]
- [6].Fan J, Li R, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Statist. Assoc. 96 (2001) 1348–1360. [Google Scholar]
- [7].Fan J, Lv J, Sure independence screening for ultrahigh dimensional feature space (with discussion), J. R. Stat. Soc. Ser. B 70 (2008) 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Fan J, Samworth R, Wu Y, Ultrahigh dimensional feature selection: Beyond the linear model, J. Mach. Learn. Res. 10 (2009) 2013–2038. [PMC free article] [PubMed] [Google Scholar]
- [9].Fan J, Song R, Sure independence screening in generalized linear models with NP-dimensionality, Ann. Statist. 38 (2010) 3567–3604. [Google Scholar]
- [10].Fine J, Comparing nonnested Cox models, Biometrika 89 (2002) 635–648. [Google Scholar]
- [11].Friedman J, Greedy function approximation: A gradient boosting machine, Ann. Statist. 29 (2001) 1189–1232. [Google Scholar]
- [12].Gorst-Rasmussen A, Scheike T, Independent screening for single-index hazard rate models with ultrahigh dimensional features, J. R. Stat. Soc. Ser. B 75 (2013) 217–245. [Google Scholar]
- [13].Hao N, Zhang HH, Interaction screening for ultrahigh-dimensional data, J. Amer. Statist. Assoc. 109 (2014) 1285–1301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].He X, Wang L, Hong HG, Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data, Ann. Statist. 41 (2013) 342–369. [Google Scholar]
- [15].Hong HG, Chen X, Christiani DC, Li Y, Integrated powered density: Screening ultrahigh dimensional covariates with survival outcomes, Biometrics 74 (2017) 421–429. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Hong HG, Kang J, Li Y, Conditional screening for ultra-high dimensional covariates with survival outcomes, Lifetime Data Anal. 24 (2018) 45–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Hong HG, Li Y, Feature selection of ultrahigh-dimensional covariates with survival outcomes: A selective review, Appl. Math. Ser. B 32 (2017) 379–396. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Huang J, Sun T, Ying Z, Yu Y, Zhang C-H, Oracle inequalities for the Lasso in the Cox model, Ann. Statist. 41 (2013) 1142–1165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Ing C-K, Lai TL, A stepwise regression method and consistent model selection for high-dimensional sparse linear models, Statist. Sinica 21 (2011) 1473–1513. [Google Scholar]
- [20].Kong S, Nan B, Non-asymptotic oracle inequalities for the high-dimensional Cox regression via Lasso, Statist. Sinica 24 (2014) 25–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Li J, Zheng Q, Peng L, Huang Z, Survival impact index and ultrahigh-dimensional model-free screening with survival outcomes, Biometrics 72 (2016) 1145–1154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Lin DY, Wei L-J, The robust inference for the Cox proportional hazards model, J. Amer. Statist. Assoc. 84 (1989) 1074–1078. [Google Scholar]
- [23].Luo S, Chen Z, Sequential Lasso cum EBIC for feature selection with ultra-high dimensional feature space, J. Amer. Statist. Assoc. 109 (2014) 1229–1240. [Google Scholar]
- [24].Luo S, Xu J, Chen Z, Extended Bayesian information criterion in the Cox model with a high-dimensional feature space, Ann. Inst. Statist. Math. 67 (2015) 287–311. [Google Scholar]
- [25].Schmidt KD, On the Covariance of Monotone Functions of a Random Variable, Professoren des Inst. f¨ur Math. Stochastik, 2003. [Google Scholar]
- [26].Song R,Lu W, Ma S, Jeng XJ, Censored rank independence screening for high-dimensional survival data, Biometrika 101 (2014) 799–814. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Talagrand M, Sharper bounds for Gaussian and empirical processes, Ann. Probab. 22 (1994) 28–76. [Google Scholar]
- [28].Tibshirani RJ, Regression shrinkage and selection via the Lasso, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 58 (1996) 267–288. [Google Scholar]
- [29].Tibshirani RJ, The Lasso method for variable selection in the Cox model, Statist. Medicine 16 (1997) 385–395. [DOI] [PubMed] [Google Scholar]
- [30].van de Geer SA, High-dimensional generalized linear models and the Lasso, Ann. Statist. 36 (2008) 614–645. [Google Scholar]
- [31].van der Vaart AW, Wellner JA, Weak Convergence, Springer, New York, 1996. [Google Scholar]
- [32].Vittinghoff E, McCulloch CE, Relaxing the rule of ten events per variable in logistic and Cox regression, Amer. J. Epidemiology 165 (2007) 710–718. [DOI] [PubMed] [Google Scholar]
- [33].Volinsky CT, Raftery AE, Bayesian information criterion for censored survival models, Biometrics 56 (2000) 256–262. [DOI] [PubMed] [Google Scholar]
- [34].Wang H, Forward regression for ultra-high dimensional variable screening, J. Amer. Statist. Assoc. 104 (2009) 1512–1524. [Google Scholar]
- [35].Xu R, Vaida F, Harrington DP, Using profile likelihood for semiparametric model selection with application to proportional hazards mixed models, Statist. Sinica 19 (2009) 819–842. [PMC free article] [PubMed] [Google Scholar]
- [36].Zhang CH, Nearly unbiased variable selection under minimax concave penalty, Ann. Statist. 38 (2010) 894–942. [Google Scholar]
- [37].Zhao SD, Li Y, Principled sure independence screening for Cox models with ultra-high-dimensional covariates, J. Multivariate Anal. 105 (2012) 397–411. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [38].Zheng Q, Peng L, He X, Globally adaptive quantile regression with ultra-high dimensional data, Ann. Statist. 43 (2015) 2225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Zhong W, Zhang T, Zhu Y, Liu JS, Correlation pursuit: forward stepwise variable selection for index models, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 74 (2012) 849–870. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [40].Zöchbauer-M¨uller S, Minna JD, Gazdar AF, Aberrant DNA methylation in lung cancer: Biological and clinical implications, The Oncologist 7 (2002) 451–457. [DOI] [PubMed] [Google Scholar]
- [41].Zou H, A note on path-based variable selection in the penalized proportional hazards model, Biometrika 95 (2008) 241–247. [Google Scholar]