Feature selection of ultrahigh-dimensional covariates with survival outcomes: a selective review

HONG Hyokyoung Grace; Yi LI

doi:10.1007/s11766-017-3547-8

. Author manuscript; available in PMC: 2018 Apr 18.

Published in final edited form as: Appl Math. 2017 Dec 29;32(4):379–396. doi: 10.1007/s11766-017-3547-8

Feature selection of ultrahigh-dimensional covariates with survival outcomes: a selective review

HONG Hyokyoung Grace ¹, Yi LI ²

PMCID: PMC5906071 NIHMSID: NIHMS951968 PMID: 29683128

Abstract

Many modern biomedical studies have yielded survival data with high-throughput predictors. The goals of scientific research often lie in identifying predictive biomarkers, understanding biological mechanisms and making accurate and precise predictions. Variable screening is a crucial first step in achieving these goals. This work conducts a selective review of feature screening procedures for survival data with ultrahigh dimensional covariates. We present the main methodologies, along with the key conditions that ensure sure screening properties. The practical utility of these methods is examined via extensive simulations. We conclude the review with some future opportunities in this field.

MR Subject Classification: 97K80

Keywords: survival analysis, ultrahigh dimensional predictors, variable screening, sure screening property

§1 Introduction

Modern biomedical studies have generated abundant survival data with high dimensional biomarkers for various scientific purposes. For instance, identifying genomic profiles that are associated with cancer patients’ survival may help with understanding disease progression processes and designing more effective gene therapies. With the advent of new biotechnologies, the emergence of high-throughput data, such as gene expressions, SNPs, methylation and next-generation RNA sequencing, has pushed the dimensionality of data to a larger scale. In these cases, the dimensionality of covariates may grow exponentially with the sample size and such data has been commonly referred to as ultrahigh dimensional data ([5]).

When the number of covariates (p) is less than the sample size (n), the parametric regression, such as Weibull models, and the semiparametric regression, such as the Cox proportional hazards model and the Accelerated Failure Time (AFT) model, have been routinely used for modeling censored outcome data in many practical settings. When p > n, penalized likelihood methods have been proposed by various authors ([20], [4], [25], [29]) and the oracle properties and statistical error bounds of estimation have been established ([13], [15]). However, when p ≫ n, computational issues inherent in these methods makes them non-applicable to ultrahigh-dimensional statistical learning problems because of serious challenges in “computational expediency, statistical accuracy, and algorithmic stability” ([6]). A recent work by [2] did establish the oracle properties of the regularized partial likelihood estimates under an ultrahigh dimensional setting. The results, however, required the optimizers to the penalized partial likelihood function to be unique and global, which is, in general, difficult to verify, especially when the dimension of covariates is exceedingly high.

A seminal paper by [5] has demonstrated a simple but useful way to deal with ultrahigh dimensional regression. First, a variable screening procedure is used as a fast and crude tool for reducing the dimensionality to a moderate size (usually below the sample size). In the second step, a more sophisticated technique, such as penalized likelihood methods, can be further applied to perform the final feature selection and parameter estimation simultaneously.

In the framework of linear regression with normal errors, [5] showed that sure independence screening (SIS), which recruits features with the largest marginal correlations with the response, has the desirable sure screening property. That is, with probability converging to 1, the screening procedure retains all of the important features in the model. While screening approaches have been actively pursued for completely observed outcome data, the development of ultrahigh dimensional screening tools with survival outcomes, however, has been less fruitful. Several adhoc solutions are available from [21] and [3], though detailed accounts of practical utility or theoretical support are still elusive. Recent years have seen a rapid surge in variable screening methods for survival data, but to our knowledge, no systematic reviews and comparisons are available. To fill the gap, we will review and compare several representative works in this field. Specifically, we first review model-motivated screening methods, including the principled sure screening by [27], the feature aberration at survival times screening by [8] and the conditional screening by [11]. We then review several model-free methods, including the quantile adaptive sure independence screening by [9], the censored rank independence screening procedure by [19], the survival impact index screening by [17] and the integrated powered density screening by [10]. We introduce the motivation of each method, describe the main conditions that lead to the sure screening property and numerically compare all the methods under the same settings.

Below we describe notation and terminologies that will be used throughout this paper. Suppose we have n observations with p covariates, where p ≫ n. Denote by X_ij the jth covariate for subject i, and write X_i = (X_i₁, …, X_ip)^T. Let T_i and C_i be the underlying survival and censoring times, respectively. We, however, only observe Y_i = min{T_i, C_i}, and the event indicator δ_i = I(T_i ≤ C_i), where I(·) is the indicator function. In general, we assume random censoring such that C_i and T_i are independent given X_i. Let N_i(t) = I(Y_i ≤ t, δ_i = 1) be the observed failure process and Ỹ_i(t) = I(Y_i ≥ t) be the at-risk process. We assume (Y_i, δ_i, X_i) are independently and identically distributed (i.i.d). In particular, we assume (T_i, X_ij), i = 1, …, n, are i.i.d copies of (T, X_j), the random variables that underlie the survival time and covariates. Denote by S(·) the marginal survival function of T and by S(t|X) the conditional survival function of T given X, where X = (X₁, …, X_p). Let G(t) = pr(C_i > t) be the survival function of C_i and Ĝ(t) be the Kaplan-Meier estimator of G(t) based on {Y_i, δ_i}, i = 1, …, n. Suppose that S(t|X) only depends on a subset of covariates, denoted by ℳ. The overarching goal of variable screening is to estimate ℳ, which we let be ℳ̂. We note, however, the specific definitions of ℳ and ℳ̂ may differ in the reviewed papers and will be defined under the specific contexts.

§2 Screening methods

2.1 Principled Cox sure independence screening

[27] generalized the sure independence screening ([5]) to the Cox proportional hazards model, which stipulates that the hazard at time t > 0 for a subject i with the vector of covariates X_i is

h (t ∣ X_{i}) = lim_{Δ t \to 0 +} \frac{1}{Δ t} P (t \leq T \leq t + Δ t ∣ T \geq t, X_{i}) = h_{0} (t) exp (\sum_{j = 1}^{p} X_{i j} β_{j}),

(1)

where h₀(t) is an unspecified baseline hazard function.

Under (1), [27] defined ℳ = {j: β_j ≠ 0} and proposed to estimate it as follows. Assuming a (working and possibly misspecified) marginal Cox model on each X_j, namely, $h_{0, j}^{*} (t) exp (X_{i j} β_{j}^{*})$ , they obtained the maximum partial likelihood estimate of $β_{j}^{*}$ , denoted by β̂_j. Then, the importance of X_j was measured by a Wald type statistic for testing $β_{j}^{*} = 0$ . As a result, the estimated ℳ was given by

\hat{M} = {j : I_{j} {({\hat{β}}_{j})}^{\frac{1}{2}} ∣ {\hat{β}}_{j} ∣ \geq λ_{n}},

where β̂_j solves the partial likelihood score equation

U_{j} (β) = \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{ν} {X_{i j} - \frac{\sum_{i = 1}^{n} X_{i j} exp (X_{i j} β) {\tilde{Y}}_{i} (t)}{\sum_{i = 1}^{n} exp (X_{i j} β) {\tilde{Y}}_{i} (t)}} d N_{i} (t) = 0,

(2)

$I_{j} ({\hat{β}}_{j}) = - {\frac{\partial U_{j} (β)}{\partial β} ∣}_{β = {\hat{β}}_{j}}$ is the observed information at β̂_j, and λ_n is a pre-specified cut-off that depends on n. Here, ν > 0 in (2) is the study duration, which is assumed to be long enough to ensure that ample events are observed during the interval [0, ν].

As there were no principled ways of choosing λ_n, the work of [27] provided a new perspective of choosing λ_n such that one can achieve the sure screening property while controlling the false positive rate, or the proportion of unimportant covariates incorrectly included in ℳ̂. The rationale is as follows.

If the true model size |ℳ| = s, then the expected false positive rate can be written as

E (\frac{∣ M^{c} \cap \hat{M} ∣}{∣ M^{c} ∣}) = \frac{1}{p - s} \sum_{j \in M^{c}} P {I_{j} {({\hat{β}}_{j})}^{\frac{1}{2}} ∣ {\hat{β}}_{j} ∣ \geq λ_{n}} .

Moreover, when β_j = 0 or j ∈ ℳ^c, $I_{j} {({\hat{β}}_{j})}^{\frac{1}{2}} {\hat{β}}_{j}$ converges in distribution to a standard normal variable. Hence, by setting λ_n = Φ⁻¹(1 − f/(2p)), the expected false positive rate is 2(1 − Φ(λ_n)) = f/p, which is approximately equal to the desirable false positive rate, f/(p − s). Here, Φ(·) is the standard normal cumulative distribution function and f is the number of false positives that one is willing to tolerate. The screening procedure is commonly referred to as the principled Cox sure independence screening (PSIS).

To study the sure screening property, they first established the following “β-min” condition (that is, the true signals have enough marginal strengths): there exist constants c₁ > 0 and 0 < κ < 1/2 such that min_j_∈ℳ |cov[X_ij, E{F_T (C_i|X_i)|X_i}]| ≥ c₁n⁻^κ, where F_T(·|X_i) is the cumulative distribution function of T_i given X_i. Then [27] proved that

min_{j \in M} ∣ β_{j} ∣ \geq c_{2} n^{- κ},

(3)

where c₂ is a positive constant. This result led to the sure screening property

pr (M \subset \hat{M}) \geq 1 - s exp (- c_{3} n^{1 - 2 κ}),

(4)

where c₃ is a positive constant. As n → ∞, pr(ℳ ⊂ ℳ̂) → 1 for the nonpolynomial (NP)-dimensionality problem log(p) = O(n¹⁻²^κ).

[27] demonstrated that, numerically, PSIS is stable and computationally efficient in selecting true signals. However, given that PSIS stems from a Wald test based on a Cox model, its performance is unclear when the underlying assumption of a Cox model fails.

2.2 Feature aberration at survival times screening

With the goal of making the screening procedure less model-centric, [8] proposed a ‘feature aberration at survival times’ (FAST) statistic that measures the aberration of each covariate relative to its at-risk average. Specifically, for covariate X_j, the FAST statistic is defined as

d_{j} = \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{ν} {X_{i j} - \frac{\sum_{i = 1}^{n} X_{i j} {\tilde{Y}}_{i} (t)}{\sum_{i = 1}^{n} {\tilde{Y}}_{i} (t)}} d N_{i} (t),

(5)

where t ∈ [0, ν].

The statistic can be justified under an administrative censoring scheme (C_i ≡ ν). With standardized covariates such that E(X_j) = 0 and var(X_j) = 1 for j = 1, …, p, first note that the population version of d_j is

{\tilde{d}}_{j} = E (d_{j}) = cov {X_{j}, F_{T} (ν ∣ X_{i})} + \int_{0}^{ν} cov {X_{j}, F_{T} (t ∣ X_{i})} K (t) d t,

where F_T (t|X_i) = P(T_i ≤ t|X_i) and K(·) is a strictly positive function. Thus, d̃_j is large if cov{X_j, F_T (t|X_i)} has a constant sign throughout t ∈ [0, ν]. Thus, it is reasonable to consider the magnitude of d_j as a marginal utility to rank the importance of X_j.

In addition, by taking β to be 0 in (2), we note that (2) reduces to (5). Therefore, FAST can also be viewed as a score test statistic based on a Cox model, which is a special case of the score-based screening proposed by [28].

To study the sure screening property, [8] assumed that the true hazard function is of the single-index form

h_{i} (t) = h (t, X_{i}^{T} β), i = 1, \dots, n,

(6)

and required the resulting survival function $exp {\int_{0}^{t} h (s, \cdot) d s}$ to be strictly monotonic for each t ≥ 0. They argued that such an assumption holds true for a variety of models, including the additive model ([18]), the Cox model and the AFT model. Under (6), [8] defined the true set as

M = {j : β_{j} \neq 0},

and proposed to estimate it by

\hat{M} = {j : ∣ d_{j} ∣ > λ_{n}},

for a given λ_n.

Assuming various regularity conditions, [8] showed that there exists a threshold ζ_n > 0 such that min_j_∈ℳ |d̃_j| ≥ ζ_n and max_j_∉ℳ |d̃_j| = 0. Thus, the signals d̃_j when j ∈ ℳ are stronger than those when j ∉ ℳ. They further assumed that |cov(X_j,X^Tβ)| ≥ c₁n⁻^κ, j ∈ ℳ, for some c₁ > 0 and κ < 1/2. Then they showed that, by taking λ_n = c₂n⁻^κ for some constant 0 < c₂ ≤ c₁/2, the sure screening property holds even when p grows exponentially fast with n, or,

pr (M \subset \hat{M}) \to 1 when n \to \infty .

As FAST is essentially a score test that requires fitting only the null model, its computation is simpler than the Wald test based screening methods, such as PSIS ([28]). Like the sure independence screening (SIS) of [5], FAST assumes that the covariates present in the true model ℳ are independent of the irrelevant covariates. This assumption is often violated in practice. To account for possible correlations between variables, [5] proposed an iterative SIS procedure (ISIS): after applying the SIS procedure, the relevance of the unselected covariates is reassessed given the selected covariates and a small number of the most relevant features among them can be added to the selected set. These iterative steps are repeated until some stopping criterion is reached. [5] and [3] showed that an iterative screening procedure may perform better than a non-iterative procedure. A similar iterated FAST procedure was proposed by [8]; however, like the general ISIS procedure, its theoretical support is still an open problem.

2.3 Conditional screening

Intensive biomedical research has generated a large body of biological knowledge. Incorporating such knowledge may lead to improved accuracy in modeling. However, the marginal screening approaches, such as PSIS and FAST, were not designed to integrate prior knowledge into variable screening. Recent years have seen an advent of conditional sure independence screening methods that use a priori information; see [1], [12]. By including such important predictors as the conditioning variables, conditional screening ranks the marginal utility of each variable after adjusting for these conditioning variables.

[11] proposed a conditional screening method under the Cox proportional hazards model (1) by computing the marginal contribution of each covariate given priorly known information. This method is referred to as the Cox conditional screening (CoxCS).

Denote by 𝒞 the indices of the set of covariates that are known a priori to be related to the number of covariates. Let ℳ_−𝒞 = {j ∉ 𝒞, β_j ≠ 0}, q = |𝒞|, and X_𝒞 = (X_j, j ∈ 𝒞)^T. [11] proposed to fit the marginal Cox regression by including the known covariates in X_𝒞. Specifically, for each X_j ∉ X_𝒞, they considered the following Cox regression model

h_{j} (t, X_{i}) = h_{j, 0} (t) exp (β_{C}^{T} X_{i C} + β X_{i j}) .

The maximum partial likelihood estimates ${({\hat{β}}_{C}^{T}, {\hat{β}}_{j})}^{T}$ can be obtained by solving the following equations:

V_{j} (β_{C}, β) = {[V_{j, k} (β_{C}, β)]}^{T} = 0_{q + 1},

with

V_{j, k} (β_{C}, β) = \sum_{i = 1}^{n} \int_{0}^{ν} {X_{i k} - \frac{\sum_{i = 1}^{n} X_{i k} {\tilde{Y}}_{i} (t) exp (β_{C}^{T} X_{i C} + β X_{i j})}{\sum_{i = 1}^{n} {\tilde{Y}}_{i} (t) exp (β_{C}^{T} X_{i C} + β X_{i j})}},

for k ∈ 𝒞 ∪ {j}. The key is to recruit variables with large additional contributions given X_𝒞. If the conditioning set X_𝒞 is empty, the method reduces to PSIS. Using the magnitude of β̂_j as a marginal utility to rank the importance of X_j, the set of selected variables among j ∉ 𝒞 is given by

{\hat{M}}_{- C} = {j \notin C : ∣ {\hat{β}}_{j} ∣ > λ_{n}},

for a pre-defined λ_n.

To study the asymptotic behavior of the proposed procedure, [11] assumed the following “beta-min” condition: for constants c₁ > 0 and 0 < κ < 1/2,

min_{j \in M_{- C}} ∣ E [{cov}^{*} {X_{j}, pr (δ = 1 ∣ X) ∣ X_{C}}] ∣ \geq c_{1} n^{- κ},

(7)

where cov^*(ζ₁, ζ₂|ξ) indicates the conditional linear covariance between ζ₁ and ζ₂ given ξ. [11] introduced this new concept in order to approximate the ordinary conditional variance and facilitate the proof of the sure screening property.

Under (7) and some other conditions, [11] proved that the estimated β̂_j converges uniformly to β_j in probability. That is, for any given ε₁ > 0 and ε₂ > 0,

pr (max_{j \in M_{- C}} ∣ {\hat{β}}_{j} - β_{j} ∣ > \frac{c_{2}}{2} (n^{- κ} - ε_{1})) \leq 2 s (q + 1) exp (- c_{3} n^{1 - 2 κ}) + ε_{2},

where c₂ and c₃ are some positive constants, q is the size of 𝒞, and s is the size of ℳ_−𝒞. In addition, they established the following “β-min” condition:

min_{j \in M_{- C}} ∣ β_{j} ∣ \geq c_{2} n^{- κ},

where 0 < κ < 1/2.

Taking λ_n = c₄n⁻^κ with c₄ > 0, this “β-min” condition leads to the following sure screening property:

pr (M_{- C} \subset {\hat{M}}_{- C}) \to 1, as n \to \infty .

The conditional screening approach can identify “marginally unimportant but jointly important” variables, commonly referred to as hidden variables. Failing to include hidden variables in the screening stage may cause undesirable consequences. For example, important features may be missed in model selection or biased inference may occur in the subsequent analysis.

Since CoxCS requires the prior information to be known and informative, it remains statistically challenging to develop efficient screening methods in the absence of such information. In the context of generalized linear models, [12] proposed a data-driven alternative in the absence of prior knowledge. The question of how to extend the data-driven approach to a survival setting, however, would require more intensive theoretical and empirical work.

2.4 Quantile adaptive nonparametric screening

The validity of model-based screening methods, such as PSIS and CoxCS, often hinges upon the assumptions of the underlying models. For example, when the proportional hazards assumption fails, the model-based approaches may incur a large number of false negatives and lead to an invalid result. To develop a model-free framework that can be applicable to a more general class of survival models, [9] proposed the quantile adaptive sure independence screening (QaSIS). This approach performs screening based on the disparity between unconditional and conditional quantiles given each covariate. They defined the true set at a given quantile level α ∈ (0, 1) as

M_{α} = {j : Q_{α} (T ∣ X) functionally depends on X_{j}},

where Q_α(T|X) is the α-th conditional quantile of T given X. That is, Q_α(T|X) = inf{t: pr(T ≤ t|X) ≥ α}.

Now let q_αj = E{Q_α(T|X_j) − Q_α(T)}², where Q_α(T|X_j) is the α-th conditional quantile of T given X_j and Q_α(T) is the marginal α-th quantile of T. Then, q_αj gauges the magnitude of the association of T with X_j, and T and X_j are independent if and only if q_αj = 0 for every α ∈ (0, 1).

To estimate Q_α(T|X_j), [9] proposed a spline estimator, Q̂_α(T|X_j) = π(X_j)^T β̂_j. Here, π(x) = {B₁(x), …, B_N(x)}^T is the normalized B-spline basis functions and β̂_j = (β_j₁, …, β_N)^T is obtained via the inverse probability weighted marginal quantile regression estimator, i.e.,

{\hat{β}}_{j} = \underset{π \in ℝ^{N}}{argmin} \sum_{i = 1}^{n} \frac{δ_{i}}{\hat{G} (Y_{i})} ρ_{α} (Y_{i} - π {(X_{i j})}^{T} β),

where ρ_α(x) = x[α − I(x < 0)] is the quantile loss function. On the other hand, the marginal quantile Q_α(T) can be simply estimated by the inverse function of the Kaplan-Meier estimate of the α-th conditional quantile of T, $F_{KM, n}^{- 1} (α)$ . Hence, q_αj can empirically be estimated by ${\hat{q}}_{α j} = n^{- 1} \sum_{i = 1}^{n} {π {(X_{i j})}^{T} {\hat{β}}_{j} - F_{KM, n}^{- 1} (α)}^{2}$ . Then, ℳ_α is estimated as

{\hat{M}}_{α} = {j : {\hat{q}}_{α_{j}} \geq λ_{n}}

for some λ_n > 0. The method is deemed a model-free approach as it does not resort to a specific model structure.

To guarantee sure screening and control the false selection rate, [9] assumed that 0 < κ < 1/4 and N³n²^κ⁻¹ = o(1), where N is the number of basis functions. By taking λ_n = c^*n⁻^κ with c^*≤ c₁/16 and c₁ > 0 the sure screening property holds. That is,

pr (M_{α} \subset {\hat{M}}_{α}) \geq 1 - s_{α} {17 exp (- c_{3} n^{1 - 4 κ}) + 12 N^{2} exp (- c_{4} N^{- 3} n^{1 - 2 κ})} \to 1, as n \to \infty,

where c₃ and c₄ are some positive constants and s_α is the size of ℳ_α.

Overall, QaSIS is flexible as it allows the set of active variables to vary across quantiles, which is appealing to the analysis of heterogeneous data. However, since not every quantile is estimable under censoring, its performance under heavy censoring is unclear. The question of how to choose the quantile level α in practice to optimize the procedure seems intriguing as well.

2.5 Censored rank independence screening

In biomedical studies, variables are often transformed to achieve uniformity across different platforms, and from time to time, outliers are observed in predictors. It is desirable for screening tools to possess invariance properties under transformations of variables (X_i or T_i) and robustness against outliers. As Kendall’s τ, a widely used measure of correlation, is robust against heavy tailed distributions and invariant under monotonic transformations, researchers proposed Kendall’s τ based screening methods ([16]). To accommodate survival data, [19] considered the concordance between failure time T and covariate X_j in the presence of censoring.

Define τ_j = pr(X_ji > X_ji_′, T_i > T_i_′) − 1/4, (j = 1, …, p, i = 1, …, n). [19] advocated using τ_j as the marginal utility measure for ranking predictors for several reasons. First, τ_j measures the association between T and X_j and τ_j is 0 if T and X_j are independent. Second, τ_j is invariant to any monotonic transformations on T and X_j. Third, this rank based measure is robust against outliers in the data.

Let ψ_j = δ_i_′I(X_ij > X_i_′_j, Y_i > Y_i_′)/G²(Y_i_′). It can be easily shown that E(ψ_j) = pr(X_ij > X_i_′_j, T_i > T_i_′). Thus, a natural estimate of τ_j is:

{\hat{τ}}_{j} = {(\begin{matrix} n \\ 2 \end{matrix})}^{- 1} \sum_{i < i^{'}} \frac{δ_{i^{'}}}{{\hat{G}}^{2} (Y_{i^{'}})} I (X_{i j} > X_{i^{'} j}, Y_{i} > Y_{i^{'}}) - 1 / 4,

where Ĝ(·) is the Kaplan–Meier estimator of G(t) = pr(C_i ≥ t).

Define the true set as

M = {j : pr (T > t ∣ X) functionally depends on X_{j}} .

Then, it is estimated by a set of important predictors with large τ̂_j:

\hat{M} = {j : ∣ {\hat{τ}}_{j} ∣ > λ_{n}},

where λ_n is a predefined threshold value. This procedure is called the censored rank independence screening (CRIS). [19] showed that τ̂_j is a consistent estimator for τ_j. That is,

pr (max_{j} ∣ {\hat{τ}}_{j} - τ ∣ < c_{6} n^{- κ}) \leq p {2.5 n exp (- c_{1} n) + 4 exp (- c_{4} n^{1 - 2 κ}) + 2.5 n exp (- c_{2} n^{1 - 2 κ})},

for some positive constants c₁, c₂, c₄, and c₆. Moreover, when min_j_∈ℳ |pr(X₁_j > X₂_j, T₁ > T₂) − 1/4| ≥ c₀n⁻^κ for some 0 < κ < 1/2 and c₀ > 0, taking λ_n = c₇n⁻^κ with c₇ ≤ c₀/2 leads to

pr (M \subset \hat{M}) \geq 1 - s {2.5 n exp (- c_{1} n) + 4 exp (- c_{4} n^{1 - 2 k}) + 2.5 n exp (- c_{2} n^{1 - 2 κ})} \to 1, as n \to \infty,

where s is the size of ℳ.

CRIS is a model-free approach that enjoys the sure screening property. Moreover, the screening statistic is a U-statistic with a bounded kernel function. The large sample results hold even without the tail probability conditions. However, the computation of τ̂_j requires the comparison of all possible pairs of samples. This exceedingly heavy computational burden may hamper its applicability when the sample size is large.

2.6 Survival impact index screening

QaSIS and CRIS are model-free screeners, but both employ the inverse probability of censoring weighting (IPCW), which may be unstable, especially when evaluated at large observed survival times. In addition, they may not capture the full-range impact of covariates on the overall survival since QaSIS focuses on a specific quantile level and CRIS relies on a summarized value of association. To more fully capture the overall influence of a covariate on the outcome distribution, [17] proposed a new metric called the survival impact index (SII), which evaluates the absolute deviation of the covariate-stratified survival distribution from the unstratified survival distribution.

Specifically, for each X_j, j = 1, …, p, SII is defined as

ξ_{j} = \int_{t \in T, x \in X} W_{ξ} (t, x) ∣ S (t ∣ X_{j} > x) - S (t) ∣ dxdt,

where W_ξ(t, x) is a pre-determined weight function introduced to capture the covariate impact on either early or late survival. The authors argued that if, for at least one t and one x, the survival function stratified on X_j > x differs from the unstratified survival function at t, then ξ_j will be non-zero under mild conditions. On the other hand, if T and X_j are independent, then ξ_j = 0. Hence, ξ_j serves as a sensible index for characterizing the importance of X_j in influencing the distribution of T.

To estimate ξ_j, [17] proposed to use

{\hat{ξ}}_{j} = \int_{t \in T, x \in X} W_{ξ} (t, x) ∣ \hat{S} (t ∣ X_{j} > x) - \hat{S} (t) ∣ dxdt,

(8)

where Ŝ (t|X_j > x) is the Kaplan-Meier estimator based on sub-sample X_j > x and Ŝ(t) is the Kaplan-Meier estimator for the survival function of T. Note that (8) indicates SII does not need to adopt IPCW to handle the random censoring.

The set of important predictors with large ξ̂_j is defined by ℳ̂ = {j: ξ̂_j > λ_n}. Under some regularity conditions, given 24/(nμγ⁴) < ε < 1, [17] proved that the estimated survival impact index ξ̂_j is uniformly consistent to ξ_j, or

pr (max_{j} ∣ {\hat{ξ}}_{j} - ξ_{j} ∣ > ε) \leq c_{3} exp (- n c_{4} ε^{2} - c_{5} log ε),

(9)

where c₃, c₄, c₅, μ, and γ are some positive constants and n is sufficiently large. With the true set being

M = {j : ξ_{j} > 0},

if p = O(exp(n^c)) for some 0 < c < 1 and min_j_∈ℳξ_j > c₀n⁻^α for some constant c₀ > 0, 0 ≤ α < (1 − c)/2, [17] proved that

pr (M \subset \hat{M}) \geq 1 - c_{3} s exp {- n c_{4} {(c_{0} - b)}^{2} n^{- 2 α} - c_{5} log ((c_{0} - b) n^{- α})},

where s is the size of ℳ. It follows that the sure screening property holds by taking λ_n = bn⁻^α with b ≤ c₀/2. That is, pr(ℳ ⊂ ℳ̂) → 1 as n→∞.

One potential challenge with SII is that (8) needs to be computed over a range of values for both T and X_j, which is often associated with high computation cost. The problem of how to select the weight function may warrant more research.

2.7 Integrated powered density screening

In a survival setting, nonparametric variable screeners have focused on discerning how each candidate variable influences survival functions. One way of detecting such influence is by studying the variability of survival functions for the subpopulations or strata defined by each variable. The difference patterns, however, may vary across covariates. Specifically, the differences may occur either during the early or late period in the follow-up due to disease-related characteristics. For example, survival differences between chemotherapy versus both chemotherapy and radiation treatment groups among childhood cancer patients may be more pronounced right after treatment, while survival differences between the EGFR mutation status among nonsmall cell lung cancer patients may be more obvious long after the onset of cancer. Therefore, screening approaches that rely on a single screening criterion may not be able to capture the complex difference patterns and may lead to false non-discovery. [10] proposed to consider the following integrated powered density (IPOD):

\int_{0}^{t} f^{γ} (s) d s,

where a power index γ(> 0) inflates either early (γ > 1) or late differences (γ < 1) during the life span and thus gives more flexibility in detecting distributional differences. IPOD resembles the cumulative distribution function (CDF) and satisfies the basic properties of CDFs, except that it does not necessarily approach one when t→∞.

To derive the screening criterion, first consider a discrete X_j with R_j categories. The unique property of IPOD motivates the following marginal utility to detect distributional differences:

I_{j}^{(γ)} = max_{r_{1}, r_{2} \in {1, \dots, R_{j}}} sup_{t \in [0, ν]} | \int_{0}^{t} f_{T ∣ X_{j}}^{γ} (s ∣ X_{j} = r_{1}) d s - \int_{0}^{t} f_{T ∣ X_{j}}^{γ} (s ∣ X_{j} = r_{2}) d s |,

(10)

where f_{T|X_j}(s|X_j = r) denotes the conditional density function of T given X_j = r. Since $I_{j}^{(γ)} = 0$ if and only if T and X_j are independent, it serves as a measure of marginal utility for each X_j. The framework of IPOD is general by accommodating different γ’s. For example, when γ = 1, (10) is simply the classical Kolmogorov difference: max_{r₁,r₂∈{1,…,R_j}} sup_t_∈[0_,ν_] |F_{T|X_j}(t|X_j = r₁) − F_{T|X_j}(t|X_j = r₂)|.

Denote by h_n > 0 the bandwidth of a kernel function K(·). Then, (10) can be estimated by

{\hat{I}}_{j}^{(γ)} = max_{r_{1}, r_{2} \in {1, \dots, R_{j}}} sup_{t \in [0, ν]} | \int_{0}^{t} {\hat{f}}_{T ∣ X_{j}}^{γ} (s ∣ X_{j} = r_{1}) d s - \int_{0}^{t} {\hat{f}}_{T ∣ X_{j}}^{γ} (s ∣ X_{j} = r_{2}) d s |,

where Ŝ(t) is the Kaplan–Meier estimator for the survival function of T,

{\hat{f}}_{T} (t) = \sum_{i} K ((t - t_{i}) / h_{n}) ({\hat{S}}_{T} (t_{i - 1}) - {\hat{S}}_{T} (t_{i})),

and the conditional density estimator f̂_{T|X_j}(t|X_j = r) can be obtained similarly as f̂_T (t) by restricting samples to X_j = r.

[10] defined the true important feature set as

M = {j : S (t ∣ X) functionally depends on X_{j} for some t \in (0, \infty)},

which is estimated by ${\hat{M}}_{1} = {j : {\hat{I}}_{j}^{(γ)} > λ_{n}}$ where λ_n > 0. This procedure is referred to as the IPOD screening.

When a covariate X_j is continuous, it can be discretized into R_j slices by using the percentiles of the empirical distribution of X_j. Suppose there are N different ways of slicing a continuous covariate X_j, denoted by Λ_ju, u = 1, …, N with each slicing Λ_ju containing R_ju intervals. To ensure there are enough sample within each slice for all slicing schemes, one may take R_ju = 3, …, [log(n)], which gives N = [log(n)−2] slicing schemes. Denoting ${\hat{I}}_{j, Λ_{j, u}}^{(γ)}$ as the IPOD screening statistic corresponding to the slicing scheme of Λ_ju, they proposed the following fused IPOD screening statistic that collects all information from each slice:

{\tilde{I}}_{j}^{(γ)} = \sum_{u = 1}^{N} {\hat{I}}_{j, Λ_{j u}}^{(γ)},

which leads to the following screening criterion:

{\hat{M}}_{2} = {j : {\tilde{I}}_{j}^{(γ)} > λ_{n}},

where λ_n > 0 is a pre-specified constant.

For large sample results, [10] stipulated the following assumptions. Let Λ_juo be the partition based on the theoretical quantiles q_ju₍_r₎, r = 0, …, R_ju of X_j and $I_{j o}^{(γ)} = \sum_{u = 1}^{N} I_{j, Λ_{juo}}^{(γ)}$ . Assume that there exist a c > 0 and 0 < v < 1/2 such that ${min}_{j \in M} I_{j o}^{(γ)} \geq 2 c n^{- v}$ for a specific γ.

When covariates include both continuous and discrete values, under the above conditions, [10] proved that

pr (M \subset {\hat{M}}_{2}) \geq 1 - O (Np exp (- b_{2} n^{1 - 3 κ - 2 v - 2 μ - 2 ρ} + κ log n)),

where b₂ is a positive constant. For 0 < α < 1 − 3κ − 2v − 2μ − 2ρ, if N = O(log n) and log p = O(n^α), the fused IPOD has the sure screening property.

IPOD enjoys the invariance property like other nonparametric screeners such as SII and CRIS, but it is more computationally efficient with increasing n. The performance of the method depends on how well the distribution function can be estimated on each covariate-defined stratum. Hence, it may not work well for small sample sizes. Moreover, since smoothing techniques are used to evaluate the density functions, the choice of bandwidth affects the accuracy of results.

§3 Numerical comparisons

All the reviewed papers were exemplified with intensive simulations. As the simulation setup differs across different papers, to make fair comparisons and give general guidelines for practical utility, we considered various simulation settings, under which the methods reviewed in Section 2 are all compared. For each configuration, a total of 100 simulated datasets were generated. We considered n = 100 and n = 300 and explored how the performance of the methods improved with sample size.

Example 1 (Nonlinear covariate-response relationship)

The survival time was generated from log(T) = g₁(X₁)+g₂(X₂)+g₃(X₃)+g₄(X₄)+ε, where g₁(x) = 5x, g₂(x) = −4x(1−x), g₃(x) = 10[exp{−3(x−1)²}+exp{−4(x−3)²}]−1.5, and g₄(x) = 4sin(2πx). The vector of covariates X was generated from a multivariate normal distribution with mean 0, variance 1, and a first order autoregressive correlation structure, i.e., cor(X_j,X_j_′) = ρ^|^j⁻^j^′| (j, j′ = 1, …, p). The error ε ~ N(0, 1) is independent of X. The censoring time C was generated from a 3-component normal mixture distribution N(0, 4) − N(5, 1) + 0.5N(25, 1). This example was adopted from [17].

Example 2 (Rayleigh model)

The survival time was generated from a hazard function h(t|X) = 2t(|X₁|+|X₂|). All covariates were generated from the multivariate normal distribution with mean 0, variance 1, and an exchangeable correlation structure with the true parameter value of ρ. In this case, the marginal correlation between each of the active variables, X₁ and X₂, and the survival time is 0. The censoring time C was generated from a uniform distribution on (0, 3). This example was adopted from [10].

Example 3 (Linear transformation models)

The survival time was generated from the class of linear transformation models

log {0.5 (e^{2 T} - 1)} = - β^{T} X + ε,

where $β = {(- 1, - 0.9, 0_{6}^{T}, 0.8, 1.0, 0_{p - 10}^{T})}^{T}$ and X were generated from a multivariate normal distribution with mean 0, variance 1, and a first order autoregressive correlation structure, i.e., cor(X_j,X_j_′) = ρ^|^j⁻^j^′| (j, j′ = 1, …, p). The error was generated from the standard normal distribution and the censoring time was generated from a uniform distribution on (0, 3). This example was adopted from [19].

Example 4 (Discrete covariates)

The survival time was generated from the proportional hazards model,

h (t ∣ X) = 0.1 exp {\sum_{j = 1}^{p} β_{j} I (X_{j} \in {2, 3})},

where $β = {(- 1_{5}, 0_{p - 5}^{T})}^{T}$ . The covariates x^★ underlying these discrete variables were generated from a multivariate normal distribution with mean 0 and a covariance matrix Σ = (σ_jj_′)_p×p, σ_jj = 1 and σ_jj_′ = 0.5 for j ≠ j′. For each j, $x_{j}^{*}$ was further quarterized by its quartile values: the obtained quarterly variable X_j = 1 if $x_{j}^{*}$ is less than the lower quartile, 2 if between the lower quartile and the median, 3 if between the median and the upper quartile, and 4 if else. The censoring time was generated from a uniform distribution on (0, 40). This example was adopted and modified from [10].

Example 5 (Hidden variables)

The survival time was generated from a Cox model:

h (t ∣ X) = exp (β^{T} X),

where X were generated from a multivariate normal distribution with mean 0, variance 1, and an equal correlation of 0.5 and $β = (1_{5}^{T}, - 2.5, 0_{p - 6}^{T})$ . In this case, X₆ has a lower marginal utility than all the noise variables. The censoring time was generated from a uniform distribution on (0, 10). This example was adopted from [11].

Following the literature, we used three metrics as the criteria for comparisons, the minimum model size (MMS; the minimum number of variables that need to be selected in order to include all active variables), the true positive rate (TPR; the proportion of active variables selected in the first [n/log n] variables), and the probability of inclusion of the true model (PIT; the probability of all active variables selected in the first [n/log n] variables). MMS was reported as the median and TPR and PIT as the averages over 100 repetitions in Tables 1 and 2. Three conditioning sets 𝒞₁ = {X₁}, 𝒞₂ = {X₁,X₂}, and 𝒞₃ = {X₁, an inactive variable} were used for the conditional screening method ([11]), whenever appropriate. Many screeners assumed that the active variables are independent of the inactive variables, which is difficult to satisfy in practice. To explore how sensitive the competing methods are toward this assumption, we set the correlation coefficient ρ to be 0, 0.5, and 0.8 in each example, except in Example 5 wherein ρ = 0.5 was carefully chosen to generate a hidden variable.

Table 1.

Average runtime (seconds) of nonparametric screeners for Example 1 on a CPU with 2.9 GHz Intel Core i5 and 8GB of memory

(n, p)	CRIS	SII	IPOD
(100, 1000)	10.873	53.307	12.030
(100, 10000)	83.384	524.126	119.233
(300, 1000)	77.244	659.949	18.139
(300, 10000)	625.341	5607.038	175.598

Open in a new tab

Table 2.

Comparisons of competing methods when (n, p) = (100, 1000)

	MMS	TPR	PIT	MMS	TPR	PIT	MMS	TPR	PIT
Example 1	ρ = 0			ρ = 0.5			ρ = 0.8

PSIS	660 (391)	0.46	0.00	744 (344)	0.27	0.01	580 (505)	0.22	0.01
FAST	601 (372)	0.18	0.00	765 (262)	0.17	0.00	672 (413)	0.08	0.00
CoxCS (x₁)	668 (361)	0.45	0.00	621 (518)	0.50	0.02	333 (618)	0.59	0.10
CoxCS (x₁, x₂)	677 (425)	0.54	0.00	488 (522)	0.57	0.02	390 (551)	0.59	0.06
CoxCS (x₁, an inactive)	694 (379)	0.44	0.00	604 (503)	0.51	0.02	391 (588)	0.60	0.11
QaSIS	702 (434)	0.22	0.00	618 (386)	0.22	0.00	381 (394)	0.31	0.00
CRIS	722 (344)	0.52	0.00	746 (399)	0.27	0.00	696 (424)	0.22	0.00
SII	509 (375)	0.48	0.00	560 (377)	0.41	0.00	231 (332)	0.56	0.08
IPOD (γ = 0.8)	388 (420)	0.63	0.02	441 (517)	0.56	0.00	160 (221)	0.62	0.08
IPOD (γ = 1)	402 (378)	0.61	0.03	457 (496)	0.52	0.01	176 (217)	0.59	0.06
IPOD (γ = 1.2)	456 (415)	0.57	0.02	495 (462)	0.45	0.00	216 (284)	0.52	0.05

Example 2	ρ = 0			ρ = 0.5			ρ = 0.8

PSIS	764 (272)	0.01	0.00	733 (423)	0.02	0.00	699 (291)	0.01	0.00
FAST	757 (278)	0.01	0.00	719 (435)	0.02	0.00	694 (306)	0.01	0.00
CoxCS (x₁)	433 (539)	0.52	0.03	454 (492)	0.50	0.01	444 (505)	0.51	0.02
CoxCS (x₁, an inactive)	445 (491)	0.52	0.03	477 (540)	0.50	0.01	415 (531)	0.50	0.01
QaSIS	514 (372)	0.01	0.00	472 (489)	0.03	0.01	417 (450)	0.03	0.00
CRIS	705 (278)	0.01	0.00	710 (401)	0.00	0.00	658 (390)	0.02	0.00
SII	352 (307)	0.03	0.01	316 (315)	0.02	0.00	237 (239)	0.07	0.02
IPOD (γ = 0.8)	133 (265)	0.41	0.14	130 (237)	0.40	0.13	58 (172)	0.49	0.23
IPOD (γ = 1)	125 (279)	0.42	0.17	132 (238)	0.38	0.15	65 (142)	0.50	0.21
IPOD (γ = 1.2)	148 (270)	0.40	0.15	147 (232)	0.36	0.11	79 (148)	0.52	0.23

Example 3	ρ = 0			ρ = 0.5			ρ = 0.8

PSIS	7 (21)	0.93	0.71	4 (0)	1.00	1.00	5 (2)	1.00	0.99
FAST	7 (28)	0.92	0.70	4 (0)	1.00	1.00	5 (2)	1.00	0.99
CoxCS (x₁)	5 (6)	0.97	0.88	4 (2)	0.98	0.92	27 (120)	0.87	0.48
CoxCS (x₁, x₂)	4 (1)	0.99	0.95	4 (0)	1.00	1.00	4 (0)	1.00	1.00
CoxCS (x₁, an inactive)	6 (6)	0.97	0.88	5 (2)	0.98	0.92	8 (11)	0.94	0.78
QaSIS	267 (261)	0.17	0.00	111 (139)	0.30	0.01	126 (110)	0.30	0.01
CRIS	7 (9)	0.95	0.79	4 (0)	1.00	1.00	5 (2)	1.00	0.99
SII	113 (220)	0.56	0.08	18 (33)	0.86	0.54	23 (50)	0.82	0.50
IPOD (γ = 0.8)	121 (194)	0.61	0.05	24 (58)	0.86	0.49	25 (52)	0.83	0.45
IPOD (γ = 1)	88 (136)	0.70	0.21	10 (20)	0.92	0.70	13 (17)	0.92	0.73
IPOD (γ = 1.2)	68 (112)	0.73	0.24	7 (9)	0.95	0.81	9 (11)	0.96	0.83

Example 4	ρ = 0			ρ = 0.5			ρ = 0.8

PSIS	863 (194)	0.01	0.00	876 (156)	0.03	0.00	855 (209)	0.03	0.00
FAST	845 (211)	0.03	0.00	857 (166)	0.05	0.00	820 (234)	0.05	0.00
CoxCS (x₁)	851 (237)	0.22	0.00	833 (195)	0.23	0.00	783 (244)	0.23	0.00
CoxCS (x₁, x₂)	820 (244)	0.41	0.00	733 (303)	0.42	0.00	737 (316)	0.41	0.00
CoxCS (x₁, an inactive)	827 (245)	0.21	0.00	830 (240)	0.22	0.00	819 (244)	0.22	0.00
QaSIS	770 (330)	0.09	0.00	785 (304)	0.10	0.00	747 (297)	0.11	0.00
CRIS	872 (176)	0.01	0.00	863 (159)	0.03	0.00	879 (206)	0.01	0.00
SII	489 (393)	0.10	0.00	435 (292)	0.18	0.00	349 (258)	0.23	0.00
IPOD(γ = 0.8)	549 (382)	0.15	0.00	457 (296)	0.20	0.00	401 (315)	0.21	0.00
IPOD (γ = 1)	411 (412)	0.27	0.00	363 (316)	0.32	0.00	286 (312)	0.33	0.00
IPOD (γ = 1.2)	398 (429)	0.33	0.01	294 (279)	0.35	0.00	242 (273)	0.41	0.00

Example 5	ρ = 0			ρ = 0.5			ρ = 0.8

PSIS	-	-	-	1000 (0)	0.53	0.00	-	-	-
FAST	-	-	-	1000 (3)	0.25	0.00	-	-	-
CoxCS (x₁)	-	-	-	131 (199)	0.71	0.13	-	-	-
CoxCS (x₁, x₂)	-	-	-	93 (157)	0.80	0.20	-	-	-
CoxCS (x₁, an inactive)	-	-	-	108 (131)	0.74	0.14	-	-	-
QaSIS	-	-	-	895 (254)	0.08	0.00	-	-	-
CRIS	-	-	-	1000 (0)	0.54	0.00	-	-	-
SII	-	-	-	992 (42)	0.31	0.00	-	-	-
IPOD (γ = 0.8)	-	-	-	782 (351)	0.11	0.00	-	-	-
IPOD (γ = 1)	-	-	-	945 (122)	0.22	0.00	-	-	-
IPOD (γ = 1.2)	-	-	-	985 (57)	0.26	0.00	-	-	-

Open in a new tab

Table 1 documents a comparison of computation time of nonparametric screening methods with various n and p. It shows that the computation time of CRIS increases non-proportionally with the sample size. SII is also computationally expensive as it needs to exhaustively search the ranges of both T and X_j. On the other hand, IPOD seems to be less impacted by the increasing n and p.

Tables 2 and 3 demonstrate that when the covariates and the response have a nonlinear relationship (Example 1), IPOD, SII, and CoxCS outperformed the other methods. In Example 2 where the marginal correlation between each active variable and the survival time is 0, PSIS, FAST, and CRIS performed poorly. With the linear transformation model in Example 3, all methods easily identified active variables, especially when the sample size was large. When the covariates are discrete (Example 4), IPOD, which is designed to accommodate discrete variables, outperformed all the other methods. In Example 5, all marginal screeners had difficulties recruiting a “marginally unimportant but jointly important” variable X₆. Only the conditional screening approach, CoxCS, was able to detect the hidden variable. In addition, the performance of IPOD varied with different γ. The conditional approach might fail when the dependence among covariates is too strong, as shown in Example 3 with ρ = 0.8.

Table 3.

Comparisons of competing methods when (n, p) = (300, 1000)

	MMS	TPR	PIT	MMS	TPR	PIT	MMS	TPR	PIT
Example 1	ρ = 0			ρ = 0.5			ρ = 0.8

PSIS	571 (407)	0.61	0.01	745 (329)	0.35	0.00	448 (586)	0.46	0.11
FAST	538 (443)	0.56	0.04	803 (285)	0.29	0.00	569 (473)	0.32	0.00
CoxCS (x₁)	598 (508)	0.64	0.02	439 (525)	0.64	0.09	50 (187)	0.84	0.52
CoxCS (x₁, x₂)	488 (435)	0.67	0.03	454 (600)	0.66	0.09	350 (568)	0.67	0.12
CoxCS(x₁, an inactive)	602 (491)	0.64	0.02	487 (524)	0.63	0.10	71 (195)	0.81	0.45
QaSIS	645 (453)	0.50	0.00	590 (434)	0.52	0.02	290 (454)	0.66	0.10
CRIS	673 (374)	0.57	0.01	743 (282)	0.40	0.00	385 (599)	0.51	0.14
SII	423 (442)	0.68	0.07	333 (420)	0.65	0.01	36 (120)	0.87	0.56
IPOD (γ = 0.8)	76 (142)	0.85	0.39	42 (97)	0.90	0.59	5 (8)	0.99	0.95
IPOD (γ = 1)	115 (152)	0.81	0.25	57 (133)	0.87	0.47	9 (12)	0.97	0.89
IPOD (γ = 1.2)	161 (186)	0.80	0.20	82 (143)	0.84	0.37	12 (23)	0.96	0.84

Example 2	ρ = 0			ρ = 0.5			ρ = 0.8

PSIS	739 (361)	0.01	0.00	719 (315)	0.04	0.00	737 (421)	0.06	0.01
FAST	725 (374)	0.03	0.00	705 (324)	0.06	0.00	731 (441)	0.07	0.01
CoxCS (x₁)	484 (548)	0.52	0.04	537 (504)	0.56	0.11	404 (466)	0.54	0.09
CoxCS (x₁, an inactive)	486 (557)	0.52	0.04	465 (428)	0.54	0.08	421 (499)	0.52	0.05
QaSIS	120 (163)	0.35	0.14	83 (134)	0.58	0.33	48 (102)	0.70	0.51
CRIS	734 (354)	0.02	0.00	712 (329)	0.05	0.00	701 (316)	0.04	0.00
SII	45 (53)	0.76	0.58	28 (37)	0.85	0.76	12 (34)	0.91	0.85
IPOD (γ = 0.8)	5 (9)	0.96	0.92	2 (2)	0.97	0.96	2 (4)	0.98	0.97
IPOD (γ = 1)	5 (12)	0.96	0.93	2 (2)	0.97	0.96	2 (3)	0.99	0.98
IPOD (γ = 1.2)	5 (12)	0.94	0.89	3 (3)	0.97	0.96	3 (4)	0.98	0.97

Example 3	ρ = 0			ρ = 0.5			ρ = 0.8

PSIS	4 (0)	1.00	0.99	4 (0)	1.00	1.00	4 (0)	1.00	1.00
FAST	4 (0)	1.00	0.99	4 (0)	1.00	1.00	4 (1)	1.00	1.00
CoxCS (x₁)	4 (0)	1.00	0.99	4 (1)	1.00	1.00	12 (29)	0.95	0.81
CoxCS (x₁, x₂)	4 (0)	1.00	1.00	4 (0)	1.00	1.00	4 (0)	1.00	1.00
CoxCS (x₁, an inactive)	5 (0)	1.00	0.99	5 (1)	1.00	1.00	7 (1)	1.00	1.00
QaSIS	23 (21)	0.95	0.84	9 (13)	0.99	0.97	9 (10)	0.98	0.96
CRIS	4 (0)	1.00	1.00	4 (0)	1.00	1.00	4 (0)	1.00	1.00
SII	6 (6)	0.98	0.94	4 (0)	1.00	1.00	5 (2)	1.00	1.00
IPOD (γ = 0.8)	4 (1)	0.99	0.96	4 (0)	1.00	1.00	4 (1)	1.00	1.00
IPOD (γ = 1)	4 (1)	0.99	0.96	4 (0)	1.00	1.00	4 (1)	1.00	1.00
IPOD (γ = 2.3)	4 (0)	0.99	0.98	4 (0)	1.00	1.00	4 (1)	1.00	1.00

Example 4	ρ = 0			ρ = 0.5			ρ = 0.8

PSIS	870 (163)	0.05	0.00	842 (212)	0.05	0.00	889 (209)	0.06	0.00
FAST	849 (191)	0.10	0.00	825 (241)	0.10	0.00	872 (236)	0.09	0.00
CoxCS (x₁)	831 (201)	0.27	0.00	807 (258)	0.26	0.00	812 (256)	0.24	0.00
CoxCS (x₁, x₂)	807 (228)	0.45	0.00	781 (346)	0.44	0.00	834 (261)	0.43	0.00
CoxCS (x₁, inactive)	836 (197)	0.27	0.00	799 (292)	0.25	0.00	842 (241)	0.24	0.00
QaSIS	636 (422)	0.36	0.00	488 (545)	0.40	0.02	531 (499)	0.41	0.00
CRIS	849 (192)	0.05	0.00	875 (144)	0.03	0.00	913 (109)	0.04	0.00
SII	103 (110)	0.73	0.17	42 (80)	0.90	0.57	36 (55)	0.91	0.64
IPOD (γ = 0.8)	94 (147)	0.77	0.26	81 (116)	0.83	0.34	50 (90)	0.89	0.56
IPOD (γ = 1)	39 (65)	0.90	0.60	28 (44)	0.94	0.72	22 (41)	0.95	0.74
IPOD (γ = 1.2)	28 (47)	0.94	0.71	18 (26)	0.96	0.84	19 (32)	0.96	0.81

Example 5	ρ = 0			ρ = 0.5			ρ = 0.8

PSIS	-	-	-	1000 (0)	0.83	0.00	-	-	-
FAST	-	-	-	1000 (0)	0.71	0.00	-	-	-
CoxCS (x₁)	-	-	-	8 (11)	0.98	0.91	-	-	-
CoxCS (x₁, x₂)	-	-	-	6 (2)	1.00	0.98	-	-	-
CoxCS (x₁, inactive)	-	-	-	8 (3)	0.99	0.97	-	-	-
QaSIS	-	-	-	990 (49)	0.27	0.00	-	-	-
CRIS	-	-	-	1000 (0)	0.82	0.00	-	-	-
SII	-	-	-	1000 (0)	0.75	0.00	-	-	-
IPOD (γ = 0.8)	-	-	-	990 (59)	0.57	0.00	-	-	-
IPOD (γ = 1)	-	-	-	1000 (1)	0.69	0.00	-	-	-
IPOD (γ = 1.2)	-	-	-	1000 (0)	0.71	0.00	-	-	-

Open in a new tab

§4 Discussion

The aim of this paper is to provide a selective overview on feature screening for ultrahigh-dimensional survival data. We study the justifications of some commonly used methods, paying attention to the motivation and rationale of the screening statistics. Under the same simulation settings, we have numerically compared all the reviewed methods and commented on their practical utility.

With the advent of the biomedical big data era, variable screening for ultrahigh dimensional survival data has been rapidly evolving; see some recent works in [26], [22], [24], and [23]. Variable screening is becoming a standard and indispensable analytical tool for ultrahigh dimensional data analysis, especially when outcomes are subject to censoring.

Even with the discussed progress, from our perspectives, there are several open problems that may attract researchers’ attention. First, as the majority of works in this field have focused on right-censored survival times, it would be of interest to extend the works to accommodate more complex censoring mechanisms, including left censoring, interval censoring and left/right truncation.

Second, most authors explicitly or implicitly assumed partial orthogonality ([7]). That is, they assumed the active variables and inactive variables are independent. This assumption is often violated in practice, as variables tend to be dependent, no matter whether they are active or inactive. It is worth investigating how to relax this condition in the theoretical development or to design a more efficient screening method for dependent predictors.

Third, as shown in the numerical studies, variable screening methods do not necessarily agree with each other as they tend to select different variables. Hence, a natural question is how to integrate the results from these different screening methods, while controlling false positives as well as false negatives.

Fourth, most screening methods developed so far focus on independent survival times. Generalizing the results to accommodate more complex survival settings, including multi-state models and competing risks, semi-competing risks, multivariate survival, etc., remains an open problem.

Finally, existing variable screening methods have often overlooked the useful information on covariates with similar functionality or spatial proximity. Partitioning biomarkers into smaller groups according to biological knowledge or other useful information may facilitate feature selection. Recently, a partition-based screening method, which accounts for such grouping, has been proposed in a generalized linear regression setting by [14]. It is worth investigating its extension to the survival setting.

Acknowledgments

Supported by the National Natural Science Foundation of China (11528102) and the National Institutes of Health (U01CA209414).

We thank Dr. Jialiang Li for providing the code for the survival impact index screening and Ms. Martina Fu for proofreading the manuscript.

References

1.Barut E, Fan JQ, Verhasselt A. Conditional sure independence screening. J Amer Statist Assoc. 2016;111(515):1266–1277. doi: 10.1080/01621459.2015.1092974. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Bradic J, Fan JQ, Jiang JC. Regularization for Cox’s proportional hazards model with NP-dimensionality. Ann Statist. 2011;39(6):3092–3120. doi: 10.1214/11-AOS911. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Fan JQ, Feng Y, Wu YC. High-dimensional variable selection for Cox’s proportional hazards model. IMS Collections 6, Borrowing Strength: Theory Powering Applications - A Festschrift for Lawrence D. Brown. 2010:70–86. [Google Scholar]
4.Fan JQ, Li RZ. Variable selection for Cox’s proportional hazards model and frailty model. Ann Statist. 2000;30(1):74–99. [Google Scholar]
5.Fan JQ, Lv J. Sure independence screening for ultrahigh dimensional feature space (with discussion) J Roy Statist Soc B. 2008;70(5):849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Fan JQ, Samworth R, Wu YC. Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res. 2009;10:2013–2038. [PMC free article] [PubMed] [Google Scholar]
7.Fan JQ, Song R. Sure independence screening in generalized linear models with NP-dimensionality. Ann Statist. 2010;38(6):3567–3604. [Google Scholar]
8.Gorst-Rasmussen A, Scheike T. Independent screening for single-index hazard rate models with ultrahigh dimensional features. J Roy Statist Soc B. 2013;75(2):217–245. [Google Scholar]
9.He XM, Wang L, Hong HG. Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann Statist. 2013;41(1):342–369. [Google Scholar]
10.Hong HG, Chen XR, Christiani DC, Li Y. Integrated powered density: screening ultrahigh dimensional covariates with survival outcomes. Biometrics. doi: 10.1111/biom.12820. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Hong HG, Kang J, Li Y. Conditional screening for ultra-high dimensional covariates with survival outcomes. Lifetime Data Anal. 2016 doi: 10.1007/s10985-016-9387-7. https://doi.org/10.1007/s10985-016-9387-7. [DOI] [PMC free article] [PubMed]
12.Hong HG, Wang L, He XM. A data-driven approach to conditional screening of high dimensional variables. Stat. 2016;5(1):200–212. [Google Scholar]
13.Huang J, Sun TN, Ying ZL, Yu Y, Zhang C-H. Oracle inequalities for the Lasso in the Cox model. Ann Statist. 2013;41(3):1142–1165. doi: 10.1214/13-AOS1098. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Kang J, Hong HG, Li Y. Partition-based ultrahigh-dimensional variable screening. Biometrika. 2017 doi: 10.1093/biomet/asx052. https://doi.org/10.1093/biomet/asx052. [DOI] [PMC free article] [PubMed]
15.Kong SC, Nan B. Non-asymptotic oracle inequalities for the high-dimensional Cox regression via Lasso. Statist Sinica. 2014;24:25–42. doi: 10.5705/ss.2012.240. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Li GR, Peng H, Zhang J, Zhu LX. Robust rank correlation based screening. Ann Statist. 2012;40:1846–1877. [Google Scholar]
17.Li JL, Zheng Q, Peng LM, Huang ZP. Survival impact index and ultrahigh-dimensional model-free screening with survival outcomes. Biometrics. 2016;72(4):1145–1154. doi: 10.1111/biom.12499. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Lin DY, Ying ZL. Semiparametric analysis of the additive risk model. Biometrika. 1994;81(1):61–71. [Google Scholar]
19.Song R, Lu WB, Ma SG, Jeng XJ. Censored rank independence screening for high-dimensional survival data. Biometrika. 2014;101(4):799–814. doi: 10.1093/biomet/asu047. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Tibshirani RJ. The lasso method for variable selection in the Cox model. Stat Med. 1997;16(4):385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
21.Tibshirani RJ. Univariate shrinkage in the Cox model for high dimensional data. Stat Appl Genet Mol Biol. 2009;8(1):3498–3528. doi: 10.2202/1544-6115.1438. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Yan XD, Tang NS, Zhao XQ. The Spearman rank correlation screening for ultrahigh dimensional censored data. eprint arXiv:1702.02708. [Google Scholar]
23.Yang GR, Yu Y, Li RZ, Buu A. Feature screening in ultrahigh dimensional Cox’s model. Statist Sinica. 2016;26:881–901. doi: 10.5705/ss.2014.171. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Yue M, Li JL. Improvement screening for ultra-high dimensional data with censored survival outcomes and varying coefficients. Int J Biostat. 2017;13(1) doi: 10.1515/ijb-2017-0024. https://doi.org/10.1515/ijb-2017-0024. [DOI] [PubMed] [Google Scholar]
25.Zhang HH, Lu WB. Adaptive Lasso for Cox’s proportional hazards model. Biometrika. 2007;94(3):691–703. [Google Scholar]
26.Zhang J, Yin GS, Liu YY, Wu YS. Censored cumulative residual independent screening for ultrahigh-dimensional survival data. Lifetime Data Anal. 2017 doi: 10.1007/s10985-017-9395-2. https://doi.org/10.1007/s10985-017-9395-2. [DOI] [PubMed]
27.Zhao SD, Li Y. Principled sure independence screening for Cox models with ultra-high-dimensional covariates. J Multivariate Anal. 2012;105(1):397–411. doi: 10.1016/j.jmva.2011.08.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Zhao SD, Li Y. Score test variable screening. Biometrics. 2014;70(4):862–871. doi: 10.1111/biom.12209. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Zou H. A note on path-based variable selection in the penalized proportional hazards model. Biometrika. 2008;95:241–247. [Google Scholar]

[R1] 1.Barut E, Fan JQ, Verhasselt A. Conditional sure independence screening. J Amer Statist Assoc. 2016;111(515):1266–1277. doi: 10.1080/01621459.2015.1092974. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Bradic J, Fan JQ, Jiang JC. Regularization for Cox’s proportional hazards model with NP-dimensionality. Ann Statist. 2011;39(6):3092–3120. doi: 10.1214/11-AOS911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Fan JQ, Feng Y, Wu YC. High-dimensional variable selection for Cox’s proportional hazards model. IMS Collections 6, Borrowing Strength: Theory Powering Applications - A Festschrift for Lawrence D. Brown. 2010:70–86. [Google Scholar]

[R4] 4.Fan JQ, Li RZ. Variable selection for Cox’s proportional hazards model and frailty model. Ann Statist. 2000;30(1):74–99. [Google Scholar]

[R5] 5.Fan JQ, Lv J. Sure independence screening for ultrahigh dimensional feature space (with discussion) J Roy Statist Soc B. 2008;70(5):849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Fan JQ, Samworth R, Wu YC. Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res. 2009;10:2013–2038. [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Fan JQ, Song R. Sure independence screening in generalized linear models with NP-dimensionality. Ann Statist. 2010;38(6):3567–3604. [Google Scholar]

[R8] 8.Gorst-Rasmussen A, Scheike T. Independent screening for single-index hazard rate models with ultrahigh dimensional features. J Roy Statist Soc B. 2013;75(2):217–245. [Google Scholar]

[R9] 9.He XM, Wang L, Hong HG. Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann Statist. 2013;41(1):342–369. [Google Scholar]

[R10] 10.Hong HG, Chen XR, Christiani DC, Li Y. Integrated powered density: screening ultrahigh dimensional covariates with survival outcomes. Biometrics. doi: 10.1111/biom.12820. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Hong HG, Kang J, Li Y. Conditional screening for ultra-high dimensional covariates with survival outcomes. Lifetime Data Anal. 2016 doi: 10.1007/s10985-016-9387-7. https://doi.org/10.1007/s10985-016-9387-7. [DOI] [PMC free article] [PubMed]

[R12] 12.Hong HG, Wang L, He XM. A data-driven approach to conditional screening of high dimensional variables. Stat. 2016;5(1):200–212. [Google Scholar]

[R13] 13.Huang J, Sun TN, Ying ZL, Yu Y, Zhang C-H. Oracle inequalities for the Lasso in the Cox model. Ann Statist. 2013;41(3):1142–1165. doi: 10.1214/13-AOS1098. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Kang J, Hong HG, Li Y. Partition-based ultrahigh-dimensional variable screening. Biometrika. 2017 doi: 10.1093/biomet/asx052. https://doi.org/10.1093/biomet/asx052. [DOI] [PMC free article] [PubMed]

[R15] 15.Kong SC, Nan B. Non-asymptotic oracle inequalities for the high-dimensional Cox regression via Lasso. Statist Sinica. 2014;24:25–42. doi: 10.5705/ss.2012.240. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Li GR, Peng H, Zhang J, Zhu LX. Robust rank correlation based screening. Ann Statist. 2012;40:1846–1877. [Google Scholar]

[R17] 17.Li JL, Zheng Q, Peng LM, Huang ZP. Survival impact index and ultrahigh-dimensional model-free screening with survival outcomes. Biometrics. 2016;72(4):1145–1154. doi: 10.1111/biom.12499. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Lin DY, Ying ZL. Semiparametric analysis of the additive risk model. Biometrika. 1994;81(1):61–71. [Google Scholar]

[R19] 19.Song R, Lu WB, Ma SG, Jeng XJ. Censored rank independence screening for high-dimensional survival data. Biometrika. 2014;101(4):799–814. doi: 10.1093/biomet/asu047. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Tibshirani RJ. The lasso method for variable selection in the Cox model. Stat Med. 1997;16(4):385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]

[R21] 21.Tibshirani RJ. Univariate shrinkage in the Cox model for high dimensional data. Stat Appl Genet Mol Biol. 2009;8(1):3498–3528. doi: 10.2202/1544-6115.1438. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Yan XD, Tang NS, Zhao XQ. The Spearman rank correlation screening for ultrahigh dimensional censored data. eprint arXiv:1702.02708. [Google Scholar]

[R23] 23.Yang GR, Yu Y, Li RZ, Buu A. Feature screening in ultrahigh dimensional Cox’s model. Statist Sinica. 2016;26:881–901. doi: 10.5705/ss.2014.171. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Yue M, Li JL. Improvement screening for ultra-high dimensional data with censored survival outcomes and varying coefficients. Int J Biostat. 2017;13(1) doi: 10.1515/ijb-2017-0024. https://doi.org/10.1515/ijb-2017-0024. [DOI] [PubMed] [Google Scholar]

[R25] 25.Zhang HH, Lu WB. Adaptive Lasso for Cox’s proportional hazards model. Biometrika. 2007;94(3):691–703. [Google Scholar]

[R26] 26.Zhang J, Yin GS, Liu YY, Wu YS. Censored cumulative residual independent screening for ultrahigh-dimensional survival data. Lifetime Data Anal. 2017 doi: 10.1007/s10985-017-9395-2. https://doi.org/10.1007/s10985-017-9395-2. [DOI] [PubMed]

[R27] 27.Zhao SD, Li Y. Principled sure independence screening for Cox models with ultra-high-dimensional covariates. J Multivariate Anal. 2012;105(1):397–411. doi: 10.1016/j.jmva.2011.08.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Zhao SD, Li Y. Score test variable screening. Biometrics. 2014;70(4):862–871. doi: 10.1111/biom.12209. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Zou H. A note on path-based variable selection in the penalized proportional hazards model. Biometrika. 2008;95:241–247. [Google Scholar]

PERMALINK

Feature selection of ultrahigh-dimensional covariates with survival outcomes: a selective review

HONG Hyokyoung Grace

Yi LI

Abstract

§1 Introduction

§2 Screening methods

2.1 Principled Cox sure independence screening

2.2 Feature aberration at survival times screening

2.3 Conditional screening

2.4 Quantile adaptive nonparametric screening

2.5 Censored rank independence screening

2.6 Survival impact index screening

2.7 Integrated powered density screening

§3 Numerical comparisons

Example 1 (Nonlinear covariate-response relationship)

Example 2 (Rayleigh model)

Example 3 (Linear transformation models)

Example 4 (Discrete covariates)

Example 5 (Hidden variables)

Table 1.

Table 2.

Table 3.

§4 Discussion

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Feature selection of ultrahigh-dimensional covariates with survival outcomes: a selective review

HONG Hyokyoung Grace

Yi LI

Abstract

§1 Introduction

§2 Screening methods

2.1 Principled Cox sure independence screening

2.2 Feature aberration at survival times screening

2.3 Conditional screening

2.4 Quantile adaptive nonparametric screening

2.5 Censored rank independence screening

2.6 Survival impact index screening

2.7 Integrated powered density screening

§3 Numerical comparisons

Example 1 (Nonlinear covariate-response relationship)

Example 2 (Rayleigh model)

Example 3 (Linear transformation models)

Example 4 (Discrete covariates)

Example 5 (Hidden variables)

Table 1.

Table 2.

Table 3.

§4 Discussion

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases