Summary
In modern statistical applications, the dimension of covariates can be much larger than the sample size. In the context of linear models, correlation screening (Fan and Lv, 2008) has been shown to reduce the dimension of such data effectively while achieving the sure screening property, i.e., all of the active variables can be retained with high probability. However, screening based on the Pearson correlation does not perform well when applied to contaminated covariates and/or censored outcomes. In this paper, we study censored rank independence screening of high-dimensional survival data. The proposed method is robust to predictors that contain outliers, works for a general class of survival models, and enjoys the sure screening property. Simulations and an analysis of real data demonstrate that the proposed method performs competitively on survival data sets of moderate size and high-dimensional predictors, even when these are contaminated.
Some key words: High-dimensional survival data, Rank independence screening, Sure screening property
1. Introduction
Our study was motivated by a breast cancer data (van Houwelingen et al., 2006) that contains expression profiles of 24885 candidate genes for 295 patients with breast cancer. The primary interest was to find genes that are predictive for the overall survival time of breast cancer patients. In addition to their dimensionality being large, some predictors are not normally distributed and contain outliers; see the Supplementary Material. These phenomena are common in microarray data. High dimensionality and the existence of outliers make variable selection for censored survival data challenging.
There are numerous studies in the literature regarding variable selection for regression problems with and without censoring. Recently, many studies have focused on penalized methods, such as the lasso (Tibshirani, 1996), the smoothly clipped absolute deviation (Fan & Li, 2001), the Dantzig selector (Candes & Tao, 2007), and their variants. These methods have been thoroughly studied for variable selection with high-dimensional data (e.g., Bickel et al., 2009; Meinshausen & Yu, 2009; van de Geer, 2008). Studies of variable selection for survival outcomes include penalized partial likelihood (Fan & Li, 2002; Tibshirani, 1997; Zhang & Lu, 2007), penalized estimating equations (Johnson, 2008; Johnson et al., 2008), and other approaches that can be used for simultaneous variable selection and estimation. Generally, the associated optimization problems may be solved quickly for moderate to large p, such as p being hundreds or thousands. However, for very large p, such as is encountered in microarray data, these methods remain computationally demanding.
A computationally simple method for very high-dimensional data that can work well in practice is sure independence screening, which was demonstrated in the classical regression context in Fan & Lv (2008). In this method, the outcome variable is regressed on each covariate separately. Sure independence screening recruits the features that have the best marginal utility. In the context of least-squares regression for a linear model, this corresponds to the largest marginal absolute Pearson correlation between the response and the predictor. Fan & Lv (2008) showed that this method has a sure screening property: with probability very close to 1, the method can retain all of the important features in the model. It can also be derived from an empirical likelihood point of view (Hall et al., 2009). Correlation screening is a crude yet effective way to decrease the dimensionality of data. However, the Pearson correlation might not work well for censored survival data because it cannot be estimated reliably, especially when the censoring rate is high. In addition, its performance can be ruined by outliers in predictors because correlation is not a robust measure for association. Such outliers cause trouble for theoretical studies of screening methods, most of which require tail probability conditions for the covariates.
Variable screening methods for high-dimensional survival data are mostly based on the partial-likelihood of the Cox model. For example, Tibshirani (2009) used a lasso penalization approach for pre-screening. Zhao & Li (2012) proposed a screening method based on standardized marginal maximum partial likelihood estimators. However, in practice, the true models often remain unknown, and it is unclear if these methods will work well under model misspecification. Gorst-Rasmussen & Scheike (2013) proposed a model-free screening statistic: the feature aberration at survival times. For each covariate, this new statistic is equivalent to the numerator of the marginal log-rank test. These screening methods might be influenced by outliers in predictors.
In this paper, we propose a censored rank independence screening method for high-dimensional survival data. The rank statistic we consider can be viewed as an inverse probability-of-censoring weighted Kendall’s τ (Kendall, 1962). Our proposed method has several advantages. First, it is robust against the existence of outliers. This robustness is inherited from Kendall’s τ coefficient (Sen, 1968). Second, it is a non model-based method, so it works for a wide class of survival models. In contrast to Pearson’s correlation, Kendall’s τ is invariant under monotonic transformations of responses and predictors. This invariance allows our method to discover any nonlinear relationships between the response and predictors. Third, the proposed method has technical improvements over some other high-dimensional methods, as the proposed screening utility is a U-statistic with a bounded kernel function, which enables us to obtain the sure screening property without requiring tail probability conditions.
2. Censored Rank Independence Screening
Let T denote the event time of interest, C denote the censoring time, and X = (X(1), …, X(p))′ denote the p-dimensional vector of covariates. Further, define V = min(T, C) and Δ = I(T ≤ C). where I(·) denotes the indicator function. The observed data are independent and identically distributed copies of W = (X, V, Δ) and are denoted by Wi = (Xi, Vi, Δi) for i = 1, …, n, where Xi = (X1i, …, Xpi)′. Throughout the paper, it is assumed that the censoring time, C, is independent of the event time, T, and the covariates, X. Let 𝒜 = {1, …, p} and Xℬ = {X(j) : j ∈ ℬ} for a set ℬ ⊂ 𝒜. Let ℳ⋆ denote the index set of the active variables:
Our goal is to select the set of active variables, Xℳ⋆, where ℳ⋆ ⊂ 𝒜.
We consider the following inverse probability-of-censoring weighted marginal rank correlation utility,
where Ŝ(·) is the Kaplan–Meier estimator of S(t) = pr(C ≥ t). We define 0/0 = 0 to make τ̂k well-defined. For a prespecified γn, we select the set
as active variables. In this way, the dimension of the covariates used in the model can be reduced to a value much smaller than n.
Let τk = pr(Xki > Xkj, Ti > Tj) − 1/4. It can be shown that
and it follows that τ̂k provides a consistent estimate of τk. Without assuming any particular model structure, such as the proportional hazards model, the set selected by the proposed censored rank independence screening comprises the variables that have strong marginal rank correlation with the failure time. In the next section we show that the proposed method enjoys the sure screening property under general conditions.
3. Sure Screening Property
Let {Ci, Vi, Xki} (i = 1, 2) be independent and identically distributed copies of {C, V, X(k)}, for k = 1, …, p. The following conditions are required:
there exists a ν > 0 such that pr(C = ν) > 0 and pr(C > ν) = 0;
mink∈ℳ⋆ |pr(Xk1 > Xk2, T1 > T2) − 1/4| ≥ c0n−κ for some 0 < κ < 1/2 and c0 > 0.
Condition A, adopted from Peng & Fine (2009), is a technical condition that simplifies the derivation of asymptotic properties. Because Condition A is satisfied in many clinical settings, it is widely used in the literature. Condition B is a key assumption to ensure the sure screening property, even without assuming specific model forms. This indicates that to ensure the sure screening property, the minimal marginal rank correlation between the active variables and the response variable should exceed a certain threshold.
Theorem 1. Under Condition A, for any positive constants c5 ≤ c6, when , there exist constants c1, c2, and c4 such that
where ‖ · ‖∞ is the L∞ norm, and D is a constant introduced in Lemma 1 of the Appendix. Moreover, when Condition B holds, taking γn = c7n−κ with c7 ≤ c0/2 leads to
where s is the number of variables in ℳ⋆.
The first result of Theorem 1 leads to the conditions whereby the sure screening property of our method is ensured. Specifically, as n goes to infinity, the maximum dimensionality is p = o{exp(n1−2κ)}, for κ ∈ (0, 1/2). This limit is of the same order as that obtained in Fan and Lv (2008) for correlation learning in the linear model set-up, and it is stronger than the result obtained in Fan et al. (2010). Because no tail probability conditions for covariates are needed, the conditions for the sure screening property are more relaxed than those in Fan & Lv (2008) and Fan et al. (2010). Therefore, our method generally allows heavy tailed covariates. Moreover, the method is robust to model misspecification because no model assumptions are required for the sure screening property to hold. In the next section, we apply the proposed method to a general class of transformation models, under which a set of sufficient conditions for showing Condition B will be provided and the size of the set ℳ̂ can be controlled.
The threshold, γn, controls how many covariates pass screening. To ensure the sure screening property, γn can be taken as any value that is smaller than the minimum signal, provided the minimum signal is distinguishable from the estimation noise. Model selection consistency, i.e., pr(ℳ̂ = ℳ⋆) = 1 − o(1), can be achieved if there is a gap between signal variables and noise variables. In our case, a sufficient condition for model selection consistency is that Xℳ⋆ and are independent. Then
4. Selection of the important set
In applications, it is common practice to select a prefixed number of top-ranked variables for follow-up study. The prefixed number may reflect researchers’ prior knowledge of the number of susceptible predictors, or budget limitations (Kuo & Zaykin, 2011; Skol et al., 2006). Another commonly used procedure is to set the size of M̂ to a number less than the sample size, so that follow-up regression analysis can be performed in a p < n scenario (Fan and Lv, 2008). Data-driven procedures for selecting the size of the important set based on screening statistics are appealing but relatively limited. Zhao and Li (2012) proposed a principled selection method based on controlling false positive rate, but it can be conservative for screening purposes because controlling the false positive rate at a low level can lead to large false negative error.
We propose to estimate the size of the important set using a technique developed in the multiple testing literature. Specifically, consider the hypotheses H0k : τk = 0 and Hak : τk ≠ 0 (k = 1, …, p). Under H0k, we can show that n1/2τ̂k converges in distribution to a mean-zero normal random variable, and its asymptotic variance can be consistently estimated using U-statistic techniques similar to those studied in Fine et al. (1998). Let denote the estimated asymptotic variance of τ̂k. Then, the p-value for testing H0k can be computed as qk = 2{1 − Φ(|τ̂k|/σ̂k)}, where Φ(·) is the standard normal cumulative distribution function. Order the p-values as q(1) ≤ ⋯ ≤ q(p). Let |𝒜| denote the size of a set 𝒜. The proportion of true signals is π = |ℳ*|/p. For a large number of independently tested hypotheses, Meinshausen et al. (2006) showed that π can be consistently estimated by
(1) |
However, for general dependent test statistics, such as our proposed censored rank screening statistics τ̂k’s, the consistency of π̂ is usually unclear. In this paper, we use π̂ as an estimator of π and set |ℳ̂| = π̂p. We study the empirical performance of this estimator in Section 6.
5. Application to a general class of transformation models
Although the sure screening property of our method does not depend on the specific modeling form, the active set, ℳ⋆, is not easily specified without assuming a model structure. To benefit from aspects of both the model-based and the model-free approaches, it is helpful to consider a wide class of models that contains the underlying true model. Here, we consider a general class of transformation models, under which the active set, ℳ⋆, can be easily specified, and the sure screening property will hold.
Specifically, the general class of transformation models is given by
(2) |
where H(·) is an increasing transformation function, m(·) is monotone in each element of X, and ε is independent of X and has a continuous distribution function. Under model (2), the conditional survival function takes the form
(3) |
where Sε(·) is the survival function of ε. This class of transformation models includes many popular survival models as special cases. For example, when H(·) is unknown, Sε(·) is specified, and m(X) = β′X, model (2) becomes the linear transformation model (Clayton & Cuzick, 1985). This model includes the proportional hazards and proportional odds models as special cases. When H is the log transformation, Sε(·) is unspecified, and m(X) = β′X, model (2) becomes the accelerated failure time model (Kalbfleisch & Prentice, 2002). Other examples of transformation models include the odds-rate, inverse Gaussian and log-normal families (Kosorok et al., 2004; Scharfstein et al., 1998).
For transformation models with m(X) = β′X, it is clear that ℳ⋆ = {j ∈ 𝒜 : βj ≠ 0}, where β = (β1, …, βp)′. In general, ℳ⋆ can be defined as the smallest subset of 𝒜 such that m(·) is only a function of covariates in ℳ⋆, i.e., the transformation models can be equivalently written as H(T) = m(Xℳ⋆) + ε. Define sn = |ℳ⋆|, the number of active variables in ℳ⋆, and .
For k ∈ ℳ⋆, define mk(x) = E{m(x, Xℳ⋆/k)}, where the expectation is taken with respect to the joint distribution of covariates in ℳ⋆/k with X(k) fixed at x. Without loss of generality, X(k) is assumed to have mean 0 and variance 1. The following conditions are sufficient to show Condition B for a general class of transformation models.
-
C1
For any k ∈ ℳ⋆, the conditional density function of H(T1) − H(T2) − {mk(Xk1) − mk(Xk2)} given mk(Xk1) − mk(Xk2) is unimodal and symmetric around zero.
-
C2
For any k ∈ ℳ⋆, there exist positive constants σ1 and σ2 such that the variance of mk(Xk1) is uniformly bounded above by , and the conditional variance of H(T1) − H(T2) − {mk(Xk1) − mk(Xk2)} given mk(Xk1) − mk(Xk2) is uniformly bounded above by .
-
C3
For any k ∈ ℳ⋆, there exists a positive constant, d0, that is independent of p such that mink E{|mk(X(k)) − Emk(X(k))|} ≥ d0n−κ/2 for 0 < κ < 1/2.
Proposition 1. If Conditions C1–C3 hold, then Condition B holds for some c0 > 0.
As m(·) is monotone in each element of ℳ⋆, mk(x) is monotone in x. As a marginal projection of m(·) onto the univariate dimension of Xk, mk(x) is utilized as a parsimonious way to pass along the monotonicity from the joint model to the marginal model. As technical conditions, C1 and C2 can be checked empirically. Condition C3 states that the least absolute deviation of mk(X(k)) in the active set, ℳ⋆, can serve as the measurement of detectable signals for transformation models.
Next, we show that the size of set ℳ̂ can be controlled for linear transformation models with m(X) = β′X. This result is similar in rationale to the result of Theorem 5 in Fan and Song (2010). The following condition, C4, is needed.
-
C4
For k ∈ ℳ⋆, the conditional density function of H(T1) − H(T2) − (Xk1 − Xk2)EX(k)Y given Xk1 − Xk2 is unimodal and symmetric around zero.
Theorem 2. Under Conditions A and C1–C4, when var{H(T)} = O(1), for γn = c7n−κ, there exist positive constants c1, c2, and c4 such that
where Σ is the covariance matrix of X and λmax(Σ) is its largest eigenvalue.
Taking the choice of γn as in Theorem 1, if the largest eigenvalue of the covariance matrix of X = O(nτ) is of polynomial order for some τ > 0, then the size of ℳ̂ is also of polynomial order O(n2κ+τ) according to Theorem 2. This indicates that the size of the selected set can indeed be effectively controlled.
6. Simulation Studies
We conducted simulations to evaluate the empirical performance of the proposed censored rank independence screening method. For comparison, we considered three alternative methods: feature aberration at survival times screening (Gorst-Rasmussen & Scheike, 2013), partial likelihood ratio screening, and correlation screening. For partial likelihood ratio screening, we fit a marginal Cox model for each covariate and constructed the corresponding partial likelihood ratio statistic versus the no covariates model. This method is asymptotically equivalent to the screening method proposed by Zhao & Li (2012). For correlation screening, we used uncensored data to compute the marginal correlation between event time and covariate using an inverse probability-of-censoring weighted method. This generalizes standard sure independence screening for linear regression to survival data.
The failure time, Ti, was generated from the class of linear transformation models
where H(t) = log{0.5(e2t − 1)}, and Xi is a p-dimensional vector of covariates. We set n = 100, 300 and p = 5000, 10000. The covariates, Xi, were generated from a multivariate normal distribution with mean 0, variance 1, and a first order autoregressive structure, i.e., corr(Xij, Xik) = 0.5|j−k| (j, k = 1, …, p). We considered two scenarios for the true regression coefficients: Scenario 1, and Scenario 2,
Scenario 2 is more challenging than Scenario 1 because there are several active variables with relatively small effects. We considered three error distributions: the standard extreme value distribution, which corresponds to a proportional hazards model; the standard logistic distribution, which corresponds to a proportional odds model; and the standard normal distribution, which corresponds to a normal transformation model. The censoring time was generated from a uniform distribution on [0, c], where the constant, c, was chosen to achieve censoring proportions of 15% and 40%. For each setting, we conducted 100 simulation runs. The simulation results for normal covariates are given in the Supplementary Material. Based on the results, the selection performances of the proposed method and partial likelihood ratio screening were comparable for the Scenario 1 settings and for the 15% censoring rate setting in Scenario 2. The performance of partial likelihood ratio screening was slightly better than the proposed method for Scenario 2 when the censoring rate was high, i.e., 40%. Generally, correlation screening performed poorly relative to both the proposed method and partial likelihood ratio screening. Correlation screening became very poor for Scenario 2 when the censoring rate was high.
To study the performances when the covariates might be contaminated by outliers, we added outliers to the covariates. All other settings were unchanged. Specifically, with a probability of 0.1, each covariate was replaced by a random variable generated from a t distribution. Again, we conducted 100 simulation runs. For Scenario 1, we report the average number of active variables contained in the top 4, 10, 20, 30, 40, and 50 selected variables, denoted by true positive; the true number was 4. For Scenario 2, we report the corresponding number in the top 15, 30, 45, 60, 75, 90, 120, and 150 selected variables, and the true number was 15. The results for the proportional hazards model under Scenarios 1 and 2 are summarized in Figures 1 and 2, respectively. Results for the proportional odds model are similar, and are provided in the Supplementary Material.
Under all settings, the selection performance of the proposed method was very similar to that when the covariates did not contain outliers, so it is robust to outliers in covariates. In addition, the selection performance improved when the sample size increased and the censoring rate decreased, though the performances when p = 5000 and p = 10000 were similar. For most settings, the performance of the proposed method was superior to both partial likelihood ratio screening and the feature aberration at survival times statistic. However, these three methods had comparable performances when n = 100 and the censoring rate was 40% in Scenario 2. Generally, partial likelihood ratio screening and feature aberration at survival times screening had comparable performances. In addition, the censoring proportion had less effect on their performances than on the performance of the proposed method. This behavior is expected because the proposed method uses the inverse probability-of-censoring weighted technique to address censoring, which might lose some efficiency when the censoring proportion is high. As was the case for normal covariates, correlation screening had the poorest selection performance.
Next, we examine the performance of the method proposed in Section 4 for selecting the number of important predictors. We considered n = 100 and 300, and p = 5000. The average numbers of selected predictors over 100 simulations are given in Table 1. The average numbers of selected predictors are much larger for Scenario 2 and censoring rate of 40%, which is expected because the signal is much weaker under Scenario 2 and the censoring rate is higher. In addition, with n = 300, the selected sets cover all the true signals in almost all the simulation runs for Scenario 1 under both models and censoring rates, while for Scenario 2, they cover almost half of the true signals on average. For smaller sample size of 100, the average numbers of selected true signals decrease, and the magnitude of decrease is relatively large for 40% censoring rate and Scenario 2.
Table 1.
Model | Scenario | N.sel (SD) | N.sel (SD) | N.sel (SD) | N.sel (SD) |
---|---|---|---|---|---|
n = 100 | n = 300 | ||||
CP = 15% | CP = 40% | CP = 15% | CP = 40% | ||
PH | 1 | 13.8 (4.8) | 28.1 (8.9) | 7.4 (3.0) | 28.7 (10.2) |
2 | 31.3 (16.4) | 116.7 (80.5) | 9.6 (6.5) | 101.6 (79.2) | |
PO | 1 | 11.6 (4.7) | 26.1 (7.8) | 7.5 (3.3) | 23.1 (10.6) |
2 | 45.0 (15.6) | 125.4 (87.3) | 25.0 (8.0) | 129.8 (100.4) |
CP denotes censoring proportion; N.sel denotes the average number of selected important predictors over 100 simulations with the number in parenthesis being the standard deviation over 100 simulations; PH denotes the proportional hazards model; PO denotes the proportional odds model.
Although the independent censoring condition was imposed for theoretical development, it can be relaxed in practice. We conducted simulations using a censoring distribution that depended on covariates. Specifically, the censoring times were generated from an exponential distribution with mean c exp(X1 − X8), with c chosen to achieve censoring rates of 15% and 40%. All other settings were unchanged from the previous simulations. Here, we only considered the proportional hazards model under Scenario 2 with n = 100, 300 and p = 5000. The simulation results are given in the Supplementary Material. Although the Kaplan–Meier estimator is not consistent for the survival distribution of the censoring time, the proposed method continued to perform competitively in this limited simulation study. In addition, we conducted simulations to examine the performance of the proposed and competing methods under a censoring rate of 70%. The simulation results are given in the Supplementary Material. In summary, the proposed method showed comparable performance under the heavy censoring case. The performance of the proposed method became slightly worse than partial likelihood ratio screening and the feature aberration at survival times screening. This behavior is expected because the proposed method uses the inverse probability-of-censoring weighted technique to address censoring, which might lose some efficiency when the censoring proportion is high.
7. Application to breast cancer data
We applied our proposed rank independence screening method to the analysis of survival from a breast cancer study (van Houwelingen et al., 2006), with 295 female patients with primary invasive breast carcinoma. For each patient, the expressions of 24885 genes were profiled on cDNA arrays from all tumors. A set of 4919 candidate genes were selected after initial screening using the Rosetta error model (van’t Veer et al., 2002). The primary endpoint of interest was the overall survival time. Of the 295 patients, 216 had censored responses, giving a 73% censoring rate. A main goal of the study was to identify genes that are associated with the overall survival of breast cancer patients. As discussed in §1, gene expression profiles commonly contain outliers. For the breast cancer data, we identified potential outliers in the gene expressions. Specifically, for the jth gene of subject i, we calculated a modified z-statistic: zij = 0.6745|Xji − mj|/νj, where mj and νj are the median and median absolute deviation of the jth gene expression profiles over the 295 subjects. If zij > 3.5, the data point was claimed to be an outlier. This criterion was suggested by Iglewicz & Hoaglin (1993) for outlier detection and has been widely used in the literature. Based on this rule, of the 4919 genes, 3488 contained at least 1 outlier, 582 contained at least 10 outliers, and 58 contained at least 30 outliers.
To check the covariate-independent censoring assumption, we fit a marginal Cox proportional hazards model for each individual predictor. Out of 4919 genes, 368 genes have significant coefficients, with p-values less than 0.05. After Bonferroni correction, only one gene is significantly related to censoring times. For the 368 significant genes, we replaced the Kaplan–Meier estimator of the censoring survival function by the estimated conditional survival function from the fitted proportional hazards model and recalculated the corresponding screening statistics. We found that the rankings of the screening statistics are nearly unchanged. Therefore, in our data application, we used the Kaplan–Meier estimator for the censoring survival function.
We used the proposed method to analyze both the original data and the data with outliers removed. For comparison, we also include the selection results obtained using partial likelihood ratio screening and correlation screening. We used our method to estimate the number of important predictors based on the original data. The estimated size of the important set is 492. We report the symbols of the top 20 selected genes in Table 2. For the original data, none of the top 20 genes selected by correlation screening were selected by either partial likelihood ratio screening or the proposed method. However, several genes were selected by both partial likelihood ratio screening and the proposed method: 3 of the top 10 genes and 13 of the top 20 genes. For the data with outliers removed, only 1 of the top 20 genes selected by correlation screening was also selected by partial likelihood ratio screening. For each method, we compared the top 20 genes selected from the original data to the top 20 selected genes obtained from the data with outliers removed. The two sets of genes selected using partial likelihood ratio screening were completely different. However, the two sets selected by the proposed method had 18 genes in common, with a similar order. These results imply that partial likelihood ratio screening and the proposed method might give more reliable selection results than correlation screening; and the proposed method is robust to outliers in covariates, but partial likelihood ratio screening is not.
Table 2.
CS denotes correlation screening; PLRS denotes partial likelihood ratio screening; CRIS denotes proposed censored rank independence screening; W denotes the original data; and WO denotes the data with outliers removed.
8. Discussion
In our method, censoring times are assumed to be independent of failure times and predictors, which may be too restrictive in some applications. This assumption can be relaxed to a certain extent. For example, as considered in He et al. (2013) for quantile-adaptive variable screening, we may assume that Ti and Ci are conditionally independent given a single predictor. Then, the survival function, S(t), of censoring times can be replaced by the conditional survival function S(t | Xij) = pr(Ci ≥ t | Xij), which can be consistently estimated by the local Kaplan–Meier estimator (Gonzalez-Manteiga & Cadarso-Suarez, 1994). Alternatively, we may build a semi-parametric survival model for censoring times, for example, a proportional hazards model with lasso selection of important predictors, and compute the model-based conditional survival function for censoring times. The sure screening property of the associated statistics needs to be further investigated.
Supplementary Material
Acknowledgement
We would like to thank the editor, an associate editor and three referees for very insightful comments. We also thank Shannon Holloway for careful reading of the manuscript. R. S.’s research was supported by the National Science Foundation. W. L.’s research was supported by the National Cancer Institute. S. M.’s research was supported by the National Institute of Health.
Appendix
Proof of the sure screening property
The following Lemmas are used to prove the sure screening property of τ̂k.
Lemma 1. (Bitouze et al., 1999, Theorem 1) Let and be independent sequences of independently identically distributed nonnegative random variables with distribution functions F and G, respectively. Let F̂n be the Kaplan–Meier estimator of the distribution function F. There exists a positive constant, D, such that for any positive constant λ,
Lemma 2. (Hoeffding, 1963) Let g = g(x1, …, xm) be a symmetric kernel of the U-statistic, U, with a ≤ h(x1, …, xm) ≤ b. For any t > 0 and m ≤ n, we have
Lemma 3. For any c > 0, when , where c1 = (1 − δ)2 { (2c + 1)1/2(c + 1)−1/2 − 1}2 ,
(A1) |
Moreover, for any 0 < l < 1.12, when ,
(A2) |
where .
Proof of Lemma 3. To show Lemma 3, we claim the following result, giving its proof at the end: for t ∈ (0, 21/2 − 1), |Ŝ−2(Vi) − S−2(Vi)| ≥ cS−2(Vi) implies |Ŝ(Vi) − S(Vi)| ≥ tS(Vi), where c = {(t + 1)2 − 1}/{2 − (t + 1)2}. This claim further implies that ‖Ŝ − S‖∞ ≥ t‖S‖∞. Therefore,
(A3) |
Let G(·) denote the cumulative distribution function of T, i.e., G(t) = pr(T ≤ t). By Condition A, 1 − δ < |1 − G(Vi)| < 1, for i = 1, …, n. Therefore, by Lemma 1, we have
(A4) |
When , (A4) is further bounded by 2.5 .
For 0 < c < 1, i.e., 0 < t < (3/2)1/2 − 1, because
from the above calculations, for ,
The desired result (A2) now follows from the union bound of probability and by setting l = 5t.
We now show the result given at the beginning of the proof. Take A = Ŝ(Vi) and B = S(Vi). Let a = (t + 1)2 − 1. Therefore, c = 1/(1 − a) − 1. Because
we have A−2 − B−2 ≤ −{1/(1 − a) − 1}B−2, or ≥ {1/(1 − a) − 1}B−2. For t ∈ (0, 21/2 − 1), i.e., a ∈ (0, 1), we have 1 − 1/(1 + a) < 1/(1 − a) − 1. It follows that A−2 − B−2 ≤ −{1 − 1/(1 + a)}B−2, or ≥ {1/(1 − a) − 1}B−2, which is equivalent to |A − B| ≥ tB.
Proof of Theorem 1. Rewrite τ̂k = 2{n(n − 1)}−1 ∑i<j g(Wi, Wj) − 1/4, where
is the symmetric kernel of τ̂k. Therefore, τ̂k is a U-statistic.
Let Unf = 2{n(n − 1)}−1 ∑i<j f(Xi, Xj) denote the empirical function for U-statistics. After some algebra, we have τ̂k − τk = Ik1 + Ik2, where
and
We bound Ik1 and Ik2 piece by piece. In particular, Ik1 can be bounded from above as
(A5) |
By the triangle inequality, (A5) can be further bounded above as
To bound Ik11, recall that
By Condition A, we have
For any c3 > 0, by Lemma 2, there exist c3 and such that
(A6) |
By Lemma 3, letting c = 1 in (A1), when ,
(A7) |
It follows from (A6) and (A7) that for any c3 > 0, when , there exist c1 and c4 such that
(A8) |
Because |E{ΔjS(Vj)−2I(Xki > Xkj, Vi > Vj)}| ≤ 1, it follows from (A2) that for any c5 > 0 and c5n−κ < 1.12, when , there exists c2 > 0 such that
(A9) |
By the triangle inequality, it now follows from (A6), (A8), and (A9) that for any c3, c5 > 0, when , there exist c1, c2, and c4 such that
The first result follows by letting c6 = 2c3 + c5.
For the second part, note that on the event
by Condition B, we have |τ̂k| ≥ c0n−κ/2, for all k ∈ ℳ⋆. Therefore, by the choice of νn, we have ℳ⋆ ⊂ ℳ̂νn. The result now follows from a simple union bound:
This completes the proof.
Verification of Condition B for a general class of transformation models
Proof of Proposition 1. Recall that τk = pr(Xk1 > Xk2, T1 > T2) − 1/4. Next we will show that |τk| ≥ c0n−κ for some c0 > 0, if k ∈ ℳ⋆. For k ∈ ℳ⋆, we have
where FΔεk|Δmk (·) is the conditional cumulative distribution function of Δεk = H(T1) − H(T2) − {mk(Xk1) − mk(Xk2)} given Δmk = mk(Xk1) − mk(Xk2). Because mk(·) is a monotone function, mk(Xk2) − mk(Xk1) is either greater than or less than zero for all Xk1 > Xk2. This implies that 1 − FΔεk|Δmk{mk(Xk2) − mk(Xk1)} is either greater or less than 1/2 due to Condition C1. Therefore, τk is either greater or less than zero for k ∈ ℳ⋆. In the following, we further establish the lower bound of |τk|.
Without loss of generality, assume mk(·) is monotone increasing. Note that τk can be equivalently written as
According to Corollary 3 in Sellke & Sellke (1997), for a random variable X with mean zero, variance σ2, and unimodal symmetric distribution, pr(|X| ≥ t) ≤ 31/2σ/(t + 31/2σ). By Condition C1, we have
This expression leads to
for M > 0. By Chebyshev’s inequality and C2, the second term can be further bounded below as
Let . Then, M ≥ 31/2σ2 when n is large. We have τk ≥ c0n−κ, where . Similarly, if mk(·) is monotone decreasing, we can show that τk ≤ −c0n−κ. Therefore, |τk| ≥ c0n−κ for any k ∈ ℳ⋆. Condition B is then proved.
Proof of Theorem 2. To show Theorem 2, we note that for any c8 > 0, if |EX(k)Y| > c8n−κ, then |τk| > c0n−κ for some c0 > 0. The proof of this statement is similar to that in Proposition 1, hence we omit the details.
Because var{H(T)} = O(1), we have that . Therefore, the number of {k : τk > c0n−κ} = {k : |EX(k)H(T)| > c8n−κ} = O{n2κλmax(Σ)}. Because the number of {k : |τ̂k| > 2c0n−κ} is no bigger than the number of {k : |τk| > c0n−κ} on the set, {max1≤k≤p |τ̂k − τk| ≤ c0n−κ}. By taking c8 = c7/2,
The conclusion follows from the tail probability in Theorem 1. This completes the proof.
Contributor Information
Rui Song, Email: rsong@ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, USA.
Wenbin Lu, Email: lu@stat.ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, USA.
Shuangge Ma, Email: shuangge.ma@yale.edu, Division of Biostatistics, School of Public Health, Yale University, New Haven, Connecticut 06510, USA.
X. Jessie Jeng, Email: xjjeng@ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, USA.
References
- Bickel P, Ritov Y, Tsybakov A. Simultaneous analysis of lasso and Dantzig selector. The Annals of Statistics. 2009;37:1705–1732. [Google Scholar]
- Bitouze Laurent, Massart A Dvoretzky–Kiefer–Wolfowitz type inequality for the Kaplan–Meier estimator. Annales de l’Institut Henri Poincare (B) Probability and Statistics. 1999;35:735–763. [Google Scholar]
- Candes E, Tao T. The Dantzig selector: statistical estimation when p is much larger than n (with discussion) The Annals of Statistics. 2007;35:2313–2404. [Google Scholar]
- Clayton D, Cuzick J. Multivariate generalizations of the proportional hazards model (with discussion) Journal of the Royal Statistical Society, Series A. 1985;148:82–117. [Google Scholar]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
- Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. The Annals of Statistics. 2002;30:74–99. [Google Scholar]
- Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space (with discussion) Journal of the Royal Statistical Society, Series B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Song R, et al. Sure independence screening in generalized linear models with NP–dimensionality. The Annals of Statistics. 2010;38:3567–3604. [Google Scholar]
- Gonzalez-Manteiga W, Cadarso-Suarez C. Asymptotic properties of a generalized Kaplan– Meier estimator with some applications. Journal of Nonparametric Statisics. 1994;4:65–78. [Google Scholar]
- Gorst-Rasmussen A, Scheike T. Independent screening for single-index hazard rate models with ultrahigh dimensional features. Journal of the Royal Statistical Society, Series B. 2013;75:217–245. [Google Scholar]
- Hall P, Titterington DM, Xue J. Tilting methods for assessing the influence of components in a classifier. Journal of the Royal Statistical Society, Series B. 2009;71:783–803. [Google Scholar]
- He X, Wang L, Hong HG. Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. The Annals of Statistics. 2013;41:342–369. [Google Scholar]
- Hoeffding W. Probability inequality for sums of bounded random variables. Journal of the American Statistical Association. 1963;58:13–30. [Google Scholar]
- Iglewicz B, Hoaglin DC. How to Detect and Handle Outliers. Milwaukee, WI: American Society for Quality Control; 1993. [Google Scholar]
- Johnson BA. Variable selection in semiparametric linear regression with censored data. Journal of the Royal Statistical Society, Series B. 2008;70:351–370. [Google Scholar]
- Johnson BA, Lin DY, Zeng D. Penalized estimating functions and variable selection in semi-parametric regression models. Journal of the American Statistical Association. 2008;103:672–680. doi: 10.1198/016214508000000184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. 2nd ed. New York: Wiley; 2011. [Google Scholar]
- Kendall MG. Rank Correlation Methods. 3rd ed. London: Griffin & Co; 1962. [Google Scholar]
- Kosorok MR, Lee BL, Fine JP. Robust inference for univariate proportional hazards frailty regression models. The Annals of Statistics. 2004;32:1448–1491. [Google Scholar]
- Kuo C-L, Zaykin DV. Novel rank-based approaches for discovery and replication in genome-wide association studies. Genetics. 2011;189:329–340. doi: 10.1534/genetics.111.130542. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meinshausen N, Rice J. Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. The Annals of Statistics. 2006;34:373–393. [Google Scholar]
- Meinshausen N, Yu B. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics. 2009;37:246–270. [Google Scholar]
- Peng L, Fine J. Competing risks quantile regression. Journal of the American Statistical Association. 2009;104:1440–1453. [Google Scholar]
- Scharfstein DO, Tsiatis AA, Gilbert PB. Semiparametric efficient estimation in the generalized odds-rate class of regression models for right-censored time-to-event data. Lifetime Data Analysis. 1998;4:355–391. doi: 10.1023/a:1009634103154. [DOI] [PubMed] [Google Scholar]
- Sellke TM, Sellke SH. Chebyshev inequalities for unimodal distributions. The American Statistician. 1997;51:34–39. [Google Scholar]
- Sen PK. Estimates of the regression coefficient based on Kendall’s tau. Journal of the American Statistical Association. 1968;63:1379–1389. [Google Scholar]
- Skol D, Scott L, Abecasis G, Boehnke M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nature Genetics. 2006;38:209–213. doi: 10.1038/ng1706. [DOI] [PubMed] [Google Scholar]
- Tibshirani RJ. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
- Tibshirani RJ. The lasso method for variable selection in the Cox model. Statistics in Medicine. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
- Tibshirani RJ. Univariate shrinkage in the cox model for high dimensional data. Statistical Applications in Genetics and Molecular Biology. 2009;8:1–18. doi: 10.2202/1544-6115.1438. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van de Geer S. High-dimensional generalized linear models and the lasso. The Annals of Statistics. 2008;36:614–645. [Google Scholar]
- van Houwelingen HC, Bruinsma T, Hart AAM, Van ’t Veer LJ, Wessels LFA. Cross-validated cox regression on microarray gene expression data. Statistics in Medicine. 2006;25:3201–3216. doi: 10.1002/sim.2353. [DOI] [PubMed] [Google Scholar]
- van’t Veer L, Dai H, van de Vijver MJ, He Y, Hart A, Mao M, Peterse H, van der Kooy K, Marton MJ, Witteveen AT, Schreiber G, Kerkhoven R, Roberts C, Linsley P, Bernards R, Friend SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–536. doi: 10.1038/415530a. [DOI] [PubMed] [Google Scholar]
- Zhang HH, Lu W. Adaptive lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]
- Zhao DS, Li Y. Principled sure independence screening for Cox models with ultra-high-dimensional covariates. Journal of Multivariae Analysis. 2012;105:397–411. doi: 10.1016/j.jmva.2011.08.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.