Abstract
Feature screening plays an important role in dimension reduction for ultrahigh-dimensional data. In this paper, we introduce a new feature screening method and establish its sure independence screening property under the ultrahigh-dimensional setting. The proposed method works based on the nonparanormal transformation and Henze-Zirkler’s test; that is, it first transforms the response variable and features to Gaussian random variables using the nonparanormal transformation and then tests the dependence between the response variable and features using the Henze-Zirkler’s test. The proposed method enjoys at least two merits. First, it is model-free, which avoids the specification of a particular model structure. Second, it is condition-free, which does not require any extra conditions except for some regularity conditions for high-dimensional feature screening. The numerical results indicate that, compared to the existing methods, the proposed method is more robust to the data generated from heavy-tailed distributions and/or complex models with interaction variables. The proposed method is applied to screening of anticancer drug response genes.
Keywords: Gene Screening, Henze-Zirkler’s Test, Nonparanormal Transformation, Sure Independence Screening, Precision Medicine
1. Introduction
Variable selection plays an important role in high-dimensional data analysis. However, under the ultrahigh-dimensional setting, where the number of covariates may grow at an exponential rate of sample size, the current variable selection methods may not work well due to the simultaneous challenges of computational expediency, statistical accuracy, and algorithmic stability (Fan et al., 2009). A practical approach is to use a screening procedure to reduce the dimension of feature space to a moderate scale, and then implement a variable selection method on the reduced dataset. To pursue this approach, Fan and Lv (2008) proposed the sure independence screening (SIS) method for linear regression, where the features are screened based on their respective Pearson correlation coefficients with the response variable. They established the sure screening property, i.e., all active predictors can be selected with probability approaching one as the sample size increases to infinity. Fan and Song (2010) extended SIS to generalized linear models, where the features are screened based on the estimates of the regression coefficients in their respective marginal models. For nonlinear models, Hall and Miller (2009) suggested polynomial transformations of predictors, and Fan et al. (2011) suggested to estimate the nonparametric components using B-splines for each feature and then have the features screened based on their respective magnitudes of the nonparametric components.
All the above methods require the specification of a particular model structure. If the underlying model is correctly specified, these methods can perform well. However, if the underlying model is misspecified, their performance may be corrupted. Under the ultrahigh dimensional setting, specifying a correct model is usually an impossible task, and thus the model-free feature screening methods are appealing. Toward this direction, Zhu et al. (2011) proposed a sure independence ranking and screening (SIRS) method to screen significant features for multi-index models. Li et al. (2012) proposed a distance correlation sure independence screening (DC-SIS) method, where the features are screened based on their distance correlation (Sz´ekely et al., 2007) with the response variable. It is known that for two random variables, a zero distance correlation coefficient implies independence. He et al. (2013) proposed a quantile-adaptive sure independence screening (Qa-SIS) method, which employs spline approximation to model the marginal effect of each predictor at a quantile level and then have the predictors screened accordingly. A particular strength of this method is that it can handle the censored data arising in survival analysis. Recently, Cui et al. (2015) proposed a mean variance sure independence screening (MV-SIS) method, where the dependence of two random variables is measured using the mean variance of the conditional distribution function. This method is originally proposed for categorical response variables, but can be extended to the problems for which the response variable is continuous via discretization.
Although the model-free variable screening methods avoid the specification of a particular model structure, they are still based on some assumptions for the predictor and response variables, more or less. For example, DC-SIS requires both the predictors and the response variable to satisfy the sub-exponential tail probability uniformly. That is, practically, the response variable and predictors should be uniformly bounded or follow a multivariate Gaussian distribution. Qa-SIS requires the conditional quantile function’s derivative to satisfy a Lipschitz condition and the conditional density function to be uniformly bounded for each feature.
In this article, we propose a new model-free feature screening method and establish its sure independence screening property under very weak conditions. The proposed method works based on the nonparanormal transformation (Liu et al., 2009) and Henze-Zirkler’s test (Henze and Zirkler, 1990). It is to first transform the response variable and each of the predictors to Gaussian random variables using the nonparanormal transformation and then test the dependence between the response variable and the predictors using the Henze-Zirkler’s test. Compared to the existing methods, the proposed method requires fewer assumptions to ensure its sure independence screening property and thus performs more robustly. Our numerical studies indicate that the new method can achieve better performance when the covariates follow a heavy-tailed distribution and when the underlying true model is complex with interaction variables.
The rest of this article is organized as follows. In Section 2, we describe the proposed method and establish its sure independence screening property. In Section 3, we conduct simulation studies to evaluate the finite sample performance of the proposed method along with comparisons with the existing methods. In Section 4, we apply the proposed method to screening of anticancer drug response genes. In Section 5, we conclude the article with a brief discussion. Technical proofs are given in the Appendix.
2. Robust Feature Screening
2.1. Henze-Zirkler Sure Independence Screening
Let Y denote a continuous response variable, let X = (X1,…, Xp) denote p continuous covariates, let n denote the sample size, and let f (y|x) denote the conditional density function of Y given X. Under the ultrahigh-dimensional setting, where p = O(exp(nτ)) for some τ > 0, a sparsity condition is generally needed. For example, we may assume that there are only a small number of predictors relevant to the response variable, although p can be much greater than n. Without specifying a parametric form for the regression model, we define the sets of active predictors and inactive predictors as follows:
A direct identification of the active predictor set D is usually difficult or even impossible under the ultrahigh-dimensional setting. Therefore, people proposed to first identify a moderate size set of variables which has included all the elements of D, and then apply a variable selection method on on this moderate size set to accurately identify D. Note that if f (y|x) functionally depends on Xk, then Y and Xk are usually marginally dependent as well. Hence, the moderate size set of variables can be constructed by selecting only the predictors that are marginally dependent with Y , and this procedure is usually referred to as independence screening. In fact, under the partial orthogonality condition (Fan and Song, 2010; Huang et al., 2008), i.e., {xi, i ϵ D} is independent of {xj, j ϵ I}, it can be shown that f (y|x) functionally depends on xk if and only if Y and Xk are marginally dependent.
To implement independence screening, we need to find a metric to measure the marginal dependence between each predictor Xk and the response variable Y . Several metrics have already been proposed, see e.g., Zhu et al. (2011), Cui et al. (2015), Li et al. (2012), and He et al. (2013). In this paper, we propose a new one with the basic idea described as follows. Let Fy(·) denote the CDF of the response variable Y , and let Fk(·) denote the CDF of the predictor Xk. Consider the nonparanormal transformation (Liu et al., 2009)
| (1) |
where Φ(·) denotes the CDF of the standard Gaussian distribution. Liu et al. (2009) imposed the nonparanormal transformation on nonparanormal random variables, but here the transformation was applied to general continuous random variables. Then it is easy to see that Y is independent of Xk if and only if (Ty(Y), Tk(Xk)) follows the bivariate normal distribution N2(0, I2), where I2 denotes the 2 × 2 identity matrix. The latter can be tested using a multivariate normality test, e.g., Henze-Zirkler’s test (Henze and Zirkler, 1990), with the known covariance structure. If (Ty(Y), Tk(Xk)) does not follow the distribution N2(0, I2), then the Henze-Zirkler’s test statistic tends to take a larger value. In practice, since Fy and Fk’s are usually unknown, we can use the estimated nonparanormal transformation in Liu et al. (2009). The estimated nonparanormal transformation has been implemented in the R package huge.
In summary, the proposed method consists of the following steps:
- Transform all variables, including the response variable and predictors, to standard Gaussian random variables by the estimated nonparanormal transformation. Let’s take the response variable as an example: Let
where yi denotes the ith observation of Y , is the truncated empirical distribution of Y given by
is the empirical distribution of Y , and is the default truncation parameter. - For each predictor Xk, calculate the Henze-Zirkler test statistic
where β is the smoothing parameter and its optimal value is (1.25n)1/6/ , which corresponds to the optimal bandwidth for a nonparametric kernel density estimator with Gaussian kernel (Henze and Zirkler, 1990); and and denote the ith realization of and , respectively.(2) - Select a set of predictors with large values of i.e., set
where c and κ are predetermined threshold values.
Since c and κ are usually difficult to determine, we follow other feature screening methods to set the size of to be [n/ log(n)], where [z] denotes the integer part of z. Since the proposed method employes the Henze-Zirkler test statistic to measure the dependence between the transformed response variable and predictors, we call it the Henze-Zirkler sure independence screening method or HZ-SIS for short.
2.2. Theoretical properties
To study the theoretical properties of the HZ-SIS method, we first describe how the HZ-test statistic in (2) is derived. Define
where ϕk(t) is the characteristic function of and φβ(t) is the density function of N (0, β2I2). Recall that exp is the characteristic function of N (0, I2). Therefore, ωk can be viewed as the averaged difference between the characteristic function of the transformed variables and the characteristic function of N (0, I2). It is easy to verify that ωk equals zero if and only if Xk and Y are marginally independent.
Given observations {(x1, y1),…, (xn, yn)}, where ={x1i,…, xpi} denotes the predictor variables in the ith observation, we first use the truncated empirical distribution to estimate the CDF for each variable. In order to estimate ωk, we re-express it in the following form (Henze and Zirkler, 1990) by some algebra:
where is an independent copy of (Xk, Y). With this representation, ωk can be estimated using a V -statistic, which leads to the HZ-test statistic used in (2).
Next, we study the sure screening property of the HZ-SIS method. As mentioned previously, compared to the existing methods, HZ-SIS requires fewer assumptions for establishing its sure screening property. The assumptions are given as follows.
(C1) There exist positive constants c > 0 and 0 ≤ κ ≤ 1/4 such that minkϵD ωk ≥ 2cn−κ.
(C2) The dimension p = O(exp(nτ)) for some constant 0 ≤ τ < .
Assumption (C1) can be viewed as a regularity condition for sure screening methods, which assumes that the minimum true signal cannot be too weak to be detectable for a given sample size, although it can gradually diminish to zero as the sample size increases to infinity. A similar assumption has been used in other methods, see e.g., Li et al. (2012) and Cui et al. (2015). Assumption (C2) allows an exponential growth of the dimension p as a function of the sample size. It is also regular for ultrahigh dimensional methods. To establish the sure screening property for HZ-SIS, the key step is to establish an exponential probability bound for . The following lemma presents such an exponential probability bound with the proof given in the Appendix.
Lemma 1.
If the truncation parameter , where , then there exist positive constants c1 > 0 such that
Here we set m = 1/2 as the default value for the HZ-SIS method and this default value has been used in all examples of this paper. Based on this lemma, we establish the sure screening property for HZ-SIS in the following theorem.
Theorem 1.
Under conditions (C1) and (C2), we have
Proof. If , there must exist such that Recall that for all . Therefore, we have This indicates that there exist such that Consequently,
which concludes the proof.
3. Simulation Studies
In this section, we used three simulated examples to assess the finite sample performance of HZ-SIS along with comparisons with SIS (Fan and Lv, 2008), DC-SIS (Li et al., 2012), Qa-SIS (He et al., 2013) and MV-SIS (Cui et al., 2015). In addition, NIS(Fan et al., 2011) was implemented for an additive model (Example 3.1), and the sliced inverse regression for interaction detection (SIRI) method (Jiang and Liu, 2014) was implemented for a model with interaction terms (Example 3.2 & 3.3). For each example, we generated 100 independent datasets and summarized the performance of the methods in a few statistics. These statistics include the minimum size of needed to cover all active variables, which is denoted by MSD for short; and for the given size νn = [n/ log(n)] of , the proportion of covering a single active predictor Xk (denoted by Pk), and the proportion of covering all active variables (denoted by Pa). The reason for choosing the above statistics is that in practice, we usually specify the size νn of instead of the thresholding value cn−κ for feature screening.
3.1. An Additive Model Example
This example is adopted from Cui et al. (2015). Let
and consider the additive model
where the error term ε follows the t(1) distribution, i.e., the Cauchy distribution. For the predictors, we consider two different distributions:
Xk’s, k = 1,…, p, are generated independently from the distribution t(4);
Xk’s, k = 1,…, p, are generated independently from the Uniform[−2.5,2.5].
We set (n, p) = (200, 2000) and repeated each case for 100 times. In Qa-SIS, we set τ = 0.5 and the number of basis dn = = 3. In NIS, we took the number of basis dn = +2 = 5. In MV-SIS, we discretized each predictor into a four-categorical variable using the first, second and third quartiles as knots. For MV-SIS, the same discretization method has been used in all examples of this paper. The results are summarized in Table 1.
Table 1:
Results for the additive model example. For MSD, we report the median with the associated interquartile range (IQR) in the parentheses. Cases 1 and 2 refer to that the predictors are generated from t(4) and Uniform[−2.5,2.5], respectively.
| Method | MSD | P1 | P2 | P3 | P4 | Pa | |
|---|---|---|---|---|---|---|---|
| SIS | 976.50(1023.00) | 0.07 | 0.08 | 0.96 | 0.98 | 0.01 | |
| NIS | 1342.50(704.75) | 0.01 | 0.20 | 0.05 | 0.76 | 0.00 | |
| DC-SIS | 279.50(656.75) | 0.27 | 0.46 | 0.57 | 0.95 | 0.16 | |
| Case 1 | MV-SIS | 24.00(118.00) | 0.83 | 0.73 | 0.97 | 0.95 | 0.58 |
| Qa-SIS | 347.50(653.50) | 0.02 | 0.81 | 0.22 | 0.98 | 0.00 | |
| HZ-SIS | 11.50(22.00) | 0.98 | 0.90 | 0.97 | 0.94 | 0.80 | |
| SIS | 1216.50(964.75) | 0.10 | 0.02 | 1.00 | 1.00 | 0.00 | |
| NIS | 924.00(1257.25) | 0.16 | 0.17 | 0.30 | 0.33 | 0.06 | |
| DC-SIS | 197.00(339.00) | 0.20 | 0.31 | 0.98 | 0.98 | 0.06 | |
| Case 2 | MV-SIS | 11.00(28.00) | 0.94 | 0.83 | 1.00 | 1.00 | 0.78 |
| Qa-SIS | 8.00(15.50) | 0.91 | 0.91 | 1.00 | 1.00 | 0.82 | |
| HZ-SIS | 24.50(52.25) | 0.71 | 0.88 | 0.99 | 1.00 | 0.64 | |
Table 1 shows that when the predictors are generated from t(4), a heavy-tailed distribution, HZ-SIS performs best, followed by MV-SIS, DC-SIS and Qa-SIS. This result, combined with the fact that HZ-SIS requires fewer assumptions for the sure screening property, indicates that HZ-SIS is a more robust feature screening method than the existing ones. When the predictors are generated from the uniform distribution, for which the support is bounded, HZ-SIS still performs reasonably well. In this case, it is comparable with MV-SIS and Qa-SIS, but much better than DC-SIS, NIS and SIS.
For the case where the predictors are generated from t(4) distribution, we also plotted histograms of the screening indices for each method. Specifically, for each method, we first combined the screening indices calculated for the 100 datasets. Then we drew a histogram using all 400 indices of active predictors and a histogram using 600 indices of inactive predictors. The latter were randomly selected from a total of 199,600 (100 × 1996) indices. Finally, the two histograms were merged into the same plot and differentiated by color. As shown in Figure 1, the two histograms produced by HZ-SIS have the smallest overlapping area. This again indicates the superiority of HZ-SIS in identifying active features.
Figure 1:

Histograms of screening indices produced by different methods for the additive model example with the predictors generated from the distribution t(4).
3.2. A Model with Interaction Variables
This example illustrates the performance of HZ-SIS for the models with interaction variables. Let
The covariates X = (X1,…, Xp)T are generated from a multivariate normal distribution with mean 0 and the covariance matrix Σ = (σij)p×p with σij = 0.5|i−j|. For the error term ε, we considered two cases: (i) ϵ follows N (0, 1) distribution. (ii) ϵ follows the Cauchy distribution. We set (n, p) = (200, 1000) and generated 100 datasets for each case.
Jiang and Liu (2014) recently proposed a procedure, sliced inverse regression for interaction detection (SIRI), to conduct high dimensional variable selection for the models with interaction terms. Instead of building a predictive model for the response given combinations of predictors, this procedure works by modeling the conditional distribution of predictors given the response. Since this procedure includes a screening step, we also implemented this procedure and denoted it as SIRI-SIS. In SIRI-SIS, we used a fixed slicing scheme with 10 slices of equal size (H=10). In Qa-SIS procedure, we set τ = 0.4 and the number of basis dn = = 3. The results are summarized in Table 2.
Table 2:
Results for the model with interaction variables. For MSD, we report the median with the associated interquartile range (IQR) given in the parentheses. Cases 1 and 2 refer to that the random errors are generated from the N(0,1) and Cauchy distributions, respectively.
| Method | MSD | P1 | P50 | Pa | |
|---|---|---|---|---|---|
| SIS | 686.00(321.25) | 1.00 | 0.00 | 0.00 | |
| SIRI-SIS | 3.00(1.00) | 1.00 | 0.99 | 0.99 | |
| DC-SIS | 50.50(52.25) | 1.00 | 0.39 | 0.39 | |
| Case 1 | MV-SIS | 34.00(73.75) | 1.00 | 0.52 | 0.52 |
| Qa-SIS | 426.00(478.00) | 1.00 | 0.02 | 0.02 | |
| HZ-SIS | 3.00(1.00) | 1.00 | 1.00 | 1.00 | |
| SIS | 575.50(387.00) | 0.78 | 0.03 | 0.01 | |
| SIRI-SIS | 11.00(52.00) | 1.00 | 0.69 | 0.69 | |
| DC-SIS | 167.00(251.25) | 1.00 | 0.12 | 0.12 | |
| Case 2 | MV-SIS | 97.50(158.75) | 1.00 | 0.23 | 0.23 |
| Qa-SIS | 414.50(403.75) | 1.00 | 0.02 | 0.02 | |
| HZ-SIS | 9.50(22.00) | 1.00 | 0.86 | 0.86 | |
Table 2 indicates that in the case where the random error is generated from normal, all methods can detect X1 with ease. However, when it comes to detecting X50, HZ-SIS and SIRI- SIS substantially outperforms other methods. For the case where the random error follows the Cauchy distribution, we have a similar conclusion. In addition, our method performs slightly better than SIRI-SIS in this case.
To understand the performance of these methods, we show in Figure 2 the scatter plots of the transformed predictors and versus the transformed response variable for a dataset generated in case 1. Given the reference scatter plot of for which the theoretical joint distribution is N (0, I2), we can see that the joint distributions of and substantially deviate from N (0, I2), and thereby HZ-test is powerful in detecting the dependence of Y on X1 and X50. However, not all other methods work well for this example. As indicated by the values of P2 reported in Table 2, SIS and Qa-SIS essentially fail to detect the dependence of Y on X50, and DC-SIS and MV-SIS have only a limited success probability in detecting this dependence.
Figure 2:

Scatter plots of the transformed response variable T˜y(Y ) versus the transformed predictors T˜1(X1), T˜50(X50) and T˜100(X100)
3.3. A Complex Model with More Interaction Variables
This example illustrates the performance of HZ-SIS for more complex models. Let
where A is generated from the set {−1, 1} with equal probability, Xk’s are independently generated from t(4) distribution. For the error term ε, we considered two cases: (i) s follows N (0, 1) distribution. (ii) ϵ follows the Cauchy distribution. This model is complex, containing more interaction variables than the previous model. We set (n, p) = (400, 1000) and repeated each case 100 times.
In SIRI-SIS, we used a fixed slicing scheme with 10 slices of equal size (H=10). In Qa-SIS procedure, we set τ = 0.4 and the number of basis dn = = 3. The results are summarized in Table 3. In both case, HZ-SIS has an overall better performance against any other methods.
Table 3:
Results for the complex model with more interaction variables. For MSD, we report the median with the associated interquartile range (IQR) given in the parentheses. Cases 1 and 2 refer to that the random errors are generated from N(0,1) and the Cauchy distribution, respectively.
| Method | MSD | P1 | P2 | P3 | P5 | P6 | P7 | P8 | Pa | |
|---|---|---|---|---|---|---|---|---|---|---|
| SIS | 841.50(196.25) | 0.07 | 0.30 | 0.21 | 0.78 | 0.06 | 0.08 | 0.06 | 0.00 | |
| SIRI-SIS | 470.00(377.25) | 0.73 | 0.99 | 1.00 | 0.34 | 0.05 | 0.48 | 0.53 | 0.00 | |
| DC-SIS | 911.00(117.00) | 0.04 | 0.94 | 0.13 | 0.04 | 0.07 | 0.04 | 0.07 | 0.00 | |
| Case1 | MV-SIS | 511.50(322.00) | 0.28 | 0.63 | 1.00 | 0.88 | 0.09 | 0.38 | 0.33 | 0.00 |
| Qa-SIS | 420.50(305.25) | 0.02 | 0.97 | 0.98 | 0.09 | 0.05 | 0.05 | 0.08 | 0.00 | |
| HZ-SIS | 245.50(341.75) | 0.90 | 1.00 | 1.00 | 0.93 | 0.30 | 0.76 | 0.53 | 0.12 | |
| SIS | 882.50(183.50) | 0.10 | 0.29 | 0.16 | 0.84 | 0.07 | 0.09 | 0.11 | 0.00 | |
| SIRI-SIS | 432.00(378.25) | 0.56 | 0.99 | 1.00 | 0.17 | 0.06 | 0.45 | 0.40 | 0.00 | |
| DC-SIS | 899.50(134.25) | 0.10 | 0.97 | 0.11 | 0.10 | 0.09 | 0.03 | 0.08 | 0.00 | |
| Case2 | MV-SIS | 484.00(325.00) | 0.17 | 0.63 | 1.00 | 0.85 | 0.11 | 0.39 | 0.36 | 0.00 |
| Qa-SIS | 416.00(453.25) | 0.04 | 0.97 | 0.99 | 0.06 | 0.08 | 0.03 | 0.08 | 0.00 | |
| HZ-SIS | 200.50(309.25) | 0.79 | 1.00 | 1.00 | 0.89 | 0.26 | 0.71 | 0.40 | 0.10 | |
4. Screening of Anticancer Drug Response Genes
Recent advances in high-throughput biotechnologies, such as microarray, sequencing technologies and mass spectrometry, have provided an unprecedented opportunity for biomarker discovery. Molecular biomarkers can not only facilitate disease diagnosis, but also reveal underlying, biologically distinct, patient subgroups with different sensitivities to a specific therapy. The latter is known as disease heterogeneity, which is often observed in complex diseases such as cancer. For example, molecularly targeted cancer drugs are only effective for patients with tumors expressing targets (Gru¨nwald and Hidalgo, 2003; Buzdar, 2009). The disease heterogeneity has directly motivated the development of precision medicine, which aims to improve patient care by tailoring optimal therapies to an individual patient according to his/her molecular profile and other clinical characteristics.
Toward the ultimate goal of precision medicine, i.e., selecting right drugs for individual patients, a recent large-scale pharmacogenomics study, namely, cancer cell line encyclopedia (CCLE), has screened multiple anticancer drugs over hundreds of cell lines in order to elucidate the response mechanism of anticancer drugs. The dataset consists of the dose-response data for 24 chemical compounds across over 479 cell lines. For each cell line, it consists of the expression data of 18,988 genes. The dataset is publicly available at www.broadinstitute.org/ccle. Our goal is to screen the genes that respond to each chemical compounds, which will facilitate the followed analysis for identification of anticancer drug response genes. In our analysis, we used the area under the dose-response curve, which is termed as activity area in Barretina et al. (2012), to measure the sensitivity of drug to a given cell line. Compared to other measurements, such as IC50 and EC50, the activity area could capture the efficacy and potency of a drug simultaneously.
The drug topotecan (trade name Hycamtin) is a chemotherapeutic agent that is a topoisomerase inhibitor. It is a synthetic, water-soluble analog of the natural chemical compound camptothecin and has been used to treat ovarian cancer, lung cancer and other cancer types. After GlaxoSmithKline received final FDA approval for Hycamtin Capsules in 2007, topote-can became the first topoisomerase I inhibitor for oral use. Table 4 lists the top 10 important genes selected for topotecan by HZ-SIS. For comparison, the table also includes the top 10 genes selected by SIS, DC-SIS, MV-SIS and Qa-SIS. In Qa-SIS procedure, we set τ = 0.5 and the number of basis dn = = 3. For topotecan, the gene SLFN11 has been recognized as a very important predictor for the sensitivity of topotecan (Barretina et al., 2012; Zoppoli et al., 2012). HZ-SIS ranks it No. 10. In addition to SLFN11, Wang et al. (2014) found the strong relevance of HMGB2 and BCLAF1 to topotecan. HZ-SIS ranks these two genes No. 2 and No. 4, respectively. DC-SIS has a similar performance to HZ-SIS for the drug topotecan, while the other methods do not.
Table 4:
Top 10 genes selected for the drug topotecan by different methods.
| Rank | SIS | DC-SIS | MV-SIS | Qa-SIS | HZ-SIS |
|---|---|---|---|---|---|
| 1 | CADM2 | ITGB5 | HMGB2 | FLJ35816 | RFXAP |
| 2 | MMP27 | PPIC | KIF15 | GATS | HMGB2 |
| 3 | WNT5B | HMGB2 | RFXAP | HS3ST3A1 | ITGB5 |
| 4 | CDX4 | ARHGAP19 | ARHGAP19 | ASIC4 | BCLAF1 |
| 5 | ELF4 | RFXAP | CD63 | TRAV26–2 | CPSF6 |
| 6 | ECI2 | CPSF6 | TAF5 | LOC100128239 | HAUS1 |
| 7 | GLIPR1 | CD63 | CPSF6 | LOC100507300 | ILF3 |
| 8 | ABCC9 | TAF5 | ELAVL1 | VPS72 | ELAVL1 |
| 9 | ADCY5 | CNTRL | ILF3 | PIK3IP1 | TAF5 |
| 10 | PPIC | SLFN11 | S100A10 | CRSF6 | SLFN11 |
The drug 17-AAG is a derivative of the antibiotic geldanamycin that is being studied in the treatment of cancer, specific young patients with certain types of leukemia or solid tumors, especially kidney tumors. 17-AAG works by inhibiting the gene HSP90, which is expressed in those tumors, and belongs to the family of drugs called antitumor antibiotics. Table 5 reports the top 10 genes ranked by different methods for 17-AAG. According to Hadley and Hendricks (2014) and Barretina et al. (2012), the gene NQO1 is the top predictive biomarker for 17-AAG. HZ-SIS ranks it first among all genes. DC-SIS and MV-SIS also rank it first.
Table 5:
Top 10 genes selected for the drug 17-AAG by different methods.
| Rank | SIS | DC-SIS | MV-SIS | Qa-SIS | HZ-SIS |
|---|---|---|---|---|---|
| 1 | UXT | NQO1 | NQO1 | MMP24 | NQO1 |
| 2 | IGFN1 | MMP24 | INO80 | ATP6V0E1 | OGDHL |
| 3 | MSH2 | ZNF610 | MMP24 | ZFP30 | TMEM198 |
| 4 | ROCK1 | ZFP30 | ZNF610 | GPR35 | ZBTB7A |
| 5 | DDA1 | NFKB1 | ZFP30 | SLC1A5 | GYG2 |
| 6 | SCEL | CDH6 | PRPUSD4 | GPX2 | CDH6 |
| 7 | ST5 | OGDHL | LOC1005007372 | CNTRL | ZNF610 |
| 8 | THUMPD3 | LOC100507372 | PCSK1N | VPS72 | RPUSD4 |
| 9 | ITGA9 | PRUSD4 | NFKB1 | LOC100507373 | CSK |
| 10 | C20orf141 | IN080 | ZBTB7A | ZNF610 | CTCF |
In the supplementary material, we gave the top 10 genes selected by HZ-SIS for each of the rest 22 drugs. The results are promising. For example, Barretina et al. (2012) and Zoppoli et al. (2012) reported that the gene SLFN11 is predictive of treatment response for both the drug topotecan and irinotecan. As shown in the supplementary material, SLFN11 is also among the top 10 response genes for the drug irinotecan.
5. Discussion
This paper has proposed HZ-SIS as a new model-free feature screening method, and established its sure screening property under very weak conditions. The HZ-SIS method contains two components, nonparanormal transformation and HZ-test. The numerical examples indicate that, compared to the existing methods, HZ-SIS can achieve better performance when the covariates follow a heavy-tailed distribution and when the underlying true model is complex with interaction variables. The reason why HZ-SIS can achieve such a robust performance can be understood from two aspects. First, HZ-SIS does not require any extra conditions except for two regularity conditions that are generally required for high-dimensional feature screening. Second, the truncated empirical CDF estimator used in the estimated nonparanormal transformation helps to reduce the effect of extreme observations.
In HZ-SIS, the HZ-test is employed to test the normality of the nonparanormally transformed data. Other than the HZ-test, other multivariate normality tests, such as Szekeley-Rizzo’s goodness-of-fit test (Sz´ekely and Rizzo, 2005) and Mardia’s skewness and kurtosis tests (Mardia, 1970), can also be applied here. Since none of the tests are universally superior, a combination of different tests might produce a higher power. For example, if we view the feature screening problem as a clustering problem (i.e., separating the active and inactive predictors), the ensemble averaging clustering method by Liang (2008) can be used here to aggregate the results from different tests.
Henze and Zirkler (1990) showed that under the null hypothesis that the testing data are drawn from a multivariate Gaussian distribution, the HZ-test statistic follows a log-normal distribution. This implies that approximately follows a log-normal distribution if Y and Xk are independent, although a rigorous theoretical justification is still needed to account for the effect caused by the estimation error of the nonparanormal transformation. Hence, compared to the existing variable screening methods, HZ-SIS will have an added advantage that the relevance of an individual predictor to the response variable can be measured with p- value, and thus many of the existing multiple hypothesis tests can be applied to assist feature screening. In particular, we can use the multiple hypothesis testing approach to determine the size of the selected set Dˆ by controlling the false discovery rate.
Finally, we note that in practice we often encounter into the case where there are strong correlations among predictors. To handle this problem, Fan and Lv (2008) and Zhu et al. (2011) proposed iterative versions for the SIS and SIRS algorithms, respectively. SIS works on the linear regression, so the residual of the response can be used to rule out certain strongly correlated predictors via iterations. SIRS works on multi-index models, where the predictors function through a linear combination, so the residual of predictors, i.e., the projection of the remaining predictors onto the orthogonal complement space of the already selected predictors, can be used for iterations. Similarly, for HZ-SIS, the correlation among predictors can be accounted for in the following procedure:
-
(1)
Apply HZ-SIS to the data and select p(1) predictors, where p(1) < [n/ log(n)]. Denote the set of selected predictors by A(1).
-
(2)
Treat each predictor Xk ∈/ A(1) as a response variable and conduct regression analysis on the predictors included in the set A(1) using a regularization method, e.g., Lasso(Tibshirani, 1996), SCAD(Fan and Li, 2001) or MCP(Zhang, 2010). Calculate the residual of the regression.
-
(3)
Apply HZ-SIS to the residuals calculated in step (2). Suppose that p(2) residuals are selected, and the corresponding set of variables is denoted by A(2). Update the totally selected predictor set by A(1) ∪ A(2).
-
(4)
Repeat steps (2) and (3) m − 1 times until the totally selected number of predictors p(1)+ +p(m) exceeds the pre-specified number [n/ log(n)]. The finally selected predictor set is A(1) A(m).
It is obvious that if a predictor is strongly correlated with some already selected predictors, it will not be selected by this procedure. This increases the chance to select the true predictors that are only weekly dependent on the response. In general, we suggest to work on the original predictors in step (2). However, for some problems, e.g., the predictors are bounded, we might choose to work on the nonparanormally transformed predictors.
Supplementary Material
Acknowledgments
Liang’s research was partially supported by grants DMS- 1545202, DMS-1612924 and R01-GM117597. The authors thank the editor, associate editor and two referees for their constructive comments which have led to significant improvement of this paper.
Appendix A Proof of Lemma 1
Define
where
For any ε > 0, we have
For simplicity, in what follows we let and . For the first term, we have
Note that we only deal with because can be calculated in a similar way. First, we calculate
Among the ten terms, and are of a higher order, and the other terms share the same order. Hence, we only consider the probability
Define the event as
Since for the standard Gaussian random variable Z,
| (3) |
we have
Therefore,
For simplicity, henceforth, we let
Set the truncation parameter and split the interval into
and
Therefore,
We now analyze these two terms separately.
From Lemma 12.3 of Abramovich et al. (2006), if we let Φ−1(η) denote the upper ηth percentile of the standard Gaussian distribution, for η ≥ 0:99 we have
Based on this lemma, we can show
for any , provided that n is suficiently large.
Then we can bound Δij under :
if n is suficiently large. Therefore,
where the last inequality follows from the fact that
Recall and assume we have
if n is sufficiently large. Further, we have
where the last inequality follows from Hoeffiding’s inequality.
Now we turn to Define the event as
Following from (3), we have
where last inequality follows from Hoeffiding’s inequality. Therefore
Recall that under for , we have So we can rewrite as By the mean value theorem, we further have
where s is between From Lemma 12.3 of Abramovich et al. (2006), we know Also, recall that under , for t ϵ Mn, both are bounded by Therefore,
Combining them together, we are able to show
Using the Dvoretzky-Kiefer-Wolfowitz inequality, we have
Now it remains to deal with Recall that is a V-statistic, and the corresponding U-statistic is given by
By noting that when i = j, and < 1 for other cases, it is easy to show
Recall that ϵ is of a lower order of therefore, we can consider instead.
The kernel h(xki; xkj ; yi; yj) of is bounded,
Therefore, it follows from Theorem 5.6.1.A of Sering (1980) that
where denotes the integer part of.
In summary, by letting ε = cn−ĸ, we have
with the additional constraint . To optimize the convergence rate, we should let and , then we can obtain
Hence,
which completes the proof.
Contributor Information
Jingnan Xue, Department of Statistics, Texas A&M University, College Station, TX 77843.
Faming Liang, Department of Biostatistics, University of Florida, Gainesville, FL 32611.
References
- Abramovich F, Benjamini Y, Donoho DL, and Johnstone IM (2006), “Adapting to unknown sparsity by controlling the false discovery rate,” The Annals of Statistics, 34, 584–653. [Google Scholar]
- Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin A, Kim S, Wilson C, Lehar J, Kryukov G, Sonkin D, Reddy A, Liu M, Murray L, Berger M, Monahan J, Morais P, Meltzer J, Korejwa A, Jane-Valbuena J andMapa F, Thibault J, Bric-Furlong E, Raman P., Shipway A, Engels I, and et al. (2012), “The Cancer cell line encyclopedia enables predictive modeling of anticancer drug sensitivity,” Nature, 483, 603–607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Buzdar A (2009), “Role of biologic therapy and chemotherapy in hormone receptor and HER2-positive breast cancer,” The Annals of Oncology, 20, 993–999. [DOI] [PubMed] [Google Scholar]
- Cui H, Li R, and Zhong W (2015), “Model-Free Feature Screening for Ultrahigh Di- mensional Discriminant Analysis,” Journal of the American Statistical Association, 110, 630–641. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Feng Y, and Song R (2011), “Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models,” Journal of the American Statistical Association, 106, 544–557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J and Li R (2001), “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, 96, 1348–1360. [Google Scholar]
- Fan J and Lv J (2008), “Sure independence screening for ultrahigh dimensional feature space,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Samworth R, and Wu Y (2009), “Ultrahigh Dimensional Feature Selection: Beyond The Linear Model,” J. Mach. Learn. Res, 10, 2013–2038. [PMC free article] [PubMed] [Google Scholar]
- Fan J and Song R (2010), “Sure independence screening in generalized linear models with NP-dimensionality,” Ann. Statist, 38, 3567–3604. [Google Scholar]
- Gru¨nwald V and Hidalgo M (2003), “Developing inhibitors of the epidermal growth factor receptor for cancer treatment,” Journal of the National Cancer Institute, 95, 851–867. [DOI] [PubMed] [Google Scholar]
- Hadley KE and Hendricks DT (2014), “Use of NQO1 status as a selective biomarker for oesophageal squamous cell carcinomas with greater sensitivity to 17-AAG,” BMC Cancer, 14, 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hall P and Miller H (2009), “Using generalized correlation to effect variable selection in very high dimensional problems,” Journal of Computational and Graphical Statistics, 18, 533–550. [Google Scholar]
- He X, Wang L, and Hong HG (2013), “Correction: Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data,” The Annals of Statistics, 41, 2699. [Google Scholar]
- Henze N and Zirkler B (1990), “A class of invariant consistent tests for multivariate nor- mality,” Communications in statistics - Theory and Methods, 10, 3595–3617. [Google Scholar]
- Huang J, Horowitz JL, and Ma S (2008), “Asymptotic properties of bridge estimators in sparse high-dimensional regression models,” Ann. Statist, 36, 587–613. [Google Scholar]
- Jiang B and Liu J (2014), “Sliced Inverse Regression with Variable Selection and Interaction Detection ,” ArXiv e-prints [Google Scholar]
- Li R, Zhong W, and Zhu L (2012), “Feature Screening via Distance Correlation Learning,” Journal of the American Statistical Association, 107, 1129–1139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liang F (2008), “Clustering gene expression profiles using mixture model ensemble averaging approach,” JP Journal of Biostatistics, 2, 57–80. [Google Scholar]
- Liu H, Lafferty J, and Wasserman L (2009), “The nonparanormal: Semiparametric estimation of high dimensional undirected graphs,” J. Mach. Learn. Res, 10, 2295–2328. [PMC free article] [PubMed] [Google Scholar]
- Mardia K (1970), “Measures of multivariate skewness and kurtosis with applications,” Biometrika, 57, 519–530. [Google Scholar]
- Serfling R (1980), Approximation Theorems of Mathematical Statistics, Wiley Series in Probability and Statistics - Applied Probability and Statistics Section Series, Wiley. [Google Scholar]
- Sz´ekely G and Rizzo M (2005), “A new test for multivariate normality,” Journal of Multi-variate Analysis, 93, 58–80. [Google Scholar]
- Sz´ekely G, Rizzo M, and Bakirov N (2007), “Measuring and testing dependence by correlation of distances,” The Annals of Statistics, 35, 2769–2794. [Google Scholar]
- Tibshirani R (1996), “Regression shrinkage and selection via the LASSO,” Journal of the Royal Statistical Society, Series B, 58, 267–288. [Google Scholar]
- Wang K, Shrestha R, Wyatt AW, Reddy A, Lehr J, Wang Y, Lapuk A, and Collins CC (2014), “A Meta-Analysis Approach for Characterizing Pan-Cancer Mechanisms of Drug Sensitivity in Cell Lines,” PLoS One, 9, 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang C-H (2010), “Nearly unbiased variable selection under minimax concave penalty,” The Annals of Statistics, 38, 894–942. [Google Scholar]
- Zhu L-P, Li L, Li R, and Zhu L-X (2011), “Model-Free Feature Screening for Ultrahigh- Dimensional Data,” Journal of the American Statistical Association, 106, 1464–1475. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zoppoli G, Regairaz M, Leo E, Reinhold W, Varma S, Ballestrero A, Doroshow J, and Pommier Y (2012), “Putative DNA/RNA helicase schlafenl1 (SLFN11) sensitizes cancer cells to DNA-damaging agents,” Proceedings of the National Academy of Sciences USA, 109, 15030–15035. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
