A Robust Model-Free Feature Screening Method for Ultrahigh-Dimensional Data

Jingnan Xue; Faming Liang

doi:10.1080/10618600.2017.1328364

. Author manuscript; available in PMC: 2018 Dec 7.

Published in final edited form as: J Comput Graph Stat. 2017 Oct 9;26(4):803–813. doi: 10.1080/10618600.2017.1328364

A Robust Model-Free Feature Screening Method for Ultrahigh-Dimensional Data

Jingnan Xue ^1,^*, Faming Liang ^2,^*

PMCID: PMC6284821 NIHMSID: NIHMS994701 PMID: 30532512

Abstract

Feature screening plays an important role in dimension reduction for ultrahigh-dimensional data. In this paper, we introduce a new feature screening method and establish its sure independence screening property under the ultrahigh-dimensional setting. The proposed method works based on the nonparanormal transformation and Henze-Zirkler’s test; that is, it first transforms the response variable and features to Gaussian random variables using the nonparanormal transformation and then tests the dependence between the response variable and features using the Henze-Zirkler’s test. The proposed method enjoys at least two merits. First, it is model-free, which avoids the specification of a particular model structure. Second, it is condition-free, which does not require any extra conditions except for some regularity conditions for high-dimensional feature screening. The numerical results indicate that, compared to the existing methods, the proposed method is more robust to the data generated from heavy-tailed distributions and/or complex models with interaction variables. The proposed method is applied to screening of anticancer drug response genes.

Keywords: Gene Screening, Henze-Zirkler’s Test, Nonparanormal Transformation, Sure Independence Screening, Precision Medicine

1. Introduction

Variable selection plays an important role in high-dimensional data analysis. However, under the ultrahigh-dimensional setting, where the number of covariates may grow at an exponential rate of sample size, the current variable selection methods may not work well due to the simultaneous challenges of computational expediency, statistical accuracy, and algorithmic stability (Fan et al., 2009). A practical approach is to use a screening procedure to reduce the dimension of feature space to a moderate scale, and then implement a variable selection method on the reduced dataset. To pursue this approach, Fan and Lv (2008) proposed the sure independence screening (SIS) method for linear regression, where the features are screened based on their respective Pearson correlation coefficients with the response variable. They established the sure screening property, i.e., all active predictors can be selected with probability approaching one as the sample size increases to infinity. Fan and Song (2010) extended SIS to generalized linear models, where the features are screened based on the estimates of the regression coefficients in their respective marginal models. For nonlinear models, Hall and Miller (2009) suggested polynomial transformations of predictors, and Fan et al. (2011) suggested to estimate the nonparametric components using B-splines for each feature and then have the features screened based on their respective magnitudes of the nonparametric components.

All the above methods require the specification of a particular model structure. If the underlying model is correctly specified, these methods can perform well. However, if the underlying model is misspecified, their performance may be corrupted. Under the ultrahigh dimensional setting, specifying a correct model is usually an impossible task, and thus the model-free feature screening methods are appealing. Toward this direction, Zhu et al. (2011) proposed a sure independence ranking and screening (SIRS) method to screen significant features for multi-index models. Li et al. (2012) proposed a distance correlation sure independence screening (DC-SIS) method, where the features are screened based on their distance correlation (Sz´ekely et al., 2007) with the response variable. It is known that for two random variables, a zero distance correlation coefficient implies independence. He et al. (2013) proposed a quantile-adaptive sure independence screening (Qa-SIS) method, which employs spline approximation to model the marginal effect of each predictor at a quantile level and then have the predictors screened accordingly. A particular strength of this method is that it can handle the censored data arising in survival analysis. Recently, Cui et al. (2015) proposed a mean variance sure independence screening (MV-SIS) method, where the dependence of two random variables is measured using the mean variance of the conditional distribution function. This method is originally proposed for categorical response variables, but can be extended to the problems for which the response variable is continuous via discretization.

Although the model-free variable screening methods avoid the specification of a particular model structure, they are still based on some assumptions for the predictor and response variables, more or less. For example, DC-SIS requires both the predictors and the response variable to satisfy the sub-exponential tail probability uniformly. That is, practically, the response variable and predictors should be uniformly bounded or follow a multivariate Gaussian distribution. Qa-SIS requires the conditional quantile function’s derivative to satisfy a Lipschitz condition and the conditional density function to be uniformly bounded for each feature.

In this article, we propose a new model-free feature screening method and establish its sure independence screening property under very weak conditions. The proposed method works based on the nonparanormal transformation (Liu et al., 2009) and Henze-Zirkler’s test (Henze and Zirkler, 1990). It is to first transform the response variable and each of the predictors to Gaussian random variables using the nonparanormal transformation and then test the dependence between the response variable and the predictors using the Henze-Zirkler’s test. Compared to the existing methods, the proposed method requires fewer assumptions to ensure its sure independence screening property and thus performs more robustly. Our numerical studies indicate that the new method can achieve better performance when the covariates follow a heavy-tailed distribution and when the underlying true model is complex with interaction variables.

The rest of this article is organized as follows. In Section 2, we describe the proposed method and establish its sure independence screening property. In Section 3, we conduct simulation studies to evaluate the finite sample performance of the proposed method along with comparisons with the existing methods. In Section 4, we apply the proposed method to screening of anticancer drug response genes. In Section 5, we conclude the article with a brief discussion. Technical proofs are given in the Appendix.

2. Robust Feature Screening

2.1. Henze-Zirkler Sure Independence Screening

Let Y denote a continuous response variable, let X = (X₁,…, X_p) denote p continuous covariates, let n denote the sample size, and let f (y|x) denote the conditional density function of Y given X. Under the ultrahigh-dimensional setting, where p = O(exp(n^τ)) for some τ > 0, a sparsity condition is generally needed. For example, we may assume that there are only a small number of predictors relevant to the response variable, although p can be much greater than n. Without specifying a parametric form for the regression model, we define the sets of active predictors and inactive predictors as follows:

\begin{matrix} D = {k : f (y | x) functionally depends on X_{k}}, \\ I = {k : f (y | x) does not functionally depends on X_{k}} . \end{matrix}

A direct identification of the active predictor set D is usually difficult or even impossible under the ultrahigh-dimensional setting. Therefore, people proposed to first identify a moderate size set of variables which has included all the elements of D, and then apply a variable selection method on on this moderate size set to accurately identify D. Note that if f (y|x) functionally depends on X_k, then Y and X_k are usually marginally dependent as well. Hence, the moderate size set of variables can be constructed by selecting only the predictors that are marginally dependent with Y , and this procedure is usually referred to as independence screening. In fact, under the partial orthogonality condition (Fan and Song, 2010; Huang et al., 2008), i.e., {x_i, i ϵ D} is independent of {x_j, j ϵ I}, it can be shown that f (y|x) functionally depends on x_k if and only if Y and X_k are marginally dependent.

To implement independence screening, we need to find a metric to measure the marginal dependence between each predictor X_k and the response variable Y . Several metrics have already been proposed, see e.g., Zhu et al. (2011), Cui et al. (2015), Li et al. (2012), and He et al. (2013). In this paper, we propose a new one with the basic idea described as follows. Let F_y(·) denote the CDF of the response variable Y , and let F_k(·) denote the CDF of the predictor X_k. Consider the nonparanormal transformation (Liu et al., 2009)

T_{y} (Y) = Φ^{- 1} (F_{y} (Y)), T_{k} (X_{k}) = Φ^{- 1} (F_{k} (X_{k})), k = 1, \dots, p,

(1)

where Φ(·) denotes the CDF of the standard Gaussian distribution. Liu et al. (2009) imposed the nonparanormal transformation on nonparanormal random variables, but here the transformation was applied to general continuous random variables. Then it is easy to see that Y is independent of X_k if and only if (T_y(Y), T_k(X_k)) follows the bivariate normal distribution N₂(0, I₂), where I₂ denotes the 2 × 2 identity matrix. The latter can be tested using a multivariate normality test, e.g., Henze-Zirkler’s test (Henze and Zirkler, 1990), with the known covariance structure. If (T_y(Y), T_k(X_k)) does not follow the distribution N₂(0, I₂), then the Henze-Zirkler’s test statistic tends to take a larger value. In practice, since F_y and F_k’s are usually unknown, we can use the estimated nonparanormal transformation in Liu et al. (2009). The estimated nonparanormal transformation has been implemented in the R package huge.

In summary, the proposed method consists of the following steps:

Transform all variables, including the response variable and predictors, to standard Gaussian random variables by the estimated nonparanormal transformation. Let’s take the response variable as an example: Let
${\tilde{T}}_{y} (y_{i}) = Φ^{- 1} ({\tilde{F}}_{y} (y_{i})), i = 1, \dots, n,$
where y_i denotes the ith observation of Y , ${\tilde{F}}_{y}$ is the truncated empirical distribution of Y given by
${\tilde{F}}_{y} (t) = {\begin{matrix} δ_{n} : {\hat{F}}_{y} (t) < δ_{n}, \\ {\hat{F}}_{y} (t) : δ_{n} \leq {\hat{F}}_{y} (t) \leq 1 - δ_{n}, \\ 1 - δ_{n} : {\hat{F}}_{y} (t) > 1 - δ_{n}, \end{matrix}$
${\hat{F}}_{y} (t) \equiv \frac{1}{n} \sum_{i = 1}^{n} 1_{{y_{i} \leq t}}$ is the empirical distribution of Y , and $δ_{n} = \frac{1}{4} n^{- 1 / 4} {(π \log n)}^{- 1 / 2}$ is the default truncation parameter.
For each predictor X_k, calculate the Henze-Zirkler test statistic
${\tilde{ω}}_{k}^{*} = \frac{1}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} e^{- \frac{β^{2}}{2} d_{i j}} - \frac{2}{n (1 + β^{2})} \sum_{i = 1}^{n} e^{- \frac{β^{2}}{2 (1 + β^{2})} d_{i}} + \frac{1}{1 + 2 β^{2}},$ (2)
where β is the smoothing parameter and its optimal value is (1.25n)^1/6/ $\sqrt{2}$ , which corresponds to the optimal bandwidth for a nonparametric kernel density estimator with Gaussian kernel (Henze and Zirkler, 1990); $d_{i j} = {({\tilde{T}}_{k} (x_{k i}) - {\tilde{T}}_{k} (x_{k j}))}^{2} + {({\tilde{T}}_{y} (y_{i}) - {\tilde{T}}_{y} (y_{j}))}^{2}; d_{i} = {\tilde{T}}_{k}^{2} (x_{k i}) + {\tilde{T}}_{y}^{2} (y_{i});$ and ${\tilde{T}}_{k} (x_{k i})$ and ${\tilde{T}}_{y} (y_{i})$ denote the ith realization of ${\tilde{T}}_{k} (X_{k})$ and ${\tilde{T}}_{y} (Y)$ , respectively.
Select a set of predictors with large values of ${\tilde{ω}}_{k}^{*}' s,$ i.e., set
$\hat{D} = {k : {\tilde{ω}}_{k}^{*} > c n^{- k}, for 1 \leq k \leq p},$
where c and κ are predetermined threshold values.

Since c and κ are usually difficult to determine, we follow other feature screening methods to set the size of $\hat{D}$ to be [n/ log(n)], where [z] denotes the integer part of z. Since the proposed method employes the Henze-Zirkler test statistic to measure the dependence between the transformed response variable and predictors, we call it the Henze-Zirkler sure independence screening method or HZ-SIS for short.

2.2. Theoretical properties

To study the theoretical properties of the HZ-SIS method, we first describe how the HZ-test statistic ${\tilde{ω}}_{k}^{*}$ in (2) is derived. Define

ω_{k} = {\int_{ℝ^{2}} | ϕ_{k} (t) - \exp (- \frac{1}{2} t' t) |}^{2} φ_{β} (t) d t, k = 1, 2, \dots, p,

where ϕ_k(t) is the characteristic function of $Φ^{- 1} (F_{k} (X_{k})), Φ^{- 1} (F_{y} (Y)),$ and φ_β(t) is the density function of N (0, β²I₂). Recall that exp $(- \frac{1}{2} t' t)$ is the characteristic function of N (0, I₂). Therefore, ω_k can be viewed as the averaged difference between the characteristic function of the transformed variables and the characteristic function of N (0, I₂). It is easy to verify that ω_k equals zero if and only if X_k and Y are marginally independent.

Given observations {(x₁, y₁),…, (x_n, y_n)}, where $X_{i}^{T}$ ={x_1i,…, x_pi} denotes the predictor variables in the ith observation, we first use the truncated empirical distribution to estimate the CDF for each variable. In order to estimate ω_k, we re-express it in the following form (Henze and Zirkler, 1990) by some algebra:

\begin{matrix} ω_{k} = E {e^{- \frac{β^{2}}{2} [{(Φ^{- 1} (F_{k} (X_{k})) - Φ^{- 1} (F_{k} (X_{k}^{'})))}^{2} + {(Φ^{- 1} (F_{y} (Y)) - Φ^{- 1} (F_{y} (Y')))}^{2}]} \\ - \frac{2}{1 + β^{2}} e^{- \frac{β^{2}}{2 (1 + β^{2})} (Φ^{- 1} {(F_{k} (X_{k}))}^{2} + Φ^{- 1} {(F_{y} (Y))}^{2})} + \frac{1}{1 + 2 β^{2}}}, \end{matrix}

where $(X_{k}^{'}, Y^{'})$ is an independent copy of (X_k, Y). With this representation, ω_k can be estimated using a V -statistic, which leads to the HZ-test statistic used in (2).

Next, we study the sure screening property of the HZ-SIS method. As mentioned previously, compared to the existing methods, HZ-SIS requires fewer assumptions for establishing its sure screening property. The assumptions are given as follows.

(C1) There exist positive constants c > 0 and 0 ≤ κ ≤ 1/4 such that min_kϵD ω_k ≥ 2cn^−κ.

(C2) The dimension p = O(exp(n^τ)) for some constant 0 ≤ τ < $\frac{1 - 4 k}{3}$ .

Assumption (C1) can be viewed as a regularity condition for sure screening methods, which assumes that the minimum true signal cannot be too weak to be detectable for a given sample size, although it can gradually diminish to zero as the sample size increases to infinity. A similar assumption has been used in other methods, see e.g., Li et al. (2012) and Cui et al. (2015). Assumption (C2) allows an exponential growth of the dimension p as a function of the sample size. It is also regular for ultrahigh dimensional methods. To establish the sure screening property for HZ-SIS, the key step is to establish an exponential probability bound for $| ω_{k}^{*} - ω_{k} |$ . The following lemma presents such an exponential probability bound with the proof given in the Appendix.

Lemma 1.

If the truncation parameter $δ_{n} = {(4 n^{\frac{m}{2}} \sqrt{2 π m \log n})}^{- 1}$ , where $m = \frac{2}{3} - \frac{2 k}{3}$ , then there exist positive constants c₁ > 0 such that

P (\max_{1 \leq k \leq p} | {\tilde{ω}}_{k}^{*} - ω_{k} | > c n^{- k}) \leq O (p \exp {- c_{1} n^{\frac{1 - 4 k}{3}}}), f o r 1 < k \leq p .

Here we set m = 1/2 as the default value for the HZ-SIS method and this default value has been used in all examples of this paper. Based on this lemma, we establish the sure screening property for HZ-SIS in the following theorem.

Theorem 1.

Under conditions (C1) and (C2), we have

P (D \subseteq \hat{D}) \to 1, a s n \to \infty .

Proof. If $D \subseteq \hat{D}$ , there must exist $k' \in D$ such that ${\tilde{ω}}_{k'}^{*} - c n^{- k} .$ Recall that $ω_{k} > 2 c n^{- k}$ for all $k \in D$ . Therefore, we have $| {\tilde{ω}}_{k'}^{*} - ω_{k'} | > c n^{- k}$ This indicates that ${D \subseteq \hat{D}} \subseteq {$ there exist $k' \in D$ such that $| {\tilde{ω}}_{k'}^{*} - ω_{k'} | > c n^{- k}}$ Consequently,

\begin{matrix} P (D \subseteq \hat{D}) = 1 - P (D \subseteq \hat{D}) \\ \geq 1 - P (there exist k' \in D such that | {\tilde{ω}}_{k'}^{*} - ω_{k'} | > c n^{- k}) \\ \geq 1 - P (\max_{1 \leq k \leq p} | {\tilde{ω}}_{k}^{*} - ω_{k} | > c n^{- k}) \geq 1 - O {p \exp {- c_{1} n^{\frac{1 - 4 k}{3}}}} \\ = 1 - O {\exp {c_{1} n^{\frac{1 - 4 k}{3}} + \log (p)}} \\ = 1 - O {\exp {- c_{1} n^{\frac{1 - 4 k}{3}} + c_{2} n^{τ}}} = 1 - o (1), \end{matrix}

which concludes the proof.

3. Simulation Studies

In this section, we used three simulated examples to assess the finite sample performance of HZ-SIS along with comparisons with SIS (Fan and Lv, 2008), DC-SIS (Li et al., 2012), Qa-SIS (He et al., 2013) and MV-SIS (Cui et al., 2015). In addition, NIS(Fan et al., 2011) was implemented for an additive model (Example 3.1), and the sliced inverse regression for interaction detection (SIRI) method (Jiang and Liu, 2014) was implemented for a model with interaction terms (Example 3.2 & 3.3). For each example, we generated 100 independent datasets and summarized the performance of the methods in a few statistics. These statistics include the minimum size of $\hat{D}$ needed to cover all active variables, which is denoted by MSD for short; and for the given size ν_n = [n/ log(n)] of $\hat{D}$ , the proportion of $\hat{D}$ covering a single active predictor X_k (denoted by P_k), and the proportion of $\hat{D}$ covering all active variables (denoted by P_a). The reason for choosing the above statistics is that in practice, we usually specify the size ν_n of $\hat{D}$ instead of the thresholding value cn^−κ for feature screening.

3.1. An Additive Model Example

This example is adopted from Cui et al. (2015). Let

f_{1} (x) = - \sin (2 x), f_{2} (x) = x^{2} - \frac{12}{25}, f_{3} (x) = x, f_{4} (x) = e^{- x} - \frac{2}{5} \sinh (\frac{5}{2}),

and consider the additive model

Y = 3 f_{1} (X_{1}) + f_{2} (X_{2}) - 1.5 f_{3} (X_{3}) + f_{4} (X_{4}) + ε,

where the error term ε follows the t(1) distribution, i.e., the Cauchy distribution. For the predictors, we consider two different distributions:

X_k’s, k = 1,…, p, are generated independently from the distribution t(4);
X_k’s, k = 1,…, p, are generated independently from the Uniform[−2.5,2.5].

We set (n, p) = (200, 2000) and repeated each case for 100 times. In Qa-SIS, we set τ = 0.5 and the number of basis d_n = $[n^{\frac{1}{5}}]$ = 3. In NIS, we took the number of basis d_n = $[n^{\frac{1}{5}}]$ +2 = 5. In MV-SIS, we discretized each predictor into a four-categorical variable using the first, second and third quartiles as knots. For MV-SIS, the same discretization method has been used in all examples of this paper. The results are summarized in Table 1.

Table 1:

Results for the additive model example. For MSD, we report the median with the associated interquartile range (IQR) in the parentheses. Cases 1 and 2 refer to that the predictors are generated from t(4) and Uniform[−2.5,2.5], respectively.

	Method	MSD	P₁	P₂	P₃	P₄	P_a
	SIS	976.50(1023.00)	0.07	0.08	0.96	0.98	0.01
	NIS	1342.50(704.75)	0.01	0.20	0.05	0.76	0.00
	DC-SIS	279.50(656.75)	0.27	0.46	0.57	0.95	0.16
Case 1	MV-SIS	24.00(118.00)	0.83	0.73	0.97	0.95	0.58
	Qa-SIS	347.50(653.50)	0.02	0.81	0.22	0.98	0.00
	HZ-SIS	11.50(22.00)	0.98	0.90	0.97	0.94	0.80

	SIS	1216.50(964.75)	0.10	0.02	1.00	1.00	0.00
	NIS	924.00(1257.25)	0.16	0.17	0.30	0.33	0.06
	DC-SIS	197.00(339.00)	0.20	0.31	0.98	0.98	0.06
Case 2	MV-SIS	11.00(28.00)	0.94	0.83	1.00	1.00	0.78
	Qa-SIS	8.00(15.50)	0.91	0.91	1.00	1.00	0.82
	HZ-SIS	24.50(52.25)	0.71	0.88	0.99	1.00	0.64

Open in a new tab

Table 1 shows that when the predictors are generated from t(4), a heavy-tailed distribution, HZ-SIS performs best, followed by MV-SIS, DC-SIS and Qa-SIS. This result, combined with the fact that HZ-SIS requires fewer assumptions for the sure screening property, indicates that HZ-SIS is a more robust feature screening method than the existing ones. When the predictors are generated from the uniform distribution, for which the support is bounded, HZ-SIS still performs reasonably well. In this case, it is comparable with MV-SIS and Qa-SIS, but much better than DC-SIS, NIS and SIS.

For the case where the predictors are generated from t(4) distribution, we also plotted histograms of the screening indices for each method. Specifically, for each method, we first combined the screening indices calculated for the 100 datasets. Then we drew a histogram using all 400 indices of active predictors and a histogram using 600 indices of inactive predictors. The latter were randomly selected from a total of 199,600 (100 × 1996) indices. Finally, the two histograms were merged into the same plot and differentiated by color. As shown in Figure 1, the two histograms produced by HZ-SIS have the smallest overlapping area. This again indicates the superiority of HZ-SIS in identifying active features.

Figure 1: — Histograms of screening indices produced by different methods for the additive model example with the predictors generated from the distribution t(4).

3.2. A Model with Interaction Variables

This example illustrates the performance of HZ-SIS for the models with interaction variables. Let

Y = 0.5 + \frac{10 X_{1}}{1 + X_{50}^{2}} + ε,

The covariates X = (X₁,…, X_p)^T are generated from a multivariate normal distribution with mean 0 and the covariance matrix Σ = (σ_ij)_p×p with σ_ij = 0.5^|i−j|. For the error term ε, we considered two cases: (i) ϵ follows N (0, 1) distribution. (ii) ϵ follows the Cauchy distribution. We set (n, p) = (200, 1000) and generated 100 datasets for each case.

Jiang and Liu (2014) recently proposed a procedure, sliced inverse regression for interaction detection (SIRI), to conduct high dimensional variable selection for the models with interaction terms. Instead of building a predictive model for the response given combinations of predictors, this procedure works by modeling the conditional distribution of predictors given the response. Since this procedure includes a screening step, we also implemented this procedure and denoted it as SIRI-SIS. In SIRI-SIS, we used a fixed slicing scheme with 10 slices of equal size (H=10). In Qa-SIS procedure, we set τ = 0.4 and the number of basis d_n = $[n^{\frac{1}{5}}]$ = 3. The results are summarized in Table 2.

Table 2:

Results for the model with interaction variables. For MSD, we report the median with the associated interquartile range (IQR) given in the parentheses. Cases 1 and 2 refer to that the random errors are generated from the N(0,1) and Cauchy distributions, respectively.

	Method	MSD	P₁	P50	P_a
	SIS	686.00(321.25)	1.00	0.00	0.00
	SIRI-SIS	3.00(1.00)	1.00	0.99	0.99
	DC-SIS	50.50(52.25)	1.00	0.39	0.39
Case 1	MV-SIS	34.00(73.75)	1.00	0.52	0.52
	Qa-SIS	426.00(478.00)	1.00	0.02	0.02
	HZ-SIS	3.00(1.00)	1.00	1.00	1.00

	SIS	575.50(387.00)	0.78	0.03	0.01
	SIRI-SIS	11.00(52.00)	1.00	0.69	0.69
	DC-SIS	167.00(251.25)	1.00	0.12	0.12
Case 2	MV-SIS	97.50(158.75)	1.00	0.23	0.23
	Qa-SIS	414.50(403.75)	1.00	0.02	0.02
	HZ-SIS	9.50(22.00)	1.00	0.86	0.86

Open in a new tab

Table 2 indicates that in the case where the random error is generated from normal, all methods can detect X₁ with ease. However, when it comes to detecting X₅₀, HZ-SIS and SIRI- SIS substantially outperforms other methods. For the case where the random error follows the Cauchy distribution, we have a similar conclusion. In addition, our method performs slightly better than SIRI-SIS in this case.

To understand the performance of these methods, we show in Figure 2 the scatter plots of the transformed predictors ${\tilde{T}}_{1} (X_{1}), {\tilde{T}}_{50} (X_{50})$ and ${\tilde{T}}_{100} (X_{100})$ versus the transformed response variable $({\tilde{T}}_{y} (Y))$ for a dataset generated in case 1. Given the reference scatter plot of $({\tilde{T}}_{100} (X_{100}), {\tilde{T}}_{y} (Y))$ for which the theoretical joint distribution is N (0, I₂), we can see that the joint distributions of $({\tilde{T}}_{1} (X_{1}), {\tilde{T}}_{y} (Y))$ and $({\tilde{T}}_{50} (X_{50}), {\tilde{T}}_{y} (Y))$ substantially deviate from N (0, I₂), and thereby HZ-test is powerful in detecting the dependence of Y on X₁ and X₅₀. However, not all other methods work well for this example. As indicated by the values of P₂ reported in Table 2, SIS and Qa-SIS essentially fail to detect the dependence of Y on X₅₀, and DC-SIS and MV-SIS have only a limited success probability in detecting this dependence.

3.3. A Complex Model with More Interaction Variables

This example illustrates the performance of HZ-SIS for more complex models. Let

Y = 1 + A [10 X_{1} + \exp (X_{2}^{2} + 3 X_{3})] + 10 \frac{X_{5}}{2 + X_{6}} + 3 {(X_{7} + X_{8})}^{2} + ε,

where A is generated from the set {−1, 1} with equal probability, X_k’s are independently generated from t(4) distribution. For the error term ε, we considered two cases: (i) s follows N (0, 1) distribution. (ii) ϵ follows the Cauchy distribution. This model is complex, containing more interaction variables than the previous model. We set (n, p) = (400, 1000) and repeated each case 100 times.

In SIRI-SIS, we used a fixed slicing scheme with 10 slices of equal size (H=10). In Qa-SIS procedure, we set τ = 0.4 and the number of basis d_n = $[n^{\frac{1}{5}}]$ = 3. The results are summarized in Table 3. In both case, HZ-SIS has an overall better performance against any other methods.

Table 3:

Results for the complex model with more interaction variables. For MSD, we report the median with the associated interquartile range (IQR) given in the parentheses. Cases 1 and 2 refer to that the random errors are generated from N(0,1) and the Cauchy distribution, respectively.

	Method	MSD	P₁	P₂	P₃	P₅	P₆	P₇	P₈	P_a
	SIS	841.50(196.25)	0.07	0.30	0.21	0.78	0.06	0.08	0.06	0.00
	SIRI-SIS	470.00(377.25)	0.73	0.99	1.00	0.34	0.05	0.48	0.53	0.00
	DC-SIS	911.00(117.00)	0.04	0.94	0.13	0.04	0.07	0.04	0.07	0.00
Case1	MV-SIS	511.50(322.00)	0.28	0.63	1.00	0.88	0.09	0.38	0.33	0.00
	Qa-SIS	420.50(305.25)	0.02	0.97	0.98	0.09	0.05	0.05	0.08	0.00
	HZ-SIS	245.50(341.75)	0.90	1.00	1.00	0.93	0.30	0.76	0.53	0.12

	SIS	882.50(183.50)	0.10	0.29	0.16	0.84	0.07	0.09	0.11	0.00
	SIRI-SIS	432.00(378.25)	0.56	0.99	1.00	0.17	0.06	0.45	0.40	0.00
	DC-SIS	899.50(134.25)	0.10	0.97	0.11	0.10	0.09	0.03	0.08	0.00
Case2	MV-SIS	484.00(325.00)	0.17	0.63	1.00	0.85	0.11	0.39	0.36	0.00
	Qa-SIS	416.00(453.25)	0.04	0.97	0.99	0.06	0.08	0.03	0.08	0.00
	HZ-SIS	200.50(309.25)	0.79	1.00	1.00	0.89	0.26	0.71	0.40	0.10

Open in a new tab

4. Screening of Anticancer Drug Response Genes

Recent advances in high-throughput biotechnologies, such as microarray, sequencing technologies and mass spectrometry, have provided an unprecedented opportunity for biomarker discovery. Molecular biomarkers can not only facilitate disease diagnosis, but also reveal underlying, biologically distinct, patient subgroups with different sensitivities to a specific therapy. The latter is known as disease heterogeneity, which is often observed in complex diseases such as cancer. For example, molecularly targeted cancer drugs are only effective for patients with tumors expressing targets (Gru¨nwald and Hidalgo, 2003; Buzdar, 2009). The disease heterogeneity has directly motivated the development of precision medicine, which aims to improve patient care by tailoring optimal therapies to an individual patient according to his/her molecular profile and other clinical characteristics.

Toward the ultimate goal of precision medicine, i.e., selecting right drugs for individual patients, a recent large-scale pharmacogenomics study, namely, cancer cell line encyclopedia (CCLE), has screened multiple anticancer drugs over hundreds of cell lines in order to elucidate the response mechanism of anticancer drugs. The dataset consists of the dose-response data for 24 chemical compounds across over 479 cell lines. For each cell line, it consists of the expression data of 18,988 genes. The dataset is publicly available at www.broadinstitute.org/ccle. Our goal is to screen the genes that respond to each chemical compounds, which will facilitate the followed analysis for identification of anticancer drug response genes. In our analysis, we used the area under the dose-response curve, which is termed as activity area in Barretina et al. (2012), to measure the sensitivity of drug to a given cell line. Compared to other measurements, such as IC₅₀ and EC₅₀, the activity area could capture the efficacy and potency of a drug simultaneously.

The drug topotecan (trade name Hycamtin) is a chemotherapeutic agent that is a topoisomerase inhibitor. It is a synthetic, water-soluble analog of the natural chemical compound camptothecin and has been used to treat ovarian cancer, lung cancer and other cancer types. After GlaxoSmithKline received final FDA approval for Hycamtin Capsules in 2007, topote-can became the first topoisomerase I inhibitor for oral use. Table 4 lists the top 10 important genes selected for topotecan by HZ-SIS. For comparison, the table also includes the top 10 genes selected by SIS, DC-SIS, MV-SIS and Qa-SIS. In Qa-SIS procedure, we set τ = 0.5 and the number of basis d_n = $[n^{\frac{1}{5}}]$ = 3. For topotecan, the gene SLFN11 has been recognized as a very important predictor for the sensitivity of topotecan (Barretina et al., 2012; Zoppoli et al., 2012). HZ-SIS ranks it No. 10. In addition to SLFN11, Wang et al. (2014) found the strong relevance of HMGB2 and BCLAF1 to topotecan. HZ-SIS ranks these two genes No. 2 and No. 4, respectively. DC-SIS has a similar performance to HZ-SIS for the drug topotecan, while the other methods do not.

Table 4:

Top 10 genes selected for the drug topotecan by different methods.

Rank	SIS	DC-SIS	MV-SIS	Qa-SIS	HZ-SIS
1	CADM2	ITGB5	HMGB2	FLJ35816	RFXAP
2	MMP27	PPIC	KIF15	GATS	HMGB2
3	WNT5B	HMGB2	RFXAP	HS3ST3A1	ITGB5
4	CDX4	ARHGAP19	ARHGAP19	ASIC4	BCLAF1
5	ELF4	RFXAP	CD63	TRAV26–2	CPSF6
6	ECI2	CPSF6	TAF5	LOC100128239	HAUS1
7	GLIPR1	CD63	CPSF6	LOC100507300	ILF3
8	ABCC9	TAF5	ELAVL1	VPS72	ELAVL1
9	ADCY5	CNTRL	ILF3	PIK3IP1	TAF5
10	PPIC	SLFN11	S100A10	CRSF6	SLFN11

Open in a new tab

The drug 17-AAG is a derivative of the antibiotic geldanamycin that is being studied in the treatment of cancer, specific young patients with certain types of leukemia or solid tumors, especially kidney tumors. 17-AAG works by inhibiting the gene HSP90, which is expressed in those tumors, and belongs to the family of drugs called antitumor antibiotics. Table 5 reports the top 10 genes ranked by different methods for 17-AAG. According to Hadley and Hendricks (2014) and Barretina et al. (2012), the gene NQO1 is the top predictive biomarker for 17-AAG. HZ-SIS ranks it first among all genes. DC-SIS and MV-SIS also rank it first.

Table 5:

Top 10 genes selected for the drug 17-AAG by different methods.

Rank	SIS	DC-SIS	MV-SIS	Qa-SIS	HZ-SIS
1	UXT	NQO1	NQO1	MMP24	NQO1
2	IGFN1	MMP24	INO80	ATP6V0E1	OGDHL
3	MSH2	ZNF610	MMP24	ZFP30	TMEM198
4	ROCK1	ZFP30	ZNF610	GPR35	ZBTB7A
5	DDA1	NFKB1	ZFP30	SLC1A5	GYG2
6	SCEL	CDH6	PRPUSD4	GPX2	CDH6
7	ST5	OGDHL	LOC1005007372	CNTRL	ZNF610
8	THUMPD3	LOC100507372	PCSK1N	VPS72	RPUSD4
9	ITGA9	PRUSD4	NFKB1	LOC100507373	CSK
10	C20orf141	IN080	ZBTB7A	ZNF610	CTCF

Open in a new tab

In the supplementary material, we gave the top 10 genes selected by HZ-SIS for each of the rest 22 drugs. The results are promising. For example, Barretina et al. (2012) and Zoppoli et al. (2012) reported that the gene SLFN11 is predictive of treatment response for both the drug topotecan and irinotecan. As shown in the supplementary material, SLFN11 is also among the top 10 response genes for the drug irinotecan.

5. Discussion

This paper has proposed HZ-SIS as a new model-free feature screening method, and established its sure screening property under very weak conditions. The HZ-SIS method contains two components, nonparanormal transformation and HZ-test. The numerical examples indicate that, compared to the existing methods, HZ-SIS can achieve better performance when the covariates follow a heavy-tailed distribution and when the underlying true model is complex with interaction variables. The reason why HZ-SIS can achieve such a robust performance can be understood from two aspects. First, HZ-SIS does not require any extra conditions except for two regularity conditions that are generally required for high-dimensional feature screening. Second, the truncated empirical CDF estimator used in the estimated nonparanormal transformation helps to reduce the effect of extreme observations.

In HZ-SIS, the HZ-test is employed to test the normality of the nonparanormally transformed data. Other than the HZ-test, other multivariate normality tests, such as Szekeley-Rizzo’s goodness-of-fit test (Sz´ekely and Rizzo, 2005) and Mardia’s skewness and kurtosis tests (Mardia, 1970), can also be applied here. Since none of the tests are universally superior, a combination of different tests might produce a higher power. For example, if we view the feature screening problem as a clustering problem (i.e., separating the active and inactive predictors), the ensemble averaging clustering method by Liang (2008) can be used here to aggregate the results from different tests.

Henze and Zirkler (1990) showed that under the null hypothesis that the testing data are drawn from a multivariate Gaussian distribution, the HZ-test statistic follows a log-normal distribution. This implies that ${\tilde{ω}}_{k}^{*}$ approximately follows a log-normal distribution if Y and X_k are independent, although a rigorous theoretical justification is still needed to account for the effect caused by the estimation error of the nonparanormal transformation. Hence, compared to the existing variable screening methods, HZ-SIS will have an added advantage that the relevance of an individual predictor to the response variable can be measured with p- value, and thus many of the existing multiple hypothesis tests can be applied to assist feature screening. In particular, we can use the multiple hypothesis testing approach to determine the size of the selected set Dˆ by controlling the false discovery rate.

Finally, we note that in practice we often encounter into the case where there are strong correlations among predictors. To handle this problem, Fan and Lv (2008) and Zhu et al. (2011) proposed iterative versions for the SIS and SIRS algorithms, respectively. SIS works on the linear regression, so the residual of the response can be used to rule out certain strongly correlated predictors via iterations. SIRS works on multi-index models, where the predictors function through a linear combination, so the residual of predictors, i.e., the projection of the remaining predictors onto the orthogonal complement space of the already selected predictors, can be used for iterations. Similarly, for HZ-SIS, the correlation among predictors can be accounted for in the following procedure:

(1)
Apply HZ-SIS to the data and select p₍₁₎ predictors, where p₍₁₎ < [n/ log(n)]. Denote the set of selected predictors by A₍₁₎.
(2)
Treat each predictor X_k ∈/ A₍₁₎ as a response variable and conduct regression analysis on the predictors included in the set A₍₁₎ using a regularization method, e.g., Lasso(Tibshirani, 1996), SCAD(Fan and Li, 2001) or MCP(Zhang, 2010). Calculate the residual of the regression.
(3)
Apply HZ-SIS to the residuals calculated in step (2). Suppose that p₍₂₎ residuals are selected, and the corresponding set of variables is denoted by A₍₂₎. Update the totally selected predictor set by A₍₁₎ ∪ A₍₂₎.
(4)
Repeat steps (2) and (3) m − 1 times until the totally selected number of predictors p₍₁₎+ $\dots$ +p_(m) exceeds the pre-specified number [n/ log(n)]. The finally selected predictor set is A₍₁₎ $\cup \dots \cup$ A_(m).

It is obvious that if a predictor is strongly correlated with some already selected predictors, it will not be selected by this procedure. This increases the chance to select the true predictors that are only weekly dependent on the response. In general, we suggest to work on the original predictors in step (2). However, for some problems, e.g., the predictors are bounded, we might choose to work on the nonparanormally transformed predictors.

Supplementary Material

NIHMS994701-supplement-1.pdf^{(78.9KB, pdf)}

NIHMS994701-supplement-2.pdf^{(81.1KB, pdf)}

Acknowledgments

Liang’s research was partially supported by grants DMS- 1545202, DMS-1612924 and R01-GM117597. The authors thank the editor, associate editor and two referees for their constructive comments which have led to significant improvement of this paper.

Appendix A Proof of Lemma 1

Define

ω_{k}^{*} = \frac{1}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} e^{- \frac{β^{2}}{2} E_{i j}} - \frac{2}{n (1 + β^{2})} \sum_{i = 1}^{n} e^{- \frac{β^{2}}{2 (1 + β^{2})} E_{i}} + \frac{1}{1 + 2 β^{2}},

where

\begin{matrix} E_{i j} = {(Φ^{- 1} (F_{k} (x_{k i})) - Φ^{- 1} {(F_{k} (x_{k j}))}^{2} + (Φ^{- 1} (F_{y} (y_{i}))) - Φ^{- 1} (F_{y} (y_{j})))}^{2}, \\ E_{i} = Φ^{- 1} {(F_{k} (x_{k i}))}^{2} + Φ^{- 1} {(F_{y} (y_{i}))}^{2} . \end{matrix}

For any ε > 0, we have

\begin{matrix} P (| {\tilde{ω}}_{k}^{*} - ω_{k} | > ε) = P (| {\tilde{ω}}_{k}^{*} - ω_{k}^{*} + ω_{k}^{*} - ω_{k} | > ε) \\ \leq P (| {\tilde{ω}}_{k}^{*} - ω_{k}^{*} | > \frac{ε}{2}) + P (| ω_{k}^{*} - ω_{k} | > \frac{ε}{2}) . \end{matrix}

For simplicity, in what follows we let ${\tilde{T}}_{k} (x) \equiv Φ^{- 1} ({\tilde{F}}_{k} (x)), T_{k} (x) \equiv Φ^{- 1} (F_{k} (x))$ and $g_{j} \equiv T_{j}^{- 1}$ . For the first term, we have

\begin{matrix} P (| {\tilde{ω}}_{k}^{*} - ω_{k}^{*} | > \frac{ε}{2}) \leq P (\frac{1}{n^{2}} \sum_{i, j} | e^{- \frac{β^{2}}{2} D i j} - e^{- \frac{β^{2}}{2} E i j} | > \frac{ε}{4}) \\ + P (\frac{1}{n} \sum_{i} | e^{- \frac{β^{2}}{2 (1 + β^{2})} D i} - e^{- \frac{β^{2}}{2 (1 + β^{2})} E i} | > \frac{1 + β^{2}}{2} \frac{ε}{4}) \\ \leq P (\frac{1}{n^{2}} \sum_{i, j} \frac{β^{2}}{2} | D_{i j} - E_{i j} | > \frac{ε}{4}) + P (\frac{1}{n} \sum_{i} \frac{1}{2} | D_{i} - E_{i} | > \frac{1 + β^{2}}{2} \frac{ε}{4}) \\ = P (\frac{1}{n^{2}} \sum_{i, j} | D_{i j} - E_{i j} | > \frac{ε}{2 β^{2}}) + P (\frac{1}{n} \sum_{i} | D_{i} - E_{i} | > \frac{(1 + β^{2}) ε}{4}) . \end{matrix}

Note that we only deal with $P (\frac{1}{n^{2}} \sum_{i, j} | D_{i j} - E_{i j} | > \frac{ε}{2 β^{2}})$ because $P (\frac{1}{n} \sum_{i} | D_{i} - E_{i} | > \frac{(1 + β^{2}) ε}{4})$ can be calculated in a similar way. First, we calculate

\begin{matrix} D_{i j} - E_{i j} = {({\tilde{T}}_{k} (x_{k i}) - {\tilde{T}}_{k} (x_{k j}))^{2} + ({\tilde{T}}_{y} (y_{i}) - {\tilde{T}}_{y} (y_{j}))}^{2} \\ - {(T_{k} (x_{k i}) - T_{k} (x_{k j}))^{2} - (T_{y} (y_{i}) - T_{y} (y_{j}))}^{2} \\ = ({\tilde{T}}_{k} (x_{k i}) - T_{k} (x_{k i}))^{2} + {({\tilde{T}}_{k} (x_{k j}) - T_{k} (x_{k j}))}^{2} - 2 ({\tilde{T}}_{k} (x_{k i}) ({\tilde{T}}_{k} (x_{k j}) - (T_{k} (x_{k i}) (T_{k} (x_{k i})) \\ + ({\tilde{T}}_{y} (y_{i}) - T_{y} (y_{i}))^{2} + {({\tilde{T}}_{y} (y_{j}) - T_{y} (y_{j}))}^{2} - 2 ({\tilde{T}}_{y} (y_{i}) ({\tilde{T}}_{y} (y_{j}) - T_{y} (y_{i}) T_{y} (y_{j})) \\ = ({\tilde{T}}_{k} (x_{k i}) - (T_{k} (x_{k i})) ({\tilde{T}}_{k} (x_{k i}) + (T_{k} (x_{k i})) + ({\tilde{T}}_{k} (x_{k j}) - (T_{k} (x_{k j})) ({\tilde{T}}_{k} (x_{k j}) + (T_{k} (x_{k j})) \\ - 2 ({\tilde{T}}_{k} (x_{k i}) - (T_{k} (x_{k i})) ({\tilde{T}}_{k} (x_{k j}) - (T_{k} (x_{k j})) - 2 ({\tilde{T}}_{k} (x_{k i}) - (T_{k} (x_{k i})) T_{k} (x_{k j}) \\ - 2 ({\tilde{T}}_{k} (x_{k j}) - T_{k} (x_{k j ς})) T_{k} (x_{k i}) + ({\tilde{T}}_{y} (y_{i}) - T_{y} (y_{i})) (({\tilde{T}}_{y} (y_{i}) + T_{y} (y_{i})) \\ + ({\tilde{T}}_{y} (y_{j}) - T_{y} (y_{j})) ({\tilde{T}}_{y} (y_{j}) + T_{y} (y_{j})) - 2 ({\tilde{T}}_{y} (y_{i}) - T_{y} (y_{i})) ({\tilde{T}}_{y} (y_{j}) - ({\tilde{T}}_{y} (y_{j})) \\ - 2 ({\tilde{T}}_{y} (y_{i}) - T_{y} (y_{i})) T_{y} (y_{j}) - 2 ({\tilde{T}}_{y} (y_{j}) - (T_{y} (y_{j})) T_{y} (y_{i}) . \end{matrix}

Among the ten terms, $({\tilde{T}}_{k} (x_{k i}) - (T_{k} (x_{k i})) ({\tilde{T}}_{k} (x_{k j}) - (T_{k} (x_{k j}))$ and $({\tilde{T}}_{y} (y_{i}) - T_{y} (y_{i})) ({\tilde{T}}_{y} (y_{j}) - T_{y} (y_{j}))$ are of a higher order, and the other terms share the same order. Hence, we only consider the probability $P (\frac{1}{n^{2}} \sum_{i, j} | ({\tilde{T}}_{y} (y_{i}) - T_{y} (y_{i})) (T_{y} (y_{j}) | > \frac{ε}{20 β^{2}}) .$

Define the event $A_{n}$ as

A_{n} \equiv {g_{y} (- n^{d}) \leq y_{1}, \dots, y_{n} \leq g_{y} (n^{d})} .

Since for the standard Gaussian random variable Z,

P (Z > t) \leq \frac{1}{\sqrt{2 π}} \frac{e^{- \frac{1}{2} t^{2}}}{t}, if t > 1,

(3)

we have

P (A_{n}^{c}) \leq \sum_{i = 1}^{n} 2 P (y_{i} > g_{y} (- n^{d})) \leq 2 n \frac{1}{\sqrt{2 π}} \frac{e^{- \frac{1}{2} n^{2 d}}}{n^{d}} = \sqrt{\frac{2}{π}} n^{1 - d} \exp {- 1 \frac{1}{2} n^{2 d}} .

Therefore,

\begin{matrix} P (\frac{1}{n^{2}} \sum_{i, j} | ({\tilde{T}}_{y} (y_{i}) - T_{y} (y_{i})) T_{y} (y_{j}) | > \frac{ε}{20 β^{2}}) \\ \leq P (\frac{1}{n^{2}} \sum_{i, j} | ({\tilde{T}}_{y} (y_{i}) - T_{y} (y_{i})) T_{y} (y_{j}) | > \frac{ε}{20 β^{2}}, A_{n}) + P (A_{n}^{c}) \\ \leq P (\frac{1}{n^{2}} \sum_{i, j} | ({\tilde{T}}_{y} (y_{i}) - T_{y} (y_{i})) T_{y} (y_{j}) | > \frac{ε}{20 β^{2}}, A_{n}) + \sqrt{\frac{2}{π}} n^{1 - d} \exp {- \frac{1}{2} n^{2 d}} . \end{matrix}

For simplicity, henceforth, we let $Δ_{i j} = ({\tilde{T}}_{y} (y_{i}) - (T_{y} (y_{i})) T_{y} (y_{i}) .$

Set the truncation parameter $δ_{n} = \frac{1}{4 n^{\frac{m}{2}} \sqrt{2 π m \log n}}, m < 1$ and split the interval $[g_{y} (- n^{d}), g_{h} (n^{d})]$ into

M_{n} = (g_{y} (- \sqrt{m \log n}), g_{h} (\sqrt{m \log n}))

and

E_{n} = [g_{y} (- n^{d}), g_{y} (- \sqrt{m \log n}] \cup [(g_{y} (\sqrt{m \log n}), g_{y} (n^{d})] .

Therefore,

P (\frac{1}{n^{2}} \sum_{i, j} | Δ_{i j} | > \frac{ε}{20 β^{2}}, A_{n}) \leq P (\frac{1}{n^{2}} \sum_{y_{i} \in E_{n} \cup y_{j} \in E_{n}} | Δ_{i j} | > \frac{ε}{40 β^{2}}) + P (\frac{1}{n^{2}} \sum_{y_{i} \in M_{n} \cap y_{j} \in M_{n}} | Δ_{i j} | > \frac{ε}{40 β^{2}}) .

We now analyze these two terms separately.

From Lemma 12.3 of Abramovich et al. (2006), if we let Φ⁻¹(η) denote the upper ηth percentile of the standard Gaussian distribution, for η ≥ 0:99 we have

Φ^{- 1} (η) = \sqrt{2 \log \frac{1}{1 - η}} - r (η), r (η) \in [0, 1.5] .

Based on this lemma, we can show

| {\tilde{T}}_{y} (t) | < Φ^{- 1} (1 - δ_{n}) = \sqrt{2 \log \frac{1}{δ_{n}}} - r (1 - δ) \leq \sqrt{2 [\frac{m}{2} \log (n) + \log (4 \sqrt{2 π m \log n)}]} < \sqrt{\log n,}

for any $t \in ℝ$ , provided that n is suficiently large.

Then we can bound Δij under $A_{n}$ :

| Δ_{i j} | = | {\tilde{T}}_{y} (y_{i}) - T_{y} (y_{i}) | | T_{y} (y_{j}) | \leq (| {\tilde{T}}_{y} (y_{i}) | + | T_{y} (y_{i}) |) | T_{y} (y_{j}) | \leq (\sqrt{\log (n)} + n^{d}) n^{d} < 2 n^{2 d},

if n is suficiently large. Therefore,

\begin{matrix} P (\frac{1}{n^{2}} \sum_{y_{i} \in E_{n} \cup y_{j} \in E_{n}} | Δ_{i j} | > \frac{ε}{40 β^{2}}) \leq P (\frac{1}{n^{2}} \sum_{i, j} 1_{{y_{i} \in E_{n} \cup y_{j} \in E_{n}}} > \frac{ε}{80 n^{2 d} β^{2}} \\ \leq P (\frac{1}{n^{2}} \sum_{i, j} 1_{{y_{i} \in E_{n}}} + 1_{{y_{j} \in E_{n}}} > \frac{ε}{80 n^{2 d} β^{2}}) \\ = P (\frac{1}{n} \sum_{i} 1_{{y_{i} \in E_{n}}} > \frac{ε}{160 n^{2 d} β^{2}} \\ \leq P (\frac{1}{n} \sum_{i} 1_{{y_{i} \in E_{n}}} - P (y_{i} \in E_{n})) > \frac{ε}{160 n^{2 d} β^{2}} - n^{- \frac{m}{2}}), \end{matrix}

where the last inequality follows from the fact that

P (y_{i} \in E_{n}) \leq 2 P (y_{i} > g_{h} (\sqrt{m \log n})) \leq 2 \frac{1}{\sqrt{2 π}} \frac{e^{- \frac{1}{2} m \log n}}{\sqrt{m \log n}} \leq n^{- \frac{m}{2}} .

Recall $ε = c n^{- k}$ and assume $k + 2 d < \frac{m}{2}$ we have

\frac{ε}{160 n^{2 d} β^{2}} - n^{- \frac{m}{2}} = \frac{c n^{- (k + 2 d)}}{160 β^{2}} - n^{- \frac{m}{2}} \geq \frac{c n^{- (k + 2 d)}}{200 β^{2}},

if n is suﬃciently large. Further, we have

\begin{matrix} P (\frac{1}{n} \sum_{i} 1_{{y_{i} \in E_{n}}} - P (y_{i} \in E_{n})) > \frac{ε}{160 n^{2 d} β^{2}} - n^{- \frac{m}{2}}) \\ \leq P (\frac{1}{n} \sum_{i} (1_{{y_{i} \in E_{n}}} - P (y_{i} \in E_{n})) > \frac{c n^{- (k + 2 d)}}{200 β^{2}}) \\ \leq \exp {- 2 n \frac{c^{2} n^{- 2 (k + 2 d)}}{40000 β^{4}}} = \exp {- \frac{c^{2}}{20000 β^{4}} n^{1 - 2 (k + 2 d)}}, \end{matrix}

where the last inequality follows from Hoeffiding’s inequality.

Now we turn to $P (\frac{1}{n^{2}} \sum_{y_{i} \in M_{n} \cap y_{j} \in M_{n}} | Δ_{i j} | > \frac{ε}{40 β^{2}}) .$ Define the event $B_{n}$ as

B_{n} \equiv {δ_{n} \leq {\hat{F}}_{y} (g_{j} (- \sqrt{β \log n})) \cup {\hat{F}}_{y} (g_{j} (\sqrt{β \log n})) \leq 1 - δ_{n}} .

Following from (3), we have

\begin{matrix} P (B_{n}^{c}) \leq 2 P ({\hat{F}}_{y} (g_{j} (\sqrt{β \log n})) \geq 1 - δ_{n}) \\ = 2 P ({\hat{F}}_{y} (g_{j} (\sqrt{β \log n}) - F_{y} (\sqrt{β \log n}) \geq 1 - F_{y} (\sqrt{β \log n}) - δ_{n}) \\ \leq 2 P ({\hat{F}}_{y} (g_{j} (\sqrt{β \log n}) - F_{y} (\sqrt{β \log n}) \geq \frac{1}{2 n^{\frac{m}{2}} \sqrt{2 π m \log n}} - δ_{n}) \\ \leq 2 P ({\hat{F}}_{y} (g_{j} (\sqrt{β \log n}) - F_{y} (\sqrt{β \log n}) \geq \frac{1}{2 n^{\frac{m}{2}} \sqrt{2 π m \log n}} - \frac{1}{4 n^{\frac{m}{2}} \sqrt{2 π m \log n}}) \\ = 2 P ({\hat{F}}_{y} (g_{j} (\sqrt{β \log n}) - F_{y} (\sqrt{β \log n}) \geq \frac{1}{4 n^{\frac{m}{2}} \sqrt{2 π m \log n}}) \\ \leq 2 \exp {- 2 n \frac{1}{16 n^{m} 2 π m \log n}} = 2 \exp {- \frac{n^{1 - m}}{16 π m \log n}}, \end{matrix}

where last inequality follows from Hoeffiding’s inequality. Therefore

\begin{matrix} P (\frac{1}{n^{2}} \sum_{y_{i} \in M_{n} \cap y_{j} \in M_{n}} | Δ_{i j} | > \frac{ε}{40 β^{2}}) \\ = P (\frac{1}{n^{2}} \sum_{y_{i} \in M_{n} \cap y_{j} \in M_{n}} | Δ_{i j} | > \frac{ε}{40 β^{2}}, B_{n}) + P (\frac{1}{n^{2}} \sum_{y_{i} \in M_{n} \cap y_{j} \in M_{n}} | Δ_{i j} | > \frac{ε}{40 β^{2}}, B_{n}^{c}) \\ \leq P (\frac{1}{n^{2}} \sum_{y_{i} \in M_{n} \cap y_{j} \in M_{n}} | Δ_{i j} | > \frac{ε}{40 β^{2}}, B_{n}) + P (B_{n}^{c}) \\ \leq P (\frac{1}{n^{2}} \sum_{y_{i} \in M_{n} \cap y_{j} \in M_{n}} | Δ_{i j} | > \frac{ε}{40 β^{2}}, B_{n}) + 2 \exp {- \frac{n^{1 - m}}{16 π m \log n}} \\ \leq P (\frac{1}{n^{2}} \sum_{y_{i} \in M_{n} \cap y_{j} \in M_{n}} | {\tilde{T}}_{y} (y_{i}) - T_{y} (y_{i}) | > \frac{ε}{40 β^{2} \sqrt{m \log n}}, B_{n}) + 2 \exp {- \frac{n^{1 - m}}{16 π m \log n}} \\ \leq P (\sup_{t \in M_{n}} | {\tilde{T}}_{y} (t) - T_{y} (t) | > \frac{ε}{40 β^{2} \sqrt{m \log n}}, B_{n}) + 2 \exp {- \frac{n^{1 - m}}{16 π m \log n}} . \end{matrix}

Recall that under $B_{n},$ for $t \in M_{n}$ , we have ${\tilde{F}}_{y} (t) = {\hat{F}}_{y} (t) .$ So we can rewrite ${\tilde{T}}_{y} (t) - T_{y} (t)$ as $Φ^{- 1} ({\tilde{F}}_{y} (t)) - Φ^{- 1} (F_{y} (t)) .$ By the mean value theorem, we further have

Φ^{- 1} ({\hat{F}}_{y} (t)) - Φ^{- 1} (F_{y} (t)) = (Φ^{- 1})' (s)({\hat{F}}_{y} (t) - F_{y} (t)),

where s is between ${\hat{F}}_{y} (t) and F_{y} (t) .$ From Lemma 12.3 of Abramovich et al. (2006), we know $(Φ^{- 1})' (s) = \frac{1}{ϕ (Φ^{- 1} (s))} .$ Also, recall that under $B_{n}$ , for t ϵ Mn, both ${\hat{F}}_{y} (t) and F_{y} (t)$ are bounded by $[δ_{n}, 1 - δ_{n}] .$ Therefore,

$\sup_{s \in [δ_{n}, 1 - δ_{n}]} (Φ^{- 1})' (s) = \sup_{s \in [δ_{n}, 1 - δ_{n}]} \frac{1}{ϕ (Φ^{- 1} (s))} = \frac{1}{ϕ (Φ^{- 1} (1 - δ_{n}))} \leq \frac{1}{ϕ (\sqrt{2 \log (\frac{1}{δ_{n}})})} = \frac{\sqrt{2 π}}{δ_{n}} .$ Combining them together, we are able to show

\begin{matrix} P (\sup_{t \in M_{n}} | {\tilde{T}}_{y} (t) - T_{y} (t) | > \frac{ε}{40 β^{2} \sqrt{m \log n}}, B_{n}) \\ \leq P (\sup_{t \in M_{n}} | Φ^{- 1} ({\hat{F}}_{y} (t)) - Φ^{- 1} (F_{y} (t)) | > \frac{ε}{40 β^{2} \sqrt{m \log n}}, B_{n}) \\ \leq P (\sup_{s \in [δ n, 1 - δ_{n}]} \frac{1}{ϕ (Φ^{- 1} (s))} \sup_{t \in M_{n}} | {\hat{F}}_{y} (t) - F_{y} (t) | > \frac{ε}{40 β^{2} \sqrt{m \log n}}, B_{n}) \\ = P (\frac{\sqrt{2 π}}{δ_{n}} \sup_{t \in M_{n}} | {\hat{F}}_{y} (t) - F_{y} (t) | > \frac{ε}{40 β^{2} \sqrt{m \log n}}, B_{n}) \\ = P \underset{t \in M_{n}}{(\sup} | {\hat{F}}_{y} (t) - F_{y} (t) | > \frac{ε}{40 β^{2} \sqrt{m \log n}} \frac{1}{\sqrt{2 π} 4 n^{\frac{m}{2}} \sqrt{2 π m \log n}}, B_{n}) \\ = P \underset{t \in M_{n}}{(\sup} | {\hat{F}}_{y} (t) - F_{y} (t) | > \frac{ε n^{- \frac{m}{2}}}{320 β^{2} π m \log n}, B_{n}) . \end{matrix}

Using the Dvoretzky-Kiefer-Wolfowitz inequality, we have

\begin{matrix} P \underset{t \in M_{n}}{(\sup} | {\hat{F}}_{y} (t) - F_{y} (t) | > \frac{ε n^{- \frac{m}{2}}}{320 β^{2} π m \log n}, B_{n}) \leq \exp {- 2 n {[\frac{ε n^{- \frac{m}{2}}}{320 β^{2} π m \log n}]}^{2}} \\ = \exp {- 2 \frac{ε^{2} n^{1 - m}}{102400 β^{4} m^{2} π^{2} \log^{2} n}} . \end{matrix}

Now it remains to deal with $P (| ω_{k}^{*} - ω_{k} | > \frac{ε}{2}) .$ Recall that $ω_{k}^{*}$ is a V-statistic, and the corresponding U-statistic $ω_{k}^{* *}$ is given by

ω_{k}^{* *} = \frac{1}{n (n - 1)} \sum_{i = 1}^{n} \sum_{j \neq i} e^{- \frac{β^{2}}{2} E_{i j}} - \frac{2}{n (1 + β^{2})} \sum_{i = 1}^{n} e^{- \frac{β^{2}}{2 (1 + β^{2})} E_{i}} + \frac{1}{1 + 2 β^{2}} .

By noting that $e^{- \frac{β^{2}}{2} E_{i j}}$ when i = j, and $e^{- \frac{β^{2}}{2} E_{i j}}$ < 1 for other cases, it is easy to show

\begin{matrix} | ω_{k}^{*} - ω_{k}^{* *} | = | \frac{1}{n^{2}} (n + \sum_{i = 1}^{n} \sum_{j \neq i} e^{- \frac{β^{2}}{2} E i j}) - \frac{1}{n (n - 1)} \sum_{i = 1}^{n} \sum_{j \neq i} e^{- \frac{β^{2}}{2} E i j} | \\ = | \frac{1}{n} - \frac{1}{n^{2} (n - 1)} \sum_{i = 1}^{n} \sum_{j \neq i} e^{- \frac{β^{2}}{2} E_{i j}} | \leq \frac{1}{n} + \frac{1}{n} = \frac{2}{n} . \end{matrix}

Recall that ϵ is of a lower order of $\frac{1}{n}$ therefore, we can consider $P (| ω_{k}^{* *} - ω_{k} | > \frac{ε}{2})$ instead.

The kernel h(xki; xkj ; yi; yj) of $ω_{k}^{* *}$ is bounded,

\begin{matrix} | h (x_{k i}, x_{k j}, y_{i}, y_{j} | = | e^{- \frac{β^{2}}{2} E_{i j}} - \frac{1}{1 + β^{2}} e^{- \frac{β^{2}}{2 (1 + β^{2})} E_{i}} - \frac{1}{1 + β^{2}} e^{- \frac{β^{2}}{2 (1 + β^{2})} E_{j}} E_{j} + \frac{1}{1 + 2 β^{2}} | \\ \leq 1 + \frac{2}{1 + β^{2}} + \frac{1}{1 + 2 β^{2}} \leq 4. \end{matrix}

Therefore, it follows from Theorem 5.6.1.A of Sering (1980) that

P (| ω_{k}^{* *} - ω_{k} | > \frac{ε}{2}) \leq \exp {- 2 [\frac{n}{2}] \frac{ε^{2}}{2^{2}} \frac{1}{8^{2}}} = \exp {- [\frac{n}{2}] \frac{ε^{2}}{128}},

where $[\frac{n}{2}]$ denotes the integer part of $\frac{n}{2}$ .

In summary, by letting ε = cn^−ĸ, we have

P (| {\tilde{ω}}_{k}^{*} - ω_{k} | > c n^{- k}) \leq O {\exp {- c_{1} n^{2 d}} + \exp {- c_{2} n^{1 - 2 (k + 2 d)}} + \exp {- c_{3} n^{1 - m - 2 k}}},

with the additional constraint $k + 2 d < \frac{m}{2}$ . To optimize the convergence rate, we should let $m = 2 (k + 2 d)$ and $d = \frac{1}{6} - \frac{2}{3} k$ , then we can obtain

P (| {\tilde{ω}}_{k}^{*} - ω_{k} | > c n^{- k}) \leq O {\exp {- c_{1} n^{\frac{1 - 4 k}{3}}}} .

Hence,

P (\max_{1 \leq k \leq p} | {\tilde{ω}}_{k}^{*} - ω_{k} | > c n^{- k}) \leq p \max_{1 \leq k \leq p} P (| {\tilde{ω}}_{k}^{*} - ω_{k} | > c n^{- k}) \leq O {p \exp {- c_{1} n^{\frac{1 - 4 k}{3}}}},

which completes the proof.

Contributor Information

Jingnan Xue, Department of Statistics, Texas A&M University, College Station, TX 77843.

Faming Liang, Department of Biostatistics, University of Florida, Gainesville, FL 32611.

References

Abramovich F, Benjamini Y, Donoho DL, and Johnstone IM (2006), “Adapting to unknown sparsity by controlling the false discovery rate,” The Annals of Statistics, 34, 584–653. [Google Scholar]
Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin A, Kim S, Wilson C, Lehar J, Kryukov G, Sonkin D, Reddy A, Liu M, Murray L, Berger M, Monahan J, Morais P, Meltzer J, Korejwa A, Jane-Valbuena J andMapa F, Thibault J, Bric-Furlong E, Raman P., Shipway A, Engels I, and et al. (2012), “The Cancer cell line encyclopedia enables predictive modeling of anticancer drug sensitivity,” Nature, 483, 603–607. [DOI] [PMC free article] [PubMed] [Google Scholar]
Buzdar A (2009), “Role of biologic therapy and chemotherapy in hormone receptor and HER2-positive breast cancer,” The Annals of Oncology, 20, 993–999. [DOI] [PubMed] [Google Scholar]
Cui H, Li R, and Zhong W (2015), “Model-Free Feature Screening for Ultrahigh Di- mensional Discriminant Analysis,” Journal of the American Statistical Association, 110, 630–641. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Feng Y, and Song R (2011), “Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models,” Journal of the American Statistical Association, 106, 544–557. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J and Li R (2001), “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, 96, 1348–1360. [Google Scholar]
Fan J and Lv J (2008), “Sure independence screening for ultrahigh dimensional feature space,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Samworth R, and Wu Y (2009), “Ultrahigh Dimensional Feature Selection: Beyond The Linear Model,” J. Mach. Learn. Res, 10, 2013–2038. [PMC free article] [PubMed] [Google Scholar]
Fan J and Song R (2010), “Sure independence screening in generalized linear models with NP-dimensionality,” Ann. Statist, 38, 3567–3604. [Google Scholar]
Gru¨nwald V and Hidalgo M (2003), “Developing inhibitors of the epidermal growth factor receptor for cancer treatment,” Journal of the National Cancer Institute, 95, 851–867. [DOI] [PubMed] [Google Scholar]
Hadley KE and Hendricks DT (2014), “Use of NQO1 status as a selective biomarker for oesophageal squamous cell carcinomas with greater sensitivity to 17-AAG,” BMC Cancer, 14, 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hall P and Miller H (2009), “Using generalized correlation to effect variable selection in very high dimensional problems,” Journal of Computational and Graphical Statistics, 18, 533–550. [Google Scholar]
He X, Wang L, and Hong HG (2013), “Correction: Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data,” The Annals of Statistics, 41, 2699. [Google Scholar]
Henze N and Zirkler B (1990), “A class of invariant consistent tests for multivariate nor- mality,” Communications in statistics - Theory and Methods, 10, 3595–3617. [Google Scholar]
Huang J, Horowitz JL, and Ma S (2008), “Asymptotic properties of bridge estimators in sparse high-dimensional regression models,” Ann. Statist, 36, 587–613. [Google Scholar]
Jiang B and Liu J (2014), “Sliced Inverse Regression with Variable Selection and Interaction Detection ,” ArXiv e-prints [Google Scholar]
Li R, Zhong W, and Zhu L (2012), “Feature Screening via Distance Correlation Learning,” Journal of the American Statistical Association, 107, 1129–1139. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liang F (2008), “Clustering gene expression profiles using mixture model ensemble averaging approach,” JP Journal of Biostatistics, 2, 57–80. [Google Scholar]
Liu H, Lafferty J, and Wasserman L (2009), “The nonparanormal: Semiparametric estimation of high dimensional undirected graphs,” J. Mach. Learn. Res, 10, 2295–2328. [PMC free article] [PubMed] [Google Scholar]
Mardia K (1970), “Measures of multivariate skewness and kurtosis with applications,” Biometrika, 57, 519–530. [Google Scholar]
Serfling R (1980), Approximation Theorems of Mathematical Statistics, Wiley Series in Probability and Statistics - Applied Probability and Statistics Section Series, Wiley. [Google Scholar]
Sz´ekely G and Rizzo M (2005), “A new test for multivariate normality,” Journal of Multi-variate Analysis, 93, 58–80. [Google Scholar]
Sz´ekely G, Rizzo M, and Bakirov N (2007), “Measuring and testing dependence by correlation of distances,” The Annals of Statistics, 35, 2769–2794. [Google Scholar]
Tibshirani R (1996), “Regression shrinkage and selection via the LASSO,” Journal of the Royal Statistical Society, Series B, 58, 267–288. [Google Scholar]
Wang K, Shrestha R, Wyatt AW, Reddy A, Lehr J, Wang Y, Lapuk A, and Collins CC (2014), “A Meta-Analysis Approach for Characterizing Pan-Cancer Mechanisms of Drug Sensitivity in Cell Lines,” PLoS One, 9, 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang C-H (2010), “Nearly unbiased variable selection under minimax concave penalty,” The Annals of Statistics, 38, 894–942. [Google Scholar]
Zhu L-P, Li L, Li R, and Zhu L-X (2011), “Model-Free Feature Screening for Ultrahigh- Dimensional Data,” Journal of the American Statistical Association, 106, 1464–1475. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zoppoli G, Regairaz M, Leo E, Reinhold W, Varma S, Ballestrero A, Doroshow J, and Pommier Y (2012), “Putative DNA/RNA helicase schlafenl1 (SLFN11) sensitizes cancer cells to DNA-damaging agents,” Proceedings of the National Academy of Sciences USA, 109, 15030–15035. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS994701-supplement-1.pdf^{(78.9KB, pdf)}

NIHMS994701-supplement-2.pdf^{(81.1KB, pdf)}

[R1] Abramovich F, Benjamini Y, Donoho DL, and Johnstone IM (2006), “Adapting to unknown sparsity by controlling the false discovery rate,” The Annals of Statistics, 34, 584–653. [Google Scholar]

[R2] Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin A, Kim S, Wilson C, Lehar J, Kryukov G, Sonkin D, Reddy A, Liu M, Murray L, Berger M, Monahan J, Morais P, Meltzer J, Korejwa A, Jane-Valbuena J andMapa F, Thibault J, Bric-Furlong E, Raman P., Shipway A, Engels I, and et al. (2012), “The Cancer cell line encyclopedia enables predictive modeling of anticancer drug sensitivity,” Nature, 483, 603–607. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Buzdar A (2009), “Role of biologic therapy and chemotherapy in hormone receptor and HER2-positive breast cancer,” The Annals of Oncology, 20, 993–999. [DOI] [PubMed] [Google Scholar]

[R4] Cui H, Li R, and Zhong W (2015), “Model-Free Feature Screening for Ultrahigh Di- mensional Discriminant Analysis,” Journal of the American Statistical Association, 110, 630–641. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Fan J, Feng Y, and Song R (2011), “Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models,” Journal of the American Statistical Association, 106, 544–557. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Fan J and Li R (2001), “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, 96, 1348–1360. [Google Scholar]

[R7] Fan J and Lv J (2008), “Sure independence screening for ultrahigh dimensional feature space,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Fan J, Samworth R, and Wu Y (2009), “Ultrahigh Dimensional Feature Selection: Beyond The Linear Model,” J. Mach. Learn. Res, 10, 2013–2038. [PMC free article] [PubMed] [Google Scholar]

[R9] Fan J and Song R (2010), “Sure independence screening in generalized linear models with NP-dimensionality,” Ann. Statist, 38, 3567–3604. [Google Scholar]

[R10] Gru¨nwald V and Hidalgo M (2003), “Developing inhibitors of the epidermal growth factor receptor for cancer treatment,” Journal of the National Cancer Institute, 95, 851–867. [DOI] [PubMed] [Google Scholar]

[R11] Hadley KE and Hendricks DT (2014), “Use of NQO1 status as a selective biomarker for oesophageal squamous cell carcinomas with greater sensitivity to 17-AAG,” BMC Cancer, 14, 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Hall P and Miller H (2009), “Using generalized correlation to effect variable selection in very high dimensional problems,” Journal of Computational and Graphical Statistics, 18, 533–550. [Google Scholar]

[R13] He X, Wang L, and Hong HG (2013), “Correction: Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data,” The Annals of Statistics, 41, 2699. [Google Scholar]

[R14] Henze N and Zirkler B (1990), “A class of invariant consistent tests for multivariate nor- mality,” Communications in statistics - Theory and Methods, 10, 3595–3617. [Google Scholar]

[R15] Huang J, Horowitz JL, and Ma S (2008), “Asymptotic properties of bridge estimators in sparse high-dimensional regression models,” Ann. Statist, 36, 587–613. [Google Scholar]

[R16] Jiang B and Liu J (2014), “Sliced Inverse Regression with Variable Selection and Interaction Detection ,” ArXiv e-prints [Google Scholar]

[R17] Li R, Zhong W, and Zhu L (2012), “Feature Screening via Distance Correlation Learning,” Journal of the American Statistical Association, 107, 1129–1139. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Liang F (2008), “Clustering gene expression profiles using mixture model ensemble averaging approach,” JP Journal of Biostatistics, 2, 57–80. [Google Scholar]

[R19] Liu H, Lafferty J, and Wasserman L (2009), “The nonparanormal: Semiparametric estimation of high dimensional undirected graphs,” J. Mach. Learn. Res, 10, 2295–2328. [PMC free article] [PubMed] [Google Scholar]

[R20] Mardia K (1970), “Measures of multivariate skewness and kurtosis with applications,” Biometrika, 57, 519–530. [Google Scholar]

[R21] Serfling R (1980), Approximation Theorems of Mathematical Statistics, Wiley Series in Probability and Statistics - Applied Probability and Statistics Section Series, Wiley. [Google Scholar]

[R22] Sz´ekely G and Rizzo M (2005), “A new test for multivariate normality,” Journal of Multi-variate Analysis, 93, 58–80. [Google Scholar]

[R23] Sz´ekely G, Rizzo M, and Bakirov N (2007), “Measuring and testing dependence by correlation of distances,” The Annals of Statistics, 35, 2769–2794. [Google Scholar]

[R24] Tibshirani R (1996), “Regression shrinkage and selection via the LASSO,” Journal of the Royal Statistical Society, Series B, 58, 267–288. [Google Scholar]

[R25] Wang K, Shrestha R, Wyatt AW, Reddy A, Lehr J, Wang Y, Lapuk A, and Collins CC (2014), “A Meta-Analysis Approach for Characterizing Pan-Cancer Mechanisms of Drug Sensitivity in Cell Lines,” PLoS One, 9, 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Zhang C-H (2010), “Nearly unbiased variable selection under minimax concave penalty,” The Annals of Statistics, 38, 894–942. [Google Scholar]

[R27] Zhu L-P, Li L, Li R, and Zhu L-X (2011), “Model-Free Feature Screening for Ultrahigh- Dimensional Data,” Journal of the American Statistical Association, 106, 1464–1475. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Zoppoli G, Regairaz M, Leo E, Reinhold W, Varma S, Ballestrero A, Doroshow J, and Pommier Y (2012), “Putative DNA/RNA helicase schlafenl1 (SLFN11) sensitizes cancer cells to DNA-damaging agents,” Proceedings of the National Academy of Sciences USA, 109, 15030–15035. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A Robust Model-Free Feature Screening Method for Ultrahigh-Dimensional Data

Jingnan Xue

Faming Liang

Roles

Abstract

1. Introduction

2. Robust Feature Screening

2.1. Henze-Zirkler Sure Independence Screening

2.2. Theoretical properties

Lemma 1.

Theorem 1.

3. Simulation Studies

3.1. An Additive Model Example

Table 1:

Figure 1:

3.2. A Model with Interaction Variables

Table 2:

Figure 2:

3.3. A Complex Model with More Interaction Variables

Table 3:

4. Screening of Anticancer Drug Response Genes

Table 4:

Table 5:

5. Discussion

Supplementary Material

Acknowledgments

Appendix A Proof of Lemma 1

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A Robust Model-Free Feature Screening Method for Ultrahigh-Dimensional Data

Jingnan Xue

Faming Liang

Roles

Abstract

1. Introduction

2. Robust Feature Screening

2.1. Henze-Zirkler Sure Independence Screening

2.2. Theoretical properties

Lemma 1.

Theorem 1.

3. Simulation Studies

3.1. An Additive Model Example

Table 1:

Figure 1:

3.2. A Model with Interaction Variables

Table 2:

Figure 2:

3.3. A Complex Model with More Interaction Variables

Table 3:

4. Screening of Anticancer Drug Response Genes

Table 4:

Table 5:

5. Discussion

Supplementary Material

Acknowledgments

Appendix A Proof of Lemma 1

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases