Model-Free Feature Screening for Ultrahigh Dimensional Discriminant Analysis

Hengjian Cui; Runze Li; Wei Zhong

doi:10.1080/01621459.2014.920256

. Author manuscript; available in PMC: 2016 Jun 1.

Published in final edited form as: J Am Stat Assoc. 2014 May 13;110(510):630–641. doi: 10.1080/01621459.2014.920256

Model-Free Feature Screening for Ultrahigh Dimensional Discriminant Analysis

Hengjian Cui ¹, Runze Li ¹, Wei Zhong ^1,^✉

PMCID: PMC4574103 NIHMSID: NIHMS590994 PMID: 26392643

Abstract

This work is concerned with marginal sure independence feature screening for ultra-high dimensional discriminant analysis. The response variable is categorical in discriminant analysis. This enables us to use conditional distribution function to construct a new index for feature screening. In this paper, we propose a marginal feature screening procedure based on empirical conditional distribution function. We establish the sure screening and ranking consistency properties for the proposed procedure without assuming any moment condition on the predictors. The proposed procedure enjoys several appealing merits. First, it is model-free in that its implementation does not require specification of a regression model. Second, it is robust to heavy-tailed distributions of predictors and the presence of potential outliers. Third, it allows the categorical response having a diverging number of classes in the order of O(n^κ) with some κ ≥ 0. We assess the finite sample property of the proposed procedure by Monte Carlo simulation studies and numerical comparison. We further illustrate the proposed methodology by empirical analyses of two real-life data sets.

Keywords: Feature screening, consistency in ranking, sure screening property, ultrahigh dimensional data analysis

1. INTRODUCTION

Variable selection plays an important role in high dimensional data analysis. Marginal feature screening becomes indispensable for ultrahigh dimensional data and has received much attention in the very recent literature. Various feature screening procedures have been proposed for linear models, generalized linear models and robust linear models (Fan and Lv, 2008; Wang, 2009; Fan, Samworth and Wu, 2009; Li et al., 2012). These authors demonstrate their procedures enjoy sure screening property in the terminology of Fan and Lv (2008). Feature screening procedures have been further proposed for nonparametric regression models in the literature. Fan, Feng and Song (2011) proposed a nonparametric marginal screening procedure for additive models based on B-spline expansion. Fan, Ma and Dai (2014) further extended the nonparametric B-spline method for varying coefficient models and proposed a marginal sure screening procedure. Liu, Li and Wu (2014) proposed a local kernel-based marginal sure screening procedure for varying coefficient models and further established its sure screening property. Aforementioned model-based screening procedures perform well when the underlying models are correctly specified, but their performance may be quite poor in the presence of model mis-specification. Specifying a correct model for ultrahigh dimensional data may be challenging. Thus, model-free sure screening procedures are appealing and have been developed by several authors (Zhu, et al., 2011; Li, Zhong and Zhu, 2012; He, Wang and Hong, 2013). Li, Zhong and Zhu (2012) developed a sure independence screening procedure based on the distance correlation which is model-free. Its sure screening property requires subexponential tail probability conditions on predictors and response, and it is not robust to very heavy-tailed data with extreme values. Mai and Zou (2013) developed a sure feature screening procedures with ultrahigh dimensional predictors based on the Kolmogorov distance, but it is studied only for binary classification problems. Pan, Wang and Li (2013) proposed a pairwise sure screening procedure for linear discriminant analysis with a diverging number of classes and ultrahigh dimensional predictors. However, it is based on mean difference and cannot perform well for heavy-tailed data. This work aims to develop an effective model-free and robust feature screening procedure for ultrahigh dimensional discriminant analysis with a possibly diverging number of classes.

In this paper, we propose an effective sure screening procedure for discriminant analysis. We further study its theoretical properties and establish the sure screening and rank consistency properties without assuming the moment conditions on predictors under the settings of ultrahigh dimensional discriminant analysis with a diverging number of response classes. Our numerical studies show that the proposed procedure has excellent performance. It enjoys several appealing properties. It is model-free since its implementation does not require specification of the regression model. Its corresponding marginal utility may be easily evaluated without involving numerical optimization.

Due to its nature, the proposed procedure can be directly applied for continuous response with categorical predictors. This indeed is also very useful in the genomics-wide association study (GWAS), in which the phenotypes (i.e., the responses) are continuous, and the single-nucleotide polymorphisms (SNPs) as predictors are categorical. Thus, it is also of interest to develop an effective feature screening procedure for setting in which the response is continuous, while the predictors of interest are categorical. In this paper, we further extend our procedure for such settings. Some further extensions are also discussed in Section 4.

The rest of this paper is organized as follows. In Section 2, we propose a new marginal utility for feature screening and further study its theoretical properties. In Section 3, we conduct Monte Carlo simulation studies to examine the finite sample performance of the proposed procedure. We further illustrate the proposed methodology by empirical analyses of real data examples. Section 4 presents some extensions of the proposed methodology. Technical proofs are given in the Appendix.

2. A NEW FEATURE SCREENING PROCEDURE

2.1. A New Index based on Conditional Distribution Function

Let Y be a categorical response with R classes {y₁, y₂, …, y_R}, and X be a continuous covariate with a support ℝ_X. To investigate the dependence relationship between X and Y, we naturally consider the conditional distribution function of X given Y, denoted by F(x|Y) = Inline graphic (X ≤ x|Y). Denote by F(x) = (X ≤ x) the unconditional distribution function of X and F_r(x) = (X ≤ x|Y = y_r) the conditional distribution function of X given Y = y_r. If F_r(x) = F(x) for any x ∈ ℝ_X and r = 1, 2, …, R, then X and Y are independent. This motivates us to consider the following index

M V (X ∣ Y) = E_{X} [{Var}_{Y} (F (X ∣ Y))]

(2.1)

to measure the dependence between X and Y. The following proposition provides the properties of the MV (X|Y).

Proposition 2.1

Let Y be a categorical random variable with R classes {y₁, y₂, …, y_R} and p_r = Inline graphic (Y = y_r) > 0 for all r = 1, …, R. Let X be a continuous random variable with support ℝ_X. Denote F(x) = (X ≤ x) and F_r(x) = (X ≤ x|Y = y_r), then

$M V (X ∣ Y) = \sum_{r = 1}^{R} p_{r} \int {[F_{r} (x) - F (x)]}^{2} d F (x)$ .
MV (X|Y) = 0 if and only if X and Y are statistically independent.

The proof of this proposition is given in the Appendix. The results in (1) implies that the MV (X|Y) can be represented as the weighted average of Cramér-von Mises distances between the conditional distribution function of X given Y = y_r and the unconditional distribution function of X. The second remarkable property motivates us to utilize the MV (X|Y) as a marginal utility for feature screening to characterize both linear and nonlinear relationships for ultrahigh dimensional discriminant analysis.

Let {(X_i, Y_i) : 1 ≤ i ≤ n} be a random sample of size n from the population (X, Y). Define ${\hat{p}}_{r} = \frac{1}{n} \sum_{i = 1}^{n} I {Y_{i} = y_{r}}$ with I{·} being the indicator function, $\hat{F} (x) = \frac{1}{n} \sum_{i = 1}^{n} I {X_{i} \leq x}$ , and ${\hat{F}}_{r} (x) = \frac{1}{n} \sum_{i = 1}^{n} {X_{i} \leq x, Y_{i} = y_{r}} / {\hat{p}}_{r}$ . It is natural to use its sample counterpart to estimate MV (X|Y) as follows:

\hat{M V} (X ∣ Y) = \frac{1}{n} \sum_{r = 1}^{R} \sum_{j = 1}^{n} {\hat{p}}_{r} {[{\hat{F}}_{r} (X_{j}) - \hat{F} (X_{j})]}^{2} .

(2.2)

To get insights into MV(X|Y), let us consider a simple example. Let X be a univariate standard normal random variable and generate random variables Z_k with k = 1, 2 by Z₁ = cX +ε and Z₂ = cX² +ε, where ε ~ N(0, 1) and c is a constant to control the signal-to-noise ratio. Then, we equally discretize each Z_k to a categorical variable Y_k with four classes. That is, Y_k = I(Z_k ≤ q_k₁)+2I(q_k₁ < Z_k ≤ q_k₂)+3I(q_k₂ < Z_k ≤ q_k₃)+4I(Z_k > q_k₃), k = 1, 2 where {q_k₁, q_k₂, q_k₃} are the first, second and third quartiles of Z_k, respectively. Thus, the response Y₁ depends on X through a linear term cX, while Y₂ depends on X through a quadratic term cX². We set sample size n = 200 and c = 0, 0.5, 1 and 2. Note that Y_k and X are independent for each k = 1, 2 when c = 0. Then, we compute the variance of conditional distribution function of X given Y_k, i.e. Var_{Y_k} [F(x|Y_k)], for x ∈ [−2, 2] and each c. Panel (a) and (c) in Figure 1 are boxplots of Var_{Y_k} [F(x|Y_k)] against different c values for k = 1, 2, respectively, where the star indicates $\hat{M V} (X ∣ Y_{k})$ . Panel (b) and (d) in Figure 1 demonstrate how Var_{Y_k} [F(x|Y_k)] with k = 1, 2 varies across x ∈ [−2, 2] for different c values. It is shown that as the signal-to-noise ratio increases, $\hat{M V} (X ∣ Y_{k})$ increases. When c = 0, i.e. X and Y_k are independent, $\hat{M V} (X ∣ Y_{k})$ are nearly close to zero; When c > 0, they are remarkably different above from zero. Consequently, the MV(X|Y) should be an effective measure to characterize the strengthen of both linear and nonlinear dependence between a continuous covariate and a categorical response.

(a) Boxplot of *Var*_Y₁ [F(x|Y₁)] against c with the star indicating the mean; (b) Plot of *Var*_Y₁ [F(x|Y₁)] against x for different c values; (c) Boxplot of *Var*_Y₂ [F(x|Y₂)] against c with the star indicating the mean; (d) Plot of *Var*_Y₂ [F(x|Y₂)] against x for different c values.

2.2. Sure Independence Screening Using MV(X|Y)

We now propose a new model-free sure independence screening using MV(X|Y) for ultrahigh dimensional discriminant analysis. Let Y be the response with discrete support {y₁, y₂, · · ·, y_R} with R ≥ 2 and x = (X₁, · · ·, Xp)^T be the predictor vector, where p ≥ n and n is the sample size. Without specifying a regression model, define the active predictor subset by

D = {k : F (y ∣ x) functionally depends on X_{k} for some y = y_{r}},

and denote by Inline graphic = {1, 2, · · ·, p} \ the inactive predictor subset.

The goal is to select a reduced model with a moderate scale which can almost fully contain Inline graphic using an independence screening method for ultrahigh dimensional discriminant analysis. To this end, we apply the MV index for each pair (X_k, Y):

ω_{k} = M V (X_{k} ∣ Y)

as a marginal utility to measure the importance of X_k for the response, where k = 1, 2, …, p. Note that ω_k = 0 if and only if X_k and Y are statistically independent. As a motivation, we can see that, if the partial orthogonality condition (Huang, Horowitz and Ma, 2008; Fan and Song, 2010) holds, i.e. {X_k : k ∈ Inline graphic } are statistically independent of {X_k : k ∈ }, then ω_k is a naturally effective measure to sperate the active and inactive predictor subsets because ω_k > 0 for k ∈ and ω_k = 0 for k ∈ . It also implies that the MV-based variable screening is model-free in that it is defined through conditional and unconditional distribution functions and able to characterize both linear and nonlinear relationships between the response and predictors.

For a random sample {(x_i, Y_i) : 1 ≤ i ∈ n}, we can easily estimate ω_k by setting ${\hat{ω}}_{k} = \hat{M V} (X_{k} ∣ Y)$ according to equation (2.2). Then we propose to utilize ω̂_k to choose a submodel

\hat{D} = {k : {\hat{ω}}_{k} \geq {c n}^{- τ}, for 1 \leq k \leq p},

where c and τ are pre-determined thresholding values defined in Condition (C2) below. In practice, for a given size d < n, one can select a reduced model:

{\hat{D}}^{*} = {k : {\hat{ω}}_{k} is among the top d largest of all} .

We refer this procedure to the MV-based sure independence screening, MV-SIS for short.

Next, we study the theoretical properties of the proposed MV-SIS. Fan and Lv (2008) and Ji and Jin (2012) demonstrated that the two-stage procedure combining independence screening and penalized estimation can outperform an one-step penalized estimation approach, such as LASSO. The effectiveness of the two-stage procedure is guaranteed by the sure screening property. That is, all active predictors can be included in the reduced model with high probability. Thus, we first establish the sure screening property for MV-SIS with assuming the following conditions.

(C1)
There exist two positive constants c₁ and c₂ such that $c_{1} / R_{n} \leq min_{1 \leq r \leq R_{n}} p_{r} \leq max_{1 \leq r \leq R_{n}} p_{r} \leq c_{2} / R_{n}$ . Assume that R_n = O(n^κ) for κ ≥ 0.
(C2)
There exists positive constants c > 0 and 0 ≤ τ < 1/2 such that $min_{k \in D} ω_{k} \geq 2 {c n}^{- τ}$ .

Condition (C1) requires that the proportion of each class of the response cannot be either too small or too large. R_n = O(n^κ) assumed in Condition (C1) allows the diverging number of classes of the response, where the subscript n in R_n is used to emphasize R_n being allowed to be diverging with the sample size n. Condition (C2) assumes that the minimum true signal cannot be too small and it is in the order of n^−τ which allows the minimum true signal to vanish to zero as the sample size n approaches the infinity. Such an assumption is typical in the feature screening literature (e.g., Condition 3 in Fan and Lv (2008), Condition (C3) in Wang (2009), Condition (C2) in both Li, Zhong and Zhu (2012) and He, Wang and Hong (2013) etc). The following theorem presents the sure screening property of MV-SIS and its proof is provided in the Appendix.

Theorem 2.1

[Sure Screening Property] Under Condition (C1) and for any 0 ≤ κ < 1 ≤ 2τ, there exists a positive constant b depending on c, c₁ and c₂, such that

ℙ (max_{1 \leq k \leq p} ∣ {\hat{ω}}_{k} - ω_{k} ∣ > {c n}^{- τ}) \leq O (p exp {- {b n}^{1 - (2 τ + κ)} + (1 + κ) log n}) .

(2.3)

Under Conditions (C1) and (C2), we have that

ℙ (D \subseteq \hat{D}) \geq 1 - O (s_{n} exp {- {b n}^{1 - (2 τ + κ)} + (1 + κ) log n}),

(2.4)

where s_n is the cardinality of Inline graphic .

The sure screening property holds for MV-SIS under milder conditions than those for the SIS (Fan and Lv, 2008) and DC-SIS (Li, Zhong and Zhu, 2012) in that we do not require the regression function of Y onto x to be linear and it needs little assumption on the moments of predictors. It is worth noting that MV-SIS is robust to heavy-tailed distributions of predictors and the presence of potential outliers because MV (X_k|Y) inherits the robustness property of conditional distribution function. Furthermore, the sure screening property also holds for the categorical response with a diverging number of classes. Thus, the MV-SIS provides a unified alternative to existing model-based sure screening procedures for ultrahigh dimensional discriminant analysis.

According to Theorem 2.1, we know that MV-SIS can handle the NP-dimensionality log p = O(n^α), where α < 1 ≤ 2τ − κ with 0 ≤ τ < 1/2 and 0 ≤ κ ≤1 ≤ 2τ, which depends on the minimum true signal strengthen and the number of response classes. If R_n is fixed, i.e. κ = 0, then the result of Theorem 2.1 is improved and its first part can be rewritten as

P {max_{1 \leq k \leq p} ∣ {\hat{ω}}_{k} - ω_{k} ∣ > {c n}^{- τ}} \leq O (p exp {- {b n}^{1 - 2 τ} + log n}),

for some constant b > 0. In this case, we can handle the even larger NP-dimensionality log p = O(n^α), where α < 1 − 2τ with 0 ≤ τ < 1/2.

Remark

Condition (C1) can be relaxed in the way that c₁ is allowed to tends to zero in a certain rate. To be specific, we assume that c₁ = O(n^−eegr;) with 0 < η < 2τ + κ. Under the relaxed condition, the sure screening property remains as essentially same as before, but the convergence rate becomes relatively slower. That is,

ℙ (max_{1 \leq k \leq p} ∣ {\hat{ω}}_{k} - ω_{k} ∣ > {c n}^{- τ}) \leq O (p exp {- {b n}^{1 - (2 τ + κ + η)} + (1 + κ) log n}) .

Then, a smaller NP-dimensionality log p = O(n^α) with α < 1 − 2τ − κ − η is allowed. For the proof, refer to Appendix A in the Supplement.

Another interesting property for independence screening is ranking consistency property in the terms of Zhu, et al. (2011). To investigate the ranking consistency property of MV-SIS, we additionally assume the following condition.

(C3)
$\underset{p \to \infty}{lim inf} {min_{k \in D} ω_{k} - max_{k \in I} ω_{k}} \geq c_{3}$ , where c₃ > 0 is a constant.

It is easily shown that under the partial orthogonality condition (Huang, Horowitz and Ma, 2008) that ω_k > 0 for k ∈ Inline graphic and ω_k = 0 for k ∈ , Condition (C3) naturally holds. Thus, Condition (C3) is a relatively weaker assumption than partial orthogonality condition. It requires the MV index is able to sperate active and inactive predictors well in the population level. The following theorem justifies the ranking consistency property of MV-SIS.

Theorem 2.2

[Ranking Consistency Property] If conditions (C1) and (C3) hold for R_n log(n)/n = o(1) and R_n log(p)/n = o(1), then $\underset{n \to \infty}{lim inf} {min_{k \in D} {\hat{ω}}_{k} - max_{k \in I} {\hat{ω}}_{k}} > 0$ , a.s..

Although it requires a more restrictive condition on the difference between active and inactive signals, Theorem 2.2 demonstrates a stronger theoretical result than sure screening property. That is, the sample MV(X_k|Y) values of active predictors are always ranked beyond those of inactive ones with high probability. Thus, with an ideal thresholding value, one might separate the active predictors and inactive predictors.

3. NUMERICAL STUDIES

In this section, we first assess the finite sample performance of the proposed MV-SIS by Monte Carlo simulation studies. Then, we conduct empirical analyses of two real data examples to illustrate the proposed MV-SIS procedure. Some additional numerical results are given in the Supplement.

3.1. Monte Carlo Simulations

We use the minimum model size (MMS) to include all active predictors to measure the effectiveness of each screening approach. In addition, the proportion including a single active predictor X_j, denoted by $P_{j}^{s}$ , and the proportion including all active predictors, denoted by Inline graphic , are computed for a given model size d = [n/log n], where n is the sample size and [x] denotes the integer part of x. All numerical studies are conducted using R code.

Example 1

(Ultrahigh Dimensional Linear Discriminant Analysis) In this example, we consider a linear discriminant analysis problem with ultrahigh dimensional predictors by following the similar settings in Pan, Wang and Li (2013). For each ith observation, the categorical response Y_i is generated from two different distributions: (i) balanced, a discrete uniform distribution with R categories where ℙ(Y_i = r) = 1/R with r = 1, …, R; (ii) unbalanced, the sequence of probabilities p_r = P(Y_i = r) = 2[1 + (r − 1)/(R − 1)]/3R is an arithmetic progression with $max_{1 \leq r \leq R} p_{r} = 2 min_{1 \leq r \leq R} p_{r}$ . For instance, when Y is binary, p₁ = 1/3 and p₂ = 2/3. Given Y_i = r, the ith predictor X_i is then generated by letting X_i = μ_r + ε_i, where the mean term μ_r = (μ_r₁, …, μ_rp) ∈ ℝ^p is a p-dimensional vector with rth component μ_rr = 3 but other components are all zero, and ε_i = (ε_i₁, …, ε_ip) is a p-dimensional error term. Here, we consider two cases of the error term: (1) ε_ij ~ N(0, 1); (2) ε_ij ~ t(2) independently for each j = 1, …, p. Note that the Case (2) makes each predictor heavy-tailed, which is designed to examine the robustness of an independence screening method. To systematically examine MV-SIS and other competitors, we will consider 2000 predictors and a binary response with n = 40, and a 10-categorical response with n = 200 for each case, respectively. That is, (R, n, p) = (2, 40, 2000) and (10, 200, 2000).

First, we compare the performance of MV-SIS with SIS (Fan and Lv, 2008), SIRS (Zhu, et al., 2011), DC-SIS (Li, Zhong and Zhu, 2012), Kolmogorov Filter (Mai and Zou, 2013) and PSIS (Pan, Wang and Li, 2013) for the binary response, where X₁ and X₂ are the active predictors. Table 1 summarizes the median of MMS with its associated robust estimate of the standard deviation (RSD = IQR/1.34) in the parentheses, $P_{j}^{s}$ with j = 1, 2 and Inline graphic for the given model size d = [n/log n] for each method based on 500 simulations.

Table 1.

Simulation Results for Linear Discriminant Analysis with Binary Response

Case (1): ε_ij ~ N (0, 1)

Case (2): ε_ij ~ t(2)

p_r

Method

MMS

P_{1}^{s}

P_{2}^{s}

graphic file with name nihms590994ig4.jpg

MMS

P_{1}^{s}

P_{2}^{s}

Balanced

SIS

2.0(0.0)

1.00

2.5(9.1)

0.79

0.88

0.71

SIRS

2.0(0.0)

1.00

8.0(20.5)

0.71

0.76

0.55

DC-SIS

2.0(0.0)

1.00

2.0(0.0)

0.99

0.98

0.97

2.0(0.0)

1.00

2.0(0.0)

0.99

0.98

PSIS

2.0(0.0)

1.00

2.5(9.1)

0.79

0.88

0.71

MV-SIS

2.0(0.0)

1.00

2.0(0.0)

1.00

0.99

Unbalanced

SIS

2.0(0.0)

1.00

5.5(48.8)

0.75

0.55

SIRS

2.0(0.0)

1.00

0.99

17.0(123.3)

0.67

0.64

0.44

DC-SIS

2.0(0.0)

1.00

2.0(1.1)

0.95

0.96

0.92

2.0(0.0)

1.00

2.0(0.7)

0.96

0.99

0.95

PSIS

2.0(0.0)

1.00

5.5(48.8)

0.75

0.55

MV-SIS

2.0(0.0)

1.00

2.0(0.7)

0.96

0.99

0.95

Open in a new tab

Next, we consider the response with 10 categories, where X₁, X₂, …, X₁₀ are active. Note that a value of the response Y is a nominal number, which makes SIS, SIRS and Kolmogorov Filter unapplicable. However, MV-SIS is proposed for variable screening with a multiple categorical response. To make DC-SIS applicable for this problem, we transfer the 10-categorical response to 9 dummy binary variables, which are together considered as a new multiple response. Note that Li, Zhong and Zhu (2012) claimed that DC-SIS can be applied for the multiple response. Pan, Wang and Li (2013) proposed a pairwise sure independence screening (PSIS) to deal with the categorical response. PSIS utilizes |μ̂_{r_1j} − μ̂_{r_2j}| as the marginal signal of predictor X_j for each pair of classes (r₁, r₂) each time, where μ̂_rj denotes the sample average of X_ij for i ∈ {i : Y_i = r}. Essentially, we consider max_r₁≠₂ |μ̂_{r_1j} − μ̂_{r_2j}| as the marginal signal of predictor X_j, where r₁, r₂ = 1, 2, …, 10, denoted by PSIS*. Table 2 summarizes the median of MMS with its associated robust standard deviation in the parentheses, $P_{j}^{s}$ with j = 1, 2, …, 10 and Inline graphic for the given model size d = [n/log n] based on 500 simulations.

Table 2.

Simulation Results for Linear Discriminant Analysis with 10-Categorical Response

Method

MMS

P_{1}^{s}

P_{2}^{s}

P_{3}^{s}

P_{4}^{s}

P_{4}^{s}

P_{6}^{s}

P_{7}^{s}

P_{8}^{s}

P_{9}^{s}

P_{10}^{s}

(i) Balanced Probabilities and Case (1): ε_ij ~ N (0, 1)

DC-SIS

10.0(0.0)

1.00

0.99

PSIS*

10.0(0.0)

1.00

MV-SIS

10.0(0.0)

1.00

(i) Balanced Probabilities and Case (2): ε_ij ~ t(2)

DC-SIS

15.0(21.8)

0.86

0.99

0.97

0.98

0.99

0.98

0.74

PSIS*

362.5(563.6)

0.73

0.75

0.76

0.73

0.75

0.73

0.76

0.79

0.05

MV-SIS

11.0(3.7)

1.00

0.99

1.00

0.99

0.95

(ii) Unbalanced Probabilities and Case (1): ε_ij ~ N(0, 1)

DC-SIS

13.0(14.9)

0.82

1.00

0.82

PSIS*

10.0(0.0)

1.00

MV-SIS

10.0(0.0)

1.00

(ii) Unbalanced Probabilities and Case (2): ε_ij ~ t(2)

DC-SIS

126.5(248.3)

0.35

0.90

0.93

0.96

1.00

0.99

1.00

0.22

PSIS*

343.5(444.9)

0.68

0.66

0.56

0.58

0.64

0.63

0.60

0.73

0.61

0.67

0.05

MV-SIS

13.0(9.8)

0.93

0.98

1.00

0.85

Open in a new tab

In addition, we will compare the post-screening estimation and prediction performance between PSIS and MV-SIS for the binary response from a discrete uniform distribution. Here, we generate p = 2000 predictors and a binary response with different sample sizes n = 40 and n = 80. We replicate each simulation experiment a total of 500 times. For the rth simulation, we follows Pan, Wang and Li (2013) to choose the model size using the BIC criterion, which utilizes the equivalence between the LDA problem and a least squares one in Mai, Zou and Yuan (2012). Then, we define the model size (MS), percentage of correct zeros (CZ), incorrect zeros (IZ), coverage probability (CP), and the root of the sum squared error (RSSE) as follows, respectively,

\begin{array}{l} {MS}_{r} = ∣ D_{r}^{*} ∣, {CZ}_{r} = \frac{p - ∣ D \cup D_{r}^{*} ∣}{p - s_{n}}, {IZ}_{r} = \frac{∣ D_{r}^{* c} \cap D ∣}{∣ D ∣}, \\ {CP}_{r} = I (D \subseteq D_{r}^{*}), {RSSE}_{r} = ‖ {\hat{γ}}_{r} - γ_{0} ‖, \end{array}

where $D_{r}^{*}$ is the selected model in the rth replication, s_n = 2 is the cardinality of Inline graphic , and γ₀ = μ₁ − μ₂ = (3, −3, 0, …, 0) is the true difference between two true means, and γ̂_r = μ̂₁_r − μ̂₂_r is the post-screening estimator of γ₀ in the rth replication based on the selected model. Furthermore, to assess the prediction performance, an independent testing dataset is generated with the same sample size in the each simulation. The classification accuracy (CA) of the post-screening estimator is computed in each simulation. Also the classification accuracy based on the true means, denoted by CA₀, and the ratio CA/CA₀ is evaluated for comparison. We report the median of MS with its robust standard deviation in the parentheses, and the averages of other performance measures over all 500 simulations in Table 3.

Table 3.

Simulation Results for Estimation and Prediction Performance in Linear Discriminant Analysis with Binary Response with 500 Simulations.

n	Method	MS(RSD)	CZ(%)	IZ(%)	CP(%)	RSSE	CA(%)	CA₀(%)	RCA
Case (1): ε_ij ~ N (0, 1)

40	PSIS	3.0(2.9)	99.89	0.00	100.00	1.31	95.20	98.41	96.76
40	MV-SIS	3.0(2.2)	99.91	0.00	100.00	1.16	95.34	98.41	96.90

80	PSIS	2.0(1.5)	99.94	0.00	100.00	0.70	97.31	98.31	98.98
80	MV-SIS	2.0(0.8)	99.95	0.00	100.00	0.62	97.47	98.31	99.15

Case (2): ε_ij ~ t(2)

40	PSIS	6.0(2.9)	99.76	19.50	65.00	3.65	73.42	89.91	81.81
40	MV-SIS	5.0(3.1)	99.83	3.00	94.00	2.74	78.92	89.91	87.87

80	PSIS	7.0(4.4)	99.71	7.00	86.40	2.56	79.17	89.95	88.04
80	MV-SIS	3.0(2.9)	99.87	0.00	100.00	1.56	84.80	89.95	94.30

Open in a new tab

Both Tables 1 and 2 indicate that the proposed MV-SIS is superior to other competitors for variable screening in the linear discriminant analysis. When the error term is heavy-tailed and the number of the response categories increases, MV-SIS has much smaller minimum model sizes (MMS) and significantly higher probabilities to include all active predictors in the selected model than other independent screenings. Thus, the robustness of our MV-SIS is an important feature, which can make it more useful in practice. The same pattern can be observed from Table 3. MV-SIS has the very close estimation and prediction performance of PSIS when the error term is normal. However, when the error deviates from a normal distribution, PSIS deteriorates while MV-SIS still performs reasonably well.

3.2. Real Data Examples

Example 2

Lung cancer data were previously analyzed for classification between malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA) of the lung in Gordon et al. (2002) and Fan and Fan (2008). There are 12533 genes and 181 tissue samples from two classes: 31 in class MPM and 150 in class ADCA. The training dataset contains 32 of them (16 MPM and 16 ADCA), while the remaining 149 samples (15 MPM and 134 ADCA) are used for testing.

Before classification, we first standardize the data to zero mean and unit variance. Fan and Fan (2008) showed that their features annealed independence rules (FAIR) selected 31 important genes and made no training error and 7 testing errors, while the nearest shrunken centroids (NSC) method proposed by Tibshirani, et al. (2002) chose 26 genes and resulted in no training error and 11 testing errors. Then, we consider DC-SIS, PSIS and our MV-SIS approach (denoted by MV-SIS1) following by LDA for this ultrahigh dimensional classification problem. Note that FAIR used the diagonal linear discriminant analysis after the t-test screening. To make a fair comparison, we add a procedure combining t-test screening with LDA as well, denoted by FAIR*. Furthermore, the penalized LDA method (denote by PenLDA) proposed by Witten and Tibshirani (2011) and the sparse discriminant analysis (denoted by SDA) in Clemmensen, et al. (2011) are also implemented in this example for comparison. In addition, we combine our MV-SIS with SDA and consider this two-stage method as another potential approach, denoted by MV-SIS2. Similar to Example 1, the BIC criterion is applied to determining the model size for all competing methods in this binary classification problem. We summarize the classification results in Table 4. The MV-SIS followed by LDA (i.e. MV-SIS1) makes 0 training error and 5 testing errors using only 5 top genes, and the MV-SIS with SDA (i.e. MV-SIS2) performs even better than MV-SIS1 and SDA to achieve the smallest testing errors using only 7 genes. Thus, the two-stage approaches combining MV-SIS with LDA or SDA are superior to other competitors in terms of classification errors and the selected model size for this ultrahigh dimensional lung cancer data.

Table 4.

Classification Errors for Lung Cancer Data in Example 2

Method	Training Error	Testing Error	No. of Selected Genes
NSC	0/32	11/149	26
FAIR	0/32	7/149	31
FAIR*	0/32	7/149	14
PenLDA	0/32	9/149	8
SDA	0/32	6/149	17
PSIS	1/32	34/149	4
DC-SIS	0/32	6/149	7
MV-SIS1	0/32	5/149	5
MV-SIS2	0/32	3/149	7

Open in a new tab

To further evaluate the prediction performance, we randomly partition all 181 tissue samples into two parts: the training set including 100 samples and the testing set of the rest 81 samples. The above procedures are applied to the training data, and their performances are evaluated by the classification errors in both training and testing sets. For a fair comparison, we choose the best model sizes for all methods using the same BIC criterion. We repeat the experiment 100 times, summarize the means with associated standard deviations (in the parentheses) of the training and testing classification errors and the numbers of selected genes in Table 5, and display their distributions in Figure 2. In the result, the MV-SIS with LDA method (i.e. MV-SIS1) performs reasonably well and has both small training and testing errors using averagely around 12 genes. Among all the methods, the SDA method classifies the training samples perfectly and achieves a small testing error rate. However, SDA tends to select a considerably large number of genes and thus may lose some model interpretability. It is worth noting that the MV-SIS with SDA (i.e. MV-SIS2) can achieve the smallest testing error rate with a much smaller number of genes. This further demonstrates the merit of the two-stage approach combining MV-SIS with SDA.

Table 5.

Performance Evaluation for Lung Cancer Data in Example 2

Method	Training Error(%)	Testing Error(%)	No. of Selected Genes
NSC	0.87(0.90)	1.86(1.91)	17.52(11.36)
FAIR	3.07(1.32)	3.51(1.93)	13.72(7.37)
PenLDA	0.88(0.92)	1.95(1.97)	18.95(18.14)
SDA	0.00(0.00)	1.42(1.21)	39.83(2.84)
PSIS	0.06(0.24)	2.14(1.57)	26.49(6.85)
DCSIS	0.08(0.27)	2.63(2.30)	15.54(12.53)
MVSIS1	0.15(0.44)	1.77(1.91)	11.99(9.53)
MVSIS2	0.20(0.40)	1.41(1.10)	11.74(6.71)

Open in a new tab

Lung Cancer Data in Example 2. (a) Boxplots of classification errors in the training sets over 100 random partitions of 181 samples; (b) Boxplots of classification errors in the testing sets; (c) Boxplots of numbers of selected genes.

Example 3

This human lung carcinomas data was analyzed by using mRNA expression profiling (Bhattacharjee, et al., 2001). There are 12600 mRNA expression levels in a total of 203 snap-frozen lung tumors and normal lungs. The 203 specimens are classified into five subclasses: 139 in lung adenocarcinomas (ADEN), 21 in squamous cell lung carcinomas (SQUA), 6 in small cell lung carcinomas (SCLC), 20 in pulmonary carcinoid tumors (COID) and the remaining 17 normal lung samples (NORMAL). Before classification, we first standardize the data to zero mean and unit variance. To evaluate the prediction performance of the proposed method, we randomly select approximately 100τ% of the observations from each subclass as the training samples and the rest 100(1 − τ)% observations as the testing samples, where τ ∈ (0, 1).

Note that the aforementioned NSC and FAIR are proposed only for binary classification problems, thus they are not applicable in this multiple classes discriminant analysis. PSIS, DC-SIS and MV-SIS with LDA are applied to the training set and their performances are evaluated by the testing samples. For the DC-SIS and MV-SIS (denoted by MV-SIS1) with LDA procedures, the leave-one-out cross validation is applied to choosing the optimal model size for the raining data. Besides, we also consider the penalized LDA (denoted by PenLDA) and MV-SIS followed by SDA (denoted by MV-SIS2) for comparison, and use the 10-folded cross validation rather than the leave-one-out cross validation to choose the best model size in order to reduce the computation time. Although SDA can be directly applied to multiple-class discriminant analysis for a given model size, searching the best model size for SDA is remarkably computational expensive for multiple-class ultrahigh dimensional data. Thus, we use MV-SIS to reduce dimensionality and then follow by SDA (i.e. MV-SIS2) instead of SDA alone in the example.

Next, we choose τ =0.9, 0.8 and repeat the experiment 100 times, respectively. Following the previous Example 2, the means of the training and testing classification errors and the corresponding numbers of selected genes with their associated standard deviations (in the parentheses) are reported in Table 6. We can clearly observe that, although all methods perform reasonably well in the tumors classification, the MV-SIS procedure with LDA or SDA are significantly better than other methods in terms of both training and testing classification errors and the number of selected genes. Specifically, the MV-SIS+SDA (i.e. MV-SIS2) procedure achieves the best performances using a small number of top genes. Furthermore, we find that the top genes selected by MV-SIS are not normally distributed and contain potential outliers. This observation explains why other methods performs relatively worse and confirms the robustness feature of the proposed MV-SIS. This example further demonstrates that the two-stage approach combing the MV-SIS method with a discriminant analysis is more favorable for ultrahigh dimensional data in practice.

Table 6.

Classification Errors for Lung Carcinomas Data with 5 Classes in Example 3.

τ	Method	Training Error(%)	Testing Error(%)	No. of Selected Genes
0.9	PenLDA	21.88(2.24)	21.71(3.87)	25.76(21.04)
	PSIS	3.54(0.79)	9.43(5.65)	107.54(15.71)
	DC-SIS	6.85(1.35)	11.81(6.40)	32.08(3.85)
	MV-SIS1	3.65(1.15)	7.71(4.99)	20.56(8.02)
	MV-SIS2	3.65(1.15)	7.62(5.09)	31.76(10.24)

0.8	PenLDA	22.12(2.10)	22.40(4.37)	25.04(21.81)
	PSIS	3.08(1.11)	7.90(3.89)	101.88(15.72)
	DC-SIS	6.33(2.16)	13.15(5.32)	32.18(5.39)
	MV-SIS1	3.74(1.09)	8.35(4.12)	21.34(7.42)
	MV-SIS2	3.74(1.09)	6.70(4.24)	27.20(9.11)

Open in a new tab

4. SOME EXTENSIONS

The MV-SIS approach is proposed to screen important predictors for the ultrahigh dimensional discriminant analysis where the response is categorical, but its applications can be easily extended to some other settings. In this section, we discuss two natural extensions of MV-SIS and use simulation studies to show their excellent performances.

4.1. Genome-Wide Association Studies

First, we can apply MV-SIS to ultrahigh dimensional problems with categorical predictors. In such situations, feature screening can be done by using MV (Y|X_k), where X_k is categorical for k = 1, 2, …, p. Under Conditions (C1) and (C2), we can establish the sure screening property and ranking consistency property for ω_k = MV (Y|X_k) with imposing Condition (C1) on each categorical SNP instead of the response. In genome-wide association studies (GWAS), modern genotyping techniques allow researchers to collect genetic data which usually contain an extremely large number of single-nucleotide polymorphisms (SNPs). In general, the SNPs as predictors are categorical with three classes, denoted by {AA, Aa, aa}. In Example 4, we consider applying the proposed MV-SIS for the ultrahigh dimensional GWAS problem to identify important SNPs, and compare its performance with other independence screening approaches.

Example 4

(Genome-Wide Association Studies) To mimic SNPs with equal allele frequencies, we denote Z_ij as the indicators of the dominant effect of the jth SNP for ith subject and generate it in the following way

Z_{i j} = {\begin{cases} 1, & if X_{i j} < q_{1} \\ 0, & if q_{1} \leq X_{i j} < q_{3} \\ - 1, & if X_{i j} \geq q_{3} \end{cases}

where X_i = (X_i₁, …, X_ip) ~ N (0, Σ), where Σ = (ρ_ij)_p_×_p with ρ_ij = 0.5^|ⁱ ⁻ ^j^|, i = 1, …, n, j = 1, …, p, and q₁ and q₃ are first and third quartiles of a standard normal distribution, respectively. Then, we generate the response (some trait or disease) by:

Y = β_{1} Z_{1} + β_{2} Z_{2} + 2 β_{3} Z_{10} + 2 β_{4} Z_{20} - 2 β_{5} ∣ Z_{100} ∣ + ε,

where β_j = (−1)^U(a + |Z|) for j = 1, …, 5, where $a = 2 log n / \sqrt{n}$ , U ~ Bernoulli(0.4) and Z ~ Inline graphic (0, 1), the error term ε follows N(0, 1) ort(1). There are 5 active SNPs, i.e. Z₁, Z₂, Z₁₀, Z₂₀ and Z₁₀₀, for the response. The first four active SNPs are linearly correlated with the response Y, while the SNP Z₁₀₀ and Y are nonlinearly correlated. It is interesting to note that the absolute value of dominant effect |Z₁₀₀| is the corresponding additive effect in genetics. Here, we consider five different independence screening approaches: SIS, DC-SIS, SIRS, RRCS (Li et al., 2012) and MV-SIS, and set n = 200 and p = 2000 and repeat each experiment 500 times. We summarize the simulation results for d = [n/log(n)] in Table 7.

Table 7.

Simulation Results for Example 4 - GWAS Model.

Method

MMS

P_{1}^{s}

P_{2}^{s}

P_{10}^{s}

P_{20}^{s}

P_{100}^{s}

N(0, 1)

SIS

1058.0(786.9)

0.96

0.97

1.00

0.99

0.02

DCSIS

10.0(40.1)

0.96

0.95

1.00

0.99

0.79

0.72

SIRS

1074.0(834.8)

0.94

0.95

1.00

0.98

0.03

0.02

RRCS

1031.0(801.6)

0.96

1.00

0.99

0.03

MVSIS

8.0(34.3)

0.96

0.94

0.99

0.98

0.89

0.78

t(1)

SIS

1427.0(530.4)

0.26

0.28

0.42

0.02

0.00

DC-SIS

124.0(284.8)

0.78

0.75

0.92

0.91

0.53

0.32

SIRS

1050.0(672.5)

0.86

0.84

0.97

0.96

0.02

0.01

RRCS

993.0(725.5)

0.87

0.84

0.98

0.96

0.02

0.01

MV-SIS

46.0(139.1)

0.79

0.94

0.79

0.46

Open in a new tab

According to Table 7, when the error follows a normal distribution, all five independence screening are able to select the first four active SNPs effectively because they are linearly correlated with the response. However, only DC-SIS and MV-SIS can choose Z₁₀₀ which nonlinearly contributed to Y. When the error is generated from t(1) which is largely heavy-tailed, it is not surprising that all independence screening methods perform worse than before. However, the performance of MV-SIS is still the best one. Thus, we can conclude that MV-SIS can effectively select active categorical SNPs which are linearly or nonlinearly correlated with the response.

4.2. Nonparametric Additive Models

In this subsection, we further consider the application of MV-SIS for an ultrahigh dimensional nonparametric additive model to evaluate MV-SIS. Although both the response and predictors are generally continuous, we can discretize each predictor X_j into a categorical variable to make MV-SIS applicable. To be specific, we can define $X_{j}^{*}$ using percentiles {τ₁, …, τ_{K_n}} of X_j by $X_{i j}^{*} = k I (τ_{k} \leq X_{i j} < τ_{k + 1})$ , where I(·) is an indicator function, i = 1, …, n, j = 1, …, p, k = 1, …, K_n with K_n = O(n^1/5). Then, we can apply MV-SIS to the discretized predictors and use $M V (Y ∣ X_{j}^{*})$ as the marginal screening utility to measure the importance of X_j. In practice, the sample size in each discretized class can not be small in order to ensure an accurate estimation of conditional distribution function. On the other hand, the number of classes cannot be small in order to retain as much information of the continuous variable as possible. According to our empirical experiences, we suggest that the number of samples in each class should be greater than 20 to obtain a decent estimator of the MV index. One can also consider the number of classes as a tuning parameter and apply the cross validation technique to choose an optimal number of classes. The following simulation example numerically examines the performance of the proposal.

Example 5

(Nonparametric Additive Model) Following Meier, Geer and Bühlmann (2009), we define the following four functions

f_{1} (x) = - sin (2 x), f_{2} (x) = x^{2} - 25 / 12, f_{3} (x) = x, f_{4} (x) = e^{- x} - 2 / 5 \cdot sinh (5 / 2) .

Then we consider the following additive model

Y = 3 f_{1} (X_{1}) + f_{2} (X_{2}) - 1.5 f_{3} (X_{3}) + f_{4} (X_{4}) + ε,

where the predictors are generated independently from Uniform[−2.5, 2.5]. To examine the robustness of each independence screening approach, we consider two cases for the error term ε = (ε₁, …, ε_n): (1) ε_i ~ N (0, 1); (2) ε_i ~ t(1) for i = 1, 2, …, n. In this example, besides the five approaches in Example 4, we further consider the nonparametric independence screening (NIS) proposed for sparse ultrahigh dimensional additive models by Fan, Feng and Song (2011), and the quantile-adaptive sure independence screening (QaSIS) with quantile τ= 0.5 proposed by He, Wang and Hong (2013). We set n = 200 and p = 2000 and repeat each experiment 500 times for each error case. In our simulation, we discretize each predictor into a 4-categorical variable using 1st, 2nd and 3rd quartiles as knots for our MV-SIS. Simulation results are reported for the given model size d = [n/log(n)] in Table 8.

Table 8.

Simulation Results for Example 5 - Nonparametric Additive Model

Method

MMS

P_{1}^{s}

$P_{2}^{s}$

P_{3}^{s}

P_{4}^{s}

N(0, 1)

SIS

1084.5(690.3)

0.17

0.02

1.00

0.00

NIS

4.0(0)

1.00

0.99

1.00

0.99

DC-SIS

50.5(55.2)

0.47

0.79

1.00

0.37

SIRS

1178.0(668.6)

0.15

0.01

1.00

0.00

QaSIS

5.0(4.5)

0.99

0.93

0.99

1.00

0.91

RRCS

1112.5(673.9)

0.16

0.03

1.00

0.00

MV-SIS

4.0(1.5)

0.99

0.95

1.00

0.95

t(1)

SIS

1508.0(538.1)

0.04

0.01

0.44

0.51

0.00

NIS

1056.5(932.2)

0.25

0.15

0.22

0.37

0.08

DC-SIS

205.0(280.1)

0.20

0.33

0.96

0.07

SIRS

1222.5(645.5)

0.12

0.01

1.00

0.00

QaSIS

16.0(37.7)

0.93

0.79

0.93

1.00

0.69

RRCS

1212.0(688.1)

0.14

0.01

0.99

1.00

0.00

MV-SIS

11.0(24.8)

0.93

0.81

0.99

1.00

0.75

Open in a new tab

Table 8 indicates that MV-SIS performs very well after discretizing each predictor. When the error term is normal, NIS performs best followed by MV-SIS and QaSIS. Although DC-SIS may detect the nonlinearity, it occasionally misses X₁ and X₂. The probable reason is the distance correlation between Y and the first two predictors are relatively weak. When the error term follows Cauchy distribution, which makes the data heavy-tailed and generates some extreme points, NIS quickly deteriorates and yet QaSIS performs well to detect the true signals. On the other hand, MV-SIS still can effectively select the active predictors and performs even better than QaSIS, which presents its robustness merit again.

5. DISCUSSION

In this paper, we have developed a new sure screening procedure for ultrahigh dimensional discriminant analysis, in which the response is allowed to have a diverging number of categories. We further established the sure screening property and the ranking consistency property of the proposed procedure without assuming any moment condition on predictors. The proposed procedure have several appealing properties. It is easily implemented, and it is robust to model specification (i.e. model-free), and robust to outliers or heavy tails of the predictors. The proposed procedure is also highly useful for analysis of data collected in GWAS, in which the phenotype may be multivariate continuous, while the predictors are categorical SNPs.

In the numerical studies, we applied linear discriminant analysis on the selected model by MV-SIS in the second-stage study. The linear discriminant analysis methods are widely used in practice and did perform reasonably well in our real data analysis. However, it is also interesting to propose a model-free and robust discriminant analysis after a model-free variable screening approach. This is out of scope of this work, but is an interesting topic for future research. Some work have been done on robust discriminant analysis. Related references include regularized discriminant analysis by Friedman (1989), robust LDA based on S-estimators by He and Fung (2000), penalized linear discriminant analysis by Witten and Tibshirani (2011), semiparametric sparse discriminant analysis by Mai and Zou (2014) and among others.

Supplementary Material

supple

NIHMS590994-supplement-supple.pdf^{(247.8KB, pdf)}

Acknowledgments

The authors thank the Editor, the AE and reviewers for their constructive comments, which have greatly improved the earlier version of this paper.

Biographies

Hengjian Cui is Professor, Department of Statistics, Capital Normal University, China. hjcui@bnu.edu.cn. His research was supported by National Natural Science Foundation of China (NNSFC) grants 11071022, 11028103, 11231010 and Key project of Beijing Municipal Educational Commission and Beijing Center for Mathematics and Information Interdisciplinary Sciences.

Runze Li is Distinguished Professor, Department of Statistics and The Methodology Center, The Pennsylvania State University, University Park, PA 16802-2111. rzli@psu.edu. His research was supported by National Institute on Drug Abuse (NIDA) grants P50-DA10075 and P50 DA036107, and NNSFC grant 11028103. Wei Zhong is Assistant Professor of Wang Yanan Institute for Studies in Economics (WISE), Department of Statistics and Fujian Key Laboratory of Statistical Science, Xiamen University, China. wzhong@xmu.edu.cn. His research was supported by NNSFC grants 11301435 and 71131008.

APPENDIX

Proof of Proposition 2.1

Note F(x|Y) = Inline graphic (X ≤ x|Y) is a random variable of Y.

\begin{array}{l} E_{Y} [F (x ∣ Y)] = \sum_{r = 1}^{R} ℙ (X \leq x ∣ Y = y_{r}) ℙ (Y = y_{r}) = \sum_{r = 1}^{R} ℙ (X \leq x, Y = y_{r}) = ℙ (X \leq x) = F (x), \\ {Var}_{Y} [F (x ∣ Y)] = \sum_{r = 1}^{R} {[ℙ (X \leq x ∣ Y = y_{r}) - F (x)]}^{2} ℙ (Y = y_{r}) = \sum_{r = 1}^{R} p_{r} {[F_{r} (x) - F (x)]}^{2} . \end{array}

where p_r = Inline graphic (Y = y_r). Then

M V (X ∣ Y) = E_{X} [{Var}_{Y} (F (X ∣ Y))] = \sum_{r = 1}^{R} p_{r} \int {[F_{r} (x) - F (x)]}^{2} d F (x) .

The second property can be directly implied by the first one. Because the result that X and Y are statistical independent is equivalent to that F_r(x) = F (x) for any x ∈ ℝ_X and r = 1, 2, …, R, which is also equivalent to $\sum_{r = 1}^{R} p_{r} \int {[F_{r} (x) - F (x)]}^{2} d F (x) = 0$ given p_r > 0 and F(x + δ) − F(x − δ) > 0 for any δ > 0 and x ∈ ℝ_X. This completes the proof.

To prove Theorems 2.1 and 2.2, we need the following lemmas.

Lemma A.1

[Hoeffding’s Inequality] Let X₁, …, X_n be independent random variables. Assume that Inline graphic (X_i ∈ [a_i, b_i]) = 1 for 1 ≤ i ≤ n, where a_i and b_i are constants. Let $\bar{X} = n^{- 1} \sum_{i = 1}^{n} X_{i}$ . Then the following inequality holds

ℙ (∣ \bar{X} - E (\bar{X}) ∣ \geq t) \leq 2 exp {- \frac{2 n^{2} t^{2}}{\sum_{i = 1}^{n} {(b_{i} - a_{i})}^{2}}},

(A.1)

where t is a positive constant and E(X̄) is the expected value of X̄.

Lemma A.2

[Bernstein’s Inequality] (van der Vaart and Wellner, 1996, Lemma 2.2.9) Let X₁, …, X_n be independent random variables with bounded support [−M, M] and zero means, then the following inequality holds

ℙ (∣ X_{1} + \dots + X_{n} ∣ > t) \leq 2 exp {- \frac{t^{2}}{2 (ν + M t / 3)}},

(A.2)

for ν ≥ var(X₁ + ··· + X_n).

We need the following notations for next lemma. Let F_k_,_r(x) = Inline graphic (X_k ≤ x|Y = y_r) and F_k(x) = (X_k ≤ x), for 1 ≤ k ≤ p, r = 1, …, R and x ∈ ℝ_X. Denote

\begin{array}{l} f_{0} = f_{0} (X_{k}, Y) = \sum_{r = 1}^{R} I {Y = y_{r}} \int {[F_{k, r} (x) - F_{k} (x)]}^{2} d F_{k} (x); \\ {\bar{f}}_{r} = {\bar{f}}_{r} (X_{k}, Y) = {[F_{k, r} (X_{k}) - F_{k} (X_{k})]}^{2}; \\ f_{r} = f_{r} (X_{k}, Y) = I {Y = y_{r}}; \\ f_{0, x} = f_{0, x} (X_{k}, Y) = I {X_{k} \leq x}; \\ f_{r, x} = f_{r, x} (X_{k}, Y) = I {X_{k} \leq x, Y = y_{r}} . \end{array}

Let {X_ki, Y_i): 1 ≤ i ≤ n} be a random sample from a population (X_k, Y). Define ${\bar{f}}_{r}^{(i)} = {\bar{f}}_{r} (X_{i k}, Y_{i}), f_{0}^{(i)} = f_{0} (X_{i k}, Y_{i}), f_{r}^{(i)} = I (Y_{i} = y_{r}), f_{0, x}^{(i)} = I {X_{i k} \leq x}, f_{r, x}^{(i)} = I {X_{i k} \leq x, Y_{i} = y_{r}}$ for i = 1, …, n.

Lemma A.3

For any ε ∈ (0, 1) and 1 ≤ r ≤ R, the following inequalities are valid for univariate X_k

ℙ {| \frac{1}{n} \sum_{i = 1}^{n} {\bar{f}}_{r}^{(i)} - E {\bar{f}}_{r} | \geq ε} \leq 2 exp {- 2 n ε^{2}};

(A.3)

ℙ {| \frac{1}{n} \sum_{i = 1}^{n} f_{0}^{(i)} - E f_{0} | \geq ε} \leq 2 exp {- 2 n ε^{2}};

(A.4)

ℙ {| \frac{1}{n} \sum_{i = 1}^{n} f_{r}^{(i)} - E f_{r} | \geq ε} \leq 2 exp {- \frac{n ε^{2}}{2 (p_{r} + ε / 3)}};

(A.5)

ℙ {sup_{x \in ℝ_{X}} | \frac{1}{n} \sum_{i = 1}^{n} f_{0, x}^{(i)} - E f_{0, x} | \geq ε} \leq 2 (n + 1) exp {- 2 n ε^{2}};

(A.6)

ℙ {sup_{x \in ℝ_{X}} | \frac{1}{n} \sum_{i = 1}^{n} f_{r, x}^{(i)} - E f_{r, x} | \geq ε} \leq 2 (n + 1) exp {- \frac{n ε^{2}}{2 (p_{r} + ε / 3)}},

(A.7)

where Eh stands for Eh(X_k, Y) for a function h(X_k, Y) with finite expected value.

Proof

Since |f̄_r (X_k, Y)| = [F_k_,_r(X_k) − F_k(X_k)]² ≤ 1 and $∣ f_{0} (X_{k}, Y) ∣ = ∣ \sum_{r = 1}^{R} I {Y = y_{r}} \int {[F_{k, r} (x) - F_{k} (x)]}^{2} d F_{k} (x) ∣ \leq 1$ , we apply Hoeffding’s inequality to obtain the inequalities (A.3) and (A.4).

Since $f_{r}^{(i)} = I {Y_{i} = y_{r}}$ for i = 1, …, n, then $f_{r}^{(i)} ~ Bernoulli (p_{r})$ with $E f_{r}^{(i)} = p_{r}$ and $f_{r}^{(1)} + \dots + f_{r}^{(n)} ~ Binomial (n, p_{r})$ , which implies $Var (f_{r}^{(1)} + \dots + f_{r}^{(n)}) = {n p}_{r} (1 - p_{r}) \leq {n p}_{r}$ and $∣ f_{r}^{(i)} - p_{r} ∣ \leq 1$ . Thus, by Bernstein’s inequality, we have

\begin{array}{l} ℙ {| \frac{1}{n} \sum_{i = 1}^{n} f_{r}^{(i)} - E f_{r} | \geq ε} = ℙ {| \sum_{i = 1}^{n} (f_{r}^{(i)} - p_{r}) | \geq n ε} \leq 2 exp {- \frac{n^{2} ε^{2}}{2 ({n p}_{r} + n ε / 3)}} \\ \leq 2 exp {- n ε^{2} / (2 (p_{r} + ε / 3))} . \end{array}

The inequality (A.5) is proved.

Note that $∣ f_{0, x}^{(i)} - E f_{0, x} ∣ = ∣ I {X_{i k} \leq x} - F_{k} (x) ∣ \leq 1$ , then we apply Hoeffding’s inequality and empirical process theory (Pollard, 1984) to obtain (A.6). Note that $∣ f_{r, x}^{(i)} - E f_{r, x} ∣ = ∣ I {X_{i k} \leq x, Y_{i} = y_{r}} - F_{k, r} (x) p_{r} ∣ \leq 1$ , then we apply Bernstein’s inequality and empirical process theory (Pollard, 1984) to obtain (A.7). This completes the proof of Lemma A.3.

Lemma A.4

Under Condition (C1), for any ε ∈ (0, 1/2) and 1 ≤ k ≤ p, we have

ℙ {∣ {\hat{ω}}_{k} - ω_{k} ∣ \geq ε} \leq O (n) R_{n} exp {- \frac{c_{4} n}{R_{n}} ε^{2}}

(A.8)

for some constant c₄ > 0.

Proof

According the definitions of ω_k and ω̂_k, we have

\begin{array}{l} {\hat{ω}}_{k} - ω_{k} = \frac{1}{n} \sum_{j = 1}^{n} \sum_{r = 1}^{R} {\hat{p}}_{r} {[{\hat{F}}_{k r} (X_{j}) - {\hat{F}}_{k} (X_{j})]}^{2} - \sum_{r = 1}^{R} p_{r} \int {[F_{k r} (x) - F_{k} (x)]}^{2} d F_{k} (x) \\ = \sum_{r = 1}^{R} {\hat{p}}_{r} (\int {[{\hat{F}}_{k r} (x) - {\hat{F}}_{k} (x)]}^{2} d {\hat{F}}_{k} (x) - \int {[F_{k r} (x) - F_{k} (x)]}^{2} d F_{k} (x)) + \sum_{r = 1}^{R} ({\hat{p}}_{r} - p_{r}) \int {[F_{k r} (x) - F_{k} (x)]}^{2} d F_{k} (x) \\ = \sum_{r = 1}^{R} {\hat{p}}_{r} \int ({[{\hat{F}}_{k r} (x) - {\hat{F}}_{k} (x)]}^{2} - {[F_{k r} (x) - F_{k} (x)]}^{2}) d {\hat{F}}_{k} (x) + \sum_{r = 1}^{R} {\hat{p}}_{r} \int {[F_{k r} (x) - F_{k} (x)]}^{2} d [{\hat{F}}_{k} (x) - F_{k} (x)] + \sum_{r = 1}^{R} ({\hat{p}}_{r} - p_{r}) \int {[F_{k r} (x) - F_{k} (x)]}^{2} d F_{k} (x) \\ = : I_{k 1} + I_{k 2} + I_{k 3} . \end{array}

We first deal with the term I_k₁.

\begin{array}{l} ∣ I_{k 1} ∣ \leq 2 max_{r} \int | [{\hat{F}}_{k r} (x) - F_{k r} (x)] - [{\hat{F}}_{k} (x) - F_{k} (x)] | d {\hat{F}}_{k} (x) \\ \leq 2 max_{r} sup_{x \in ℝ_{X}} (| {\hat{F}}_{k r} (x) - F_{k r} (x) | + | {\hat{F}}_{k} (x) - F_{k} (x) |) = : 2 (J_{k 1} + J_{k 2}), \end{array}

where the first inequality holds by $\sum_{r = 1}^{R} {\hat{p}}_{r} = 1$ and

\begin{array}{l} J_{k 1} = max_{r} sup_{x \in ℝ_{X}} ∣ {\hat{F}}_{k r} (x) - F_{k r} (x) ∣ = max_{r} sup_{x \in ℝ_{X}} | \frac{1}{n} \sum_{i = 1}^{n} f_{r, x}^{(i)} / {\hat{p}}_{r} - E f_{r, x} / p_{r} | \\ \leq max_{r} sup_{x \in ℝ_{X}} (\frac{| \frac{1}{n} \sum_{i = 1}^{n} f_{r, x}^{(i)} - E f_{r, x} |}{{\hat{p}}_{r}} + \frac{E f_{r, x} ∣ {\hat{p}}_{r} - p_{r} ∣}{{\hat{p}}_{r} p_{r}}) \\ = max_{r} sup_{x \in ℝ_{X}} \frac{| \frac{1}{n} \sum_{i = 1}^{n} f_{r, x}^{(i)} - E f_{r, x} |}{{\hat{p}}_{r}} + max_{r} \frac{| \frac{1}{n} \sum_{i = 1}^{n} f_{r}^{(i)} - E f_{r} |}{{\hat{p}}_{r}}, \end{array}

where the equality holds due to $sup_{x \in ℝ_{X}} E f_{r, x} = sup_{x \in ℝ_{X}} P (X_{k} < x, Y = y_{r}) = p_{r}$ . Thus, under Condition (C1), for any 0 < ε < 1/2,

\begin{array}{l} ℙ {J_{k 1} \geq ε} \leq ℙ {max_{r} sup_{x \in ℝ_{X}} \frac{| \frac{1}{n} \sum_{i = 1}^{n} f_{r, x}^{(i)} - E f_{r, x} |}{{\hat{p}}_{r}} + max_{r} \frac{| \frac{1}{n} \sum_{i = 1}^{n} f_{r}^{(i)} - E f_{r} |}{{\hat{p}}_{r}} \geq ε, min_{r} {\hat{p}}_{r} \geq \frac{c_{1}}{2 R_{n}}} + ℙ {min_{r} {\hat{p}}_{r} < c_{1} / 2 R_{n}} \\ \leq ℙ {max_{r} sup_{x \in ℝ_{X}} | \frac{1}{n} \sum_{i = 1}^{n} f_{r, x}^{(i)} - E f_{r, x} | + max_{r} | \frac{1}{n} \sum_{i = 1}^{n} f_{r}^{(i)} - E f_{r} | \geq \frac{c_{1} ε}{2 R_{n}}} + ℙ {max_{r} | \frac{1}{n} \sum_{i = 1}^{n} f_{r}^{(i)} - E f_{r} | \geq \frac{c_{1}}{2 R_{n}}} \\ \leq P {max_{r} sup_{x \in ℝ_{X}} | \frac{1}{n} \sum_{i = 1}^{n} f_{r, x}^{(i)} - E f_{r, x} | \geq \frac{c_{1} ε}{4 R_{n}}} + 2 ℙ {max_{r} | \frac{1}{n} \sum_{i = 1}^{n} f_{r}^{(i)} - E f_{r} | \geq \frac{c_{1} ε}{4 R_{n}}} \\ \leq 2 (n + 1) R_{n} exp {- \frac{n {(c_{1} ε / 4 R_{n})}^{2}}{2 (p_{r} + c_{1} ε / 12 R_{n})}} + 2 R_{n} exp {- \frac{n {(c_{1} ε / 4 R_{n})}^{2}}{2 (p_{r} + c_{1} ε / 12 R_{n})}} \\ \leq 2 (n + 3) R_{n} exp {- \frac{c_{1}^{2}}{32} \frac{n ε^{2}}{R_{n}} / (c_{2} + \frac{c_{1} ε}{12})} \\ \leq 2 (n + 3) R_{n} exp {- c_{5} \frac{n ε^{2}}{R_{n}}}, \end{array}

(A.9)

for some constant c₅ > 0, where the second inequality holds because min_r p̂_r < c₁/2R_n implies $max | \frac{1}{n} \sum_{i = 1}^{n} f_{r}^{(i)} - E f_{r} | = {max}_{r} ∣ {\hat{p}}_{r} - p_{r} ∣ \geq p_{r} - {\hat{p}}_{r} \geq c_{1} / R_{n} - c_{1} / 2 R_{n} = c_{1} / 2 R_{n}$ using $c_{1} / R_{n} \leq min_{1 \leq r \leq R_{n}} p_{r}$ in Condition (C1), the fourth inequality is due to Lemma A.3, and the fifth inequality follows that $max_{1 \leq r \leq R_{n}} p_{r} \leq c_{2} / R_{n}$ in Condition (C1). Then, we apply inequalities (A.6), (A.3) and (A.4) in Lemma A.3 to obtain the following there results, respectively,

ℙ {J_{k 2} \geq ε} = ℙ {sup_{x \in ℝ_{X}} ∣ {\hat{F}}_{k} (x) - F_{k} (x) ∣ \geq ε} \leq 2 (n + 1) exp {- 2 n ε^{2}},

(A.10)

\begin{array}{l} P {∣ I_{k 2} ∣ \geq ε} = P {| \sum_{r}^{R} {\hat{p}}_{r} (\frac{1}{n} \sum_{i = 1}^{n} {\bar{f}}_{r}^{(i)} - E {\bar{f}}_{r}) | \geq ε} \leq P {max_{r} | \frac{1}{n} \sum_{i = 1}^{n} {\bar{f}}_{r}^{(i)} - E {\bar{f}}_{r} | \geq ε} \\ \leq 2 R_{n} exp {- 2 n ε^{2}}, \end{array}

(A.11)

P {∣ I_{k 3} ∣ \geq ε} = ℙ {\frac{1}{n} \sum_{i = 1}^{n} f_{0}^{(i)} - E f_{0} \geq ε} \leq 2 exp {- 2 n ε^{2}} .

(A.12)

Inequalities (A.9)–(A.12) together imply the result of Lemma A.4.

Proof of Theorem 2.1

For the first term of Theorem 2.1, by Lemma A.4 and R_n = O(n^κ), we have

\begin{array}{l} ℙ {max_{1 \leq k \leq p} ∣ {\hat{ω}}_{k} - ω_{k} ∣ \geq {c n}^{- τ}} \leq O (n) {p R}_{n} exp {- \frac{c_{4} c^{2} n^{1 - 2 τ}}{R_{n}}} \\ \leq O ({pnR}_{n} exp {- {b n}^{1 - (2 τ + κ)}}) \\ \leq O (p exp {- {b n}^{1 - (2 τ + κ)} + (1 + κ) log n}), \end{array}

where b > 0 is a constant depending c, c₁ and c₂.

Next, we deal with the second part of Theorem 2.1. If Inline graphic ⊈ , then there must exist some k ∈ such that ω̂_k< cn⁻^τ. It follows from Condition (C2) that |ω̂_k − ω_k > cn⁻^τ for some k ∈ , indicating that the events satisfy { ⊈ } ⊆ {|ω̂_k} − ω_k| > cn⁻^τ, for some k ∈ }, and hence D_n = { |ω̂_k − ω_k | ≤ cn⁻^τ} ⊆ { Inline graphic ⊆ }. Consequently,

\begin{array}{l} ℙ {D \subseteq \hat{D}} \geq ℙ {D_{n}} = 1 - ℙ {D_{n}^{c}} = 1 - ℙ {min_{k \in D} ∣ {\hat{ω}}_{k} - ω_{k} ∣ \geq {c n}^{- τ}} \\ = 1 - s_{n} ℙ {∣ {\hat{ω}}_{k} - ω_{k} ∣ \geq {c n}^{- τ}} \\ \leq 1 - O (s_{n} exp {- {b n}^{1 - (2 τ + κ)} + (1 + κ) log n}), \end{array}

where s_n is the cardinality of Inline graphic . This completes the proof of the second part.

Proof of Theorem 2.2

\begin{array}{l} ℙ {(min_{k \in D} {\hat{ω}}_{k} - max_{k \in I} {\hat{ω}}_{k}) < c_{3} / 2} \leq ℙ {(min_{k \in D} {\hat{ω}}_{k} - max_{k \in I} {\hat{ω}}_{k}) - (min_{k \in D} ω_{k} - max_{k \in I} ω_{k}) < - c_{3} / 2} \\ \leq ℙ {| (min_{k \in D} {\hat{ω}}_{k} - max_{k \in I} {\hat{ω}}_{k}) - (min_{k \in D} ω_{k} - max_{k \in I} ω_{k}) | > c_{3} / 2} \\ \leq ℙ {2 max_{1 \leq k \leq p} ∣ {\hat{ω}}_{k} - ω_{k} ∣ > c_{3} / 2} \\ \leq O (n) {p R}_{n} exp {- c_{6} n / R_{n}} \end{array}

for some constant c₆ > 0, where the first inequality follows Condition (C3) and the last inequality is implied by Lemma A.4. Because R_n log(p)/n = o(1) and R_n log(n)/n = o(1) imply that $p \leq exp {\frac{c_{6}}{2} n / R_{n}}$ , and $\frac{c_{6}}{2} n / R_{n} \geq 4 log (n)$ , log(nRn) ≤ 2 log(n) for large n. Then, we have for some n₀, $\sum_{n = n_{0}}^{+ \infty} {npR}_{n} exp {- c_{6} n / R_{n}} \leq exp {log (n R_{n}) + \frac{c_{6}}{2} n / R_{n} - c_{6} n / R_{n}} \leq$ $exp {log (n R_{n}) - 4 log (n)} \leq \sum_{n = n_{0}}^{+ \infty} n^{- 2} < + \infty$ . Therefore, by Borel Contelli Lemma, we obtain that $\underset{n \to \infty}{lim inf} {min_{k \in D} {\hat{ω}}_{k} - max_{k \in I} {\hat{ω}}_{k}} \geq c_{3} / 2 > 0$ a.s..

Footnotes

The content is solely the responsibility of the authors and does not necessarily represent the official views of the NNSFC or NIDA.

References

Bhattacharjee A, Richards W, Staunton J, Li C, Monti S, Vasal P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark E, Lander E, Wong W, Johnson B, Golub T, Sugarbaker D, Meyerson M. Classification of Human Lung Carcinomas by mRNA Expression Profiling Reveals Distinct Adenocarcinoma Subclasses. PNAS. 2001;98:13790–13795. doi: 10.1073/pnas.191502998. [DOI] [PMC free article] [PubMed] [Google Scholar]
Clemmensen L, Hastie T, Witten D, Ersboll B. Sparse Discriminant Analysis. Technometrics. 2011;53:406–415. [Google Scholar]
Fan J, Fan Y. High-Dimensional Classification Using Features Annealed Independence Rules. The Annals of Statistics. 2008;36:2605–2637. doi: 10.1214/07-AOS504. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Feng Y, Song R. Nonparametric Independence Screening in Sparse Ultra-High Dimensional Additive Models. Journal of the American Statistical Association. 2011;106:544–557. doi: 10.1198/jasa.2011.tm09779. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Ma Y, Dai W. Nonparametric Independence Screening in Sparse Ultra-High Dimensional Varying Coefficient Models. Journal of the American Statistical Association. 2014 doi: 10.1080/01621459.2013.879828. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Lv J. Sure Independence Screening for Ultrahigh Dimensional Feature Space (with Discussion) Journal of the Royal Statistical Society, Series B. 2008;70:849– 911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Samworth R, Wu Y. Ultrahigh Dimensional Feature Selection: beyond the Linear Model. Journal of Machine Learning Research. 2009;10:1829–1853. [PMC free article] [PubMed] [Google Scholar]
Fan J, Song R. Sure Independence Screening in Generalized Linear Models with NP-Dimensionality. The Annals of Statistics. 2010;38:3567–3604. [Google Scholar]
Friedman J. Regularized Discriminant Analysis. Journal of the American Statistical Association. 1989;84:165–175. [Google Scholar]
Gordon G, Jensen R, Hsiao L, Gullans S, Blumenstock J, Ramaswamy S, Richards W, Sugarbaker D, Bueno R. Translation of Microarray Data Into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research. 2002;62:4963–4967. [PubMed] [Google Scholar]
He X, Fung WK. High Breakdown Estimation for Multiple Populations with Applications to Discriminant Analysis. Journal of Multivariate Analysis. 2000;72:151–162. [Google Scholar]
He X, Wang L, Hong H. Quantile-Adaptive Model-Free Variable Screening for High-Dimensional Heterogeneous Data. The Annals of Statistics. 2013;41:342–369. [Google Scholar]
Huang J, Horowitz J, Ma S. Asymptotic Properties of Bridge Estimators in Sparse High-Dimensional Regression Models. The Annals of Statistics. 2008;36:587–613. [Google Scholar]
Ji P, Jin J. UPS Delivers Optimal Phase Diagram in High Dimensional Variable Selection. The Annals of Statistics. 2012;40:73–103. [Google Scholar]
Li G, Peng H, Zhang J, Zhu L. Robust Rank Correlation Based Screening. The Annals of Statistics. 2012;40:1846–1877. [Google Scholar]
Li R, Zhong W, Zhu L. Feature Screening via Distance Correlation Learning. Journal of American Statistical Association. 2012;107:1129–1139. doi: 10.1080/01621459.2012.695654. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu J, Li R, Wu R. Feature Selection for Varying Coefficient Models with Ultrahigh Dimensional Covariates. Journal of American Statistical Association. 2014;109:266–274. doi: 10.1080/01621459.2013.850086. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mai Q, Zou H. The Kolmogorov Filter for Variable Screening in High-Dimensional Binary Classification. Biometrika. 2013;100:229–234. [Google Scholar]
Mai Q, Zou H. Semiparametric Sparse Discriminant Analysis in Ultra-High Dimensions. 2014. manuscript. 2013arXiv1304.4983M. [Google Scholar]
Mai Q, Zou H, Yuan M. A Direct Approach to Sparse Discriminant Analysis in Ultra-High Dimensions. Biometrika. 2012;99:29–42. [Google Scholar]
Meier L, Geer V, Bühlmann P. High-Dimensional Additive Modeling. The Annals of Statistics. 2009;37:3779–3821. [Google Scholar]
Pan R, Wang H, Li R. On the Ultrahigh Dimensional Linear Discriminant Analysis Problem with A Diverging Number of Classes. 2013. manuscript. [Google Scholar]
Pollard D. Convergence of Stochastic Processes. Springer-Verlag New York Inc; 1984. [Google Scholar]
Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression. Proceedings of the National Academy of Sciences. 2002;99:6567–6572. doi: 10.1073/pnas.082099299. [DOI] [PMC free article] [PubMed] [Google Scholar]
van der Vaart A, Wellner J. Weak Convergence and Empirical Processes. New York: Springer; 1996. [Google Scholar]
Wang H. Forward Regression for Ultra-High Dimensional Variable Screening. Journal of the American Statistical Association. 2009;104:1512–1524. [Google Scholar]
Witten D, Tibshirani R. Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B. 2011;73:753–772. doi: 10.1111/j.1467-9868.2011.00783.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhu LP, Li L, Li R, Zhu LX. Model-Free Feature Screening for Ultrahigh Dimensional Data. Journal of the American Statistical Association. 2011;106:1464–1475. doi: 10.1198/jasa.2011.tm10563. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supple

NIHMS590994-supplement-supple.pdf^{(247.8KB, pdf)}

[R1] Bhattacharjee A, Richards W, Staunton J, Li C, Monti S, Vasal P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark E, Lander E, Wong W, Johnson B, Golub T, Sugarbaker D, Meyerson M. Classification of Human Lung Carcinomas by mRNA Expression Profiling Reveals Distinct Adenocarcinoma Subclasses. PNAS. 2001;98:13790–13795. doi: 10.1073/pnas.191502998. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Clemmensen L, Hastie T, Witten D, Ersboll B. Sparse Discriminant Analysis. Technometrics. 2011;53:406–415. [Google Scholar]

[R3] Fan J, Fan Y. High-Dimensional Classification Using Features Annealed Independence Rules. The Annals of Statistics. 2008;36:2605–2637. doi: 10.1214/07-AOS504. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Fan J, Feng Y, Song R. Nonparametric Independence Screening in Sparse Ultra-High Dimensional Additive Models. Journal of the American Statistical Association. 2011;106:544–557. doi: 10.1198/jasa.2011.tm09779. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Fan J, Ma Y, Dai W. Nonparametric Independence Screening in Sparse Ultra-High Dimensional Varying Coefficient Models. Journal of the American Statistical Association. 2014 doi: 10.1080/01621459.2013.879828. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Fan J, Lv J. Sure Independence Screening for Ultrahigh Dimensional Feature Space (with Discussion) Journal of the Royal Statistical Society, Series B. 2008;70:849– 911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Fan J, Samworth R, Wu Y. Ultrahigh Dimensional Feature Selection: beyond the Linear Model. Journal of Machine Learning Research. 2009;10:1829–1853. [PMC free article] [PubMed] [Google Scholar]

[R8] Fan J, Song R. Sure Independence Screening in Generalized Linear Models with NP-Dimensionality. The Annals of Statistics. 2010;38:3567–3604. [Google Scholar]

[R9] Friedman J. Regularized Discriminant Analysis. Journal of the American Statistical Association. 1989;84:165–175. [Google Scholar]

[R10] Gordon G, Jensen R, Hsiao L, Gullans S, Blumenstock J, Ramaswamy S, Richards W, Sugarbaker D, Bueno R. Translation of Microarray Data Into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research. 2002;62:4963–4967. [PubMed] [Google Scholar]

[R11] He X, Fung WK. High Breakdown Estimation for Multiple Populations with Applications to Discriminant Analysis. Journal of Multivariate Analysis. 2000;72:151–162. [Google Scholar]

[R12] He X, Wang L, Hong H. Quantile-Adaptive Model-Free Variable Screening for High-Dimensional Heterogeneous Data. The Annals of Statistics. 2013;41:342–369. [Google Scholar]

[R13] Huang J, Horowitz J, Ma S. Asymptotic Properties of Bridge Estimators in Sparse High-Dimensional Regression Models. The Annals of Statistics. 2008;36:587–613. [Google Scholar]

[R14] Ji P, Jin J. UPS Delivers Optimal Phase Diagram in High Dimensional Variable Selection. The Annals of Statistics. 2012;40:73–103. [Google Scholar]

[R15] Li G, Peng H, Zhang J, Zhu L. Robust Rank Correlation Based Screening. The Annals of Statistics. 2012;40:1846–1877. [Google Scholar]

[R16] Li R, Zhong W, Zhu L. Feature Screening via Distance Correlation Learning. Journal of American Statistical Association. 2012;107:1129–1139. doi: 10.1080/01621459.2012.695654. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Liu J, Li R, Wu R. Feature Selection for Varying Coefficient Models with Ultrahigh Dimensional Covariates. Journal of American Statistical Association. 2014;109:266–274. doi: 10.1080/01621459.2013.850086. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Mai Q, Zou H. The Kolmogorov Filter for Variable Screening in High-Dimensional Binary Classification. Biometrika. 2013;100:229–234. [Google Scholar]

[R19] Mai Q, Zou H. Semiparametric Sparse Discriminant Analysis in Ultra-High Dimensions. 2014. manuscript. 2013arXiv1304.4983M. [Google Scholar]

[R20] Mai Q, Zou H, Yuan M. A Direct Approach to Sparse Discriminant Analysis in Ultra-High Dimensions. Biometrika. 2012;99:29–42. [Google Scholar]

[R21] Meier L, Geer V, Bühlmann P. High-Dimensional Additive Modeling. The Annals of Statistics. 2009;37:3779–3821. [Google Scholar]

[R22] Pan R, Wang H, Li R. On the Ultrahigh Dimensional Linear Discriminant Analysis Problem with A Diverging Number of Classes. 2013. manuscript. [Google Scholar]

[R23] Pollard D. Convergence of Stochastic Processes. Springer-Verlag New York Inc; 1984. [Google Scholar]

[R24] Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression. Proceedings of the National Academy of Sciences. 2002;99:6567–6572. doi: 10.1073/pnas.082099299. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] van der Vaart A, Wellner J. Weak Convergence and Empirical Processes. New York: Springer; 1996. [Google Scholar]

[R26] Wang H. Forward Regression for Ultra-High Dimensional Variable Screening. Journal of the American Statistical Association. 2009;104:1512–1524. [Google Scholar]

[R27] Witten D, Tibshirani R. Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B. 2011;73:753–772. doi: 10.1111/j.1467-9868.2011.00783.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Zhu LP, Li L, Li R, Zhu LX. Model-Free Feature Screening for Ultrahigh Dimensional Data. Journal of the American Statistical Association. 2011;106:1464–1475. doi: 10.1198/jasa.2011.tm10563. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Model-Free Feature Screening for Ultrahigh Dimensional Discriminant Analysis

Hengjian Cui

Runze Li

Wei Zhong

Abstract

1. INTRODUCTION

2. A NEW FEATURE SCREENING PROCEDURE

2.1. A New Index based on Conditional Distribution Function

Proposition 2.1

Figure 1.

2.2. Sure Independence Screening Using MV(X|Y)

Theorem 2.1

Remark

Theorem 2.2

3. NUMERICAL STUDIES

3.1. Monte Carlo Simulations

Example 1

Table 1.

Table 2.

Table 3.

3.2. Real Data Examples

Example 2

Table 4.

Table 5.

Figure 2.

Example 3

Table 6.

4. SOME EXTENSIONS

4.1. Genome-Wide Association Studies

Example 4

Table 7.

4.2. Nonparametric Additive Models

Example 5

Table 8.

5. DISCUSSION

Supplementary Material

Acknowledgments

Biographies

APPENDIX

Proof of Proposition 2.1

Lemma A.1

Lemma A.2

Lemma A.3

Proof

Lemma A.4

Proof

Proof of Theorem 2.1

Proof of Theorem 2.2

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases