Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Jun 1.
Published in final edited form as: J Am Stat Assoc. 2014 May 13;110(510):630–641. doi: 10.1080/01621459.2014.920256

Model-Free Feature Screening for Ultrahigh Dimensional Discriminant Analysis

Hengjian Cui 1, Runze Li 1, Wei Zhong 1,
PMCID: PMC4574103  NIHMSID: NIHMS590994  PMID: 26392643

Abstract

This work is concerned with marginal sure independence feature screening for ultra-high dimensional discriminant analysis. The response variable is categorical in discriminant analysis. This enables us to use conditional distribution function to construct a new index for feature screening. In this paper, we propose a marginal feature screening procedure based on empirical conditional distribution function. We establish the sure screening and ranking consistency properties for the proposed procedure without assuming any moment condition on the predictors. The proposed procedure enjoys several appealing merits. First, it is model-free in that its implementation does not require specification of a regression model. Second, it is robust to heavy-tailed distributions of predictors and the presence of potential outliers. Third, it allows the categorical response having a diverging number of classes in the order of O(nκ) with some κ ≥ 0. We assess the finite sample property of the proposed procedure by Monte Carlo simulation studies and numerical comparison. We further illustrate the proposed methodology by empirical analyses of two real-life data sets.

Keywords: Feature screening, consistency in ranking, sure screening property, ultrahigh dimensional data analysis

1. INTRODUCTION

Variable selection plays an important role in high dimensional data analysis. Marginal feature screening becomes indispensable for ultrahigh dimensional data and has received much attention in the very recent literature. Various feature screening procedures have been proposed for linear models, generalized linear models and robust linear models (Fan and Lv, 2008; Wang, 2009; Fan, Samworth and Wu, 2009; Li et al., 2012). These authors demonstrate their procedures enjoy sure screening property in the terminology of Fan and Lv (2008). Feature screening procedures have been further proposed for nonparametric regression models in the literature. Fan, Feng and Song (2011) proposed a nonparametric marginal screening procedure for additive models based on B-spline expansion. Fan, Ma and Dai (2014) further extended the nonparametric B-spline method for varying coefficient models and proposed a marginal sure screening procedure. Liu, Li and Wu (2014) proposed a local kernel-based marginal sure screening procedure for varying coefficient models and further established its sure screening property. Aforementioned model-based screening procedures perform well when the underlying models are correctly specified, but their performance may be quite poor in the presence of model mis-specification. Specifying a correct model for ultrahigh dimensional data may be challenging. Thus, model-free sure screening procedures are appealing and have been developed by several authors (Zhu, et al., 2011; Li, Zhong and Zhu, 2012; He, Wang and Hong, 2013). Li, Zhong and Zhu (2012) developed a sure independence screening procedure based on the distance correlation which is model-free. Its sure screening property requires subexponential tail probability conditions on predictors and response, and it is not robust to very heavy-tailed data with extreme values. Mai and Zou (2013) developed a sure feature screening procedures with ultrahigh dimensional predictors based on the Kolmogorov distance, but it is studied only for binary classification problems. Pan, Wang and Li (2013) proposed a pairwise sure screening procedure for linear discriminant analysis with a diverging number of classes and ultrahigh dimensional predictors. However, it is based on mean difference and cannot perform well for heavy-tailed data. This work aims to develop an effective model-free and robust feature screening procedure for ultrahigh dimensional discriminant analysis with a possibly diverging number of classes.

In this paper, we propose an effective sure screening procedure for discriminant analysis. We further study its theoretical properties and establish the sure screening and rank consistency properties without assuming the moment conditions on predictors under the settings of ultrahigh dimensional discriminant analysis with a diverging number of response classes. Our numerical studies show that the proposed procedure has excellent performance. It enjoys several appealing properties. It is model-free since its implementation does not require specification of the regression model. Its corresponding marginal utility may be easily evaluated without involving numerical optimization.

Due to its nature, the proposed procedure can be directly applied for continuous response with categorical predictors. This indeed is also very useful in the genomics-wide association study (GWAS), in which the phenotypes (i.e., the responses) are continuous, and the single-nucleotide polymorphisms (SNPs) as predictors are categorical. Thus, it is also of interest to develop an effective feature screening procedure for setting in which the response is continuous, while the predictors of interest are categorical. In this paper, we further extend our procedure for such settings. Some further extensions are also discussed in Section 4.

The rest of this paper is organized as follows. In Section 2, we propose a new marginal utility for feature screening and further study its theoretical properties. In Section 3, we conduct Monte Carlo simulation studies to examine the finite sample performance of the proposed procedure. We further illustrate the proposed methodology by empirical analyses of real data examples. Section 4 presents some extensions of the proposed methodology. Technical proofs are given in the Appendix.

2. A NEW FEATURE SCREENING PROCEDURE

2.1. A New Index based on Conditional Distribution Function

Let Y be a categorical response with R classes {y1, y2, …, yR}, and X be a continuous covariate with a support ℝX. To investigate the dependence relationship between X and Y, we naturally consider the conditional distribution function of X given Y, denoted by F(x|Y) = Inline graphic(Xx|Y). Denote by F(x) = Inline graphic(Xx) the unconditional distribution function of X and Fr(x) = Inline graphic(Xx|Y = yr) the conditional distribution function of X given Y = yr. If Fr(x) = F(x) for any x ∈ ℝX and r = 1, 2, …, R, then X and Y are independent. This motivates us to consider the following index

MV(XY)=EX[VarY(F(XY))] (2.1)

to measure the dependence between X and Y. The following proposition provides the properties of the MV (X|Y).

Proposition 2.1

Let Y be a categorical random variable with R classes {y1, y2, …, yR} and pr = Inline graphic(Y = yr) > 0 for all r = 1, …, R. Let X be a continuous random variable with support ℝX. Denote F(x) = Inline graphic(Xx) and Fr(x) = Inline graphic(Xx|Y = yr), then

  1. MV(XY)=r=1Rpr[Fr(x)-F(x)]2dF(x).

  2. MV (X|Y) = 0 if and only if X and Y are statistically independent.

The proof of this proposition is given in the Appendix. The results in (1) implies that the MV (X|Y) can be represented as the weighted average of Cramér-von Mises distances between the conditional distribution function of X given Y = yr and the unconditional distribution function of X. The second remarkable property motivates us to utilize the MV (X|Y) as a marginal utility for feature screening to characterize both linear and nonlinear relationships for ultrahigh dimensional discriminant analysis.

Let {(Xi, Yi) : 1 ≤ in} be a random sample of size n from the population (X, Y). Define p^r=1ni=1nI{Yi=yr} with I{·} being the indicator function, F^(x)=1ni=1nI{Xix}, and F^r(x)=1ni=1n{Xix,Yi=yr}/p^r. It is natural to use its sample counterpart to estimate MV (X|Y) as follows:

MV^(XY)=1nr=1Rj=1np^r[F^r(Xj)-F^(Xj)]2. (2.2)

To get insights into MV(X|Y), let us consider a simple example. Let X be a univariate standard normal random variable and generate random variables Zk with k = 1, 2 by Z1 = cX +ε and Z2 = cX2 +ε, where ε ~ N(0, 1) and c is a constant to control the signal-to-noise ratio. Then, we equally discretize each Zk to a categorical variable Yk with four classes. That is, Yk = I(Zkqk1)+2I(qk1 < Zkqk2)+3I(qk2 < Zkqk3)+4I(Zk > qk3), k = 1, 2 where {qk1, qk2, qk3} are the first, second and third quartiles of Zk, respectively. Thus, the response Y1 depends on X through a linear term cX, while Y2 depends on X through a quadratic term cX2. We set sample size n = 200 and c = 0, 0.5, 1 and 2. Note that Yk and X are independent for each k = 1, 2 when c = 0. Then, we compute the variance of conditional distribution function of X given Yk, i.e. VarYk [F(x|Yk)], for x ∈ [−2, 2] and each c. Panel (a) and (c) in Figure 1 are boxplots of VarYk [F(x|Yk)] against different c values for k = 1, 2, respectively, where the star indicates MV^(XYk). Panel (b) and (d) in Figure 1 demonstrate how VarYk [F(x|Yk)] with k = 1, 2 varies across x ∈ [−2, 2] for different c values. It is shown that as the signal-to-noise ratio increases, MV^(XYk) increases. When c = 0, i.e. X and Yk are independent, MV^(XYk) are nearly close to zero; When c > 0, they are remarkably different above from zero. Consequently, the MV(X|Y) should be an effective measure to characterize the strengthen of both linear and nonlinear dependence between a continuous covariate and a categorical response.

Figure 1.

Figure 1

(a) Boxplot of VarY1 [F(x|Y1)] against c with the star indicating the mean; (b) Plot of VarY1 [F(x|Y1)] against x for different c values; (c) Boxplot of VarY2 [F(x|Y2)] against c with the star indicating the mean; (d) Plot of VarY2 [F(x|Y2)] against x for different c values.

2.2. Sure Independence Screening Using MV(X|Y)

We now propose a new model-free sure independence screening using MV(X|Y) for ultrahigh dimensional discriminant analysis. Let Y be the response with discrete support {y1, y2, · · ·, yR} with R ≥ 2 and x = (X1, · · ·, Xp)T be the predictor vector, where pn and n is the sample size. Without specifying a regression model, define the active predictor subset by

D={k:F(yx)functionallydependsonXkforsomey=yr},

and denote by Inline graphic = {1, 2, · · ·, p} \ Inline graphic the inactive predictor subset.

The goal is to select a reduced model with a moderate scale which can almost fully contain Inline graphic using an independence screening method for ultrahigh dimensional discriminant analysis. To this end, we apply the MV index for each pair (Xk, Y):

ωk=MV(XkY)

as a marginal utility to measure the importance of Xk for the response, where k = 1, 2, …, p. Note that ωk = 0 if and only if Xk and Y are statistically independent. As a motivation, we can see that, if the partial orthogonality condition (Huang, Horowitz and Ma, 2008; Fan and Song, 2010) holds, i.e. {Xk : kInline graphic} are statistically independent of {Xk : kInline graphic}, then ωk is a naturally effective measure to sperate the active and inactive predictor subsets because ωk > 0 for kInline graphic and ωk = 0 for kInline graphic. It also implies that the MV-based variable screening is model-free in that it is defined through conditional and unconditional distribution functions and able to characterize both linear and nonlinear relationships between the response and predictors.

For a random sample {(xi, Yi) : 1 ≤ in}, we can easily estimate ωk by setting ω^k=MV^(XkY) according to equation (2.2). Then we propose to utilize ω̂k to choose a submodel

D^={k:ω^kcn-τ,for1kp},

where c and τ are pre-determined thresholding values defined in Condition (C2) below. In practice, for a given size d < n, one can select a reduced model:

D^={k:ω^kisamongthetopdlargestofall}.

We refer this procedure to the MV-based sure independence screening, MV-SIS for short.

Next, we study the theoretical properties of the proposed MV-SIS. Fan and Lv (2008) and Ji and Jin (2012) demonstrated that the two-stage procedure combining independence screening and penalized estimation can outperform an one-step penalized estimation approach, such as LASSO. The effectiveness of the two-stage procedure is guaranteed by the sure screening property. That is, all active predictors can be included in the reduced model with high probability. Thus, we first establish the sure screening property for MV-SIS with assuming the following conditions.

  • (C1)

    There exist two positive constants c1 and c2 such that c1/Rnmin1rRnprmax1rRnprc2/Rn. Assume that Rn = O(nκ) for κ ≥ 0.

  • (C2)

    There exists positive constants c > 0 and 0 ≤ τ < 1/2 such that minkDωk2cn-τ.

Condition (C1) requires that the proportion of each class of the response cannot be either too small or too large. Rn = O(nκ) assumed in Condition (C1) allows the diverging number of classes of the response, where the subscript n in Rn is used to emphasize Rn being allowed to be diverging with the sample size n. Condition (C2) assumes that the minimum true signal cannot be too small and it is in the order of n−τ which allows the minimum true signal to vanish to zero as the sample size n approaches the infinity. Such an assumption is typical in the feature screening literature (e.g., Condition 3 in Fan and Lv (2008), Condition (C3) in Wang (2009), Condition (C2) in both Li, Zhong and Zhu (2012) and He, Wang and Hong (2013) etc). The following theorem presents the sure screening property of MV-SIS and its proof is provided in the Appendix.

Theorem 2.1

[Sure Screening Property] Under Condition (C1) and for any 0 ≤ κ < 1 ≤ 2τ, there exists a positive constant b depending on c, c1 and c2, such that

(max1kpω^k-ωk>cn-τ)O(pexp{-bn1-(2τ+κ)+(1+κ)logn}). (2.3)

Under Conditions (C1) and (C2), we have that

(DD^)1-O(snexp{-bn1-(2τ+κ)+(1+κ)logn}), (2.4)

where sn is the cardinality of Inline graphic.

The sure screening property holds for MV-SIS under milder conditions than those for the SIS (Fan and Lv, 2008) and DC-SIS (Li, Zhong and Zhu, 2012) in that we do not require the regression function of Y onto x to be linear and it needs little assumption on the moments of predictors. It is worth noting that MV-SIS is robust to heavy-tailed distributions of predictors and the presence of potential outliers because MV (Xk|Y) inherits the robustness property of conditional distribution function. Furthermore, the sure screening property also holds for the categorical response with a diverging number of classes. Thus, the MV-SIS provides a unified alternative to existing model-based sure screening procedures for ultrahigh dimensional discriminant analysis.

According to Theorem 2.1, we know that MV-SIS can handle the NP-dimensionality log p = O(nα), where α < 1 ≤ 2τκ with 0 ≤ τ < 1/2 and 0 ≤ κ ≤1 ≤ 2τ, which depends on the minimum true signal strengthen and the number of response classes. If Rn is fixed, i.e. κ = 0, then the result of Theorem 2.1 is improved and its first part can be rewritten as

P{max1kpω^k-ωk>cn-τ}O(pexp{-bn1-2τ+logn}),

for some constant b > 0. In this case, we can handle the even larger NP-dimensionality log p = O(nα), where α < 1 − 2τ with 0 ≤ τ < 1/2.

Remark

Condition (C1) can be relaxed in the way that c1 is allowed to tends to zero in a certain rate. To be specific, we assume that c1 = O(n−eegr;) with 0 < η < 2τ + κ. Under the relaxed condition, the sure screening property remains as essentially same as before, but the convergence rate becomes relatively slower. That is,

(max1kpω^k-ωk>cn-τ)O(pexp{-bn1-(2τ+κ+η)+(1+κ)logn}).

Then, a smaller NP-dimensionality log p = O(nα) with α < 1 − 2τκη is allowed. For the proof, refer to Appendix A in the Supplement.

Another interesting property for independence screening is ranking consistency property in the terms of Zhu, et al. (2011). To investigate the ranking consistency property of MV-SIS, we additionally assume the following condition.

  • (C3)

    liminfp{minkDωk-maxkIωk}c3, where c3 > 0 is a constant.

It is easily shown that under the partial orthogonality condition (Huang, Horowitz and Ma, 2008) that ωk > 0 for kInline graphic and ωk = 0 for kInline graphic, Condition (C3) naturally holds. Thus, Condition (C3) is a relatively weaker assumption than partial orthogonality condition. It requires the MV index is able to sperate active and inactive predictors well in the population level. The following theorem justifies the ranking consistency property of MV-SIS.

Theorem 2.2

[Ranking Consistency Property] If conditions (C1) and (C3) hold for Rn log(n)/n = o(1) and Rn log(p)/n = o(1), then liminfn{minkDω^k-maxkIω^k}>0, a.s..

Although it requires a more restrictive condition on the difference between active and inactive signals, Theorem 2.2 demonstrates a stronger theoretical result than sure screening property. That is, the sample MV(Xk|Y) values of active predictors are always ranked beyond those of inactive ones with high probability. Thus, with an ideal thresholding value, one might separate the active predictors and inactive predictors.

3. NUMERICAL STUDIES

In this section, we first assess the finite sample performance of the proposed MV-SIS by Monte Carlo simulation studies. Then, we conduct empirical analyses of two real data examples to illustrate the proposed MV-SIS procedure. Some additional numerical results are given in the Supplement.

3.1. Monte Carlo Simulations

We use the minimum model size (MMS) to include all active predictors to measure the effectiveness of each screening approach. In addition, the proportion including a single active predictor Xj, denoted by Pjs, and the proportion including all active predictors, denoted by Inline graphic, are computed for a given model size d = [n/log n], where n is the sample size and [x] denotes the integer part of x. All numerical studies are conducted using R code.

Example 1

(Ultrahigh Dimensional Linear Discriminant Analysis) In this example, we consider a linear discriminant analysis problem with ultrahigh dimensional predictors by following the similar settings in Pan, Wang and Li (2013). For each ith observation, the categorical response Yi is generated from two different distributions: (i) balanced, a discrete uniform distribution with R categories where ℙ(Yi = r) = 1/R with r = 1, …, R; (ii) unbalanced, the sequence of probabilities pr = P(Yi = r) = 2[1 + (r − 1)/(R − 1)]/3R is an arithmetic progression with max1rRpr=2min1rRpr. For instance, when Y is binary, p1 = 1/3 and p2 = 2/3. Given Yi = r, the ith predictor Xi is then generated by letting Xi = μr + εi, where the mean term μr = (μr1, …, μrp) ∈ ℝp is a p-dimensional vector with rth component μrr = 3 but other components are all zero, and εi = (εi1, …, εip) is a p-dimensional error term. Here, we consider two cases of the error term: (1) εij ~ N(0, 1); (2) εij ~ t(2) independently for each j = 1, …, p. Note that the Case (2) makes each predictor heavy-tailed, which is designed to examine the robustness of an independence screening method. To systematically examine MV-SIS and other competitors, we will consider 2000 predictors and a binary response with n = 40, and a 10-categorical response with n = 200 for each case, respectively. That is, (R, n, p) = (2, 40, 2000) and (10, 200, 2000).

First, we compare the performance of MV-SIS with SIS (Fan and Lv, 2008), SIRS (Zhu, et al., 2011), DC-SIS (Li, Zhong and Zhu, 2012), Kolmogorov Filter (Mai and Zou, 2013) and PSIS (Pan, Wang and Li, 2013) for the binary response, where X1 and X2 are the active predictors. Table 1 summarizes the median of MMS with its associated robust estimate of the standard deviation (RSD = IQR/1.34) in the parentheses, Pjs with j = 1, 2 and Inline graphic for the given model size d = [n/log n] for each method based on 500 simulations.

Table 1.

Simulation Results for Linear Discriminant Analysis with Binary Response

Case (1): εij ~ N (0, 1) Case (2): εij ~ t(2)

pr Method MMS
P1s
P2s
graphic file with name nihms590994ig4.jpg MMS
P1s
P2s
graphic file with name nihms590994ig4.jpg
Balanced SIS 2.0(0.0) 1.00 1.00 1.00 2.5(9.1) 0.79 0.88 0.71
SIRS 2.0(0.0) 1.00 1.00 1.00 8.0(20.5) 0.71 0.76 0.55
DC-SIS 2.0(0.0) 1.00 1.00 1.00 2.0(0.0) 0.99 0.98 0.97
KF 2.0(0.0) 1.00 1.00 1.00 2.0(0.0) 0.99 0.99 0.98
PSIS 2.0(0.0) 1.00 1.00 1.00 2.5(9.1) 0.79 0.88 0.71
MV-SIS 2.0(0.0) 1.00 1.00 1.00 2.0(0.0) 1.00 0.99 0.99

Unbalanced SIS 2.0(0.0) 1.00 1.00 1.00 5.5(48.8) 0.75 0.75 0.55
SIRS 2.0(0.0) 1.00 0.99 0.99 17.0(123.3) 0.67 0.64 0.44
DC-SIS 2.0(0.0) 1.00 1.00 1.00 2.0(1.1) 0.95 0.96 0.92
KF 2.0(0.0) 1.00 1.00 1.00 2.0(0.7) 0.96 0.99 0.95
PSIS 2.0(0.0) 1.00 1.00 1.00 5.5(48.8) 0.75 0.75 0.55
MV-SIS 2.0(0.0) 1.00 1.00 1.00 2.0(0.7) 0.96 0.99 0.95

Next, we consider the response with 10 categories, where X1, X2, …, X10 are active. Note that a value of the response Y is a nominal number, which makes SIS, SIRS and Kolmogorov Filter unapplicable. However, MV-SIS is proposed for variable screening with a multiple categorical response. To make DC-SIS applicable for this problem, we transfer the 10-categorical response to 9 dummy binary variables, which are together considered as a new multiple response. Note that Li, Zhong and Zhu (2012) claimed that DC-SIS can be applied for the multiple response. Pan, Wang and Li (2013) proposed a pairwise sure independence screening (PSIS) to deal with the categorical response. PSIS utilizes |μ̂r1jμ̂r2j| as the marginal signal of predictor Xj for each pair of classes (r1, r2) each time, where μ̂rj denotes the sample average of Xij for i ∈ {i : Yi = r}. Essentially, we consider maxr12 |μ̂r1jμ̂r2j| as the marginal signal of predictor Xj, where r1, r2 = 1, 2, …, 10, denoted by PSIS*. Table 2 summarizes the median of MMS with its associated robust standard deviation in the parentheses, Pjs with j = 1, 2, …, 10 and Inline graphic for the given model size d = [n/log n] based on 500 simulations.

Table 2.

Simulation Results for Linear Discriminant Analysis with 10-Categorical Response

Method MMS
P1s
P2s
P3s
P4s
P4s
P6s
P7s
P8s
P9s
P10s
graphic file with name nihms590994ig4.jpg
(i) Balanced Probabilities and Case (1): εij ~ N (0, 1)

DC-SIS 10.0(0.0) 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.99
PSIS* 10.0(0.0) 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
MV-SIS 10.0(0.0) 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

(i) Balanced Probabilities and Case (2): εij ~ t(2)

DC-SIS 15.0(21.8) 0.86 0.99 0.99 0.99 0.97 0.98 0.99 0.99 0.99 0.98 0.74
PSIS* 362.5(563.6) 0.73 0.75 0.76 0.73 0.75 0.75 0.75 0.73 0.76 0.79 0.05
MV-SIS 11.0(3.7) 1.00 1.00 1.00 0.99 0.99 1.00 1.00 0.99 0.99 0.99 0.95

(ii) Unbalanced Probabilities and Case (1): εij ~ N(0, 1)

DC-SIS 13.0(14.9) 0.82 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.82
PSIS* 10.0(0.0) 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
MV-SIS 10.0(0.0) 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

(ii) Unbalanced Probabilities and Case (2): εij ~ t(2)

DC-SIS 126.5(248.3) 0.35 0.90 0.93 0.93 0.96 1.00 0.99 1.00 1.00 1.00 0.22
PSIS* 343.5(444.9) 0.68 0.66 0.56 0.58 0.64 0.63 0.60 0.73 0.61 0.67 0.05
MV-SIS 13.0(9.8) 0.93 0.98 0.98 0.98 0.98 1.00 1.00 1.00 1.00 1.00 0.85

In addition, we will compare the post-screening estimation and prediction performance between PSIS and MV-SIS for the binary response from a discrete uniform distribution. Here, we generate p = 2000 predictors and a binary response with different sample sizes n = 40 and n = 80. We replicate each simulation experiment a total of 500 times. For the rth simulation, we follows Pan, Wang and Li (2013) to choose the model size using the BIC criterion, which utilizes the equivalence between the LDA problem and a least squares one in Mai, Zou and Yuan (2012). Then, we define the model size (MS), percentage of correct zeros (CZ), incorrect zeros (IZ), coverage probability (CP), and the root of the sum squared error (RSSE) as follows, respectively,

MSr=Dr,CZr=p-DDrp-sn,IZr=DrcDD,CPr=I(DDr),RSSEr=γ^r-γ0,

where Dr is the selected model in the rth replication, sn = 2 is the cardinality of Inline graphic, and γ0 = μ1μ2 = (3, −3, 0, …, 0) is the true difference between two true means, and γ̂r = μ̂1rμ̂2r is the post-screening estimator of γ0 in the rth replication based on the selected model. Furthermore, to assess the prediction performance, an independent testing dataset is generated with the same sample size in the each simulation. The classification accuracy (CA) of the post-screening estimator is computed in each simulation. Also the classification accuracy based on the true means, denoted by CA0, and the ratio CA/CA0 is evaluated for comparison. We report the median of MS with its robust standard deviation in the parentheses, and the averages of other performance measures over all 500 simulations in Table 3.

Table 3.

Simulation Results for Estimation and Prediction Performance in Linear Discriminant Analysis with Binary Response with 500 Simulations.

n Method MS(RSD) CZ(%) IZ(%) CP(%) RSSE CA(%) CA0(%) RCA
Case (1): εij ~ N (0, 1)

40 PSIS 3.0(2.9) 99.89 0.00 100.00 1.31 95.20 98.41 96.76
MV-SIS 3.0(2.2) 99.91 0.00 100.00 1.16 95.34 98.41 96.90

80 PSIS 2.0(1.5) 99.94 0.00 100.00 0.70 97.31 98.31 98.98
MV-SIS 2.0(0.8) 99.95 0.00 100.00 0.62 97.47 98.31 99.15

Case (2): εij ~ t(2)

40 PSIS 6.0(2.9) 99.76 19.50 65.00 3.65 73.42 89.91 81.81
MV-SIS 5.0(3.1) 99.83 3.00 94.00 2.74 78.92 89.91 87.87

80 PSIS 7.0(4.4) 99.71 7.00 86.40 2.56 79.17 89.95 88.04
MV-SIS 3.0(2.9) 99.87 0.00 100.00 1.56 84.80 89.95 94.30

Both Tables 1 and 2 indicate that the proposed MV-SIS is superior to other competitors for variable screening in the linear discriminant analysis. When the error term is heavy-tailed and the number of the response categories increases, MV-SIS has much smaller minimum model sizes (MMS) and significantly higher probabilities to include all active predictors in the selected model than other independent screenings. Thus, the robustness of our MV-SIS is an important feature, which can make it more useful in practice. The same pattern can be observed from Table 3. MV-SIS has the very close estimation and prediction performance of PSIS when the error term is normal. However, when the error deviates from a normal distribution, PSIS deteriorates while MV-SIS still performs reasonably well.

3.2. Real Data Examples

Example 2

Lung cancer data were previously analyzed for classification between malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA) of the lung in Gordon et al. (2002) and Fan and Fan (2008). There are 12533 genes and 181 tissue samples from two classes: 31 in class MPM and 150 in class ADCA. The training dataset contains 32 of them (16 MPM and 16 ADCA), while the remaining 149 samples (15 MPM and 134 ADCA) are used for testing.

Before classification, we first standardize the data to zero mean and unit variance. Fan and Fan (2008) showed that their features annealed independence rules (FAIR) selected 31 important genes and made no training error and 7 testing errors, while the nearest shrunken centroids (NSC) method proposed by Tibshirani, et al. (2002) chose 26 genes and resulted in no training error and 11 testing errors. Then, we consider DC-SIS, PSIS and our MV-SIS approach (denoted by MV-SIS1) following by LDA for this ultrahigh dimensional classification problem. Note that FAIR used the diagonal linear discriminant analysis after the t-test screening. To make a fair comparison, we add a procedure combining t-test screening with LDA as well, denoted by FAIR*. Furthermore, the penalized LDA method (denote by PenLDA) proposed by Witten and Tibshirani (2011) and the sparse discriminant analysis (denoted by SDA) in Clemmensen, et al. (2011) are also implemented in this example for comparison. In addition, we combine our MV-SIS with SDA and consider this two-stage method as another potential approach, denoted by MV-SIS2. Similar to Example 1, the BIC criterion is applied to determining the model size for all competing methods in this binary classification problem. We summarize the classification results in Table 4. The MV-SIS followed by LDA (i.e. MV-SIS1) makes 0 training error and 5 testing errors using only 5 top genes, and the MV-SIS with SDA (i.e. MV-SIS2) performs even better than MV-SIS1 and SDA to achieve the smallest testing errors using only 7 genes. Thus, the two-stage approaches combining MV-SIS with LDA or SDA are superior to other competitors in terms of classification errors and the selected model size for this ultrahigh dimensional lung cancer data.

Table 4.

Classification Errors for Lung Cancer Data in Example 2

Method Training Error Testing Error No. of Selected Genes
NSC 0/32 11/149 26
FAIR 0/32 7/149 31
FAIR* 0/32 7/149 14
PenLDA 0/32 9/149 8
SDA 0/32 6/149 17
PSIS 1/32 34/149 4
DC-SIS 0/32 6/149 7
MV-SIS1 0/32 5/149 5
MV-SIS2 0/32 3/149 7

To further evaluate the prediction performance, we randomly partition all 181 tissue samples into two parts: the training set including 100 samples and the testing set of the rest 81 samples. The above procedures are applied to the training data, and their performances are evaluated by the classification errors in both training and testing sets. For a fair comparison, we choose the best model sizes for all methods using the same BIC criterion. We repeat the experiment 100 times, summarize the means with associated standard deviations (in the parentheses) of the training and testing classification errors and the numbers of selected genes in Table 5, and display their distributions in Figure 2. In the result, the MV-SIS with LDA method (i.e. MV-SIS1) performs reasonably well and has both small training and testing errors using averagely around 12 genes. Among all the methods, the SDA method classifies the training samples perfectly and achieves a small testing error rate. However, SDA tends to select a considerably large number of genes and thus may lose some model interpretability. It is worth noting that the MV-SIS with SDA (i.e. MV-SIS2) can achieve the smallest testing error rate with a much smaller number of genes. This further demonstrates the merit of the two-stage approach combining MV-SIS with SDA.

Table 5.

Performance Evaluation for Lung Cancer Data in Example 2

Method Training Error(%) Testing Error(%) No. of Selected Genes
NSC 0.87(0.90) 1.86(1.91) 17.52(11.36)
FAIR 3.07(1.32) 3.51(1.93) 13.72(7.37)
PenLDA 0.88(0.92) 1.95(1.97) 18.95(18.14)
SDA 0.00(0.00) 1.42(1.21) 39.83(2.84)
PSIS 0.06(0.24) 2.14(1.57) 26.49(6.85)
DCSIS 0.08(0.27) 2.63(2.30) 15.54(12.53)
MVSIS1 0.15(0.44) 1.77(1.91) 11.99(9.53)
MVSIS2 0.20(0.40) 1.41(1.10) 11.74(6.71)
Figure 2.

Figure 2

Lung Cancer Data in Example 2. (a) Boxplots of classification errors in the training sets over 100 random partitions of 181 samples; (b) Boxplots of classification errors in the testing sets; (c) Boxplots of numbers of selected genes.

Example 3

This human lung carcinomas data was analyzed by using mRNA expression profiling (Bhattacharjee, et al., 2001). There are 12600 mRNA expression levels in a total of 203 snap-frozen lung tumors and normal lungs. The 203 specimens are classified into five subclasses: 139 in lung adenocarcinomas (ADEN), 21 in squamous cell lung carcinomas (SQUA), 6 in small cell lung carcinomas (SCLC), 20 in pulmonary carcinoid tumors (COID) and the remaining 17 normal lung samples (NORMAL). Before classification, we first standardize the data to zero mean and unit variance. To evaluate the prediction performance of the proposed method, we randomly select approximately 100τ% of the observations from each subclass as the training samples and the rest 100(1 − τ)% observations as the testing samples, where τ ∈ (0, 1).

Note that the aforementioned NSC and FAIR are proposed only for binary classification problems, thus they are not applicable in this multiple classes discriminant analysis. PSIS, DC-SIS and MV-SIS with LDA are applied to the training set and their performances are evaluated by the testing samples. For the DC-SIS and MV-SIS (denoted by MV-SIS1) with LDA procedures, the leave-one-out cross validation is applied to choosing the optimal model size for the raining data. Besides, we also consider the penalized LDA (denoted by PenLDA) and MV-SIS followed by SDA (denoted by MV-SIS2) for comparison, and use the 10-folded cross validation rather than the leave-one-out cross validation to choose the best model size in order to reduce the computation time. Although SDA can be directly applied to multiple-class discriminant analysis for a given model size, searching the best model size for SDA is remarkably computational expensive for multiple-class ultrahigh dimensional data. Thus, we use MV-SIS to reduce dimensionality and then follow by SDA (i.e. MV-SIS2) instead of SDA alone in the example.

Next, we choose τ =0.9, 0.8 and repeat the experiment 100 times, respectively. Following the previous Example 2, the means of the training and testing classification errors and the corresponding numbers of selected genes with their associated standard deviations (in the parentheses) are reported in Table 6. We can clearly observe that, although all methods perform reasonably well in the tumors classification, the MV-SIS procedure with LDA or SDA are significantly better than other methods in terms of both training and testing classification errors and the number of selected genes. Specifically, the MV-SIS+SDA (i.e. MV-SIS2) procedure achieves the best performances using a small number of top genes. Furthermore, we find that the top genes selected by MV-SIS are not normally distributed and contain potential outliers. This observation explains why other methods performs relatively worse and confirms the robustness feature of the proposed MV-SIS. This example further demonstrates that the two-stage approach combing the MV-SIS method with a discriminant analysis is more favorable for ultrahigh dimensional data in practice.

Table 6.

Classification Errors for Lung Carcinomas Data with 5 Classes in Example 3.

τ Method Training Error(%) Testing Error(%) No. of Selected Genes
0.9 PenLDA 21.88(2.24) 21.71(3.87) 25.76(21.04)
PSIS 3.54(0.79) 9.43(5.65) 107.54(15.71)
DC-SIS 6.85(1.35) 11.81(6.40) 32.08(3.85)
MV-SIS1 3.65(1.15) 7.71(4.99) 20.56(8.02)
MV-SIS2 3.65(1.15) 7.62(5.09) 31.76(10.24)

0.8 PenLDA 22.12(2.10) 22.40(4.37) 25.04(21.81)
PSIS 3.08(1.11) 7.90(3.89) 101.88(15.72)
DC-SIS 6.33(2.16) 13.15(5.32) 32.18(5.39)
MV-SIS1 3.74(1.09) 8.35(4.12) 21.34(7.42)
MV-SIS2 3.74(1.09) 6.70(4.24) 27.20(9.11)

4. SOME EXTENSIONS

The MV-SIS approach is proposed to screen important predictors for the ultrahigh dimensional discriminant analysis where the response is categorical, but its applications can be easily extended to some other settings. In this section, we discuss two natural extensions of MV-SIS and use simulation studies to show their excellent performances.

4.1. Genome-Wide Association Studies

First, we can apply MV-SIS to ultrahigh dimensional problems with categorical predictors. In such situations, feature screening can be done by using MV (Y|Xk), where Xk is categorical for k = 1, 2, …, p. Under Conditions (C1) and (C2), we can establish the sure screening property and ranking consistency property for ωk = MV (Y|Xk) with imposing Condition (C1) on each categorical SNP instead of the response. In genome-wide association studies (GWAS), modern genotyping techniques allow researchers to collect genetic data which usually contain an extremely large number of single-nucleotide polymorphisms (SNPs). In general, the SNPs as predictors are categorical with three classes, denoted by {AA, Aa, aa}. In Example 4, we consider applying the proposed MV-SIS for the ultrahigh dimensional GWAS problem to identify important SNPs, and compare its performance with other independence screening approaches.

Example 4

(Genome-Wide Association Studies) To mimic SNPs with equal allele frequencies, we denote Zij as the indicators of the dominant effect of the jth SNP for ith subject and generate it in the following way

Zij={1,ifXij<q10,ifq1Xij<q3-1,ifXijq3

where Xi = (Xi1, …, Xip) ~ N (0, Σ), where Σ = (ρij)p×p with ρij = 0.5|i j|, i = 1, …, n, j = 1, …, p, and q1 and q3 are first and third quartiles of a standard normal distribution, respectively. Then, we generate the response (some trait or disease) by:

Y=β1Z1+β2Z2+2β3Z10+2β4Z20-2β5Z100+ε,

where βj = (−1)U(a + |Z|) for j = 1, …, 5, where a=2logn/n, U ~ Bernoulli(0.4) and Z ~ Inline graphic(0, 1), the error term ε follows N(0, 1) ort(1). There are 5 active SNPs, i.e. Z1, Z2, Z10, Z20 and Z100, for the response. The first four active SNPs are linearly correlated with the response Y, while the SNP Z100 and Y are nonlinearly correlated. It is interesting to note that the absolute value of dominant effect |Z100| is the corresponding additive effect in genetics. Here, we consider five different independence screening approaches: SIS, DC-SIS, SIRS, RRCS (Li et al., 2012) and MV-SIS, and set n = 200 and p = 2000 and repeat each experiment 500 times. We summarize the simulation results for d = [n/log(n)] in Table 7.

Table 7.

Simulation Results for Example 4 - GWAS Model.

ε Method MMS
P1s
P2s
P10s
P20s
P100s
graphic file with name nihms590994ig4.jpg
N(0, 1) SIS 1058.0(786.9) 0.96 0.97 1.00 0.99 0.02 0.02
DCSIS 10.0(40.1) 0.96 0.95 1.00 0.99 0.79 0.72
SIRS 1074.0(834.8) 0.94 0.95 1.00 0.98 0.03 0.02
RRCS 1031.0(801.6) 0.96 0.96 1.00 0.99 0.03 0.03
MVSIS 8.0(34.3) 0.96 0.94 0.99 0.98 0.89 0.78

t(1) SIS 1427.0(530.4) 0.26 0.28 0.42 0.42 0.02 0.00
DC-SIS 124.0(284.8) 0.78 0.75 0.92 0.91 0.53 0.32
SIRS 1050.0(672.5) 0.86 0.84 0.97 0.96 0.02 0.01
RRCS 993.0(725.5) 0.87 0.84 0.98 0.96 0.02 0.01
MV-SIS 46.0(139.1) 0.79 0.79 0.94 0.94 0.79 0.46

According to Table 7, when the error follows a normal distribution, all five independence screening are able to select the first four active SNPs effectively because they are linearly correlated with the response. However, only DC-SIS and MV-SIS can choose Z100 which nonlinearly contributed to Y. When the error is generated from t(1) which is largely heavy-tailed, it is not surprising that all independence screening methods perform worse than before. However, the performance of MV-SIS is still the best one. Thus, we can conclude that MV-SIS can effectively select active categorical SNPs which are linearly or nonlinearly correlated with the response.

4.2. Nonparametric Additive Models

In this subsection, we further consider the application of MV-SIS for an ultrahigh dimensional nonparametric additive model to evaluate MV-SIS. Although both the response and predictors are generally continuous, we can discretize each predictor Xj into a categorical variable to make MV-SIS applicable. To be specific, we can define Xj using percentiles {τ1, …, τKn} of Xj by Xij=kI(τkXij<τk+1), where I(·) is an indicator function, i = 1, …, n, j = 1, …, p, k = 1, …, Kn with Kn = O(n1/5). Then, we can apply MV-SIS to the discretized predictors and use MV(YXj) as the marginal screening utility to measure the importance of Xj. In practice, the sample size in each discretized class can not be small in order to ensure an accurate estimation of conditional distribution function. On the other hand, the number of classes cannot be small in order to retain as much information of the continuous variable as possible. According to our empirical experiences, we suggest that the number of samples in each class should be greater than 20 to obtain a decent estimator of the MV index. One can also consider the number of classes as a tuning parameter and apply the cross validation technique to choose an optimal number of classes. The following simulation example numerically examines the performance of the proposal.

Example 5

(Nonparametric Additive Model) Following Meier, Geer and Bühlmann (2009), we define the following four functions

f1(x)=-sin(2x),f2(x)=x2-25/12,f3(x)=x,f4(x)=e-x-2/5·sinh(5/2).

Then we consider the following additive model

Y=3f1(X1)+f2(X2)-1.5f3(X3)+f4(X4)+ε,

where the predictors are generated independently from Uniform[−2.5, 2.5]. To examine the robustness of each independence screening approach, we consider two cases for the error term ε = (ε1, …, εn): (1) εi ~ N (0, 1); (2) εi ~ t(1) for i = 1, 2, …, n. In this example, besides the five approaches in Example 4, we further consider the nonparametric independence screening (NIS) proposed for sparse ultrahigh dimensional additive models by Fan, Feng and Song (2011), and the quantile-adaptive sure independence screening (QaSIS) with quantile τ= 0.5 proposed by He, Wang and Hong (2013). We set n = 200 and p = 2000 and repeat each experiment 500 times for each error case. In our simulation, we discretize each predictor into a 4-categorical variable using 1st, 2nd and 3rd quartiles as knots for our MV-SIS. Simulation results are reported for the given model size d = [n/log(n)] in Table 8.

Table 8.

Simulation Results for Example 5 - Nonparametric Additive Model

ε Method MMS
P1s
P2s
P3s
P4s
graphic file with name nihms590994ig4.jpg
N(0, 1) SIS 1084.5(690.3) 0.17 0.02 1.00 1.00 0.00
NIS 4.0(0) 1.00 0.99 1.00 1.00 0.99
DC-SIS 50.5(55.2) 0.47 0.79 1.00 1.00 0.37
SIRS 1178.0(668.6) 0.15 0.01 1.00 1.00 0.00
QaSIS 5.0(4.5) 0.99 0.93 0.99 1.00 0.91
RRCS 1112.5(673.9) 0.16 0.03 1.00 1.00 0.00
MV-SIS 4.0(1.5) 0.99 0.95 1.00 1.00 0.95

t(1) SIS 1508.0(538.1) 0.04 0.01 0.44 0.51 0.00
NIS 1056.5(932.2) 0.25 0.15 0.22 0.37 0.08
DC-SIS 205.0(280.1) 0.20 0.33 0.96 0.96 0.07
SIRS 1222.5(645.5) 0.12 0.01 1.00 1.00 0.00
QaSIS 16.0(37.7) 0.93 0.79 0.93 1.00 0.69
RRCS 1212.0(688.1) 0.14 0.01 0.99 1.00 0.00
MV-SIS 11.0(24.8) 0.93 0.81 0.99 1.00 0.75

Table 8 indicates that MV-SIS performs very well after discretizing each predictor. When the error term is normal, NIS performs best followed by MV-SIS and QaSIS. Although DC-SIS may detect the nonlinearity, it occasionally misses X1 and X2. The probable reason is the distance correlation between Y and the first two predictors are relatively weak. When the error term follows Cauchy distribution, which makes the data heavy-tailed and generates some extreme points, NIS quickly deteriorates and yet QaSIS performs well to detect the true signals. On the other hand, MV-SIS still can effectively select the active predictors and performs even better than QaSIS, which presents its robustness merit again.

5. DISCUSSION

In this paper, we have developed a new sure screening procedure for ultrahigh dimensional discriminant analysis, in which the response is allowed to have a diverging number of categories. We further established the sure screening property and the ranking consistency property of the proposed procedure without assuming any moment condition on predictors. The proposed procedure have several appealing properties. It is easily implemented, and it is robust to model specification (i.e. model-free), and robust to outliers or heavy tails of the predictors. The proposed procedure is also highly useful for analysis of data collected in GWAS, in which the phenotype may be multivariate continuous, while the predictors are categorical SNPs.

In the numerical studies, we applied linear discriminant analysis on the selected model by MV-SIS in the second-stage study. The linear discriminant analysis methods are widely used in practice and did perform reasonably well in our real data analysis. However, it is also interesting to propose a model-free and robust discriminant analysis after a model-free variable screening approach. This is out of scope of this work, but is an interesting topic for future research. Some work have been done on robust discriminant analysis. Related references include regularized discriminant analysis by Friedman (1989), robust LDA based on S-estimators by He and Fung (2000), penalized linear discriminant analysis by Witten and Tibshirani (2011), semiparametric sparse discriminant analysis by Mai and Zou (2014) and among others.

Supplementary Material

supple

Acknowledgments

The authors thank the Editor, the AE and reviewers for their constructive comments, which have greatly improved the earlier version of this paper.

Biographies

Hengjian Cui is Professor, Department of Statistics, Capital Normal University, China. hjcui@bnu.edu.cn. His research was supported by National Natural Science Foundation of China (NNSFC) grants 11071022, 11028103, 11231010 and Key project of Beijing Municipal Educational Commission and Beijing Center for Mathematics and Information Interdisciplinary Sciences.

Runze Li is Distinguished Professor, Department of Statistics and The Methodology Center, The Pennsylvania State University, University Park, PA 16802-2111. rzli@psu.edu. His research was supported by National Institute on Drug Abuse (NIDA) grants P50-DA10075 and P50 DA036107, and NNSFC grant 11028103. Wei Zhong is Assistant Professor of Wang Yanan Institute for Studies in Economics (WISE), Department of Statistics and Fujian Key Laboratory of Statistical Science, Xiamen University, China. wzhong@xmu.edu.cn. His research was supported by NNSFC grants 11301435 and 71131008.

APPENDIX

Proof of Proposition 2.1

Note F(x|Y) = Inline graphic(Xx|Y) is a random variable of Y.

EY[F(xY)]=r=1R(XxY=yr)(Y=yr)=r=1R(Xx,Y=yr)=(Xx)=F(x),VarY[F(xY)]=r=1R[(XxY=yr)-F(x)]2(Y=yr)=r=1Rpr[Fr(x)-F(x)]2.

where pr = Inline graphic(Y = yr). Then

MV(XY)=EX[VarY(F(XY))]=r=1Rpr[Fr(x)-F(x)]2dF(x).

The second property can be directly implied by the first one. Because the result that X and Y are statistical independent is equivalent to that Fr(x) = F (x) for any x ∈ ℝX and r = 1, 2, …, R, which is also equivalent to r=1Rpr[Fr(x)-F(x)]2dF(x)=0 given pr > 0 and F(x + δ) − F(xδ) > 0 for any δ > 0 and x ∈ ℝX. This completes the proof.

To prove Theorems 2.1 and 2.2, we need the following lemmas.

Lemma A.1

[Hoeffding’s Inequality] Let X1, …, Xn be independent random variables. Assume that Inline graphic(Xi ∈ [ai, bi]) = 1 for 1 ≤ in, where ai and bi are constants. Let X¯=n-1i=1nXi. Then the following inequality holds

(X¯-E(X¯)t)2exp{-2n2t2i=1n(bi-ai)2}, (A.1)

where t is a positive constant and E() is the expected value of .

Lemma A.2

[Bernstein’s Inequality] (van der Vaart and Wellner, 1996, Lemma 2.2.9) Let X1, …, Xn be independent random variables with bounded support [−M, M] and zero means, then the following inequality holds

(X1++Xn>t)2exp{-t22(ν+Mt/3)}, (A.2)

for ν ≥ var(X1 + ··· + Xn).

We need the following notations for next lemma. Let Fk,r(x) = Inline graphic(Xkx|Y = yr) and Fk(x) = Inline graphic(Xkx), for 1 ≤ kp, r = 1, …, R and x ∈ ℝX. Denote

f0=f0(Xk,Y)=r=1RI{Y=yr}[Fk,r(x)-Fk(x)]2dFk(x);f¯r=f¯r(Xk,Y)=[Fk,r(Xk)-Fk(Xk)]2;fr=fr(Xk,Y)=I{Y=yr};f0,x=f0,x(Xk,Y)=I{Xkx};fr,x=fr,x(Xk,Y)=I{Xkx,Y=yr}.

Let {Xki, Yi): 1 ≤ in} be a random sample from a population (Xk, Y). Define f¯r(i)=f¯r(Xik,Yi),f0(i)=f0(Xik,Yi),fr(i)=I(Yi=yr),f0,x(i)=I{Xikx},fr,x(i)=I{Xikx,Yi=yr} for i = 1, …, n.

Lemma A.3

For any ε ∈ (0, 1) and 1 ≤ rR, the following inequalities are valid for univariate Xk

{|1ni=1nf¯r(i)-Ef¯r|ε}2exp{-2nε2}; (A.3)
{|1ni=1nf0(i)-Ef0|ε}2exp{-2nε2}; (A.4)
{|1ni=1nfr(i)-Efr|ε}2exp{-nε22(pr+ε/3)}; (A.5)
{supxX|1ni=1nf0,x(i)-Ef0,x|ε}2(n+1)exp{-2nε2}; (A.6)
{supxX|1ni=1nfr,x(i)-Efr,x|ε}2(n+1)exp{-nε22(pr+ε/3)}, (A.7)

where Eh stands for Eh(Xk, Y) for a function h(Xk, Y) with finite expected value.

Proof

Since |r (Xk, Y)| = [Fk,r(Xk) − Fk(Xk)]2 ≤ 1 and f0(Xk,Y)=r=1RI{Y=yr}[Fk,r(x)-Fk(x)]2dFk(x)1, we apply Hoeffding’s inequality to obtain the inequalities (A.3) and (A.4).

Since fr(i)=I{Yi=yr} for i = 1, …, n, then fr(i)~Bernoulli(pr) with Efr(i)=pr and fr(1)++fr(n)~Binomial(n,pr), which implies Var(fr(1)++fr(n))=npr(1-pr)npr and fr(i)-pr1. Thus, by Bernstein’s inequality, we have

{|1ni=1nfr(i)-Efr|ε}={|i=1n(fr(i)-pr)|nε}2exp{-n2ε22(npr+nε/3)}2exp{-nε2/(2(pr+ε/3))}.

The inequality (A.5) is proved.

Note that f0,x(i)-Ef0,x=I{Xikx}-Fk(x)1, then we apply Hoeffding’s inequality and empirical process theory (Pollard, 1984) to obtain (A.6). Note that fr,x(i)-Efr,x=I{Xikx,Yi=yr}-Fk,r(x)pr1, then we apply Bernstein’s inequality and empirical process theory (Pollard, 1984) to obtain (A.7). This completes the proof of Lemma A.3.

Lemma A.4

Under Condition (C1), for any ε ∈ (0, 1/2) and 1 ≤ kp, we have

{ω^k-ωkε}O(n)Rnexp{-c4nRnε2} (A.8)

for some constant c4 > 0.

Proof

According the definitions of ωk and ω̂k, we have

ω^k-ωk=1nj=1nr=1Rp^r[F^kr(Xj)-F^k(Xj)]2-r=1Rpr[Fkr(x)-Fk(x)]2dFk(x)=r=1Rp^r([F^kr(x)-F^k(x)]2dF^k(x)-[Fkr(x)-Fk(x)]2dFk(x))+r=1R(p^r-pr)[Fkr(x)-Fk(x)]2dFk(x)=r=1Rp^r([F^kr(x)-F^k(x)]2-[Fkr(x)-Fk(x)]2)dF^k(x)+r=1Rp^r[Fkr(x)-Fk(x)]2d[F^k(x)-Fk(x)]+r=1R(p^r-pr)[Fkr(x)-Fk(x)]2dFk(x)=:Ik1+Ik2+Ik3.

We first deal with the term Ik1.

Ik12maxr|[F^kr(x)-Fkr(x)]-[F^k(x)-Fk(x)]|dF^k(x)2maxrsupxX(|F^kr(x)-Fkr(x)|+|F^k(x)-Fk(x)|)=:2(Jk1+Jk2),

where the first inequality holds by r=1Rp^r=1 and

|[kr(x) − Fkr(x)] + [k(x) − Fk(x)]| ≤ |kr(x) − Fkr(x)| + |k(x) − Fk(x)| 1 + 1 = 2, and the second inequality is implied by ∫dF̂k(x) = 1. Then, we first deal with the term Jk1,

Jk1=maxrsupxXF^kr(x)-Fkr(x)=maxrsupxX|1ni=1nfr,x(i)/p^r-Efr,x/pr|maxrsupxX(|1ni=1nfr,x(i)-Efr,x|p^r+Efr,xp^r-prp^rpr)=maxrsupxX|1ni=1nfr,x(i)-Efr,x|p^r+maxr|1ni=1nfr(i)-Efr|p^r,

where the equality holds due to supxXEfr,x=supxXP(Xk<x,Y=yr)=pr. Thus, under Condition (C1), for any 0 < ε < 1/2,

{Jk1ε}{maxrsupxX|1ni=1nfr,x(i)-Efr,x|p^r+maxr|1ni=1nfr(i)-Efr|p^rε,minrp^rc12Rn}+{minrp^r<c1/2Rn}{maxrsupxX|1ni=1nfr,x(i)-Efr,x|+maxr|1ni=1nfr(i)-Efr|c1ε2Rn}+{maxr|1ni=1nfr(i)-Efr|c12Rn}P{maxrsupxX|1ni=1nfr,x(i)-Efr,x|c1ε4Rn}+2{maxr|1ni=1nfr(i)-Efr|c1ε4Rn}2(n+1)Rnexp{-n(c1ε/4Rn)22(pr+c1ε/12Rn)}+2Rnexp{-n(c1ε/4Rn)22(pr+c1ε/12Rn)}2(n+3)Rnexp{-c1232nε2Rn/(c2+c1ε12)}2(n+3)Rnexp{-c5nε2Rn}, (A.9)

for some constant c5 > 0, where the second inequality holds because minrr < c1/2Rn implies max|1ni=1nfr(i)-Efr|=maxrp^r-prpr-p^rc1/Rn-c1/2Rn=c1/2Rn using c1/Rnmin1rRnpr in Condition (C1), the fourth inequality is due to Lemma A.3, and the fifth inequality follows that max1rRnprc2/Rn in Condition (C1). Then, we apply inequalities (A.6), (A.3) and (A.4) in Lemma A.3 to obtain the following there results, respectively,

{Jk2ε}={supxXF^k(x)-Fk(x)ε}2(n+1)exp{-2nε2}, (A.10)
P{Ik2ε}=P{|rRp^r(1ni=1nf¯r(i)-Ef¯r)|ε}P{maxr|1ni=1nf¯r(i)-Ef¯r|ε}2Rnexp{-2nε2}, (A.11)
P{Ik3ε}={1ni=1nf0(i)-Ef0ε}2exp{-2nε2}. (A.12)

Inequalities (A.9)–(A.12) together imply the result of Lemma A.4.

Proof of Theorem 2.1

For the first term of Theorem 2.1, by Lemma A.4 and Rn = O(nκ), we have

{max1kpω^k-ωkcn-τ}O(n)pRnexp{-c4c2n1-2τRn}O(pnRnexp{-bn1-(2τ+κ)})O(pexp{-bn1-(2τ+κ)+(1+κ)logn}),

where b > 0 is a constant depending c, c1 and c2.

Next, we deal with the second part of Theorem 2.1. If Inline graphicInline graphic, then there must exist some kInline graphic such that ω̂k< cnτ. It follows from Condition (C2) that |ω̂kωk > cnτ for some kInline graphic, indicating that the events satisfy { Inline graphicInline graphic} ⊆ {|ω̂k} − ωk| > cnτ, for some kInline graphic}, and hence Dn = { Inline graphic |ω̂kωk | ≤ cnτ} ⊆ { Inline graphicInline graphic}. Consequently,

{DD^}{Dn}=1-{Dnc}=1-{minkDω^k-ωkcn-τ}=1-sn{ω^k-ωkcn-τ}1-O(snexp{-bn1-(2τ+κ)+(1+κ)logn}),

where sn is the cardinality of Inline graphic. This completes the proof of the second part.

Proof of Theorem 2.2

{(minkDω^k-maxkIω^k)<c3/2}{(minkDω^k-maxkIω^k)-(minkDωk-maxkIωk)<-c3/2}{|(minkDω^k-maxkIω^k)-(minkDωk-maxkIωk)|>c3/2}{2max1kpω^k-ωk>c3/2}O(n)pRnexp{-c6n/Rn}

for some constant c6 > 0, where the first inequality follows Condition (C3) and the last inequality is implied by Lemma A.4. Because Rn log(p)/n = o(1) and Rn log(n)/n = o(1) imply that pexp{c62n/Rn}, and c62n/Rn4log(n), log(nRn) ≤ 2 log(n) for large n. Then, we have for some n0, n=n0+npRnexp{-c6n/Rn}exp{log(nRn)+c62n/Rn-c6n/Rn} exp{log(nRn)-4log(n)}n=n0+n-2<+. Therefore, by Borel Contelli Lemma, we obtain that liminfn{minkDω^k-maxkIω^k}c3/2>0 a.s..

Footnotes

The content is solely the responsibility of the authors and does not necessarily represent the official views of the NNSFC or NIDA.

References

  1. Bhattacharjee A, Richards W, Staunton J, Li C, Monti S, Vasal P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark E, Lander E, Wong W, Johnson B, Golub T, Sugarbaker D, Meyerson M. Classification of Human Lung Carcinomas by mRNA Expression Profiling Reveals Distinct Adenocarcinoma Subclasses. PNAS. 2001;98:13790–13795. doi: 10.1073/pnas.191502998. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Clemmensen L, Hastie T, Witten D, Ersboll B. Sparse Discriminant Analysis. Technometrics. 2011;53:406–415. [Google Scholar]
  3. Fan J, Fan Y. High-Dimensional Classification Using Features Annealed Independence Rules. The Annals of Statistics. 2008;36:2605–2637. doi: 10.1214/07-AOS504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Fan J, Feng Y, Song R. Nonparametric Independence Screening in Sparse Ultra-High Dimensional Additive Models. Journal of the American Statistical Association. 2011;106:544–557. doi: 10.1198/jasa.2011.tm09779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Fan J, Ma Y, Dai W. Nonparametric Independence Screening in Sparse Ultra-High Dimensional Varying Coefficient Models. Journal of the American Statistical Association. 2014 doi: 10.1080/01621459.2013.879828. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Fan J, Lv J. Sure Independence Screening for Ultrahigh Dimensional Feature Space (with Discussion) Journal of the Royal Statistical Society, Series B. 2008;70:849– 911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Fan J, Samworth R, Wu Y. Ultrahigh Dimensional Feature Selection: beyond the Linear Model. Journal of Machine Learning Research. 2009;10:1829–1853. [PMC free article] [PubMed] [Google Scholar]
  8. Fan J, Song R. Sure Independence Screening in Generalized Linear Models with NP-Dimensionality. The Annals of Statistics. 2010;38:3567–3604. [Google Scholar]
  9. Friedman J. Regularized Discriminant Analysis. Journal of the American Statistical Association. 1989;84:165–175. [Google Scholar]
  10. Gordon G, Jensen R, Hsiao L, Gullans S, Blumenstock J, Ramaswamy S, Richards W, Sugarbaker D, Bueno R. Translation of Microarray Data Into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research. 2002;62:4963–4967. [PubMed] [Google Scholar]
  11. He X, Fung WK. High Breakdown Estimation for Multiple Populations with Applications to Discriminant Analysis. Journal of Multivariate Analysis. 2000;72:151–162. [Google Scholar]
  12. He X, Wang L, Hong H. Quantile-Adaptive Model-Free Variable Screening for High-Dimensional Heterogeneous Data. The Annals of Statistics. 2013;41:342–369. [Google Scholar]
  13. Huang J, Horowitz J, Ma S. Asymptotic Properties of Bridge Estimators in Sparse High-Dimensional Regression Models. The Annals of Statistics. 2008;36:587–613. [Google Scholar]
  14. Ji P, Jin J. UPS Delivers Optimal Phase Diagram in High Dimensional Variable Selection. The Annals of Statistics. 2012;40:73–103. [Google Scholar]
  15. Li G, Peng H, Zhang J, Zhu L. Robust Rank Correlation Based Screening. The Annals of Statistics. 2012;40:1846–1877. [Google Scholar]
  16. Li R, Zhong W, Zhu L. Feature Screening via Distance Correlation Learning. Journal of American Statistical Association. 2012;107:1129–1139. doi: 10.1080/01621459.2012.695654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Liu J, Li R, Wu R. Feature Selection for Varying Coefficient Models with Ultrahigh Dimensional Covariates. Journal of American Statistical Association. 2014;109:266–274. doi: 10.1080/01621459.2013.850086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Mai Q, Zou H. The Kolmogorov Filter for Variable Screening in High-Dimensional Binary Classification. Biometrika. 2013;100:229–234. [Google Scholar]
  19. Mai Q, Zou H. Semiparametric Sparse Discriminant Analysis in Ultra-High Dimensions. 2014. manuscript. 2013arXiv1304.4983M. [Google Scholar]
  20. Mai Q, Zou H, Yuan M. A Direct Approach to Sparse Discriminant Analysis in Ultra-High Dimensions. Biometrika. 2012;99:29–42. [Google Scholar]
  21. Meier L, Geer V, Bühlmann P. High-Dimensional Additive Modeling. The Annals of Statistics. 2009;37:3779–3821. [Google Scholar]
  22. Pan R, Wang H, Li R. On the Ultrahigh Dimensional Linear Discriminant Analysis Problem with A Diverging Number of Classes. 2013. manuscript. [Google Scholar]
  23. Pollard D. Convergence of Stochastic Processes. Springer-Verlag New York Inc; 1984. [Google Scholar]
  24. Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression. Proceedings of the National Academy of Sciences. 2002;99:6567–6572. doi: 10.1073/pnas.082099299. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. van der Vaart A, Wellner J. Weak Convergence and Empirical Processes. New York: Springer; 1996. [Google Scholar]
  26. Wang H. Forward Regression for Ultra-High Dimensional Variable Screening. Journal of the American Statistical Association. 2009;104:1512–1524. [Google Scholar]
  27. Witten D, Tibshirani R. Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B. 2011;73:753–772. doi: 10.1111/j.1467-9868.2011.00783.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Zhu LP, Li L, Li R, Zhu LX. Model-Free Feature Screening for Ultrahigh Dimensional Data. Journal of the American Statistical Association. 2011;106:1464–1475. doi: 10.1198/jasa.2011.tm10563. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supple

RESOURCES