Abstract
This work is concerned with marginal sure independence feature screening for ultra-high dimensional discriminant analysis. The response variable is categorical in discriminant analysis. This enables us to use conditional distribution function to construct a new index for feature screening. In this paper, we propose a marginal feature screening procedure based on empirical conditional distribution function. We establish the sure screening and ranking consistency properties for the proposed procedure without assuming any moment condition on the predictors. The proposed procedure enjoys several appealing merits. First, it is model-free in that its implementation does not require specification of a regression model. Second, it is robust to heavy-tailed distributions of predictors and the presence of potential outliers. Third, it allows the categorical response having a diverging number of classes in the order of O(nκ) with some κ ≥ 0. We assess the finite sample property of the proposed procedure by Monte Carlo simulation studies and numerical comparison. We further illustrate the proposed methodology by empirical analyses of two real-life data sets.
Keywords: Feature screening, consistency in ranking, sure screening property, ultrahigh dimensional data analysis
1. INTRODUCTION
Variable selection plays an important role in high dimensional data analysis. Marginal feature screening becomes indispensable for ultrahigh dimensional data and has received much attention in the very recent literature. Various feature screening procedures have been proposed for linear models, generalized linear models and robust linear models (Fan and Lv, 2008; Wang, 2009; Fan, Samworth and Wu, 2009; Li et al., 2012). These authors demonstrate their procedures enjoy sure screening property in the terminology of Fan and Lv (2008). Feature screening procedures have been further proposed for nonparametric regression models in the literature. Fan, Feng and Song (2011) proposed a nonparametric marginal screening procedure for additive models based on B-spline expansion. Fan, Ma and Dai (2014) further extended the nonparametric B-spline method for varying coefficient models and proposed a marginal sure screening procedure. Liu, Li and Wu (2014) proposed a local kernel-based marginal sure screening procedure for varying coefficient models and further established its sure screening property. Aforementioned model-based screening procedures perform well when the underlying models are correctly specified, but their performance may be quite poor in the presence of model mis-specification. Specifying a correct model for ultrahigh dimensional data may be challenging. Thus, model-free sure screening procedures are appealing and have been developed by several authors (Zhu, et al., 2011; Li, Zhong and Zhu, 2012; He, Wang and Hong, 2013). Li, Zhong and Zhu (2012) developed a sure independence screening procedure based on the distance correlation which is model-free. Its sure screening property requires subexponential tail probability conditions on predictors and response, and it is not robust to very heavy-tailed data with extreme values. Mai and Zou (2013) developed a sure feature screening procedures with ultrahigh dimensional predictors based on the Kolmogorov distance, but it is studied only for binary classification problems. Pan, Wang and Li (2013) proposed a pairwise sure screening procedure for linear discriminant analysis with a diverging number of classes and ultrahigh dimensional predictors. However, it is based on mean difference and cannot perform well for heavy-tailed data. This work aims to develop an effective model-free and robust feature screening procedure for ultrahigh dimensional discriminant analysis with a possibly diverging number of classes.
In this paper, we propose an effective sure screening procedure for discriminant analysis. We further study its theoretical properties and establish the sure screening and rank consistency properties without assuming the moment conditions on predictors under the settings of ultrahigh dimensional discriminant analysis with a diverging number of response classes. Our numerical studies show that the proposed procedure has excellent performance. It enjoys several appealing properties. It is model-free since its implementation does not require specification of the regression model. Its corresponding marginal utility may be easily evaluated without involving numerical optimization.
Due to its nature, the proposed procedure can be directly applied for continuous response with categorical predictors. This indeed is also very useful in the genomics-wide association study (GWAS), in which the phenotypes (i.e., the responses) are continuous, and the single-nucleotide polymorphisms (SNPs) as predictors are categorical. Thus, it is also of interest to develop an effective feature screening procedure for setting in which the response is continuous, while the predictors of interest are categorical. In this paper, we further extend our procedure for such settings. Some further extensions are also discussed in Section 4.
The rest of this paper is organized as follows. In Section 2, we propose a new marginal utility for feature screening and further study its theoretical properties. In Section 3, we conduct Monte Carlo simulation studies to examine the finite sample performance of the proposed procedure. We further illustrate the proposed methodology by empirical analyses of real data examples. Section 4 presents some extensions of the proposed methodology. Technical proofs are given in the Appendix.
2. A NEW FEATURE SCREENING PROCEDURE
2.1. A New Index based on Conditional Distribution Function
Let Y be a categorical response with R classes {y1, y2, …, yR}, and X be a continuous covariate with a support ℝX. To investigate the dependence relationship between X and Y, we naturally consider the conditional distribution function of X given Y, denoted by F(x|Y) =
(X ≤ x|Y). Denote by F(x) =
(X ≤ x) the unconditional distribution function of X and Fr(x) =
(X ≤ x|Y = yr) the conditional distribution function of X given Y = yr. If Fr(x) = F(x) for any x ∈ ℝX and r = 1, 2, …, R, then X and Y are independent. This motivates us to consider the following index
| (2.1) |
to measure the dependence between X and Y. The following proposition provides the properties of the MV (X|Y).
Proposition 2.1
Let Y be a categorical random variable with R classes {y1, y2, …, yR} and pr =
(Y = yr) > 0 for all r = 1, …, R. Let X be a continuous random variable with support ℝX. Denote F(x) =
(X ≤ x) and Fr(x) =
(X ≤ x|Y = yr), then
.
MV (X|Y) = 0 if and only if X and Y are statistically independent.
The proof of this proposition is given in the Appendix. The results in (1) implies that the MV (X|Y) can be represented as the weighted average of Cramér-von Mises distances between the conditional distribution function of X given Y = yr and the unconditional distribution function of X. The second remarkable property motivates us to utilize the MV (X|Y) as a marginal utility for feature screening to characterize both linear and nonlinear relationships for ultrahigh dimensional discriminant analysis.
Let {(Xi, Yi) : 1 ≤ i ≤ n} be a random sample of size n from the population (X, Y). Define with I{·} being the indicator function, , and . It is natural to use its sample counterpart to estimate MV (X|Y) as follows:
| (2.2) |
To get insights into MV(X|Y), let us consider a simple example. Let X be a univariate standard normal random variable and generate random variables Zk with k = 1, 2 by Z1 = cX +ε and Z2 = cX2 +ε, where ε ~ N(0, 1) and c is a constant to control the signal-to-noise ratio. Then, we equally discretize each Zk to a categorical variable Yk with four classes. That is, Yk = I(Zk ≤ qk1)+2I(qk1 < Zk ≤ qk2)+3I(qk2 < Zk ≤ qk3)+4I(Zk > qk3), k = 1, 2 where {qk1, qk2, qk3} are the first, second and third quartiles of Zk, respectively. Thus, the response Y1 depends on X through a linear term cX, while Y2 depends on X through a quadratic term cX2. We set sample size n = 200 and c = 0, 0.5, 1 and 2. Note that Yk and X are independent for each k = 1, 2 when c = 0. Then, we compute the variance of conditional distribution function of X given Yk, i.e. VarYk [F(x|Yk)], for x ∈ [−2, 2] and each c. Panel (a) and (c) in Figure 1 are boxplots of VarYk [F(x|Yk)] against different c values for k = 1, 2, respectively, where the star indicates . Panel (b) and (d) in Figure 1 demonstrate how VarYk [F(x|Yk)] with k = 1, 2 varies across x ∈ [−2, 2] for different c values. It is shown that as the signal-to-noise ratio increases, increases. When c = 0, i.e. X and Yk are independent, are nearly close to zero; When c > 0, they are remarkably different above from zero. Consequently, the MV(X|Y) should be an effective measure to characterize the strengthen of both linear and nonlinear dependence between a continuous covariate and a categorical response.
Figure 1.
(a) Boxplot of VarY1 [F(x|Y1)] against c with the star indicating the mean; (b) Plot of VarY1 [F(x|Y1)] against x for different c values; (c) Boxplot of VarY2 [F(x|Y2)] against c with the star indicating the mean; (d) Plot of VarY2 [F(x|Y2)] against x for different c values.
2.2. Sure Independence Screening Using MV(X|Y)
We now propose a new model-free sure independence screening using MV(X|Y) for ultrahigh dimensional discriminant analysis. Let Y be the response with discrete support {y1, y2, · · ·, yR} with R ≥ 2 and x = (X1, · · ·, Xp)T be the predictor vector, where p ≥ n and n is the sample size. Without specifying a regression model, define the active predictor subset by
and denote by
= {1, 2, · · ·, p} \
the inactive predictor subset.
The goal is to select a reduced model with a moderate scale which can almost fully contain
using an independence screening method for ultrahigh dimensional discriminant analysis. To this end, we apply the MV index for each pair (Xk, Y):
as a marginal utility to measure the importance of Xk for the response, where k = 1, 2, …, p. Note that ωk = 0 if and only if Xk and Y are statistically independent. As a motivation, we can see that, if the partial orthogonality condition (Huang, Horowitz and Ma, 2008; Fan and Song, 2010) holds, i.e. {Xk : k ∈
} are statistically independent of {Xk : k ∈
}, then ωk is a naturally effective measure to sperate the active and inactive predictor subsets because ωk > 0 for k ∈
and ωk = 0 for k ∈
. It also implies that the MV-based variable screening is model-free in that it is defined through conditional and unconditional distribution functions and able to characterize both linear and nonlinear relationships between the response and predictors.
For a random sample {(xi, Yi) : 1 ≤ i ∈ n}, we can easily estimate ωk by setting according to equation (2.2). Then we propose to utilize ω̂k to choose a submodel
where c and τ are pre-determined thresholding values defined in Condition (C2) below. In practice, for a given size d < n, one can select a reduced model:
We refer this procedure to the MV-based sure independence screening, MV-SIS for short.
Next, we study the theoretical properties of the proposed MV-SIS. Fan and Lv (2008) and Ji and Jin (2012) demonstrated that the two-stage procedure combining independence screening and penalized estimation can outperform an one-step penalized estimation approach, such as LASSO. The effectiveness of the two-stage procedure is guaranteed by the sure screening property. That is, all active predictors can be included in the reduced model with high probability. Thus, we first establish the sure screening property for MV-SIS with assuming the following conditions.
-
(C1)
There exist two positive constants c1 and c2 such that . Assume that Rn = O(nκ) for κ ≥ 0.
-
(C2)
There exists positive constants c > 0 and 0 ≤ τ < 1/2 such that .
Condition (C1) requires that the proportion of each class of the response cannot be either too small or too large. Rn = O(nκ) assumed in Condition (C1) allows the diverging number of classes of the response, where the subscript n in Rn is used to emphasize Rn being allowed to be diverging with the sample size n. Condition (C2) assumes that the minimum true signal cannot be too small and it is in the order of n−τ which allows the minimum true signal to vanish to zero as the sample size n approaches the infinity. Such an assumption is typical in the feature screening literature (e.g., Condition 3 in Fan and Lv (2008), Condition (C3) in Wang (2009), Condition (C2) in both Li, Zhong and Zhu (2012) and He, Wang and Hong (2013) etc). The following theorem presents the sure screening property of MV-SIS and its proof is provided in the Appendix.
Theorem 2.1
[Sure Screening Property] Under Condition (C1) and for any 0 ≤ κ < 1 ≤ 2τ, there exists a positive constant b depending on c, c1 and c2, such that
| (2.3) |
Under Conditions (C1) and (C2), we have that
| (2.4) |
where sn is the cardinality of
.
The sure screening property holds for MV-SIS under milder conditions than those for the SIS (Fan and Lv, 2008) and DC-SIS (Li, Zhong and Zhu, 2012) in that we do not require the regression function of Y onto x to be linear and it needs little assumption on the moments of predictors. It is worth noting that MV-SIS is robust to heavy-tailed distributions of predictors and the presence of potential outliers because MV (Xk|Y) inherits the robustness property of conditional distribution function. Furthermore, the sure screening property also holds for the categorical response with a diverging number of classes. Thus, the MV-SIS provides a unified alternative to existing model-based sure screening procedures for ultrahigh dimensional discriminant analysis.
According to Theorem 2.1, we know that MV-SIS can handle the NP-dimensionality log p = O(nα), where α < 1 ≤ 2τ − κ with 0 ≤ τ < 1/2 and 0 ≤ κ ≤1 ≤ 2τ, which depends on the minimum true signal strengthen and the number of response classes. If Rn is fixed, i.e. κ = 0, then the result of Theorem 2.1 is improved and its first part can be rewritten as
for some constant b > 0. In this case, we can handle the even larger NP-dimensionality log p = O(nα), where α < 1 − 2τ with 0 ≤ τ < 1/2.
Remark
Condition (C1) can be relaxed in the way that c1 is allowed to tends to zero in a certain rate. To be specific, we assume that c1 = O(n−eegr;) with 0 < η < 2τ + κ. Under the relaxed condition, the sure screening property remains as essentially same as before, but the convergence rate becomes relatively slower. That is,
Then, a smaller NP-dimensionality log p = O(nα) with α < 1 − 2τ − κ − η is allowed. For the proof, refer to Appendix A in the Supplement.
Another interesting property for independence screening is ranking consistency property in the terms of Zhu, et al. (2011). To investigate the ranking consistency property of MV-SIS, we additionally assume the following condition.
-
(C3)
, where c3 > 0 is a constant.
It is easily shown that under the partial orthogonality condition (Huang, Horowitz and Ma, 2008) that ωk > 0 for k ∈
and ωk = 0 for k ∈
, Condition (C3) naturally holds. Thus, Condition (C3) is a relatively weaker assumption than partial orthogonality condition. It requires the MV index is able to sperate active and inactive predictors well in the population level. The following theorem justifies the ranking consistency property of MV-SIS.
Theorem 2.2
[Ranking Consistency Property] If conditions (C1) and (C3) hold for Rn log(n)/n = o(1) and Rn log(p)/n = o(1), then , a.s..
Although it requires a more restrictive condition on the difference between active and inactive signals, Theorem 2.2 demonstrates a stronger theoretical result than sure screening property. That is, the sample MV(Xk|Y) values of active predictors are always ranked beyond those of inactive ones with high probability. Thus, with an ideal thresholding value, one might separate the active predictors and inactive predictors.
3. NUMERICAL STUDIES
In this section, we first assess the finite sample performance of the proposed MV-SIS by Monte Carlo simulation studies. Then, we conduct empirical analyses of two real data examples to illustrate the proposed MV-SIS procedure. Some additional numerical results are given in the Supplement.
3.1. Monte Carlo Simulations
We use the minimum model size (MMS) to include all active predictors to measure the effectiveness of each screening approach. In addition, the proportion including a single active predictor Xj, denoted by
, and the proportion including all active predictors, denoted by
, are computed for a given model size d = [n/log n], where n is the sample size and [x] denotes the integer part of x. All numerical studies are conducted using R code.
Example 1
(Ultrahigh Dimensional Linear Discriminant Analysis) In this example, we consider a linear discriminant analysis problem with ultrahigh dimensional predictors by following the similar settings in Pan, Wang and Li (2013). For each ith observation, the categorical response Yi is generated from two different distributions: (i) balanced, a discrete uniform distribution with R categories where ℙ(Yi = r) = 1/R with r = 1, …, R; (ii) unbalanced, the sequence of probabilities pr = P(Yi = r) = 2[1 + (r − 1)/(R − 1)]/3R is an arithmetic progression with . For instance, when Y is binary, p1 = 1/3 and p2 = 2/3. Given Yi = r, the ith predictor Xi is then generated by letting Xi = μr + εi, where the mean term μr = (μr1, …, μrp) ∈ ℝp is a p-dimensional vector with rth component μrr = 3 but other components are all zero, and εi = (εi1, …, εip) is a p-dimensional error term. Here, we consider two cases of the error term: (1) εij ~ N(0, 1); (2) εij ~ t(2) independently for each j = 1, …, p. Note that the Case (2) makes each predictor heavy-tailed, which is designed to examine the robustness of an independence screening method. To systematically examine MV-SIS and other competitors, we will consider 2000 predictors and a binary response with n = 40, and a 10-categorical response with n = 200 for each case, respectively. That is, (R, n, p) = (2, 40, 2000) and (10, 200, 2000).
First, we compare the performance of MV-SIS with SIS (Fan and Lv, 2008), SIRS (Zhu, et al., 2011), DC-SIS (Li, Zhong and Zhu, 2012), Kolmogorov Filter (Mai and Zou, 2013) and PSIS (Pan, Wang and Li, 2013) for the binary response, where X1 and X2 are the active predictors. Table 1 summarizes the median of MMS with its associated robust estimate of the standard deviation (RSD = IQR/1.34) in the parentheses,
with j = 1, 2 and
for the given model size d = [n/log n] for each method based on 500 simulations.
Table 1.
Simulation Results for Linear Discriminant Analysis with Binary Response
| Case (1): εij ~ N (0, 1) | Case (2): εij ~ t(2) | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| |||||||||||||
| pr | Method | MMS |
|
|
|
MMS |
|
|
|
||||
| Balanced | SIS | 2.0(0.0) | 1.00 | 1.00 | 1.00 | 2.5(9.1) | 0.79 | 0.88 | 0.71 | ||||
| SIRS | 2.0(0.0) | 1.00 | 1.00 | 1.00 | 8.0(20.5) | 0.71 | 0.76 | 0.55 | |||||
| DC-SIS | 2.0(0.0) | 1.00 | 1.00 | 1.00 | 2.0(0.0) | 0.99 | 0.98 | 0.97 | |||||
| KF | 2.0(0.0) | 1.00 | 1.00 | 1.00 | 2.0(0.0) | 0.99 | 0.99 | 0.98 | |||||
| PSIS | 2.0(0.0) | 1.00 | 1.00 | 1.00 | 2.5(9.1) | 0.79 | 0.88 | 0.71 | |||||
| MV-SIS | 2.0(0.0) | 1.00 | 1.00 | 1.00 | 2.0(0.0) | 1.00 | 0.99 | 0.99 | |||||
|
| |||||||||||||
| Unbalanced | SIS | 2.0(0.0) | 1.00 | 1.00 | 1.00 | 5.5(48.8) | 0.75 | 0.75 | 0.55 | ||||
| SIRS | 2.0(0.0) | 1.00 | 0.99 | 0.99 | 17.0(123.3) | 0.67 | 0.64 | 0.44 | |||||
| DC-SIS | 2.0(0.0) | 1.00 | 1.00 | 1.00 | 2.0(1.1) | 0.95 | 0.96 | 0.92 | |||||
| KF | 2.0(0.0) | 1.00 | 1.00 | 1.00 | 2.0(0.7) | 0.96 | 0.99 | 0.95 | |||||
| PSIS | 2.0(0.0) | 1.00 | 1.00 | 1.00 | 5.5(48.8) | 0.75 | 0.75 | 0.55 | |||||
| MV-SIS | 2.0(0.0) | 1.00 | 1.00 | 1.00 | 2.0(0.7) | 0.96 | 0.99 | 0.95 | |||||
Next, we consider the response with 10 categories, where X1, X2, …, X10 are active. Note that a value of the response Y is a nominal number, which makes SIS, SIRS and Kolmogorov Filter unapplicable. However, MV-SIS is proposed for variable screening with a multiple categorical response. To make DC-SIS applicable for this problem, we transfer the 10-categorical response to 9 dummy binary variables, which are together considered as a new multiple response. Note that Li, Zhong and Zhu (2012) claimed that DC-SIS can be applied for the multiple response. Pan, Wang and Li (2013) proposed a pairwise sure independence screening (PSIS) to deal with the categorical response. PSIS utilizes |μ̂r1j − μ̂r2j| as the marginal signal of predictor Xj for each pair of classes (r1, r2) each time, where μ̂rj denotes the sample average of Xij for i ∈ {i : Yi = r}. Essentially, we consider maxr1≠2 |μ̂r1j − μ̂r2j| as the marginal signal of predictor Xj, where r1, r2 = 1, 2, …, 10, denoted by PSIS*. Table 2 summarizes the median of MMS with its associated robust standard deviation in the parentheses,
with j = 1, 2, …, 10 and
for the given model size d = [n/log n] based on 500 simulations.
Table 2.
Simulation Results for Linear Discriminant Analysis with 10-Categorical Response
| Method | MMS |
|
|
|
|
|
|
|
|
|
|
|
||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| (i) Balanced Probabilities and Case (1): εij ~ N (0, 1) | ||||||||||||||||||||||
|
| ||||||||||||||||||||||
| DC-SIS | 10.0(0.0) | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.99 | 0.99 | ||||||||||
| PSIS* | 10.0(0.0) | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ||||||||||
| MV-SIS | 10.0(0.0) | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ||||||||||
|
| ||||||||||||||||||||||
| (i) Balanced Probabilities and Case (2): εij ~ t(2) | ||||||||||||||||||||||
|
| ||||||||||||||||||||||
| DC-SIS | 15.0(21.8) | 0.86 | 0.99 | 0.99 | 0.99 | 0.97 | 0.98 | 0.99 | 0.99 | 0.99 | 0.98 | 0.74 | ||||||||||
| PSIS* | 362.5(563.6) | 0.73 | 0.75 | 0.76 | 0.73 | 0.75 | 0.75 | 0.75 | 0.73 | 0.76 | 0.79 | 0.05 | ||||||||||
| MV-SIS | 11.0(3.7) | 1.00 | 1.00 | 1.00 | 0.99 | 0.99 | 1.00 | 1.00 | 0.99 | 0.99 | 0.99 | 0.95 | ||||||||||
|
| ||||||||||||||||||||||
| (ii) Unbalanced Probabilities and Case (1): εij ~ N(0, 1) | ||||||||||||||||||||||
|
| ||||||||||||||||||||||
| DC-SIS | 13.0(14.9) | 0.82 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.82 | ||||||||||
| PSIS* | 10.0(0.0) | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ||||||||||
| MV-SIS | 10.0(0.0) | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ||||||||||
|
| ||||||||||||||||||||||
| (ii) Unbalanced Probabilities and Case (2): εij ~ t(2) | ||||||||||||||||||||||
|
| ||||||||||||||||||||||
| DC-SIS | 126.5(248.3) | 0.35 | 0.90 | 0.93 | 0.93 | 0.96 | 1.00 | 0.99 | 1.00 | 1.00 | 1.00 | 0.22 | ||||||||||
| PSIS* | 343.5(444.9) | 0.68 | 0.66 | 0.56 | 0.58 | 0.64 | 0.63 | 0.60 | 0.73 | 0.61 | 0.67 | 0.05 | ||||||||||
| MV-SIS | 13.0(9.8) | 0.93 | 0.98 | 0.98 | 0.98 | 0.98 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.85 | ||||||||||
In addition, we will compare the post-screening estimation and prediction performance between PSIS and MV-SIS for the binary response from a discrete uniform distribution. Here, we generate p = 2000 predictors and a binary response with different sample sizes n = 40 and n = 80. We replicate each simulation experiment a total of 500 times. For the rth simulation, we follows Pan, Wang and Li (2013) to choose the model size using the BIC criterion, which utilizes the equivalence between the LDA problem and a least squares one in Mai, Zou and Yuan (2012). Then, we define the model size (MS), percentage of correct zeros (CZ), incorrect zeros (IZ), coverage probability (CP), and the root of the sum squared error (RSSE) as follows, respectively,
where
is the selected model in the rth replication, sn = 2 is the cardinality of
, and γ0 = μ1 − μ2 = (3, −3, 0, …, 0) is the true difference between two true means, and γ̂r = μ̂1r − μ̂2r is the post-screening estimator of γ0 in the rth replication based on the selected model. Furthermore, to assess the prediction performance, an independent testing dataset is generated with the same sample size in the each simulation. The classification accuracy (CA) of the post-screening estimator is computed in each simulation. Also the classification accuracy based on the true means, denoted by CA0, and the ratio CA/CA0 is evaluated for comparison. We report the median of MS with its robust standard deviation in the parentheses, and the averages of other performance measures over all 500 simulations in Table 3.
Table 3.
Simulation Results for Estimation and Prediction Performance in Linear Discriminant Analysis with Binary Response with 500 Simulations.
| n | Method | MS(RSD) | CZ(%) | IZ(%) | CP(%) | RSSE | CA(%) | CA0(%) | RCA |
|---|---|---|---|---|---|---|---|---|---|
| Case (1): εij ~ N (0, 1) | |||||||||
|
| |||||||||
| 40 | PSIS | 3.0(2.9) | 99.89 | 0.00 | 100.00 | 1.31 | 95.20 | 98.41 | 96.76 |
| MV-SIS | 3.0(2.2) | 99.91 | 0.00 | 100.00 | 1.16 | 95.34 | 98.41 | 96.90 | |
|
| |||||||||
| 80 | PSIS | 2.0(1.5) | 99.94 | 0.00 | 100.00 | 0.70 | 97.31 | 98.31 | 98.98 |
| MV-SIS | 2.0(0.8) | 99.95 | 0.00 | 100.00 | 0.62 | 97.47 | 98.31 | 99.15 | |
|
| |||||||||
| Case (2): εij ~ t(2) | |||||||||
|
| |||||||||
| 40 | PSIS | 6.0(2.9) | 99.76 | 19.50 | 65.00 | 3.65 | 73.42 | 89.91 | 81.81 |
| MV-SIS | 5.0(3.1) | 99.83 | 3.00 | 94.00 | 2.74 | 78.92 | 89.91 | 87.87 | |
|
| |||||||||
| 80 | PSIS | 7.0(4.4) | 99.71 | 7.00 | 86.40 | 2.56 | 79.17 | 89.95 | 88.04 |
| MV-SIS | 3.0(2.9) | 99.87 | 0.00 | 100.00 | 1.56 | 84.80 | 89.95 | 94.30 | |
Both Tables 1 and 2 indicate that the proposed MV-SIS is superior to other competitors for variable screening in the linear discriminant analysis. When the error term is heavy-tailed and the number of the response categories increases, MV-SIS has much smaller minimum model sizes (MMS) and significantly higher probabilities to include all active predictors in the selected model than other independent screenings. Thus, the robustness of our MV-SIS is an important feature, which can make it more useful in practice. The same pattern can be observed from Table 3. MV-SIS has the very close estimation and prediction performance of PSIS when the error term is normal. However, when the error deviates from a normal distribution, PSIS deteriorates while MV-SIS still performs reasonably well.
3.2. Real Data Examples
Example 2
Lung cancer data were previously analyzed for classification between malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA) of the lung in Gordon et al. (2002) and Fan and Fan (2008). There are 12533 genes and 181 tissue samples from two classes: 31 in class MPM and 150 in class ADCA. The training dataset contains 32 of them (16 MPM and 16 ADCA), while the remaining 149 samples (15 MPM and 134 ADCA) are used for testing.
Before classification, we first standardize the data to zero mean and unit variance. Fan and Fan (2008) showed that their features annealed independence rules (FAIR) selected 31 important genes and made no training error and 7 testing errors, while the nearest shrunken centroids (NSC) method proposed by Tibshirani, et al. (2002) chose 26 genes and resulted in no training error and 11 testing errors. Then, we consider DC-SIS, PSIS and our MV-SIS approach (denoted by MV-SIS1) following by LDA for this ultrahigh dimensional classification problem. Note that FAIR used the diagonal linear discriminant analysis after the t-test screening. To make a fair comparison, we add a procedure combining t-test screening with LDA as well, denoted by FAIR*. Furthermore, the penalized LDA method (denote by PenLDA) proposed by Witten and Tibshirani (2011) and the sparse discriminant analysis (denoted by SDA) in Clemmensen, et al. (2011) are also implemented in this example for comparison. In addition, we combine our MV-SIS with SDA and consider this two-stage method as another potential approach, denoted by MV-SIS2. Similar to Example 1, the BIC criterion is applied to determining the model size for all competing methods in this binary classification problem. We summarize the classification results in Table 4. The MV-SIS followed by LDA (i.e. MV-SIS1) makes 0 training error and 5 testing errors using only 5 top genes, and the MV-SIS with SDA (i.e. MV-SIS2) performs even better than MV-SIS1 and SDA to achieve the smallest testing errors using only 7 genes. Thus, the two-stage approaches combining MV-SIS with LDA or SDA are superior to other competitors in terms of classification errors and the selected model size for this ultrahigh dimensional lung cancer data.
Table 4.
Classification Errors for Lung Cancer Data in Example 2
| Method | Training Error | Testing Error | No. of Selected Genes |
|---|---|---|---|
| NSC | 0/32 | 11/149 | 26 |
| FAIR | 0/32 | 7/149 | 31 |
| FAIR* | 0/32 | 7/149 | 14 |
| PenLDA | 0/32 | 9/149 | 8 |
| SDA | 0/32 | 6/149 | 17 |
| PSIS | 1/32 | 34/149 | 4 |
| DC-SIS | 0/32 | 6/149 | 7 |
| MV-SIS1 | 0/32 | 5/149 | 5 |
| MV-SIS2 | 0/32 | 3/149 | 7 |
To further evaluate the prediction performance, we randomly partition all 181 tissue samples into two parts: the training set including 100 samples and the testing set of the rest 81 samples. The above procedures are applied to the training data, and their performances are evaluated by the classification errors in both training and testing sets. For a fair comparison, we choose the best model sizes for all methods using the same BIC criterion. We repeat the experiment 100 times, summarize the means with associated standard deviations (in the parentheses) of the training and testing classification errors and the numbers of selected genes in Table 5, and display their distributions in Figure 2. In the result, the MV-SIS with LDA method (i.e. MV-SIS1) performs reasonably well and has both small training and testing errors using averagely around 12 genes. Among all the methods, the SDA method classifies the training samples perfectly and achieves a small testing error rate. However, SDA tends to select a considerably large number of genes and thus may lose some model interpretability. It is worth noting that the MV-SIS with SDA (i.e. MV-SIS2) can achieve the smallest testing error rate with a much smaller number of genes. This further demonstrates the merit of the two-stage approach combining MV-SIS with SDA.
Table 5.
Performance Evaluation for Lung Cancer Data in Example 2
| Method | Training Error(%) | Testing Error(%) | No. of Selected Genes |
|---|---|---|---|
| NSC | 0.87(0.90) | 1.86(1.91) | 17.52(11.36) |
| FAIR | 3.07(1.32) | 3.51(1.93) | 13.72(7.37) |
| PenLDA | 0.88(0.92) | 1.95(1.97) | 18.95(18.14) |
| SDA | 0.00(0.00) | 1.42(1.21) | 39.83(2.84) |
| PSIS | 0.06(0.24) | 2.14(1.57) | 26.49(6.85) |
| DCSIS | 0.08(0.27) | 2.63(2.30) | 15.54(12.53) |
| MVSIS1 | 0.15(0.44) | 1.77(1.91) | 11.99(9.53) |
| MVSIS2 | 0.20(0.40) | 1.41(1.10) | 11.74(6.71) |
Figure 2.
Lung Cancer Data in Example 2. (a) Boxplots of classification errors in the training sets over 100 random partitions of 181 samples; (b) Boxplots of classification errors in the testing sets; (c) Boxplots of numbers of selected genes.
Example 3
This human lung carcinomas data was analyzed by using mRNA expression profiling (Bhattacharjee, et al., 2001). There are 12600 mRNA expression levels in a total of 203 snap-frozen lung tumors and normal lungs. The 203 specimens are classified into five subclasses: 139 in lung adenocarcinomas (ADEN), 21 in squamous cell lung carcinomas (SQUA), 6 in small cell lung carcinomas (SCLC), 20 in pulmonary carcinoid tumors (COID) and the remaining 17 normal lung samples (NORMAL). Before classification, we first standardize the data to zero mean and unit variance. To evaluate the prediction performance of the proposed method, we randomly select approximately 100τ% of the observations from each subclass as the training samples and the rest 100(1 − τ)% observations as the testing samples, where τ ∈ (0, 1).
Note that the aforementioned NSC and FAIR are proposed only for binary classification problems, thus they are not applicable in this multiple classes discriminant analysis. PSIS, DC-SIS and MV-SIS with LDA are applied to the training set and their performances are evaluated by the testing samples. For the DC-SIS and MV-SIS (denoted by MV-SIS1) with LDA procedures, the leave-one-out cross validation is applied to choosing the optimal model size for the raining data. Besides, we also consider the penalized LDA (denoted by PenLDA) and MV-SIS followed by SDA (denoted by MV-SIS2) for comparison, and use the 10-folded cross validation rather than the leave-one-out cross validation to choose the best model size in order to reduce the computation time. Although SDA can be directly applied to multiple-class discriminant analysis for a given model size, searching the best model size for SDA is remarkably computational expensive for multiple-class ultrahigh dimensional data. Thus, we use MV-SIS to reduce dimensionality and then follow by SDA (i.e. MV-SIS2) instead of SDA alone in the example.
Next, we choose τ =0.9, 0.8 and repeat the experiment 100 times, respectively. Following the previous Example 2, the means of the training and testing classification errors and the corresponding numbers of selected genes with their associated standard deviations (in the parentheses) are reported in Table 6. We can clearly observe that, although all methods perform reasonably well in the tumors classification, the MV-SIS procedure with LDA or SDA are significantly better than other methods in terms of both training and testing classification errors and the number of selected genes. Specifically, the MV-SIS+SDA (i.e. MV-SIS2) procedure achieves the best performances using a small number of top genes. Furthermore, we find that the top genes selected by MV-SIS are not normally distributed and contain potential outliers. This observation explains why other methods performs relatively worse and confirms the robustness feature of the proposed MV-SIS. This example further demonstrates that the two-stage approach combing the MV-SIS method with a discriminant analysis is more favorable for ultrahigh dimensional data in practice.
Table 6.
Classification Errors for Lung Carcinomas Data with 5 Classes in Example 3.
| τ | Method | Training Error(%) | Testing Error(%) | No. of Selected Genes |
|---|---|---|---|---|
| 0.9 | PenLDA | 21.88(2.24) | 21.71(3.87) | 25.76(21.04) |
| PSIS | 3.54(0.79) | 9.43(5.65) | 107.54(15.71) | |
| DC-SIS | 6.85(1.35) | 11.81(6.40) | 32.08(3.85) | |
| MV-SIS1 | 3.65(1.15) | 7.71(4.99) | 20.56(8.02) | |
| MV-SIS2 | 3.65(1.15) | 7.62(5.09) | 31.76(10.24) | |
|
| ||||
| 0.8 | PenLDA | 22.12(2.10) | 22.40(4.37) | 25.04(21.81) |
| PSIS | 3.08(1.11) | 7.90(3.89) | 101.88(15.72) | |
| DC-SIS | 6.33(2.16) | 13.15(5.32) | 32.18(5.39) | |
| MV-SIS1 | 3.74(1.09) | 8.35(4.12) | 21.34(7.42) | |
| MV-SIS2 | 3.74(1.09) | 6.70(4.24) | 27.20(9.11) | |
4. SOME EXTENSIONS
The MV-SIS approach is proposed to screen important predictors for the ultrahigh dimensional discriminant analysis where the response is categorical, but its applications can be easily extended to some other settings. In this section, we discuss two natural extensions of MV-SIS and use simulation studies to show their excellent performances.
4.1. Genome-Wide Association Studies
First, we can apply MV-SIS to ultrahigh dimensional problems with categorical predictors. In such situations, feature screening can be done by using MV (Y|Xk), where Xk is categorical for k = 1, 2, …, p. Under Conditions (C1) and (C2), we can establish the sure screening property and ranking consistency property for ωk = MV (Y|Xk) with imposing Condition (C1) on each categorical SNP instead of the response. In genome-wide association studies (GWAS), modern genotyping techniques allow researchers to collect genetic data which usually contain an extremely large number of single-nucleotide polymorphisms (SNPs). In general, the SNPs as predictors are categorical with three classes, denoted by {AA, Aa, aa}. In Example 4, we consider applying the proposed MV-SIS for the ultrahigh dimensional GWAS problem to identify important SNPs, and compare its performance with other independence screening approaches.
Example 4
(Genome-Wide Association Studies) To mimic SNPs with equal allele frequencies, we denote Zij as the indicators of the dominant effect of the jth SNP for ith subject and generate it in the following way
where Xi = (Xi1, …, Xip) ~ N (0, Σ), where Σ = (ρij)p×p with ρij = 0.5|i − j|, i = 1, …, n, j = 1, …, p, and q1 and q3 are first and third quartiles of a standard normal distribution, respectively. Then, we generate the response (some trait or disease) by:
where βj = (−1)U(a + |Z|) for j = 1, …, 5, where
, U ~ Bernoulli(0.4) and Z ~
(0, 1), the error term ε follows N(0, 1) ort(1). There are 5 active SNPs, i.e. Z1, Z2, Z10, Z20 and Z100, for the response. The first four active SNPs are linearly correlated with the response Y, while the SNP Z100 and Y are nonlinearly correlated. It is interesting to note that the absolute value of dominant effect |Z100| is the corresponding additive effect in genetics. Here, we consider five different independence screening approaches: SIS, DC-SIS, SIRS, RRCS (Li et al., 2012) and MV-SIS, and set n = 200 and p = 2000 and repeat each experiment 500 times. We summarize the simulation results for d = [n/log(n)] in Table 7.
Table 7.
Simulation Results for Example 4 - GWAS Model.
| ε | Method | MMS |
|
|
|
|
|
|
|||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| N(0, 1) | SIS | 1058.0(786.9) | 0.96 | 0.97 | 1.00 | 0.99 | 0.02 | 0.02 | |||||
| DCSIS | 10.0(40.1) | 0.96 | 0.95 | 1.00 | 0.99 | 0.79 | 0.72 | ||||||
| SIRS | 1074.0(834.8) | 0.94 | 0.95 | 1.00 | 0.98 | 0.03 | 0.02 | ||||||
| RRCS | 1031.0(801.6) | 0.96 | 0.96 | 1.00 | 0.99 | 0.03 | 0.03 | ||||||
| MVSIS | 8.0(34.3) | 0.96 | 0.94 | 0.99 | 0.98 | 0.89 | 0.78 | ||||||
|
| |||||||||||||
| t(1) | SIS | 1427.0(530.4) | 0.26 | 0.28 | 0.42 | 0.42 | 0.02 | 0.00 | |||||
| DC-SIS | 124.0(284.8) | 0.78 | 0.75 | 0.92 | 0.91 | 0.53 | 0.32 | ||||||
| SIRS | 1050.0(672.5) | 0.86 | 0.84 | 0.97 | 0.96 | 0.02 | 0.01 | ||||||
| RRCS | 993.0(725.5) | 0.87 | 0.84 | 0.98 | 0.96 | 0.02 | 0.01 | ||||||
| MV-SIS | 46.0(139.1) | 0.79 | 0.79 | 0.94 | 0.94 | 0.79 | 0.46 | ||||||
According to Table 7, when the error follows a normal distribution, all five independence screening are able to select the first four active SNPs effectively because they are linearly correlated with the response. However, only DC-SIS and MV-SIS can choose Z100 which nonlinearly contributed to Y. When the error is generated from t(1) which is largely heavy-tailed, it is not surprising that all independence screening methods perform worse than before. However, the performance of MV-SIS is still the best one. Thus, we can conclude that MV-SIS can effectively select active categorical SNPs which are linearly or nonlinearly correlated with the response.
4.2. Nonparametric Additive Models
In this subsection, we further consider the application of MV-SIS for an ultrahigh dimensional nonparametric additive model to evaluate MV-SIS. Although both the response and predictors are generally continuous, we can discretize each predictor Xj into a categorical variable to make MV-SIS applicable. To be specific, we can define using percentiles {τ1, …, τKn} of Xj by , where I(·) is an indicator function, i = 1, …, n, j = 1, …, p, k = 1, …, Kn with Kn = O(n1/5). Then, we can apply MV-SIS to the discretized predictors and use as the marginal screening utility to measure the importance of Xj. In practice, the sample size in each discretized class can not be small in order to ensure an accurate estimation of conditional distribution function. On the other hand, the number of classes cannot be small in order to retain as much information of the continuous variable as possible. According to our empirical experiences, we suggest that the number of samples in each class should be greater than 20 to obtain a decent estimator of the MV index. One can also consider the number of classes as a tuning parameter and apply the cross validation technique to choose an optimal number of classes. The following simulation example numerically examines the performance of the proposal.
Example 5
(Nonparametric Additive Model) Following Meier, Geer and Bühlmann (2009), we define the following four functions
Then we consider the following additive model
where the predictors are generated independently from Uniform[−2.5, 2.5]. To examine the robustness of each independence screening approach, we consider two cases for the error term ε = (ε1, …, εn): (1) εi ~ N (0, 1); (2) εi ~ t(1) for i = 1, 2, …, n. In this example, besides the five approaches in Example 4, we further consider the nonparametric independence screening (NIS) proposed for sparse ultrahigh dimensional additive models by Fan, Feng and Song (2011), and the quantile-adaptive sure independence screening (QaSIS) with quantile τ= 0.5 proposed by He, Wang and Hong (2013). We set n = 200 and p = 2000 and repeat each experiment 500 times for each error case. In our simulation, we discretize each predictor into a 4-categorical variable using 1st, 2nd and 3rd quartiles as knots for our MV-SIS. Simulation results are reported for the given model size d = [n/log(n)] in Table 8.
Table 8.
Simulation Results for Example 5 - Nonparametric Additive Model
| ε | Method | MMS |
|
|
|
|
||||
|---|---|---|---|---|---|---|---|---|---|---|
| N(0, 1) | SIS | 1084.5(690.3) | 0.17 | 0.02 | 1.00 | 1.00 | 0.00 | |||
| NIS | 4.0(0) | 1.00 | 0.99 | 1.00 | 1.00 | 0.99 | ||||
| DC-SIS | 50.5(55.2) | 0.47 | 0.79 | 1.00 | 1.00 | 0.37 | ||||
| SIRS | 1178.0(668.6) | 0.15 | 0.01 | 1.00 | 1.00 | 0.00 | ||||
| QaSIS | 5.0(4.5) | 0.99 | 0.93 | 0.99 | 1.00 | 0.91 | ||||
| RRCS | 1112.5(673.9) | 0.16 | 0.03 | 1.00 | 1.00 | 0.00 | ||||
| MV-SIS | 4.0(1.5) | 0.99 | 0.95 | 1.00 | 1.00 | 0.95 | ||||
|
| ||||||||||
| t(1) | SIS | 1508.0(538.1) | 0.04 | 0.01 | 0.44 | 0.51 | 0.00 | |||
| NIS | 1056.5(932.2) | 0.25 | 0.15 | 0.22 | 0.37 | 0.08 | ||||
| DC-SIS | 205.0(280.1) | 0.20 | 0.33 | 0.96 | 0.96 | 0.07 | ||||
| SIRS | 1222.5(645.5) | 0.12 | 0.01 | 1.00 | 1.00 | 0.00 | ||||
| QaSIS | 16.0(37.7) | 0.93 | 0.79 | 0.93 | 1.00 | 0.69 | ||||
| RRCS | 1212.0(688.1) | 0.14 | 0.01 | 0.99 | 1.00 | 0.00 | ||||
| MV-SIS | 11.0(24.8) | 0.93 | 0.81 | 0.99 | 1.00 | 0.75 | ||||
Table 8 indicates that MV-SIS performs very well after discretizing each predictor. When the error term is normal, NIS performs best followed by MV-SIS and QaSIS. Although DC-SIS may detect the nonlinearity, it occasionally misses X1 and X2. The probable reason is the distance correlation between Y and the first two predictors are relatively weak. When the error term follows Cauchy distribution, which makes the data heavy-tailed and generates some extreme points, NIS quickly deteriorates and yet QaSIS performs well to detect the true signals. On the other hand, MV-SIS still can effectively select the active predictors and performs even better than QaSIS, which presents its robustness merit again.
5. DISCUSSION
In this paper, we have developed a new sure screening procedure for ultrahigh dimensional discriminant analysis, in which the response is allowed to have a diverging number of categories. We further established the sure screening property and the ranking consistency property of the proposed procedure without assuming any moment condition on predictors. The proposed procedure have several appealing properties. It is easily implemented, and it is robust to model specification (i.e. model-free), and robust to outliers or heavy tails of the predictors. The proposed procedure is also highly useful for analysis of data collected in GWAS, in which the phenotype may be multivariate continuous, while the predictors are categorical SNPs.
In the numerical studies, we applied linear discriminant analysis on the selected model by MV-SIS in the second-stage study. The linear discriminant analysis methods are widely used in practice and did perform reasonably well in our real data analysis. However, it is also interesting to propose a model-free and robust discriminant analysis after a model-free variable screening approach. This is out of scope of this work, but is an interesting topic for future research. Some work have been done on robust discriminant analysis. Related references include regularized discriminant analysis by Friedman (1989), robust LDA based on S-estimators by He and Fung (2000), penalized linear discriminant analysis by Witten and Tibshirani (2011), semiparametric sparse discriminant analysis by Mai and Zou (2014) and among others.
Supplementary Material
Acknowledgments
The authors thank the Editor, the AE and reviewers for their constructive comments, which have greatly improved the earlier version of this paper.
Biographies
Hengjian Cui is Professor, Department of Statistics, Capital Normal University, China. hjcui@bnu.edu.cn. His research was supported by National Natural Science Foundation of China (NNSFC) grants 11071022, 11028103, 11231010 and Key project of Beijing Municipal Educational Commission and Beijing Center for Mathematics and Information Interdisciplinary Sciences.
Runze Li is Distinguished Professor, Department of Statistics and The Methodology Center, The Pennsylvania State University, University Park, PA 16802-2111. rzli@psu.edu. His research was supported by National Institute on Drug Abuse (NIDA) grants P50-DA10075 and P50 DA036107, and NNSFC grant 11028103. Wei Zhong is Assistant Professor of Wang Yanan Institute for Studies in Economics (WISE), Department of Statistics and Fujian Key Laboratory of Statistical Science, Xiamen University, China. wzhong@xmu.edu.cn. His research was supported by NNSFC grants 11301435 and 71131008.
APPENDIX
Proof of Proposition 2.1
Note F(x|Y) =
(X ≤ x|Y) is a random variable of Y.
where pr =
(Y = yr). Then
The second property can be directly implied by the first one. Because the result that X and Y are statistical independent is equivalent to that Fr(x) = F (x) for any x ∈ ℝX and r = 1, 2, …, R, which is also equivalent to given pr > 0 and F(x + δ) − F(x − δ) > 0 for any δ > 0 and x ∈ ℝX. This completes the proof.
To prove Theorems 2.1 and 2.2, we need the following lemmas.
Lemma A.1
[Hoeffding’s Inequality] Let X1, …, Xn be independent random variables. Assume that
(Xi ∈ [ai, bi]) = 1 for 1 ≤ i ≤ n, where ai and bi are constants. Let
. Then the following inequality holds
| (A.1) |
where t is a positive constant and E(X̄) is the expected value of X̄.
Lemma A.2
[Bernstein’s Inequality] (van der Vaart and Wellner, 1996, Lemma 2.2.9) Let X1, …, Xn be independent random variables with bounded support [−M, M] and zero means, then the following inequality holds
| (A.2) |
for ν ≥ var(X1 + ··· + Xn).
We need the following notations for next lemma. Let Fk,r(x) =
(Xk ≤ x|Y = yr) and Fk(x) =
(Xk ≤ x), for 1 ≤ k ≤ p, r = 1, …, R and x ∈ ℝX. Denote
Let {Xki, Yi): 1 ≤ i ≤ n} be a random sample from a population (Xk, Y). Define for i = 1, …, n.
Lemma A.3
For any ε ∈ (0, 1) and 1 ≤ r ≤ R, the following inequalities are valid for univariate Xk
| (A.3) |
| (A.4) |
| (A.5) |
| (A.6) |
| (A.7) |
where Eh stands for Eh(Xk, Y) for a function h(Xk, Y) with finite expected value.
Proof
Since |f̄r (Xk, Y)| = [Fk,r(Xk) − Fk(Xk)]2 ≤ 1 and , we apply Hoeffding’s inequality to obtain the inequalities (A.3) and (A.4).
Since for i = 1, …, n, then with and , which implies and . Thus, by Bernstein’s inequality, we have
The inequality (A.5) is proved.
Note that , then we apply Hoeffding’s inequality and empirical process theory (Pollard, 1984) to obtain (A.6). Note that , then we apply Bernstein’s inequality and empirical process theory (Pollard, 1984) to obtain (A.7). This completes the proof of Lemma A.3.
Lemma A.4
Under Condition (C1), for any ε ∈ (0, 1/2) and 1 ≤ k ≤ p, we have
| (A.8) |
for some constant c4 > 0.
Proof
According the definitions of ωk and ω̂k, we have
We first deal with the term Ik1.
where the first inequality holds by and
|[F̂kr(x) − Fkr(x)] + [F̂k(x) − Fk(x)]| ≤ |F̂kr(x) − Fkr(x)| + |F̂k(x) − Fk(x)| ≤ 1 + 1 = 2, and the second inequality is implied by ∫dF̂k(x) = 1. Then, we first deal with the term Jk1,
where the equality holds due to . Thus, under Condition (C1), for any 0 < ε < 1/2,
| (A.9) |
for some constant c5 > 0, where the second inequality holds because minr p̂r < c1/2Rn implies using in Condition (C1), the fourth inequality is due to Lemma A.3, and the fifth inequality follows that in Condition (C1). Then, we apply inequalities (A.6), (A.3) and (A.4) in Lemma A.3 to obtain the following there results, respectively,
| (A.10) |
| (A.11) |
| (A.12) |
Inequalities (A.9)–(A.12) together imply the result of Lemma A.4.
Proof of Theorem 2.1
For the first term of Theorem 2.1, by Lemma A.4 and Rn = O(nκ), we have
where b > 0 is a constant depending c, c1 and c2.
Next, we deal with the second part of Theorem 2.1. If
⊈
, then there must exist some k ∈
such that ω̂k< cn−τ. It follows from Condition (C2) that |ω̂k − ωk > cn−τ for some k ∈
, indicating that the events satisfy {
⊈
} ⊆ {|ω̂k} − ωk| > cn−τ, for some k ∈
}, and hence Dn = {
|ω̂k − ωk | ≤ cn−τ} ⊆ {
⊆
}. Consequently,
where sn is the cardinality of
. This completes the proof of the second part.
Proof of Theorem 2.2
for some constant c6 > 0, where the first inequality follows Condition (C3) and the last inequality is implied by Lemma A.4. Because Rn log(p)/n = o(1) and Rn log(n)/n = o(1) imply that , and , log(nRn) ≤ 2 log(n) for large n. Then, we have for some n0, . Therefore, by Borel Contelli Lemma, we obtain that a.s..
Footnotes
The content is solely the responsibility of the authors and does not necessarily represent the official views of the NNSFC or NIDA.
References
- Bhattacharjee A, Richards W, Staunton J, Li C, Monti S, Vasal P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark E, Lander E, Wong W, Johnson B, Golub T, Sugarbaker D, Meyerson M. Classification of Human Lung Carcinomas by mRNA Expression Profiling Reveals Distinct Adenocarcinoma Subclasses. PNAS. 2001;98:13790–13795. doi: 10.1073/pnas.191502998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clemmensen L, Hastie T, Witten D, Ersboll B. Sparse Discriminant Analysis. Technometrics. 2011;53:406–415. [Google Scholar]
- Fan J, Fan Y. High-Dimensional Classification Using Features Annealed Independence Rules. The Annals of Statistics. 2008;36:2605–2637. doi: 10.1214/07-AOS504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Feng Y, Song R. Nonparametric Independence Screening in Sparse Ultra-High Dimensional Additive Models. Journal of the American Statistical Association. 2011;106:544–557. doi: 10.1198/jasa.2011.tm09779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Ma Y, Dai W. Nonparametric Independence Screening in Sparse Ultra-High Dimensional Varying Coefficient Models. Journal of the American Statistical Association. 2014 doi: 10.1080/01621459.2013.879828. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Lv J. Sure Independence Screening for Ultrahigh Dimensional Feature Space (with Discussion) Journal of the Royal Statistical Society, Series B. 2008;70:849– 911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Samworth R, Wu Y. Ultrahigh Dimensional Feature Selection: beyond the Linear Model. Journal of Machine Learning Research. 2009;10:1829–1853. [PMC free article] [PubMed] [Google Scholar]
- Fan J, Song R. Sure Independence Screening in Generalized Linear Models with NP-Dimensionality. The Annals of Statistics. 2010;38:3567–3604. [Google Scholar]
- Friedman J. Regularized Discriminant Analysis. Journal of the American Statistical Association. 1989;84:165–175. [Google Scholar]
- Gordon G, Jensen R, Hsiao L, Gullans S, Blumenstock J, Ramaswamy S, Richards W, Sugarbaker D, Bueno R. Translation of Microarray Data Into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research. 2002;62:4963–4967. [PubMed] [Google Scholar]
- He X, Fung WK. High Breakdown Estimation for Multiple Populations with Applications to Discriminant Analysis. Journal of Multivariate Analysis. 2000;72:151–162. [Google Scholar]
- He X, Wang L, Hong H. Quantile-Adaptive Model-Free Variable Screening for High-Dimensional Heterogeneous Data. The Annals of Statistics. 2013;41:342–369. [Google Scholar]
- Huang J, Horowitz J, Ma S. Asymptotic Properties of Bridge Estimators in Sparse High-Dimensional Regression Models. The Annals of Statistics. 2008;36:587–613. [Google Scholar]
- Ji P, Jin J. UPS Delivers Optimal Phase Diagram in High Dimensional Variable Selection. The Annals of Statistics. 2012;40:73–103. [Google Scholar]
- Li G, Peng H, Zhang J, Zhu L. Robust Rank Correlation Based Screening. The Annals of Statistics. 2012;40:1846–1877. [Google Scholar]
- Li R, Zhong W, Zhu L. Feature Screening via Distance Correlation Learning. Journal of American Statistical Association. 2012;107:1129–1139. doi: 10.1080/01621459.2012.695654. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu J, Li R, Wu R. Feature Selection for Varying Coefficient Models with Ultrahigh Dimensional Covariates. Journal of American Statistical Association. 2014;109:266–274. doi: 10.1080/01621459.2013.850086. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mai Q, Zou H. The Kolmogorov Filter for Variable Screening in High-Dimensional Binary Classification. Biometrika. 2013;100:229–234. [Google Scholar]
- Mai Q, Zou H. Semiparametric Sparse Discriminant Analysis in Ultra-High Dimensions. 2014. manuscript. 2013arXiv1304.4983M. [Google Scholar]
- Mai Q, Zou H, Yuan M. A Direct Approach to Sparse Discriminant Analysis in Ultra-High Dimensions. Biometrika. 2012;99:29–42. [Google Scholar]
- Meier L, Geer V, Bühlmann P. High-Dimensional Additive Modeling. The Annals of Statistics. 2009;37:3779–3821. [Google Scholar]
- Pan R, Wang H, Li R. On the Ultrahigh Dimensional Linear Discriminant Analysis Problem with A Diverging Number of Classes. 2013. manuscript. [Google Scholar]
- Pollard D. Convergence of Stochastic Processes. Springer-Verlag New York Inc; 1984. [Google Scholar]
- Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression. Proceedings of the National Academy of Sciences. 2002;99:6567–6572. doi: 10.1073/pnas.082099299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van der Vaart A, Wellner J. Weak Convergence and Empirical Processes. New York: Springer; 1996. [Google Scholar]
- Wang H. Forward Regression for Ultra-High Dimensional Variable Screening. Journal of the American Statistical Association. 2009;104:1512–1524. [Google Scholar]
- Witten D, Tibshirani R. Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B. 2011;73:753–772. doi: 10.1111/j.1467-9868.2011.00783.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu LP, Li L, Li R, Zhu LX. Model-Free Feature Screening for Ultrahigh Dimensional Data. Journal of the American Statistical Association. 2011;106:1464–1475. doi: 10.1198/jasa.2011.tm10563. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


