High Dimensional Classification Using Features Annealed Independence Rules

Jianqing Fan; Yingying Fan

doi:10.1214/07-AOS504

. Author manuscript; available in PMC: 2009 Jan 23.

Published in final edited form as: Ann Stat. 2008;36(6):2605–2637. doi: 10.1214/07-AOS504

High Dimensional Classification Using Features Annealed Independence Rules

Jianqing Fan ¹, Yingying Fan ^2,^✉

PMCID: PMC2630123 NIHMSID: NIHMS72255 PMID: 19169416

Abstract

Classification using high-dimensional features arises frequently in many contemporary statistical studies such as tumor classification using microarray or other high-throughput data. The impact of dimensionality on classifications is largely poorly understood. In a seminal paper, Bickel and Levina (2004) show that the Fisher discriminant performs poorly due to diverging spectra and they propose to use the independence rule to overcome the problem. We first demonstrate that even for the independence classification rule, classification using all the features can be as bad as the random guessing due to noise accumulation in estimating population centroids in high-dimensional feature space. In fact, we demonstrate further that almost all linear discriminants can perform as bad as the random guessing. Thus, it is paramountly important to select a subset of important features for high-dimensional classification, resulting in Features Annealed Independence Rules (FAIR). The conditions under which all the important features can be selected by the two-sample t-statistic are established. The choice of the optimal number of features, or equivalently, the threshold value of the test statistics are proposed based on an upper bound of the classification error. Simulation studies and real data analysis support our theoretical results and demonstrate convincingly the advantage of our new classification procedure.

Keywords: Classification, feature extraction, high dimensionality, independence rule, misclassification rates

1 Introduction

With rapid advance of imaging technology, high-throughput data such as microarray and proteomics data are frequently seen in many contemporary statistical studies. For instance, in the analysis of Microarray data, the dimensionality is frequently thousands or more, while the sample size is typically in the order of tens (West et al., 2001; Dudoit et al., 2002). See Fan and Ren (2006) for an overview. The large number of features presents an intrinsic challenge to classification problems. For an overview of statistical challenges associated with high dimensionality, see Fan and Li (2006).

Classical methods of classification break down when the dimensionality is extremely large. For example, even when the covariance matrix is known, Bickel and Levina (2004) demonstrate convincingly that the Fisher discriminant analysis performs poorly in a minimax sense due to the diverging spectra (e.g., the condition number goes to infinity as dimensionality diverges) frequently encountered in the high-dimensional covariance matrices. Even if the true covariance matrix is not ill conditioned, the singularity of the sample covariance matrix will make the Fisher discrimination rule inapplicable when the dimensionality is larger than sample size. Bickel and Levina (2004) show that the independence rule overcomes the above two problems. However, in tumor classification using microarray data, we hope to find tens of genes that have high discriminative power. The independence rule, studied by Bickel and Levina (2004), does not possess this kind of properties.

The diffculty of high-dimensional classification is intrinsically caused by the existence of many noise features that do not contribute to the reduction of misclassification rate. Though the importance of dimension reduction and feature selection has been stressed and many methods have been proposed in the literature, very little research has been done on theoretical analysis of the impacts of high dimensionality on classification. For example, using most discrimination rules such as the linear discriminants, we need to estimate the population mean vectors from the sample. When the dimensionality is high, even though each component of the population mean vectors can be estimated with accuracy, the aggregated estimation error can be very large and this has adverse effects on the misclassification rate. Therefore, when there is only a fraction of features that account for most of the variation in the data such as tumor classification using gene expression data, using all features will increase the misclassification rate.

To illustrate the idea, we study independence classification rule. Specifically, we give an explicit formula on how the signal and noise affect the misclassification rates. We show formally how large the signal to noise ratio can be such that the effect of noise accumulation can be ignored, and how small this ratio can be before the independence classifier performs as bad as the random guessing. Indeed, as demonstrated in Section 2, the impact of the dimensionality can be very drastic. For the independence rule, the misclassification rate can be as high as the random guessing even when the problem is perfectly classifiable. In fact, we demonstrate that almost all linear discriminants can not perform any better than random guessing, due to the noise accumulation in the estimation of the population mean vectors, unless the signals are very strong, namely the population mean vectors are very far apart.

The above discussion reveals that feature selection is necessary for high-dimensional classification problems. When the independence rule is applied to selected features, the resulting Feature Annealed Independent Rules (FAIR) overcome both the issues of interpretability and the noise accumulation. One can extract the important features via variable selection techniques such as the penalized quasi-likelihood function. See Fan and Li (2006) for an overview. One can also employ a simple two-sample t-test as in Tibshirani et al.(2002) to identify important genes for the tumor classification, resulting in the nearest shrunken centroids method. Such a simple method corresponds to a componentwise regression method or a ridge regression method with ridge parameters tending to ∞ (Fan and Lv, 2007). Hence, it is a specific and useful example of the penalized quasi-likelihood method for feature selection. It is surprising that such a simple proposal can indeed extract all important features. Indeed, we demonstrate that under suitable conditions, the two sample t-statistic can identify all the features that efficiently characterize both classes.

Another popular class of the dimension reduction methods is projection. They have been widely applied to the classification based on the gene expression data. See, for example, principal component analysis in Ghosh (2002), Zou et al. (2004), and Bair et al.(2004); partial least squares in Nguyen and Rocke (2002), Huang and Pan (2003), and Boulesteix (2004); and sliced inverse regression in Chiaromonte and Martinelli (2002), Antoniadis et al.(2003), and Bura and Pfeiffer (2003). These projection methods attempt to find directions that can result in small classification errors. In fact, the directions found by these methods usually put much more weights on features that have large classification power. In general, however, linear projection methods are likely to perform poorly unless the projection vector is sparse, namely, the effective number of selected features is small. This is due to the aforementioned noise accumulation prominently featured in high-dimensional problems, recalling discrimination based on linear projections onto almost all directions can perform as bad as the random guessing.

As direct application of the independence rule is not efficient, we propose a specific form of FAIR. Our FAIR selects the statistically most significant m features according to the componentwise two-sample t-statistics between two classes, and applies the independence classifiers to these m features. Interesting questions include how to choose the optimal m, or equivalently, the threshold value of t-statistic, such that the classification error is minimized, and how this classifier performs compared with the independence rule without feature selection and the oracle-assisted FAIR. All these questions will be formally answered in this paper. Surprisingly, these results are similar to those for the adaptive Neyman test in Fan (1996). The theoretical results also indicate that FAIR without oracle information performs worse than the one with oracle information, and the difference of classification error depends on the threshold value, which is consistent with the common sense.

There is a huge literature on classification. To name a few in addition to those mention before, Bai and Saranadasa (1996) dealt with the effect of high dimensionality in a two-sample problem from a hypothesis testing viewpoint; Friedman (1989) proposed a regularized discriminant analysis to deal with the problems associated with high dimension while performing computations in the regular way; Dettling and Bühlmann (2003) and Bühlmann and Yu (2003) study boosting with logit loss and L₂ loss, respectively, and demonstrate the good performances of these methods in high-dimensional setting; Greenshtein and Ritov (2004), Greenshtein (2006) and Meinshausen (2005) introduced and studied the concept of persistence, which places more emphasis on misclassification rates or expected loss rather than the accuracy of estimated parameters.

This article is organized as follows. In Section 2, we demonstrate the impact of dimensionality on the independence classification rule, and show that discrimination based on projecting observations onto almost all linear directions is nearly the same as random guessing. We establish, in Section 3, the conditions under which two sample t-test can identify all the important features with probability tending to 1. In Section 4, we propose FAIR and give an upper bound of its classification error. Simulation studies and real data analyses are conducted in Section 5. The conclusion of our study is summarized in Section 6. All proofs are given in the Appendix.

2 Impact of High Dimensionality

Consider the p-dimensional classification problem between two classes C₁ and C₂. Suppose that from class C_k, we have n_k observations Y _k1, ⋯, Y _{kn_k} in $R^{p}$ . The j-th feature of the i-th sample from class C_k satisfies the model

Y_{k i j} = μ_{k j} + ϵ_{k i j}, k = 1, 2, i = 1, \dots, n_{k}, j = 1, \dots, p,

(2.1)

where μ_kj is the mean effect of the j-th feature in class C_k and ϵ_kij is the corresponding Gaussian random noise for i-th observation. In matrix notation, the above model can be written as

Y_{k i} = μ_{k} + ϵ_{k i}, k = 1, 2, i = 1, \dots, n_{k},

where μ_k = (μ_k1, ⋯, μ_kp)′ is the mean vector of class C_k and ϵ_ki = (ϵ_ki1, ⋯, ϵ_kip)′ has the distribution N (0, Σ_k). We assume that all observations are independent across samples and in addition, within class C_k, observations Y_k1, ⋯, Y_knk are also identically distributed. Throughout this paper, we make the assumption that the two classes have compatible sample sizes, i.e., c₁ ≤ n₁/n₂ ≤ c₂ with c₁ and c₂ some positive constants.

We first investigate the impact of high dimensionality on classification. For simplicity, we temporarily assume that the two classes C₁ and C₂ have the same covariance matrix Σ. To illustrate our idea, we consider the independence classification rule, which classifies the new feature vector x into class C₁ if

δ (x) = (x - μ)' D^{- 1} (μ_{1} - μ_{2}) > 0,

where μ = (μ₁ + μ₂)/2 and D = diag(Σ). This classifier has been thoroughly studied in Bickel and Levina (2004). They showed that in the classification of two normal populations, this independence rule greatly outperforms the Fisher linear discriminant rule under broad conditions when the number of variables is large.

The independence rule depends on the marginal parameters μ₁, μ₂ and $D = diag {σ_{1}^{2}, \dots, σ_{p}^{2}}$ . They can easily be estimated from the samples:

{\hat{μ}}_{k} = \sum_{i = 1}^{n_{k}} Y_{k i} ∕ n_{k}, k = 1, 2, \hat{μ} = ({\hat{μ}}_{1} + {\hat{μ}}_{2}) ∕ 2

and

\hat{D} = diag {(S_{1 j}^{2} + S_{2 j}^{2}) ∕ 2, j = 1, \dots, p},

where $S_{k j}^{2} = \sum_{i = 1}^{n_{k}} {(Y_{k i j} - {\overset{‒}{Y}}_{k j})}^{2} ∕ (n_{k} - 1)$ is the sample variance of the j-th feature in class k and ${\overset{‒}{Y}}_{k j} = \sum_{i = 1}^{n_{k}} Y_{k i} ∕ n_{k}$ . Hence, the plug-in discrimination function is

\hat{δ} (x) = (x - \hat{μ})' {\hat{D}}^{- 1} ({\hat{μ}}_{1} - {\hat{μ}}_{2}) .

Denote the parameter by θ = (μ₁, μ₂, Σ). If we have a new observation X from class C₁, then the misclassification rate of $\hat{δ}$ is

W (\hat{δ}, θ) = P (\hat{δ} (X) \leq 0 ∣ Y_{k i}, i = 1, \dots, n_{k}, k = 1, 2) = 1 - Φ (Ψ),

(2.2)

where

Ψ = \frac{(μ_{1} - \hat{μ})' {\hat{D}}^{- 1} ({\hat{μ}}_{1} - {\hat{μ}}_{2})}{\sqrt{({\hat{μ}}_{1} - {\hat{μ}}_{2})' {\hat{D}}^{- 1} Σ {\hat{D}}^{- 1} ({\hat{μ}}_{1} - {\hat{μ}}_{2})}},

and Φ(·) is the standard Gaussian distribution function. The worst case classification error is

W (\hat{δ}) = \max_{θ \in Γ} W (\hat{δ}, θ),

where Γ is some parameter space to be defined. Let n = n₁ + n₂. In our asymptotic analysis, we always consider the misclassification rate of observations from C₁, since the misclassification rate of observations from C₂ can be easily obtained by interchanging n₁ with n₂ and μ₁ with μ₂. The high dimensionality is modeled through its dependence on n, namely p_n → ∞. However, we will suppress its dependence on n whenever there is no confusion.

Let R = D^-1/2 ΣD^-1/2 be the correlation matrix, and λ_max(R) be its largest eigenvalue, and α ≡ (α₁, ⋯, α_p)′ = μ₁ - μ₂. Consider the parameter space

Γ = {(α, Σ) : α' D^{- 1} α \geq C_{p}, λ_{\max} (R) \leq b_{0}, \min_{1 \leq j \leq p, k = 1, 2} σ_{k j}^{2} > 0}

where C_p is a deterministic positive sequence that depends only on the dimensionality p, and b₀ is a positive constant. Note that α′D^-1α corresponds to the overall strength of signals, and the first condition α′D^-1α ≥ C_p imposes a lower bound on the strength of signals. The second condition λ_max(R) ≤ b₀ requires that the maximum eigenvalue of R should not exceed a positive constant. But since there are no restrictions on the smallest eigenvalue of R, the condition number can still diverge. The third condition $\min_{1 \leq j \leq p, k = 1, 2} σ_{k j}^{2} > 0$ ensures that there are no deterministic features that make classification trivial and the diagonal matrix D is always invertible. We will consider the asymptotic behavior of $W (\hat{δ}, θ)$ and $W (\hat{δ})$ .

Theorem 1

Suppose that log p = o(n), n = o(p) and nC_p → ∞. Then (i) The classification error W (δ, θ) with θ ϵ Γ is bounded from above as

W (\hat{δ}, θ) \leq 1 - Φ (\frac{{[n_{1} n_{2} ∕ (p n)]}^{1 ∕ 2} α' D^{- 1} α (1 + o_{P} (1)) + \sqrt{p ∕ (n n_{1} n_{2})} (n_{1} - n_{2})}{2 \sqrt{λ_{\max} (R)} {1 + n_{1} n_{2} ∕ (p n) α' D^{- 1} α (1 + o_{P} (1))}^{1 ∕ 2}}) .

(ii) Suppose p/(nC_p) → 0. For the worst case classification error W (δ), we have

W (\hat{δ}) = 1 - Φ (\frac{1}{2} {[n_{1} n_{2} ∕ (p n b_{0})]}^{1 ∕ 2} C_{p} {1 + o_{P} (1)}) .

Specifically, when ${\frac{n_{1} n_{2}}{p n}}^{1 ∕ 2} C_{p} \to C_{0}$ with C₀ a nonnegative constant, then

W (\hat{δ}) \overset{P}{\to} 1 - Φ (C_{0} ∕ (2 \sqrt{b_{0}})) .

In particular, if C₀ = 0, then $W (\hat{δ}) \overset{P}{\to} \frac{1}{2}$ .

Theorem 1 reveals the trade-off between the signal strength C_p and the dimensionality, reflected in the term $C_{p} ∕ \sqrt{p}$ when all features are used for classification. It states that the independence rule $\hat{δ}$ would be no better than the random guessing due to noise accumulation, unless the signal levels are extremely high, say, ${\frac{n}{p}}^{1 ∕ 2} C_{p} \geq B$ for some B > 0. Indeed, discrimination based on linear projections to almost all directions performs nearly the same as random guessing, as shown in the theorem below. The poor performance is caused by noise accumulation in the estimation of μ₁ and μ₂.

Theorem 2

Suppose that a is a p-dimensional uniformly distributed unit random vector on a (p - 1)-dimensional sphere. Let λ₁, ⋯, λ_p be the eigenvalues of the covariance matrix Σ. Suppose $\lim_{p} \frac{1}{p^{2}} \sum_{j = 1}^{p} λ_{j}^{2} < \infty$ and $\lim_{p} \frac{1}{p} \sum_{j = 1}^{p} λ_{j} = τ$ with τ a positive constant. Moveover, assume that p^-1α′α → 0. Then if we project all the observations onto the vector a and use the classifier

{\hat{δ}}_{a} (x) = (a' x - a' \hat{μ}) (a' {\hat{μ}}_{1} - a' {\hat{μ}}_{2}),

(2.3)

the misclassification rate of $\hat{δ} a$ satisfies

P ({\hat{δ}}_{a} (X) \leq 0 ∣ Y_{k i}, i = 1, \dots, n_{k}, k = 1, 2) \overset{P}{\to} \frac{1}{2},

where the probability is taken with respect to a and X ϵ C₁.

3 Feature Selection by Two-Sample t-Test

To extract salient features, we appeal to the two sample t-test statistics. Other componentwise tests such as the rank sum test can also be used, but we do not pursue those in detail. The two-sample t-statistic for feature j is defined as

T_{j} = \frac{{\overset{‒}{Y}}_{1 j} - {\overset{‒}{Y}}_{2 j}}{\sqrt{S_{1 j}^{2} ∕ n_{1} + S_{2 j}^{2} ∕ n_{2}}}, j = 1, \dots, p,

(3.1)

where ${\overset{‒}{Y}}_{k j}$ and $S_{k j}^{2}$ are the same as those defined in Section 1. We work under more relaxed technical conditions: the normality assumption is not needed. Instead, we assume merely that the noise vectors ϵ_ki,i = 1, ⋯, n_k are i.i.d. within class C_k with mean 0 and covariance matrix Σ_k, and are independent between classes. The covariance matrix Σ₁ can also differ from Σ₂.

To show that the t-statistic can select all the important features with probability 1, we need the following condition.

Condition 1

Assume that the vector α = μ₁ - μ₂ is sparse and without loss of generality, only the first s entries are nonzero.
Suppose that ϵ_kij and $ϵ_{k i j}^{2} - 1$ satisfy the Cramér's condition, i.e., there exist constants ν₁, ν₂, M₁ and M₂, such that $E {∣ ϵ_{k i j} ∣}^{m} \leq m! M_{1}^{m - 2} ν_{1} ∕ 2$ and $E {∣ ϵ_{k i j}^{2} - σ_{k j}^{2} ∣}^{m} \leq m! M_{2}^{m - 2} ν_{2} ∕ 2$ for all m = 1, 2, ⋯.
Assume that the diagonal elements of both Σ₁ and Σ₂ are bounded away from 0.

The following theorem describes the situation under which the two sample t-test can pick up all important features by choosing an appropriate critical value. Recall that c₁ ≤ n₁/n₂ ≤ c₂ and n = n₁ + n₂.

Theorem 3

Let s be a sequence such that log(p - s) = o(n^γ) and $\log s = o (n^{\frac{1}{2} - γ} β_{n})$ for someβ_n → ∞ and $0 < γ < \frac{1}{3}$ . Suppose that $\min_{1 \leq j \leq s} \frac{∣ α_{j} ∣}{\sqrt{σ_{1 j}^{2} + σ_{2 j}^{2}}} = n^{- γ} β_{n}$ . Then under Condition 1, for x ~ cn^γ/2 with c some positive constant, we have

P (\min_{j \leq s} ∣ T_{j} ∣ \geq x a n d \max_{j > s} ∣ T_{j} ∣ < x) \to 1 .

In the proof of Theorem 3, we used the moderate deviation results of the two-sample t-statistic (see Cao, 2007 or Shao, 2005). Theorem 3 allows the lowest signal level to decay with sample size n. As long as the rate of decay is not too fast and the sample size is not too small, the two sample t-test can pick up all the important features with probability tending to 1.

4 Features Annealed Independence Rules

We apply the independence classifier to the selected features, resulting in a Features Annealed Independence Rule (FAIR). In many applications such as tumor classification using gene expression data, we would expect that elements in the population mean difference vector α are sparse: most entries are small. Thus, even if we could use t-test to correctly extract out all these features, the resulting choice is not necessarily optimal, since the noise accumulation can even exceed the signal accumulation for faint features. This can be seen from Theorem 1. Therefore, it is necessary to further single out the most important features that help reduce misclassification rate.

To help us select the number of features, or the critical value of the test statistic, we first consider the ideal situation that the important features are located at the first m coordinates and our task is to merely select m to minimize the misclassification rate. This is the case when we have the ideal information about the relative importance of features, as measured by |α_j|/σ_j, say. When such an oracle information is unavailable, we will learn it from the data. In the situation that we have vague knowledge about the importance of features such as tumor classification using gene expression data, we can give high ranks to features with large |α_j|/σ_j.

In the presentation below, unless otherwise specified, we assume that the two classes C₁ and C₂ are both from Gaussian distributions and the common covariance matrix is the identity, i.e., Σ₁ = Σ₂ = I. If this common covariance matrix is known, the independence classifier $\hat{δ}$ becomes the nearest centroids classifier

{\hat{δ}}_{NC} (x) = (x - \hat{μ})' ({\hat{μ}}_{1} - {\hat{μ}}_{2}) .

If only the first m dimensions are used in the classification, the corresponding features annealed independence classifier becomes

{\hat{δ}}_{NC}^{m} (x) = (x^{m} - {\hat{μ}}^{m})' ({\hat{μ}}_{1}^{m} - {\hat{μ}}_{2}^{m}),

where the superscript m means that the vector is truncated after the first m entries. This is indeed the same as the nearest shrunken centroids method of Tibshirani et al. (2002).

Theorem 4

Consider the truncated classifier ${\hat{δ}}_{N C}^{m_{n}}$ for a given sequence m_n. Suppose that $\frac{n}{\sqrt{m_{n}}} \sum_{j = 1}^{m_{n}} α_{j}^{2} \to \infty$ as m_n → ∞. Then the classification error of ${\hat{δ}}_{N C}^{m_{n}}$ is

W ({\hat{δ}}_{N C}^{m_{n}}, θ) = 1 - Φ (\frac{(1 + o_{P} (1)) \sum_{j = 1}^{m_{n}} α_{j}^{2} + m_{n} (n_{1} - n_{2}) ∕ (n_{1} n_{2})}{2 {(1 + o_{P} (1)) \sum_{j = 1}^{m_{n}} α_{j}^{2} + n m_{n} ∕ n_{1} n_{2}}^{1 ∕ 2}}),

where n = n₁ + n₂ as defined in Section 2.

In the following, we suppress the dependence of m on n when there is no confusion. The above theorem reveals that the ideal choice on the number of features is

m_{0} = {argmax}_{1 \leq m \leq p} \frac{{[\sum_{j = 1}^{m} α_{j}^{2} + m (n_{1} - n_{2}) ∕ (n_{1} n_{2})]}^{2}}{n m ∕ (n_{1} n_{2}) + \sum_{j = 1}^{m} α_{j}^{2}} .

It can be estimated as

{\hat{m}}_{0} = {argmax}_{1 \leq m \leq p} \frac{{[\sum_{j = 1}^{m} {\hat{α}}_{j}^{2} + m (n_{1} - n_{2}) ∕ (n_{1} n_{2})]}^{2}}{n m ∕ (n_{1} n_{2}) + \sum_{j = 1}^{m} {\hat{α}}_{j}^{2}},

where ${\hat{α}}_{j} = {\hat{μ}}_{1 j} - {\hat{μ}}_{2 j}$ . The expression for m₀ quantifies how the signal and the noise affect the misclassification rates as the dimensionality m increases. In particular, when n₁ = n₂, the express reduces to $m_{0} = {argmax}_{1 \leq m \leq p} \frac{{[m^{- 1 ∕ 2} \sum_{j = 1}^{m} α_{j}^{2}]}^{2}}{2 ∕ n + \sum_{j = 1}^{m} α_{j}^{2} ∕ m}$ . The term $m^{- 1 ∕ 2} \sum_{j = 1}^{m} α_{j}^{2}$ reflects the trade-off between the signal and noise as dimensionality m increases.

The good performance of the classifier ${\hat{δ}}_{NC}^{m}$ depends on the assumption that the largest entries of α cluster at the first m dimensions. An ideal version of the classifier $\hat{δ} NC$ is to select a subset $A = {j : ∣ α_{j} ∣ > a}$ and use this subset to construct independence classifier. Let m be the number of elements in $A$ . The oracle classifier can be written as

{\hat{δ}}_{orc} (x) = \sum_{j = 1}^{p} {\hat{α}}_{j} (x_{j} - {\hat{μ}}_{j}) 1_{{∣ α_{j} ∣ > a}} .

The misclassification rate is approximately

1 - Φ (\frac{\sum_{j \in A} α_{j}^{2} + m (n_{1} - n_{2}) ∕ (n_{1} n_{2})}{2 {n m ∕ (n_{1} n_{2}) + \sum_{j \in A} α_{j}^{2}}^{1 ∕ 2}}),

(4.1)

when $\frac{n}{\sqrt{m}} \sum_{j \in A} α_{j}^{2} \to \infty$ and m → ∞. This is straightforward from Theorem 4. In practice, we do not have such an oracle, and selecting the subset $A$ is difficult. A simple procedure is to use the feature annealed independence rule based on the hard thresholding:

{\hat{δ}}_{FAIR}^{b} (x) = \sum_{j = 1}^{p} {\hat{α}}_{j} (x_{j} - {\hat{μ}}_{j}) 1_{{∣ {\hat{α}}_{j} ∣ > b}} .

We study the classification error of FAIR and the impact of the threshold b on the classification result in the following theorem.

Theorem 5

Suppose that $\max_{j \in A^{c}} ∣ α_{j} ∣ < b_{n}$ and $\log (p - m) ∕ [n {(b_{n} - \max_{j \in A^{c}} ∣ α_{j} ∣)}^{2}] \to 0$ with $m = ∣ A ∣$ . Moreover, assume that $\frac{n}{\sqrt{m}} \sum_{j \in A} α_{j}^{2} \to \infty$ and $\sum_{j \in A} ∣ α_{j} ∣ ∕ [\sqrt{n} \sum_{j \in A} α_{j}^{2}] \to 0$ . Then

W ({\hat{δ}}_{F A I R}^{b_{n}}, θ) \leq 1 - Φ (\frac{(1 + o_{P} (1)) \sum_{j \in A} α_{j}^{2} + n m {(n_{1} n_{2})}^{- 1} - m b_{n}^{2}}{2 {(1 + o_{P} (1)) \sum_{j \in A} α_{j}^{2} + n m {(n_{1} n_{2})}^{- 1}}^{1 ∕ 2}}) .

Notice that the upper bound of $W ({\hat{δ}}_{FAIR}^{b_{n}}, θ)$ in Theorem 5 is greater than the classification error in Theorem 4, and the magnitude of difference depends on $m b_{n}^{2}$ . This is expected as estimating the set $A$ increases the classification error. These results are similar to those in Fan (1996) for high-dimensional hypothesis testing.

When the common covariance matrix is different from the identity, FAIR takes a slightly different form to adapt to the unknown componentwise variance:

{\hat{δ}}_{FAIR} (x) = \sum_{j = 1}^{p} {\hat{α}}_{j} (x_{j} - {\hat{μ}}_{j}) ∕ {\hat{σ}}_{j}^{2} 1_{{\sqrt{n ∕ (n_{1} n_{2})} ∣ T_{j} ∣ > b}},

(4.2)

where T_j is the two sample t-statistic. It is clear from (4.2) that FAIR works the same way as that we first sort the features by the absolute values of their t-statistics in the descending order, and then take out the first m features to classify the data. The number of features can be selected by minimizing the upper bound of the classification error given in Theorem 1. The optimal m in this sense is

m_{1} = {argmax}_{1 \leq m \leq p} \frac{1}{λ_{\max}^{m}} \frac{{[\sum_{j = 1}^{m} α_{j}^{2} ∕ σ_{j}^{2} + m (1 ∕ n_{2} - 1 ∕ n_{1})]}^{2}}{n m ∕ (n_{1} n_{2}) + \sum_{j = 1}^{m} α_{j}^{2} ∕ σ_{j}^{2}},

where $λ_{\max}^{m}$ is the largest eigenvalue of the correlation matrix R^m of the truncated observations. It can be estimated from the samples:

\begin{matrix} {\hat{m}}_{1} & = {argmax}_{1 \leq m \leq p} \frac{1}{{\hat{λ}}_{\max}^{m}} \frac{{[\sum_{j = 1}^{m} {\hat{α}}_{j}^{2} ∕ {\hat{σ}}_{j}^{2} + m (1 ∕ n_{2} - 1 ∕ n_{1})]}^{2}}{n m ∕ (n_{1} n_{2}) + \sum_{j = 1}^{m} {\hat{α}}_{j}^{2} ∕ {\hat{σ}}_{j}^{2}} \\ = {argmax}_{1 \leq m \leq p} \frac{1}{{\hat{λ}}_{\max}^{m}} \frac{n {[\sum_{j = 1}^{m} T_{j}^{2} + m (n_{1} - n_{2}) ∕ n]}^{2}}{m n_{1} n_{2} + n_{1} n_{2} \sum_{j = 1}^{m} T_{j}^{2}} . \end{matrix}

(4.3)

Note that the factor $λ_{\max}^{m}$ in (4.3) increases with m, which makes ${\hat{m}}_{1}$ usually smaller than ${\hat{m}}_{0}$ .

5 Numerical Studies

In this section we use a simulation study and three real data analyses to illustrate our theoretical results and to verify the performance of our newly proposed classifier FAIR.

5.1 Simulation Study

We first introduce the model. The covariance matrices Σ₁ and Σ₂ for the two classes are chosen to be the same. For the distribution of the error ϵ_ij in (2.1), we use the same model as that in Fan, Hall and Yao (2006). Specifically, features are divided into three groups. Within each group, features share one unobservable common factor with different factor loadings. In addition, there is an unobservable common factor among all the features across three groups. For simplicity, we assume that the number of features p is a multiple of 3. Let Z_ij be a sequence of independent N(0, 1) random variables, and $χ_{i j}^{2}$ be a sequence of independent random variables of the same distribution as $(χ_{d}^{2} - d) ∕ \sqrt{2 d}$ with $χ_{d}^{2}$ the Chi-square distribution with degrees of freedom d. In the simulation we set d = 6.

Let {a_j} and {b_j} be factor loading coefficients. Then the error in (2.1) is defined as

ϵ_{i j} = \frac{Z_{i j} + a_{1 j} χ_{1 i} + a_{2 j} χ_{2 i} + a_{3 j} χ_{3 i} + b_{j} χ_{4 i}}{{(1 + a_{1 j}^{2} + a_{2 j}^{2} + a_{3 j}^{2} + b_{j}^{2})}^{1 ∕ 2}}, i = 1, \dots, n_{k}, j = 1, \dots, p,

where a_ij = 0 except that a_1j = a_j for j = 1, ⋯, p/3, a_2j = a_j for j = (p/3) + 1, ⋯, 2p/3, and a_3j = a_j for j = (2p/3)+1, ⋯, p. Therefore, ϵ_ij = 0 and var(ϵ_ij) = 1, and in general, within group correlation is greater than the between group correlation. The factor loadings a_j and b_j are independently generated from uniform distributions U(0, 0.4) and U(0, 0.2). The mean vector μ₁ for class C₁ is taken from a realization of the mixture of a point mass at 0 and a double-exponential distribution:

(1 - c) δ_{0} + \frac{1}{2} c \exp (- 2 ∣ x ∣),

where c ϵ (0, 1) is a constant. In the simulation, we set p = 4, 500 and c = 0.02. In other words, there are around 90 signal features on an average, many of which are weak signals. Without loss of generality, μ₂ is set to be 0. Figure 1 shows the true mean difference vector α, which is fixed across all simulations. It is clear that there are only very few features with signal levels exceeding 1 standard deviation of the noise.

True mean difference vector α. x-axis represents the dimensionality, and y-axis shows the values of corresponding entries of α.

With the parameters and model above, for each simulation, we generate n₁ = 30 training data from class C₁ and n₂ = 30 training data from C₂. In addition, separate 200 samples are generated from each of the two classes in each simulation, and these 400 vectors are used as test samples. We apply our newly proposed classifier FAIR to the simulated data. Specifically, for each feature, the t-test statistic in (3.1) is calculated using the training sample. Then the features are sorted in the decreasing order of the absolute values of their t-statistics. We then examine the impact of the number of features m on the misclassification rate. In each simulation, with m ranging from 1 to 4500, we construct the feature annealed independence classifiers using the training samples, and then apply these classifiers to the 400 test samples. The classification errors are compared to those of the independence rule with the oracle ordering information, which is constructed by repeating the above procedure except that in the first step the features are ordered by their true signal levels, |α|, instead of by their t-statistics

The above procedure is repeated 100 times, and averages and standard errors of the misclassification rates (based on 400 test samples in each simulation) are calculated across the 100 simulations. Note that the average of the 100 misclassification rates is indeed computed based on 100 × 400 testing samples.

Figure 2 depicts the misclassification rate as a function of the number of features m. The solid curves represent the average of classification rates across the 100 simulations, and the corresponding dashed curves are 2 standard errors (i.e. the standard deviation of 100 misclassification rates divided by 10) away from the solid one. The misclassification rates using the first 80 features in Figure 2(a) are zoomed in Figure 2(b). Figures 2(c) and 2(d) are the same as 2(a) and 2(b) except that the features are arranged in the decreasing order of |α|, i.e., the results are based on the oracle-assisted feature annealed independence classifier. From these plots we see that the classification results of FAIR are close to those of the oracle-assisted independence classifier. Moreover, as the dimensionality m grows, the misclassification rate increases steadily due to the noise accumulation. When all the features are included, i.e. m = 4500, the misclassification rate is 0.2522, whereas the minimum classification errors are 0.0128 in plot 2(b) and 0.0020 in plot 2(d). These results are consistent with Theorem 1. We also tried to decrease the signal levels, i.e., the mean of the double exponential distribution, or to increase the dimensionality p, and found that the classification error tend to 0.5 when all the dimensions are included. Comparing Figures 2(a) and 2(b) to Figures 2(c) and 2(d), we see that the features ordered by t-statistics has higher misclassification rates than those ordered by the oracle. Also, using t-statistics results in larger minimum classification errors (see plots 2(b) and 2(d)), but the differences are not very large.

Number of features versus misclassification rates. The solid curves represent the averages of classification errors across 100 simulations. The dashed curves are 2 standard errors away from the solid curves. The x-axis represents the number of features used in the classification, and the y-axis shows the misclassification rates. (a) The features are ordered in a way such that the corresponding t-statistics are decreasing in absolute values. (b) The amplified plot of the first 80 values of x-axis in plot (a). (c) The same as (a) except that the features are arranged in a way such that the corresponding true mean differences are decreasing in absolute values. (d) The amplified plot of the first 80 values of x-axis in plot (c).

Figure 3 shows the classification errors of the independence rule based on projected samples onto randomly chosen directions across 100 simulations. Specifically, in each of the simulations in Figure 2, we generate a direction vector a randomly from the (p - 1)-dimensional unit sphere, then project all the data in that simulation onto the direction a, and finally apply the Fisher discriminant to the projected data (see (2.3)). The average of these misclassification rates is 0.4986 and the corresponding standard deviation is 0.0318. These results are consistent with our Theorem 2.

Classification errors of the independence rule based on projected samples onto randomly chosen directions over 100 simulations.

Finally, we examine the effectiveness of our proposed method (4.3) for selecting features in FAIR. In each of the 100 simulations, we apply (4.3) to choose the number of features and compute the resulting misclassification rate based on 400 test samples. We also use the nearest shrunken centroids of Tibshirani et al. (2003) to select the important features. Figure 4 summarizes these results. The thin curves correspond to the nearest shrunken centroids method, and the thick curves correspond to FAIR. Figure 4(a) presents the number of features calculated from these two methods, and Figure 4(b) shows the corresponding misclassification rates. For our newly proposed classifier FAIR, the average of the optimal number of features over 100 simulations is 29.71, which is very close to the smallest number of features with the minimum misclassification rate in Figure 2(d). The misclassification rates of FAIR in Figure 4(b) have average 0.0154 and standard deviation 0.0085, indicating the outstanding performance of FAIR. Nearest shrunken centroids method is unstable in selecting features. Over the 100 simulations, there are several realizations in which it chooses plenty of features. We truncated Figure 4 to make it easier to view. The average number of features chosen by the nearest shrunken centroids is 28.43, and the average classification error is 0.0216, with corresponding standard deviation 0.0179. It is clear that nearest shrunken centroids method tends to choose less features than FAIR, but the misclassification rates are larger.

The thick curves correspond to FAIR, while the thin curves correspond to the nearest shrunken centroids method. (a) The numbers of features chosen by (4.3) and by the nearest shrunken centroids method over 100 simulations. (b) Corresponding classification errors based on the optimal number of features chosen in (a) over 100 simulations.

5.2 Real Data Analysis

5.2.1 Leukemia Data

Leukemia data from high-density Affymetrix oligonucleotide arrays were previously analyzed in Golub et al.(1999), and are available at http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi. There are 7129 genes and 72 samples coming from two classes: 47 in class ALL (acute lymphocytic leukemia) and 25 in class AML (acute mylogenous leukemia). Among these 72 samples, 38 (27 in class ALL and 11 in class AML) are set to be training samples and 34 (20 in class ALL and 14 in class AML) are set as test samples.

Before classification, we standardize each sample to zero mean and unit variance as done by Dudoit et al. (2002). The classification results from the nearest shrunken centroids (NSC hereafter) method and FAIR are shown in Table 1. The nearest shrunken centroids method picks up 21 genes and makes 1 training error and 3 test errors, while our method chooses 11 genes and makes 1 training error and 1 test error. Tibshirani et al.(2002) proposed and applied the nearest shrunken centroids method to the unstandardized Leukemia dataset. They chose 21 genes and made 1 training error and 2 test errors. Our results are still superior to theirs.

Table 1.

Classification errors of Leukemia data set

Method	Training error	Test error	No. of selected genes
Nearest shrunken centroids	1/38	3/34	21
FAIR	1/38	1/34	11

Open in a new tab

To further evaluate the performance of the two classifiers, we randomly split the 72 samples into training and test sets. Specifically, we set approximately 100γ% of the observations from class ALL and 100γ% of the observations from class AML as training samples, and the rest as test samples. FAIR and NSC are applied to the training data, and their performances are evaluated by the test samples. The above procedure is repeated 100 times for γ = 0.4, 0.5 and 0.6, respectively, and the distributions of test errors of FAIR, NSC and the independence rule without feature selection are summarized in Figure 5. In each of the splits, we also calculated the difference of test errors between NSC and FAIR, i.e., the test error of FAIR minus that of NSC, and the distribution is summarized in Figure 5. The top panel of Figure 6 shows the number of features selected by FAIR and NSC for γ = 0.4. The results for the other two values of γ are similar so we do not present here to save the space. From these figures we can see that the performance of independence rule improves significantly after feature selection. The classification errors of NSC and FAIR are approximately the same. As we have already noticed in the simulation study, NSC is not good with feature selection, that is, the number of features selected by NSC is very large and unstable, while the number of features selected by FAIR is quite reasonable and stable over different random splits. Clearly, the independent rule without feature selection performs poorly.

Leukemia data. Boxplots of test errors of FAIR, NSC and the independence rule without feature selection over 100 random splits of 72 samples, where 100γ% of the samples from both classes are set as training samples. The three plots from left to right correspond to γ = 0.4, 0.5 and 0.6, respectively. In each boxplot above, “FAIR” refers to the test errors of the feature annealed independent rule; “NSC” corresponds to the test errors of nearest shrunken centroids method; “diff.” means the difference of the test errors of FAIR and those of NSC; and “IR” corresponds the test errors of independence rule without feature selection.

Leukemia, Lung Cancer, and Prostate data sets. The number of features selected by FAIR and NSC over 100 random splits of the total samples. In each split, 100γ% of the samples from both class are set as training samples, and the rest are used as test samples. The three plots from top to bottom correspond to the Leukemia data with γ = 0.4, the Lung Cancer data with γ = 0.5 and the Prostate cancer data with γ = 0.6, respectively. The thin curves show the results from NSC, and the thick curves correspond to FAIR. The plots are truncated to make them easy to view.

5.2.2 Lung Cancer Data

We evaluate our method by classifying between malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA) of the lung. Lung cancer data were analyzed by Gordon et al.(2002) and are available at http://www.chestsurg.org. There are 181 tissue samples (31 MPM and 150 ADCA). The training set contains 32 of them, with 16 from MPM and 16 from ADCA. The rest 149 samples are used for testing (15 from MPM and 134 from ADCA). Each sample is described by 12533 genes.

As in the Leukemia data set, we first standardize the data to zero mean and unit variance, and then apply the two classification methods to the standardized data set. Classification results are summarized in Table 2. Although FAIR uses 5 more genes than the nearest shrunken centroids method, it has better classification results: both methods perfectly classify the training samples, while our classification procedure has smaller test error.

Table 2.

Classification errors of Lung Cancer data

Method	Training error	Test error	No. of selected genes
Nearest shrunken centroids	0/32	11/149	26
FAIR	0/32	7/149	31

Open in a new tab

We follow the same procedure as that in Leukemia example to randomly split the 181 samples into training and test sets. FAIR and NSC are applied to the training data, and the test errors are calculated using the test data. The procedure is repeated 100 times with γ = 0.4, 0.5 and 0.6, respectively, and the test error distributions of FAIR, NSC and the independence rule without feature selection can be found in Figure 7. We also present the difference of the test errors between FAIR and NSC in Figure 7. The numbers of features used by FAIR and NSC with γ = 0.5 are shown in the middle panel of Figure 6. Figure 7 shows again that feature selection is very important in high dimensional classification. The performance of FAIR is close to NSC in terms of classification error (Figure 7), but FAIR is stable in feature selection, as shown in the middle panel of Figure 6. One possible reason of Figure 7 might be that the signal strength in this Lung Cancer dataset is relatively weak, and more features are needed to obtain the optimal performance. However, the estimate of the largest eigenvalue is not accurate anymore when the number of features is large, which results in inaccurate estimates of m₁ in (4.3).

Lung cancer data. The same as Figure 5 except that the data set is different.

5.2.3 Prostate Cancer Data

The last example uses the prostate cancer data studied in Singh et al.(2002). The data set is available at http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi. The training data set contains 102 patient samples, 52 of which (labeled as “tumor”) are prostate tumor samples and 50 of which (labeled as “Normal”) are prostate samples. There are around 12600 genes. An independent set of test samples is from a different experiment and has 25 tumor and 9 normal samples.

We preprocess the data by standardizing the gene expression data as before. The classification results are summarized in Table 3. We make the same test error as and a bit larger training error than the nearest shrunken centroids method, but the number of selected genes we use is much less.

Table 3.

Classification errors of Prostate Cancer data set

Method	Training error	Test error	No. of selected genes
Nearest shrunken centroids	8/102	9/34	6
FAIR	10/102	9/34	2

Open in a new tab

The samples are randomly split into training and test sets in the same way as before, the test errors are calculated, and the number of features used by these two methods are recorded. Figure 8 shows the test errors of FAIR, NSC and the independence rule without feature selection, and the difference of the test errors of FAIR and NSC. The bottom panel of Figure 6 presents the numbers of features used by FAIR and NSC in each random split for γ = 0.6. As we mentioned before, the plots for γ = 0.4 and 0.5 are similar so we omit them in the paper. The performance of FAIR is better than that of NSC both in terms of classification error and in terms of the selection of features. The good performance of FAIR might be caused by the strong signal level of few features in this data set. Due to the strong signal level, FAIR can attain the optimal performance with small number of features. Thus, the estimate of m₁ in (4.3) is accurate and hence the actual performance of FAIR is good.

Prostate cancer data. The same as Figure 5 except that the data set is different.

6 Conclusion

This paper studies the impact of high dimensionality on classifications. To illustrate the idea, we have considered the independence classification rule, which avoids the difficulty of estimating large covariance matrix and the diverging condition number frequently associated with the large covariance matrix. When only a subset of the features capture the characteristics of two groups, classification using all dimensions would intrinsically classify the noises. We prove that classification based on linear projections onto almost all directions performs nearly the same as random guessing. Hence, it is necessary to choose direction vectors which put more weights on important features.

The two-sample t-test can be used to choose the important features. We have shown that under mild conditions, the two sample t-test can select all the important features with probability one. The features annealed independence rule using hard thresholding, FAIR, is proposed, with the number of features selected by a data-driven rule. An upper bound of the classification error of FAIR is explicitly given. We also give suggestions on the optimal number of features used in classification. Simulation studies and real data analysis support our theoretical results convincingly.

Acknowledgments

Financial support from the NSF grants DMS-0354223 and DMS-0704337 and NIH grant R01-GM072611 is gratefully acknowledged. The authors acknowledge gratefully the helpful comments of referees that led to the improvement of the presentation and the results of the paper.

7 Appendix

Proof of Theorem 1

For θ ∈ Γ, Ψ defined in (2.2) can be bounded as

Ψ \geq \frac{(μ_{1} - \hat{μ})' {\hat{D}}^{- 1} ({\hat{μ}}_{1} - {\hat{μ}}_{2})}{\sqrt{b_{0} ({\hat{μ}}_{1} - {\hat{μ}}_{2})' {\hat{D}}^{- 1} D {\hat{D}}^{- 1} ({\hat{μ}}_{1} - {\hat{μ}}_{2})}},

(7.1)

where we have used the assumption that λ_max(R) ≤ b₀. Denote by

\tilde{Ψ} = \frac{(μ_{1} - \hat{μ})' {\hat{D}}^{- 1} ({\hat{μ}}_{1} - {\hat{μ}}_{2})}{\sqrt{({\hat{μ}}_{1} - {\hat{μ}}_{2})' {\hat{D}}^{- 1} {D \hat{D}}^{- 1} ({\hat{μ}}_{1} - {\hat{μ}}_{2})}} .

We next study the asymptotic behavior of $\tilde{Ψ}$ .

Since Condition 1(b) in Section 3 is satisfied automatically for normal distribution, by Lemma 2 below we have $\hat{D} = D (1 + o_{P} (1))$ , where o_P(1) holds uniformly across all diagonal elements. Thus, the right hand side of (7.1) can be written as

\frac{1}{\sqrt{b_{0}}} \frac{(μ_{1} - \hat{μ})' {\hat{D}}^{- 1} ({\hat{μ}}_{1} - {\hat{μ}}_{2})}{\sqrt{({\hat{μ}}_{1} - {\hat{μ}}_{2})' D^{- 1} ({\hat{μ}}_{1} - {\hat{μ}}_{2})}} (1 + o_{P} (1)) .

We first consider the denominator. Notice that it can be decomposed as

\begin{matrix} ({\hat{μ}}_{1} - {\hat{μ}}_{2})' D^{- 1} ({\hat{μ}}_{1} - {\hat{μ}}_{2}) & = α' D^{- 1} α + 2 \sum α_{j} \frac{{\hat{ϵ}}_{1 j} - {\hat{ϵ}}_{2 j}}{σ_{j}^{2}} + \sum \frac{{({\hat{ϵ}}_{1 j} - {\hat{ϵ}}_{2 j})}^{2}}{σ_{j}^{2}} \\ = α' D^{- 1} α + 2 \sum \frac{α_{j}}{σ_{j}^{2}} ({\hat{ϵ}}_{1 j} - {\hat{ϵ}}_{2 j}) + \sum \frac{{({\hat{ϵ}}_{1 j} - {\hat{ϵ}}_{2 j})}^{2}}{σ_{j}^{2}} \\ \equiv α' D^{- 1} α + 2 (1 + o_{P} (1)) I_{1} + I_{2}, \end{matrix}

(7.2)

where $σ_{j}^{2}$ is the j-th diagonal entry of D, ${\hat{σ}}_{j}^{2}$ , is the j-th diagonal entry of $\hat{D}$ , and ${\hat{ϵ}}_{k j} = \sum_{i = 1}^{n_{k}} ϵ_{k i j} ∕ n_{k}, k = 1, 2$ . Notice that ${\hat{ϵ}}_{1} - {\hat{ϵ}}_{2} ~ N (0, \frac{n}{n_{1} n_{2}} Σ)$ . By singular value decomposition we have

R = Q_{R} V_{R} Q_{R}^{'},

where Q_R is orthogonal matrix and $V_{R} = diag {λ_{R, 1}, \dots, λ_{R, p}}$ be the eigenvalues of the correlation matrix R. Define $\tilde{ϵ} = \sqrt{n_{1} n_{2} ∕ n} V_{R}^{- 1 ∕ 2} Q_{R}^{'} D^{- 1 ∕ 2} ({\hat{ϵ}}_{1} - {\hat{ϵ}}_{2})$ , then $\tilde{ϵ} ~ N (0, I)$ . Hence,

I_{2} = ({\hat{ϵ}}_{1} - {\hat{ϵ}}_{2})' D^{- 1} ({\hat{ϵ}}_{1} - {\hat{ϵ}}_{2}) = \frac{n}{n_{1} n_{2}} \tilde{ϵ}' V_{R} \tilde{ϵ} .

Since $\sum_{i = 1}^{p} λ_{R, i} = p$ and $λ_{R, i} \geq 0$ for all i = 1, ⋯, p, we have $\frac{1}{p^{2}} \sum_{i = 1}^{p} λ_{R, i}^{2} < \infty$ . By weak law of large number we have

n_{1} n_{2} I_{2} ∕ [p n] \overset{P}{\to} 1 as n \to \infty, p \to \infty .

(7.3)

Next, we consider I₁. Note that I₁ has the distribution $I_{1} \sim N (0, \frac{n}{n_{1} n_{2}} {α' D}^{- 1} {Σ D}^{- 1} α)$ . Since $λ_{\max} \leq b_{0}, n {α' D}^{- 1} α \geq n C_{p} \to \infty$ and

α' D^{- 1} {Σ D}^{- 1} α = α' D^{- 1 ∕ 2} {RD}^{- 1 ∕ 2} α \leq λ_{\max} (R) α' D^{- 1} α,

we have $I_{1} = {α' D}^{- 1} α o_{P} (1)$ . This together with (7.2) and (7.3) yields

\frac{n_{1} n_{2}}{p n} ({\hat{μ}}_{1} - {\hat{μ}}_{2})' {\hat{D}}^{1} ({\hat{μ}}_{1} - {\hat{μ}}_{2}) = 1 + \frac{n_{1} n_{2}}{p n} \sum_{j = 1}^{p} \frac{α_{j}^{2}}{σ_{j}^{2}} (1 + o_{P} (1)) .

(7.4)

Now, we consider the numerator. It can be decomposed as

\begin{matrix} (μ_{1} - \hat{μ})' {\hat{D}}^{- 1} ({\hat{μ}}_{1} - {\hat{μ}}_{2}) & = \frac{1}{2} α' {\hat{D}}^{- 1} α - \sum \frac{α_{j}}{{\hat{σ}}_{j}^{2}} ({\hat{ϵ}}_{2 j}) - \frac{1}{2} (1 + o_{P} (1)) \sum {\hat{ϵ}}_{1 j}^{2} ∕ σ_{j}^{2} + \frac{1}{2} (1 + o_{P} (1)) \sum {\hat{ϵ}}_{2 j}^{2}) ∕ σ_{j}^{2} \\ \equiv \frac{1}{2} α' D^{- 1} α (1 + o_{P} (1)) - I_{3} - \frac{1}{2} (1 + o_{P} (1)) I_{4} + \frac{1}{2} (1 + o_{P} (1)) I_{5} . \end{matrix}

Denote by ${\tilde{I}}_{3} = \sum \frac{α_{j}}{σ_{j}^{2}} ({\hat{ϵ}}_{2 j})$ . Note that

\max ∣ \frac{α_{j}}{σ_{j}^{2}} ({\hat{ϵ}}_{2 j}) - \frac{α_{j}}{{\hat{σ}}_{j}^{2}} ({\hat{ϵ}}_{2 j}) ∣ \leq \max ∣ \frac{σ_{j}^{2}}{{\hat{σ}}_{j}^{2}} - 1 ∣ \max ∣ \frac{α_{j}}{σ_{j}^{2}} {\hat{ϵ}}_{2 j} ∣ = o_{P} (1) \max ∣ \frac{α_{j}}{σ_{j}^{2}} {\hat{ϵ}}_{2 j} ∣ .

(7.5)

Define $F_{j} = \sqrt{n_{2}} \frac{α_{j}}{σ_{j}^{2}} {\hat{ϵ}}_{2 j} ∕ {α' D}^{- 1} α$ , then $σ_{F_{j}}^{2} \equiv var (F_{j}) \leq 1$ for all j. For the normal distribution, we have the following tail probability inequality

1 - Φ (x) \leq \frac{1}{\sqrt{2 π}} \frac{1}{x} e^{- x^{2} ∕ 2} .

Since $F_{j} \sim N (0, σ_{F_{j}}^{2})$ , by the above inequality we have

P (∣ F_{j} ∣ \geq x) \leq 2 \exp {- \frac{x^{2}}{2 C}} .

with C some positive constant, for all x > 0 and j = 1,⋯, p. By Lemma 2.2.10 of van de Vaart and Wellner (1996, P102), we have

{(α' D^{- 1} α)}^{- 1} E \max ∣ \frac{α_{j}}{σ_{j}^{2}} {\hat{ϵ}}_{2 j} ∣ = n_{2}^{- 1} E \max_{j \leq p} ∣ F_{j} ∣ \leq K \sqrt{C \log (p + 1) ∕ n_{2}} \overset{P}{\to} 0,

where K is some universal constant. This together with (7.5) ensures that

{(α' D^{- 1} α)}^{- 1} \max ∣ \frac{α_{j}}{σ_{j}^{2}} ({\hat{ϵ}}_{2 j}) - \frac{α_{j}}{{\hat{σ}}_{j}^{2}} ({\hat{ϵ}}_{2 j}) ∣ = o_{P} (1)

Hence,

I_{3} = {\tilde{I}}_{3} + α' D^{- 1} α o_{P} (1) .

(7.6)

Now we only need to consider ${\tilde{I}}_{3}$ . Note that ${\tilde{I}}_{3} = \sum \frac{α_{j}}{σ_{j}^{2}} {\hat{ϵ}}_{2 j} \sim N (0, \frac{1}{n_{2}} {α' D}^{- 1} {Σ D}^{- 1} α)$ . Since the variance term can be bounded as

α' D^{- 1} {Σ D}^{- 1} α \leq λ_{\max} (R) α' D^{- 1} α,

By the assumption that $n {α' D}^{- 1} α \to \infty$ and λ_max(R) is bounded, we have ${\tilde{I}}_{3} = \frac{1}{2} {α' D}^{- 1} α o_{P} (1)$ . Combining this with (7.6) leads to

I_{3} = \frac{1}{2} α' D^{- 1} α o_{P} (1)

We now examine I₄ and I₅. By the similar proof to (7.3) above we have

I_{4} = p ∕ n_{1} + o_{P} (\sqrt{n p ∕ (n_{1} n_{2})}) and I_{5} = p ∕ n_{2} + o_{P} (\sqrt{n p ∕ (n_{1} n_{2})}) .

Thus the numerator can be written as

(μ_{1} - \hat{μ})' {\hat{D}}^{- 1} ({\hat{μ}}_{1} - {\hat{μ}}_{2}) = (1 + o_{P} (1)) \frac{1}{2} \sum α_{j}^{2} ∕ σ_{j}^{2} - (p ∕ n_{1} - p ∕ n_{2}) ∕ 2 + o_{P} (\sqrt{n p ∕ (n_{1} n_{2})}) .

and by (7.4)

\tilde{Ψ} = \frac{\sqrt{\frac{n_{1} n_{2}}{p n}} \sum α_{j}^{2} ∕ σ_{j}^{2} (1 + o_{P} (1)) + \sqrt{p ∕ (n n_{1} n_{2})} (n_{1} - n_{2})}{2 {1 + \frac{n_{1} n_{2}}{p n} \sum α_{j}^{2} ∕ σ_{j}^{2} (1 + o_{P} (1))}^{1 ∕ 2}} .

Since $\frac{a x}{\sqrt{1 + a^{2} x}}$ is an increasing function of x and $\sum \frac{α_{j}^{2}}{σ_{j}^{2}} \geq C_{p}$ , in view of (7.1) and the definition of the parameter space Γ, we have

W (\hat{δ}) = 1 - Φ (\frac{{[n_{1} n_{2} ∕ (p n)]}^{1 ∕ 2} C_{p} {1 + o_{P} (1) + \frac{p (n_{1} - n_{2})}{n_{1} n_{2} C_{p}}}}{2 \sqrt{b_{0}} {1 + n_{1} n_{2} ∕ (p n) C_{p} (1 + o_{P} (1))}^{1 ∕ 2}}) .

If p/(nC_p) → 0, then $W (\hat{δ}) = 1 - Φ (\frac{1}{2} {[n_{1} n_{2} ∕ (p n b_{0})]}^{1 ∕ 2} C_{p} {1 + o_{P} (1)})$ . Furthermore, if ${\frac{n_{1} n_{2}}{p n}}^{1 ∕ 2} C_{p} \to C_{0}$ with C₀ some constant, then

W (\hat{δ}) \overset{P}{\to} 1 - Φ (\frac{C_{0}}{2 \sqrt{b_{0}}});

This completes the proof.

Proof of Theorem 2

suppose we have a new observation X from class C₁. Then the posterior classification error of using ${\hat{δ}}_{a} (\cdot)$ is

\begin{matrix} W (δ_{a}, θ) & = E^{a} [P (δ_{a} (X) < 0 ∣ Y_{k i}, i = 1, \dots, n_{k}, k = 1, 2, a)] \\ = 1 - E^{a} Φ (Ψ_{a} sign (a' {\hat{μ}}_{1} - a' {\hat{μ}}_{2})), \end{matrix}

Where $Ψ_{a} = \frac{a' μ_{1} - a' \hat{μ}}{\sqrt{a' Σ a}}, Φ (\cdot)$ is the standard Gaussian distribution function, and E^a means expectation taken with respect to a. We are going to show that

Ψ_{a} \overset{P}{\to} 0,

(7.7)

which together with the continuity of ·(.) and the dominated convergence theorem gives

\lim_{p} E^{a} Φ (Ψ_{a} sign (a' {\hat{μ}}_{1} - a' {\hat{μ}}_{2})) = 1 ∕ 2 .

Therefore, the posterior error $W ({\hat{δ}}_{a}, θ)$ is no better than the random guessing.

Now, let us prove (7.7). Note that the random vector a can be written as

a = Z ∕ ∥ Z ∥,

where Z is a p-dimensional standard Gaussian distributed random vector, independent of all the observations Y _ki and X. Therefore,

Ψ_{a} = \frac{a' μ_{1} - a' \hat{μ}}{\sqrt{a' Σ a}} = \frac{Z' α ∕ \sqrt{p} - \sqrt{\frac{n}{n_{1} n_{2} p}} Z' [\sqrt{\frac{n_{1} n_{2}}{n}} ({\hat{ϵ}}_{1} + {\hat{ϵ}}_{2})]}{2 \sqrt{Z' Σ Z ∕ p}},

(7.8)

where $α = μ_{1} - μ_{2}$ and ${\hat{ϵ}}_{k} = \frac{1}{n_{k}} \sum_{i = 1}^{n_{k}} ϵ_{k i}, k = 1, 2$ . By the singular value decomposition we have

Σ = Q' VQ,

where Q is an orthogonal matrix and $V = diag {λ_{1}, \dots, λ_{p}}$ is a diagonal matrix. Let $\tilde{Z} = QZ$ , then $\tilde{Z}$ is also a p-dimensional standard Gaussian random vector. Hence the denominator of Ψ_a can be written as

2 \sqrt{Z' Σ Z ∕ p} = 2 {(\frac{1}{p} \sum_{j = 1}^{p} λ_{j} {\tilde{Z}}_{j}^{2})}^{1 ∕ 2},

where ${\tilde{Z}}_{j}$ is the j-th entry of $\tilde{Z}$ . Since it is assumed that $\lim_{p} \frac{1}{p^{2}} \sum_{j = 1}^{p} λ_{j}^{2} < \infty$ and $\lim_{p} \frac{1}{p} \sum_{j = 1}^{p} λ_{j} = τ$ for some positive constant τ, by the weak law of large numbers, we have

\frac{1}{p} \sum_{j = 1}^{G} λ_{j} {\tilde{Z}}_{j}^{2} \overset{P}{\to} τ .

(7.9)

Next, we study the numerator of Ψ_a in (7.8). Since $\frac{1}{p} \sum_{j = 1}^{p} α_{j}^{2} \to 0$ , the first term of the numerator converges to 0 in probability, i.e.,

Z' α ∕ \sqrt{p} \overset{P}{\to} 0 .

(7.10)

Let $ε = \sqrt{\frac{n_{1} n_{2}}{n}} ({\hat{ϵ}}_{1} + {\hat{ϵ}}_{2})$ and $\tilde{ε} = V^{- 1 ∕ 2} Q ε$ , then $\tilde{ε}$ has distribution N(0, I) and is independent of $\tilde{Z}$ . The second term of the numerator can be written as

Z' [\sqrt{n_{1} n_{2} ∕ n} ({\hat{ϵ}}_{1} + {\hat{ϵ}}_{2})] = {\tilde{Z}' V}^{1 ∕ 2} \tilde{ε} = \sum_{j = 1}^{p} \sqrt{λ_{j}} {\tilde{Z}}_{j} {\tilde{ε}}_{j} .

Since $\frac{n}{n_{1} n_{2} p} \sum_{j = 1}^{p} λ_{j} \to 0 < \infty$ , it follows from the weak law of large number that

\sqrt{\frac{n}{n_{1} n_{2} p}} \sum_{j = 1}^{p} \sqrt{λ_{j}} {\tilde{Z}}_{j} {\tilde{ε}}_{j} \overset{P}{\to} 0 .

This together with (7.8), (7.9), and (7.10) completes the proof.

We need the the following two lemmas to prove Theorem 3.

Lemma 1

[Cao (2005)] Let n = n₁ + n₂. Assume that there exist 0 < c₁ ≤ c₂ < 1 such that c₁ ≤ n₁/n₂ ≤ c₂. Let ${\tilde{T}}_{j} = T_{j} - \frac{μ_{j 1} - μ_{j 2}}{\sqrt{S_{1 j}^{2} ∕ n_{1} + S_{1 j}^{2} ∕ n_{2}}}$ . Then for any $x \equiv x (n_{1}, n_{2})$ satisfying x → ∞ and x = o(n^1/2),

\log P ({\tilde{T}}_{j} \geq x) \sim - x^{2} ∕ 2, a s n_{1}, n_{2} \to \infty .

If in addition, if we have only $E {∣ Y_{1 i j} ∣}^{3} < \infty$ and $E {∣ Y_{2 i j} ∣}^{3} < \infty$ , then

\frac{P ({\tilde{T}}_{j} \geq x)}{1 - Φ (x)} = 1 + O {(1) (1 + x)}^{3} n^{- 1 ∕ 2} d^{3}, f o r 0 \leq x \leq n^{1 ∕ 6} ∕ d,

where $d = (E {∣ Y_{1 i j} ∣}^{3} + E {∣ Y_{2 i j} ∣}^{3}) ∕ {(var (Y_{1 i j}) + var (Y_{2 i j}))}^{3 ∕ 2}$ and O(1) is a finite constant depending only on c₁ and c₂. In particular,

\frac{P ({\tilde{T}}_{j} \geq x)}{1 - Φ (x)} \to 1

uniformly in $x \in (0, o (n^{1 ∕ 6}))$ .

Lemma 2

Suppose Condition 1(b) holds and log p = o(n). Let $S_{k j}^{2}$ be the sample variance defined in Section 1, and $σ_{k j}^{2}$ be the variance of the j-th feature in class C_k. Suppose min $σ_{k j}^{2}$ is bounded away from 0. Then we have the following uniform convergence result

\max_{k = 1, 2, j = 1, \dots, p} ∣ S_{k j}^{2} - σ_{k j}^{2} ∣ \overset{P}{\to} 0 .

Proof of Lemma 2

For any $ε > 0$ , we know when n_k is very large,

P (\max_{k = 1, 2, j = 1, \dots, p} ∣ S_{k j}^{2} - σ_{k j}^{2} ∣ > ε) \leq \sum_{k = 1, 2} \sum_{j = 1}^{p} P (∣ S_{k j}^{2} - σ_{k j}^{2} ∣ > ε) \leq \sum_{k = 1, 2} \sum_{j = 1}^{s} P (∣ \sum_{i = 1}^{n_{k}} (ϵ_{k i j}^{2} - σ_{k j}^{2}) ∣ > n_{k} ε ∕ 3) + \sum_{k = 1, 2} \sum_{j = 1}^{s} P (∣ \sum_{i = 1}^{n_{k}} ϵ_{k i j} ∣ > n_{k} \sqrt{ε} ∕ 2) \equiv I_{1} + I_{2} .

(7.11)

It follows from Bernstein's inequality that

P (∣ \sum_{i = 1}^{n_{k}} (ϵ_{k i j}^{2} - σ_{k j}^{2}) ∣ > n_{k} ε ∕ 3) \leq 2 \exp {- \frac{1}{2} \frac{n_{k}^{2} ε^{2}}{9 ν 1 + 3 M_{1} n_{k} ε}},

and

P (∣ \sum_{i = 1}^{n_{k}} ϵ_{k i j} ∣ > n_{k} \sqrt{ε} ∕ 2) \leq 2 \exp {- \frac{1}{2} \frac{n_{k}^{2} ε}{4 ν_{2} + 2 M_{2} n_{k} \sqrt{ε}}},

Since log p = o(n), we have I₁ = o(1) and I₂ = o(1). These together with (7.11) completes the proof of Lemma 2.

Proof of Theorem 3

We devidte the proof into two parts. (a) Let us first look at the probability $P (\max_{j > s} ∣ T_{j} ∣ > x)$ . Clearly,

P (\max_{j > s} ∣ T_{j} ∣ > x) \leq \sum_{j = s + 1}^{p} P (∣ T_{j} ∣ \geq x) .

(7.12)

Note that for all j > s, α_j = μ_j1 - μ_j2 = 0. By Condition 1(b) and Lemma 1, the following inequality holds for 0 ≤ x ≤ n^1/6/d,

P (T_{j} \geq x) = (1 - Φ (x)) (1 + C {(1 + x)}^{3} n^{- 1 ∕ 2} d^{3}),

where C is a constant that only depends on c₁ and c₂, and

d = (E {∣ Y_{1 i j} ∣}^{3} + E {∣ Y_{2 i j} ∣}^{3}) ∕ {(σ_{1 j}^{2} + σ_{2 j}^{2})}^{3 ∕ 2}

with $σ_{k j}^{2}$ the j-th diagonal element of $Σ_{k}$ . For the normal distribution, we have the following tail probability inequality

1 - Φ (x) \leq \frac{1}{\sqrt{2 π}} \frac{1}{x} e^{- x^{2} ∕ 2} .

This together with the symmetry of T_j gives

P (∣ T_{j} ∣ \geq x) \leq 2 \frac{1}{\sqrt{2 π}} \frac{1}{x} e^{- x^{2} ∕ 2} (1 + C {(1 + x)}^{3} n^{- 1 ∕ 2} d^{3}) .

Combining the above inequality with (7.12), we have

\sum_{j > s} P (∣ T_{j} ∣ \geq x) \leq (p - s) \frac{2}{\sqrt{2 π}} \frac{1}{x} e^{- x^{2} ∕ 2} (1 + C {(1 + x)}^{3} n^{- 1 ∕ 2} d^{3}) .

Since log(p - s) = o(n^γ) with $0 < γ < \frac{1}{3}$ , if we let $x \sim c n^{γ ∕ 2}$ , then

\sum_{j > s} P (∣ T_{j} ∣ \geq x) \to 0,

which along with (7.12) yields

P (\max_{j > s} ∣ T_{j} ∣ > x) \to 0 .

(b) Next, we consider P (min_j≤s |T_j| ≤ x). Notice that for j ≤ s, α_j = μ_1j - μ_2j ≠ 0. Let $η_{j} = \frac{α_{j}}{\sqrt{S_{1 j}^{2} ∕ n_{1} + S_{1 j}^{2} ∕ n_{2}}}$ and define

{\tilde{T}}_{j} = T_{j} - η_{j} .

Then following the same lines as those in (a), we have

\sum_{j \leq s} P (∣ {\tilde{T}}_{j} ∣ \geq x) \leq s \frac{2}{\sqrt{2 π}} \frac{1}{x} e^{- x^{2} ∕ 2} (1 + C {(1 + x)}^{3} n^{- 1 ∕ 2} d^{3}) \to 0 .

It follows from Lemma 2 that,

\max_{j \leq s} ∣ S_{k j}^{2} - σ_{k j}^{2} ∣ \overset{P}{\to} 0, k = 1, 2 .

Hence, uniformly over j = 1, ⋯, s, we have

η_{j} = \frac{∣ α_{j} ∣}{\sqrt{σ_{1 j}^{2} ∕ n_{1} + σ_{2 j}^{2} ∕ n_{2}}} (1 + o_{P} (1)) .

Therefore,

\min_{j \leq s} ∣ η_{j} ∣ = \min_{j \leq s} \frac{\sqrt{n_{1}} ∣ α_{j} ∣}{\sqrt{σ_{1 j}^{2} + σ_{2 j}^{2} n_{1} ∕ n_{2}}} (1 + o_{P} (1)) \geq \min_{j \leq s} \frac{\sqrt{n_{1}} ∣ α_{j} ∣}{\sqrt{σ_{1 j}^{2} + c_{2} σ_{2 j}^{2}}} (1 + o_{P} (1))

with c₂ defined in Theorem 3. Let $α_{0} = \min_{j \leq s} ∣ μ_{j 1} - μ_{j 2} ∣ ∕ \sqrt{σ_{1 j}^{2} + c_{2} σ_{2 j}^{2}}$ . Then it follows that

P (\min_{j \leq s} ∣ T_{j} ∣ \leq x) \leq P (\max_{j \leq s} ∣ {\tilde{T}}_{j} ∣ \geq \min_{j \leq s} ∣ η_{j} ∣ - x) \leq P (\max_{j \leq s} ∣ {\tilde{T}}_{j} ∣ \geq \sqrt{n_{1}} α_{0} (1 + o_{P} (1)) - x) .

By part (a), we know that x ~ cn^γ/2 and log(p-s) = o(n^γ). Thus if $α_{0} \sim \min_{j \leq s} \frac{∣ μ_{j 1} - μ_{j 2} ∣}{\sqrt{σ_{j 1}^{2} + σ_{j 2}^{2}}} = n^{- γ} β_{n}$ for some $β_{n} \to \infty$ , then similarly to part (a), we have

P (\min_{j \leq s} ∣ T_{j} ∣ \leq x) \to 0 .

Combination of part (a) and part (b) completes the proof.

Proof of Theorem 4

The classification error of the truncated classifier ${\hat{δ}}_{NC}^{m}$ is

W ({\hat{δ}}_{NC}^{m}, θ) = 1 - Φ (\frac{\sum_{j = 1}^{m} {\hat{α}}_{j} (μ_{1 j} - {\hat{μ}}_{j})}{\sum_{j = 1}^{m} {\hat{α}}_{j}^{2}}) .

We first consider the denominator. Note that ${\hat{α}}_{j} \sim N (α_{j}, \frac{n}{n_{1} n_{2}})$ . It can be shown that

{(\frac{4 n}{n_{1} n_{2}} \sum_{j = 1}^{m} α_{j}^{2} + \frac{2 m n^{2}}{n_{1}^{2} n_{2}^{2}})}^{- 1 ∕ 2} \sum_{j = 1}^{m} ({\hat{α}}_{j}^{2} - α_{j}^{2} - \frac{n}{n_{1} n_{2}}) \overset{D}{\to} N (0, 1),

which together with the assumptions $\frac{n}{\sqrt{m}} \sum_{j = 1}^{m} α_{j}^{2} \to \infty$ gives

\begin{matrix} \sum_{j = 1}^{m} {\hat{α}}_{j}^{2} & = \sum_{j = 1}^{m} α_{j}^{2} + \frac{m n}{n_{1} n_{2}} + {\frac{4 n}{n_{1} n_{2}} \sum_{j = 1}^{m} α_{j}^{2} + \frac{2 m n^{2}}{n_{1}^{2} n_{2}^{2}}}^{1 ∕ 2} O_{p} (1) \\ = (1 + o_{P} (1)) \sum_{j = 1}^{m} α_{j}^{2} + \frac{m n}{n_{1} n_{2}} . \end{matrix}

Next, let us look at the numerator. We decompose it as

\sum_{j = 1}^{m} {\hat{α}}_{j} (μ_{1 j} - {\hat{μ}}_{j}) = \frac{1}{2} \sum_{j = 1}^{m} α_{j}^{2} - \sum_{j = 1}^{m} α_{j} {\hat{ϵ}}_{2 j} - \frac{1}{2} \sum_{j = 1}^{m} ({\hat{ϵ}}_{1 j}^{2} - {\hat{ϵ}}_{2 j}^{2}) .

(7.13)

Since the second term above has the distribution $N (0, \sum_{j = 1}^{m} α_{j}^{2} ∕ n_{2})$ , it follows from the assumption $n \sum_{j = 1}^{m} α_{j}^{2} \to \infty$ that

\sum_{j = 1}^{m} α_{j} {\hat{ϵ}}_{2_{j}} = o_{P} (1) \sum_{j = 1}^{m} α_{j}^{2} .

The third term in (7.13) can be written as

\sum_{j = 1}^{m} ({\hat{ϵ}}_{1}^{2}_{j} - {\hat{ϵ}}_{2}^{2}_{j}) = \frac{m}{n_{1}} - \frac{m}{n_{2}} + O_{p} (\frac{n m}{n_{1} n_{2}}) = \frac{m (n_{2} - n_{1})}{n_{1} n_{2}} = o_{P} (1) \sum_{j = 1}^{m} α_{j}^{2} .

Hence the numerator is

\sum_{j = 1}^{m} {\hat{α}}_{j} (μ_{1 j} - {\hat{μ}}_{j}) = \frac{m (n_{2} - n_{1})}{n_{1} n_{2}} + (1 + o_{P} (1)) \sum_{j = 1}^{m} α_{j}^{2} .

Therefore, the classification error is

W ({\hat{δ}}_{NC}^{m}, θ) = 1 - Φ (\frac{(1 + o_{P} (1)) \sum_{j = 1}^{m} α_{j}^{2} + m (n_{1} - n_{2}) ∕ (n_{1} n_{2})}{2 {(1 + o_{P} (1)) \sum_{j = 1}^{m} α_{j}^{2} + m n ∕ (n_{1} n_{2})}^{1 ∕ 2}}) .

This concludes the proof.

Proof of Theorem 5

Note that the classification error of ${\hat{δ}}_{FAIR}^{b_{n}}$ is

W ({\hat{δ}}_{FAIR}^{b_{n}} (x), θ) = 1 - Φ (\frac{\sum_{j} (μ_{1 j} - {\hat{μ}}_{j}) {\hat{α}}_{j} 1 {∣ {\hat{α}}_{j} ∣ \geq b_{n}}}{\sum_{j} {\hat{α}}_{j}^{2} 1 {∣ {\hat{α}}_{j} ∣ \geq b_{n}}}) \equiv 1 - Φ (Ψ^{H}) .

We divide the proof into two parts: the numerator andthe denominator.

(a) First, we study the numerator of Ψ^H. It can be decomposed as

\sum_{j} (μ_{1 j} - {\hat{μ}}_{j}) {\hat{α}}_{j} 1 {∣ {\hat{α}}_{j} ∣ \geq b_{n}} = I_{1} + I_{2},

where $I_{1} = \sum_{j \in A} (μ_{1 j} - {\hat{μ}}_{j}) {\hat{α}}_{j} 1 {∣ {\hat{α}}_{j} ∣ \geq b_{n}}$ and $I_{2} = \sum_{j \in A^{c}} (μ_{1 j} - {\hat{μ}}_{j}) {\hat{α}}_{j} 1 {∣ {\hat{α}}_{j} ∣ \geq b_{n}}$ with $A^{c}$ the complementary of the set $A$ . Note that

\begin{matrix} I_{2} & = \frac{1}{2} \sum_{j \in A^{c}} α_{j}^{2} 1 {∣ {\hat{α}}_{j} ∣ \geq b_{n}} - \sum_{j \in A^{c}} α_{j} {\hat{ϵ}}_{2 j} 1 {∣ {\hat{α}}_{j} ∣ \geq b_{n}} - \frac{1}{2} \sum_{j \in A^{c}} ({\hat{ϵ}}_{1 j}^{2} - {\hat{ϵ}}_{2 j}^{2}) 1 {∣ {\hat{α}}_{j} ∣ \geq b_{n}} \\ \equiv \frac{1}{2} I_{2, 1} - I_{2, 2} - \frac{1}{2} I_{2, 3} . \end{matrix}

Since ${\hat{α}}_{j} \sim N (α_{j}, \frac{n}{n_{1} n_{2}})$ , it follows from the normal tail probability inequality that for every $j \in A^{c}$ and $b_{n} > \max_{j \in A^{c}} ∣ α_{j} ∣$ ,

\begin{matrix} P (∣ {\hat{α}}_{j} ∣ \geq b_{n}) & \leq P (∣ {\hat{α}}_{j} - α_{j} ∣ \geq b_{n} - \max_{j \in A^{c}} ∣ α_{j} ∣) \\ \leq M \frac{\exp {- n_{1} n_{2} {(b_{n} - \max_{j \in A^{c}} ∣ α_{j} ∣)}^{2} ∕ (2 n)}}{\sqrt{n_{1} n_{2} n^{- 1}} (b_{n} - \max_{j \in A^{c}} ∣ α_{j} ∣)}, \end{matrix}

(7.14)

where M is a generic constant. Thus for every ε > 0, if $\log (p - m) ∕ [n {(b_{n} - \max_{j \in A^{c}} ∣ α_{j} ∣)}^{2}] \to 0$ and $\max_{j \in A^{c}} ∣ α_{j} ∣ < b_{n}$ , we have

\begin{matrix} P (∣ I_{2, 1} ∣ \geq ε) & \leq ε^{- 1} \sum_{j \in A^{c}} α_{j}^{2} P (∣ {\hat{α}}_{j} ∣ \geq b_{n}) \\ \leq M \max_{j \in A^{c}} α_{j}^{2} \frac{(p - m)}{ε} \frac{\exp {- n_{1} n_{2} {(b_{n} - \max_{j \in A^{c}} ∣ α_{j} ∣)}^{2} ∕ (2 n)}}{\sqrt{n_{1} n_{2} n^{- 1}} (b_{n} - \max_{j \in A^{c}} ∣ α_{j} ∣)}, \end{matrix}

which tends to zero. Hence,

I_{2, 1} \overset{P}{\to} 0

(7.15)

We next consider I_2,2. Since $E {({\hat{ϵ}}_{2 j})}^{2} = \frac{1}{n_{2}}, \log (p - m) ∕ [n {(b_{n} - \max_{j \in A^{c}} ∣ α_{j} ∣)}^{2}] \to 0$ , and $\max_{j \in A^{c}} ∣ α_{j} ∣ < b_{n}$ , we have

\begin{matrix} P (∣ I_{2, 2} ∣ \geq ε) & \leq ε^{- 1} \sum_{j \in A^{c}} E ∣ {\hat{ϵ}}_{2 j} α_{j} 1 {∣ {\hat{α}}_{j} ∣ \geq b_{n}} ∣ \\ \leq ε^{- 1} \sum_{j \in A^{c}} {E {({\hat{ϵ}}_{2 j})}^{2}}^{1 ∕ 2} {E α_{j}^{2} 1 {∣ {\hat{α}}_{j} ∣ \geq b_{n}} ∣}^{1 ∕ 2} \\ \leq M \frac{(p - m) \max_{j \in A^{c}} ∣ α_{j} ∣}{ε \sqrt{n_{2}}} \frac{\exp {- n_{1} n_{2} {(b_{n} - \max_{j \in A^{c}} ∣ α_{j} ∣)}^{2} ∕ (4 n)}}{\sqrt{n_{1} n_{2} n^{- 1}} (b_{n} - \max_{j \in A^{c}} ∣ α_{j} ∣)}, \end{matrix}

which converges to 0. Therefore,

I_{2, 2} \overset{P}{\to} 0 .

(7.16)

Then, we consider I_2,3. Since c₁ ≤ n₁/n₂ ≤ c₂ and $E {({\hat{ϵ}}_{1 j}^{2} - {\hat{ϵ}}_{2 j}^{2})}^{2} = \frac{3 n_{1}^{2} + 3 n_{2}^{2} - 2 n_{1} n_{2}}{n_{1}^{2} n_{2}^{2}} \leq \frac{3 c_{2} + 3 - 2 c_{1}}{c_{1} n_{2}^{2}}$ , by (7.14) we have for every ε > 0,

\begin{matrix} P (∣ I_{2, 3} ∣ \geq ε) & \leq ε^{- 1} E ∣ \sum_{j \in A^{c}} ({\hat{ϵ}}_{1 j}^{2} - {\hat{ϵ}}_{2 j}^{2}) 1 {∣ {\hat{α}}_{j} ∣ \geq b_{n}} ∣ \\ \leq ε^{- 1} \sum_{j \in A^{c}} {E {({\hat{ϵ}}_{1 j}^{2} - {\hat{ϵ}}_{2 j}^{2})}^{2}}^{1 ∕ 2} P {(∣ {\hat{α}}_{j} ∣ \geq b_{n})}^{1 ∕ 2} \\ \leq M \frac{\sum_{j \in A^{c}} P (∣ {\hat{α}}_{j} ∣ \geq b_{n})}{n_{2} ε} \to 0, \end{matrix}

where M is some generic constant. Thus, $I_{2, 3} \overset{P}{\to} 0$ . Combination of this with (7.15) and (7.16) entails

I_{2} = o_{P} (1) .

We now deal with I₁. Decompose I₁ similarly as

\begin{matrix} I_{1} & = \sum_{j \in A} (μ_{1 j} - {\hat{μ}}_{1 j}) {\hat{α}}_{j} 1 {∣ {\hat{α}}_{j} ∣ \geq b_{n}} + \frac{1}{2} \sum_{j \in A} {\hat{α}}_{j}^{2} 1 {∣ {\hat{α}}_{j} ∣ \geq b_{n}} \\ \geq \sum_{j \in A} (μ_{1 j} - {\hat{μ}}_{1 j}) {\hat{α}}_{j} 1 {∣ {\hat{α}}_{j} ∣ \geq b_{n}} + \frac{1}{2} \sum_{j \in A} ({\hat{α}}_{j}^{2} - b_{n}^{2}) \\ \equiv I_{1, 1} + \frac{1}{2} I_{1, 2} . \end{matrix}

We first study I_1,2. By using ${\hat{α}}_{j} \sim N (α_{j}, \frac{n}{n_{1} n_{2}})$ , it can be shown that

{(\frac{4 n}{n_{1} n_{2}} \sum_{j \in A} α_{j}^{2} + \frac{2 m n^{2}}{n_{1}^{2} n_{2}^{2}})}^{- 1 ∕ 2} \sum_{j \in A} ({\hat{α}}_{j}^{2} - α_{j}^{2} - \frac{n}{n_{1} n_{2}}) \overset{D}{\to} N (0, 1) .

(7.17)

since $\frac{n}{\sqrt{m}} \sum_{j = 1}^{m} α_{j}^{2} \to \infty$ , we have ${(\frac{4 n}{n_{1} n_{2}} \sum_{j \in A} α_{j}^{2} + \frac{2 m n^{2}}{n_{1}^{2} n_{2}^{2}})}^{1 ∕ 2} ∕ \sum_{j \in A} α_{j}^{2} \to 0$ . Therefore,

\begin{matrix} I_{1, 2} & = \sum_{j \in A} (α_{j}^{2} - b_{n}^{2}) + \frac{n m}{n_{1} n_{2}} + {(\frac{4 n}{n_{1} n_{2}} \sum_{j \in A} α_{j}^{2} + \frac{2 m n^{2}}{n_{1}^{2} n_{2}^{2}})}^{1 ∕ 2} O_{p} (1) \\ = (1 + o_{P} (1)) \sum_{j \in A} α_{j}^{2} + \frac{n m}{n_{1} n_{2}} - m b^{2} . \end{matrix}

Next, we look at I_1,1. For any ε > 0,

P (∣ I_{1, 1} ∣ \geq ε) \leq \frac{1}{ε} E ∣ I_{1, 1} ∣ \leq \frac{1}{ε} \sum_{j \in A} {E {∣ μ_{1 j} - {\hat{μ}}_{1 j} ∣}^{2} E {∣ {\hat{α}}_{j} ∣}^{2}}^{1 ∕ 2} = \frac{1}{\sqrt{n_{1}} ε} \sum_{j \in A} \sqrt{α_{j}^{2} + n ∕ (n_{1} n_{2})} .

When n is large enough, the above probability can be bounded by

P (∣ I_{1, 1} ∣ \geq ε) \leq \sqrt{2 ∕ (n_{1} ε^{2})} \sum_{j \in A} ∣ α_{j} ∣,

which along with the assumption $\sum_{j \in A} ∣ α_{j} ∣ ∕ [\sqrt{n} \sum_{j \in A} α_{j}^{2}] \to 0$ gives

I_{1, 1} = o_{P} (1) \sum_{j \in A} α_{j}^{2} .

It follows that the numerator is bounded from below by

(1 + o_{P} (1)) \frac{1}{2} \sum_{j \in A} α_{j}^{2} + \frac{m n}{2 n_{1} n_{2}} - \frac{1}{2} m b^{2} .

(b) Now, we study the denominator of Ψ. Let

\sum_{j} {\hat{α}}_{j}^{2} 1 {∣ {\hat{α}}_{j} ∣ \geq b_{n}} = \sum_{j \in A} {\hat{α}}_{j}^{2} 1 {∣ {\hat{α}}_{j} ∣ \geq b_{n}} + \sum_{j \in A^{c}} {\hat{α}}_{j}^{2} 1 {∣ {\hat{α}}_{j} ∣ \geq b_{n}} \equiv J_{1} + J_{2} .

We first show that $J_{2} \overset{P}{\to} 0$ . Note that $E {\hat{α}}_{j}^{4} = α_{j}^{4} + 6 n {(n_{1} n_{2})}^{- 1} α_{j}^{2} + 3 n^{2} {(n_{1} n_{2})}^{- 2}$ . Thus,

\begin{matrix} P (∣ J_{2} ∣ \geq ε) & \leq \frac{1}{ε} E ∣ J_{2} ∣ = \sum_{j \in A^{c}} E {\hat{α}}_{j}^{2} 1 {∣ {\hat{α}}_{j} ∣ \geq b_{n}} ∕ ε \leq \frac{1}{ε} \sum_{j \in A^{c}} {E {\hat{α}}_{j}^{4} P (∣ {\hat{α}}_{j} ∣ \geq b_{n})}^{1 ∕ 2} \\ \leq \frac{1}{ε} \sum_{j \in A^{c}} {(α_{j}^{4} + 6 n {(n_{1} n_{2})}^{- 1} α_{j}^{2} + 3 n^{2} {(n_{1} n_{2})}^{- 2}) P (∣ {\hat{α}}_{j} ∣ \geq b_{n})}^{1 ∕ 2} . \end{matrix}

This together with (7.14) and the assumption that $\log (p - m) ∕ [n {(b_{n} - \max_{j \in A^{c}} ∣ α_{j} ∣)}^{2}] \to 0$ yields $J_{2} \overset{P}{\to} 0$ as $n \to \infty, p \to \infty$ . Now we study term J₁. By (7.17), we have

J_{1} \leq \sum_{j \in A} {\hat{α}}_{j}^{2} = (1 + o_{P} (1)) \sum_{j \in A} α_{j}^{2} + \frac{m n}{n_{1} n_{2}} .

Hence the denominator is bounded from a above by $(1 + o_{P} (1)) \sum_{j \in A} α_{j}^{2} + \frac{m n}{n_{1} n_{2}}$ . Therefore,

Ψ^{H} \geq \frac{(1 + o_{P} (1)) \sum_{j \in A} α_{j}^{2} + \frac{m n}{n_{1} n_{2}} - m b^{2}}{2 \sqrt{(1 + o_{P} (1)) \sum_{j \in A} α_{j}^{2} + \frac{m n}{n_{1} n_{2}}}} .

It follows that the classification error is bounded from above by

1 - Φ (\frac{(1 + o_{P} (1)) \sum_{j \in A} α_{j}^{2} + \frac{m n}{n_{1} n_{2}} - m b^{2}}{2 \sqrt{(1 + o_{P} (1)) \sum_{j \in A} α_{j}^{2} + \frac{m n}{n_{1} n_{2}}}}) .

This completes the proof.

Contributor Information

Jianqing Fan, Princeton University.

Yingying Fan, Harvard University.

REFERENCES

ANTONIADIS A, LAMBERT-LACROIX S, LEBLANC F. Effective dimension reduction methods for tumor classification using gene expression data. Bioinformatics. 2003;19:563–570. doi: 10.1093/bioinformatics/btg062. [DOI] [PubMed] [Google Scholar]
BAI Z, SARANADASA H. Effect of high dimension : by an example of a two sample problem. Statistica Sinica. 1996;6:311–329. [Google Scholar]
BAIR E, HASTIE T, DEBASHIS P, TIBSHIRANI R. Prediction by supervised principal components. The Annals of Statistics. 2007 to appear. [Google Scholar]
BICKEL PJ, LEVINA E. Some theory for Fisher's linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations. Bernoulli. 2004;10:989–1010. [Google Scholar]
BOULESTEIX A. PLS Dimension reduction for classification with microarray data. Statistical Applications in Genetics and Molecular Biology. 2004;3:1–33. doi: 10.2202/1544-6115.1075. [DOI] [PubMed] [Google Scholar]
BÜHLMANN P, YU B. Boosting with the L2 loss: regression and classification. Journal of the American Statistical Association. 2003;98:324–339. [Google Scholar]
BURA E, PFEIFFER RM. Graphical methods for class prediction using dimension reduction techniques on DNA microarray data. Bioinformatics. 2003;19:1252–1258. doi: 10.1093/bioinformatics/btg150. [DOI] [PubMed] [Google Scholar]
CAO HY. Moderate deviations for two sample t-statistics. Probability and Statistics. 2007 Forthcoming. [Google Scholar]
CHIAROMONTE F, MARTINELLI J. Dimension reduction strategies for analyzing global gene expression data with a response. Mathematical Biosciences. 2002;176:123–144. doi: 10.1016/s0025-5564(01)00106-7. [DOI] [PubMed] [Google Scholar]
DETTLING M, BÜHLMANN P. Boosting for tumor classification with gene expression data. Bioinformatics. 2003;19(9):1061–1069. doi: 10.1093/bioinformatics/btf867. [DOI] [PubMed] [Google Scholar]
DUDOIT S, FRIDLYARD J, SPEED TP. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association. 2002;97:77–87. [Google Scholar]
FAN J. Test of significance based on wavelet thresholding and Neyman's truncation. Journal of the American Statistical Association. 1996;91:674–688. [Google Scholar]
FAN J, HALL P, YAO Q. To how many simultaneous hypothesis tests can normal, student's t or Bootstrap calibration be applied? Manuscript. 2006 [Google Scholar]
FAN J, LI R. Statistical challenges with high dimensionality: feature selection in knowledge discovery. In: Sanz-Sole M, Soria J, Varona JL, Verdera J, editors. Proceedings of the International Congress of Mathematicians. Vol. III. 2006. pp. 595–622. [Google Scholar]
FAN J, REN Y. Statistical analysis of DNA microarray data. Clinical Cancer Research. 2006;12:4469–4473. doi: 10.1158/1078-0432.CCR-06-1033. [DOI] [PubMed] [Google Scholar]
FAN J, LV J. Sure independence screening for ultra-high dimensional feature space. Manuscript. 2007 doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
FRIEDMAN JH. Regularized discriminant analysis. Journal of the American Statistical Association. 1989;84:165–175. [Google Scholar]
GHOSH D. Singular value decomposition regression modeling for classification of tumors from microarray experiments. Proceedings of the Pacific Symposium on Biocomputing. 2002:11462–11467. [PubMed] [Google Scholar]
GREENSHTEIN E. Best subset selection, persistence in high dimensional statistical learning and optimization under l1 constraint. Ann. Statist. 2006 to appear. [Google Scholar]
GREENSHTEIN E, RITOV Y. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli. 2004;10:971–988. [Google Scholar]
HUANG X, PAN W. Linear regression and two-class classification with gene expression data. Bioinformatics. 2003;19:2072–2978. doi: 10.1093/bioinformatics/btg283. [DOI] [PubMed] [Google Scholar]
LIN Z, LU C. Limit Theory for Mixing Dependent Random Variables. Kluwer Academic Publishers; Dordrecht: 1996. [Google Scholar]
MEINSHAUSEN N. Relaxed Lasso. Computational Statistics and Data Analysis. 2007 to appear. [Google Scholar]
NGUYEN DV, ROCKE DM. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics. 2002;18:39–50. doi: 10.1093/bioinformatics/18.1.39. [DOI] [PubMed] [Google Scholar]
SHAO QM. Self-normalized Limit Theorems in Probability and Statistics. Manuscript. 2005 [Google Scholar]
TIBSHIRANI R, HASTIE T, NARASIMHAN B, CHU G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. 2002;99:6567–6572. doi: 10.1073/pnas.082099299. [DOI] [PMC free article] [PubMed] [Google Scholar]
VAN DER VAART AW, WELLNER JA. Weak Convergence and Empirical Processes. Springer-Verlag; New York: 1996. [Google Scholar]
WEST M, BLANCHETTE C, FRESSMAN H, HUANG E, ISHIDA S, SPANG R, ZUAN H, MARKS JR, NEVINS JR. Predicting the clinical status of human breast cancer using gene expression profiles. Proc. Natl. Acad. Sci. 2001;98:11462–11467. doi: 10.1073/pnas.201162998. [DOI] [PMC free article] [PubMed] [Google Scholar]
ZOU H, HASTIE T, TIBSHIRANI R. Sparse principal component analysis. Technical report. 2004 [Google Scholar]

[R1] ANTONIADIS A, LAMBERT-LACROIX S, LEBLANC F. Effective dimension reduction methods for tumor classification using gene expression data. Bioinformatics. 2003;19:563–570. doi: 10.1093/bioinformatics/btg062. [DOI] [PubMed] [Google Scholar]

[R2] BAI Z, SARANADASA H. Effect of high dimension : by an example of a two sample problem. Statistica Sinica. 1996;6:311–329. [Google Scholar]

[R3] BAIR E, HASTIE T, DEBASHIS P, TIBSHIRANI R. Prediction by supervised principal components. The Annals of Statistics. 2007 to appear. [Google Scholar]

[R4] BICKEL PJ, LEVINA E. Some theory for Fisher's linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations. Bernoulli. 2004;10:989–1010. [Google Scholar]

[R5] BOULESTEIX A. PLS Dimension reduction for classification with microarray data. Statistical Applications in Genetics and Molecular Biology. 2004;3:1–33. doi: 10.2202/1544-6115.1075. [DOI] [PubMed] [Google Scholar]

[R6] BÜHLMANN P, YU B. Boosting with the L2 loss: regression and classification. Journal of the American Statistical Association. 2003;98:324–339. [Google Scholar]

[R7] BURA E, PFEIFFER RM. Graphical methods for class prediction using dimension reduction techniques on DNA microarray data. Bioinformatics. 2003;19:1252–1258. doi: 10.1093/bioinformatics/btg150. [DOI] [PubMed] [Google Scholar]

[R8] CAO HY. Moderate deviations for two sample t-statistics. Probability and Statistics. 2007 Forthcoming. [Google Scholar]

[R9] CHIAROMONTE F, MARTINELLI J. Dimension reduction strategies for analyzing global gene expression data with a response. Mathematical Biosciences. 2002;176:123–144. doi: 10.1016/s0025-5564(01)00106-7. [DOI] [PubMed] [Google Scholar]

[R10] DETTLING M, BÜHLMANN P. Boosting for tumor classification with gene expression data. Bioinformatics. 2003;19(9):1061–1069. doi: 10.1093/bioinformatics/btf867. [DOI] [PubMed] [Google Scholar]

[R11] DUDOIT S, FRIDLYARD J, SPEED TP. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association. 2002;97:77–87. [Google Scholar]

[R12] FAN J. Test of significance based on wavelet thresholding and Neyman's truncation. Journal of the American Statistical Association. 1996;91:674–688. [Google Scholar]

[R13] FAN J, HALL P, YAO Q. To how many simultaneous hypothesis tests can normal, student's t or Bootstrap calibration be applied? Manuscript. 2006 [Google Scholar]

[R14] FAN J, LI R. Statistical challenges with high dimensionality: feature selection in knowledge discovery. In: Sanz-Sole M, Soria J, Varona JL, Verdera J, editors. Proceedings of the International Congress of Mathematicians. Vol. III. 2006. pp. 595–622. [Google Scholar]

[R15] FAN J, REN Y. Statistical analysis of DNA microarray data. Clinical Cancer Research. 2006;12:4469–4473. doi: 10.1158/1078-0432.CCR-06-1033. [DOI] [PubMed] [Google Scholar]

[R16] FAN J, LV J. Sure independence screening for ultra-high dimensional feature space. Manuscript. 2007 doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] FRIEDMAN JH. Regularized discriminant analysis. Journal of the American Statistical Association. 1989;84:165–175. [Google Scholar]

[R18] GHOSH D. Singular value decomposition regression modeling for classification of tumors from microarray experiments. Proceedings of the Pacific Symposium on Biocomputing. 2002:11462–11467. [PubMed] [Google Scholar]

[R19] GREENSHTEIN E. Best subset selection, persistence in high dimensional statistical learning and optimization under l1 constraint. Ann. Statist. 2006 to appear. [Google Scholar]

[R20] GREENSHTEIN E, RITOV Y. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli. 2004;10:971–988. [Google Scholar]

[R21] HUANG X, PAN W. Linear regression and two-class classification with gene expression data. Bioinformatics. 2003;19:2072–2978. doi: 10.1093/bioinformatics/btg283. [DOI] [PubMed] [Google Scholar]

[R22] LIN Z, LU C. Limit Theory for Mixing Dependent Random Variables. Kluwer Academic Publishers; Dordrecht: 1996. [Google Scholar]

[R23] MEINSHAUSEN N. Relaxed Lasso. Computational Statistics and Data Analysis. 2007 to appear. [Google Scholar]

[R24] NGUYEN DV, ROCKE DM. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics. 2002;18:39–50. doi: 10.1093/bioinformatics/18.1.39. [DOI] [PubMed] [Google Scholar]

[R25] SHAO QM. Self-normalized Limit Theorems in Probability and Statistics. Manuscript. 2005 [Google Scholar]

[R26] TIBSHIRANI R, HASTIE T, NARASIMHAN B, CHU G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. 2002;99:6567–6572. doi: 10.1073/pnas.082099299. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] VAN DER VAART AW, WELLNER JA. Weak Convergence and Empirical Processes. Springer-Verlag; New York: 1996. [Google Scholar]

[R28] WEST M, BLANCHETTE C, FRESSMAN H, HUANG E, ISHIDA S, SPANG R, ZUAN H, MARKS JR, NEVINS JR. Predicting the clinical status of human breast cancer using gene expression profiles. Proc. Natl. Acad. Sci. 2001;98:11462–11467. doi: 10.1073/pnas.201162998. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] ZOU H, HASTIE T, TIBSHIRANI R. Sparse principal component analysis. Technical report. 2004 [Google Scholar]

PERMALINK

High Dimensional Classification Using Features Annealed Independence Rules

Jianqing Fan

Yingying Fan

Abstract

1 Introduction

2 Impact of High Dimensionality

Theorem 1

Theorem 2

3 Feature Selection by Two-Sample t-Test

Condition 1

Theorem 3

4 Features Annealed Independence Rules

Theorem 4

Theorem 5

5 Numerical Studies

5.1 Simulation Study

Figure 1.

Figure 2.

Figure 3.

Figure 4.

5.2 Real Data Analysis

5.2.1 Leukemia Data

Table 1.

Figure 5.

Figure 6.

5.2.2 Lung Cancer Data

Table 2.

Figure 7.

5.2.3 Prostate Cancer Data

Table 3.

Figure 8.

6 Conclusion

Acknowledgments

7 Appendix

Proof of Theorem 1

Proof of Theorem 2

Lemma 1

Lemma 2

Proof of Lemma 2

Proof of Theorem 3

Proof of Theorem 4

Proof of Theorem 5

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases