Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Nov 1.
Published in final edited form as: Pattern Recognit. 2013 Nov;46(11):3017–3029. doi: 10.1016/j.patcog.2013.04.002

Analytical Study of Performance of Linear Discriminant Analysis in Stochastic Settings

Amin Zollanvari a,b,, Jianping Hua c, Edward R Dougherty a,c
PMCID: PMC3769149  NIHMSID: NIHMS501827  PMID: 24039299

Abstract

This paper provides exact analytical expressions for the first and second moments of the true error for linear discriminant analysis (LDA) when the data are univariate and taken from two stochastic Gaussian processes. The key point is that we assume a general setting in which the sample data from each class do not need to be identically distributed or independent within or between classes. We compare the true errors of designed classifiers under the typical i.i.d. model and when the data are correlated, providing exact expressions and demonstrating that, depending on the covariance structure, correlated data can result in classifiers with either greater error or less error than when training with uncorrelated data. The general theory is applied to autoregressive and moving-average models of the first order, and it is demonstrated using real genomic data.

Keywords: Linear discriminant analysis, Stochastic settings, Correlated data, Non-i.i.d data, Expected error, Gaussian processes, Auto-regressive models, Moving-average models

1. Introduction

It is common in practice to assume that the training data used to construct a classifier are independent and identically distributed (i.i.d). Should the data be dependent or not identically distributed, the classifier performance is affected. This paper presents a mathematical framework for analytically studying classifiers in such situations in general, and the univariate LDA (linear discriminant analysis) classifier in particular. We pay particular attention to the univariate LDA model because it is possible to obtain closed-form (not asymptotic) results for moments of the error – in analogy to moments for the error [1, 2] and error estimates [1, 3] for univariate LDA with i.i.d. sampling. The desired framework is achieved by placing classifier performance in a stochastic setting where the training data are univariate dependent and not necessarily identically distributed.

Motivation for this line of research goes back to the early 1970’s when Basu and Odell observed in remote sensing applications that the conditional expected true error of LDA is commonly higher than what is expected from a theoretical analysis [4]. They associated this observation with violation of the independence assumption on the training data.

To study the effect of correlated training data on the performance of LDA, Basu and Odell [4] used numerical examples under an equicorrellated structure of samples (see Appendix for definition of various correlation structures). They showed that misclassification probabilities change under such structures. Afterwards, McLachlan [5] used asymptotic analysis to show that even under a simple-equicorrelated structure the probability of misclassification changes. Later, Tubbs [6] used a similar asymptotic analysis but with a serially correlated structure among training data. He considered further simplifying assumptions to show that the asymptotic error rate changes with serially correlated data having positive correlations. Lawoko and McLachlan [7] used the same serially correlated structure and obtained a different asymptotic expansion of LDA true error from the one that Tubbs previously achieved in [6]. This type of asymptotic analysis was later used in [7, 8] to characterize the asymptotic expected true error of univariate LDA and Z-statistics assuming an autoregressive process of order p.

Typically, large-sample asymptotic results are not helpful in small-sample situations. Going back to 1925, R. A. Fisher wrote, “Only by systematically tackling small sample problems on their merits does it seem possible to apply accurate tests to practical data” [9, 10]. This understanding led us to study the distribution and exact moments of LDA true error and comnon estimators [11, 3, 12, 13].

Having laid the groundwork for analyzing LDA related statistics in small-sample situations, in this work we establish a framework for studying LDA in stochastic settings, thereby allowing us to obtain the exact first and second moments of univariate LDA true error in a general stochastic setting. We neither impose a specific correlation structure on the training data, nor do we assume the training data have necessarily the same mean or variance. For example the basic assumption in [4, 5, 6, 7, 8] is that the training data of the two classes are taken separately from two class conditional densities Π0, for class 0, and Π1, for class 1. This assumption immediately imposes several restrictions on the problem: the training data from each class have the same mean and variance (because they are coming from the same distribution) and, furthermore, only intraclass correlations exist. The stochastic setting permits us to generalize such assumptions to training data being correlated across classes or the samples from each class being differently distributed. To model such data we employ Gaussian processes and we assume the samples are taken from class conditional processes rather than class conditional densities.

Another related line of research is the work on classification of stationary time series data [14, 15, 16]. The main focus in this work is to construct linear discriminant rules with the knowledge of having stationary data. In this framework the discriminant function is commonly the one which maximizes some measure of disparity between two multivariate densities, e.g. the Kullback-Leibler information measure. This means that the linear discriminant rules constructed here are no longer what is commonly known as LDA. Therefore, the main difference between the aforementioned results on studying the performance of LDA under correlated training data and the body of work on classification of stationary times series, is that the former focuses on the analysis of the effect of correlated training data (which may have a stationary structure) on the performance of LDA, and the latter focuses on the synthesis of new classification rules with the knowledge of having stationary time series. Our work is of the first type. We study the effect of training data that can be dependent and not necessarily identically distributed or stationary on the performance of LDA.

As an application of these results, we consider two commonly used models, first-order autoregressive and moving averages. We further study the exact effect of autoregressive or moving-average model coefficients on changing the expected true error of LDA. Finally, we present numerical experiments to study several specific settings using the theory.

Before proceeding we note that univariate classification has played a major role in the history of pattern recogntion, in part, because of the ability to obtain closed-form solutions for error moments [1, 2, 3]; however, we should not overlook practical application. Indeed, most common tests for diagnosis and prognosis of cancer are univariate: PSA for prostate cancer [21], AFP for liver cancer [22], CA 125 for ovarian cancer [23], and CA 19.9 for colorectal cancer [24] are major protein markers. In addition to these protein biomarkers, there are genomic markers such as BRCA1 for breast cancer [25], BRCA2 [26] for male breast cancer, and APC for pancreatic cancer [27] that are major genomic markers.

2. Linear Discriminant Analysis and Error Estimation: Independent Sampling

In this section, we present the traditional sampling scenario in which LDA is employed in a univariate setting. Consider a set of n = n0 + n1 independent sample points in ℝ, where X1, X2, …, Xn0 come from population Π0 and Xn0+1, Xn0+2, …, Xn0+n1 come from population Π1. Population Πi is assumed to follow a univariate Gaussian distribution N(μi,σi2), for i = 0, 1. Linear Discriminant Analysis (LDA) utilizes the Anderson W statistic, which in the univariate case is presented as

W(X¯0,X¯1,X)=1σ^2(X-X¯0+X¯12)T(X¯0-X¯1), (1)

where X¯0=1n0i=1n0Xi and X¯1=1n1i=n0+1n0+n1Xi are the sample means for each class and σ̂2 is the pooled estimate of the variance of classes, which is assumed to be common in the LDA discriminant. Given 0 and 1, the designed LDA classifier is given by

ψ(X)={1,ifW(X¯0,X¯1,X)c0,ifW(X¯0,X¯1,X)>c, (2)

with c being a constant. It is commonly assumed that c is zero [17], which is the assumption we also make throughout this paper. Therefore, the sign of W determines the classification of the sample point X and since σ̂2 > 0, (1) reduces to

W(X¯0,X¯1,X)=(X-X¯)(X¯0-X¯1) (3)

where X¯=X¯0+X¯12. Given the training data Sn (and thus 0 and 1), the classification error, also known as true error, is given by

ε=P(W(X¯0,X¯1,X)0,XΠ0X¯0,X¯1)+P(W(X¯0,X¯1,X)>0,XΠ1X¯0,X¯1)=α0ε0+α1ε1, (4)

where αi = P (X ∈ Πi) is the a priori mixing probability for population Πi and εi is the error rate specific to population Πi, with

εi=P((-1)iW(X¯0,X¯1,X)0XΠi,X¯0,X¯1). (5)

The first and second moments of the classification error are given by

E[ε]=i=01αiP((-1)iW(X¯0,X¯1,X)0XΠi), (6)

and

E[ε2]=E[(α0ε0+α1ε1)2]=2α0α1E[ε0ε1]+i=01αi2E[εiεi]. (7)

3. Performance of LDA classifier in Univariate Gaussian Dependent Sampling (UGDS) Model of Binary Classification

We now provide the mathematical framework to study LDA performance in a stochastic setting.

Definition 1

A process Xt = {Xt : tT} with T being an ordered set, is called a Gaussian process if any finite-dimensional vector [Xt1, Xt2, …, Xtn]T has the multivariate normal distribution N(μT, ΣT), where

μT=[E(Xt1),E(Xt2),,E(Xtn)]T=[μ1,μ2,,μn]T

and ΣT is the covariance matrix dependent on T = [t1, t2, …, tn].

Definition 2

We refer to the following sampling procedure as the Univariate Gaussian Dependent Sampling (UGDS) Model of Binary Classification: Xti={Xtii:tiTi}, with Ti being two ordered sets for i = 0, 1, are two Gaussian processes such that any finite-dimensional vector constructed by stacking the random variables of Xt00 and Xt11 as [Xt100,Xt200,,Xtn000,Xt111,Xt211,,Xtn111]T possesses a multivariate normal distribution N(μT, ΣT), where μT=[μ10,μ20,,μn00,μ11,μ21,,μn11]T, and

T=[n0×n000n0×n101n1×n010n1×n111] (8)

is a positive definite covariance matrix.

This model is univariate because both processes, Xt00 and Xt11, are collections of univariate random variables, not necessarily with the same means or variances. Xt00 and Xt11 are called class conditional processes. For ease of notations and without loss of mathematical generality, we assume that T0 and T1 are the same set and, therefore, we omit the superscript i from ti. Thus, henceforth we denote Xtii by Xti and the stacked vector [Xt100,Xt200,,Xtn000,Xt111,Xt211,,Xtn111]T by [Xt10,Xt20,,Xtn00,Xt11,Xt21,,Xtn11]T.

Remark 1

If we assume μT = [μ0T, μ1T]T, with μi=[μi,μi,.,μi]1×niT,jjii=(σi)2, i = 0, 1, j = 1, 2, …, ni, where (σi)2 is the variance of class conditional distributions and jjii indicates the diagonal elements of matrix Σii, jkii=0, i = 0, 1, j, k = 1, …, ni, jk, Σ01 = 0n0×n1 = Σ10T, and any future sample is independent from the training data and distributed either as N(μ0, (σ0)2) or N(μ1, (σ1)2), depending on its class, then the UGDS model reduces to the traditional i.i.d. sampling scenario defined in section 2. Because we will want to compare classifier errors in the dependent and independent scenarios, we will sometimes use εD and εI to denote errors in the respective settings.

Similar to (3), employing LDA with the UGDS model instead of traditional independent sampling in order to classify a sample point taken at t, denoted by Xt, results in the following W statistic for the univariate case

W(X¯T0,X¯T1,Xt)=(Xt-X¯T)(X¯T0-X¯T1), (9)

where X¯T0=1n0i=1n0Xti0 and X¯T1=1n1i=1n1Xti1 are the sample means for each class and X¯T=X¯T0+X¯T12. The designed LDA classifier is given by

ψ(Xt)={1,ifW(X¯T0,X¯T1,Xt)00,ifW(X¯T0,X¯T1,Xt)>0. (10)

For the ease of notation, hereafter, we omit the subscript T from μT and ΣT.

3.1. Stochastic true error and its moments

Let Xtsi denote a test sample point, where i indicates the class conditional process in which the sample is coming from, i.e. either Xt0 or Xt1. The auto-covariance sequence of Xtsi with the training data is defined as

ρsik(j)=E[(Xtsi-μsi)(Xtjk-μjk)],i,k=0,1,j=1,2,,nk, (11)

where ρsik(j) is the jth element of the sequence ρsik. Since Xtsi is a future sample point, we assume 2 ≤ max{n0, n1} < s, unless otherwise stated. Throughout the paper, we use SA to denote the sum of all elements of a matrix or vector A. For instance, Sρsik=j=1niρsik(j).

The true classifier error under the UGDS model is a function of ts. Sample points at ts can come from either processes and the classifier may misclassify any of these. Hence,

εts=αts0εts0+αts1εts1, (12)

where αtsi=P(XtsXti), i = 0, 1, is the a priori mixing probability of the two processes Xt0 and Xt1 at ts and εtsi is the error rate specific to each process, with

εtsi=P((-1)iW(X¯T0,X¯T1,Xts)0X¯T0,X¯T1,XtsXti). (13)

By replacing W(X¯T0,X¯T1,Xts) with any proper statistic used in other classifiers, this stochastic definition of true error applies to other rules. The expected performance of true error is also specific to ts:

E[εts]=i=01αtsiP((-1)iW(X¯T0,X¯T1,Xts)0XtsXti). (14)

In (12), the true error is indexed. One could, if desired, define the true error of a classifier to be the average error the classifier induces over an index set of interest, namely, εts1-s2=1s2-s1s=s1s2εts. Since characterizing εts yields a characterization of εts1s2, no generality is gained by averaging and we restrict our attention to εts. The second moment is also a function of ts and from (12) we get

E[εts2]=2αts0αts1E[εts0εts1]+i=01(αtsi)2E[(εtsi)2]. (15)

First focusing on E[(εts0)2], the square of the probability defining (εts0)2 can be factored by introducing the random variable XtsXt0. Writing the probabilities as integrals of indicator functions allows us to apply Fubini’s theorem, which shows Xts and Xts to be independent (denoted XtsXts). The expectation can then be applied. Altogether,

E[(εts0)2]=E[P(W(X¯T0,X¯T1,Xts)0X¯T0,X¯T1,XtsXt0)2]=E[P(W(X¯T0,X¯T1,Xts)0X¯T0,X¯T1,XtsXt0)×P(W(X¯T0,X¯T1,Xts)0X¯T0,X¯T1,XtsXt0)]=E[P(W(X¯T0,X¯T1,Xts)0,W(X¯T0,X¯T1,Xts)0X¯T0,X¯T1,Xts,XtsXt0,XtsXts)]=P(W(X¯T0,X¯T1,Xts)0,W(X¯T0,X¯T1,Xts)0Xts,XtsXt0,XtsXts). (16)

E[(εts1)2] and E[εts0εts1]

can be expressed via similar factorizations. Hence,

E[εts2]=(αts0)2P(W(X¯T0,X¯T1,Xts)0,W(X¯T0,X¯T1,Xts)0Xts,XtsXt0,XtsXts)+2αts0αts1P(W(X¯T0,X¯T1,Xts)0,W(X¯T0,X¯T1,Xts)>0XtsXt0,XtsXt1,XtsXts)+(αts1)2P(W(X¯T0,X¯T1,Xts)>0,W(X¯T0,X¯T1,Xts)>0Xts,XtsXt1,XtsXts). (17)

To facilitate the subsequent discussion, we will explicitly denote the dependency of true error on the number of samples. Therefore, hereafter we use εts,n0+n1 and εts, or εts,n0+n12 and εts2

, interchangeably.

Throughout the paper, we use the notations Z < 0 or Z0 to indicate componentwise inequalities, e.g. Z = (z1, z2)T < 0 means z1 < 0, z2 < 0.

3.2. Expected performance of LDA in the UGDS model

The first moment of the classification error for LDA under the UGDS model is expressed exactly according to the following theorem.

Theorem 1

Under the UGDS model, the expected true error of LDA at ts is

E[εts,n0+n1D]=αts0[P(Zs1<0)+P(Zs10)]+αts1[P(ZsII<0)+P(ZsII0)], (18)

where ZtsI and ZtsII are Gaussian bivariate vectors with

μZsI=[μs0-μ¯2-μ]T,μZsII=[μs1-μ¯2μ]T,ZsI=[(σs0)2-Sρs00n0-Sρs01n1+S004n02+S114n12+S012n0n1-Sρs00n0+Sρs01n1+S002n02-S112n12·S00n02+S11n12-2S01n0n1],ZsII=[(σs1)2-Sρs11n1-Sρs10n0+S004n02+S114n12+S012n0n1-Sρs11n1+Sρs10n0-S002n02+S112n12·S00n02+S11n12-2S01n0n1], (19)

where μ¯=i=1n0μi0n0+i=1n1μi1n1,μ=i=1n0μi0n0-i=1n1μi1n1, and μsi and (σsi)2 are the mean and variance of random variables at ts from class i, i = 0, 1, with the auto-covariance ρsik defined as in (11).

Proof

See the Appendix.

We note that under conditions stated in Remark 1, Theorem 1 reduces to Theorem 1 in [3].

Let Φ(x, y; ρ) be the cumulative bivariate normal distribution defined as:

Φ(x,y;ρ)=-x-yψ(u,v;ρ)dudv,ψ(u,v;ρ)=12π1-ρ2exp{-(u2+v2-2ρuv)2(1-ρ2)}. (20)

We have the following Corollary.

Corollary 2

In the model considered in Theorem 1, let the training samples from each class have the same mean, that is μ = [μ0T, μ1T]T in which μi=[μi,μi,,μi]1×niT,μsi=μi,σsi=σ, i = 0, 1, meaning the test data at ts has equal variances across classes, and αts0=αts1=0.5. Furthermore, let Sρsik=0 , i, k = 0, 1 Then

E[εts,n0+n1D]=12-L(h,k;ρ)2, (21)

where

L(x,y;ρ)=-xx-yyψ(u,v;ρ)dudv, (22)
h=μ2a,k=μb,μ=μ0-μ1,ρ=S002n02-S112n12ab,a=σ2+S004n02+S114n12+S012n0n1,b=S00n02+S11n12-2S01n0n1. (23)
Proof

See the Appendix.

To further proceed we present the following lemma, in which ∧ denotes conjunction.

Lemma 3

Let Φ(x, y; ρ) be the cumulative bivariate normal distribution defined in (20) and defne F(x, y; ρ) and G(x, y; ρ) as follows:

F(x,y;ρ)=Φ(x,y;ρ)+Φ(-x,-y;ρ),G(x,y;ρ)=F(x,y;ρ)+F(x,y;-ρ), (24)

where x and y are two constants such that xy < 0. Then, for 0 ≤ λx < 1, 0 ≤ λy < 1,

(ρ1ρ0)(x1=λxx0)(y1=λyy0)G(x1,y1;ρ1)>G(x0,y0;ρ0). (25)
Proof

See the Appendix.

Using this Lemma, we compare the expected true error of the UGDS model with the independent sampling model.

Corollary 4

In the model considered in Corollary 2, let (σji)2jjii, i = 0, 1, j = 1, 2, …, ni, and let

ρD=S00n02-S11n12(σ2+S004n02+S114n12+S012n0n1)(S00n02+S11n12-2S01n0n1),ρI=j=1n0(σj0)2n02-j=1n1(σj1)2n12(σ2+j=1n0(σj0)24n02+j=1n1(σj1)24n12)(j=1n0(σj0)2n02+j=1n1(σj1)2n12). (26)

Let E[εts,n0+n1I] be the expectaton of the true error of the classifier in (10) specific to ts and constructed as if all n0 + n1 training samples are i.i.d. (same mean and variance). Then

(ρDρI)(S00n02+S11n12max{2S01n0n1,-2S01n0n1})E[εts,n0+n1D]E[εts,n0+n1I], (27)
(ρDρI)(S00n02+S11n12min{2S01n0n1,-2S01n0n1})E[εts,n0+n1D]E[εts,n0+n1I], (28)

where SA is the sum of the off diagonal elements of matrix A, defined as SA=i,j,ijaij

Proof

Find the expected true error in Theorem 1 using the conditions in the corollary and compare it to the expected true error determined by setting all off diagonal elements of Σii to zero, i = 0, 1 and Σ01 = 0n0×n1. The proof follows by using the results of Lemma 3 in Theorem 1.

A more restricted set of sufficient conditions than those presented in Corollary 4 follows.

Corollary 5

In the model considered in Corollary 2, let (σji)2jjii, i = 0, 1, j = 1, 2, …, ni, and

S00n02-S11n12=j=1n0(σj0)2n02-j=1n1(σj1)2n12. (29)

Then

S00n02+S11n12max{2S01n0n1,-2S01n0n1}E[εts,n0+n1D]E[εts,n0+n1I], (30)
S00n02+S11n12min{2S01n0n1,-2S01n0n1}E[εts,n0+n1D]E[εts,n0+n1I], (31)

where SA is the sum of off diagonal elements of matrix A, defined as SA=i,j,ijaij

Proof

The proof is similar to Corollary 4.

To have a sense of the conditions stated in Corollary 5, consider a scenario in which n0 = n1, the sample points in each class are equi-correlated with correlation ρ, and there is independent sampling across classes. This satisfies (29). If ρ > 0, then (30) holds and E[εts,n0+n1D]E[εts,n0+n1I]. If ρ < 0 and the class covariance matrices are positive definite, then (31) holds and E[εts,n0+n1D]<E[εts,n0+n1I].

Let us reflect on Corollaries 4 and 5. A correlated set of n sample points can be considered as a set in which the points convey some information about each other. Therefore, they are often considered to be as informative as n′ independent samples with n′ < n, thereby producing a poorer classifier. This intuition aligns with the simple situation in which the sample points in each class are equi-correlated with ρ > 0 and the sample points across the two classes are uncorrelated. This scenario is a special case of (30) and E[εts,n0+n1D]E[εts,n0+n1I]. However, (31) shows that there are correlation patterns that result in an expected true error smaller than it would be were there independent sampling, which means that sampling satisfying (31) is like having a larger sample size than if sampling were independent.

To illustrate, in the UGDS model suppose the training sample points are identically distributed as two Gaussian distribution, N(−1, 1) for class 0 and N(1, 1) for class 1. Let n0 = n1 = 3 and assume that any future test point is also distributed identically to the training data of its class. Furthermore, assume the data are generated via two different scenarios, a and b, such that Σ01 = 03×3 and, for i = 0, 1,

aii=[1-1/4-1/4-1/41ρ-1/4ρ1],bii=[11/41/41/41ρ1/4ρ1] (32)

Figure 1(a) shows the expected true error of the classifier designed in scenario a as a function of ρ. It demonstrates that for some dependency patterns, as defined by the covariance matrix, the classifier has better performance than if the sampling were independent. Note that in Fig. 1(a) the curves meet at ρ = 0.5, the point of equality for the inequalities (30) and (31). Note also that for ρ = −0.499, E[ε6D]=E[ε18I]=0.165. Hence, for the sampling covariance matrix (32), 3 points have the effect of 9 independent points. In general, better classification accuracy may be achieved if the sample points are collected according to specific schemes. Equations (28) and (31) provide sufficient sets of conditions that result in such schemes.

Figure 1.

Figure 1

(a) Expected true error of constructed classifiers in scenario a as a function of ρ, (b) Expected true error of constructed classifiers in scenario b as a function of ρ. The horizontal line shows the performance of the constructed classifier as if the samples were independent. Solid lines: dependent samples; Dashed lines: independent samples.

Figure 1(b) shows the expected true error of a classifier constructed in scenario b by varying ρ in the same range as in scenario a. The only difference between scenarios a and b is changing the covariances between the first sample point and other sample points to positive values. It results in the curve for dependent sampling in Figure 1(b) being substantially above the curve for independent sampling.

3.3. Second moment of LDA true error in the UGDS model

Next we obtain the second moment of true error of LDA at ts as defined in (17).

Theorem 6

Under the UGDS model, the second moment of LDA at ts is

E[(εts,n0+n1D)2]=(αts0)2[P(ZsI<0)+P(ZsI0)]+2αts0αtsI[P(ZsII<0)+P(ZsII0)]+(αts1)2[P(ZsIII<0)+P(ZsIII0)], (33)

where Zsj is a 3-variate Gaussian random vector with mean and covarianc matrices as follows:

μZsI=[μs0-μ¯2-μμs0-μ¯2]T,μZsII=[μs0-μ¯2-μ-μs1+μ¯2]T (34)

and, for i, j = 0, 1, ij, letting

zsi=(σsi)2-Sρsiini-Sρsijnj+S004n02+S114n12+S012n0n1, (35)

we have

ZsI=[zs0-Sρs00n0+Sρs01n1+S002n02-S112n12zs0-(σs0)2·S00n02+S11n12-2S01n0n1-Sρs00n0+Sρs01n1+S002n02-S112n12··zs0],ZsII=[zs0-Sρs00n0+Sρs01n1+S002n02-S112n12-zs0+(σs0)2+Sρs11-Sρs012n1+Sρs10-Sρs002n0·S00n02+S11n12-2S01n0n1-Sρs11n1+Sρs10n0-S002n02+S112n12··zs1], (36)

with μ¯=i=1n0μi0n0+i=1n1μi1n1,μ=i=1n0μi0n0-i=1n1μi1n1, SA is the sum of all elements of matrix A, defined as SA = Σi,j aij, and μsi and (σsi)2 are the mean and variance of random variables at ts from class i, i = 0, 1. Furthermore, μZIII is obtained from μZI by replacing −μ′ with μ′ and μs0 with μs1, and ZsIII is obtained from ZsI by exchanging n0 and n1, (σs0)2 and (σs1)2, SΣ00 and SΣ11, Sρs00 and Sρs11, and Sρs01 and Sρs10.

Proof

See the Appendix.

Let E[εts,n0+n1I] and E[(εts,n0+n1I)2] be the first and second moments of true error of the classifier in (10) specific to ts and constructed by n0 + n1 independent training sample points distributed according to the same mean and variance. Then we have the following corollary.

Corollary 7

In the model considered in corollary 5, further assume n0 = n1 = n, Σ01 = 0n×n, and, for k, j = 1, 2, …, n,

kj00=kj11={σ2k=jρ>0otherwise, (37)

where σ is the common variance of test sample points across classes at ts. Let m be the number of additional dependent training points in each class with the same class conditional means and dependency structure, meaning kjii as in (37) for k, j = 1, 2, …, n + m and Σ01 =0(n+m)×(n+m), that are required to make E[εts,2n+2mD]=E[εts,2nI]. This number also makes E[(εts,2n+2mD)2]=E[(εts,2nI)2] and is given by

m=n-1σ2nρ-1 (38)
Proof

The proof of E[εts,2n+2mD]=E[εts,2nI] follows by equating elements of covariance matrices obtained for the dependent model in (19) with the covariance matrices for the independent sampling model. Under the conditions of the corollary, these matrices in the independent sampling scenario (given by Theorem 1 in [3]) are

ZsI=ZsII=ZsIII=(σ2+σ22n002σ2n) (39)

Furthermore, we note that the conditions stated in this corollary satisfy the condition stated in (30), and hence E[εts,2nD]E[εts,2nI]. The proof of E[(εts,2n+2mD)2]=E[(εts,2nI)2] follows similarly by equating covariance matrices presented in (36) with those presented in Theorem 2 in [3].

In (38), if ρ>σ2n, then m < 0, meaning that adding any additional points under the dependency model in the corollary does not lower E[εts,2n+2mD] and E[(εts,2n+2mD)2] to the level of the first and second moments of true error of the constructed LDA classifier as if the original 2n training samples were independent.

4. Applications

In this section we study applications to common models used in signal processing, the first-order autoregressive model, AR(1), and the first-order moving average model, MA(1), by assuming the training data are generated by the output processes of two models. Zt0 and Zt1 are two independent white noise processes and Xt0 and Xt1 are the processes producing the system output. The goal is to characterize the performance of the LDA classifier as a function of sample size, the parameters of the white noise processes, and the autoregressive coefficients.

4.1. First-order autoregressive model AR(1)

We consider two AR(1) models:

Xti=ci+ψiXt-1i+Zti,i=0,1, (40)

where ψi is a constant such that 0 < |ψi| < 1, i = 0, 1, and Zt0~N(0,σ02),Zt1~N(0,σ12), for all t, are independent from each other. Then Xt0={Xt0:0<t<} and Xt1={Xt1:0<t<} are two independent covariance-stationary processes and we have the following theorem.

Theorem 8

Let Xt0,Xt1 in the UGDS model be defined by the two independent covariance-stationary AR(1) processes as defined in (40). Then, at ts, where max{n0, n1} < s, the expected true error of LDA constructed using the training samples [Xt10,Xt20,,Xtn00] and [Xt11,Xt21,,Xtn11] is

E[εts,n0+n1AR(1)]=αts0[P(ZsI<0)+P(ZsI0)]+αtsI[P(ZsII<0)+P(ZsII0)], (41)

where ZtsI and ZtsII are Gaussian bivariate vectors with

μZsI=[μ2-μ]T,μZsII=[-μ2μ]T,ZsI=[σ021-ψ02-Sρs00n0+S004n02+S114n12-Sρs00n0+S002n02-S112n12·S00n02+S11n12],ZsII=[σ121-ψ12-Sρs11n1+S004n02+S114n12-Sρs11n1-S002n02+S112n12·S00n02+S11n12], (42)

where

μ=c01-ψ0-c11-ψ1,Sρsii=ψi(s-ni)σi21-ψi2(1-ψini1-ψi),Sii=σi2(1-ψi2)(1-ψi)[ni(1+ψi)-2ψi(1-ψini1-ψi)]. (43)
Proof

See the Appendix.

Corollary 9

In the model considered in Theorem 8, let ψ0 = ψ1 = ψ, σ0 = σ1, αts0=αts1. Let E[εts,n0+n1AR(1)ψ] denote the expected true error of an AR(1) model with AR coefficient ψ specific to ts. Then

limsE[εts,n0+n1AR(1)ψ]=12-L(h,k;ρ)2, (44)

where L(h, k; ρ) is defined in (22) and

h=μ2a,k=μb,μ=c0-c11-ψ,a=σ21-ψ2+b4,b=σ2[(1+ψ)(1n0+1n1)-2ψ1-ψ(1-ψn0n02+1-ψn1n12)](1-ψ2)(1-ψ),ρ=σ2[(1+ψ)(1n0-1n1)-2ψ1-ψ(1-ψn0n02-1-ψn1n12)]2(1-ψ2)(1-ψ)ab. (45)
Proof

See the Appendix.

We consider E[εts,2nAR(1)] as a function of ψ and compare it to the case where ψ= 0, which corresponds to the stochastic i.i.d setting.

Corollary 10

In the model considered in Corollary 9, let n0 = n1 = n. Furthermore, let E[εts,2nI] be the expected true error of the LDA classifier with ψ = 0 in (40). Let ψ′ and ψ″ be two arbitrary values of the AR coefficient ψ. Then

ψ>ψlimsE[εts,2nAR(1)ψ]<limsE[εts,2nAR(1)ψ]. (46)

Hence,

0<ψ<1limsE[εts,2nAR(1)ψ]<limsE[εts,2nI],-1<ψ<0limsE[εts,2nAR(1)ψ]>limsE[εts,2nI]. (47)
Proof

See the Appendix.

Corollary 10 shows that, if ψ ∈ (0, 1), then under the conditions of the Corollary, constructing an LDA classifier to differentiate between AR processes is beneficial in terms of the expected true error tested on sufficiently lagged data; however, for ψ ∈ (−1, 0), we expect larger expected true error.

4.2. First-order moving-average model MA(1)

We consider the MA(1) models

Xti=ci+Zti+θiZt-1i,i=0,1, (48)

where θi ∈ ℝ and Zt0~N(0,σ02),Zt1~N(0,σ12), for all t, are independent from each other. Then Xt0={Xt0:0<t<} and Xt1={Xt1:0<t<}

are two independent and covariance-stationary processes regardless of the values of θi [18].

Theorem 11

Let Xt0,Xt1 in the UGDS model be defined by the two independent covariance-stationary MA(1) processes defined in (48). Then, at ts, where max{n0, n1} + 1 < s, the expected true error of an LDA classifier constructed using the training samples [Xt10,Xt20,,Xtn00] and [Xt11,Xt21,,Xtn11] is

E[εts,n0+n1MA(1)]=αts0[P(ZsI<0)+P(ZsI0)]+αts1[P(ZsII<0)+P(ZsII0)], (49)

where ZtsI and ZtsII are Gaussian bivariate vectors with:

μZsI=[μ2-μ]T,μZsII=[-μ2μ]T,ZsI=[σ02(1+θ02)+S004n02+S114n12S002n02-S112n12·S00n02+S11n12],ZsII=[σ12(1+θ12)+S004n02+S114n12-S002n02-S112n12·S00n02+S11n12], (50)

where i = 0, 1, and

μ=c0-c1,Sii=σi2[ni(1+θi2)+2(ni-1)θi]. (51)
Proof

See the Appendix.

Corollary 12

For the model in Theorem 11, let θ0 = θ1, σ0 = σ1, αts0=αts1. Let E[εts,n0+n1MA(1)θ] denote the expected true error of an MA(1) model with MA coefficient θ specific to ts. Then

E[εts,n0+n1MA(1)θ]=12-L(h,k;ρ)2, (52)

where L(h, k; ρ) is defined in (22) and

h=μ2a,k=μb,μ=c0-c1,a=σ2(1+θ2)+b4,b=σ2[(1+θ)2(1n0+1n1)-2θ(1n02+1n12)],ρ=σ2[(1+θ)2(1n0-1n1)-2θ(1n02-1n12)]2ab. (53)
Proof

The result follows by considering the assumption of the corollary in Theorem 11 and then following the same steps similar to Corollary 2.

Corollary 13

For the model in Corollary 12, let n0 = n1 = n. Furthermore, let E[εts,2nI] be the expected true error of the LDA classifier specific to ts with θ = 0 in (48). Let θ′ and θ″ be two arbitrary values of the MA coefficient θ. Then

(θ>θ)(θ,θ(-,1n-1))E[εts,2nMA(1)θ]<E[εts,2nMA(1)θ],(θ>θ)(θ,θ[1n-12n+1,))E[εts,2nMA(1)θ]>E[εts,2nMA(1)θ], (54)

and, therefore,

θ(0,)E[εts,2nMA(1)θ]>E[εts,2nI],θ[1n-12n+1,0)E[εts,2nMA(1)θ]<E[εts,2nI]. (55)
Proof

See the Appendix.

Corollary 13 shows that there exists a range of moving-average coefficients, i.e. [ 1n-12n+1, 0), that is beneficial in terms of expected classification error, i.e. has a smaller expected true error than the stochastic i.i.d model. For positive values of the coefficient, the expected true error of LDA increases.

5. Numerical Examples

We now illustrate the results obtained in previous sections under several specific settings.

Experiment 1

First, we consider scenarios in which the sample points taken from each class conditional process are identically distributed. They have the same mean, μ0 for class 0 and μ1 for class 1, and we set μ0 = −μ1 and μ0 = 0.5, 0.75, 1, 1.5. We assume that the observations have variance 1 and are equally correlated with ρwith ∈ [ρl, 0.95]. The value of ρl is determined so that the covariance matrix defined in (8) is positive definite. In each case we consider three settings for the correlation, ρbet, across classes: (1) independent, ρbet = 0, (2) ρbet = ρwith, and (3) ρbet = −ρwith. For each setting we consider two sample sizes, n0 = n1 = n = 5 and n0 = n1 = n = 25. We assume any future observation from each class conditional process has a distribution similar to those of the training data from that class and αts0=αts1.

Figure 2(a)–2(d) show the exact expectation and standard deviation (SD) of the LDA true error for this experiment as a function of ρwith. The results are calculated from Theorems 1 and 6. Parts a and b of the figure show that increasing ρwith has an incremental effect on E[εts,2nD]. Since future observations are identically distributed, E[εts,2nD] is the same for all values of ts. Theoretically, for ρbet = 0, we can easily verify the graphical behavior by using Lemma 3 in Theorem 1. To analytically see the effect of ρwith on E[εts,2nD] once ρbet = 0, let ρ1, ρ2 be two arbitrary values of ρwith such that ρ1 < ρ2 and denote all distributional parameters used in Corollary 2 corresponding to ρk, k = 1, 2, with a super script ρk. With the aforementioned conditions of the experiment, we have S00ρ1n02-S11ρ1n12=S00ρ2n02-S11ρ2n12=0 and S00ρkn02+S11ρkn12=1+2(n-1)ρk2n, k = 1, 2. Therefore, aρ1 < aρ2 and bρ1 < bρ2. The results then follow from Lemma 3. For other cases where ρbet ≠ 0 one may analytically study the effect of changing ρbet on E[εts,2nD] using results Theorem 1 and studying the change similar to the proof of Lemma 3.

Figure 2.

Figure 2

Figures (a)–(d) show the exact expectation and standard deviation of LDA true error in Experiment 1 as a function of ρwith. (a) Expectation for n0 = n1 = 5; (b) Expectation for n0 = n1 = 25; (c) Standard deviation for n0 = n1 = 5; (d) Standard deviation for n0 = n1 = 25; (a)–(d) plot keys: ○ := ρbet = 0; ×:= ρbet = ρwith; △ = ρbet = −ρwith; solid := μ0 = 1.5; dash := μ0 = 1; dot := μ0 = 0.75; dash-dot := μ0 = 0.5. The cross section of each curve with the vertical solid line in (a)–(d) plots shows the magnitude of the expectation/variance for i.i.d sampling situation for the corresponding scenario. The small horizontal solid lines in Figures (b) and (d) show the magnitude of expectation/variance of i.i.d situation in Figures (a) and (c), respectively. Figures (e)–(f) show the exact expectation of LDA true error of the first-order autoregressive model in Experiment 2 as a function of ψ:= ψ0 = ψ1. (a) Case of n0 = n1 = 5; (b) Case of n0 = n1 = 25; (e)–(f) plot keys: ○ := sn0 = 2; × : sn0 = 10; solid := c0 = 1.5; dash := c0 = 1; dot := c0 = 0.75; dash-dot := c0 = 0.5. The cross section of each curve with the vertical solid line in (e)–(f) plots shows the magnitude of the expectation for i.i.d sampling situation for the corresponding scenario. The small horizontal solid lines in Figure (f) show the magnitude of expectation of i.i.d situation in Figure (e).

Figures 2(a) and 2(b) show that increasing d = |μ0μ1| has an incremental effect on E[εts,2nD]. This effect can also be seen from Lemma 3 and Corollary 2. Therefore, we call classification scenarios with a larger d, “easier” scenarios, and those with smaller d, “harder” scenarios. In this sense, d is an indicator of classification difficulty in our experiment. The figures suggest that having a between-class correlation of ρbet = ρwith > 0 helps in classification performance in “harder” classification situations (i.e., compared with ρbet = 0) and has a detrimental effect on classification performance in “easier” settings. However, having ρbet = ρwith < 0, helps to have a better classification performance in “easier” settings and results in a worse performance in “harder” settings. This is observed by the fact that curves for ρbet = ρwith are above (below) the curves for ρbet = 0 for d = 3 (d = 1).

The standard deviation is more complicated to interpret. The trends seen in Figures 2(c) and 2(d) suggest that increasing ρwith generally increases the standard deviation of the LDA true error in cases where ρbet = 0. Furthermore, it suggests that once ρbet = −ρwith, the standard deviation generally increases as ρwith grows, but once ρbet = ρwith, increasing ρwith in small sample sizes may increase or decrease the standard deviation depending on classification difficulty, and as the sample size gets larger, increasing ρwith generally increases the standard deviation. Furthermore, the figures suggest that increasing the classification difficulty may first increase the standard deviation and then decreases it.

Comparing Figure 2(a) with 2(b) and Figure 2(c) with 2(d) shows that increasing sample sizes lower the magnitude of the expectation and standard deviation regardless of classification difficulty or magnitude of ρwith.

Experiment 2

In this experiment, we use the first order autoregressive model defined in (40). We assume αts0=αts1, n0 = n1 = n, σ0 = σ1 = 1, and ψ0 = ψ1 = ψ ∈ [−0.95, 0.95]. We consider various cases where c0 = 0.5, 0.75, 1, 1.5 with c0 = −c1. Figure 2(e) and 2(f) show the exact expectation of LDA true error for this experiment. These results are exact and are calculated from Theorem 8. These figures suggest that increasing ψ decreases E[εts,2nAR(1)ψ]. According to Corollary 10, for a sufficiently lagged ts, E[εts,2nAR(1)ψ] is a decreasing function of ψ and, furthermore, E[εts,2nAR(1)ψ]<E[εts,2nI] for 0 < ψ < 1 and E[εts,2nAR(1)ψ]>E[εts,2nI] for 0 < ψ < 1. Here the same behavior is observed even for small lags of 2 and 10. Furthermore, decreasing the sample size and increasing the classification difficulty have an incremental effect on the expected true error.

Experiment 3

In this experiment, we use the AR(1) model in (48). We assume αts0=αts1, n0 = n1 = n, σ0 = σ1 = 1, and θ0 = θ1 = θ ∈ [−10, 10]. We consider c0 = 0.5, 0.75, 1, 1.5 with c0 = −c1. Figure 3 shows the exact expectation of the LDA true error for this experiment. These results are exact and are calculated from Theorem 11.

Figure 3.

Figure 3

Exact expectation of LDA true error of the first-order moving average model in Experiment 3 as a function of θ := θ0 = θ1. (a) Expectation for n = 5; (b) Expectation for n = 25; (c) Magnification of region [−1, 0.1] in Figure (a); (d) Magnification of region [−1, 0.1] in Figure (b). Plot keys: ○:= c0 = 1.5; △ : c0 = 1; ▽ := c0 = 0.75; × := c0 = 0.5; The cross section of each curve with the vertical solid line in each plot shows the magnitude of the expectation for i.i.d sampling situation for the corresponding scenario. The small horizontal solid lines in Figure (b) are drawn to facilitate comparing this figure with Figure (a) at these cross sections. The left side of blue dotted line is (−∞, 1n-1] region, which in (54) we proved that the expectation of true error is a decreasing function of θ. The right hand side of the red dashed line is [ 1n-12n+1, ∞) region, which the expectation is an increasing function as seen from (54).

Figure 3 shows that the expected true error of LDA under the MA(1) model has an inverted bell shape with a negatively biased center, and the bias decreases as the sample size increases. The results of Corollary 13 are clear in this figure: for θ(-,1n-1],E[εts,2nMA(1)θ] is a decreasing function of θ. This region is on the left-hand side of the vertical blue dotted lines in Figure 3. For θ[1n-12n+1),E[εts,2nMA(1)θ] is an increasing function of θ. This region is on the right-hand side of the vertical red dashed line in the figure. As proved in Corollary 13, we observe in Figure 3(c) and 3(d) that, for θ[1n-12n+1,0),E[εts,2nMA(1)θ]<E[εts,2nI]. This is the region between red dashed line and the vertical black line of each plot. For θ ∈ (0, ∞), E[εts,2nMA(1)θ]>E[εts,2nI]. This is the region on the right-hand side of the vertical black solid lines.

Experiment 4

This experiment is an example derived from gene-expression data used in studying the prognosis of breast-cancer using 70 genes with high prognostic ability [19]. Following [20], we divide the 307 individuals used in this study into 64 “poor” prognosis (class 0) versus 243 “good” prognosis (class 1) patients. A poor prognosis is defined to be a distant metastasis within 5 years of initial diagnosis. The gene expression data used in this study have been collected by triplicating each gene on each microarray and then duplicating each measurement by dye-swaping. Therefore, for each patient, each gene, we have six measurements, three of which are positively correlated with themselves and negatively correlated with others. Using this dataset we consider a scenario in which the experimenter is only given six measurements taken from one patient from class 0 and six measurements from another patient from class 1, and a univariate LDA classifier is desired to differentiate the two groups. We assume the single variate used in this classifier is the ALDH4 gene, which has the highest correlation with prognosis of breast cancer in [20]. Therefore, in this scenario, the experimenter is given 12 “technical” replicates in total, which are now treated as our “sample points”. This is an example of the UGDS model in genomic applications in which our classification is defined by two Gaussian processes, Xt0 and Xt1, which are assumed to be independent processes. We note that the expected performance of a classifier depends on ts, i.e. E[εts,12D] in Theorem 1, which is a function of the distribution of the future data as well as the distribution of the training data and their correlation structure. We verify the Gaussianity of each of the 12 random variables, Xt10,Xt20,,Xt60,Xt11,Xt21,,Xt61, used for characterizing the two Gaussian processes of this example via a Shapiro-Wilk test (using the R statistical software) on the full dataset corresponding to each random variable. This test does not reject Gaussianity of the random variables over either of the classes at a 95% significance level after employing the Bonferroni correction of multi-hypothesis tests.

Unfortunately, taken together, the 12 random variables do not pass the Shapiro-Wilk test for multivariate Gaussianity. Nonetheless, we will proceed and demonstrate that, even with this lack of multivariate Gaussianity, Theorem 1 is much more accurate than its counterpart in [3], which assumes i.i.d. data from each distribution.

Sample means, variances, and correlation, computed on the full dataset, were used as estimates of the unknown true means, variances, and the correlation structure between samples needed in Theorem 1. Using Theorem 1, the expected performance of a classifier, E[εts,12D], to differentiate samples distributed as Xt50 from samples distributed as Xt11 is 0.475. To further verify this expected performance we construct a classifier on each possible combination among 243 × 64 = 15552 combinations of 6 samples from either classes and each time we test the accuracy of the designed classifier on the 64 − 1 = 63 remaining realizations of Xt50 and 243 − 1 = 242 realizations of Xt11. The accuracy computed in this way is 0.479, which is almost the same as what is computed from Theorem 1. It is interesting to compare this accuracy to the case in which one designs a classifier without paying attention to the correlation structure between samples and various distributions governing the data (considering the data being i.i.d.). In this scenario one (incorrectly) considers the data from each class coming from a single distribution and the expected performance of a classifier can be therefore evaluated from Theorem 1 that we presented in [3]. Again we use the sample means and variances, computed on the full dataset, as estimates of the unknown true means and variances. In this case the expected performance of LDA is estimated to be 0.374, which is very far from 0.479.

6. Conclusion

In many applications, the assumption of having i.i.d. training samples is violated. This paper characterizes the performance of univariate LDA classification in stochastic settings by assuming the samples are taken from two class conditional Gaussian processes, which are not necessarily independent. Linear classification has been considered owing to its long history in pattern recognition and its suitability for small-sample classification. We do not impose a specific correlation structure on the training data. We have presented conditions in which the correlation structure can be either beneficial or detrimental in terms of classification performance. As an application we have obtained exact expressions for the performance of LDA in situations that the data are produced through auto-regressive (AR) or moving-average (MA) models of the first order. We have found ranges of AR or MA multiplicative coefficients having incremental or decremental effect on classification performance. Having characterized univariate LDA performance in closed form, we aim to follow our work in [3] and characterize the effect of non-i.i.d. samples on training-data-based error estimators.

Acknowledgments

This work was partially supported by the NIH grants 2R25CA090301 (Nutrition, Biostatistics, and Bioinformatics) from the National Cancer Institute.

Appendix. Various Correlation Structures

Let p-dimenional sample points of each class, X1, X2, …, Xni, with Xj being a column vector, be separately taken from two p-variate normal distributions, Π1 and Π2, with the distribution N(μi, Σ). Furthermore, let Vi be the dispersion matrix of the nip×1 vector X=(X1T,X2T,,XniT)T, i = 0, 1, defined as Vi = E[(XE(X))(XE(X))T]. We define three correlation structures in regard to the data: (1) equicorrelated if Vi = Ini ⊗ (ΣR)+ EniR, with R being a symmetric matrix, In the n × n identity matrix, and En the n × n matrix with all elements being 1; (2) simply equicorrelated if Vi = Ini ⊗ (1 − ρ)Σ + EniρΣ, where ρ is a nonzero scalar constant where |ρ| < 1; and (3) serially correlated if Vi = IniΣ + Eniρτ Σ, where τ = |kl|, k, l = 1, 2,, ni, |ρτ| < 1, τ = 1, 2,, ni, ρ0 = 0. Note that univariate sample points, equicorrelation and simple-equicorrelation structures are essentially the same.

Proof of Theorem 1

From (9), it follows that

E[εts0]=P(W(X¯T0,X¯T1,Xts)0XtsXt0)=P(Xts-X¯T<0,X¯T0-X¯T1>0)+P(Xts-X¯T0,X¯T0-X¯T1<0),

where X¯T=X¯T0+X¯T12. Expanding X¯T0 and X¯T1 as 1n0i=1n0Xti0 and 1n1i=1n1Xti1 results in

E[εts0]=P(ZsI<0)+P(ZsI0), (56)

where ZsI=AYs0 and Ys0=[Xts0,,Xtn00,Xt11,,Xtn11]T, where the super index 0 in Xts0 is to denote explicitly XtsXt0, and

A=[1-12n0-12n0-12n0-12n1-12n10-1n0-1n0-1n01n11n1]. (57)

Therefore, ZsI is a Gaussian random vector with mean AμYs0 and covariance matrix AYs0AT. Plugging in the values of μYs0=[μs0,μ10,μ20,,μn00,μ11,,μn11] and noting the fact that the jth element of vector ρsik is defined as ρsik(j)=E[(Xtsi-μsi)(Xtjk-μjk)], i, k = 0, 1, j = 1, 2,, nk, we have

Ys0=[(σs0)2ρs00ρs01(ρs00)T0001(ρs01)T(01)T11], (58)

which leads to the expression stated in Theorem 1. Evaluating the mean and covariance matrix of vector ZsII, which is the counterpart for E[εts1], is entirely similar, by considering P(W(X¯T0,X¯T1,Xts)>0X¯T0,X¯T1,XtsXt1).

Proof of Corollary 2

Note that for Φ(x, y; ρ) defined in (20),

Φ(-x,-y;ρ)=xy12π1-ρ2exp{-(u2+v2-2ρuv)2(1-ρ2)}dudv. (59)

By considering the assumption of the corollary for Theorem 1, and using (20) and (59) in (18), we get

E[εts,n0+n1D]=12(Φ(μ2a,-μb,ρ)+Φ(-μ2a,μb,ρ))+12(Φ(-μ2a,μb,-ρ)+Φ(μ2a,-μb,-ρ)), (60)

with a, b, and ρ defined in the corollary. Using the identity [28]

2[Φ(x,y;ρ)+Φ(x,y;-ρ)-Φ(x)-Φ(y)]+1=L(x,y;ρ), (61)

where Φ(.) is the standard normal cumulative function, completes the proof.

Proof of Lemma 3

Here, we first provide a way to intuitively understand the Lemma and then we provide a rigorous proof. We have

G(x,y;ρ)=F(x,y;ρ)+F(x,y;-ρ)=1+L(x,y,ρ)=1-L(x,y,ρ), (62)

where the last equality is due to xy < 0 stated as an assumption to the lemma. Intuitively, the lemma makes sense because smaller values of |x|, |y|, and |ρ| imply not only a smaller integration region in (22), but also less mass in that region. Next we provide a rigorous proof. It is straightforward to show

G(x,y;ρ)x=2e-x222π[Φ(y-ρx1-ρ2)+Φ(y+ρx1-ρ2)-1],G(x,y;ρ)y=2e-y222π[Φ(x-ρy1-ρ2)+Φ(x+ρy1-ρ2)-1],G(x,y;ρ)ρ=2ψ(x,y;ρ)-2ψ(x,y;-ρ), (63)

where the last equality comes from well known results of Gaussian distribution, where Φ(x,y;ρ)ρ=Φ(x,y;ρ)xy=ψ(x,y;ρ). Without loss of generality, we assume x ≥ 0 and y ≤ 0. The results for x ≤ 0 and y ≥ 0 are entirely similar after exchanging x and y in the following proof. We have

ρ0y-ρx0,y-ρxy+ρx-y+ρxΦ(y-ρx1-ρ2)+Φ(y+ρx1-ρ2)1Gx0,ρ<0y+ρx0,y+ρxy-ρx-y-ρxΦ(y-ρx1-ρ2)+Φ(y+ρx1-ρ2)1Gx0. (64)

Hence, Gx0. Similarly, Gy0. Furthermore,

ρ0G(x,y;ρ)ρ0,ρ<0G(x,y;ρ)ρ>0.

For 0 ≤ λ ≤ 1, we set

γx=λx1+(1-λ)x0,γy=λy1+(1-λ)y0,γρ=λρ1+(1-λ)ρ0. (65)

Then λ, γx ≥ 0, γy ≤ 0, ρi ≥ 0 ⇒ γρ ≥ 0 (i = 0, 1), and ρi < 0 ⇒ γρ < 0 (i = 0, 1). Thus, Gγx0,Gγy0 and

γρ0G(x,y;ρ)γρ0,γρ<0G(x,y;ρ)γρ>0. (66)

Then

dGdλ=Gγxdγxdλ+Gγydγydλ+Gγρdγρdλ=Gγx(x1-x0)+Gγy(y1-y0)+Gγρ(ρ1-ρ0). (67)

First assume ρi ≥ 0, i = 0, 1, so that γρ ≥ 0, G(x,y;ρ)γρ0. Since Gγx0, x1x0, Gγy0, y1y0, and ρ1ρ0, we have dGdλ0. Therefore,

01dGdλdλ=G(1)-G(0)=G(x1,y1,ρ1)-G(x0,y0,ρ0)0. (68)

Next assume ρi ≤ 0, i = 0, 1, so that γρ ≤ 0, G(x,y;ρ)γρ0. Since Gγx0, x1x0, Gγy0, y1y0, and ρ1ρ0, we have dGdλ0. Therefore,

01dGdλdλ=G(1)-G(0)=G(x1,y1,ρ1)-G(x0,y0,ρ0)0.

Lastly, assume the ρi’s have opposite signs. Without loss of generality, assume ρ0 < 0, ρ1 ≥ 0, and |ρ1| ≤ |ρ0|. Then

01dGdλdλ=0ρ0-ρ1ρ1-ρ0dG+ρ0-ρ1ρ1-ρ01dG=0ρ0-ρ1ρ1-ρ0dG+G(xm,ym,-ρ1)-G(x1,y1,ρ1), (69)

where xm=ρ0-ρ1ρ1-ρ0x1+(1-ρ0-ρ1ρ1-ρ0)x0,ym=ρ0-ρ1ρ1-ρ0y1+(1-ρ0-ρ1ρ1-ρ0)y0, x1 < xm < x0, and y0 < ym < y1. From the definition of G(x, y, ρ) it is easy to see that G(x, y, ρ) = G(x, y,ρ) and then, from the conditions that result in (68), we have G(xm, ym,ρ1) − G(x1, y1, ρ1) = G(xm, ym, ρ1) − G(x1, y1, ρ1) ≥ 0. Hence, in order to show G(x1, y1, ρ1) − G(x0, y0, ρ0) ≥ 0 it is sufficient to show that 0ρ0-ρ1ρ1-ρ0dGdλdλ0. For λ[0,ρ0-ρ1ρ1-ρ0], we have γρ ≤ 0, G(x,y;ρ)γρ0. Therefore, from (67), G(x,y;ρ)γρ0 and, furthermore, dGdλ0. Thus, 0ρ0-ρ1ρ1-ρ0dGdλdλ0 and the result follows.

Proof of Theorem 6

From (16) and (9), it follows that

E[(εts0)2]=P(Xts-X¯T<0,X¯T0-X¯T1>0,Xts-X¯T<0)+P(Xts-X¯T0,X¯T0-X¯T1<0,Xts-X¯T0), (70)

where X¯T=X¯T0+X¯T12. Expanding X¯T0 and X¯T1 as 1n0i=1n0Xti0 and 1n1i=1n1Xti1 results in

E[(εts0)2]=P(ZsI<0)+P(ZsI0), (71)

where ZsI=AYs0, in which Ys0=[Xts0,Xts0,Xt10,,Xtn00,Xt11,,Xtn11]T, and the super index 0 in Xts0 and Xts0 is to denote explicitly Xts, XtsXt0, and

A=[10-12n0-12n0-12n0-12n1-12n100-1n0-1n0-1n01n11n101-12n0-12n0-12n0-12n1-12n1].

Therefore, ZsI is a Gaussian random vector with mean AμYs0 and covariance matrix AYs0AT. Plugging in the values of μYs0=[μs0,μs0,μ10,μ20,,μn00,μ11,μ21,,μn11] and noting the fact that the jth element of vector ρsik(j) is defined as ρsik(j)=E[(Xtsi-μsi)(Xtjk-μjk)], i, k = 0, 1, j = 1, 2,, nk, and from the definition of Xts0 it holds that E[(Xts0-μsi)(Xtjk-μjk)]=E[(Xts0-μsi)(Xtjk-μjk)], we have:

Ys0=[(σs0)20ρs00ρs010(σs0)2ρs00ρs01(ρs00)T(ρs00)T0001(ρs01)T(ρs01)T(01)T11], (72)

which leads to the expression stated in Theorem 6. Evaluating the mean and covariance matrices of vector ZsII and ZsIII is entirely similar.

Proof of Theorem 8

Since the Zti’s are Gaussian, Xt0 and Xt1 are covariance-stationary [18] and the vectors Xn00=[Xt10,Xt20,,Xtn0]T and Xn00=[Xt11,Xt21,,Xtn11]T are distributed normally as Xnii~N(μi,i), i = 0, 1, where

μi=[μi,μi,,μi]1×niT,i(k,l)=ψik-l1-ψi2σi2,k,l=1,2,,ni,ρsii(k)=ψis-k1-ψi2σi2,k=1,2,,ni,ρs01=01×n1,ρs10=01×n0, (73)

μi=ci1-ψi, and Σi(k, l) denotes the entry in the kth row and lth column of matrix Σi. The result follows by replacing (73) in Theorem 1.

Proof of Corollary 9

Using the corollary assumptions in Theorem 8, we get

E[εts,n0+n1AR(1)ψ]=12i=01(Φ(hsi,-k;ρsi)+Φ(-hsi,k;ρsi)), (74)
hsi=μ2asi,ρsi=-ψ(s-ni)σ2ni(1-ψ2)(1-ψni1-ψ)+ρ,asi=a-ψ(s-ni)σ2ni(1-ψ2)(1-ψni1-ψ), (75)

with k, a, b, ρ, and μ defined in (45). Let F(x, y; ρ) = Φ(x, y; ρ) + Φ(−x,y; ρ), with Φ(x, y; ρ) defined in (20). Then using Scheffe’s Lemma [29] we have

2limsE[εts,n0+n1AR(1)ψ]=F(limshs0,-k;limsρs0)+F(limshs1,-k;limsρs1). (76)

Note that by taking the limit, the term ψ(s-ni)σ2ni(1-ψ2)(1-ψni1-ψ) in hsi and asi converges exponentially to 0 and we have as0=as1=a,hs0=hs1=h,ρs0=ρs1=ρ. The result follows similarly to proof of Corollary 2.

Proof of Corollary 10

With n0 = n1 = n, we have ρ = 0 in (45). From (76) we get 2limsE[εts,2nAR(1)ψ]=G(hψ,lψ;0), with G(hψ,kψ; 0) defined as in (24) and lψ= −kψ, where we use a subscript ψ to explicitly denote dependence of l and h on ψ. Since hψlψ< 0, we can use a proof similar to that of Lemma 3 to compare different AR models. Specifically, suppose we prove that

ψ>ψλh[0,1)λl[0,1):hψ=λhhψlψ=λllψ (77)

Then, similar to proof of Lemma 3, we can prove G(hψ″, lψ″; 0) < G(hψ′, lψ′; 0), so that

2limsE[εts,2nAR(1)ψ]=G(hψ,lψ;0)<limsE[εts,2nAR(1)ψ]=G(hψ,lψ;0), (78)

thereby proving the basic inequality in the corrollary. We first demonstrate (77). Assume c0 > c1. We first prove that for ψ ∈ (−1, 1), we have dlψdψ<0 and dhψdψ>0. It is easy to see that:

dlψdψ=-n2(c0-c1)σd(fψ)dψ=-n2fψ(c0-c1)gψσ(dψ)2,

where

gψ=2n(1+ψ2-(n+1)ψn+(n-1)ψ(n+2)),dψ=n-2ψ-nψ2+2ψn+1,fψ=n-nψ2dψ. (79)

From Descartes’ Rule of Signs [30], gψ has either zero or two positive roots. For n ≥ 2,

gψ=4n(ψ-1)2((n-1)2ψn+j=1n-1jψj+12). (80)

Therefore, for all n, gψ has two roots at 1 and these are the only positive roots. Similarly we observe that if n is even, then gψ has only two negative roots at −1. If n is odd, again from Descartes’ Rule of Signs [30], gψ has only one negative root, denoted by ψ−. We show that ψ ∈ (−∞, −1). Let ψ = −ψ+, ψ+ > 0. Since n is odd, we need to have

1+ψ+2+(n+1)ψ+n-(n-1)ψ+(n+2)=0. (81)

Were ψ+ ∈ (0, 1), this would imply (n+1)ψ+n>(n-1)ψ+(n+2). Hence, (81) is not possible and ψ ∈ (−∞, −1). Summarizing this result, we see that ψ ∈ (−1, 1) ⇒ gψ > 0 and therefore, dlψdψ<0. It is straightforward to show

dhψdψ=n2(c0-c1)σd(rψ)dψ=n2rψ(c0-c1)(4n3(1-ψ)2+gψ)σ(2n2(1-ψ)2+dψ)2,

where rψ=n-nψ22n2(1-ψ)2+dψ. Since for ψ ∈ (−1, 1) we have gψ > 0, then dhψdψ>0. We set γψ = λψ1 + (1 − λ)ψ0, where 0 ≤ λ ≤ 1. Now we check that (78) holds. Denoting G(hγψ,lγψ; 0) by G, we have

dGdλ=Ghγψdhγψdγψdγψdλ+Glγψdlγψdγψdγψdλ=Ghγψdhγψdγψ(ψ-ψ)+Glγψdlγψdγψ(ψ-ψ). (82)

Since ψ ∈ (−1, 1), 0 ≤ λ ≤ 1, and hψlψ< 0, we can see that γψ ∈ (−1, 1), hγψlγψ< 0, dhγψdγψ>0,dlγψdγψ<0, and from Proof of Lemma 3 in the appendix, Ghγψ0 and Glγψ0. Since ψ″ > ψ′, we see that dGdλ<0. Similar to the proof of Lemma 3, integrating over λ results in G(hψ″, lψ″; 0) < G(hψ′, lψ′; 0). The same basic argument goes through for c0 < c1 and we have dhγψdγψ<0,dlγψdγψ>0. The remaining results follow from the definition of E[εts,2nI], where we have E[εts,2nAR(1)ψ=0]=E[εts,2nI].

Proof of Theorem 11

Since the Zti’s are Gaussian, Xt0 and Xt1 are covariance-stationary and the vectors Xn00=[Xt10,Xt20,,Xtn0]T and Xn00=[Xt11,Xt21,,Xtn11]T are distributed normally as Xnii~N(μi,i), i = 0, 1, [18], where for k = 1, 2, …, ni,

μi=[μi,μi,,μi]1×niT,ρsii(k)=0,ρs01=01×n1,ρs10=01×n0,i(k,l)={σi2(1+θi2),k=lσi2θi,k-l=10othewise, (83)

where μi = ci and Σi(k, l) denotes the entry in the kth row and lth column of the matrix Σi. The result follows by replacing (83) in Theorem 1.

Proof of Corollary 13

From Theorem 11, since αts0=αts1 and max{n0, n1} + 1 < s, we have 2E[εts,2nMA(1)θ]=G(hθ,lθ;0), for any s, with lθ = −kθ, hθ and kθ defined in Corollary 12, and G(hθ, −kθ; 0) defined as in (24). Similar to proof of Corollary 10, the present corollary follows by setting n0 = n1 = n and using

dlθdθ=b-32(c0-c1)σ2n(2θ+2-2n),dhθdθ=-a-32(c0-c1)σ24n(2nθ+θ+1-1n), (84)

where a and b are obtained from (53).

Contributor Information

Amin Zollanvari, Email: amin_zoll@neo.tamu.edu.

Jianping Hua, Email: jhua@tgen.org.

Edward R. Dougherty, Email: edward@ece.tamu.edu.

References

  • 1.Hills M. Allocation rules and their error rates. J Royal Statist Soc Ser B (Methodological) 1966;28(1):1–31. [Google Scholar]
  • 2.Sorum MJ. Estimating the expected probability of misclassification for a rule based on the linear discriminant function: Univariate normal case. Technometrics. 1973;15:329–339. [Google Scholar]
  • 3.Zollanvari A, Braga-Neto UM, Dougherty ER. Exact representation of the second-order moments for resubstitution and leave-one-out error estimation for linear discriminant analysis in the univariate heteroskedastic gaussian model. Pattern Recogn. 2012;45:908–917. [Google Scholar]
  • 4.Basu JP, Odell PL. Effect of intraclass correlation among training samples on the misclassification probabilities of bayes’ procedure. Pattern Recogn. 1974;6:13–16. [Google Scholar]
  • 5.McLachlan GJ. Further results on the effect of interclass correlation among training samples in discriminant analysis. Pattern Recogn. 1976;8:273–275. [Google Scholar]
  • 6.Tubbs JD. Effect of autocorrelated training samples on bayes’ probability of mis-classlficatlon. Pattern Recogn. 1980;12:351–354. [Google Scholar]
  • 7.Lawoko CRO, McLachlan GJ. Discrimination with autocorrelated observations. Pattern Recogn. 1985;18:145–149. [Google Scholar]
  • 8.Lawoko CRO, McLachlan GJ. Asymptotic error rates of the w and z statistics when the training observations are dependent. Pattern Recogn. 1986;19:467–471. [Google Scholar]
  • 9.Fisher RA. Statistical Methods for Research Workers. Edinburgh: Oliver &Boyd; 1925. [Google Scholar]
  • 10.Martin JK, Hirschberg DS. Small sample statistics for classification error rates ii: Confidence intervals and significance tests. 1996 [Google Scholar]
  • 11.Zollanvari A, Braga-Neto UM, Dougherty ER. On the sampling distribution of resubstitution and leave-one-out error estimators for linear classifiers. Pattern Recogn. 2009;42(11):2705–2723. [Google Scholar]
  • 12.Zollanvari A, Braga-Neto UM, Dougherty ER. Joint sampling distribution between actual and estimated classification errors for linear discriminant analysis. IEEE Trans Inf Theory. 2010;56(2):784–804. [Google Scholar]
  • 13.Zollanvari A, Braga-Neto UM, Dougherty ER. Analytic study of performance of error estimators for linear discriminant analysis. IEEE Trans Sig Proc. 2011;59(9):4238–4255. [Google Scholar]
  • 14.Shumway RH, Unger AN. Linear discriminant functions for stationary time series. J Am Statist Assoc. 1974;69:948–956. [Google Scholar]
  • 15.Kakizawa Y, Shumway R, Taniguchi M. Discrimination and clustering for multivariate time series. J Am Statist Assoc. 1998;93:328–340. [Google Scholar]
  • 16.Kazakos D, Papantoni-Kazakos P. Spectral distance measuring between gaussian processes. IEEE Trans Autom Control. 1980;25:950–959. [Google Scholar]
  • 17.McLachlan GJ. The asymptotic distributions of the conditional error rate and risk in discriminant analysis. Biometrika. 1974;61:131–135. [Google Scholar]
  • 18.Hamilton JD. Time Series Analysis. NJ: Princeton University Press; 1994. [Google Scholar]
  • 19.Buyse M, Loi S, et al. Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. Journal of the National Cancer Institute. 2006;98:1183–1192. doi: 10.1093/jnci/djj329. [DOI] [PubMed] [Google Scholar]
  • 20.vanÕt Veer L, Dai H, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–6. doi: 10.1038/415530a. [DOI] [PubMed] [Google Scholar]
  • 21.Schršder FH, Hugosson J, et al. Screening and prostate-cancer mortality in a randomized european study. New Eng J Med. 2009;360:1320–1328. doi: 10.1056/NEJMoa0810084. [DOI] [PubMed] [Google Scholar]
  • 22.Koelinka CJL, van Hasseltb P, et al. Tyrosinemia type i treated by ntbc: How does afp predict liver cancer? Mol Genet Metab. 2006;89:310–315. doi: 10.1016/j.ymgme.2006.07.009. [DOI] [PubMed] [Google Scholar]
  • 23.Bast RC, Xu FJ, et al. Ca 125: the past and the future. Int J Biol Markers. 1998;13:179–187. doi: 10.1177/172460089801300402. [DOI] [PubMed] [Google Scholar]
  • 24.Filella X, Molina R, et al. Prognostic value of ca 19.9 levels in colorectal cancer. Mol Genet Metab. 1992;216:55–59. doi: 10.1097/00000658-199207000-00008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Frank TS, Deffenbaugh AM, et al. Clinical characteristics of individuals with germline mutations in brca1 and brca2: Analysis of 10,000 individuals. J Clin Oncol. 2002;20:1480–1490. doi: 10.1200/JCO.2002.20.6.1480. [DOI] [PubMed] [Google Scholar]
  • 26.Syrjakoski K, Kuukasjarvi T, et al. Brca2 mutations in 154 finnish male breast cancer patients. Neoplasia. 2004;6:541–545. doi: 10.1593/neo.04193. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Horii A, Nakatsuru S, et al. Frequent somatic mutations of the apc gene in human pancreatic cancer. Cancer Res. 1992;52:6696–6698. [PubMed] [Google Scholar]
  • 28.Abramowitz M, Stegun IA. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. New York: Dover Publications; 1972. [Google Scholar]
  • 29.Scheffe H. A useful convergence theorem for probability distributions. Ann Math Statist. 1947;18:434–438. [Google Scholar]
  • 30.Anderson B, Jackson J, Sitharam M. A useful convergence theorem for probability distributions. Amer Math Monthly. 1998;105:447–451. [Google Scholar]

RESOURCES