Abstract
This paper provides exact analytical expressions for the first and second moments of the true error for linear discriminant analysis (LDA) when the data are univariate and taken from two stochastic Gaussian processes. The key point is that we assume a general setting in which the sample data from each class do not need to be identically distributed or independent within or between classes. We compare the true errors of designed classifiers under the typical i.i.d. model and when the data are correlated, providing exact expressions and demonstrating that, depending on the covariance structure, correlated data can result in classifiers with either greater error or less error than when training with uncorrelated data. The general theory is applied to autoregressive and moving-average models of the first order, and it is demonstrated using real genomic data.
Keywords: Linear discriminant analysis, Stochastic settings, Correlated data, Non-i.i.d data, Expected error, Gaussian processes, Auto-regressive models, Moving-average models
1. Introduction
It is common in practice to assume that the training data used to construct a classifier are independent and identically distributed (i.i.d). Should the data be dependent or not identically distributed, the classifier performance is affected. This paper presents a mathematical framework for analytically studying classifiers in such situations in general, and the univariate LDA (linear discriminant analysis) classifier in particular. We pay particular attention to the univariate LDA model because it is possible to obtain closed-form (not asymptotic) results for moments of the error – in analogy to moments for the error [1, 2] and error estimates [1, 3] for univariate LDA with i.i.d. sampling. The desired framework is achieved by placing classifier performance in a stochastic setting where the training data are univariate dependent and not necessarily identically distributed.
Motivation for this line of research goes back to the early 1970’s when Basu and Odell observed in remote sensing applications that the conditional expected true error of LDA is commonly higher than what is expected from a theoretical analysis [4]. They associated this observation with violation of the independence assumption on the training data.
To study the effect of correlated training data on the performance of LDA, Basu and Odell [4] used numerical examples under an equicorrellated structure of samples (see Appendix for definition of various correlation structures). They showed that misclassification probabilities change under such structures. Afterwards, McLachlan [5] used asymptotic analysis to show that even under a simple-equicorrelated structure the probability of misclassification changes. Later, Tubbs [6] used a similar asymptotic analysis but with a serially correlated structure among training data. He considered further simplifying assumptions to show that the asymptotic error rate changes with serially correlated data having positive correlations. Lawoko and McLachlan [7] used the same serially correlated structure and obtained a different asymptotic expansion of LDA true error from the one that Tubbs previously achieved in [6]. This type of asymptotic analysis was later used in [7, 8] to characterize the asymptotic expected true error of univariate LDA and Z-statistics assuming an autoregressive process of order p.
Typically, large-sample asymptotic results are not helpful in small-sample situations. Going back to 1925, R. A. Fisher wrote, “Only by systematically tackling small sample problems on their merits does it seem possible to apply accurate tests to practical data” [9, 10]. This understanding led us to study the distribution and exact moments of LDA true error and comnon estimators [11, 3, 12, 13].
Having laid the groundwork for analyzing LDA related statistics in small-sample situations, in this work we establish a framework for studying LDA in stochastic settings, thereby allowing us to obtain the exact first and second moments of univariate LDA true error in a general stochastic setting. We neither impose a specific correlation structure on the training data, nor do we assume the training data have necessarily the same mean or variance. For example the basic assumption in [4, 5, 6, 7, 8] is that the training data of the two classes are taken separately from two class conditional densities Π0, for class 0, and Π1, for class 1. This assumption immediately imposes several restrictions on the problem: the training data from each class have the same mean and variance (because they are coming from the same distribution) and, furthermore, only intraclass correlations exist. The stochastic setting permits us to generalize such assumptions to training data being correlated across classes or the samples from each class being differently distributed. To model such data we employ Gaussian processes and we assume the samples are taken from class conditional processes rather than class conditional densities.
Another related line of research is the work on classification of stationary time series data [14, 15, 16]. The main focus in this work is to construct linear discriminant rules with the knowledge of having stationary data. In this framework the discriminant function is commonly the one which maximizes some measure of disparity between two multivariate densities, e.g. the Kullback-Leibler information measure. This means that the linear discriminant rules constructed here are no longer what is commonly known as LDA. Therefore, the main difference between the aforementioned results on studying the performance of LDA under correlated training data and the body of work on classification of stationary times series, is that the former focuses on the analysis of the effect of correlated training data (which may have a stationary structure) on the performance of LDA, and the latter focuses on the synthesis of new classification rules with the knowledge of having stationary time series. Our work is of the first type. We study the effect of training data that can be dependent and not necessarily identically distributed or stationary on the performance of LDA.
As an application of these results, we consider two commonly used models, first-order autoregressive and moving averages. We further study the exact effect of autoregressive or moving-average model coefficients on changing the expected true error of LDA. Finally, we present numerical experiments to study several specific settings using the theory.
Before proceeding we note that univariate classification has played a major role in the history of pattern recogntion, in part, because of the ability to obtain closed-form solutions for error moments [1, 2, 3]; however, we should not overlook practical application. Indeed, most common tests for diagnosis and prognosis of cancer are univariate: PSA for prostate cancer [21], AFP for liver cancer [22], CA 125 for ovarian cancer [23], and CA 19.9 for colorectal cancer [24] are major protein markers. In addition to these protein biomarkers, there are genomic markers such as BRCA1 for breast cancer [25], BRCA2 [26] for male breast cancer, and APC for pancreatic cancer [27] that are major genomic markers.
2. Linear Discriminant Analysis and Error Estimation: Independent Sampling
In this section, we present the traditional sampling scenario in which LDA is employed in a univariate setting. Consider a set of n = n0 + n1 independent sample points in ℝ, where X1, X2, …, Xn0 come from population Π0 and Xn0+1, Xn0+2, …, Xn0+n1 come from population Π1. Population Πi is assumed to follow a univariate Gaussian distribution , for i = 0, 1. Linear Discriminant Analysis (LDA) utilizes the Anderson W statistic, which in the univariate case is presented as
| (1) |
where and are the sample means for each class and σ̂2 is the pooled estimate of the variance of classes, which is assumed to be common in the LDA discriminant. Given X̄0 and X̄1, the designed LDA classifier is given by
| (2) |
with c being a constant. It is commonly assumed that c is zero [17], which is the assumption we also make throughout this paper. Therefore, the sign of W determines the classification of the sample point X and since σ̂2 > 0, (1) reduces to
| (3) |
where . Given the training data Sn (and thus X̄0 and X̄1), the classification error, also known as true error, is given by
| (4) |
where αi = P (X ∈ Πi) is the a priori mixing probability for population Πi and εi is the error rate specific to population Πi, with
| (5) |
The first and second moments of the classification error are given by
| (6) |
and
| (7) |
3. Performance of LDA classifier in Univariate Gaussian Dependent Sampling (UGDS) Model of Binary Classification
We now provide the mathematical framework to study LDA performance in a stochastic setting.
Definition 1
A process Xt = {Xt : t ∈ T} with T being an ordered set, is called a Gaussian process if any finite-dimensional vector [Xt1, Xt2, …, Xtn]T has the multivariate normal distribution N(μT, ΣT), where
and ΣT is the covariance matrix dependent on T = [t1, t2, …, tn].
Definition 2
We refer to the following sampling procedure as the Univariate Gaussian Dependent Sampling (UGDS) Model of Binary Classification: , with Ti being two ordered sets for i = 0, 1, are two Gaussian processes such that any finite-dimensional vector constructed by stacking the random variables of and as possesses a multivariate normal distribution N(μT, ΣT), where , and
| (8) |
is a positive definite covariance matrix.
This model is univariate because both processes, and , are collections of univariate random variables, not necessarily with the same means or variances. and are called class conditional processes. For ease of notations and without loss of mathematical generality, we assume that T0 and T1 are the same set and, therefore, we omit the superscript i from ti. Thus, henceforth we denote by and the stacked vector by .
Remark 1
If we assume μT = [μ0T, μ1T]T, with , i = 0, 1, j = 1, 2, …, ni, where (σi)2 is the variance of class conditional distributions and indicates the diagonal elements of matrix Σii, , i = 0, 1, j, k = 1, …, ni, j ≠ k, Σ01 = 0n0×n1 = Σ10T, and any future sample is independent from the training data and distributed either as N(μ0, (σ0)2) or N(μ1, (σ1)2), depending on its class, then the UGDS model reduces to the traditional i.i.d. sampling scenario defined in section 2. Because we will want to compare classifier errors in the dependent and independent scenarios, we will sometimes use εD and εI to denote errors in the respective settings.
Similar to (3), employing LDA with the UGDS model instead of traditional independent sampling in order to classify a sample point taken at t, denoted by Xt, results in the following W statistic for the univariate case
| (9) |
where and are the sample means for each class and . The designed LDA classifier is given by
| (10) |
For the ease of notation, hereafter, we omit the subscript T from μT and ΣT.
3.1. Stochastic true error and its moments
Let denote a test sample point, where i indicates the class conditional process in which the sample is coming from, i.e. either or . The auto-covariance sequence of with the training data is defined as
| (11) |
where is the jth element of the sequence . Since is a future sample point, we assume 2 ≤ max{n0, n1} < s, unless otherwise stated. Throughout the paper, we use SA to denote the sum of all elements of a matrix or vector A. For instance, .
The true classifier error under the UGDS model is a function of ts. Sample points at ts can come from either processes and the classifier may misclassify any of these. Hence,
| (12) |
where , i = 0, 1, is the a priori mixing probability of the two processes and at ts and is the error rate specific to each process, with
| (13) |
By replacing with any proper statistic used in other classifiers, this stochastic definition of true error applies to other rules. The expected performance of true error is also specific to ts:
| (14) |
In (12), the true error is indexed. One could, if desired, define the true error of a classifier to be the average error the classifier induces over an index set of interest, namely, . Since characterizing εts yields a characterization of εts1−s2, no generality is gained by averaging and we restrict our attention to εts. The second moment is also a function of ts and from (12) we get
| (15) |
First focusing on , the square of the probability defining can be factored by introducing the random variable . Writing the probabilities as integrals of indicator functions allows us to apply Fubini’s theorem, which shows Xts and to be independent (denoted ). The expectation can then be applied. Altogether,
| (16) |
and
can be expressed via similar factorizations. Hence,
| (17) |
To facilitate the subsequent discussion, we will explicitly denote the dependency of true error on the number of samples. Therefore, hereafter we use εts,n0+n1 and εts, or and
, interchangeably.
Throughout the paper, we use the notations Z < 0 or Z ≥ 0 to indicate componentwise inequalities, e.g. Z = (z1, z2)T < 0 means z1 < 0, z2 < 0.
3.2. Expected performance of LDA in the UGDS model
The first moment of the classification error for LDA under the UGDS model is expressed exactly according to the following theorem.
Theorem 1
Under the UGDS model, the expected true error of LDA at ts is
| (18) |
where and are Gaussian bivariate vectors with
| (19) |
where , and and are the mean and variance of random variables at ts from class i, i = 0, 1, with the auto-covariance defined as in (11).
Proof
See the Appendix.
We note that under conditions stated in Remark 1, Theorem 1 reduces to Theorem 1 in [3].
Let Φ(x, y; ρ) be the cumulative bivariate normal distribution defined as:
| (20) |
We have the following Corollary.
Corollary 2
In the model considered in Theorem 1, let the training samples from each class have the same mean, that is μ = [μ0T, μ1T]T in which , i = 0, 1, meaning the test data at ts has equal variances across classes, and . Furthermore, let , i, k = 0, 1 Then
| (21) |
where
| (22) |
| (23) |
Proof
See the Appendix.
To further proceed we present the following lemma, in which ∧ denotes conjunction.
Lemma 3
Let Φ(x, y; ρ) be the cumulative bivariate normal distribution defined in (20) and defne F(x, y; ρ) and G(x, y; ρ) as follows:
| (24) |
where x and y are two constants such that xy < 0. Then, for 0 ≤ λx < 1, 0 ≤ λy < 1,
| (25) |
Proof
See the Appendix.
Using this Lemma, we compare the expected true error of the UGDS model with the independent sampling model.
Corollary 4
In the model considered in Corollary 2, let , i = 0, 1, j = 1, 2, …, ni, and let
| (26) |
Let be the expectaton of the true error of the classifier in (10) specific to ts and constructed as if all n0 + n1 training samples are i.i.d. (same mean and variance). Then
| (27) |
| (28) |
where is the sum of the off diagonal elements of matrix A, defined as
Proof
Find the expected true error in Theorem 1 using the conditions in the corollary and compare it to the expected true error determined by setting all off diagonal elements of Σii to zero, i = 0, 1 and Σ01 = 0n0×n1. The proof follows by using the results of Lemma 3 in Theorem 1.
A more restricted set of sufficient conditions than those presented in Corollary 4 follows.
Corollary 5
In the model considered in Corollary 2, let , i = 0, 1, j = 1, 2, …, ni, and
| (29) |
Then
| (30) |
| (31) |
where is the sum of off diagonal elements of matrix A, defined as
Proof
The proof is similar to Corollary 4.
To have a sense of the conditions stated in Corollary 5, consider a scenario in which n0 = n1, the sample points in each class are equi-correlated with correlation ρ, and there is independent sampling across classes. This satisfies (29). If ρ > 0, then (30) holds and . If ρ < 0 and the class covariance matrices are positive definite, then (31) holds and .
Let us reflect on Corollaries 4 and 5. A correlated set of n sample points can be considered as a set in which the points convey some information about each other. Therefore, they are often considered to be as informative as n′ independent samples with n′ < n, thereby producing a poorer classifier. This intuition aligns with the simple situation in which the sample points in each class are equi-correlated with ρ > 0 and the sample points across the two classes are uncorrelated. This scenario is a special case of (30) and . However, (31) shows that there are correlation patterns that result in an expected true error smaller than it would be were there independent sampling, which means that sampling satisfying (31) is like having a larger sample size than if sampling were independent.
To illustrate, in the UGDS model suppose the training sample points are identically distributed as two Gaussian distribution, N(−1, 1) for class 0 and N(1, 1) for class 1. Let n0 = n1 = 3 and assume that any future test point is also distributed identically to the training data of its class. Furthermore, assume the data are generated via two different scenarios, a and b, such that Σ01 = 03×3 and, for i = 0, 1,
| (32) |
Figure 1(a) shows the expected true error of the classifier designed in scenario a as a function of ρ. It demonstrates that for some dependency patterns, as defined by the covariance matrix, the classifier has better performance than if the sampling were independent. Note that in Fig. 1(a) the curves meet at ρ = 0.5, the point of equality for the inequalities (30) and (31). Note also that for ρ = −0.499, . Hence, for the sampling covariance matrix (32), 3 points have the effect of 9 independent points. In general, better classification accuracy may be achieved if the sample points are collected according to specific schemes. Equations (28) and (31) provide sufficient sets of conditions that result in such schemes.
Figure 1.

(a) Expected true error of constructed classifiers in scenario a as a function of ρ, (b) Expected true error of constructed classifiers in scenario b as a function of ρ. The horizontal line shows the performance of the constructed classifier as if the samples were independent. Solid lines: dependent samples; Dashed lines: independent samples.
Figure 1(b) shows the expected true error of a classifier constructed in scenario b by varying ρ in the same range as in scenario a. The only difference between scenarios a and b is changing the covariances between the first sample point and other sample points to positive values. It results in the curve for dependent sampling in Figure 1(b) being substantially above the curve for independent sampling.
3.3. Second moment of LDA true error in the UGDS model
Next we obtain the second moment of true error of LDA at ts as defined in (17).
Theorem 6
Under the UGDS model, the second moment of LDA at ts is
| (33) |
where is a 3-variate Gaussian random vector with mean and covarianc matrices as follows:
| (34) |
and, for i, j = 0, 1, i ≠ j, letting
| (35) |
we have
| (36) |
with , SA is the sum of all elements of matrix A, defined as SA = Σi,j aij, and and are the mean and variance of random variables at ts from class i, i = 0, 1. Furthermore, μZIII is obtained from μZI by replacing −μ′ with μ′ and with , and is obtained from by exchanging n0 and n1, and , SΣ00 and SΣ11, and , and and .
Proof
See the Appendix.
Let and be the first and second moments of true error of the classifier in (10) specific to ts and constructed by n0 + n1 independent training sample points distributed according to the same mean and variance. Then we have the following corollary.
Corollary 7
In the model considered in corollary 5, further assume n0 = n1 = n, Σ01 = 0n×n, and, for k, j = 1, 2, …, n,
| (37) |
where σ is the common variance of test sample points across classes at ts. Let m be the number of additional dependent training points in each class with the same class conditional means and dependency structure, meaning as in (37) for k, j = 1, 2, …, n + m and Σ01 =0(n+m)×(n+m), that are required to make . This number also makes and is given by
| (38) |
Proof
The proof of follows by equating elements of covariance matrices obtained for the dependent model in (19) with the covariance matrices for the independent sampling model. Under the conditions of the corollary, these matrices in the independent sampling scenario (given by Theorem 1 in [3]) are
| (39) |
Furthermore, we note that the conditions stated in this corollary satisfy the condition stated in (30), and hence . The proof of follows similarly by equating covariance matrices presented in (36) with those presented in Theorem 2 in [3].
In (38), if , then m < 0, meaning that adding any additional points under the dependency model in the corollary does not lower and to the level of the first and second moments of true error of the constructed LDA classifier as if the original 2n training samples were independent.
4. Applications
In this section we study applications to common models used in signal processing, the first-order autoregressive model, AR(1), and the first-order moving average model, MA(1), by assuming the training data are generated by the output processes of two models. and are two independent white noise processes and and are the processes producing the system output. The goal is to characterize the performance of the LDA classifier as a function of sample size, the parameters of the white noise processes, and the autoregressive coefficients.
4.1. First-order autoregressive model AR(1)
We consider two AR(1) models:
| (40) |
where ψi is a constant such that 0 < |ψi| < 1, i = 0, 1, and , for all t, are independent from each other. Then and are two independent covariance-stationary processes and we have the following theorem.
Theorem 8
Let in the UGDS model be defined by the two independent covariance-stationary AR(1) processes as defined in (40). Then, at ts, where max{n0, n1} < s, the expected true error of LDA constructed using the training samples and is
| (41) |
where and are Gaussian bivariate vectors with
| (42) |
where
| (43) |
Proof
See the Appendix.
Corollary 9
In the model considered in Theorem 8, let ψ0 = ψ1 = ψ, σ0 = σ1, . Let denote the expected true error of an AR(1) model with AR coefficient ψ specific to ts. Then
| (44) |
where L(h, k; ρ) is defined in (22) and
| (45) |
Proof
See the Appendix.
We consider as a function of ψ and compare it to the case where ψ= 0, which corresponds to the stochastic i.i.d setting.
Corollary 10
In the model considered in Corollary 9, let n0 = n1 = n. Furthermore, let be the expected true error of the LDA classifier with ψ = 0 in (40). Let ψ′ and ψ″ be two arbitrary values of the AR coefficient ψ. Then
| (46) |
Hence,
| (47) |
Proof
See the Appendix.
Corollary 10 shows that, if ψ ∈ (0, 1), then under the conditions of the Corollary, constructing an LDA classifier to differentiate between AR processes is beneficial in terms of the expected true error tested on sufficiently lagged data; however, for ψ ∈ (−1, 0), we expect larger expected true error.
4.2. First-order moving-average model MA(1)
We consider the MA(1) models
| (48) |
where θi ∈ ℝ and , for all t, are independent from each other. Then and
are two independent and covariance-stationary processes regardless of the values of θi [18].
Theorem 11
Let in the UGDS model be defined by the two independent covariance-stationary MA(1) processes defined in (48). Then, at ts, where max{n0, n1} + 1 < s, the expected true error of an LDA classifier constructed using the training samples and is
| (49) |
where and are Gaussian bivariate vectors with:
| (50) |
where i = 0, 1, and
| (51) |
Proof
See the Appendix.
Corollary 12
For the model in Theorem 11, let θ0 = θ1, σ0 = σ1, . Let denote the expected true error of an MA(1) model with MA coefficient θ specific to ts. Then
| (52) |
where L(h, k; ρ) is defined in (22) and
| (53) |
Proof
The result follows by considering the assumption of the corollary in Theorem 11 and then following the same steps similar to Corollary 2.
Corollary 13
For the model in Corollary 12, let n0 = n1 = n. Furthermore, let be the expected true error of the LDA classifier specific to ts with θ = 0 in (48). Let θ′ and θ″ be two arbitrary values of the MA coefficient θ. Then
| (54) |
and, therefore,
| (55) |
Proof
See the Appendix.
Corollary 13 shows that there exists a range of moving-average coefficients, i.e. [ , 0), that is beneficial in terms of expected classification error, i.e. has a smaller expected true error than the stochastic i.i.d model. For positive values of the coefficient, the expected true error of LDA increases.
5. Numerical Examples
We now illustrate the results obtained in previous sections under several specific settings.
Experiment 1
First, we consider scenarios in which the sample points taken from each class conditional process are identically distributed. They have the same mean, μ0 for class 0 and μ1 for class 1, and we set μ0 = −μ1 and μ0 = 0.5, 0.75, 1, 1.5. We assume that the observations have variance 1 and are equally correlated with ρwith ∈ [ρl, 0.95]. The value of ρl is determined so that the covariance matrix defined in (8) is positive definite. In each case we consider three settings for the correlation, ρbet, across classes: (1) independent, ρbet = 0, (2) ρbet = ρwith, and (3) ρbet = −ρwith. For each setting we consider two sample sizes, n0 = n1 = n = 5 and n0 = n1 = n = 25. We assume any future observation from each class conditional process has a distribution similar to those of the training data from that class and .
Figure 2(a)–2(d) show the exact expectation and standard deviation (SD) of the LDA true error for this experiment as a function of ρwith. The results are calculated from Theorems 1 and 6. Parts a and b of the figure show that increasing ρwith has an incremental effect on . Since future observations are identically distributed, is the same for all values of ts. Theoretically, for ρbet = 0, we can easily verify the graphical behavior by using Lemma 3 in Theorem 1. To analytically see the effect of ρwith on once ρbet = 0, let ρ1, ρ2 be two arbitrary values of ρwith such that ρ1 < ρ2 and denote all distributional parameters used in Corollary 2 corresponding to ρk, k = 1, 2, with a super script ρk. With the aforementioned conditions of the experiment, we have and , k = 1, 2. Therefore, aρ1 < aρ2 and bρ1 < bρ2. The results then follow from Lemma 3. For other cases where ρbet ≠ 0 one may analytically study the effect of changing ρbet on using results Theorem 1 and studying the change similar to the proof of Lemma 3.
Figure 2.
Figures (a)–(d) show the exact expectation and standard deviation of LDA true error in Experiment 1 as a function of ρwith. (a) Expectation for n0 = n1 = 5; (b) Expectation for n0 = n1 = 25; (c) Standard deviation for n0 = n1 = 5; (d) Standard deviation for n0 = n1 = 25; (a)–(d) plot keys: ○ := ρbet = 0; ×:= ρbet = ρwith; △ = ρbet = −ρwith; solid := μ0 = 1.5; dash := μ0 = 1; dot := μ0 = 0.75; dash-dot := μ0 = 0.5. The cross section of each curve with the vertical solid line in (a)–(d) plots shows the magnitude of the expectation/variance for i.i.d sampling situation for the corresponding scenario. The small horizontal solid lines in Figures (b) and (d) show the magnitude of expectation/variance of i.i.d situation in Figures (a) and (c), respectively. Figures (e)–(f) show the exact expectation of LDA true error of the first-order autoregressive model in Experiment 2 as a function of ψ:= ψ0 = ψ1. (a) Case of n0 = n1 = 5; (b) Case of n0 = n1 = 25; (e)–(f) plot keys: ○ := s − n0 = 2; × : s − n0 = 10; solid := c0 = 1.5; dash := c0 = 1; dot := c0 = 0.75; dash-dot := c0 = 0.5. The cross section of each curve with the vertical solid line in (e)–(f) plots shows the magnitude of the expectation for i.i.d sampling situation for the corresponding scenario. The small horizontal solid lines in Figure (f) show the magnitude of expectation of i.i.d situation in Figure (e).
Figures 2(a) and 2(b) show that increasing d = |μ0 − μ1| has an incremental effect on . This effect can also be seen from Lemma 3 and Corollary 2. Therefore, we call classification scenarios with a larger d, “easier” scenarios, and those with smaller d, “harder” scenarios. In this sense, d is an indicator of classification difficulty in our experiment. The figures suggest that having a between-class correlation of ρbet = ρwith > 0 helps in classification performance in “harder” classification situations (i.e., compared with ρbet = 0) and has a detrimental effect on classification performance in “easier” settings. However, having ρbet = ρwith < 0, helps to have a better classification performance in “easier” settings and results in a worse performance in “harder” settings. This is observed by the fact that curves for ρbet = ρwith are above (below) the curves for ρbet = 0 for d = 3 (d = 1).
The standard deviation is more complicated to interpret. The trends seen in Figures 2(c) and 2(d) suggest that increasing ρwith generally increases the standard deviation of the LDA true error in cases where ρbet = 0. Furthermore, it suggests that once ρbet = −ρwith, the standard deviation generally increases as ρwith grows, but once ρbet = ρwith, increasing ρwith in small sample sizes may increase or decrease the standard deviation depending on classification difficulty, and as the sample size gets larger, increasing ρwith generally increases the standard deviation. Furthermore, the figures suggest that increasing the classification difficulty may first increase the standard deviation and then decreases it.
Comparing Figure 2(a) with 2(b) and Figure 2(c) with 2(d) shows that increasing sample sizes lower the magnitude of the expectation and standard deviation regardless of classification difficulty or magnitude of ρwith.
Experiment 2
In this experiment, we use the first order autoregressive model defined in (40). We assume , n0 = n1 = n, σ0 = σ1 = 1, and ψ0 = ψ1 = ψ ∈ [−0.95, 0.95]. We consider various cases where c0 = 0.5, 0.75, 1, 1.5 with c0 = −c1. Figure 2(e) and 2(f) show the exact expectation of LDA true error for this experiment. These results are exact and are calculated from Theorem 8. These figures suggest that increasing ψ decreases . According to Corollary 10, for a sufficiently lagged ts, is a decreasing function of ψ and, furthermore, for 0 < ψ < 1 and for 0 < ψ < 1. Here the same behavior is observed even for small lags of 2 and 10. Furthermore, decreasing the sample size and increasing the classification difficulty have an incremental effect on the expected true error.
Experiment 3
In this experiment, we use the AR(1) model in (48). We assume , n0 = n1 = n, σ0 = σ1 = 1, and θ0 = θ1 = θ ∈ [−10, 10]. We consider c0 = 0.5, 0.75, 1, 1.5 with c0 = −c1. Figure 3 shows the exact expectation of the LDA true error for this experiment. These results are exact and are calculated from Theorem 11.
Figure 3.
Exact expectation of LDA true error of the first-order moving average model in Experiment 3 as a function of θ := θ0 = θ1. (a) Expectation for n = 5; (b) Expectation for n = 25; (c) Magnification of region [−1, 0.1] in Figure (a); (d) Magnification of region [−1, 0.1] in Figure (b). Plot keys: ○:= c0 = 1.5; △ : c0 = 1; ▽ := c0 = 0.75; × := c0 = 0.5; The cross section of each curve with the vertical solid line in each plot shows the magnitude of the expectation for i.i.d sampling situation for the corresponding scenario. The small horizontal solid lines in Figure (b) are drawn to facilitate comparing this figure with Figure (a) at these cross sections. The left side of blue dotted line is (−∞, ] region, which in (54) we proved that the expectation of true error is a decreasing function of θ. The right hand side of the red dashed line is [ , ∞) region, which the expectation is an increasing function as seen from (54).
Figure 3 shows that the expected true error of LDA under the MA(1) model has an inverted bell shape with a negatively biased center, and the bias decreases as the sample size increases. The results of Corollary 13 are clear in this figure: for is a decreasing function of θ. This region is on the left-hand side of the vertical blue dotted lines in Figure 3. For is an increasing function of θ. This region is on the right-hand side of the vertical red dashed line in the figure. As proved in Corollary 13, we observe in Figure 3(c) and 3(d) that, for . This is the region between red dashed line and the vertical black line of each plot. For θ ∈ (0, ∞), . This is the region on the right-hand side of the vertical black solid lines.
Experiment 4
This experiment is an example derived from gene-expression data used in studying the prognosis of breast-cancer using 70 genes with high prognostic ability [19]. Following [20], we divide the 307 individuals used in this study into 64 “poor” prognosis (class 0) versus 243 “good” prognosis (class 1) patients. A poor prognosis is defined to be a distant metastasis within 5 years of initial diagnosis. The gene expression data used in this study have been collected by triplicating each gene on each microarray and then duplicating each measurement by dye-swaping. Therefore, for each patient, each gene, we have six measurements, three of which are positively correlated with themselves and negatively correlated with others. Using this dataset we consider a scenario in which the experimenter is only given six measurements taken from one patient from class 0 and six measurements from another patient from class 1, and a univariate LDA classifier is desired to differentiate the two groups. We assume the single variate used in this classifier is the ALDH4 gene, which has the highest correlation with prognosis of breast cancer in [20]. Therefore, in this scenario, the experimenter is given 12 “technical” replicates in total, which are now treated as our “sample points”. This is an example of the UGDS model in genomic applications in which our classification is defined by two Gaussian processes, and , which are assumed to be independent processes. We note that the expected performance of a classifier depends on ts, i.e. in Theorem 1, which is a function of the distribution of the future data as well as the distribution of the training data and their correlation structure. We verify the Gaussianity of each of the 12 random variables, , used for characterizing the two Gaussian processes of this example via a Shapiro-Wilk test (using the R statistical software) on the full dataset corresponding to each random variable. This test does not reject Gaussianity of the random variables over either of the classes at a 95% significance level after employing the Bonferroni correction of multi-hypothesis tests.
Unfortunately, taken together, the 12 random variables do not pass the Shapiro-Wilk test for multivariate Gaussianity. Nonetheless, we will proceed and demonstrate that, even with this lack of multivariate Gaussianity, Theorem 1 is much more accurate than its counterpart in [3], which assumes i.i.d. data from each distribution.
Sample means, variances, and correlation, computed on the full dataset, were used as estimates of the unknown true means, variances, and the correlation structure between samples needed in Theorem 1. Using Theorem 1, the expected performance of a classifier, , to differentiate samples distributed as from samples distributed as is 0.475. To further verify this expected performance we construct a classifier on each possible combination among 243 × 64 = 15552 combinations of 6 samples from either classes and each time we test the accuracy of the designed classifier on the 64 − 1 = 63 remaining realizations of and 243 − 1 = 242 realizations of . The accuracy computed in this way is 0.479, which is almost the same as what is computed from Theorem 1. It is interesting to compare this accuracy to the case in which one designs a classifier without paying attention to the correlation structure between samples and various distributions governing the data (considering the data being i.i.d.). In this scenario one (incorrectly) considers the data from each class coming from a single distribution and the expected performance of a classifier can be therefore evaluated from Theorem 1 that we presented in [3]. Again we use the sample means and variances, computed on the full dataset, as estimates of the unknown true means and variances. In this case the expected performance of LDA is estimated to be 0.374, which is very far from 0.479.
6. Conclusion
In many applications, the assumption of having i.i.d. training samples is violated. This paper characterizes the performance of univariate LDA classification in stochastic settings by assuming the samples are taken from two class conditional Gaussian processes, which are not necessarily independent. Linear classification has been considered owing to its long history in pattern recognition and its suitability for small-sample classification. We do not impose a specific correlation structure on the training data. We have presented conditions in which the correlation structure can be either beneficial or detrimental in terms of classification performance. As an application we have obtained exact expressions for the performance of LDA in situations that the data are produced through auto-regressive (AR) or moving-average (MA) models of the first order. We have found ranges of AR or MA multiplicative coefficients having incremental or decremental effect on classification performance. Having characterized univariate LDA performance in closed form, we aim to follow our work in [3] and characterize the effect of non-i.i.d. samples on training-data-based error estimators.
Acknowledgments
This work was partially supported by the NIH grants 2R25CA090301 (Nutrition, Biostatistics, and Bioinformatics) from the National Cancer Institute.
Appendix. Various Correlation Structures
Let p-dimenional sample points of each class, X1, X2, …, Xni, with Xj being a column vector, be separately taken from two p-variate normal distributions, Π1 and Π2, with the distribution N(μi, Σ). Furthermore, let Vi be the dispersion matrix of the nip×1 vector , i = 0, 1, defined as Vi = E[(X − E(X))(X − E(X))T]. We define three correlation structures in regard to the data: (1) equicorrelated if Vi = Ini ⊗ (Σ − R)+ Eni ⊗ R, with R being a symmetric matrix, In the n × n identity matrix, and En the n × n matrix with all elements being 1; (2) simply equicorrelated if Vi = Ini ⊗ (1 − ρ)Σ + Eni ⊗ ρΣ, where ρ is a nonzero scalar constant where |ρ| < 1; and (3) serially correlated if Vi = Ini ⊗ Σ + Eni ⊗ ρτ Σ, where τ = |k − l|, k, l = 1, 2, …, ni, |ρτ| < 1, τ = 1, 2, …, ni, ρ0 = 0. Note that univariate sample points, equicorrelation and simple-equicorrelation structures are essentially the same.
Proof of Theorem 1
From (9), it follows that
where . Expanding and as and results in
| (56) |
where and , where the super index 0 in is to denote explicitly , and
| (57) |
Therefore, is a Gaussian random vector with mean and covariance matrix . Plugging in the values of and noting the fact that the jth element of vector is defined as , i, k = 0, 1, j = 1, 2, …, nk, we have
| (58) |
which leads to the expression stated in Theorem 1. Evaluating the mean and covariance matrix of vector , which is the counterpart for , is entirely similar, by considering .
Proof of Corollary 2
Note that for Φ(x, y; ρ) defined in (20),
| (59) |
By considering the assumption of the corollary for Theorem 1, and using (20) and (59) in (18), we get
| (60) |
with a, b, and ρ defined in the corollary. Using the identity [28]
| (61) |
where Φ(.) is the standard normal cumulative function, completes the proof.
Proof of Lemma 3
Here, we first provide a way to intuitively understand the Lemma and then we provide a rigorous proof. We have
| (62) |
where the last equality is due to xy < 0 stated as an assumption to the lemma. Intuitively, the lemma makes sense because smaller values of |x|, |y|, and |ρ| imply not only a smaller integration region in (22), but also less mass in that region. Next we provide a rigorous proof. It is straightforward to show
| (63) |
where the last equality comes from well known results of Gaussian distribution, where . Without loss of generality, we assume x ≥ 0 and y ≤ 0. The results for x ≤ 0 and y ≥ 0 are entirely similar after exchanging x and y in the following proof. We have
| (64) |
Hence, . Similarly, . Furthermore,
For 0 ≤ λ ≤ 1, we set
| (65) |
Then λ, γx ≥ 0, γy ≤ 0, ρi ≥ 0 ⇒ γρ ≥ 0 (i = 0, 1), and ρi < 0 ⇒ γρ < 0 (i = 0, 1). Thus, and
| (66) |
Then
| (67) |
First assume ρi ≥ 0, i = 0, 1, so that γρ ≥ 0, . Since , x1 ≤ x0, , y1 ≥ y0, and ρ1 ≤ ρ0, we have . Therefore,
| (68) |
Next assume ρi ≤ 0, i = 0, 1, so that γρ ≤ 0, . Since , x1 ≤ x0, , y1 ≥ y0, and ρ1 ≥ ρ0, we have . Therefore,
Lastly, assume the ρi’s have opposite signs. Without loss of generality, assume ρ0 < 0, ρ1 ≥ 0, and |ρ1| ≤ |ρ0|. Then
| (69) |
where , x1 < xm < x0, and y0 < ym < y1. From the definition of G(x, y, ρ) it is easy to see that G(x, y, ρ) = G(x, y, −ρ) and then, from the conditions that result in (68), we have G(xm, ym, −ρ1) − G(x1, y1, ρ1) = G(xm, ym, ρ1) − G(x1, y1, ρ1) ≥ 0. Hence, in order to show G(x1, y1, ρ1) − G(x0, y0, ρ0) ≥ 0 it is sufficient to show that . For , we have γρ ≤ 0, . Therefore, from (67), and, furthermore, . Thus, and the result follows.
Proof of Theorem 6
From (16) and (9), it follows that
| (70) |
where . Expanding and as and results in
| (71) |
where , in which , and the super index 0 in and is to denote explicitly Xts, , and
Therefore, is a Gaussian random vector with mean and covariance matrix . Plugging in the values of and noting the fact that the jth element of vector is defined as , i, k = 0, 1, j = 1, 2, …, nk, and from the definition of it holds that , we have:
| (72) |
which leads to the expression stated in Theorem 6. Evaluating the mean and covariance matrices of vector and is entirely similar.
Proof of Theorem 8
Since the ’s are Gaussian, and are covariance-stationary [18] and the vectors and are distributed normally as , i = 0, 1, where
| (73) |
, and Σi(k, l) denotes the entry in the kth row and lth column of matrix Σi. The result follows by replacing (73) in Theorem 1.
Proof of Corollary 9
Using the corollary assumptions in Theorem 8, we get
| (74) |
| (75) |
with k, a, b, ρ, and μ defined in (45). Let F(x, y; ρ) = Φ(x, y; ρ) + Φ(−x, −y; ρ), with Φ(x, y; ρ) defined in (20). Then using Scheffe’s Lemma [29] we have
| (76) |
Note that by taking the limit, the term in and converges exponentially to 0 and we have . The result follows similarly to proof of Corollary 2.
Proof of Corollary 10
With n0 = n1 = n, we have ρ = 0 in (45). From (76) we get , with G(hψ, −kψ; 0) defined as in (24) and lψ= −kψ, where we use a subscript ψ to explicitly denote dependence of l and h on ψ. Since hψlψ< 0, we can use a proof similar to that of Lemma 3 to compare different AR models. Specifically, suppose we prove that
| (77) |
Then, similar to proof of Lemma 3, we can prove G(hψ″, lψ″; 0) < G(hψ′, lψ′; 0), so that
| (78) |
thereby proving the basic inequality in the corrollary. We first demonstrate (77). Assume c0 > c1. We first prove that for ψ ∈ (−1, 1), we have and . It is easy to see that:
where
| (79) |
From Descartes’ Rule of Signs [30], gψ has either zero or two positive roots. For n ≥ 2,
| (80) |
Therefore, for all n, gψ has two roots at 1 and these are the only positive roots. Similarly we observe that if n is even, then gψ has only two negative roots at −1. If n is odd, again from Descartes’ Rule of Signs [30], gψ has only one negative root, denoted by ψ−. We show that ψ− ∈ (−∞, −1). Let ψ− = −ψ+, ψ+ > 0. Since n is odd, we need to have
| (81) |
Were ψ+ ∈ (0, 1), this would imply . Hence, (81) is not possible and ψ− ∈ (−∞, −1). Summarizing this result, we see that ψ ∈ (−1, 1) ⇒ gψ > 0 and therefore, . It is straightforward to show
where . Since for ψ ∈ (−1, 1) we have gψ > 0, then . We set γψ = λψ1 + (1 − λ)ψ0, where 0 ≤ λ ≤ 1. Now we check that (78) holds. Denoting G(hγψ,lγψ; 0) by G, we have
| (82) |
Since ψ ∈ (−1, 1), 0 ≤ λ ≤ 1, and hψlψ< 0, we can see that γψ ∈ (−1, 1), hγψlγψ< 0, , and from Proof of Lemma 3 in the appendix, and . Since ψ″ > ψ′, we see that . Similar to the proof of Lemma 3, integrating over λ results in G(hψ″, lψ″; 0) < G(hψ′, lψ′; 0). The same basic argument goes through for c0 < c1 and we have . The remaining results follow from the definition of , where we have .
Proof of Theorem 11
Since the ’s are Gaussian, and are covariance-stationary and the vectors and are distributed normally as , i = 0, 1, [18], where for k = 1, 2, …, ni,
| (83) |
where μi = ci and Σi(k, l) denotes the entry in the kth row and lth column of the matrix Σi. The result follows by replacing (83) in Theorem 1.
Proof of Corollary 13
From Theorem 11, since and max{n0, n1} + 1 < s, we have , for any s, with lθ = −kθ, hθ and kθ defined in Corollary 12, and G(hθ, −kθ; 0) defined as in (24). Similar to proof of Corollary 10, the present corollary follows by setting n0 = n1 = n and using
| (84) |
where a and b are obtained from (53).
Contributor Information
Amin Zollanvari, Email: amin_zoll@neo.tamu.edu.
Jianping Hua, Email: jhua@tgen.org.
Edward R. Dougherty, Email: edward@ece.tamu.edu.
References
- 1.Hills M. Allocation rules and their error rates. J Royal Statist Soc Ser B (Methodological) 1966;28(1):1–31. [Google Scholar]
- 2.Sorum MJ. Estimating the expected probability of misclassification for a rule based on the linear discriminant function: Univariate normal case. Technometrics. 1973;15:329–339. [Google Scholar]
- 3.Zollanvari A, Braga-Neto UM, Dougherty ER. Exact representation of the second-order moments for resubstitution and leave-one-out error estimation for linear discriminant analysis in the univariate heteroskedastic gaussian model. Pattern Recogn. 2012;45:908–917. [Google Scholar]
- 4.Basu JP, Odell PL. Effect of intraclass correlation among training samples on the misclassification probabilities of bayes’ procedure. Pattern Recogn. 1974;6:13–16. [Google Scholar]
- 5.McLachlan GJ. Further results on the effect of interclass correlation among training samples in discriminant analysis. Pattern Recogn. 1976;8:273–275. [Google Scholar]
- 6.Tubbs JD. Effect of autocorrelated training samples on bayes’ probability of mis-classlficatlon. Pattern Recogn. 1980;12:351–354. [Google Scholar]
- 7.Lawoko CRO, McLachlan GJ. Discrimination with autocorrelated observations. Pattern Recogn. 1985;18:145–149. [Google Scholar]
- 8.Lawoko CRO, McLachlan GJ. Asymptotic error rates of the w and z statistics when the training observations are dependent. Pattern Recogn. 1986;19:467–471. [Google Scholar]
- 9.Fisher RA. Statistical Methods for Research Workers. Edinburgh: Oliver &Boyd; 1925. [Google Scholar]
- 10.Martin JK, Hirschberg DS. Small sample statistics for classification error rates ii: Confidence intervals and significance tests. 1996 [Google Scholar]
- 11.Zollanvari A, Braga-Neto UM, Dougherty ER. On the sampling distribution of resubstitution and leave-one-out error estimators for linear classifiers. Pattern Recogn. 2009;42(11):2705–2723. [Google Scholar]
- 12.Zollanvari A, Braga-Neto UM, Dougherty ER. Joint sampling distribution between actual and estimated classification errors for linear discriminant analysis. IEEE Trans Inf Theory. 2010;56(2):784–804. [Google Scholar]
- 13.Zollanvari A, Braga-Neto UM, Dougherty ER. Analytic study of performance of error estimators for linear discriminant analysis. IEEE Trans Sig Proc. 2011;59(9):4238–4255. [Google Scholar]
- 14.Shumway RH, Unger AN. Linear discriminant functions for stationary time series. J Am Statist Assoc. 1974;69:948–956. [Google Scholar]
- 15.Kakizawa Y, Shumway R, Taniguchi M. Discrimination and clustering for multivariate time series. J Am Statist Assoc. 1998;93:328–340. [Google Scholar]
- 16.Kazakos D, Papantoni-Kazakos P. Spectral distance measuring between gaussian processes. IEEE Trans Autom Control. 1980;25:950–959. [Google Scholar]
- 17.McLachlan GJ. The asymptotic distributions of the conditional error rate and risk in discriminant analysis. Biometrika. 1974;61:131–135. [Google Scholar]
- 18.Hamilton JD. Time Series Analysis. NJ: Princeton University Press; 1994. [Google Scholar]
- 19.Buyse M, Loi S, et al. Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. Journal of the National Cancer Institute. 2006;98:1183–1192. doi: 10.1093/jnci/djj329. [DOI] [PubMed] [Google Scholar]
- 20.vanÕt Veer L, Dai H, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–6. doi: 10.1038/415530a. [DOI] [PubMed] [Google Scholar]
- 21.Schršder FH, Hugosson J, et al. Screening and prostate-cancer mortality in a randomized european study. New Eng J Med. 2009;360:1320–1328. doi: 10.1056/NEJMoa0810084. [DOI] [PubMed] [Google Scholar]
- 22.Koelinka CJL, van Hasseltb P, et al. Tyrosinemia type i treated by ntbc: How does afp predict liver cancer? Mol Genet Metab. 2006;89:310–315. doi: 10.1016/j.ymgme.2006.07.009. [DOI] [PubMed] [Google Scholar]
- 23.Bast RC, Xu FJ, et al. Ca 125: the past and the future. Int J Biol Markers. 1998;13:179–187. doi: 10.1177/172460089801300402. [DOI] [PubMed] [Google Scholar]
- 24.Filella X, Molina R, et al. Prognostic value of ca 19.9 levels in colorectal cancer. Mol Genet Metab. 1992;216:55–59. doi: 10.1097/00000658-199207000-00008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Frank TS, Deffenbaugh AM, et al. Clinical characteristics of individuals with germline mutations in brca1 and brca2: Analysis of 10,000 individuals. J Clin Oncol. 2002;20:1480–1490. doi: 10.1200/JCO.2002.20.6.1480. [DOI] [PubMed] [Google Scholar]
- 26.Syrjakoski K, Kuukasjarvi T, et al. Brca2 mutations in 154 finnish male breast cancer patients. Neoplasia. 2004;6:541–545. doi: 10.1593/neo.04193. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Horii A, Nakatsuru S, et al. Frequent somatic mutations of the apc gene in human pancreatic cancer. Cancer Res. 1992;52:6696–6698. [PubMed] [Google Scholar]
- 28.Abramowitz M, Stegun IA. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. New York: Dover Publications; 1972. [Google Scholar]
- 29.Scheffe H. A useful convergence theorem for probability distributions. Ann Math Statist. 1947;18:434–438. [Google Scholar]
- 30.Anderson B, Jackson J, Sitharam M. A useful convergence theorem for probability distributions. Amer Math Monthly. 1998;105:447–451. [Google Scholar]


