Analytical Study of Performance of Linear Discriminant Analysis in Stochastic Settings

Amin Zollanvari; Jianping Hua; Edward R Dougherty

doi:10.1016/j.patcog.2013.04.002

. Author manuscript; available in PMC: 2013 Nov 1.

Published in final edited form as: Pattern Recognit. 2013 Nov;46(11):3017–3029. doi: 10.1016/j.patcog.2013.04.002

Analytical Study of Performance of Linear Discriminant Analysis in Stochastic Settings

Amin Zollanvari ^a,^b,^✉, Jianping Hua ^c, Edward R Dougherty ^a,^c

PMCID: PMC3769149 NIHMSID: NIHMS501827 PMID: 24039299

Abstract

This paper provides exact analytical expressions for the first and second moments of the true error for linear discriminant analysis (LDA) when the data are univariate and taken from two stochastic Gaussian processes. The key point is that we assume a general setting in which the sample data from each class do not need to be identically distributed or independent within or between classes. We compare the true errors of designed classifiers under the typical i.i.d. model and when the data are correlated, providing exact expressions and demonstrating that, depending on the covariance structure, correlated data can result in classifiers with either greater error or less error than when training with uncorrelated data. The general theory is applied to autoregressive and moving-average models of the first order, and it is demonstrated using real genomic data.

Keywords: Linear discriminant analysis, Stochastic settings, Correlated data, Non-i.i.d data, Expected error, Gaussian processes, Auto-regressive models, Moving-average models

1. Introduction

It is common in practice to assume that the training data used to construct a classifier are independent and identically distributed (i.i.d). Should the data be dependent or not identically distributed, the classifier performance is affected. This paper presents a mathematical framework for analytically studying classifiers in such situations in general, and the univariate LDA (linear discriminant analysis) classifier in particular. We pay particular attention to the univariate LDA model because it is possible to obtain closed-form (not asymptotic) results for moments of the error – in analogy to moments for the error [1, 2] and error estimates [1, 3] for univariate LDA with i.i.d. sampling. The desired framework is achieved by placing classifier performance in a stochastic setting where the training data are univariate dependent and not necessarily identically distributed.

Motivation for this line of research goes back to the early 1970’s when Basu and Odell observed in remote sensing applications that the conditional expected true error of LDA is commonly higher than what is expected from a theoretical analysis [4]. They associated this observation with violation of the independence assumption on the training data.

To study the effect of correlated training data on the performance of LDA, Basu and Odell [4] used numerical examples under an equicorrellated structure of samples (see Appendix for definition of various correlation structures). They showed that misclassification probabilities change under such structures. Afterwards, McLachlan [5] used asymptotic analysis to show that even under a simple-equicorrelated structure the probability of misclassification changes. Later, Tubbs [6] used a similar asymptotic analysis but with a serially correlated structure among training data. He considered further simplifying assumptions to show that the asymptotic error rate changes with serially correlated data having positive correlations. Lawoko and McLachlan [7] used the same serially correlated structure and obtained a different asymptotic expansion of LDA true error from the one that Tubbs previously achieved in [6]. This type of asymptotic analysis was later used in [7, 8] to characterize the asymptotic expected true error of univariate LDA and Z-statistics assuming an autoregressive process of order p.

Typically, large-sample asymptotic results are not helpful in small-sample situations. Going back to 1925, R. A. Fisher wrote, “Only by systematically tackling small sample problems on their merits does it seem possible to apply accurate tests to practical data” [9, 10]. This understanding led us to study the distribution and exact moments of LDA true error and comnon estimators [11, 3, 12, 13].

Having laid the groundwork for analyzing LDA related statistics in small-sample situations, in this work we establish a framework for studying LDA in stochastic settings, thereby allowing us to obtain the exact first and second moments of univariate LDA true error in a general stochastic setting. We neither impose a specific correlation structure on the training data, nor do we assume the training data have necessarily the same mean or variance. For example the basic assumption in [4, 5, 6, 7, 8] is that the training data of the two classes are taken separately from two class conditional densities Π₀, for class 0, and Π₁, for class 1. This assumption immediately imposes several restrictions on the problem: the training data from each class have the same mean and variance (because they are coming from the same distribution) and, furthermore, only intraclass correlations exist. The stochastic setting permits us to generalize such assumptions to training data being correlated across classes or the samples from each class being differently distributed. To model such data we employ Gaussian processes and we assume the samples are taken from class conditional processes rather than class conditional densities.

Another related line of research is the work on classification of stationary time series data [14, 15, 16]. The main focus in this work is to construct linear discriminant rules with the knowledge of having stationary data. In this framework the discriminant function is commonly the one which maximizes some measure of disparity between two multivariate densities, e.g. the Kullback-Leibler information measure. This means that the linear discriminant rules constructed here are no longer what is commonly known as LDA. Therefore, the main difference between the aforementioned results on studying the performance of LDA under correlated training data and the body of work on classification of stationary times series, is that the former focuses on the analysis of the effect of correlated training data (which may have a stationary structure) on the performance of LDA, and the latter focuses on the synthesis of new classification rules with the knowledge of having stationary time series. Our work is of the first type. We study the effect of training data that can be dependent and not necessarily identically distributed or stationary on the performance of LDA.

As an application of these results, we consider two commonly used models, first-order autoregressive and moving averages. We further study the exact effect of autoregressive or moving-average model coefficients on changing the expected true error of LDA. Finally, we present numerical experiments to study several specific settings using the theory.

Before proceeding we note that univariate classification has played a major role in the history of pattern recogntion, in part, because of the ability to obtain closed-form solutions for error moments [1, 2, 3]; however, we should not overlook practical application. Indeed, most common tests for diagnosis and prognosis of cancer are univariate: PSA for prostate cancer [21], AFP for liver cancer [22], CA 125 for ovarian cancer [23], and CA 19.9 for colorectal cancer [24] are major protein markers. In addition to these protein biomarkers, there are genomic markers such as BRCA1 for breast cancer [25], BRCA2 [26] for male breast cancer, and APC for pancreatic cancer [27] that are major genomic markers.

2. Linear Discriminant Analysis and Error Estimation: Independent Sampling

In this section, we present the traditional sampling scenario in which LDA is employed in a univariate setting. Consider a set of n = n₀ + n₁ independent sample points in ℝ, where X₁, X₂, …, X_n₀ come from population Π₀ and X_n₀+1, X_n₀+2, …, X_n₀+n₁ come from population Π₁. Population Π_i is assumed to follow a univariate Gaussian distribution $N (μ_{i}, σ_{i}^{2})$ , for i = 0, 1. Linear Discriminant Analysis (LDA) utilizes the Anderson W statistic, which in the univariate case is presented as

W ({\bar{X}}^{0}, {\bar{X}}^{1}, X) = \frac{1}{{\hat{σ}}^{2}} {(X - \frac{{\bar{X}}^{0} + {\bar{X}}^{1}}{2})}^{T} ({\bar{X}}^{0} - {\bar{X}}^{1}),

(1)

where ${\bar{X}}^{0} = \frac{1}{n_{0}} \sum_{i = 1}^{n_{0}} X_{i}$ and ${\bar{X}}^{1} = \frac{1}{n_{1}} \sum_{i = n_{0} + 1}^{n_{0} + n_{1}} X_{i}$ are the sample means for each class and σ̂² is the pooled estimate of the variance of classes, which is assumed to be common in the LDA discriminant. Given X̄⁰ and X̄¹, the designed LDA classifier is given by

ψ (X) = {\begin{matrix} 1, & if W ({\bar{X}}^{0}, {\bar{X}}^{1}, X) \leq c \\ 0, & if W ({\bar{X}}^{0}, {\bar{X}}^{1}, X) > c \end{matrix},

(2)

with c being a constant. It is commonly assumed that c is zero [17], which is the assumption we also make throughout this paper. Therefore, the sign of W determines the classification of the sample point X and since σ̂² > 0, (1) reduces to

W ({\bar{X}}^{0}, {\bar{X}}^{1}, X) = (X - \bar{X}) ({\bar{X}}^{0} - {\bar{X}}^{1})

(3)

where $\bar{X} = \frac{{\bar{X}}^{0} + {\bar{X}}^{1}}{2}$ . Given the training data S_n (and thus X̄₀ and X̄₁), the classification error, also known as true error, is given by

ε = P (W ({\bar{X}}^{0}, {\bar{X}}^{1}, X) \leq 0, X \in Π_{0} ∣ {\bar{X}}^{0}, {\bar{X}}^{1}) + P (W ({\bar{X}}^{0}, {\bar{X}}^{1}, X) > 0, X \in Π_{1} ∣ {\bar{X}}^{0}, {\bar{X}}^{1}) = α_{0} ε^{0} + α_{1} ε^{1},

(4)

where α_i = P (X ∈ Π_i) is the a priori mixing probability for population Π_i and εⁱ is the error rate specific to population Π_i, with

ε^{i} = P ({(- 1)}^{i} W ({\bar{X}}^{0}, {\bar{X}}^{1}, X) \leq 0 ∣ X \in Π_{i}, {\bar{X}}^{0}, {\bar{X}}^{1}) .

(5)

The first and second moments of the classification error are given by

E [ε] = \sum_{i = 0}^{1} α_{i} P ({(- 1)}^{i} W ({\bar{X}}^{0}, {\bar{X}}^{1}, X) \leq 0 ∣ X \in Π_{i}),

(6)

and

E [ε^{2}] = E [{(α_{0} ε^{0} + α_{1} ε^{1})}^{2}] = 2 α_{0} α_{1} E [ε^{0} ε^{1}] + \sum_{i = 0}^{1} α_{i}^{2} E [ε^{i} ε^{i}] .

(7)

3. Performance of LDA classifier in Univariate Gaussian Dependent Sampling (UGDS) Model of Binary Classification

We now provide the mathematical framework to study LDA performance in a stochastic setting.

Definition 1

A process X_t = {X_t : t ∈ T} with T being an ordered set, is called a Gaussian process if any finite-dimensional vector [X_t₁, X_t₂, …, X_{t_n}]^T has the multivariate normal distribution N(μ_T, Σ_T), where

μ_{T} = {[E (X_{t_{1}}), E (X_{t_{2}}), \dots, E (X_{t_{n}})]}^{T} = {[μ_{1}, μ_{2}, \dots, μ_{n}]}^{T}

and Σ_T is the covariance matrix dependent on T = [t₁, t₂, …, t_n].

Definition 2

We refer to the following sampling procedure as the Univariate Gaussian Dependent Sampling (UGDS) Model of Binary Classification: $X_{t}^{i} = {X_{t^{i}}^{i} : t^{i} \in T^{i}}$ , with Tⁱ being two ordered sets for i = 0, 1, are two Gaussian processes such that any finite-dimensional vector constructed by stacking the random variables of $X_{t^{0}}^{0}$ and $X_{t^{1}}^{1}$ as ${[X_{t_{1}^{0}}^{0}, X_{t_{2}^{0}}^{0}, \dots, X_{t_{n_{0}}^{0}}^{0}, X_{t_{1}^{1}}^{1}, X_{t_{2}^{1}}^{1}, \dots, X_{t_{n_{1}}^{1}}^{1}]}^{T}$ possesses a multivariate normal distribution N(μ_T, Σ_T), where $μ_{T} = {[μ_{1}^{0}, μ_{2}^{0}, \dots, μ_{n_{0}}^{0}, μ_{1}^{1}, μ_{2}^{1}, \dots, μ_{n_{1}}^{1}]}^{T}$ , and

\sum_{T} = [\begin{matrix} \sum_{n_{0} \times n_{0}}^{00} & \sum_{n_{0} \times n_{1}}^{01} \\ \sum_{n_{1} \times n_{0}}^{10} & \sum_{n_{1} \times n_{1}}^{11} \end{matrix}]

(8)

is a positive definite covariance matrix.

This model is univariate because both processes, $X_{t^{0}}^{0}$ and $X_{t^{1}}^{1}$ , are collections of univariate random variables, not necessarily with the same means or variances. $X_{t^{0}}^{0}$ and $X_{t^{1}}^{1}$ are called class conditional processes. For ease of notations and without loss of mathematical generality, we assume that T⁰ and T¹ are the same set and, therefore, we omit the superscript i from tⁱ. Thus, henceforth we denote $X_{t^{i}}^{i}$ by $X_{t}^{i}$ and the stacked vector ${[X_{t_{1}^{0}}^{0}, X_{t_{2}^{0}}^{0}, \dots, X_{t_{n_{0}}^{0}}^{0}, X_{t_{1}^{1}}^{1}, X_{t_{2}^{1}}^{1}, \dots, X_{t_{n_{1}}^{1}}^{1}]}^{T}$ by ${[X_{t_{1}}^{0}, X_{t_{2}}^{0}, \dots, X_{t_{n_{0}}}^{0}, X_{t_{1}}^{1}, X_{t_{2}}^{1}, \dots, X_{t_{n_{1}}}^{1}]}^{T}$ .

Remark 1

If we assume μ_T = [μ^{0^T}, μ^{1^T]T}, with $μ^{i} = {[μ^{i}, μ^{i}, \dots ., μ^{i}]}_{1 \times n_{i}}^{T}, \sum_{j j}^{i i} = {(σ^{i})}^{2}$ , i = 0, 1, j = 1, 2, …, n_i, where (σⁱ)² is the variance of class conditional distributions and $\sum_{j j}^{i i}$ indicates the diagonal elements of matrix Σⁱⁱ, $\sum_{j k}^{i i} = 0$ , i = 0, 1, j, k = 1, …, n_i, j ≠ k, Σ⁰¹ = 0_n₀×n₁ = Σ^{10^T}, and any future sample is independent from the training data and distributed either as N(μ⁰, (σ⁰)²) or N(μ¹, (σ¹)²), depending on its class, then the UGDS model reduces to the traditional i.i.d. sampling scenario defined in section 2. Because we will want to compare classifier errors in the dependent and independent scenarios, we will sometimes use ε^D and ε^I to denote errors in the respective settings.

Similar to (3), employing LDA with the UGDS model instead of traditional independent sampling in order to classify a sample point taken at t, denoted by X_t, results in the following W statistic for the univariate case

W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t}) = (X_{t} - {\bar{X}}_{T}) ({\bar{X}}_{T}^{0} - {\bar{X}}_{T}^{1}),

(9)

where ${\bar{X}}_{T}^{0} = \frac{1}{n_{0}} \sum_{i = 1}^{n_{0}} X_{t_{i}}^{0}$ and ${\bar{X}}_{T}^{1} = \frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} X_{t_{i}}^{1}$ are the sample means for each class and ${\bar{X}}_{T} = \frac{{\bar{X}}_{T}^{0} + {\bar{X}}_{T}^{1}}{2}$ . The designed LDA classifier is given by

ψ (X_{t}) = {\begin{matrix} 1, & if W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t}) \leq 0 \\ 0, & if W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t}) > 0 \end{matrix} .

(10)

For the ease of notation, hereafter, we omit the subscript T from μ_T and Σ_T.

3.1. Stochastic true error and its moments

Let $X_{t_{s}}^{i}$ denote a test sample point, where i indicates the class conditional process in which the sample is coming from, i.e. either $X_{t}^{0}$ or $X_{t}^{1}$ . The auto-covariance sequence of $X_{t_{s}}^{i}$ with the training data is defined as

ρ_{s}^{i k} (j) = E [(X_{t_{s}}^{i} - μ_{s}^{i}) (X_{t_{j}}^{k} - μ_{j}^{k})], i, k = 0, 1, j = 1, 2, \dots, n_{k},

(11)

where $ρ_{s}^{i k} (j)$ is the j^th element of the sequence $ρ_{s}^{i k}$ . Since $X_{t_{s}}^{i}$ is a future sample point, we assume 2 ≤ max{n₀, n₁} < s, unless otherwise stated. Throughout the paper, we use S_A to denote the sum of all elements of a matrix or vector A. For instance, $S_{ρ_{s}^{i k}} = \sum_{j = 1}^{n_{i}} ρ_{s}^{i k} (j)$ .

The true classifier error under the UGDS model is a function of t_s. Sample points at t_s can come from either processes and the classifier may misclassify any of these. Hence,

ε_{t_{s}} = α_{t_{s}}^{0} ε_{t_{s}}^{0} + α_{t_{s}}^{1} ε_{t_{s}}^{1},

(12)

where $α_{t_{s}}^{i} = P (X_{t_{s}} \in X_{t}^{i})$ , i = 0, 1, is the a priori mixing probability of the two processes $X_{t}^{0}$ and $X_{t}^{1}$ at t_s and $ε_{t_{s}}^{i}$ is the error rate specific to each process, with

ε_{t_{s}}^{i} = P ({(- 1)}^{i} W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}}) \leq 0 ∣ {\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}} \in X_{t}^{i}) .

(13)

By replacing $W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}})$ with any proper statistic used in other classifiers, this stochastic definition of true error applies to other rules. The expected performance of true error is also specific to t_s:

E [ε_{t_{s}}] = \sum_{i = 0}^{1} α_{t_{s}}^{i} P ({(- 1)}^{i} W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}}) \leq 0 ∣ X_{t_{s}} \in X_{t}^{i}) .

(14)

In (12), the true error is indexed. One could, if desired, define the true error of a classifier to be the average error the classifier induces over an index set of interest, namely, $ε_{t_{s_{1} - s_{2}}} = \frac{1}{s_{2} - s_{1}} \sum_{s = s_{1}}^{s_{2}} ε_{t_{s}}$ . Since characterizing ε_{t_s} yields a characterization of ε_{t_s₁−_s₂}, no generality is gained by averaging and we restrict our attention to ε_{t_s}. The second moment is also a function of t_s and from (12) we get

E [ε_{t_{s}}^{2}] = 2 α_{t_{s}}^{0} α_{t_{s}}^{1} E [ε_{t_{s}}^{0} ε_{t_{s}}^{1}] + \sum_{i = 0}^{1} {(α_{t_{s}}^{i})}^{2} E [{(ε_{t_{s}}^{i})}^{2}] .

(15)

First focusing on $E [{(ε_{t_{s}}^{0})}^{2}]$ , the square of the probability defining ${(ε_{t_{s}}^{0})}^{2}$ can be factored by introducing the random variable $X_{t_{s}}^{'} \in X_{t}^{0}$ . Writing the probabilities as integrals of indicator functions allows us to apply Fubini’s theorem, which shows X_{t_s} and $X_{t_{s}}^{'}$ to be independent (denoted $X_{t_{s}} ⊥ X_{t_{s}}^{'}$ ). The expectation can then be applied. Altogether,

\begin{array}{l} E [{(ε_{t_{s}}^{0})}^{2}] = E [P {(W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}}) \leq 0 ∣ {\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}} \in X_{t}^{0})}^{2}] \\ = E [P (W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}}) \leq 0 ∣ {\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}} \in X_{t}^{0}) \times P (W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}}^{'}) \leq 0 ∣ {\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}}^{'} \in X_{t}^{0})] \\ = E [P (W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}}) \leq 0, W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}}^{'}) \leq 0 ∣ {\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}}, X_{t_{s}}^{'} \in X_{t}^{0}, X_{t_{s}} ⊥ X_{t_{s}}^{'})] \\ = P (W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}}) \leq 0, W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}}^{'}) \leq 0 ∣ X_{t_{s}}, X_{t_{s}}^{'} \in X_{t}^{0}, X_{t_{s}} ⊥ X_{t_{s}}^{'}) . \end{array}

(16)

$E [{(ε_{t_{s}}^{1})}^{2}]$ and $E [ε_{t_{s}}^{0} ε_{t_{s}}^{1}]$

can be expressed via similar factorizations. Hence,

\begin{array}{l} E [ε_{t_{s}}^{2}] = {(α_{t_{s}}^{0})}^{2} P (W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}}) \leq 0, W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}}^{'}) \leq 0 ∣ X_{t_{s}}, X_{t_{s}}^{'} \in X_{t}^{0}, X_{t_{s}} ⊥ X_{t_{s}}^{'}) \\ + 2 α_{t_{s}}^{0} α_{t_{s}}^{1} P (W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}}) \leq 0, W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}}^{'}) > 0 ∣ X_{t_{s}} \in X_{t}^{0}, X_{t_{s}}^{'} \in X_{t}^{1}, X_{t_{s}} ⊥ X_{t_{s}}^{'}) \\ + {(α_{t_{s}}^{1})}^{2} P (W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}}) > 0, W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}}^{'}) > 0 ∣ X_{t_{s}}, X_{t_{s}}^{'} \in X_{t}^{1}, X_{t_{s}} ⊥ X_{t_{s}}^{'}) . \end{array}

(17)

To facilitate the subsequent discussion, we will explicitly denote the dependency of true error on the number of samples. Therefore, hereafter we use ε_{t_s,n₀+n₁} and ε_{t_s}, or $ε_{t_{s}, n_{0} + n_{1}}^{2}$ and $ε_{t_{s}}^{2}$

, interchangeably.

Throughout the paper, we use the notations Z < 0 or Z ≥ 0 to indicate componentwise inequalities, e.g. Z = (z₁, z₂)^T < 0 means z₁ < 0, z₂ < 0.

3.2. Expected performance of LDA in the UGDS model

The first moment of the classification error for LDA under the UGDS model is expressed exactly according to the following theorem.

Theorem 1

Under the UGDS model, the expected true error of LDA at t_s is

E [ε_{t_{s}, n_{0} + n_{1}}^{D}] = α_{t_{s}}^{0} [P (Z_{s}^{1} < 0) + P (Z_{s}^{1} \geq 0)] + α_{t_{s}}^{1} [P (Z_{s}^{I I} < 0) + P (Z_{s}^{I I} \geq 0)],

(18)

where $Z_{t_{s}}^{I}$ and $Z_{t_{s}}^{I I}$ are Gaussian bivariate vectors with

\begin{array}{l} μ_{Z_{s}^{I}} = {[\begin{matrix} μ_{s}^{0} - \frac{\bar{μ}}{2} & - μ^{'} \end{matrix}]}^{T}, μ_{Z_{s}^{I I}} = {[\begin{matrix} μ_{s}^{1} - \frac{\bar{μ}}{2} & μ^{'} \end{matrix}]}^{T}, \\ \sum_{Z_{s}^{I}} = [\begin{matrix} {(σ_{s}^{0})}^{2} - \frac{S_{ρ_{s}^{00}}}{n_{0}} - \frac{S_{ρ_{s}^{01}}}{n_{1}} + \frac{S_{\sum^{00}}}{4 n_{0}^{2}} + \frac{S_{\sum^{11}}}{4 n_{1}^{2}} + \frac{S_{\sum^{01}}}{2 n_{0} n_{1}} & \frac{- S_{ρ_{s}^{00}}}{n_{0}} + \frac{S_{ρ_{s}^{01}}}{n_{1}} + \frac{S_{\sum^{00}}}{2 n_{0}^{2}} - \frac{S_{\sum^{11}}}{2 n_{1}^{2}} \\ \cdot & \frac{S_{\sum^{00}}}{n_{0}^{2}} + \frac{S_{\sum^{11}}}{n_{1}^{2}} - \frac{2 S_{\sum^{01}}}{n_{0} n_{1}} \end{matrix}], \\ \sum_{Z_{s}^{I I}} = [\begin{matrix} {(σ_{s}^{1})}^{2} - \frac{S_{ρ_{s}^{11}}}{n_{1}} - \frac{S_{ρ_{s}^{10}}}{n_{0}} + \frac{S_{\sum^{00}}}{4 n_{0}^{2}} + \frac{S_{\sum^{11}}}{4 n_{1}^{2}} + \frac{S_{\sum^{01}}}{2 n_{0} n_{1}} & \frac{- S_{ρ_{s}^{11}}}{n_{1}} + \frac{S_{ρ_{s}^{10}}}{n_{0}} - \frac{S_{\sum^{00}}}{2 n_{0}^{2}} + \frac{S_{\sum^{11}}}{2 n_{1}^{2}} \\ \cdot & \frac{S_{\sum^{00}}}{n_{0}^{2}} + \frac{S_{\sum^{11}}}{n_{1}^{2}} - \frac{2 S_{\sum^{01}}}{n_{0} n_{1}} \end{matrix}], \end{array}

(19)

where $\bar{μ} = \frac{\sum_{i = 1}^{n_{0}} μ_{i}^{0}}{n_{0}} + \frac{\sum_{i = 1}^{n_{1}} μ_{i}^{1}}{n_{1}}, μ^{'} = \frac{\sum_{i = 1}^{n_{0}} μ_{i}^{0}}{n_{0}} - \frac{\sum_{i = 1}^{n_{1}} μ_{i}^{1}}{n_{1}}$ , and $μ_{s}^{i}$ and ${(σ_{s}^{i})}^{2}$ are the mean and variance of random variables at t_s from class i, i = 0, 1, with the auto-covariance $ρ_{s}^{i k}$ defined as in (11).

Proof

See the Appendix.

We note that under conditions stated in Remark 1, Theorem 1 reduces to Theorem 1 in [3].

Let Φ(x, y; ρ) be the cumulative bivariate normal distribution defined as:

\begin{array}{l} Φ (x, y; ρ) = \int_{- \infty}^{x} \int_{- \infty}^{y} ψ (u, v; ρ) d u d v, \\ ψ (u, v; ρ) = \frac{1}{2 π \sqrt{1 - ρ^{2}}} exp {\frac{- (u^{2} + v^{2} - 2 ρ u v)}{2 (1 - ρ^{2})}} . \end{array}

(20)

We have the following Corollary.

Corollary 2

In the model considered in Theorem 1, let the training samples from each class have the same mean, that is μ = [μ^{0^T}, μ^{1^T}]^T in which $μ^{i} = {[μ^{i}, μ^{i}, \dots, μ^{i}]}_{1 \times n_{i}}^{T}, μ_{s}^{i} = μ^{i}, σ_{s}^{i} = σ$ , i = 0, 1, meaning the test data at t_s has equal variances across classes, and $α_{t_{s}}^{0} = α_{t_{s}}^{1} = 0.5$ . Furthermore, let $S_{ρ_{s}^{i k}} = 0$ , i, k = 0, 1 Then

E [ε_{t_{s}, n_{0} + n_{1}}^{D}] = \frac{1}{2} - \frac{L (h, k; ρ)}{2},

(21)

where

L (x, y; ρ) = \int_{- x}^{x} \int_{- y}^{y} ψ (u, v; ρ) d u d v,

(22)

\begin{array}{l} h = \frac{μ}{2 \sqrt{a}}, k = \frac{μ}{\sqrt{b}}, μ = μ^{0} - μ^{1}, ρ = \frac{\frac{S_{\sum^{00}}}{2 n_{0}^{2}} - \frac{S_{\sum^{11}}}{2 n_{1}^{2}}}{\sqrt{a} \sqrt{b}}, \\ a = σ^{2} + \frac{S_{\sum^{00}}}{4 n_{0}^{2}} + \frac{S_{\sum^{11}}}{4 n_{1}^{2}} + \frac{S_{\sum^{01}}}{2 n_{0} n_{1}}, b = \frac{S_{\sum^{00}}}{n_{0}^{2}} + \frac{S_{\sum^{11}}}{n_{1}^{2}} - \frac{2 S_{\sum^{01}}}{n_{0} n_{1}} . \end{array}

(23)

Proof

See the Appendix.

To further proceed we present the following lemma, in which ∧ denotes conjunction.

Lemma 3

Let Φ(x, y; ρ) be the cumulative bivariate normal distribution defined in (20) and defne F(x, y; ρ) and G(x, y; ρ) as follows:

\begin{array}{l} F (x, y; ρ) = Φ (x, y; ρ) + Φ (- x, - y; ρ), \\ G (x, y; ρ) = F (x, y; ρ) + F (x, y; - ρ), \end{array}

(24)

where x and y are two constants such that xy < 0. Then, for 0 ≤ λ_x < 1, 0 ≤ λ_y < 1,

(∣ ρ_{1} ∣ \leq ∣ ρ_{0} ∣) \land (x_{1} = λ_{x} x_{0}) \land (y_{1} = λ_{y} y_{0}) \Rightarrow G (x_{1}, y_{1}; ρ_{1}) > G (x_{0}, y_{0}; ρ_{0}) .

(25)

Proof

See the Appendix.

Using this Lemma, we compare the expected true error of the UGDS model with the independent sampling model.

Corollary 4

In the model considered in Corollary 2, let ${(σ_{j}^{i})}^{2} ≜ \sum_{j j}^{i i}$ , i = 0, 1, j = 1, 2, …, n_i, and let

\begin{array}{l} ρ_{D} = \frac{\frac{S_{\sum^{00}}}{n_{0}^{2}} - \frac{S_{\sum^{11}}}{n_{1}^{2}}}{\sqrt{(σ^{2} + \frac{S_{\sum^{00}}}{4 n_{0}^{2}} + \frac{S_{\sum^{11}}}{4 n_{1}^{2}} + \frac{S_{\sum^{01}}}{2 n_{0} n_{1}}) (\frac{S_{\sum^{00}}}{n_{0}^{2}} + \frac{S_{\sum^{11}}}{n_{1}^{2}} - \frac{2 S_{\sum^{01}}}{n_{0} n_{1}})}}, \\ ρ_{I} = \frac{\frac{\sum_{j = 1}^{n_{0}} {(σ_{j}^{0})}^{2}}{n_{0}^{2}} - \frac{\sum_{j = 1}^{n_{1}} {(σ_{j}^{1})}^{2}}{n_{1}^{2}}}{\sqrt{(σ^{2} + \frac{\sum_{j = 1}^{n_{0}} {(σ_{j}^{0})}^{2}}{4 n_{0}^{2}} + \frac{\sum_{j = 1}^{n_{1}} {(σ_{j}^{1})}^{2}}{4 n_{1}^{2}}) (\frac{\sum_{j = 1}^{n_{0}} {(σ_{j}^{0})}^{2}}{n_{0}^{2}} + \frac{\sum_{j = 1}^{n_{1}} {(σ_{j}^{1})}^{2}}{n_{1}^{2}})}} . \end{array}

(26)

Let $E [ε_{t_{s}, n_{0} + n_{1}}^{I}]$ be the expectaton of the true error of the classifier in (10) specific to t_s and constructed as if all n₀ + n₁ training samples are i.i.d. (same mean and variance). Then

(∣ ρ_{D} ∣ \leq ∣ ρ_{I} ∣) \land (\frac{S_{\sum^{00}}^{'}}{n_{0}^{2}} + \frac{S_{\sum^{11}}^{'}}{n_{1}^{2}} \geq max {\frac{2 S_{\sum^{01}}}{n_{0} n_{1}}, \frac{- 2 S_{\sum^{01}}}{n_{0} n_{1}}}) \Rightarrow E [ε_{t_{s}, n_{0} + n_{1}}^{D}] \geq E [ε_{t_{s}, n_{0} + n_{1}}^{I}],

(27)

(∣ ρ_{D} ∣ \geq ∣ ρ_{I} ∣) \land (\frac{S_{\sum^{00}}^{'}}{n_{0}^{2}} + \frac{S_{\sum^{11}}^{'}}{n_{1}^{2}} \leq min {\frac{2 S_{\sum^{01}}}{n_{0} n_{1}}, \frac{- 2 S_{\sum^{01}}}{n_{0} n_{1}}}) \Rightarrow E [ε_{t_{s}, n_{0} + n_{1}}^{D}] \leq E [ε_{t_{s}, n_{0} + n_{1}}^{I}],

(28)

where $S_{A}^{'}$ is the sum of the off diagonal elements of matrix A, defined as $S_{A}^{'} = \sum_{i, j, i \neq j} a_{i j}$

Proof

Find the expected true error in Theorem 1 using the conditions in the corollary and compare it to the expected true error determined by setting all off diagonal elements of Σⁱⁱ to zero, i = 0, 1 and Σ⁰¹ = 0_n₀×n₁. The proof follows by using the results of Lemma 3 in Theorem 1.

A more restricted set of sufficient conditions than those presented in Corollary 4 follows.

Corollary 5

In the model considered in Corollary 2, let ${(σ_{j}^{i})}^{2} ≜ \sum_{j j}^{i i}$ , i = 0, 1, j = 1, 2, …, n_i, and

\frac{S_{\sum^{00}}}{n_{0}^{2}} - \frac{S_{\sum^{11}}}{n_{1}^{2}} = \frac{\sum_{j = 1}^{n_{0}} {(σ_{j}^{0})}^{2}}{n_{0}^{2}} - \frac{\sum_{j = 1}^{n_{1}} {(σ_{j}^{1})}^{2}}{n_{1}^{2}} .

(29)

Then

\frac{S_{\sum^{00}}^{'}}{n_{0}^{2}} + \frac{S_{\sum^{11}}^{'}}{n_{1}^{2}} \geq max {\frac{2 S_{\sum^{01}}}{n_{0} n_{1}}, \frac{- 2 S_{\sum^{01}}}{n_{0} n_{1}}} \Rightarrow E [ε_{t_{s}, n_{0} + n_{1}}^{D}] \geq E [ε_{t_{s}, n_{0} + n_{1}}^{I}],

(30)

\frac{S_{\sum^{00}}^{'}}{n_{0}^{2}} + \frac{S_{\sum^{11}}^{'}}{n_{1}^{2}} \geq min {\frac{2 S_{\sum^{01}}}{n_{0} n_{1}}, \frac{- 2 S_{\sum^{01}}}{n_{0} n_{1}}} \Rightarrow E [ε_{t_{s}, n_{0} + n_{1}}^{D}] \geq E [ε_{t_{s}, n_{0} + n_{1}}^{I}],

(31)

where $S_{A}^{'}$ is the sum of off diagonal elements of matrix A, defined as $S_{A}^{'} = \sum_{i, j, i \neq j} a_{i j}$

Proof

The proof is similar to Corollary 4.

To have a sense of the conditions stated in Corollary 5, consider a scenario in which n₀ = n₁, the sample points in each class are equi-correlated with correlation ρ, and there is independent sampling across classes. This satisfies (29). If ρ > 0, then (30) holds and $E [ε_{t_{s}, n_{0} + n_{1}}^{D}] \geq E [ε_{t_{s}, n_{0} + n_{1}}^{I}]$ . If ρ < 0 and the class covariance matrices are positive definite, then (31) holds and $E [ε_{t_{s}, n_{0} + n_{1}}^{D}] < E [ε_{t_{s}, n_{0} + n_{1}}^{I}]$ .

Let us reflect on Corollaries 4 and 5. A correlated set of n sample points can be considered as a set in which the points convey some information about each other. Therefore, they are often considered to be as informative as n′ independent samples with n′ < n, thereby producing a poorer classifier. This intuition aligns with the simple situation in which the sample points in each class are equi-correlated with ρ > 0 and the sample points across the two classes are uncorrelated. This scenario is a special case of (30) and $E [ε_{t_{s}, n_{0} + n_{1}}^{D}] \geq E [ε_{t_{s}, n_{0} + n_{1}}^{I}]$ . However, (31) shows that there are correlation patterns that result in an expected true error smaller than it would be were there independent sampling, which means that sampling satisfying (31) is like having a larger sample size than if sampling were independent.

To illustrate, in the UGDS model suppose the training sample points are identically distributed as two Gaussian distribution, N(−1, 1) for class 0 and N(1, 1) for class 1. Let n₀ = n₁ = 3 and assume that any future test point is also distributed identically to the training data of its class. Furthermore, assume the data are generated via two different scenarios, a and b, such that Σ⁰¹ = 0_3×3 and, for i = 0, 1,

\sum_{a}^{i i} = [\begin{matrix} 1 & - 1 / 4 & - 1 / 4 \\ - 1 / 4 & 1 & ρ \\ - 1 / 4 & ρ & 1 \end{matrix}], \sum_{b}^{i i} = [\begin{matrix} 1 & 1 / 4 & 1 / 4 \\ 1 / 4 & 1 & ρ \\ 1 / 4 & ρ & 1 \end{matrix}]

(32)

Figure 1(a) shows the expected true error of the classifier designed in scenario a as a function of ρ. It demonstrates that for some dependency patterns, as defined by the covariance matrix, the classifier has better performance than if the sampling were independent. Note that in Fig. 1(a) the curves meet at ρ = 0.5, the point of equality for the inequalities (30) and (31). Note also that for ρ = −0.499, $E [ε_{6}^{D}] = E [ε_{18}^{I}] = 0.165$ . Hence, for the sampling covariance matrix (32), 3 points have the effect of 9 independent points. In general, better classification accuracy may be achieved if the sample points are collected according to specific schemes. Equations (28) and (31) provide sufficient sets of conditions that result in such schemes.

(a) Expected true error of constructed classifiers in scenario a as a function of ρ, (b) Expected true error of constructed classifiers in scenario b as a function of ρ. The horizontal line shows the performance of the constructed classifier as if the samples were independent. Solid lines: dependent samples; Dashed lines: independent samples.

Figure 1(b) shows the expected true error of a classifier constructed in scenario b by varying ρ in the same range as in scenario a. The only difference between scenarios a and b is changing the covariances between the first sample point and other sample points to positive values. It results in the curve for dependent sampling in Figure 1(b) being substantially above the curve for independent sampling.

3.3. Second moment of LDA true error in the UGDS model

Next we obtain the second moment of true error of LDA at t_s as defined in (17).

Theorem 6

Under the UGDS model, the second moment of LDA at t_s is

E [{(ε_{t_{s}, n_{0} + n_{1}}^{D})}^{2}] = {(α_{t_{s}}^{0})}^{2} [P (Z_{s}^{I} < 0) + P (Z_{s}^{I} \geq 0)] + 2 α_{t_{s}}^{0} α_{t_{s}}^{I} [P (Z_{s}^{I I} < 0) + P (Z_{s}^{I I} \geq 0)] + {(α_{t_{s}}^{1})}^{2} [P (Z_{s}^{III} < 0) + P (Z_{s}^{III} \geq 0)],

(33)

where $Z_{s}^{j}$ is a 3-variate Gaussian random vector with mean and covarianc matrices as follows:

μ_{Z_{s}^{I}} = {[\begin{matrix} μ_{s}^{0} - \frac{\bar{μ}}{2} & - μ^{'} & μ_{s}^{0} - \frac{\bar{μ}}{2} \end{matrix}]}^{T}, μ_{Z_{s}^{I I}} = {[\begin{matrix} μ_{s}^{0} - \frac{\bar{μ}}{2} & - μ^{'} & - μ_{s}^{1} + \frac{\bar{μ}}{2} \end{matrix}]}^{T}

(34)

and, for i, j = 0, 1, i ≠ j, letting

z_{s}^{i} = {(σ_{s}^{i})}^{2} - \frac{S_{ρ_{s}^{i i}}}{n_{i}} - \frac{S_{ρ_{s}^{i j}}}{n_{j}} + \frac{S_{\sum^{00}}}{4 n_{0}^{2}} + \frac{S_{\sum^{11}}}{4 n_{1}^{2}} + \frac{S_{\sum^{01}}}{2 n_{0} n_{1}},

(35)

we have

\begin{array}{l} \sum_{Z_{s}^{I}} = [\begin{matrix} z_{s}^{0} & \frac{- S_{ρ_{s}^{00}}}{n_{0}} + \frac{S_{ρ_{s}^{01}}}{n_{1}} + \frac{S_{\sum^{00}}}{2 n_{0}^{2}} - \frac{S_{\sum^{11}}}{2 n_{1}^{2}} & z_{s}^{0} - {(σ_{s}^{0})}^{2} \\ \cdot & \frac{S_{\sum^{00}}}{n_{0}^{2}} + \frac{S_{\sum^{11}}}{n_{1}^{2}} - \frac{2 S_{\sum^{01}}}{n_{0} n_{1}} & \frac{- S_{ρ_{s}^{00}}}{n_{0}} + \frac{S_{ρ_{s}^{01}}}{n_{1}} + \frac{S_{\sum^{00}}}{2 n_{0}^{2}} - \frac{S_{\sum^{11}}}{2 n_{1}^{2}} \\ \cdot & \cdot & z_{s}^{0} \end{matrix}], \\ \sum_{Z_{s}^{I I}} = [\begin{matrix} z_{s}^{0} & \frac{- S_{ρ_{s}^{00}}}{n_{0}} + \frac{S_{ρ_{s}^{01}}}{n_{1}} + \frac{S_{\sum^{00}}}{2 n_{0}^{2}} - \frac{S_{\sum^{11}}}{2 n_{1}^{2}} & - z_{s}^{0} + {(σ_{s}^{0})}^{2} + \frac{S_{ρ_{s}^{11}} - S_{ρ_{s}^{01}}}{2 n_{1}} + \frac{S_{ρ_{s}^{10}} - S_{ρ_{s}^{00}}}{2 n_{0}} \\ \cdot & \frac{S_{\sum^{00}}}{n_{0}^{2}} + \frac{S_{\sum^{11}}}{n_{1}^{2}} - \frac{2 S_{\sum^{01}}}{n_{0} n_{1}} & \frac{- S_{ρ_{s}^{11}}}{n_{1}} + \frac{S_{ρ_{s}^{10}}}{n_{0}} - \frac{S_{\sum^{00}}}{2 n_{0}^{2}} + \frac{S_{\sum^{11}}}{2 n_{1}^{2}} \\ \cdot & \cdot & z_{s}^{1} \end{matrix}], \end{array}

(36)

with $\bar{μ} = \frac{\sum_{i = 1}^{n_{0}} μ_{i}^{0}}{n_{0}} + \frac{\sum_{i = 1}^{n_{1}} μ_{i}^{1}}{n_{1}}, μ^{'} = \frac{\sum_{i = 1}^{n_{0}} μ_{i}^{0}}{n_{0}} - \frac{\sum_{i = 1}^{n_{1}} μ_{i}^{1}}{n_{1}}$ , S_A is the sum of all elements of matrix A, defined as S_A = Σ_i,j a_ij, and $μ_{s}^{i}$ and ${(σ_{s}^{i})}^{2}$ are the mean and variance of random variables at t_s from class i, i = 0, 1. Furthermore, μ_Z^III is obtained from μ_Z^I by replacing −μ′ with μ′ and $μ_{s}^{0}$ with $μ_{s}^{1}$ , and $\sum_{Z_{s}^{III}}$ is obtained from $\sum_{Z_{s}^{I}}$ by exchanging n₀ and n₁, ${(σ_{s}^{0})}^{2}$ and ${(σ_{s}^{1})}^{2}$ , S_Σ⁰⁰ and S_Σ¹¹, $S_{ρ_{s}^{00}}$ and $S_{ρ_{s}^{11}}$ , and $S_{ρ_{s}^{01}}$ and $S_{ρ_{s}^{10}}$ .

Proof

See the Appendix.

Let $E [ε_{t_{s}, n_{0} + n_{1}}^{I}]$ and $E [{(ε_{t_{s}, n_{0} + n_{1}}^{I})}^{2}]$ be the first and second moments of true error of the classifier in (10) specific to t_s and constructed by n₀ + n₁ independent training sample points distributed according to the same mean and variance. Then we have the following corollary.

Corollary 7

In the model considered in corollary 5, further assume n₀ = n₁ = n, Σ⁰¹ = 0_n_×_n, and, for k, j = 1, 2, …, n,

\sum_{k j}^{00} = \sum_{k j}^{11} = {\begin{cases} σ^{2} & k = j \\ ρ > 0 & otherwise \end{cases},

(37)

where σ is the common variance of test sample points across classes at t_s. Let m be the number of additional dependent training points in each class with the same class conditional means and dependency structure, meaning $\sum_{k j}^{i i}$ as in (37) for k, j = 1, 2, …, n + m and Σ⁰¹ =0₍_n₊_m₎_×₍_n₊_m₎, that are required to make $E [ε_{t_{s}, 2 n + 2 m}^{D}] = E [ε_{t_{s}, 2 n}^{I}]$ . This number also makes $E [{(ε_{t_{s}, 2 n + 2 m}^{D})}^{2}] = E [{(ε_{t_{s}, 2 n}^{I})}^{2}]$ and is given by

m = \frac{n - 1}{\frac{σ^{2}}{n ρ} - 1}

(38)

Proof

The proof of $E [ε_{t_{s}, 2 n + 2 m}^{D}] = E [ε_{t_{s}, 2 n}^{I}]$ follows by equating elements of covariance matrices obtained for the dependent model in (19) with the covariance matrices for the independent sampling model. Under the conditions of the corollary, these matrices in the independent sampling scenario (given by Theorem 1 in [3]) are

\sum_{Z_{s}^{I}} = \sum_{Z_{s}^{II}} = \sum_{Z_{s}^{III}} = (\begin{matrix} σ^{2} + \frac{σ^{2}}{2 n} & 0 \\ 0 & \frac{2 σ^{2}}{n} \end{matrix})

(39)

Furthermore, we note that the conditions stated in this corollary satisfy the condition stated in (30), and hence $E [ε_{t_{s}, 2 n}^{D}] \geq E [ε_{t_{s}, 2 n}^{I}]$ . The proof of $E [{(ε_{t_{s}, 2 n + 2 m}^{D})}^{2}] = E [{(ε_{t_{s}, 2 n}^{I})}^{2}]$ follows similarly by equating covariance matrices presented in (36) with those presented in Theorem 2 in [3].

In (38), if $ρ > \frac{σ^{2}}{n}$ , then m < 0, meaning that adding any additional points under the dependency model in the corollary does not lower $E [ε_{t_{s}, 2 n + 2 m}^{D}]$ and $E [{(ε_{t_{s}, 2 n + 2 m}^{D})}^{2}]$ to the level of the first and second moments of true error of the constructed LDA classifier as if the original 2n training samples were independent.

4. Applications

In this section we study applications to common models used in signal processing, the first-order autoregressive model, AR(1), and the first-order moving average model, MA(1), by assuming the training data are generated by the output processes of two models. $Z_{t}^{0}$ and $Z_{t}^{1}$ are two independent white noise processes and $X_{t}^{0}$ and $X_{t}^{1}$ are the processes producing the system output. The goal is to characterize the performance of the LDA classifier as a function of sample size, the parameters of the white noise processes, and the autoregressive coefficients.

4.1. First-order autoregressive model AR(1)

We consider two AR(1) models:

X_{t}^{i} = c_{i} + ψ_{i} X_{t - 1}^{i} + Z_{t}^{i}, i = 0, 1,

(40)

where ψ_i is a constant such that 0 < |ψ_i| < 1, i = 0, 1, and $Z_{t}^{0} ~ N (0, σ_{0}^{2}), Z_{t}^{1} ~ N (0, σ_{1}^{2})$ , for all t, are independent from each other. Then $X_{t}^{0} = {X_{t}^{0} : 0 < t < \infty}$ and $X_{t}^{1} = {X_{t}^{1} : 0 < t < \infty}$ are two independent covariance-stationary processes and we have the following theorem.

Theorem 8

Let $X_{t}^{0}, X_{t}^{1}$ in the UGDS model be defined by the two independent covariance-stationary AR(1) processes as defined in (40). Then, at t_s, where max{n₀, n₁} < s, the expected true error of LDA constructed using the training samples $[X_{t_{1}}^{0}, X_{t_{2}}^{0}, \dots, X_{t_{n_{0}}}^{0}]$ and $[X_{t_{1}}^{1}, X_{t_{2}}^{1}, \dots, X_{t_{n_{1}}}^{1}]$ is

E [ε_{t_{s}, n_{0} + n_{1}}^{A R (1)}] = α_{t_{s}}^{0} [P (Z_{s}^{I} < 0) + P (Z_{s}^{I} \geq 0)] + α_{t_{s}}^{I} [P (Z_{s}^{I I} < 0) + P (Z_{s}^{I I} \geq 0)],

(41)

where $Z_{t_{s}}^{I}$ and $Z_{t_{s}}^{I I}$ are Gaussian bivariate vectors with

\begin{array}{l} μ_{Z_{s}^{I}} = {[\begin{matrix} \frac{μ}{2} & - μ \end{matrix}]}^{T}, μ_{Z_{s}^{I I}} = {[\begin{matrix} \frac{- μ}{2} & μ \end{matrix}]}^{T}, \\ \sum_{Z_{s}^{I}} = [\begin{matrix} \frac{σ_{0}^{2}}{1 - ψ_{0}^{2}} - \frac{S_{ρ_{s}^{00}}}{n_{0}} + \frac{S_{\sum^{00}}}{4 n_{0}^{2}} + \frac{S_{\sum_{11}}}{4 n_{1}^{2}} & \frac{- S_{ρ_{s}^{00}}}{n_{0}} + \frac{S_{\sum^{00}}}{2 n_{0}^{2}} - \frac{S_{\sum^{11}}}{2 n_{1}^{2}} \\ \cdot & \frac{S_{\sum^{00}}}{n_{0}^{2}} + \frac{S_{\sum^{11}}}{n_{1}^{2}} \end{matrix}], \\ \sum_{Z_{s}^{I I}} = [\begin{matrix} \frac{σ_{1}^{2}}{1 - ψ_{1}^{2}} - \frac{S_{ρ_{s}^{11}}}{n_{1}} + \frac{S_{\sum^{00}}}{4 n_{0}^{2}} + \frac{S_{\sum_{11}}}{4 n_{1}^{2}} & \frac{- S_{ρ_{s}^{11}}}{n_{1}} - \frac{S_{\sum^{00}}}{2 n_{0}^{2}} + \frac{S_{\sum^{11}}}{2 n_{1}^{2}} \\ \cdot & \frac{S_{\sum^{00}}}{n_{0}^{2}} + \frac{S_{\sum^{11}}}{n_{1}^{2}} \end{matrix}], \end{array}

(42)

where

\begin{array}{l} μ = \frac{c_{0}}{1 - ψ_{0}} - \frac{c_{1}}{1 - ψ_{1}}, S_{ρ_{s}^{i i}} = \frac{ψ_{i}^{(s - n_{i})} σ_{i}^{2}}{1 - ψ_{i}^{2}} (\frac{1 - ψ_{i}^{n_{i}}}{1 - ψ_{i}}), \\ S_{\sum^{i i}} = \frac{σ_{i}^{2}}{(1 - ψ_{i}^{2}) (1 - ψ_{i})} [n_{i} (1 + ψ_{i}) - 2 ψ_{i} (\frac{1 - ψ_{i}^{n_{i}}}{1 - ψ_{i}})] . \end{array}

(43)

Proof

See the Appendix.

Corollary 9

In the model considered in Theorem 8, let ψ₀ = ψ₁ = ψ, σ₀ = σ₁, $α_{t_{s}}^{0} = α_{t_{s}}^{1}$ . Let $E [ε_{t_{s}, n_{0} + n_{1}}^{A R {(1)}_{ψ}}]$ denote the expected true error of an AR(1) model with AR coefficient ψ specific to t_s. Then

lim_{s \to \infty} E [ε_{t_{s}, n_{0} + n_{1}}^{A R {(1)}_{ψ}}] = \frac{1}{2} - \frac{L (h, k; ρ)}{2},

(44)

where L(h, k; ρ) is defined in (22) and

\begin{array}{l} h = \frac{μ}{2 \sqrt{a}}, k = \frac{μ}{\sqrt{b}}, μ = \frac{c_{0} - c_{1}}{1 - ψ}, a = \frac{σ^{2}}{1 - ψ^{2}} + \frac{b}{4}, \\ b = \frac{σ^{2} [(1 + ψ) (\frac{1}{n_{0}} + \frac{1}{n_{1}}) - 2 \frac{ψ}{1 - ψ} (\frac{1 - ψ^{n_{0}}}{n_{0}^{2}} + \frac{1 - ψ^{n_{1}}}{n_{1}^{2}})]}{(1 - ψ^{2}) (1 - ψ)}, \\ ρ = \frac{σ^{2} [(1 + ψ) (\frac{1}{n_{0}} - \frac{1}{n_{1}}) - \frac{2 ψ}{1 - ψ} (\frac{1 - ψ^{n_{0}}}{n_{0}^{2}} - \frac{1 - ψ^{n_{1}}}{n_{1}^{2}})]}{2 (1 - ψ^{2}) (1 - ψ) \sqrt{a} \sqrt{b}} . \end{array}

(45)

Proof

See the Appendix.

We consider $E [ε_{t_{s}, 2 n}^{A R (1)}]$ as a function of ψ and compare it to the case where ψ= 0, which corresponds to the stochastic i.i.d setting.

Corollary 10

In the model considered in Corollary 9, let n₀ = n₁ = n. Furthermore, let $E [ε_{t_{s}, 2 n}^{I}]$ be the expected true error of the LDA classifier with ψ = 0 in (40). Let ψ′ and ψ″ be two arbitrary values of the AR coefficient ψ. Then

ψ^{″} > ψ^{'} \Rightarrow lim_{s \to \infty} E [ε_{t_{s}, 2 n}^{A R {(1)}_{ψ^{″}}}] < lim_{s \to \infty} E [ε_{t_{s}, 2 n}^{A R {(1)}_{ψ^{'}}}] .

(46)

Hence,

\begin{array}{l} 0 < ψ < 1 \Rightarrow lim_{s \to \infty} E [ε_{t_{s}, 2 n}^{A R {(1)}_{ψ}}] < lim_{s \to \infty} E [ε_{t_{s}, 2 n}^{I}], \\ - 1 < ψ < 0 \Rightarrow lim_{s \to \infty} E [ε_{t_{s}, 2 n}^{A R {(1)}_{ψ}}] > lim_{s \to \infty} E [ε_{t_{s}, 2 n}^{I}] . \end{array}

(47)

Proof

See the Appendix.

Corollary 10 shows that, if ψ ∈ (0, 1), then under the conditions of the Corollary, constructing an LDA classifier to differentiate between AR processes is beneficial in terms of the expected true error tested on sufficiently lagged data; however, for ψ ∈ (−1, 0), we expect larger expected true error.

4.2. First-order moving-average model MA(1)

We consider the MA(1) models

X_{t}^{i} = c_{i} + Z_{t}^{i} + θ_{i} Z_{t - 1}^{i}, i = 0, 1,

(48)

where θ_i ∈ ℝ and $Z_{t}^{0} ~ N (0, σ_{0}^{2}), Z_{t}^{1} ~ N (0, σ_{1}^{2})$ , for all t, are independent from each other. Then $X_{t}^{0} = {X_{t}^{0} : 0 < t < \infty}$ and $X_{t}^{1} = {X_{t}^{1} : 0 < t < \infty}$

are two independent and covariance-stationary processes regardless of the values of θ_i [18].

Theorem 11

Let $X_{t}^{0}, X_{t}^{1}$ in the UGDS model be defined by the two independent covariance-stationary MA(1) processes defined in (48). Then, at t_s, where max{n₀, n₁} + 1 < s, the expected true error of an LDA classifier constructed using the training samples $[X_{t_{1}}^{0}, X_{t_{2}}^{0}, \dots, X_{t_{n_{0}}}^{0}]$ and $[X_{t_{1}}^{1}, X_{t_{2}}^{1}, \dots, X_{t_{n_{1}}}^{1}]$ is

E [ε_{t_{s}, n_{0} + n_{1}}^{M A (1)}] = α_{t_{s}}^{0} [P (Z_{s}^{I} < 0) + P (Z_{s}^{I} \geq 0)] + α_{t_{s}}^{1} [P (Z_{s}^{I I} < 0) + P (Z_{s}^{I I} \geq 0)],

(49)

where $Z_{t_{s}}^{I}$ and $Z_{t_{s}}^{I I}$ are Gaussian bivariate vectors with:

\begin{array}{l} μ_{Z_{s}^{I}} = {[\begin{matrix} \frac{μ}{2} & - μ \end{matrix}]}^{T}, μ_{Z_{s}^{I I}} = {[\begin{matrix} \frac{- μ}{2} & μ \end{matrix}]}^{T}, \\ \sum_{Z_{s}^{I}} = [\begin{matrix} σ_{0}^{2} (1 + θ_{0}^{2}) + \frac{S_{\sum^{00}}}{4 n_{0}^{2}} + \frac{S_{\sum^{11}}}{4 n_{1}^{2}} & \frac{S_{\sum^{00}}}{2 n_{0}^{2}} - \frac{S_{\sum^{11}}}{2 n_{1}^{2}} \\ \cdot & \frac{S_{\sum^{00}}}{n_{0}^{2}} + \frac{S_{\sum^{11}}}{n_{1}^{2}} \end{matrix}], \\ \sum_{Z_{s}^{I I}} = [\begin{matrix} σ_{1}^{2} (1 + θ_{1}^{2}) + \frac{S_{\sum^{00}}}{4 n_{0}^{2}} + \frac{S_{\sum^{11}}}{4 n_{1}^{2}} & \frac{- S_{\sum^{00}}}{2 n_{0}^{2}} - \frac{S_{\sum^{11}}}{2 n_{1}^{2}} \\ \cdot & \frac{S_{\sum^{00}}}{n_{0}^{2}} + \frac{S_{\sum^{11}}}{n_{1}^{2}} \end{matrix}], \end{array}

(50)

where i = 0, 1, and

μ = c_{0} - c_{1}, S_{\sum^{i i}} = σ_{i}^{2} [n_{i} (1 + θ_{i}^{2}) + 2 (n_{i} - 1) θ_{i}] .

(51)

Proof

See the Appendix.

Corollary 12

For the model in Theorem 11, let θ₀ = θ₁, σ₀ = σ₁, $α_{t_{s}}^{0} = α_{t_{s}}^{1}$ . Let $E [ε_{t_{s}, n_{0} + n_{1}}^{M A {(1)}_{θ}}]$ denote the expected true error of an MA(1) model with MA coefficient θ specific to t_s. Then

E [ε_{t_{s}, n_{0} + n_{1}}^{M A {(1)}_{θ}}] = \frac{1}{2} - \frac{L (h, k; ρ)}{2},

(52)

where L(h, k; ρ) is defined in (22) and

\begin{array}{l} h = \frac{μ}{2 \sqrt{a}}, k = \frac{μ}{\sqrt{b}}, μ = c_{0} - c_{1}, a = σ^{2} (1 + θ^{2}) + \frac{b}{4}, \\ b = σ^{2} [{(1 + θ)}^{2} (\frac{1}{n_{0}} + \frac{1}{n_{1}}) - 2 θ (\frac{1}{n_{0}^{2}} + \frac{1}{n_{1}^{2}})], \\ ρ = \frac{σ^{2} [{(1 + θ)}^{2} (\frac{1}{n_{0}} - \frac{1}{n_{1}}) - 2 θ (\frac{1}{n_{0}^{2}} - \frac{1}{n_{1}^{2}})]}{2 \sqrt{a} \sqrt{b}} . \end{array}

(53)

Proof

The result follows by considering the assumption of the corollary in Theorem 11 and then following the same steps similar to Corollary 2.

Corollary 13

For the model in Corollary 12, let n₀ = n₁ = n. Furthermore, let $E [ε_{t_{s}, 2 n}^{I}]$ be the expected true error of the LDA classifier specific to t_s with θ = 0 in (48). Let θ′ and θ″ be two arbitrary values of the MA coefficient θ. Then

\begin{array}{l} (θ^{″} > θ^{'}) \land (θ^{'}, θ^{″} \in (- \infty, \frac{1}{n} - 1)) \Rightarrow E [ε_{t_{s}, 2 n}^{M A {(1)}_{θ^{″}}}] < E [ε_{t_{s}, 2 n}^{M A {(1)}_{θ^{'}}}], \\ (θ^{″} > θ^{'}) \land (θ^{'}, θ^{″} \in [\frac{\frac{1}{n} - 1}{2 n + 1}, \infty)) \Rightarrow E [ε_{t_{s}, 2 n}^{M A {(1)}_{θ^{″}}}] > E [ε_{t_{s}, 2 n}^{M A {(1)}_{θ^{'}}}], \end{array}

(54)

and, therefore,

\begin{array}{l} θ \in (0, \infty) \Rightarrow E [ε_{t_{s}, 2 n}^{M A {(1)}_{θ}}] > E [ε_{t_{s}, 2 n}^{I}], \\ θ \in [\frac{\frac{1}{n} - 1}{2 n + 1}, 0) \Rightarrow E [ε_{t_{s}, 2 n}^{M A {(1)}_{θ}}] < E [ε_{t_{s}, 2 n}^{I}] . \end{array}

(55)

Proof

See the Appendix.

Corollary 13 shows that there exists a range of moving-average coefficients, i.e. [ $\frac{\frac{1}{n} - 1}{2 n + 1}$ , 0), that is beneficial in terms of expected classification error, i.e. has a smaller expected true error than the stochastic i.i.d model. For positive values of the coefficient, the expected true error of LDA increases.

5. Numerical Examples

We now illustrate the results obtained in previous sections under several specific settings.

Experiment 1

First, we consider scenarios in which the sample points taken from each class conditional process are identically distributed. They have the same mean, μ₀ for class 0 and μ₁ for class 1, and we set μ₀ = −μ₁ and μ₀ = 0.5, 0.75, 1, 1.5. We assume that the observations have variance 1 and are equally correlated with ρ_with ∈ [ρ_l, 0.95]. The value of ρ_l is determined so that the covariance matrix defined in (8) is positive definite. In each case we consider three settings for the correlation, ρ_bet, across classes: (1) independent, ρ_bet = 0, (2) ρ_bet = ρ_with, and (3) ρ_bet = −ρ_with. For each setting we consider two sample sizes, n₀ = n₁ = n = 5 and n₀ = n₁ = n = 25. We assume any future observation from each class conditional process has a distribution similar to those of the training data from that class and $α_{t_{s}}^{0} = α_{t_{s}}^{1}$ .

Figure 2(a)–2(d) show the exact expectation and standard deviation (SD) of the LDA true error for this experiment as a function of ρ_with. The results are calculated from Theorems 1 and 6. Parts a and b of the figure show that increasing ρ_with has an incremental effect on $E [ε_{t_{s}, 2 n}^{D}]$ . Since future observations are identically distributed, $E [ε_{t_{s}, 2 n}^{D}]$ is the same for all values of t_s. Theoretically, for ρ_bet = 0, we can easily verify the graphical behavior by using Lemma 3 in Theorem 1. To analytically see the effect of ρ_with on $E [ε_{t_{s}, 2 n}^{D}]$ once ρ_bet = 0, let ρ₁, ρ₂ be two arbitrary values of ρ_with such that ρ₁ < ρ₂ and denote all distributional parameters used in Corollary 2 corresponding to ρ_k, k = 1, 2, with a super script ρ_k. With the aforementioned conditions of the experiment, we have $\frac{S_{\sum^{00}}^{ρ_{1}}}{n_{0}^{2}} - \frac{S_{\sum^{11}}^{ρ_{1}}}{n_{1}^{2}} = \frac{S_{\sum^{00}}^{ρ_{2}}}{n_{0}^{2}} - \frac{S_{\sum^{11}}^{ρ_{2}}}{n_{1}^{2}} = 0$ and $\frac{S_{\sum^{00}}^{ρ_{k}}}{n_{0}^{2}} + \frac{S_{\sum^{11}}^{ρ_{k}}}{n_{1}^{2}} = \frac{1 + 2 (n - 1) ρ_{k}}{2 n}$ , k = 1, 2. Therefore, a^ρ₁ < a^ρ₂ and b^ρ₁ < b^ρ₂. The results then follow from Lemma 3. For other cases where ρ_bet ≠ 0 one may analytically study the effect of changing ρ_bet on $E [ε_{t_{s}, 2 n}^{D}]$ using results Theorem 1 and studying the change similar to the proof of Lemma 3.

Figures (a)–(d) show the exact expectation and standard deviation of LDA true error in Experiment 1 as a function of *ρ_with*. (a) Expectation for n₀ = n₁ = 5; (b) Expectation for n₀ = n₁ = 25; (c) Standard deviation for n₀ = n₁ = 5; (d) Standard deviation for n₀ = n₁ = 25; (a)–(d) plot keys: ○ := *ρ_bet* = 0; ×:= *ρ_bet* = *ρ_with*; △ = *ρ_bet* = −*ρ_with*; solid := μ₀ = 1.5; dash := μ₀ = 1; dot := μ₀ = 0.75; dash-dot := μ₀ = 0.5. The cross section of each curve with the vertical solid line in (a)–(d) plots shows the magnitude of the expectation/variance for i.i.d sampling situation for the corresponding scenario. The small horizontal solid lines in Figures (b) and (d) show the magnitude of expectation/variance of i.i.d situation in Figures (a) and (c), respectively. Figures (e)–(f) show the exact expectation of LDA true error of the first-order autoregressive model in Experiment 2 as a function of ψ:= ψ₀ = ψ₁. (a) Case of n₀ = n₁ = 5; (b) Case of n₀ = n₁ = 25; (e)–(f) plot keys: ○ := s − n₀ = 2; × : s − n₀ = 10; solid := c₀ = 1.5; dash := c₀ = 1; dot := c₀ = 0.75; dash-dot := c₀ = 0.5. The cross section of each curve with the vertical solid line in (e)–(f) plots shows the magnitude of the expectation for i.i.d sampling situation for the corresponding scenario. The small horizontal solid lines in Figure (f) show the magnitude of expectation of i.i.d situation in Figure (e).

Figures 2(a) and 2(b) show that increasing d = |μ₀ − μ₁| has an incremental effect on $E [ε_{t_{s}, 2 n}^{D}]$ . This effect can also be seen from Lemma 3 and Corollary 2. Therefore, we call classification scenarios with a larger d, “easier” scenarios, and those with smaller d, “harder” scenarios. In this sense, d is an indicator of classification difficulty in our experiment. The figures suggest that having a between-class correlation of ρ_bet = ρ_with > 0 helps in classification performance in “harder” classification situations (i.e., compared with ρ_bet = 0) and has a detrimental effect on classification performance in “easier” settings. However, having ρ_bet = ρ_with < 0, helps to have a better classification performance in “easier” settings and results in a worse performance in “harder” settings. This is observed by the fact that curves for ρ_bet = ρ_with are above (below) the curves for ρ_bet = 0 for d = 3 (d = 1).

The standard deviation is more complicated to interpret. The trends seen in Figures 2(c) and 2(d) suggest that increasing ρ_with generally increases the standard deviation of the LDA true error in cases where ρ_bet = 0. Furthermore, it suggests that once ρ_bet = −ρ_with, the standard deviation generally increases as ρ_with grows, but once ρ_bet = ρ_with, increasing ρ_with in small sample sizes may increase or decrease the standard deviation depending on classification difficulty, and as the sample size gets larger, increasing ρ_with generally increases the standard deviation. Furthermore, the figures suggest that increasing the classification difficulty may first increase the standard deviation and then decreases it.

Comparing Figure 2(a) with 2(b) and Figure 2(c) with 2(d) shows that increasing sample sizes lower the magnitude of the expectation and standard deviation regardless of classification difficulty or magnitude of ρ_with.

Experiment 2

In this experiment, we use the first order autoregressive model defined in (40). We assume $α_{t_{s}}^{0} = α_{t_{s}}^{1}$ , n₀ = n₁ = n, σ₀ = σ₁ = 1, and ψ₀ = ψ₁ = ψ ∈ [−0.95, 0.95]. We consider various cases where c₀ = 0.5, 0.75, 1, 1.5 with c₀ = −c₁. Figure 2(e) and 2(f) show the exact expectation of LDA true error for this experiment. These results are exact and are calculated from Theorem 8. These figures suggest that increasing ψ decreases $E [ε_{t_{s}, 2 n}^{A R {(1)}_{ψ}}]$ . According to Corollary 10, for a sufficiently lagged t_s, $E [ε_{t_{s}, 2 n}^{A R {(1)}_{ψ}}]$ is a decreasing function of ψ and, furthermore, $E [ε_{t_{s}, 2 n}^{A R {(1)}_{ψ}}] < E [ε_{t_{s}, 2 n}^{I}]$ for 0 < ψ < 1 and $E [ε_{t_{s}, 2 n}^{A R {(1)}_{ψ}}] > E [ε_{t_{s}, 2 n}^{I}]$ for 0 < ψ < 1. Here the same behavior is observed even for small lags of 2 and 10. Furthermore, decreasing the sample size and increasing the classification difficulty have an incremental effect on the expected true error.

Experiment 3

In this experiment, we use the AR(1) model in (48). We assume $α_{t_{s}}^{0} = α_{t_{s}}^{1}$ , n₀ = n₁ = n, σ₀ = σ₁ = 1, and θ₀ = θ₁ = θ ∈ [−10, 10]. We consider c₀ = 0.5, 0.75, 1, 1.5 with c₀ = −c₁. Figure 3 shows the exact expectation of the LDA true error for this experiment. These results are exact and are calculated from Theorem 11.

Figure 3 shows that the expected true error of LDA under the MA(1) model has an inverted bell shape with a negatively biased center, and the bias decreases as the sample size increases. The results of Corollary 13 are clear in this figure: for $θ \in (- \infty, \frac{1}{n} - 1], E [ε_{t_{s}, 2 n}^{M A {(1)}_{θ}}]$ is a decreasing function of θ. This region is on the left-hand side of the vertical blue dotted lines in Figure 3. For $θ \in [\frac{\frac{1}{n} - 1}{2 n + 1} \infty), E [ε_{t_{s}, 2 n}^{M A {(1)}_{θ}}]$ is an increasing function of θ. This region is on the right-hand side of the vertical red dashed line in the figure. As proved in Corollary 13, we observe in Figure 3(c) and 3(d) that, for $θ \in [\frac{\frac{1}{n} - 1}{2 n + 1}, 0), E [ε_{t_{s}, 2 n}^{M A {(1)}_{θ}}] < E [ε_{t_{s}, 2 n}^{I}]$ . This is the region between red dashed line and the vertical black line of each plot. For θ ∈ (0, ∞), $E [ε_{t_{s}, 2 n}^{M A {(1)}_{θ}}] > E [ε_{t_{s}, 2 n}^{I}]$ . This is the region on the right-hand side of the vertical black solid lines.

Experiment 4

This experiment is an example derived from gene-expression data used in studying the prognosis of breast-cancer using 70 genes with high prognostic ability [19]. Following [20], we divide the 307 individuals used in this study into 64 “poor” prognosis (class 0) versus 243 “good” prognosis (class 1) patients. A poor prognosis is defined to be a distant metastasis within 5 years of initial diagnosis. The gene expression data used in this study have been collected by triplicating each gene on each microarray and then duplicating each measurement by dye-swaping. Therefore, for each patient, each gene, we have six measurements, three of which are positively correlated with themselves and negatively correlated with others. Using this dataset we consider a scenario in which the experimenter is only given six measurements taken from one patient from class 0 and six measurements from another patient from class 1, and a univariate LDA classifier is desired to differentiate the two groups. We assume the single variate used in this classifier is the ALDH4 gene, which has the highest correlation with prognosis of breast cancer in [20]. Therefore, in this scenario, the experimenter is given 12 “technical” replicates in total, which are now treated as our “sample points”. This is an example of the UGDS model in genomic applications in which our classification is defined by two Gaussian processes, $X_{t}^{0}$ and $X_{t}^{1}$ , which are assumed to be independent processes. We note that the expected performance of a classifier depends on t_s, i.e. $E [ε_{t_{s}, 12}^{D}]$ in Theorem 1, which is a function of the distribution of the future data as well as the distribution of the training data and their correlation structure. We verify the Gaussianity of each of the 12 random variables, $X_{t_{1}}^{0}, X_{t_{2}}^{0}, \dots, X_{t_{6}}^{0}, X_{t_{1}}^{1}, X_{t_{2}}^{1}, \dots, X_{t_{6}}^{1}$ , used for characterizing the two Gaussian processes of this example via a Shapiro-Wilk test (using the R statistical software) on the full dataset corresponding to each random variable. This test does not reject Gaussianity of the random variables over either of the classes at a 95% significance level after employing the Bonferroni correction of multi-hypothesis tests.

Unfortunately, taken together, the 12 random variables do not pass the Shapiro-Wilk test for multivariate Gaussianity. Nonetheless, we will proceed and demonstrate that, even with this lack of multivariate Gaussianity, Theorem 1 is much more accurate than its counterpart in [3], which assumes i.i.d. data from each distribution.

Sample means, variances, and correlation, computed on the full dataset, were used as estimates of the unknown true means, variances, and the correlation structure between samples needed in Theorem 1. Using Theorem 1, the expected performance of a classifier, $E [ε_{t_{s}, 12}^{D}]$ , to differentiate samples distributed as $X_{t_{5}}^{0}$ from samples distributed as $X_{t_{1}}^{1}$ is 0.475. To further verify this expected performance we construct a classifier on each possible combination among 243 × 64 = 15552 combinations of 6 samples from either classes and each time we test the accuracy of the designed classifier on the 64 − 1 = 63 remaining realizations of $X_{t_{5}}^{0}$ and 243 − 1 = 242 realizations of $X_{t_{1}}^{1}$ . The accuracy computed in this way is 0.479, which is almost the same as what is computed from Theorem 1. It is interesting to compare this accuracy to the case in which one designs a classifier without paying attention to the correlation structure between samples and various distributions governing the data (considering the data being i.i.d.). In this scenario one (incorrectly) considers the data from each class coming from a single distribution and the expected performance of a classifier can be therefore evaluated from Theorem 1 that we presented in [3]. Again we use the sample means and variances, computed on the full dataset, as estimates of the unknown true means and variances. In this case the expected performance of LDA is estimated to be 0.374, which is very far from 0.479.

6. Conclusion

In many applications, the assumption of having i.i.d. training samples is violated. This paper characterizes the performance of univariate LDA classification in stochastic settings by assuming the samples are taken from two class conditional Gaussian processes, which are not necessarily independent. Linear classification has been considered owing to its long history in pattern recognition and its suitability for small-sample classification. We do not impose a specific correlation structure on the training data. We have presented conditions in which the correlation structure can be either beneficial or detrimental in terms of classification performance. As an application we have obtained exact expressions for the performance of LDA in situations that the data are produced through auto-regressive (AR) or moving-average (MA) models of the first order. We have found ranges of AR or MA multiplicative coefficients having incremental or decremental effect on classification performance. Having characterized univariate LDA performance in closed form, we aim to follow our work in [3] and characterize the effect of non-i.i.d. samples on training-data-based error estimators.

Acknowledgments

This work was partially supported by the NIH grants 2R25CA090301 (Nutrition, Biostatistics, and Bioinformatics) from the National Cancer Institute.

Appendix. Various Correlation Structures

Let p-dimenional sample points of each class, X₁, X₂, …, X_{n_i}, with X_j being a column vector, be separately taken from two p-variate normal distributions, Π₁ and Π₂, with the distribution N(μ_i, Σ). Furthermore, let V_i be the dispersion matrix of the n_ip×1 vector $X = {(X_{1}^{T}, X_{2}^{T}, \dots, X_{n_{i}}^{T})}^{T}$ , i = 0, 1, defined as V_i = E[(X − E(X))(X − E(X))^T]. We define three correlation structures in regard to the data: (1) equicorrelated if V_i = I_{n_i} ⊗ (Σ − R)+ E_{n_i} ⊗ R, with R being a symmetric matrix, I_n the n × n identity matrix, and E_n the n × n matrix with all elements being 1; (2) simply equicorrelated if V_i = I_{n_i} ⊗ (1 − ρ)Σ + E_{n_i} ⊗ ρΣ, where ρ is a nonzero scalar constant where |ρ| < 1; and (3) serially correlated if V_i = I_{n_i} ⊗ Σ + E_{n_i} ⊗ ρ_τ Σ, where τ = |k − l|, k, l = 1, 2, …, n_i, |ρ_τ| < 1, τ = 1, 2, …, n_i, ρ₀ = 0. Note that univariate sample points, equicorrelation and simple-equicorrelation structures are essentially the same.

Proof of Theorem 1

From (9), it follows that

\begin{array}{l} E [ε_{t_{s}}^{0}] = P (W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}}) \leq 0 ∣ X_{t_{s}} \in X_{t}^{0}) = \\ P (X_{t_{s}} - {\bar{X}}_{T} < 0, {\bar{X}}_{T}^{0} - {\bar{X}}_{T}^{1} > 0) + P (X_{t_{s}} - {\bar{X}}_{T} \geq 0, {\bar{X}}_{T}^{0} - {\bar{X}}_{T}^{1} < 0), \end{array}

where ${\bar{X}}_{T} = \frac{{\bar{X}}_{T}^{0} + {\bar{X}}_{T}^{1}}{2}$ . Expanding ${\bar{X}}_{T}^{0}$ and ${\bar{X}}_{T}^{1}$ as $\frac{1}{n_{0}} \sum_{i = 1}^{n_{0}} X_{t_{i}}^{0}$ and $\frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} X_{t_{i}}^{1}$ results in

E [ε_{t_{s}}^{0}] = P (Z_{s}^{I} < 0) + P (Z_{s}^{I} \geq 0),

(56)

where $Z_{s}^{I} = {A Y}_{s}^{0}$ and $Y_{s}^{0} = {[X_{t_{s}}^{0}, \dots, X_{t_{n_{0}}}^{0}, X_{t_{1}}^{1}, \dots, X_{t_{n_{1}}}^{1}]}^{T}$ , where the super index 0 in $X_{t_{s}}^{0}$ is to denote explicitly $X_{t_{s}} \in X_{t}^{0}$ , and

A = [\begin{matrix} 1 & - \frac{1}{2 n_{0}} & \frac{- 1}{2 n_{0}} & \dots & \frac{- 1}{2 n_{0}} & \frac{- 1}{2 n_{1}} & \dots & \frac{- 1}{2 n_{1}} \\ 0 & \frac{- 1}{n_{0}} & \frac{- 1}{n_{0}} & \dots & \frac{- 1}{n_{0}} & \frac{1}{n_{1}} & \dots & \frac{1}{n_{1}} \end{matrix}] .

(57)

\sum_{Y_{s}^{0}} = [\begin{matrix} {(σ_{s}^{0})}^{2} & ρ_{s}^{00} & ρ_{s}^{01} \\ {(ρ_{s}^{00})}^{T} & \sum^{00} & \sum^{01} \\ {(ρ_{s}^{01})}^{T} & {(\sum^{01})}^{T} & \sum^{11} \end{matrix}],

(58)

which leads to the expression stated in Theorem 1. Evaluating the mean and covariance matrix of vector $Z_{s}^{II}$ , which is the counterpart for $E [ε_{t_{s}}^{1}]$ , is entirely similar, by considering $P (W ({\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}}) > 0 ∣ {\bar{X}}_{T}^{0}, {\bar{X}}_{T}^{1}, X_{t_{s}} \in X_{t}^{1})$ .

Proof of Corollary 2

Note that for Φ(x, y; ρ) defined in (20),

Φ (- x, - y; ρ) = \int_{x}^{\infty} \int_{y}^{\infty} \frac{1}{2 π \sqrt{1 - ρ^{2}}} exp {\frac{- (u^{2} + v^{2} - 2 ρ u v)}{2 (1 - ρ^{2})}} dudv .

(59)

By considering the assumption of the corollary for Theorem 1, and using (20) and (59) in (18), we get

E [ε_{t_{s}, n_{0} + n_{1}}^{D}] = \frac{1}{2} (Φ (\frac{μ}{2 \sqrt{a}}, \frac{- μ}{\sqrt{b}}, ρ) + Φ (\frac{- μ}{2 \sqrt{a}}, \frac{μ}{\sqrt{b}}, ρ)) + \frac{1}{2} (Φ (\frac{- μ}{2 \sqrt{a}}, \frac{μ}{\sqrt{b}}, - ρ) + Φ (\frac{μ}{2 \sqrt{a}}, \frac{- μ}{\sqrt{b}}, - ρ)),

(60)

with a, b, and ρ defined in the corollary. Using the identity [28]

2 [Φ (x, y; ρ) + Φ (x, y; - ρ) - Φ (x) - Φ (y)] + 1 = L (x, y; ρ),

(61)

where Φ(.) is the standard normal cumulative function, completes the proof.

Proof of Lemma 3

Here, we first provide a way to intuitively understand the Lemma and then we provide a rigorous proof. We have

G (x, y; ρ) = F (x, y; ρ) + F (x, y; - ρ) = 1 + L (x, y, ρ) = 1 - L (∣ x ∣, ∣ y ∣, ρ),

(62)

where the last equality is due to xy < 0 stated as an assumption to the lemma. Intuitively, the lemma makes sense because smaller values of |x|, |y|, and |ρ| imply not only a smaller integration region in (22), but also less mass in that region. Next we provide a rigorous proof. It is straightforward to show

\begin{array}{l} \frac{\partial G (x, y; ρ)}{\partial x} = \frac{2 e^{- \frac{x^{2}}{2}}}{\sqrt{2 π}} [Φ (\frac{y - ρ x}{\sqrt{1 - ρ^{2}}}) + Φ (\frac{y + ρ x}{\sqrt{1 - ρ^{2}}}) - 1], \\ \frac{\partial G (x, y; ρ)}{\partial y} = \frac{2 e^{- \frac{y^{2}}{2}}}{\sqrt{2 π}} [Φ (\frac{x - ρ y}{\sqrt{1 - ρ^{2}}}) + Φ (\frac{x + ρ y}{\sqrt{1 - ρ^{2}}}) - 1], \\ \frac{\partial G (x, y; ρ)}{\partial ρ} = 2 ψ (x, y; ρ) - 2 ψ (x, y; - ρ), \end{array}

(63)

where the last equality comes from well known results of Gaussian distribution, where $\frac{\partial Φ (x, y; ρ)}{\partial ρ} = \frac{Φ (x, y; ρ)}{\partial x \partial y} = ψ (x, y; ρ)$ . Without loss of generality, we assume x ≥ 0 and y ≤ 0. The results for x ≤ 0 and y ≥ 0 are entirely similar after exchanging x and y in the following proof. We have

\begin{array}{l} ρ \geq 0 \Rightarrow y - ρ x \leq 0, y - ρ x \leq y + ρ x \leq - y + ρ x \\ \Rightarrow Φ (\frac{y - ρ x}{\sqrt{1 - ρ^{2}}}) + Φ (\frac{y + ρ x}{\sqrt{1 - ρ^{2}}}) \leq 1 \Rightarrow \frac{\partial G}{\partial x} \leq 0, \\ ρ < 0 \Rightarrow y + ρ x \leq 0, y + ρ x \leq y - ρ x \leq - y - ρ x \\ \Rightarrow Φ (\frac{y - ρ x}{\sqrt{1 - ρ^{2}}}) + Φ (\frac{y + ρ x}{\sqrt{1 - ρ^{2}}}) \leq 1 \Rightarrow \frac{\partial G}{\partial x} \leq 0. \end{array}

(64)

Hence, $\frac{\partial G}{\partial x} \leq 0$ . Similarly, $\frac{\partial G}{\partial y} \geq 0$ . Furthermore,

ρ \geq 0 \Rightarrow \frac{\partial G (x, y; ρ)}{\partial ρ} \leq 0, ρ < 0 \Rightarrow \frac{\partial G (x, y; ρ)}{\partial ρ} > 0.

For 0 ≤ λ ≤ 1, we set

γ_{x} = λ x_{1} + (1 - λ) x_{0}, γ_{y} = λ y_{1} + (1 - λ) y_{0}, γ_{ρ} = λ ρ_{1} + (1 - λ) ρ_{0} .

(65)

Then λ, γ_x ≥ 0, γ_y ≤ 0, ρ_i ≥ 0 ⇒ γ_ρ ≥ 0 (i = 0, 1), and ρ_i < 0 ⇒ γ_ρ < 0 (i = 0, 1). Thus, $\frac{\partial G}{\partial γ_{x}} \leq 0, \frac{\partial G}{\partial γ_{y}} \geq 0$ and

γ_{ρ} \geq 0 \Rightarrow \frac{\partial G (x, y; ρ)}{\partial γ_{ρ}} \leq 0, γ_{ρ} < 0 \Rightarrow \frac{\partial G (x, y; ρ)}{\partial γ_{ρ}} > 0.

(66)

Then

\begin{array}{l} \frac{d G}{d λ} = \frac{\partial G}{\partial γ_{x}} \frac{d γ_{x}}{d λ} + \frac{\partial G}{\partial γ_{y}} \frac{d γ_{y}}{d λ} + \frac{\partial G}{\partial γ_{ρ}} \frac{d γ_{ρ}}{d λ} \\ = \frac{\partial G}{\partial γ_{x}} (x_{1} - x_{0}) + \frac{\partial G}{\partial γ_{y}} (y_{1} - y_{0}) + \frac{\partial G}{\partial γ_{ρ}} (ρ_{1} - ρ_{0}) . \end{array}

(67)

First assume ρ_i ≥ 0, i = 0, 1, so that γ_ρ ≥ 0, $\frac{\partial G (x, y; ρ)}{\partial γ_{ρ}} \leq 0$ . Since $\frac{\partial G}{\partial γ_{x}} \leq 0$ , x₁ ≤ x₀, $\frac{\partial G}{\partial γ_{y}} \geq 0$ , y₁ ≥ y₀, and ρ₁ ≤ ρ₀, we have $\frac{d G}{d λ} \geq 0$ . Therefore,

\int_{0}^{1} \frac{d G}{d λ} d λ = G (1) - G (0) = G (x_{1}, y_{1}, ρ_{1}) - G (x_{0}, y_{0}, ρ_{0}) \geq 0.

(68)

Next assume ρ_i ≤ 0, i = 0, 1, so that γ_ρ ≤ 0, $\frac{\partial G (x, y; ρ)}{\partial γ_{ρ}} \geq 0$ . Since $\frac{\partial G}{\partial γ_{x}} \leq 0$ , x₁ ≤ x₀, $\frac{\partial G}{\partial γ_{y}} \geq 0$ , y₁ ≥ y₀, and ρ₁ ≥ ρ₀, we have $\frac{d G}{d λ} \geq 0$ . Therefore,

\int_{0}^{1} \frac{d G}{d λ} d λ = G (1) - G (0) = G (x_{1}, y_{1}, ρ_{1}) - G (x_{0}, y_{0}, ρ_{0}) \geq 0.

Lastly, assume the ρ_i’s have opposite signs. Without loss of generality, assume ρ₀ < 0, ρ₁ ≥ 0, and |ρ₁| ≤ |ρ₀|. Then

\begin{array}{l} \int_{0}^{1} \frac{d G}{d λ} d λ = \int_{0}^{\frac{∣ ρ_{0} ∣ - ρ_{1}}{ρ_{1} - ρ_{0}}} d G + \int_{\frac{∣ ρ_{0} ∣ - ρ_{1}}{ρ_{1} - ρ_{0}}}^{1} d G \\ = \int_{0}^{\frac{∣ ρ_{0} ∣ - ρ_{1}}{ρ_{1} - ρ_{0}}} d G + G (x_{m}, y_{m}, - ρ_{1}) - G (x_{1}, y_{1}, ρ_{1}), \end{array}

(69)

where $x_{m} = \frac{∣ ρ_{0} ∣ - ρ_{1}}{ρ_{1} - ρ_{0}} x_{1} + (1 - \frac{∣ ρ_{0} ∣ - ρ_{1}}{ρ_{1} - ρ_{0}}) x_{0}, y_{m} = \frac{∣ ρ_{0} ∣ - ρ_{1}}{ρ_{1} - ρ_{0}} y_{1} + (1 - \frac{∣ ρ_{0} ∣ - ρ_{1}}{ρ_{1} - ρ_{0}}) y_{0}$ , x₁ < x_m < x₀, and y₀ < y_m < y₁. From the definition of G(x, y, ρ) it is easy to see that G(x, y, ρ) = G(x, y, −ρ) and then, from the conditions that result in (68), we have G(x_m, y_m, −ρ₁) − G(x₁, y₁, ρ₁) = G(x_m, y_m, ρ₁) − G(x₁, y₁, ρ₁) ≥ 0. Hence, in order to show G(x₁, y₁, ρ₁) − G(x₀, y₀, ρ₀) ≥ 0 it is sufficient to show that $\int_{0}^{\frac{∣ ρ_{0} ∣ - ρ_{1}}{ρ_{1} - ρ_{0}}} \frac{d G}{d λ} d λ \geq 0$ . For $λ \in [0, \frac{∣ ρ_{0} ∣ - ρ_{1}}{ρ_{1} - ρ_{0}}]$ , we have γ_ρ ≤ 0, $\frac{\partial G (x, y; ρ)}{\partial γ_{ρ}} \geq 0$ . Therefore, from (67), $\frac{\partial G (x, y; ρ)}{\partial γ_{ρ}} \geq 0$ and, furthermore, $\frac{d G}{d λ} \geq 0$ . Thus, $\int_{0}^{\frac{∣ ρ_{0} ∣ - ρ_{1}}{ρ_{1} - ρ_{0}}} \frac{d G}{d λ} d λ \geq 0$ and the result follows.

Proof of Theorem 6

From (16) and (9), it follows that

E [{(ε_{t_{s}}^{0})}^{2}] = P (X_{t_{s}} - {\bar{X}}_{T} < 0, {\bar{X}}_{T}^{0} - {\bar{X}}_{T}^{1} > 0, X_{t_{s}}^{'} - {\bar{X}}_{T} < 0) + P (X_{t_{s}} - {\bar{X}}_{T} \geq 0, {\bar{X}}_{T}^{0} - {\bar{X}}_{T}^{1} < 0, X_{t_{s}}^{'} - {\bar{X}}_{T} \geq 0),

(70)

E [{(ε_{t_{s}}^{0})}^{2}] = P (Z_{s}^{I} < 0) + P (Z_{s}^{I} \geq 0),

(71)

where $Z_{s}^{I} = {A Y}_{s}^{0}$ , in which $Y_{s}^{0} = {[X_{t_{s}}^{0}, X_{t_{s}}^{' 0}, X_{t_{1}}^{0}, \dots, X_{t_{n_{0}}}^{0}, X_{t_{1}}^{1}, \dots, X_{t_{n_{1}}}^{1}]}^{T}$ , and the super index 0 in $X_{t_{s}}^{0}$ and $X_{t_{s}}^{' 0}$ is to denote explicitly X_{t_s}, $X_{t_{s}}^{'} \in X_{t}^{0}$ , and

A = [\begin{matrix} 1 & 0 & \frac{- 1}{2 n_{0}} & \frac{- 1}{2 n_{0}} & \dots & \frac{- 1}{2 n_{0}} & \frac{- 1}{2 n_{1}} & \dots & \frac{- 1}{2 n_{1}} \\ 0 & 0 & \frac{- 1}{n_{0}} & \frac{- 1}{n_{0}} & \dots & \frac{- 1}{n_{0}} & \frac{1}{n_{1}} & \dots & \frac{1}{n_{1}} \\ 0 & 1 & \frac{- 1}{2 n_{0}} & \frac{- 1}{2 n_{0}} & \dots & \frac{- 1}{2 n_{0}} & \frac{- 1}{2 n_{1}} & \dots & \frac{- 1}{2 n_{1}} \end{matrix}] .

Therefore, $Z_{s}^{I}$ is a Gaussian random vector with mean $A μ_{Y_{s}^{0}}$ and covariance matrix $A \sum_{Y_{s}^{0}} A^{T}$ . Plugging in the values of $μ_{Y_{s}^{0}} = [μ_{s}^{0}, μ_{s}^{0}, μ_{1}^{0}, μ_{2}^{0}, \dots, μ_{n_{0}}^{0}, μ_{1}^{1}, μ_{2}^{1}, \dots, μ_{n_{1}}^{1}]$ and noting the fact that the j^th element of vector $ρ_{s}^{i k} (j)$ is defined as $ρ_{s}^{i k} (j) = E [(X_{t_{s}}^{i} - μ_{s}^{i}) (X_{t_{j}}^{k} - μ_{j}^{k})]$ , i, k = 0, 1, j = 1, 2, …, n_k, and from the definition of $X_{t_{s}}^{' 0}$ it holds that $E [(X_{t_{s}}^{0} - μ_{s}^{i}) (X_{t_{j}}^{k} - μ_{j}^{k})] = E [(X_{t_{s}}^{' 0} - μ_{s}^{i}) (X_{t_{j}}^{k} - μ_{j}^{k})]$ , we have:

\sum_{Y_{s}^{0}} = [\begin{matrix} {(σ_{s}^{0})}^{2} & 0 & ρ_{s}^{00} & ρ_{s}^{01} \\ 0 & {(σ_{s}^{0})}^{2} & ρ_{s}^{00} & ρ_{s}^{01} \\ {(ρ_{s}^{00})}^{T} & {(ρ_{s}^{00})}^{T} & \sum^{00} & \sum^{01} \\ {(ρ_{s}^{01})}^{T} & {(ρ_{s}^{01})}^{T} & {(\sum^{01})}^{T} & \sum^{11} \end{matrix}],

(72)

which leads to the expression stated in Theorem 6. Evaluating the mean and covariance matrices of vector $Z_{s}^{II}$ and $Z_{s}^{III}$ is entirely similar.

Proof of Theorem 8

Since the $Z_{t}^{i}$ ’s are Gaussian, $X_{t}^{0}$ and $X_{t}^{1}$ are covariance-stationary [18] and the vectors $X_{n_{0}}^{0} = {[X_{t_{1}}^{0}, X_{t_{2}}^{0}, \dots, X_{t_{n_{0}}}]}^{T}$ and $X_{n_{0}}^{0} = {[X_{t_{1}}^{1}, X_{t_{2}}^{1}, \dots, X_{t_{n_{1}}}^{1}]}^{T}$ are distributed normally as $X_{n_{i}}^{i} ~ N (μ^{i}, \sum^{i})$ , i = 0, 1, where

\begin{array}{l} μ^{i} = {[μ^{i}, μ^{i}, \dots, μ^{i}]}_{1 \times n_{i}}^{T}, \sum^{i} (k, l) = \frac{ψ_{i}^{∣ k - l ∣}}{1 - ψ_{i}^{2}} σ_{i}^{2}, k, l = 1, 2, \dots, n_{i}, \\ ρ_{s}^{i i} (k) = \frac{ψ_{i}^{s - k}}{1 - ψ_{i}^{2}} σ_{i}^{2}, k = 1, 2, \dots, n_{i}, ρ_{s}^{01} = 0_{1 \times n_{1}}, ρ_{s}^{10} = 0_{1 \times n_{0}}, \end{array}

(73)

$μ_{i} = \frac{c_{i}}{1 - ψ_{i}}$ , and Σⁱ(k, l) denotes the entry in the k^th row and l^th column of matrix Σⁱ. The result follows by replacing (73) in Theorem 1.

Proof of Corollary 9

Using the corollary assumptions in Theorem 8, we get

E [ε_{t_{s}, n_{0} + n_{1}}^{A R {(1)}_{ψ}}] = \frac{1}{2} \sum_{i = 0}^{1} (Φ (h_{s}^{i}, - k; ρ_{s}^{i}) + Φ (- h_{s}^{i}, k; ρ_{s}^{i})),

(74)

\begin{array}{l} h_{s}^{i} = \frac{μ}{2 \sqrt{a_{s}^{i}}}, ρ_{s}^{i} = - \frac{ψ^{(s - n_{i})} σ^{2}}{n_{i} (1 - ψ^{2})} (\frac{1 - ψ^{n_{i}}}{1 - ψ}) + ρ, \\ a_{s}^{i} = a - \frac{ψ^{(s - n_{i})} σ^{2}}{n_{i} (1 - ψ^{2})} (\frac{1 - ψ^{n_{i}}}{1 - ψ}), \end{array}

(75)

with k, a, b, ρ, and μ defined in (45). Let F(x, y; ρ) = Φ(x, y; ρ) + Φ(−x, −y; ρ), with Φ(x, y; ρ) defined in (20). Then using Scheffe’s Lemma [29] we have

2 lim_{s \to \infty} E [ε_{t_{s}, n_{0} + n_{1}}^{A R {(1)}_{ψ}}] = F (lim_{s \to \infty} h_{s}^{0}, - k; lim_{s \to \infty} ρ_{s}^{0}) + F (lim_{s \to \infty} h_{s}^{1}, - k; lim_{s \to \infty} ρ_{s}^{1}) .

(76)

Note that by taking the limit, the term $\frac{ψ^{(s - n_{i})} σ^{2}}{n_{i} (1 - ψ^{2})} (\frac{1 - ψ^{n_{i}}}{1 - ψ})$ in $h_{s}^{i}$ and $a_{s}^{i}$ converges exponentially to 0 and we have $a_{s}^{0} = a_{s}^{1} = a, h_{s}^{0} = h_{s}^{1} = h, ρ_{s}^{0} = ρ_{s}^{1} = ρ$ . The result follows similarly to proof of Corollary 2.

Proof of Corollary 10

With n₀ = n₁ = n, we have ρ = 0 in (45). From (76) we get $2 {lim}_{s \to \infty} E [ε_{t_{s}, 2 n}^{A R {(1)}_{ψ}}] = G (h_{ψ}, l_{ψ}; 0)$ , with G(h_ψ, −k_ψ; 0) defined as in (24) and l_ψ= −k_ψ, where we use a subscript ψ to explicitly denote dependence of l and h on ψ. Since h_ψl_ψ< 0, we can use a proof similar to that of Lemma 3 to compare different AR models. Specifically, suppose we prove that

ψ^{″} > ψ^{'} \Rightarrow \exists λ_{h} \in [0, 1) \land \exists λ_{l} \in [0, 1) : h_{ψ^{″}} = λ_{h} h_{ψ^{'}} \land l_{ψ^{″}} = λ_{l} l_{ψ^{'}}

(77)

Then, similar to proof of Lemma 3, we can prove G(h_ψ″, l_ψ″; 0) < G(h_ψ′, l_ψ′; 0), so that

2 lim_{s \to \infty} E [ε_{t_{s}, 2 n}^{A R {(1)}_{ψ^{″}}}] = G (h_{ψ^{″}}, l_{ψ^{″}}; 0) < lim_{s \to \infty} E [ε_{t_{s}, 2 n}^{A R {(1)}_{ψ^{'}}}] = G (h_{ψ^{'}}, l_{ψ^{'}}; 0),

(78)

thereby proving the basic inequality in the corrollary. We first demonstrate (77). Assume c₀ > c₁. We first prove that for ψ ∈ (−1, 1), we have $\frac{d l_{ψ}}{d ψ} < 0$ and $\frac{d h_{ψ}}{d ψ} > 0$ . It is easy to see that:

\frac{d l_{ψ}}{d ψ} = - \sqrt{\frac{n}{2}} \frac{(c_{0} - c_{1})}{σ} \frac{d (\sqrt{f_{ψ}})}{d ψ} = - \sqrt{\frac{n}{2 f_{ψ}}} \frac{(c_{0} - c_{1}) g_{ψ}}{σ {(d_{ψ})}^{2}},

where

\begin{array}{l} g_{ψ} = 2 n (1 + ψ^{2} - (n + 1) ψ^{n} + (n - 1) ψ^{(n + 2)}), \\ d_{ψ} = n - 2 ψ - n ψ^{2} + 2 ψ^{n + 1}, f_{ψ} = \frac{n - n ψ^{2}}{d_{ψ}} . \end{array}

(79)

From Descartes’ Rule of Signs [30], g_ψ has either zero or two positive roots. For n ≥ 2,

g_{ψ} = 4 n {(ψ - 1)}^{2} (\frac{(n - 1)}{2} ψ^{n} + \sum_{j = 1}^{n - 1} j ψ^{j} + \frac{1}{2}) .

(80)

Therefore, for all n, g_ψ has two roots at 1 and these are the only positive roots. Similarly we observe that if n is even, then g_ψ has only two negative roots at −1. If n is odd, again from Descartes’ Rule of Signs [30], g_ψ has only one negative root, denoted by ψ_−. We show that ψ₋ ∈ (−∞, −1). Let ψ₋ = −ψ₊, ψ₊ > 0. Since n is odd, we need to have

1 + ψ_{+}^{2} + (n + 1) ψ_{+}^{n} - (n - 1) ψ_{+}^{(n + 2)} = 0.

(81)

Were ψ₊ ∈ (0, 1), this would imply $(n + 1) ψ_{+}^{n} > (n - 1) ψ_{+}^{(n + 2)}$ . Hence, (81) is not possible and ψ₋ ∈ (−∞, −1). Summarizing this result, we see that ψ ∈ (−1, 1) ⇒ g_ψ > 0 and therefore, $\frac{d l_{ψ}}{d ψ} < 0$ . It is straightforward to show

\frac{d h_{ψ}}{d ψ} = \sqrt{\frac{n}{2}} \frac{(c_{0} - c_{1})}{σ} \frac{d (\sqrt{r_{ψ}})}{d ψ} = \sqrt{\frac{n}{2 r_{ψ}}} \frac{(c_{0} - c_{1}) (4 n^{3} {(1 - ψ)}^{2} + g_{ψ})}{σ {(2 n^{2} {(1 - ψ)}^{2} + d_{ψ})}^{2}},

where $r_{ψ} = \frac{n - n ψ^{2}}{2 n^{2} {(1 - ψ)}^{2} + d_{ψ}}$ . Since for ψ ∈ (−1, 1) we have g_ψ > 0, then $\frac{d h_{ψ}}{d ψ} > 0$ . We set γ_ψ = λψ₁ + (1 − λ)ψ₀, where 0 ≤ λ ≤ 1. Now we check that (78) holds. Denoting G(h_{γ_ψ},l_{γ_ψ}; 0) by G, we have

\begin{array}{l} \frac{d G}{d λ} = \frac{\partial G}{\partial h_{γ_{ψ}}} \frac{d h_{γ_{ψ}}}{d γ_{ψ}} \frac{d γ_{ψ}}{d λ} + \frac{\partial G}{\partial l_{γ_{ψ}}} \frac{d l_{γ_{ψ}}}{d γ_{ψ}} \frac{d γ_{ψ}}{d λ} \\ = \frac{\partial G}{\partial h_{γ_{ψ}}} \frac{d h_{γ_{ψ}}}{d γ_{ψ}} (ψ^{″} - ψ^{'}) + \frac{\partial G}{\partial l_{γ_{ψ}}} \frac{d l_{γ_{ψ}}}{d γ_{ψ}} (ψ^{″} - ψ^{'}) . \end{array}

(82)

Since ψ ∈ (−1, 1), 0 ≤ λ ≤ 1, and h_ψl_ψ< 0, we can see that γ_ψ ∈ (−1, 1), h_{γ_ψ}l_{γ_ψ}< 0, $\frac{d h_{γ_{ψ}}}{d γ_{ψ}} > 0, \frac{d l_{γ_{ψ}}}{d γ_{ψ}} < 0$ , and from Proof of Lemma 3 in the appendix, $\frac{\partial G}{\partial h_{γ_{ψ}}} \leq 0$ and $\frac{\partial G}{\partial l_{γ_{ψ}}} \geq 0$ . Since ψ″ > ψ′, we see that $\frac{d G}{d λ} < 0$ . Similar to the proof of Lemma 3, integrating over λ results in G(h_ψ″, l_ψ″; 0) < G(h_ψ′, l_ψ′; 0). The same basic argument goes through for c₀ < c₁ and we have $\frac{d h_{γ_{ψ}}}{d γ_{ψ}} < 0, \frac{d l_{γ_{ψ}}}{d γ_{ψ}} > 0$ . The remaining results follow from the definition of $E [ε_{t_{s}, 2 n}^{I}]$ , where we have $E [ε_{t_{s}, 2 n}^{A R {(1)}_{ψ = 0}}] = E [ε_{t_{s}, 2 n}^{I}]$ .

Proof of Theorem 11

Since the $Z_{t}^{i}$ ’s are Gaussian, $X_{t}^{0}$ and $X_{t}^{1}$ are covariance-stationary and the vectors $X_{n_{0}}^{0} = {[X_{t_{1}}^{0}, X_{t_{2}}^{0}, \dots, X_{t_{n_{0}}}]}^{T}$ and $X_{n_{0}}^{0} = {[X_{t_{1}}^{1}, X_{t_{2}}^{1}, \dots, X_{t_{n_{1}}}^{1}]}^{T}$ are distributed normally as $X_{n_{i}}^{i} ~ N (μ^{i}, \sum^{i})$ , i = 0, 1, [18], where for k = 1, 2, …, n_i,

\begin{array}{l} μ^{i} = {[μ^{i}, μ^{i}, \dots, μ^{i}]}_{1 \times n_{i}}^{T}, ρ_{s}^{i i} (k) = 0, ρ_{s}^{01} = 0_{1 \times n_{1}}, \\ ρ_{s}^{10} = 0_{1 \times n_{0}}, \sum^{i} (k, l) = {\begin{matrix} σ_{i}^{2} (1 + θ_{i}^{2}), & k = l \\ σ_{i}^{2} θ_{i}, & ∣ k - l ∣ = 1 \\ 0 & othewise \end{matrix}, \end{array}

(83)

where μ_i = c_i and Σⁱ(k, l) denotes the entry in the k^th row and l^th column of the matrix Σⁱ. The result follows by replacing (83) in Theorem 1.

Proof of Corollary 13

From Theorem 11, since $α_{t_{s}}^{0} = α_{t_{s}}^{1}$ and max{n₀, n₁} + 1 < s, we have $2 E [ε_{t_{s}, 2 n}^{M A {(1)}_{θ}}] = G (h_{θ}, l_{θ}; 0)$ , for any s, with l_θ = −k_θ, h_θ and k_θ defined in Corollary 12, and G(h_θ, −k_θ; 0) defined as in (24). Similar to proof of Corollary 10, the present corollary follows by setting n₀ = n₁ = n and using

\begin{array}{l} \frac{d l_{θ}}{d θ} = b^{- \frac{3}{2}} \frac{(c_{0} - c_{1}) σ^{2}}{n} (2 θ + 2 - \frac{2}{n}), \\ \frac{d h_{θ}}{d θ} = - a^{- \frac{3}{2}} \frac{(c_{0} - c_{1}) σ^{2}}{4 n} (2 n θ + θ + 1 - \frac{1}{n}), \end{array}

(84)

where a and b are obtained from (53).

Contributor Information

Amin Zollanvari, Email: amin_zoll@neo.tamu.edu.

Jianping Hua, Email: jhua@tgen.org.

Edward R. Dougherty, Email: edward@ece.tamu.edu.

References

1.Hills M. Allocation rules and their error rates. J Royal Statist Soc Ser B (Methodological) 1966;28(1):1–31. [Google Scholar]
2.Sorum MJ. Estimating the expected probability of misclassification for a rule based on the linear discriminant function: Univariate normal case. Technometrics. 1973;15:329–339. [Google Scholar]
3.Zollanvari A, Braga-Neto UM, Dougherty ER. Exact representation of the second-order moments for resubstitution and leave-one-out error estimation for linear discriminant analysis in the univariate heteroskedastic gaussian model. Pattern Recogn. 2012;45:908–917. [Google Scholar]
4.Basu JP, Odell PL. Effect of intraclass correlation among training samples on the misclassification probabilities of bayes’ procedure. Pattern Recogn. 1974;6:13–16. [Google Scholar]
5.McLachlan GJ. Further results on the effect of interclass correlation among training samples in discriminant analysis. Pattern Recogn. 1976;8:273–275. [Google Scholar]
6.Tubbs JD. Effect of autocorrelated training samples on bayes’ probability of mis-classlficatlon. Pattern Recogn. 1980;12:351–354. [Google Scholar]
7.Lawoko CRO, McLachlan GJ. Discrimination with autocorrelated observations. Pattern Recogn. 1985;18:145–149. [Google Scholar]
8.Lawoko CRO, McLachlan GJ. Asymptotic error rates of the w and z statistics when the training observations are dependent. Pattern Recogn. 1986;19:467–471. [Google Scholar]
9.Fisher RA. Statistical Methods for Research Workers. Edinburgh: Oliver &Boyd; 1925. [Google Scholar]
10.Martin JK, Hirschberg DS. Small sample statistics for classification error rates ii: Confidence intervals and significance tests. 1996 [Google Scholar]
11.Zollanvari A, Braga-Neto UM, Dougherty ER. On the sampling distribution of resubstitution and leave-one-out error estimators for linear classifiers. Pattern Recogn. 2009;42(11):2705–2723. [Google Scholar]
12.Zollanvari A, Braga-Neto UM, Dougherty ER. Joint sampling distribution between actual and estimated classification errors for linear discriminant analysis. IEEE Trans Inf Theory. 2010;56(2):784–804. [Google Scholar]
13.Zollanvari A, Braga-Neto UM, Dougherty ER. Analytic study of performance of error estimators for linear discriminant analysis. IEEE Trans Sig Proc. 2011;59(9):4238–4255. [Google Scholar]
14.Shumway RH, Unger AN. Linear discriminant functions for stationary time series. J Am Statist Assoc. 1974;69:948–956. [Google Scholar]
15.Kakizawa Y, Shumway R, Taniguchi M. Discrimination and clustering for multivariate time series. J Am Statist Assoc. 1998;93:328–340. [Google Scholar]
16.Kazakos D, Papantoni-Kazakos P. Spectral distance measuring between gaussian processes. IEEE Trans Autom Control. 1980;25:950–959. [Google Scholar]
17.McLachlan GJ. The asymptotic distributions of the conditional error rate and risk in discriminant analysis. Biometrika. 1974;61:131–135. [Google Scholar]
18.Hamilton JD. Time Series Analysis. NJ: Princeton University Press; 1994. [Google Scholar]
19.Buyse M, Loi S, et al. Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. Journal of the National Cancer Institute. 2006;98:1183–1192. doi: 10.1093/jnci/djj329. [DOI] [PubMed] [Google Scholar]
20.vanÕt Veer L, Dai H, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–6. doi: 10.1038/415530a. [DOI] [PubMed] [Google Scholar]
21.Schršder FH, Hugosson J, et al. Screening and prostate-cancer mortality in a randomized european study. New Eng J Med. 2009;360:1320–1328. doi: 10.1056/NEJMoa0810084. [DOI] [PubMed] [Google Scholar]
22.Koelinka CJL, van Hasseltb P, et al. Tyrosinemia type i treated by ntbc: How does afp predict liver cancer? Mol Genet Metab. 2006;89:310–315. doi: 10.1016/j.ymgme.2006.07.009. [DOI] [PubMed] [Google Scholar]
23.Bast RC, Xu FJ, et al. Ca 125: the past and the future. Int J Biol Markers. 1998;13:179–187. doi: 10.1177/172460089801300402. [DOI] [PubMed] [Google Scholar]
24.Filella X, Molina R, et al. Prognostic value of ca 19.9 levels in colorectal cancer. Mol Genet Metab. 1992;216:55–59. doi: 10.1097/00000658-199207000-00008. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Frank TS, Deffenbaugh AM, et al. Clinical characteristics of individuals with germline mutations in brca1 and brca2: Analysis of 10,000 individuals. J Clin Oncol. 2002;20:1480–1490. doi: 10.1200/JCO.2002.20.6.1480. [DOI] [PubMed] [Google Scholar]
26.Syrjakoski K, Kuukasjarvi T, et al. Brca2 mutations in 154 finnish male breast cancer patients. Neoplasia. 2004;6:541–545. doi: 10.1593/neo.04193. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Horii A, Nakatsuru S, et al. Frequent somatic mutations of the apc gene in human pancreatic cancer. Cancer Res. 1992;52:6696–6698. [PubMed] [Google Scholar]
28.Abramowitz M, Stegun IA. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. New York: Dover Publications; 1972. [Google Scholar]
29.Scheffe H. A useful convergence theorem for probability distributions. Ann Math Statist. 1947;18:434–438. [Google Scholar]
30.Anderson B, Jackson J, Sitharam M. A useful convergence theorem for probability distributions. Amer Math Monthly. 1998;105:447–451. [Google Scholar]

[R1] 1.Hills M. Allocation rules and their error rates. J Royal Statist Soc Ser B (Methodological) 1966;28(1):1–31. [Google Scholar]

[R2] 2.Sorum MJ. Estimating the expected probability of misclassification for a rule based on the linear discriminant function: Univariate normal case. Technometrics. 1973;15:329–339. [Google Scholar]

[R3] 3.Zollanvari A, Braga-Neto UM, Dougherty ER. Exact representation of the second-order moments for resubstitution and leave-one-out error estimation for linear discriminant analysis in the univariate heteroskedastic gaussian model. Pattern Recogn. 2012;45:908–917. [Google Scholar]

[R4] 4.Basu JP, Odell PL. Effect of intraclass correlation among training samples on the misclassification probabilities of bayes’ procedure. Pattern Recogn. 1974;6:13–16. [Google Scholar]

[R5] 5.McLachlan GJ. Further results on the effect of interclass correlation among training samples in discriminant analysis. Pattern Recogn. 1976;8:273–275. [Google Scholar]

[R6] 6.Tubbs JD. Effect of autocorrelated training samples on bayes’ probability of mis-classlficatlon. Pattern Recogn. 1980;12:351–354. [Google Scholar]

[R7] 7.Lawoko CRO, McLachlan GJ. Discrimination with autocorrelated observations. Pattern Recogn. 1985;18:145–149. [Google Scholar]

[R8] 8.Lawoko CRO, McLachlan GJ. Asymptotic error rates of the w and z statistics when the training observations are dependent. Pattern Recogn. 1986;19:467–471. [Google Scholar]

[R9] 9.Fisher RA. Statistical Methods for Research Workers. Edinburgh: Oliver &Boyd; 1925. [Google Scholar]

[R10] 10.Martin JK, Hirschberg DS. Small sample statistics for classification error rates ii: Confidence intervals and significance tests. 1996 [Google Scholar]

[R11] 11.Zollanvari A, Braga-Neto UM, Dougherty ER. On the sampling distribution of resubstitution and leave-one-out error estimators for linear classifiers. Pattern Recogn. 2009;42(11):2705–2723. [Google Scholar]

[R12] 12.Zollanvari A, Braga-Neto UM, Dougherty ER. Joint sampling distribution between actual and estimated classification errors for linear discriminant analysis. IEEE Trans Inf Theory. 2010;56(2):784–804. [Google Scholar]

[R13] 13.Zollanvari A, Braga-Neto UM, Dougherty ER. Analytic study of performance of error estimators for linear discriminant analysis. IEEE Trans Sig Proc. 2011;59(9):4238–4255. [Google Scholar]

[R14] 14.Shumway RH, Unger AN. Linear discriminant functions for stationary time series. J Am Statist Assoc. 1974;69:948–956. [Google Scholar]

[R15] 15.Kakizawa Y, Shumway R, Taniguchi M. Discrimination and clustering for multivariate time series. J Am Statist Assoc. 1998;93:328–340. [Google Scholar]

[R16] 16.Kazakos D, Papantoni-Kazakos P. Spectral distance measuring between gaussian processes. IEEE Trans Autom Control. 1980;25:950–959. [Google Scholar]

[R17] 17.McLachlan GJ. The asymptotic distributions of the conditional error rate and risk in discriminant analysis. Biometrika. 1974;61:131–135. [Google Scholar]

[R18] 18.Hamilton JD. Time Series Analysis. NJ: Princeton University Press; 1994. [Google Scholar]

[R19] 19.Buyse M, Loi S, et al. Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. Journal of the National Cancer Institute. 2006;98:1183–1192. doi: 10.1093/jnci/djj329. [DOI] [PubMed] [Google Scholar]

[R20] 20.vanÕt Veer L, Dai H, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–6. doi: 10.1038/415530a. [DOI] [PubMed] [Google Scholar]

[R21] 21.Schršder FH, Hugosson J, et al. Screening and prostate-cancer mortality in a randomized european study. New Eng J Med. 2009;360:1320–1328. doi: 10.1056/NEJMoa0810084. [DOI] [PubMed] [Google Scholar]

[R22] 22.Koelinka CJL, van Hasseltb P, et al. Tyrosinemia type i treated by ntbc: How does afp predict liver cancer? Mol Genet Metab. 2006;89:310–315. doi: 10.1016/j.ymgme.2006.07.009. [DOI] [PubMed] [Google Scholar]

[R23] 23.Bast RC, Xu FJ, et al. Ca 125: the past and the future. Int J Biol Markers. 1998;13:179–187. doi: 10.1177/172460089801300402. [DOI] [PubMed] [Google Scholar]

[R24] 24.Filella X, Molina R, et al. Prognostic value of ca 19.9 levels in colorectal cancer. Mol Genet Metab. 1992;216:55–59. doi: 10.1097/00000658-199207000-00008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Frank TS, Deffenbaugh AM, et al. Clinical characteristics of individuals with germline mutations in brca1 and brca2: Analysis of 10,000 individuals. J Clin Oncol. 2002;20:1480–1490. doi: 10.1200/JCO.2002.20.6.1480. [DOI] [PubMed] [Google Scholar]

[R26] 26.Syrjakoski K, Kuukasjarvi T, et al. Brca2 mutations in 154 finnish male breast cancer patients. Neoplasia. 2004;6:541–545. doi: 10.1593/neo.04193. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Horii A, Nakatsuru S, et al. Frequent somatic mutations of the apc gene in human pancreatic cancer. Cancer Res. 1992;52:6696–6698. [PubMed] [Google Scholar]

[R28] 28.Abramowitz M, Stegun IA. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. New York: Dover Publications; 1972. [Google Scholar]

[R29] 29.Scheffe H. A useful convergence theorem for probability distributions. Ann Math Statist. 1947;18:434–438. [Google Scholar]

[R30] 30.Anderson B, Jackson J, Sitharam M. A useful convergence theorem for probability distributions. Amer Math Monthly. 1998;105:447–451. [Google Scholar]

PERMALINK

Analytical Study of Performance of Linear Discriminant Analysis in Stochastic Settings

Amin Zollanvari

Jianping Hua

Edward R Dougherty

Abstract

1. Introduction

2. Linear Discriminant Analysis and Error Estimation: Independent Sampling

3. Performance of LDA classifier in Univariate Gaussian Dependent Sampling (UGDS) Model of Binary Classification

Definition 1

Definition 2

Remark 1

3.1. Stochastic true error and its moments

3.2. Expected performance of LDA in the UGDS model

Theorem 1

Proof

Corollary 2

Proof

Lemma 3

Proof

Corollary 4

Proof

Corollary 5

Proof

Figure 1.

3.3. Second moment of LDA true error in the UGDS model

Theorem 6

Proof

Corollary 7

Proof

4. Applications

4.1. First-order autoregressive model AR(1)

Theorem 8

Proof

Corollary 9

Proof

Corollary 10

Proof

4.2. First-order moving-average model MA(1)

Theorem 11

Proof

Corollary 12

Proof

Corollary 13

Proof

5. Numerical Examples

Experiment 1

Figure 2.

Experiment 2

Experiment 3

Figure 3.

Experiment 4

6. Conclusion

Acknowledgments

Appendix. Various Correlation Structures

Proof of Theorem 1

Proof of Corollary 2

Proof of Lemma 3

Proof of Theorem 6

Proof of Theorem 8

Proof of Corollary 9

Proof of Corollary 10

Proof of Theorem 11

Proof of Corollary 13

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases