UNEXPECTED PROPERTIES OF BANDWIDTH CHOICE WHEN SMOOTHING DISCRETE DATA FOR CONSTRUCTING A FUNCTIONAL DATA CLASSIFIER

Raymond J Carroll; Aurore Delaigle; Peter Hall

doi:10.1214/13-AOS1158

. Author manuscript; available in PMC: 2014 Oct 9.

Published in final edited form as: Ann Stat. 2013 Dec 17;41(6):2739–2767. doi: 10.1214/13-AOS1158

UNEXPECTED PROPERTIES OF BANDWIDTH CHOICE WHEN SMOOTHING DISCRETE DATA FOR CONSTRUCTING A FUNCTIONAL DATA CLASSIFIER

Raymond J Carroll ^1,¹, Aurore Delaigle ^2,², Peter Hall ^2,²

PMCID: PMC4191932 NIHMSID: NIHMS564605 PMID: 25309640

Abstract

The data functions that are studied in the course of functional data analysis are assembled from discrete data, and the level of smoothing that is used is generally that which is appropriate for accurate approximation of the conceptually smooth functions that were not actually observed. Existing literature shows that this approach is effective, and even optimal, when using functional data methods for prediction or hypothesis testing. However, in the present paper we show that this approach is not effective in classification problems. There a useful rule of thumb is that undersmoothing is often desirable, but there are several surprising qualifications to that approach. First, the effect of smoothing the training data can be more significant than that of smoothing the new data set to be classified; second, undersmoothing is not always the right approach, and in fact in some cases using a relatively large bandwidth can be more effective; and third, these perverse results are the consequence of very unusual properties of error rates, expressed as functions of smoothing parameters. For example, the orders of magnitude of optimal smoothing parameter choices depend on the signs and sizes of terms in an expansion of error rate, and those signs and sizes can vary dramatically from one setting to another, even for the same classifier.

Keywords: Centroid method, discrimination, kernel smoothing, quadratic discrimination, smoothing parameter choice, training data

1. Introduction

All supposedly “functional” data are actually observed discretely, sometimes on a grid and on other occasions at randomly scattered points. For example, in longitudinal data analysis the observation points are often widely spaced and irregularly placed, and substantial smoothing is commonly used to convert discrete data like these to functions. The impact of such smoothing has been addressed in the context of prediction or hypothesis testing for functional data; see, for example, Hall and Van Keilegom (2007), Panaretos, Kraus and Maddocks (2010), Wu and Müller (2011), Benhennia and Degras (2011), Cardot and Josserand (2011) and Cardot, Degras and Josserand (2013). The main conclusion of these papers has been that conventional rules for smoothing discrete data typically apply, and that smoothing parameters of standard size generally are appropriate.

In contrast, the present paper was motivated by numerical work indicating that, in the context of classifying functional data, smoothing parameters of highly nonstandard sizes are appropriate, and more generally that, even for a relatively simple classifier, there is no simple precept (even an asymptotic prescription of size) that leads to minimisation of error rate. If one had to give a rule, valid in some but by no means all cases, it would be to undersmooth, but even there unexpected caveats must be addressed.

For example, it turns out that the impact of smoothing the training data can be more significant than that of smoothing the new data to be classified. Indeed, the effect of smoothing the new data is characteristic of a parametric problem, rather than a nonparametric one. There, asymptotic arguments indicate that (sample size)^−1/2 is an appropriate bandwidth size for reducing the impact of smoothing to parametric levels, whereas (sample size)^−1/3 is the nearest analogue for smoothing the training data.

However, both these recommendations are incorrect in many cases. Depending on the signs and sizes of certain functionals of the data distributions, it can be optimal to use smoothing parameters that are an order of magnitude smaller, or an order of magnitude larger, than these. Using some viewpoints the need for a low level of smoothing is intuitively clear. Indeed, we expect that relatively minor features of a curve, of the sort that might disappear if we were too enthusiastic in the smoothing step, could have important information to convey in a classification analysis. On the other hand, our results show that very high levels of smoothing are sometimes advantageous.

We drew these conclusions after studying three different classifiers for functional data: the standard centroid-based method, the scale-adjusted form of that approach, and a version for functional data of quadratic discrimination. Our conclusions are valid for all three approaches, although they contradict conclusions which are well known for standard nonparametric approximations to the Bayes classifier in multivariate, rather than functional-data, settings. Specifically, for univariate and functional data, and nonparametric Bayes classifiers, conventional smoothing parameters, for example, those chosen using standard plug-in rules for function estimation, typically are of the correct order even though they do not quite minimise asymptotic classification error; see, for example, Hall and Kang (2005). Moreover, there does not exist a version of our results in univariate or multivariate settings, since there is no analogue in such cases of the “lattice effect,” represented by the m_kj's.

To these comments, we should add that in practice there is relatively little difficulty in choosing smoothing parameters to minimise error rate; cross-validation is usually effective. Our aim in this paper is therefore not to develop methods for choosing the bandwidth optimally, or nearly optimally, in classification problems, but to provide an understanding of the many aspects of those problems that conspire together to determine the optimal choice.

2. Model and methodology

2.1. Model

We consider n₀ (resp., n₁) unknown random functions {g_0j, 1 ≤ j ≤ n₀} (resp., {g_1j, 1 ≤ j ≤ n₁}) coming from two populations. We observe a training sample of the data pairs $D_{k j} = {(X_{k j i}, Y_{k j i}), 1 \leq i \leq m_{k j}}$ , for 1 ≤ j ≤ n_k and k 0,1, corresponding to noisy versions of the g_kj 's sampled at a discrete set of random points (i.e., X_kji 's) and generated by the model

Y_{k j i} = g_{k j} (X_{k j i}) + ε_{k j i},

(2.1)

where k indexes the population, Π_k, from which the data in $D_{k j}$ came, j denotes the index of an individual drawn from Π_k, and i is the index of a data pair (X_kji, Y_kji) for the jth individual from the kth population.

The g_kj are random functions defined on a compact interval $I$ , but observed only at m_kj points X_kj1, …, X_kjm_kj. These points may be fixed or random, and although we shall develop our arguments in the random case, they can easily be extended to the fixed case. We assume that each g_kj has two bounded derivatives on $I$ ; the respective sequences of X's and ε's are each identically distributed with distributions that do not depend on the g's; the g's, X's and ε's are all mutually independent; the X's are supported on $I$ ; and the ε's have zero mean and finite variance.

We also observe a new data set $D = {(X_{i}, Y_{i}), 1 \leq i \leq m}$ , similar to the $D_{k j}$ 's except that in this case we do not know which population the data come from. Here,

Y_{i} = g (X_{i}) + ε_{i},

(2.2)

where the function g, the X's and the ε's have the properties given in the previous paragraph. Using the training data, we wish to determine whether $D$ came from Π₀ or Π₁.

In the functional data literature [see, e.g., Ramsay and Silverman (2005)], when the data are noisy, it is common to preprocess them prior to further analysis. Typically, this is done by smoothing the data in some way, for example, through a spline or kernel smoother, thereby obtaining, from the data in $D_{k j}$ and $D$ , estimators ${\hat{g}}_{k j}$ and $\hat{g}$ of g_kj and g, respectively. In the classification context, once these estimators have been derived, they are plugged into functional data classifiers, replacing there the unobserved functions g and g_kj by their estimators $\hat{g}$ and ${\hat{g}}_{k j}$ . Our aim in this paper is to describe the application of estimators $\hat{g}$ and ${\hat{g}}_{k j}$ of g and g_kj, and in particular to describe the influence of tuning parameters used to construct them, when the aim is classification rather than just function estimation.

2.2. Estimating g, g_kj and their mean and covariance functions

There are several ways to obtain nonparametric estimators of the functions g and g_kj, but the most popular ones are spline and local linear methods. They have similar properties, but since local linear estimators are much more tractable theoretically, we shall use these in this work. For $x \in I$ , the local linear estimators of g and g_kj are defined by

\hat{g} (x) = \frac{U_{2} (x) V_{0} (x) - U_{1} (x) V_{1} (x)}{U_{2} (x) U_{0} (x) - U_{1}^{2} (x)},

{\hat{g}}_{k j} (x) = \frac{U_{k j 2} (x) V_{k j 0} (x) - U_{k j 1} (x) V_{k j 1} (x)}{U_{k j 2} (x) U_{k j 0} (x) - U_{k j 1}^{2} (x)},

(2.3)

where

U_{ℓ} (x) = \frac{1}{m} \sum_{i = 1}^{m} {(\frac{x - X_{i}}{h})}^{ℓ} K_{h} (x - X_{i}),

(2.4)

V_{ℓ} (x) = \frac{1}{m} \sum_{i = 1}^{m} Y_{i} {(\frac{x - X_{i}}{h})}^{ℓ} K_{h} (x - X_{i}),

(2.5)

U_{k j ℓ} (x) = \frac{1}{m_{k j}} \sum_{i = 1}^{m_{k j}} {(\frac{x - X_{k j i}}{h_{1}})}^{ℓ} K_{h_{1}} (x - X_{k j i}),

(2.6)

V_{k j ℓ} (x) = \frac{1}{m_{k j}} \sum_{i = 1}^{m_{k j}} Y_{k j i} {(\frac{x - X_{k j i}}{h_{1}})}^{ℓ} K_{h_{1}} (x - X_{k j i}),

K is a kernel function, h > 0 and h₁ > 0 are bandwidths, and K_h(x) = K(x/h)/h. See, for example, Fan and Gijbels (1996). For simplicity, throughout we use the same bandwidth h₁ for each population and each individual, but we could have replaced h₁ by bandwidths that depended on k and j, as we do in our numerical work.

The classifiers we consider in this work require estimators of the population means and covariances. For k = 0, 1, let μ_k denote the mean function

μ_{k} = E_{k} (g) = E_{k} (g_{k j}),

(2.7)

where E_k represents expectation under the assumption that the data come from Π_k. Also, let G_k be the covariance function, defined by G_k(u, v) = cov_k{g(u), g(v)} = E_k{g(u)g(v)} − μ_k(u)μ_k(v), where cov_k denotes covariance when the data come from Π_k. Estimators ${\hat{μ}}_{k}$ and ${\hat{G}}_{k}$ of μ_k and G_k are defined in the standard way by the empirical mean and covariance functions, but replacing, in the definitions of these estimators, the unobserved g_kj by ${\hat{g}}_{k j}$ :

{\hat{μ}}_{k} = \frac{1}{n_{k}} \sum_{j = 1}^{n_{k}} {\hat{g}}_{k j},

(2.8)

{\hat{G}}_{k} (u, v) = \frac{1}{n_{k}} \sum_{j = 1}^{n_{k}} {{\hat{g}}_{k j} (u) - {\hat{μ}}_{k} (u)} {{\hat{g}}_{k j} (v) - {\hat{μ}}_{k} (v)} .

(2.9)

See, for example, Ramsay and Silverman (2005), Chapter 2.

Consider the spectral decomposition of the covariance function

G_{k} (u, v) = \sum_{ℓ = 1}^{\infty} θ_{k ℓ} ψ_{k ℓ} (u) ψ_{k ℓ} (v),

(2.10)

where, (θ_kℓ, ψ_kℓ) is an (eigenvalue, eigenfunction) pair for the linear operator G_k defined by G_k(ψ)(u) = ∫ G_k(u, vψ(v) dv, and where, following convention, we have used the notation G_k for both the operator and the covariance. The terms in (2.10) are ordered such that θ_k1 ≥ θ_k2 ≥ ⋯ ≥ 0. If g is drawn from Π_k then we can write

g (x) = μ_{k} (x) + \sum_{ℓ = 1}^{\infty} Z_{k ℓ} θ_{k ℓ}^{1 ∕ 2} ψ_{k ℓ} (x),

(2.11)

where μ_k = E_k(g) denotes the mean of the random process of which g is a realisation, $Z_{k ℓ} = θ_{k ℓ}^{- 1 ∕ 2} \int (g - μ_{k}) ψ k ℓ$ , and the Z_kℓ's (for ℓ = 1, 2, …) comprise a sequence of uncorrelated random variables with zero mean and unit variance. The quantities θ_kℓ, and ψ_kℓ can be estimated consistently by the eigenvalues and eigenfunctions ${\hat{θ}}_{k ℓ}$ and ${\hat{ψ}}_{k ℓ}$ of the linear operator ${\hat{G}}_{k}$ , defined by ${\hat{G}}_{k} (ψ) (u) = \int {\hat{G}}_{k} (u, v) ψ (v) d v$ , with the covariance estimator ${\hat{G}}_{k}$ defined as at (2.9):

{\hat{G}}_{k} (u, v) = \sum_{ℓ = 1}^{\infty} {\hat{θ}}_{k ℓ} {\hat{ψ}}_{k ℓ} (u) {\hat{ψ}}_{k ℓ} (v),

(2.12)

where ${\hat{θ}}_{k 1} \geq {\hat{θ}}_{k 2} \geq \dots \geq 0$ , and, since ${\hat{θ}}_{k ℓ} = 0$ for all ℓ > n_k, all but the first n_k terms in the series at (2.12) vanish. See Hall and Hosseini-Nasab (2006, 2009) for properties of these estimators in the case where g and g_kj are observed; see also Li and Hsing (2010a, 2010b) for other cases.

2.3. Constructing classifiers

Classifiers for functional data have received a great deal of attention in the literature. See, for example, Vilar and Pértega (2004), Biau, Bunea and Wegkamp (2005), Fromont and Tuleau (2006), Leng and Müller (2006), López-Pintado and Romo (2006), Rossi and Villa (2006), Cuevas, Febrero and Fraiman (2007), Wang, Ray and Mallick (2007), Berlinet, Biau and Rouvière (2008), Epifanio (2008), Araki et al. (2009), Delaigle and Hall (2012) and Delaigle, Hall and Bathia (2012).

In those papers the authors suggest methods for constructing classifiers, but so far the theoretical impact of smoothing; that is, the impact of using $\hat{g}$ and ${\hat{g}}_{k j}$ instead of g and g_kj when constructing classifiers; has been largely ignored in the literature. In this paper, we study this impact of smoothing for three relatively simple functional classifiers: the centroid classifier, or Rocchio classifier [see, e.g., Manning, Raghavan and Schütze (2008)], commonly used for classifying high-dimensional data; a scaled version of this classifier, which we define below in a general way; and a version for functional data of Fisher's quadratic discriminant, studied, for example, by Leng and Müller (2006) and Delaigle and Hall (2012). These classifiers are usually defined in terms of the functions g and g_kj, and here we shall define them in terms of $\hat{g}$ and ${\hat{g}}_{k j}$ . The standard versions of these classifiers are obtained by replacing $\hat{g}$ and ${\hat{g}}_{k j}$ by g and g_kj. The functions ${\hat{g}}_{k j}$ appear only implicitly through the estimated means and covariance functions constructed in Section 2.2.

In the present setting, the centroid-based classifier assigns the curve g, observed through $D$ , to Π₀ if the statistic

S (\hat{g}) = \int_{I} {\hat{g} (t) - {\hat{μ}}_{0} (t)}^{2} d t - \int_{I} {\hat{g} (t) - {\hat{μ}}_{1} (t)}^{2} d t

(2.13)

is negative, and to Π₁ if $S (\hat{g}) > 0$ .

A scaled version of the centroid classifier, which accommodates differences in scales between the two populations, can be defined by replacing S in (2.13) by

S_{scale} (\hat{g}) = \frac{1}{s_{0}^{2}} \int_{I} {\hat{g} (t) - {\hat{μ}}_{0} (t)}^{2} d t - \frac{1}{s_{1}^{2}} \int_{I} {\hat{g} (t) - {\hat{μ}}_{1} (t)}^{2} d t + \log (\frac{s_{0}^{2}}{s_{1}^{2}}),

(2.14)

where $s_{k}^{2}$ is an estimator of the scale of population Π_k. For example, we might take $s_{k}^{2}$ to equal ${n_{k}}^{- 1} \sum_{j = 1}^{n_{k}} \int_{I} {({\hat{g}}_{k j} - {\hat{μ}}_{k})}^{2}$ , the version we used in our numerical work, or $\int_{I} \int_{I} {\hat{G}}_{k} (u, v) ψ (u) ψ (v) d u d v$ where ψ is open to choice; or $s_{0}^{2}$ and $s_{1}^{2}$ could be selected empirically by minimising a cross-validation estimator of classification error. The definition at (2.14) should be compared with those at (2.15) and (2.16), below. The form of (2.14), and also of (2.15) and (2.16), is motivated by likelihood-ratio statistics for Gaussian data.

A version for functional data of Fisher's quadratic discriminant is based on

T (\hat{g}) = \sum_{ℓ = 1}^{p} [\frac{1}{{\hat{θ}}_{0 ℓ}} {\int_{I} (\hat{g} - {\hat{μ}}_{0}) {\hat{ψ}}_{0 ℓ}}^{2} - \frac{1}{{\hat{θ}}_{1 ℓ}} {\int_{I} (\hat{g} - {\hat{μ}}_{1}) {\hat{ψ}}_{1 ℓ}}^{2} + \log (\frac{{\hat{θ}}_{0 ℓ}}{{\hat{θ}}_{1 ℓ}})],

(2.15)

where $\hat{g}$ and ${\hat{μ}}_{k}$ are as at (2.3) and (2.8), ( $({\hat{θ}}_{0 ℓ}, {\hat{θ}}_{1 ℓ})$ ) are at (2.12) and p is a positive truncation parameter. (Here we assume, as is often the case in practice, that the prior probabilities of each population are unknown and estimated by 1/2. A more general version of the classifier can be used if these probabilities are estimated by other values, but this does not alter our main conclusions.) We assign the new, data set $D$ to Π₀ if $T (\hat{g}) \leq 0$ , and to Π₁ otherwise. Of course, the statistic $T (\hat{g})$ , at (2.15), is just an empirical version of the quantity

T_{0} (g) = \sum_{ℓ = 1}^{p} [\frac{1}{θ_{0 ℓ}} {\int_{I} (g - μ_{0}) ψ_{0 ℓ}}^{2} - \frac{1}{θ_{1 ℓ}} {\int_{I} (g - μ_{1}) ψ_{1 ℓ}}^{2} + \log (\frac{θ_{0 ℓ}}{θ_{1 ℓ}})] .

(2.16)

If the functions g are Gaussian, and the first p eigenvalues, in versions of (2.10) and (2.12) for either population, are distinct and nonzero, and the remaining eigenvalues vanish, then the classifier based on T₀(g), at (2.16), is optimal in the sense of having least classification error among all classifiers, since it is, after all, just a likelihood ratio statistic. When the eigenvalues and eigenfunctions are estimated from data, as at (2.15), the classifier is asymptotically optimal. Bearing in mind the effectiveness of Fisher's discriminant analysis in the case of vector-valued data, even when the data are not normal, the classifier based on $T (\hat{g})$ is an attractive choice even in non-Gaussian cases.

3. Theoretical properties

3.1. Standard centroid-based classifier

In this section, we derive properties of the centroid classifier based on the estimators $\hat{g}$ and ${\hat{g}}_{k j}$ , and in particular we examine the impact of smoothing. First, we introduce notation. Let n = n₀ + n₁ (hence n is a positive integer sequence diverging to infinity), let m = m(n) be of the same size as m_kj [see (3.7) below], and write $σ_{ε k}^{2}$ for the variance of the experimental errors ε_kji and ε_i, in (2.1) and (2.2), when the data come from Π_k. Let

{\overset{‒}{g}}_{k} = \frac{1}{n_{k}} \sum_{j = 1}^{n_{k}} g_{k j}, ν_{k} = n_{k}^{2} {(\sum_{j = 1}^{n_{k}} m_{k j}^{- 1})}^{- 1}

(3.1)

and define

\begin{matrix} b_{k 0} & = \int_{I} (μ_{1} - μ_{0}) {2 μ_{k} - (μ_{0} + μ_{1})}, \\ β_{k 0} & = \int_{I} ({\overset{‒}{g}}_{1} - {\overset{‒}{g}}_{0}) {2 μ_{k} - ({\overset{‒}{g}}_{0} + {\overset{‒}{g}}_{1})}, \end{matrix}

(3.2)

σ_{k}^{2} = 4 κ_{2} \int_{I} \int_{I} ({\overset{‒}{g}}_{1} - {\overset{‒}{g}}_{0}) (x_{1}) ({\overset{‒}{g}}_{1} - {\overset{‒}{g}}_{0}) (x_{2}) G_{k} (x_{1}, x_{2}) d x_{1} d x_{2},

(3.3)

τ_{k}^{2} = 4 κ_{2} \int_{I} \int_{I} (μ_{1} - μ_{0}) (x_{1}) (μ_{1} - μ_{0}) (x_{2}) G_{k} (x_{1}, x_{2}) d x_{1} d x_{2},

(3.4)

where κ₂ = ∫ u²K(u)du. Finally, put κ = ∫ K², and let $I$ be the compact interval that equals the support of the density f_X of the X_i's, and X_kji's of the functions g and g_kj.

We make the following assumptions:

(a) The distribution of the experimental
error ε_{kji} and ε_{i}, in (2.1) and (2.2), has zero mean and all moments finite,
may depend on k, and has variance σ_{ε k}^{2}; (b) the density f_{X} of the variables X_{kji} and X_{i} does not depend on i, j or k; (c) f_{X} has two bounded derivatives f_{X} (x) \geq C > 0 for all x \in I, and f_{X}^{″} is H \ddot{o} lder continuous on the support I of f_{X} .

(3.5)

(a) The functions g and g_{k j} associated with the populations Π_{k} for k = 0, 1, are realisations of Gaussian processes, have uniformly
bounded covariance functions G_{k} and mean functions μ_{k}, both depending only on k, and satisfy τ_{k}^{2} > 0 for k = 0, 1; and (b) with probabilty 1 the functions are uniformly
bounded and have H \ddot{o} lder-continuous second detrivatives with the property
that, for a constant C > 0, {all moments of sup}_{x_{1}, x_{2}} ∣ g^{″} (x_{1}) - g^{″} (x_{2}) ∣ ∕ ∣ x_{1} - x_{2} ∣^{C} are finite when g is sampled from either Π_{0} or Π_{1} .

(3.6)

(a) For a constant C > 0, the results h_{(1)} = O (n^{- C}) and n^{1 - C} h^{(1)} \to \infty hold for h^{(1)} = h and h^{(1)} = h_{(1)}; (b) the kernel K is a symmetric, nonnegative compactly supported and
H \ddot{o} lder continuous probability density; and (c) for each k the values of m^{- 1} \min_{j} m_{k j}, m^{- 1} \max_{j} m_{k j} and n_{0} ∕ n_{1} are bounded away from zero and infinity as n \to \infty, and, for constants C_{1} and C_{2} satisfying 0 < C_{1} < C_{2} < \infty, m and n_{0} lie between n^{C_{1}} and n^{C_{2}} .

(3.7)

The assumption in (3.5)(c) that f_X is bounded away from zero on its support is only a technical requirement, and is unnecessary in practice. To make this clear, in our numerical work we shall take f_X to be a normal density, and show that the conclusions of Theorem 1 are nevertheless reflected clearly.

Let ${err}_{k} = P_{k} {{(- 1)}^{k} S (\hat{g}) > 0}$ denote the probability that the standard centroid-based classifier, based on the statistic $S (\hat{g})$ at (2.13), commits an error when the data set $D$ actually comes from Π_k. Theorem 1 below describes the asymptotic behaviour of err_k, and highlights the effect of the smoothing parameters h and h₁, used to construct the estimators $g$ and $g_{k j}$ of g and g_kj, on the classifier. A proof is given in Appendix A.1.

Theorem 1. Assume that (3.5)–(3.7) hold, and let Ψ₀ = 1 – Φ and Ψ₁ = Φ, where Φ denotes the c.d.f. of a standard normal random variable. Then

{err}_{k} = {err}_{k 0} + h^{2} c_{k} + h_{1}^{2} c_{k 1} + \frac{d_{k 0}}{ν_{0} h_{1}} + \frac{d_{k 1}}{ν_{1} h_{1}} + O {m^{- 1} + (m h)^{- 2}} + o (h^{2} + h_{1}^{2} + \frac{1}{ν_{0} h_{1}}),

(3.8)

where err_k0 = E_k[Ψ_k {−β_k0/σ_k}], $c_{k} = κ_{2} α_{k} \int_{I} (μ_{1} - μ_{0}) μ_{k}^{″}$ , $c_{k 1} = - κ_{2} α_{k} \int_{I} (μ_{1} - μ_{0}) μ_{1 - k}^{″}$ , $d_{k j} = {(- 1)}^{j} α_{k} σ_{ε j}^{2} κ \int_{I} f_{X}^{- 1}$ , with $α_{k} = {(- 1)}^{k} τ_{k}^{- 1} ϕ (b_{k 0} ∕ τ_{k})$ , and where ϕ denotes standard normal density function.

The leading term err_k0 on the right-hand side of (3.8) does not depend in any way on the bandwidths h and h₁. It does involve the training sample sizes n₀ and n₁, and in particular does not equal the asymptotic limit of err_k as n increases, since that limit is given by Ψ_k(−b_k0/τ_k), but the effects of the bandwidths are all confined to subsequent terms on the right-hand side of (3.8). The terms in h² and $h_{1}^{2}$ represent contributions to classification error arising from biases of the estimators $\hat{g}$ and ${\hat{g}}_{k j}$ , and the terms in (ν₀h₁)⁻¹ and (ν₁h₁)⁻¹ are contributions from the variances of the estimators ${\hat{g}}_{k j}$ .

While a priori it might be thought that, since the total number of observations in the training sample, Σ_jm_kj, for k = 0 and 1, is an order of magnitude larger than the number of observations, m, in the new data set $D$ , then h₁ should be chosen smaller than h, Theorem 1 shows that the influence of bandwidths on error rate is much more complex than this.

For one thing, there are no terms in (mh)⁻¹ on the right-hand side of (3.8). (Section 3.1.2 will explain the reason for this.) As a result, the terms on the right-hand side of (3.8) that depend on h can be rendered equal to O(m⁻¹) simply by taking h equal to a constant multiple of m^−1/2. As noted in Remark 1, below, this level of contribution to the error rate is generally impossible to remove, even in simple parametric problems. Therefore the contribution of h to error rate cannot be rendered smaller than m⁻¹. However, in some instances choosing h to be an order of magnitude larger or smaller than m^−1/2 can be beneficial; see Section 3.1.1 below.

The terms in h₁ on the right-hand side of (3.8) are a different matter because each of c_k1 and $d_{k 0} ν_{0}^{- 1} + d_{k 1} ν_{1}^{- 1}$ can be either positive or negative. Depending on the signs and sizes of c_k1 and $d_{k 0} ν_{0}^{- 1} + d_{k 1} ν_{1}^{- 1}$ , it can be optimal to take h₁ to be of order $ν_{k}^{- 1 ∕ 3}$ , which achieves a trade-off between terms in $h_{1}^{2}$ and (ν_kh₁)⁻¹, or to take h₁ to decrease to zero more quickly or to converge to a positive constant, as n increases; see Section 3.1.1 below.

Therefore, the impact that smoothing has on classification performance is much more subtle than it might have appeared. We discuss these issues in more detail in the next sections.

3.1.1. Sizes of h and h₁ that optimise overall error rate

Using Theorem 1 we can deduce the orders of magnitudes of h and h₁ that minimise the error rate of the classifier, that is, that minimise the probability of misclassification,

err = π_{0} {err}_{0} + π_{1} {err}_{1},

(3.9)

where err₀ and err₁ are as in (3.8), π_k denotes the prior probability attached to population Π_k, and π₀ + π₁ = 1. Using (3.8) and (3.9), we can write

err = {err}^{0} + c^{0} h^{2} + c_{1}^{0} h_{1}^{2} + d^{0} {(ν_{0} h_{1})}^{- 1} + O {m^{- 1} + {(m h)}^{- 2}} + o {h_{2} + h_{1}^{2} + {(ν_{0} h_{1})}^{- 1}},

(3.10)

where err⁰ = π₀err₀₀ + π₁err₁₀ (recall that err_k0 does not depend on the bandwidths),

c_{0} = κ_{2} \int (μ_{1} - μ_{0}) {π_{0} \frac{μ_{0}^{″}}{τ_{0}} ϕ (\frac{b_{00}}{τ_{0}}) - π_{1} \frac{μ_{1}^{″}}{τ_{1}} ϕ (\frac{b_{10}}{τ_{1}})},

c_{1}^{0} = κ_{2} \int (μ_{0} - μ_{1}) {π_{0} \frac{μ_{1}^{″}}{τ_{0}} ϕ (\frac{b_{00}}{τ_{0}}) - π_{1} \frac{μ_{0}^{″}}{τ_{1}} ϕ (\frac{b_{10}}{τ_{1}})},

d^{0} = κ (\int_{I} f_{X}^{- 1}) {\frac{π_{0}}{τ_{0}} ϕ (\frac{b_{00}}{τ_{0}}) - \frac{π_{1}}{τ_{1}} ϕ (\frac{b_{10}}{τ_{1}})} (σ_{ε 0}^{2} - σ_{ε 1}^{2} \frac{ν_{0}}{ν_{1}}) .

Since the function ϕ is symmetric, and b₁₀ = −b₀₀, then b₁₀ can be replaced by b₀₀ in the formula for d₀ without altering its veracity.

To appreciate the very wide range of optimal bandwidth choices that can arise in the problem of minimising error rate, let us consider minimising err, at (3.10). To help remove ambiguities, let us assume that as n increases the value of $σ_{ε 0}^{2} - σ_{ε 1}^{2} ν_{0} ν_{1}^{- 1}$ is of the same sign for all sufficiently large n, and its absolute value is bounded away from zero; assumption (3.7)(c) ensures that it is uniformly bounded. In this instance, and focusing just on the terms in h₁, we see that four distinct cases can arise in practice:

(i)
$c_{1}^{0}$ and d⁰ are both positive. In this case, to minimise the contribution from h₁, we should minimise $c_{1}^{0} h_{1}^{2} + d^{0} {(ν_{0} h_{1})}^{- 1}$ , which is achieved by taking h₁ to be of size $ν_{0}^{- 1 ∕ 3}$ .
(ii)
$c_{1}^{0}$ and d⁰ are both negative. In this case, the contribution made by h₁ behaves like $- {∣ c_{1}^{0} ∣ h_{1}^{2} + ∣ d^{0} ∣ {(ν_{0} h_{1})}^{- 1}}$ as sample size increases. The term within braces here is maximised by taking h₁ = 0, and analogously, in minimising err, it is optimal to take h₁ to be of strictly smaller order than $ν_{0}^{- 1 ∕ 3}$ .
(iii)
$c_{1}^{0} > 0$ and d⁰ ≤ 0. In this case, to minimise the error rate, we need to maximise the size of the negative term and minimise that of the positive term, which is achieved by taking h₁ to be of strictly smaller order than $ν_{0}^{- 1 ∕ 3}$ (the precise order depends on the magnitude of second order terms, but deriving the latter precisely would require a lot of additional computation).
(iv)
$c_{1}^{0} < 0$ and d⁰ ≥ 0. Here, using arguments similar to those in case (iii), taking h₁ to be of strictly larger order than $ν_{0}^{- 1 ∕ 3}$ is optimal.

The case d⁰ = 0 occurs, for example, if the covariance G_k of the Gaussian process g, the experimental error variance $σ_{ε k}^{2}$ , and the values of m_kj and n_k do not depend on k. Equal values of m_kj commonly arise when the data are observed on a grid; see Remark 4.

A similar analysis can be carried out in the case of optimisation over h rather than h₁, although there the optimum is accessed from a comparison of terms in h and (mh)⁻², rather than $h_{1}^{2}$ and (ν₀h₁)⁻¹. [A tedious analysis of the term of size (mh)⁻², represented by the remainder O{(mh)⁻²} in (3.8), shows that it can be either positive or negative.] Depending on the relative signs of the terms in h² and (mh)⁻², it can be optimal to take h ≍ m^−½, or h of strictly larger, or strictly smaller order than m^−½.

Similar results are obtained if we investigate properties of err_k, in (3.8), instead of the overall error rate, err, at (3.9).

These results explain the very diverse patterns of behaviour that are seen in numerical work, and that motivated our research; see Section 1. In summary, in apparently similar problems and using the same type of classifier, it can be optimal to use a very small bandwidth, or a very large bandwidth, or a bandwidth of only moderate size, depending on the signs of certain constants. Therein lies the contradictory nature of the smoothing parameter choice problem for classification of functional data.

3.1.2. Absence of terms in (mh)⁻¹

The centroid-based classifier statistic $S (\hat{g})$ , at (2.13), can be written equivalently as

S (\hat{g}) = \int_{I} ({\hat{μ}}_{1} - {\hat{μ}}_{0}) (2 \hat{g} - {\hat{μ}}_{0} - {\hat{μ}}_{1}) d t .

(3.11)

Importantly, there is no quadratic term in ${\hat{g}}^{2}$ in (3.11), and as a result the impact of the bandwidth h, although not h₁, on properties of the classifier is greatly reduced. This reduction is brought about by the smoothing effect of the integral in (3.11), which results in the elimination of terms in (mh)⁻¹.

This property, to which we shall refer to as the “integration effect,” is known in other settings, for example, when integrating a kernel density estimator, computed from a sample of size m, to produce a distribution estimator. Integration results in the variance reducing from order (mh)⁻¹, for the density estimator, to order m⁻¹, for the distribution function estimator—just as it does in the setting above.

Remark 1 (Order m⁻¹ term in expansion of classification error). We assumed in (3.7)(c) that the values of m_kj, representing the number of pairs (X_kji, Y_kji) for a given population index k and given individual j, are all of roughly the same size. In this setting it is easy to see that, even in an elementary parametric setting, we must expect the operation of observing the functions g_kj at scattered points to affect error rate through a term of order m⁻¹, and no smaller. For example, consider the case where g_kj = ψ(· | ω_kj), with ψ(· ω) being a known function completely determined by the parameter ω, and $ω_{k j} = \int_{I} g_{k j} w$ where the weight function w is known. Using the data $D_{k j}$ on g_kj we can estimate ω_kj root-m consistently, but no faster, and as a result we incur a classification error of size m⁻¹, and no smaller, from not knowing the values ω_kj. It is for this reason that, when developing expansions of classification error, we do not explore the remainder of size m⁻¹; it is stated simply as O(m⁻¹) on the right-hand side of (3.8).

3.1.3. Other remarks

We conclude our discussion of Theorem 1 with a number of remarks.

Remark 2 (Definition of ${\hat{μ}}_{k}$ ). The size of the fourth and fifth terms on the right-hand side of (3.8) is determined by the sizes of $ν_{0}^{- 1}$ and $ν_{1}^{- 1}$ , and those quantities can be made slightly smaller by using a slightly different definition of ${\hat{μ}}_{k}$ , at (2.8). In particular in (2.8), on account of the definition of ${\hat{g}}_{k j}$ at (2.3), ${\hat{μ}}_{k}$ is defined as an average of ratios of sums, whereas slightly better statistical performance is obtained by taking ${\hat{μ}}_{k}$ to be simply a ratio of sums:

{\hat{μ}}_{k} = \frac{Σ_{j} (U_{k j 2} V_{k j 0} - U_{k j 1} V_{k j 1})}{Σ_{j} (U_{k j 2} U_{k j 0} - U_{k j 1}^{2})},

compare (2.3). However, this approach departs from standard practice in working with functional data, and therefore, since convergence rates do not alter (only the constant multiples of rates are reduced), we have followed standard practice in the definition of ${\hat{μ}}_{k}$ .

Remark 3 (Gaussian assumption). Of course, if m is sufficiently large then $\hat{g}$ is itself approximately Gaussian, and so the assumption that g is a Gaussian process is reflected particularly well in properties of its estimator. More generally, our assumption that g is a Gaussian process is made for simplicity, and can be relaxed. For example, generalisations to chi-squared and other processes, where shape can be described in terms of a small number of fixed functions (mean and covariance in the Gaussian case), are straightforward.

More generally we would require a model which described the properties of random functions relatively simply. The Gaussian model fills this need ideally; shape is described by mean and variance functions, on which we have imposed only smoothness, rather than parametric, conditions. Moreover, in the Gaussian case all moments of g(x) are finite, for each x (we use this property repeatedly during our theoretical arguments), and the principal component scores are independent (this is used frequently during our proof of Theorem 2).

Remark 4 (Case of regularly spaced design). Theorem 1 continues to hold if the m_kj design variables X_kji are regularly spaced on $I$ for each k and j. The only change necessary is to replace $\int_{I} f_{X}^{- 1}$ , on the right-hand side of (3.8), by the square of the length of the interval $I$ .

3.2. Scale-adjusted centroid-based classifier

Recall that scale-adjusted centroid-based classifier is defined in terms of $S_{scale} (\hat{g})$ , at (2.14). A decomposition similar to that of Theorem 1 can be derived for this classifier, as we shall prove in Theorem 2 below. For this classifier, it seems necessary to strengthen (3.7) by imposing conditions on the behaviour of the eigenvalues θ_kℓ as ℓ increases. However, since our aim in this section is only to corroborate the conclusions in Section 3.1, drawn there in the case of the standard centroid-based classifier, then we shall simplify our account by assuming that g is finite dimensional, and in particular taking the covariance expansion at (2.10) to have just q terms:

For k = 0 and 1 : (a) the first q eigenvalues in the sequence θ_{k 1} \geq θ_{k 2} \dots, arising in the covariance expansion (2.10) of g when the data come from Π_{k}, are distinct; (b) θ_{k ℓ} = 0 for ℓ > q; (c) for 1 \leq ℓ \leq q the eigenfunctions ψ_{k ℓ} in (2.10) have two H \ddot{o} lder continuous derivatives on I; (d) E_{k} (g) is a linear form in ψ_{k 1}, \dots, ψ_{k q}; and (e) in the definition of S_{scale} (\hat{g}), s_{0}^{2} \neq s_{1}^{2} .

(3.12)

Without (3.12)(a), separate conditions, valid uniformly in j = 1, 2, …, have to be imposed on remainders in Taylor expansions of “smoothed” versions of the eigenvalues θ_kj, depending on h.

The next theorem indicates that the results of Theorem 1 also apply for the scale-adjusted centroid-based classifier. Its proof is given in the supplementary material [Carroll, Delaigle and Hall (2013)].

Theorem 2. Assume that (3.5), (3.6) and (3.12) hold. Then the error rate of the scale-adjusted centroid-based classifier, when the data in $D$ are drawn from Π_k, admits the expansion at (3.8), but with different constants, where the various terms have the properties stated immediately below that formula.

The diversity of possible signs of c_k, c_k1 and $d_{k 0 ν_{0}^{- 1}} + d_{k 1 ν_{1}^{- 1}}$ in (3.8), discussed in Section 3.1.1, is also present in this case. Therefore the conclusions drawn in that section apply to the scale-adjusted centroid-based classifier. However, we have not derived explicitly the counterparts of the constants c_k, c_k1, d_k0 and d_k1 that appear in equation (3.8).

The integration effect discussed in Section 3.1.2 is also present here, although we had originally expected that the scale-adjusted centroid classifier would produce a term of size (mh)⁻¹ in an expansion of error rate. Indeed, the situation initially seems quite different in the case of the scale-adjusted version $S_{scale} (\hat{g})$ of $S (\hat{g})$ , at (2.14), when $s_{0}^{2} \neq s_{1}^{2}$ . There the quadratic term in $\hat{g}$ persists. The reason it still does not produce a term in (mh)⁻¹ is quite subtle. Define ⋈_k to be > or ≤ according as k = 0 or k = 1, respectively. The probability $P_{k} {S_{scale} (\hat{g}) ⋈_{k} 0}$ can be written as

P_{k} {\sum_{j = 1}^{\infty} w_{j} {(Z_{j} + V_{j})}^{2} ⋈_{k} W} + negligible terms,

where the Z_j's are independent N(0, 1) variables, conditional on the V_j's and W; the positive weights w_j are nonrandom; and critically, W does not involve the experimental errors ε_i in (2.2), from which any term in (mh)⁻¹ would arise. The terms V_j depend on the experimental errors only through integrals of the error process, and the integration effect at this point largely removes the impact of the error bandwidth h, with the result that there is no term of size (mh)⁻¹. However, terms in (ν₀h₁)⁻¹ remain; the integration effect only influences smoothing of the new data, not of the training data.

3.3. Quadratic discriminant

Finally, we show that similar smoothing effects are present in the case of the quadratic discriminant classifier defined through the statistic $T (\hat{g})$ at (2.15). Recall that, when the data in $D$ come from Π_k, the random function g has covariance function G_k. To derive the counterpart of Theorem 1 for this classifier, let r, r₁, r₂ take the values 0 and 1, let 1 ≤ ℓ, ℓ₁, ℓ₂ ≤ p, and define the covariances

{cov}_{k} [r_{1}, r_{2}; ℓ_{1}, ℓ_{2}] = \int_{I} \int_{I} G_{k} (x_{1}, x_{2}) ψ_{r_{1} ℓ_{1}} (x_{1}) ψ_{r_{2} ℓ_{2}} (x_{2}) d x_{1} d x_{2},

the variances var_k[r, ℓ] = cov_k[r, r; ℓ, ℓ], and the correlations

ρ_{k} [r_{1}, r_{2}; ℓ_{1}, ℓ_{2}] = \frac{cov [r_{1}, r_{2}; ℓ_{1}, ℓ_{2}]}{{({var}_{k} [r_{1}, ℓ_{1}] {var}_{k} [r_{2}, ℓ_{2})}^{1 ∕ 2}} .

Let p ≥ 1, a fixed number, be the number of principal components used to construct the quadratic discriminant statistic $T (\hat{g})$ , defined at (2.15). Theorem 3 below addresses the error rate of the quadratic discriminant based on $T (\hat{g})$ , and there we shall assume that:

(a) For k = 0, 1 the eigenvalues θ_{k 1}, \dots, θ_{k, p + 1} are distinct; and (b) among the values taken by ρ_{k} [r_{1}, r_{2}; ℓ_{1}, ℓ_{2}] for k, r_{1}, r_{2} = 0, 1 and 1 \leq ℓ_{1}, ℓ_{2} \leq p, the absolute value of ρ_{k} [r_{1}, r_{2}; ℓ_{1}, ℓ_{2}] equals 1 only when r_{1} = r_{2} and ℓ_{1} = ℓ_{2} .

(3.13)

Condition (3.13)(a) ensures that the eigenfunctions ψ_kℓ are well defined for k = 0, 1 and ℓ = 1, …, p; and (3.13)(b) guarantees that the quantities $\int_{I} (g - μ_{r_{1}}) Ψ_{r_{1} ℓ_{1}}$ and $\int_{I} (g - μ_{r_{2}}) Ψ_{r_{2} ℓ_{2}}$ , which appear in the definition of T₀(g) at (2.16), cannot be identical, except for a difference in means, unless r₁ = r₂ and ℓ₁ = ℓ₂, thereby avoiding degeneracy.

The counterpart of Theorem 1 for the quadratic discriminant classifier is stated in the next theorem. Its proof is given in the supplementary material [Carroll, Delaigle and Hall (2013)].

Theorem 3. Assume that (3.5)–(3.7) and (3.13) hold. Then the error rate of the quadratic discriminant, when the data in $D$ come from Π_k, admits the expansion at (3.8), but with different constants, where the various terms have the properties stated immediately below that formula.

Again the signs of c_k, c_k1 and $d_{k 0} v_{0}^{- 1} + d_{k 1} v_{1}^{- 1}$ , in (3.8), are particularly diverse, and so the conclusions reached in Section 3.1.1 apply. Likewise, the integration effect discussed in Section 3.1.2 is also observed. Here, as can be seen directly from (2.15), the estimator $\hat{g}$ is integrated, and only the integral is squared, not $\hat{g}$ itself. The resulting integration effect eliminates any term in (mh)⁻¹ from the analogue of the expansion (3.8) in this setting, although again this influence does not carry over to the training data.

4. Numerical illustrations

4.1. Simulated data

To illustrate the impact of bandwidth on classification performance, we generated data from several instances of model (2.1), taking, in each case, m_kj = 50. Let ϕ_σ (x) denote the normal density function with mean zero and standard deviation σ. We considered the following cases, each with three different levels of errors, which we refer to as noise versions 1, 2 and 3:

(A):
g_kj (t) = μ_k(t) + (3t + 100)^1/2{cos(t/50)}^kZ_kj, where μ₀(t) = ϕ₁₀(t − 5), μ₁(t) = μ₀(t) + 0.3 cos(t/5) + 0.1, Z_kj ~ U[−1/(30 − 10k), 1/(30 − 10k)], and ε_kji ~ N(0, 1/(4 − 2k)²) (noise version 1), ε_kji ~ N(0, 2/(4 − 2k)²) (noise version 2) or ε_kji ~ N(0, 4/(4 − 2k)²) (noise version 3), and π₀ = 1/3, π₁ = 2/3. Moreover, X_kji = 2i − 1, for i = 1, …, 50.
(B):
g_kj (t) = μ_k(t) + (3t + 100)^1/2Z_kj, where μ₀(t) = 30{0.2ϕ₄(t − 5) + 0.1ϕ₄(t − 10) + 0.4ϕ₆(t − 20) + 0.4ϕ₆(t − 35) + 0.6ϕ₇(t − 55) + 0.6ϕ₇(t − 80)}, μ1(t) = μ₀(t) + 4/{(t − 50)² + 10}, Z_kj ~ U[−1/(60 + 15k), 1/(60 + 15k)], ε_kji ~ {Exp(0.5) − 2}/(2 + 2k) (noise version 1), $ε_{k j i} ~ \sqrt{2} Exp (0.5) - 2} ∕ (2 + 2 k)$ (noise version 2) or ε_kji ~ {Exp(0.5) − 2}/(1 + k) (noise version 3), and π₀ = 2/5 and π₁ = 3/5. Moreover, X_kji was as in (A).
(C):
g_0j (t) = μ₀(t) + (3t + 100)^1/2Z_0j, g_1j (t) = μ₁(t) + (t + 5)Z_1j, where μ₀(t) = 15ϕ₁₇(t − 65) cos(t/7), μ₁(t) = μ₀(t)+5ϕ₂₀(t − 50), Z_kj ~ U[−1/(50 − 10k), 1/(50 − 10k)], ε_kji ~ N(0, (4 − k)²/100) (noise version 1), ε_kji ~ N(0, (4 − k)²/50) (noise version 2), ε_kji ~ N(0, (4 − k)²/25) (noise version 3), and π₀ = 2/3 and π₁ = 1/3. Moreover, X_kji was as in (A).
(D)–(F):
Same as (A) to (C) but with X_kji = 2i − 1 + T_kji, where T_kji ~ N(0, 0.25).

We chose these examples to illustrate various features of the problems, namely that the impact of smoothing may differ among classifiers, and that in some cases, some classifiers perform better with more smoothing and in other cases, they might perform better with less smoothing.

In each case, for k = 0, 1 and for several values of n_tr, we generated 100 (resp., n_tr) noisy test curves (resp., training curves) from model (2.1), each of which came with probability π_k from Π_k. We constructed each classifier from the training data, and applied it to the test data. To compute $\hat{g}$ and ${\hat{g}}_{k j}$ , we compared three approaches for selecting the bandwidths: no smoothing (NS), the standard plug-in (PI) bandwidths h_PI and h_PI,kj that estimate the optimal bandwidth for estimation of the regression functions g and g_kj, which we computed using the dpill function in the R package KernSmooth; see Ruppert, Sheather and Wand (1995); and the bandwidths γh_PI and γ₁h_PI,kj, where γ and γ₁ (and also the truncation parameter p in the case of the quadratic discriminant classifier) were chosen to minimise the following cross-validation (CV) estimator of classification error:

\hat{err} = \frac{{\hat{π}}_{0}}{n_{0}} \sum_{i = 1}^{n_{0}} I {{\hat{C}}_{i 0, - i} = 1} + \frac{{\hat{π}}_{1}}{n_{1}} \sum_{i = 1}^{n_{1}} I {{\hat{C}}_{i 1, - i} = 0}

with ${\hat{π}}_{0}$ and ${\hat{π}}_{1}$ denoting estimators of π₀ and π₁ (we took ${\hat{π}}_{k} = 1 ∕ 2$ ), and ${\hat{C}}_{i k, - i}$ being the estimator of the class label of the ith training observation from group k, obtained from the classifier constructed without using this observation.

For each configuration, we generated B = 200 sets of training and test samples. In Tables 1 and 2, we report the percentage of correctly classified test curves, averaged over the B replicates. Depending on the model, the classifier, and the type of data (test or training), the cross-validation bandwidths were either smaller or larger than the PI regression bandwidths, illustrating the variety of settings already explained by our theory. See Table B.1 in Section B.3 in the supplementary material [Carroll, Delaigle and Hall (2013)], where we report the value of γ and γ₁ averaged over the B replicates. We can see from the table that in most cases, γ was smaller than γ₁, and both were usually smaller than 1, except in cases (C) and (F).

Table 1.

Percentage of correctly classified observations for the simulated data of Section 4.1, using plug-in (PI) regression bandwidths, bandwidths that minimise a crossvalidation (CV) estimate of classification error, or without smoothing the noisy data (NS). The three noise versions, in increasing order, are described in cases (A)–(C) in Section 4.1. Here “Cent” is the centroid classifier (2.13), “Cent sc.” is the scaled centroid classifier (2.14) and “QDA” is the quadratic discriminant classifier (2.15)

		Cent			Cent sc.			QDA
	n _tr	CV	PI	NS	CV	PI	NS	CV	PI	NS
Case (A)
Noise version 1	50	82.9	74.1	84.0	91.8	73.2	92.0	95.1	94.1	53.5
Noise version 1	100	84.4	74.9	84.8	92.6	74.1	92.6	97.6	94.8	67.6
Noise version 2	50	77.7	69.6	78.1	94.3	70.4	94.4	91.0	89.3	49.1
Noise version 2	100	79.9	70.7	80.2	95.1	71.2	95.1	94.3	89.7	61.7
Noise version 3	50	71.1	65.6	71.1	97.1	69.0	97.1	85.4	84.3	46.2
Noise version 3	100	73.7	66.8	74.1	97.9	69.4	97.9	89.9	84.1	58.4
Case (B)
Noise version 1	50	63.2	60.1	65.7	96.3	78.7	96.5	77.1	74.3	65.8
Noise version 1	100	65.5	61.5	66.8	96.8	80.0	96.8	81.8	76.3	73.0
Noise version 2	50	61.5	58.6	64.6	96.3	80.6	96.4	76.9	74.1	65.2
Noise version 2	100	62.6	58.7	64.4	96.7	81.3	96.7	81.3	75.0	72.4
Noise version 3	50	60.9	57.6	64.0	96.2	81.6	96.4	77.3	74.2	65.4
Noise version 3	100	60.7	56.8	63.3	96.7	82.3	96.7	81.6	75.2	72.3
Case (C)
Noise version 1	50	61.5	60.8	60.8	88.7	89.2	87.4	84.8	83.7	82.0
Noise version 1	100	59.4	58.4	58.2	90.0	90.3	88.5	86.9	85.7	79.0
Noise version 2	50	61.3	60.2	60.3	87.3	87.9	82.8	81.9	81.2	82.4
Noise version 2	100	58.9	57.9	57.6	88.8	89.0	85.2	84.6	83.1	80.4
Noise version 3	50	61.0	59.7	59.3	85.2	85.4	71.2	80.5	79.9	79.6
Noise version 3	100	58.5	57.4	57.0	87.2	86.6	74.9	82.6	81.1	79.7

Open in a new tab

Table 2.

Percentage of correctly classified observations for the simulated data of Section 4.1, using plug-in (PI) regression bandwidths, bandwidths that minimise a crossvalidation (CV) estimate of classification error, or without smoothing the noisy data (NS). The three noise versions, in increasing order, are described in cases (D)–(F) in Section 4.1. Here “Cent” is the centroid classifier (2.13), “Cent sc.” is the scaled centroid classifier (2.14) and “QDA” is the quadratic discriminant classifier (2.15)

		Cent			Cent sc.			QDA
	n _tr	CV	PI	NS	CV	PI	NS	CV	PI	NS
Case (D)
Noise version 1	50	80.2	69.5	80.6	85.7	68.5	86.3	93.9	92.7	69.2
Noise version 1	100	81.5	70.0	82.0	87.3	69.2	87.3	96.6	93.2	84.3
Noise version 2	50	74.7	65.9	75.6	90.0	65.6	90.3	88.5	86.7	60.9
Noise version 2	100	76.9	66.8	77.3	90.9	66.8	91.0	92.3	86.8	77.6
Noise version 3	50	69.2	62.3	69.6	94.2	65.4	94.4	82.9	80.5	55.6
Noise version 3	100	71.4	63.4	72.0	95.1	66.4	95.1	87.9	80.8	72.7
Case (E)
Noise version 1	50	65.0	61.9	67.3	94.8	79.0	95.0	77.0	73.1	71.2
Noise version 1	100	65.8	62.5	67.6	95.4	79.6	95.4	84.3	69.7	82.8
Noise version 2	50	63.0	60.2	65.4	94.7	80.6	95.0	77.8	74.2	70.5
Noise version 2	100	63.4	59.9	64.9	95.4	81.4	95.5	84.8	69.8	82.4
Noise version 3	50	61.3	59.2	64.3	94.6	81.4	94.9	77.9	74.4	69.8
Noise version 3	100	61.8	58.2	63.1	95.4	82.5	95.5	84.5	71.5	81.8
Case (F)
Noise version 1	50	60.2	59.1	59.4	88.0	88.7	87.9	83.5	82.6	80.4
Noise version 1	100	58.8	57.8	57.7	89.0	89.3	88.5	84.9	83.2	77.3
Noise version 2	50	59.8	58.7	59.0	86.5	87.2	84.6	80.8	80.2	80.0
Noise version 2	100	58.6	57.3	57.2	87.6	87.7	85.8	83.1	81.1	77.0
Noise version 3	50	59.2	58.3	58.2	84.5	84.1	76.7	79.4	78.5	78.9
Noise version 3	100	58.0	56.8	56.5	85.8	84.6	78.5	81.0	79.1	75.7

Open in a new tab

As expected, we conclude from Tables 1 and 2, depending on the model and the classifier, the negative impact of smoothing with the standard PI bandwidth can be quite significant, indeed sometimes reducing the percentage of correctly classified data by as much as 10%. In cases (A) and (D), it is the centroid classifier and its scaled version that are the most affected by this inappropriate level of smoothing, whereas the quadratic discriminant classifier is more robust against the level of smoothing. In cases (B) and (E), the scaled centroid classifier and the quadratic discriminant classifier are the most affected by inappropriate smoothing. Cases (C) and (F) are more robust against smoothing; there, all three versions (PI, CV and NS) of the data result in similar classification performance, although overall the data smoothed by CV result in slightly improved performance. Depending on the case, when the noise level increases the impact of inappropriate bandwidth choice can either increase or decrease.

4.2. Real data

We illustrate our findings on the ovarian cancer data set 8–7–02, which concerns 253 patients (91 controls and 162 with ovarian cancer). The data, which were produced to study the effect of robotic sample handling, are available from http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp. In this example, the functions X_i represent proteomic mass spectra and t ∈ [0, 20,000] is the mass over charge ratio, m/z. These raw curves are ideal for illustrating the negative impact that systematically smoothing by standard methods can have, because in some ranges of values of t, the spectra have considerable activity, and the impact of smoothing such data can be striking. We focus on one such ranges, namely t ∈ [200, 500].

To assess the performance of classifiers on this data set, we randomly and uniformly created B = 200 pairs of (training sample, test sample), where we took the training sample to be of size n_tr and the test sample of size 253 − n_tr, for n_tr = 50 and n_tr = 100. We also generated two more noise versions of the data, adding to the Y_kji's in both the test and training data, noise $ε_{k j i}^{'} ~ N (0, 0.04)$ (noise version 1) or $ε_{k j i}^{'} ~ N (0, 0.25)$ (noise version 2), where the $ε_{k j i}^{'}$ 's were totally independent.

For each version of the data (original data and noise versions 1 and 2), and for each pair of test and training sample, we constructed each classifier from the training sample, and applied the classifier to the test sample using either plug-in regression bandwidths to construct the estimators $\hat{g}$ and ${\hat{g}}_{k j}$ , or bandwidths obtained by minimising the CV estimator of classification error defined in Section 4.1, where we took ${\hat{π}}_{k} = 1 ∕ 2$ .

Table 3 reports the percentage, averaged over the B pairs of samples, of correctly classified observations from the test samples. The table indicates very clearly that smoothing the data using the plug-in regression bandwidths degraded the quality of the two versions of the centroid classifier by about 10%, and a similar phenomenon was observed for the quadratic discriminant classifier when the training sample was small and when the data were noisy.

Table 3.

Percentage of correctly classified observations for the ovarian cancer data, using plug-in (PI) regression bandwidths or bandwidths that minimise a crossvalidation (CV) estimate of classification error Here “Cent” is the centroid classifier (2.13), “Cent sc.” is the scaled centroid classifier (2.14) and “QDA” is the quadratic discriminant classifier (2.15)

		Cent		Cent sc.		QDA
Data	n _tr	CV	PI	CV	PI	CV	PI
Original data	50	90.60	80.25	90.05	78.79	93.32	89.69
Original data	100	90.43	80.96	90.00	79.96	98.58	98.86
Noisy version 1	50	88.07	75.19	87.83	74.23	78.03	68.50
Noisy version 1	100	87.58	76.76	88.54	76.27	91.48	90.97
Noisy version 2	50	76.15	66.57	76.65	66.09	56.91	48.54
Noisy version 2	100	81.97	67.55	81.91	67.64	77.62	66.49

Open in a new tab

Supplementary Material

NIHMS564605-supplement-Supplementary_Material.pdf^{(97.6KB, pdf)}

Acknowledgments

Delaigle and Hall's research was supported by grants and fellowships from the Australian Research Council. Carroll's research was supported by a grant from the National Cancer Institute (R37-CA057030).

APPENDIX: PROOF OF THEOREM 1

A.1. Preliminary results

Define

Δ_{ℓ} (x) = \frac{1}{m h} \sum_{i = 1}^{m} ε_{i} {(\frac{x - X_{i}}{h})}^{ℓ} K (\frac{x - X_{i}}{h}),

W_{ℓ} (x) = \frac{1}{m h} \sum_{i = 1}^{m} [\int_{x}^{X_{i}} {g^{″} (t) - g^{″} (x)} (X_{i} - t) d t] {(\frac{x - X_{i}}{h})}^{ℓ} K (\frac{x - X_{i}}{h}) .

With U_ℓ and V_ℓ given by (2.4) and (2.5), and using the model at (2.2) and the exact form of the remainder in Taylor's theorem, we can write:

\begin{matrix} V_{ℓ} (x) & = \frac{1}{m h} \sum_{i = 1}^{m} {g (X_{i}) + ε_{i}} {(\frac{x - X_{i}}{h})}^{ℓ} K (\frac{x - X_{i}}{h}) \\ = \frac{1}{m h} \sum_{i = 1}^{m} [g (x) + (X_{i} - x) g^{'} (x) + \frac{1}{2} {(X_{i} - x)}^{2} g^{″} (x) + ε_{i}] \times {(\frac{x - X_{i}}{h})}^{ℓ} K (\frac{x - X_{i}}{h}) + W_{ℓ} (x) \\ = g (x) U_{ℓ} (x) - h g^{'} (x) U_{ℓ + 1} (x) + \frac{1}{2} h^{2} g^{″} (x) U_{ℓ + 2} (x) + Δ_{ℓ} (x) + W_{ℓ} (x) . \end{matrix}

Assuming, without loss of generality, that K is supported on [−1, 1],

∣ W_{ℓ} (x) ∣ \leq h^{2} {sup_{t ∊ I : ∣ t - x ∣ \leq h} ∣ g^{″} (t) - g^{″} (x) ∣} \frac{1}{m h} \sum_{i = 1}^{m} K (\frac{x - X_{i}}{h}) \leq h^{2} U_{0} (x) Q,

where $Q = {sup}_{s, t \in I : ∣ s - t ∣ \leq h} ∣ g^{''} (s) - g^{''} (t) ∣$ . Now,

\hat{g} = \frac{U_{2} V_{0} - U_{1} V_{1}}{U_{2} U_{0} - U_{1}^{2}} = g + \frac{1}{2} h^{2} g^{″} \frac{U_{2}^{2} - U_{1} U_{3}}{U_{2} U_{0} - U_{1}^{2}} + Δ + \frac{U_{2} W_{0} - U_{1} W_{1}}{U_{2} U_{0} - U_{1}^{2}},

where $Δ = (U_{2} Δ_{0} - U_{1} Δ_{1}) ∕ (U_{2} U_{0} - U_{1}^{2})$ . Therefore, since |U_ℓ| ≤ U₀ for each ℓ ≥ 0,

∣ \hat{g} - (g + \frac{1}{2} h^{2} g^{″} \frac{U_{2}^{2} - U_{1} U_{3}}{U_{2} U_{0} - U_{1}^{2}} + Δ) ∣ \leq \frac{2 Q h^{2} U_{0}^{2}}{U_{2} U_{0} - U_{1}^{2}},

(A.1)

uniformly on $I$

Similarly, defining $Q_{k j} = {sup}_{s, t \in I : ∣ s - t ∣ \leq h} ∣ g_{k j}^{''} (s) - g_{k j}^{''} (t) ∣$ , and

Δ_{k j ℓ} (x) = \frac{1}{m_{k j} h_{1}} \sum_{i = 1}^{m_{k j}} ε_{k j i} {(\frac{x - X_{k j i}}{h_{1}})}^{ℓ} K (\frac{x - X_{k j i}}{h_{1}}),

Δ_{k j} = \frac{U_{k j 2} Δ_{k j 0} - U_{k j 1} Δ_{k j 1}}{U_{k j 2} U_{k j 0} - U_{k j 1}^{2}},

where U_kjℓ is as at (2.6), we have, uniformly on $I$

∣ {\hat{g}}_{k j} - (g_{k j} + \frac{1}{2} h_{1}^{2} g_{k j}^{″} \frac{U_{k j 2}^{2} - U_{k j 1} U_{k j 3}}{U_{k j 2} U_{k j 0} - U_{k j 1}^{2}} + Δ_{k j}) ∣ \leq \frac{2 Q_{k j} h_{1}^{2} U_{k j 0}^{2}}{U_{k j 2} U_{k j 0} - U_{k j 1}^{2}} .

(A.2)

Define

{\overset{‒}{Δ}}_{k} = \frac{1}{n_{k}} \sum_{j = 1}^{n_{k}} Δ_{k j}

(A.3)

and recall that κ₂ = ∫ u²K (u) du. We shall derive the following result in Section A.6:

Lemma 1. Under the conditions of Theorem 1, for some C₁ > 0, all C₂ > 0 and k = 0, 1,

P_{k} (sup_{I} ∣ \frac{U_{2}^{2} - U_{1} U_{3}}{U_{2} U_{0} - U_{1}^{2}} - κ_{2} ∣ > n^{- C_{1}}) = O (n^{- C_{2}}),

P_{k} (\max_{j = 1, \dots, n_{k}} sup_{I} ∣ \frac{U_{k j 2}^{2} - U_{k j 1} U_{k j 3}}{U_{k j 2} U_{k j 0} - U_{k j 1}^{2}} - κ_{2} ∣ > n^{- C_{1}}) = O (n^{- C_{2}})

as n → ∞, and for some C₃ > 0, all C₂ > 0 and k = 0, 1,

P_{k} (sup_{I} \frac{U_{0}^{2}}{U_{2} U_{0} - U_{1}^{2}} > C_{3}) = O (n^{- C_{2}}),

P_{k} (\max_{j = 1, \dots, n_{k}} sup_{I} \frac{U_{k j 0}^{2}}{U_{k j 2} U_{k j 0} - U_{k j 1}^{2}} > C_{3}) = O (n^{- C_{2}}) .

Furthermore, defining M_sum = min_k=1,2(Σ_j m_kj), we have for all C₂, C₄ > 0,

P_{k} {sup_{I} ∣ Δ ∣ > n^{C_{4}} {(m h)}^{- 1 ∕ 2}} + \max_{k = 0, 1} P_{k} {sup_{I} ∣ {\overset{‒}{Δ}}_{k} ∣ > n^{C_{4}} {(M_{sum} h)}^{- 1 ∕ 2}} = O (n^{- C_{2}}) .

(A.4)

A.2. Initial calculation of err_k

Let $G_{1}$ denote the sigma-field generated by the random variables introduced in Section 2, and the random functions g_kj, but excluding g. Specifically, $G_{1}$ is the sigma-field generated by g_kj, X_kji and ε_kji for 1 ≤ i ≤ m_kj, 1 ≤ j ≤ n_k and k = 0, 1, and by X_i and ε_i for 1 ≤ i ≤ m. Recall that ⋈_k is > or ≤ according as k = 0 or k = 1, respectively, and recall formula (3.11) for the statistic $S (\hat{g})$ .

Under the assumption that the new data set $D$ comes from Π_k, and conditional on $G_{1}$ , $\hat{g}$ is a Gaussian process with mean ${\hat{α}}_{k} = E_{k} (\hat{g} ∣ G_{1})$ and covariance function ${\hat{Γ}}_{k}$ , say. In this notation,

{err}_{k} \equiv E_{k} [P_{k} {S (\hat{g}) ⋈_{k} 0 ∣ G_{1}}] = E_{k} {Ψ_{k} (- {\hat{β}}_{k} ∕ {\hat{σ}}_{k})},

(A.5)

where, by (3.11),

{\hat{β}}_{k} = E_{k} {S (\hat{g}) ∣ G_{1}} = \int_{I} ({\hat{μ}}_{1} - {\hat{μ}}_{0}) {2 {\hat{α}}_{k} - ({\hat{μ}}_{0} + {\hat{μ}}_{1})},

(A.6)

\begin{matrix} {\hat{σ}}_{k}^{2} & = var {S (\hat{g}) ∣ G_{1}} \\ = 4 \int_{I} \int_{I} {{\hat{μ}}_{1} (x_{1}) - {\hat{μ}}_{0} (x_{1})} {{\hat{μ}}_{1} (x_{2}) - {\hat{μ}}_{0} (x_{2})} \times {\hat{Γ}}_{k} (x_{1}, x_{2}) d x_{1} d x_{2} . \end{matrix}

(A.7)

The probability on the left-hand side of (A.5) equals the chance that, when $D$ comes from Π_k, the classifier based on $S (\hat{g})$ makes an error and assigns $D$ to the other population.

A.3. Approximations to ${\hat{α}}_{k}$ , ${\hat{β}}_{k}$ and ${\hat{σ}}_{k}$

In view of (A.1),

∣ {\hat{α}}_{k} - (μ_{k} + \frac{1}{2} h^{2} μ_{k}^{″} \frac{U_{2}^{2} - U_{1} U_{3}}{U_{2} U_{0} - U_{1}^{2}} + Δ) ∣ \leq \frac{2 E_{k} (Q) h^{2} U_{0}^{2}}{U_{2} U_{0} - U_{1}^{2}} .

(A.8)

Noting that, for random variables A₁, A₂, B₁ and B₂, |cov(A₁ + A₂, B₁ + B₂) − cov(A₁, A₂)| ≤ |cov(B₁, B₂)| + |cov(A₁, B₂)| + |cov(B₁, A₂)| where the covariances are interpreted conditionally on $G_{1}$ , we deduce from (A.1) that for a constant C₄ > 0,

sup_{x_{1}, x_{2} \in I} ∣ {\hat{Γ}}_{k} (x_{1}, x_{2}) - {G_{k} (x_{1}, x_{2}) + \frac{1}{2} h^{2} G_{k}^{(0, 2)} (x_{1}, x_{2}) \frac{U_{2}^{2} - U_{1} U_{3}}{U_{2} U_{0} - U_{1}^{2}} (x_{2}) + \frac{1}{2} h^{2} G_{k}^{(2, 0)} (x_{1}, x_{2}) \frac{U_{2}^{2} - U_{1} U_{3}}{U_{2} U_{0} - U_{1}^{2}} (x_{1})} ∣ \leq C_{4} h^{2} {h^{2} + E_{k} (Q + Q^{2})} sup_{I} {(1 + \frac{U_{0}^{2}}{U_{2} U_{0} - U_{1}^{2}})}^{2},

(A.9)

where we define $G_{k}^{(j_{1}, j_{2})} (x_{1}, x_{2}) = \partial^{j_{1} + j_{2}} G_{k} (x_{1}, x_{2}) ∕ \partial x_{1}^{j_{1}} \partial x_{2}^{j_{2}}$ . (Recall that G_k denotes the covariance of the Gaussian process g when the data $D$ are drawn from Π_k.

With ${\overset{‒}{g}}_{k}$ defined as at (3.1), and defining ${\overset{‒}{Δ}}_{k}$ as at (A.3), we have, in view of (A.2), Lemma 1 and (3.6)(b), the result

P_{k} {_{I}^{sup} ∣ {\hat{μ}}_{k} - ({\overset{‒}{g}}_{k} + \frac{1}{2} h_{1}^{2} κ_{2} {\overset{‒}{g}}_{k}^{″} + {\overset{‒}{Δ}}_{k}) ∣ > n^{- C_{1}} h_{1}^{2}} = O (n^{- C_{2}})

(A.10)

for some C₁ > 0 and all C₂ > 0. Using Rosenthal's inequality, it can be proved from (3.6) and (3.7)(c) that, for some C₁ > 0 and all C₂ > 0,

P_{k} (sup_{I} ∣ {\overset{‒}{g}}_{k}^{″} - μ_{k}^{″} ∣ > n^{- C_{1}}) = O (n^{- C_{2}}) .

(A.11)

Together, (A.10) and (A.11) imply that

P_{k} {_{I}^{sup} ∣ {\hat{μ}}_{k} - ({\overset{‒}{g}}_{k} + \frac{1}{2} h_{1}^{2} κ_{2} μ_{k}^{″} + {\overset{‒}{Δ}}_{k}) ∣ > n^{- C_{1}} h_{1}^{2}} = O (n^{- C_{2}}) .

(A.12)

Define $H^{2} = h^{2} + h_{1}^{2}$ ,

β_{k} = \int_{I} {{\overset{‒}{g}}_{1} - {\overset{‒}{g}}_{0} + \frac{1}{2} h_{1}^{2} κ_{2} (μ_{1}^{″} - μ_{0}^{″}) + {\overset{‒}{Δ}}_{1} - {\overset{‒}{Δ}}_{0}} \times {2 μ_{k} - ({\overset{‒}{g}}_{0} + {\overset{‒}{g}}_{1}) + h^{2} κ_{2} μ_{k}^{″} - \frac{1}{2} h_{1}^{2} κ_{2} (μ_{0}^{″} + μ_{1}^{″}) + 2 Δ - ({\overset{‒}{Δ}}_{0} + {\overset{‒}{Δ}}_{1})},

(A.13)

{\tilde{σ}}_{k}^{2} = 4 \int_{I} \int_{I} {{\overset{‒}{g}}_{1} - {\overset{‒}{g}}_{0} + \frac{1}{2} h_{1}^{2} κ_{2} (μ_{1}^{″} - μ_{0}^{″}) + {\overset{‒}{Δ}}_{0} - {\overset{‒}{Δ}}_{1}} (x_{1}) \times {{\overset{‒}{g}}_{1} - {\overset{‒}{g}}_{0} + \frac{1}{2} h_{1}^{2} κ_{2} (μ_{1}^{″} - μ_{0}^{″}) + {\overset{‒}{Δ}}_{0} - {\overset{‒}{Δ}}_{1}} (x_{2}) \times [G_{k} (x_{1}, x_{2}) + \frac{1}{2} h^{2} κ_{2} {G_{k}^{(2, 0)} (x_{1}, x_{2}) + G_{k}^{(0, 2)} (x_{1}, x_{2})}] .

Combining Lemma 1, (A.5)–(A.9) and (A.12), we deduce that, for some C₁ > 0 and all C₂ > 0,

\begin{matrix} P_{k} (∣ {\hat{β}}_{k} - β_{k} ∣ > n^{- C_{1}} H^{2}) = O (n^{- C_{2}}), \\ P_{k} (∣ {\hat{σ}}_{k}^{2} - {\tilde{σ}}_{k}^{2} ∣ > n^{- C_{1}} H^{2}) = O (n^{- C_{2}}) . \end{matrix}

(A.14)

Observe from (A.13) that $β_{k} = β_{k 0} + b_{k 1} + β_{k 1} + β_{k 2} + {\overset{‒}{Δ}}_{2}$ , where β_k0 is as at (3.2),

b_{k 1} = κ_{2} \int_{I} (μ_{1} - μ_{0}) (h^{2} μ_{k}^{″} - h_{1}^{2} μ_{1 - k}^{″}),

(A.15)

β_{k 1} = \int_{I} ({\overset{‒}{g}}_{1} - {\overset{‒}{g}}_{0}) {2 Δ - ({\overset{‒}{Δ}}_{0} + {\overset{‒}{Δ}}_{1})} + \int_{I} {2 μ_{k} - ({\overset{‒}{g}}_{0} + {\overset{‒}{g}}_{1})} ({\overset{‒}{Δ}}_{1} - {\overset{‒}{Δ}}_{0}),

β_{k 2} = \int_{I} {2 Δ - ({\overset{‒}{Δ}}_{0} + {\overset{‒}{Δ}}_{1})} ({\overset{‒}{Δ}}_{1} - {\overset{‒}{Δ}}_{0})

and ${\overset{‒}{Δ}}_{2} = β_{k} - (β_{k 0} + b_{k 1} + β_{k 1} + β_{k 2})$ . Using (A.4) it can be shown that, for some C₁ > 0 and all C₂ > 0, and when ℓ = 2,

P_{k} (∣ {\overset{‒}{Δ}}_{ℓ} ∣ > n^{- C_{1}} H^{2}) = O (n^{- C_{2}}) .

(A.16)

Hence, noting the first result in (A.14), we have:

P_{k} {∣ {\hat{β}}_{k} - (β_{k 0} + b_{k 1} + β_{k 1} + β_{k 2}) ∣ > n^{- C_{1}} H^{2}} = O (n^{- C_{2}}) .

(A.17)

Recall the definitions of $σ_{k}^{2}$ and $τ_{k}^{2}$ at (3.3) and (3.4), and put

σ_{k 0} = 2 h^{2} κ_{2} \int_{I} \int_{I} ({\overset{‒}{g}}_{1} - {\overset{‒}{g}}_{0}) (x_{1}) ({\overset{‒}{g}}_{1} - {\overset{‒}{g}}_{0}) (x_{2}) \times {G_{k}^{(2, 0)} (x_{1}, x_{2}) + G_{k}^{(0, 2)} (x_{1}, x_{2})} d x_{1} d x_{2},

(A.18)

σ_{k 1} = 4 h_{1}^{2} κ_{2} \int_{I} \int_{I} ({\overset{‒}{g}}_{1} - {\overset{‒}{g}}_{0}) (x_{1}) {(μ_{1} - μ_{0})}^{″} (x_{2}) G_{k} (x_{1}, x_{2}) d x_{1} d x_{2}

(A.19)

and ${\overset{‒}{Δ}}_{3} = {\tilde{σ}}_{k}^{2} - (σ_{k}^{2} + σ_{k 0} + σ_{k 1})$ . Thus, ${\overset{‒}{Δ}}_{3}$ is the term in ${\overset{‒}{Δ}}_{0}$ and ${\overset{‒}{Δ}}_{1}$ that arises when ${\tilde{σ}}_{k}^{2}$ is expanded. Using (A.4) it can be proved that (A.16) holds when ℓ = 3. Moreover, ${\hat{σ}}_{k}^{2}$ can be written as

{\hat{σ}}_{k}^{2} = σ_{k}^{2} + σ_{k 0} + σ_{k 1} + {\overset{‒}{Δ}}_{3} + {\overset{‒}{Δ}}_{4},

(A.20)

where, in view of the second part of (A.14), (A.16) holds in the case ℓ = 4 and for some C₁ > 0 and all C₂ > 0.

Define τ_kℓ to be equal to σ_kℓ, at (A.18) and (A.19), when ${\overset{‒}{g}}_{0}$ and ${\overset{‒}{g}}_{1}$ on the respective right-hand sides are replaced by μ₀ and μ₁. Then for k = 0, 1 and ℓ = 0, 1, noting property (3.7)(c) on the rates of increase of n₀ and n₁, it can be shown that for some C₁ > 0,

P_{k} (∣ σ_{k ℓ} - τ_{k ℓ} ∣ > n^{- C_{1}} h_{ℓ}^{2}) = O (n^{- C_{2}})

(A.21)

for all C₂ > 0, where we define h₀ = h. Therefore, if C₁ > 0 is sufficiently small,

\max_{k = 0, 1} \max_{ℓ = 0, 1} P_{k} (∣ σ_{k ℓ} ∣ > n^{- C_{1}}) = O (n^{- C_{2}})

(A.22)

for all C₂ > 0.

A.4. Approximation to ${\hat{σ}}_{k}^{- 1}$

In the notation at (A.20),

\frac{1}{{\hat{σ}}_{k}} = \frac{1}{τ_{k}} {(1 + \frac{σ_{k}^{2} - τ_{k}^{2}}{τ_{k}^{2}} + \frac{σ_{k 0} + σ_{k 1} + {\overset{‒}{Δ}}_{3} + {\overset{‒}{Δ}}_{4}}{τ_{k}^{2}})}^{- 1 ∕ 2} = s_{k} (\infty),

where, for 0 ≤ r ≤ ∞,

s_{k} (r) = \frac{1}{τ_{k}} \sum_{j = 0}^{r} \sum_{ℓ = 0}^{j} (\begin{matrix} - \frac{1}{2} \\ j \end{matrix}) (\begin{matrix} j \\ ℓ \end{matrix}) {(\frac{σ_{k}^{2} - τ_{k}^{2}}{τ_{k}^{2}})}^{j - ℓ} {(\frac{σ_{k 0} + σ_{k 1} + {\overset{‒}{Δ}}_{3} + {\overset{‒}{Δ}}_{4}}{τ_{k}^{2}})}^{ℓ} .

We claim that the infinite series defined by s_k(∞) converges with probability 1 − O(n^−C₂) for all C₂ > 0. To appreciate why, note that, by (3.6) and (3.7)(c), there exists C₁ > 0 such that

P_{k} (∣ σ_{k}^{2} - τ_{k}^{2} ∣ > n^{- C_{1}}) = O (n^{- C_{2}})

for all C₂ > 0. Combining this property, (A.16) for ℓ 3 and 4, and (A.22), we deduce that, for some C₁ > 0 and all C₂ > 0,

P_{k} (∣ \frac{σ_{k}^{2} - τ_{k}^{2}}{τ_{k}^{2}} ∣ + ∣ \frac{σ_{k 0} + σ_{k 1} + {\overset{‒}{Δ}}_{3} + {\overset{‒}{Δ}}_{4}}{τ_{k}^{2}} ∣ \leq n^{- C_{1}}) = 1 - O (n^{- C_{2}}) .

Therefore, if C₃ > 0 is given then r₀ = r₀(C₃) ≥ 1 can be chosen so large that, whenever r₀ ≤ r ≤ ∞, $P_{k} {∣ {\hat{σ}}_{k}^{- 1} - s_{k} (r) ∣ > n^{- C_{3}}} = O (n^{- C_{2}})$ for all C₂ > 0. Using this property and (A.16), again for ℓ = 3 and 4; and employing too (A.21); we see that for some C₁ > 0 and all C₂ > 0, if r₀ is chosen sufficiently large,

P_{k} {∣ {\hat{σ}}_{k}^{- 1} - t_{k} (r) ∣ > n^{- C_{1}} H^{2}} = O (n^{- C_{2}})

(A.23)

for r ≥ r₀, where

t_{k} (r) = \frac{1}{τ_{k}} \sum_{j = 0}^{r} \sum_{ℓ = 0}^{\min (j, 1)} (\begin{matrix} - \frac{1}{2} \\ j \end{matrix}) (\begin{matrix} j \\ ℓ \end{matrix}) {(\frac{σ_{k}^{2} - τ_{k}^{2}}{τ_{k}^{2}})}^{j - ℓ} {(\frac{τ_{k 0} + τ_{k 1}}{τ_{k}^{2}})}^{ℓ} .

(A.24)

A.5. Approximation to $E_{k} {Ψ_{k} (- {\hat{β}}_{k} ∕ {\hat{σ}}_{k})}$

Let C₁ > 0 and let ℓ₀ ≥ 0 be an integer. With U_kjℓ defined as at (2.6), let $E$ denote the event

E = E (C_{1}, ℓ_{0}) = {\max_{1 \leq ℓ \leq ℓ_{0}} \max_{j = 1, \dots, n_{k}} sup_{x \in I} ∣ U_{k j ℓ} (x) - κ_{ℓ} f_{X} (x) ∣ \leq n^{- C_{1}}},

where κ_ℓ = ∫ u^ℓK (u) du and hence vanishes for odd ℓ, since by (3.7)(b), K is symmetric. It will be proved in Section A.6 that, for some C₁ > 0 and each ℓ₀ ≥ 0,

P_{k} {ε (C_{1}, ℓ_{0})} = 1 - O (n^{- C_{2}}) f o r a l l C_{2} > 0 .

(A.25)

If $E (C_{1}, ℓ_{0})$ holds for an ℓ₀ ≥ 2 then, if $0 < C_{1}^{'} < C_{1}$ , there exists a nonrandom integer n₀ ≥ 1 such that the event $E_{1} = E_{1} (C_{1}^{'})$ , defined by

ε_{1} = {\max_{j = 1, \dots, n_{k}} \sup_{x \in I} ∣ U_{k j 2} (x) U_{k j 0} (x) - U_{k j 1} {(x)}^{2} - κ_{2} f X {(x)}^{2} ∣ \leq n^{- C_{1}^{'}}}

(A.26)

holds for all n ≥ n₀.

Let $I = I (E)$ denote the indicator of $E$ . In view of (A.25),

E_{k} {Ψ_{k} (- {\hat{β}}_{k} ∕ {\hat{σ}}_{k})} = E_{k} {I Ψ_{k} (- {\hat{β}}_{k} ∕ {\hat{σ}}_{k})} + O (n^{- C_{2}})

(A.27)

for all C₂ > 0, and so to approximate the term on the left-hand side of (A.27) we may develop an approximation to the first term on the right-hand side.

Let $G_{2}$ denote the sigma-field generated by the random variables X_i for 1 ≤ i ≤ m, and by X_kji and the functions g_kji for 1 ≤ i ≤ m_kj, 1 ≤ j ≤ n_k and k = 0, 1 (i.e., generated by everything except g and the experimental errors ε_i and ε_kji). The quantities I, t_k(r) at (A.24), β_k0 at (3.2), and b_k1 at (A.15) are all $G_{2}$ -measurable. Therefore, using (A.17) and (A.23), and noting that ψ_k is an analytic function with all derivatives uniformly bounded, we obtain

\begin{matrix} E_{k} {I Ψ_{k} (- {\hat{β}}_{k} ∕ {\hat{σ}}_{k})} & = E_{k} (E_{k} [I Ψ_{k} {- (β_{k 0} + b_{k 1} + β_{k 1} + β_{k 2}) t_{k} (r)} ∣ G_{2}]) + o (H^{2}) \\ = E_{k} [I Ψ_{k} {- β_{k 0} t_{k} (r)}] - b_{k 1} τ_{k}^{- 1} E_{k} [I Ψ_{k}^{'} {- β_{k 0} t_{k} (r)}] - τ_{k}^{- 1} E_{k} [E_{k} (β_{k 2} ∣ G_{2}) I Ψ_{k}^{'} {- β_{k 0} t_{k} (r)}] + \frac{1}{2} τ_{k}^{- 2} E_{k} [E_{k} (β_{k 1}^{2} ∣ G_{2}) I Ψ_{k}^{″} {- β_{k 0} t_{k} (r)}] \end{matrix}

(A.28)

+ O {{(m h)}^{- 2} + {(M_{sum h_{1}})}^{- 2}} + o (H^{2}) .

(A.29)

Here we have used the properties $E_{k} (β_{k 1} ∣ G_{2}) = 0$ , $E_{k} ∣ t_{k} (r) - τ_{k}^{- 1} ∣ = O (n^{- C})$ for some C > 0,

E_{k} [E_{k} (β_{k ℓ_{1}}^{ℓ_{2}} ∣ G_{2}) I Ψ_{k}^{(ℓ_{2})} {- β_{k 0} t_{k} (r)}] = O {{(m h)}^{- 2} + {(M_{sum} h_{1})}^{- 2}}

for ℓ₂ ≥ 3 if ℓ₁ = 1, and for ℓ₂ ≥ 2 if ℓ₁ = 2, and

∣ E_{k} [E_{k} (β_{k 1} β_{k 2} ∣ G_{2}) I Ψ_{k}^{″} {- β_{k 0} t_{k} (r)}] ∣ = O {{(m h)}^{- 2} + {(M_{sum} h_{1})}^{- 2}} .

Further, we have used the fact that the event $E_{1}$ , defined at (A.26), obtains when ever I ≠ 0.

In addition,

\begin{matrix} \frac{1}{4} E_{k} [E_{k} (β_{k 1}^{2} ∣ G_{2}) I] & = E_{k} {I \int_{I} ({\overset{‒}{g}}_{0} - {\overset{‒}{g}}_{1}) Δ}^{2} + E_{k} {I \int_{I} ({\overset{‒}{g}}_{0} - μ_{k}) {\overset{‒}{Δ}}_{0}}^{2} + E_{k} {I \int_{I} ({\overset{‒}{g}}_{1} - μ_{k}) {\overset{‒}{Δ}}_{1}}^{2} \\ = O (m^{- 1}), \end{matrix}

(A.30)

that

\begin{matrix} E_{k} [E_{k} (β_{k 2} ∣ G_{2}) I Ψ_{k}^{'} {- β_{k 0} t_{k} (r)}] & = {(- 1)}^{k + 1} ϕ (b_{k 0} ∕ τ_{k}) \int_{I} E_{k} [I {E_{k} ({\overset{‒}{Δ}}_{0}^{2} ∣ G_{2}) - E_{k} ({\overset{‒}{Δ}}_{1}^{2} ∣ G_{2})}] + o {{(ν_{0} h_{1})}^{- 1}} \\ = \frac{κ}{h_{1}} (σ_{ε 0}^{2} ν_{0}^{- 1} - σ_{ε 1}^{2} ν_{1}^{- 1}) {(- 1)}^{k + 1} ϕ (b_{k 0} ∕ τ_{k}) \int_{I} f_{X}^{- 1} + o {{(ν_{0} h_{1})}^{- 1}} \end{matrix}

(A.31)

and that

b_{k 1} τ_{k}^{- 1} E_{k} [I Ψ_{k}^{'} {- β_{k 0} t_{k} (r)}] = b_{k 1} τ_{k}^{- 1} {(- 1)}^{k + 1} ϕ (b_{k 0} ∕ τ_{k}) + o (H^{2}),

(A.32)

where b_k0 and b_k1 are as at (3.2) and (A.15), ϕ is the standard normal density, and we have used the fact that $Ψ_{k}^{'} = {(- 1)}^{k + 1} ϕ$ . Combining (A.25) and (A.27)–(A.32), and taking r sufficiently large (but fixed), we deduce that

\begin{matrix} E_{k} {Ψ_{k} (- {\hat{β}}_{k} ∕ {\hat{σ}}_{k})} & = E_{k} [Ψ_{k} {- β_{k 0} ∕ σ_{k}}] - b_{k 1} τ_{k}^{- 1} {(- 1)}^{k + 1} ϕ (b_{k 0} ∕ τ_{k}) \\ - \frac{κ}{τ_{k} h_{1}} (σ_{ε 0}^{2} ν_{0}^{- 1} - σ_{ε 1}^{2} ν_{1}^{- 1}) {(- 1)}^{k + 1} ϕ (b_{k 0} ∕ τ_{k}) \int_{I} f_{X}^{- 1} \\ + O {m^{- 1} + {(m h)}^{- 2}} + o {H^{2} + {(ν_{0} h_{1})}^{- 1}} . \end{matrix}

(A.33)

Result (3.8) follows from (A.5) and (A.33).

A.6. Proof of Lemma 1 and (A.25)

The results in Lemma 1, with the exception of (A.4); and also result (A.25); will follow if we show that for each ℓ ≥ 1, some C₁ > 0 and all C₂ > 0,

P_{k} {\sup_{x \in I} ∣ U_{ℓ} (x) - κ_{ℓ} f_{X} (x) ∣ > n^{- C_{1}}} = O (n^{- C_{2}}),

(A.34)

P_{k} {\max_{j = 1, \dots, n_{k}} \sup_{x \in I} ∣ U_{k j ℓ} (x) - κ_{ℓ} f_{X} (x) ∣ > n^{- C_{1}}} = O (n^{- C_{2}}) .

(A.35)

We shall derive (A.35); a proof of (A.34) is similar.

Markov's inequality can be used to prove that

\max_{j = 1, \dots, n_{k}} \sup_{x \in I} P_{k} {∣ U_{k j ℓ} (x) - κ_{ℓ} f_{X} (x) ∣ > n^{- C_{1}}} = O (n^{- C_{2}}) .

(A.36)

It follows from (3.7)(c) that each n_k is increasing no faster than polynomially in n, and therefore, if we confine attention to x in a subset $I_{n}$ , say, of $I$ that contains only O(n^C) points for some C > 0, we can place the maximum and supremum inside the probability statement at (A.36), provided that $I$ is replaced by $I_{n}$ : for some C₁ > 0 and all C₂ > 0,

P_{k} {\max_{j = 1, \dots, n_{k}} \sup_{x \in I_{n}} ∣ U_{k j ℓ} (x) - κ_{ℓ} f_{X} (x) ∣ > n^{- C_{1}}} = O (n^{- C_{2}}) .

(A.37)

The assumption, in (3.7)(b), that K is compactly supported and Hölder continuous, and the implication, in (3.5)(c), that f_X is also Hölder continuous, enable (A.35) to be derived directly from (A.37) by taking $I_{n}$ to be a sufficiently fine grid in $I$ .

A proof of (A.4) in Lemma 1 is similar. To illustrate the argument, we derive the following result part of (A.4): for all C₂, C₄ > 0,

P_{k} {\sup_{x \in I} ∣ Δ (x) ∣ > n^{C_{4}} {(m h)}^{- 1 ∕ 2}} = O (n^{- C_{2}}) .

(A.38)

Using Markov's and Rosenthal's inequalities, we first obtain the result when the supremum is outside the probability statement:

\sup_{x \in I} P_{k} {∣ Δ (x) ∣ > n^{C_{4}} {(m h)}^{- 1 ∕ 2}} = O (n^{- C_{2}}) .

Taking $I_{n}$ to contain only O(n^C) points, for any fixed C > 0, we deduce that

P_{k} {\sup_{x \in I_{n}} ∣ Δ (x) ∣ > n^{C_{4}} {(m h)}^{- 1 ∕ 2}} = O (n^{- C_{2}}),

and taking $I_{n}$ to be a sufficiently fine grid in $I$ we obtain (A.38).

Footnotes

SUPPLEMENTARY MATERIAL

Supplement to “Unexpected properties of bandwidth choice when smoothing discrete data for constructing a functional data classifier” (DOI: 10.1214/13-AOS1158SUPP; .pdf). The supplementary file contains the proof of Theorems 2 and 3, as well as additional simulation results.

REFERENCES

Araki Y, Konishi S, Kawano S, Matsui H. Functional logistic discrimination via regularized basis expansions. Comm. Statist. Theory Methods. 2009;38:2944–2957. MR2568196. [Google Scholar]
Benhennia K, Degras D. Local polynomial regression based on functional data. 2011 Unpublished manuscript. Available at http://arxiv.org/pdf/1107.4058v1.
Berlinet A, Biau G, RouvièRe L. Functional supervised classification with wavelets. Ann. I.S.U.P. 2008;52:61–80. MR2435041. [Google Scholar]
Biau G, Bunea F, Wegkamp MH. Functional classification in Hilbert spaces. IEEE Trans. Inform. Theory. 2005;51:2163–2172. MR2235289. [Google Scholar]
Cardot H, Degras D, Josserand E. Confidence bands for Horvitz–Thompson estimators using sampled noisy functional data. Bernoulli. 2013;19:2067–2097. [Google Scholar]
Cardot H, Josserand E. Horvitz–Thompson estimators for functional data: Asymptotic confidence bands and optimal allocation for stratified sampling. Biometrika. 2011;98:107–118. MR2804213. [Google Scholar]
Carroll RJ, Delaigle A, Hall P. Supplement to “Unexpected properties of bandwidth choice when smoothing discrete data for constructing a functional data classifier.”. 2013. DOI:10.1214/13-AOS1158SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cuevas A, Febrero M, Fraiman R. Robust estimation and classification for functional data via projection-based depth notions. Comput. Statist. 2007;22:481–496. MR2336349. [Google Scholar]
Delaigle A, Hall P. Achieving near perfect classification for functional data. J. R. Stat. Soc. Ser. B Stat. Methodol. 2012;74:267–286. MR2899863. [Google Scholar]
Delaigle A, Hall P, Bathia N. Componentwise classification and clustering of functional data. Biometrika. 2012;99:299–313. MR2931255. [Google Scholar]
Epifanio I. Shape descriptors for classification of functional data. Technometrics. 2008;50:284–294. MR2528652. [Google Scholar]
Fan J, Gijbels I. Local Polynomial Modelling and Its Applications. Monographs on Statistics and Applied Probability. Chapman & Hall; London: 1996. p. 66. MR1383587. [Google Scholar]
Fromont M, Tuleau C. Functional classification with margin conditions. In: Carbonell JG, Siekmann J, editors. Learning Theory—Proceedings of the 19th Annual Conference on Learning Theory, Pittsburgh; New York: Springer; 2006. 2006. [Google Scholar]
Hall P, Hosseini-Nasab M. On properties of functional principal components analysis. J. R. Stat. Soc. Ser. B Stat. Methodol. 2006;68:109–126. MR2212577. [Google Scholar]
Hall P, Hosseini-Nasab M. Theory for high-order bounds in functional principal components analysis. Math. Proc. Cambridge Philos. Soc. 2009;146:225–256. MR2461880. [Google Scholar]
Hall P, Kang K-H. Bandwidth choice for nonparametric classification. Ann. Statist. 2005;33:284–306. MR2157804. [Google Scholar]
Hall P, Van Keilegom I. Two-sample tests in functional data analysis starting from discrete data. Statist. Sinica. 2007;17:1511–1531. MR2413533. [Google Scholar]
Leng X, Müller H-G. Classification using functional data analysis for temporal gene expression data. Bioinformatics. 2006;22:68–76. doi: 10.1093/bioinformatics/bti742. [DOI] [PubMed] [Google Scholar]
Li Y, Hsing T. Deciding the dimension of effective dimension reduction space for functional and high-dimensional data. Ann. Statist. 2010a;38:3028–3062. MR2722463. [Google Scholar]
Li Y, Hsing T. Uniform convergence rates for nonparametric regression and principal component analysis in functional/longitudinal data. Ann. Statist. 2010b;38:3321–3351. MR2766854. [Google Scholar]
López-pintado S, Romo J. Depth-based classification for functional data. (DIMACS Series in Discrete Mathematics and Theoretical Computer Science).Data Depth: Robust Multivariate Analysis, Computational Geometry and Applications. 2006;72:103–119. Amer. Math. Soc., Providence, RI. MR2343116. [Google Scholar]
Manning CD, Raghavan P, Schütze H. Introduction to Information Retrival. Cambridge Univ. Press; Cambridge: 2008. [Google Scholar]
Panaretos VM, Kraus D, Maddocks JH. Second-order comparison of Gaussian random functions and the geometry of DNA minicircles. J. Amer. Statist. Assoc. 2010;105:670–682. MR2724851. [Google Scholar]
Ramsay JO, Silverman BW. Functional Data Analysis. 2nd ed. Springer; New York: 2005. MR2168993. [Google Scholar]
Rossi F, Villa N. Support vector machine for functional data classification. Neuro-computing. 2006;69:730–742. [Google Scholar]
Ruppert D, Sheather SJ, Wand MP. An effective bandwidth selector for local least squares regression. J. Amer. Statist. Assoc. 1995;90:1257–1270. MR1379468. [Google Scholar]
Vilar JA, Pértega S. Discriminant and cluster analysis for Gaussian stationary processes: Local linear fitting approach. J. Nonparametr. Stat. 2004;16:443–462. MR2073035. [Google Scholar]
Wang X, Ray S, Mallick BK. Bayesian curve classification using wavelets. J. Amer. Statist. Assoc. 2007;102:962–973. MR2354408. [Google Scholar]
Wu S, Müller H-G. Response-adaptive regression for longitudinal data. Biometrics. 2011;67:852–860. doi: 10.1111/j.1541-0420.2010.01518.x. MR2829259. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

NIHMS564605-supplement-Supplementary_Material.pdf^{(97.6KB, pdf)}

[R1] Araki Y, Konishi S, Kawano S, Matsui H. Functional logistic discrimination via regularized basis expansions. Comm. Statist. Theory Methods. 2009;38:2944–2957. MR2568196. [Google Scholar]

[R2] Benhennia K, Degras D. Local polynomial regression based on functional data. 2011 Unpublished manuscript. Available at http://arxiv.org/pdf/1107.4058v1.

[R3] Berlinet A, Biau G, RouvièRe L. Functional supervised classification with wavelets. Ann. I.S.U.P. 2008;52:61–80. MR2435041. [Google Scholar]

[R4] Biau G, Bunea F, Wegkamp MH. Functional classification in Hilbert spaces. IEEE Trans. Inform. Theory. 2005;51:2163–2172. MR2235289. [Google Scholar]

[R5] Cardot H, Degras D, Josserand E. Confidence bands for Horvitz–Thompson estimators using sampled noisy functional data. Bernoulli. 2013;19:2067–2097. [Google Scholar]

[R6] Cardot H, Josserand E. Horvitz–Thompson estimators for functional data: Asymptotic confidence bands and optimal allocation for stratified sampling. Biometrika. 2011;98:107–118. MR2804213. [Google Scholar]

[R7] Carroll RJ, Delaigle A, Hall P. Supplement to “Unexpected properties of bandwidth choice when smoothing discrete data for constructing a functional data classifier.”. 2013. DOI:10.1214/13-AOS1158SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Cuevas A, Febrero M, Fraiman R. Robust estimation and classification for functional data via projection-based depth notions. Comput. Statist. 2007;22:481–496. MR2336349. [Google Scholar]

[R9] Delaigle A, Hall P. Achieving near perfect classification for functional data. J. R. Stat. Soc. Ser. B Stat. Methodol. 2012;74:267–286. MR2899863. [Google Scholar]

[R10] Delaigle A, Hall P, Bathia N. Componentwise classification and clustering of functional data. Biometrika. 2012;99:299–313. MR2931255. [Google Scholar]

[R11] Epifanio I. Shape descriptors for classification of functional data. Technometrics. 2008;50:284–294. MR2528652. [Google Scholar]

[R12] Fan J, Gijbels I. Local Polynomial Modelling and Its Applications. Monographs on Statistics and Applied Probability. Chapman & Hall; London: 1996. p. 66. MR1383587. [Google Scholar]

[R13] Fromont M, Tuleau C. Functional classification with margin conditions. In: Carbonell JG, Siekmann J, editors. Learning Theory—Proceedings of the 19th Annual Conference on Learning Theory, Pittsburgh; New York: Springer; 2006. 2006. [Google Scholar]

[R14] Hall P, Hosseini-Nasab M. On properties of functional principal components analysis. J. R. Stat. Soc. Ser. B Stat. Methodol. 2006;68:109–126. MR2212577. [Google Scholar]

[R15] Hall P, Hosseini-Nasab M. Theory for high-order bounds in functional principal components analysis. Math. Proc. Cambridge Philos. Soc. 2009;146:225–256. MR2461880. [Google Scholar]

[R16] Hall P, Kang K-H. Bandwidth choice for nonparametric classification. Ann. Statist. 2005;33:284–306. MR2157804. [Google Scholar]

[R17] Hall P, Van Keilegom I. Two-sample tests in functional data analysis starting from discrete data. Statist. Sinica. 2007;17:1511–1531. MR2413533. [Google Scholar]

[R18] Leng X, Müller H-G. Classification using functional data analysis for temporal gene expression data. Bioinformatics. 2006;22:68–76. doi: 10.1093/bioinformatics/bti742. [DOI] [PubMed] [Google Scholar]

[R19] Li Y, Hsing T. Deciding the dimension of effective dimension reduction space for functional and high-dimensional data. Ann. Statist. 2010a;38:3028–3062. MR2722463. [Google Scholar]

[R20] Li Y, Hsing T. Uniform convergence rates for nonparametric regression and principal component analysis in functional/longitudinal data. Ann. Statist. 2010b;38:3321–3351. MR2766854. [Google Scholar]

[R21] López-pintado S, Romo J. Depth-based classification for functional data. (DIMACS Series in Discrete Mathematics and Theoretical Computer Science).Data Depth: Robust Multivariate Analysis, Computational Geometry and Applications. 2006;72:103–119. Amer. Math. Soc., Providence, RI. MR2343116. [Google Scholar]

[R22] Manning CD, Raghavan P, Schütze H. Introduction to Information Retrival. Cambridge Univ. Press; Cambridge: 2008. [Google Scholar]

[R23] Panaretos VM, Kraus D, Maddocks JH. Second-order comparison of Gaussian random functions and the geometry of DNA minicircles. J. Amer. Statist. Assoc. 2010;105:670–682. MR2724851. [Google Scholar]

[R24] Ramsay JO, Silverman BW. Functional Data Analysis. 2nd ed. Springer; New York: 2005. MR2168993. [Google Scholar]

[R25] Rossi F, Villa N. Support vector machine for functional data classification. Neuro-computing. 2006;69:730–742. [Google Scholar]

[R26] Ruppert D, Sheather SJ, Wand MP. An effective bandwidth selector for local least squares regression. J. Amer. Statist. Assoc. 1995;90:1257–1270. MR1379468. [Google Scholar]

[R27] Vilar JA, Pértega S. Discriminant and cluster analysis for Gaussian stationary processes: Local linear fitting approach. J. Nonparametr. Stat. 2004;16:443–462. MR2073035. [Google Scholar]

[R28] Wang X, Ray S, Mallick BK. Bayesian curve classification using wavelets. J. Amer. Statist. Assoc. 2007;102:962–973. MR2354408. [Google Scholar]

[R29] Wu S, Müller H-G. Response-adaptive regression for longitudinal data. Biometrics. 2011;67:852–860. doi: 10.1111/j.1541-0420.2010.01518.x. MR2829259. [DOI] [PubMed] [Google Scholar]

PERMALINK

UNEXPECTED PROPERTIES OF BANDWIDTH CHOICE WHEN SMOOTHING DISCRETE DATA FOR CONSTRUCTING A FUNCTIONAL DATA CLASSIFIER

Raymond J Carroll

Aurore Delaigle

Peter Hall

Abstract

1. Introduction

2. Model and methodology

2.1. Model

2.2. Estimating g, gkj and their mean and covariance functions

2.3. Constructing classifiers

3. Theoretical properties

3.1. Standard centroid-based classifier

3.1.1. Sizes of h and h1 that optimise overall error rate

3.1.2. Absence of terms in (mh)−1

3.1.3. Other remarks

3.2. Scale-adjusted centroid-based classifier

3.3. Quadratic discriminant

4. Numerical illustrations

4.1. Simulated data

Table 1.

Table 2.

4.2. Real data

Table 3.

Supplementary Material

Acknowledgments

APPENDIX: PROOF OF THEOREM 1

A.1. Preliminary results

A.2. Initial calculation of errk

A.3. Approximations to α^k, β^k and σ^k

A.4. Approximation to σ^k−1

A.5. Approximation to Ek{Ψk(−β^k∕σ^k)}

A.6. Proof of Lemma 1 and (A.25)

Footnotes

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.2. Estimating g, g_kj and their mean and covariance functions

3.1.1. Sizes of h and h₁ that optimise overall error rate

3.1.2. Absence of terms in (mh)⁻¹

A.2. Initial calculation of err_k

A.3. Approximations to ${\hat{α}}_{k}$ , ${\hat{β}}_{k}$ and ${\hat{σ}}_{k}$

A.4. Approximation to ${\hat{σ}}_{k}^{- 1}$

A.5. Approximation to $E_{k} {Ψ_{k} (- {\hat{β}}_{k} ∕ {\hat{σ}}_{k})}$