Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Oct 9.
Published in final edited form as: Ann Stat. 2013 Dec 17;41(6):2739–2767. doi: 10.1214/13-AOS1158

UNEXPECTED PROPERTIES OF BANDWIDTH CHOICE WHEN SMOOTHING DISCRETE DATA FOR CONSTRUCTING A FUNCTIONAL DATA CLASSIFIER

Raymond J Carroll 1,1, Aurore Delaigle 2,2, Peter Hall 2,2
PMCID: PMC4191932  NIHMSID: NIHMS564605  PMID: 25309640

Abstract

The data functions that are studied in the course of functional data analysis are assembled from discrete data, and the level of smoothing that is used is generally that which is appropriate for accurate approximation of the conceptually smooth functions that were not actually observed. Existing literature shows that this approach is effective, and even optimal, when using functional data methods for prediction or hypothesis testing. However, in the present paper we show that this approach is not effective in classification problems. There a useful rule of thumb is that undersmoothing is often desirable, but there are several surprising qualifications to that approach. First, the effect of smoothing the training data can be more significant than that of smoothing the new data set to be classified; second, undersmoothing is not always the right approach, and in fact in some cases using a relatively large bandwidth can be more effective; and third, these perverse results are the consequence of very unusual properties of error rates, expressed as functions of smoothing parameters. For example, the orders of magnitude of optimal smoothing parameter choices depend on the signs and sizes of terms in an expansion of error rate, and those signs and sizes can vary dramatically from one setting to another, even for the same classifier.

Keywords: Centroid method, discrimination, kernel smoothing, quadratic discrimination, smoothing parameter choice, training data

1. Introduction

All supposedly “functional” data are actually observed discretely, sometimes on a grid and on other occasions at randomly scattered points. For example, in longitudinal data analysis the observation points are often widely spaced and irregularly placed, and substantial smoothing is commonly used to convert discrete data like these to functions. The impact of such smoothing has been addressed in the context of prediction or hypothesis testing for functional data; see, for example, Hall and Van Keilegom (2007), Panaretos, Kraus and Maddocks (2010), Wu and Müller (2011), Benhennia and Degras (2011), Cardot and Josserand (2011) and Cardot, Degras and Josserand (2013). The main conclusion of these papers has been that conventional rules for smoothing discrete data typically apply, and that smoothing parameters of standard size generally are appropriate.

In contrast, the present paper was motivated by numerical work indicating that, in the context of classifying functional data, smoothing parameters of highly nonstandard sizes are appropriate, and more generally that, even for a relatively simple classifier, there is no simple precept (even an asymptotic prescription of size) that leads to minimisation of error rate. If one had to give a rule, valid in some but by no means all cases, it would be to undersmooth, but even there unexpected caveats must be addressed.

For example, it turns out that the impact of smoothing the training data can be more significant than that of smoothing the new data to be classified. Indeed, the effect of smoothing the new data is characteristic of a parametric problem, rather than a nonparametric one. There, asymptotic arguments indicate that (sample size)−1/2 is an appropriate bandwidth size for reducing the impact of smoothing to parametric levels, whereas (sample size)−1/3 is the nearest analogue for smoothing the training data.

However, both these recommendations are incorrect in many cases. Depending on the signs and sizes of certain functionals of the data distributions, it can be optimal to use smoothing parameters that are an order of magnitude smaller, or an order of magnitude larger, than these. Using some viewpoints the need for a low level of smoothing is intuitively clear. Indeed, we expect that relatively minor features of a curve, of the sort that might disappear if we were too enthusiastic in the smoothing step, could have important information to convey in a classification analysis. On the other hand, our results show that very high levels of smoothing are sometimes advantageous.

We drew these conclusions after studying three different classifiers for functional data: the standard centroid-based method, the scale-adjusted form of that approach, and a version for functional data of quadratic discrimination. Our conclusions are valid for all three approaches, although they contradict conclusions which are well known for standard nonparametric approximations to the Bayes classifier in multivariate, rather than functional-data, settings. Specifically, for univariate and functional data, and nonparametric Bayes classifiers, conventional smoothing parameters, for example, those chosen using standard plug-in rules for function estimation, typically are of the correct order even though they do not quite minimise asymptotic classification error; see, for example, Hall and Kang (2005). Moreover, there does not exist a version of our results in univariate or multivariate settings, since there is no analogue in such cases of the “lattice effect,” represented by the mkj's.

To these comments, we should add that in practice there is relatively little difficulty in choosing smoothing parameters to minimise error rate; cross-validation is usually effective. Our aim in this paper is therefore not to develop methods for choosing the bandwidth optimally, or nearly optimally, in classification problems, but to provide an understanding of the many aspects of those problems that conspire together to determine the optimal choice.

2. Model and methodology

2.1. Model

We consider n0 (resp., n1) unknown random functions {g0j, 1 ≤ jn0} (resp., {g1j, 1 ≤ jn1}) coming from two populations. We observe a training sample of the data pairs Dkj={(Xkji,Ykji),1imkj}, for 1 ≤ jnk and k 0,1, corresponding to noisy versions of the gkj 's sampled at a discrete set of random points (i.e., Xkji 's) and generated by the model

Ykji=gkj(Xkji)+εkji, (2.1)

where k indexes the population, Πk, from which the data in Dkj came, j denotes the index of an individual drawn from Πk, and i is the index of a data pair (Xkji, Ykji) for the jth individual from the kth population.

The gkj are random functions defined on a compact interval I, but observed only at mkj points Xkj1, …, Xkjmkj. These points may be fixed or random, and although we shall develop our arguments in the random case, they can easily be extended to the fixed case. We assume that each gkj has two bounded derivatives on I; the respective sequences of X's and ε's are each identically distributed with distributions that do not depend on the g's; the g's, X's and ε's are all mutually independent; the X's are supported on I; and the ε's have zero mean and finite variance.

We also observe a new data set D={(Xi,Yi),1im}, similar to the Dkj's except that in this case we do not know which population the data come from. Here,

Yi=g(Xi)+εi, (2.2)

where the function g, the X's and the ε's have the properties given in the previous paragraph. Using the training data, we wish to determine whether D came from Π0 or Π1.

In the functional data literature [see, e.g., Ramsay and Silverman (2005)], when the data are noisy, it is common to preprocess them prior to further analysis. Typically, this is done by smoothing the data in some way, for example, through a spline or kernel smoother, thereby obtaining, from the data in Dkj and D, estimators g^kj and g^ of gkj and g, respectively. In the classification context, once these estimators have been derived, they are plugged into functional data classifiers, replacing there the unobserved functions g and gkj by their estimators g^ and g^kj. Our aim in this paper is to describe the application of estimators g^ and g^kj of g and gkj, and in particular to describe the influence of tuning parameters used to construct them, when the aim is classification rather than just function estimation.

2.2. Estimating g, gkj and their mean and covariance functions

There are several ways to obtain nonparametric estimators of the functions g and gkj, but the most popular ones are spline and local linear methods. They have similar properties, but since local linear estimators are much more tractable theoretically, we shall use these in this work. For xI, the local linear estimators of g and gkj are defined by

g^(x)=U2(x)V0(x)U1(x)V1(x)U2(x)U0(x)U12(x),
g^kj(x)=Ukj2(x)Vkj0(x)Ukj1(x)Vkj1(x)Ukj2(x)Ukj0(x)Ukj12(x), (2.3)

where

U(x)=1mi=1m(xXih)Kh(xXi), (2.4)
V(x)=1mi=1mYi(xXih)Kh(xXi), (2.5)
Ukj(x)=1mkji=1mkj(xXkjih1)Kh1(xXkji), (2.6)
Vkj(x)=1mkji=1mkjYkji(xXkjih1)Kh1(xXkji),

K is a kernel function, h > 0 and h1 > 0 are bandwidths, and Kh(x) = K(x/h)/h. See, for example, Fan and Gijbels (1996). For simplicity, throughout we use the same bandwidth h1 for each population and each individual, but we could have replaced h1 by bandwidths that depended on k and j, as we do in our numerical work.

The classifiers we consider in this work require estimators of the population means and covariances. For k = 0, 1, let μk denote the mean function

μk=Ek(g)=Ek(gkj), (2.7)

where Ek represents expectation under the assumption that the data come from Πk. Also, let Gk be the covariance function, defined by Gk(u, v) = covk{g(u), g(v)} = Ek{g(u)g(v)} − μk(uk(v), where covk denotes covariance when the data come from Πk. Estimators μ^k and G^k of μk and Gk are defined in the standard way by the empirical mean and covariance functions, but replacing, in the definitions of these estimators, the unobserved gkj by g^kj:

μ^k=1nkj=1nkg^kj, (2.8)
G^k(u,v)=1nkj=1nk{g^kj(u)μ^k(u)}{g^kj(v)μ^k(v)}. (2.9)

See, for example, Ramsay and Silverman (2005), Chapter 2.

Consider the spectral decomposition of the covariance function

Gk(u,v)==1θkψk(u)ψk(v), (2.10)

where, (θk, ψk) is an (eigenvalue, eigenfunction) pair for the linear operator Gk defined by Gk(ψ)(u) = ∫ Gk(u, vψ(v) dv, and where, following convention, we have used the notation Gk for both the operator and the covariance. The terms in (2.10) are ordered such that θk1 ≥ θk2 ≥ ⋯ ≥ 0. If g is drawn from Πk then we can write

g(x)=μk(x)+=1Zkθk12ψk(x), (2.11)

where μk = Ek(g) denotes the mean of the random process of which g is a realisation, Zk=θk12(gμk)ψk, and the Zk's (for ℓ = 1, 2, …) comprise a sequence of uncorrelated random variables with zero mean and unit variance. The quantities θk, and ψk can be estimated consistently by the eigenvalues and eigenfunctions θ^k and ψ^k of the linear operator G^k, defined by G^k(ψ)(u)=G^k(u,v)ψ(v)dv, with the covariance estimator G^k defined as at (2.9):

G^k(u,v)==1θ^kψ^k(u)ψ^k(v), (2.12)

where θ^k1θ^k20, and, since θ^k=0 for all ℓ > nk, all but the first nk terms in the series at (2.12) vanish. See Hall and Hosseini-Nasab (2006, 2009) for properties of these estimators in the case where g and gkj are observed; see also Li and Hsing (2010a, 2010b) for other cases.

2.3. Constructing classifiers

Classifiers for functional data have received a great deal of attention in the literature. See, for example, Vilar and Pértega (2004), Biau, Bunea and Wegkamp (2005), Fromont and Tuleau (2006), Leng and Müller (2006), López-Pintado and Romo (2006), Rossi and Villa (2006), Cuevas, Febrero and Fraiman (2007), Wang, Ray and Mallick (2007), Berlinet, Biau and Rouvière (2008), Epifanio (2008), Araki et al. (2009), Delaigle and Hall (2012) and Delaigle, Hall and Bathia (2012).

In those papers the authors suggest methods for constructing classifiers, but so far the theoretical impact of smoothing; that is, the impact of using g^ and g^kj instead of g and gkj when constructing classifiers; has been largely ignored in the literature. In this paper, we study this impact of smoothing for three relatively simple functional classifiers: the centroid classifier, or Rocchio classifier [see, e.g., Manning, Raghavan and Schütze (2008)], commonly used for classifying high-dimensional data; a scaled version of this classifier, which we define below in a general way; and a version for functional data of Fisher's quadratic discriminant, studied, for example, by Leng and Müller (2006) and Delaigle and Hall (2012). These classifiers are usually defined in terms of the functions g and gkj, and here we shall define them in terms of g^ and g^kj. The standard versions of these classifiers are obtained by replacing g^ and g^kj by g and gkj. The functions g^kj appear only implicitly through the estimated means and covariance functions constructed in Section 2.2.

In the present setting, the centroid-based classifier assigns the curve g, observed through D, to Π0 if the statistic

S(g^)=I{g^(t)μ^0(t)}2dtI{g^(t)μ^1(t)}2dt (2.13)

is negative, and to Π1 if S(g^)>0.

A scaled version of the centroid classifier, which accommodates differences in scales between the two populations, can be defined by replacing S in (2.13) by

Sscale(g^)=1s02I{g^(t)μ^0(t)}2dt1s12I{g^(t)μ^1(t)}2dt+log(s02s12), (2.14)

where sk2 is an estimator of the scale of population Πk. For example, we might take sk2 to equal nk1j=1nkI(g^kjμ^k)2, the version we used in our numerical work, or IIG^k(u,v)ψ(u)ψ(v)dudv where ψ is open to choice; or s02 and s12 could be selected empirically by minimising a cross-validation estimator of classification error. The definition at (2.14) should be compared with those at (2.15) and (2.16), below. The form of (2.14), and also of (2.15) and (2.16), is motivated by likelihood-ratio statistics for Gaussian data.

A version for functional data of Fisher's quadratic discriminant is based on

T(g^)==1p[1θ^0{I(g^μ^0)ψ^0}21θ^1{I(g^μ^1)ψ^1}2+log(θ^0θ^1)], (2.15)

where g^ and μ^k are as at (2.3) and (2.8), ((θ^0,θ^1)) are at (2.12) and p is a positive truncation parameter. (Here we assume, as is often the case in practice, that the prior probabilities of each population are unknown and estimated by 1/2. A more general version of the classifier can be used if these probabilities are estimated by other values, but this does not alter our main conclusions.) We assign the new, data set D to Π0 if T(g^)0, and to Π1 otherwise. Of course, the statistic T(g^), at (2.15), is just an empirical version of the quantity

T0(g)==1p[1θ0{I(gμ0)ψ0}21θ1{I(gμ1)ψ1}2+log(θ0θ1)]. (2.16)

If the functions g are Gaussian, and the first p eigenvalues, in versions of (2.10) and (2.12) for either population, are distinct and nonzero, and the remaining eigenvalues vanish, then the classifier based on T0(g), at (2.16), is optimal in the sense of having least classification error among all classifiers, since it is, after all, just a likelihood ratio statistic. When the eigenvalues and eigenfunctions are estimated from data, as at (2.15), the classifier is asymptotically optimal. Bearing in mind the effectiveness of Fisher's discriminant analysis in the case of vector-valued data, even when the data are not normal, the classifier based on T(g^) is an attractive choice even in non-Gaussian cases.

3. Theoretical properties

3.1. Standard centroid-based classifier

In this section, we derive properties of the centroid classifier based on the estimators g^ and g^kj, and in particular we examine the impact of smoothing. First, we introduce notation. Let n = n0 + n1 (hence n is a positive integer sequence diverging to infinity), let m = m(n) be of the same size as mkj [see (3.7) below], and write σεk2 for the variance of the experimental errors εkji and εi, in (2.1) and (2.2), when the data come from Πk. Let

gk=1nkj=1nkgkj,νk=nk2(j=1nkmkj1)1 (3.1)

and define

bk0=I(μ1μ0){2μk(μ0+μ1)},βk0=I(g1g0){2μk(g0+g1)}, (3.2)
σk2=4κ2II(g1g0)(x1)(g1g0)(x2)Gk(x1,x2)dx1dx2, (3.3)
τk2=4κ2II(μ1μ0)(x1)(μ1μ0)(x2)Gk(x1,x2)dx1dx2, (3.4)

where κ2 = ∫ u2 K(u)du. Finally, put κ = ∫ K2, and let I be the compact interval that equals the support of the density fX of the Xi's, and Xkji's of the functions g and gkj.

We make the following assumptions:

(a) The distribution of the experimental errorεkjiandεi,in (2.1)and (2.2), has zero mean and all moments finite, may depend onk,and has varianceσεk2;(b) the densityfXof the variablesXkjiandXidoes not depend oni,jork;(c)fXhas two bounded derivativesfX(x)C>0for allxI,andfXisHo¨lder continuous on the supportIoffX. (3.5)
(a) The functionsgandgkjassociated with the populationsΠkfork=0,1,are realisations of Gaussian processes, have uniformly bounded covariance functionsGkand mean functionsμk,both depending only onk,and satisfyτk2>0fork=0,1;and (b) with probabilty 1 the functions are uniformly bounded and haveHo¨lder-continuous second detrivatives with the property that, for a constantC>0,all moments of supx1,x2g(x1)g(x2)x1x2Care finite whengis sampled from eitherΠ0orΠ1. (3.6)
(a) For a constantC>0,the resultsh(1)=O(nC)andn1Ch(1)hold forh(1)=handh(1)=h(1);(b) the kernelKis a symmetric, nonnegativecompactly supported and Ho¨lder continuous probability density;and (c) for eachkthe values ofm1minjmkj,m1maxjmkjandn0n1are bounded away from zero and infinity asn,and, for constantsC1andC2satisfying0<C1<C2<,mandn0lie betweennC1andnC2. (3.7)

The assumption in (3.5)(c) that fX is bounded away from zero on its support is only a technical requirement, and is unnecessary in practice. To make this clear, in our numerical work we shall take fX to be a normal density, and show that the conclusions of Theorem 1 are nevertheless reflected clearly.

Let errk=Pk{(1)kS(g^)>0} denote the probability that the standard centroid-based classifier, based on the statistic S(g^) at (2.13), commits an error when the data set D actually comes from Πk. Theorem 1 below describes the asymptotic behaviour of errk, and highlights the effect of the smoothing parameters h and h1, used to construct the estimators g and gkj of g and gkj, on the classifier. A proof is given in Appendix A.1.

Theorem 1. Assume that (3.5)(3.7) hold, and let Ψ0 = 1 – Φ and Ψ1 = Φ, where Φ denotes the c.d.f. of a standard normal random variable. Then

errk=errk0+h2ck+h12ck1+dk0ν0h1+dk1ν1h1+O{m1+(mh)2}+o(h2+h12+1ν0h1), (3.8)

where errk0 = Ekk {−βk0k}], ck=κ2αkI(μ1μ0)μk, ck1=κ2αkI(μ1μ0)μ1k, dkj=(1)jαkσεj2κIfX1, with αk=(1)kτk1ϕ(bk0τk), and where ϕ denotes standard normal density function.

The leading term errk0 on the right-hand side of (3.8) does not depend in any way on the bandwidths h and h1. It does involve the training sample sizes n0 and n1, and in particular does not equal the asymptotic limit of errk as n increases, since that limit is given by Ψk(−bk0k), but the effects of the bandwidths are all confined to subsequent terms on the right-hand side of (3.8). The terms in h2 and h12 represent contributions to classification error arising from biases of the estimators g^ and g^kj, and the terms in (ν0h1)−1 and (ν1h1)−1 are contributions from the variances of the estimators g^kj.

While a priori it might be thought that, since the total number of observations in the training sample, Σj mkj, for k = 0 and 1, is an order of magnitude larger than the number of observations, m, in the new data set D, then h1 should be chosen smaller than h, Theorem 1 shows that the influence of bandwidths on error rate is much more complex than this.

For one thing, there are no terms in (mh)−1 on the right-hand side of (3.8). (Section 3.1.2 will explain the reason for this.) As a result, the terms on the right-hand side of (3.8) that depend on h can be rendered equal to O(m−1) simply by taking h equal to a constant multiple of m−1/2. As noted in Remark 1, below, this level of contribution to the error rate is generally impossible to remove, even in simple parametric problems. Therefore the contribution of h to error rate cannot be rendered smaller than m−1. However, in some instances choosing h to be an order of magnitude larger or smaller than m−1/2 can be beneficial; see Section 3.1.1 below.

The terms in h1 on the right-hand side of (3.8) are a different matter because each of ck1 and dk0ν01+dk1ν11 can be either positive or negative. Depending on the signs and sizes of ck1 and dk0ν01+dk1ν11, it can be optimal to take h1 to be of order νk13, which achieves a trade-off between terms in h12 and (νkh1)−1, or to take h1 to decrease to zero more quickly or to converge to a positive constant, as n increases; see Section 3.1.1 below.

Therefore, the impact that smoothing has on classification performance is much more subtle than it might have appeared. We discuss these issues in more detail in the next sections.

3.1.1. Sizes of h and h1 that optimise overall error rate

Using Theorem 1 we can deduce the orders of magnitudes of h and h1 that minimise the error rate of the classifier, that is, that minimise the probability of misclassification,

err=π0err0+π1err1, (3.9)

where err0 and err1 are as in (3.8), πk denotes the prior probability attached to population Πk, and π0 + π1 = 1. Using (3.8) and (3.9), we can write

err=err0+c0h2+c10h12+d0(ν0h1)1+O{m1+(mh)2}+o{h2+h12+(ν0h1)1}, (3.10)

where err0 = π0err00 + π1err10 (recall that errk0 does not depend on the bandwidths),

c0=κ2(μ1μ0){π0μ0τ0ϕ(b00τ0)π1μ1τ1ϕ(b10τ1)},
c10=κ2(μ0μ1){π0μ1τ0ϕ(b00τ0)π1μ0τ1ϕ(b10τ1)},
d0=κ(IfX1){π0τ0ϕ(b00τ0)π1τ1ϕ(b10τ1)}(σε02σε12ν0ν1).

Since the function ϕ is symmetric, and b10 = −b00, then b10 can be replaced by b00 in the formula for d0 without altering its veracity.

To appreciate the very wide range of optimal bandwidth choices that can arise in the problem of minimising error rate, let us consider minimising err, at (3.10). To help remove ambiguities, let us assume that as n increases the value of σε02σε12ν0ν11 is of the same sign for all sufficiently large n, and its absolute value is bounded away from zero; assumption (3.7)(c) ensures that it is uniformly bounded. In this instance, and focusing just on the terms in h1, we see that four distinct cases can arise in practice:

  • (i)

    c10 and d0 are both positive. In this case, to minimise the contribution from h1, we should minimise c10h12+d0(ν0h1)1, which is achieved by taking h1 to be of size ν013.

  • (ii)

    c10 and d0 are both negative. In this case, the contribution made by h1 behaves like {c10h12+d0(ν0h1)1} as sample size increases. The term within braces here is maximised by taking h1 = 0, and analogously, in minimising err, it is optimal to take h1 to be of strictly smaller order than ν013.

  • (iii)

    c10>0 and d0 ≤ 0. In this case, to minimise the error rate, we need to maximise the size of the negative term and minimise that of the positive term, which is achieved by taking h1 to be of strictly smaller order than ν013 (the precise order depends on the magnitude of second order terms, but deriving the latter precisely would require a lot of additional computation).

  • (iv)

    c10<0 and d0 ≥ 0. Here, using arguments similar to those in case (iii), taking h1 to be of strictly larger order than ν013 is optimal.

The case d0 = 0 occurs, for example, if the covariance Gk of the Gaussian process g, the experimental error variance σεk2, and the values of mkj and nk do not depend on k. Equal values of mkj commonly arise when the data are observed on a grid; see Remark 4.

A similar analysis can be carried out in the case of optimisation over h rather than h1, although there the optimum is accessed from a comparison of terms in h and (mh)−2, rather than h12 and (ν0h1)−1. [A tedious analysis of the term of size (mh)−2, represented by the remainder O{(mh)−2} in (3.8), shows that it can be either positive or negative.] Depending on the relative signs of the terms in h2 and (mh)−2, it can be optimal to take hm−½, or h of strictly larger, or strictly smaller order than m−½.

Similar results are obtained if we investigate properties of errk, in (3.8), instead of the overall error rate, err, at (3.9).

These results explain the very diverse patterns of behaviour that are seen in numerical work, and that motivated our research; see Section 1. In summary, in apparently similar problems and using the same type of classifier, it can be optimal to use a very small bandwidth, or a very large bandwidth, or a bandwidth of only moderate size, depending on the signs of certain constants. Therein lies the contradictory nature of the smoothing parameter choice problem for classification of functional data.

3.1.2. Absence of terms in (mh)−1

The centroid-based classifier statistic S(g^), at (2.13), can be written equivalently as

S(g^)=I(μ^1μ^0)(2g^μ^0μ^1)dt. (3.11)

Importantly, there is no quadratic term in g^2 in (3.11), and as a result the impact of the bandwidth h, although not h1, on properties of the classifier is greatly reduced. This reduction is brought about by the smoothing effect of the integral in (3.11), which results in the elimination of terms in (mh)−1.

This property, to which we shall refer to as the “integration effect,” is known in other settings, for example, when integrating a kernel density estimator, computed from a sample of size m, to produce a distribution estimator. Integration results in the variance reducing from order (mh)−1, for the density estimator, to order m−1, for the distribution function estimator—just as it does in the setting above.

Remark 1 (Order m−1 term in expansion of classification error). We assumed in (3.7)(c) that the values of mkj, representing the number of pairs (Xkji, Ykji) for a given population index k and given individual j, are all of roughly the same size. In this setting it is easy to see that, even in an elementary parametric setting, we must expect the operation of observing the functions gkj at scattered points to affect error rate through a term of order m−1, and no smaller. For example, consider the case where gkj = ψ(· | ωkj), with ψ(· ω) being a known function completely determined by the parameter ω, and ωkj=Igkjw where the weight function w is known. Using the data Dkj on gkj we can estimate ωkj root-m consistently, but no faster, and as a result we incur a classification error of size m−1, and no smaller, from not knowing the values ωkj. It is for this reason that, when developing expansions of classification error, we do not explore the remainder of size m−1; it is stated simply as O(m−1) on the right-hand side of (3.8).

3.1.3. Other remarks

We conclude our discussion of Theorem 1 with a number of remarks.

Remark 2 (Definition of μ^k). The size of the fourth and fifth terms on the right-hand side of (3.8) is determined by the sizes of ν01 and ν11, and those quantities can be made slightly smaller by using a slightly different definition of μ^k, at (2.8). In particular in (2.8), on account of the definition of g^kj at (2.3), μ^k is defined as an average of ratios of sums, whereas slightly better statistical performance is obtained by taking μ^k to be simply a ratio of sums:

μ^k=Σj(Ukj2Vkj0Ukj1Vkj1)Σj(Ukj2Ukj0Ukj12),

compare (2.3). However, this approach departs from standard practice in working with functional data, and therefore, since convergence rates do not alter (only the constant multiples of rates are reduced), we have followed standard practice in the definition of μ^k.

Remark 3 (Gaussian assumption). Of course, if m is sufficiently large then g^ is itself approximately Gaussian, and so the assumption that g is a Gaussian process is reflected particularly well in properties of its estimator. More generally, our assumption that g is a Gaussian process is made for simplicity, and can be relaxed. For example, generalisations to chi-squared and other processes, where shape can be described in terms of a small number of fixed functions (mean and covariance in the Gaussian case), are straightforward.

More generally we would require a model which described the properties of random functions relatively simply. The Gaussian model fills this need ideally; shape is described by mean and variance functions, on which we have imposed only smoothness, rather than parametric, conditions. Moreover, in the Gaussian case all moments of g(x) are finite, for each x (we use this property repeatedly during our theoretical arguments), and the principal component scores are independent (this is used frequently during our proof of Theorem 2).

Remark 4 (Case of regularly spaced design). Theorem 1 continues to hold if the mkj design variables Xkji are regularly spaced on I for each k and j. The only change necessary is to replace IfX1, on the right-hand side of (3.8), by the square of the length of the interval I.

3.2. Scale-adjusted centroid-based classifier

Recall that scale-adjusted centroid-based classifier is defined in terms of Sscale(g^), at (2.14). A decomposition similar to that of Theorem 1 can be derived for this classifier, as we shall prove in Theorem 2 below. For this classifier, it seems necessary to strengthen (3.7) by imposing conditions on the behaviour of the eigenvalues θk as ℓ increases. However, since our aim in this section is only to corroborate the conclusions in Section 3.1, drawn there in the case of the standard centroid-based classifier, then we shall simplify our account by assuming that g is finite dimensional, and in particular taking the covariance expansion at (2.10) to have just q terms:

Fork=0and1:(a) the firstqeigenvalues in the sequenceθk1θk2,arising in the covariance expansion (2.10) ofgwhen the data come fromΠk,are distinct;(b)θk=0for>q;(c) for1qthe eigenfunctionsψkin (2.10) have two Ho¨lder continuous derivatives onI;(d)Ek(g)is a linear form inψk1,,ψkq;and (e) in the definition ofSscale(g^),s02s12. (3.12)

Without (3.12)(a), separate conditions, valid uniformly in j = 1, 2, …, have to be imposed on remainders in Taylor expansions of “smoothed” versions of the eigenvalues θkj, depending on h.

The next theorem indicates that the results of Theorem 1 also apply for the scale-adjusted centroid-based classifier. Its proof is given in the supplementary material [Carroll, Delaigle and Hall (2013)].

Theorem 2. Assume that (3.5), (3.6) and (3.12) hold. Then the error rate of the scale-adjusted centroid-based classifier, when the data in D are drawn from Πk, admits the expansion at (3.8), but with different constants, where the various terms have the properties stated immediately below that formula.

The diversity of possible signs of ck, ck1 and dk0ν01+dk1ν11 in (3.8), discussed in Section 3.1.1, is also present in this case. Therefore the conclusions drawn in that section apply to the scale-adjusted centroid-based classifier. However, we have not derived explicitly the counterparts of the constants ck, ck1, dk0 and dk1 that appear in equation (3.8).

The integration effect discussed in Section 3.1.2 is also present here, although we had originally expected that the scale-adjusted centroid classifier would produce a term of size (mh)−1 in an expansion of error rate. Indeed, the situation initially seems quite different in the case of the scale-adjusted version Sscale(g^) of S(g^), at (2.14), when s02s12. There the quadratic term in g^ persists. The reason it still does not produce a term in (mh)−1 is quite subtle. Define ⋈k to be > or ≤ according as k = 0 or k = 1, respectively. The probability Pk{Sscale(g^)k0} can be written as

Pk{j=1wj(Zj+Vj)2kW}+negligible terms,

where the Zj's are independent N(0, 1) variables, conditional on the Vj's and W; the positive weights wj are nonrandom; and critically, W does not involve the experimental errors εi in (2.2), from which any term in (mh)−1 would arise. The terms Vj depend on the experimental errors only through integrals of the error process, and the integration effect at this point largely removes the impact of the error bandwidth h, with the result that there is no term of size (mh)−1. However, terms in (ν0h1)−1 remain; the integration effect only influences smoothing of the new data, not of the training data.

3.3. Quadratic discriminant

Finally, we show that similar smoothing effects are present in the case of the quadratic discriminant classifier defined through the statistic T(g^) at (2.15). Recall that, when the data in D come from Πk, the random function g has covariance function Gk. To derive the counterpart of Theorem 1 for this classifier, let r, r1, r2 take the values 0 and 1, let 1 ≤ ℓ, ℓ1, ℓ2p, and define the covariances

covk[r1,r2;1,2]=IIGk(x1,x2)ψr11(x1)ψr22(x2)dx1dx2,

the variances vark[r, ℓ] = covk[r, r; ℓ, ℓ], and the correlations

ρk[r1,r2;1,2]=cov[r1,r2;1,2](vark[r1,1]vark[r2,2)12.

Let p ≥ 1, a fixed number, be the number of principal components used to construct the quadratic discriminant statistic T(g^), defined at (2.15). Theorem 3 below addresses the error rate of the quadratic discriminant based on T(g^), and there we shall assume that:

(a) Fork=0,1the eigenvaluesθk1,,θk,p+1are distinct;and(b) among the values taken byρk[r1,r2;1,2]fork,r1,r2=0,1and11,2p,the absolute value ofρk[r1,r2;1,2]equals1only whenr1=r2and1=2. (3.13)

Condition (3.13)(a) ensures that the eigenfunctions ψk are well defined for k = 0, 1 and ℓ = 1, …, p; and (3.13)(b) guarantees that the quantities I(gμr1)Ψr11 and I(gμr2)Ψr22, which appear in the definition of T0(g) at (2.16), cannot be identical, except for a difference in means, unless r1 = r2 and ℓ1 = ℓ2, thereby avoiding degeneracy.

The counterpart of Theorem 1 for the quadratic discriminant classifier is stated in the next theorem. Its proof is given in the supplementary material [Carroll, Delaigle and Hall (2013)].

Theorem 3. Assume that (3.5)(3.7) and (3.13) hold. Then the error rate of the quadratic discriminant, when the data in D come from Πk, admits the expansion at (3.8), but with different constants, where the various terms have the properties stated immediately below that formula.

Again the signs of ck, ck1 and dk0v01+dk1v11, in (3.8), are particularly diverse, and so the conclusions reached in Section 3.1.1 apply. Likewise, the integration effect discussed in Section 3.1.2 is also observed. Here, as can be seen directly from (2.15), the estimator g^ is integrated, and only the integral is squared, not g^ itself. The resulting integration effect eliminates any term in (mh)−1 from the analogue of the expansion (3.8) in this setting, although again this influence does not carry over to the training data.

4. Numerical illustrations

4.1. Simulated data

To illustrate the impact of bandwidth on classification performance, we generated data from several instances of model (2.1), taking, in each case, mkj = 50. Let ϕσ (x) denote the normal density function with mean zero and standard deviation σ. We considered the following cases, each with three different levels of errors, which we refer to as noise versions 1, 2 and 3:

  • (A):

    gkj (t) = μk(t) + (3t + 100)1/2{cos(t/50)}kZkj, where μ0(t) = ϕ10(t − 5), μ1(t) = μ0(t) + 0.3 cos(t/5) + 0.1, Zkj ~ U[−1/(30 − 10k), 1/(30 − 10k)], and εkji ~ N(0, 1/(4 − 2k)2) (noise version 1), εkji ~ N(0, 2/(4 − 2k)2) (noise version 2) or εkji ~ N(0, 4/(4 − 2k)2) (noise version 3), and π0 = 1/3, π1 = 2/3. Moreover, Xkji = 2i − 1, for i = 1, …, 50.

  • (B):

    gkj (t) = μk(t) + (3t + 100)1/2Zkj, where μ0(t) = 30{0.2ϕ4(t − 5) + 0.1ϕ4(t − 10) + 0.4ϕ6(t − 20) + 0.4ϕ6(t − 35) + 0.6ϕ7(t − 55) + 0.6ϕ7(t − 80)}, μ1(t) = μ0(t) + 4/{(t − 50)2 + 10}, Zkj ~ U[−1/(60 + 15k), 1/(60 + 15k)], εkji ~ {Exp(0.5) − 2}/(2 + 2k) (noise version 1), εkji~2Exp(0.5)2}(2+2k) (noise version 2) or εkji ~ {Exp(0.5) − 2}/(1 + k) (noise version 3), and π0 = 2/5 and π1 = 3/5. Moreover, Xkji was as in (A).

  • (C):

    g0j (t) = μ0(t) + (3t + 100)1/2Z0j, g1j (t) = μ1(t) + (t + 5)Z1j, where μ0(t) = 15ϕ17(t − 65) cos(t/7), μ1(t) = μ0(t)+5ϕ20(t − 50), Zkj ~ U[−1/(50 − 10k), 1/(50 − 10k)], εkji ~ N(0, (4 − k)2/100) (noise version 1), εkji ~ N(0, (4 − k)2/50) (noise version 2), εkji ~ N(0, (4 − k)2/25) (noise version 3), and π0 = 2/3 and π1 = 1/3. Moreover, Xkji was as in (A).

  • (D)–(F):

    Same as (A) to (C) but with Xkji = 2i − 1 + Tkji, where Tkji ~ N(0, 0.25).

We chose these examples to illustrate various features of the problems, namely that the impact of smoothing may differ among classifiers, and that in some cases, some classifiers perform better with more smoothing and in other cases, they might perform better with less smoothing.

In each case, for k = 0, 1 and for several values of ntr, we generated 100 (resp., ntr) noisy test curves (resp., training curves) from model (2.1), each of which came with probability πk from Πk. We constructed each classifier from the training data, and applied it to the test data. To compute g^ and g^kj, we compared three approaches for selecting the bandwidths: no smoothing (NS), the standard plug-in (PI) bandwidths hPI and hPI,kj that estimate the optimal bandwidth for estimation of the regression functions g and gkj, which we computed using the dpill function in the R package KernSmooth; see Ruppert, Sheather and Wand (1995); and the bandwidths γhPI and γ1hPI,kj, where γ and γ1 (and also the truncation parameter p in the case of the quadratic discriminant classifier) were chosen to minimise the following cross-validation (CV) estimator of classification error:

err^=π^0n0i=1n0I{C^i0,i=1}+π^1n1i=1n1I{C^i1,i=0}

with π^0 and π^1 denoting estimators of π0 and π1 (we took π^k=12), and C^ik,i being the estimator of the class label of the ith training observation from group k, obtained from the classifier constructed without using this observation.

For each configuration, we generated B = 200 sets of training and test samples. In Tables 1 and 2, we report the percentage of correctly classified test curves, averaged over the B replicates. Depending on the model, the classifier, and the type of data (test or training), the cross-validation bandwidths were either smaller or larger than the PI regression bandwidths, illustrating the variety of settings already explained by our theory. See Table B.1 in Section B.3 in the supplementary material [Carroll, Delaigle and Hall (2013)], where we report the value of γ and γ1 averaged over the B replicates. We can see from the table that in most cases, γ was smaller than γ1, and both were usually smaller than 1, except in cases (C) and (F).

Table 1.

Percentage of correctly classified observations for the simulated data of Section 4.1, using plug-in (PI) regression bandwidths, bandwidths that minimise a crossvalidation (CV) estimate of classification error, or without smoothing the noisy data (NS). The three noise versions, in increasing order, are described in cases (A)–(C) in Section 4.1. Here “Cent” is the centroid classifier (2.13), “Cent sc.” is the scaled centroid classifier (2.14) and “QDA” is the quadratic discriminant classifier (2.15)

Cent
Cent sc.
QDA
n tr CV PI NS CV PI NS CV PI NS
Case (A)
Noise version 1 50 82.9 74.1 84.0 91.8 73.2 92.0 95.1 94.1 53.5
Noise version 1 100 84.4 74.9 84.8 92.6 74.1 92.6 97.6 94.8 67.6
Noise version 2 50 77.7 69.6 78.1 94.3 70.4 94.4 91.0 89.3 49.1
Noise version 2 100 79.9 70.7 80.2 95.1 71.2 95.1 94.3 89.7 61.7
Noise version 3 50 71.1 65.6 71.1 97.1 69.0 97.1 85.4 84.3 46.2
Noise version 3 100 73.7 66.8 74.1 97.9 69.4 97.9 89.9 84.1 58.4
Case (B)
Noise version 1 50 63.2 60.1 65.7 96.3 78.7 96.5 77.1 74.3 65.8
Noise version 1 100 65.5 61.5 66.8 96.8 80.0 96.8 81.8 76.3 73.0
Noise version 2 50 61.5 58.6 64.6 96.3 80.6 96.4 76.9 74.1 65.2
Noise version 2 100 62.6 58.7 64.4 96.7 81.3 96.7 81.3 75.0 72.4
Noise version 3 50 60.9 57.6 64.0 96.2 81.6 96.4 77.3 74.2 65.4
Noise version 3 100 60.7 56.8 63.3 96.7 82.3 96.7 81.6 75.2 72.3
Case (C)
Noise version 1 50 61.5 60.8 60.8 88.7 89.2 87.4 84.8 83.7 82.0
Noise version 1 100 59.4 58.4 58.2 90.0 90.3 88.5 86.9 85.7 79.0
Noise version 2 50 61.3 60.2 60.3 87.3 87.9 82.8 81.9 81.2 82.4
Noise version 2 100 58.9 57.9 57.6 88.8 89.0 85.2 84.6 83.1 80.4
Noise version 3 50 61.0 59.7 59.3 85.2 85.4 71.2 80.5 79.9 79.6
Noise version 3 100 58.5 57.4 57.0 87.2 86.6 74.9 82.6 81.1 79.7

Table 2.

Percentage of correctly classified observations for the simulated data of Section 4.1, using plug-in (PI) regression bandwidths, bandwidths that minimise a crossvalidation (CV) estimate of classification error, or without smoothing the noisy data (NS). The three noise versions, in increasing order, are described in cases (D)–(F) in Section 4.1. Here “Cent” is the centroid classifier (2.13), “Cent sc.” is the scaled centroid classifier (2.14) and “QDA” is the quadratic discriminant classifier (2.15)

Cent
Cent sc.
QDA
n tr CV PI NS CV PI NS CV PI NS
Case (D)
Noise version 1 50 80.2 69.5 80.6 85.7 68.5 86.3 93.9 92.7 69.2
Noise version 1 100 81.5 70.0 82.0 87.3 69.2 87.3 96.6 93.2 84.3
Noise version 2 50 74.7 65.9 75.6 90.0 65.6 90.3 88.5 86.7 60.9
Noise version 2 100 76.9 66.8 77.3 90.9 66.8 91.0 92.3 86.8 77.6
Noise version 3 50 69.2 62.3 69.6 94.2 65.4 94.4 82.9 80.5 55.6
Noise version 3 100 71.4 63.4 72.0 95.1 66.4 95.1 87.9 80.8 72.7
Case (E)
Noise version 1 50 65.0 61.9 67.3 94.8 79.0 95.0 77.0 73.1 71.2
Noise version 1 100 65.8 62.5 67.6 95.4 79.6 95.4 84.3 69.7 82.8
Noise version 2 50 63.0 60.2 65.4 94.7 80.6 95.0 77.8 74.2 70.5
Noise version 2 100 63.4 59.9 64.9 95.4 81.4 95.5 84.8 69.8 82.4
Noise version 3 50 61.3 59.2 64.3 94.6 81.4 94.9 77.9 74.4 69.8
Noise version 3 100 61.8 58.2 63.1 95.4 82.5 95.5 84.5 71.5 81.8
Case (F)
Noise version 1 50 60.2 59.1 59.4 88.0 88.7 87.9 83.5 82.6 80.4
Noise version 1 100 58.8 57.8 57.7 89.0 89.3 88.5 84.9 83.2 77.3
Noise version 2 50 59.8 58.7 59.0 86.5 87.2 84.6 80.8 80.2 80.0
Noise version 2 100 58.6 57.3 57.2 87.6 87.7 85.8 83.1 81.1 77.0
Noise version 3 50 59.2 58.3 58.2 84.5 84.1 76.7 79.4 78.5 78.9
Noise version 3 100 58.0 56.8 56.5 85.8 84.6 78.5 81.0 79.1 75.7

As expected, we conclude from Tables 1 and 2, depending on the model and the classifier, the negative impact of smoothing with the standard PI bandwidth can be quite significant, indeed sometimes reducing the percentage of correctly classified data by as much as 10%. In cases (A) and (D), it is the centroid classifier and its scaled version that are the most affected by this inappropriate level of smoothing, whereas the quadratic discriminant classifier is more robust against the level of smoothing. In cases (B) and (E), the scaled centroid classifier and the quadratic discriminant classifier are the most affected by inappropriate smoothing. Cases (C) and (F) are more robust against smoothing; there, all three versions (PI, CV and NS) of the data result in similar classification performance, although overall the data smoothed by CV result in slightly improved performance. Depending on the case, when the noise level increases the impact of inappropriate bandwidth choice can either increase or decrease.

4.2. Real data

We illustrate our findings on the ovarian cancer data set 8–7–02, which concerns 253 patients (91 controls and 162 with ovarian cancer). The data, which were produced to study the effect of robotic sample handling, are available from http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp. In this example, the functions Xi represent proteomic mass spectra and t ∈ [0, 20,000] is the mass over charge ratio, m/z. These raw curves are ideal for illustrating the negative impact that systematically smoothing by standard methods can have, because in some ranges of values of t, the spectra have considerable activity, and the impact of smoothing such data can be striking. We focus on one such ranges, namely t ∈ [200, 500].

To assess the performance of classifiers on this data set, we randomly and uniformly created B = 200 pairs of (training sample, test sample), where we took the training sample to be of size ntr and the test sample of size 253 − ntr, for ntr = 50 and ntr = 100. We also generated two more noise versions of the data, adding to the Ykji's in both the test and training data, noise εkji~N(0,0.04) (noise version 1) or εkji~N(0,0.25) (noise version 2), where the εkji's were totally independent.

For each version of the data (original data and noise versions 1 and 2), and for each pair of test and training sample, we constructed each classifier from the training sample, and applied the classifier to the test sample using either plug-in regression bandwidths to construct the estimators g^ and g^kj, or bandwidths obtained by minimising the CV estimator of classification error defined in Section 4.1, where we took π^k=12.

Table 3 reports the percentage, averaged over the B pairs of samples, of correctly classified observations from the test samples. The table indicates very clearly that smoothing the data using the plug-in regression bandwidths degraded the quality of the two versions of the centroid classifier by about 10%, and a similar phenomenon was observed for the quadratic discriminant classifier when the training sample was small and when the data were noisy.

Table 3.

Percentage of correctly classified observations for the ovarian cancer data, using plug-in (PI) regression bandwidths or bandwidths that minimise a crossvalidation (CV) estimate of classification error Here “Cent” is the centroid classifier (2.13), “Cent sc.” is the scaled centroid classifier (2.14) and “QDA” is the quadratic discriminant classifier (2.15)

Cent
Cent sc.
QDA
Data n tr CV PI CV PI CV PI
Original data 50 90.60 80.25 90.05 78.79 93.32 89.69
Original data 100 90.43 80.96 90.00 79.96 98.58 98.86
Noisy version 1 50 88.07 75.19 87.83 74.23 78.03 68.50
Noisy version 1 100 87.58 76.76 88.54 76.27 91.48 90.97
Noisy version 2 50 76.15 66.57 76.65 66.09 56.91 48.54
Noisy version 2 100 81.97 67.55 81.91 67.64 77.62 66.49

Supplementary Material

Supplementary Material

Acknowledgments

Delaigle and Hall's research was supported by grants and fellowships from the Australian Research Council. Carroll's research was supported by a grant from the National Cancer Institute (R37-CA057030).

APPENDIX: PROOF OF THEOREM 1

A.1. Preliminary results

Define

Δ(x)=1mhi=1mεi(xXih)K(xXih),
W(x)=1mhi=1m[xXi{g(t)g(x)}(Xit)dt](xXih)K(xXih).

With U and V given by (2.4) and (2.5), and using the model at (2.2) and the exact form of the remainder in Taylor's theorem, we can write:

V(x)=1mhi=1m{g(Xi)+εi}(xXih)K(xXih)=1mhi=1m[g(x)+(Xix)g(x)+12(Xix)2g(x)+εi]×(xXih)K(xXih)+W(x)=g(x)U(x)hg(x)U+1(x)+12h2g(x)U+2(x)+Δ(x)+W(x).

Assuming, without loss of generality, that K is supported on [−1, 1],

W(x)h2{suptI:txhg(t)g(x)}1mhi=1mK(xXih)h2U0(x)Q,

where Q=sups,tI:sthg(s)g(t). Now,

g^=U2V0U1V1U2U0U12=g+12h2gU22U1U3U2U0U12+Δ+U2W0U1W1U2U0U12,

where Δ=(U2Δ0U1Δ1)(U2U0U12). Therefore, since |U| ≤ U0 for each ℓ ≥ 0,

g^(g+12h2gU22U1U3U2U0U12+Δ)2Qh2U02U2U0U12, (A.1)

uniformly on I

Similarly, defining Qkj=sups,tI:sthgkj(s)gkj(t), and

Δkj(x)=1mkjh1i=1mkjεkji(xXkjih1)K(xXkjih1),
Δkj=Ukj2Δkj0Ukj1Δkj1Ukj2Ukj0Ukj12,

where Ukj is as at (2.6), we have, uniformly on I

g^kj(gkj+12h12gkjUkj22Ukj1Ukj3Ukj2Ukj0Ukj12+Δkj)2Qkjh12Ukj02Ukj2Ukj0Ukj12. (A.2)

Define

Δk=1nkj=1nkΔkj (A.3)

and recall that κ2 = ∫ u2 K (u) du. We shall derive the following result in Section A.6:

Lemma 1. Under the conditions of Theorem 1, for some C1 > 0, all C2 > 0 and k = 0, 1,

Pk(supIU22U1U3U2U0U12κ2>nC1)=O(nC2),
Pk(maxj=1,,nksupIUkj22Ukj1Ukj3Ukj2Ukj0Ukj12κ2>nC1)=O(nC2)

as n → ∞, and for some C3 > 0, all C2 > 0 and k = 0, 1,

Pk(supIU02U2U0U12>C3)=O(nC2),
Pk(maxj=1,,nksupIUkj02Ukj2Ukj0Ukj12>C3)=O(nC2).

Furthermore, defining Msum = mink=1,2j mkj), we have for all C2, C4 > 0,

Pk{supIΔ>nC4(mh)12}+maxk=0,1Pk{supIΔk>nC4(Msumh)12}=O(nC2). (A.4)

A.2. Initial calculation of errk

Let G1 denote the sigma-field generated by the random variables introduced in Section 2, and the random functions gkj, but excluding g. Specifically, G1 is the sigma-field generated by gkj, Xkji and εkji for 1 ≤ imkj, 1 ≤ jnk and k = 0, 1, and by Xi and εi for 1 ≤ im. Recall that ⋈k is > or ≤ according as k = 0 or k = 1, respectively, and recall formula (3.11) for the statistic S(g^).

Under the assumption that the new data set D comes from Πk, and conditional on G1, g^ is a Gaussian process with mean α^k=Ek(g^G1) and covariance function Γ^k, say. In this notation,

errkEk[Pk{S(g^)k0G1}]=Ek{Ψk(β^kσ^k)}, (A.5)

where, by (3.11),

β^k=Ek{S(g^)G1}=I(μ^1μ^0){2α^k(μ^0+μ^1)}, (A.6)
σ^k2=var{S(g^)G1}=4II{μ^1(x1)μ^0(x1)}{μ^1(x2)μ^0(x2)}×Γ^k(x1,x2)dx1dx2. (A.7)

The probability on the left-hand side of (A.5) equals the chance that, when D comes from Πk, the classifier based on S(g^) makes an error and assigns D to the other population.

A.3. Approximations to α^k, β^k and σ^k

In view of (A.1),

α^k(μk+12h2μkU22U1U3U2U0U12+Δ)2Ek(Q)h2U02U2U0U12. (A.8)

Noting that, for random variables A1, A2, B1 and B2, |cov(A1 + A2, B1 + B2) − cov(A1, A2)| ≤ |cov(B1, B2)| + |cov(A1, B2)| + |cov(B1, A2)| where the covariances are interpreted conditionally on G1, we deduce from (A.1) that for a constant C4 > 0,

supx1,x2IΓ^k(x1,x2){Gk(x1,x2)+12h2Gk(0,2)(x1,x2)U22U1U3U2U0U12(x2)+12h2Gk(2,0)(x1,x2)U22U1U3U2U0U12(x1)}C4h2{h2+Ek(Q+Q2)}supI(1+U02U2U0U12)2, (A.9)

where we define Gk(j1,j2)(x1,x2)=j1+j2Gk(x1,x2)x1j1x2j2. (Recall that Gk denotes the covariance of the Gaussian process g when the data D are drawn from Πk.

With gk defined as at (3.1), and defining Δk as at (A.3), we have, in view of (A.2), Lemma 1 and (3.6)(b), the result

Pk{Isupμ^k(gk+12h12κ2gk+Δk)>nC1h12}=O(nC2) (A.10)

for some C1 > 0 and all C2 > 0. Using Rosenthal's inequality, it can be proved from (3.6) and (3.7)(c) that, for some C1 > 0 and all C2 > 0,

Pk(supIgkμk>nC1)=O(nC2). (A.11)

Together, (A.10) and (A.11) imply that

Pk{Isupμ^k(gk+12h12κ2μk+Δk)>nC1h12}=O(nC2). (A.12)

Define H2=h2+h12,

βk=I{g1g0+12h12κ2(μ1μ0)+Δ1Δ0}×{2μk(g0+g1)+h2κ2μk12h12κ2(μ0+μ1)+2Δ(Δ0+Δ1)}, (A.13)
σ~k2=4II{g1g0+12h12κ2(μ1μ0)+Δ0Δ1}(x1)×{g1g0+12h12κ2(μ1μ0)+Δ0Δ1}(x2)×[Gk(x1,x2)+12h2κ2{Gk(2,0)(x1,x2)+Gk(0,2)(x1,x2)}].

Combining Lemma 1, (A.5)(A.9) and (A.12), we deduce that, for some C1 > 0 and all C2 > 0,

Pk(β^kβk>nC1H2)=O(nC2),Pk(σ^k2σ~k2>nC1H2)=O(nC2). (A.14)

Observe from (A.13) that βk=βk0+bk1+βk1+βk2+Δ2, where βk0 is as at (3.2),

bk1=κ2I(μ1μ0)(h2μkh12μ1k), (A.15)
βk1=I(g1g0){2Δ(Δ0+Δ1)}+I{2μk(g0+g1)}(Δ1Δ0),
βk2=I{2Δ(Δ0+Δ1)}(Δ1Δ0)

and Δ2=βk(βk0+bk1+βk1+βk2). Using (A.4) it can be shown that, for some C1 > 0 and all C2 > 0, and when ℓ = 2,

Pk(Δ>nC1H2)=O(nC2). (A.16)

Hence, noting the first result in (A.14), we have:

Pk{β^k(βk0+bk1+βk1+βk2)>nC1H2}=O(nC2). (A.17)

Recall the definitions of σk2 and τk2 at (3.3) and (3.4), and put

σk0=2h2κ2II(g1g0)(x1)(g1g0)(x2)×{Gk(2,0)(x1,x2)+Gk(0,2)(x1,x2)}dx1dx2, (A.18)
σk1=4h12κ2II(g1g0)(x1)(μ1μ0)(x2)Gk(x1,x2)dx1dx2 (A.19)

and Δ3=σ~k2(σk2+σk0+σk1). Thus, Δ3 is the term in Δ0 and Δ1 that arises when σ~k2 is expanded. Using (A.4) it can be proved that (A.16) holds when ℓ = 3. Moreover, σ^k2 can be written as

σ^k2=σk2+σk0+σk1+Δ3+Δ4, (A.20)

where, in view of the second part of (A.14), (A.16) holds in the case ℓ = 4 and for some C1 > 0 and all C2 > 0.

Define τk to be equal to σk, at (A.18) and (A.19), when g0 and g1 on the respective right-hand sides are replaced by μ0 and μ1. Then for k = 0, 1 and ℓ = 0, 1, noting property (3.7)(c) on the rates of increase of n0 and n1, it can be shown that for some C1 > 0,

Pk(σkτk>nC1h2)=O(nC2) (A.21)

for all C2 > 0, where we define h0 = h. Therefore, if C1 > 0 is sufficiently small,

maxk=0,1max=0,1Pk(σk>nC1)=O(nC2) (A.22)

for all C2 > 0.

A.4. Approximation to σ^k1

In the notation at (A.20),

1σ^k=1τk(1+σk2τk2τk2+σk0+σk1+Δ3+Δ4τk2)12=sk(),

where, for 0 ≤ r ≤ ∞,

sk(r)=1τkj=0r=0j(12j)(j)(σk2τk2τk2)j(σk0+σk1+Δ3+Δ4τk2).

We claim that the infinite series defined by sk(∞) converges with probability 1 − O(nC2) for all C2 > 0. To appreciate why, note that, by (3.6) and (3.7)(c), there exists C1 > 0 such that

Pk(σk2τk2>nC1)=O(nC2)

for all C2 > 0. Combining this property, (A.16) for ℓ 3 and 4, and (A.22), we deduce that, for some C1 > 0 and all C2 > 0,

Pk(σk2τk2τk2+σk0+σk1+Δ3+Δ4τk2nC1)=1O(nC2).

Therefore, if C3 > 0 is given then r0 = r0(C3) ≥ 1 can be chosen so large that, whenever r0r ≤ ∞, Pk{σ^k1sk(r)>nC3}=O(nC2) for all C2 > 0. Using this property and (A.16), again for ℓ = 3 and 4; and employing too (A.21); we see that for some C1 > 0 and all C2 > 0, if r0 is chosen sufficiently large,

Pk{σ^k1tk(r)>nC1H2}=O(nC2) (A.23)

for rr0, where

tk(r)=1τkj=0r=0min(j,1)(12j)(j)(σk2τk2τk2)j(τk0+τk1τk2). (A.24)

A.5. Approximation to Ek{Ψk(β^kσ^k)}

Let C1 > 0 and let ℓ0 ≥ 0 be an integer. With Ukj defined as at (2.6), let E denote the event

E=E(C1,0)={max10maxj=1,,nksupxIUkj(x)κfX(x)nC1},

where κ = ∫ uK (u) du and hence vanishes for odd ℓ, since by (3.7)(b), K is symmetric. It will be proved in Section A.6 that, for some C1 > 0 and each ℓ0 ≥ 0,

Pk{ε(C1,0)}=1O(nC2)forallC2>0. (A.25)

If E(C1,0) holds for an ℓ0 ≥ 2 then, if 0<C1<C1, there exists a nonrandom integer n0 ≥ 1 such that the event E1=E1(C1), defined by

ε1={maxj=1,,nksupxIUkj2(x)Ukj0(x)Ukj1(x)2κ2fX(x)2nC1} (A.26)

holds for all nn0.

Let I=I(E) denote the indicator of E. In view of (A.25),

Ek{Ψk(β^kσ^k)}=Ek{IΨk(β^kσ^k)}+O(nC2) (A.27)

for all C2 > 0, and so to approximate the term on the left-hand side of (A.27) we may develop an approximation to the first term on the right-hand side.

Let G2 denote the sigma-field generated by the random variables Xi for 1 ≤ im, and by Xkji and the functions gkji for 1 ≤ imkj, 1 ≤ jnk and k = 0, 1 (i.e., generated by everything except g and the experimental errors εi and εkji). The quantities I, tk(r) at (A.24), βk0 at (3.2), and bk1 at (A.15) are all G2-measurable. Therefore, using (A.17) and (A.23), and noting that ψk is an analytic function with all derivatives uniformly bounded, we obtain

Ek{IΨk(β^kσ^k)}=Ek(Ek[IΨk{(βk0+bk1+βk1+βk2)tk(r)}G2])+o(H2)=Ek[IΨk{βk0tk(r)}]bk1τk1Ek[IΨk{βk0tk(r)}]τk1Ek[Ek(βk2G2)IΨk{βk0tk(r)}]+12τk2Ek[Ek(βk12G2)IΨk{βk0tk(r)}] (A.28)
+O{(mh)2+(Msumh1)2}+o(H2). (A.29)

Here we have used the properties Ek(βk1G2)=0, Ektk(r)τk1=O(nC) for some C > 0,

Ek[Ek(βk12G2)IΨk(2){βk0tk(r)}]=O{(mh)2+(Msumh1)2}

for ℓ2 ≥ 3 if ℓ1 = 1, and for ℓ2 ≥ 2 if ℓ1 = 2, and

Ek[Ek(βk1βk2G2)IΨk{βk0tk(r)}]=O{(mh)2+(Msumh1)2}.

Further, we have used the fact that the event E1, defined at (A.26), obtains when ever I ≠ 0.

In addition,

14Ek[Ek(βk12G2)I]=Ek{II(g0g1)Δ}2+Ek{II(g0μk)Δ0}2+Ek{II(g1μk)Δ1}2=O(m1), (A.30)

that

Ek[Ek(βk2G2)IΨk{βk0tk(r)}]=(1)k+1ϕ(bk0τk)IEk[I{Ek(Δ02G2)Ek(Δ12G2)}]+o{(ν0h1)1}=κh1(σε02ν01σε12ν11)(1)k+1ϕ(bk0τk)IfX1+o{(ν0h1)1} (A.31)

and that

bk1τk1Ek[IΨk{βk0tk(r)}]=bk1τk1(1)k+1ϕ(bk0τk)+o(H2), (A.32)

where bk0 and bk1 are as at (3.2) and (A.15), ϕ is the standard normal density, and we have used the fact that Ψk=(1)k+1ϕ. Combining (A.25) and (A.27)(A.32), and taking r sufficiently large (but fixed), we deduce that

Ek{Ψk(β^kσ^k)}=Ek[Ψk{βk0σk}]bk1τk1(1)k+1ϕ(bk0τk)κτkh1(σε02ν01σε12ν11)(1)k+1ϕ(bk0τk)IfX1+O{m1+(mh)2}+o{H2+(ν0h1)1}. (A.33)

Result (3.8) follows from (A.5) and (A.33).

A.6. Proof of Lemma 1 and (A.25)

The results in Lemma 1, with the exception of (A.4); and also result (A.25); will follow if we show that for each ℓ ≥ 1, some C1 > 0 and all C2 > 0,

Pk{supxIU(x)κfX(x)>nC1}=O(nC2), (A.34)
Pk{maxj=1,,nksupxIUkj(x)κfX(x)>nC1}=O(nC2). (A.35)

We shall derive (A.35); a proof of (A.34) is similar.

Markov's inequality can be used to prove that

maxj=1,,nksupxIPk{Ukj(x)κfX(x)>nC1}=O(nC2). (A.36)

It follows from (3.7)(c) that each nk is increasing no faster than polynomially in n, and therefore, if we confine attention to x in a subset In, say, of I that contains only O(nC) points for some C > 0, we can place the maximum and supremum inside the probability statement at (A.36), provided that I is replaced by In: for some C1 > 0 and all C2 > 0,

Pk{maxj=1,,nksupxInUkj(x)κfX(x)>nC1}=O(nC2). (A.37)

The assumption, in (3.7)(b), that K is compactly supported and Hölder continuous, and the implication, in (3.5)(c), that fX is also Hölder continuous, enable (A.35) to be derived directly from (A.37) by taking In to be a sufficiently fine grid in I.

A proof of (A.4) in Lemma 1 is similar. To illustrate the argument, we derive the following result part of (A.4): for all C2, C4 > 0,

Pk{supxIΔ(x)>nC4(mh)12}=O(nC2). (A.38)

Using Markov's and Rosenthal's inequalities, we first obtain the result when the supremum is outside the probability statement:

supxIPk{Δ(x)>nC4(mh)12}=O(nC2).

Taking In to contain only O(nC) points, for any fixed C > 0, we deduce that

Pk{supxInΔ(x)>nC4(mh)12}=O(nC2),

and taking In to be a sufficiently fine grid in I we obtain (A.38).

Footnotes

SUPPLEMENTARY MATERIAL

Supplement to “Unexpected properties of bandwidth choice when smoothing discrete data for constructing a functional data classifier” (DOI: 10.1214/13-AOS1158SUPP; .pdf). The supplementary file contains the proof of Theorems 2 and 3, as well as additional simulation results.

REFERENCES

  1. Araki Y, Konishi S, Kawano S, Matsui H. Functional logistic discrimination via regularized basis expansions. Comm. Statist. Theory Methods. 2009;38:2944–2957. MR2568196. [Google Scholar]
  2. Benhennia K, Degras D. Local polynomial regression based on functional data. 2011 Unpublished manuscript. Available at http://arxiv.org/pdf/1107.4058v1.
  3. Berlinet A, Biau G, RouvièRe L. Functional supervised classification with wavelets. Ann. I.S.U.P. 2008;52:61–80. MR2435041. [Google Scholar]
  4. Biau G, Bunea F, Wegkamp MH. Functional classification in Hilbert spaces. IEEE Trans. Inform. Theory. 2005;51:2163–2172. MR2235289. [Google Scholar]
  5. Cardot H, Degras D, Josserand E. Confidence bands for Horvitz–Thompson estimators using sampled noisy functional data. Bernoulli. 2013;19:2067–2097. [Google Scholar]
  6. Cardot H, Josserand E. Horvitz–Thompson estimators for functional data: Asymptotic confidence bands and optimal allocation for stratified sampling. Biometrika. 2011;98:107–118. MR2804213. [Google Scholar]
  7. Carroll RJ, Delaigle A, Hall P. Supplement to “Unexpected properties of bandwidth choice when smoothing discrete data for constructing a functional data classifier.”. 2013. DOI:10.1214/13-AOS1158SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Cuevas A, Febrero M, Fraiman R. Robust estimation and classification for functional data via projection-based depth notions. Comput. Statist. 2007;22:481–496. MR2336349. [Google Scholar]
  9. Delaigle A, Hall P. Achieving near perfect classification for functional data. J. R. Stat. Soc. Ser. B Stat. Methodol. 2012;74:267–286. MR2899863. [Google Scholar]
  10. Delaigle A, Hall P, Bathia N. Componentwise classification and clustering of functional data. Biometrika. 2012;99:299–313. MR2931255. [Google Scholar]
  11. Epifanio I. Shape descriptors for classification of functional data. Technometrics. 2008;50:284–294. MR2528652. [Google Scholar]
  12. Fan J, Gijbels I. Local Polynomial Modelling and Its Applications. Monographs on Statistics and Applied Probability. Chapman & Hall; London: 1996. p. 66. MR1383587. [Google Scholar]
  13. Fromont M, Tuleau C. Functional classification with margin conditions. In: Carbonell JG, Siekmann J, editors. Learning Theory—Proceedings of the 19th Annual Conference on Learning Theory, Pittsburgh; New York: Springer; 2006. 2006. [Google Scholar]
  14. Hall P, Hosseini-Nasab M. On properties of functional principal components analysis. J. R. Stat. Soc. Ser. B Stat. Methodol. 2006;68:109–126. MR2212577. [Google Scholar]
  15. Hall P, Hosseini-Nasab M. Theory for high-order bounds in functional principal components analysis. Math. Proc. Cambridge Philos. Soc. 2009;146:225–256. MR2461880. [Google Scholar]
  16. Hall P, Kang K-H. Bandwidth choice for nonparametric classification. Ann. Statist. 2005;33:284–306. MR2157804. [Google Scholar]
  17. Hall P, Van Keilegom I. Two-sample tests in functional data analysis starting from discrete data. Statist. Sinica. 2007;17:1511–1531. MR2413533. [Google Scholar]
  18. Leng X, Müller H-G. Classification using functional data analysis for temporal gene expression data. Bioinformatics. 2006;22:68–76. doi: 10.1093/bioinformatics/bti742. [DOI] [PubMed] [Google Scholar]
  19. Li Y, Hsing T. Deciding the dimension of effective dimension reduction space for functional and high-dimensional data. Ann. Statist. 2010a;38:3028–3062. MR2722463. [Google Scholar]
  20. Li Y, Hsing T. Uniform convergence rates for nonparametric regression and principal component analysis in functional/longitudinal data. Ann. Statist. 2010b;38:3321–3351. MR2766854. [Google Scholar]
  21. López-pintado S, Romo J. Depth-based classification for functional data. (DIMACS Series in Discrete Mathematics and Theoretical Computer Science).Data Depth: Robust Multivariate Analysis, Computational Geometry and Applications. 2006;72:103–119. Amer. Math. Soc., Providence, RI. MR2343116. [Google Scholar]
  22. Manning CD, Raghavan P, Schütze H. Introduction to Information Retrival. Cambridge Univ. Press; Cambridge: 2008. [Google Scholar]
  23. Panaretos VM, Kraus D, Maddocks JH. Second-order comparison of Gaussian random functions and the geometry of DNA minicircles. J. Amer. Statist. Assoc. 2010;105:670–682. MR2724851. [Google Scholar]
  24. Ramsay JO, Silverman BW. Functional Data Analysis. 2nd ed. Springer; New York: 2005. MR2168993. [Google Scholar]
  25. Rossi F, Villa N. Support vector machine for functional data classification. Neuro-computing. 2006;69:730–742. [Google Scholar]
  26. Ruppert D, Sheather SJ, Wand MP. An effective bandwidth selector for local least squares regression. J. Amer. Statist. Assoc. 1995;90:1257–1270. MR1379468. [Google Scholar]
  27. Vilar JA, Pértega S. Discriminant and cluster analysis for Gaussian stationary processes: Local linear fitting approach. J. Nonparametr. Stat. 2004;16:443–462. MR2073035. [Google Scholar]
  28. Wang X, Ray S, Mallick BK. Bayesian curve classification using wavelets. J. Amer. Statist. Assoc. 2007;102:962–973. MR2354408. [Google Scholar]
  29. Wu S, Müller H-G. Response-adaptive regression for longitudinal data. Biometrics. 2011;67:852–860. doi: 10.1111/j.1541-0420.2010.01518.x. MR2829259. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

RESOURCES