Abstract
The data functions that are studied in the course of functional data analysis are assembled from discrete data, and the level of smoothing that is used is generally that which is appropriate for accurate approximation of the conceptually smooth functions that were not actually observed. Existing literature shows that this approach is effective, and even optimal, when using functional data methods for prediction or hypothesis testing. However, in the present paper we show that this approach is not effective in classification problems. There a useful rule of thumb is that undersmoothing is often desirable, but there are several surprising qualifications to that approach. First, the effect of smoothing the training data can be more significant than that of smoothing the new data set to be classified; second, undersmoothing is not always the right approach, and in fact in some cases using a relatively large bandwidth can be more effective; and third, these perverse results are the consequence of very unusual properties of error rates, expressed as functions of smoothing parameters. For example, the orders of magnitude of optimal smoothing parameter choices depend on the signs and sizes of terms in an expansion of error rate, and those signs and sizes can vary dramatically from one setting to another, even for the same classifier.
Keywords: Centroid method, discrimination, kernel smoothing, quadratic discrimination, smoothing parameter choice, training data
1. Introduction
All supposedly “functional” data are actually observed discretely, sometimes on a grid and on other occasions at randomly scattered points. For example, in longitudinal data analysis the observation points are often widely spaced and irregularly placed, and substantial smoothing is commonly used to convert discrete data like these to functions. The impact of such smoothing has been addressed in the context of prediction or hypothesis testing for functional data; see, for example, Hall and Van Keilegom (2007), Panaretos, Kraus and Maddocks (2010), Wu and Müller (2011), Benhennia and Degras (2011), Cardot and Josserand (2011) and Cardot, Degras and Josserand (2013). The main conclusion of these papers has been that conventional rules for smoothing discrete data typically apply, and that smoothing parameters of standard size generally are appropriate.
In contrast, the present paper was motivated by numerical work indicating that, in the context of classifying functional data, smoothing parameters of highly nonstandard sizes are appropriate, and more generally that, even for a relatively simple classifier, there is no simple precept (even an asymptotic prescription of size) that leads to minimisation of error rate. If one had to give a rule, valid in some but by no means all cases, it would be to undersmooth, but even there unexpected caveats must be addressed.
For example, it turns out that the impact of smoothing the training data can be more significant than that of smoothing the new data to be classified. Indeed, the effect of smoothing the new data is characteristic of a parametric problem, rather than a nonparametric one. There, asymptotic arguments indicate that (sample size)−1/2 is an appropriate bandwidth size for reducing the impact of smoothing to parametric levels, whereas (sample size)−1/3 is the nearest analogue for smoothing the training data.
However, both these recommendations are incorrect in many cases. Depending on the signs and sizes of certain functionals of the data distributions, it can be optimal to use smoothing parameters that are an order of magnitude smaller, or an order of magnitude larger, than these. Using some viewpoints the need for a low level of smoothing is intuitively clear. Indeed, we expect that relatively minor features of a curve, of the sort that might disappear if we were too enthusiastic in the smoothing step, could have important information to convey in a classification analysis. On the other hand, our results show that very high levels of smoothing are sometimes advantageous.
We drew these conclusions after studying three different classifiers for functional data: the standard centroid-based method, the scale-adjusted form of that approach, and a version for functional data of quadratic discrimination. Our conclusions are valid for all three approaches, although they contradict conclusions which are well known for standard nonparametric approximations to the Bayes classifier in multivariate, rather than functional-data, settings. Specifically, for univariate and functional data, and nonparametric Bayes classifiers, conventional smoothing parameters, for example, those chosen using standard plug-in rules for function estimation, typically are of the correct order even though they do not quite minimise asymptotic classification error; see, for example, Hall and Kang (2005). Moreover, there does not exist a version of our results in univariate or multivariate settings, since there is no analogue in such cases of the “lattice effect,” represented by the mkj's.
To these comments, we should add that in practice there is relatively little difficulty in choosing smoothing parameters to minimise error rate; cross-validation is usually effective. Our aim in this paper is therefore not to develop methods for choosing the bandwidth optimally, or nearly optimally, in classification problems, but to provide an understanding of the many aspects of those problems that conspire together to determine the optimal choice.
2. Model and methodology
2.1. Model
We consider n0 (resp., n1) unknown random functions {g0j, 1 ≤ j ≤ n0} (resp., {g1j, 1 ≤ j ≤ n1}) coming from two populations. We observe a training sample of the data pairs , for 1 ≤ j ≤ nk and k 0,1, corresponding to noisy versions of the gkj 's sampled at a discrete set of random points (i.e., Xkji 's) and generated by the model
| (2.1) |
where k indexes the population, Πk, from which the data in came, j denotes the index of an individual drawn from Πk, and i is the index of a data pair (Xkji, Ykji) for the jth individual from the kth population.
The gkj are random functions defined on a compact interval , but observed only at mkj points Xkj1, …, Xkjmkj. These points may be fixed or random, and although we shall develop our arguments in the random case, they can easily be extended to the fixed case. We assume that each gkj has two bounded derivatives on ; the respective sequences of X's and ε's are each identically distributed with distributions that do not depend on the g's; the g's, X's and ε's are all mutually independent; the X's are supported on ; and the ε's have zero mean and finite variance.
We also observe a new data set , similar to the 's except that in this case we do not know which population the data come from. Here,
| (2.2) |
where the function g, the X's and the ε's have the properties given in the previous paragraph. Using the training data, we wish to determine whether came from Π0 or Π1.
In the functional data literature [see, e.g., Ramsay and Silverman (2005)], when the data are noisy, it is common to preprocess them prior to further analysis. Typically, this is done by smoothing the data in some way, for example, through a spline or kernel smoother, thereby obtaining, from the data in and , estimators and of gkj and g, respectively. In the classification context, once these estimators have been derived, they are plugged into functional data classifiers, replacing there the unobserved functions g and gkj by their estimators and . Our aim in this paper is to describe the application of estimators and of g and gkj, and in particular to describe the influence of tuning parameters used to construct them, when the aim is classification rather than just function estimation.
2.2. Estimating g, gkj and their mean and covariance functions
There are several ways to obtain nonparametric estimators of the functions g and gkj, but the most popular ones are spline and local linear methods. They have similar properties, but since local linear estimators are much more tractable theoretically, we shall use these in this work. For , the local linear estimators of g and gkj are defined by
| (2.3) |
where
| (2.4) |
| (2.5) |
| (2.6) |
K is a kernel function, h > 0 and h1 > 0 are bandwidths, and Kh(x) = K(x/h)/h. See, for example, Fan and Gijbels (1996). For simplicity, throughout we use the same bandwidth h1 for each population and each individual, but we could have replaced h1 by bandwidths that depended on k and j, as we do in our numerical work.
The classifiers we consider in this work require estimators of the population means and covariances. For k = 0, 1, let μk denote the mean function
| (2.7) |
where Ek represents expectation under the assumption that the data come from Πk. Also, let Gk be the covariance function, defined by Gk(u, v) = covk{g(u), g(v)} = Ek{g(u)g(v)} − μk(u)μk(v), where covk denotes covariance when the data come from Πk. Estimators and of μk and Gk are defined in the standard way by the empirical mean and covariance functions, but replacing, in the definitions of these estimators, the unobserved gkj by :
| (2.8) |
| (2.9) |
See, for example, Ramsay and Silverman (2005), Chapter 2.
Consider the spectral decomposition of the covariance function
| (2.10) |
where, (θkℓ, ψkℓ) is an (eigenvalue, eigenfunction) pair for the linear operator Gk defined by Gk(ψ)(u) = ∫ Gk(u, vψ(v) dv, and where, following convention, we have used the notation Gk for both the operator and the covariance. The terms in (2.10) are ordered such that θk1 ≥ θk2 ≥ ⋯ ≥ 0. If g is drawn from Πk then we can write
| (2.11) |
where μk = Ek(g) denotes the mean of the random process of which g is a realisation, , and the Zkℓ's (for ℓ = 1, 2, …) comprise a sequence of uncorrelated random variables with zero mean and unit variance. The quantities θkℓ, and ψkℓ can be estimated consistently by the eigenvalues and eigenfunctions and of the linear operator , defined by , with the covariance estimator defined as at (2.9):
| (2.12) |
where , and, since for all ℓ > nk, all but the first nk terms in the series at (2.12) vanish. See Hall and Hosseini-Nasab (2006, 2009) for properties of these estimators in the case where g and gkj are observed; see also Li and Hsing (2010a, 2010b) for other cases.
2.3. Constructing classifiers
Classifiers for functional data have received a great deal of attention in the literature. See, for example, Vilar and Pértega (2004), Biau, Bunea and Wegkamp (2005), Fromont and Tuleau (2006), Leng and Müller (2006), López-Pintado and Romo (2006), Rossi and Villa (2006), Cuevas, Febrero and Fraiman (2007), Wang, Ray and Mallick (2007), Berlinet, Biau and Rouvière (2008), Epifanio (2008), Araki et al. (2009), Delaigle and Hall (2012) and Delaigle, Hall and Bathia (2012).
In those papers the authors suggest methods for constructing classifiers, but so far the theoretical impact of smoothing; that is, the impact of using and instead of g and gkj when constructing classifiers; has been largely ignored in the literature. In this paper, we study this impact of smoothing for three relatively simple functional classifiers: the centroid classifier, or Rocchio classifier [see, e.g., Manning, Raghavan and Schütze (2008)], commonly used for classifying high-dimensional data; a scaled version of this classifier, which we define below in a general way; and a version for functional data of Fisher's quadratic discriminant, studied, for example, by Leng and Müller (2006) and Delaigle and Hall (2012). These classifiers are usually defined in terms of the functions g and gkj, and here we shall define them in terms of and . The standard versions of these classifiers are obtained by replacing and by g and gkj. The functions appear only implicitly through the estimated means and covariance functions constructed in Section 2.2.
In the present setting, the centroid-based classifier assigns the curve g, observed through , to Π0 if the statistic
| (2.13) |
is negative, and to Π1 if .
A scaled version of the centroid classifier, which accommodates differences in scales between the two populations, can be defined by replacing S in (2.13) by
| (2.14) |
where is an estimator of the scale of population Πk. For example, we might take to equal , the version we used in our numerical work, or where ψ is open to choice; or and could be selected empirically by minimising a cross-validation estimator of classification error. The definition at (2.14) should be compared with those at (2.15) and (2.16), below. The form of (2.14), and also of (2.15) and (2.16), is motivated by likelihood-ratio statistics for Gaussian data.
A version for functional data of Fisher's quadratic discriminant is based on
| (2.15) |
where and are as at (2.3) and (2.8), () are at (2.12) and p is a positive truncation parameter. (Here we assume, as is often the case in practice, that the prior probabilities of each population are unknown and estimated by 1/2. A more general version of the classifier can be used if these probabilities are estimated by other values, but this does not alter our main conclusions.) We assign the new, data set to Π0 if , and to Π1 otherwise. Of course, the statistic , at (2.15), is just an empirical version of the quantity
| (2.16) |
If the functions g are Gaussian, and the first p eigenvalues, in versions of (2.10) and (2.12) for either population, are distinct and nonzero, and the remaining eigenvalues vanish, then the classifier based on T0(g), at (2.16), is optimal in the sense of having least classification error among all classifiers, since it is, after all, just a likelihood ratio statistic. When the eigenvalues and eigenfunctions are estimated from data, as at (2.15), the classifier is asymptotically optimal. Bearing in mind the effectiveness of Fisher's discriminant analysis in the case of vector-valued data, even when the data are not normal, the classifier based on is an attractive choice even in non-Gaussian cases.
3. Theoretical properties
3.1. Standard centroid-based classifier
In this section, we derive properties of the centroid classifier based on the estimators and , and in particular we examine the impact of smoothing. First, we introduce notation. Let n = n0 + n1 (hence n is a positive integer sequence diverging to infinity), let m = m(n) be of the same size as mkj [see (3.7) below], and write for the variance of the experimental errors εkji and εi, in (2.1) and (2.2), when the data come from Πk. Let
| (3.1) |
and define
| (3.2) |
| (3.3) |
| (3.4) |
where κ2 = ∫ u2 K(u)du. Finally, put κ = ∫ K2, and let be the compact interval that equals the support of the density fX of the Xi's, and Xkji's of the functions g and gkj.
We make the following assumptions:
| (3.5) |
| (3.6) |
| (3.7) |
The assumption in (3.5)(c) that fX is bounded away from zero on its support is only a technical requirement, and is unnecessary in practice. To make this clear, in our numerical work we shall take fX to be a normal density, and show that the conclusions of Theorem 1 are nevertheless reflected clearly.
Let denote the probability that the standard centroid-based classifier, based on the statistic at (2.13), commits an error when the data set actually comes from Πk. Theorem 1 below describes the asymptotic behaviour of errk, and highlights the effect of the smoothing parameters h and h1, used to construct the estimators and of g and gkj, on the classifier. A proof is given in Appendix A.1.
Theorem 1. Assume that (3.5)–(3.7) hold, and let Ψ0 = 1 – Φ and Ψ1 = Φ, where Φ denotes the c.d.f. of a standard normal random variable. Then
| (3.8) |
where errk0 = Ek[Ψk {−βk0/σk}], , , , with , and where ϕ denotes standard normal density function.
The leading term errk0 on the right-hand side of (3.8) does not depend in any way on the bandwidths h and h1. It does involve the training sample sizes n0 and n1, and in particular does not equal the asymptotic limit of errk as n increases, since that limit is given by Ψk(−bk0/τk), but the effects of the bandwidths are all confined to subsequent terms on the right-hand side of (3.8). The terms in h2 and represent contributions to classification error arising from biases of the estimators and , and the terms in (ν0h1)−1 and (ν1h1)−1 are contributions from the variances of the estimators .
While a priori it might be thought that, since the total number of observations in the training sample, Σj mkj, for k = 0 and 1, is an order of magnitude larger than the number of observations, m, in the new data set , then h1 should be chosen smaller than h, Theorem 1 shows that the influence of bandwidths on error rate is much more complex than this.
For one thing, there are no terms in (mh)−1 on the right-hand side of (3.8). (Section 3.1.2 will explain the reason for this.) As a result, the terms on the right-hand side of (3.8) that depend on h can be rendered equal to O(m−1) simply by taking h equal to a constant multiple of m−1/2. As noted in Remark 1, below, this level of contribution to the error rate is generally impossible to remove, even in simple parametric problems. Therefore the contribution of h to error rate cannot be rendered smaller than m−1. However, in some instances choosing h to be an order of magnitude larger or smaller than m−1/2 can be beneficial; see Section 3.1.1 below.
The terms in h1 on the right-hand side of (3.8) are a different matter because each of ck1 and can be either positive or negative. Depending on the signs and sizes of ck1 and , it can be optimal to take h1 to be of order , which achieves a trade-off between terms in and (νkh1)−1, or to take h1 to decrease to zero more quickly or to converge to a positive constant, as n increases; see Section 3.1.1 below.
Therefore, the impact that smoothing has on classification performance is much more subtle than it might have appeared. We discuss these issues in more detail in the next sections.
3.1.1. Sizes of h and h1 that optimise overall error rate
Using Theorem 1 we can deduce the orders of magnitudes of h and h1 that minimise the error rate of the classifier, that is, that minimise the probability of misclassification,
| (3.9) |
where err0 and err1 are as in (3.8), πk denotes the prior probability attached to population Πk, and π0 + π1 = 1. Using (3.8) and (3.9), we can write
| (3.10) |
where err0 = π0err00 + π1err10 (recall that errk0 does not depend on the bandwidths),
Since the function ϕ is symmetric, and b10 = −b00, then b10 can be replaced by b00 in the formula for d0 without altering its veracity.
To appreciate the very wide range of optimal bandwidth choices that can arise in the problem of minimising error rate, let us consider minimising err, at (3.10). To help remove ambiguities, let us assume that as n increases the value of is of the same sign for all sufficiently large n, and its absolute value is bounded away from zero; assumption (3.7)(c) ensures that it is uniformly bounded. In this instance, and focusing just on the terms in h1, we see that four distinct cases can arise in practice:
-
(i)
and d0 are both positive. In this case, to minimise the contribution from h1, we should minimise , which is achieved by taking h1 to be of size .
-
(ii)
and d0 are both negative. In this case, the contribution made by h1 behaves like as sample size increases. The term within braces here is maximised by taking h1 = 0, and analogously, in minimising err, it is optimal to take h1 to be of strictly smaller order than .
-
(iii)
and d0 ≤ 0. In this case, to minimise the error rate, we need to maximise the size of the negative term and minimise that of the positive term, which is achieved by taking h1 to be of strictly smaller order than (the precise order depends on the magnitude of second order terms, but deriving the latter precisely would require a lot of additional computation).
-
(iv)
and d0 ≥ 0. Here, using arguments similar to those in case (iii), taking h1 to be of strictly larger order than is optimal.
The case d0 = 0 occurs, for example, if the covariance Gk of the Gaussian process g, the experimental error variance , and the values of mkj and nk do not depend on k. Equal values of mkj commonly arise when the data are observed on a grid; see Remark 4.
A similar analysis can be carried out in the case of optimisation over h rather than h1, although there the optimum is accessed from a comparison of terms in h and (mh)−2, rather than and (ν0h1)−1. [A tedious analysis of the term of size (mh)−2, represented by the remainder O{(mh)−2} in (3.8), shows that it can be either positive or negative.] Depending on the relative signs of the terms in h2 and (mh)−2, it can be optimal to take h ≍ m−½, or h of strictly larger, or strictly smaller order than m−½.
Similar results are obtained if we investigate properties of errk, in (3.8), instead of the overall error rate, err, at (3.9).
These results explain the very diverse patterns of behaviour that are seen in numerical work, and that motivated our research; see Section 1. In summary, in apparently similar problems and using the same type of classifier, it can be optimal to use a very small bandwidth, or a very large bandwidth, or a bandwidth of only moderate size, depending on the signs of certain constants. Therein lies the contradictory nature of the smoothing parameter choice problem for classification of functional data.
3.1.2. Absence of terms in (mh)−1
The centroid-based classifier statistic , at (2.13), can be written equivalently as
| (3.11) |
Importantly, there is no quadratic term in in (3.11), and as a result the impact of the bandwidth h, although not h1, on properties of the classifier is greatly reduced. This reduction is brought about by the smoothing effect of the integral in (3.11), which results in the elimination of terms in (mh)−1.
This property, to which we shall refer to as the “integration effect,” is known in other settings, for example, when integrating a kernel density estimator, computed from a sample of size m, to produce a distribution estimator. Integration results in the variance reducing from order (mh)−1, for the density estimator, to order m−1, for the distribution function estimator—just as it does in the setting above.
Remark 1 (Order m−1 term in expansion of classification error). We assumed in (3.7)(c) that the values of mkj, representing the number of pairs (Xkji, Ykji) for a given population index k and given individual j, are all of roughly the same size. In this setting it is easy to see that, even in an elementary parametric setting, we must expect the operation of observing the functions gkj at scattered points to affect error rate through a term of order m−1, and no smaller. For example, consider the case where gkj = ψ(· | ωkj), with ψ(· ω) being a known function completely determined by the parameter ω, and where the weight function w is known. Using the data on gkj we can estimate ωkj root-m consistently, but no faster, and as a result we incur a classification error of size m−1, and no smaller, from not knowing the values ωkj. It is for this reason that, when developing expansions of classification error, we do not explore the remainder of size m−1; it is stated simply as O(m−1) on the right-hand side of (3.8).
3.1.3. Other remarks
We conclude our discussion of Theorem 1 with a number of remarks.
Remark 2 (Definition of ). The size of the fourth and fifth terms on the right-hand side of (3.8) is determined by the sizes of and , and those quantities can be made slightly smaller by using a slightly different definition of , at (2.8). In particular in (2.8), on account of the definition of at (2.3), is defined as an average of ratios of sums, whereas slightly better statistical performance is obtained by taking to be simply a ratio of sums:
compare (2.3). However, this approach departs from standard practice in working with functional data, and therefore, since convergence rates do not alter (only the constant multiples of rates are reduced), we have followed standard practice in the definition of .
Remark 3 (Gaussian assumption). Of course, if m is sufficiently large then is itself approximately Gaussian, and so the assumption that g is a Gaussian process is reflected particularly well in properties of its estimator. More generally, our assumption that g is a Gaussian process is made for simplicity, and can be relaxed. For example, generalisations to chi-squared and other processes, where shape can be described in terms of a small number of fixed functions (mean and covariance in the Gaussian case), are straightforward.
More generally we would require a model which described the properties of random functions relatively simply. The Gaussian model fills this need ideally; shape is described by mean and variance functions, on which we have imposed only smoothness, rather than parametric, conditions. Moreover, in the Gaussian case all moments of g(x) are finite, for each x (we use this property repeatedly during our theoretical arguments), and the principal component scores are independent (this is used frequently during our proof of Theorem 2).
Remark 4 (Case of regularly spaced design). Theorem 1 continues to hold if the mkj design variables Xkji are regularly spaced on for each k and j. The only change necessary is to replace , on the right-hand side of (3.8), by the square of the length of the interval .
3.2. Scale-adjusted centroid-based classifier
Recall that scale-adjusted centroid-based classifier is defined in terms of , at (2.14). A decomposition similar to that of Theorem 1 can be derived for this classifier, as we shall prove in Theorem 2 below. For this classifier, it seems necessary to strengthen (3.7) by imposing conditions on the behaviour of the eigenvalues θkℓ as ℓ increases. However, since our aim in this section is only to corroborate the conclusions in Section 3.1, drawn there in the case of the standard centroid-based classifier, then we shall simplify our account by assuming that g is finite dimensional, and in particular taking the covariance expansion at (2.10) to have just q terms:
| (3.12) |
Without (3.12)(a), separate conditions, valid uniformly in j = 1, 2, …, have to be imposed on remainders in Taylor expansions of “smoothed” versions of the eigenvalues θkj, depending on h.
The next theorem indicates that the results of Theorem 1 also apply for the scale-adjusted centroid-based classifier. Its proof is given in the supplementary material [Carroll, Delaigle and Hall (2013)].
Theorem 2. Assume that (3.5), (3.6) and (3.12) hold. Then the error rate of the scale-adjusted centroid-based classifier, when the data in are drawn from Πk, admits the expansion at (3.8), but with different constants, where the various terms have the properties stated immediately below that formula.
The diversity of possible signs of ck, ck1 and in (3.8), discussed in Section 3.1.1, is also present in this case. Therefore the conclusions drawn in that section apply to the scale-adjusted centroid-based classifier. However, we have not derived explicitly the counterparts of the constants ck, ck1, dk0 and dk1 that appear in equation (3.8).
The integration effect discussed in Section 3.1.2 is also present here, although we had originally expected that the scale-adjusted centroid classifier would produce a term of size (mh)−1 in an expansion of error rate. Indeed, the situation initially seems quite different in the case of the scale-adjusted version of , at (2.14), when . There the quadratic term in persists. The reason it still does not produce a term in (mh)−1 is quite subtle. Define ⋈k to be > or ≤ according as k = 0 or k = 1, respectively. The probability can be written as
where the Zj's are independent N(0, 1) variables, conditional on the Vj's and W; the positive weights wj are nonrandom; and critically, W does not involve the experimental errors εi in (2.2), from which any term in (mh)−1 would arise. The terms Vj depend on the experimental errors only through integrals of the error process, and the integration effect at this point largely removes the impact of the error bandwidth h, with the result that there is no term of size (mh)−1. However, terms in (ν0h1)−1 remain; the integration effect only influences smoothing of the new data, not of the training data.
3.3. Quadratic discriminant
Finally, we show that similar smoothing effects are present in the case of the quadratic discriminant classifier defined through the statistic at (2.15). Recall that, when the data in come from Πk, the random function g has covariance function Gk. To derive the counterpart of Theorem 1 for this classifier, let r, r1, r2 take the values 0 and 1, let 1 ≤ ℓ, ℓ1, ℓ2 ≤ p, and define the covariances
the variances vark[r, ℓ] = covk[r, r; ℓ, ℓ], and the correlations
Let p ≥ 1, a fixed number, be the number of principal components used to construct the quadratic discriminant statistic , defined at (2.15). Theorem 3 below addresses the error rate of the quadratic discriminant based on , and there we shall assume that:
| (3.13) |
Condition (3.13)(a) ensures that the eigenfunctions ψkℓ are well defined for k = 0, 1 and ℓ = 1, …, p; and (3.13)(b) guarantees that the quantities and , which appear in the definition of T0(g) at (2.16), cannot be identical, except for a difference in means, unless r1 = r2 and ℓ1 = ℓ2, thereby avoiding degeneracy.
The counterpart of Theorem 1 for the quadratic discriminant classifier is stated in the next theorem. Its proof is given in the supplementary material [Carroll, Delaigle and Hall (2013)].
Theorem 3. Assume that (3.5)–(3.7) and (3.13) hold. Then the error rate of the quadratic discriminant, when the data in come from Πk, admits the expansion at (3.8), but with different constants, where the various terms have the properties stated immediately below that formula.
Again the signs of ck, ck1 and , in (3.8), are particularly diverse, and so the conclusions reached in Section 3.1.1 apply. Likewise, the integration effect discussed in Section 3.1.2 is also observed. Here, as can be seen directly from (2.15), the estimator is integrated, and only the integral is squared, not itself. The resulting integration effect eliminates any term in (mh)−1 from the analogue of the expansion (3.8) in this setting, although again this influence does not carry over to the training data.
4. Numerical illustrations
4.1. Simulated data
To illustrate the impact of bandwidth on classification performance, we generated data from several instances of model (2.1), taking, in each case, mkj = 50. Let ϕσ (x) denote the normal density function with mean zero and standard deviation σ. We considered the following cases, each with three different levels of errors, which we refer to as noise versions 1, 2 and 3:
-
(A):
gkj (t) = μk(t) + (3t + 100)1/2{cos(t/50)}kZkj, where μ0(t) = ϕ10(t − 5), μ1(t) = μ0(t) + 0.3 cos(t/5) + 0.1, Zkj ~ U[−1/(30 − 10k), 1/(30 − 10k)], and εkji ~ N(0, 1/(4 − 2k)2) (noise version 1), εkji ~ N(0, 2/(4 − 2k)2) (noise version 2) or εkji ~ N(0, 4/(4 − 2k)2) (noise version 3), and π0 = 1/3, π1 = 2/3. Moreover, Xkji = 2i − 1, for i = 1, …, 50.
-
(B):
gkj (t) = μk(t) + (3t + 100)1/2Zkj, where μ0(t) = 30{0.2ϕ4(t − 5) + 0.1ϕ4(t − 10) + 0.4ϕ6(t − 20) + 0.4ϕ6(t − 35) + 0.6ϕ7(t − 55) + 0.6ϕ7(t − 80)}, μ1(t) = μ0(t) + 4/{(t − 50)2 + 10}, Zkj ~ U[−1/(60 + 15k), 1/(60 + 15k)], εkji ~ {Exp(0.5) − 2}/(2 + 2k) (noise version 1), (noise version 2) or εkji ~ {Exp(0.5) − 2}/(1 + k) (noise version 3), and π0 = 2/5 and π1 = 3/5. Moreover, Xkji was as in (A).
-
(C):
g0j (t) = μ0(t) + (3t + 100)1/2Z0j, g1j (t) = μ1(t) + (t + 5)Z1j, where μ0(t) = 15ϕ17(t − 65) cos(t/7), μ1(t) = μ0(t)+5ϕ20(t − 50), Zkj ~ U[−1/(50 − 10k), 1/(50 − 10k)], εkji ~ N(0, (4 − k)2/100) (noise version 1), εkji ~ N(0, (4 − k)2/50) (noise version 2), εkji ~ N(0, (4 − k)2/25) (noise version 3), and π0 = 2/3 and π1 = 1/3. Moreover, Xkji was as in (A).
-
(D)–(F):
Same as (A) to (C) but with Xkji = 2i − 1 + Tkji, where Tkji ~ N(0, 0.25).
We chose these examples to illustrate various features of the problems, namely that the impact of smoothing may differ among classifiers, and that in some cases, some classifiers perform better with more smoothing and in other cases, they might perform better with less smoothing.
In each case, for k = 0, 1 and for several values of ntr, we generated 100 (resp., ntr) noisy test curves (resp., training curves) from model (2.1), each of which came with probability πk from Πk. We constructed each classifier from the training data, and applied it to the test data. To compute and , we compared three approaches for selecting the bandwidths: no smoothing (NS), the standard plug-in (PI) bandwidths hPI and hPI,kj that estimate the optimal bandwidth for estimation of the regression functions g and gkj, which we computed using the dpill function in the R package KernSmooth; see Ruppert, Sheather and Wand (1995); and the bandwidths γhPI and γ1hPI,kj, where γ and γ1 (and also the truncation parameter p in the case of the quadratic discriminant classifier) were chosen to minimise the following cross-validation (CV) estimator of classification error:
with and denoting estimators of π0 and π1 (we took ), and being the estimator of the class label of the ith training observation from group k, obtained from the classifier constructed without using this observation.
For each configuration, we generated B = 200 sets of training and test samples. In Tables 1 and 2, we report the percentage of correctly classified test curves, averaged over the B replicates. Depending on the model, the classifier, and the type of data (test or training), the cross-validation bandwidths were either smaller or larger than the PI regression bandwidths, illustrating the variety of settings already explained by our theory. See Table B.1 in Section B.3 in the supplementary material [Carroll, Delaigle and Hall (2013)], where we report the value of γ and γ1 averaged over the B replicates. We can see from the table that in most cases, γ was smaller than γ1, and both were usually smaller than 1, except in cases (C) and (F).
Table 1.
Percentage of correctly classified observations for the simulated data of Section 4.1, using plug-in (PI) regression bandwidths, bandwidths that minimise a crossvalidation (CV) estimate of classification error, or without smoothing the noisy data (NS). The three noise versions, in increasing order, are described in cases (A)–(C) in Section 4.1. Here “Cent” is the centroid classifier (2.13), “Cent sc.” is the scaled centroid classifier (2.14) and “QDA” is the quadratic discriminant classifier (2.15)
| Cent |
Cent sc. |
QDA |
||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| n tr | CV | PI | NS | CV | PI | NS | CV | PI | NS | |
| Case (A) | ||||||||||
| Noise version 1 | 50 | 82.9 | 74.1 | 84.0 | 91.8 | 73.2 | 92.0 | 95.1 | 94.1 | 53.5 |
| Noise version 1 | 100 | 84.4 | 74.9 | 84.8 | 92.6 | 74.1 | 92.6 | 97.6 | 94.8 | 67.6 |
| Noise version 2 | 50 | 77.7 | 69.6 | 78.1 | 94.3 | 70.4 | 94.4 | 91.0 | 89.3 | 49.1 |
| Noise version 2 | 100 | 79.9 | 70.7 | 80.2 | 95.1 | 71.2 | 95.1 | 94.3 | 89.7 | 61.7 |
| Noise version 3 | 50 | 71.1 | 65.6 | 71.1 | 97.1 | 69.0 | 97.1 | 85.4 | 84.3 | 46.2 |
| Noise version 3 | 100 | 73.7 | 66.8 | 74.1 | 97.9 | 69.4 | 97.9 | 89.9 | 84.1 | 58.4 |
| Case (B) | ||||||||||
| Noise version 1 | 50 | 63.2 | 60.1 | 65.7 | 96.3 | 78.7 | 96.5 | 77.1 | 74.3 | 65.8 |
| Noise version 1 | 100 | 65.5 | 61.5 | 66.8 | 96.8 | 80.0 | 96.8 | 81.8 | 76.3 | 73.0 |
| Noise version 2 | 50 | 61.5 | 58.6 | 64.6 | 96.3 | 80.6 | 96.4 | 76.9 | 74.1 | 65.2 |
| Noise version 2 | 100 | 62.6 | 58.7 | 64.4 | 96.7 | 81.3 | 96.7 | 81.3 | 75.0 | 72.4 |
| Noise version 3 | 50 | 60.9 | 57.6 | 64.0 | 96.2 | 81.6 | 96.4 | 77.3 | 74.2 | 65.4 |
| Noise version 3 | 100 | 60.7 | 56.8 | 63.3 | 96.7 | 82.3 | 96.7 | 81.6 | 75.2 | 72.3 |
| Case (C) | ||||||||||
| Noise version 1 | 50 | 61.5 | 60.8 | 60.8 | 88.7 | 89.2 | 87.4 | 84.8 | 83.7 | 82.0 |
| Noise version 1 | 100 | 59.4 | 58.4 | 58.2 | 90.0 | 90.3 | 88.5 | 86.9 | 85.7 | 79.0 |
| Noise version 2 | 50 | 61.3 | 60.2 | 60.3 | 87.3 | 87.9 | 82.8 | 81.9 | 81.2 | 82.4 |
| Noise version 2 | 100 | 58.9 | 57.9 | 57.6 | 88.8 | 89.0 | 85.2 | 84.6 | 83.1 | 80.4 |
| Noise version 3 | 50 | 61.0 | 59.7 | 59.3 | 85.2 | 85.4 | 71.2 | 80.5 | 79.9 | 79.6 |
| Noise version 3 | 100 | 58.5 | 57.4 | 57.0 | 87.2 | 86.6 | 74.9 | 82.6 | 81.1 | 79.7 |
Table 2.
Percentage of correctly classified observations for the simulated data of Section 4.1, using plug-in (PI) regression bandwidths, bandwidths that minimise a crossvalidation (CV) estimate of classification error, or without smoothing the noisy data (NS). The three noise versions, in increasing order, are described in cases (D)–(F) in Section 4.1. Here “Cent” is the centroid classifier (2.13), “Cent sc.” is the scaled centroid classifier (2.14) and “QDA” is the quadratic discriminant classifier (2.15)
| Cent |
Cent sc. |
QDA |
||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| n tr | CV | PI | NS | CV | PI | NS | CV | PI | NS | |
| Case (D) | ||||||||||
| Noise version 1 | 50 | 80.2 | 69.5 | 80.6 | 85.7 | 68.5 | 86.3 | 93.9 | 92.7 | 69.2 |
| Noise version 1 | 100 | 81.5 | 70.0 | 82.0 | 87.3 | 69.2 | 87.3 | 96.6 | 93.2 | 84.3 |
| Noise version 2 | 50 | 74.7 | 65.9 | 75.6 | 90.0 | 65.6 | 90.3 | 88.5 | 86.7 | 60.9 |
| Noise version 2 | 100 | 76.9 | 66.8 | 77.3 | 90.9 | 66.8 | 91.0 | 92.3 | 86.8 | 77.6 |
| Noise version 3 | 50 | 69.2 | 62.3 | 69.6 | 94.2 | 65.4 | 94.4 | 82.9 | 80.5 | 55.6 |
| Noise version 3 | 100 | 71.4 | 63.4 | 72.0 | 95.1 | 66.4 | 95.1 | 87.9 | 80.8 | 72.7 |
| Case (E) | ||||||||||
| Noise version 1 | 50 | 65.0 | 61.9 | 67.3 | 94.8 | 79.0 | 95.0 | 77.0 | 73.1 | 71.2 |
| Noise version 1 | 100 | 65.8 | 62.5 | 67.6 | 95.4 | 79.6 | 95.4 | 84.3 | 69.7 | 82.8 |
| Noise version 2 | 50 | 63.0 | 60.2 | 65.4 | 94.7 | 80.6 | 95.0 | 77.8 | 74.2 | 70.5 |
| Noise version 2 | 100 | 63.4 | 59.9 | 64.9 | 95.4 | 81.4 | 95.5 | 84.8 | 69.8 | 82.4 |
| Noise version 3 | 50 | 61.3 | 59.2 | 64.3 | 94.6 | 81.4 | 94.9 | 77.9 | 74.4 | 69.8 |
| Noise version 3 | 100 | 61.8 | 58.2 | 63.1 | 95.4 | 82.5 | 95.5 | 84.5 | 71.5 | 81.8 |
| Case (F) | ||||||||||
| Noise version 1 | 50 | 60.2 | 59.1 | 59.4 | 88.0 | 88.7 | 87.9 | 83.5 | 82.6 | 80.4 |
| Noise version 1 | 100 | 58.8 | 57.8 | 57.7 | 89.0 | 89.3 | 88.5 | 84.9 | 83.2 | 77.3 |
| Noise version 2 | 50 | 59.8 | 58.7 | 59.0 | 86.5 | 87.2 | 84.6 | 80.8 | 80.2 | 80.0 |
| Noise version 2 | 100 | 58.6 | 57.3 | 57.2 | 87.6 | 87.7 | 85.8 | 83.1 | 81.1 | 77.0 |
| Noise version 3 | 50 | 59.2 | 58.3 | 58.2 | 84.5 | 84.1 | 76.7 | 79.4 | 78.5 | 78.9 |
| Noise version 3 | 100 | 58.0 | 56.8 | 56.5 | 85.8 | 84.6 | 78.5 | 81.0 | 79.1 | 75.7 |
As expected, we conclude from Tables 1 and 2, depending on the model and the classifier, the negative impact of smoothing with the standard PI bandwidth can be quite significant, indeed sometimes reducing the percentage of correctly classified data by as much as 10%. In cases (A) and (D), it is the centroid classifier and its scaled version that are the most affected by this inappropriate level of smoothing, whereas the quadratic discriminant classifier is more robust against the level of smoothing. In cases (B) and (E), the scaled centroid classifier and the quadratic discriminant classifier are the most affected by inappropriate smoothing. Cases (C) and (F) are more robust against smoothing; there, all three versions (PI, CV and NS) of the data result in similar classification performance, although overall the data smoothed by CV result in slightly improved performance. Depending on the case, when the noise level increases the impact of inappropriate bandwidth choice can either increase or decrease.
4.2. Real data
We illustrate our findings on the ovarian cancer data set 8–7–02, which concerns 253 patients (91 controls and 162 with ovarian cancer). The data, which were produced to study the effect of robotic sample handling, are available from http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp. In this example, the functions Xi represent proteomic mass spectra and t ∈ [0, 20,000] is the mass over charge ratio, m/z. These raw curves are ideal for illustrating the negative impact that systematically smoothing by standard methods can have, because in some ranges of values of t, the spectra have considerable activity, and the impact of smoothing such data can be striking. We focus on one such ranges, namely t ∈ [200, 500].
To assess the performance of classifiers on this data set, we randomly and uniformly created B = 200 pairs of (training sample, test sample), where we took the training sample to be of size ntr and the test sample of size 253 − ntr, for ntr = 50 and ntr = 100. We also generated two more noise versions of the data, adding to the Ykji's in both the test and training data, noise (noise version 1) or (noise version 2), where the 's were totally independent.
For each version of the data (original data and noise versions 1 and 2), and for each pair of test and training sample, we constructed each classifier from the training sample, and applied the classifier to the test sample using either plug-in regression bandwidths to construct the estimators and , or bandwidths obtained by minimising the CV estimator of classification error defined in Section 4.1, where we took .
Table 3 reports the percentage, averaged over the B pairs of samples, of correctly classified observations from the test samples. The table indicates very clearly that smoothing the data using the plug-in regression bandwidths degraded the quality of the two versions of the centroid classifier by about 10%, and a similar phenomenon was observed for the quadratic discriminant classifier when the training sample was small and when the data were noisy.
Table 3.
Percentage of correctly classified observations for the ovarian cancer data, using plug-in (PI) regression bandwidths or bandwidths that minimise a crossvalidation (CV) estimate of classification error Here “Cent” is the centroid classifier (2.13), “Cent sc.” is the scaled centroid classifier (2.14) and “QDA” is the quadratic discriminant classifier (2.15)
| Cent |
Cent sc. |
QDA |
|||||
|---|---|---|---|---|---|---|---|
| Data | n tr | CV | PI | CV | PI | CV | PI |
| Original data | 50 | 90.60 | 80.25 | 90.05 | 78.79 | 93.32 | 89.69 |
| Original data | 100 | 90.43 | 80.96 | 90.00 | 79.96 | 98.58 | 98.86 |
| Noisy version 1 | 50 | 88.07 | 75.19 | 87.83 | 74.23 | 78.03 | 68.50 |
| Noisy version 1 | 100 | 87.58 | 76.76 | 88.54 | 76.27 | 91.48 | 90.97 |
| Noisy version 2 | 50 | 76.15 | 66.57 | 76.65 | 66.09 | 56.91 | 48.54 |
| Noisy version 2 | 100 | 81.97 | 67.55 | 81.91 | 67.64 | 77.62 | 66.49 |
Supplementary Material
Acknowledgments
Delaigle and Hall's research was supported by grants and fellowships from the Australian Research Council. Carroll's research was supported by a grant from the National Cancer Institute (R37-CA057030).
APPENDIX: PROOF OF THEOREM 1
A.1. Preliminary results
Define
With Uℓ and Vℓ given by (2.4) and (2.5), and using the model at (2.2) and the exact form of the remainder in Taylor's theorem, we can write:
Assuming, without loss of generality, that K is supported on [−1, 1],
where . Now,
where . Therefore, since |Uℓ| ≤ U0 for each ℓ ≥ 0,
| (A.1) |
uniformly on
Similarly, defining , and
where Ukjℓ is as at (2.6), we have, uniformly on
| (A.2) |
Define
| (A.3) |
and recall that κ2 = ∫ u2 K (u) du. We shall derive the following result in Section A.6:
Lemma 1. Under the conditions of Theorem 1, for some C1 > 0, all C2 > 0 and k = 0, 1,
as n → ∞, and for some C3 > 0, all C2 > 0 and k = 0, 1,
Furthermore, defining Msum = mink=1,2(Σj mkj), we have for all C2, C4 > 0,
| (A.4) |
A.2. Initial calculation of errk
Let denote the sigma-field generated by the random variables introduced in Section 2, and the random functions gkj, but excluding g. Specifically, is the sigma-field generated by gkj, Xkji and εkji for 1 ≤ i ≤ mkj, 1 ≤ j ≤ nk and k = 0, 1, and by Xi and εi for 1 ≤ i ≤ m. Recall that ⋈k is > or ≤ according as k = 0 or k = 1, respectively, and recall formula (3.11) for the statistic .
Under the assumption that the new data set comes from Πk, and conditional on , is a Gaussian process with mean and covariance function , say. In this notation,
| (A.5) |
where, by (3.11),
| (A.6) |
| (A.7) |
The probability on the left-hand side of (A.5) equals the chance that, when comes from Πk, the classifier based on makes an error and assigns to the other population.
A.3. Approximations to , and
In view of (A.1),
| (A.8) |
Noting that, for random variables A1, A2, B1 and B2, |cov(A1 + A2, B1 + B2) − cov(A1, A2)| ≤ |cov(B1, B2)| + |cov(A1, B2)| + |cov(B1, A2)| where the covariances are interpreted conditionally on , we deduce from (A.1) that for a constant C4 > 0,
| (A.9) |
where we define . (Recall that Gk denotes the covariance of the Gaussian process g when the data are drawn from Πk.
With defined as at (3.1), and defining as at (A.3), we have, in view of (A.2), Lemma 1 and (3.6)(b), the result
| (A.10) |
for some C1 > 0 and all C2 > 0. Using Rosenthal's inequality, it can be proved from (3.6) and (3.7)(c) that, for some C1 > 0 and all C2 > 0,
| (A.11) |
Together, (A.10) and (A.11) imply that
| (A.12) |
Define ,
| (A.13) |
Combining Lemma 1, (A.5)–(A.9) and (A.12), we deduce that, for some C1 > 0 and all C2 > 0,
| (A.14) |
Observe from (A.13) that , where βk0 is as at (3.2),
| (A.15) |
and . Using (A.4) it can be shown that, for some C1 > 0 and all C2 > 0, and when ℓ = 2,
| (A.16) |
Hence, noting the first result in (A.14), we have:
| (A.17) |
Recall the definitions of and at (3.3) and (3.4), and put
| (A.18) |
| (A.19) |
and . Thus, is the term in and that arises when is expanded. Using (A.4) it can be proved that (A.16) holds when ℓ = 3. Moreover, can be written as
| (A.20) |
where, in view of the second part of (A.14), (A.16) holds in the case ℓ = 4 and for some C1 > 0 and all C2 > 0.
Define τkℓ to be equal to σkℓ, at (A.18) and (A.19), when and on the respective right-hand sides are replaced by μ0 and μ1. Then for k = 0, 1 and ℓ = 0, 1, noting property (3.7)(c) on the rates of increase of n0 and n1, it can be shown that for some C1 > 0,
| (A.21) |
for all C2 > 0, where we define h0 = h. Therefore, if C1 > 0 is sufficiently small,
| (A.22) |
for all C2 > 0.
A.4. Approximation to
In the notation at (A.20),
where, for 0 ≤ r ≤ ∞,
We claim that the infinite series defined by sk(∞) converges with probability 1 − O(n−C2) for all C2 > 0. To appreciate why, note that, by (3.6) and (3.7)(c), there exists C1 > 0 such that
for all C2 > 0. Combining this property, (A.16) for ℓ 3 and 4, and (A.22), we deduce that, for some C1 > 0 and all C2 > 0,
Therefore, if C3 > 0 is given then r0 = r0(C3) ≥ 1 can be chosen so large that, whenever r0 ≤ r ≤ ∞, for all C2 > 0. Using this property and (A.16), again for ℓ = 3 and 4; and employing too (A.21); we see that for some C1 > 0 and all C2 > 0, if r0 is chosen sufficiently large,
| (A.23) |
for r ≥ r0, where
| (A.24) |
A.5. Approximation to
Let C1 > 0 and let ℓ0 ≥ 0 be an integer. With Ukjℓ defined as at (2.6), let denote the event
where κℓ = ∫ uℓ K (u) du and hence vanishes for odd ℓ, since by (3.7)(b), K is symmetric. It will be proved in Section A.6 that, for some C1 > 0 and each ℓ0 ≥ 0,
| (A.25) |
If holds for an ℓ0 ≥ 2 then, if , there exists a nonrandom integer n0 ≥ 1 such that the event , defined by
| (A.26) |
holds for all n ≥ n0.
Let denote the indicator of . In view of (A.25),
| (A.27) |
for all C2 > 0, and so to approximate the term on the left-hand side of (A.27) we may develop an approximation to the first term on the right-hand side.
Let denote the sigma-field generated by the random variables Xi for 1 ≤ i ≤ m, and by Xkji and the functions gkji for 1 ≤ i ≤ mkj, 1 ≤ j ≤ nk and k = 0, 1 (i.e., generated by everything except g and the experimental errors εi and εkji). The quantities I, tk(r) at (A.24), βk0 at (3.2), and bk1 at (A.15) are all -measurable. Therefore, using (A.17) and (A.23), and noting that ψk is an analytic function with all derivatives uniformly bounded, we obtain
| (A.28) |
| (A.29) |
Here we have used the properties , for some C > 0,
for ℓ2 ≥ 3 if ℓ1 = 1, and for ℓ2 ≥ 2 if ℓ1 = 2, and
Further, we have used the fact that the event , defined at (A.26), obtains when ever I ≠ 0.
In addition,
| (A.30) |
that
| (A.31) |
and that
| (A.32) |
where bk0 and bk1 are as at (3.2) and (A.15), ϕ is the standard normal density, and we have used the fact that . Combining (A.25) and (A.27)–(A.32), and taking r sufficiently large (but fixed), we deduce that
| (A.33) |
A.6. Proof of Lemma 1 and (A.25)
The results in Lemma 1, with the exception of (A.4); and also result (A.25); will follow if we show that for each ℓ ≥ 1, some C1 > 0 and all C2 > 0,
| (A.34) |
| (A.35) |
We shall derive (A.35); a proof of (A.34) is similar.
Markov's inequality can be used to prove that
| (A.36) |
It follows from (3.7)(c) that each nk is increasing no faster than polynomially in n, and therefore, if we confine attention to x in a subset , say, of that contains only O(nC) points for some C > 0, we can place the maximum and supremum inside the probability statement at (A.36), provided that is replaced by : for some C1 > 0 and all C2 > 0,
| (A.37) |
The assumption, in (3.7)(b), that K is compactly supported and Hölder continuous, and the implication, in (3.5)(c), that fX is also Hölder continuous, enable (A.35) to be derived directly from (A.37) by taking to be a sufficiently fine grid in .
A proof of (A.4) in Lemma 1 is similar. To illustrate the argument, we derive the following result part of (A.4): for all C2, C4 > 0,
| (A.38) |
Using Markov's and Rosenthal's inequalities, we first obtain the result when the supremum is outside the probability statement:
Taking to contain only O(nC) points, for any fixed C > 0, we deduce that
and taking to be a sufficiently fine grid in we obtain (A.38).
Footnotes
SUPPLEMENTARY MATERIAL
Supplement to “Unexpected properties of bandwidth choice when smoothing discrete data for constructing a functional data classifier” (DOI: 10.1214/13-AOS1158SUPP; .pdf). The supplementary file contains the proof of Theorems 2 and 3, as well as additional simulation results.
REFERENCES
- Araki Y, Konishi S, Kawano S, Matsui H. Functional logistic discrimination via regularized basis expansions. Comm. Statist. Theory Methods. 2009;38:2944–2957. MR2568196. [Google Scholar]
- Benhennia K, Degras D. Local polynomial regression based on functional data. 2011 Unpublished manuscript. Available at http://arxiv.org/pdf/1107.4058v1.
- Berlinet A, Biau G, RouvièRe L. Functional supervised classification with wavelets. Ann. I.S.U.P. 2008;52:61–80. MR2435041. [Google Scholar]
- Biau G, Bunea F, Wegkamp MH. Functional classification in Hilbert spaces. IEEE Trans. Inform. Theory. 2005;51:2163–2172. MR2235289. [Google Scholar]
- Cardot H, Degras D, Josserand E. Confidence bands for Horvitz–Thompson estimators using sampled noisy functional data. Bernoulli. 2013;19:2067–2097. [Google Scholar]
- Cardot H, Josserand E. Horvitz–Thompson estimators for functional data: Asymptotic confidence bands and optimal allocation for stratified sampling. Biometrika. 2011;98:107–118. MR2804213. [Google Scholar]
- Carroll RJ, Delaigle A, Hall P. Supplement to “Unexpected properties of bandwidth choice when smoothing discrete data for constructing a functional data classifier.”. 2013. DOI:10.1214/13-AOS1158SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cuevas A, Febrero M, Fraiman R. Robust estimation and classification for functional data via projection-based depth notions. Comput. Statist. 2007;22:481–496. MR2336349. [Google Scholar]
- Delaigle A, Hall P. Achieving near perfect classification for functional data. J. R. Stat. Soc. Ser. B Stat. Methodol. 2012;74:267–286. MR2899863. [Google Scholar]
- Delaigle A, Hall P, Bathia N. Componentwise classification and clustering of functional data. Biometrika. 2012;99:299–313. MR2931255. [Google Scholar]
- Epifanio I. Shape descriptors for classification of functional data. Technometrics. 2008;50:284–294. MR2528652. [Google Scholar]
- Fan J, Gijbels I. Local Polynomial Modelling and Its Applications. Monographs on Statistics and Applied Probability. Chapman & Hall; London: 1996. p. 66. MR1383587. [Google Scholar]
- Fromont M, Tuleau C. Functional classification with margin conditions. In: Carbonell JG, Siekmann J, editors. Learning Theory—Proceedings of the 19th Annual Conference on Learning Theory, Pittsburgh; New York: Springer; 2006. 2006. [Google Scholar]
- Hall P, Hosseini-Nasab M. On properties of functional principal components analysis. J. R. Stat. Soc. Ser. B Stat. Methodol. 2006;68:109–126. MR2212577. [Google Scholar]
- Hall P, Hosseini-Nasab M. Theory for high-order bounds in functional principal components analysis. Math. Proc. Cambridge Philos. Soc. 2009;146:225–256. MR2461880. [Google Scholar]
- Hall P, Kang K-H. Bandwidth choice for nonparametric classification. Ann. Statist. 2005;33:284–306. MR2157804. [Google Scholar]
- Hall P, Van Keilegom I. Two-sample tests in functional data analysis starting from discrete data. Statist. Sinica. 2007;17:1511–1531. MR2413533. [Google Scholar]
- Leng X, Müller H-G. Classification using functional data analysis for temporal gene expression data. Bioinformatics. 2006;22:68–76. doi: 10.1093/bioinformatics/bti742. [DOI] [PubMed] [Google Scholar]
- Li Y, Hsing T. Deciding the dimension of effective dimension reduction space for functional and high-dimensional data. Ann. Statist. 2010a;38:3028–3062. MR2722463. [Google Scholar]
- Li Y, Hsing T. Uniform convergence rates for nonparametric regression and principal component analysis in functional/longitudinal data. Ann. Statist. 2010b;38:3321–3351. MR2766854. [Google Scholar]
- López-pintado S, Romo J. Depth-based classification for functional data. (DIMACS Series in Discrete Mathematics and Theoretical Computer Science).Data Depth: Robust Multivariate Analysis, Computational Geometry and Applications. 2006;72:103–119. Amer. Math. Soc., Providence, RI. MR2343116. [Google Scholar]
- Manning CD, Raghavan P, Schütze H. Introduction to Information Retrival. Cambridge Univ. Press; Cambridge: 2008. [Google Scholar]
- Panaretos VM, Kraus D, Maddocks JH. Second-order comparison of Gaussian random functions and the geometry of DNA minicircles. J. Amer. Statist. Assoc. 2010;105:670–682. MR2724851. [Google Scholar]
- Ramsay JO, Silverman BW. Functional Data Analysis. 2nd ed. Springer; New York: 2005. MR2168993. [Google Scholar]
- Rossi F, Villa N. Support vector machine for functional data classification. Neuro-computing. 2006;69:730–742. [Google Scholar]
- Ruppert D, Sheather SJ, Wand MP. An effective bandwidth selector for local least squares regression. J. Amer. Statist. Assoc. 1995;90:1257–1270. MR1379468. [Google Scholar]
- Vilar JA, Pértega S. Discriminant and cluster analysis for Gaussian stationary processes: Local linear fitting approach. J. Nonparametr. Stat. 2004;16:443–462. MR2073035. [Google Scholar]
- Wang X, Ray S, Mallick BK. Bayesian curve classification using wavelets. J. Amer. Statist. Assoc. 2007;102:962–973. MR2354408. [Google Scholar]
- Wu S, Müller H-G. Response-adaptive regression for longitudinal data. Biometrics. 2011;67:852–860. doi: 10.1111/j.1541-0420.2010.01518.x. MR2829259. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
