Functional Data Classification: A Wavelet Approach

Chung Chang; R Todd Ogden; Yakuan Chen

doi:10.1007/s00180-014-0503-4

. Author manuscript; available in PMC: 2024 Jun 21.

Published in final edited form as: Comput Stat. 2014 Jun 4;29(6):1497–1513. doi: 10.1007/s00180-014-0503-4

Functional Data Classification: A Wavelet Approach

Chung Chang ^1,^*, R Todd Ogden ², Yakuan Chen ²

PMCID: PMC11192549 NIHMSID: NIHMS610810 PMID: 38912384

Abstract

In recent years, several methods have been proposed to deal with functional data classification problems (e.g., one-dimensional curves or two- or three-dimensional images). One popular general approach is based on the kernel-based method, proposed by Ferraty and Vieu (2003). The performance of this general method depends heavily on the choice of the semi-metric. Motivated by Fan and Lin (1998) and our image data, we propose a new semi-metric, based on wavelet thresholding for classifying functional data. This wavelet-thresholding semi-metric is able to adapt to the smoothness of the data and provides for particularly good classification when data features are localized and/or sparse. We conduct simulation studies to compare our proposed method with several functional classification methods and study the relative performance of the methods for classifying positron emission tomography (PET) images.

Keywords: wavelet thresholding, semi-metric

1 Introduction

The classical problem in classification analysis (discriminant analysis) solves, based on a training data set consisting of multivariate data and class membership for several observations, correct classification of a new observation. In recent years, data with complicated structure and high dimensionality have attracted much attention. In particular, in many situations, the multivariate observations can be regarded as functional data, such as one-dimensional curves or two- or three-dimensional images (see Ramsay and Silverman, 2005; Ferraty and Vieu, 2006). The ultimate aim of functional data classification is to determine group membership for a newly observed function based on a training sample consisting of observed functions with their corresponding class memberships.

Curve classification was described by Hastie et al. (1995), who dealt with phonemes. Our motivating example arises from a study of depression using positron emission tomography (PET). Binding potential (BP; Gunn et al., 2001) of the serotonin 1A receptor was estimated throughout the brain for each of many subjects drawn from two groups: patients with major depressive disorder and normal controls. Figure 1 shows BP image for three normal controls (top row) and for three subjects with major depressive disorder (bottom row).

The 50th transaxial slices of the binding potential images in PET for normal control subjects (top) and patients with major depressive disorder (bottom).

Due to the high dimensionality of such functional data, traditional classification methods for multivariate data are not generally appropriate. Many methods dealing with functional data classification have been developed rapidly in recent years. Here we review contributions most closely related to our proposed method. Hastie et al. (1995) set out the general idea of Functional Discriminant Analysis. Hall et al. (2001) proposed a functional data-analytic method for dimension reduction by regarding signals as curves or functions and performed Quadratic Discriminant Analysis (FQDA) on the reduced space. Cuevas et al. (2007) used the idea of depth to compute robust distances between curves. Berlinet et al. (2008) developed a supervised wavelet-based functional data classification method. Cao and Fan (2009) proposed a kernel-induced random forest method for classifying the functional data by defining kernel functions of two curves. Other related functional classification methods include functional generalized linear models (FGLM, Müller and Stadmüller 2005), functional kernel density estimation (FKDE, Zhu et. al.), and functional principle component regression (FPCR, Reiss and Ogden 2007) .

Our approach for functional data classification is based on the kernel-based non-parametric approach proposed by Ferraty and Vieu (2003) and requires the choice of a “distance function” (more precisely, a semi-metric). The performance of this approach is greatly affected by the choice of the semi-metric. Ferraty and Vieu's proposed semi-metric is based on functional principal component analysis (FPCA). In this paper we propose an alternative semi-metric based on wavelets and through simulations and real data analysis, we examine whether our proposed semi-metric will allow for improved performance in some situations, such as the noisy image data shown in Figure 1.

In Section 2, we will first review Ferraty and Vieu's kernel approach and describe our motivation for choosing the wavelet-based semi-metric for functional classification. In addition, we will briefly introduce wavelet methods and then describe our approach. We provide simulation results for comparing our proposed wavelet method with other functional classification approaches and a real image data application in Section 4. Some brief discussion is given in Section 5.

2 Methodology

The central idea of the kernel-based approach proposed by Ferraty and Vieu (2003) is to compute distances between a given curve to be classified and all curves in the training data. The classification of the new curve is based on class membership of the curves that are “nearest” to it. In order to compute the distance between curves, some metric between curves must be defined. In fact, only the properties of symmetry, non-negativity, and triangle inequality of a metric are needed (coincidence axiom is not necessary) and therefore, only a semi-metric is required. To be precise, a semi-metric d on a space S is a function that maps S × S to R which satisfies the axioms of symmetry, non-negativity, and triangle inequality for a metric, but d(s₁, s₂) = 0 does not imply s₁ = s₂.

2.1 Notation

Before describing the method, we first introduce some notation. Let (X₁, Y₁), . . . , (X_n, Y_n) be independently and identically distributed as (X, Y ), where the $X_{i}^{'} s$ are random functions that take values in the semi-metric space (S; d) and the $Y_{i}^{'} s$ are categorical variables. To avoid abstract notation, in this article we assume S to be the Hilbert space L₂([0, 1]^p), where p is the dimension of the X_i functions. To further simplify notation, we will use C to represent the support [0, 1]^p.

2.2 Review of the kernel method

Ferraty and Vieu (2003) proposed a kernel-based approach to classify curves (i.e., C = [0, 1]). The procedure is described as follows: For a given function x, for each group g, estimate the conditional probability that Y belongs to g:

p_{g} (x) = P (Y = g ∣ X (t) = x (t), t \in C), g \in 1, \dots, G .

(1)

Then assign the function x to the group with highest conditional probability.

In order to estimate the conditional probability in 1, Ferraty and Vieu proposed a kernel estimator, defined by

{\hat{p}}_{g, h} (x) = \frac{\sum_{i = 1}^{n} I_{{Y_{i} = g}} K (h^{- 1} d (x, X_{i}))}{\sum_{i = 1}^{n} K (h^{- 1} d (x, X_{i}))},

(2)

where I is the indicator function, h is the bandwidth, and K is the kernel function with support [0,1] that is non-negative and

\int_{0}^{1} K (x) d x = 1

. Note that in this paper, we use uniform kernel. We also tried Epanechnikov kernel and the results are very similar.

The semi-metric used in their paper is based on functional principal component analysis (FPCA; Dauxois et al. 1982), in which a random function X can be expressed as

X (t) = \sum_{k = 1}^{\infty} (\int_{C} X (t) v_{k} (t) d t) v_{k} (t),

where v_k is the kth orthonormal eigenfunction (corresponding to the kth largest eigenvalues) for the covariance function Γ(s, t) = E(X(s)X(t)). The corresponding semi-metric d_FPCA is defined

d_{F P C A}^{2} (x_{1}, x_{2}) = \sum_{i = 1}^{L} {(\int_{C} (x_{1} (t) - x_{2} (t)) v_{i} (t) d t)}^{2},

which depends on L, the number of selected functional principal components, which could be determined by cross-validation. To simplify notation, we suppress the dependence of d_FPCA on L.

2.3 Wavelets

Wavelet bases are commonly used for sparse representation of curves or images. Compared with the traditional Fourier bases, wavelet bases provide a degree of localization in space as well as in frequency. A wavelet basis is constructed using a scaling function (“father wavelet”) g=fstr and a wavelet function (“mother wavelet”) ψ. Any (one-dimensional) function in L₂(R) can be approximated by shifted and dilated functions of father wavelet g=fstr; in fact, L₂(R) can be approximated by a union of a nested sequence of subspaces

V_{0} \subset V_{1} \subset V_{2} \dots \subset L_{2} (R),

where V_j is the span of the functions

{ϕ_{j, k} : ϕ_{j, k} (k) = 2^{j ∕ 2} ϕ (2^{j} x - k), k \in Z} .

The shifted and delated functions of mother wavelet

{ψ_{j, k} (x) = 2^{j ∕ 2} ψ (2^{j} x - k), k \in Z}

form an orthonormal basis for a “detail space” W_j which is the orthonormal complement of V_j for V_j₊₁. Consequently, L₂(R) is the union of V₀, W₀, W₁, . . . and

{ψ_{j, k}, k \in Z, j = 0, 1, 2, \dots}

along with the functions {g=fstr₀_,k, k ∈ Z} form an orthonormal basis for L₂(R).

In this paper we consider only wavelets with support on a finite interval. Without loss of generality we consider L₂([0, 1]). The wavelet bases of L₂([0, 1]) can be easily adapted from those for L₂(R). We will employ periodic boundary handling that gives 2^j basis functions at each resolution level j. For simplicity, throughout this paper, we express the orthonormal basis of L₂([0, 1]) by taking the union of

{ψ_{j, k}, k = 0, \dots, 2^{j - 1}, j = 0, 1, 2, \dots}

with the mean function denoted g=fstr₋₁_,0

In practice, we only observe x = (x(1/N), . . . , x(N/N))^T , where we will take N = 2^J for some integer J. For such sampled data, we can apply the discrete wavelet transform (DWT) to obtain the wavelet coefficients. In our simulation and application described here, we used Daubechies’ orthogonal wavelet basis db4, which has 4 vanishing moments. We have also repeated our analyses using other Daubechies basis sets (db5, db6, and db1 (the Haar basis)), and we have found that the results depend very little on the choice of the basis. The DWT is a linear transform which, if applied to a vector x of length N will result in N wavelet coefficients. In matrix form, the vector of wavelet coeffcients can be written z = w(x) = W x, where W is an N × N orthonormal matrix and z = (z₁, . . . , z_N)^T represents the wavelet coeffcients arranged in vector form. For convenience in notation, we use a single subscript for wavelet coeffcients.

The extension of one-dimensional wavelet analysis to two or three dimensions may be accomplished by taking tensor products of the wavelets and scaling functions to create basis functions for L₂(R²) or L₂(R³) and this can also be adapted to the unit square or unit cube, i.e., for L₂([0, 1]²) or for L₂([0, 1]³) (Daubechies, 1992).

2.4 Proposed class of semi-metrics

In describing the main idea of our proposed method, we consider only two groups in this paper (i.e., G = 2), noting that it can be easily generalized to more groups. We first divide our data into a training sequence (X₁, Y₁), . . . , (X_n₁, Y_n₁) and a validation sequence (X_n₁₊₁, Y_n₁₊₁), . . . , (X_n₁₊_n₂, Y_n₁₊_n₂), where n₁ + n₂ = n. We arrange the training sample so that the first n₁₁ samples (X₁, Y₁), . . . , (X_n₁₁, Y_n₁₁) are from group 1 and the remaining samples are from group 2. Then for the training samples, our model can be written:

\begin{matrix} X_{i} (t_{j}) = & f_{1} (t_{j}) + ∊_{i j}, i = 1, \dots, n_{11}, j = 1, \dots, N \\ X_{i} (t_{j}) = & f_{2} (t_{j}) + ∊_{i j}, i = n_{11} + 1, \dots, n_{1}, j = 1, \dots, N . \end{matrix}

(3)

The functions f₁ and f₂ are the mean function of group 1 and 2, respectively; the errors ε_ij, i = 1, . . . , n₁; j = 1, . . . , N are assumed to be i.i.d. random variables with mean zero and variance σ². For simplicity of presentation, we describe our method in terms of one-dimensional curves, but the same procedure can be applied directly to functions with any dimensionality.

Defining f₁ = (f₁(1/N), f₁(2/N), . . . , f₁(N/N))^T and f₂ similarly, let θ = (θ₁, . . . , θ_N)^T denote the wavelet coeffcients of f₁ − f₂, i.e., θ = w(f₁ − f₂). We can then define the index set

P_{T} = {j : ∣ θ_{j} ∣ > T},

(4)

which identifies the indices of the wavelet coeffcients that are most different between f₁ and f₂. Of course when f₁ and f₂ are not observed (i.e., when only noisy observations are available), this cannot be determined, but the “oracle semi-metric” between two observed functions x₁ and x₂ may be defined as if $P_{T}$ were provided:

d_{O} (x_{1}, x_{2}) = \sum_{j \in P_{T}} {(w {(x_{1})}_{j} - w {(x_{2})}_{j})}^{2},

(5)

where w(x₁)_j indicates the jth coefficient in the DWT of x₁. For simplicity of notation, we suppress the dependence of d_O on the selected threshold T . The traditional Euclidean metric can be seen to be a special case of the oracle semi-metric with T = ∞ (by Parseval's theorem):

\begin{matrix} d_{E} (x_{1}, x_{2}) = \sum_{j = 1}^{N} {(w {(x_{1})}_{j} - w {(x_{2})}_{j})}^{2} & = {(w (x_{1}) - w (x_{2}))}^{T} (w (x_{1}) - w (x_{2})) \\ = {(x_{1} - x_{2})}^{T} (x_{1} - x_{2}), \end{matrix}

which is based on all of the wavelet coeffcients.

In practice, when $P_{T}$ is not provided, we can calculate an empirical version of the index set of coeffcients to include in calculating the semi-metric by defining

{\hat{θ}}_{j} = w ({\hat{f}}_{1} - {\hat{f}}_{2}) .

(6)

where

{\hat{f}}_{1} = \frac{1}{n_{11}} \sum_{i = 1}^{n_{11}} X_{i}, {\hat{f}}_{2} = \frac{1}{n_{1} - n_{11}} \sum_{i - n_{11} + 1}^{n_{1}} X_{i}

(7)

are the estimates based on the training data. Then the sample version of $P_{T}$ is simply ${\hat{P}}_{T} = {j : ∣ {\hat{θ}}_{j} ∣ > T}$ and the corresponding semi-metric is

d_{W} (x_{1}, x_{2}) = \sum_{j \in {\hat{P}}_{T}} {(w {(x_{1})}_{j} - w {(x_{2})}_{j})}^{2} .

In order to apply this method in a real-data situation, we must choose the threshold T as well as the bandwidth h. We propose to do this using a cross-validation algorithm:

For b = 1, . . . , B (resampling steps):

B1. Randomly permute the n samples: $(X_{1}^{b}, Y_{1}^{b}), \dots, (X_{n}^{b}, Y_{n}^{b})$ and designate the first n₁ samples in the permutation to be the training data. The remaining n − n₁ samples comprise the validation data.

B2. For each h and T and for each function $X_{j}^{b}$ in the validation group (j = n₁ + 1, . . . , n), compute the probability that, conditional on the observed $X_{j}^{b}$ , $Y_{j}^{b}$ is assigned to group g using the kernel estimator

{\hat{p}}_{g, h, T}^{b} (X_{j}^{b}) = \frac{\sum_{i = 1}^{n_{1}} I_{[y_{i}^{b} = g]} K (h^{- 1} d_{W} (X_{j}^{b}, X_{i}^{b}))}{\sum_{i = 1}^{n_{1}} K (h^{- 1} d_{W} (X_{j}^{b}, X_{i}^{b}))}, g = 1, \dots, G

B3: For each h, T , assign $Y_{j}^{b}$ to the group with the highest conditional probability:

{\hat{g}}^{b} (h, T, X_{j}^{b}) = \arg \max_{g} {\hat{p}}_{g, h, T}^{b} (X_{j}^{b})

Optimal h and T values are obtained by minimizing the misclassification rate in the validation sample:

(\hat{h}, \hat{T}) = \arg \min_{h, T} \frac{1}{n_{2} B} (\sum_{b = 1}^{B} \sum_{j = n_{1} + 1}^{n} I_{Y_{j}^{b} \neq {\hat{g}}^{b} (h, T, X_{j}^{b})})

Then for a new independent individual with observed X, our proposed classifier will assign this individual to the group

\hat{g} (X) = \arg \max_{g} {\hat{p}}_{g, \hat{h}, \hat{T}} (X) .

3 Simulation Study

We performed a simulation study to compare the classification accuracy of the kernel method using four different semi-metrics: the FPCA-based d_FPCA (FPCA), Euclidean d_E (Euclidean), wavelet-based oracle d_O (oracle), and wavelet-based empirical (wavelet) d_W described in Section 2.4 as well as four other functional classification algorithms mentioned in Section 1:

We performed simulations for both one-dimension curves and two-dimensional images. Also, in order to compare the empirical wavelet-based semi-metric with the oracle, we conducted a simulation to see how well the empirical version matched the oracle version.

For each of the four semi-metrics (Euclidean, wavelet, oracle, FPCA), the tuning parametersĥ and T̂ were chosen based on 100 cross-validation samples For both training and validation sample, half of the curves belong to the first group and the other half belong to the second group. To evaluate the performance of each classifier, 500 additional independent test samples were generated to estimate the misclassification rate.

For one dimensional curve simulation, we compared all eight functional classification methods, while for two dimensional images, we only compared the wavelet thesholding semi-metric (wavelet) with the Euclidean one (Euclidean), since available software allows application only to one-dimensional functional data.

3.1 One-dimensional curves

To study the performance of the various semi-metrics on classifying one-dimensional curves we set N = 1024 in each case for a variety of choices of n. In each case, we set n₁ = n/2 to be the size of the training set in each bootstrap sample and place the remaining observations to be in the validation sample. We set the simulation model as follows, and consider nine different combinations:

f_{1} (t) = b (t) + s (t)

(8)

f_{2} (t) = b (t),

(9)

where t ∈ [0, 1] and b(t) is the “baseline” function and s(t) is the signal that allows discrimination between the two groups. We consider three choices for the baseline function and three choices for the signal, giving nine combinations in all. Our three baseline functions are: b₁(t) ≡ 0; b₂(t) = sin(4πt) − cos(6πt) (smooth); and

b_{3} (t) = {\begin{matrix} 1 & if 1 \leq [1024 t] (\mod 10) \leq 3 \\ 0 & otherwise \end{matrix}

where [x] is the is the greatest integer that is less than or equal to x. Our three signal functions are defined:

s_{1} (t) = {\begin{matrix} 1, & \frac{230}{1024} < t \leq \frac{250}{1024} \\ 0, & otherwise \end{matrix}

s_{2} (t) = {\begin{matrix} 1, & \frac{200}{1024} < t \leq \frac{250}{1024} \\ 0, & otherwise \end{matrix}

s_{3} (t) = {\begin{matrix} 1, & \frac{200}{1024} < t \leq \frac{210}{1024} or \frac{300}{1024} < t \leq \frac{310}{1024} \\ 0, & otherwise \end{matrix}

Figure 2 and Figure 3 illustrate the three choices of baseline functions and three choices of signal functions, respectively.

Figure 4 and Figure 5 show 20 realizations (10 blue curves for group 1 and 10 red curves for group 2) for the simulation setting with smooth baseline and one bump signal with standard deviation of the noise equal to 0.9 and 1.8, respectively. The only di erence between the two groups is on t ∈ [231/1024, 250/1024] (indicated in the figures by arrows).

Ten simulated curves for group 1 (blue) and ten simulated curves for group 2 (red) for smooth baseline and one bump signal. The standard deviation for the noise is 0.9. The arrows represent the beginning and ending locations of the bump.

Then the n curves for training and validation were simulated from the models as given in equation (3) with noise {ε_ij, i = 1, . . . , n; j = 1, . . . , N} being iid normal random variables with mean 0 and variance σ². For each baseline/signal combination, and for each choice of noise level/sample size, this entire procedure was repeated 10 times. We display the aggregate results.

3.1.1 Differing noise levels

Our first simulation is designed to compare the relative performance of classification among the eight classification algorithms for a range of noise levels. Here we present only the results for n = 100. Simulations for different n values were also performed but the relative performances were very similar.

The standard deviations of the noise were set to 10 levels spaced evenly between 0.9 and 1.8.

Figure 6 demonstrate the effect of the noise level on the performance of classification using the 8 classification methods.

Estimated misclassification rates for 8 classifiers for different noise levels in 9 situations. Zero (top row), smooth (middle row) and complicated (bottom row) baselines; one bump (left column), wide bump (middle column), and two bumps (right column).

As would be expected, as the standard deviation of the noise increases, the misclassification rate increases for all classifiers. As a general rule, for all the classifiers and all baseline functions, the misclassification rate for the wide bump signal function is the lowest, and that for the two bump signal function is the highest. Note that both the one bump and the two bumps signal functions have 20 nonzero points (out of 1024 total points).

Since the kernel-based methods based on the Euclidean, wavelet, and oracle semi-metrics depend only on the signal s(t) (or equivalently, only on the di erence function f₁ − f₂), the results do not change for different choices of baseline function. For the other methods (FPCA, FGLM, FPCR, FKDE, and FQDA), adding a smooth baseline has very little effect on results, but adding a non-smooth, complicated baseline, does result in higher misclassification rates.

As would be expected, the classifier based on the oracle semi-metric d_O (oracle) performs better than its empirical counterpart (wavelet) but not much better. Both the wavelet and the oracle methods generally perform better than all of the other classifiers for all situations considered. It is interesting to note that for the wide bump signals and zero or smooth baselines, the FPCA method performs as well as the wavelet and the oracle methods. However, when the baseline function is complicated even with the wide bump, FPCA's performance is affected by the complication, and it does not perform as well.

3.1.2 Differing sample sizes

Our second simulation compares the relative performance of classifiers based on the eight classification algorithms for different sample sizes ranging from 40 to 240. For each simulation the noise level was set to be σ = 1.5. Results are displayed in Figure 7.

Estimated misclassification rates for 8 classifiers for different sample sizes in 9 situations. Zero (top row), smooth (middle row) and complicated (bottom row) baselines; one bump (left column), wide bump (middle column), and two bumps (right column).

As would be expected, as the sample size increases, in most situations, the misclassification rates also decreases. As seen in the previous section, since the kernel-based methods depending on the Euclidean, wavelet, and oracle semi-metrics depend only on the signal s(t) (or equivalently the di erence function f₁ − f₂), the results for these methods are not effected by choice of baseline. For the other five methods (FPCA, FGLM, FPCR, FKDE, and FQDA), again, adding a smooth baseline has very little effect, but adding a non-smooth, complicated baseline, does tend to worsen performance in most situations. Also, as would be expected, the classifier based on the oracle semi-metric d_O (oracle) performs better than its empirical counterpart (wavelet) but not much better, and the difference in performance becomes small as n increases. Also, as seen in the previous section, these two methods (wavelet and oracle) perform better than (or at least as well as the other classifiers in all simulation scenarios.

3.1.3 Mimicking the oracle semi-metric

The oracle semi-metric represents an ideal not attainable in practice, but it is informative to investigate how well the empirical version resembles it for various settings. A third simulation study was designed to see how well the coeffcients chosen by the empirical wavelet-based semi-metric match those chosen by the oracle semi-metric for σ = 1.5 and three different sample sizes. For this simulation we use the one bump (but from 200/1024 to 220/1024) signal with zero baseline.

In Table 1, the magnitude for each wavelet coefficient is defined by the ratio of each squared wavelet coefficient $θ_{(j)}^{2}$ to the total magnitude. The relative importance of the wavelet basis functions in the oracle semi-metric are ordered based on these magnitudes. The first listed coefficient accounts for about 27% of the differences, the second accounts for an additional 21%, and so on. The total magnitude for these 17 coefficients is 95%. In addition, this table lists the frequency (out of 100 simulations) that each coefficient is selected by the empirical wavelet-based semi-metric. As would be expected, the most important coefficients are selected with high probability and the less important coefficients are selected less often. Furthermore, we can see that these frequencies tend to increase as n increases. The oracle semi-metric, depending on the choice of T , would use at most 71 coefficients, since there are only 71 nonzero wavelet coefficients for f₁ − f₂ (the remaining 953 coefficients are all zero). In order to determine the false positive rate, Table 1 lists the average percentages of inclusion of coefficients among these 953 coefficients. These decline from 2.61% to 0.71% as n increases from 60 to 200.

Table 1.

Frequency of selection of large magnitude components using wavelet-based semi-metric

magnitude	n = 60	n = 100	n = 200

	%	%	%
0.2691	100	100	100
0.2090	97	99	100
0.1070	74	86	99
0.0540	67	84	100
0.0529	67	78	98
0.0435	43	56	81
0.0410	42	60	80
0.0281	23	36	49
0.0215	22	30	58
0.0195	22	29	42
0.0192	17	24	27
0.0192	20	27	40
0.0190	19	21	35
0.0189	15	25	35
0.0129	12	16	17
0.0098	10	10	13
0.0079	11	11	13

remaining	2.61	1.48	0.71

Open in a new tab

3.2 Two-dimensional images

We also conducted a simulation study to compare the relative performance of classifications for different semi-metrics for the two-dimensional images. For simulated images, we chose a square domain [0, 1]², and this square is divided into a 128 × 128 grid. The simulated model is:

f_{1} (t_{1}, t_{2}) = {\begin{matrix} 1, & \frac{30}{128} < t_{1} \leq \frac{36}{128} and \frac{30}{128} < t_{2} \leq \frac{36}{128} \\ 0, & otherwise \end{matrix}

and f₂(t₁, t₂) ≡ 0.

The noise is generated by iid Gaussian random variables. Sample size n was set to 100 and σ ranged from 0.3 to 3.0. For each simulation, resampling was performed 100 times. Figure 8 illustrates the results of using the Euclidean semi-metric and the wavelet thresholding one. The plot shows that as noise level increases, the misclassification rate increases for both semi-metrics. Similarly to what was observed with the one-dimensional signals, the wavelet thresholding semi-metric tends to perform better than the Euclidean one for all the noise levels.

Comparison of classification between wavelet thresholding and Euclidean semi-metrics for 2-dimensional images. The x-axis represents noise levels ranging from σ=0.3 to 3.0 and the y-axis represents misclassification rate.

4 Application to PET imaging data

To examine the performance of the proposed methods on real data, we applied the classification algorithm based on the wavelet thresholding semi-metric and the Euclidean metric on images from a depression study. Collected by Parsey et al. (2006) from 51 healthy controls and 69 subjects with major depressive disorder, these images are maps of the binding potential of 5HT₁_A receptors, which are believed to play an important role in the disorder. The binding potential is an index that measures how many receptors are available for binding. Images were registered to a common template, resulting in a set of 79 transaxial slices of dimension 91 × 109. We adapted our semi-metric to the wavelet domain in three-dimensional image space. In order to determine h and T , resampling was performed 100 times. The 120 images were randomly divided into two groups and classification was run repeatedly. For each repetition, 80 images were used to find optimal parameters for further classification and the remaining 40 images were used to test the performance. The misclassification rate for the wavelet-based semi-metric was estimated to be 0.27, and 0.46 for the Euclidean metric. This rate is quite good given the high noise level of such data and the considerable overlap that exists in binding potential measures calculated for various anatomically defined regions of interest.

5 Discussion

We proposed a new semi-metric based on wavelet thresholding for functional data classification. The simulation results showed that when signals are sparse, the classifier based on the wavelet thresholding semi-metric tends to perform considerably better than (at least comparably well with) all the other functional classification methods, including FPCA, Euclidean, FGLM, FPCR, FKDE, and FQDA for all considered noise levels and sample sizes and all simulated scenarios. Furthermore, we found that the wavelet thresholding semi-metric performs similarly to the oracle version, especially for moderate to large sample sizes. This is due to the ability of the empirical wavelet-based version to select the important coefficients with high probability. We also applied our method to classify two groups of 3D binding potential images. The result showed that our proposed wavelet thresholding semi-metric performed much better than the Euclidean one. From our experience, since our sample size of images is not large (120 images), our proposed resampling method is necessary to improve the performance of our classification method.

One major advantage of taking a wavelet-based approach is that the extension of the semi-metric from one-dimensional signals to two-dimensional and three-dimensional images is quite straightforward; once a basis set is defined (regardless of dimensionality) and calculated coefficients are arranged as a vector, the procedure is exactly the same. Extensions of other methods, though conceptually straightforward, involve some computational challenges.

Though our method is described for the situation of equal variance, it would be straightforward to extend the method to handle the case of unequal variances. If the variance of noise in (3) is given by, say, V ar(ε_ij ) = σ²(t_j), then the wavelet thresholding semi-metric should be constructed while taking into account the heterogeneity. Instead of using f̂₁ − f̂₂ (see 6 and 7), we replace it by the normalized difference. That is, for j = 1, . . . , N, replacing f̂₁ − f̂₂ in (6) by

\frac{{\hat{f}}_{1} - {\hat{f}}_{2}}{s e ({\hat{f}}_{1} - {\hat{f}}_{2})},

where se(f̂₁ − f̂₂) is the standard error vector for f̂₁ − f̂₂ and may be calculated

s e ({\hat{f}}_{1} - {\hat{f}}_{2}) = \frac{\sum_{i = 1}^{n_{11}} {(X_{i} - {\hat{f}}_{1})}^{T} (X_{i} - {\hat{f}}_{1})}{n_{11} - 1} + \frac{\sum_{n_{11} + 1}^{n_{1}} {(X_{i} - {\hat{f}}_{2})}^{T} (X_{i} - {\hat{f}}_{2})}{n_{1} - n_{11} - 1} .

Acknowledgments

The research was supported in part by NIH grants (5 R01 EB009744-03 and 5 R01 MH099003-02) and grants from the National Science Council of Taiwan (NSC 100-2118-M-110-004 and NSC 100-2118-M-110-004).

References

1.Berlinet A, Biau G, Rouvière L. Functional supervised classification with wavelets. Annales de l'ISUP. 2008;52:61–80. [Google Scholar]
2.Cao J, Fan G. Functional data classification with kernel-induced random forests. 2009. Preprint: http://people.stat.sfu.ca/~cao/Research/FunctionalDataClassification.pdf.
3.Cuevas A, Febrero M, Fraiman R. Robust estimation and classification for functional data via projection-based depth notions. Computational Statistics. 2007;22:481–496. [Google Scholar]
4.Daubechies I. Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics; Philadelphia: 1992. [Google Scholar]
5.Dauxois J, Pousse A, Romain Y. Asymptotic theory for the principal component analysis of a vector random function: Some applications to statistical inference. Journal of Multivariate Analysis. 1982;12:136–154. [Google Scholar]
6.Donoho DL, Johnstone IM. Ideal spatial adaptation via wavelet shrinkage. Biometrika. 1994;81:425–455. [Google Scholar]
7.Fan J, Lin SK. Test of significance when data are curves. Journal of the American Statistical Association. 1998;93:1007–1021. [Google Scholar]
8.Ferraty F, Vieu P. Curves discrimination: A nonparametric functional approach. Computational Statistics & Data Analysis. 2003;44:161–173. [Google Scholar]
9.Ferraty F, Vieu P. Nonparametric Functional Data Analysis: Theory and Prac- tice. Springer; New York: 2006. [Google Scholar]
10.Gunn RN, Gunn SR, Cunningham VJ. Positron emission tomography compartmental models. Journal of Cerebral Blood Flow and Metabolism. 2001;21:635–652. doi: 10.1097/00004647-200106000-00002. [DOI] [PubMed] [Google Scholar]
11.Hall P, Poskitt DS, Presnell B. A functional data-analytic approach to signal discrimination. Technometrics. 2001;43:1–9. [Google Scholar]
12.Hastie T, Buja A, Tibshirani R. Penalized discriminant analysis. The Annals of Statistics. 1995;23:73–102. [Google Scholar]
13.Müller HG, Stadtmüller U. Generalized functional linear models. Annals of Statistics. 2005;33:774–805. [Google Scholar]
14.Parsey RV, Oquendo MA, Ogden RT, Olvet DM, Simpson N, Huang Y, Van Heertum RL, Arango V, Mann JJ. Altered serotonin 1A binding in major depression: A [carbonyl-C-11]WAY100635 positron emission tomography study. Biological Psychiatry. 2006;59:106–113. doi: 10.1016/j.biopsych.2005.06.016. [DOI] [PubMed] [Google Scholar]
15.Ramsay J, Silverman BW. Functional Data Analysis. Springer; New York: 2008. [Google Scholar]
16.Reiss PT, Ogden RT. Functional principal component regression and functional partial least squares. Journal of the Statistical Association. 2007;102:984–996. [Google Scholar]
17.Zhu H, Brown PJ, Morris JS. Robust classification of functional and quantitative image data using functional mixed models. Biometrics. 2012;68:1260–1268. doi: 10.1111/j.1541-0420.2012.01765.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Berlinet A, Biau G, Rouvière L. Functional supervised classification with wavelets. Annales de l'ISUP. 2008;52:61–80. [Google Scholar]

[R2] 2.Cao J, Fan G. Functional data classification with kernel-induced random forests. 2009. Preprint: http://people.stat.sfu.ca/~cao/Research/FunctionalDataClassification.pdf.

[R3] 3.Cuevas A, Febrero M, Fraiman R. Robust estimation and classification for functional data via projection-based depth notions. Computational Statistics. 2007;22:481–496. [Google Scholar]

[R4] 4.Daubechies I. Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics; Philadelphia: 1992. [Google Scholar]

[R5] 5.Dauxois J, Pousse A, Romain Y. Asymptotic theory for the principal component analysis of a vector random function: Some applications to statistical inference. Journal of Multivariate Analysis. 1982;12:136–154. [Google Scholar]

[R6] 6.Donoho DL, Johnstone IM. Ideal spatial adaptation via wavelet shrinkage. Biometrika. 1994;81:425–455. [Google Scholar]

[R7] 7.Fan J, Lin SK. Test of significance when data are curves. Journal of the American Statistical Association. 1998;93:1007–1021. [Google Scholar]

[R8] 8.Ferraty F, Vieu P. Curves discrimination: A nonparametric functional approach. Computational Statistics & Data Analysis. 2003;44:161–173. [Google Scholar]

[R9] 9.Ferraty F, Vieu P. Nonparametric Functional Data Analysis: Theory and Prac- tice. Springer; New York: 2006. [Google Scholar]

[R10] 10.Gunn RN, Gunn SR, Cunningham VJ. Positron emission tomography compartmental models. Journal of Cerebral Blood Flow and Metabolism. 2001;21:635–652. doi: 10.1097/00004647-200106000-00002. [DOI] [PubMed] [Google Scholar]

[R11] 11.Hall P, Poskitt DS, Presnell B. A functional data-analytic approach to signal discrimination. Technometrics. 2001;43:1–9. [Google Scholar]

[R12] 12.Hastie T, Buja A, Tibshirani R. Penalized discriminant analysis. The Annals of Statistics. 1995;23:73–102. [Google Scholar]

[R13] 13.Müller HG, Stadtmüller U. Generalized functional linear models. Annals of Statistics. 2005;33:774–805. [Google Scholar]

[R14] 14.Parsey RV, Oquendo MA, Ogden RT, Olvet DM, Simpson N, Huang Y, Van Heertum RL, Arango V, Mann JJ. Altered serotonin 1A binding in major depression: A [carbonyl-C-11]WAY100635 positron emission tomography study. Biological Psychiatry. 2006;59:106–113. doi: 10.1016/j.biopsych.2005.06.016. [DOI] [PubMed] [Google Scholar]

[R15] 15.Ramsay J, Silverman BW. Functional Data Analysis. Springer; New York: 2008. [Google Scholar]

[R16] 16.Reiss PT, Ogden RT. Functional principal component regression and functional partial least squares. Journal of the Statistical Association. 2007;102:984–996. [Google Scholar]

[R17] 17.Zhu H, Brown PJ, Morris JS. Robust classification of functional and quantitative image data using functional mixed models. Biometrics. 2012;68:1260–1268. doi: 10.1111/j.1541-0420.2012.01765.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Functional Data Classification: A Wavelet Approach

Chung Chang

R Todd Ogden

Yakuan Chen

Abstract

1 Introduction

Figure 1.