Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Apr 1.
Published in final edited form as: J Am Stat Assoc. 2015 Apr 1;110(511):1266–1275. doi: 10.1080/01621459.2015.1016225

Localized Functional Principal Component Analysis

Kehui Chen 1, Jing Lei 1
PMCID: PMC4721272  NIHMSID: NIHMS675394  PMID: 26806987

Abstract

We propose localized functional principal component analysis (LFPCA), looking for orthogonal basis functions with localized support regions that explain most of the variability of a random process. The LFPCA is formulated as a convex optimization problem through a novel Deflated Fantope Localization method and is implemented through an efficient algorithm to obtain the global optimum. We prove that the proposed LFPCA converges to the original FPCA when the tuning parameters are chosen appropriately. Simulation shows that the proposed LFPCA with tuning parameters chosen by cross validation can almost perfectly recover the true eigenfunctions and significantly improve the estimation accuracy when the eigenfunctions are truly supported on some subdomains. In the scenario that the original eigenfunctions are not localized, the proposed LFPCA also serves as a nice tool in finding orthogonal basis functions that balance between interpretability and the capability of explaining variability of the data. The analyses of a country mortality data reveal interesting features that cannot be found by standard FPCA methods.

Keywords: functional principal component analysis, domain selection, interpretability, orthogonality, deflation, convex optimization

1. INTRODUCTION

Functional principal component analysis has emerged as a major tool to explore the source of variability in a sample of random curves and has found wide applications in functional regression, curve classification, and clustering (Castro et al., 1986; Rice and Silverman, 1991; Cardot, 2000; Yao et al., 2005; Ramsay and Silverman, 2005; Hall and Hosseini-Nasab, 2006). In this paper, we consider functional principal component analysis with localized support regions. That is, for a smooth random function X, we look for orthogonal basis functions with localized support regions that explain most of the variance. The main motivation of the localized functional principal component analysis (LFPCA) is to find a parsimonious linear representation of the data that balances the interpretability and the capability of explaining variance of the stochastic process.

The proposed method outputs localized basis functions whose localization level is controlled by a localization tuning parameter. We propose two methods to select the localization parameter, corresponding to two useful applications of our proposed method. First, one can choose the localization parameter by maximizing the explained variance of the random process computed by V-fold cross-validation. Our simulation shows that when the eigenfunctions truly have localized support regions, the proposed LFPCA with a localization parameter chosen by cross-validation significantly improves the estimation accuracy of the eigenfunctions compared to standard FPCA methods. On the other hand, when the original eigenfunctions are not localized, the localization parameter chosen by cross-validation is expected to be very close to zero and it is confirmed by our numerical studies that the performance of the proposed LFPCA is almost identical to standard FPCA methods. The second method of choosing the localization parameter is to seek the most localized basis functions that explain a fixed level of variance. This method is particularly useful when the standard eigenfunctions are not localized, and it makes sense for the proposed LFPCA not to target at the standard eigenfunctions but to balance between interpretability and the capability of explaining variance of the stochastic process. Details can be found in Section 3.2. We consider a country mortality data application to illustrate the second method of choosing the localization parameter. In the country mortality data, the mortality rates at age 60 were recorded from year 1960–2006 for 27 countries around the world. The first three localized basis functions, explaining more than 85% variance, correspond to variational modes around mid 1990s, 1980s, and 1960s, respectively. Another example concerning a growth curve data is presented in the Online Supplementary File. The first two localized basis functions explain more than 85% variance, and clearly indicate that the main variational modes of female height growth are around age 12 and around age 5, perfectly matching the knowledge of growth spurts. These interesting features cannot be revealed by standard FPCA.

Domain localization has been studied by several authors in the functional regression model: E(Y|X) = a + Inline graphic X(t) β(t) dt, where the coefficient function β(t) is desired to be zero outside a subdomain Inline graphicInline graphic with the purpose of improved interpretability (James et al., 2009; Zhao et al., 2012; Zhou et al., 2013). Most of these methods turn the problem into a variable selection problem and use LASSO type penalties. In a recent thesis work Lin (2013), interpretable functional principal component analysis is studied, which has a very similar flavor as our proposed LFPCA. In their work, an ℓ0 penalty is added on eigenfunctions and a greedy algorithm based on basis expansion of curves is developed to solve the non-convex optimization problem. The formulation of localization or domain selection in the context of functional principal component analysis is quite challenging for at least two reasons. First, the eigen problem together with a localization penalty is usually not convex, and in general it is an NP-hard problem to find a global optimum. Second, in order to obtain a sequence of mutually orthogonal eigen-components, a commonly taken procedure is to deflate the empirical covariance operator at step j by removing the effect of the previous j − 1 components (White, 1958; Mackey, 2008). But with the localization penalty in the objective function, this procedure can not guarantee the orthogonality of such sequentially obtained eigen-components. In sequential estimation of principal components, being orthogonal to the first component is a natural requirement when looking for the second component, otherwise the maximization over second direction is not well-defined since the solution would still be the first direction. From a dimension reduction perspective, the orthogonality is also appealing since the resulting k dimensional orthogonal basis leads to very simple calculation for subsequent inferences.

The main contribution of this paper is three-fold. First, we formulate the LFPCA as a convex optimization problem with explicit constraint on the orthogonality of eigen-components. Second, we provide an efficient algorithm to obtain the global maximum of this convex problem. Third, we carefully investigate the estimation error from the discretized data version to the functional continuous version, as well as the complex interaction between the eigen problem and the localization penalty, and prove consistency of the estimated eigenfunctions. The starting point of our method is a sup-norm consistent estimator of the covariance operator, up to a constant shift on the diagonal. For dense and equally spaced observations with or without measurement error, the proposed method can be directly carried out on the sample covariance, i.e., without the need to perform basis expansion, smoothing of the individual curves, or smoothing of the estimated covariance operator. For other designs of functional data, the proposed method is still applicable when an appropriate covariance estimator is available.

Our formulation of LFPCA borrows ideas from recent developments in sparse principal component analysis. In Vu et al. (2013); Lei and Vu (2015), a similar convex framework based on Fantope Projection and Selection has been proposed to estimate a k dimensional sparse principal subspace of a high dimensional random vector (see also d’Aspremont et al. (2007) for k = 1). These sparse subspace methods are useful when the union of the support regions of several leading eigenvectors is sparse. In sparse PCA settings, the notion of sparsity requires the proportion of non-zero entries in the leading eigenvectors to vanish as the dimensionality increases and therefore it makes sense to consider the union of the support regions of several leading eigenvectors to be sparse. However, in functional data settings, the length ratio of a support subdomain over the entire domain is determined by the random curve model and usually a constant, and the union of several leading subdomains can be as large as the entire domain. This is also the reason that we use the notion “localized” instead of “sparse”. It has remained challenging to obtain sparse eigenvectors sequentially that each one is allowed to have a different support region. A particular challenge is the interaction between orthogonality and the sparse penalty. Besides the difference between functional PCA and sparse PCA, one main extension developed in our method is the construction of a deflated Fantope to estimate individual eigen-components sequentially, with possibly different support regions and guaranteed orthogonality. This deflated Fantope formulation is of independent interest in many other structured principal component analysis.

The rest of this paper is organized as follows. In section 2, we introduce the formulation of localized functional principal component analysis. Section 3 derives the solution to the optimization problem and describes the algorithm as well as the selection of tuning parameters. Section 4 contains the consistency results. Section 5 and Section 6 present numerical experiments and data examples to illustrate our method. Section 7 contains some discussions and extensions. Technical details and additional materials are provided in the Online Supplementary File.

2. LFPCA THROUGH DEFLATED FANTOPE LOCALIZATION

We consider a square integrable random process X(t) : Inline graphic ↦ ℝ over a compact interval Inline graphic ⊂ ℝ, with mean and covariance functions μ(t) = EX(t) and Γ(s, t) = Cov(X(s), X(t)), and covariance operator (Γf)(t) = Inline graphic f (s) Γ(t, s)ds. Under the minimal assumption that Γ(s, t) is continuous over (s, t), this operator Γ has orthonormal eigenfunctions ϕj (t), j = 1, 2, ··· , with nonincreasing eigenvalues λj, satisfying Γϕj = λj ϕj. The well known Karhunen-Loève expansion then gives the representation

X(t)=μ(t)+j=1ξjϕj(t), (1)

where ξj, j ≥ 1, is a sequence of uncorrelated random variables satisfying E(ξj) = 0 and var(ξj) = λj, with an explicit representation ξj = Inline graphic (X(t) − μ(t))ϕj (t)dt. A key inference task in functional principal component analysis (FPCA) is to estimate the leading eigen-functions ϕj (t), 1 ≤ jk, from n independent sample curves.

In practice, the underlying sample curves Xi(t), 1 ≤ in, are usually recorded at a grid of points and may be contaminated with additive measurement errors. We start with dense and equally spaced observations

Yil=Xi(tl)+εil=μ(tl)+j=1ξijϕj(tl)+εil,i=1,,n,l=1,,p, (2)

where εil are independent noises with mean zero and variance σ2, and tl, 1 ≤ lp, are grid points in Inline graphic at which observations are recorded. Starting from such discrete and possibly noisy data, there are different ways of introducing smoothness in the estimation of FPCA. Rice and Silverman (1991); Silverman (1996); Huang et al. (2008), among others, studied approaches where a smoothness penalty on eigenfunctions is integrated in the optimization step of eigen-decomposition. Let S be the p × p sample covariance matrix of the observed vector Y, and v be a p dimensional vector. Rice and Silverman (1991) used a roughening matrix D = ΔTΔ where Δ ∈ ℝ(p−2)×p is a second-differencing operator:

Δij={1,ifj{i,i+2},-2,ifj=i+1,0,otherwise.

A smoothed eigenfunction estimator is obtained by solving the following eigen problem:

maxvT(S-ρ1D)v,s.t.v2=1, (3)

where ρ1 is a smoothing parameter, and ||v||2 is the Euclidean norm of v.

A straightforward approach to localize the estimated eigenfunctions is to add another localization penalty:

maxvT(S-ρ1D)v-ρ2v1,s.t.v2=1, (4)

where ρ2 is a tuning parameter, and ||v||1 is the ℓ1 norm of v. However, this is not a convex problem and there are no known algorithms that can efficiently find a global optimum even for the first eigen-component.

Here we propose a novel sequential estimation procedure based on the idea of estimating the rank one projection matrix vvT. Let 〈A, B〉 = trace(AT B) for matrices A, B of compatible dimensions. Denote ||H||1,1 the matrix ℓ1 norm, which is the sum absolute value of all entries in H. Starting from Π̂0 = 0, for each j = 1, … , k, the jth localized eigen-component is estimated as follows.

Hj=argmaxS-ρ1D,H-ρ2H1,1,s.t.HDΠ^j-1,v^j=thefirsteigenvectorofHj,Π^j=Π^j-1+v^jv^jT, (5)

where, for any p × p projection matrixΠ,

DΠ:={H:0HI,trace(H)=1,andH,Π=0},

and, for symmetric matrices A, B, “AB” means that BA is positive semidefinite.

Problem (5) is a convex relaxation of (4) with integrated orthogonality constraints on the estimated localized eigen-components. With the estimated j, we can easily obtain an estimate ϕ̂j(t) of the localized eigenfunction ϕj(t) by standard interpolation techniques such as linear interpolation, plus an optional final step of re-orthogonalization and re-normalization.

With appropriately chosen tuning parameters, the performance of the proposed method ties to the maximum entry-wise error of the discretized covariance estimator S (see Section 4 for details). Our presentation will focus on a sample covariance S that is computed from dense and equally spaced observations, but the proposed method is not restricted to a dense regular design as long as a reasonable covariance estimate can be obtained. More discussions can be found in Section 7.

To solve for j and ϕ̂j(t), the key step in (5) is to solve

maxHS-ρ1D,H-ρ2H1,1,s.t.HDΠ, (6)

where Π = Π̂j−1 at step j. In next section we present an algorithm that solves problem (6), with a discussion on the choice of tuning parameters ρ1 and ρ2.

In the sparse PCA literature, d’Aspremont et al. (2007); Vu et al. (2013) have considered the following problem,

maxHS,H-ρH1,1,s.t.HFd, (7)

where the convex set Inline graphic := {H : 0 ⪯ HI, trace(H) = d} is called the Fantope of degree d (Dattorro, 2005), and is the convex hull of all rank d projection matrices. While the convex relaxation given in (7) allows us to estimate d-dimensional sparse principal subspaces, it does not lead to a sequence of mutually orthogonal eigenvectors with different support regions.

To ensure orthogonality among the estimated eigenvectors, we consider Inline graphic := {H : HInline graphic, and 〈H, Π 〈 = 0}, which we call the deflated Fantope. It can be naturally generalized to DΠd:={H:HFd,andH,Π=0} to estimate mutually orthogonal principal subspaces. Such a feasibility deflation technique is quite different from the commonly suggested matrix deflation techniques in sequential estimation of eigenvectors (see Mackey (2008) for example).

3. ALGORITHM

3.1 Deflated Fantope Localization using ADMM

The main difficulty in solving problem (6) is the complex interaction between the ℓ1 penalty and the deflated Fantope constraint. To overcome this difficulty, we write (6) in an equivalent form to separate the ℓ1 penalty and deflated Fantope constraint:

minH,ZIDΠ(H)-S-ρ1D,H+ρ2Z1,1,s.t.H-Z=0, (8)

where Inline graphic is the convex indicator function, which is ∞ outside Inline graphic and 0 inside Inline graphic. Problem (8) is a convex global variable consensus optimization, which can be solved using alternating direction method of multipliers (ADMM, Boyd et al. (2011)). We describe in Algorithm 1 an ADMM algorithm that solves (8) and hence (6). It extends the FPS algorithm in Vu et al. (2013) to the deflated Fantope.

The two matrix operators used in the algorithm are defined as follows.

Algorithm 1.

Deflated Fantope Localization using ADMM

Require: S = ST, Π, D, ρ1, ρ2 ≥ 0, τ > 0, ε > 0
Z(0) ← 0, W(0) ← 0 ▷ Initialization
repeat r = 1, 2, …
  H(r)Inline graphic[Z(r−1)W(r−1) + (Sρ1D)/τ] ▷ Deflated Fantope projection
  Z(r)Inline graphic (H(r) + W(r−1)) ▷ Elementwise soft thresholding
  W(r)W(r−1) + H(r)Z(r) ▷ Dual variable update
until H(r)-Z(r)F2τ2Z(r)-Z(r-1)F2ε2 ▷ Stopping criterion
return Z(r)
  1. Soft-thresholding operator: for any a > 0,
    Sa(x)=sign(x)max(x-a,0).
  2. Deflated-Fantope-projection operator: For any p×p symmetric matrix A and projection matrixΠ,
    PDΠ(A):=argminBDΠA-BF2

    is the Frobenius norm projection of A onto the deflated Fantope Inline graphic.

A non-trivial subroutine in Algorithm 1 is to calculate the deflated-Fantope-projection Inline graphic(A) for a symmetric matrix A. The following lemma gives a close-form characterization of the deflated-Fantope-projection operator.

Lemma 3.1

Let Π = VVT, where V is a p × d matrix with orthonormal columns. Let U be a p × (pd) matrix that forms an orthogonal complement basis of V. Then

PDΠ(A)=U[i=1p-dγi+(θ)ηiηiT]UT,

where (γi,ηi)i=1p-d are eigenvalue-eigenvector pairs of UTAU:UTAU=i=1p-dγiηiηiT, and γi+(θ)=min(max(γi-θ,0),1), with θ chosen such that i=1p-dγi+(θ)=1.

The next theorem ensures the convergence of our algorithm to a global optimum of problem (6). The proofs of Lemma 3.1 and Theorem 3.2 are given in the Online Supplementary File.

Theorem 3.2

In Algorithm 1, Z(r)H*, H(r)H* as r → ∞, where H* is a global optimum of problem (6).

In the proof of Theorem 3.2 we will see that the auxiliary number τ used in Algorithm 1 plays a role that is similar to the step size commonly seen in iterative convex optimization solvers. The particular choice of τ does not affect the theoretical convergence of the ADMM algorithm. There are some general guidelines on the practical choice of τ and we refer to Boyd et al. (2011) for further details.

3.2 Choice of Tuning Parameters

The optimization problem (5) involves two tuning parameters: ρ1 controls the roughness of eigenfunctions; and ρ2 controls the localization of the eigenfunctions. We present a two-step approach where ρ1 is chosen first and kept the same for all 1 ≤ jk, and then ρ2 is determined sequentially for each eigenfunction ϕj(t) and denoted by ρ2,j. The two parameters can be chosen together by straightforward modification but computationally it would be a bit intensive.

The choice of the smoothing parameter ρ1 has been discussed in Rice and Silverman (1991), and they recommended cross-validation or manual selection. In our simulation and data analysis, we have used V-fold cross-validation. First the data is divided into V folds, denoted by Inline graphic, Inline graphic, … , Inline graphic. Let Hj(-v)(ρ1,ρ2) be the estimated Hj in (5) using data other than Inline graphic with tuning parameters ρ1 and ρ2. Let Sv be the discrete covariance estimated from data Inline graphic. The smoothing parameter is chosen by maximizing the cross-validated inner product of H1(-v) and Sv:

ρ^1=argmaxρA1v=1VH1(-v)(ρ,0),Sv, (9)

where Inline graphic is a candidate set of ρ1 and empirically we found that a sequence between 0 and p times the largest eigenvalue of S works well.

In the following, we present two methods for the choice of ρ2 given a pre-chosen smoothing parameter ρ1. The first method is to choose ρ2,j by maximizing the cross-validated inner product of Hj(-v) and Sv:

ρ^2,j=argmaxρA2,jv=1VHj(-v)(ρ1,ρ),Sv,j=1,2,,k, (10)

where Inline graphic is a candidate set for ρ2,j, and we propose to use a sequence between 0 and the 95% quantile of absolute values of off-diagonal entries in Sj, with Sj = (I − Π̂j−1)S(I − Π̂j−1).

The V-fold cross-validation approach is expected to give a ρ̂2 that indicates the true localization level of the eigenfunctions. The criterion in our proposed cross-validation corresponds to maximizing 〈Σ, Ĥ(ρ2)〉, where Σ is the discretized population covariance and is substituted by the test sample covariance in practice. When the true eigenvector v is localized, 〈Σ, Ĥ(ρ2)〉 shall be maximized at approximately the value of ρ2 which corresponds to the ideal localization level of the eigenfunction, i.e., H^(ρ2)1,1vvT1,1. We shall expect 〈Σ, Ĥ(ρ2)〉 as a function of ρ2 to be (i) monotonically increasing on [0, ρ2] as the search area gradually expands to cover the true eigenvector, and (ii) monotonically decreasing on [ ρ2, ∞) as the search area goes unnecessarily larger so that the estimation becomes more noisy. When the true eigenvector v is not localized, then we shall expect 〈Σ, Ĥ(ρ2)〉 to be monotonically decreasing as ρ2 increases. The numerical study confirms the good performance of the cross-validation method. See Section 5 for more details.

In some applications, we may not want to target the standard eigenfunction, but instead we may want to find orthogonal linear expansions that balance the interpretability (localization) and the capability of explaining the variance of the process. We therefore propose a second method of choosing ρ2, which is based on the notion of fraction of variance explained (FVE). For a p-dimensional vector v of unit length and a sup-norm consistent estimator S of the covariance operator,

FVE(v)=vTSv/totV(S), (11)

where totV (S) is the sum of positive eigenvalues of S. We note that for dense and equally spaced observations with measurement error, the FV E(v) defined above is not directly applicable to a sample covariance S because the sample covariance is sup-norm consistent up to a shift σ2 on the diagonal, where σ2 is the error variance. To avoid serious bias by the nugget effect on the diagonal, one may use the eigenvalues from the smoothed covariance. Another practical way is to approximate totV (S) by the sum of the first M leading eigenvalues of S, for a finite number M. For a reasonable error level, the nugget effect bias 2 is small compared to the sum of the first M eigenvalues of S, which is of order p (Kneip et al., 2011), while the remaining true eigenvalues beyond M are usually very small because the smoothness of X(t) ensures fast decay of eigenvalues. In numerical experiments where FV E is needed for determining the number of principal components to be included, we use M = min(20, p − 2).

To sequentially select the sparsity parameter ρ2 for the jth eigenfunction. Suppose that we have estimated i for 1 ≤ ij − 1. Let v^j(ρ1,ρ) be the solution of (5) by using a fixed ρ1 and ρ2 = ρ, with Π̂j−1 being the projector of the subspace spanned by (i : 1 ≤ ij − 1). We can define

rFVE(ρ)=FVE(v^j(ρ1,ρ))FVE(v^j(ρ1,0))=v^jT(ρ1,ρ)Sv^j(ρ1,ρ)v^jT(ρ1,0)Sv^j(ρ1,0), (12)

and choose ρ2,j as

max{ρA2,j:rFVE(ρ)1-a}, (13)

where a ∈ [0, 1) is the proportion of FVE that one chooses to sacrifice in return of localization. For any a ∈ [0, 1), a ρ satisfying (13) always exists, because rFV E(ρ) = 1 for ρ = 0 and rFV E(ρ) ∈ [0, 1) for ρ > 0. Equation (12) also suggests that rFV E(ρ) can be calculated without computing totV (S). Although the first localized basis function explains less variance than the standard eigenfunction, the lost proportion is likely to be picked up by the second component, and we are still able to explain a large proportion of the total variance with a small number of components. We illustrate this method with real data analyses in Section 6.

4. ASYMPTOTIC PROPERTIES

In this section we establish the ℓ2 consistency of the proposed estimator in an asymptotic setting where both the sample size n and the number of grid points p increase. We will first provide sufficient conditions on the tuning parameters ρ1 and ρ2 such that the LFPCA estimate is consistent. Our second result provides further insights on how the localization penalty ρ2 affects the rate of convergence. We make the following assumptions.

  • A1
    The input matrix S in (5) satisfies sup-norm consistency up to a constant shift on the diagonal: for some constant α ≥ 0 and a sequence en = o(1),
    max1l,lpS(l,l)-Γ(tl,tl)-α1(l=l)=OP(en),as(n,p).

Remark

Assumption (A1) puts a mild condition on the input matrix S that can be satisfied by many standard estimators. Consider functional data with dense and equally spaced observations. If S is the sample covariance estimator from the raw data, standard large deviation bounds such as Bernstein’s inequality (Van Der Vaart and Wellner (1996), Chapter II) imply that Assumption (A1) holds with en=logp/n if log p/n → 0 and the random curve X(t) as well as the observation error in model (2) has sub-Gaussian tails; see also Kneip et al. (2011). In this case α = σ2, the noise variance. If a smoothed covariance estimator S is used, the sup-norm rate can be logn/n (Li et al., 2010). The convergence in other norms such as Frobenius norm can be found in (Hall and Hosseini-Nasab, 2006; Hall et al., 2006; Bunea and Xiao, 2015). In general, the consistency result does not really depend on the observational design as long as we can get an estimate of the covariance operator whose sup-norm error vanishes as n and p increase. More discussions about cases where a sample covariance is not feasible can be found in Section 7.

  • A2

    There is a positive integer k such that the eigenvalues of Γ satisfies λ1 >, …, > λk > λk+1 ≥, …, ≥ 0, with positive eigen-gap δ := min1≤jk(λjλj+1) > 0.

  • A3
    Γ is Lipschitz continuous:
    Γ(s,t)-Γ(s,t)Lmax(s-s,t-t),s,s,t,t.
  • A4
    The k leading eigenfunctions of Γ have Lipschitz first derivatives:
    ϕj(t)-ϕj(s)Lt-s,1jk.

Theorem 4.1 (ℓ2 consistency)

Under assumptions (A1–A4), if ρ1/p5 → 0 and ρ2 → 0 as (n, p) → ∞, then for k as defined in A2, we have

sup1jkϕ^j(t)-ϕj(t)2P0.

The proof of Theorem 4.1 is given in the Online Supplementary File. Here we outline the proof, highlighting some key technical challenges.

Let uj be the unit vector obtained by discretizing and re-normalizing the eigenfunction ϕj(t). We will prove Theorem 4.1 by proving

sup1jkv^j-uj2P0.

Let Σ be the discretized covariance operator, with a possible constant shift a on the diagonal. Roughly speaking, j and uj approximate the jth eigenvector of S and Σ, respectively. We hope to establish the following inequality based on the standard Davis-Kahan sin Θ theorem (Bhatia (1997), Theorem VII.3.1),

v^jv^jT-ujujTFcδpS-FcδS-,, (14)

provided that S and Σ have eigengap of order δp (Lemma C.2), where ||A||F = 〈A, A1/2 is the Frobenius norm and ||A||∞,∞ is the maximum absolute value of all entries in A.

However, to rigorously obtain an approximated version of (14) is non-trivial. First, uj is not an eigenvector of Σ because of the discretization error. The discretization error will be explicitly tracked in all subsequent analysis when comparing uj with j, for example, in the characterization of population PCA problem (Lemma C.4). Second, j is obtained by solving a penalized eigenvector problem over the deflated Fantope, and hence is not directly comparable to its ideal theoretical counterpart uj, which may not be in the feasible set of problem (5). To overcome this difficulty, we will consider a modified version of uj that is feasible for (5) but still possesses similar smoothness as well as proximity to the true eigenvector of Σ. Furthermore, the sequential estimation procedure (5) involves deflation based on estimated projection matrix Π̂j−1, which carries over estimation error from previous steps. We will use an induction argument to control the sequential error accumulation.

Due to the sequential error accumulation, the convergence rate involves ρ1 and ρ2 in a complex way. To provide insights for our localized estimation procedure, the following theorem shows how the rate of convergence depends on ρ2 when the smoothing parameter ρ1 = 0. The proof is included in the Online Supplementary File.

Theorem 4.2 (Rate of convergence)

Under assumptions (A1–A4), if ρ1 = 0 and ρ2 → 0 as (n, p) → ∞, then for k as defined in A2, we have

sup1jkϕ^j(t)-ϕj(t)2=OP(en+ρ2+p-1).

The three parts in the rate of convergence correspond to covariance estimation error, bias caused by localization penalty, and discretization error, respectively. According to the discussion after Assumption A1, the sup-norm covariance estimation error en can be made as small as logp/n or logn/n depending on the estimating method used and observation scheme. Thus if p grows at the same or higher order than n, and ρ2 = O(en), our LFPCA estimate achieves an error rate of en within a logarithm factor from the standard FPCA error rate.

5. NUMERICAL STUDY

To illustrate our methods for localized functional principal component analysis, we conduct simulations under two scenarios, Simulation I: localized case and Simulations II: non-localized case. For Simulation I, data {Yil, i = 1, …, n, l = 1, …, p} are generated according to model (2), where tl are equally spaced observational points on [0, 1]. We set μ(t) = 0, ξij ~ N(0, λj), independent, with λj taken from (42, 32, 2.52, 1.252, 1, 0.752, 0.52, 0.252) and λj = 0 for j > 8, and the measurement errors εiliidN(0,σ2). We generate the eigenfunctions as follows. Let ϕ̃1(t) = B3(t), ϕ̃2(t) = B6(t) and ϕ̃3(t) = B9(t), where Bb(t) is the bth cubic B-spline basis on [0, 1], with 8 equally spaced interior knots. For j > 3, ϕj(t)=2cos((j+1)πt) for odd values of j and ϕj(t)=2sin(jπt) for even values of j, 0 ≤ t ≤ 1. Then ϕj(t), 1 ≤ j ≤ 8, are obtained by applying Gram-Schmidt orthonormalization on the set of ϕ̃j(t), 1 ≤ j ≤ 8. For Simulation II, we use ϕj(t)=2cos((j+1)πt) for odd values of j and ϕj(t)=2sin(jπt) for even values of j, 0 ≤ t ≤ 1, for 1 ≤ j ≤ 8. The rest is the same as simulation I.

We investigate the performance of the proposed LFPCA under varying combinations of sample size n, number of observations per curve p, and noise level σ2. More importantly, we compare the estimates given by four different methods (i) (ρ1, ρ2) = (0, 0) corresponds to the ordinary PCA estimation directly obtained from the sample covariance; (ii) (ρ1, ρ2) = (ρ̂1, 0) corresponds to the smoothed eigenfunction estimation without localization, where ρ̂1 was chosen by 5-fold cross-validation as discussed in Section 3.2. Empirically, we found these estimated eigenfunctions almost identical to those estimated from a smoothed covariance function or pre-smoothed individual curves; (iii) NonSeq corresponds to the subspace method developed in Vu et al. (2013). For comparison purpose, we incorporate the roughening matrix, i.e., use Sρ1D as input matrix in (7) with ρ̂1 chosen by 5-fold cross-validation, the same as used in (ii) and (iv). The sparse tuning parameter is chosen by 5-fold cross validation ρ2=argmaxρA2v=1VH(-v)(ρ1,ρ),Sv. We also note that their proposed method only outputs k basis vectors of the k-dimensional subspace, and one needs to rotate that basis to obtain eigenvectors. (iv) (ρ1, ρ2) = (ρ̂1, ρ̂2) corresponds to the proposed LFPCA with tuning parameters selected by 5-fold cross-validation as detailed in (9) and (10). Each setting is repeated 200 times to assess the average performance. The number of included components k is chosen to account for at least 85% of the total variance, i.e. j=1kFVE(v^j)85%, where FVE is defined in (11). The selected number of k is quite robust among all simulation settings, and the average number over 200 simulations is 3.01.

The estimated eigenfunctions ϕj(t), j = 1, 2, 3, from a typical run of Simulation I with p = 100, n = 100, and σ = 1 are visualized in Figure 1. The four columns correspond to results given by the four methods described above. One can clearly see the improvement by adding smoothing and localization penalties. The 5-fold cross-validation choice (ρ̂1, ρ̂2) leads to almost perfect recovery of the true eigenfunctions. The fitted curves X^i(t)=μ^(t)+j=1kξ^ijϕ^j(t) for nine randomly chosen subjects are shown in Figure 2, where ϕ̂j(t) are obtained using (ρ̂1, ρ̂2). It demonstrates accurate recovery of the true curves Xi(t).

Figure 1.

Figure 1

True (blue-solid) and estimated (red-dashed) eigenfunctions ϕj(t), j = 1, 2, 3, in one run of Simulation I, with n = 100, p = 100 and σ = 1, by four different methods as described in Section 5. Tuning parameters are chosen by 5-fold cross-validation.

Figure 2.

Figure 2

Noisy observations and recovered functions i(t) (red-solid) for nine randomly selected subjects, as obtained in one run of simulation I with n = 100, p = 100 and σ = 1, using (ρ̂1, ρ̂2) chosen by 5-fold cross-validation.

To better quantify the performance of estimating ϕj(t) we report the ℓ2 distance ||ϕj(t) − ϕ̂j(t)||2. The medians of the errors over 200 simulation runs are reported in Table 1. The results are quite similar for different levels of p and σ, so only results for p = 100 and σ = 1 are reported with various sample size n. The errors are found to decline with increasing sample size n, as expected. For Simulation I: localized case, the proposed LFPCA with ρ̂2 chosen by cross-validation significantly outperforms other methods. The ρ̂2 chosen by cross-validation well approximates the true localization level of the eigenfunctions; see Figure 3 for an illustration and detailed discussions can be found in section 3.2. The NonSeq, a subspace method proposed by Vu et al. (2013), does not perform well since the union of the subdomains under consideration is the entire domain, not ‘sparse’ at all in their setting. The results demonstrate the advantage of our proposed sequential method. For Simulation II: non-localized case, 5-fold cross-validation method combined with the proposed LFPCA choose ρ̂2j to be very close to 0, and as expected, the ℓ2 errors of the four methods are almost identical.

Table 1.

Results for simulation: reporting the median of errors (||ϕjϕ̂j||2) for ϕj, j = 1, 2, 3, (with median absolute deviations in parentheses) over 200 simulation runs, with σ = 1, p = 100, and varying sample sizes n, where NonSeq is the subspace method developed in Vu et al. (2013).

Simulation I: Localized Simulation II: Non-localized
n = 50 n = 100 n = 200 n = 50 n = 100 n = 200
ϕ1 (0, 0) 0.22 (0.09) 0.18 (0.06) 0.13 (0.04) 0.22 (0.08) 0.15 (0.05) 0.12 (0.04)
(ρ̂1, 0) 0.22 (0.09) 0.18 (0.07) 0.13 (0.04) 0.21 (0.08) 0.15 (0.06) 0.11 (0.04)
NonSeq 0.22 (0.09) 0.18 (0.07) 0.13 (0.04) 0.21 (0.08) 0.15 (0.06) 0.11 (0.03)
(ρ̂1, ρ̂2) 0.12 (0.05) 0.10 (0.03) 0.06 (0.02) 0.22 (0.08) 0.15 (0.05) 0.13 (0.03)

ϕ2 (0, 0) 0.42 (0.16) 0.30 (0.13) 0.20 (0.07) 0.42 (0.16) 0.29 (0.12) 0.19 (0.07)
(ρ̂1, 0) 0.41 (0.16) 0.30 (0.13) 0.20 (0.07) 0.41 (0.15) 0.28 (0.11) 0.19 (0.06)
NonSeq 0.40 (0.16) 0.30 (0.12) 0.20 (0.07) 0.41 (0.15) 0.28 (0.11) 0.19 (0.07)
(ρ̂1, ρ̂2) 0.26 (0.17) 0.14 (0.07) 0.11 (0.05) 0.41 (0.17) 0.28 (0.11) 0.20 (0.07)

ϕ3 (0, 0) 0.37 (0.12) 0.27 (0.11) 0.18 (0.07) 0.31 (0.10) 0.26 (0.09) 0.18 (0.06)
(ρ̂1, 0) 0.36 (0.12) 0.27 (0.11) 0.18 (0.08) 0.30 (0.10) 0.26 (0.09) 0.17 (0.06)
NonSeq 0.33 (0.14) 0.27 (0.11) 0.18 (0.07) 0.30 (0.10) 0.26 (0.09) 0.18 (0.06)
(ρ̂1, ρ̂2) 0.24 (0.15) 0.14 (0.08) 0.09 (0.04) 0.31 (0.11) 0.26 (0.09) 0.18 (0.06)

Figure 3.

Figure 3

Performance of the 5-fold cross-validation to choose ρ2 for ϕ1. The top panel is for Simulation I: localized case: The peak of the cross-validated inner product 〈H, S〉 (dashed curve, left y label) corresponds to ρ̂2 = 3.8. The estimated ϕ1 is more localized as ρ2 increases, and an ideal ρ2 is where the ||·||1,1 of estimated H (solid curve, right y label) meets that of true discretized projection matrix corresponding to ϕ1 (indicated by the horizontal line). The bottom panel is for Simulation II: non-localized case: ρ̂2 = 1.2.

6. APPLICATION TO COUNTRY MORTALITY DATA

The analysis of human mortality is important in assessing the future demographic prospects of societies, and quantifying differences between countries with regard to the overall public health measure. Functional data analysis approaches have been previously applied to study mortality data (Hyndman et al., 2007; Chiou and Müller, 2009; Chen and Müller, 2012). To study the variational modes of mortality rates across countries over years, we applied the proposed LFPCA method to period life tables for 27 countries, with rates of mortality at age 60 available for each of the calendar years from 1960 to 2006. The data were obtained from the Human Mortality Database (downloaded on March 1, 2011), maintained by University of California, Berkeley (USA), and Max Planck Institute for Demographic Research (Germany). The data is available at www.mortality.org or www.humanmortality.de, with detailed description in Wilmoth et al. (2007).

Let Xi(t) denote the mortality rate in the ith country for subjects at age 60 during calendar year t, where 1960 ≤ t ≤ 2006. We directly compute the sample covariance matrix S from the observed data and apply the proposed algorithm to solve problem (5). The ρ̂1 chosen by 5-fold cross-validation (9) is always used to ensure a relatively smooth estimate of the eigenfunction, and the solution path along different levels of localization is investigated. The 5-fold cross-validation method gave ρ̂2 = 0, indicating the eigenfunctions are not exactly localized. The estimated eigenfunctions without localization penalty are given in the top row of Figure 4. We then choose ρ̂2 by the second method as defined in (13) with a = 30%. The estimated localized basis functions, as visualized in the bottom row of Figure 4, reveal several interesting features. The first localized basis function ϕ1(t), explaining 54% of the total variance, indicates that a big variation of the mortality functions Xi(t) around their mean function happens around mid 1990s. The second basis function ϕ2(t) with a mode around 1980s accounts for 21.2% of the total variation. The third basis function ϕ3(t) characterizes variation of mortality around 1960s. Although the first localized basis function explains less variance than the first leading eigenfunction, only retaining 70% of the capability in return of localization, the lost proportion is picked up by the second component. The second component could have explained 30.3% of the variance without localization. Therefore, we only need three localized eigenfunctions to account for more than 85% of the total variance. Plots of estimated eigenvalues using other values of a are provided in the Online Supplementary File.

Figure 4.

Figure 4

Top Row: Estimated eigenfunctions for the mortality data, ρ̂1 is chosen by 5-fold cross validation; Bottom Row: Estimated orthogonal basis functions, ρ̂2 is chosen to maintain rFVE at 70% in (12), and the number of components k = 3 is chosen to explain at least 85% of the total variance.

7. DISCUSSION

In this paper, we propose a localized functional principal component analysis through a De-flated Fantope Localization method, where sequentially obtained eigenfunctions have guaranteed orthogonality and are allowed to be supported on different localized subdomains. As mentioned in Section 2, the deflated Fantope Inline graphic can easily be generalized to a d-dimensional version DΠd. In some applications, one might be interested to estimate mutually orthogonal principal subspaces with dimensions d1, …, dk, and each principal subspace is only supported on a subdomain Inline graphicInline graphic.

Throughout the paper, we mainly focus on a dense and regular functional data design where p equally spaced observations are recorded on each curve. Most commonly seen functional data have this design and a sample covariance can be easily computed from the discrete and possibly noisy observations. Our proposed formula (5) takes the sample covariance S and outputs smooth and localized estimates of eigenfunctions. In fact the proposed method puts rather minimal requirement on the input matrix S and does not rely on the design of observations. Consider the discretized version of 3 by evaluating on p×p equally spaced grid points. The estimation error of the eigenfunctions directly ties to the maximum entry-wise error of the input matrix S. Here we briefly discuss several scenarios where a sample covariance is not feasible. (i) For dense but irregularly observed functional data, one can simply smooth each curve (Ramsay and Silverman, 2005) or interpolate between points to get p equally spaced observations, and then a p×p covariance estimate S can be computed. This is what we have done for the Berkeley growth data given in the Online Supplementary File. (ii) For sparse functional data where the observations are recorded at random and sparse time points, individual smoothing or interpolation is impossible. But a uniformly consistent covariance estimation is possible by, for example, two-dimensional smoothing methods (Yao et al., 2005; Li et al., 2010). Our proposed LFPCA can then be applied by taking S as the discretized version of the smooth covariance estimator. (iii) For ultra dense and noisy data, the independent measurement errors accumulate if one uses sample covariance computed from the raw measurements. Moreover, using a large p × p matrix is not computationally efficient. We recommend performing pre-smoothing or pre-binning on individual curves and choosing a grid with a moderate size p.

We proposed two methods of choosing the localization parameter ρ2. When the cross-validation method chooses ρ2 = 0, it roughly means that the true eigenfunctions are not localized. Then for a fixed number of a, we find a set of orthogonal basis functions that retain the ability to explain a fair amount of variance (at least a (1 − a) proportion) and are localized. In this case, the outcome would depend on the choice of a and they should not be interpreted as the true eigenfunctions. Rather, these localized basis functions and the corresponding projection scores have ready interpretation with domain knowledge.

Supplementary Material

Supplementary Material

Acknowledgments

Jing Lei’s research is partially supported by NSF Grant BCS-0941518, NSF Grant DMS-1407771 and NIH Grant MH057881.

References

  1. Bhatia R. Matrix analysis. Vol. 169. Springer; 1997. [Google Scholar]
  2. Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning. 2011;3:1–122. [Google Scholar]
  3. Bunea F, Xiao L. On the sample covariance matrix estimator of reduced effective rank population matrices, with applications to fPCA. Bernoulli. 2015 to appear. [Google Scholar]
  4. Cardot H. Nonparametric estimation of smoothed principal components analysis of sampled noisy functions. Journal of Nonparametric Statistics. 2000;12:503–538. [Google Scholar]
  5. Castro PE, Lawton WH, Sylvestre EA. Principal Modes of Variation for Processes with Continuous Sample Curves. Technometrics. 1986;28:329–337. [Google Scholar]
  6. Chen K, Müller H-G. Modeling repeated functional observations. Journal of the American Statistical Association. 2012;107:1599–1609. [Google Scholar]
  7. Chiou J-M, Müller H-G. Modeling hazard rates as functional data for the analysis of cohort lifetables and mortality forecasting. Journal of the American Statistical Association. 2009;104:572–585. [Google Scholar]
  8. d’Aspremont A, El Ghaoui L, Jordan M, Lanckriet G. A Direct Formulation of Sparse PCA using Semidefinite Programming. SIAM Review. 2007:49. [Google Scholar]
  9. Dattorro J. Convex Optimization & Euclidean Distance Geometry. Meboo Publishing; USA: 2005. v2012.01.28. [Google Scholar]
  10. Gasser T, Müller H-G, Köhler W, Prader A, Largo R, Molinari L. An analysis of the mid-growth and adolescent spurts of height based on acceleration. Annals of Human Biology. 1985;12:129–148. doi: 10.1080/03014468500007631. [DOI] [PubMed] [Google Scholar]
  11. Hall P, Hosseini-Nasab M. On properties of functional principal components analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2006;68:109–126. [Google Scholar]
  12. Hall P, Müller H-G, Wang J-L. Properties of principal component methods for functional and longitudinal data analysis. The Annals of Statistics. 2006:1493–1517. [Google Scholar]
  13. Huang JZ, Shen H, Buja A, et al. Functional principal components analysis via penalized rank one approximation. Electronic Journal of Statistics. 2008;2:678–695. [Google Scholar]
  14. Hyndman RJ, Ullah S, et al. Robust forecasting of mortality and fertility rates: a functional data approach. Computational Statistics & Data Analysis. 2007;51:4942–4956. [Google Scholar]
  15. James GM, Wang J, Zhu J. Functional linear regression that’s interpretable. The Annals of Statistics. 2009:2083–2108. [Google Scholar]
  16. Kneip A, Sarda P, et al. Factor models and variable selection in high-dimensional regression analysis. The Annals of Statistics. 2011;39:2410–2447. [Google Scholar]
  17. Lei J, Vu VQ. Sparsistency and Agnostic Inference in Sparse PCA. The Annals of Statistics. 2015 to appear. [Google Scholar]
  18. Li Y, Hsing T, et al. Uniform convergence rates for nonparametric regression and principal component analysis in functional/longitudinal data. The Annals of Statistics. 2010;38:3321–3351. [Google Scholar]
  19. Lin Z. Master thesis. Simon Fraser University; Canada: 2013. Some perspectives of smooth and locally sparse estimators. [Google Scholar]
  20. Mackey L. Deflation Methods for Sparse PCA. NIPS. 2008;21:1017–1024. [Google Scholar]
  21. Mühl A, Herkner K, Swoboda W. The mid-growth spurt–a pre-puberty growth spurt. Review of its significance and biological correlations. Padiatrie und Padolo-gie. 1991;27:119–123. [PubMed] [Google Scholar]
  22. Ramsay JO, Silverman BW. Springer Series in Statistics. 2 New York: Springer; 2005. Functional Data Analysis. [Google Scholar]
  23. Rice JA, Silverman BW. Estimating the mean and covariance structure nonparametrically when the data are curves. Journal of the Royal Statistical Society. Series B (Methodological) 1991:233–243. [Google Scholar]
  24. Sheehy A, Gasser T, Molinari L, Largo R. An analysis of variance of the pubertal and midgrowth spurts for length and width. Annals of human biology. 1999;26:309–331. doi: 10.1080/030144699282642. [DOI] [PubMed] [Google Scholar]
  25. Silverman BW. Smoothed functional principal components analysis by choice of norm. The Annals of Statistics. 1996;24:1–24. [Google Scholar]
  26. Tuddenham R, Snyder M. Physical growth of California boys and girls from birth to age 18. Calif Publ Child Deve. 1954;1:183–364. [PubMed] [Google Scholar]
  27. Van Der Vaart AW, Wellner JA. Weak Convergence. Springer; 1996. [Google Scholar]
  28. Vu VQ, Cho J, Lei J, Rohe K. Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA. Advances in Neural Information Processing Systems. 2013:2670–2678. [Google Scholar]
  29. Vu VQ, Lei J. Minimax sparse principal subspace estimation in high dimensions. The Annals of Statistics. 2013;41:2905–2947. [Google Scholar]
  30. White PA. The computation of eigenvalues and eigenvectors of a matrix. Journal of the Society for Industrial & Applied Mathematics. 1958;6:393–437. [Google Scholar]
  31. Wilmoth JR, Andreev K, Jdanov D, Glei DA, Boe C, Bubenheim M, Philipov D, Shkolnikov V, Vachon P. University of California; Berkeley: Max Planck Institute for Demographic Research; Rostock: 2007. Methods protocol for the human mortality database. http://mortality.org [version 31/05/2007] [Google Scholar]
  32. Yao F, Müller H-G, Wang J-L. Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association. 2005;100:577–590. [Google Scholar]
  33. Zhao Y, Ogden RT, Reiss PT. Wavelet-based LASSO in functional linear regression. Journal of Computational and Graphical Statistics. 2012;21:600–617. doi: 10.1080/10618600.2012.679241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Zhou J, Wang N-Y, Wang N. Functional linear model with zero-value coefficient function at sub-regions. Statistica Sinica. 2013;23:25. doi: 10.5705/ss.2010.237. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

RESOURCES