Abstract
Sufficient dimension reduction (SDR) embodies a family of methods that aim for reduction of dimensionality without loss of information in a regression setting. In this article, we propose a new method for nonparametric function-on-function SDR, where both the response and the predictor are a function. We first develop the notions of functional central mean subspace and functional central subspace, which form the population targets of our functional SDR. We then introduce an average Fréchet derivative estimator, which extends the gradient of the regression function to the operator level and enables us to develop estimators for our functional dimension reduction spaces. We show the resulting functional SDR estimators are unbiased and exhaustive, and more importantly, without imposing any distributional assumptions such as the linearity or the constant variance conditions that are commonly imposed by all existing functional SDR methods. We establish the uniform convergence of the estimators for the functional dimension reduction spaces, while allowing both the number of Karhunen–Loève expansions and the intrinsic dimension to diverge with the sample size. We demonstrate the efficacy of the proposed methods through both simulations and two real data examples.
Keywords: Functional central mean subspace, functional central subspace, function-on-function regression, unbiasedness, exhaustiveness, consistency, reproducing kernel Hilbert space
MSC2020 subject classifications: 62B05, 62G08, 62G20, 62R10
1. Introduction.
Sufficient dimension reduction (SDR) embodies a family of methods in regressions that seek a low-dimensional representation of the high-dimensional data while minimizing the loss of regression information. Since the pioneering work of sliced inverse regression (Li (1991)), SDR has enjoyed a rapid development in the past three decades, and has been widely used in a variety of applications, including biology, finance, medical science, among others. In its full generality, SDR aims at the entire conditional distribution of Y|X, where Y is a response variable and X a p-dimensional predictor vector, and (X, Y) has a joint distribution. It seeks a lower-dimensional representation R(X) such that Y ╨ X | R(X). In plenty of applications, the regression interest focuses on the conditional mean E(Y|X) only. In that case, Cook and Li (2002) developed the notion of sufficient mean dimension reduction, by seeking R(X) such that Y ╨ E(Y|X) | R(X), or equivalently, E(Y|X) = E{Y|R(X)}. The reduction R(X) usually takes the form of linear combinations of X, that is, R(X) = β⊤X, where β is a p × q matrix with q ≤ p. Sufficient reduction or sufficient mean reduction then pursues the minimum subspace spanned by β, which uniquely exists under very mild conditions (Yin, Li and Cook (2008)). Such a space is called the central subspace or the central mean subspace, and its dimension q is called the intrinsic dimension. There have been a large body of methods proposed for SDR, usually in a nonparametric fashion. Depending on their estimation strategies, broadly speaking, those methods can be grouped into two categories. One category utilizes the first and second moments of inverse regression X|Y, which under certain distributional assumptions, contain useful information about R(X); see, for example, Cook and Weisberg (1991), Li (1991), Li, Zha and Chiaromonte (2005), Li and Wang (2007), among others. The other category utilizes the gradient of the forward regression mean E(Y|X); see, for example, Hardle (1989), Xia (2007), Xia et al. (2002), Yin and Li (2011), Fukumizu and Leng (2014), among others. See Li (2018a) for a comprehensive review of SDR.
In this article, we target the problem of sufficient reduction and sufficient mean reduction when both the response and the predictor are a function. Function-on-function regression is receiving increasing attention in recent years, and is being widely used in applications such as environmental science, neuroimaging analysis, and e-commerce (Kim et al. (2018), Luo and Qi (2017), Müller and Yao (2008), Reimherr, Sriperumbudur and Taoufik (2018), Sun et al. (2018), Luo and Qi (2019)). We aim to relax the parametric or semiparametric model assumptions, and propose a new method of nonparametric function-on-function SDR. We first develop the notions of functional central mean subspace and functional central subspace, which are the population targets of our functional SDR inquiries. Motivated by the fact that the gradient of the regression mean function E(Y|X) lies in the central mean or central subspace (Xia et al. (2002), Xia (2007)), we extend the idea to the functional setting, and show that the Riesz representation of the Fréchet derivative of the regression functional is located in the functional central mean or central subspace. We then propose the corresponding functional dimension reduction estimators based on the average Fréchet derivative, and show that the resulting estimators are both unbiased and exhaustive. Moreover, our proposal leads naturally to a procedure for predicting any functional of the response after dimension reduction. Theoretically, we establish the uniform convergence for the estimated bases of the dimension reduction spaces, while allowing both the number of Karhunen–Loève expansions and the intrinsic dimension to diverge with the sample size.
Our proposal is closely related to but also clearly different from a number of lines of research on sufficient dimension reduction and functional regression. We next review the relevant literature and discuss the connections and differences with our proposal.
First, our proposal extends the gradient-based dimension reduction methods such as Hardle (1989), Xia (2007), Xia et al. (2002), Yin and Li (2011), Fukumizu and Leng (2014) from the random variable setting to the random function setting. However, such an extension is far beyond routine, and our new solution is utterly different, in both computation and theory, from the existing ones. On the computation side, most existing gradient-based methods approximated the high-dimensional gradients using iterative least squares, and the resulting computation can be intensive (Xia (2007), Xia et al. (2002), Yin and Li (2011)). To address this issue, we establish the interchangeability between the Fréchet derivative and the reproducing kernel Hilbert space (RKHS), which in effect provides a closed form of the Fréchet derivative. Consequently, our algorithm requires only spectral decompositions of linear operators and no iterative optimization, and hence is computationally much simpler. On the theory side, in the random variable setting, the variable dimension p and the intrinsic dimension q after dimension reduction are usually treated as fixed. Only recently, Lin et al. (2017) and Lin, Zhao and Liu (2019) established the consistency of sliced inverse regression with a diverging intrinsic dimension q. By contrast, in the random function setting, both the response and the predictor are random elements in a potentially infinite-dimensional Hilbert space. We need to handle a diverging number of Karhunen–Loève (KL) expansions from the random functions, and this portion of asymptotic analysis requires a more complex treatment than in the classical setting. Besides, we allow the intrinsic dimension q to diverge with the sample size.
Second, our work is also related to a family of proposals that extended the inverse regression-based SDR methods to the functional setting. The first generalization was introduced by Ferré and Yao (2003, 2005), in which they extended the predictor space from an Euclidean space to a Hilbert space, then proposed a functional slice inverse regression method. Hsing and Ren (2009), Jiang, Yu and Wang (2014), Wang, Lin and Zhang (2013), Wang et al. (2015), Yao, Lei and Wu (2015) further extended and developed a series of inverse regression-based estimators in the functional setting. All these methods involved a scalar response Y and a functional predictor X, targeted sufficient reduction of the full conditional distribution Y|X, and were based on some moments of the inverse regression X|Y. Our proposal on one hand extends SDR to the setting where not only the predictor but also the response is a function. More importantly, our method is built on the derivative of the forward regression, instead of the inverse regression. A critical implication of this difference is that, our functional SDR estimators achieve the unbiasedness and exhaustiveness without imposing distributional assumptions such as the linearity condition or the constant variance condition that are commonly imposed by all existing inverse moments-based functional SDR methods. As pointed out by Ma and Zhu (2012, 2013) under the random variable setting, those distributional assumptions can be restrictive, and our proposal is the first unbiased and exhaustive solution that relaxes those conditions under the random function setting.
More recently, Li and Song (2017) proposed a nonlinear SDR method for functional data. Our proposal is similar, in that we both target SDR for functional data, and both methods are built on some linear operators. However, the two also differ considerably in terms of methodology and theory. On the methodology side, Li and Song (2017) focused on nonlinear reduction, rather than linear reduction. Although nonlinear reduction is more flexible in the sense that its reduced predictors do not have to be linear, the linear framework of our proposal is still pivotally important. Linear SDR is usually easier to interpret, and is better connected with other modeling techniques, since it preserves the original coordinates of the predictors; see Li ((2018a), Chapter 14) for more comparison between linear and nonlinear SDR. Moreover, our estimator can not be directly induced by that of Li and Song (2017). On the theory side, Li and Song (2017) required the constant variance assumption to show their generalized sliced average variance estimator is unbiased, which can be restrictive. By contrast, our estimator does not impose such a condition. Furthermore, our asymptotic analysis requires a different set of tools from that of Li and Song (2017). Specifically, we employ the leading KL coefficients to approximate the sample covariance operators to accelerate the computation, which was not considered in Li and Song (2017). As a result, we need to take into account this extra layer of approximation when proving the consistency. More importantly, Li and Song (2017) treated the intrinsic dimension q as fixed, while we allow q to diverge, which leads to a completely different regime of asymptotics.
In summary, our proposal is profoundly different from the existing sufficient dimension reduction methods. The extension from random variables to random functions, and the change from inverse regression to forward regression-based estimation are both far from straightforward. It leads to a new set of methodological and theoretical results, and makes a useful addition to the toolbox of dimension reduction and functional data analysis.
The rest of the article is organized as follows. Section 2 introduces the functional central mean and central subspaces. Section 3 develops the average Fréchet derivative estimator for functional SDR, and establishes the population properties including the unbiasedness and exhaustiveness. Section 4 develops the estimation procedure, both at the operator level and under a coordinate system. Section 5 derives the asymptotic properties. Section 6 presents the simulations and two real data examples. Section 7 concludes the paper with a discussion. The Supplementary Material (Lee and Li (2022)) collects additional results and proofs.
2. Functional dimension reduction subspaces.
In this section, we first formally define the functional central mean and central subspaces, which are the population targets of functional SDR. We then introduce a series of linear operators useful for SDR estimation.
2.1. Functional central mean subspace and central subspace.
Let (Ω, ℱ, P) be a probability space. Let X : Ω → ΩX, Y : Ω → ΩY be the ΩX-valued and ΩY-valued random elements, where ΩX and ΩY are Hilbert spaces of functions on an interval , and their inner products are and . Let PX = P о X−1 and PY = P о Y−1 be the distributions of X and Y. We first seek sufficient reduction for the mean of the conditional distribution Y|X.
Definition 1. Suppose 𝒮 is a linear, closed subspace of ΩX, and P𝒮 : ΩX → 𝒮 is the projection onto 𝒮. If, for all ψ ∈ ΩY,
| (1) |
we call 𝒮 a functional mean dimension reduction subspace. Let 𝒯 denote the collection of all 𝒮 satisfying (1). If 𝒮E(Y|X) = ∩{𝒮 : 𝒮 ∈ 𝒯} is in 𝒯, we call 𝒮E(Y|X) the functional central mean subspace.
We next seek sufficient reduction for the entire conditional distribution Y|X. The idea follows that of Xia (2007), Yin and Li (2011), in that one can extend the estimation of central mean subspace to that of central subspace through a class of functions that characterize the conditional distribution Y | X. More specifically, note that the collection of conditional means {E{g(Y)|X} : g ∈ ℋY} characterize the full information of Y | X, as long as the class of functions ℋY is sufficiently rich. This implies, if there is a functional dimensional reduction space for the conditional mean E{g(Y)|X} for all g ∈ ℋY, then it is also a functional dimension reduction space for Y | X. We formalize this idea below.
Definition 2. Let ℋY be a dense subset of L2(PY) modulo constants. Suppose 𝒮 is a linear, closed subspace of ΩX, and P𝒮 : ΩX → 𝒮 is the projection onto 𝒮. If, for all g ∈ ℋY,
| (2) |
we call 𝒮 a functional dimension reduction subspace. Let 𝒯 denote the collection of all 𝒮 satisfying (2). If 𝒮Y|X = ∩{𝒮 : 𝒮 ∈ 𝒯} is in 𝒯, we call 𝒮Y|X the functional central subspace.
Here we say ℋY is dense in L2(PY) modulo constants if, for any g ∈ L2(PY), there exists a sequence of ℋY, such that E{gk(X) − g(X)}2 → 0, as k → ∞, where denotes the collection of natural numbers. We also note that, if ℋY is a dense subset of L2(PY), ℋY is characteristic (Fukumizu, Bach and Jordan (2009)). Therefore, (2) is equivalent to Y ╨ X | P𝒮X. This justifies why (2) can characterize the conditional distribution Y | X. It also shows that 𝒮Y|X is independent of the selection of ℋY.
In both definitions of 𝒮E(Y|X) and 𝒮Y|X, we are seeking the smallest subspace in 𝒯 that satisfies (1) or (2), so to achieve maximal dimension reduction. Toward that goal, we take the intersection of all subspaces in 𝒯, and assume such an intersection still belongs to 𝒯. This actually holds under some mild conditions, as we show next. We introduce the concept of M-set in an Hilbert space.
Definition 3. Let Ω1 and Ω2 denote two generic Hilbert spaces, and M a subset of the product space Ω1 × Ω2. If, for any two pairs of points (ω1, ω2), , there exist a sequence of pairs of points , with and , such that, (a) , for j = 2, …, J − 1, and (b) or , for j = 1, …, J − 1, then we call M an M-set in Ω1 × Ω2.
Yin, Li and Cook (2008) introduced the notion of M-set for the random variable setting, and Definition 3 extends it to the random function setting. The conditions in Definition 3 require that any two points in an M-set can be connected via a “stairway”, and that only the corner points need to belong to the M-set. These are fairly mild conditions; see, for example, any convex and open subset of Ω1 × Ω2 is an M-set. Let supp(X) denote the support of X. For linear and closed subspaces 𝒮1 and 𝒮2 of ΩX, and g a member in , denote , and . Let 𝒯 be the collection of all subspaces satisfying (1) or (2). The next theorem justifies the existence of 𝒮E(Y|X) and 𝒮Y|X.
Theorem 1. Suppose, for any 𝒮1 and 𝒮2 in 𝒯, Ω(𝒮1, 𝒮2, g) is an M-set in 𝒮1 × 𝒮2, for all .
If the subspaces in 𝒯 satisfy (1), then there exists a unique intersection, 𝒮E(Y|X) = ∩{𝒮 : 𝒮 ∈ 𝒯} ⊆ ΩX, satisfying that 𝒮E(Y|X) is in 𝒯, and 𝒮E(Y|X) ⊆ 𝒮, for all 𝒮 ∈ 𝒯;
If the subspaces in 𝒯 satisfy (2), then there exists a unique intersection, 𝒮Y|X = ∩{𝒮 : 𝒮 ∈ 𝒯 } ⊆ ΩX, satisfying that 𝒮Y|X is in 𝒯, and 𝒮Y|X ⊆ 𝒮, for all 𝒮 ∈ 𝒯.
We also note that, the sufficient predictors in Definitions 1 and 2 are linear mappings on ΩX. Consider full reduction as an example, and let 𝒮 = Span{ϕi ∈ ΩX : i = 1…, q}. Then
For this reason, our functional SDR is linear dimension reduction. By contrast, Li and Song (2017) studied nonlinear dimension reduction, by relaxing the linear constraint and allowing the sufficient predictors to be nonlinear mappings, in that,
where f1, …, fq are elements of the RKHS ℋX. This key difference leads to two sets of utterly different methodology and theory for our proposal and for Li and Song (2017), as detailed in Sections 1 and 7.
We next give some concrete examples of 𝒮E(Y|X) and 𝒮Y|X. For a positive definite kernel function , let ΩT denote the RKHS induced by κT, which can be constructed by , where Span is the closure of the space spanned by the set of functions. Next, consider the Brownian motion kernel, κT(t1, t2) = min(t1, t2), and let (αj, βj) denote the j th pair of eigenvalue and eigenfunction of κT, with αj = 1/{(j − 0.5)π}2, and βj(t) = sin{(j − 0.5)πt}, (see, e.g., Amini and Wainwright (2012)). We then construct the predictor function X(t) and the error function ϵ(t) using the leading pairs of eigenvalues and eigenfunctions, where
E(aj) = E(bk) = 0, j = 1, …, J, k = 1, …, K, and (a1, …, aJ)⊤ and (b1, …, bK)⊤ are independent. By this construction, both ϵ(t) and X(t) are elements in ΩT. Besides, for any ϕ ∈ ΩT, . Moreover, because there are one-to-one correspondences between X(t) and (a1, …, aJ)⊤, and between ϵ(t) and (b1, …, bK)⊤, we have ϵ(t) ╨ X(t). Following this construction, we consider the following examples.
Example 1. Suppose there exist ϕ1, ϕ2 ∈ ΩT, such that X(t) and Y(t) satisfy that
where m1(·) and m2(·) are mappings from to , σ is a positive constant, and , for any ϕ, ψ ∈ ΩT. Then Y is an ΩT-valued random element, and 𝒮E(Y|X) = Span{ϕ1, ϕ2}.
Example 2. Suppose there exist ϕ1, ϕ2, ϕ3 ∈ ΩT, such that X(t) and Y(t) satisfy that
where m1(·) to m3(·) are mappings from to , and , for any ϕ, ψ ∈ ΩT. Then Y is an ΩT-valued random element, and 𝒮Y|X = Span{ϕ1, ϕ2, ϕ3}.
Finally, we note that it is straightforward to show 𝒮E(Y|X) ⊆ 𝒮Y|X. This echos the classical result in the random variable setting. In Example 2, 𝒮E(Y|X) is a proper subset of 𝒮Y|X.
2.2. Functional regression operator.
By Definitions 1 and 2, the subspaces of interest are defined via the conditional expectation or E{g(Y)|X}, that is, the regression functionals that link the predictor function X and the response function Y. In order to characterize such a regression relationship, we introduce the functional regression operator as an extension of regression coefficient to both functional and nonlinear settings.
We first define the nested kernel and the associated RKHS. A positive definite kernel can be constructed via the inner product , if there exists a mapping , such that
| (3) |
Because κX is uniquely characterized by ΩX, it is called a nested kernel. Examples of nested kernels include the radial basis function kernel , the polynomial kernel , among others. Then given κX, a nested RKHS ℋX can be induced by this kernel. Similarly, a positive definite kernel , and a nested RKHS ℋX can be constructed via the inner product .
We next introduce some background and notation. Let ℋ, ℋ′ be two generic Hilbert spaces, and A a linear mapping from ℋ to ℋ′. Define the norm of A as ‖A‖ = sup{‖Af‖ℋ′ : f ∈ ℋ, ‖f‖ℋ =1}, where ‖ · ‖ℋ and ‖ · ‖ℋ′ are the norms of ℋ and ℋ′, respectively. An operator is said to be bounded, if its norm is finite. A linear operator A : ℋ → ℋ′ is said to be Hilbert-Schmidt, if for an orthonormal basis of ℋ. Define the Hilbert-Schmidt norm of A as . Let B(ℋ, ℋ′) be the collection of all linear bounded operators from ℋ to ℋ′, and B2(ℋ, ℋ′) the collection of all Hilbert-Schmidt operators from ℋ to ℋ′. Note that B(ℋ, ℋ′) is a Banach space endowed with the operator norm ‖ · ‖, and B2(ℋ, ℋ′) is a Hilbert space with its inner product defined as , for any A1, A2 ∈ B2(ℋ, ℋ′). Because ‖A‖ ≤ ‖A‖HS, it holds that B2(ℋ, ℋ′) ⊆ B(ℋ, ℋ′) (Weidmann (1980)). We abbreviate B(ℋ) = B(ℋ, ℋ) and B2(ℋ) = B2(ℋ, ℋ) whenever appropriate. Furthermore, let ker(A), ran(A), , A* denote the null space, the range, the closure of the range, and the adjoint of an operator A, respectively.
We now develop a series of linear operators. We begin with a regularity condition. It ensures the square-integrability of the sample path of Y and that of every function in ℋX and ℋY. The first part is a standard moment condition; see, for example, Yao, Lei and Wu (2015). The second part holds for all bounded kernels. To avoid digression, we defer a detailed discussion of this condition, along with all other conditions in this article, to Section S1 of the Supplementary Material (Lee and Li (2022)).
Assumption 1. Suppose , E{κX(X, X)} < ∞, and E{κY(Y, Y)} < ∞. In addition, ℋX and ℋY are dense in L2(PX) and LY(PY) modulo constants, respectively.
Let mX denote the mean element of ℋX, such that for any f ∈ ℋX. Similarly, let μY and mY denote the mean elements of ΩY and ℋY, respectively. We next define a number of covariance operators:
for any ψ, ψ′ ∈ ΩY, f, f′ ∈ ℋX, and g, g′ ∈ ℋY. Note that . The next lemma justifies the existence of the mean elements and covariance operators we define above.
Lemma 1. Suppose Assumption 1 holds. Then there exist the mean elements mX, μY, and mY, and the covariance operators ΛXY, ΛYY, ΣXX, ΣXY, ΣYY.
Next, we define two functional regression operators,
where is the Moore–Penrose inverse of ΣXX. That is, for any f ∈ ran(ΣXX), there exist g ∈ ker(ΣXX) and such that f = ΣXX(g + h), and the Moore–Penrose inverse maps f ∈ ΣXX to (Li (2018b)). Because they resemble the notion of classical regression coefficients, we refer mE(Y|X) and mY|X as the functional regression operators. We next introduce another regularity condition to ensure such definitions. See Section S1 of the Supplementary Material (Lee and Li (2022)) for more discussion on this condition.
Assumption 2. Suppose ker(ΣXX) = {0}, ran(ΛXY) ⊆ ran(ΣXX), and ran(ΣXY) ⊆ ran(ΣXX). In addition, mE(Y|X) ∈ B2(ΩY, ℋX), and mY|X ∈ B2(ℋY, ℋX).
Assumption 2 ensures that the mappings of ΛXY and ΣXY are both in that of ΣXX, which in turns ensures that the two operators mE(Y|X) and mY|X are well defined. Moreover, it also ensures that the Moore–Penrose inverse is well defined.
The next proposition shows that the regression functionals and can be induced by the regression operators mE(Y|X) and mY|X. Its proof is similar to Li and Song ((2017), Proposition 1) and is omitted.
Proposition 1. Suppose Assumptions 1 and 2 hold. Then for any ψ ∈ ΩY and g ∈ ℋY, there exist constants cψ and cg, such that
3. Average Fréchet derivative estimators.
In this section, we first establish the connection between the Fréchet derivative of the regression functional and the functional central mean or central subspace. Built on this connection, we introduce the average Fréchet derivative estimators, and develop their population-level properties.
3.1. Fréchet derivative of regression functionals.
We first derive a key property for our proposal: the Riesz representation of the Fréchet derivative of the regression functional is located within the dimension reduction spaces of interest. Let for any ψ ∈ ΩY, and Mg(·) = E{g(Y) | X = ·} for any g ∈ ℋY, both of which are mappings from . Let ∂xMψ(x, ·) and ∂xMg(x, ·) denote the Fréchet derivative of Mψ and Mg, respectively, at any x ∈ ΩX. If is a bounded linear mapping, then, by Riesz representation theorem, there exists a unique g ∈ ℋ, such that Af = 〈g, f〉ℋ for all f ∈ ℋ. In other words, g is the Riesz representation of A in ℋ (Conway (2010)).
Theorem 2. If Mψ and Mg are Fréchet differentiable for any ψ ∈ ΩY and g ∈ ℋY, then:
for any ψ ∈ ΩY, the Riesz representation of ∂xMψ(x, ·) is in 𝒮E(Y|X);
for any g ∈ ℋY, the Riesz representation of ∂xMg(x, ·) is in 𝒮Y|X.
Its proof is given in Section S4 of the Supplementary Material (Lee and Li (2022)). Theorem 2 shares the same spirit as the gradient-based SDR methods such as Xia et al. (2002), Yin and Li (2011), but is stated in a more general setting. It shows that the Riesz representations of the Fréchet derivatives ∂xMψ(x, ·) and ∂xMg(x, ·) are elements of the functional dimension reduction spaces. Next, we show that such representations have closed forms.
We introduce an additional assumption, which is mild and is satisfied for many standard kernels κX, with the ρ function in (3) properly chosen, for example, the radial basis kernel and the polynomial kernel. In Section S1 of the Supplementary Material (Lee and Li (2022)), we further give the sufficient condition under which this assumption holds, and the explicit form of the derivative of κX for two commonly used kernels, the radial basis kernel and the polynomial kernel.
Assumption 3. κX(·, ·) is continuously differentiable.
We next establish the interchangeability between the Fréchet derivative and the reproducing kernel Hilbert space for the general case. Its proof is given in Section S4 of the Supplementary Material (Lee and Li (2022)).
Proposition 2. Suppose Assumption 3 holds. Then, for any f ∈ ℋX such that , x ∈ ΩX, the Fréchet derivative of f(x) at x, , u ↦ ∂x f (x, u) is equal to .
Steinwart and Christmann (2008) established the differentiability of RKHS in the random variable setting. Proposition 2 extends their result to the functional setting. See also Fukumizu and Leng (2014) for a simpler version of this result in the random variable setting. Note that the regression functionals Mψ(·) and Mg(·) are members in ℋX, and thus by Proposition 2, their Fréchet derivatives ∂xMψ(x, ·) and ∂xMg(x, ·) can be calculated using the derivative of κX. By Proposition 1, this further implies that, via the mappings of the regression operators, we can explicitly compute ∂xMψ(x, ·) and ∂xMg(x, ·), as well as their Riesz representations. The next theorem summarizes the above statement. Hereafter, we write ∂xκ(x, u) as {∂xκ(x)} о u, and ∂xκ (x) : ΩX → ℋX is a member in B(ΩX, ℋX).
Theorem 3. Suppose Assumptions 1 to 3 hold. Then, for any x ∈ ΩX,
Theorem 3 shows that and are subspaces of the functional central mean and functional central subspaces, respectively, which provides the essential foundation for our functional SDR estimation. In particular, this result does not require any distributional or structural assumptions, such as the linearity condition or the constant variance condition, that are required in all existing inverse momentbased functional SDR methods (Ferré and Yao (2003), Hsing and Ren (2009), Jiang, Yu and Wang (2014), Wang, Lin and Zhang (2013), Yao, Lei and Wu (2015)). To gain further insight about Theorem 3, we observe that {∂xκX(x)}*mY|X has an intuitive explanation that can be linked to the classical linear regression coefficient. Consider a linear kernel for any ϕ, ϕ′ ∈ ΩX. Suppose , , Y and X are centered and satisfy that E(Y|X = x) = 〈β, x〉 =β⊤x with . Then for any , we have 〈{∂xκX(x)}*mE(Y|X), x′〉 = 〈β⊤x, {∂xκX (x)} о x′〉, which is equal to . This implies that {∂xκX(x)}*mE(Y|X) = β.
3.2. Average Fréchet derivative.
Built on Theorem 3, let B = {∂xκX(x)}*mE(Y|X) or {∂xκX(x)}*mY|X. Because , we can define two mappings that map X in ΩX to the random operators FE(Y|X)(X) and FY|X(X) in B(ΩX), respectively, via
Note that the adjoints of mE(Y|X) and mY|X are defined on ran(ΣXX) instead of ℋX. Nonetheless, because mE(Y|X) and mY|X are both bounded, their domains can be extended to , which, by Assumption 2, is equal to ℋX. Next, we establish the existence of the expectations of FE(Y|X)(X) and FY|X(X). We first need a condition on the moments of the Fréchet derivative of κX. In Section S1 of the Supplementary Material (Lee and Li (2022)), we give further results based on which we can show this moment condition holds. We also derive the explicit bounds of E‖∂xκX(X)‖4 for the radial basis kernel and the polynomial kernel.
Assumption 4. E‖∂xκX(X)‖4 < ∞.
The next proposition establishes the existence of the expectations of FE(Y|X)(X) and FY|X(X). Its proof immediately follows by the representation theorem, and is omitted.
Proposition 3. Suppose Assumptions 1 to 4 hold. Then the mean elements of FE(Y|X)(X) and FY|X(X) in B(ΩX), denoted by E{FE(Y|X)} and E(FY|X), respectively, uniquely exist via the following relations: for any ϕ, ϕ′ ∈ ΩX,
The two operators, E{FE(Y|X)} and E(FY|X), share the same spirit as the average derivative estimator of Hardle (1989) in the classical random variable SDR setting, and are generalized to the functional setting. For this reason, we refer E{FE(Y|X)} and E(FY|X) as the average Fréchet derivatives (AFD).
In SDR for random variables, we say a basis matrix is an unbiased estimator for the central subspace if the spanning space of this basis matrix is located within the central subspace. In addition, we say the basis matrix is exhaustive if the spanning space recovers the entire central subspace (Li and Wang (2007)). The same definitions apply to the central mean subspace. We now extend these concepts to the functional setting.
Definition 4. Let 𝒮 be a subspace of ΩX.
We say 𝒮 is unbiased for 𝒮E(Y|X) if 𝒮 ⊆ 𝒮E(Y|X), and exhaustive for 𝒮E(Y|X) if 𝒮 = 𝒮E(Y|X);
We say 𝒮 is unbiased for 𝒮Y|X if 𝒮 ⊆ 𝒮Y|X, and exhaustive for 𝒮Y|X if 𝒮 = 𝒮Y|X.
Next, we show that E{FE(Y|X)} and E(FY|X) are unbiased and exhaustive estimators for 𝒮E(Y|X) and 𝒮Y|X, respectively. We add another condition, and discuss it in detail in Section S1 of the Supplementary Material (Lee and Li (2022)).
Assumption 5. The supports, and , are convex.
Theorem 4. Suppose Assumptions 1 to 4 hold. Then:
Unbiasedness: , and .
Exhaustiveness: , and , if Assumption 5 further holds.
Theorem 4 suggests that we can use the space spanned by E{FE(Y|X)} or E(FY|X) to estimate the functional central mean subspace or the functional central subspace. In other words, we can use the leading eigenfunctions of these operators to estimate the bases of 𝒮E(Y|X) and 𝒮Y|X. The next corollary summarizes this observation, which provides the estimators for our functional SDR spaces at the population level. To ensure that we can compute the two functional spaces in Theorem 4, and , at the sample level, we let the rank of E{FE(Y|X)} and E{FY|X} equal q, while we allow q to diverge.
Corollary 1. Suppose the conditions in Theorem 4 hold.
-
If are the solutions to the sequential optimization problem
then .
-
If are the solutions to the sequential optimization problem
then .
4. Sample estimation.
In this section, we develop the functional SDR estimation procedure, first at the operator level, then under a coordinate system. We also develop a postreduction prediction method based on the proposed dimension reduction framework.
4.1. Estimation at the operator level.
Let denote i.i.d. samples of (X, Y)⊤. We estimate the mean elements μY, mX and mY of ΩY, ℋX and ℋY by , , and , respectively, where En(·) is the mean operator such that for the samples (ω1, …, ωn). We estimate the three key covariance operators by
where ⊗ represents the tensor product. Besides, we estimate ΛYY and ΣYY by , and , respectively. For any ψ, ψ′ ∈ ΩY, we have , and the rest of the sample covariance operators have similar properties, where covn(·, ·) denotes the sample covariance between the designated quantities.
In the next section, we show that these sample covariance operators can be represented as functions of Gram kernel matrices, which often have fast decaying eigenvalues and can be well approximated by the spectral decomposition associated with only the leading eigenvalues (Lee and Huang (2007)). This suggests that using a reduced set of bases of RKHS can improve the estimation efficiency both statistically and computationally. We adopt this idea in our functional SDR estimation. Let , , and denote the pairs of eigenvalues and eigenfunctions of ΛYY, ΣXX, and ΣYY, such that
Let , denote the Karhunen–Loève coefficients satisfying that,
That is, , and . By definition, the covariance operators ΛXY, ΣXX, and ΣXY can be written as
| (4) |
Similarly, at the sample level, let , and denote the pairs of eigenvalues and eigenfunctions of , and . Let denote the estimated KL coefficients, , and . To estimate the covariance operators in (4), we then truncate and focus on the leading d terms, and obtain the estimators via
| (5) |
where varn(ω) = covn(ω, ω).
Correspondingly, we estimate the functional regression operators mE(Y|X) and mY|X by
| (6) |
respectively, where ϵ is a ridge parameter, and I : ℋX → ℋX is the identify mapping. Then, by Corollary 1, we solve the optimization problem
| (7) |
where when estimating 𝒮E(Y|X), and when estimating 𝒮Y|X. That is, we estimate 𝒮E(Y|X) or 𝒮Y|X by , where is the eigenfunction corresponding to the kth largest eigenvalue of in (7). Lastly, we estimate the sufficient predictors by , for k = 1, …, q.
4.2. Estimation under a coordinate system.
In practice, a function can only be observed at a finite set of points. For simplicity, we focus on the balanced setting where all the subjects are observed under a common set of time points, and the observed data are {[Xk(t), Yk(t)]⊤ : t = t1, …, tm, k = 1, …, n}. Meanwhile, our method can be straightforwardly extended to the unbalanced setting with only minor modifications. We next develop an estimation procedure under a chosen coordinate system given the observed data.
We first introduce the notion of coordinate representation here, and give more details in Section S2.1 of the Supplementary Material (Lee and Li (2022)). Let denote the coordinate of f in a generic Hilbert space ℋ with respect to its bases 𝓑. Moreover, let 𝓑′⌊A⌋𝓑 denote the coordinate of a linear operator A : ℋ → ℋ′ with respect to 𝓑 and 𝓑′, where ℋ′ is another Hilbert space spanned by 𝓑′. For convenience, we write ⌊f⌋𝓑 as ⌊f⌋, and 𝓑′⌊A⌋𝓑 as ⌊A⌋ when there is no confusion.
We divide our coordinate-level estimation procedure into four major steps.
In Step 1, we first construct the RKHS, ΩX and ΩY, which are Hilbert spaces of functions on an interval . We begin with a positive definite kernel function, . There are many choices, for example, the Brownian motion kernel, or the radial basis kernel. Suppose ΩX = ΩY is the linear span of a sequence of functional basis .
Next, we derive the coordinate representation of the observed data Xk(t) and Yk(t). Let Xk(τ) = {Xk(t1), …, Xk(tm)}⊤, and Yk(τ) = {Yk(t1), …, Yk(tm)}⊤, for the kth subject, k = 1, …, n, and be the m × m Gram matrix of T with the spectral decomposition, . It is then straightforward to obtain that
is an orthonormal basis of ΩX and ΩY. Therefore, we can write Xk(τ) and Yk(τ) as , and , which further leads to the their coordinate representations,
| (8) |
Note that if KT is not of a full rank, we simply use its leading nonzero eigenvalues and corresponding eigenfunctions to construct 𝓑T.
In Step 2, we first construct the nested RKHS, ℋX and ℋY, which are Hilbert spaces of functions on X and Y, respectively. We choose positive definite nested kernels κX and κY, using, for example, the radial basis function kernel, or the Laplacian kernel. We then follow the coordinates derived in (8), and estimate the inner products by , and . Let ℋX and ℋY be the linear span of and , respectively.
Next, we construct the orthonormal bases of ℋX and ℋY, following a similar way as in Step 1. That is, let and be the n × n Gram matrix of X and Y, respectively, and GX = QnKXQn and GY = QnKYQn be their centered version, with and 1n being the n-dimensional vector of ones. We then perform the spectral decomposition, , and , where and correspond to the first d eigenvalues, and and correspond to the last n − d eigenvalues. Therefore, we construct the orthonormal basis of ℋX and ℋY, respectively, via
| (9) |
In Step 3, we derive the coordinate representation of the truncated sample covariance operators in (5), given the constructed ΩX, ΩY, ℋX and ℋY, and their orthonormal bases. Specifically, we perform the spectral decomposition on , and denote its first d eigenvalues and eigenfunctions as . Let . We then compute the leading sample KL coefficients by, for k = 1, …, d, ℓ = 1, …, n,
where ek is the d-dimensional vector with its kth position being one and the rest being zero. Stack the sample KL coefficients θk,ℓ, , and vertically, and form the n-dimensional vectors , , and , for k = 1, …, d. Next stack , and horizontally, and form the n × d matrices , and . Then we obtain the coordinate representation of the truncated sample covariance operators in (5) as
| (10) |

In Step 4, we derive the coordinate representation of in (7) as
| (11) |
where ⌊∂xκX(X)⌋ is the coordinate of the Fréchet derivative, which we derive explicitly for the radial basis kernel and the polynomial kernel in Section S2.2 of the Supplementary Material (Lee and Li (2022)). We take the value of ϵ to be 10−4 × [DX]1,1. Finally, we compute the sample average , perform its spectral decomposition, and obtain the eigenfunctions associated with the leading q eigenvalues. We estimate 𝒮E(Y|X) or 𝒮Y|X by , where , i = 1, …, q..
We summarize the above estimation procedure in Algorithm 1.
Our estimation method requires specifying kernel functions for κT, κX, and κY. We study the effect of kernel choices in Section S3 of the Supplementary Material (Lee and Li (2022)). In general, we have found that the estimated sufficient predictors are not overly sensitive to the choices of kernel functions.
In addition, we need to determine the number of leading KL coefficients d, and the intrinsic dimension q of the functional central subspace or central mean subspace. We adopt the commonly used strategy in principal component analysis (PCA) to select d and q, such that a certain proportion of total variation has been accounted for. Specifically, let tr(·) and λk(·) denote the trace and the kth largest eigenvalue of a designated symmetric matrix, and pd and pq denote some prespecified proportion, say, 90% or 95%. We then determine d and q by
| (12) |
4.3. Postreduction prediction.
In function-on-function regression, in addition to finding the reduced-dimensional representation, it is of equal interest to predict the response function Y, or any of its functional g(Y) in ℋY, given the predictor function X. Our proposed dimension reduction framework leads naturally to a procedure for such prediction tasks.
Let , i = 1, …, q, denote the estimates from Step 5 of Algorithm 1. Then, for any x ∈ ΩX, its projections onto is , where . Therefore, for any x, x′ ∈ ΩX, we can define a new kernel for the reduced predictors xq via
where ρ is from (3). Based on , we calculate the Gram matrix . Suppose we aim to estimate the conditional mean of g(Y) given X = x, that is, E {g(Y) | X = x}, for a given g in ℋY. We first compute the coordinates of g with respect to 𝒞Y by . We then estimate E {g(Y) | x} by
where and are from the spectral decomposition corresponding to the leading d eigenvalues, and is the basis with respect to the new kernel , that is .
A special case that is of particular interest is to predict Y given X. Consider the mapping, , , that is, the sth coordinate of y, for each s = 1, …, m. Therefore, we estimate by , and estimate E{Y(t) | x} by
where .
5. Asymptotic properties.
In this section, we establish the consistency and convergence rate of our functional SDR estimator for the functional central subspace 𝒮Y|X. The result for the functional central mean subspace 𝒮E(Y|X) can be obtained in a similar fashion, and is thus omitted. We first show the convergence of the truncated sample covariance operators and in (5). We next show the convergence of the sample functional regression operator in (6). We then establish the convergence of the sample average Fréchet derivative of the regression function, , which leads to the uniform convergence of our functional SDR estimator for 𝒮Y|X. We assume the trajectory of the random functions {X(t), Y(t)}⊤ is fully observed on t ∈ T, and briefly comment on the setting when {X(t), Y(t)}⊤ is partially observed toward the end. We also derive the consistency and convergence rate of the estimator based on the un-truncated sample covariance operators in Section S2.3 of the Supplementary Material (Lee and Li (2022)). Unless otherwise stated, all proofs are relegated to Section S4 of the Supplementary Material (Lee and Li (2022)).
We begin with some supporting lemmas. The first lemma is about the perturbation of self-adjoint operators. Given , let λ1 ≥ ⋯ ≥ λd+1 denote the top d + 1 eigenvalues of a self-adjoint operator Σ, and νd(Σ) denote the minimum distance between these eigenvalues, νd(Σ) = min{λk − λk+1: k = 1, …, d}.
Lemma 2. Let Σ and be self-adjoint operators in B(ℋ), and λ1 > λ2 > ⋯ and be their sequences of eigenvalues, respectively. If , then .
The next lemma provides the convergence rates of the relevant sample operators.
Lemma 3. Suppose Assumption 1 holds. Then the orders of magnitude of the terms,
are OP (n−1/2).
Next, we establish the convergence of the truncated sample covariance operators and in (5). Since our interest is on 𝒮Y|X, not 𝒮E(Y|X), we do not discuss the property of here. We introduce two intermediate operators, and , which are defined as
Note that we allow d to grow with the sample size, but denote it as d instead of dn to simplify the notation. Moreover, for three positive sequences {an}, {bn} and {cn}, denote an ≺ bn if an = o(bn), an ⪯ bn if an = O(bn), and an ∧ bn = bn, or an ∨ bn = an, if bn ⪯ an. Denote an ≍ bn if both an ≍ bn and bn ⪯ an. Similarly, denote an ∧ bn ∧ cn = (an ∧ bn) ∧ cn, and an ∨ bn ∨ cn = (an ∨ bn) ∨ cn.
Lemma 4. Suppose Assumption 1 holds, and ξ ≡ d2n−1/2{νd(ΣXX)∧νd(ΣYY)}−1 ⪯ 1. Then , and .
Next, we establish the convergence of the sample functional regression operator . We require two additional smoothness assumptions. We discuss the two assumptions in detail in Section S1 of the Supplementary Material (Lee and Li (2022)).
Assumption 6. There exist β1 > 0 and m0 ∈ B2(ℋY, ℋX), such that .
Assumption 7. Let ak and bk denote the kth eigenvalues of ΣXX and ΣYY, respectively. There exists β2 > 0, such that , and .
Theorem 5. Suppose Assumptions 1, 2, 6, 7 hold, and that ϵ ≺ 1. Then,
Next, we present a lemma regarding the convergence of linear operators in the form of B*B. We then establish the convergence of the sample average Fréchet derivative estimator.
Lemma 5. Let {Bn}n≥1 and B be random elements in B2(ℋ, 𝒦). Suppose ∥Bn – B∥HS = OP (an). Then .
Theorem 6. Suppose Assumptions 1 to 7 hold, and that ϵ ≺ 1. Then,
We are now ready to establish the uniform convergence of the estimated bases for 𝒮Y|X. The proof follows Lemma 2 and Theorem 6, and is omitted.
Theorem 7. Let denote the eigenvalues and eigenfunctions of E(FY|X), with . Suppose the same conditions in Theorem 6 hold. Then,
We make some remarks about Theorems 5 to 7. First, we note that Li and Song (2017) also studied the convergence of the functional regression operator. Our Theorem 5 further extends their result to the setting where the sample covariance operators are based on the truncated bases. In addition, Li and Song (2017) obtained the consistency of their nonlinear basis functions, whereas our Theorem 7 studied the linear basis functions ϕk, which involves a different asymptotic analysis.
Second, and more importantly, in all our theorems, we allow the intrinsic dimension q to diverge, while Li and Song (2017) only considered the setting of a fixed q. This difference has profound implications, and makes our asymptotic analysis far from a straightforward extension of Li and Song (2017). Actually, Theorem 7 suggests that q affects the convergence rate of the basis estimates through νq {E(FY|X)}, that is, the minimal gap between the eigenvalues of E(FY|X). To gain further insight, we note that, by definition, the squared Hilbert-Schmidt norm of E(FY|X) is
where are a set of orthonormal bases in ΩX. Following the proof of Theorem 6, the right-hand side of the above equation is bounded. This implies that E(FY|X) is Hilbert-Schmidt, which further indicates that its eigenvalues decay to zero. Therefore, Theorem 7 suggests that we need to restrict this decaying rate. We also comment that, Lin, Zhao and Liu (2019) has recently shown that the convergence rate of sliced inverse regression (SIR) relies on the smallest eigenvalue of the SIR matrix. Although we consider a very different problem than Lin, Zhao and Liu (2019), both works have suggested that the eigen-structure of the key quantity of interest, the SIR matrix in their work, and the average Fréchet derivative in ours, plays an important role in the consistency of the estimated basis.
Third, we examine more closely the convergence rate in Theorem 7. For a given d, we denote the rate of the regularization parameter ϵ in (6) as ϵ(d), and the resulting convergence rate in Theorem 7 as ζ(d). We study the rate of the ϵ(d) under which the best convergence rate ζ(d) is achieved. We have the following result.
Corollary 2. Suppose the same conditions in Theorem 7 hold. Then
Its proof is by direct calculation and is omitted. To better understand the rates in Corollary 2, suppose d = na, , and νq {E(FY|X)} = n−c, for some constants a, b, c > 0. The value of a reflects the growing rate of the number of KL expansions, b restricts the decaying rate of the eigenvalues of ΣXX and ΣYY, and c controls the decaying rate of the eigenvalues of E(FY|X). Combined with Corollary 2, we have , and , where r(a, b, β1, β2) = {(1/2 − 2a − b)∧(aβ2)}/(1 + β1). This further implies that, when (2 + β2)a + b = 1/2, we obtain the best rates,
There is a clear interpretation regarding how these best rates are affected by β1, β2 as defined in Assumptions 6 and 7, and also the value of b. Specifically, β1 controls the smoothness of the functional regression operator, and a larger value of β1 encourages a smoother relation between X and Y, which leads to a larger penalty from ϵ, and a faster convergence rate of ζ. Meanwhile, β2 regulates the behavior of the tail eigenvalues of ΣXX and ΣYY, and a larger β2 encourages a faster decaying of those tail eigenvalues, which leads to a smaller penalty from ϵ, and a faster rate of ζ. The quantity b restricts the decaying rate of the gaps from the leading eigenvalues of ΣXX and ΣYY, and a smaller b implies a slower decaying rate, which means a smaller ϵ is required to stabilize the inversion of , and a faster rate of ζ.
Finally, we have so far studied the asymptotics under the setting when the random functions are fully observed. We remark that all the results can be extended to the setting when the random function are only partially observed. In this case, the sample mean and covariance operators in Lemma 3 are to have a slower convergence rate than n−1/2. Suppose the rate is n−1/2+δ, with δ between [0, 1/2), and the denser the observed time points are for the functions, the closer δ is to 0. The specific value of δ depends on the actual sampling schedule of how functional data are collected; see Wang, Chiou and Muller (2016) for more details. Based on the rate of n−1/2+δ, we can extend the results in Theorems 5 to 7 to the setting of partially observed functions. We skip the details, since this extension is relatively straightforward.
6. Numerical studies.
In this section, we first carry out simulations to examine the empirical performance of our proposed functional SDR estimators. We then illustrate our method with two real data examples, the classical Canadian weather dataset, and a more recent bike sharing dataset. We report the additional simulation results that compare our method with Li and Song (2017), and study the effect of diverging intrinsic dimensionality in Section S3 of the Supplementary Material (Lee and Li (2022)).
6.1. Simulations.
We consider two sets of simulation models: in the first set, the response function is associated with the predictor function via the conditional mean only, and in the second set, with not only the conditional mean but also with the conditional variance. Let αj = 1/{(j − 0.5)π}2 and βj(t) = sin {(j − 0.5)πt} denote the jth eigenvalue and eigenfunction of the Brownian motion kernel, for j = 1, …, 100. We independently sample the predictor function X(t) and the error function ε(t) from , where aj’s are i.i.d. standard normal variables. Let ϕ1(t) = β4(t), and ϕ2(t) = β5(t). We then generate the response function Y(t) as
I-1:
I-2:
II-1:
II-2:
By construction, for models I-1 and I-2, 𝒮E(Y|X) = 𝒮Y |X = Span{ϕ1(t), ϕ2(t)}, while for models II-1 and II-2, 𝒮E(Y|X) = Span{ϕ1(t)} and 𝒮Y|X = Span{ϕ1(t), ϕ2(t)}. We then take the same 50 equally spaced points {tk1, …, tk50} between [0, 1] as the observed time points, for all k = 1, …, n. We consider different values of the sample size n = {100, 250, 500}. In our implementation of the functional SDR, we use a Brownian motion kernel for κT, and a radial basis kernel for κX and κY.
We first evaluate the accuracy of recovering 𝒮E(Y|X) and 𝒮Y|X. Toward that end, we generate n i.i.d. pairs of random functions {X(t), Y(t)} for each model, then divide them into a training set and a testing set, each with n/2 observations. We then apply our methods to the training data to estimate the basis functions for the functional dimension reduction subspaces. We next compute the top q sufficient predictors using the testing data. We first use the true value of q, then examine the estimation of q using (12) later. To evaluate how effective the estimated sufficient predictors approximate the true ones, we compute the trace of the average squared multivariate correlation coefficient matrix (MCC, Ferré and Yao (2005)),
Where , and are the true and estimated sufficient predictors, respectively. The value of MCC is between 0 and 1, with 1 indicating a perfect recovery. Besides, it is calculated based on the testing samples, and hence avoids the overfitting issue. Figure 1 reports the box plots of this criterion for the four models based on 80 data replications, for both the functional central mean subspace (fCMS) and the functional central subspace (fCS) estimation. It is seen that, for models I-1 and I-2, both estimators perform well, since the conditional mean contains all the relevant regression information. For models II-1 and II-2, the functional central subspace estimator performs consistently better than the functional central mean subspace estimator, because the conditional variance contains additional relevant regression information that the functional central mean subspace misses. Moreover, the performance improves as the sample size n increases, which agrees with our asymptotic theory.
Fig. 1.

Box plots of the average squared multivariate correlation coefficients between the true and estimated sufficient predictors, for the functional central mean subspace estimator (fCMS), and the functional central subspace estimator (fCS), and the four models, respectively. Each panel consists of 3 boxes for three sample sizes n= 100,250,500.
Next, we evaluate the performance of selecting the intrinsic dimension q using (12). We choose the top q eigenfunctions that account for 90% total variation in the estimated average Fréchet derivative. For models I-1 and I-2, both 𝒮E(Y|X) and 𝒮Y|X have the intrinsic dimension q = 2. For models II-1 and II-2, 𝒮E(Y|X) has the intrinsic dimension q = 1, and 𝒮Y|X has the intrinsic dimension q = 2. Table 1 reports the estimated q averaged over 80 data replications. It is seen that, the estimated q is slightly larger than the true intrinsic dimension, especially for models II-1 and II-2. This is acceptable from a dimension reduction perspective, as a slightly larger estimated q implies that the method recovers some additional redundant information, while a smaller estimated q means the method misses some important information. Moreover, the estimation of q becomes more accurate as the sample size n increases.
Table 1.
The estimated intrinsic dimension q
| Model | fCMS | fCS | |||||
|---|---|---|---|---|---|---|---|
| n | 100 | 250 | 500 | 100 | 250 | 500 | |
| I-1 | mean | 3.00 | 2.48 | 2.01 | 3.99 | 3.01 | 2.14 |
| s.d. | 0.32 | 0.50 | 0.11 | 0.11 | 0.11 | 0.35 | |
| I-2 | mean | 3.58 | 3.00 | 2.58 | 4.03 | 3.92 | 3.02 |
| s.d. | 0.50 | 0.23 | 0.50 | 0.16 | 0.27 | 0.16 | |
| II-1 | mean | 4.00 | 3.86 | 3.27 | 4.08 | 3.94 | 3.01 |
| s.d. | 0.00 | 0.35 | 0.45 | 0.27 | 0.24 | 0.11 | |
| II-2 | mean | 4.00 | 3.95 | 3.39 | 4.41 | 4.01 | 3.88 |
| s.d. | 0.00 | 0.22 | 0.49 | 0.50 | 0.11 | 0.33 | |
Finally, we study the performance of postreduction prediction. We compare our method with three state-of-the-art parametric function-on-function regression methods: the functional linear regression (LIN, Yao, Müller and Wang (2005)), the functional linear regression by signal compression (LQ, Luo and Qi (2017)), and the functional additive regression (ADD, Müller and Yao (2008)). Both LIN and ADD are implemented in the R package PACE, and LQ is implemented in the R package FRegSigCom. Since the competing methods were designed to predict the mean function of the response function, it is more meaningful to focus our comparison on the conditional mean models I-1 and I-2. To evaluate the predictive performance, we fit the model using the training samples, then obtain the predicted response function using the testing samples. We then compute the mean squared prediction error (MSPE),
where are the testing samples. Table 2 reports the results based on 80 data replications. It is seen that, LQ has the best predictive performance for model I-1, while our prediction method based on the functional central subspace 𝒮Y|X is the second best. This is not surprising though, since their method was specifically designed for a linear function-on-function regression model. Meanwhile, our method based on 𝒮Y|X performs the best for model I-2, where there is a clear nonlinear association between the predictor and response functions.
Table 2.
The mean squared prediction error for the prediction methods based on the functional central mean subspace (fCMS), the functional central subspace (fCS), the functional linear regression (LIN), the functional linear regression by signal compression (LQ), and the functional additive regression (ADD)
| Model | fCMS | fCS | LIN | LQ | ADD | |
|---|---|---|---|---|---|---|
| I-1 | mean | 0.715 | 0.695 | 2.584 | 0.540 | 2.618 |
| s.d. | 0.159 | 0.121 | 0.151 | 0.042 | 0.147 | |
| I-2 | mean | 0.610 | 0.603 | 1.396 | 0.666 | 1.411 |
| s.d. | 0.063 | 0.058 | 0.081 | 0.057 | 0.088 |
6.2. Canadian weather data.
We first illustrate our methods with the classical Canadian weather data, which is often used as a benchmark for function-on-function regression analysis (Ramsay and Silverman (2005)). The dataset consists of the average daily temperature and logarithm of the daily precipitation over 35 years, from 1960 to 1994, for 35 different locations in Canada. The data is available from the R package fda. Figure 2(a)–(b) show the observed time series for the daily temperature and precipitation in the logarithmic scale. The analysis goal is to study the association between the temperature and precipitation, and to use the temperature to predict the precipitation.
Fig. 2.

Canadian weather data.
We apply the proposed functional SDR methods to this data. For 𝒮E(Y|X), the top 3 sufficient predictors explain 45.1%, 33.6% 15.2% of total variation, respectively, and 93.9% accumulatively. For 𝒮Y|X, the top 3 sufficient predictors explain 48.5%, 37.3%, 9.2% of total variation, respectively, and 94.9% accumulatively. Figure 2(c)–(d) show these estimated bases, which look similar for 𝒮E(Y|X) and 𝒮Y|X in this example. From the plots, it is seen that the first basis function weighs more at the beginning and toward the end of the year, while the second basis function weighs more in the middle of the year.
Next, we apply the postreduction prediction method to estimate the conditional mean of the precipitation function in logarithm given the top 3 estimated bases from the temperature function. Figure 3(a) shows the scatterplot of the first sample KL coefficient of the logarithmic daily precipitation, as well as its estimated regression function, that is, the conditional mean, versus the first and second estimated sufficient predictors. We also interpolate the regression surface for a clearer visualization. From the plot, we see that the sample KL coefficient is well approximated by its regression function. We also observe an increasing trend of the conditional mean as the sufficient predictors increase.
Fig. 3.

The first sample KL coefficient of the response function (black dots) and its regression function (colored mesh) versus the first two sufficient predictors.
Finally, we perform an out-of-sample prediction analysis. That is, we randomly draw 30 out of 35 observations as the training samples, and use the rest as the testing samples. We record the mean squared prediction error based on the testing data, and compare the five methods studied in Section 6.1. We repeat this process 100 times. Table 3 reports the results averaged over such 100 replications. It is seen that our method based on the functional central subspace achieves the best prediction accuracy.
Table 3.
The mean squared prediction error for the prediction methods based on the functional central mean subspace (fCMS), the functional central subspace (fCS), the functional linear regression (LIN), the functional linear regression by signal compression (LQ), and the functional additive regression (ADD)
| Dataset | fCMS | fCS | LIN | LQ | ADD | |
|---|---|---|---|---|---|---|
| Canadian weather | mean | 0.0865 | 0.0861 | 0.1134 | 0.1230 | 0.1192 |
| s.d. | 0.0301 | 0.0300 | 0.0401 | 0.0589 | 0.0492 | |
| Bike sharing | mean | 0.1357 | 0.1361 | 0.1995 | 0.1775 | 0.1458 |
| s.d. | 0.0421 | 0.0415 | 0.0732 | 0.0630 | 0.0545 |
6.3. Bike sharing data.
We next illustrate our methods with a more recent business application, the bike sharing data (Fanaee-T and Gama (2014)). The dataset consists of the hourly counts of total rental bikes for both casual and registered users, and weather related information, such as the hourly temperature, precipitation, wind speed, and humidity. The observations were recorded from the Capital Bike Share system in Washington, DC every day from January 1, 2011 to December 31, 2012. The data is available from https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset. One of the analysis goals is to understand how the bike rentals are affected by the temperature on Saturdays. Therefore, we treat the hourly bike rentals as the response function and the hourly temperature as the predictor function. After removing 3 weeks of observations with many missing values, we obtain 102 weeks of pairs of curves, which are shown in Figure 4(a)–(b).
Fig. 4.

Bike sharing data for 102 Saturdays.
We apply the proposed functional SDR methods to this data. For the functional central mean subspace 𝒮E(Y|X), the top 3 sufficient predictors explain 62.8%, 18.4% 14.6% of total variation, respectively, and 95.9% accumulatively. For the functional central subspace 𝒮Y|X, the top 3 sufficient predictors explain 69.3%, 16.2%, 10.9% of total variation, respectively, and 96.4% accumulatively. Figure 4(c)–(d) show these estimated bases. It is seen that, the first basis suggests that the temperature between 10 am and midnight has more impact on the bike rentals than the rest of the day, while the second and third bases suggest that early morning tends to affect more on the bike rentals. Moreover, we observe that, although the estimated bases of the 𝒮E(Y|X) and 𝒮Y|X are not numerically identical, they capture very similar patterns. This suggests that the leading estimated bases are most likely related to the change of the conditional mean of the hourly rental counts. Also, we see that the second and third estimated bases from the two estimations roughly span the same space, with the second estimated basis for 𝒮E(Y|X) corresponding to the sign-flipped third estimated basis for 𝒮Y|X, and the third estimated basis for 𝒮E(Y|X) corresponding to the second estimated basis for 𝒮Y|X. This order switching between the estimated bases is likely due to the closeness of the second and third largest eigenvalues.
Next, we apply the postreduction prediction method to estimate the conditional mean of the bike rental function given the top 3 estimated bases from the temperature function. Figure 3(b) shows the scatterplot of the first sample KL coefficient of the hourly rental, and its estimated regression function, versus the first and second estimated sufficient predictors. We again interpolate the regression surface. We see that the estimated regression function well approximates the estimated KL coefficient. Besides, as the first two sufficient predictors increase, the conditional mean of the first KL of the bike rental decreases.
Finally, we perform the out-of-sample prediction analysis, by randomly drawing 90 observations out of 102 as the training samples, and then using the rest as the testing samples. Table 3 reports the results averaged over 100 replications. For this example, our method based on the functional central mean subspace achieves the best prediction accuracy.
7. Discussion.
In this section, we reiterate and further clarify some key differences between our proposal and the existing gradient-based SDR and functional SDR solutions. We first compare to the gradient-based SDR methods for the random variable case.
The gradient-based linear SDR methods such as Härdle and Stoker (1989), Xia (2007), Xia et al. (2002), Yin and Li (2011) targeted the central mean or central subspaces that are finite-dimensional subspaces of . In comparison, our functional central mean and central subspaces are defined as the projections of Hilbert space-valued random predictors, which themselves can be Hilbert space-valued and infinite-dimensional.
A major step in the gradient-based methods is to estimate the gradient itself. Nevertheless, this step has imposed great challenges under the functional setting. Härdle and Stoker (1989) estimated the gradient of the density function; however, the density function does not always exist for random functions. Alternatively, Xia (2007), Xia et al. (2002), Yin and Li (2011) estimated the gradient of the regression functions, which requires to solve a high-dimensional local linear regression. Extending the local linear regression to the functional setting is feasible, but is analytically complicated. In contrast, we build our estimator on the Fréchet derivative of the regression functionals, which has a closed form. To achieve this, we rely on a key property of RKHS, in which we establish the interchangeability between the nested kernel function and the Fréchet derivative; see Proposition 2. This simplification in estimating the Fréchet derivative is useful in both estimation at the population and sample levels, as well as in deriving the consistency results.
We have developed a number of useful properties regarding the Fréchet differentiability of the nested RKHS. For instance, we derive the conditions under which the nested kernel function is continuously Fréchet differentiable in Proposition S1. To show that the Fréchet derivative of the regression functionals have tractable forms, we illustrate with some popular kernel functions and derive their Fréchet derivatives in Proposition S2 at the population level, and in Proposition S5 at the coordinate level. Moreover, in Lemma S3 we derive a property on the interchangeability of partial Fréchet differentiability on the nested RKHS, which allows us to compute the moments of the Fréchet derivative; see Proposition S4. These results are sufficiently general to be applied to problems beyond the SDR framework, and are useful for functional data analysis involving the Fréchet derivative.
We remark that, a possible alternative for our functional SDR is to discretize the response function as Y(t1), …, Y(tm) for the observed time points t1, …, tm, then perform dimension reduction by treating Y as an m-dimensional random vector (Hsing (1999), Li, Wen and Zhu (2008)). This solution does not require to specify a working RKHS. However, a major advantage of our approach is that we are able to borrow information across the entire function to compute the KL coefficients, then to recover the entire function using only a few leading coefficients. This approach greatly alleviates the curse of dimensionality, as the working space is of the same dimension as the number of extracted KL coefficients d, rather than the number of observed time points m, and m is usually much larger than d. This is also reflected in our asymptotic convergence rates, which depend on d, not m.
We next compare to the functional SDR methods, particularly, Li and Song (2017).
Although Li and Song (2017) has shown the existence of the nonlinear functional subspace, our Theorem 1 on the existence of the functional central mean and central subspaces can not be directly inferred by their result. To establish the existence, we introduce the concept of M-set in the functional space, a result that is not available in their work.
A major difference between our work and Li and Song (2017) is that, their estimator was based on the inverse regression of X | Y, whereas our estimator is built on the forward regression of Y | X. The inverse regression approach has been taken by many other functional SDR estimators, for example, Hsing and Ren (2009), Jiang, Yu and Wang (2014), Wang, Lin and Zhang (2013), Wang et al. (2015), Yao, Lei and Wu (2015). Such a difference leads to completely different estimation approaches, theoretical analyses, as well as required conditions. For instance, the generalized sliced average variance estimator of Li and Song (2017) and all other inverse regression based functional SDR methods required some version of the linearity or constant variance conditions to establish the unbiasedness or exhaustiveness. In contrast, our method achieves the unbiasedness and exhaustiveness without imposing such distributional assumptions.
Another major difference is that Li and Song (2017) targeted nonlinear SDR, whereas we aim at linear SDR, as we have clarified after Theorem 1. Each has its own strength and limitation; see Li ((2018a), Chapter 14) for more comparison. However, simply using a linear kernel in Li and Song (2017) would not produce the same result as our method, due to different approaches of inverse and forward regressions the two methods adopt.
Compared to the estimator of Li and Song (2017) that was solely built on the functional regression operator, our estimator requires both the functional regression operator, and another key element, the Fréchet derivative. Consequently, we have developed a different set of tools regarding the Fréchet differentiability of the nested RKHS, None of those results are available in Li and Song (2017).
A key novelty of our asymptotic analysis is that we allow the number of sufficient predictors to diverge with the sample size. Li and Song ((2017), Corollary 3) has shown the pointwise consistency, but assumed a fixed number of sufficient predictors. In comparison, in Theorem 7, we have established the uniform consistency with a diverging number of sufficient predictors, which is considerably more challenging than Li and Song (2017). We require additional technical tools such as Lemma 2 to deal with the perturbation of the sample linear operators. Moreover, our asymptotic framework needs to deal with the consistency of the Fréchet derivative estimation; for example, in the proof of Theorem 6, we derive the rate of convergence for the Fréchet derivative, based on which we are able to establish the consistency of our average Fréchet derivative estimator.
In summary, our proposal is far from an incremental extension of the existing SDR methods. We believe it is important and useful for both fields of sufficient dimension reduction and functional data analysis in general.
Supplementary Material
Acknowledgments.
The authors would like to thank the three anonymous referees, the Associate Editor and the Editor for their constructive comments that improved the quality of this paper.
Funding.
Lee’s research was partially supported by the NSF Grant CIF-2102243, and the Seed Funding grant from Fox School of Business, Temple University.
Li’s research was partially supported by the NSF Grant CIF-2102227, and the NIH Grants R01AG061303, R01AG062542 and R01AG034570.
Footnotes
SUPPLEMENTARY MATERIAL
Supplementary appendix for “Functional sufficient dimension reduction through average Fréchet derivatives” (DOI: 10.1214/21-AOS2131SUPP; .pdf). The supplementary appendix collects detailed discussions of the regularity assumptions, some additional results regarding the estimation, asymptotic theory and simulations, and all the technical proofs.
REFERENCES
- Amini AA and Wainwright MJ (2012). Sampled forms of functional PCA in reproducing kernel Hilbert spaces. Ann. Statist 40 2483–2510. MR3097610 10.1214/12-AOS1033 [DOI] [Google Scholar]
- Conway JB (2010). A Course in Functional Analysis, 2nd ed. Springer, New York. [Google Scholar]
- Cook RD and Li B (2002). Dimension reduction for conditional mean in regression. Ann. Statist 30 455–474. MR1902895 10.1214/aos/1021379861 [DOI] [Google Scholar]
- Cook RD and Weisberg S (1991). Discussion of “Sliced inverse regression for dimension reduction”. J. Amer. Statist. Assoc 86 328–332. [Google Scholar]
- Fanaee-T H and Gama J (2014). Event labeling combining ensemble detectors and background knowledge. Prog. Artif. Intell 2 113–127. 10.1007/s13748-013-0040-3 [DOI] [Google Scholar]
- Ferré L and Yao AF (2003). Functional sliced inverse regression analysis. Statistics 37 475–488. MR2022235 10.1080/0233188031000112845 [DOI] [Google Scholar]
- Ferré L and Yao A-F (2005). Smoothed functional inverse regression. Statist. Sinica 15 665–683. MR2233905 [Google Scholar]
- Fukumizu K, Bach FR and Jordan MI (2009). Kernel dimension reduction in regression. Ann. Statist 37 1871–1905. MR2533474 10.1214/08-AOS637 [DOI] [Google Scholar]
- Fukumizu K and Leng C (2014). Gradient-based kernel dimension reduction for regression. J. Amer. Statist. Assoc 109 359–370. MR3180569 10.1080/01621459.2013.838167 [DOI] [Google Scholar]
- Hardle W (1989). Investigating smooth multiple regression by the method of average derivatives. J. Amer. Statist. Assoc 84 986–995. [Google Scholar]
- Härdle W and Stoker TM (1989). Investigating smooth multiple regression by the method of average derivatives. J. Amer. Statist. Assoc 84 986–995. MR1134488 [Google Scholar]
- Hsing T (1999). Nearest neighbor inverse regression. Ann. Statist 27 697–731. MR1714711 10.1214/aos/1018031213 [DOI] [Google Scholar]
- Hsing T and Ren H (2009). An RKHS formulation of the inverse regression dimension-reduction problem. Ann. Statist 37 726–755. MR2502649 10.1214/07-AOS589 [DOI] [Google Scholar]
- Jiang C-R, Yu W and Wang J-L (2014). Inverse regression for longitudinal data. Ann. Statist 42 563–591. MR3210979 10.1214/13-AOS1193 [DOI] [Google Scholar]
- Kim JS, Staicu A-M, Maity A, Carroll RJ and Ruppert D (2018). Additive function-on-function regression. J. Comput. Graph. Statist 27 234–244. MR3788315 10.1080/10618600.2017.1356730 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee Y-J and Huang S-Y (2007). Reduced support vector machines: A statistical theory. IEEE Trans. Neural Netw 18 1–13. 10.1109/TNN.2006.883722 [DOI] [PubMed] [Google Scholar]
- Lee K-Y and Li L (2022). Supplement to “Functional sufficient dimension reduction through average Fréchet derivatives.” 10.1214/21-AOS2131SUPP [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li K-C (1991). Sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc 86 316–342. MR1137117 [Google Scholar]
- Li B (2018a). Sufficient Dimension Reduction: Methods and Applications with R. Monographs on Statistics and Applied Probability 161. CRC Press, Boca Raton, FL. MR3838449 10.1201/9781315119427 [DOI] [Google Scholar]
- Li B (2018b). Linear operator-based statistical analysis: A useful paradigm for big data. Canad. J. Statist 46 79–103. MR3767167 10.1002/cjs.11329 [DOI] [Google Scholar]
- Li B and Song J (2017). Nonlinear sufficient dimension reduction for functional data. Ann. Statist 45 1059–1095. MR3662448 10.1214/16-AOS1475 [DOI] [Google Scholar]
- Li B and Wang S (2007). On directional regression for dimension reduction. J. Amer. Statist. Assoc 102 997–1008. MR2354409 10.1198/016214507000000536 [DOI] [Google Scholar]
- Li B, Wen S and Zhu L (2008). On a projective resampling method for dimension reduction with multivariate responses. J. Amer. Statist. Assoc 103 1177–1186. MR2462891 10.1198/016214508000000445 [DOI] [Google Scholar]
- Li B, Zha H and Chiaromonte F (2005). Contour regression: A general approach to dimension reduction. Ann. Statist 33 1580–1616. MR2166556 10.1214/009053605000000192 [DOI] [Google Scholar]
- Lin Q, Zhao Z and Liu JS (2019). Sparse sliced inverse regression via Lasso. J. Amer. Statist. Assoc 114 1726–1739. MR4047295 10.1080/01621459.2018.1520115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin Q, Li X, Huang D and Liu JS (2017). On the optimality of sliced inverse regression in high dimensions.
- Luo R and Qi X (2017). Function-on-function linear regression by signal compression. J. Amer. Statist. Assoc 112 690–705. MR3671763 10.1080/01621459.2016.1164053 [DOI] [Google Scholar]
- Luo R and Qi X (2019). Interaction model and model selection for function-on-function regression. J. Comput. Graph. Statist 28 309–322. MR3974882 10.1080/10618600.2018.1514310 [DOI] [Google Scholar]
- Ma Y and Zhu L (2012). A semiparametric approach to dimension reduction. J. Amer. Statist. Assoc 107 168–179. MR2949349 10.1080/01621459.2011.646925 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma Y and Zhu L (2013). Efficient estimation in sufficient dimension reduction. Ann. Statist 41 250–268. MR3059417 10.1214/12-AOS1072 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Müller H-G and Yao F (2008). Functional additive models. J. Amer. Statist. Assoc 103 1534–1544. MR2504202 10.1198/016214508000000751 [DOI] [Google Scholar]
- Ramsay JO and Silverman BW (2005). Functional Data Analysis, 2nd ed. Springer Series in Statistics. Springer, New York. MR2168993 [Google Scholar]
- Reimherr M, Sriperumbudur B and Taoufik B (2018). Optimal prediction for additive function-on-function regression. Electron. J. Stat 12 4571–4601. MR3893421 10.1214/18-EJS1505 [DOI] [Google Scholar]
- Steinwart I and Christmann A (2008). Support Vector Machines. Springer, New York. [Google Scholar]
- Sun X, Du P, Wang X and Ma P (2018). Optimal penalized function-on-function regression under a reproducing kernel Hilbert space framework. J. Amer. Statist. Assoc 113 1601–1611. MR3902232 10.1080/01621459.2017.1356320 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang JL, Chiou JM and Muller HG (2016). Functional data analysis. Annu. Rev. Stat. Appl 3 257–295. [Google Scholar]
- Wang G, Lin N and Zhang B (2013). Functional contour regression. J. Multivariate Anal 116 1–13. MR3049886 10.1016/j.jmva.2012.11.005 [DOI] [Google Scholar]
- Wang G, Zhou Y, Feng X-N and Zhang B (2015). The hybrid method of FSIR and FSAVE for functional effective dimension reduction. Comput. Statist. Data Anal 91 64–77. MR3368006 10.1016/j.csda.2015.05.011 [DOI] [Google Scholar]
- Weidmann J (1980). Linear Operators in Hilbert Spaces. Graduate Texts in Mathematics 68. Springer, New York–Berlin. MR0566954 [Google Scholar]
- Xia Y (2007). A constructive approach to the estimation of dimension reduction directions. Ann. Statist 35 2654–2690. MR2382662 10.1214/009053607000000352 [DOI] [Google Scholar]
- Xia Y, Tong H, Li WK and Zhu L-X (2002). An adaptive estimation of dimension reduction space. J. R. Stat. Soc. Ser. B. Stat. Methodol 64 363–410. MR1924297 10.1111/1467-9868.03411 [DOI] [Google Scholar]
- Yao F, Lei E and Wu Y (2015). Effective dimension reduction for sparse functional data. Biometrika 102 421–437. MR3371014 10.1093/biomet/asv006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yao F, Müller H-G and Wang J-L (2005). Functional linear regression analysis for longitudinal data. Ann. Statist 33 2873–2903. MR2253106 10.1214/009053605000000660 [DOI] [Google Scholar]
- Yin X and Li B (2011). Sufficient dimension reduction based on an ensemble of minimum average variance estimators. Ann. Statist 39 3392–3416. MR3012413 10.1214/11-AOS950 [DOI] [Google Scholar]
- Yin X, Li B and Cook RD (2008). Successive direction extraction for estimating the central subspace in a multiple-index regression. J. Multivariate Anal 99 1733–1757. MR2444817 10.1016/j.jmva.2008.01.006 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
