Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Jul 1.
Published in final edited form as: J Multivar Anal. 2024 Feb 27;202:105302. doi: 10.1016/j.jmva.2024.105302

Nonlinear sufficient dimension reduction for distribution-on-distribution regression

Qi Zhang a, Bing Li a, Lingzhou Xue a,*
PMCID: PMC10956811  NIHMSID: NIHMS1972219  PMID: 38525479

Abstract

We introduce a new approach to nonlinear sufficient dimension reduction in cases where both the predictor and the response are distributional data, modeled as members of a metric space. Our key step is to build universal kernels (cc-universal) on the metric spaces, which results in reproducing kernel Hilbert spaces for the predictor and response that are rich enough to characterize the conditional independence that determines sufficient dimension reduction. For univariate distributions, we construct the universal kernel using the Wasserstein distance, while for multivariate distributions, we resort to the sliced Wasserstein distance. The sliced Wasserstein distance ensures that the metric space possesses similar topological properties to the Wasserstein space, while also offering significant computation benefits. Numerical results based on synthetic data show that our method outperforms possible competing methods. The method is also applied to several data sets, including fertility and mortality data and Calgary temperature data.

Keywords: Distributional data, RKHS, Sliced Wasserstein distance, Universal kernel, Wasserstein distance, 62G08, 62H12

1. Introduction

Complex data objects such as random elements in general metric spaces are commonly encountered in modern statistical applications. However, these data objects do not conform to the operation rules of Hilbert spaces and lack important properties such as inner products and orthogonality, making them difficult to analyze using traditional multivariate and functional data analysis methods. An important example of the metric space-valued data objects is the distributional data, which can be modeled as random probability measures that satisfy specific regularity conditions. Recently, there has been an increasing interest in this type of data. Petersen and Müller [43] extended the classical regression to Fréchet regression, making it possible to handle univariate distribution on scalar or vector regression. Fan and Müller [14] extended the Fréchet regression framework to the case of multivariate response distributions. Besides scalar or vector-valued predictors, the relationship between two distributions is also becoming increasingly important. Petersen and Müller [42] proposed the log quantile density (LQD) transformation to transform the densities of these distributions to unconstrained functions in the Hilbert space L2. Chen et al. [9] further applied function-to-function linear regression to the LQD transformations of distributions and mapped the fitted responses back to the Wasserstein space through the inverse LQD transformation. Chen et al. [8] proposed a distribution-on-distribution regression model by adopting the Wasserstein metric and shows that it works better than the transformation methods in Chen et al. [9]. Recently, Bhattacharjee et al. [4] proposed a global nonlinear Fréchet regression model for random objects via weak conditional expectation. In practical applications, distribution-on-distribution regression has been utilized for analyzing mortality distributions across different countries or regions [20], distributions of fMRI brain imaging signals [42], and distributions of daily temperature and humidity [40], among others.

Distribution-on-distribution regression encounters similar challenges to classical regression, including the need for exploratory data analysis, data visualization, and improved estimation accuracy through data dimension reduction. In classical regression, sufficient dimension reduction (SDR) has proven to be an effective tool for addressing these challenges. To set the stage, we outline the classical sufficient dimension reduction (SDR) framework. Let X be a p-dimensional random vector in p and Y a random variable in . Linear SDR aims to find a subspace 𝓢 of p such that YXP𝒮X, where P𝒮 is the projection on to 𝓢 with respect to the usual inner product in p. As an extension of linear SDR, [28] and [25] propose the general theory of nonlinear sufficient dimension reduction, which seeks a set of nonlinear functions f1(X),,fd(X) in a Hilbert space such that YXf1(X),,fd(X).

In the last two decades, the SDR framework has undergone constant evolution to adapt to increasingly complex data structures. Researchers have extended SDR to functional data [16, 22, 30, 31], tensorial data [12, 29], and forecasting with large panel data [15, 35, 53] Most recently, Ying and Yu [52], Zhang et al. [54], and Dong and Wu [13] have developed SDR methods for cases where the response takes values in a metric space while the predictor lies in Euclidean space.

Let X and Y be random distributions defined on Mr, with finite p-th moments (p1). We do allow X and Y to be random vectors, but our focus will be on the case where they are distributions. Modelling X and Y as random elements in metric spaces ΩX,dX and ΩY,dY, we seek nonlinear functions f1,,fd defined on ΩX such that the random measures Y and X are conditionally independent given f1(X),,fd(X). In order to guarantee the theoretical properties of the nonlinear SDR methods and facilitate the estimation procedure, we assume f1,,fd to reside in a reproducing kernel Hilbert space (RKHS). While the nonlinear SDR problem can be formulated in much the same way as that for multivariate and functional data, the main new element in this theory that still requires substantial effort is the construction of positive definite and universal kernels on ΩX and ΩY. These are needed for constructing unbiased and exhaustive estimators for the dimension reduction problem [27]. We achieve this purpose with specific choices of the metrics of Wasserstein distance and sliced Wasserstein distance: we will show how to construct positive definite and universal kernels and the RKHS generated from them to achieve nonlinear SDR for distributional data.

While acknowledging the recent independent work of Virta et al. [50], who proposed a nonlinear SDR method for metric space-valued data, our work has some novel contributions. First, we focus on distributional data and consider a practical setting where only discrete samples from each distribution are available instead of the distributions themselves, while Virta et al. [50] only illustrated the method with torus data, positive definite matrices, and compositional data. Second, we explicitly construct universal kernels over the space of distributions, which results in an RKHS that is rich enough to characterize the conditional independence. In contrast, Virta et al. [50] only assumed that the RKHS is dense in L2 space but missed verifications.

The rest of the paper is organized as follows. Section 2 defines the general framework of nonlinear sufficient dimension reduction for distributional data. Section 3 shows how to construct RKHS on the space of univariate distributions and multivariate distributions, respectively. Section 4 proposes the generalized sliced inverse regression methods for distribution data. Section 5 establishes the convergence rate of the proposed methods for both the fully observed setting and the discretely observed setting. Simulation results are presented in Section 6 to show the numerical performances of proposed methods. In Section 7, we analyze two real applications to human mortality & fertility data and Calgary extreme temperature data, demonstrating the usefulness of our methods. All proofs are presented in Section 9.

2. Nonlinear SDR for Distributional Data

We consider the setting of distribution-on-distribution regression. Let (Ω,,P) be a probability space. Let M be a subset of r and (M) the Borel σ-field on M. Let 𝒫p(M) be the set of Borel probability measures on (M,(M)) that have finite p-th moment and that is dominated by the Lebesgue measure on r. We let ΩX and ΩY be nonempty subsets of 𝒫p(M) equipped with metrics dX and dY, respectively. We let X and Y be the Borel σ-fields generated by the open sets in the metric spaces ΩX,dX and ΩY,dY. Let (X,Y) be a random element mapping from ΩX to ΩX×ΩY, measurable with respect to the product σ-field X×Y. We denote the marginal distributions of X and Y by PX and PY, respectively, and the conditional distributions of YX and XY by PYX and PXY.

Let σ(X) be the sub σ-field in generated by X, that is, σ(X)=X1X. Following the terminology in [27], a sub σ-field 𝒢 of σ(X) is called a sufficient dimension reduction σ-field, or simply a sufficient σ-field, if YX𝒢. In other words, 𝒢 captures all the regression information of Y on X. As shown in Lee et al. [25], if the family of conditional probability measures PXY(y):yΩY is dominated by a σ-finite measure, then the intersection of all sufficient σ-field is still a sufficient σ-field. This minimal sufficient σ-field is called the central σ-field for Y versus X, denoted by 𝒢YX. By definition, the central σ-field captures all the regression information of Y on X and is the target that we aim to estimate.

Let X be a Hilbert space of real-valued functions defined on ΩX. We convert estimating the central σ field into estimating a subspace of X. Specifically, we assume that the central σ-field is generated by a finite set of functions f1,,fd in X, which can be expressed as

YX|f1(X),,fd(X). (1)

For any sub-σ-field 𝒢 of σ(X), let X(𝒢) denote the subspace of X spanned by the function f such that f(X) is 𝒢-measurable, that is,

X(𝒢)=span¯{fX,f(X)is measurable𝒢}. (2)

We define the central class as 𝔖YX=X𝒢YX following (2). We say that a subspace 𝔖 of X is unbiased if it is contained in 𝔖YX and consistent if it is equal to 𝔖YX. To recover the central class 𝔖YX consistently by an extension of Sliced Inverse Regression [32], we need to assume the central σ-field is complete [25].

Definition 1. A sub σ-field 𝒢 of σ(X) is complete if, for each function f such that f(X) is 𝒢 measurable and E[f(X)Y]=0 almost surely PY, we have f(X)=0 almost surely PX. We say that X(𝒢) is a complete class for Y versus X if 𝒢 is complete σ-field for Y versus X.

Although our theoretical analysis so far does not require X and Y to be RKHS, using an RKHS provides a concrete framework for establishing an unbiased and consistent estimator. It also builds a connection between the classical linear SDR and nonlinear SDR in the sense that f(x) can be expressed as the inner product f,κ(,x), where κ:ΩX×ΩX is the reproducing kernel. This inner product is a nonlinear extension of βX in linear SDR. In the next section, We will describe how to construct RKHS for univariate and multivariate distributions.

3. Construction of RKHS

A common approach to constructing a reproducing kernel is to use a classical radial basis function φ(|xc|) (such as the Gaussian radial basis kernel) and substitute the Euclidean distance with the distance in the metric space. However, not every metric can be used in such a way to produce positive definite kernels. We show that metric spaces that are of negative type can yield positive definite kernels with form φ(xc). Moreover, as will be seen in Proposition 3 and the discussion following it, in order to achieve an unbiased and consistent estimation of the central class 𝒢YX, we need the kernels for X and Y to be cc-universal (Micchelli et al. 37). For ease of reference, we use the term ”universal” to refer to cc-universal kernels. We select the Wasserstein metric and sliced Wasserstein metric for our work, as they possess the desired properties for constructing universal kernels.

3.1. Wasserstein kernel for univariate distributions

For probability measures μ1 and μ2 in 𝒫p(M), the p-Wasserstein distance between μ1 and μ2 is defined as the solution of the Kantorovich transportation problem [49]:

Wp(μ1,μ2)=(infγΓ(μ1,μ2)M×Mxypdγ(x,y))1/p,

where is the Euclidean metric, and Γμ1,μ2 is the space of joint probability measures on (M×M,(M)×(M)) with marginals μ1 and μ2. When M, the p-Wasserstein distance has the following explicit quantile representation:

Wp(μ1,μ2)=(01[Fμ11(s)Fμ21(s)]2ds)1/p,

where Fμ11 and Fμ21 denote the quantile functions of μ1 and μ2, respectively. The set 𝒫p(M) endowed with the Wasserstein metric Wp is called the Wasserstein metric space and is denoted by 𝒲p(M). Kolouri et al. [23, Theorem 4] show that Wasserstein space of absolutely continuous univariate distributions can be isometrically embedded in a Hilbert space, and thus the Gaussian RBF kernel is positive definite.

We now turn to universality. Christmann and Steinwart [10, Theorem 3] showed that if ΩX is compact and can be continuously embedded in a Hilbert space by a mapping ρ, then for any analytic function A: whose Taylor series at zero has strictly positive coefficients, the function κx,x=Aρ(x),ρx defines a c-universal kernel on ΩX. To accommodate the scenarios of M= and M=r, we need to go beyond compact metric spaces. For this reason, we use a more general definition of universality that does not require the support of the kernel to be compact, called cc-universality [37, 46, 47]. Let κX:ΩX×ΩX be a positive definite kernel and X the RKHS generated by κX. For any compact set K, let X(K) be the RKHS generated by κX(,x):xK. Let C(K) be the class of all continuous functions with respect to the topology in ΩX,dX restricted on K.

Definition 2. [37] We say that κX is universal(cc-universal) if, for any compact set KΩX, any member f of C(K), and any ϵ>0, there is an hX(K) such that supxK|f(x)h(x)|<ϵ.

Let κGx,x=expγW22x,x and κLLx,x=expγW2x,x. The subscripts G and L here refer to “Gaussian” and “Laplacian”, respectively. [54] showed that both κG and κL on a complete and separable metric space that can be isometrically embedded into a Hilbert space are universal. We note that if M is separable and complete, then so is 𝒲2(M) [41, Proposition 2.2.8, Theorem 2.2.7]. Therefore, We have the following proposition that guarantees the construction of universal kernels on (possibly non-compact) 𝒲2(M).

Proposition 1. If M is complete, then κGx,x and κLx,x are universal kernels on 𝒲2(M).

By Proposition 1, we construct the Hilbert spaces X and Y as RKHS generated by Gaussian type kernel κG or Laplacian type kernel κL. Let L2PX be the class of square-integrable functions of X under PX. Let 𝔅 be the set of measurable indicator functions on 𝒲2(M), that is,

𝔅={IB:B𝒲2(M)is measurable}.

Recall that a measure PX on ΩX,d is regular if, for any Borel subset BΩX and any ε>0, there is a compact set KB and an open set GB, such that PX(GK)<ε. By Zhang et al. [54, Theorem 1], if PX is a regular measure, X is dense in 𝔅, and hence dense in span{𝔅}, which is the space of simple functions. Since span{𝔅} is dense in L2PX, X is dense in L2PX.

3.2. Sliced-Wasserstein kernel for multivariate distributions

For multivariate distributions Mr), the sliced p-Wasserstein distance is obtained by computing the average Wasserstein distance of the projected univariate distributions along randomly picked directions. Let μ1 and μ2 be two measures in 𝒫p(M), where Mr, r>1. Let 𝕊r1 be the unit sphere in r. For θ𝕊r1, let Tθ:r be the linear transformation xθ,x, where , is the Euclidean inner-product. Let μ1Tθ1 and μ2Tθ1 be the induced measures by the mapping Tθ. The sliced p-Wasserstein distance between μ1 and μ2 is defined by

SWp(μ1,μ2)=(𝕊r1Wpp(μ1Tθ1,μ2Tθ1)dθ)1/p.

It can be verified that SWp is indeed a metric. We denote the metric space 𝒫p(M),SWp by 𝒯𝒲p(M) and call it the sliced Wasserstein space. It has been shown (for example, Bayraktar and Guo [2]) that the sliced Wasserstein metric is a weaker metric than the Wasserstein metric, that is μ1,μ2Pp(M) with Mr, SWpμ1,μ2Wpμ1,μ2. This relation implies two topological properties of the sliced Wasserstein space that are useful to us, which can be derived from the topological properties of p-Wasserstein space established in Ambrosio et al. [1, Proposition 7.1.5], and Panaretos and Zemel [41, Chapter 2.2].

Proposition 2. If M is a subset of r, then 𝒯𝒲p(M) is complete and separable. Furthermore, if Mr is compact, then 𝒯𝒲p(M) is compact.

With p=2, [23] show that the square of sliced Wasserstein distance is conditionally negative definite and hence that the Gaussian RBF kernel expγSW22x,x is a positive definite kernel. The next lemma shows that the Gaussian RBF kernel and Laplacian RBF kernel based on the sliced Wasserstein distance are, in fact, universal kernels.

Lemma 1. If Mr(r>1) is complete, then both κGx,x=expγSW22x,x and κLx,x=expγSW2x,x are universal kernels on 𝒯𝒲2(M). Furthermore, if PX and PY are regular measures, X and Y are dense in L2PX and L2PY, respectively.

It’s worth mentioning that in a recent study, Meunier et al. [36] demonstrated the universality of the Sliced Wasserstein kernel. However, our findings extend beyond the scope of Meunier et al. [36]’s work. Specifically, our results apply to scenarios where M is non-compact, such as M=d, by introducing cc-universality as defined in Definition 2.

4. Generalized Sliced Inverse Regression for Distributional Data

This section extends the generalized sliced inverse regression (GSIR) [25] for distributional data. We call this extension to univariate distribution settings as Wasserstein GSIR, or W-GSIR, and to multivariate distribution settings as Sliced-Wasserstein GSIR, or SW-GSIR.

4.1. Distributional GSIR and the role of universal kernel

To model the nonlinear relationships between random elements, we introduce the covariance operator in the RKHS, a concept similar to the constructions in [19, 25], [27, Chapter 12.2] and [30]. Let 1 and 2 be two arbitrary Hilbert spaces, and let 1,2 denote the class of bounded linear operators from 1 to 2. If 1=2=, we use () to denote (,). For any operator T1,2, we use T* to denote the adjoint operator of T, ker(T) to denote the kernel of T,ran(T) to denote the range of T, and ran¯(T) to denote the closure of the range of T. Given two members f and g of , the tensor product fg is the operator on such that (fg)h=fg,h for all h. It is important to note that the adjoint operator of fg is gf.

We define E[κ(,X)], the mean element of X in X, as the unique element in X such that

f,E[κ(,X)]X=Ef,κ(,X)X (3)

for all fX. Define the bounded linear operator E[κ(,X)κ(,X)], the second-moment operator of X in X, as the unique element in X such that, for all f and g in X,

f,E[κ(,X)κ(,X)]gX=Ef,(κ(,X)κ(,x))gX. (4)

We write μX=E[κ(,X)], MXX=E[κ(,X)k(,X)]. For Gaussian RBF kernel and Laplacian RBF kernel based on Wasserstein distance or sliced-Wasserstein distance, κ(X,X) is bounded and E[κ(X,X)] is finite. By Cauchy-Schwartz inequality and Jensen’s inequality, it is guaranteed that items on the right-hand side of (3) and (4) are well-defined. The existence and uniqueness of μX and MXX is guaranteed by Riesz’s representation theorem. We then define the covariance operator ΣXX as MXXμXμX. Then, for all f, gX, we have cov(f(X),g(X))=f,ΣXXgX. Similarly, we can define μYY, ΣYYY, ΣXYX,Y and ΣYXY,X. By definition, both ΣXX and ΣYY are self-adjoint, and ΣXY*=ΣYX.

To define the regression operators ΣXX1ΣXY and ΣYY1ΣYX, we make the following assumptions. Similar regularity conditions are assumed in [25, 27, 28].

Assumption 1.

  1. kerΣXX={0} and kerΣYY={0}.

  2. ranΣXYranΣXX and ranΣYXranΣYY.

  3. The operators ΣXX1ΣXY and ΣYY1ΣYX are compact.

Condition (i) amounts to resetting the domains of ΣXX and ΣYY to kerΣXX and kerΣYY, respectively. This is motivated by the fact that members of kerΣXX and kerΣYY are constants almost surely, which are irrelevant when we consider independence. Since ΣXX and ΣYY are self-adjoint operators, this assumption is equivalent to resetting X to ran¯ΣXX and Y to ran¯ΣYY, respectively. Condition (i) also implies that the mappings ΣXX and ΣYY are invertible, though, as we will see, ΣXX1 and ΣYY1 are unbounded operators.

Condition (ii) guarantees that ranΣXYdomΣXX1=ranΣXX and ranΣYXdomΣYY1=ranΣYY, which is necessary to define the regression operators ΣXX1ΣXY and ΣYY1ΣYX. By Proposition 12.5 of [27], ranΣYXran¯ΣYY and ranΣXYran¯ΣXX. Thus, the above assumption is not very strong.

As interpreted in Section 13.1 of [27], Condition (iii) in Assumption 1 is akin to a smoothness condition. Even though the inverse mappings ΣXX1 and ΣYY1 are well defined, since ΣXX and ΣYY are Hilbert Schmidt operators ([18]), these inverses are unbounded operators. However, these unbounded operators never appear by themselves but are always accompanied by operators multiplied from the right. Condition (iii) assumes that the composite operators ΣXX1ΣXY and ΣYY1ΣYX are compact. This requires, for example, that ΣYY1ΣYX must send all incoming functions into the low-frequency range of the eigenspaces of ΣYY with relatively large eigenvalues. That is, ΣYX and ΣXY are smooth in the sense that their outputs are low-frequency components of ΣYY or ΣXX.

With Assumption 1 and universal kernels κX and κY, we then have that the range of the regression operator ΣXX1ΣXY is contained in central class 𝔖YX. Furthermore, if the central class 𝔖YX is also complete, it can be fully covered by the range of ΣXX1ΣXY. The next proposition adapts the main result of Chapter 13 of [27] to the current context.

Proposition 3. If Assumption 1 holds, X is dense in L2PX and Y is dense in L2PY, then we have ranΣXX1ΣXY𝔖YX. If, furthermore, 𝔖YX is complete, then we have ranΣXX1ΣXY=𝔖YX.

The universal kernels κX and κY proposed in Section 3 guarantees that X is dense in L2PX and L2PY, respectively.

4.2. Estimation for distributional GSIR

By Proposition 3, for any invertible operator A, we have ran¯ΣXX1ΣXYAΣYXΣXX1𝔖YX. Two common choices are A=I and A=ΣYY1. When we take A=ΣYY1, the procedure is a nonlinear parallel of SIR in the sense that we replace the inner product in the Euclidean space by the inner product in the RKHS X. For easy reference, we refer to the method using A=I as W-GSIR1 or SW-GSIR1 and A=ΣXY1 as W-GSIR2 or SW-GSIR2. To estimate the space ran¯ΣXX1ΣXYAΣYXΣXX1, we successively solve the following generalized eigenvalue problem:

maximizef;ΣXYAΣYXfXsubject tof;ΣXXfX=1;fspan{f1,,fk1},k{1,2,,d},

where f1,,fk are the solutions to this constrained optimization problem in the first k steps.

At the sample level, we estimate ΣXX, ΣYY, ΣXY and ΣYX by replacing the expectations E() with sample moments En() whenever possible. For example, suppose we are given i.i.d. sample X1,Y1,,Xn,Yn of (X,Y). We estimate ΣXX by

Σ^XX=En[κ(,X)κ(,X)]En[κ(,X)]En[κ(,X)].

The sample estimates ΣˆYY,ΣˆXY and ΣˆYX for ΣYY,ΣXY and ΣYX are similarly defined. The subspace ran¯ΣˆXX and ran¯ΣˆYY are spanned by the sets X=κ,XiEnκ(,X):i=1,,n, and Y=κ,YiEnκ(,Y):i{1,,n}, respectively. Let KX,KY denote the n×n matrix whose (i,j)-th entry is κXi,Xj,κYi,Yj respectively, and let Q denote the projection matrix In1n1nT/n. For two Hilbert spaces 1,2 with spanning systems 1 and 2, and a linear operator A:12, we use the notation 2[A]1 to represent the coordinate representation of A relative to spanning systems 1 and 2. We then have the following coordinate representations of covariance operators:

X[Σ^XX]X=n1GX,Y[Σ^YX]X=n1GX,X[Σ^XY]Y=n1GY,Y[Σ^YY]Y=n1GY,

where GX=QKXQ and GY=QKYQ. The details are referred to Section 12.4 of [27].

When A=In, the generalized eigenvalue problem becomes

max[f]XTGXGYGX[f]Xsubject to[f]XGX2[f]X=1.

Let v=GX[f]X. To avoid overfitting, we solve this equation for [f]X via Tychonoff regularization, that is, [f]X=GX+ηXIn1v, where ηX is a tuning constant. The problem is then transformed into finding eigenvector v1,,vd of the following matrix

ΛGSIR(1)=(GX+ηXIn)1GXGYGX(GX+ηXIn)1,

and then set fjX=GX+ηXIn1vj, j{1,,d}. In practice, we use ηX=εXλmaxGX, where λmaxGX is the maximum eigenvalue of GX and εX is a tuning parameter.

For the second choice A=ΣˆYY1, we also use the regularized inverse GY+ηYIn1, leading to the following generalized eigenvalue problem:

max[f]XTGXGY(GY+ηYIn)1GX[f]Xsubject to[f]XGX2[f]X=1.

To solve this problem, we first compute the eigenvectors v1,,vd of the matrix

ΛGSIR(2)=(GX+ηXIn)1GXGY(GY+ηYIn)1GX(GX+ηXIn)1,

and then set fjX=GX+ηXIn1vj for j{1,,d}.

Choice of tuning parameters:

We use the general cross validation criterion [21] to determine the tuning constant εX:

GCVX(εX)=KYKX(KX+εXλmax(KX)In)1KYF2{tr[InKX(KX+εXλmax(KX)In)1]}2.

The numerator of this criterion is the prediction error, and the denominator is to control the degree of overfitting. Similarly, the GCV criterion for εY is defined as

GCVY(εY)=KXKY(KY+εYλmax(KY)In)1KXF2{tr[InKY(KY+εYλmax(KY)In)1]}2.

We minimize the criteria over grid 106,,101,1 to find the optimal tuning constants. We choose the parameters γX and γY in the reproducing kernels κX and κY as the fixed quantities γX=1/2σX2 and γY=1/2σY2, where σX2=n21i<jdXi,Xj2, σY2=n21i<jdYi,Yj2 and metric d(,) is W2(,) for univariate distributional data and SW2(,) for multivariate distributional data.

Order Determination:

To determine the dimension d in (1), we use the BIC type criterion in [28] and [30]. Let Gn(k)=i=1kλˆic0λ1n1/2log(n)k, where λi’s are the eigenvalues of the matrix ΛGSIR and c0 is taken to be 2 when A=Ip and 4 when A=ΣYY1. Then we estimate d by

d^=argmax{Gn(k):k{0,1,,n}}.

Recently developed order-determination methods, such as the ladle estimator [34], can also be directly used to estimate d.

5. Asymptotic Analysis

In this section, we establish the consistency and convergence rates of W-GSIR and SW-GSIR. We focus on the analysis of Type-I GSIR, where the operator A is chosen as the identity map I. The techniques we use are also applicable to the analysis of Type-II GSIR. To simplify the exposition, we define Λ=ΣXX1ΣXYΣYXΣXX1 and Λˆ=ΣˆXX+ηnIn1ΣˆXYΣˆYXΣˆXX+ηnIn1.

5.1. Convergence rate for fully observed distribution

If we assume that the data Xi,Yii=1n are fully observed, we can establish the consistency and convergence rates of W-GSIR and SW-GSIR without fundamental differences from [30]. To make the paper self-contained, we present the results here without proof.

Proposition 4. Suppose ΣXY=ΣXXβSXY far some linear operator SXY:XY where 0<β1. Also, suppose n1/2_ηn0. Then

  1. If SXY is bounded, then Λ^ΛOP=𝒪p(ηnβ+ηn1n1/2).

  2. If SXY is Hilbert-Schmidt, then Λ^ΛHS=𝒪p(ηnβ+ηn1n1/2).

The condition ΣXY=ΣXXβSXY is a smoothness condition, which implies the range space of ΣXY be sufficiently focused on the eigenspaces of the large eigenvalues of ΣXX. The parameter β characterizes the degree of ”smoothness” in the relation between X and Y, with a larger β indicating a stronger smoothness relation.

By a perturbation theory result in Lemma 5.2 of Koltchinskii and Giné [24], the eigenspaces of Λˆ converge to those of Λ at the same rate if the nonzero eigenvalues of Λ are distinct. Therefore, as a corollary of Proposition 4, the W-GSIR and SW-GSIR estimators are consistent with the same convergence rates.

5.2. Convergence rate for discretely observed distribution

In practice, additional challenges arise when the distributions are not fully observed. Instead, we observe i.i.d. samples for each Xi,Yi, where i{1,,n}, which is called the discretely observed scenario. Suppose we observe X1jj=1r1,Y1kk=1s1,,Xnjj=1r1,Ynkk=1sn, where Xijj=1ri and Yikk=1si are independent samples from Xi and Yi, respectively. Let Xˆi, Yˆi be the empirical measures ri1j=1riδXijsi1j=1siδYij, where δa is the Dirac measure at a. Then we estimate dXi,Xk and dYi,Yk by dXˆi,Xˆk and dYˆi,Yˆk, respectively. For the convenience of analysis, we assume the sample sizes are the same, that is, r1==rn=s1==sn=m. It is important to note that there are two layers of randomness in this situation: the first generates independent samples of distributions Xi,Yi for i{1,,n}, and the second generates independent samples given each pair of distributions Xi,Yi.

To guarantee the consistency of W-GSIR or SW-GSIR, we need to quantify the discrepancy between the estimated and true distributions by the following assumption.

Assumption 2. For i{1,,n}, EdXˆi,Xi=𝒪δm and EdYˆi,Yi=𝒪δm, where δm0 as m.

Let μ be Xi or Yi for i=1,,n and μˆ be the empirical measure of μ based on m i.i.d samples. The convergence rate of empirical measures in Wasserstein distance on Euclidean spaces has been studied in several works, including [7, 11, 17, 26, 51]. When M is compact, Fournier and Guillin [17] showed that EW2(μˆ,μ)m1/4. However, when M is unbounded, such as M=, we need concentration assumptions or moment assumptions on the measure μ to establish the convergence rate. Let mq(μ):=M|x|qdμ be the q-th moment of μ. If mq(μ)< for some q>2, the result of [17] implies that EW2(μˆ,μ)=𝒪m1/4+m(q2)/(2q). If q>4, then the term m(q2)/(2q) is dominated by m1/4 and can be removed. If μ is a log-concave measure, then Bobkov and Ledoux [6] showed a sharper rate that E[W2(μ^,μ)]logm/m.

The convergence rate of empirical measures in sliced Wasserstein distance has been investigated by Lin et al. [33], Niles-Weed and Rigollet [39], and Nietert et al. [38]. Lin et al. [33]. When M is compact, the result of Lin et al. [33] indicates that ESW2(μˆ,μ)m1/4. When M=r and mq(μ)< for some q>2, Lin et al. [33] established the rate ESW2(μˆ,μ)=𝒪m1/4+m(q2)/(2q). A sharper rate is shown in Nietert et al. [38] under the log-concave assumption on μ.

To ensure notation consistency, we define

Σ^XY=En[κ(,X^)κ(,Y^)]En[κ(,X^)]En[κ(,Y^)],Σ˜XY=En[κ(,X)κ(,Y)En[κ(,X)]En[κ(,Y)].

We note that Xˆ1,,Xˆn are independent but not necessarily identically distributed. Despite this, we still write the sample average as En(). Similarly, we define ΣˆXX and ΣˆYY as the sample covariance operators based on the estimated distribution Xˆn,,Xˆn and Yˆ1,,Yˆn. Under Assumption 2, we have the following lemma showing the convergence rates of covariance operators.

Lemma 2. Under Assumption 2, if the kernel κz,z is Lipschitz continuous, that is, supz|κ(z1,z)κ(z2,z)|<Cd(z1,z2), for some C>0, then ΣXX, ΣYY and ΣYX are Hilbert-Schmidt operators, and we have Σ^XXΣXXHS=𝒪p(δm+n1/2), Σ^YYΣYYHS=𝒪p(δm+n1/2), and Σ^XYΣXYHS=𝒪p(δm+n1/2).

Based on Lemma 2, we establish the convergence rate of W-GSIR in the following theorem.

Theorem 1. Suppose ΣXY=ΣXX1+βSXY for some linear operator SXY:XY, where 0<β1. Suppose δm+n1/2_ηn0, then

  1. If SXY is bounded, then Λ^ΛOP=𝒪p(ηnβ+ηn1(δm+n1/2)).

  2. If SXY is Hilbert-Schmidt, then Λ^ΛHS=𝒪p(ηnβ+ηn1(δm+n1/2)).

The proof is provided in Section 9. The same convergence rate can be established for SW-GSIR.

6. Simulation

In this section, we evaluate the numerical performances of W-GSIR and SW-GSIR. We consider two scenarios: univariate distribution on univariate distribution regression and multivariate distribution on multivariate distribution regression. In Section 6.4, we compare the performance of W-GSIR and SW-GSIR with the result using functional-GSIR [30]. The code to reproduce the simulation results can be found at https://github.com/bideliunian/SDR4D2DReg.

6.1. Computational details

We use the Gaussian RBF kernel to generate the RKHS. We consider the discretely observed situation described in Section 5.2. Specifically, let Xˆi=m1j=1mδXij be the empirical distributions for i{1,,n}. When X is univariate distributions, for i, k{1,,n}, we estimate W2Xi,Xk and W2Yi,Yk by

W2(X^i,X^k)=(1mj=1m(Xi(j)Xk(j))2)1/2,W2(Y^i,Y^k)=(1mj=1m(Yi(j)Yk(j))2)1/2,

respectively, where Xi(j) are the j-th order statistics of Xijj=1m.

When X is multivariate distribution supported on Mr, we estimate the sliced Wasserstein distance using a standard Monte Carlo method, that is,

SWp(X^i,X^k)(1Ll=1LW22(X^iTθl1,X^kTθl1))1/2=[1Ll=1LW22(1mj=1mδθl,Xij,1mj=1mδθl,Xkj)]1/2,

where θll=1L are i.i.d. samples drawn from the uniform distribution on 𝕊r1. The number of samples L controls the approximation error: a larger L gives a more accurate approximation but increases the computation cost. In our simulation settings, we set L=50.

We consider two measures to evaluate the difference between estimated and true predictors. The first one is the RV Coefficient of Multivariate Rank (RVMR) defined below, which is a generalization of Spearman’s correlation in the multivariate case. For two samples of random vectors U1,,Unr and V1,,Vns, let U˜i, V˜i be their multivariate ranks, that is,

U˜i=1n=1nUUiUUi,V˜i=1n=1nVViVVi.

Then the RVMR between U1,,Un and V1,,Vn is defined as the RV coefficient between U˜1,,U˜n and V˜1,,V˜n:

RVMRn(U,V)=tr(covn(U˜,V˜)covn(V˜,U˜))tr(varn(U˜)2)tr(varn(V˜)2).

The second one is the distance correlation [48], a well-known measure of dependence between two random vectors of arbitrary dimension.

6.2. Univariate distribution-on-distribution regression

We generate normal distribution Y with mean and variance parameters being random variables dependent on X, that is,

Y=N(μY,σY2), (5)

where μY and σY>0 are random variables generated according to the following models:

Model I-1 : μY|XN(exp(W22(X,μ1))+exp(W22(X,μ2)),0.22); σY=1;

Model I-2 : μY|XN(exp(W22(X,μ1)),0.22); σY=Gamma(W22(X,μ2),W2(X,μ2));

Model I-3 : μY|XN(exp(H(X,μ1)),0.22); σY=exp(H(X,μ2));

Model I-4 : μY|XN(E(X),0.22); σY=Gamma(Var(X),Var(X)).

We let μ1=Beta(2,1) and μ2=Beta(2,3) and generate discrete observations from distributional predictors by Xijj=1miidBetaai,bi where aiiidGamma(2,rate=1) and biiidGamma(2,rate=3). We note that the Hellinger distance between two Beta distributions μ=Betaa1,b1 and v=Betaa2,b2 can be represented explicitly as

H(μ,v)=1fμ(t)fv(t)dt=1B((a1+a2)/2,(b1+b2)/2)B(a1,b1)B(a2,b2),

where B(α,β) is the Beta function.

We compute the distances W2X,μ1 and W2X,μ2 by the L2-distance between the quantile functions. We set n{100, 200},m{50, 100} and generate 2n samples Xijj=1m,Yijj=1mi=12n. We use half of them to train the nonlinear sufficient predictors via W-GSIR, and then evaluate the RVMR and distance correlation between the estimated and true predictors using the rest of the data set. The tuning parameters and the dimensions are determined by the methods described in Section 4.2. The experiment is repeated 100 times, and averages and standard errors (in parentheses) of the RVMR and Dcor are summarized in Table 1. The following are the identified true predictors for each model: Model I-1 uses W2X,μ1, Model I-2 uses W2X,μ1,W2X,μ2, Model I-3 uses HX,μ1,HX,μ2, and Model I-4 uses E(X) and var(X).

Table 1:

RVMR and Distance Correlation between the estimated predictors and the true predictors of models in Section 6.2, with their Monte Carlo standard errors in parentheses.

Models nm W-GSIR1 W-GSIR2

50 100 50 100
RVMR

I-1 100 0.791 (0.128) 0.839 (0.115) 0.776 (0.124) 0.812 (0.159)
200 0.832 (0.091) 0.864 (0.087) 0.808 (0.114) 0.842 (0.129)
I-2 100 0.597 (0.187) 0.607 (0.206) 0.555 (0.236) 0.548 (0.235)
200 0.694 (0.141) 0.681 (0.172) 0.709 (0.177) 0.688 (0.190)
I-3 100 0.846 (0.037) 0.880 (0.037) 0.836 (0.045) 0.859 (0.049)
200 0.864 (0.021) 0.896 (0.025) 0.797 (0.088) 0.696 (0.046)
I-4 100 0.558 (0.242) 0.652 (0.253) 0.729 (0.196) 0.790 (0.215)
200 0.643 (0.221) 0.732 (0.183) 0.767 (0.169) 0.847 (0.145)
Dcor

I-1 100 0.958 (0.024) 0.969 (0.022) 0.952 (0.029) 0.964 (0.034)
200 0.967 (0.011) 0.974 (0.013) 0.963 (0.017) 0.970 (0.020)
I-2 100 0.932 (0.037) 0.935 (0.041) 0.896 (0.071) 0.898 (0.066)
200 0.952 (0.026) 0.948 (0.032) 0.934 (0.054) 0.932 (0.048)
I-3 100 0.971 (0.008) 0.978 (0.005) 0.968 (0.010) 0.974 (0.007)
200 0.974 (0.004) 0.980 (0.004) 0.970 (0.007) 0.971 (0.008)
I-4 100 0.921 (0.042) 0.936 (0.042) 0.937 (0.036) 0.947 (0.038)
200 0.937 (0.037) 0.950 (0.027) 0.951 (0.023) 0.962 (0.025)

Fig 1 (a) displays a scatter plot of the true predictor versus the first estimated sufficient predictor for Model I-1. Fig 1 (b) and (c) show the scatter plots of the first two sufficient predictors for Model I-2, with the color indicating the values of the true predictor. These figures demonstrate the method’s ability to capture nonlinear patterns among predictor random elements.

Fig. 1:

Fig. 1:

Visualization of W-GSIR1 estimator for (a) Model I-1, and (b)(c) Model I-2, with n=200 and m=100. The sufficient predictors are computed via W-GSIR1

6.3. Multivariate distribution-on-distribution regression

We now consider the scenario where both X and Y are two-dimensional random Gaussian distributions. We generate Y=NμY,ΣY, where μY2 and ΣY2×2 are randomly generated according to the following models:

II-1: μY|X=N(W2(X,μ1)(1,1),I2), ΣY=diag(1,1).

II-2: μY|X=(W2(X,μ1)(1,1) and ΣY=ΓΛΓ, where Γ=22(1111), Λ=diag(|λ1|,|λ2|), and (λ1,λ2)|XN(W2(X,μ2)(1,1),0.25I2).

II-3: μY|X=N(W2(X,μ1)(1,1),I2) and ΣY=ΓΛΓ, where Γ=22(1111), Λ=diag(λ1,λ2), and λ1,λ2|Xi.i.dtGamma(W22(X,μ2),W2(X,μ2),(0.2,2)).

II-4: μY|X=N(H22(X,μ1)(1,1),I2) and ΣY=ΓΛΓ, where Γ=22(1111), Λ=diag(λ1,λ2), and (λ1,λ2)|XtGamma(H2(X,μ2),H(X,μ2),(0.2,2)).

where μ1 and μ2 are two fixed measures defined by

μ1=N((1,0),diag(1,0.5))μ2=N((0,1),diag(0.5,1)),

and tGammaα,β,r1,r2 is the truncated gamma distribution on range r1,r2 with shape parameter α and rate parameter β. We generate discrete observations of Xi, i{1,n} by Xijj=1miidNai(1,1),biI2 where aiiidN0.5,0.52 and biiidBeta(2, 3). When computing W2X,μ1 and W2X,μ2, we use the following explicit representations of the Wasserstein distance between two Gaussian distributions:

W22(N(m1,Σ1),N(m2,Σ2))=m1m22+trΣ1+trΣ22trΣ21/2Σ1Σ21/2.

The following are the identified true predictors for each model: Model II-1 uses W2X,μ1, Models II-2 and II-3 uses W2X,μ1,W2X,μ2, Model II-4 uses HX,μ1,HX,μ2.

Using the true dimensions and the same choices for n, m, and the tuning parameters, we repeat the experiment 100 times and summarize the average and standard errors of RVMR and distance correlation between the estimated and true predictors in Table 2. In Fig. 2, we plot the 2-dimensional response densities associated with the 10%, 30%, 50%, 70%, and 90% quantiles of estimated predictor (first row) the true predictor (second row) for Model II-2. Comparing the plots, we can see that the two-dimensional response distributions show a similar variation pattern, which indicates the method successfully captured the nonlinear predictor in the responses. We also see that the first estimated sufficient predictor captures both the location and scale of the response distribution. With the increase of the estimated sufficient predictor, the location of the response distribution moves slightly rightward and upward, while the variance of the response distribution decreases at first and then increases.

Table 2:

RVMR and Distance Correlation between the estimated predictors and the true predictors of models in Section 6.3, with their Monte Carlo standard errors in parentheses.

Models nm SWGSIR1 SWGSIR2

50 100 50 100
RVMR

II-1 100 0.948 (0.063) 0.957 (0.049) 0.915 (0.130) 0.910 (0.150)
200 0.958 (0.041) 0.970 (0.022) 0.921 (0.087) 0.934 (0.084)
II-2 100 0.784 (0.036) 0.791 (0.033) 0.820 (0.038) 0.822 (0.036)
200 0.783 (0.023) 0.791 (0.023) 0.834 (0.033) 0.824 (0.034)
II-3 100 0.744 (0.061) 0.755 (0.059) 0.806 (0.067) 0.812 (0.065)
200 0.747 (0.040) 0.753 (0.043) 0.835 (0.069) 0.841 (0.059)
II-4 100 0.499 (0.166) 0.500 (0.144) 0.570 (0.170) 0.567 (0.155)
200 0.512 (0.156) 0.477 (0.152) 0.532 (0.157) 0.501 (0.159)
Dcor

II-1 100 0.962 (0.024) 0.963 (0.025) 0.977 (0.018) 0.977 (0.021)
200 0.963 (0.017) 0.964 (0.018) 0.973 (0.023) 0.970 (0.025)
II-2 100 0.967 (0.013) 0.967 (0.013) 0.973 (0.010) 0.975 (0.010)
200 0.965 (0.011) 0.966 (0.011) 0.975 (0.008) 0.975 (0.010)
II-3 100 0.980 (0.009) 0.981 (0.008) 0.983 (0.007) 0.984 (0.006)
200 0.979 (0.007) 0.979 (0.009) 0.982 (0.007) 0.983 (0.008)
II-4 100 0.889 (0.031) 0.886 (0.036) 0.886 (0.033) 0.892 (0.030)
200 0.893 (0.033) 0.886 (0.033) 0.887 (0.034) 0.889 (0.037)

Fig. 2:

Fig. 2:

Densities associated with the 10%, 30%, 50%, 70%, and 90% quantiles (left to right) of estimated predictor (first row) the true predictor (second row) for Model II-2 .

6.4. Comparison with functional-GSIR

Next, we compare the performance of W-GSIR with two methods using the GSIR framework but replacing Wasserstein distance by L1 or L2 distances. We call them L1-GSIR and L2-GSIR, respectively. Note that L2-GSIR is the same as functional-GSIR (f-GSIR) proposed in Li and Song [30]. Theoretically, L2-GSIR is an inadequate estimate since an L2 function need not be a density and vice versa. Nevertheless, we still naively implement L2-GSIR, treating density curves as L2 functions. To make a fair comparison, we first use the Gaussian kernel smoother to estimate the densities based on discrete observations and then evaluate the Lr distances by numerical integration. For Lr-GSIR (r=1, 2), we take the Gaussian type kernel κ(z,z)=exp(γzzLr2), with the same choice of tuning parameters γ as described in Subsection 4.2. We use n=100, m=100, and repeat the experiment 100 times with A=In. The results are summarized in Table 3. We see that W-GSIR provides more accurate estimation than both L1-GSIR and L2-GSIR.

Table 3:

RVMR and Distance Correlation between the estimated predictors and the true predictors of models in Section 6.2, with their Monte Carlo standard errors in parentheses, computed using L1-GSIR, L2-GSIR, and W-GSIR.

Models L1-GSIR1 L2-GSIR1 W-GSIR1

RVMR

I-1 0.258 (0.233) 0.356 (0.276) 0.839 (0.115)
I-2 0.322 (0.236) 0.433 (0.244) 0.607 (0.206)
I-3 0.307 (0.242) 0.359 (0.205) 0.880 (0.037)
I-4 0.313 (0.252) 0.441 (0.278) 0.652 (0.253)
Dcor

I-1 0.773 (0.129) 0.731 (0.171) 0.969 (0.022)
I-2 0.778 (0.173) 0.690 (0.203) 0.935 (0.041)
I-3 0.779 (0.169) 0.688 (0.196) 0.978 (0.005)
I-4 0.779 (0.129) 0.740 (0.176) 0.936 (0.042)

7. Applications

7.1. Application to human mortality data

In this application, we explore the relationship between the distribution of age at death and the distribution of the mother’s age at birth. We obtained our data from the UN World Population Prospects 2019 Databases (https://population.un.org), specifically focusing on the years 2015-2020. For each country, we compiled the number of deaths every five years from ages 0-100 and the number of births categorized by mother’s age every five years from ages 15-50. We represented this data as histograms with bin widths equal to 5 years. To obtain smooth probability density functions for each country, we used the R package ‘frechet’ to perform smoothing. We then calculated the relative Wasserstein distance between the predictor and response densities. The predictor and response densities are visualized in Fig. 3.

Fig. 3:

Fig. 3:

Density of (a) age at death and (b) mother’s age at birth for 194 countries, obtained using data from the UN World Population Prospects 2019 Databases (https://population.un.org)

We apply the proposed W-GSIR algorithm to the fertility and mortality data. The dimension d of the central class is determined as one by the BIC-type procedure described in Subsection 4.2. We plot the age-at-death distributions versus the nonlinear sufficient predictors obtained by W-GSIR2 in Fig. 4. In Fig. 5, we present the summary statistics of the age-at-death distributions plotted against the sufficient predictors.

Fig. 4:

Fig. 4:

Densities of age at death for 194 countries in random order in (a) and (c), and versus the first nonlinear sufficient predictors obtained by W-GSIR2 in (b) and (d).

Fig. 5:

Fig. 5:

Summary statistics (mean, mode, standard deviation, and skewness) of the mortality distributions for 194 countries versus nonlinear sufficient predictors obtained by W-GSIR2.

Upon examining these plots, we obtained the following insights. The first nonlinear sufficient predictor effectively captures the location and variation information of the mortality distributions. Specifically, as the first sufficient predictor increases, the means of the mortality distributions decrease while the standard deviations increase. This suggests that the population’s death age tends to concentrate between 70 and 80 for large sufficient predictor values. Additionally, for densities with small sufficient predictors, there is an uptick at the ends of the 0-age side, which indicates higher infant mortality rates among the countries with such densities.

7.2. Application to Calgary temperature data

In this application, we are interested in the relationship between the extreme daily temperatures in spring (Mar, Apr, and May) and summer (Jun, Jul, and Aug) in Calgary, Alberta. We obtained the dataset from https://calgary.weatherstats.ca/, which contains the minimum and maximum temperatures for each day from 1884 to 2020. These data were previously analyzed in Fan and Müller [14]. We focused on the joint distribution of the minimum daily temperature and the difference between the maximum and minimum daily temperatures, which ensures that the distributions have common support. Each pair of daily values was treated as one observation from a two-dimensional distribution, resulting in one realization of the joint distribution for spring and one for summer each year. We then employed the spring extreme temperature distribution to predict the summer extreme temperature distribution. The dataset had n=136 observations, with m=92 discrete values for each joint distribution. We utilized the SW-GSIR method on the data, taking 50 random projections with ρX=ρY=1. The sufficient dimension was determined as 2 using a BIC-type procedure. We illustrated the response summer extreme temperature distributions associated with the five percentiles of the first estimated sufficient predictors in Fig. 6. It is observed from Fig. 6 that as the estimated sufficient predictor value increases, the minimum daily temperature for summer rises slightly while the daily temperature range decreases.

Fig. 6:

Fig. 6:

Joint distribution of temperature range and minimum temperature in summer associated with the 10%, 30%, 50%, 70%, and 90% quantiles (from left to right) of SWGSIR2 predictor.

8. Discussion

The paper introduces a framework of nonlinear sufficient dimension reduction for distribution-on-distribution regression. The key strength of this paper is its ability to handle distributional data without linear structure. After explicitly building the universal kernels on the space of distributions, the proposed SDR method effectively reduces the complexity of the distributional predictors while retaining essential information from them.

Several related open problems persist in this topic. Firstly, a more systematic approach to selecting the kernel becomes essential, particularly when multiple universal kernels are available. Secondly, while the paper offers an adaptive method for choosing the bandwidth of the universal kernel in Section 4, the theoretical analysis of this bandwidth selection remains a potential area for future research. Additionally, more appropriate methods for determining the order in nonlinear SDR need to be developed, alongside establishing corresponding consistency results.

9. Technical Proofs

The section contains essential proof details to make the paper self-contained.

Geometry of Wasserstein space

We present some basic results that characterize 𝒲2(M) when M (i.e., the distributions involved are univariate). Their proofs can be found, for example, in [1] and [5]. In this case, 𝒲2(M) is a metric space with a formal Riemannian structure [1]. Let μ0𝒲2(M) be a reference measure with a continuous Fμ0. The tangent space at μ0 is

Tμ0=clL2(μ0){λ(Fμ1Fμ0id):μ𝒲2(M),λ>0},

where, for a set AL2μ0, clL2μ0(A) denotes the L2μ0-closure of A, and id is the identity map. The exponential map expμ0 from Tμ0 to 𝒲2(M) is defined by expμ0(r)=μ0(r+id)1, where the right-hand side is the measure on 𝒲2(M) induced by the mapping mapping r+id. The logarithmic map logμ0 from 𝒲2(M) to Tμ0 is defined by logμ0(μ)=Fμ1Fμ0id. It is known that the exponential map restricted to the image of log map, denoted as expμ0|logμ0(μ)𝒲2(M), is an isometric homeomorphism with inverse logμ0 [5]. Therefore, logμ0 is a continuous injection from 𝒲2(M) to L2μ0. This embedding guarantees that we can replace the Euclidean distance by the 𝒲2(M)-metric in a radial basis kernel to construct a positive definite kernel.

Proof of Proposition 2: Recall that Γμ1,μ2 is the space of joint probability measures on M×M with marginals μ1 and μ2. Let Tθ×Tθ be the mapping from M×M to × defined by Tθ×Tθ(x,y)=Tθ(x),Tθ(y). We first show that, if γΓμ1,μ2, then γTθ×Tθ1Γμ1Tθ1,μ2Tθ1. This is true because, for any Borel set A, we have

[γ(Tθ×Tθ)1](A×Tθ(M))=γ((Tθ×Tθ)1(A×Tθ(M))=γ(Tθ1(A)×M)=μ1(Tθ1A)=μ1Tθ(A),

and similarly γTθ×Tθ1Tθ(M)×A=μ2Tθ(A). Hence, for any γΓμ1,μ2, we have

Wpp(μ1Tθ1,μ2Tθ1)Tθ(M)×Tθ(M)|uv|pdγ(Tθ×Tθ)1(u,v)=M×M|Tθ(x)Tθ(y)|pdγ(x,y)M×Mxy2pdγ(x,y),

where the last inequality is from the Cauchy-Schwartz inequality. Therefore,

Wpp(μ1Tθ1,μ2Tθ1)infγΓ(μ1,μ2)M×Mxypdγ(x,y)=Wpp(μ1,μ2).

Integrate the left-hand side with respect to θ and obtain SWpμ1,μ2Wpμ1,μ2. Therefore, the SWp distance is a weaker metric than Wp distance, which implies every open set in 𝒯𝒲p(M) is open in 𝒲p(M). In other words, 𝒯𝒲p(M) has a coarser topology than 𝒲p(M). Since Mr is separable, so is 𝒲p(M) [1, Remark 7.1.7]. Therefore, a countable dense subset of 𝒲p(M) is also a countable dense subset of 𝒯𝒲p(M), implying 𝒯𝒲p(M) is separable. Furthermore, if M is a compact set in r, then 𝒲p(M) is compact [1, Proposition 7.1.5], implying 𝒯𝒲p(M) is compact. This completes the proof of Proposition 2.

Proof of Lemma 1: By Theorem 3.2.2 of [3], the kernel exp(γSW22(x,x)) is positive definite for all γ>0 if and only if SW22(,) is conditionally negative definite. That is, for any c1,,cm with i=1mci=0, and x1,,xmΩX, i=1mj=1mcicjSW22xi,xj0. Kolouri et al. [23, Theorem 5] showed the conditional negativity of the sliced Wasserstein distance, which is implied by the negative type of the Wasserstein distance. By [44, 45], a metric is of negative type is equivalent to the statement that there is a Hilbert space and a map ϕ:𝒯𝒲2(M) such that x, x𝒯𝒲2(M), SW22(x,x)=ϕ(x)ϕ(x)2. By Proposition 2, 𝒯𝒲2(M) is a complete and separable space. Then by the construction of the Hilbert space, is complete and separable. Therefore, there exists a continuous mapping from metric space 𝒯𝒲2(M) to a complete and separable Hilbert space . Then by Zhang et al. [54, Theorem 1], the Gaussian type kernel expγSW22x,x is universal. Hence, X and Y are dense in L2PX and L2PY, respectively. Same proof applies to the Laplacian-type kernel expγSW2x,x. This completes the proof of Lemma 1. □

Proof of Lemma 2. We will only show the details of the proof for the convergence rate of Σ^XYΣXYHS. By the triangular inequality,

Σ^XYΣXYHSΣ^XYΣ˜XYHS+Σ˜XYΣXYHS,

where

Σ˜XY=En[κ(,X)κ(,Y)En[κ(,X)]En[κ(,Y)].

By Lemma 5 of [18], under the assumption that E[κ(X,X)]< and E[κ(Y,Y)]<, we have

EΣ˜XYΣXYHS=𝒪(n1/2). (S.1)

Now, we derive a convergence rate for Σ^XYΣ˜XYHS. For simplicity, let Fˆi=κ,Xˆi, F˜i=κ,Xi, Gˆi=κ,Yˆi, and G˜i=κ,Yi. Then

Σ^XYΣ˜XYHS=1ni=1n(F^i1nj=1nF^j)(G^i1nj=1nG^j)1ni=1n(F^i1nj=1nF^j)(G^i1nj=1nG˜j)HS=1ni=1n((F^iF˜i)1nj=1n(F^jF˜j))((G^iG˜i)1nj=1n(G^jG˜j))HS1ni=1n(F^iF˜i)(G^iG˜i)(21n)(1nj=1n(F^jF˜j))(1nj=1n(G^jG˜j))HS1ni=1n(F^iF˜i)(G^iG˜i)HS+2(1nj=1n(F^jF˜j))(1nj=1n(G^jG˜j))HS. (S.2)

Consider the expectation of the first term on the right-hand side. Here, the expectation involves two layers of randomness: that in Xijj=1m,Yikk=1m and that in Xi. Taking expectation with respect to Xijj=1m,Yikk=1m and then Xi, we have

E[1ni=1n(F^iF˜i)(G^iG˜i)HS]=1ni=1nE[(F^iF˜i)X(G^iG˜i)Y]1ni=1n(E(F^iF˜i)X2)1/2(E(G^iG˜i)Y2)1/2,

Evoking the Lipschitz continuity condition on κz,z, we have

E(F^iF˜i)X2=E(F^iF˜i,F^iF˜iX=E[κ(X^i,X^i)2κ(X^i,Xi)+κ(Xi,Xi)]2CE[d(Xi,X^i)]𝒪(EXiEX^i[d(Xi,X^i)]).

By Assumption 2, EXˆdWXˆi,Xi=𝒪δm for i={1,,n}. We then have E(F^iF˜i)X2=𝒪(δm) for i{1,,n}. Similarly, we have E(G^iG˜i)Y2=𝒪(δm) for i{1,,n}. Therefore,

E[1ni=1n(F^iF˜i)(G^iG˜i)HS]=𝒪(δm). (S.3)

For the expectation of the second term on the right-hand side of equation (S.2), we have

2E(1nj=1n(F^jF˜j))(1nj=1n(G^jG˜j))HS=2E[(1nj=1n(F^jF˜j))X(1nj=1n(G^jG˜j))Y]2(E1nj=1n(F^jF˜j)X2)1/2(E1nj=1n(G^jG˜j)Y2)1/22(1nsup1inE(F^iF˜i)X2)1/2(1nsup1inE(G^iG˜i)Y2)1/2𝒪(δm/n). (S.4)

Combine result (S.1)(S.3) and (S.4), we have

EΣ^XYΣXYHS𝒪(δm(1+1/n)+n1/2)=𝒪(δm+n1/2).

Then by Chebyshev’s inequality, we have

Σ^XYΣXYHS𝒪p(δm+n1/2),

as desired. This completes the proof of Lemma 2.

Proof of Theorem 1. Let

A^=(Σ^XX+ηnI)1,An=(ΣXX+ηnI)1,A=ΣXX1;B^=Σ^XY,B=ΣXY.

Then the element of interest ΛˆΛ can be written as

Λ^Λ=A^B^B^*A^*ABB*A*=A^B^(B^*A^*B*A*)+(A^B^AB)B*A*;

Thus, we have

Λ^ΛOPA^B^(B^*A^*B*A*)OP+(A^B^AB)B*A*OP=(ABA^B^)B^*A^*OP+(A^B^AB)B*A*OP(ABA^B^)OP(A^B^OP+ABOP).

Since both AB and AˆBˆ are compact operators, it suffices to show that

(ABA^B^)OP=𝒪p(ηnβ+ηn1εn,m),

where εn,m=δm+n1/2. Writing AˆBˆ as

A^B^=A^(B^B)+(A^An)B+(AnA)B+AB,

we obtain

(ABA^B^)OPA^(B^B)OP+(A^An)BOP+(AnA)BOP. (S.5)

For the first term on the right-hand side, we have

A^(B^B)OP=(Σ^XX+ηnI)1(Σ^XYΣXY)OP(Σ^XX+ηnI)1OP(Σ^XYΣXY)HSηn1(Σ^XX+ηnI)(Σ^XX+ηnI)1OP(Σ^XYΣXY)HS𝒪p(ηn1εn,m), (S.6)

where the last inequality follows from Lemma 2. For the second term on the right-hand side of (S.5), we write it as

(A^An)B=((Σ^XX+ηnI)1(ΣXX+ηnI)1)ΣXY=(Σ^XX+ηnI)1((Σ^XX+ηnI)(ΣXX+ηnI))(ΣXX+ηnI)1ΣXY=(Σ^XX+ηnI)1(Σ^XXΣXX)(ΣXX+ηnI)1ΣXXΣXX1ΣXY.

Thus, we have

(A^An)BOP(Σ^XX+ηnI)1OPΣ^XXΣXXOP(ΣXX+ηnI)1ΣXXOPΣXX1ΣXYOP.

By the above derivations, we have (Σ^XX+ηnI)1OP=𝒪p(ηn1) and Σ^XXΣXXOP=𝒪p(εn,m). Also, we have

(ΣXX+ηnI)1ΣXXOP(ΣXX+ηnI)1(ΣXX+ηnI)OP=1,

and ΣXX1ΣXYOP by Assumption 1. Therefore, we have

(A^An)BOP=𝒪p(ηn1εn,m). (S.7)

Finally, letting RXY=ΣXXβSXY and rewriting the third term on the right-hand side of (S.5) as

(AnA)B=((ΣXX+ηnI)1ΣXX1)ΣXY=(ΣXX+ηnI)1ΣXXRXYRXY=ηn(ΣXX+ηnI)1RXY,

we see that

(AnA)BOPηn(ΣXX+ηnI)1+βOPSXYOPηnηnβ1(ΣXX+ηnI)1+β(ΣXX+ηnI)1βOPSXYOPηnβSXYOP=𝒪p(ηnβ). (S.8)

Combining (S.6), (S.7), and (S.8), we prove the first assertion of the theorem.

The second assertion can then be proved by following roughly the same path and using the following facts:

  1. if A is a bounded operator and B is Hilbert-Schmidt operator and ran(A)dom(B), then AB is a Hilbert Schmidt operator with
    ABHSAOPBHS;
  2. if A is Hilbert-Schmidt then so is A* and
    AHS=A*HS.

Using the same decomposition as (S.5), we have

(ABA^B^)HSA^(B^B)HS+(A^An)BHS+(AnA)BHS. (S.9)

For the first term on the right-hand side of (S.9):

A^(B^B)HSA^OPB^BHS=𝒪(ηn1εn,m). (S.10)

For the second term on the right-hand side of (S.9):

(A^An)BHS(Σ^XX+ηnI)1OPΣ^XXΣXXOP(ΣXX+ηnI)1ΣXXOPΣXX1ΣXYHS(Σ^XX+ηnI)1OPΣ^XXΣXXOP(ΣXX+ηnI)1ΣXXOPΣXXβSXYHS=𝒪p(ηn1εn,m). (S.11)

For the third term on the right-hand side of (S.9):

(AnA)BHSηn(ΣXX+ηnI)1+βOPSXYHSηnβSXYHS=𝒪p(ηnβ). (S.12)

Combining the results (S.10), (S.11), and (S.12), we have

(ABA^B^)HS𝒪p(ηnβ+ηn1εn,m).

This completes the proof of Theorem 1.

Acknowledgments

We thank the Editor, Associate Editor and referees for their helpful comments. This research is partly supported by the NIH grant 1R01GM152812 and the NSF grants DMS-1953189, CCF-2007823 and DMS-2210775.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • [1].Ambrosio L, Gigli N, Savaré G, Gradient flows with metric and differentiable structures, and applications to the wasserstein space, Atti della Accademia Nazionale dei Lincei. Classe di Scienze Fisiche, Matematiche e Naturali. Rendiconti Lincei. Matematica e Applicazioni 15 (2004) 327–343. [Google Scholar]
  • [2].Bayraktar E, Guo G, Strong equivalence between metrics of Wasserstein type, Electronic Communications in Probability 26 (2021) 1–13. [Google Scholar]
  • [3].Berg C, Christensen JPR, Ressel P, Harmonic Analysis on Semigroups: Theory of Positive Definite and Related Functions, Springer, 1984. [Google Scholar]
  • [4].Bhattacharjee S, Li B, Xue L, Nonlinear global Fréchet regression for random objects via weak conditional expectation, arXiv preprint arXiv:2310.07817 (2023). [Google Scholar]
  • [5].Bigot J, Gouet R, Klein T, López A, Geodesic pca in the wasserstein space by convex pca, in: Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 53, Institut Henri Poincaré, pp. 1–26. [Google Scholar]
  • [6].Bobkov S, Ledoux M, One-Dimensional Empirical Measures, Order Statistics, and Kantorovich Transport Distances, American Mathematical Society, 2019. [Google Scholar]
  • [7].Boissard E, Le Gouic T, On the mean speed of convergence of empirical and occupation measures in Wasserstein distance, Annales de l’IHP Probabilités et Statistiques 50 (2014) 539–563. [Google Scholar]
  • [8].Chen Y, Lin Z, Müller H-G, Wasserstein regression, Journal of the American Statistical Association, in press (2021) 1–14.35757777 [Google Scholar]
  • [9].Chen Z, Bao Y, Li H, Spencer BF Jr, Lqd-rkhs-based distribution-to-distribution regression methodology for restoring the probability distributions of missing shm data, Mechanical Systems and Signal Processing 121 (2019) 655–674. [Google Scholar]
  • [10].Christmann A, Steinwart I, Universal kernels on non-standard input spaces, in: in Advances in Neural Information Processing Systems, pp. 406–414. [Google Scholar]
  • [11].Dereich S, Scheutzow M, Schottstedt R, Constructive quantization: Approximation by empirical measures, Annales de l’IHP Probabilités et Statistiques 49 (2013) 1183–1203. [Google Scholar]
  • [12].Ding S, Cook RD, Tensor sliced inverse regression, Journal of Multivariate Analysis 133 (2015) 216–231. [Google Scholar]
  • [13].Dong Y, Wu Y, Fréchet kernel sliced inverse regression, Journal of Multivariate Analysis 191 (2022) 105032 [Google Scholar]
  • [14].Fan J, Müller H-G, Conditional wasserstein barycenters and interpolation/extrapolation of distributions, arXiv preprint arXiv:2107.09218 (2021). [Google Scholar]
  • [15].Fan J, Xue L, Yao J, Sufficient forecasting using factor models, Journal of Econometrics 201 (2017) 292–306. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Ferré L, Yao A-F, Functional sliced inverse regression analysis, Statistics 37 (2003) 475–488. [Google Scholar]
  • [17].Fournier N, Guillin A, On the rate of convergence in wasserstein distance of the empirical measure, Probability Theory and Related Fields 162 (2015) 707–738. [Google Scholar]
  • [18].Fukumizu K, Bach FR, Gretton A, Statistical consistency of kernel canonical correlation analysis., Journal of Machine Learning Research 8 (2007) 361–383. [Google Scholar]
  • [19].Fukumizu K, Bach FR, Jordan MI, Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces, Journal of Machine Learning Research 5 (2004) 73–99 [Google Scholar]
  • [20].Ghodrati L, Panaretos VM, Distribution-on-distribution regression via optimal transport maps, Biometrika 109 (2022) 957–974. [Google Scholar]
  • [21].Golub GH, Heath M, Wahba G, Generalized cross-validation as a method for choosing a good ridge parameter, Technometrics 21 (1979) 215–223. [Google Scholar]
  • [22].Hsing T, Ren H, An rkhs formulation of the inverse regression dimension-reduction problem, The Annals of Statistics 37 (2009) 726–755. [Google Scholar]
  • [23].Kolouri S, Zou Y, Rohde GK, Sliced wasserstein kernels for probability distributions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5258–5267. [Google Scholar]
  • [24].Koltchinskii V, Giné E, Random matrix approximation of spectra of integral operators, Bernoulli 6 (2000) 113–167. [Google Scholar]
  • [25].Lee K-Y, Li B, Chiaromonte F, A general theory for nonlinear sufficient dimension reduction: Formulation and estimation, The Annals of Statistics 41 (2013) 221–249. [Google Scholar]
  • [26].Lei J, Convergence and concentration of empirical measures under Wasserstein distance in unbounded functional spaces, Bernoulli 26 (2020) 767–798. [Google Scholar]
  • [27].Li B, Sufficient Dimension Reduction: Methods and Applications with R, CRC Press, 2018. [Google Scholar]
  • [28].Li B, Artemiou A, Li L, Principal support vector machines for linear and nonlinear sufficient dimension reduction, The Annals of Statistics 39 (2011) 3182–3210. [Google Scholar]
  • [29].Li B, Kim MK, Altman N, On dimension folding of matrix-or array-valued statistical objects, The Annals of Statistics 38 (2010) 1094–1121. [Google Scholar]
  • [30].Li B, Song J, Nonlinear sufficient dimension reduction for functional data, The Annals of Statistics 45 (2017) 1059–1095. [Google Scholar]
  • [31].Li B, Song J, Dimension reduction for functional data based on weak conditional moments, The Annals of Statistics 50 (2022) 107–128. [Google Scholar]
  • [32].Li K-C, Sliced inverse regression for dimension reduction, Journal of the American Statistical Association 86 (1991) 316–327. [Google Scholar]
  • [33].Lin T, Fan C, Ho N, Cuturi M, Jordan MI, Projection robust wasserstein distance and riemannian optimization, arXiv preprint arXiv:2006.07458 (2020). [Google Scholar]
  • [34].Luo W, Li B, Combining eigenvalues and variation of eigenvectors for order determination, Biometrika 103 (2016) 875–887. [Google Scholar]
  • [35].Luo W, Xue L, Yao J, Yu X, Inverse moment methods for sufficient forecasting using high-dimensional predictors, Biometrika 109 (2022) 473–487. [Google Scholar]
  • [36].Meunier D, Pontil M, Ciliberto C, Distribution regression with sliced wasserstein kernels, in: International Conference on Machine Learning, PMLR, pp. 15501–15523. [Google Scholar]
  • [37].Micchelli CA, Xu Y, Zhang H, Universal kernels., Journal of Machine Learning Research 7 (2006). [Google Scholar]
  • [38].Nietert S, Goldfeld Z, Sadhu R, Kato K, Statistical, robustness, and computational guarantees for sliced Wasserstein distances, Advances in Neural Information Processing Systems 35 (2022) 28179–28193. [Google Scholar]
  • [39].Niles-Weed J, Rigollet P, Estimation of wasserstein distances in the spiked transport model, Bernoulli 28 (2022) 2663–2688. [Google Scholar]
  • [40].Okano R, Imaizumi M, Distribution-on-distribution regression with wasserstein metric: Multivariate gaussian case, arXiv preprint arXiv:2307.06137 (2023). [Google Scholar]
  • [41].Panaretos V, Zemel Y, An Invitation to Statistics in Wasserstein Space, Springer International Publishing, 2020. [Google Scholar]
  • [42].Petersen A, Müller H-G, Functional data analysis for density functions by transformation to a hilbert space, The Annals of Statistics 44 (2016) 183–218. [Google Scholar]
  • [43].Petersen A, Müller H-G, Fréchet regression for random objects with euclidean predictors, The Annals of Statistics 47 (2019) 691–719. [Google Scholar]
  • [44].Schoenberg IJ, On certain metric spaces arising from euclidean spaces by a change of metric and their imbedding in hilbert space, The Annals of Mathematics 38 (1937) 787–793 [Google Scholar]
  • [45].Schoenberg IJ, Metric spaces and positive definite functions, Transactions of the American Mathematical Society 44 (1938) 522–536. [Google Scholar]
  • [46].Sriperumbudur B, Fukumizu K, Lanckriet G, On the relation between universality, characteristic kernels and rkhs embedding of measures, in: Proceedings of the thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, pp. 773–780. [Google Scholar]
  • [47].Sriperumbudur BK, Fukumizu K, Lanckriet GR, Universality, characteristic kernels and rkhs embedding of measures., Journal of Machine Learning Research 12 (2011) 2389–2410. [Google Scholar]
  • [48].Székely GJ, Rizzo ML, Bakirov NK, Measuring and testing dependence by correlation of distances, The Annals of Statistics 35 (2007) 2769–2794. [Google Scholar]
  • [49].Villani C, Optimal Transport: Old and New, volume 338, Springer, 2009 [Google Scholar]
  • [50].Virta J, Lee K-Y, Li L, Sliced inverse regression in metric spaces, arXiv preprint arXiv:2206.11511 (2022). [Google Scholar]
  • [51].Weed J, Bach F, Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance, Bernoulli 25 (2019) 2620–2648. [Google Scholar]
  • [52].Ying C, Yu Z, Fréchet sufficient dimension reduction for random objects, Biometrika, in press (2022). [Google Scholar]
  • [53].Yu X, Yao J, Xue L, Nonparametric estimation and conformal inference of the sufficient forecasting with a diverging number of factors, Journal of Business & Economic Statistics 40 (2022) 342–354. [Google Scholar]
  • [54].Zhang Q, Xue L, Li B, Dimension reduction for fréchet regression, Journal of the American Statistical Association, in press (2023). [Google Scholar]

RESOURCES