Nonlinear sufficient dimension reduction for distribution-on-distribution regression

Qi Zhang; Bing Li; Lingzhou Xue

doi:10.1016/j.jmva.2024.105302

. Author manuscript; available in PMC: 2025 Jul 1.

Published in final edited form as: J Multivar Anal. 2024 Feb 27;202:105302. doi: 10.1016/j.jmva.2024.105302

Nonlinear sufficient dimension reduction for distribution-on-distribution regression

Qi Zhang ^a, Bing Li ^a, Lingzhou Xue ^a,^*

PMCID: PMC10956811 NIHMSID: NIHMS1972219 PMID: 38525479

Abstract

We introduce a new approach to nonlinear sufficient dimension reduction in cases where both the predictor and the response are distributional data, modeled as members of a metric space. Our key step is to build universal kernels (cc-universal) on the metric spaces, which results in reproducing kernel Hilbert spaces for the predictor and response that are rich enough to characterize the conditional independence that determines sufficient dimension reduction. For univariate distributions, we construct the universal kernel using the Wasserstein distance, while for multivariate distributions, we resort to the sliced Wasserstein distance. The sliced Wasserstein distance ensures that the metric space possesses similar topological properties to the Wasserstein space, while also offering significant computation benefits. Numerical results based on synthetic data show that our method outperforms possible competing methods. The method is also applied to several data sets, including fertility and mortality data and Calgary temperature data.

Keywords: Distributional data, RKHS, Sliced Wasserstein distance, Universal kernel, Wasserstein distance, 62G08, 62H12

1. Introduction

Complex data objects such as random elements in general metric spaces are commonly encountered in modern statistical applications. However, these data objects do not conform to the operation rules of Hilbert spaces and lack important properties such as inner products and orthogonality, making them difficult to analyze using traditional multivariate and functional data analysis methods. An important example of the metric space-valued data objects is the distributional data, which can be modeled as random probability measures that satisfy specific regularity conditions. Recently, there has been an increasing interest in this type of data. Petersen and Müller [43] extended the classical regression to Fréchet regression, making it possible to handle univariate distribution on scalar or vector regression. Fan and Müller [14] extended the Fréchet regression framework to the case of multivariate response distributions. Besides scalar or vector-valued predictors, the relationship between two distributions is also becoming increasingly important. Petersen and Müller [42] proposed the log quantile density (LQD) transformation to transform the densities of these distributions to unconstrained functions in the Hilbert space $L_{2}$ . Chen et al. [9] further applied function-to-function linear regression to the LQD transformations of distributions and mapped the fitted responses back to the Wasserstein space through the inverse LQD transformation. Chen et al. [8] proposed a distribution-on-distribution regression model by adopting the Wasserstein metric and shows that it works better than the transformation methods in Chen et al. [9]. Recently, Bhattacharjee et al. [4] proposed a global nonlinear Fréchet regression model for random objects via weak conditional expectation. In practical applications, distribution-on-distribution regression has been utilized for analyzing mortality distributions across different countries or regions [20], distributions of fMRI brain imaging signals [42], and distributions of daily temperature and humidity [40], among others.

Distribution-on-distribution regression encounters similar challenges to classical regression, including the need for exploratory data analysis, data visualization, and improved estimation accuracy through data dimension reduction. In classical regression, sufficient dimension reduction (SDR) has proven to be an effective tool for addressing these challenges. To set the stage, we outline the classical sufficient dimension reduction (SDR) framework. Let $X$ be a $p$ -dimensional random vector in $ℝ^{p}$ and $Y$ a random variable in $ℝ$ . Linear SDR aims to find a subspace $𝓢$ of $ℝ^{p}$ such that $Y ⫫ X ∣ P_{𝒮} X$ , where $P_{𝒮}$ is the projection on to $𝓢$ with respect to the usual inner product in $ℝ^{p}$ . As an extension of linear SDR, [28] and [25] propose the general theory of nonlinear sufficient dimension reduction, which seeks a set of nonlinear functions $f_{1} (X), \dots, f_{d} (X)$ in a Hilbert space such that $Y ⫫ X ∣ f_{1} (X), \dots, f_{d} (X)$ .

In the last two decades, the SDR framework has undergone constant evolution to adapt to increasingly complex data structures. Researchers have extended SDR to functional data [16, 22, 30, 31], tensorial data [12, 29], and forecasting with large panel data [15, 35, 53] Most recently, Ying and Yu [52], Zhang et al. [54], and Dong and Wu [13] have developed SDR methods for cases where the response takes values in a metric space while the predictor lies in Euclidean space.

Let $X$ and $Y$ be random distributions defined on $M \subseteq ℝ^{r}$ , with finite $p$ -th moments $(p \geq 1)$ . We do allow $X$ and $Y$ to be random vectors, but our focus will be on the case where they are distributions. Modelling $X$ and $Y$ as random elements in metric spaces $(Ω_{X}, d_{X})$ and $(Ω_{Y}, d_{Y})$ , we seek nonlinear functions $f_{1}, \dots, f_{d}$ defined on $Ω_{X}$ such that the random measures $Y$ and $X$ are conditionally independent given $f_{1} (X), \dots, f_{d} (X)$ . In order to guarantee the theoretical properties of the nonlinear SDR methods and facilitate the estimation procedure, we assume $f_{1}, \dots, f_{d}$ to reside in a reproducing kernel Hilbert space (RKHS). While the nonlinear SDR problem can be formulated in much the same way as that for multivariate and functional data, the main new element in this theory that still requires substantial effort is the construction of positive definite and universal kernels on $Ω_{X}$ and $Ω_{Y}$ . These are needed for constructing unbiased and exhaustive estimators for the dimension reduction problem [27]. We achieve this purpose with specific choices of the metrics of Wasserstein distance and sliced Wasserstein distance: we will show how to construct positive definite and universal kernels and the RKHS generated from them to achieve nonlinear SDR for distributional data.

While acknowledging the recent independent work of Virta et al. [50], who proposed a nonlinear SDR method for metric space-valued data, our work has some novel contributions. First, we focus on distributional data and consider a practical setting where only discrete samples from each distribution are available instead of the distributions themselves, while Virta et al. [50] only illustrated the method with torus data, positive definite matrices, and compositional data. Second, we explicitly construct universal kernels over the space of distributions, which results in an RKHS that is rich enough to characterize the conditional independence. In contrast, Virta et al. [50] only assumed that the RKHS is dense in $L^{2}$ space but missed verifications.

The rest of the paper is organized as follows. Section 2 defines the general framework of nonlinear sufficient dimension reduction for distributional data. Section 3 shows how to construct RKHS on the space of univariate distributions and multivariate distributions, respectively. Section 4 proposes the generalized sliced inverse regression methods for distribution data. Section 5 establishes the convergence rate of the proposed methods for both the fully observed setting and the discretely observed setting. Simulation results are presented in Section 6 to show the numerical performances of proposed methods. In Section 7, we analyze two real applications to human mortality & fertility data and Calgary extreme temperature data, demonstrating the usefulness of our methods. All proofs are presented in Section 9.

2. Nonlinear SDR for Distributional Data

We consider the setting of distribution-on-distribution regression. Let $(Ω, ℱ, P)$ be a probability space. Let $M$ be a subset of $ℝ^{r}$ and $ℬ (M)$ the Borel $σ$ -field on $M$ . Let $𝒫_{p} (M)$ be the set of Borel probability measures on $(M, ℬ (M)$ ) that have finite $p$ -th moment and that is dominated by the Lebesgue measure on $ℝ^{r}$ . We let $Ω_{X}$ and $Ω_{Y}$ be nonempty subsets of $𝒫_{p} (M)$ equipped with metrics $d_{X}$ and $d_{Y}$ , respectively. We let $ℬ_{X}$ and $ℬ_{Y}$ be the Borel $σ$ -fields generated by the open sets in the metric spaces $(Ω_{X}, d_{X})$ and $(Ω_{Y}, d_{Y})$ . Let $(X, Y)$ be a random element mapping from $Ω_{X}$ to $Ω_{X} \times Ω_{Y}$ , measurable with respect to the product $σ$ -field $ℬ_{X} \times ℬ_{Y}$ . We denote the marginal distributions of $X$ and $Y$ by $P_{X}$ and $P_{Y}$ , respectively, and the conditional distributions of $Y ∣ X$ and $X ∣ Y$ by $P_{Y ∣ X}$ and $P_{X ∣ Y}$ .

Let $σ (X)$ be the sub $σ$ -field in $ℱ$ generated by $X$ , that is, $σ (X) = X^{- 1} ℬ_{X}$ . Following the terminology in [27], a sub $σ$ -field $𝒢$ of $σ (X)$ is called a sufficient dimension reduction $σ$ -field, or simply a sufficient $σ$ -field, if $Y ⫫ X ∣ 𝒢$ . In other words, $𝒢$ captures all the regression information of $Y$ on $X$ . As shown in Lee et al. [25], if the family of conditional probability measures $\{P_{X ∣ Y} (\cdot ∣ y) : y \in Ω_{Y}\}$ is dominated by a $σ$ -finite measure, then the intersection of all sufficient $σ$ -field is still a sufficient $σ$ -field. This minimal sufficient $σ$ -field is called the central $σ$ -field for $Y$ versus $X$ , denoted by $𝒢_{Y ∣ X}$ . By definition, the central $σ$ -field captures all the regression information of $Y$ on $X$ and is the target that we aim to estimate.

Let $ℋ_{X}$ be a Hilbert space of real-valued functions defined on $Ω_{X}$ . We convert estimating the central $σ$ field into estimating a subspace of $ℋ_{X}$ . Specifically, we assume that the central $σ$ -field is generated by a finite set of functions $f_{1}, \dots, f_{d}$ in $ℋ_{X}$ , which can be expressed as

Y ⫫ X | f_{1} (X), \dots, f_{d} (X) .

(1)

For any sub- $σ$ -field $𝒢$ of $σ (X)$ , let $ℋ_{X} (𝒢)$ denote the subspace of $ℋ_{X}$ spanned by the function $f$ such that $f (X)$ is $𝒢$ -measurable, that is,

ℋ_{X} (𝒢) = \bar{span} {f \in ℋ_{X}, f (X) is measurable 𝒢} .

(2)

We define the central class as $𝔖_{Y ∣ X} = ℋ_{X} (𝒢_{Y ∣ X})$ following (2). We say that a subspace $𝔖$ of $ℋ_{X}$ is unbiased if it is contained in $𝔖_{Y ∣ X}$ and consistent if it is equal to $𝔖_{Y ∣ X}$ . To recover the central class $𝔖_{Y ∣ X}$ consistently by an extension of Sliced Inverse Regression [32], we need to assume the central $σ$ -field is complete [25].

Definition 1. A sub $σ$ -field $𝒢$ of $σ (X)$ is complete if, for each function $f$ such that $f (X)$ is $𝒢$ measurable and $E [f (X) ∣ Y] = 0$ almost surely $P_{Y}$ , we have $f (X) = 0$ almost surely $P_{X}$ . We say that $ℋ_{X} (𝒢)$ is a complete class for $Y$ versus $X$ if $𝒢$ is complete $σ$ -field for $Y$ versus $X$ .

Although our theoretical analysis so far does not require $ℋ_{X}$ and $ℋ_{Y}$ to be RKHS, using an RKHS provides a concrete framework for establishing an unbiased and consistent estimator. It also builds a connection between the classical linear SDR and nonlinear SDR in the sense that $f (x)$ can be expressed as the inner product $⟨ f, κ (\cdot, x) ⟩$ , where $κ : Ω_{X} \times Ω_{X} \to ℝ$ is the reproducing kernel. This inner product is a nonlinear extension of $β^{⊤} X$ in linear SDR. In the next section, We will describe how to construct RKHS for univariate and multivariate distributions.

3. Construction of RKHS

A common approach to constructing a reproducing kernel is to use a classical radial basis function $φ (| x - c |)$ (such as the Gaussian radial basis kernel) and substitute the Euclidean distance with the distance in the metric space. However, not every metric can be used in such a way to produce positive definite kernels. We show that metric spaces that are of negative type can yield positive definite kernels with form $φ (‖ x - c ‖)$ . Moreover, as will be seen in Proposition 3 and the discussion following it, in order to achieve an unbiased and consistent estimation of the central class $𝒢_{Y ∣ X}$ , we need the kernels for $ℋ_{X}$ and $ℋ_{Y}$ to be cc-universal (Micchelli et al. 37). For ease of reference, we use the term ”universal” to refer to cc-universal kernels. We select the Wasserstein metric and sliced Wasserstein metric for our work, as they possess the desired properties for constructing universal kernels.

3.1. Wasserstein kernel for univariate distributions

For probability measures $μ_{1}$ and $μ_{2}$ in $𝒫_{p} (M)$ , the $p$ -Wasserstein distance between $μ_{1}$ and $μ_{2}$ is defined as the solution of the Kantorovich transportation problem [49]:

W_{p} (μ_{1}, μ_{2}) = {(inf_{γ \in Γ (μ_{1}, μ_{2})} \int_{M \times M} {‖ x - y ‖}^{p} d γ (x, y))}^{1 / p},

where $∥ \cdot ∥$ is the Euclidean metric, and $Γ (μ_{1}, μ_{2})$ is the space of joint probability measures on $(M \times M, ℬ (M) \times ℬ (M))$ with marginals $μ_{1}$ and $μ_{2}$ . When $M \subseteq ℝ$ , the $p$ -Wasserstein distance has the following explicit quantile representation:

W_{p} (μ_{1}, μ_{2}) = {(\int_{0}^{1} {[F_{μ_{1}}^{- 1} (s) - F_{μ_{2}}^{- 1} (s)]}^{2} d s)}^{1 / p},

where $F_{μ_{1}}^{- 1}$ and $F_{μ_{2}}^{- 1}$ denote the quantile functions of $μ_{1}$ and $μ_{2}$ , respectively. The set $𝒫_{p} (M)$ endowed with the Wasserstein metric $W_{p}$ is called the Wasserstein metric space and is denoted by $𝒲_{p} (M)$ . Kolouri et al. [23, Theorem 4] show that Wasserstein space of absolutely continuous univariate distributions can be isometrically embedded in a Hilbert space, and thus the Gaussian RBF kernel is positive definite.

We now turn to universality. Christmann and Steinwart [10, Theorem 3] showed that if $Ω_{X}$ is compact and can be continuously embedded in a Hilbert space $ℋ$ by a mapping $ρ$ , then for any analytic function $A : ℝ \to ℝ$ whose Taylor series at zero has strictly positive coefficients, the function $κ (x, x^{'}) = A ({⟨ρ (x), ρ (x^{'})⟩}_{ℋ})$ defines a c-universal kernel on $Ω_{X}$ . To accommodate the scenarios of $M = ℝ$ and $M = ℝ^{r}$ , we need to go beyond compact metric spaces. For this reason, we use a more general definition of universality that does not require the support of the kernel to be compact, called cc-universality [37, 46, 47]. Let $κ_{X} : Ω_{X} \times Ω_{X} \to ℝ$ be a positive definite kernel and $ℋ_{X}$ the RKHS generated by $κ_{X}$ . For any compact set $K$ , let $ℋ_{X} (K)$ be the RKHS generated by $\{κ_{X} (\cdot, x) : x \in K\}$ . Let $C (K)$ be the class of all continuous functions with respect to the topology in $(Ω_{X}, d_{X})$ restricted on $K$ .

Definition 2. [37] We say that $κ_{X}$ is universal(cc-universal) if, for any compact set $K \subseteq Ω_{X}$ , any member $f$ of $C (K)$ , and any $ϵ > 0$ , there is an $h \in ℋ_{X} (K)$ such that ${s u p}_{x \in K} | f (x) - h (x) | < ϵ$ .

Let $κ_{G} (x, x^{'}) = e x p (- γ W_{2}^{2} (x, x^{'}))$ and $κ_{L} L (x, x^{'}) = e x p (- γ W_{2} (x, x^{'}))$ . The subscripts $G$ and $L$ here refer to “Gaussian” and “Laplacian”, respectively. [54] showed that both $κ_{G}$ and $κ_{L}$ on a complete and separable metric space that can be isometrically embedded into a Hilbert space are universal. We note that if $M$ is separable and complete, then so is $𝒲_{2} (M)$ [41, Proposition 2.2.8, Theorem 2.2.7]. Therefore, We have the following proposition that guarantees the construction of universal kernels on (possibly non-compact) $𝒲_{2} (M)$ .

Proposition 1. If $M \subseteq ℝ$ is complete, then $κ_{G} (x, x^{'})$ and $κ_{L} (x, x^{'})$ are universal kernels on $𝒲_{2} (M)$ .

By Proposition 1, we construct the Hilbert spaces $ℋ_{X}$ and $ℋ_{Y}$ as RKHS generated by Gaussian type kernel $κ_{G}$ or Laplacian type kernel $κ_{L}$ . Let $L_{2} (P_{X})$ be the class of square-integrable functions of $X$ under $P_{X}$ . Let $𝔅$ be the set of measurable indicator functions on $𝒲_{2} (M)$ , that is,

𝔅 = {I_{B} : B \subseteq 𝒲_{2} (M) is measurable} .

Recall that a measure $P_{X}$ on $(Ω_{X}, d)$ is regular if, for any Borel subset $B \subseteq Ω_{X}$ and any $ε > 0$ , there is a compact set $K \subseteq B$ and an open set $G \supseteq B$ , such that $P_{X} (G ∖ K) < ε$ . By Zhang et al. [54, Theorem 1], if $P_{X}$ is a regular measure, $ℋ_{X}$ is dense in $𝔅$ , and hence dense in $s p a n {𝔅}$ , which is the space of simple functions. Since $span {𝔅}$ is dense in $L_{2} (P_{X})$ , $ℋ_{X}$ is dense in $L_{2} (P_{X})$ .

3.2. Sliced-Wasserstein kernel for multivariate distributions

For multivariate distributions $(M \subseteq ℝ^{r})$ , the sliced $p$ -Wasserstein distance is obtained by computing the average Wasserstein distance of the projected univariate distributions along randomly picked directions. Let $μ_{1}$ and $μ_{2}$ be two measures in $𝒫_{p} (M)$ , where $M \subseteq ℝ^{r}$ , $r > 1$ . Let $𝕊^{r - 1}$ be the unit sphere in $ℝ^{r}$ . For $θ \in 𝕊^{r - 1}$ , let $T_{θ} : ℝ^{r} \to ℝ$ be the linear transformation $x \to 〈 θ, x 〉$ , where $〈 \cdot, \cdot 〉$ is the Euclidean inner-product. Let $μ_{1} \circ T_{θ}^{- 1}$ and $μ_{2} \circ T_{θ}^{- 1}$ be the induced measures by the mapping $T_{θ}$ . The sliced $p$ -Wasserstein distance between $μ_{1}$ and $μ_{2}$ is defined by

{SW}_{p} (μ_{1}, μ_{2}) = {(\int_{𝕊^{r - 1}} W_{p}^{p} (μ_{1} \circ T_{θ}^{- 1}, μ_{2} \circ T_{θ}^{- 1}) d θ)}^{1 / p} .

It can be verified that ${S W}_{p}$ is indeed a metric. We denote the metric space $(𝒫_{p} (M), {S W}_{p})$ by $𝒯 𝒲_{p} (M)$ and call it the sliced Wasserstein space. It has been shown (for example, Bayraktar and Guo [2]) that the sliced Wasserstein metric is a weaker metric than the Wasserstein metric, that is $\forall μ_{1}, μ_{2} \in P_{p} (M)$ with $M \subseteq ℝ^{r}$ , ${S W}_{p} (μ_{1}, μ_{2}) \leq W_{p} (μ_{1}, μ_{2})$ . This relation implies two topological properties of the sliced Wasserstein space that are useful to us, which can be derived from the topological properties of $p$ -Wasserstein space established in Ambrosio et al. [1, Proposition 7.1.5], and Panaretos and Zemel [41, Chapter 2.2].

Proposition 2. If $M$ is a subset of $ℝ^{r}$ , then $𝒯 𝒲_{p} (M)$ is complete and separable. Furthermore, if $M \subseteq ℝ^{r}$ is compact, then $𝒯 𝒲_{p} (M)$ is compact.

With $p = 2$ , [23] show that the square of sliced Wasserstein distance is conditionally negative definite and hence that the Gaussian RBF kernel $e x p (- γ {S W}_{2}^{2} (x, x^{'}))$ is a positive definite kernel. The next lemma shows that the Gaussian RBF kernel and Laplacian RBF kernel based on the sliced Wasserstein distance are, in fact, universal kernels.

Lemma 1. If $M \subseteq ℝ^{r} (r > 1)$ is complete, then both $κ_{G} (x, x^{'}) = e x p (- γ {S W}_{2}^{2} (x, x^{'}))$ and $κ_{L} (x, x^{'}) = e x p (- γ {S W}_{2} (x, x^{'}))$ are universal kernels on $𝒯 𝒲_{2} (M)$ . Furthermore, if $P_{X}$ and $P_{Y}$ are regular measures, $ℋ_{X}$ and $ℋ_{Y}$ are dense in $L_{2} (P_{X})$ and $L_{2} (P_{Y})$ , respectively.

It’s worth mentioning that in a recent study, Meunier et al. [36] demonstrated the universality of the Sliced Wasserstein kernel. However, our findings extend beyond the scope of Meunier et al. [36]’s work. Specifically, our results apply to scenarios where $M$ is non-compact, such as $M = ℝ^{d}$ , by introducing cc-universality as defined in Definition 2.

4. Generalized Sliced Inverse Regression for Distributional Data

This section extends the generalized sliced inverse regression (GSIR) [25] for distributional data. We call this extension to univariate distribution settings as Wasserstein GSIR, or W-GSIR, and to multivariate distribution settings as Sliced-Wasserstein GSIR, or SW-GSIR.

4.1. Distributional GSIR and the role of universal kernel

To model the nonlinear relationships between random elements, we introduce the covariance operator in the RKHS, a concept similar to the constructions in [19, 25], [27, Chapter 12.2] and [30]. Let $ℋ_{1}$ and $ℋ_{2}$ be two arbitrary Hilbert spaces, and let $ℬ (ℋ_{1}, ℋ_{2})$ denote the class of bounded linear operators from $ℋ_{1}$ to $ℋ_{2}$ . If $ℋ_{1} = ℋ_{2} = ℋ$ , we use $ℬ (ℋ)$ to denote $ℬ (ℋ, ℋ)$ . For any operator $T \in ℬ (ℋ_{1}, ℋ_{2})$ , we use $T^{*}$ to denote the adjoint operator of $T$ , $k e r (T)$ to denote the kernel of $T, r a n (T)$ to denote the range of $T$ , and $\bar{r a n} (T)$ to denote the closure of the range of $T$ . Given two members $f$ and $g$ of $ℋ$ , the tensor product $f \otimes g$ is the operator on $ℋ$ such that $(f \otimes g) h = f {〈 g, h 〉}_{ℋ}$ for all $h \in ℋ$ . It is important to note that the adjoint operator of $f \otimes g$ is $g \otimes f$ .

We define $E [κ (\cdot, X)]$ , the mean element of $X$ in $ℋ_{X}$ , as the unique element in $ℋ_{X}$ such that

{〈 f, E [κ (\cdot, X)] 〉}_{ℋ_{X}} = E {〈 f, κ (\cdot, X) 〉}_{ℋ_{X}}

(3)

for all $f \in ℋ_{X}$ . Define the bounded linear operator $E [κ (\cdot, X) \otimes κ (\cdot, X)]$ , the second-moment operator of $X$ in $ℋ_{X}$ , as the unique element in $ℬ (ℋ_{X})$ such that, for all $f$ and $g$ in $ℋ_{X}$ ,

{〈 f, E [κ (\cdot, X) \otimes κ (\cdot, X)] g 〉}_{ℋ_{X}} = E {〈 f, (κ (\cdot, X) \otimes κ (\cdot, x)) g 〉}_{ℋ_{X}} .

(4)

We write $μ_{X} = E [κ (\cdot, X)]$ , $M_{X X} = E [κ (\cdot, X) \otimes k (\cdot, X)]$ . For Gaussian RBF kernel and Laplacian RBF kernel based on Wasserstein distance or sliced-Wasserstein distance, $κ (X, X)$ is bounded and $E [κ (X, X)]$ is finite. By Cauchy-Schwartz inequality and Jensen’s inequality, it is guaranteed that items on the right-hand side of (3) and (4) are well-defined. The existence and uniqueness of $μ_{X}$ and $M_{X X}$ is guaranteed by Riesz’s representation theorem. We then define the covariance operator $Σ_{X X}$ as $M_{X X} - μ_{X} \otimes μ_{X}$ . Then, for all $f$ , $g \in ℋ_{X}$ , we have $c o v (f (X), g (X)) = {〈 f, Σ_{X X} g 〉}_{ℋ_{X}}$ . Similarly, we can define $μ_{Y} \in ℋ_{Y}$ , $Σ_{Y Y} \in ℬ (ℋ_{Y})$ , $Σ_{X Y} \in ℬ (ℋ_{X}, ℋ_{Y})$ and $Σ_{Y X} \in ℬ (ℋ_{Y}, ℋ_{X})$ . By definition, both $Σ_{X X}$ and $Σ_{Y Y}$ are self-adjoint, and $Σ_{X Y}^{*} = Σ_{Y X}$ .

To define the regression operators $Σ_{X X}^{- 1} Σ_{X Y}$ and $Σ_{Y Y}^{- 1} Σ_{Y X}$ , we make the following assumptions. Similar regularity conditions are assumed in [25, 27, 28].

Assumption 1.

$k e r (Σ_{X X}) = {0}$ and $k e r (Σ_{Y Y}) = {0}$ .
$r a n (Σ_{X Y}) \subseteq r a n (Σ_{X X})$ and $r a n (Σ_{Y X}) \subseteq r a n (Σ_{Y Y})$ .
The operators $Σ_{X X}^{- 1} Σ_{X Y}$ and $Σ_{Y Y}^{- 1} Σ_{Y X}$ are compact.

Condition (i) amounts to resetting the domains of $Σ_{X X}$ and $Σ_{Y Y}$ to $k e r {(Σ_{X X})}^{⊥}$ and $k e r {(Σ_{Y Y})}^{⊥}$ , respectively. This is motivated by the fact that members of $k e r (Σ_{X X})$ and $k e r (Σ_{Y Y})$ are constants almost surely, which are irrelevant when we consider independence. Since $Σ_{X X}$ and $Σ_{Y Y}$ are self-adjoint operators, this assumption is equivalent to resetting $ℋ_{X}$ to $\bar{r a n} (Σ_{X X})$ and $ℋ_{Y}$ to $\bar{r a n} (Σ_{Y Y})$ , respectively. Condition (i) also implies that the mappings $Σ_{X X}$ and $Σ_{Y Y}$ are invertible, though, as we will see, $Σ_{X X}^{- 1}$ and $Σ_{Y Y}^{- 1}$ are unbounded operators.

Condition (ii) guarantees that $r a n (Σ_{X Y}) \subseteq d o m (Σ_{X X}^{- 1}) = r a n (Σ_{X X})$ and $r a n (Σ_{Y X}) \subseteq d o m (Σ_{Y Y}^{- 1}) = r a n (Σ_{Y Y})$ , which is necessary to define the regression operators $Σ_{X X}^{- 1} Σ_{X Y}$ and $Σ_{Y Y}^{- 1} Σ_{Y X}$ . By Proposition 12.5 of [27], $r a n (Σ_{Y X}) \subseteq \bar{r a n} (Σ_{Y Y})$ and $r a n (Σ_{X Y}) \subseteq \bar{r a n} (Σ_{X X})$ . Thus, the above assumption is not very strong.

As interpreted in Section 13.1 of [27], Condition (iii) in Assumption 1 is akin to a smoothness condition. Even though the inverse mappings $Σ_{X X}^{- 1}$ and $Σ_{Y Y}^{- 1}$ are well defined, since $Σ_{X X}$ and $Σ_{Y Y}$ are Hilbert Schmidt operators ([18]), these inverses are unbounded operators. However, these unbounded operators never appear by themselves but are always accompanied by operators multiplied from the right. Condition (iii) assumes that the composite operators $Σ_{X X}^{- 1} Σ_{X Y}$ and $Σ_{Y Y}^{- 1} Σ_{Y X}$ are compact. This requires, for example, that $Σ_{Y Y}^{- 1} Σ_{Y X}$ must send all incoming functions into the low-frequency range of the eigenspaces of $Σ_{Y Y}$ with relatively large eigenvalues. That is, $Σ_{Y X}$ and $Σ_{X Y}$ are smooth in the sense that their outputs are low-frequency components of $Σ_{Y Y}$ or $Σ_{X X}$ .

With Assumption 1 and universal kernels $κ_{X}$ and $κ_{Y}$ , we then have that the range of the regression operator $Σ_{X X}^{- 1} Σ_{X Y}$ is contained in central class $𝔖_{Y ∣ X}$ . Furthermore, if the central class $𝔖_{Y ∣ X}$ is also complete, it can be fully covered by the range of $Σ_{X X}^{- 1} Σ_{X Y}$ . The next proposition adapts the main result of Chapter 13 of [27] to the current context.

Proposition 3. If Assumption 1 holds, $ℋ_{X}$ is dense in $L_{2} (P_{X})$ and $ℋ_{Y}$ is dense in $L_{2} (P_{Y})$ , then we have $r a n (Σ_{X X}^{- 1} Σ_{X Y}) \subseteq 𝔖_{Y ∣ X}$ . If, furthermore, $𝔖_{Y ∣ X}$ is complete, then we have $r a n (Σ_{X X}^{- 1} Σ_{X Y}) = 𝔖_{Y ∣ X}$ .

The universal kernels $κ_{X}$ and $κ_{Y}$ proposed in Section 3 guarantees that $ℋ_{X}$ is dense in $L_{2} (P_{X})$ and $L_{2} (P_{Y})$ , respectively.

4.2. Estimation for distributional GSIR

By Proposition 3, for any invertible operator $A$ , we have $\bar{r a n} (Σ_{X X}^{- 1} Σ_{X Y} A Σ_{Y X} Σ_{X X}^{- 1}) \subseteq 𝔖_{Y ∣ X}$ . Two common choices are $A = I$ and $A = Σ_{Y Y}^{- 1}$ . When we take $A = Σ_{Y Y}^{- 1}$ , the procedure is a nonlinear parallel of SIR in the sense that we replace the inner product in the Euclidean space by the inner product in the RKHS $ℋ_{X}$ . For easy reference, we refer to the method using $A = I$ as W-GSIR1 or SW-GSIR1 and $A = Σ_{X Y}^{- 1}$ as W-GSIR2 or SW-GSIR2. To estimate the space $\bar{r a n} (Σ_{X X}^{- 1} Σ_{X Y} A Σ_{Y X} Σ_{X X}^{- 1})$ , we successively solve the following generalized eigenvalue problem:

maximize {〈 f; Σ_{X Y} A Σ_{Y X} f 〉}_{ℋ_{X}} subject to {〈 f; Σ_{X X} f 〉}_{ℋ_{X}} = 1; f ⊥ span {f_{1}, \dots, f_{k - 1}}, k \in {1, 2, \dots, d},

where $f_{1}, \dots, f_{k}$ are the solutions to this constrained optimization problem in the first $k$ steps.

At the sample level, we estimate $Σ_{X X}$ , $Σ_{Y Y}$ , $Σ_{X Y}$ and $Σ_{Y X}$ by replacing the expectations $E (\cdot)$ with sample moments $E_{n} (\cdot)$ whenever possible. For example, suppose we are given i.i.d. sample $(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})$ of $(X, Y)$ . We estimate $Σ_{X X}$ by

{\hat{Σ}}_{X X} = E_{n} [κ (\cdot, X) \otimes κ (\cdot, X)] - E_{n} [κ (\cdot, X)] \otimes E_{n} [κ (\cdot, X)] .

The sample estimates ${\hat{Σ}}_{Y Y}, {\hat{Σ}}_{X Y}$ and ${\hat{Σ}}_{Y X}$ for $Σ_{Y Y}, Σ_{X Y}$ and $Σ_{Y X}$ are similarly defined. The subspace $\bar{r a n} ({\hat{Σ}}_{X X})$ and $\bar{r a n} ({\hat{Σ}}_{Y Y})$ are spanned by the sets $ℬ_{X} = \{κ (\cdot, X_{i}) - E_{n} κ (\cdot, X) : i = 1, \dots, n\}$ , and $ℬ_{Y} = \{κ (\cdot, Y_{i}) - E_{n} κ (\cdot, Y) : i \in {1, \dots, n}\}$ , respectively. Let $K_{X}, K_{Y}$ denote the $n \times n$ matrix whose $(i, j)$ -th entry is $κ (X_{i}, X_{j}), κ (Y_{i}, Y_{j})$ respectively, and let $Q$ denote the projection matrix $I_{n} - 1_{n} 1_{n}^{T} / n$ . For two Hilbert spaces $ℋ_{1}, ℋ_{2}$ with spanning systems $ℬ_{1}$ and $ℬ_{2}$ , and a linear operator $A : ℋ_{1} \to ℋ_{2}$ , we use the notation $ℬ_{2} [A]_{ℬ_{1}}$ to represent the coordinate representation of $A$ relative to spanning systems $ℬ_{1}$ and $ℬ_{2}$ . We then have the following coordinate representations of covariance operators:

ℬ_{X} {[{\hat{Σ}}_{X X}]}_{ℬ_{X}} = n^{- 1} G_{X}, ℬ_{Y} {[{\hat{Σ}}_{Y X}]}_{ℬ_{X}} = n^{- 1} G_{X}, ℬ_{X} {[{\hat{Σ}}_{X Y}]}_{ℬ_{Y}} = n^{- 1} G_{Y}, ℬ_{Y} {[{\hat{Σ}}_{Y Y}]}_{ℬ_{Y}} = n^{- 1} G_{Y},

where $G_{X} = Q K_{X} Q$ and $G_{Y} = Q K_{Y} Q$ . The details are referred to Section 12.4 of [27].

When $A = I_{n}$ , the generalized eigenvalue problem becomes

max {[f]}_{ℬ_{X}}^{T} G_{X} G_{Y} G_{X} {[f]}_{ℬ_{X}} subject to {[f]}_{ℬ_{X}} G_{X}^{2} {[f]}_{ℬ_{X}} = 1 .

Let $v = G_{X} [f]_{ℬ_{X}}$ . To avoid overfitting, we solve this equation for $[f]_{ℬ_{X}}$ via Tychonoff regularization, that is, $[f]_{ℬ_{X}} = {(G_{X} + η_{X} I_{n})}^{- 1} v$ , where $η_{X}$ is a tuning constant. The problem is then transformed into finding eigenvector $v_{1}, \dots, v_{d}$ of the following matrix

Λ_{GSIR}^{(1)} = {(G_{X} + η_{X} I_{n})}^{- 1} G_{X} G_{Y} G_{X} {(G_{X} + η_{X} I_{n})}^{- 1},

and then set ${[f_{j}]}_{ℬ_{X}} = {(G_{X} + η_{X} I_{n})}^{- 1} v_{j}$ , $j \in {1, \dots, d}$ . In practice, we use $η_{X} = ε_{X} λ_{m a x} (G_{X})$ , where $λ_{m a x} (G_{X})$ is the maximum eigenvalue of $G_{X}$ and $ε_{X}$ is a tuning parameter.

For the second choice $A = {\hat{Σ}}_{Y Y}^{- 1}$ , we also use the regularized inverse ${(G_{Y} + η_{Y} I_{n})}^{- 1}$ , leading to the following generalized eigenvalue problem:

max {[f]}_{ℬ_{X}}^{T} G_{X} G_{Y} {(G_{Y} + η_{Y} I_{n})}^{- 1} G_{X} {[f]}_{ℬ_{X}} subject to {[f]}_{ℬ_{X}} G_{X}^{2} {[f]}_{ℬ_{X}} = 1 .

To solve this problem, we first compute the eigenvectors $v_{1}, \dots, v_{d}$ of the matrix

Λ_{GSIR}^{(2)} = {(G_{X} + η_{X} I_{n})}^{- 1} G_{X} G_{Y} {(G_{Y} + η_{Y} I_{n})}^{- 1} G_{X} {(G_{X} + η_{X} I_{n})}^{- 1},

and then set ${[f_{j}]}_{ℬ_{X}} = {(G_{X} + η_{X} I_{n})}^{- 1} v_{j}$ for $j \in {1, \dots, d}$ .

Choice of tuning parameters:

We use the general cross validation criterion [21] to determine the tuning constant $ε_{X}$ :

{GCV}_{X} (ε_{X}) = \frac{{‖ K_{Y} - K_{X} {(K_{X} + ε_{X} λ_{max} (K_{X}) I_{n})}^{- 1} K_{Y} ‖}_{F}^{2}}{{tr [I_{n} - K_{X} {(K_{X} + ε_{X} λ_{max} (K_{X}) I_{n})}^{- 1}]}^{2}} .

The numerator of this criterion is the prediction error, and the denominator is to control the degree of overfitting. Similarly, the GCV criterion for $ε_{Y}$ is defined as

{GCV}_{Y} (ε_{Y}) = \frac{{‖ K_{X} - K_{Y} {(K_{Y} + ε_{Y} λ_{max} (K_{Y}) I_{n})}^{- 1} K_{X} ‖}_{F}^{2}}{{tr [I_{n} - K_{Y} {(K_{Y} + ε_{Y} λ_{max} (K_{Y}) I_{n})}^{- 1}]}^{2}} .

We minimize the criteria over grid $\{10^{- 6}, \dots, 10^{- 1}, 1\}$ to find the optimal tuning constants. We choose the parameters $γ_{X}$ and $γ_{Y}$ in the reproducing kernels $κ_{X}$ and $κ_{Y}$ as the fixed quantities $γ_{X} = 1 / (2 σ_{X}^{2})$ and $γ_{Y} = 1 / (2 σ_{Y}^{2})$ , where $σ_{X}^{2} = {(\begin{matrix} n \\ 2 \end{matrix})}^{- 1} \sum_{i < j} d {(X_{i}, X_{j})}^{2}$ , $σ_{Y}^{2} = {(\begin{array}{l} n \\ 2 \end{array})}^{- 1} \sum_{i < j} d {(Y_{i}, Y_{j})}^{2}$ and metric $d (\cdot, \cdot)$ is $W_{2} (\cdot, \cdot)$ for univariate distributional data and $S W_{2} (\cdot, \cdot)$ for multivariate distributional data.

Order Determination:

To determine the dimension $d$ in (1), we use the BIC type criterion in [28] and [30]. Let $G_{n} (k) = \sum_{i = 1}^{k} {\hat{λ}}_{i} - c_{0} λ_{1} n^{- 1 / 2} l o g (n) k$ , where $λ_{i}$ ’s are the eigenvalues of the matrix $Λ_{GSIR}$ and $c_{0}$ is taken to be 2 when $A = I_{p}$ and 4 when $A = Σ_{Y Y}^{- 1}$ . Then we estimate $d$ by

\hat{d} = arg max {G_{n} (k) : k \in {0, 1, \dots, n}} .

Recently developed order-determination methods, such as the ladle estimator [34], can also be directly used to estimate $d$ .

5. Asymptotic Analysis

In this section, we establish the consistency and convergence rates of W-GSIR and SW-GSIR. We focus on the analysis of Type-I GSIR, where the operator $A$ is chosen as the identity map $I$ . The techniques we use are also applicable to the analysis of Type-II GSIR. To simplify the exposition, we define $Λ = Σ_{X X}^{- 1} Σ_{X Y} Σ_{Y X} Σ_{X X}^{- 1}$ and $\hat{Λ} = {({\hat{Σ}}_{X X} + η_{n} I_{n})}^{- 1} {\hat{Σ}}_{X Y} {\hat{Σ}}_{Y X} {({\hat{Σ}}_{X X} + η_{n} I_{n})}^{- 1}$ .

5.1. Convergence rate for fully observed distribution

If we assume that the data ${(X_{i}, Y_{i})}_{i = 1}^{n}$ are fully observed, we can establish the consistency and convergence rates of W-GSIR and SW-GSIR without fundamental differences from [30]. To make the paper self-contained, we present the results here without proof.

Proposition 4. Suppose $Σ_{X Y} = Σ_{X X}^{β} S_{X Y}$ far some linear operator $S_{X Y} : ℋ_{X} \to ℋ_{Y}$ where $0 < β \leq 1$ . Also, suppose $n^{- 1 / 2} \underline{≺} η_{n} ≺ 0$ . Then

If $S_{X Y}$ is bounded, then ${‖ \hat{Λ} - Λ ‖}_{OP} = 𝒪_{p} (η_{n}^{β} + η_{n}^{- 1} n^{- 1 / 2})$ .
If $S_{X Y}$ is Hilbert-Schmidt, then ${‖ \hat{Λ} - Λ ‖}_{HS} = 𝒪_{p} (η_{n}^{β} + η_{n}^{- 1} n^{- 1 / 2})$ .

The condition $Σ_{X Y} = Σ_{X X}^{β} S_{X Y}$ is a smoothness condition, which implies the range space of $Σ_{X Y}$ be sufficiently focused on the eigenspaces of the large eigenvalues of $Σ_{X X}$ . The parameter $β$ characterizes the degree of ”smoothness” in the relation between $X$ and $Y$ , with a larger $β$ indicating a stronger smoothness relation.

By a perturbation theory result in Lemma 5.2 of Koltchinskii and Giné [24], the eigenspaces of $\hat{Λ}$ converge to those of $Λ$ at the same rate if the nonzero eigenvalues of $Λ$ are distinct. Therefore, as a corollary of Proposition 4, the W-GSIR and SW-GSIR estimators are consistent with the same convergence rates.

5.2. Convergence rate for discretely observed distribution

In practice, additional challenges arise when the distributions are not fully observed. Instead, we observe i.i.d. samples for each $(X_{i}, Y_{i})$ , where $i \in {1, \dots, n}$ , which is called the discretely observed scenario. Suppose we observe $({\{X_{1 j}\}}_{j = 1}^{r_{1}}, {\{Y_{1 k}\}}_{k = 1}^{s_{1}}), \dots, ({\{X_{n j}\}}_{j = 1}^{r_{1}}, {\{Y_{n k}\}}_{k = 1}^{s_{n}})$ , where ${\{X_{i j}\}}_{j = 1}^{r_{i}}$ and ${\{Y_{i k}\}}_{k = 1}^{s_{i}}$ are independent samples from $X_{i}$ and $Y_{i}$ , respectively. Let ${\hat{X}}_{i}$ , ${\hat{Y}}_{i}$ be the empirical measures $r_{i}^{- 1} \sum_{j = 1}^{r_{i}} δ_{X_{i j}} s_{i}^{- 1} \sum_{j = 1}^{s_{i}} δ_{Y_{i j}}$ , where $δ_{a}$ is the Dirac measure at $a$ . Then we estimate $d (X_{i}, X_{k})$ and $d (Y_{i}, Y_{k})$ by $d ({\hat{X}}_{i}, {\hat{X}}_{k})$ and $d ({\hat{Y}}_{i}, {\hat{Y}}_{k})$ , respectively. For the convenience of analysis, we assume the sample sizes are the same, that is, $r_{1} = \dots = r_{n} = s_{1} = \dots = s_{n} = m$ . It is important to note that there are two layers of randomness in this situation: the first generates independent samples of distributions $(X_{i}, Y_{i})$ for $i \in {1, \dots, n}$ , and the second generates independent samples given each pair of distributions $(X_{i}, Y_{i})$ .

To guarantee the consistency of W-GSIR or SW-GSIR, we need to quantify the discrepancy between the estimated and true distributions by the following assumption.

Assumption 2. For $i \in {1, \dots, n}$ , $E [d ({\hat{X}}_{i}, X_{i})] = 𝒪 (δ_{m})$ and $E [d ({\hat{Y}}_{i}, Y_{i})] = 𝒪 (δ_{m})$ , where $δ_{m} \to 0$ as $m \to \infty$ .

Let $μ$ be $X_{i}$ or $Y_{i}$ for $i = 1, \dots, n$ and $\hat{μ}$ be the empirical measure of $μ$ based on $m$ i.i.d samples. The convergence rate of empirical measures in Wasserstein distance on Euclidean spaces has been studied in several works, including [7, 11, 17, 26, 51]. When $M \subset ℝ$ is compact, Fournier and Guillin [17] showed that $E [W_{2} (\hat{μ}, μ)] ≲ m^{- 1 / 4}$ . However, when $M$ is unbounded, such as $M = ℝ$ , we need concentration assumptions or moment assumptions on the measure $μ$ to establish the convergence rate. Let $m_{q} (μ) : = \int_{M} | x |^{q} d μ$ be the $q$ -th moment of $μ$ . If $m_{q} (μ) < \infty$ for some $q > 2$ , the result of [17] implies that $E [W_{2} (\hat{μ}, μ)] = 𝒪 (m^{- 1 / 4} + m^{- (q - 2) / (2 q)})$ . If $q > 4$ , then the term $m^{- (q - 2) / (2 q)}$ is dominated by $m^{- 1 / 4}$ and can be removed. If $μ$ is a log-concave measure, then Bobkov and Ledoux [6] showed a sharper rate that $E [W_{2} (\hat{μ}, μ)] ≲ \sqrt{log m / m}$ .

The convergence rate of empirical measures in sliced Wasserstein distance has been investigated by Lin et al. [33], Niles-Weed and Rigollet [39], and Nietert et al. [38]. Lin et al. [33]. When $M$ is compact, the result of Lin et al. [33] indicates that $E [{S W}_{2} (\hat{μ}, μ)] ≲ m^{- 1 / 4}$ . When $M = ℝ^{r}$ and $m_{q} (μ) < \infty$ for some $q > 2$ , Lin et al. [33] established the rate $E [{S W}_{2} (\hat{μ}, μ)] = 𝒪 (m^{- 1 / 4} + m^{- (q - 2) / (2 q)})$ . A sharper rate is shown in Nietert et al. [38] under the log-concave assumption on $μ$ .

To ensure notation consistency, we define

{\hat{Σ}}_{X Y} = E_{n} [κ (\cdot, \hat{X}) \otimes κ (\cdot, \hat{Y})] - E_{n} [κ (\cdot, \hat{X})] \otimes E_{n} [κ (\cdot, \hat{Y})], {\tilde{Σ}}_{X Y} = E_{n} [κ (\cdot, X) \otimes κ (\cdot, Y) - E_{n} [κ (\cdot, X)] \otimes E_{n} [κ (\cdot, Y)] .

We note that ${\hat{X}}_{1}, \dots, {\hat{X}}_{n}$ are independent but not necessarily identically distributed. Despite this, we still write the sample average as $E_{n} (\cdot)$ . Similarly, we define ${\hat{Σ}}_{X X}$ and ${\hat{Σ}}_{Y Y}$ as the sample covariance operators based on the estimated distribution ${\hat{X}}_{n}, \dots, {\hat{X}}_{n}$ and ${\hat{Y}}_{1}, \dots, {\hat{Y}}_{n}$ . Under Assumption 2, we have the following lemma showing the convergence rates of covariance operators.

Lemma 2. Under Assumption 2, if the kernel $κ (z, z^{'})$ is Lipschitz continuous, that is, ${sup}_{z^{'}} | κ (z_{1}, z^{'}) - κ (z_{2}, z^{'}) | < C d (z_{1}, z_{2})$ , for some $C > 0$ , then $Σ_{X X}$ , $Σ_{Y Y}$ and $Σ_{Y X}$ are Hilbert-Schmidt operators, and we have ${‖ {\hat{Σ}}_{X X} - Σ_{X X} ‖}_{H S} = 𝒪_{p} (δ_{m} + n^{- 1 / 2})$ , ${‖ {\hat{Σ}}_{Y Y} - Σ_{Y Y} ‖}_{H S} = 𝒪_{p} (δ_{m} + n^{- 1 / 2})$ , and ${‖ {\hat{Σ}}_{X Y} - Σ_{X Y} ‖}_{H S} = 𝒪_{p} (δ_{m} + n^{- 1 / 2})$ .

Based on Lemma 2, we establish the convergence rate of W-GSIR in the following theorem.

Theorem 1. Suppose $Σ_{X Y} = Σ_{X X}^{1 + β} S_{X Y}$ for some linear operator $S_{X Y} : ℋ_{X} \to ℋ_{Y}$ , where $0 < β \leq 1$ . Suppose $δ_{m} + n^{- 1 / 2} \underline{≺} η_{n} ≺ 0$ , then

If $S_{X Y}$ is bounded, then ${‖ \hat{Λ} - Λ ‖}_{OP} = 𝒪_{p} (η_{n}^{β} + η_{n}^{- 1} (δ_{m} + n^{- 1 / 2}))$ .
If $S_{X Y}$ is Hilbert-Schmidt, then ${‖ \hat{Λ} - Λ ‖}_{HS} = 𝒪_{p} (η_{n}^{β} + η_{n}^{- 1} (δ_{m} + n^{- 1 / 2}))$ .

The proof is provided in Section 9. The same convergence rate can be established for SW-GSIR.

6. Simulation

In this section, we evaluate the numerical performances of W-GSIR and SW-GSIR. We consider two scenarios: univariate distribution on univariate distribution regression and multivariate distribution on multivariate distribution regression. In Section 6.4, we compare the performance of W-GSIR and SW-GSIR with the result using functional-GSIR [30]. The code to reproduce the simulation results can be found at https://github.com/bideliunian/SDR4D2DReg.

6.1. Computational details

We use the Gaussian RBF kernel to generate the RKHS. We consider the discretely observed situation described in Section 5.2. Specifically, let ${\hat{X}}_{i} = m^{- 1} \sum_{j = 1}^{m} δ_{X_{i j}}$ be the empirical distributions for $i \in {1, \dots, n}$ . When $X$ is univariate distributions, for $i$ , $k \in {1, \dots, n}$ , we estimate $W_{2} (X_{i}, X_{k})$ and $W_{2} (Y_{i}, Y_{k})$ by

W_{2} ({\hat{X}}_{i}, {\hat{X}}_{k}) = {(\frac{1}{m} \sum_{j = 1}^{m} {(X_{i (j)} - X_{k (j)})}^{2})}^{1 / 2}, W_{2} ({\hat{Y}}_{i}, {\hat{Y}}_{k}) = {(\frac{1}{m} \sum_{j = 1}^{m} {(Y_{i (j)} - Y_{k (j)})}^{2})}^{1 / 2},

respectively, where $X_{i (j)}$ are the $j$ -th order statistics of ${\{X_{i j}\}}_{j = 1}^{m}$ .

When $X$ is multivariate distribution supported on $M \subseteq ℝ^{r}$ , we estimate the sliced Wasserstein distance using a standard Monte Carlo method, that is,

S W_{p} ({\hat{X}}_{i}, {\hat{X}}_{k}) \approx {(\frac{1}{L} \sum_{l = 1}^{L} W_{2}^{2} ({\hat{X}}_{i} \circ T_{θ_{l}}^{- 1}, {\hat{X}}_{k} \circ T_{θ_{l}}^{- 1}))}^{1 / 2} = {[\frac{1}{L} \sum_{l = 1}^{L} W_{2}^{2} (\frac{1}{m} \sum_{j = 1}^{m} δ_{〈 θ_{l}, X_{i j} 〉}, \frac{1}{m} \sum_{j = 1}^{m} δ_{〈 θ_{l}, X_{k j} 〉})]}^{1 / 2},

where ${\{θ_{l}\}}_{l = 1}^{L}$ are i.i.d. samples drawn from the uniform distribution on $𝕊^{r - 1}$ . The number of samples $L$ controls the approximation error: a larger $L$ gives a more accurate approximation but increases the computation cost. In our simulation settings, we set $L = 50$ .

We consider two measures to evaluate the difference between estimated and true predictors. The first one is the RV Coefficient of Multivariate Rank (RVMR) defined below, which is a generalization of Spearman’s correlation in the multivariate case. For two samples of random vectors $U_{1}, \dots, U_{n} \in ℝ^{r}$ and $V_{1}, \dots, V_{n} \in ℝ^{s}$ , let ${\tilde{U}}_{i}$ , ${\tilde{V}}_{i}$ be their multivariate ranks, that is,

{\tilde{U}}_{i} = \frac{1}{n} \sum_{ℓ = 1}^{n} \frac{U_{ℓ} - U_{i}}{‖ U_{ℓ} - U_{i} ‖}, {\tilde{V}}_{i} = \frac{1}{n} \sum_{ℓ = 1}^{n} \frac{V_{ℓ} - V_{i}}{‖ V_{ℓ} - V_{i} ‖} .

Then the RVMR between $\{U_{1}, \dots, U_{n}\}$ and $\{V_{1}, \dots, V_{n}\}$ is defined as the RV coefficient between $\{{\tilde{U}}_{1}, \dots, {\tilde{U}}_{n}\}$ and $\{{\tilde{V}}_{1}, \dots, {\tilde{V}}_{n}\}$ :

{RVMR}_{n} (U, V) = \frac{tr ({cov}_{n} (\tilde{U}, \tilde{V}) {cov}_{n} (\tilde{V}, \tilde{U}))}{\sqrt{tr ({var}_{n} {(\tilde{U})}^{2}) tr ({var}_{n} {(\tilde{V})}^{2})}} .

The second one is the distance correlation [48], a well-known measure of dependence between two random vectors of arbitrary dimension.

6.2. Univariate distribution-on-distribution regression

We generate normal distribution $Y$ with mean and variance parameters being random variables dependent on $X$ , that is,

Y = N (μ_{Y}, σ_{Y}^{2}),

(5)

where $μ_{Y}$ and $σ_{Y} > 0$ are random variables generated according to the following models:

Model I-1 : $μ_{Y} | X \sim N (exp (W_{2}^{2} (X, μ_{1})) + exp (W_{2}^{2} (X, μ_{2})), {0.2}^{2})$ ; $σ_{Y} = 1$ ;

Model I-2 : $μ_{Y} | X \sim N (exp (W_{2}^{2} (X, μ_{1})), {0.2}^{2})$ ; $σ_{Y} = Gamma (W_{2}^{2} (X, μ_{2}), W_{2} (X, μ_{2}))$ ;

Model I-3 : $μ_{Y} | X \sim N (exp (H (X, μ_{1})), {0.2}^{2})$ ; $σ_{Y} = exp (H (X, μ_{2}))$ ;

Model I-4 : $μ_{Y} | X \sim N (E (X), {0.2}^{2})$ ; $σ_{Y} = Gamma (Var (X), \sqrt{Var (X)})$ .

We let $μ_{1} = B e t a (2,1)$ and $μ_{2} = B e t a (2,3)$ and generate discrete observations from distributional predictors by ${\{X_{i j}\}}_{j = 1}^{m} \overset{i i d}{\sim} B e t a (a_{i}, b_{i})$ where $a_{i} \overset{i i d}{\sim} G a m m a (2, r a t e = 1)$ and $b_{i} \overset{iid}{\sim} G a m m a (2, r a t e = 3)$ . We note that the Hellinger distance between two Beta distributions $μ = B e t a (a_{1}, b_{1})$ and $v = B e t a (a_{2}, b_{2})$ can be represented explicitly as

H (μ, v) = 1 - \int \sqrt{f_{μ} (t) f_{v} (t)} d t = 1 - \frac{B ((a_{1} + a_{2}) / 2, (b_{1} + b_{2}) / 2)}{\sqrt{B (a_{1}, b_{1}) B (a_{2}, b_{2})}},

where $B (α, β)$ is the Beta function.

We compute the distances $W_{2} (X, μ_{1})$ and $W_{2} (X, μ_{2})$ by the $L_{2}$ -distance between the quantile functions. We set $n \in {100, 200}, m \in {50, 100}$ and generate $2 n$ samples ${({\{X_{i j}\}}_{j = 1}^{m}, {\{Y_{i j}\}}_{j = 1}^{m})}_{i = 1}^{2 n}$ . We use half of them to train the nonlinear sufficient predictors via W-GSIR, and then evaluate the RVMR and distance correlation between the estimated and true predictors using the rest of the data set. The tuning parameters and the dimensions are determined by the methods described in Section 4.2. The experiment is repeated 100 times, and averages and standard errors (in parentheses) of the RVMR and Dcor are summarized in Table 1. The following are the identified true predictors for each model: Model I-1 uses $W_{2} (X, μ_{1})$ , Model I-2 uses $(W_{2} (X, μ_{1}), W_{2} (X, μ_{2}))$ , Model I-3 uses $(H (X, μ_{1}), H (X, μ_{2}))$ , and Model I-4 uses $E (X)$ and $v a r (X)$ .

Table 1:

RVMR and Distance Correlation between the estimated predictors and the true predictors of models in Section 6.2, with their Monte Carlo standard errors in parentheses.

Models	$n ∖ m$	W-GSIR1		W-GSIR2

		50	100	50	100
				RVMR

I-1	100	0.791 (0.128)	0.839 (0.115)	0.776 (0.124)	0.812 (0.159)
I-1	200	0.832 (0.091)	0.864 (0.087)	0.808 (0.114)	0.842 (0.129)
I-2	100	0.597 (0.187)	0.607 (0.206)	0.555 (0.236)	0.548 (0.235)
I-2	200	0.694 (0.141)	0.681 (0.172)	0.709 (0.177)	0.688 (0.190)
I-3	100	0.846 (0.037)	0.880 (0.037)	0.836 (0.045)	0.859 (0.049)
I-3	200	0.864 (0.021)	0.896 (0.025)	0.797 (0.088)	0.696 (0.046)
I-4	100	0.558 (0.242)	0.652 (0.253)	0.729 (0.196)	0.790 (0.215)
	200	0.643 (0.221)	0.732 (0.183)	0.767 (0.169)	0.847 (0.145)
				Dcor

I-1	100	0.958 (0.024)	0.969 (0.022)	0.952 (0.029)	0.964 (0.034)
I-1	200	0.967 (0.011)	0.974 (0.013)	0.963 (0.017)	0.970 (0.020)
I-2	100	0.932 (0.037)	0.935 (0.041)	0.896 (0.071)	0.898 (0.066)
I-2	200	0.952 (0.026)	0.948 (0.032)	0.934 (0.054)	0.932 (0.048)
I-3	100	0.971 (0.008)	0.978 (0.005)	0.968 (0.010)	0.974 (0.007)
I-3	200	0.974 (0.004)	0.980 (0.004)	0.970 (0.007)	0.971 (0.008)
I-4	100	0.921 (0.042)	0.936 (0.042)	0.937 (0.036)	0.947 (0.038)
I-4	200	0.937 (0.037)	0.950 (0.027)	0.951 (0.023)	0.962 (0.025)

Open in a new tab

Fig 1 (a) displays a scatter plot of the true predictor versus the first estimated sufficient predictor for Model I-1. Fig 1 (b) and (c) show the scatter plots of the first two sufficient predictors for Model I-2, with the color indicating the values of the true predictor. These figures demonstrate the method’s ability to capture nonlinear patterns among predictor random elements.

6.3. Multivariate distribution-on-distribution regression

We now consider the scenario where both $X$ and $Y$ are two-dimensional random Gaussian distributions. We generate $Y = N (μ_{Y}, Σ_{Y})$ , where $μ_{Y} \in ℝ^{2}$ and $Σ_{Y} \in ℝ^{2 \times 2}$ are randomly generated according to the following models:

II-1: $μ_{Y} | X = N (W_{2} (X, μ_{1}) {(1, 1)}^{⊤}, I_{2})$ , $Σ_{Y} = diag (1, 1)$ .

II-2: $μ_{Y} | X = \sqrt{(W_{2} (X, μ_{1})} {(1, 1)}^{⊤}$ and $Σ_{Y} = {ΓΛΓ}^{⊤}$ , where $Γ = \frac{\sqrt{2}}{2} (\begin{matrix} 1 & 1 \\ - 1 & 1 \end{matrix})$ , $Λ = diag (| λ_{1} |, | λ_{2} |)$ , and $(λ_{1}, λ_{2}) | X \sim N (W_{2} (X, μ_{2}) {(1, 1)}^{⊤}, 0.25 I_{2})$ .

II-3: $μ_{Y} | X = N (W_{2} (X, μ_{1}) {(1, 1)}^{⊤}, I_{2})$ and $Σ_{Y} = {ΓΛΓ}^{⊤}$ , where $Γ = \frac{\sqrt{2}}{2} (\begin{matrix} 1 & 1 \\ - 1 & 1 \end{matrix})$ , $Λ = diag (λ_{1}, λ_{2})$ , and $λ_{1}, λ_{2} | X \overset{i . i . d}{\sim} tGamma (W_{2}^{2} (X, μ_{2}), W_{2} (X, μ_{2}), (0.2, 2))$ .

II-4: $μ_{Y} | X = N (H_{2}^{2} (X, μ_{1}) {(1, 1)}^{⊤}, I_{2})$ and $Σ_{Y} = {ΓΛΓ}^{⊤}$ , where $Γ = \frac{\sqrt{2}}{2} (\begin{matrix} 1 & 1 \\ 1 & - 1 \end{matrix})$ , $Λ = diag (λ_{1}, λ_{2})$ , and $(λ_{1}, λ_{2}) | X \sim tGamma (H^{2} (X, μ_{2}), H (X, μ_{2}), (0.2, 2))$ .

where $μ_{1}$ and $μ_{2}$ are two fixed measures defined by

μ_{1} = N ({(- 1, 0)}^{⊤}, diag (1, 0.5)) μ_{2} = N ({(0, 1)}^{⊤}, diag (0.5, 1)),

and $t G a m m a (α, β, (r_{1}, r_{2}))$ is the truncated gamma distribution on range $(r_{1}, r_{2})$ with shape parameter $α$ and rate parameter $β$ . We generate discrete observations of $X_{i}$ , $i \in {1 \dots, n}$ by ${\{X_{i j}\}}_{j = 1}^{m} \overset{i i d}{\sim} N (a_{i} (1,1)^{⊤}, b_{i} I_{2})$ where $a_{i} \overset{i i d}{\sim} N (0.5, {0.5}^{2})$ and $b_{i} \overset{i i d}{\sim} B e t a (2, 3)$ . When computing $W_{2} (X, μ_{1})$ and $W_{2} (X, μ_{2})$ , we use the following explicit representations of the Wasserstein distance between two Gaussian distributions:

W_{2}^{2} (N (m_{1}, Σ_{1}), N (m_{2}, Σ_{2})) = {‖ m_{1} - m_{2} ‖}^{2} + {trΣ}_{1} + {trΣ}_{2} - 2 tr \sqrt{Σ_{2}^{1 / 2} Σ_{1} Σ_{2}^{1 / 2}} .

The following are the identified true predictors for each model: Model II-1 uses $W_{2} (X, μ_{1})$ , Models II-2 and II-3 uses $(W_{2} (X, μ_{1}), W_{2} (X, μ_{2}))$ , Model II-4 uses $(H (X, μ_{1}), H (X, μ_{2}))$ .

Using the true dimensions and the same choices for $n$ , $m$ , and the tuning parameters, we repeat the experiment 100 times and summarize the average and standard errors of RVMR and distance correlation between the estimated and true predictors in Table 2. In Fig. 2, we plot the 2-dimensional response densities associated with the 10%, 30%, 50%, 70%, and 90% quantiles of estimated predictor (first row) the true predictor (second row) for Model II-2. Comparing the plots, we can see that the two-dimensional response distributions show a similar variation pattern, which indicates the method successfully captured the nonlinear predictor in the responses. We also see that the first estimated sufficient predictor captures both the location and scale of the response distribution. With the increase of the estimated sufficient predictor, the location of the response distribution moves slightly rightward and upward, while the variance of the response distribution decreases at first and then increases.

Table 2:

RVMR and Distance Correlation between the estimated predictors and the true predictors of models in Section 6.3, with their Monte Carlo standard errors in parentheses.

Models	$n ∖ m$	SWGSIR1		SWGSIR2

		50	100	50	100
				RVMR

II-1	100	0.948 (0.063)	0.957 (0.049)	0.915 (0.130)	0.910 (0.150)
II-1	200	0.958 (0.041)	0.970 (0.022)	0.921 (0.087)	0.934 (0.084)
II-2	100	0.784 (0.036)	0.791 (0.033)	0.820 (0.038)	0.822 (0.036)
II-2	200	0.783 (0.023)	0.791 (0.023)	0.834 (0.033)	0.824 (0.034)
II-3	100	0.744 (0.061)	0.755 (0.059)	0.806 (0.067)	0.812 (0.065)
II-3	200	0.747 (0.040)	0.753 (0.043)	0.835 (0.069)	0.841 (0.059)
II-4	100	0.499 (0.166)	0.500 (0.144)	0.570 (0.170)	0.567 (0.155)
II-4	200	0.512 (0.156)	0.477 (0.152)	0.532 (0.157)	0.501 (0.159)
				Dcor

II-1	100	0.962 (0.024)	0.963 (0.025)	0.977 (0.018)	0.977 (0.021)
II-1	200	0.963 (0.017)	0.964 (0.018)	0.973 (0.023)	0.970 (0.025)
II-2	100	0.967 (0.013)	0.967 (0.013)	0.973 (0.010)	0.975 (0.010)
II-2	200	0.965 (0.011)	0.966 (0.011)	0.975 (0.008)	0.975 (0.010)
II-3	100	0.980 (0.009)	0.981 (0.008)	0.983 (0.007)	0.984 (0.006)
II-3	200	0.979 (0.007)	0.979 (0.009)	0.982 (0.007)	0.983 (0.008)
II-4	100	0.889 (0.031)	0.886 (0.036)	0.886 (0.033)	0.892 (0.030)
II-4	200	0.893 (0.033)	0.886 (0.033)	0.887 (0.034)	0.889 (0.037)

Open in a new tab

Fig. 2: — Densities associated with the 10%, 30%, 50%, 70%, and 90% quantiles (left to right) of estimated predictor (first row) the true predictor (second row) for Model II-2 .

6.4. Comparison with functional-GSIR

Next, we compare the performance of W-GSIR with two methods using the GSIR framework but replacing Wasserstein distance by L₁ or L₂ distances. We call them L₁-GSIR and L₂-GSIR, respectively. Note that L₂-GSIR is the same as functional-GSIR (f-GSIR) proposed in Li and Song [30]. Theoretically, L₂-GSIR is an inadequate estimate since an L₂ function need not be a density and vice versa. Nevertheless, we still naively implement L₂-GSIR, treating density curves as L₂ functions. To make a fair comparison, we first use the Gaussian kernel smoother to estimate the densities based on discrete observations and then evaluate the $L_{r}$ distances by numerical integration. For $L_{r}$ -GSIR $(r = 1, 2)$ , we take the Gaussian type kernel $κ (z, z^{'}) = exp (- γ {‖ z - z^{'} ‖}_{L_{r}}^{2})$ , with the same choice of tuning parameters $γ$ as described in Subsection 4.2. We use $n = 100$ , $m = 100$ , and repeat the experiment 100 times with $A = I_{n}$ . The results are summarized in Table 3. We see that W-GSIR provides more accurate estimation than both L₁-GSIR and L₂-GSIR.

Table 3:

RVMR and Distance Correlation between the estimated predictors and the true predictors of models in Section 6.2, with their Monte Carlo standard errors in parentheses, computed using $L_{1}$ -GSIR, $L_{2}$ -GSIR, and W-GSIR.

Models	$L_{1}$ -GSIR1	$L_{2}$ -GSIR1	W-GSIR1

		RVMR

I-1	0.258 (0.233)	0.356 (0.276)	0.839 (0.115)
I-2	0.322 (0.236)	0.433 (0.244)	0.607 (0.206)
I-3	0.307 (0.242)	0.359 (0.205)	0.880 (0.037)
I-4	0.313 (0.252)	0.441 (0.278)	0.652 (0.253)
		Dcor

I-1	0.773 (0.129)	0.731 (0.171)	0.969 (0.022)
I-2	0.778 (0.173)	0.690 (0.203)	0.935 (0.041)
I-3	0.779 (0.169)	0.688 (0.196)	0.978 (0.005)
I-4	0.779 (0.129)	0.740 (0.176)	0.936 (0.042)

Open in a new tab

7. Applications

7.1. Application to human mortality data

In this application, we explore the relationship between the distribution of age at death and the distribution of the mother’s age at birth. We obtained our data from the UN World Population Prospects 2019 Databases (https://population.un.org), specifically focusing on the years 2015-2020. For each country, we compiled the number of deaths every five years from ages 0-100 and the number of births categorized by mother’s age every five years from ages 15-50. We represented this data as histograms with bin widths equal to 5 years. To obtain smooth probability density functions for each country, we used the R package ‘frechet’ to perform smoothing. We then calculated the relative Wasserstein distance between the predictor and response densities. The predictor and response densities are visualized in Fig. 3.

Fig. 3: — Density of (a) age at death and (b) mother’s age at birth for 194 countries, obtained using data from the UN World Population Prospects 2019 Databases (https://population.un.org)

We apply the proposed W-GSIR algorithm to the fertility and mortality data. The dimension $d$ of the central class is determined as one by the BIC-type procedure described in Subsection 4.2. We plot the age-at-death distributions versus the nonlinear sufficient predictors obtained by W-GSIR2 in Fig. 4. In Fig. 5, we present the summary statistics of the age-at-death distributions plotted against the sufficient predictors.

Fig. 4: — Densities of age at death for 194 countries in random order in (a) and (c), and versus the first nonlinear sufficient predictors obtained by W-GSIR2 in (b) and (d).

Fig. 5: — Summary statistics (mean, mode, standard deviation, and skewness) of the mortality distributions for 194 countries versus nonlinear sufficient predictors obtained by W-GSIR2.

Upon examining these plots, we obtained the following insights. The first nonlinear sufficient predictor effectively captures the location and variation information of the mortality distributions. Specifically, as the first sufficient predictor increases, the means of the mortality distributions decrease while the standard deviations increase. This suggests that the population’s death age tends to concentrate between 70 and 80 for large sufficient predictor values. Additionally, for densities with small sufficient predictors, there is an uptick at the ends of the 0-age side, which indicates higher infant mortality rates among the countries with such densities.

7.2. Application to Calgary temperature data

In this application, we are interested in the relationship between the extreme daily temperatures in spring (Mar, Apr, and May) and summer (Jun, Jul, and Aug) in Calgary, Alberta. We obtained the dataset from https://calgary.weatherstats.ca/, which contains the minimum and maximum temperatures for each day from 1884 to 2020. These data were previously analyzed in Fan and Müller [14]. We focused on the joint distribution of the minimum daily temperature and the difference between the maximum and minimum daily temperatures, which ensures that the distributions have common support. Each pair of daily values was treated as one observation from a two-dimensional distribution, resulting in one realization of the joint distribution for spring and one for summer each year. We then employed the spring extreme temperature distribution to predict the summer extreme temperature distribution. The dataset had $n = 136$ observations, with $m = 92$ discrete values for each joint distribution. We utilized the SW-GSIR method on the data, taking 50 random projections with $ρ_{X} = ρ_{Y} = 1$ . The sufficient dimension was determined as 2 using a BIC-type procedure. We illustrated the response summer extreme temperature distributions associated with the five percentiles of the first estimated sufficient predictors in Fig. 6. It is observed from Fig. 6 that as the estimated sufficient predictor value increases, the minimum daily temperature for summer rises slightly while the daily temperature range decreases.

Fig. 6: — Joint distribution of temperature range and minimum temperature in summer associated with the 10%, 30%, 50%, 70%, and 90% quantiles (from left to right) of SWGSIR2 predictor.

8. Discussion

The paper introduces a framework of nonlinear sufficient dimension reduction for distribution-on-distribution regression. The key strength of this paper is its ability to handle distributional data without linear structure. After explicitly building the universal kernels on the space of distributions, the proposed SDR method effectively reduces the complexity of the distributional predictors while retaining essential information from them.

Several related open problems persist in this topic. Firstly, a more systematic approach to selecting the kernel becomes essential, particularly when multiple universal kernels are available. Secondly, while the paper offers an adaptive method for choosing the bandwidth of the universal kernel in Section 4, the theoretical analysis of this bandwidth selection remains a potential area for future research. Additionally, more appropriate methods for determining the order in nonlinear SDR need to be developed, alongside establishing corresponding consistency results.

9. Technical Proofs

The section contains essential proof details to make the paper self-contained.

Geometry of Wasserstein space

We present some basic results that characterize $𝒲_{2} (M)$ when $M \subseteq ℝ$ (i.e., the distributions involved are univariate). Their proofs can be found, for example, in [1] and [5]. In this case, $𝒲_{2} (M)$ is a metric space with a formal Riemannian structure [1]. Let $μ_{0} \in 𝒲_{2} (M)$ be a reference measure with a continuous $F_{μ_{0}}$ . The tangent space at $μ_{0}$ is

T_{μ_{0}} = {cl}_{L_{2} (μ_{0})} {λ (F_{μ}^{- 1} \circ F_{μ_{0}} - id) : μ \in 𝒲_{2} (M), λ > 0},

where, for a set $A \subseteq L_{2} (μ_{0})$ , ${c l}_{L_{2} (μ_{0})} (A)$ denotes the $L_{2} (μ_{0})$ -closure of $A$ , and id is the identity map. The exponential map ${e x p}_{μ_{0}}$ from $T_{μ_{0}}$ to $𝒲_{2} (M)$ is defined by ${e x p}_{μ_{0}} (r) = μ_{0^{\circ}} (r + i d)^{- 1}$ , where the right-hand side is the measure on $𝒲_{2} (M)$ induced by the mapping mapping $r + i d$ . The logarithmic map ${l o g}_{μ_{0}}$ from $𝒲_{2} (M)$ to $T_{μ_{0}}$ is defined by ${l o g}_{μ_{0}} (μ) = F_{μ}^{- 1} \circ F_{μ_{0}} - i d$ . It is known that the exponential map restricted to the image of log map, denoted as ${e x p}_{μ_{0}} | {l o g}_{μ_{0} (μ) (𝒲_{2} (M))}$ , is an isometric homeomorphism with inverse ${l o g}_{μ_{0}}$ [5]. Therefore, ${l o g}_{μ_{0}}$ is a continuous injection from $𝒲_{2} (M)$ to $L_{2} (μ_{0})$ . This embedding guarantees that we can replace the Euclidean distance by the $𝒲_{2} (M)$ -metric in a radial basis kernel to construct a positive definite kernel.

Proof of Proposition 2: Recall that $Γ (μ_{1}, μ_{2})$ is the space of joint probability measures on $M \times M$ with marginals $μ_{1}$ and $μ_{2}$ . Let $T_{θ} \times T_{θ}$ be the mapping from $M \times M$ to $ℝ \times ℝ$ defined by $(T_{θ} \times T_{θ}) (x, y) = (T_{θ} (x), T_{θ} (y))$ . We first show that, if $γ \in Γ (μ_{1}, μ_{2})$ , then $γ \circ {(T_{θ} \times T_{θ})}^{- 1} \in Γ (μ_{1} \circ T_{θ}^{- 1}, μ_{2} \circ T_{θ}^{- 1})$ . This is true because, for any Borel set $A \subseteq ℝ$ , we have

[γ \circ {(T_{θ} \times T_{θ})}^{- 1}] (A \times T_{θ} (M)) = γ ({(T_{θ} \times T_{θ})}^{- 1} (A \times T_{θ} (M)) = γ (T_{θ}^{- 1} (A) \times M) = μ_{1} (T_{θ}^{- 1} A) = μ_{1} \circ T_{θ} (A),

and similarly $[γ \circ {(T_{θ} \times T_{θ})}^{- 1}] (T_{θ} (M) \times A) = μ_{2} \circ T_{θ} (A)$ . Hence, for any $γ \in Γ (μ_{1}, μ_{2})$ , we have

W_{p}^{p} (μ_{1} \circ T_{θ}^{- 1}, μ_{2} \circ T_{θ}^{- 1}) \leq \int_{T_{θ} (M) \times T_{θ} (M)} | u - v |^{p} d γ \circ {(T_{θ} \times T_{θ})}^{- 1} (u, v) = \int_{M \times M} {| T_{θ} (x) - T_{θ} (y) |}^{p} d γ (x, y) \leq \int_{M \times M} ∥ x - y ∥_{2}^{p} d γ (x, y),

where the last inequality is from the Cauchy-Schwartz inequality. Therefore,

W_{p}^{p} (μ_{1} \circ T_{θ}^{- 1}, μ_{2} \circ T_{θ}^{- 1}) \leq inf_{γ \in Γ (μ_{1}, μ_{2})} \int_{M \times M} ∥ x - y ∥^{p} d γ (x, y) = W_{p}^{p} (μ_{1}, μ_{2}) .

Integrate the left-hand side with respect to $θ$ and obtain ${S W}_{p} (μ_{1}, μ_{2}) \leq W_{p} (μ_{1}, μ_{2})$ . Therefore, the ${S W}_{p}$ distance is a weaker metric than $W_{p}$ distance, which implies every open set in $𝒯 𝒲_{p} (M)$ is open in $𝒲_{p} (M)$ . In other words, $𝒯 𝒲_{p} (M)$ has a coarser topology than $𝒲_{p} (M)$ . Since $M \subseteq ℝ^{r}$ is separable, so is $𝒲_{p} (M)$ [1, Remark 7.1.7]. Therefore, a countable dense subset of $𝒲_{p} (M)$ is also a countable dense subset of $𝒯 𝒲_{p} (M)$ , implying $𝒯 𝒲_{p} (M)$ is separable. Furthermore, if $M$ is a compact set in $ℝ^{r}$ , then $𝒲_{p} (M)$ is compact [1, Proposition 7.1.5], implying $𝒯 𝒲_{p} (M)$ is compact. This completes the proof of Proposition 2.

Proof of Lemma 1: By Theorem 3.2.2 of [3], the kernel $exp (- γ {SW}_{2}^{2} (x, x^{'}))$ is positive definite for all $γ > 0$ if and only if ${S W}_{2}^{2} (\cdot, \cdot)$ is conditionally negative definite. That is, for any $c_{1}, \dots, c_{m} \in ℝ$ with $\sum_{i = 1}^{m} c_{i} = 0$ , and $x_{1}, \dots, x_{m} \in Ω_{X}$ , $\sum_{i = 1}^{m} \sum_{j = 1}^{m} c_{i} c_{j} {S W}_{2}^{2} (x_{i}, x_{j}) \leq 0$ . Kolouri et al. [23, Theorem 5] showed the conditional negativity of the sliced Wasserstein distance, which is implied by the negative type of the Wasserstein distance. By [44, 45], a metric is of negative type is equivalent to the statement that there is a Hilbert space $ℋ$ and a map $ϕ : 𝒯 𝒲_{2} (M) \to ℋ$ such that $\forall x$ , $x^{'} \in 𝒯 𝒲_{2} (M)$ , ${SW}_{2}^{2} (x, x^{'}) = {‖ ϕ (x) - ϕ (x^{'}) ‖}^{2}$ . By Proposition 2, $𝒯 𝒲_{2} (M)$ is a complete and separable space. Then by the construction of the Hilbert space, $ℋ$ is complete and separable. Therefore, there exists a continuous mapping from metric space $𝒯 𝒲_{2} (M)$ to a complete and separable Hilbert space $ℋ$ . Then by Zhang et al. [54, Theorem 1], the Gaussian type kernel $e x p (- γ {S W}_{2}^{2} (x, x^{'}))$ is universal. Hence, $ℋ_{X}$ and $ℋ_{Y}$ are dense in $L_{2} (P_{X})$ and $L_{2} (P_{Y})$ , respectively. Same proof applies to the Laplacian-type kernel $e x p (- γ {S W}_{2} (x, x^{'}))$ . This completes the proof of Lemma 1. □

Proof of Lemma 2. We will only show the details of the proof for the convergence rate of ${‖ {\hat{Σ}}_{X Y} - Σ_{X Y} ‖}_{HS}$ . By the triangular inequality,

{‖ {\hat{Σ}}_{X Y} - Σ_{X Y} ‖}_{HS} \leq {‖ {\hat{Σ}}_{X Y} - {\tilde{Σ}}_{X Y} ‖}_{HS} + {‖ {\tilde{Σ}}_{X Y} - Σ_{X Y} ‖}_{HS},

where

{\tilde{Σ}}_{X Y} = E_{n} [κ (\cdot, X) \otimes κ (\cdot, Y) - E_{n} [κ (\cdot, X)] \otimes E_{n} [κ (\cdot, Y)] .

By Lemma 5 of [18], under the assumption that $E [κ (X, X)] < \infty$ and $E [κ (Y, Y)] < \infty$ , we have

E {‖ {\tilde{Σ}}_{X Y} - Σ_{X Y} ‖}_{HS} = 𝒪 (n^{- 1 / 2}) .

(S.1)

Now, we derive a convergence rate for ${‖ {\hat{Σ}}_{X Y} - {\tilde{Σ}}_{X Y} ‖}_{HS}$ . For simplicity, let ${\hat{F}}_{i} = κ (\cdot, {\hat{X}}_{i})$ , ${\tilde{F}}_{i} = κ (\cdot, X_{i})$ , ${\hat{G}}_{i} = κ (\cdot, {\hat{Y}}_{i})$ , and ${\tilde{G}}_{i} = κ (\cdot, Y_{i})$ . Then

{‖ {\hat{Σ}}_{X Y} - {\tilde{Σ}}_{X Y} ‖}_{HS} = {‖ \frac{1}{n} \sum_{i = 1}^{n} ({\hat{F}}_{i} - \frac{1}{n} \sum_{j = 1}^{n} {\hat{F}}_{j}) \otimes ({\hat{G}}_{i} - \frac{1}{n} \sum_{j = 1}^{n} {\hat{G}}_{j}) - \frac{1}{n} \sum_{i = 1}^{n} ({\hat{F}}_{i} - \frac{1}{n} \sum_{j = 1}^{n} {\hat{F}}_{j}) \otimes ({\hat{G}}_{i} - \frac{1}{n} \sum_{j = 1}^{n} {\tilde{G}}_{j}) ‖}_{HS} = {‖ \frac{1}{n} \sum_{i = 1}^{n} (({\hat{F}}_{i} - {\tilde{F}}_{i}) - \frac{1}{n} \sum_{j = 1}^{n} ({\hat{F}}_{j} - {\tilde{F}}_{j})) \otimes (({\hat{G}}_{i} - {\tilde{G}}_{i}) - \frac{1}{n} \sum_{j = 1}^{n} ({\hat{G}}_{j} - {\tilde{G}}_{j})) ‖}_{HS} \leq {‖ \frac{1}{n} \sum_{i = 1}^{n} ({\hat{F}}_{i} - {\tilde{F}}_{i}) \otimes ({\hat{G}}_{i} - {\tilde{G}}_{i}) - (2 - \frac{1}{n}) (\frac{1}{n} \sum_{j = 1}^{n} ({\hat{F}}_{j} - {\tilde{F}}_{j})) \otimes (\frac{1}{n} \sum_{j = 1}^{n} ({\hat{G}}_{j} - {\tilde{G}}_{j})) ‖}_{HS} \leq \frac{1}{n} \sum_{i = 1}^{n} {‖ ({\hat{F}}_{i} - {\tilde{F}}_{i}) \otimes {({\hat{G}}_{i} - {\tilde{G}}_{i}) ‖}_{HS} + 2 ‖ (\frac{1}{n} \sum_{j = 1}^{n} ({\hat{F}}_{j} - {\tilde{F}}_{j})) \otimes (\frac{1}{n} \sum_{j = 1}^{n} ({\hat{G}}_{j} - {\tilde{G}}_{j})) ‖}_{HS} .

(S.2)

Consider the expectation of the first term on the right-hand side. Here, the expectation involves two layers of randomness: that in $({\{X_{i j}\}}_{j = 1}^{m}, {\{Y_{i k}\}}_{k = 1}^{m})$ and that in $X_{i}$ . Taking expectation with respect to $({\{X_{i j}\}}_{j = 1}^{m}, {\{Y_{i k}\}}_{k = 1}^{m})$ and then $X_{i}$ , we have

E [\frac{1}{n} \sum_{i = 1}^{n} {‖ ({\hat{F}}_{i} - {\tilde{F}}_{i}) \otimes ({\hat{G}}_{i} - {\tilde{G}}_{i}) ‖}_{HS}] = \frac{1}{n} \sum_{i = 1}^{n} E [{‖ ({\hat{F}}_{i} - {\tilde{F}}_{i}) ‖}_{ℋ_{X}} {‖ ({\hat{G}}_{i} - {\tilde{G}}_{i}) ‖}_{ℋ_{Y}}] \leq \frac{1}{n} \sum_{i = 1}^{n} {(E {‖ ({\hat{F}}_{i} - {\tilde{F}}_{i}) ‖}_{ℋ_{X}}^{2})}^{1 / 2} {(E {‖ ({\hat{G}}_{i} - {\tilde{G}}_{i}) ‖}_{ℋ_{Y}}^{2})}^{1 / 2},

Evoking the Lipschitz continuity condition on $κ (z, z^{'})$ , we have

E {‖ ({\hat{F}}_{i} - {\tilde{F}}_{i}) ‖}_{ℋ_{X}}^{2} = E {〈 ({\hat{F}}_{i} - {\tilde{F}}_{i}, {\hat{F}}_{i} - {\tilde{F}}_{i} 〉}_{ℋ_{X}} = E [κ ({\hat{X}}_{i}, {\hat{X}}_{i}) - 2 κ ({\hat{X}}_{i}, X_{i}) + κ (X_{i}, X_{i})] \leq 2 C E [d (X_{i}, {\hat{X}}_{i})] \leq 𝒪 (E_{X_{i}} E_{{\hat{X}}_{i}} [d (X_{i}, {\hat{X}}_{i})]) .

By Assumption 2, $E_{\hat{X}} [d_{W} ({\hat{X}}_{i}, X_{i})] = 𝒪 (δ_{m})$ for $i = {1, \dots, n}$ . We then have $E {‖ ({\hat{F}}_{i} - {\tilde{F}}_{i}) ‖}_{ℋ_{X}}^{2} = 𝒪 (δ_{m})$ for $i \in {1, \dots, n}$ . Similarly, we have $E {‖ ({\hat{G}}_{i} - {\tilde{G}}_{i}) ‖}_{ℋ_{Y}}^{2} = 𝒪 (δ_{m})$ for $i \in {1, \dots, n}$ . Therefore,

E [\frac{1}{n} \sum_{i = 1}^{n} {‖ ({\hat{F}}_{i} - {\tilde{F}}_{i}) \otimes ({\hat{G}}_{i} - {\tilde{G}}_{i}) ‖}_{HS}] = 𝒪 (δ_{m}) .

(S.3)

For the expectation of the second term on the right-hand side of equation (S.2), we have

2 E {‖ (\frac{1}{n} \sum_{j = 1}^{n} ({\hat{F}}_{j} - {\tilde{F}}_{j})) \otimes (\frac{1}{n} \sum_{j = 1}^{n} ({\hat{G}}_{j} - {\tilde{G}}_{j})) ‖}_{HS} = 2 E [{‖ (\frac{1}{n} \sum_{j = 1}^{n} ({\hat{F}}_{j} - {\tilde{F}}_{j})) ‖}_{ℋ_{X}} {‖ (\frac{1}{n} \sum_{j = 1}^{n} ({\hat{G}}_{j} - {\tilde{G}}_{j})) ‖}_{ℋ_{Y}}] \leq 2 {(E {‖ \frac{1}{n} \sum_{j = 1}^{n} ({\hat{F}}_{j} - {\tilde{F}}_{j}) ‖}_{ℋ_{X}}^{2})}^{1 / 2} {(E {‖ \frac{1}{n} \sum_{j = 1}^{n} ({\hat{G}}_{j} - {\tilde{G}}_{j}) ‖}_{ℋ_{Y}}^{2})}^{1 / 2} \leq 2 {(\frac{1}{n} sup_{1 \leq i \leq n} E {‖ ({\hat{F}}_{i} - {\tilde{F}}_{i}) ‖}_{ℋ_{X}}^{2})}^{1 / 2} {(\frac{1}{n} sup_{1 \leq i \leq n} E {‖ ({\hat{G}}_{i} - {\tilde{G}}_{i}) ‖}_{ℋ_{Y}}^{2})}^{1 / 2} \leq 𝒪 (δ_{m} / n) .

(S.4)

Combine result (S.1)(S.3) and (S.4), we have

E {‖ {\hat{Σ}}_{X Y} - Σ_{X Y} ‖}_{HS} \leq 𝒪 (δ_{m} (1 + 1 / n) + n^{- 1 / 2}) = 𝒪 (δ_{m} + n^{- 1 / 2}) .

Then by Chebyshev’s inequality, we have

{‖ {\hat{Σ}}_{X Y} - Σ_{X Y} ‖}_{HS} \leq 𝒪_{p} (δ_{m} + n^{- 1 / 2}),

as desired. This completes the proof of Lemma 2.

Proof of Theorem 1. Let

\hat{A} = {({\hat{Σ}}_{X X} + η_{n} I)}^{- 1}, A_{n} = {(Σ_{X X} + η_{n} I)}^{- 1}, A = Σ_{X X}^{- 1}; \hat{B} = {\hat{Σ}}_{X Y}, B = Σ_{X Y} .

Then the element of interest $\hat{Λ} - Λ$ can be written as

\hat{Λ} - Λ = \hat{A} \hat{B} {\hat{B}}^{*} {\hat{A}}^{*} - A B B^{*} A^{*} = \hat{A} \hat{B} ({\hat{B}}^{*} {\hat{A}}^{*} - B^{*} A^{*}) + (\hat{A} \hat{B} - A B) B^{*} A^{*};

Thus, we have

{‖ \hat{Λ} - Λ ‖}_{OP} \leq {‖ \hat{A} \hat{B} ({\hat{B}}^{*} {\hat{A}}^{*} - B^{*} A^{*}) ‖}_{OP} + {‖ (\hat{A} \hat{B} - A B) B^{*} A^{*} ‖}_{OP} = {‖ (A B - \hat{A} \hat{B}) {\hat{B}}^{*} {\hat{A}}^{*} ‖}_{OP} + {‖ (\hat{A} \hat{B} - A B) B^{*} A^{*} ‖}_{OP} \leq {‖ (A B - \hat{A} \hat{B}) ‖}_{OP} ({‖ \hat{A} \hat{B} ‖}_{OP} + ∥ A B ∥_{OP}) .

Since both $A B$ and $\hat{A} \hat{B}$ are compact operators, it suffices to show that

{‖ (A B - \hat{A} \hat{B}) ‖}_{OP} = 𝒪_{p} (η_{n}^{β} + η_{n}^{- 1} ε_{n, m}),

where $ε_{n, m} = δ_{m} + n^{- 1 / 2}$ . Writing $\hat{A} \hat{B}$ as

\hat{A} \hat{B} = \hat{A} (\hat{B} - B) + (\hat{A} - A_{n}) B + (A_{n} - A) B + A B,

we obtain

{‖ (A B - \hat{A} \hat{B}) ‖}_{OP} \leq {‖ \hat{A} (\hat{B} - B) ‖}_{OP} + {‖ (\hat{A} - A_{n}) B ‖}_{OP} + {‖ (A_{n} - A) B ‖}_{OP} .

(S.5)

For the first term on the right-hand side, we have

{‖ \hat{A} (\hat{B} - B) ‖}_{OP} = {‖ {({\hat{Σ}}_{X X} + η_{n} I)}^{- 1} ({\hat{Σ}}_{X Y} - Σ_{X Y}) ‖}_{OP} \leq {‖ {({\hat{Σ}}_{X X} + η_{n} I)}^{- 1} ‖}_{OP} {‖ ({\hat{Σ}}_{X Y} - Σ_{X Y}) ‖}_{HS} \leq η_{n}^{- 1} {‖ ({\hat{Σ}}_{X X} + η_{n} I) {({\hat{Σ}}_{X X} + η_{n} I)}^{- 1} ‖}_{OP} {‖ ({\hat{Σ}}_{X Y} - Σ_{X Y}) ‖}_{HS} \leq 𝒪_{p} (η_{n}^{- 1} ε_{n, m}),

(S.6)

where the last inequality follows from Lemma 2. For the second term on the right-hand side of (S.5), we write it as

(\hat{A} - A_{n}) B = ({({\hat{Σ}}_{X X} + η_{n} I)}^{- 1} - {(Σ_{X X} + η_{n} I)}^{- 1}) Σ_{X Y} = {({\hat{Σ}}_{X X} + η_{n} I)}^{- 1} (({\hat{Σ}}_{X X} + η_{n} I) - (Σ_{X X} + η_{n} I)) {(Σ_{X X} + η_{n} I)}^{- 1} Σ_{X Y} = {({\hat{Σ}}_{X X} + η_{n} I)}^{- 1} ({\hat{Σ}}_{X X} - Σ_{X X}) {(Σ_{X X} + η_{n} I)}^{- 1} Σ_{X X} Σ_{X X}^{- 1} Σ_{X Y} .

Thus, we have

{‖ (\hat{A} - A_{n}) B ‖}_{OP} \leq {‖ {({\hat{Σ}}_{X X} + η_{n} I)}^{- 1} ‖}_{OP} {‖ {\hat{Σ}}_{X X} - Σ_{X X} ‖}_{OP} {‖ {(Σ_{X X} + η_{n} I)}^{- 1} Σ_{X X} ‖}_{OP} {‖ Σ_{X X}^{- 1} Σ_{X Y} ‖}_{OP} .

By the above derivations, we have ${‖ {({\hat{Σ}}_{X X} + η_{n} I)}^{- 1} ‖}_{OP} = 𝒪_{p} (η_{n}^{- 1})$ and ${‖ {\hat{Σ}}_{X X} - Σ_{X X} ‖}_{OP} = 𝒪_{p} (ε_{n, m})$ . Also, we have

{‖ {(Σ_{X X} + η_{n} I)}^{- 1} Σ_{X X} ‖}_{OP} \leq {‖ {(Σ_{X X} + η_{n} I)}^{- 1} (Σ_{X X} + η_{n} I) ‖}_{OP} = 1,

and ${‖ Σ_{X X}^{- 1} Σ_{X Y} ‖}_{OP} \leq \infty$ by Assumption 1. Therefore, we have

{‖ (\hat{A} - A_{n}) B ‖}_{OP} = 𝒪_{p} (η_{n}^{- 1} ε_{n, m}) .

(S.7)

Finally, letting $R_{X Y} = Σ_{X X}^{β} S_{X Y}$ and rewriting the third term on the right-hand side of (S.5) as

(A_{n} - A) B = ({(Σ_{X X} + η_{n} I)}^{- 1} - Σ_{X X}^{- 1}) Σ_{X Y} = {(Σ_{X X} + η_{n} I)}^{- 1} Σ_{X X} R_{X Y} - R_{X Y} = - η_{n} {(Σ_{X X} + η_{n} I)}^{- 1} R_{X Y},

we see that

{‖ (A_{n} - A) B ‖}_{OP} \leq η_{n} {‖ {(Σ_{X X} + η_{n} I)}^{- 1 + β} ‖}_{OP} {‖ S_{X Y} ‖}_{OP} \leq η_{n} η_{n}^{β - 1} {‖ {(Σ_{X X} + η_{n} I)}^{- 1 + β} {(Σ_{X X} + η_{n} I)}^{1 - β} ‖}_{OP} {‖ S_{X Y} ‖}_{OP} \leq η_{n}^{β} {‖ S_{X Y} ‖}_{OP} = 𝒪_{p} (η_{n}^{β}) .

(S.8)

Combining (S.6), (S.7), and (S.8), we prove the first assertion of the theorem.

The second assertion can then be proved by following roughly the same path and using the following facts:

if $A$ is a bounded operator and $B$ is Hilbert-Schmidt operator and $r a n (A) \subseteq d o m (B)$ , then $A B$ is a Hilbert Schmidt operator with
${‖ A B ‖}_{HS} \leq {‖ A ‖}_{OP} {‖ B ‖}_{HS};$
if $A$ is Hilbert-Schmidt then so is $A^{*}$ and
${‖ A ‖}_{HS} = {‖ A^{*} ‖}_{HS} .$

Using the same decomposition as (S.5), we have

{‖ (A B - \hat{A} \hat{B}) ‖}_{HS} \leq {‖ \hat{A} (\hat{B} - B) ‖}_{HS} + {‖ (\hat{A} - A_{n}) B ‖}_{HS} + {‖ (A_{n} - A) B ‖}_{HS} .

(S.9)

For the first term on the right-hand side of (S.9):

{‖ \hat{A} (\hat{B} - B) ‖}_{HS} \leq {‖ \hat{A} ‖}_{OP} {‖ \hat{B} - B ‖}_{HS} = 𝒪 (η_{n}^{- 1} ε_{n, m}) .

(S.10)

For the second term on the right-hand side of (S.9):

{‖ (\hat{A} - A_{n}) B ‖}_{HS} \leq {‖ {({\hat{Σ}}_{X X} + η_{n} I)}^{- 1} ‖}_{OP} {‖ {\hat{Σ}}_{X X} - Σ_{X X} ‖}_{OP} {‖ {(Σ_{X X} + η_{n} I)}^{- 1} Σ_{X X} ‖}_{OP} {‖ Σ_{X X}^{- 1} Σ_{X Y} ‖}_{HS} \leq {‖ {({\hat{Σ}}_{X X} + η_{n} I)}^{- 1} ‖}_{OP} {‖ {\hat{Σ}}_{X X} - Σ_{X X} ‖}_{OP} {‖ {(Σ_{X X} + η_{n} I)}^{- 1} Σ_{X X} ‖}_{OP} {‖ Σ_{X X}^{β} S_{X Y} ‖}_{HS} = 𝒪_{p} (η_{n}^{- 1} ε_{n, m}) .

(S.11)

For the third term on the right-hand side of (S.9):

{‖ (A_{n} - A) B ‖}_{HS} \leq η_{n} {‖ {(Σ_{X X} + η_{n} I)}^{- 1 + β} ‖}_{OP} {‖ S_{X Y} ‖}_{HS} \leq η_{n}^{β} {‖ S_{X Y} ‖}_{HS} = 𝒪_{p} (η_{n}^{β}) .

(S.12)

Combining the results (S.10), (S.11), and (S.12), we have

{‖ (A B - \hat{A} \hat{B}) ‖}_{HS} \leq 𝒪_{p} (η_{n}^{β} + η_{n}^{- 1} ε_{n, m}) .

This completes the proof of Theorem 1.

Acknowledgments

We thank the Editor, Associate Editor and referees for their helpful comments. This research is partly supported by the NIH grant 1R01GM152812 and the NSF grants DMS-1953189, CCF-2007823 and DMS-2210775.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

[1].Ambrosio L, Gigli N, Savaré G, Gradient flows with metric and differentiable structures, and applications to the wasserstein space, Atti della Accademia Nazionale dei Lincei. Classe di Scienze Fisiche, Matematiche e Naturali. Rendiconti Lincei. Matematica e Applicazioni 15 (2004) 327–343. [Google Scholar]
[2].Bayraktar E, Guo G, Strong equivalence between metrics of Wasserstein type, Electronic Communications in Probability 26 (2021) 1–13. [Google Scholar]
[3].Berg C, Christensen JPR, Ressel P, Harmonic Analysis on Semigroups: Theory of Positive Definite and Related Functions, Springer, 1984. [Google Scholar]
[4].Bhattacharjee S, Li B, Xue L, Nonlinear global Fréchet regression for random objects via weak conditional expectation, arXiv preprint arXiv:2310.07817 (2023). [Google Scholar]
[5].Bigot J, Gouet R, Klein T, López A, Geodesic pca in the wasserstein space by convex pca, in: Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 53, Institut Henri Poincaré, pp. 1–26. [Google Scholar]
[6].Bobkov S, Ledoux M, One-Dimensional Empirical Measures, Order Statistics, and Kantorovich Transport Distances, American Mathematical Society, 2019. [Google Scholar]
[7].Boissard E, Le Gouic T, On the mean speed of convergence of empirical and occupation measures in Wasserstein distance, Annales de l’IHP Probabilités et Statistiques 50 (2014) 539–563. [Google Scholar]
[8].Chen Y, Lin Z, Müller H-G, Wasserstein regression, Journal of the American Statistical Association, in press (2021) 1–14.35757777 [Google Scholar]
[9].Chen Z, Bao Y, Li H, Spencer BF Jr, Lqd-rkhs-based distribution-to-distribution regression methodology for restoring the probability distributions of missing shm data, Mechanical Systems and Signal Processing 121 (2019) 655–674. [Google Scholar]
[10].Christmann A, Steinwart I, Universal kernels on non-standard input spaces, in: in Advances in Neural Information Processing Systems, pp. 406–414. [Google Scholar]
[11].Dereich S, Scheutzow M, Schottstedt R, Constructive quantization: Approximation by empirical measures, Annales de l’IHP Probabilités et Statistiques 49 (2013) 1183–1203. [Google Scholar]
[12].Ding S, Cook RD, Tensor sliced inverse regression, Journal of Multivariate Analysis 133 (2015) 216–231. [Google Scholar]
[13].Dong Y, Wu Y, Fréchet kernel sliced inverse regression, Journal of Multivariate Analysis 191 (2022) 105032 [Google Scholar]
[14].Fan J, Müller H-G, Conditional wasserstein barycenters and interpolation/extrapolation of distributions, arXiv preprint arXiv:2107.09218 (2021). [Google Scholar]
[15].Fan J, Xue L, Yao J, Sufficient forecasting using factor models, Journal of Econometrics 201 (2017) 292–306. [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Ferré L, Yao A-F, Functional sliced inverse regression analysis, Statistics 37 (2003) 475–488. [Google Scholar]
[17].Fournier N, Guillin A, On the rate of convergence in wasserstein distance of the empirical measure, Probability Theory and Related Fields 162 (2015) 707–738. [Google Scholar]
[18].Fukumizu K, Bach FR, Gretton A, Statistical consistency of kernel canonical correlation analysis., Journal of Machine Learning Research 8 (2007) 361–383. [Google Scholar]
[19].Fukumizu K, Bach FR, Jordan MI, Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces, Journal of Machine Learning Research 5 (2004) 73–99 [Google Scholar]
[20].Ghodrati L, Panaretos VM, Distribution-on-distribution regression via optimal transport maps, Biometrika 109 (2022) 957–974. [Google Scholar]
[21].Golub GH, Heath M, Wahba G, Generalized cross-validation as a method for choosing a good ridge parameter, Technometrics 21 (1979) 215–223. [Google Scholar]
[22].Hsing T, Ren H, An rkhs formulation of the inverse regression dimension-reduction problem, The Annals of Statistics 37 (2009) 726–755. [Google Scholar]
[23].Kolouri S, Zou Y, Rohde GK, Sliced wasserstein kernels for probability distributions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5258–5267. [Google Scholar]
[24].Koltchinskii V, Giné E, Random matrix approximation of spectra of integral operators, Bernoulli 6 (2000) 113–167. [Google Scholar]
[25].Lee K-Y, Li B, Chiaromonte F, A general theory for nonlinear sufficient dimension reduction: Formulation and estimation, The Annals of Statistics 41 (2013) 221–249. [Google Scholar]
[26].Lei J, Convergence and concentration of empirical measures under Wasserstein distance in unbounded functional spaces, Bernoulli 26 (2020) 767–798. [Google Scholar]
[27].Li B, Sufficient Dimension Reduction: Methods and Applications with R, CRC Press, 2018. [Google Scholar]
[28].Li B, Artemiou A, Li L, Principal support vector machines for linear and nonlinear sufficient dimension reduction, The Annals of Statistics 39 (2011) 3182–3210. [Google Scholar]
[29].Li B, Kim MK, Altman N, On dimension folding of matrix-or array-valued statistical objects, The Annals of Statistics 38 (2010) 1094–1121. [Google Scholar]
[30].Li B, Song J, Nonlinear sufficient dimension reduction for functional data, The Annals of Statistics 45 (2017) 1059–1095. [Google Scholar]
[31].Li B, Song J, Dimension reduction for functional data based on weak conditional moments, The Annals of Statistics 50 (2022) 107–128. [Google Scholar]
[32].Li K-C, Sliced inverse regression for dimension reduction, Journal of the American Statistical Association 86 (1991) 316–327. [Google Scholar]
[33].Lin T, Fan C, Ho N, Cuturi M, Jordan MI, Projection robust wasserstein distance and riemannian optimization, arXiv preprint arXiv:2006.07458 (2020). [Google Scholar]
[34].Luo W, Li B, Combining eigenvalues and variation of eigenvectors for order determination, Biometrika 103 (2016) 875–887. [Google Scholar]
[35].Luo W, Xue L, Yao J, Yu X, Inverse moment methods for sufficient forecasting using high-dimensional predictors, Biometrika 109 (2022) 473–487. [Google Scholar]
[36].Meunier D, Pontil M, Ciliberto C, Distribution regression with sliced wasserstein kernels, in: International Conference on Machine Learning, PMLR, pp. 15501–15523. [Google Scholar]
[37].Micchelli CA, Xu Y, Zhang H, Universal kernels., Journal of Machine Learning Research 7 (2006). [Google Scholar]
[38].Nietert S, Goldfeld Z, Sadhu R, Kato K, Statistical, robustness, and computational guarantees for sliced Wasserstein distances, Advances in Neural Information Processing Systems 35 (2022) 28179–28193. [Google Scholar]
[39].Niles-Weed J, Rigollet P, Estimation of wasserstein distances in the spiked transport model, Bernoulli 28 (2022) 2663–2688. [Google Scholar]
[40].Okano R, Imaizumi M, Distribution-on-distribution regression with wasserstein metric: Multivariate gaussian case, arXiv preprint arXiv:2307.06137 (2023). [Google Scholar]
[41].Panaretos V, Zemel Y, An Invitation to Statistics in Wasserstein Space, Springer International Publishing, 2020. [Google Scholar]
[42].Petersen A, Müller H-G, Functional data analysis for density functions by transformation to a hilbert space, The Annals of Statistics 44 (2016) 183–218. [Google Scholar]
[43].Petersen A, Müller H-G, Fréchet regression for random objects with euclidean predictors, The Annals of Statistics 47 (2019) 691–719. [Google Scholar]
[44].Schoenberg IJ, On certain metric spaces arising from euclidean spaces by a change of metric and their imbedding in hilbert space, The Annals of Mathematics 38 (1937) 787–793 [Google Scholar]
[45].Schoenberg IJ, Metric spaces and positive definite functions, Transactions of the American Mathematical Society 44 (1938) 522–536. [Google Scholar]
[46].Sriperumbudur B, Fukumizu K, Lanckriet G, On the relation between universality, characteristic kernels and rkhs embedding of measures, in: Proceedings of the thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, pp. 773–780. [Google Scholar]
[47].Sriperumbudur BK, Fukumizu K, Lanckriet GR, Universality, characteristic kernels and rkhs embedding of measures., Journal of Machine Learning Research 12 (2011) 2389–2410. [Google Scholar]
[48].Székely GJ, Rizzo ML, Bakirov NK, Measuring and testing dependence by correlation of distances, The Annals of Statistics 35 (2007) 2769–2794. [Google Scholar]
[49].Villani C, Optimal Transport: Old and New, volume 338, Springer, 2009 [Google Scholar]
[50].Virta J, Lee K-Y, Li L, Sliced inverse regression in metric spaces, arXiv preprint arXiv:2206.11511 (2022). [Google Scholar]
[51].Weed J, Bach F, Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance, Bernoulli 25 (2019) 2620–2648. [Google Scholar]
[52].Ying C, Yu Z, Fréchet sufficient dimension reduction for random objects, Biometrika, in press (2022). [Google Scholar]
[53].Yu X, Yao J, Xue L, Nonparametric estimation and conformal inference of the sufficient forecasting with a diverging number of factors, Journal of Business & Economic Statistics 40 (2022) 342–354. [Google Scholar]
[54].Zhang Q, Xue L, Li B, Dimension reduction for fréchet regression, Journal of the American Statistical Association, in press (2023). [Google Scholar]

[R1] [1].Ambrosio L, Gigli N, Savaré G, Gradient flows with metric and differentiable structures, and applications to the wasserstein space, Atti della Accademia Nazionale dei Lincei. Classe di Scienze Fisiche, Matematiche e Naturali. Rendiconti Lincei. Matematica e Applicazioni 15 (2004) 327–343. [Google Scholar]

[R2] [2].Bayraktar E, Guo G, Strong equivalence between metrics of Wasserstein type, Electronic Communications in Probability 26 (2021) 1–13. [Google Scholar]

[R3] [3].Berg C, Christensen JPR, Ressel P, Harmonic Analysis on Semigroups: Theory of Positive Definite and Related Functions, Springer, 1984. [Google Scholar]

[R4] [4].Bhattacharjee S, Li B, Xue L, Nonlinear global Fréchet regression for random objects via weak conditional expectation, arXiv preprint arXiv:2310.07817 (2023). [Google Scholar]

[R5] [5].Bigot J, Gouet R, Klein T, López A, Geodesic pca in the wasserstein space by convex pca, in: Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 53, Institut Henri Poincaré, pp. 1–26. [Google Scholar]

[R6] [6].Bobkov S, Ledoux M, One-Dimensional Empirical Measures, Order Statistics, and Kantorovich Transport Distances, American Mathematical Society, 2019. [Google Scholar]

[R7] [7].Boissard E, Le Gouic T, On the mean speed of convergence of empirical and occupation measures in Wasserstein distance, Annales de l’IHP Probabilités et Statistiques 50 (2014) 539–563. [Google Scholar]

[R8] [8].Chen Y, Lin Z, Müller H-G, Wasserstein regression, Journal of the American Statistical Association, in press (2021) 1–14.35757777 [Google Scholar]

[R9] [9].Chen Z, Bao Y, Li H, Spencer BF Jr, Lqd-rkhs-based distribution-to-distribution regression methodology for restoring the probability distributions of missing shm data, Mechanical Systems and Signal Processing 121 (2019) 655–674. [Google Scholar]

[R10] [10].Christmann A, Steinwart I, Universal kernels on non-standard input spaces, in: in Advances in Neural Information Processing Systems, pp. 406–414. [Google Scholar]

[R11] [11].Dereich S, Scheutzow M, Schottstedt R, Constructive quantization: Approximation by empirical measures, Annales de l’IHP Probabilités et Statistiques 49 (2013) 1183–1203. [Google Scholar]

[R12] [12].Ding S, Cook RD, Tensor sliced inverse regression, Journal of Multivariate Analysis 133 (2015) 216–231. [Google Scholar]

[R13] [13].Dong Y, Wu Y, Fréchet kernel sliced inverse regression, Journal of Multivariate Analysis 191 (2022) 105032 [Google Scholar]

[R14] [14].Fan J, Müller H-G, Conditional wasserstein barycenters and interpolation/extrapolation of distributions, arXiv preprint arXiv:2107.09218 (2021). [Google Scholar]

[R15] [15].Fan J, Xue L, Yao J, Sufficient forecasting using factor models, Journal of Econometrics 201 (2017) 292–306. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [16].Ferré L, Yao A-F, Functional sliced inverse regression analysis, Statistics 37 (2003) 475–488. [Google Scholar]

[R17] [17].Fournier N, Guillin A, On the rate of convergence in wasserstein distance of the empirical measure, Probability Theory and Related Fields 162 (2015) 707–738. [Google Scholar]

[R18] [18].Fukumizu K, Bach FR, Gretton A, Statistical consistency of kernel canonical correlation analysis., Journal of Machine Learning Research 8 (2007) 361–383. [Google Scholar]

[R19] [19].Fukumizu K, Bach FR, Jordan MI, Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces, Journal of Machine Learning Research 5 (2004) 73–99 [Google Scholar]

[R20] [20].Ghodrati L, Panaretos VM, Distribution-on-distribution regression via optimal transport maps, Biometrika 109 (2022) 957–974. [Google Scholar]

[R21] [21].Golub GH, Heath M, Wahba G, Generalized cross-validation as a method for choosing a good ridge parameter, Technometrics 21 (1979) 215–223. [Google Scholar]

[R22] [22].Hsing T, Ren H, An rkhs formulation of the inverse regression dimension-reduction problem, The Annals of Statistics 37 (2009) 726–755. [Google Scholar]

[R23] [23].Kolouri S, Zou Y, Rohde GK, Sliced wasserstein kernels for probability distributions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5258–5267. [Google Scholar]

[R24] [24].Koltchinskii V, Giné E, Random matrix approximation of spectra of integral operators, Bernoulli 6 (2000) 113–167. [Google Scholar]

[R25] [25].Lee K-Y, Li B, Chiaromonte F, A general theory for nonlinear sufficient dimension reduction: Formulation and estimation, The Annals of Statistics 41 (2013) 221–249. [Google Scholar]

[R26] [26].Lei J, Convergence and concentration of empirical measures under Wasserstein distance in unbounded functional spaces, Bernoulli 26 (2020) 767–798. [Google Scholar]

[R27] [27].Li B, Sufficient Dimension Reduction: Methods and Applications with R, CRC Press, 2018. [Google Scholar]

[R28] [28].Li B, Artemiou A, Li L, Principal support vector machines for linear and nonlinear sufficient dimension reduction, The Annals of Statistics 39 (2011) 3182–3210. [Google Scholar]

[R29] [29].Li B, Kim MK, Altman N, On dimension folding of matrix-or array-valued statistical objects, The Annals of Statistics 38 (2010) 1094–1121. [Google Scholar]

[R30] [30].Li B, Song J, Nonlinear sufficient dimension reduction for functional data, The Annals of Statistics 45 (2017) 1059–1095. [Google Scholar]

[R31] [31].Li B, Song J, Dimension reduction for functional data based on weak conditional moments, The Annals of Statistics 50 (2022) 107–128. [Google Scholar]

[R32] [32].Li K-C, Sliced inverse regression for dimension reduction, Journal of the American Statistical Association 86 (1991) 316–327. [Google Scholar]

[R33] [33].Lin T, Fan C, Ho N, Cuturi M, Jordan MI, Projection robust wasserstein distance and riemannian optimization, arXiv preprint arXiv:2006.07458 (2020). [Google Scholar]

[R34] [34].Luo W, Li B, Combining eigenvalues and variation of eigenvectors for order determination, Biometrika 103 (2016) 875–887. [Google Scholar]

[R35] [35].Luo W, Xue L, Yao J, Yu X, Inverse moment methods for sufficient forecasting using high-dimensional predictors, Biometrika 109 (2022) 473–487. [Google Scholar]

[R36] [36].Meunier D, Pontil M, Ciliberto C, Distribution regression with sliced wasserstein kernels, in: International Conference on Machine Learning, PMLR, pp. 15501–15523. [Google Scholar]

[R37] [37].Micchelli CA, Xu Y, Zhang H, Universal kernels., Journal of Machine Learning Research 7 (2006). [Google Scholar]

[R38] [38].Nietert S, Goldfeld Z, Sadhu R, Kato K, Statistical, robustness, and computational guarantees for sliced Wasserstein distances, Advances in Neural Information Processing Systems 35 (2022) 28179–28193. [Google Scholar]

[R39] [39].Niles-Weed J, Rigollet P, Estimation of wasserstein distances in the spiked transport model, Bernoulli 28 (2022) 2663–2688. [Google Scholar]

[R40] [40].Okano R, Imaizumi M, Distribution-on-distribution regression with wasserstein metric: Multivariate gaussian case, arXiv preprint arXiv:2307.06137 (2023). [Google Scholar]

[R41] [41].Panaretos V, Zemel Y, An Invitation to Statistics in Wasserstein Space, Springer International Publishing, 2020. [Google Scholar]

[R42] [42].Petersen A, Müller H-G, Functional data analysis for density functions by transformation to a hilbert space, The Annals of Statistics 44 (2016) 183–218. [Google Scholar]

[R43] [43].Petersen A, Müller H-G, Fréchet regression for random objects with euclidean predictors, The Annals of Statistics 47 (2019) 691–719. [Google Scholar]

[R44] [44].Schoenberg IJ, On certain metric spaces arising from euclidean spaces by a change of metric and their imbedding in hilbert space, The Annals of Mathematics 38 (1937) 787–793 [Google Scholar]

[R45] [45].Schoenberg IJ, Metric spaces and positive definite functions, Transactions of the American Mathematical Society 44 (1938) 522–536. [Google Scholar]

[R46] [46].Sriperumbudur B, Fukumizu K, Lanckriet G, On the relation between universality, characteristic kernels and rkhs embedding of measures, in: Proceedings of the thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, pp. 773–780. [Google Scholar]

[R47] [47].Sriperumbudur BK, Fukumizu K, Lanckriet GR, Universality, characteristic kernels and rkhs embedding of measures., Journal of Machine Learning Research 12 (2011) 2389–2410. [Google Scholar]

[R48] [48].Székely GJ, Rizzo ML, Bakirov NK, Measuring and testing dependence by correlation of distances, The Annals of Statistics 35 (2007) 2769–2794. [Google Scholar]

[R49] [49].Villani C, Optimal Transport: Old and New, volume 338, Springer, 2009 [Google Scholar]

[R50] [50].Virta J, Lee K-Y, Li L, Sliced inverse regression in metric spaces, arXiv preprint arXiv:2206.11511 (2022). [Google Scholar]

[R51] [51].Weed J, Bach F, Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance, Bernoulli 25 (2019) 2620–2648. [Google Scholar]

[R52] [52].Ying C, Yu Z, Fréchet sufficient dimension reduction for random objects, Biometrika, in press (2022). [Google Scholar]

[R53] [53].Yu X, Yao J, Xue L, Nonparametric estimation and conformal inference of the sufficient forecasting with a diverging number of factors, Journal of Business & Economic Statistics 40 (2022) 342–354. [Google Scholar]

[R54] [54].Zhang Q, Xue L, Li B, Dimension reduction for fréchet regression, Journal of the American Statistical Association, in press (2023). [Google Scholar]

PERMALINK

Nonlinear sufficient dimension reduction for distribution-on-distribution regression

Qi Zhang

Bing Li

Lingzhou Xue

Abstract

1. Introduction

2. Nonlinear SDR for Distributional Data

3. Construction of RKHS

3.1. Wasserstein kernel for univariate distributions

3.2. Sliced-Wasserstein kernel for multivariate distributions

4. Generalized Sliced Inverse Regression for Distributional Data

4.1. Distributional GSIR and the role of universal kernel

4.2. Estimation for distributional GSIR

Choice of tuning parameters:

Order Determination:

5. Asymptotic Analysis

5.1. Convergence rate for fully observed distribution

5.2. Convergence rate for discretely observed distribution

6. Simulation

6.1. Computational details

6.2. Univariate distribution-on-distribution regression

Table 1:

Fig. 1:

6.3. Multivariate distribution-on-distribution regression

Table 2:

Fig. 2:

6.4. Comparison with functional-GSIR

Table 3:

7. Applications

7.1. Application to human mortality data

Fig. 3:

Fig. 4:

Fig. 5:

7.2. Application to Calgary temperature data

Fig. 6:

8. Discussion

9. Technical Proofs

Geometry of Wasserstein space

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases