Dimension Reduction for Fréchet Regression

Qi Zhang; Lingzhou Xue; Bing Li

doi:10.1080/01621459.2023.2277406

. Author manuscript; available in PMC: 2025 Feb 10.

Published in final edited form as: J Am Stat Assoc. 2023 Dec 26;119(548):2733–2747. doi: 10.1080/01621459.2023.2277406

Dimension Reduction for Fréchet Regression

Qi Zhang ¹, Lingzhou Xue ^1,^*, Bing Li ¹

PMCID: PMC11810122 NIHMSID: NIHMS1956452 PMID: 39931232

Abstract

With the rapid development of data collection techniques, complex data objects that are not in the Euclidean space are frequently encountered in new statistical applications. Fréchet regression model (Peterson & Müller 2019) provides a promising framework for regression analysis with metric space-valued responses. In this paper, we introduce a flexible sufficient dimension reduction (SDR) method for Fréchet regression to achieve two purposes: to mitigate the curse of dimensionality caused by high-dimensional predictors, and to provide a visual inspection tool for Fréchet regression. Our approach is flexible enough to turn any existing SDR method for Euclidean ( $X, Y$ ) into one for Euclidean $X$ and metric space-valued $Y$ . The basic idea is to first map the metric space-valued random object $Y$ to a real-valued random variable $f (Y)$ using a class of functions, and then perform classical SDR to the transformed data. If the class of functions is sufficiently rich, then we are guaranteed to uncover the Fréchet SDR space. We showed that such a class, which we call an ensemble, can be generated by a universal kernel (cc-universal kernel). We established the consistency and asymptotic convergence rate of the proposed methods. The finite-sample performance of the proposed methods is illustrated through simulation studies for several commonly encountered metric spaces that include Wasserstein space, the space of symmetric positive definite matrices, and the sphere. We illustrated the data visualization aspect of our method by the human mortality distribution data from the United Nations Databases.

Keywords: Ensembled sufficient dimension reduction, Inverse regression, Statistical objects, Universal kernel, Wasserstein space

1. Introduction

With the rapid development of data collection techniques, complex data objects that are not in the Euclidean space are frequently encountered in new statistical applications, such as the graph Laplacians of networks, the covariance or correlation matrices for the brain functional connectivity in neuroscience (Ferreira and Busatto, 2013), and probability distributions in CT hematoma density data (Petersen and Müller, 2019). These data objects, also known as random objects, do not obey the operation rules of a vector space with an inner product or a norm, but instead reside in a general metric space. In a prescient paper, Fréchet (1948) proposed the Fréchet mean as a natural generalization of the expectation of a random vector. By extending the Fréchet mean to the conditional Fréchet mean, Petersen and Müller (2019) introduced the Fréchet regression model with random objects as the response and Euclidean vectors as predictors, which provides a promising framework for regression analysis with metric space-valued responses. Dubey and Müller (2019) showed the consistency of the sample Fréchet mean using the results of Petersen and Müller (2019), derived a central limit theorem for the sample Fréchet variance that quantifies the variation around the Fréchet mean, and further developed the Fréchet analysis of variance for random objects. Dubey and Müller (2020a) designed a method for change-point detection and inference in a sequence of metric-space-valued data objects.

The Fréchet regression employs the global least squares and the local linear or polynomial regression to fit the conditional Fréchet mean. It is well known that the global least squares is based on a restrictive assumption of the regression relation. Although the local regression is more flexible, it is effective only when the dimension of the predictor is relatively low. As this dimension gets higher, its accuracy drops significantly–a phenomenon known as the curse of dimensionality. To address this issue, it is essential to reduce the dimension of the predictor without losing the information about the response. For classical regression, this task is performed by sufficient dimension reduction (SDR; see Li 1991; Cook 1996 and Li 2018 among others). SDR projects the high-dimensional predictor onto a low-dimensional subspace that preserves the information about the response through the use of sufficiency.

Besides assisting regression in overcoming the curse of dimensionality, another important function of SDR for classical regression is to provide a data visualization tool to gain insights into how the regression surface looks in high-dimensional space before even fitting a model. By inspecting the sufficient plots of the response objects against the sufficient predictors, we can gain insights into the general trends of the response as the most informative part of the predictor varies, whether there are outlying observations, and whether there are subjects with high leverage that have undue influence on the regression estimates-the usual information a statistician looks for in the exploratory and model checking stages of the regression analysis. This function is also needed in Fréchet regression. In fact, it can be argued that data visualization is even more important for the regression of random objects, as the regression relation may be even more difficult to discern among the complex details of the objects.

To fulfill these demands, we systematically develop the theories and methodologies of sufficient dimension reduction for Fréchet regression in this paper. To set the stage, we first give an outline of SDR for classical regression. Let $X$ be a $p$ -dimensional random vector in $R^{p}$ and $Y$ a random variable in $R$ . The classical SDR aims to find a dimension reduction subspace $𝒮$ of $R^{p}$ such that $Y$ and $X$ are independent conditioning on $P_{𝒮} X$ , that is, $Y ⫫ X ∣ P_{𝒮} X$ , where $P_{𝒮}$ is the projection on to $𝒮$ with respect to the usual inner product in $R^{p}$ . This way, $P_{𝒮} X$ can be used as the synthetic predictor without losing regression information about the response $Y$ . Under mild conditions, the intersection of all such dimension reduction subspaces is also a dimension reduction subspace, and the intersection is called the central subspace denoted by $𝒮_{Y ∣ X}$ (Cook, 1996; Yin et al., 2008). For the situation where the primary interest is in estimating the regression function, Cook and Li (2002) introduced a weaker form of SDR, the mean dimension reduction subspace. A subspace $𝒮$ of $R^{p}$ is a mean SDR subspace if it satisfies $E (Y ∣ X) = E (Y ∣ P_{𝒮} X)$ , and the intersection of all such spaces if it is still a mean SDR subspace, is the central mean subspace, denoted by $𝒮_{E (Y ∣ X)}$ . The central mean subspace $𝒮_{E (Y ∣ X)}$ is always contained in central subspace $𝒮_{Y ∣ X}$ when they exist. Many estimating methods for the central subspace and the central mean subspace have been developed over the past decades. For example, for the central subspace, we have the sliced inverse regression (SIR; Li 1991), the sliced average variance estimate (SAVE; Cook and Weisberg 1991), the contour regression (CR; Li et al. 2005), and the directional regression (DR; Li and Wang 2007). For the central mean subspace, we have the ordinary least squares (OLS; Li and Duan 1989), the principal Hessian directions (PHD; Li 1992), the iterative Hessian transformation (IHT Cook and Li 2002), the outer product of gradients (OPG) and the minimum average variance estimator (MAVE) of Xia et al. (2002).

SDR has been extended to accommodate some complex data structures in the past, for example, to functional data (Ferré and Yao 2003; Hsing and Ren 2009; Li and Song 2017), to tensorial data (Li et al. 2010; Ding and Cook 2015), and to panel data (Fan et al., 2017; Yu et al., 2020; Luo et al., 2021). Most recently, Ying and Yu (2022) extended SIR to the case where the response takes values in a metric space, and Zhang et al. (2022) extended the generalized SIR (Lee et al., 2013) to the case where the response and predictors are distributional data. Taking a substantial step forward, in this paper, we introduce a comprehensive and flexible method that can adapt any existing SDR estimators to metric space-valued responses.

The basic idea of our method stems from the ensemble SDR for Euclidean $X$ and $Y$ of Yin and Li (2011), which recovers the central subspace $𝒮_{Y ∣ X}$ by repeatedly estimating the central mean subspace $𝒮_{E [f (Y) ∣ X]}$ for a family $F$ of functions $f$ that is rich enough to determine the conditional distribution of $Y | X$ . Such a family $F$ is called an ensemble and satisfies $𝒮_{Y ∣ X} = \cup \{𝒮_{E [f (Y) ∣ X]} : f \in F$ . Using this relation, we can turn any method for estimating the central mean space into one that estimates the central subspace.

While borrowing the idea of the ensemble, our goal is different from Yin and Li (2011): we are not interested in turning an estimator for the central mean subspace into one for the central subspace. Instead, we are interested in turning any existing SDR method for Euclidean $(X, Y)$ into one for Euclidean $X$ and metric space-valued $Y$ . Let $X$ be a random vector in $R^{p}$ and $Y$ a random object that takes values in a metric space $(Ω_{Y}, d)$ . Still use the symbol $S_{Y ∣ X}$ to represent the intersection of all subspaces of $R^{p}$ satisfying $Y ⫫ X ∣ P_{𝒮} X$ . We call $𝒮_{Y ∣ X}$ the central subspace for Fréchet SDR, or simply the Fréchet central subspace. Let $F$ be a family of functions $f : Ω_{Y} \to R$ that are measurable with respect to the Borel $σ$ -field on the metric space. We use two types of ensembles to connect classical SDR with Fréchet SDR:

Central Mean Space ensemble (CMS-ensemble) is a family $F$ that is rich enough so that $𝒮_{Y ∣ X} = \cup \{𝒮_{E [f (Y) ∣ X]} : f \in F\}$ . Note that we know how to estimate the spaces $𝒮_{E (f (Y) ∣ X)}$ using the existing SDR methods since $f (Y)$ is a number. We use this ensemble to turn an SDR method that targets the central mean subspace into one that targets the Fréchet central subspace. We will focus on two forward regression methods: OPG and MAVE, and three moment estimators of the CMS.
Central Space ensemble (CS-ensemble) is a family $F$ that is rich enough so that $𝒮_{Y ∣ X} = \cup \{𝒮_{f (Y) ∣ X} : f \in F\}$ . We use this ensemble to turn an SDR method that targets the central subspace for real-valued response into one that targets the Fréchet central subspace. We will focus on three inverse regression methods: SIR, SAVE, and DR.

A key step in implementing both of the above schemes is to construct an ensemble $F$ in each case. For this purpose, we assume that the metric space $(Ω_{Y}, d)$ is continuously embeddable into a Hilbert space. Under this assumption, one can construct a universal reproducing kernel, which leads to an $F$ that satisfies the required characterizing property.

As with classical SDR, the Fréchet SDR can also be used to assist data visualization. To illustrate this aspect, we consider an application involving factors that influence the mortality distributions of 162 countries (see Section 7 for details). For each country, the response is a histogram with the numbers of deaths for each five-year period from age 0 to age 100, which is smoothed to produce a density estimate, as shown in panel (a) of Figure 1. We considered nine predictors characterizing each country’s demography, economy, labor market, health care, and environment. Using our ensemble method, we obtained a set of sufficient predictors. In panel (b) of Figure 1, we show the mortality densities plotted against the first sufficient predictor. A clear pattern is shown in the plot: for countries with low values of the first sufficient predictor, the modes of the mortality distributions are at lower ages, and there are upticks at age 0, indicating high infant mortality rates; for countries with high values of the first sufficient predictor, the modes of the mortality distributions are significantly higher, and there are no upticks at age 0, indicating very low infant mortality rates. The information provided by the plot is useful, and many further insights can be gained about what affects the mortality distribution by taking a careful look at the loadings of the first sufficient predictor, as will be detailed in Section 7.

Figure 1: — Data visualization in Fréchet regression for mortality distributions of 162 countries. Panel (a) plots mortality densities that are placed in random order, and Panel (b) plots mortality densities versus the first sufficient predictor estimated by our ensemble method.

The rest of this paper is organized as follows. Section 2 defines the Fréchet SDR problem and provides sufficient conditions for a family $F$ to characterize the central subspace. Section 3 then constructs ensemble $F$ for the Wasserstein space of univariate distributions, the space of covariance matrix, and a special Riemannian manifold, the sphere. Section 4 proposes the CMS-ensembles by extending five SDR methods that target the central mean space for real-valued response: OLS, PHD, IHT, OPG and MAVE, and CS-ensembles by extending three SDR methods that target the central space for real-valued response: SIR, SAVE, and DR. Section 5 establishes the convergence rate of the proposed methods. Section 6 uses simulation studies to examine the numerical performances of different ensemble estimators in different settings, including distributional responses and covariance matrix responses. In Section 7, we analyze the mortality distribution data to demonstrate the usefulness of our methods. Section 8 includes a few concluding remarks and discussion. All the proofs and additional simulation studies and real applications are presented in the Supplementary Material.

2. Characterization of the Fréchet Central Subspace

Let $(Ω, ℱ, P)$ be a probability space. Let $(Ω_{Y}, d)$ be a metric space with metric $d$ and $ℬ_{Y}$ the Borel $σ$ -field generated by the open sets in $Ω_{Y}$ . Let $Ω_{X}$ be a subset of $R^{p}$ and $ℬ_{X}$ the Borel $σ$ -field generated by the open sets in $Ω_{X}$ . Let $(X, Y)$ be a random element mapping from $Ω$ to $Ω_{X} \times Ω_{Y}$ measurable with respect to the product $σ$ -field $ℬ_{X} \times ℬ_{Y}$ . We denote the marginal distributions of $X$ and $Y$ by $P_{X}$ and $P_{Y}$ , respectively, and the conditional distributions of $Y ∣ X$ and $X ∣ Y$ by $P_{Y ∣ X}$ and $P_{X ∣ Y}$ , respectively. We formulate the Fréchet SDR problem as finding a subspace $𝒮$ of $R^{p}$ such that $Y$ and $X$ are independent conditioning on $P_{𝒮} X$ :

Y ⫫ X ∣ P_{𝒮} X,

(1)

where $P_{𝒮}$ is the projection on to $𝒮$ with respect to the inner product in $R^{p}$ . As in the classical SDR, the intersection of all such subspaces $𝒮$ still satisfies (1) under mild conditions (Cook and Li, 2002). Indeed, it does not require any structure of the space $Ω_{Y}$ . A sufficient condition shown in Yin et al. (2008) is that $X$ is supported by a matching set. For example, if the support of $X$ is convex, then this sufficient condition is satisfied. We call this subspace the Fréchet central subspace and denote it by $𝒮_{Y ∣ X}$ . Similar to Cook (1996), it can be shown that if the support of $X$ is open and convex, the Fréchet central subspace $𝒮_{Y ∣ X}$ satisfies (1).

2.1. Two types of ensembles and their sufficient conditions

Let $F$ be a family of measurable functions $f : Ω_{Y} \to R$ , and for an $f \in F$ , let $𝒮_{E [f (Y) ∣ X]}$ be the central mean subspace of $f (Y)$ versus $X$ . As mentioned in Section 1, we use two types of ensembles to recover the Fréchet central subspace. The first type is any $F$ that satisfies

s p a n \{𝒮_{E (f (Y) ∣ X)} : f \in F\} = 𝒮_{Y | X} .

(2)

This is the same ensemble as that in Yin and Li (2011), except that, here, the right-hand side is the Fréchet central subspace. The relation (2) allows us to recover the Fréchet central subspace $𝒮_{Y ∣ X}$ by a collection of classical central mean subspaces. We call a class $F$ that satisfies (2) a CMS-ensemble. The second type of ensemble is any family $F$ that satisfies

s p a n \{𝒮_{f (Y) ∣ X} : f \in F\} = 𝒮_{Y ∣ X},

(3)

which we call a CS-ensemble. Proposition 1 shows that a CMS ensemble is a CS-ensemble.

PROPOSITION 1.

If $F$ is a CMS-ensemble, then it is a CS-ensemble.

We next develop a sufficient condition for an $F$ to be a CMS-ensemble and hence also a CS-ensemble. Let $B = \{I_{B} : B i s B o r e l s e t i n Ω_{Y}\}$ be the family of measurable indicator functions on $Ω_{Y}$ , and let $span (F) = {\sum_{i = 1}^{k} α_{i} f_{i} : k \in ℕ, α_{1}, \dots, α_{k} \in ℝ, f_{1}, \dots, f_{k} \in F}$ be the linear span of $F$ , where $N = {1, 2, \dots}$ . Yin and Li (2011) showed that if $F$ is a subset of $L_{2} (P_{Y})$ that is dense in $B$ , then (2) holds for the classical $𝒮_{Y ∣ X}$ . Here, we generalize that result to our setting by requiring only $s p a n (F)$ to be dense in $B$ .

LEMMA 1.

If $F$ is a subset of $L_{2} (P_{Y})$ and $s p a n {F}$ is dense in $B$ with respect to the $L_{2} (P_{Y})$ -metric, then $F$ is a CMS-ensemble and hence also a CS-ensemble.

2.2. Construction of the CMS-ensemble

To construct a CMS-ensemble, we resort to the notion of the universal kernel. Let $C (Ω_{y})$ be the family of continuous real-valued functions on $Ω_{Y}$ . When $Ω_{Y}$ is compact, Steinwart (2001) defined a continuous kernel $κ$ as universal (we refer to it as c-universal) if its associated RKHS $ℋ_{Y}$ is dense in $C (Ω_{Y})$ under the uniform norm. To relax the compactness assumption, Micchelli et al. (2006) proposed the following notion of universality, which is referred to cc-univsersal in Sriperumbudur et al. (2011). For any compact set $K \subseteq Ω_{Y}$ , let $ℋ_{Y} (K)$ be the RKHS generated by ${κ (\cdot, y) : y \in K}$ . We should note that a member $f$ of $ℋ_{Y} (K)$ is supported on $Ω_{Y}$ , rather than $K$ . Let $f ∣ K$ denote the restriction of $f$ on $K$ , and $C (K)$ the class of all continuous functions with respect to the topology in $(Ω_{Y}, d)$ restricted on $K$ .

DEFINITION 1.

(Micchelli et al., 2006) We say that $κ$ is cc-universal if, for any compact set $K \subseteq Ω_{Y}$ , any member $f$ of $C (K)$ , and any $ϵ > 0$ , there is an $h \in ℋ_{Y} (K)$ such that $∥ f - (h ∣ K) ∥_{\infty} = sup_{y \in K} | f (y) - h (y) | < ϵ$ .

When $Ω_{Y}$ is compact, Sriperumbudur et al. (2011) showed that two notions of universality are equivalent. In the following, we look into the conditions under which a metric space has a cc-universal kernel and how to construct such a kernel when it does.

Micchelli et al. (2006) showed that when $Ω_{Y} = R^{d}$ , many standard kernels, including Laplacian kernels and Gaussian RBF kernels, are cc-universal. Unfortunately, when $Ω_{Y}$ is a general metric space, direct extensions of these types of kernels, for example, $k (y, y^{'}) = e x p (- γ d {(y, y^{'})}^{2})$ , are no longer guaranteed to be cc-universal. Christmann and Steinwart (2010) showed that for compact $Ω_{Y}$ , if there exists a separable Hilbert space $ℋ$ and a continuous injection $ρ : Ω_{Y} \to ℋ$ , then for any analytic function $F : R \to R$ whose Taylor series at zero has strictly positive coefficients, the function $κ (y, y^{'}) = F ({⟨ρ (y), ρ (y^{'})⟩}_{ℋ})$ defines a c-universal kernel on $Ω_{Y}$ . They also provide an analogous definition of the Gaussian-type kernel in the above case. We extend this result to construct cc-universal kernels on non-compact metric space. The proof is given in the Supplementary Material.

PROPOSITION 2.

Suppose $(Ω_{Y}, d)$ is a complete and separable metric space, and there exists a separable Hilbert space $ℋ$ and a continuous injection $ρ : Ω_{Y} \to ℋ$ . If $F : R \to R$ is an analytic function of the form $F (t) = \sum_{n = 0}^{\infty} a_{n} t^{n}$ , $a_{n} \geq 0$ for all $n \geq 1$ , then the function $κ : Ω_{Y} \times Ω_{Y} \to R$ defined by $κ (y, y^{'}) = F ({⟨ρ (y), ρ (y^{'})⟩}_{ℋ})$ is a positive definite kernel. Furthermore, if $a_{n} > 0$ for all $n \geq 1$ , then $κ$ is a cc-universal kernel on $Ω_{Y}$ .

As an example, Corollary 1 shows that the Gaussian-type kernel is cc-universal on $Ω_{Y}$

COROLLARY 1.

Suppose the conditions in Proposition 2 are satisfied, then the Gaussian-type kernel $κ_{γ} (y, y^{'}) = e x p (- γ {∥ρ (y) - ρ (y^{'})∥}_{ℋ}^{2})$ , where $γ > 0$ , is cc-universal. Furthermore, if the continuous function $ρ : Ω_{Y} \to ℋ$ is isometric, that is, $d (y, y^{'}) = {∥ρ (y) - ρ (y^{'})∥}_{ℋ}$ , then Gaussian-type kernel $κ_{γ} (y, y^{'}) = e x p (- γ d^{2} (y, y^{'}))$ is cc-universal.

The second part of Corollary 1 is straightforward since an isometry is an injection. Similar results can be established for Laplacian-type kernel $κ_{γ} (y, y^{'}) = e x p (- γ {∥ρ (y) - ρ (y^{'})∥}_{ℋ})$ .

As far as we know, the idea of embedding a (semi) metric space to a Hilbert space was first proposed in Berg et al. (1984) and has been revisited by Sejdinovic et al. (2013) and Dubey and Müller (2020b). By Berg et al. (1984, Theorem 2.2), $e x p (- γ d^{2} (\cdot, \cdot))$ is positive definite for all $γ > 0$ if and only if $d^{2} (\cdot, \cdot)$ is negative definite, which is guaranteed when the metric space can be isometrically embedded into a Hilbert space.

The continuous embedding condition in Proposition 2 covers several metric spaces often encountered in statistical applications. Section 3 employs it to construct cc-universal kernels on the space of univariate distributions endowed with Wasserstein-2 distance, correlation matrices endowed with Frobenius distance, and spheres endowed with geodesic distance.

By using the notion of regular probability measure, we connect the cc-universal kernel on $(Ω_{Y}, d)$ with the CMS-ensemble, which is the theoretical foundation of our method. Recall that a measure $P_{Y}$ on $(Ω_{Y}, d)$ is regular if, for any Borel subset $B \subseteq Ω_{Y}$ and any $ε > 0$ , there is a compact set $K \subseteq B$ and an open set $G \supseteq B$ , such that $P (G \ K) < ε$ .

THEOREM 1.

Suppose, on metric space $(Ω_{Y}, d)$ , (1) $κ$ is a bounded cc-universal kernel and (2) $P_{Y}$ is a regular probability measure. Then the family $F = \{κ (\cdot, y) : y \in Ω_{Y}\}$ is a CMS-ensemble.

The proof of Theorem 1 is given in the Supplementary Material. Condition (2), which requires $P_{Y}$ to be regular, is quite mild: it is known that any Borel measure on a complete and separable metric space is regular (see Granirer (1970, Chapter 2: Theorem 1.2, Theorem 3.2)). Thus, a sufficient condition of Condition (2) is $(Ω_{Y}, d)$ being complete and separable, which is satisfied by all the metric spaces we consider. Specifically, note that if $M$ is separable and complete, then so is the Wasserstein-2 space $W_{2} (M)$ (Panaretos and Zemel, 2020, Proposition 2.2.8, Theorem 2.2.7). Therefore, $W_{2} (R)$ is complete and separable. Similarly, the SPD matrix space endowed with Frobenius distance and the sphere endowed with geodesic distance are complete and separable metric spaces. Furthermore, the Gaussian kernel and Laplacian kernel we considered satisfy Condition (1) in Theorem 1.

Thus, Proposition 2 and Theorem 1 provide a general mechanism to construct the CMS-ensemble over any separable and complete metric space without a linear structure, provided it can be continuously embedded in a separable Hilbert space. For the case where multiple cc-universal kernels exist, we design a cross-validation framework in Section 6 to choose the kernel type and the bandwidth $γ$ .

3. Important Metric Spaces and their CMS Ensembles

This section gives the construction of CMS-ensembles for three commonly used metric spaces.

3.1. Wasserstein space

Let $I$ be $R$ or a closed interval of $R, ℬ (I)$ the $σ$ -field of Borel subsets of $I$ , and $𝒫 (I)$ the collection of all probability measures on $(I, ℬ (I))$ . The Wasserstein space $𝒲_{2} (I)$ is defined as the subset of $𝒫 (I)$ with finite second moment, that is, $𝒲_{2} (I) = \{μ \in 𝒫 (I) : \int_{I} t^{2} d μ (t) < \infty\}$ , endowed with the quadratic Wasserstein distance $d_{W} (μ_{1}, μ_{2}) = {(\int_{0}^{1} {[F_{μ_{1}}^{- 1} (s) - F_{μ_{2}}^{- 1} (s)]}^{2} d s)}^{1 / 2}$ , where $μ_{1}$ and $μ_{2}$ are members of $𝒲_{2} (I)$ and $F_{μ_{1}}^{- 1}$ and $F_{μ_{2}}^{- 1}$ are the quantile functions of $μ_{1}$ and $μ_{2}$ , which we assume to be well defined. This distance can be equivalently written as $d_{W} (μ_{1}, μ_{2}) = {(\int_{I} {[F_{μ_{1}}^{- 1} {\circ F}_{μ_{2}} (t) - t]}^{2} d μ_{2} (t))}^{1 / 2}$ . The set $𝒲_{2} (I)$ endowed with $d_{W}$ is a metric space with a formal Riemannian structure (Ambrosio et al., 2004).

Here, we present some basic results that characterize $𝒲_{2} (I)$ , whose proofs can be found, for example, in Ambrosio et al. (2004) and Bigot et al. (2017). For $μ_{1}, μ_{2} \in 𝒲_{2} (I)$ , we say that a $ℬ (I)$ -measurable map $r : I \to I$ transports $μ_{1}$ to $μ_{2}$ if $μ_{2} = μ_{1} \circ r^{- 1}$ . This relation is often written as $μ_{2} = r_{#} μ_{1}$ . Let $μ_{0} \in 𝒲_{2} (I)$ be a reference measure with a continuous $F_{μ_{0}}$ . The tangent space at $μ_{0}$ is $T_{μ_{0}} = {c l}_{L_{2} (μ_{0})} \{λ (F_{μ}^{- 1} \circ F_{μ_{0}} - i d) : μ \in 𝒲_{2} (I), λ > 0\}$ , where, for a set $A \subseteq L_{2} (μ_{0}), {c l}_{L_{2} (μ_{0})} (A)$ denotes the $L_{2} (μ_{0})$ -closure of $A$ . The exponential map ${e x p}_{μ_{0}}$ from $T_{μ_{0}}$ to $𝒲_{2} (I)$ , defined by ${e x p}_{μ_{0}} (r) = (r + i d)_{#} μ_{0}$ , is surjective. Therefore its inverse, ${l o g}_{μ_{0}} : 𝒲_{2} (I) \to T_{μ_{0}}$ , defined by ${l o g}_{μ_{0}} (μ) = {F_{μ}}^{- 1} \circ F_{μ_{0}} - i d$ , is well defined on $𝒲_{2} (I)$ . It is well known that the exponential map, restricted to the image of log map, denoted as ${{e x p}_{μ_{0}}|}_{{l o g}_{μ_{0}} (μ) (𝒲_{2} (I))}$ , is an isometric homeomorphism (Bigot et al., 2017). Therefore, ${l o g}_{μ_{0}}$ is a continuous injection from $𝒲_{2} (I)$ to $L_{2} (μ_{0})$ . We can then construct CMS-ensembles using the general constructive method provided by Theorem 1 and Proposition 2. The next proposition gives two such constructions, where the subscripts “G” and “L” for the two kernels refer to “Gaussian” and “Laplacian”, respectively.

PROPOSITION 3.

For $I \subseteq R, κ_{G} (y, y^{'}) = e x p (- γ {∥ {l o g}_{μ_{0}} (y) - {l o g}_{μ_{0}} (y^{'}) ∥}_{ℒ_{μ_{0}}^{2}}^{2}) = e x p (- γ d_{W} {(y, y^{'})}^{2})$ and $κ_{L} (y, y^{'}) = e x p (- γ {∥ {l o g}_{μ_{0}} (y) - {l o g}_{μ_{0}} (y^{'}) ∥}_{ℒ_{μ_{0}}^{2}}) = e x p (- γ d_{W} (y, y^{'}))$ are both cc-universal kernels on $𝒲_{2} (I)$ . Consequently, the families $F_{G} = {\exp (- γ d_{W} (\cdot, t)^{2}) : t \in 𝒲_{2} (I)}$ and $F_{L} = \{e x p (- γ d_{W} (\cdot, t)) : t \in 𝒲_{2} (I)\}$ are CMS-ensembles.

3.2. Space of symmetric positive definite matrices

We first introduce some notations. Let Sym ( $r$ ) be the set of $r \times r$ invertible symmetric matrices with real entries and ${S y m}^{+} (r)$ the set of $r \times r$ symmetric positive definite (SPD) matrices. For any $Y \in R^{r \times r}$ , the matrix exponential of $Y$ is defined as the infinite power series $exp (y) = \sum_{k = 0}^{\infty} Y^{k} / k!$ . For any $X \in {S y m}^{+} (r)$ , the matrix logarithm of $X$ is defined as any $r \times r$ matrix $Y$ such that $e x p (y) = X$ and denoted by $l o g (X)$ .

Let $d_{F}$ be the Frobenius metric. Then $({s y m}^{+} (r), d_{F})$ is a metric space continuously embedded by identity mapping in ${s y m}^{+} (r)$ , which is a Hilbert space with the Frobenius inner product $⟨ A, B ⟩ = t r (A^{⊤} B)$ . Also, the identity mapping id: ${S y m}^{+} (r) \to S y m (r)$ is obviously isometric. Therefore, by Corollary 1, the two types of radial basis function kernels for Wasserstein space can be similarly extended to ${S y m}^{+} (r)$ . That is, let $κ_{G} (y, y^{'}) = e x p (- γ d_{F} {(y, y^{'})}^{2})$ and $κ_{L} (y, y^{'}) = e x p (- γ d_{F} (y, y^{'}))$ , then $F_{G} = \{κ_{G} (y, y^{'}), y^{'} \in {S y m}^{+} (r)\}$ and $F_{L} = \{κ_{L} (y, y^{'}), y^{'} \in {S y m}^{+} (r)\}$ are CMS-ensembles.

Another widely used metric over ${S y m}^{+} (r)$ is the log-Euclidean distance defined as $d_{l o g} (Y_{1}, Y_{2}) = {∥l o g (Y_{1}) - l o g (Y_{2})∥}_{F}$ . It pulls the Frobenius metric on $S y m (r)$ back to ${S y m}^{+} (r)$ by the matrix logarithm map. The matrix logarithm $l o g (\cdot)$ is a continuous injection to Hilbert $S y m (r)$ . By Corollary 1, the two types of radial basis function kernels $κ_{G, l o g} (y, y^{'}) = e x p (- γ d_{l o g} {(y, y^{'})}^{2})$ and $κ_{L, l o g} (y, y^{'}) = e x p (- γ d_{l o g} (y, y^{'}))$ are cc-universal. Then, $F_{G, l o g} = \{κ_{G, l o g} (y, y^{'}), y^{'} \in 𝒲_{2} (I)\}$ and $F_{L, l o g} = \{κ_{L, l o g} (y, y^{'}), y^{'} \in 𝒲_{2} (I)\}$ are CMS-ensembles.

3.3. The sphere

Consider the random vector taking values in the sphere $S^{n} = \{x \in R^{n} : ∥ x ∥ = 1\}$ . To respect the nonzero curvature of $S^{n}$ , the geodesic distance $d_{g} (Y_{1}, Y_{2}) = a r c c o s (Y_{1}^{⊤} Y_{2})$ , which is derived from its Riemannian geometry, is often used rather than the Euclidean distance. However, the popular Gaussian-type RBF kernel $κ_{G} (y, y^{'}) = e x p (- γ d_{g} {(y, y^{'})}^{2})$ is not positive definite on $S^{n}$ (Jayasumana et al., 2013). In fact, Feragen et al. (2015) proved that for complete Riemannian manifold $M$ with its associated geodesic distance $d_{g}, κ_{G} (y, y^{'}) = e x p (- γ d_{g} {(y, y^{'})}^{2})$ is positive semidefinite only if $M$ is isometric to a Euclidean space. Honeine and Richard (2010) and Jayasumana et al. (2013) proved that the Laplacian-type kernel $κ_{L} (y, y^{'}) = e x p (- γ d_{g} (y, y^{'}))$ is positive definite on the sphere $S^{n}$ . We show in the following proposition that $κ_{L} (y, y^{'})$ is cc-universal.

PROPOSITION 4.

The Laplacian-type kernel $κ_{L} (y, y^{'}) : S^{n} \times S^{n} \to R$ , defined by $κ_{L} (y, y^{'}) = e x p (- γ d_{g} (y, y^{'}))$ where $d_{g}$ is the geodesic distance on $S^{n}$ , is a cc-universal kernel for any Consequently, $F_{L} = \{e x p (- γ d_{g} (\cdot, t)), t \in S^{n}\}$ is a CMS-ensemble.

We note that the scope of Proposition 2 goes beyond Riemannian manifolds. For example, the Gaussian type kernel on the space of Borel probability measures on a compact metric space with Prohorov metric is universal (Christmann and Steinwart, 2010). We construct an explicit embedding and a universal kernel for any metric space of negative type in Theorem 4 in the Supplementary Material.

4. Fréchet Sufficient Dimension Reduction

In this section, we develop the Fréchet SDR estimators based on the CMS-ensembles and CS-ensembles and establish their Fisher consistency.

4.1. Ensembled moment estimators via CMS ensembles

We first develop a general class of Fréchet SDR estimators based on the ensembled moment estimators of the CMS, such as the OLS, PHD, and IHT. Let $𝒫_{X Y}$ be the collection of all distributions of $(X, Y)$ , and let $M : 𝒫_{X Y} \to R^{p \times p}$ be a measurable function to be used as an estimator of the Fréchet central subspace $𝒮_{Y ∣ X}$ . A function defined on $𝒫_{X Y}$ is called statistical functional; see, for example, Chapter 9 of Li (2018). In the SDR literature, such a function is also called a candidate matrix (Ye and Weiss, 2003). Let $F_{X Y}$ be a generic member of $𝒫_{X Y}, F_{X Y}^{(0)}$ the true distribution of $(X, Y)$ , and ${\hat{F}}_{X Y}^{(n)}$ the empirical distribution of $(X, Y)$ based on an i.i.d. sample $(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})$ . Extending the terminology of classical SDR (see, for example, Li 2018, Chapter 2), we say that the estimate $M ({\hat{F}}_{X Y}^{(n)})$ is unbiased if $M (F_{X Y}^{(0)}) \subseteq 𝒮_{Y ∣ X}$ , exhaustive if $M (F_{X Y}^{(0)}) \supseteq 𝒮_{Y ∣ X}$ , and Fisher consistent if $M (F_{X Y}^{(0)}) = 𝒮_{Y ∣ X}$ . We refer to $M$ as the Fréchet candidate matrix.

Suppose we are given a CMS-ensemble $F$ . Let $M_{0} : 𝒫_{X Y} \times F \to R^{p \times p}$ be a function to be used as an estimator of $𝒮_{E [f (Y) ∣ X]}$ for each $f$ . In the classical sense, this is not a statistical functional, as it involves an additional set $F$ . So, we redefine unbiasedness, exhaustiveness, and Fisher consistency for this type of augmented statistical functional.

DEFINITION 2.

We say that $M_{0}$ is unbiased for estimating $\{𝒮_{E [f (Y) ∣ X]} : f \in F\}$ if, for each $f \in F$ , $s p a n {M_{0} (F_{X Y}^{(0)}, f)} \subseteq 𝒮_{E [f (Y) ∣ X]}$ . Exhaustiveness and Fisher consistency of $M_{0}$ are defined by replacing ⊆ in the above by ⊇ and =, respectively.

Note that $M_{0} (\cdot, f)$ is an estimator of the classical central mean subspace $𝒮_{E [f (Y) ∣ X]}$ , as $f (Y)$ is a random number rather than a random object. We refer to $M_{0}$ as the ensemble candidate matrix, or, when confusion is possible, CMS-ensemble candidate matrix. Our goal is to construct a Fréchet candidate matrix $M : 𝒫_{X Y} \to R^{p \times p}$ from the ensemble candidate matrix $M_{0} : 𝒫_{X Y} \times F \to R^{p \times p}$ . To do so, we assume $F$ is of the form $\{κ (\cdot, y) : y \in Ω_{Y}\}$ , where $κ : Ω_{Y} \times Ω_{Y} \to R$ is a cc-universal kernel. Given such an $F$ and $M_{0}$ , we define $M$ as follows

M (F_{X Y}) = \int_{Ω_{Y}} M_{0} (F_{X Y}, κ (\cdot, y)) d F_{Y} (y),

where $F_{Y}$ is the distribution of $Y$ derived from $F_{X Y}$ .

We now adapt several estimates for the classical central mean subspace to the estimation of Fréchet SDR: the ordinary least squares (OLS; Li and Duan 1989), the principal Hessian directions (PHD; Li 1992), and the Iterative Hessian Transformation (IHT; Cook and Li 2002). These estimates are based on sample moments and require additional conditions on the predictor $X$ for their unbiasedness. Specifically, we make the following assumptions:

ASSUMPTION 1.

Linear Conditional Mean (LCM): $E (X ∣ β^{⊤} X)$ is a linear function of $β^{⊤} X$ , where $β$ is a basis matrix of the Fréchet central subspace $𝒮_{Y ∣ X}$ ;
Constant Conditional Variance (CCV): $v a r (X ∣ β^{⊤} X)$ is a nonrandom matrix.

Under the first assumption, the ensemble OLS and IHT are unbiased for estimating the Fréchet central subspace; under both assumptions, the ensemble PHD is unbiased for estimating $𝒮_{Y ∣ X}$ . More detailed discussions on the unbiasedness and fisher consistency of ensemble estimators are presented in Section 4.4. In practice, the two assumptions above cannot be checked directly since we do not know $β$ . However, as was shown by Eaton (1986), if Assumption 1 holds for all $β$ , then the distribution of $X$ is elliptical, and vice versa. If further $X$ is multivariate normal, then Assumption 2 is satisfied. This means once the marginal distribution of predictor $X$ is regular, Assumption 1 holds without being affected by the nonlinear nature of the response. Currently, the scatter plot matrix is the most commonly used empirical method to check the elliptical distribution assumption. If non-elliptical features are observed, one can use marginal transformations of the predictors, such as the Box-Cox transformation, to mitigate the non-ellipticity problem. Furthermore, in practice, the SDR methods that require ellipticity usually still work reasonably well even when the elliptical distribution assumption is violated. This occurs particularly when the dimension $p$ of $X$ is high. See Hall and Li (1993) and Li and Yin (2007) for the theoretical supports. Our simulation results in Section 6 support this phenomenon.

It is most convenient to construct these ensemble estimators using standardized predictors. As stated in the next proposition, the theoretical basis for doing so is an equivariant property of the Fréchet central subspace.

PROPOSITION 5.

If $𝒮_{Y ∣ X}$ is the Fréchet central subspace, $A \in R^{p \times p}$ is a non-singular matrix, and $b$ is a vector in $R^{p}$ , then $𝒮_{Y ∣ A X + b} = A^{⊤} S_{Y ∣ X}$ .

The proof is essentially the same as that for the classical central subspace (see, for example, Li, 2018, page 24) and is omitted. Using this property, we first transform $X$ to $Z = v a r (X)^{- 1 / 2} (X - E X)$ , estimate the Fréchet central subspace $𝒮_{Y ∣ Z}$ , and then transform it by $v a r (X)^{1 / 2} 𝒮_{Y ∣ Z}$ , which is the same as $𝒮_{Y ∣ X}$ . The candidate matrices $M_{0}$ and $M$ for ensemble OLS, PHD, and IHT are formulated in Remark 1. Detailed motivation for each can be found in Li (2018, Chapter 8). The sample estimates can then be constructed by replacing the expectations in $M_{0}$ and $M$ with sample moments whenever possible. Algorithm 1 summarizes the steps to implement an ensembled moment estimator, where $κ_{c} (y, y^{'})$ stands for the centered kernel $κ (y, y^{'}) - E_{n} κ (Y, y^{'})$ .

Algorithm 1.

Fréchet OLS, PHD, IHT, SIR, SAVE, DR

Step 1. Standardize predictors. Compute sample mean

\hat{μ} = E_{n} (X)

and sample variance

\hat{Σ} = {v a r}_{n} (X)

. Then let

Z_{i} = {\hat{Σ}}^{- 1 / 2} (X_{i} - \hat{μ})

Step 2. Compute

{\hat{M}}_{0} (y)

for

y = Y_{1}, \dots, Y_{n}

according to Remarks 1 and 2.

Step 3. Compute

\hat{M} = \frac{1}{n} \sum_{i = 1}^{n} {\hat{M}}_{0} (Y_{i})

Step 4. Let

{\hat{v}}_{1}, \dots, {\hat{v}}_{d_{0}}

be the leading

d_{0}

eigenvectors of

\hat{M}

, and let

u_{k} = {\hat{Σ}}^{1 / 2} v_{k}

, for

k = 1, \dots, d_{0}

. Then use

\{u_{1}, \dots, u_{d_{0}}\}

to estimate a basis of

𝒮_{Y ∣ X}

Open in a new tab

REMARK 1.

The candidate matrices $M_{0} (y)$ for Fréchet OLS, PHD, and IHT are $(y) C (y)^{⊤}$ with $(y) = c o v [Z, κ (Y, y)],$ $E [Z Z^{⊤} κ_{c} (Y, y)]$ , and $W (y) W (y)$ with $W (y) = (C (y), H (y) C (y), \dots, H (y)^{r} C (y))$ and $H (y) = E [Z Z^{⊤} κ_{c} (Y, y)]$ , respectively.

4.2. Ensembled forward regression estimators via CMS ensembles

We adopt the OPG (Xia et al. 2002), a popular method for estimating the classical CMS based on nonparametric forward regression, to the estimation of the Fréchet central subspace, which does not require LCM and CCV conditions. The adaption of another forward regression method MAVE is similar and presented in Section S.3.2 of the Supplementary Material. The framework of the statistical functional $M_{0} (F_{X Y}, f)$ is no longer sufficient to cover this case because we now have a tuning parameter here. So, we adopt the notion of tuned statistical functional in Section 11.2 of Li (2018) to accommodate a tuning parameter.

Let $𝒫_{X Y}, F_{X Y}, F_{X Y}^{(0)}$ and ${\hat{F}}_{X Y}^{(n)}$ be as defined in Section 4.1. For simplicity, we assume the tuning parameter $h$ to be a scalar, but it could also be a vector. Given a CMS-ensemble $F$ , let $T_{0} : 𝒫_{X Y} \times F \times R \to R^{p \times p}$ be a tuned functional to be used as an estimator of $𝒮_{E [f (Y) ∣ X]}$ for each $f$ . We refer to $T_{0}$ as the ensemble-tuned candidate matrix. The unbiasedness, exhaustiveness, and Fisher consistency of $T_{0}$ are defined as follows.

DEFINITION 3.

We say that $T_{0}$ is unbiased for estimating $\{𝒮_{E [f (Y) ∣ X]} : f \in F\}$ if, for each $f \in F$ , $span {lim_{h \to 0} T_{0} (F_{X Y}^{(0)}, f, h)} \subseteq 𝒮_{E [f (Y) ∣ X]}$ . Exhaustiveness and Fisher consistency of $T_{0}$ are defined by replacing ⊆ in the above by ⊇ and =, respectively.

Given $F = \{κ (\cdot, y) : y \in Ω_{Y}\}$ and $T_{0}$ , we define the tuned Fréchet candidate matrix $T : 𝒫_{X Y} \times R \to R^{p \times p}$ as $T (F_{X Y}, h) = \int_{Ω_{Y}} T_{0} (F_{X Y}, κ (S, y), h) d F_{Y} (y)$ . We say that the estimate $T (T_{X Y}^{(n)}, h)$ is unbiased if $span (lim_{h \to 0} T (F_{X Y}^{(0)}, h)) \subseteq 𝒮_{Y ∣ X}$ , exhaustive if $span (lim_{h \to 0} T (F_{X Y}^{(0)}, h)) \supseteq 𝒮_{Y ∣ X}$ , and Fisher consistent if $span (lim_{h \to 0} T (F_{X Y}^{(0)}, h)) = 𝒮_{Y ∣ X}$ .

In the following, for a function $h (x)$ , we use $\partial h (X) / \partial X$ to denote $\partial h (x) / \partial x$ evaluated at $x = X$ . The OPG aims to estimate central mean subspace $𝒮_{E [κ (Y, y) ∣ X]}$ by $E [\frac{\partial E (κ (Y, y) ∣ X)}{\partial X} \frac{\partial E (κ (Y, y) ∣ X)}{\partial X^{T}}]$ where the gradient $\partial E (κ (Y, y) ∣ X) / \partial X$ is estimated by local linear approximation as follows. Let $K_{0} : R \to [0, \infty)$ be a kernel function as used in kernel estimation. For any $v \in R^{p}$ and bandwidth $h > 0$ , let $K_{h} (v) = h^{- p} K_{0} (∥ v ∥ / h)$ . At the population level, for fixed $x \in Ω_{x}$ and $y \in Ω_{Y}$ , we minimize the objective function

E {{[κ (Y, y) - a - b^{⊤} (X - x)]}^{2} K_{h} (X - x)} / E K_{h} (X - x)

(4)

over all $a \in R$ and $b \in R^{d_{0}}$ . The minimizer depends on $x, y$ and we write it as $(a_{h} (x, y), b_{h} (x, y))$ . The ensemble tuned candidate matrix for estimating the central mean subspace $𝒮_{E [κ (Y, y) ∣ X]}$ is $T_{0} (F_{X Y}, κ (\cdot, y), h) = E [b_{h} (X, y) b_{h} (X, y)^{⊤}]$ and the tuned Fréchet candidate matrix is $T (F_{X Y}, h) = E [b_{h} (X, Y) b_{h} (X, Y)^{⊤}]$ .

At the sample level, we minimize, for each $j, k = 1, \dots, n$ , the empirical objective function

\sum_{i = 1}^{n} w_{h} (X_{i}, X_{j}) {[κ_{γ} (Y_{i}, Y_{k}) - a_{j k} - b_{j k}^{⊤} (X_{i} - X_{j})]}^{2}

(5)

over $a_{j k} \in R$ and $b_{j k} \in R^{p}$ , where $w_{h} (X_{i}, X_{j}) = K_{h} (X_{i} - X_{j}) / \sum_{l = 1}^{n} K_{h} (X_{l} - X_{j})$ . Following Xia et al. (2002), we take the bandwidth to be $h = c_{0} n^{- 1 / (p_{0} + 6)}$ where $p_{0} = m a x {p, 3}$ and $c_{0} = 2.34$ , which is slightly larger than the optimal $n^{- 1 / (p + 4)}$ in terms of the mean integrated squared errors. As proposed in Li (2018, Lemma 11.6), instead of solving $b_{j k}$ from (5) $n^{2}$ times, we solve multivariate weighted least squares to obtain $b_{j 1}, \dots, b_{j n}$ simultaneously. The tuned Fréchet candidate matrix is then estimated by $\hat{T} ({\hat{F}}_{X Y}^{(n)}, h) = n^{- 2} \sum_{j, k = 1}^{n} {\hat{b}}_{j k} {\hat{b}}_{j k}^{⊤}$ . The first $d$ eigenvectors of $\hat{T} ({\hat{F}}_{X Y}^{(n)}, h)$ form an estimate of the Fréchet central subspace.

We can further enhance the performance by projecting the original predictors onto the directions produced by $\hat{T} ({\hat{F}}_{X Y}^{(n)}, h)$ to re-estimate $𝒮_{Y ∣ X}$ . Specifically, after compute $\hat{T} ({\hat{F}}_{X Y}^{(n)}, h)$ , we form the matrix $\hat{B} = ({\hat{v}}_{1}, \dots, {\hat{v}}_{d})$ to contain the first $d$ eigenvectors of $\hat{T} ({\hat{F}}_{X Y}^{(n)}, h)$ . We then replace the kernel $K_{h} (X_{j} - X_{i})$ by $K_{h} ({\hat{B}}^{⊤} (X_{j} - X_{i}))$ with an updated bandwidth $h$ , and complete ${\hat{b}}_{j k}$ from (5) again, which leads to an updated $\hat{B}$ . We then iterate this process until convergence. In this way, we reduce the dimension of the kernel from $p$ to $d_{0}$ and mitigate the “curse of dimensionality.” For classical SDR problems, this procedure is called refined OPG, see Li (2018, Chapter 11.4). We call this refined estimator Fréchet OPG or FOPG. The algorithm for FOPG is summarized as Algorithm 2 in the Supplementary Material.

4.3. Ensembled inverse regression estimators via CS ensembles

In this subsection, we adapt several well-known estimators for the classical central subspace to Fréchet SDR, which include SIR (Li, 1991), SAVE (Cook and Weisberg, 1991), and DR (Li and Wang, 2007). We use the CS-ensemble to combine these classical estimates through (3). Let $F = \{κ (\cdot, y) : y \in Ω_{y}\}$ be a CS ensemble, where $κ$ is a cc-universal kernel. Let $M_{0} : 𝒫_{X Y} \times F \to R^{p \times p}$ be a CS-ensemble candidate matrix. Let $M (F_{X Y}) = \int M_{0} (F_{X Y}, κ (\cdot, y)) d F_{Y} (y)$ be the Fréchet candidate matrix.

Again, we work with the standard predictor $Z$ . The candidate matrices $M_{0} (y)$ for ensemble SIR, SAVE, and DR are formulated in Remark 2. Detailed motivation for each can be found in Li (2018, Chapter 3,5,6). At the sample level, we replace any unconditional moment $E$ by the sample average $E_{n}$ , and replace any conditional moment, such as $E (Z ∣ κ (Y, y))$ , by the slice mean. The algorithms are also included in Algorithm 1.

REMARK 2.

The candidate matrices $M_{0} (y)$ for Fréchet SIR, SAVE, and DR are $v a r [E (Z ∣ κ (Y, y)]$ , ${[I_{p} - v a r (Z ∣ κ (Y, y))]}^{2}$ , and $2 E {\{E [Z Z^{⊤} ∣ κ (Y, y)]\}}^{2} + 2 E^{2} \{E [Z ∣ κ (Y, y)] E [Z^{⊤} ∣ κ (Y, y)]\} + 2 E \{E [Z^{⊤} ∣ κ (Y, y)] E [Z ∣ κ (Y, y)]\} E \{E [Z ∣ κ (Y, y)] E [Z^{⊤} ∣ κ (Y, y)]\} - 2 I_{p}$ , respectively.

REMARK 3.

Regarding the time complexity of the Fréchet SDR methods, by construction, the ensemble estimator requires $n$ times the computing time of the original estimator because it needs to reapply the original estimator for each $κ (\cdot, y_{i}), i = 1, \dots, n$ . For example, if SAVE is used as the original estimator, then the largest matrix multiplication is $A_{p \times n} B_{n \times p}$ which requires $p^{2} n$ basic computation units; the largest matrix to invert or eigendecomposition to perform is a $p \times p$ matrix, which requires $p^{3}$ basic computation units. So the net computation complexity is $n \times m a x (O (n p^{2}), O (p^{3}))$ .

4.4. Fisher consistency

In this subsection, we establish the unbiasedness and Fisher consistency of the tuned Fréchet candidate matrix. As a special case, the Fréchet candidate matrix constructed by any moment-based methods in Section 4.1 can be considered as tuned Fréchet candidate matrix with the tuning parameter $h$ taken to be 0. The next theorem shows that if $T_{0}$ is unbiased (or Fisher consistent), then $T$ is unbiased (or Fisher consistent). In the following, we say that a measure $μ$ on $Ω_{Y}$ is strictly positive if and only if for any nonempty open set $U \subseteq Ω_{Y}, μ (U) > 0$ . For a matrix $A, ∥ A ∥$ represents the operator norm.

THEOREM 2.

Suppose $F = \{κ (\cdot, y) : y \in Ω_{Y}\}$ is a CMS-ensemble, where $κ$ is a cc-universal kernel. We have the following results regarding unbiasedness and Fisher consistency for $T$ .

If $T_{0} i s u n b i a s e d f o r \{𝒮_{E [κ (Y, y) ∣ X]} : f \in F\}$ and $∥ T_{0} (F_{X Y}^{(0)}, κ (\cdot, Y^{'}), h) ∥ \leq G (Y^{'})$ , where $G (Y^{'})$ is a real-valued function with $E [G (Y^{'})] < \infty$ , then $T$ is unbiased for $𝒮_{Y ∣ X}$ ;
If (a) $T_{0}$ is Fisher consistent for $\{𝒮_{E [κ (Y, y) ∣ X]} : f \in F\}$ , (b) $T_{0} (F_{X Y}, κ (\cdot, y), h)$ is positive semidefinite for each $y \in Ω_{Y}, h \in R$ and $F_{X Y} \in 𝒫_{X Y}$ , (c) $lim sup_{h \to 0} T_{0} (F_{X Y}^{(0)}, κ (\cdot, Y^{'}), h) \leq G (Y^{'})$ with $E [G (Y^{'})] < \infty$ , (d) $F_{Y}$ is strictly positive on $Ω_{Y}$ , and (e) the mapping $y^{'} \mapsto lim_{h \to 0} T_{0} (F_{X Y}, κ (\cdot, y^{'}), h)$ is continuous, then $T$ is Fisher consistent for $𝒮_{Y ∣ X}$ .

We similarly develop Fisher consistency for Fréchet SDR based on the CS-ensemble, including methods in Section 4.3. The next corollary says that if $M_{0}$ is Fréchet consistent for $\{𝒮_{κ (Y, y) ∣ X} : y \in Ω_{Y}\}$ , then $M$ is Fréchet consistent for $𝒮_{Y ∣ X}$ . The proof is similar to that of Theorem 2 and is omitted.

COROLLARY 2.

Suppose $F = \{κ (\cdot, y) : y \in Ω_{Y}\}$ is a CS-ensemble, where $κ$ is a cc-universal kernel. We have the following results regarding unbiasedness and Fisher consistency for $M$ .

If $M_{0}$ is unbiased for $\{𝒮_{κ (Y, y) ∣ X} : f \in F\}$ , then $M$ is unbiased for $𝒮_{Y ∣ X}$ ;
If $M_{0}$ is Fisher consistent for $\{𝒮_{κ (Y, y) ∣ X} : f \in F\}, M_{0} (F_{X Y}, κ (\cdot, y))$ is positive semidefinite for each $y \in Ω_{X}$ and $F_{X Y} \in 𝒫_{X Y}$ , $F_{Y}$ is strictly positive, and the mapping $y^{'} \mapsto M_{0} (F_{X Y}, κ (\cdot, y^{'}))$ is continuous, then $M$ is Fisher consistent for $𝒮_{Y ∣ X}$ .

Unbiasedness and Fisher consistency of $T_{0}$ or $M_{0}$ are satisfied by different sets of sufficient conditions for the moment-based or forward-regression-based estimators. We outline these conditions below.

For ensembled moment estimators in Section 4.1 and ensembled inverse regression estimators in Section 4.3, most of them are unbiased under either the LCM assumption or both the LCM and CCV assumption. For example, the unbiasedness of SIR, OLS, and IHT requires the LCM assumption, whereas the unbiasedness of SAVE, DR, and PHD requires both the LCM and the CCV assumptions. The estimators SIR, OLS, IHT, and PHD are generally not exhaustive (recall that unbiased along with exhaustiveness is equivalent to Fisher consistency). But sufficient conditions for SAVE and DR to be exhaustive are reasonably mild (see Li and Wang (2007) and Li (2018, Chapter 6)).
Sufficient conditions for Fisher consistency for OPG are given in Li (2018, Section 11.2). Specifically, it requires: (a) the smooth kernel function $K_{0}$ is a spherically-contoured p.d.f. with finite fourth moments; (b) the p.d.f. of $X$ is supported on $R^{p}$ and has continuous bounded second derivatives. Note that neither LCM nor CCV assumption is needed for the OPG estimator.

In practice, when we observe a severe violation of the elliptical assumption among the predictors, for example, by exploratory data analysis tools such as the scatter plot matrix, it is favorable to use forward regression methods such as FOPG. Otherwise, moment-based methods are recommended since they are faster to compute and have a parametric ( $n^{- 1 / 2}$ ) convergence rate. Our experiences also indicate that the performance of the moment-based methods is relatively robust against the violation of ellipticity as long as it is not very severe.

5. Convergence Rates of the Ensemble Estimates

In this section, we develop the convergence rates of the ensemble estimates for Fréchet SDR. To save space, we will only consider the CMS-ensemble; the results for the CS-ensemble are largely parallel. To simplify the asymptotic development, we slightly modify the ensemble estimator, which does not result in any significant numerical difference from the original ensembles developed in the previous sections. For each $i = 1, \dots, n$ , let ${\hat{F}}_{X Y}^{(- i)}$ be the empirical distribution based on the sample with $i$ th subject removed: $\{(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})\} \ \{(X_{i}, Y_{i})\}$ . Our modified ensemble estimate is of the form

T ({\hat{F}}_{X Y}^{(n)}, h_{n}) = n^{- 1} \sum_{i = 1}^{n} T_{0} ({\hat{F}}_{X Y}^{(- i)}, κ (\cdot, Y_{i}), h_{n}) .

The purpose of this modification is to break the dependence between the ensemble member $κ (\cdot, Y_{i})$ and the CMS estimate, which substantially simplifies the asymptotic argument. Here, we let the tuning parameter $h_{n}$ depend on $n$ . Again, the Fréceht candidate matrix constructed by moment-based methods can be considered as a special case with $h_{n} = 0$ .

Rather than deriving the convergence rate of each individual ensemble estimate case by case, we will show that, under some mild conditions, the ensemble convergence rate is the same as the corresponding CMS-estimate’s rate. Since the convergence rates of many CMS-estimates are well established, including all the forward regression and sample moment-based estimates mentioned earlier, our general result covers all the CMS-ensemble estimates.

In this following, for a matrix $A, ∥ A ∥$ represents the operator norm and $∥ A ∥_{F}$ the Frobenius norm. If $\{a_{n}\}$ and $\{b_{n}\}$ are sequences of positive numbers, we write $a_{n} ≺ b_{n}$ if $lim_{n \to \infty} a_{n} / b_{n} = 0$ ; we write $a_{n} b_{n}$ if $a_{n} / b_{n}$ is a bounded sequence. We write $b_{n} ≻ a_{n}$ (or $b_{n} ⪰ a_{n}$ ) if $a_{n} ≺ b_{n}$ (or $a_{n} b_{n}$ ). We write $a_{n} ≍ b_{n}$ if $a_{n} b_{n}$ and $b_{n} a_{n}$ . Let $T_{0}^{*} (F_{X Y}^{(0)}, κ (\cdot, y)) = lim_{h \to 0} T_{0} (F_{X Y}^{(0)}, κ (\cdot, y), h)$ and $T^{*} (F_{X Y}^{(0)}) = lim_{h \to 0} T (F_{X Y}^{(0)}, h)$ .

THEOREM 3.

Let $C_{n} (y) = E ∥ T_{0} ({\hat{F}}_{X Y}^{(n)}, κ (\cdot, y), h_{n}) - T_{0}^{*} (F_{X Y}^{(0)}, κ (\cdot, y)) ∥$ and $\{a_{n}\}$ be a positive sequence of numbers satisfying $a_{n + 1} / a_{n} ≍ 1$ and $a_{n} ⪰ n^{- 1 / 2}$ . Suppose the entries of $T_{0}^{*} (F_{X Y}^{(0)}, κ (\cdot, Y))$ have finite variances. If $E [C_{n} (y)] = O (a_{n})$ , then $∥ T ({\hat{F}}_{X Y}^{(n)}, h_{n}) - T^{*} (F_{X Y}^{(0)}) ∥ = O_{P} (a_{n})$ .

The above theorem says that, under some conditions, the convergence rate of an ensemble Fréchet SDR estimator is the same as the corresponding CMS estimator. This covers all the estimators developed in Section 4. Specifically:

For all moment-based ensemble methods, such as OLS, PHD, IHT, SIR, SAVE, DR, the ensemble candidate matrices can be written in the form $T_{0} ({\hat{F}}_{X Y}^{(n)}, κ (\cdot, y)) = \hat{Λ} (y) \hat{Λ} (y)^{⊤}$ , where $\hat{Λ} (y)$ is a matrix possessing the second order von Mises expansion, implying $E [C_{n} (y)] = O (n^{- 1 / 2})$ . See, for example, Li (2018)
For nonparametric forward regression ensemble methods, OPG, the convergence rate of $C_{n} (y)$ was reported in Xia et al. (2002) as $O (h_{n}^{2} + h_{n}^{- 1} δ_{n}^{2})$ where $δ_{n} = \sqrt{(l o g n) / n h_{n}^{p}}$ . Although the convergence was established in terms of convergence in probability, under mild conditions such as uniformly integrability, we can obtain the same rate for $E [C_{n} (y)]$ .

6. Simulations

We evaluate the performance of the proposed Fréchet SDR methods with distributions and symmetric positive definite matrices as responses. For space consideration, the additional simulation for spherical data is presented in the Supplementary Material.

6.1. Computational details

Choice of tuning parameters and kernel types.

We first implement a unified cross-validation procedure to select the kernel type and bandwidth $γ$ in the kernel. For both distributional response and symmetric positive definite matrix response, we consider Gaussian radial basis kernel $κ_{G} (y, y^{'}) = e x p (- γ d {(y, y^{'})}^{2})$ and Laplacian radial basis kernel $κ_{L} (y, y^{'})$ as candidates to construct the ensembles. For the bandwidth $γ$ , we set the default value as

γ_{G} = \frac{ρ_{Y}}{2 σ_{G}^{2}}, where σ_{G}^{2} = {(\begin{array}{l} n \\ 2 \end{array})}^{- 1} \sum_{i < j} d {(Y_{i}, Y_{j})}^{2}, ρ_{Y} = 1,

(6)

in the Gaussian radial basis kernel, and

γ_{L} = \frac{ρ_{Y}}{2 σ_{L}}, where σ_{L} = {(\begin{array}{l} n \\ 2 \end{array})}^{- 1} \sum_{i < j} d (Y_{i}, Y_{j}), ρ_{Y} = 1,

in the Laplacian radial basis kernel. The same choices were used in Lee et al. (2013) and Li and Song (2017). We then fine-tune $ρ_{Y}$ and kernel types together via the $k$ -fold cross-validation as follows. Randomly split the whole sample into $k$ subsets of roughly equal sizes, say $D_{1}, \dots, D_{k}$ . For each $i = 1, \dots, k$ , use $D_{i}$ as the test set and its complement as the training set. We first use the training set to implement the Fréchet SDR with an initial dimension $d$ , say 5. This choice of a relatively large dimension helps to guarantee the unbiasedness of the estimated Fréchet central subspace. We then substitute the estimated $\hat{β}$ into the testing set to produce the sufficient predictor ${\hat{β}}^{T} X$ and then fit a global Fréchet regression model (Petersen and Müller, 2019) to predict the response in the testing set. Compute the prediction error for each $i$ and aggregate the error for all rotations $i = 1, \dots, k$ , which yields an overall cross-validation error. This overall error depends on the tuning parameter $ρ_{Y}$ and kernel type, and is then minimized over a grid $\{10^{- 2}, 10^{- 1}, 1, 10\} \times \{κ_{G}, κ_{L}\}$ to obtain the optimal combinations.

Estimation of the dimensions.

For the ensemble estimators that possess a candidate matrix (such as the ensemble moment estimators in Section 4.1), the recently developed order-determination methods, such as the ladle estimate (Luo and Li, 2016), and predictor augmentation estimator (Luo and Li, 2021) can be directly applied to estimate $d_{0}$ . In addition, the BIC-criterion introduced by Zhu et al. (2006) can also be used for this purpose.

In this paper, we adapted the predictor augmentation estimator to the current setting. A detailed introduction of the predictor augmentation method and more simulation results are included in the Supplementary Material. For the predictor augmentation estimator, we take the times of augmentations $s = 10$ and the dimension of augmented predictors $r = ⌈ p / 2 ⌉$ , where $p$ is the original dimension of predictors.

Estimation error assessment.

We used the error measurement for subspace estimation as in Li et al. (2005): if $𝒮_{1}$ and $𝒮_{2}$ are two subspaces of $R^{p}$ of the same dimension, then their distance is defined as $d (𝒮_{1}, 𝒮_{2}) = {∥P_{𝒮_{1}} - P_{𝒮_{2}}∥}_{F}$ , where $P_{𝒮}$ is the projection on to $𝒮$ , and $∥ \cdot ∥_{F}$ is the Frobenius matrix norm. If $B_{1}$ and $B_{2}$ are two matrices whose columns form bases of $𝒮_{1}$ and $𝒮_{2}$ respectively, this distance can be equivalently written as ${∥ B_{1} {(B_{1}^{⊤} B_{1})}^{- 1} B_{1}^{⊤} - B_{2} {(B_{2}^{⊤} B_{2})}^{- 1} B_{2}^{⊤} ∥}_{F}$ . This distance is coordinate-free, as it is invariant to the basis matrices involved.

To facilitate the comparison, we also include the benchmark error, which is set as the expectation of the above distance when $B_{1}$ is taken as any basis matrix of the true central subspace, and entries of $B_{2}$ are generated randomly from i.i.d. $N (0, 1)$ . This expectation is computed by Monte Carlo with 1000 repeats.

6.2. Scenario I: Fréchet SDR for distributions

Let $(Ω_{Y}, d_{w})$ be the metric space of univariate distributions endowed with Wasserstein metric $d_{w}$ , as described in Section 3. The construction of the ensembles requires computing the Wasserstein distances $d_{w} (Y_{i}, Y_{j})$ for $i, j = 1, \dots, n$ . However, the distributions $Y_{1}, \dots, Y_{n}$ are usually not fully observed in practice, which means we need to estimate them in the implementation of the proposed methods. There are multiple ways to do so, such as by estimating the c.d.f.’s, the quantile functions (Parzen, 1979), or the p.d.f’s (Petersen and Müller, 2016; Chen et al., 2021). For computation simplicity, we use the Wasserstein distances between the empirical measures. Specifically, suppose we observe $(X_{1}, {\{W_{1 j}\}}_{j = 1}^{m_{1}}), \dots, (X_{n}, {\{W_{n j}\}}_{j = 1}^{m_{n}})$ , where ${\{W_{i j}\}}_{j = 1}^{m_{i}}$ are independent samples from the distribution $Y_{i}$ . Let ${\hat{Y}}_{i}$ be the empirical measure ${m_{i}}^{- 1} \sum_{j = 1}^{m_{i}} δ_{w_{i j}}$ , where $δ_{a}$ is the Dirac measure at $a$ , then we estimate $d_{w} (Y_{i}, Y_{k})$ by $d_{w} (Y_{i}, {\hat{Y}}_{k})$ . For the theoretical justification, see Fournier and Guillin (2015) and Lei (2020). For simplicity, we assume the sample sizes $m_{1}, \dots, m_{n}$ to be the same and denote the common sample size by $m$ . Then the distance between empirical measures ${\hat{Y}}_{i}$ and ${\hat{Y}}_{k}$ is a simple function of the order statistics: $d_{w} ({\hat{Y}}_{i}, {\hat{Y}}_{k}) = {{\sum_{j = 1}^{m} (W_{i (j)} - W_{k (j)})}^{2}}^{1 / 2}$ , where $W_{i (j)}$ is the $j$ -th order statistics of the sample $W_{i 1} \dots, W_{i m}$ .

Let $β_{1}^{⊤} = (1, 1, 0, \dots, 0)$ , $β_{2}^{⊤} = (0, \dots, 0, 1, 1)$ , $β_{3}^{⊤} = (1, 2, 0, \dots, 0, 2)$ and $β_{4}^{⊤} = (0, 0, 1, 2, 2, \dots, 0)$ . To generate univariate distributional response $Y$ , we let $Y = N (μ_{Y}, σ_{Y}^{2})$ , where $μ_{Y}$ and $σ_{Y}^{2}$ are random variables dependent on $X$ , and $σ_{Y} > 0$ almost surely, defined by the following models:

I-1 : $μ_{Y} ∣ X \sim N (e x p (β_{1}^{⊤} X), 1)$ and $σ_{Y} = 1$ .

I-2 : $μ_{Y} ∣ X \sim N (e x p (β_{1}^{⊤} X), 1)$ and $σ_{Y} = 10^{- 1} \cdot 1 \{ς (X) < 10^{- 1}\} + ς (X) \cdot 1 \{10^{- 1} \leq ς (X) \leq 10\} + 10 \cdot 1 {ς (X) > 10}$ where $ς (X) = exp (β_{2}^{⊤} X)$ .

To generate the predictor $X$ , we consider both scenarios where Assumption 1 is satisfied and violated. Specifically, for Model I-1 and I-2, $X$ is generated by the following two scenarios:

$X \sim N (0, 1)$ ; in this case both LCM and CCV in Assumption 1 are satisfied;
we first generate $U_{1}, \dots, U_{p}$ from the $A R (1)$ model with mean 0 and covariance matrix $Σ = {({0.5}^{| i - j |})}_{i, j}$ , and then generate $X$ by $(s i n (U_{1}), |U_{2}|, U_{3}, \dots, U_{p})$ . For this model, both LCM and CCS are violated.

Ying and Yu (2022) considered similar models to Model I-1 and Model I-2. For Model I-1, $B_{0} = β_{1}$ and $d_{0} = 1$ ; for Model I-2, $B_{0} = (β_{1}, β_{2})$ and $d_{0} = 2$ . In the simulation, we first generate $X_{1}, \dots, X_{n}$ , then generate $(μ_{Y_{1}}, σ_{Y_{1}}), \dots, (μ_{Y_{n}}, σ_{Y_{n}})$ . For each $i = 1, \dots, n$ , we then generate $W_{i 1}, \dots, W_{i m}$ independently from $N (μ_{Y_{i}}, σ_{Y_{i}}^{2})$ . We take $(n, p) = (200, 10), (400, 20)$ and $m = 100$ .

We compare performances of the CMS ensemble methods and CS ensemble methods described in Section 4, including FOLS, FPHD, FIHT, FSIR, FSAVE, FDR, and FOPG (with refinement). We first implement the predictor augmentation (PA) method to estimate the dimension of the Fréchet central subspace. Then with estimated $\hat{d}$ , we evaluate the estimation error. For FOPG, the number of iterations is set as 5, which is large enough to guarantee numerical convergence. For FSIR, FSAVE, the number of slices is chosen as $⌊ n / 2 p ⌋$ ; for FDR, the number of slices is chosen as $⌊ n / 6 p ⌋$ . We also implement the weighted inverse regression ensemble (WIRE) method proposed by Ying and Yu (2022) for comparison. We repeat the experiments 500 times and summarize the proportion of correct identification of order and the mean and standard deviation of estimation error in Table 1. A smaller distance indicates a more accurate estimate, and the estimate with the smallest distance is shown in boldface. The benchmark distances are shown at the bottom of the table.

Table 2:

Mean(± standard deviation) of estimation error measured by ${∥P_{B_{0}} - P_{\hat{B}}∥}_{F}$ for different methods for Scenario II. The benchmark for Model II-1 with $p = 10, 20$ are 1.334, 1.373 respectively, for Model II-2 with $p = 10, 20$ are 1.785, 1.893, respectively. The bold-faced number indicates the best performer.

Model	$(p, n)$	FOLS	FPHD	FIHT	FSIR	FSAVE	FDR	FOPG	WIRE
II-1-(a)		100%	71%	100%	99%	87%	98%	100%	100%
	(10,200)	0.157	0.865	0.157	0.170	0.299	0.171	0.154	0.152
		(0.055)	(0.288)	(0.055)	(0.093)	(0.285)	(0.138)	(0.041)	(0.038)
		100%	68%	100%	100%	92%	99%	100%	100%
	(20,400)	0.162	0.921	0.162	0.167	0.258	0.165	0.153	0.160
		(0.029)	(0.262)	(0.029)	(0.031)	(0.221)	(0.089)	(0.027)	(0.028)
II-1-(b)		99%	58%	99%	92%	51%	67%	97%	98%
	(10,200)	0.236	1.044	0.236	0.288	0.735	0.506	0.224	0.220
		(0.115)	(0.271)	(0.115)	(0.219)	(0.348)	(0.376)	(0.162)	(0.133)
		100%	52%	100%	96%	51%	70%	96%	99%
	(20,400)	0.235	1.126	0.235	0.260	0.726	0.472	0.233	0.222
		(0.047)	(0.245)	(0.047)	(0.155)	(0.336)	(0.364)	(0.157)	(0.09)
II-2-(a)		100%	20%	100%	100%	78%	99%	100%	100%
	(10,200)	0.292	1.20	0.292	0.306	0.615	0.358	0.151	0.290
		(0.078)	(0.155)	(0.078)	(0.088)	(0.285)	(0.122)	(0.05)	(0.078)
		100%	20%	100%	100%	85%	100%	100%	100%
	(20,400)	0.308	1.218	0.308	0.318	0.618	0.375	0.179	0.307
		(0.057)	(0.15)	(0.057)	(0.058)	(0.218)	(0.075)	(0.029)	(0.057)
II-2-(b)		99%	35%	98%	100%	58%	80%	98%	100%
	(10,200)	0.680	1.462	0.682	0.707	1.182	0.941	0.275	0.675
		(0.184)	(0.19)	(0.186)	(0.185)	(0.233)	(0.245)	(0.137)	(0.181)
		100%	42%	100%	100%	50%	92%	100%	100%
	(20,400)	0.694	1.505	0.694	0.710	1.228	0.933	0.336	0.691
		(0.128)	(0.187)	(0.128)	(0.126)	(0.189)	(0.194)	(0.079)	(0.128)

Open in a new tab

For Models I-1 and I-2, the best performer FOPG achieves 100% correct order determination percentage and enjoys the smallest estimation error. The moment-based ensemble methods are slightly less accurate than FOPG. Compared with the benchmark, all methods can successfully estimate the true central subspace except FPHD. Compared to the results from predictor settings (a) and (b), we see that most moment-based methods and inverse-regression-based methods have larger estimation error and less percentage of correct order determination under setting (b), but FOPG, which is free from the elliptical assumption of predictors, still give the most precise estimation. Overall, the correlation between predictors and non-ellipticity does not affect the results much compared with the benchmark error.

6.3. Scenario II: Fréchet SDR for positive definite matrices

Let $Ω_{Y}$ be the space ${s y m}^{+} (r)$ endowed with Frobenius distance $d_{F} (Y_{1}, Y_{2}) = {∥Y_{1} - Y_{2}∥}_{F}$ . To accommodate the anatomical intersubject variability, Schwartzman (2006) introduced the symmetric matrix variate Normal distributions. We adopt this distribution to construct the regression model with correlation matrix response. We say that $Z \in S y m (r)$ has the standard symmetric matrix variate Normal distribution $N_{r r} (0; I_{r})$ if it has density $φ (Z) = (2 π)^{- q / 2} e x p (- t r (Z)^{2} / 2)$ with respect to Lebesgue measure on $R^{p (p + 1)}$ . As pointed out in Schwartzman (2006), this definition is equivalent to a symmetric matrix with independent $N (0, 1)$ diagonal elements and $N (0, 1 / 2)$ off-diagonal elements. We say $Y \in S y m (r)$ has symmetric matrix variate Normal distribution $N_{r r} (M; Σ)$ if $Y = G Z G^{T} + M$ where $M \in S y m (r)$ , $G \in R_{(r \times r)}$ is a non-singular matrix, and $Σ = G^{T} G$ . As a special case, we say $Y \in S y m (r) \sim N_{r r} (M; σ^{2})$ if $Y = σ Z + M$ .

We generate predictors $X$ as in settings (a) and (b) of Scenario II. We generate $l o g (y)$ following $N_{d d} (l o g {D (X)}, 0.25)$ , where $l o g (\cdot)$ is the matrix logarithm defined in Section 3, and $D (X)$ is specified by the following models:

II-1: $(X) = (\begin{matrix} 1 & ρ (X) \\ ρ (X) & 1 \end{matrix})$ , where $ρ (X) = [exp (β_{1}^{⊤} X) - 1] / [exp (β_{1}^{⊤} X) + 1]$ .

II-2: $D (X) = (\begin{matrix} 1 & ρ_{1} (X) & ρ_{2} (X) \\ ρ_{1} (X) & 1 & ρ_{1} (X) \\ ρ_{2} (X) & ρ_{1} (X) & 1 \end{matrix})$ , where $ρ_{1} (X) = 0.4 [exp (β_{1}^{⊤} X) - 1] / [exp (β_{1}^{⊤} X) + 1]$ and $ρ_{2} (X) = 0.4 sin (β_{3}^{⊤} X)$ .

In Model II-1, $B_{0} = β_{1}$ and $d_{0} = 1$ ; in Model II-2, $B_{0} = (β_{1}, β_{2})$ and $d_{0} = 2$ . We note that $D (x)$ is not necessarily the Fréchet conditional mean of $Y$ given $X$ , but still measures the central tendency of the conditional distribution $Y ∣ X$ . We also compare performances of the CMS ensemble methods and CS ensemble methods, with $(n, p) = (200, 10), (400, 20)$ . The experiments are repeated 500 times. The proportion of correct identification of order and the means and standard deviations of estimation errors are summarized in Table 2.

Table 3:

10-fold cross-validation prediction errors of GFR/LFR for mortality data

	GFR	LFR
9-dim full predictor	30.475	28.745
2-dim sufficient predictor	27.200	23.852

Open in a new tab

We conclude that all ensemble methods give reasonable estimation except FPHD. FOPG performs best in all settings except Model II-1-(b). To illustrate the relation between the response and estimated sufficient predictor ${\hat{β}}_{1}^{⊤} X$ , we adopt the ellipsoidal representation of SPD matrices. Each $A \in {S y m}^{+} (d)$ can be associated with an ellipsoid centered at the origin $ℰ_{A} = \{x : x^{⊤} A^{- 1} x \leq 1\}$ . Figure 2 plots the responses ellipsoid versus the estimated sufficient predictor in panel (a), compared with the responses versus predictor $X_{10}$ for Model II-1-(a). We can tell a clear pattern of change in shape and rotation of response ellipsoids versus ${\hat{β}}_{1}^{⊤} X$ .

Figure 2: — Ellipsoidal plots of the SPD matrix response versus the FOPG predictor ${\hat{β}}_{1}^{⊤} X$ and $X_{10}$ using Model II-1-(a) with $(n = 200, p = 10)$ . Each horizontal ellipse is an Ellipsoidal representation of an SPD matrix, and the vertical axis is the value of (a) ${\hat{β}}_{1}^{⊤} X$ ; (b) $X_{10}$ , colored according to the vertical axis values.

7. Application to the Human Mortality Data

This section presents an application concerning human life spans. Another application concerning intracerebral hemorrhage is presented in Section S. 6 of the Supplementary Material.

Compared with summary statistics such as the crude death rate, viewing the entire age-at-death distributions as data objects gives us a more comprehensive understanding of human longevity and health conditions. Mortality distributions are affected by many factors, such as economics, the health care system, and social and environmental factors. To investigate the potential factors that are related to the mortality distributions across different countries, we collect nine predictors listed below, covering demography, economics, labor market, nutrition, health, and environmental factors in 2015: (1) Population Density: population per square Kilometer; (2) Sex Ratio: number of males per 100 females in the population; (3) Mean Childbearing Age: the average age of mothers at the birth of their children; (4) Gross Domestic Product (GDP) per Capita; (5) Gross Value Added (GVA) by Agriculture: the percentage of agriculture, hunting, forestry, and fishing activities of gross value added, (6) Consumer price index: treat 2010 as the base year; (7) Unemployment Rate; (8) Expenditure on Health (percentage of GDP); (9) Arable Land (percentage of total land area). The data are collected from United Nation Databases (http://data.un.org/) and UN World Population Prospects 2019 Databases (https://population.un.org/wpp/Download). For each country and age, the life table contains the number of deaths $d (x, n)$ aggregated every five years. We treat these data as histograms of the number of deaths at age, with bin widths equal to 5 years. We smooth the histograms using the ‘frechet’ package available at (https://cran.r-project.org/web/packages/frechet/index.html) to obtain smoothed probability density functions and then calculate the Wasserstein distances between them. We collected the data for 152 countries in 2015 after removing 10 countries with extreme values in feature Population Density, Sex Ratio, CPI, and Expenditure on Health.

By checking the scatter plots matrix, we observe nonellipticity in the predictors. So we choose FOPG to analyze the data, which does not rely on Assumption 1. We use the Gaussian kernel $κ (y, y^{'}) = e x p (- γ d_{w}^{2} (y, y^{'}))$ for the ensemble, where $γ$ is chosen according to (6) in Section 6.1. We standardize all covariates separately, then use the predictor augmentation method combined with FOPG to estimate the dimension of the Fréchet central subspace, which is estimated as 2. The first two directions obtained by FOPG are

{\hat{β}}_{1} = (0.841, - 0.155, - 0.100, 0.885, - 0.361, - 0.075, - 0.108, 0.214, - 0.055)^{⊤},

{\hat{β}}_{2} = (0.838, - 0.706, - 0.395, - 0.424, - 0.758, - 0.005, - 0.218, - 0.102, - 0.034)^{⊤} .

A plot of mortality densities versus the first sufficient predictor ${\hat{β}}_{1}^{T} X$ is shown in Figure 3(a). Clear and useful patterns emerge from Figure 3(a): the mode of the mortality distribution shifts from right to left (with left indicating a longer life span) as the value of the first sufficient predictor increases. Moreover, there is a significant uptick at the right-most end as the first sufficient predictor decreases, indicating high infant mortality. Meanwhile, the loadings of the first sufficieant predictor are strongly positive for the GDP per capita, which indicates the levels of economic development and health care of a country, with larger values associated with more developed countries and smaller values associated with less developed countries. From Figure 3(c), we see that the mean of the mortality distribution increases and the standard deviation decreases with the value of the first sufficient predictor. This also makes sense: the mean life span increases with the level of development, consistent with Figure 3(a). The standard deviation decreases with the first predictor because, as the development level increases, the life span is increasingly concentrated on the high values. Moreover, the high mortality in the lower region of the first sufficient predictor also contributes to the larger standard deviation in this region.

We fit the global and local Fréchet regressions (GFR/LFR) with all nine predictors and two sufficient predictors, respectively. The 10 -fold cross-validation prediction errors are collected in the following Table 3. We see that using sufficient predictors achieves more accurate prediction results, especially in the local Fréchet regression model. The LFR gives more accurate predictions than the GFR, which indicates that the LFR is more flexible when a nonlinear regression pattern exists. This result is consistent with the recent findings in Bhattacharjee and Müller (2021). The predicted mortality densities versus the first sufficient predictor are shown in Figure 3(b). We also compared the cross-validation prediction errors by LFR using sufficient predictors of FOPG with those of FSIR and WIRE in Ying and Yu (2022), which require the linear conditional mean assumption. The 10 -fold cross-validation prediction error using WIRE and FSIR are 24.765 and 24.342, respectively. Compared with 23.852 by FOPG, we see that FOPG performs better than FSIR and WIRE, while the latter two perform similarly.

8. Discussion

In the classical regression setting, sufficient dimension reduction has been used as a tool for exploratory data analysis, regression diagnostics, and a mechanism to overcome the curse of dimensionality in regression. As a regression tool, it can help us to treat collinearity in the predictor effectively, detect heteroscedasticity in the response, find the most important linear combinations of predictors, and understand the general shape of the regression surface without fitting an elaborate regression model. Although regression with a metric-space valued random object is a new problem, as a regression problem, it shares the same set of issues, such as the need for exploratory analysis before regression, for model diagnostics after the regression, and for mitigating the curse of dimensionality. As shown in Figure 1 in the paper, the first sufficient predictor clearly reveals useful information about a general trend of mortality distributions among countries.

The proposed methodology is very flexible and versatile: it can be used to turn any existing SDR method into one that can deal with the metric-space-valued response variable. Furthermore, it applies to any separable and complete metric space of negative type with an explicit CMS ensemble. It significantly broadens the current field of sufficient dimension reduction and provides a useful set of tools for Fréchet regression.

The proposed method also has its limitations, one of which is that it only applies to metric spaces that permit the construction of a universal kernel. Another possible criticism is that the ensemble constructed by metric in the embedded Hilbert space is extrinsic to the original metric space. However, we do not regard this as a serious drawback for two reasons. First, the role played by the ensemble family is rather like that played by characteristic function, which need not be of the same nature as the original random variable. Second, in some important special cases (e.g., the Wasserstein space of univariate distributions and the space of SPD matrices), the embedding is isometric, so we are building the kernel using the original metric even though we work with the embedded Hilbert space. Nevertheless, when it is possible to use the original metric (such as in the isometric embedding case), it seems intuitively appealing to take that as our first choice, as we have done in all three examples.

Supplementary Material

Supp 1

NIHMS1956452-supplement-Supp_1.zip^{(3.3MB, zip)}

Table 1:

The percentages of correct order determination, and the mean (standard deviation) of estimation error as measured by ${∥P_{B_{0}} - P_{\hat{B}}∥}_{F}$ for Models I-1 and I-2 with settings (a) settings (a) and (b); the benchmark for Model I-1 with $p = 10, 20$ are 1.334, 1.373 respectively, and for Model I-2 with $p = 10, 20$ are respectively. The bold-faced number indicates the best performer.

Model	$(p, n)$	FOLS	FPHD	FIHT	FSIR	FSAVE	FDR	FOPG	WIRE
I-1-(a)		100%	97%	100%	100%	95%	97%	100%	100%
	(10,200)	0.334	0.593	0.341	0.260	0.437	0.336	0.167	0.236
		(0.088)	(0.158)	(0.09)	(0.081)	(0.199)	(0.144)	(0.054)	(0.057)
		100%	97%	100%	100%	97%	98%	100%	100%
	(20,400)	0.365	0.634	0.371	0.263	0.433	0.342	0.227	0.251
		(0.075)	(0.136)	(0.075)	(0.046)	(0.149)	(0.115)	(0.05)	(0.041)
I-1-(b)		99%	99%	99%	97%	95%	98%	100%	97%
	(10,200)	0.380	0.638	0.399	0.239	0.361	0.280	0.136	0.204
		(0.122)	(0.122)	(0.126)	(0.145)	(0.187)	(0.12)	(0.039)	(0.147)
		99%	99%	99%	98%	95%	99%	100%	97%
	(20,400)	0.387	0.648	0.404	0.237	0.365	0.275	0.194	0.211
		(0.098)	(0.096)	(0.094)	(0.123)	(0.176)	(0.092)	(0.053)	(0.151)
I-2-(a)		100%	91%	100%	100%	99%	100%	100%	100%
	(10,200)	0.409	1.032	0.412	0.370	0.528	0.413	0.267	0.304
		(0.11)	(0.254)	(0.109)	(0.082)	(0.134)	(0.09)	(0.112)	(0.061)
		100%	90%	100%	100%	100%	100%	100%	100%
	(20,400)	0.431	1.157	0.435	0.371	0.548	0.434	0.298	0.320
		(0.069)	(0.23)	(0.069)	(0.049)	(0.086)	(0.059)	(0.072)	(0.038)
I-2-(b)		100%	91%	100%	99%	100%	100%	100%	100%
	(10,200)	0.551	1.122	0.557	0.464	0.630	0.507	0.290	0.370
		(0.111)	(0.203)	(0.11)	(0.119)	(0.133)	(0.102)	(0.086)	(0.084)
		100%	91%	100%	100%	100%	100%	100%	100%
	(20,400)	0.561	1.179	0.567	0.458	0.645	0.521	0.330	0.381
		(0.081)	(0.164)	(0.08)	(0.072)	(0.089)	(0.07)	(0.051)	(0.058)

Open in a new tab

Acknowledgments

The authors thank the Co-Editor, Associate Editor, and anonymous reviewers for their helpful comments, which significantly improved the quality of the paper.

Footnotes

Disclosure

The authors report there are no competing interests to declare.

References

Ambrosio L, Gigli N and Savaré G (2004), ‘Gradient flows with metric and differentiable structures, and applications to the wasserstein space’, Atti Accad. Naz. Lincei Cl. Sci. Fis. Mat. Natur. Rend. Lincei (9) Mat. Appl 15(3–4), 327–343. [Google Scholar]
Berg C, Christensen JPR and Ressel P (1984), Harmonic analysis on semigroups: theory of positive definite and related functions, Vol. 100, Springer. [Google Scholar]
Bhattacharjee S and Müller H-G (2021), ‘Single index fréchet regression’, arXiv preprint arXiv:2108.05437. [Google Scholar]
Bigot J, Gouet R, Klein T and López A (2017), ‘Geodesic pca in the wasserstein space by convex pca’, Annales de l’Institut Henri Poincaré, Probabilités et Statistiques 53, 1–26. [Google Scholar]
Chen Y, Lin Z and Müller H-G (2021), ‘Wasserstein regression’, J. Amer. Statist. Assoc. [Google Scholar]
Christmann A and Steinwart I (2010), Universal kernels on non-standard input spaces, in ‘in Advances in Neural Information Processing Systems’, pp. 406–414. [Google Scholar]
Cook RD (1996), ‘Graphics for regressions with a binary response’, J. Amer. Statist. Assoc 91, 983–992. [Google Scholar]
Cook RD and Li B (2002), ‘Dimension reduction for conditional mean in regression’, The Annals of Statistics 30(2), 455–474. [Google Scholar]
Cook RD and Weisberg S (1991), ‘Sliced inverse regression for dimension reduction: Comment’, Journal of the American Statistical Association 86(414), 328–332. [Google Scholar]
Ding S and Cook RD (2015), ‘Tensor sliced inverse regression’, J. Multivar. Anal. 133, 216–231. [Google Scholar]
Dubey P and Müller H-G (2019), ‘Fréchet analysis of variance for random objects’, Biometrika 106(4), 803–821. [Google Scholar]
Dubey P and Müller H-G (2020a), ‘Fréchet change-point detection’, Ann. Stat. 48(6), 3312–3335. [Google Scholar]
Dubey P and Müller H-G (2020b), ‘Functional models for time-varying random objects’, Journal of the Royal Statistical Society Series B: Statistical Methodology 82(2), 275–327. [Google Scholar]
Eaton ML (1986), ‘A characterization of spherical distributions’, J. Multivar. Anal. 20(2), 272–276. [Google Scholar]
Fan J, Xue L and Yao J (2017), ‘Sufficient forecasting using factor models’, Journal of Econometrics 201(2), 292–306. [DOI] [PMC free article] [PubMed] [Google Scholar]
Feragen A, Lauze F and Hauberg S (2015), Geodesic exponential kernels: when curvature and linearity conflict, in ‘Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition’, pp. 3032–3042. [Google Scholar]
Ferré L and Yao A-F (2003), ‘Functional sliced inverse regression analysis’, Statistics 37(6), 475–488. [Google Scholar]
Ferreira LK and Busatto GF (2013), ‘Resting-state functional connectivity in normal brain aging’, Neuroscience & Biobehavioral Reviews 37(3), 384–400. [DOI] [PubMed] [Google Scholar]
Fournier N and Guillin A (2015), ‘On the rate of convergence in wasserstein distance of the empirical measure’, Probability Theory and Related Fields 162(3–4), 707–738. [Google Scholar]
Fréchet M (1948), ‘Les éléments aléatoires de nature quelconque dans un espace distancié’, Annales de l’Institut Henri Poincaré p. 215–310. [Google Scholar]
Granirer EE (1970), ‘Probability measures on metric spaces. by parthasarathy k. r.. academic press, new york and london: (1967). x 276 pp.’, Canadian Mathematical Bulletin 13(2), 290–291. [Google Scholar]
Hall P and Li K-C (1993), ‘On almost linearity of low dimensional projections from high dimensional data’, The Annals of Statistics pp. 867–889. [Google Scholar]
Honeine P and Richard C (2010), The angular kernel in machine learning for hyperspectral data classification, in ‘2010 2nd Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing’, IEEE, pp. 1–4. [Google Scholar]
Hsing T and Ren H (2009), ‘An rkhs formulation of the inverse regression dimension-reduction problem’, The Annals of Statistics 37(2), 726–755. [Google Scholar]
Jayasumana S, Hartley R, Salzmann M, Li H and Harandi M (2013), Combining multiple manifold-valued descriptors for improved object recognition, in ‘2013 International Conference on Digital Image Computing: Techniques and Applications (DICTA)’, IEEE, pp. 1–6. [Google Scholar]
Lee K-Y, Li B and Chiaromonte F (2013), ‘A general theory for nonlinear sufficient dimension reduction: Formulation and estimation’, The Annals of Statistics 41(1), 221–249. [Google Scholar]
Lei J (2020), ‘Convergence and concentration of empirical measures under wasserstein distance in unbounded functional spaces’, Bernoulli 26(1), 767–798. [Google Scholar]
Li B (2018), Sufficient Dimension Reduction: Methods and Applications with R, CRC Press. [Google Scholar]
Li B, Kim MK and Altman N (2010), ‘On dimension folding of matrix-or array-valued statistical objects’, The Annals of Statistics 38(2), 1094–1121. [Google Scholar]
Li B and Song J (2017), ‘Nonlinear sufficient dimension reduction for functional data’, The Annals of Statistics 45(3), 1059–1095. [Google Scholar]
Li B and Wang S (2007), ‘On directional regression for dimension reduction’, Journal of the American Statistical Association 102(479), 997–1008. [Google Scholar]
Li B and Yin X (2007), ‘On surrogate dimension reduction for measurement error regression: an invariance law’, The Annals of Statistics 35(5), 2143–2172. [Google Scholar]
Li B, Zha H and Chiaromonte F (2005), ‘Contour regression: a general approach to dimension reduction’, The Annals of Statistics 33(4), 1580–1616. [Google Scholar]
Li K-C (1991), ‘Sliced inverse regression for dimension reduction’, J. Amer. Statist. Assoc 86, 316–327. [Google Scholar]
Li K-C (1992), ‘On principal hessian directions for data visualization and dimension reduction: Another application of stein’s lemma’, Journal of the American Statistical Association 87(420), 1025–1039. [Google Scholar]
Li K-C and Duan N (1989), ‘Regression analysis under link violation’, Ann. Stat. 17(3), 1009–1052. [Google Scholar]
Luo W and Li B (2016), ‘Combining eigenvalues and variation of eigenvectors for order determination’, Biometrika 103(4), 875–887. [Google Scholar]
Luo W and Li B (2021), ‘On order determination by predictor augmentation’, Biometrika 108(3), 557–574. [Google Scholar]
Luo W, Xue L, Yao J and Yu X (2021), ‘Inverse moment methods for sufficient forecasting using high-dimensional predictors’, Biometrika 109(2), 473—487. [Google Scholar]
Micchelli CA, Xu Y and Zhang H (2006), ‘Universal kernels.’, J. Mach. Learn. Res. 7(12), 2651–2667. [Google Scholar]
Panaretos VM and Zemel Y (2020), An Invitation to Statistics in Wasserstein Space, Springer Nature. [Google Scholar]
Parzen E (1979), ‘Nonparametric statistical data modeling’, J. Amer. Statist. Assoc. 74(365), 105–121. [Google Scholar]
Petersen A and Müller H-G (2016), ‘Functional data analysis for density functions by transformation to a hilbert space’, The Annals of Statistics 44(1), 183–218. [Google Scholar]
Petersen A and Müller H-G (2019), ‘Fréchet regression for random objects with euclidean predictors’, The Annals of Statistics 47(2), 691–719. [Google Scholar]
Schwartzman A (2006), Random ellipsoids and false discovery rates: Statistics for diffusion tensor imaging data, PhD thesis, Stanford University. [Google Scholar]
Sejdinovic D, Sriperumbudur B, Gretton A and Fukumizu K (2013), ‘Equivalence of distance-based and rkhs-based statistics in hypothesis testing’, The Annals of Statistics pp. 2263–2291. [Google Scholar]
Sriperumbudur BK, Fukumizu K and Lanckriet GR (2011), ‘Universality, characteristic kernels and rkhs embedding of measures.’, Journal of Machine Learning Research 12(7), 2389–2410. [Google Scholar]
Steinwart I (2001), ‘On the influence of the kernel on the consistency of support vector machines’, Journal of Machine Learning Research 2(Nov), 67–93. [Google Scholar]
Xia Y, Tong H, Li WK and Zhu L-X (2002), ‘An adaptive estimation of dimension reduction space (with discussion)’, Journal of the Royal Statistical Society. Series B 64(3), 363–410. [Google Scholar]
Ye Z and Weiss RE (2003), ‘Using the bootstrap to select one of a new class of dimension reduction methods’, Journal of the American Statistical Association 98(464), 968–979. [Google Scholar]
Yin X and Li B (2011), ‘Sufficient dimension reduction based on an ensemble of minimum average variance estimators’, The Annals of Statistics 39(6), 3392–3416. [Google Scholar]
Yin X, Li B and Cook RD (2008), ‘Successive direction extraction for estimating the central subspace in a multiple-index regression’, Journal of Multivariate Analysis 99(8), 1733–1757. [Google Scholar]
Ying C and Yu Z (2022), ‘Fréchet sufficient dimension reduction for random objects’, Biometrika. [Google Scholar]
Yu X, Yao J and Xue L (2020), ‘Nonparametric estimation and conformal inference of the sufficient forecasting with a diverging number of factors’, J. Bus. Econ. Stat. 40(1), 342–354. [Google Scholar]
Zhang Q, Li B and Xue L (2022), ‘Nonlinear sufficient dimension reduction for distribution-on-distribution regression’, arXiv preprint arXiv:2207.04613. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhu L, Miao B and Peng H (2006), ‘On sliced inverse regression with high-dimensional covariates’, Journal of the American Statistical Association 101(474), 630–643. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

NIHMS1956452-supplement-Supp_1.zip^{(3.3MB, zip)}

[R1] Ambrosio L, Gigli N and Savaré G (2004), ‘Gradient flows with metric and differentiable structures, and applications to the wasserstein space’, Atti Accad. Naz. Lincei Cl. Sci. Fis. Mat. Natur. Rend. Lincei (9) Mat. Appl 15(3–4), 327–343. [Google Scholar]

[R2] Berg C, Christensen JPR and Ressel P (1984), Harmonic analysis on semigroups: theory of positive definite and related functions, Vol. 100, Springer. [Google Scholar]

[R3] Bhattacharjee S and Müller H-G (2021), ‘Single index fréchet regression’, arXiv preprint arXiv:2108.05437. [Google Scholar]

[R4] Bigot J, Gouet R, Klein T and López A (2017), ‘Geodesic pca in the wasserstein space by convex pca’, Annales de l’Institut Henri Poincaré, Probabilités et Statistiques 53, 1–26. [Google Scholar]

[R5] Chen Y, Lin Z and Müller H-G (2021), ‘Wasserstein regression’, J. Amer. Statist. Assoc. [Google Scholar]

[R6] Christmann A and Steinwart I (2010), Universal kernels on non-standard input spaces, in ‘in Advances in Neural Information Processing Systems’, pp. 406–414. [Google Scholar]

[R7] Cook RD (1996), ‘Graphics for regressions with a binary response’, J. Amer. Statist. Assoc 91, 983–992. [Google Scholar]

[R8] Cook RD and Li B (2002), ‘Dimension reduction for conditional mean in regression’, The Annals of Statistics 30(2), 455–474. [Google Scholar]

[R9] Cook RD and Weisberg S (1991), ‘Sliced inverse regression for dimension reduction: Comment’, Journal of the American Statistical Association 86(414), 328–332. [Google Scholar]

[R10] Ding S and Cook RD (2015), ‘Tensor sliced inverse regression’, J. Multivar. Anal. 133, 216–231. [Google Scholar]

[R11] Dubey P and Müller H-G (2019), ‘Fréchet analysis of variance for random objects’, Biometrika 106(4), 803–821. [Google Scholar]

[R12] Dubey P and Müller H-G (2020a), ‘Fréchet change-point detection’, Ann. Stat. 48(6), 3312–3335. [Google Scholar]

[R13] Dubey P and Müller H-G (2020b), ‘Functional models for time-varying random objects’, Journal of the Royal Statistical Society Series B: Statistical Methodology 82(2), 275–327. [Google Scholar]

[R14] Eaton ML (1986), ‘A characterization of spherical distributions’, J. Multivar. Anal. 20(2), 272–276. [Google Scholar]

[R15] Fan J, Xue L and Yao J (2017), ‘Sufficient forecasting using factor models’, Journal of Econometrics 201(2), 292–306. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Feragen A, Lauze F and Hauberg S (2015), Geodesic exponential kernels: when curvature and linearity conflict, in ‘Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition’, pp. 3032–3042. [Google Scholar]

[R17] Ferré L and Yao A-F (2003), ‘Functional sliced inverse regression analysis’, Statistics 37(6), 475–488. [Google Scholar]

[R18] Ferreira LK and Busatto GF (2013), ‘Resting-state functional connectivity in normal brain aging’, Neuroscience & Biobehavioral Reviews 37(3), 384–400. [DOI] [PubMed] [Google Scholar]

[R19] Fournier N and Guillin A (2015), ‘On the rate of convergence in wasserstein distance of the empirical measure’, Probability Theory and Related Fields 162(3–4), 707–738. [Google Scholar]

[R20] Fréchet M (1948), ‘Les éléments aléatoires de nature quelconque dans un espace distancié’, Annales de l’Institut Henri Poincaré p. 215–310. [Google Scholar]

[R21] Granirer EE (1970), ‘Probability measures on metric spaces. by parthasarathy k. r.. academic press, new york and london: (1967). x 276 pp.’, Canadian Mathematical Bulletin 13(2), 290–291. [Google Scholar]

[R22] Hall P and Li K-C (1993), ‘On almost linearity of low dimensional projections from high dimensional data’, The Annals of Statistics pp. 867–889. [Google Scholar]

[R23] Honeine P and Richard C (2010), The angular kernel in machine learning for hyperspectral data classification, in ‘2010 2nd Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing’, IEEE, pp. 1–4. [Google Scholar]

[R24] Hsing T and Ren H (2009), ‘An rkhs formulation of the inverse regression dimension-reduction problem’, The Annals of Statistics 37(2), 726–755. [Google Scholar]

[R25] Jayasumana S, Hartley R, Salzmann M, Li H and Harandi M (2013), Combining multiple manifold-valued descriptors for improved object recognition, in ‘2013 International Conference on Digital Image Computing: Techniques and Applications (DICTA)’, IEEE, pp. 1–6. [Google Scholar]

[R26] Lee K-Y, Li B and Chiaromonte F (2013), ‘A general theory for nonlinear sufficient dimension reduction: Formulation and estimation’, The Annals of Statistics 41(1), 221–249. [Google Scholar]

[R27] Lei J (2020), ‘Convergence and concentration of empirical measures under wasserstein distance in unbounded functional spaces’, Bernoulli 26(1), 767–798. [Google Scholar]

[R28] Li B (2018), Sufficient Dimension Reduction: Methods and Applications with R, CRC Press. [Google Scholar]

[R29] Li B, Kim MK and Altman N (2010), ‘On dimension folding of matrix-or array-valued statistical objects’, The Annals of Statistics 38(2), 1094–1121. [Google Scholar]

[R30] Li B and Song J (2017), ‘Nonlinear sufficient dimension reduction for functional data’, The Annals of Statistics 45(3), 1059–1095. [Google Scholar]

[R31] Li B and Wang S (2007), ‘On directional regression for dimension reduction’, Journal of the American Statistical Association 102(479), 997–1008. [Google Scholar]

[R32] Li B and Yin X (2007), ‘On surrogate dimension reduction for measurement error regression: an invariance law’, The Annals of Statistics 35(5), 2143–2172. [Google Scholar]

[R33] Li B, Zha H and Chiaromonte F (2005), ‘Contour regression: a general approach to dimension reduction’, The Annals of Statistics 33(4), 1580–1616. [Google Scholar]

[R34] Li K-C (1991), ‘Sliced inverse regression for dimension reduction’, J. Amer. Statist. Assoc 86, 316–327. [Google Scholar]

[R35] Li K-C (1992), ‘On principal hessian directions for data visualization and dimension reduction: Another application of stein’s lemma’, Journal of the American Statistical Association 87(420), 1025–1039. [Google Scholar]

[R36] Li K-C and Duan N (1989), ‘Regression analysis under link violation’, Ann. Stat. 17(3), 1009–1052. [Google Scholar]

[R37] Luo W and Li B (2016), ‘Combining eigenvalues and variation of eigenvectors for order determination’, Biometrika 103(4), 875–887. [Google Scholar]

[R38] Luo W and Li B (2021), ‘On order determination by predictor augmentation’, Biometrika 108(3), 557–574. [Google Scholar]

[R39] Luo W, Xue L, Yao J and Yu X (2021), ‘Inverse moment methods for sufficient forecasting using high-dimensional predictors’, Biometrika 109(2), 473—487. [Google Scholar]

[R40] Micchelli CA, Xu Y and Zhang H (2006), ‘Universal kernels.’, J. Mach. Learn. Res. 7(12), 2651–2667. [Google Scholar]

[R41] Panaretos VM and Zemel Y (2020), An Invitation to Statistics in Wasserstein Space, Springer Nature. [Google Scholar]

[R42] Parzen E (1979), ‘Nonparametric statistical data modeling’, J. Amer. Statist. Assoc. 74(365), 105–121. [Google Scholar]

[R43] Petersen A and Müller H-G (2016), ‘Functional data analysis for density functions by transformation to a hilbert space’, The Annals of Statistics 44(1), 183–218. [Google Scholar]

[R44] Petersen A and Müller H-G (2019), ‘Fréchet regression for random objects with euclidean predictors’, The Annals of Statistics 47(2), 691–719. [Google Scholar]

[R45] Schwartzman A (2006), Random ellipsoids and false discovery rates: Statistics for diffusion tensor imaging data, PhD thesis, Stanford University. [Google Scholar]

[R46] Sejdinovic D, Sriperumbudur B, Gretton A and Fukumizu K (2013), ‘Equivalence of distance-based and rkhs-based statistics in hypothesis testing’, The Annals of Statistics pp. 2263–2291. [Google Scholar]

[R47] Sriperumbudur BK, Fukumizu K and Lanckriet GR (2011), ‘Universality, characteristic kernels and rkhs embedding of measures.’, Journal of Machine Learning Research 12(7), 2389–2410. [Google Scholar]

[R48] Steinwart I (2001), ‘On the influence of the kernel on the consistency of support vector machines’, Journal of Machine Learning Research 2(Nov), 67–93. [Google Scholar]

[R49] Xia Y, Tong H, Li WK and Zhu L-X (2002), ‘An adaptive estimation of dimension reduction space (with discussion)’, Journal of the Royal Statistical Society. Series B 64(3), 363–410. [Google Scholar]

[R50] Ye Z and Weiss RE (2003), ‘Using the bootstrap to select one of a new class of dimension reduction methods’, Journal of the American Statistical Association 98(464), 968–979. [Google Scholar]

[R51] Yin X and Li B (2011), ‘Sufficient dimension reduction based on an ensemble of minimum average variance estimators’, The Annals of Statistics 39(6), 3392–3416. [Google Scholar]

[R52] Yin X, Li B and Cook RD (2008), ‘Successive direction extraction for estimating the central subspace in a multiple-index regression’, Journal of Multivariate Analysis 99(8), 1733–1757. [Google Scholar]

[R53] Ying C and Yu Z (2022), ‘Fréchet sufficient dimension reduction for random objects’, Biometrika. [Google Scholar]

[R54] Yu X, Yao J and Xue L (2020), ‘Nonparametric estimation and conformal inference of the sufficient forecasting with a diverging number of factors’, J. Bus. Econ. Stat. 40(1), 342–354. [Google Scholar]

[R55] Zhang Q, Li B and Xue L (2022), ‘Nonlinear sufficient dimension reduction for distribution-on-distribution regression’, arXiv preprint arXiv:2207.04613. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] Zhu L, Miao B and Peng H (2006), ‘On sliced inverse regression with high-dimensional covariates’, Journal of the American Statistical Association 101(474), 630–643. [Google Scholar]

PERMALINK

Dimension Reduction for Fréchet Regression

Qi Zhang

Lingzhou Xue

Bing Li

Abstract

1. Introduction

Figure 1:

2. Characterization of the Fréchet Central Subspace

2.1. Two types of ensembles and their sufficient conditions

PROPOSITION 1.

LEMMA 1.

2.2. Construction of the CMS-ensemble

DEFINITION 1.

PROPOSITION 2.

COROLLARY 1.

THEOREM 1.

3. Important Metric Spaces and their CMS Ensembles

3.1. Wasserstein space

PROPOSITION 3.

3.2. Space of symmetric positive definite matrices

3.3. The sphere

PROPOSITION 4.

4. Fréchet Sufficient Dimension Reduction

4.1. Ensembled moment estimators via CMS ensembles

DEFINITION 2.

ASSUMPTION 1.

PROPOSITION 5.

Algorithm 1.

REMARK 1.

4.2. Ensembled forward regression estimators via CMS ensembles

DEFINITION 3.

4.3. Ensembled inverse regression estimators via CS ensembles

REMARK 2.

REMARK 3.

4.4. Fisher consistency

THEOREM 2.

COROLLARY 2.

5. Convergence Rates of the Ensemble Estimates

THEOREM 3.

6. Simulations

6.1. Computational details

Choice of tuning parameters and kernel types.

Estimation of the dimensions.

Estimation error assessment.

6.2. Scenario I: Fréchet SDR for distributions

Table 2:

6.3. Scenario II: Fréchet SDR for positive definite matrices

Table 3:

Figure 2:

7. Application to the Human Mortality Data

Figure 3:

8. Discussion

Supplementary Material

Table 1:

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases