Entropy-Regularized Optimal Transport on Multivariate Normal and q-normal Distributions

Qijun Tong; Kei Kobayashi

doi:10.3390/e23030302

. 2021 Mar 3;23(3):302. doi: 10.3390/e23030302

Entropy-Regularized Optimal Transport on Multivariate Normal and q-normal Distributions

Qijun Tong ^1,^*, Kei Kobayashi ¹

Editor: Steeve Zozor¹

PMCID: PMC8001134 PMID: 33802490

Abstract

The distance and divergence of the probability measures play a central role in statistics, machine learning, and many other related fields. The Wasserstein distance has received much attention in recent years because of its distinctions from other distances or divergences. Although computing the Wasserstein distance is costly, entropy-regularized optimal transport was proposed to computationally efficiently approximate the Wasserstein distance. The purpose of this study is to understand the theoretical aspect of entropy-regularized optimal transport. In this paper, we focus on entropy-regularized optimal transport on multivariate normal distributions and q-normal distributions. We obtain the explicit form of the entropy-regularized optimal transport cost on multivariate normal and q-normal distributions; this provides a perspective to understand the effect of entropy regularization, which was previously known only experimentally. Furthermore, we obtain the entropy-regularized Kantorovich estimator for the probability measure that satisfies certain conditions. We also demonstrate how the Wasserstein distance, optimal coupling, geometric structure, and statistical efficiency are affected by entropy regularization in some experiments. In particular, our results about the explicit form of the optimal coupling of the Tsallis entropy-regularized optimal transport on multivariate q-normal distributions and the entropy-regularized Kantorovich estimator are novel and will become the first step towards the understanding of a more general setting.

Keywords: optimal transportation, entropy regularization, Wasserstein distance, Tsallis entropy, q-normal distribution

1. Introduction

Comparing probability measures is a fundamental problem in statistics and machine learning. A classical way to compare probability measures is the Kullback–Leibler divergence. Let M be a measurable space and $μ, ν$ be the probability measure on M; then, the Kullback–Leibler divergence is defined as:

KL (μ | ν) = \int_{M} d μ log \frac{d μ}{d ν} .

(1)

The Wasserstein distance [1], also known as the earth mover distance [2], is another way of comparing probability measures. It is a metric on the space of probability measures derived by the mass transportation theory of two probability measures. Informally, optimal transport theory considers an optimal transport plan between two probability measures under a cost function, and the Wasserstein distance is defined by the minimum total transport cost. A significant difference between the Wasserstein distance and the Kullback–Leibler divergence is that the former can reflect the metric structure, whereas the latter cannot. The Wasserstein distance can be written as:

W_{p} (μ, ν) : = {\{inf_{π \in Π (μ, ν)} \int_{M \times M} d {(x, y)}^{p} d π (x, y)\}}^{\frac{1}{p}},

(2)

where $d (\cdot, \cdot)$ is a distance function on a measurable metric space M and $Π (μ, ν)$ denotes the set of probability measures on $M \times M$ , whose marginal measures correspond to $μ$ and $ν$ . In recent years, the application of optimal transport and the Wasserstein distance has been studied in many fields such as statistics, machine learning, and image processing. For example, Reference [3] generated the interpolation of various three-dimensional (3D) objects using the Wasserstein barycenter. In the field of word embedding in natural language processing, Reference [4] embedded each word as an elliptical distribution, and the Wasserstein distance was applied between the elliptical distributions. There are many studies on the applications of optimal transport to deep learning, including [5,6,7]. Moreover, Reference [8] analyzed the denoising autoencoder [9] with gradient flow in the Wasserstein space.

In the application of the Wasserstein distance, it is often considered in a discrete setting where $μ$ and $ν$ are discrete probability measures. Then, obtaining the Wasserstein distance between $μ$ and $ν$ can be formulated as a linear programming problem. In general, however, it is computationally intensive to solve such linear problems and obtain the optimal coupling of two probability measures. For such a situation, a novel numerical method, entropy regularization, was proposed by [10],

C_{λ} (μ, ν) : = inf_{π \in Π (μ, ν)} \int_{R^{n} \times R^{n}} c (x, y) π (x, y) d x d y - λ Ent (π) .

(3)

This is a relaxed formulation of the original optimal transport of a cost function $c (\cdot, \cdot)$ , in which the negative Shannon entropy $- Ent (\cdot)$ is used as a regularizer. For a small $λ$ , $C_{λ} (μ, ν)$ can approximate the p-th power of the Wasserstein distance between two discrete probability measures, and it can be computed efficiently by using Sinkhorn’s algorithm [11].

More recently, many studies have been published on improving the computational efficiency. According to [12], the most computationally efficient algorithm at this moment to solve the linear problem for the Wasserstein distance is Lee–Sidford linear solver [13], which runs in $O (n^{2.5})$ . Reference [14] proved that a complexity bound for the Sinkhorn algorithm is $\tilde{O} (n^{2} ε^{- 2})$ , where $ε$ is the desired absolute performance guarantee. After [10] appeared, various algorithms have been proposed. For example, Reference [15] adopted stochastic optimization schemes for solving the optimal transport. The Greenkhorn algorithm [16] is the greedy variant of the Sinkhorn algorithm, and Reference [12] proposed its acceleration. Many other approaches such as adapting a variety of standard optimization algorithms to approximate the optimal transport problem can be found in [12,17,18,19]. Several specialized Newton-type algorithms [20,21] achieve complexity bound $\tilde{O} (n^{2} ε^{- 1})$ [22,23], which are the best ones in terms of computational complexity at the present moment.

Moreover, entropy-regularized optimal transport has another advantage. Because of the differentiability of the entropy-regularized optimal transport and the simple structure of Sinkhorn’s algorithm, we can easily compute the gradient of the entropy-regularized optimal transport cost and optimize the parameter of a parametrized probability distribution by using numerical differentiation or automatic differentiation. Then, we can define a differentiable loss function that can be applied to various supervised learning methods [24]. Entropy-regularized optimal transport can be used to approximate not only the Wasserstein distance, but also its optimal coupling as a mapping function. Reference [25] adopted the optimal coupling of the entropy-regularized optimal transport as a mapping function from one domain to another.

Despite the empirical success of the entropy-regularized optimal transport, its theoretical aspect is less understood. Reference [26] studied the expected Wasserstein distance between a probability measure and its empirical version. Similarly, Reference [27] showed the consistency of the entropy-regularized optimal transport cost between two empirical distributions. Reference [28] showed that minimizing the entropy-regularized optimal transport cost between empirical distributions is equivalent to a type of maximum likelihood estimator. Reference [29] considered Wasserstein generative adversarial networks with an entropy regularization. Reference [30] constructed information geometry from the convexity of the entropy-regularized optimal transport cost.

Our intrinsic motivation of this study is to produce an analytical solution about the entropy-regularized optimal transport problem between continuous probability measures so that we can gain insight into the effects of entropy regularization in a theoretical, as well as an experimental way. In our study, we generalized the Wasserstein distance between two multivariate normal distributions by entropy regularization. We derived the explicit form of the entropy-regularized optimal transport cost and its optimal coupling, which can be used to analyze the effect of entropy regularization directly. In general, the nonregularized Wasserstein distance between two probability measures and its optimal coupling cannot be expressed in a closed form; however, Reference [31] proved the explicit formula for multivariate normal distributions. Theorem 1 is a generalized form of [31]. We obtain an explicit form of the entropy-regularized optimal transport between two multivariate normal distributions. Furthermore, by adopting the Tsallis entropy [32] as the entropy regularization instead of the Shannon entropy, our theorem can be generalized to multivariate q-normal distributions.

Some readers may find it strange to study the entropy-regularized optimal transport for multivariate normal distributions, where the exact (nonregularized) optimal transport has been obtained explicitly. However, we think it is worth studying from several perspectives:

Normal distributions are the simplest and best-studied probability distributions, and thus, it is useful to examine the regularization theoretically in order to infer results for other distributions. In particular, we will partly answer the questions “How much do entropy constraints affect the results?” and “What does it mean to constrain by the entropy?’’ for the simplest cases. Furthermore, as a first step in constructing a theory for more general probability distributions, in Section 4, we propose a generalization to multivariate q-normal distributions.
Because normal distributions are the limit distributions in asymptotic theories using the central limit theorem, studying normal distributions is necessary for the asymptotic theory of regularized Wasserstein distances and estimators computed by them. Moreover, it was proposed to use the entropy-regularized Wasserstein distance to compute a lower bound of the generalization error for a variational autoencoder [29]. The study of the asymptotic behavior of such bounds is one of the expected applications of our results.
Though this has not yet been proven theoretically, we suspect that entropy regularization is efficient not only for computational reasons, such as the use of the Sinkhorn algorithm, but also in the sense of efficiency in statistical inference. Such a phenomenon can be found in some existing studies, including [33]. Such statistical efficiency is confirmed by some experiments in Section 6.

The remainder of this paper is organized as follows. First, we review some definitions of optimal transport and entropy regularization in Section 2. Then, in Section 3, we provide an explicit form of the entropy-regularized optimal transport cost and its optimal coupling between two multivariate normal distributions. We also extend this result to q-normal distributions for Tsallis entropy regularization in Section 4. In Section 5, we obtain the entropy-regularized Kantorovich estimator of probability measures on $R^{n}$ with a finite second moment that are absolutely continuous with respect to the Lebesgue measure in Theorem 3. We emphasize that Theorem 3 is not limited to the case of multivariate normal distribution, but can handle a wider range of probability measures. We analyze how entropy regularization affects the optimal result experimentally in certain sections.

We note that after publishing the preprint version of the paper, we found closely related results [34,35] reported within half a year. In Janati et al. [34], they proved the same result as Theorem 1 based on solving the fixed-point equation behind Sinkhorn’s algorithm. Their results include the unbalanced optimal transport between unbalanced multivariate normal distributions. They also studied the convexity and differentiability of the objective function of the entropy-regularized optimal transport. In [35], the same closed-form as Theorem 1 was proven by ingeniously using the Schrödinger system. Although there are some overlaps, our paper has significant novelty in the following respects. Our proof is more direct than theirs and can be extended directly to the proof for the Tsallis entropy-regularized optimal transport between multivariate q-normal distributions provided in Section 4. Furthermore, Corollaries 1 and 2 are novel and important results to evaluate how much the entropy regularization affects the estimation results or not at all. We also obtain the entropy-regularized Kantorovich estimator in Theorem 3.

2. Preliminary

In this section, we review some definitions of optimal transport and entropy-regularized optimal transport. These definitions were referred to in [1,36]. In this section, we use a tuple $(M, Σ)$ for a set M and $σ$ -algebra on M and $P (X)$ for the set of all probability measures on a measurable space X.

Definition 1

(Pushforward measure). Given measurable spaces $(M_{1}, Σ_{1})$ and $(M_{2}, Σ_{2})$ , a measure $μ : Σ_{1} \to [0, + \infty]$ , and a measurable mapping $φ : M_{1} \to M_{2}$ , the pushforward measure of μ by φ is defined by:

$\forall B \in Σ_{2}, φ_{#} μ (B) : = μ (φ^{- 1} (B)) .$ (4)

Definition 2

(Optimal transport map). Consider a measurable space $(M, Σ)$ , and let $c : M \times M \to R_{+}$ denote a cost function. Given $μ, ν \in P (M)$ , we call $φ : M \to M$ the optimal transport map if φ realizes the infimum of:

$inf_{φ_{#} μ = ν} \int_{M} c (x, φ (x)) d μ (x) .$ (5)

This problem was originally formalized by [37]. However, the optimal transport map does not always exist. Then, Kantorovich considered a relaxation of this problem in [38].

Definition 3

(Coupling). Given $μ, ν \in P (M)$ , the coupling of μ and ν is a probability measure on $M \times M$ that satisfies:

$\forall A \in Σ, π (A \times M) = μ (A), π (M \times A) = ν (A) .$ (6)

Definition 4

(Kantorovich problem). The Kantorovich problem is defined as finding a coupling π of μ and ν that realizes the infimum of:

$\int_{M \times M} c (x, y) d π (x, y) .$ (7)

Hereafter, let $Π (μ, ν)$ be the set of all couplings of $μ$ and $ν$ . When we adopt a distance function as the cost function, we can define the Wasserstein distance.

Definition 5

(Wasserstein distance). Given $p \geq 1$ , a measurable metric space $(M, Σ, d)$ , and $μ, ν \in P (M)$ with a finite p-th moment, the p-Wasserstein distance between μ and ν is defined as:

$W_{p} (μ, ν) : = inf_{π \in Π (μ, ν)} {(\int_{M \times M} d {(x, y)}^{p} d π (x, y))}^{\frac{1}{p}} .$ (8)

Now, we review the definition of entropy-regularized optimal transport on $R^{n}$ .

Definition 6

(Entropy-regularized optimal transport). Let $μ, ν \in P (R^{n})$ , $λ > 0$ , and let $π (x, y)$ be the density function of the coupling of μ and ν, whose reference measure is the Lebesgue measure. We define the entropy-regularized optimal transport cost as:

$C_{λ} (μ, ν) : = inf_{π \in Π (μ, ν)} \int_{R^{n} \times R^{n}} c (x, y) π (x, y) d x d y - λ Ent (π),$ (9)

where $Ent (\cdot)$ denotes the Shannon entropy of a probability measure:

$Ent (π) = - \int_{R^{n} \times R^{n}} π (x, y) log π (x, y) d x d y .$ (10)

There is another variation in entropy-regularized optimal transport defined by the relative entropy instead of the Shannon entropy:

{\tilde{C}}_{λ} (μ, ν) : = inf_{π \in Π (μ, ν)} \int_{R^{n} \times R^{n}} c (x, y) π (x, y) d x d y + λ KL (π | d μ \otimes d ν) .

(11)

This is definable even when $Π (μ, ν)$ includes a coupling that is not absolutely continuous with respect to the Lebesgue measure. We note that when both $μ$ and $ν$ are absolutely continuous, the infimum is attained by the same $π$ for $C_{λ}$ and ${\tilde{C}}_{λ}$ , and it depends only on $μ$ and $ν$ . In the following part of the paper, we assume the absolute continuity of $μ, ν$ , and $π$ with respect to the Lebesgue measure for well-defined entropy regularization.

3. Entropy-Regularized Optimal Transport between Multivariate Normal Distributions

In this section, we provide a rigorous solution of entropy-regularized optimal transport between two multivariate normal distributions. Throughout this section, we adopt the squared Euclidean distance ${∥ x - y ∥}^{2}$ as the cost function. To prove our theorem, we start by expressing $C_{λ}$ using mean vectors and covariance matrices. The following lemma is a known result; for example, see [31].

Lemma 1.

Let $X \sim P, Y \sim Q$ be two random variables on $R^{n}$ with means $μ_{1}, μ_{2}$ and covariance matrices $Σ_{1}, Σ_{2}$ , respectively. If $π (x, y)$ is a coupling of P and Q, we have:

$\int_{R^{n} \times R^{n}} {∥ x - y ∥}^{2} π (x, y) d x d y = {∥ μ_{1} - μ_{2} ∥}^{2} + tr \{Σ_{1} + Σ_{2} - 2 Cov (X, Y)\} .$ (12)

Proof.

Without loss of generality, we can assume X and Y are centralized, because:

$\int ∥ (x - μ_{1}) - (y - μ_{2}) ∥^{2} π (x, y) d x d y = {\int ∥ x - y ∥}^{2} π (x, y) d x d y - {∥ μ_{1} - μ_{2} ∥}^{2} .$ (13)

Therefore, we have:

$\begin{matrix} {\int ∥ x - y ∥}^{2} π (x, y) d x d y & = {E [∥ X - Y ∥}^{2}] = E [tr {(X - Y) {(X - Y)}^{T}}] \\ = tr \{Σ_{1} + Σ_{2} - 2 Cov (X, Y)\} . \end{matrix}$ (14)

By adding $∥ μ_{1} - μ_{2} ∥^{2}$ , we obtain (12). □

Lemma 1 shows that $\int_{R^{n} \times R^{n}} {∥ x - y ∥}^{2} π (x, y) d x d y$ can be parameterized by the covariance matrices $Σ_{1}, Σ_{2}, Cov (X, Y)$ . Because $Σ_{1}$ and $Σ_{2}$ are fixed, the infinite-dimensional optimization of the coupling $π$ is a finite-dimensional optimization of covariance matrix $Cov (X, Y)$ .

We prepare the following lemma to prove Theorem 1.

Lemma 2.

Under a fixed mean and covariance matrix, the probability measure that maximizes the entropy is a multivariate normal distribution.

Lemma 2 is a particular case of the principle of maximum entropy [39], and the proof can be found in [40] Theorem 3.1.

Theorem 1.

Let $P \sim N (μ_{1}, Σ_{1}), Q \sim N (μ_{2}, Σ_{2})$ be two multivariate normal distributions. The optimal coupling π of P and Q of the entropy-regularized optimal transport:

$C_{λ} (P, Q) = inf_{π \in Π (P, Q)} \int_{R^{n} \times R^{n}} {∥ x - y ∥}^{2} π (x, y) d x d y - 4 λ Ent (π) . (⁎)$ (15)

is expressed as:

$π \sim N ((\begin{matrix} μ_{1} \\ μ_{2} \end{matrix}), (\begin{matrix} Σ_{1} & Σ_{λ} \\ Σ_{λ}^{T} & Σ_{2} \end{matrix}))$ (16)

where:

$Σ_{λ} : = Σ_{1}^{1 / 2} {(Σ_{1}^{1 / 2} Σ_{2} Σ_{1}^{1 / 2} + λ^{2} I)}^{1 / 2} Σ_{1}^{- 1 / 2} - λ I .$ (17)

Furthermore, $C_{λ} (P, Q)$ can be written as:

$\begin{matrix} C_{λ} (P, Q) = & ∥ μ_{1} - μ_{2} ∥^{2} + tr (Σ_{1} + Σ_{2} - 2 {(Σ_{1}^{1 / 2} Σ_{2} Σ_{1}^{1 / 2} + λ^{2} I)}^{1 / 2}) \\ - & 2 λ log | {(Σ_{1}^{1 / 2} Σ_{2} Σ_{1}^{1 / 2} + λ^{2} I)}^{1 / 2} - λ I | - 2 λ n log (2 π λ) - 4 λ n log (2 π) - 2 λ n \end{matrix}$ (18)

and the relative entropy version can be written as:

${\tilde{C}}_{λ} (P, Q) = C_{λ} (P, Q) + 2 λ log | Σ_{1} | | Σ_{2} | + 4 λ n {log (2 π) + 1} .$ (19)

We note that we use the regularization parameter $4 λ$ in $(⁎)$ for the sake of simplicity.

Proof.

Although the first half of the proof can be derived directly from Lemma 2, we provide a proof of this theorem by Lagrange calculus, which will be used later for the extension to q-normal distributions. Now, we define an optimization problem that is equivalent to the entropy-regularized optimal transport as follows:

$\begin{matrix} minimize & {\int ∥ x - y ∥}^{2} π (x, y) d x d y - 4 λ Ent (π) \\ subject to & \int π (x, y) d x = q (y) for \forall y \in R^{n}, \end{matrix}$ (20)

$\begin{matrix} \int π (x, y) d y = p (x) for \forall x \in R^{n} . \end{matrix}$ (21)

Here, $p (x)$ and $q (y)$ are probability density functions of P and Q, respectively. Let $α (x)$ , $β (y)$ be Lagrange multipliers that correspond to the above two constraints. The Lagrangian function of (20) is defined as:

$\begin{matrix} L (π, α, β) : & = {\int ∥ x - y ∥}^{2} π (x, y) d x d y + 4 λ \int π (x, y) log π (x, y) d x d y \\ - \int α (x) π (x, y) d x d y + \int α (x) p (x) d x \\ - \int β (y) π (x, y) d x d y + \int β (y) q (y) d y . \end{matrix}$ (22)

Taking the functional derivative of (22) with respect to $π$ , we obtain:

$δ L (π, α, β) = \int ({∥ x - y ∥}^{2} + 4 λ log π (x, y) - α (x) - β (y)) δ π (x, y) d x d y .$ (23)

By the fundamental lemma of the calculus of variations, we have:

$π (x, y) \propto exp (α (x) + β (y) - \frac{{∥ x - y ∥}^{2}}{4 λ}) .$ (24)

Here, $α (x), β (y)$ are determined from the constraints (21). We can assume that $π$ is a $2 n$ -variate normal distribution, because for a fixed covariance matrix $Cov (X, Y)$ , $- Ent (π)$ takes the infimum when the coupling $π$ is a multivariate normal distribution by Lemma 2. Therefore, we can express $π$ by using $z = {(x^{T}, y^{T})}^{T}$ and a covariance matrix $Σ : = Cov (X, Y)$ as:

$π (x, y) \propto exp \{- \frac{1}{2} z^{T} {(\begin{matrix} Σ_{1} & Σ_{} \\ Σ^{T} & Σ_{2} \end{matrix})}^{- 1} z\} .$ (25)

Putting:

$(\begin{matrix} {\tilde{Σ}}_{1} & \tilde{Σ} \\ {\tilde{Σ}}^{T} & {\tilde{Σ}}_{2} \end{matrix}) : = {(\begin{matrix} Σ_{1} & Σ_{} \\ Σ^{T} & Σ_{2} \end{matrix})}^{- 1},$ (26)

we write:

$\begin{matrix} - \frac{1}{2} z^{T} {(\begin{matrix} Σ_{1} & Σ_{} \\ Σ^{T} & Σ_{2} \end{matrix})}^{- 1} z & = - \frac{1}{2} (\begin{matrix} x^{T} & y^{T} \end{matrix}) (\begin{matrix} {\tilde{Σ}}_{1} & \tilde{Σ} \\ {\tilde{Σ}}^{T} & {\tilde{Σ}}_{2} \end{matrix}) (\begin{matrix} x \\ y \end{matrix}) \end{matrix}$ (27)

$\begin{matrix} = - \frac{1}{2} x^{T} {\tilde{Σ}}_{1} x - \frac{1}{2} y^{T} {\tilde{Σ}}_{2} y - x^{T} \tilde{Σ} y . \end{matrix}$ (28)

According to block matrix inversion formula [41], $\tilde{Σ} = - Σ_{1}^{- 1} Σ A^{- 1}$ holds, where $A : = Σ_{2} - Σ^{T} Σ_{1}^{- 1} Σ$ is positive definite. Then, comparing the term $x^{T} y$ between (24) and (28), we obtain $Σ_{1}^{- 1} Σ A^{- 1} = \frac{1}{2 λ} I$ and:

$2 λ Σ_{1}^{- 1} Σ = A = Σ_{2} - Σ^{T} Σ_{1}^{- 1} Σ .$ (29)

Here, $Σ_{1}^{- 1} Σ = Σ^{T} Σ_{1}^{- 1}$ holds, because A is a symmetric matrix, and thus, we obtain:

$λ Σ_{1}^{- 1} Σ + λ Σ^{T} Σ_{1}^{- 1} = Σ_{2} - Σ^{T} Σ_{1}^{- 1} Σ .$ (30)

Completing the square of the above equation, we obtain:

${(Σ_{1}^{- 1 / 2} (Σ + λ I) Σ_{1}^{1 / 2})}^{T} (Σ_{1}^{- 1 / 2} (Σ + λ I) Σ_{1}^{1 / 2}) = Σ_{1}^{1 / 2} Σ_{2} Σ_{1}^{1 / 2} + λ^{2} I$ (31)

Let Q be an orthogonal matrix; then, (31) can be solved as:

$Σ_{1}^{- 1 / 2} (Σ + λ I) Σ_{1}^{1 / 2} = Q {(Σ_{1}^{1 / 2} Σ_{2} Σ_{1}^{1 / 2} + λ^{2} I)}^{1 / 2} .$ (32)

We rearrange the above equation as follows:

$Σ_{1}^{1 / 2} (Σ_{1}^{- 1} Σ) Σ_{1}^{1 / 2} + λ I = Q {(Σ_{1}^{1 / 2} Σ_{2} Σ_{1}^{1 / 2} + λ^{2} I)}^{1 / 2} .$ (33)

Because the left terms and ${(Σ_{1}^{1 / 2} Σ_{2} Σ_{1}^{1 / 2} + λ^{2} I)}^{1 / 2}$ are all symmetric positive definite, we can conclude that Q is the identity matrix by the uniqueness of the polar decomposition. Finally, we obtain:

$Σ = Σ_{1}^{1 / 2} {(Σ_{1}^{1 / 2} Σ_{2} Σ_{1}^{1 / 2} + λ^{2} I)}^{1 / 2} Σ_{1}^{- 1 / 2} - λ I = : Σ_{λ} .$ (34)

We obtain (18) by the direct calculation of $C_{λ}$ using Lemma 1 with this $Σ_{λ}$ . □

The following corollary helps us to understand the properties of $Σ_{λ}$ .

Corollary 1.

Let $ν_{λ, 1} \leq ν_{λ, 2} \leq \dot{\leq} ν_{λ, n}$ be the eigenvalues of $Σ_{λ}$ ; then, $ν_{λ, i}$ monotonically decreases with λ for any $i \in {1, 2, \dot{,} n}$ .

Proof.

Because $Σ_{1}^{- 1 / 2} Σ_{λ} Σ_{1}^{1 / 2} = {(Σ_{1}^{1 / 2} Σ_{2} Σ_{1}^{1 / 2} + λ^{2} I)}^{1 / 2} - λ I$ has the same eigenvalues as $Σ_{λ}$ , if we let ${ν_{0, i}}$ be the eigenvalues of $Σ_{1}^{1 / 2} Σ_{2} Σ_{1}^{1 / 2}$ , $ν_{λ, i} = \sqrt{ν_{0, i} + λ^{2}} - λ$ , which is a monotonically decreasing function of the regularization parameter $λ$ . □

By the proof, for large $λ$ , we can prove $Σ_{1}^{- 1 / 2} Σ_{λ} Σ_{1}^{1 / 2} ≃ \frac{1}{2 λ} Σ_{1}^{1 / 2} Σ_{2} Σ_{1}^{1 / 2}$ by diagonalization and $ν_{λ, i} ≃ \frac{1}{2 λ} ν_{0, i}$ . Thus, $Σ_{λ} ≃ \frac{1}{2 λ} Σ_{1} Σ_{2}$ , and each element of $Σ_{λ}$ converges to zero as $λ \to \infty$ .

We show how entropy regularization behaves in two simple experiments. We calculate the entropy-regularized optimal transport cost $N ((\begin{matrix} 0 \\ 0 \end{matrix}), (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}))$ and $N ((\begin{matrix} 0 \\ 0 \end{matrix}), (\begin{matrix} 2 & - 1 \\ - 1 & 2 \end{matrix}))$ in the original version and the relative entropy version in Figure 1. We separate the entropy-regularized optimal transport cost into the transport cost term and regularization term and display both of them.

Graph of the entropy-regularized optimal transport cost between $N ((\begin{matrix} 0 \\ 0 \end{matrix}), (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}))$ and $N ((\begin{matrix} 0 \\ 0 \end{matrix}), (\begin{matrix} 2 & - 1 \\ - 1 & 2 \end{matrix}))$ with respect to $λ$ from zero to 10.

It is reasonable that as $λ ↓ 0$ , $Σ_{λ}$ converges to $Σ_{1}^{1 / 2} {(Σ_{1}^{1 / 2} Σ_{2} Σ_{1}^{1 / 2})}^{1 / 2} Σ_{1}^{- 1 / 2}$ , which is equal to the original optimal coupling of nonregularized optimal transport and as $λ \to \infty$ , $Σ_{λ}$ converges to 0. This is a special case of Corollary 1.The larger $λ$ becomes, the less correlated the optimal coupling is. We visualize this behavior by computing the optimal couplings of two one-dimensional normal distributions in Figure 2.

Contours of the density functions of the entropy-regularized optimal coupling of $N (0, 1)$ and $N (5, 2)$ in three different parameters $λ = 0.1, 1, 10$ . All of the optimal couplings are two-variate normal distributions.

The left panel shows the original version. The transport cost is always positive, and the entropy regularization term can take both signs in general; then, the sign and total cost depend on their balance. We note that the transport cost as a function of $λ$ is bounded, whereas the entropy regularization is not. The boundedness of the optimal cost is deduced from (1) and Corollary 1, and the unboundedness of the entropy regularization is due to the regularization parameter $λ$ multiplied by the entropy. The right panel shows the relative entropy version. It always takes a non-negative value. Furthermore, because the total cost is bounded by the value for the independent joint distribution (which is always a feasible coupling), both the transport cost and the relative entropy regularization regularization term are also bounded. Nevertheless, the larger the regularization parameter $λ$ , the greater the influence of entropy regularization over the total cost.

It is known that a specific Riemannian metric can be defined in the space of multivariate normal distributions, which induces the Wasserstein distance [42]. To understand the effect of entropy regularization, we illustrate how entropy regularization deforms this geometric structure in Figure 3. Here, we generate 100 two-variate normal distributions ${N (0, Σ_{r, k})}_{r, k \in {1, 2, \dot{,} 10}}$ , where ${Σ_{r, k}}$ is defined as:

Σ_{r, k} = {(\begin{matrix} cos (2 π \cdot \frac{k}{10}) & - sin (2 π \cdot \frac{k}{10}) \\ sin (2 π \cdot \frac{k}{10}) & cos (2 π \cdot \frac{k}{10}) \end{matrix})}^{T} (\begin{matrix} 1 & 0 \\ 0 & \sqrt{\frac{r}{10}} \end{matrix}) (\begin{matrix} cos (2 π \cdot \frac{k}{10}) & - sin (2 π \cdot \frac{k}{10}) \\ sin (2 π \cdot \frac{k}{10}) & cos (2 π \cdot \frac{k}{10}) \end{matrix}) .

(35)

Multidimensional scaling of two-variate normal distributions. The pairwise dissimilarities are given by the square root of the entropy-regularized optimal transport cost ${\tilde{C}}_{λ}$ for three different regularization parameters $λ = 0, 0.01, 0.05$ . Each ellipse in the figure represents a contour of the density function ${N (0, Σ_{r, k})}$ .

To visualize the geometric structure of these two-variate normal distributions, we compute the relative entropy-regularized optimal transport cost ${\tilde{C}}_{λ}$ between each pairwise two-variate normal distributions. Then, we apply multidimensional scaling [43] to embed them into a plane (see Figure 3). We can see entropy regularization deforming the geometric structure of the space of multivariate normal distributions. The deformation for distributions close to the isotopic normal distribution is more sensitive to the change in $λ$ .

The following corollary states that if we allow orthogonal transformations of two multivariate normal distributions with fixed covariance matrices, then the minimum and maximum of $C_{λ}$ are attained when $Σ_{1}$ and $Σ_{2}$ are diagonalizable by the same orthogonal matrix or, equivalently, when the ellipsoidal contours of the two density functions are aligned with the same orthogonal axes.

Corollary 2.

With the same settings as in Theorem 1, fix $μ_{1}$ , $μ_{2}$ , $Σ_{1}$ , and all eigenvalues of $Σ_{2}$ . When $Σ_{1}$ is diagonalized as $Σ_{1} = Γ^{T} Λ_{1}^{↓} Γ$ , where $Λ_{1}^{↓}$ is the diagonal matrix of the eigenvalues of $Σ_{1}$ in descending order and Γ is an orthogonal matrix,

(i)
$C_{λ} (P, Q)$ is minimized by $Σ_{2} = Γ^{T} Λ_{2}^{↓} Γ$ and

(ii)
$C_{λ} (P, Q)$ is maximized by $Σ_{2} = Γ^{T} Λ_{2}^{↑} Γ$ ,

where $Λ_{2}^{↓}$ and $Λ_{2}^{↑}$ are the diagonal matrices of the eigenvalues of $Σ_{2}$ in descending and ascending order, respectively. Therefore, neither the minimizer, nor the maximizer depend on the choice of λ.

Proof.

Because $μ_{1}$ , $μ_{2}$ , $Σ_{1}$ , and all eigenvalues of $Σ_{2}$ are fixed,

$\begin{matrix} C_{λ} (P, Q) & = - 2 tr ({(Σ_{1}^{1 / 2} Σ_{2} Σ_{1}^{1 / 2} + λ^{2} I)}^{1 / 2}) - \frac{λ}{2} log | {(Σ_{1}^{1 / 2} Σ_{2} Σ_{1}^{1 / 2} + λ^{2} I)}^{1 / 2} - λ I | + (constant) \end{matrix}$ (36)

$\begin{matrix} = \sum_{i = 1}^{n} - 2 {(ν_{i} + λ^{2})}^{1 / 2} - \frac{λ}{2} log {{(ν_{i} + λ^{2})}^{1 / 2} - λ} + (constant) \end{matrix}$ (37)

$\begin{matrix} = \sum_{i = 1}^{n} g_{λ} (log (ν_{i})) + (constant) \end{matrix}$ (38)

where $ν_{1} \leq \dots \leq ν_{n}$ are the eigenvalues of $Σ_{1}^{1 / 2} Σ_{2} Σ_{1}^{1 / 2}$ and:

$g_{λ} (x) : = - 2 {(e^{x} + λ^{2})}^{1 / 2} - \frac{λ}{2} log {{(e^{x} + λ^{2})}^{1 / 2} - λ} .$ (39)

Note that $g_{λ} (x)$ is a concave function, because:

$g_{λ}^{''} (x) = - \frac{e^{x} (4 e^{x} + 7 λ^{2})}{8 {(e^{x} + λ^{2})}^{3 / 2}} < 0 .$ (40)

Let $ν_{1}^{↓ ↓} \leq \dots \leq ν_{n}^{↓ ↓}$ and $ν_{1}^{↓ ↑} \leq \dots \leq ν_{n}^{↓ ↑}$ be the eigenvalues of $Λ_{1}^{↓} Λ_{2}^{↓}$ and $Λ_{1}^{↓} Λ_{2}^{↑}$ , respectively. By Exercise 6.5.3 of [44] or Theorem 6.13 and Corollary 6.14 of [45],

$\begin{matrix} (log (ν_{i}^{↓ ↑})) ≺ (log (ν_{i})) ≺ (log (ν_{i}^{↓ ↓})), \end{matrix}$ (41)

Here, for $(a_{i}), (b_{i}) \in R^{n}$ such that $a_{1} \geq \dots \geq a_{n}$ and $b_{1} \geq \dots \geq b_{n}$ , $(a_{i}) ≺ (b_{i})$ means:

$\sum_{i = 1}^{k} a_{i} \leq \sum_{i = 1}^{k} b_{i} for k = 1, \dots, n - 1, and \sum_{i = 1}^{n} a_{i} = \sum_{i = 1}^{n} b_{i}$ (42)

and $(a_{i})$ is said to be majorized by $(b_{i})$ . Because $g_{λ} (x)$ is concave,

$g_{λ} (log (ν_{i}^{↓ ↑})) ≺^{w} g_{λ} (log (ν_{i})) ≺^{w} g_{λ} (log (ν_{i}^{↓ ↓})),$ (43)

where $≺^{w}$ represents weak supermajorization, i.e., $(a_{i}) ≺^{w} (b_{i})$ means:

$\sum_{i = k}^{n} a_{i} \geq \sum_{i = k}^{n} b_{i} for k = 1, \dots, n$ (44)

(see Theorem 5.A.1 of [46], for example). Therefore,

$\sum_{i = 1}^{n} g_{λ} (log (ν_{i}^{↓ ↑})) \geq \sum_{i = 1}^{n} g_{λ} (log (ν_{i})) \geq \sum_{i = 1}^{n} g_{λ} (log (ν_{i}^{↓ ↓})) .$ (45)

As in Case (i) (or (ii)), the eigenvalues of $Σ_{1}^{1 / 2} Σ_{2} Σ_{1}^{1 / 2}$ correspond to the eigenvalues of $Λ_{1}^{↓} Λ_{2}^{↓}$ (or $Λ_{1}^{↓} Λ_{2}^{↑}$ , respectively), the corollary follows. □

Note that a special case of Corollary 2 for the ordinary Wasserstein metric ( $λ = 0$ ) has been studied in the context of fidelity and the Bures distance in quantum information theory. See Lemma 3 of [47]. Their proof is not directly applicable to our generalized result; thus, we used another approach to prove it.

4. Extension to Tsallis Entropy Regularization

In this section, we consider a generalization of entropy-regularized optimal transport. We now focus on the Tsallis entropy [32], which is a generalization of the Shannon entropy and appears in nonequilibrium statistical mechanics. We show that the optimal coupling of Tsallis entropy-regularized optimal transport between two q-normal distributions is also a q-normal distribution. We start by recalling the definition of the q-exponential function and q-logarithmic function based on [32].

Definition 7.

Let q be a real parameter, and let $u > 0$ . The q-logarithmic function is defined as:

${log}_{q} (u) : = \{\begin{matrix} \frac{1}{1 - q} (u^{1 - q} - 1) i f q \neq 1, \\ log (u) i f q = 1 \end{matrix}$ (46)

and the q-exponential function is defined as:

${exp}_{q} (u) : = \{\begin{matrix} {[1 + (1 - q) u]}_{+}^{\frac{1}{1 - q}} i f q \neq 1, \\ exp (u) i f q = 1 \end{matrix}$ (47)

Definition 8.

Let $q < 1$ or $1 < q < 1 + \frac{2}{n}$ ; an n-variate q-normal distribution is defined by two parameters: $μ \in R^{n}$ and a positive definite matrix Σ, and its density function is:

$f (x) : = \frac{1}{C_{q} (Σ)} {exp}_{q} (- {(x - μ)}^{T} Σ^{- 1} (x - μ)),$ (48)

where $C_{q} (Σ)$ is a normalizing constant. μ and Σ are called the location vector and scale matrix, respectively.

In the following, we write the multivariate q-normal distribution $N_{q} (μ, Σ)$ . We note that the property of the q-normal distribution changes in accordance with q. The q-normal distribution has an unbounded support for $1 < q < \frac{2}{n}$ and a bounded support for $q < 1$ . The second moment exists for $q < 1 + \frac{2}{n + 2}$ , and the covariance becomes $\frac{1}{2 + (n + 2) (1 - q)} Σ$ . We remark that each n-variate $(1 + \frac{2}{ν + n})$ -normal distribution is equivalent to an n-variate t-distribution with $ν$ degrees of freedom,

\frac{Γ [(ν + n) / 2]}{Γ (ν / 2) ν^{n / 2} π^{n / 2} {| Σ |}^{1 / 2}} {[1 + \frac{1}{ν} {(x - μ)}^{T} Σ^{- 1} (x - μ)]}^{- (ν + n) / 2},

(49)

for $1 < q < 1 + \frac{2}{n + 2}$ and an n-variate normal distribution for $q ↓ 1$ .

Definition 9.

Let p be a probability density function. The Tsallis entropy is defined as:

$S_{q} (p) : = \int p (x) {log}_{q} \frac{1}{p (x)} d x = \frac{1}{q - 1} (1 - \int p {(x)}^{q} d x) .$ (50)

Then, the Tsallis entropy-regularized optimal transport is defined as:

\begin{matrix} minimize & {\int ∥ x - y ∥}^{2} π (x, y) d x d y - 2 λ S_{q} (π) \\ subject to & \int π (x, y) d x = q (y) for \forall y \in R^{n}, \end{matrix}

(51)

\begin{matrix} \int π (x, y) d y = p (x) for \forall x \in R^{n} . \end{matrix}

(52)

The following lemma is a generalization of the maximum entropy principle for the Shannon entropy shown in Section 2 of [48].

Lemma 3.

Let P be a centered n-dimensional probability measure with a fixed covariance matrix Σ; the maximizer of the Renyi α-entropy:

$\frac{1}{1 - α} log \int f {(x)}^{α} d x$ (53)

under the constraint is $N_{2 - α} (0, ((n + 2) α - n) Σ)$ for $\frac{n}{n + 2} < α < 1$ .

We note that the maximizers of the Renyi $α$ -entropy and the Tsallis entropy with $q = α$ coincide; thus, the above lemma also holds for the Tsallis entropy. This is mentioned, for example, in Section 9 of [49].

To prove Theorem 2, we use the following property of multivariate t-distributions, which is summarized in Chapter 1 of [50].

Lemma 4.

Let X be a random vector following an n-variate t-distribution with degree of freedom ν. Considering a partition of the mean vector μ and scale matrix Σ, such as:

$X = (\begin{matrix} X_{1} \\ X_{2} \end{matrix}), μ = (\begin{matrix} μ_{1} \\ μ_{2} \end{matrix}), Σ = (\begin{matrix} Σ_{11} & Σ_{12} \\ Σ_{21} & Σ_{22} \end{matrix}),$ (54)

$X_{1}$ follows a p-variate t-distribution with degree of freedom ν, mean vector $μ_{1}$ , and scale matrix $Σ_{11}$ , where p is the dimension of $X_{1}$ .

Recalling the correspondence of the parameter of the multivariate q-normal distribution and the degree of freedom of the multivariate t-distribution $q = 1 + \frac{2}{ν + n}$ , we can obtain the following corollary.

Corollary 3.

Let X be a random vector following an n-variate q-normal distribution for $1 < q < 1 + \frac{2}{n + 2}$ . Consider a partition of the mean vector μ and scale matrix Σ in the same way as in (54). Then, $X_{1}$ follows a p-variate $(1 + \frac{2 (q - 1)}{2 - (n - p) (q - 1)})$ -normal distribution with mean vector $μ_{1}$ and scale matrix $Σ_{11}$ , where p is the dimension of $X_{1}$ .

Theorem 2.

Let $P \sim N_{q} (μ_{1}, Σ_{1}), Q \sim N_{q} (μ_{2}, Σ_{2})$ be n-variate q-normal distributions for $1 < q < 1 + \frac{2}{n + 2}$ and $\tilde{q} = - \frac{2 (q - 1)}{2 - n (q - 1)}$ ; consider the Tsallis entropy-regularized optimal transport:

$C_{λ} (P, Q) = inf_{π \in Π (P, Q)} \int_{R^{n} \times R^{n}} {∥ x - y ∥}^{2} π (x, y) d x d y - 2 λ S_{1 + \tilde{q}} (π) .$ (55)

Then, there exists a unique $\tilde{λ} = \tilde{λ} (q, Σ_{1}, Σ_{2}, λ) \in R_{+}$ such that the optimal coupling π of the entropy-regularized optimal transport is expressed as:

$π \sim N_{1 - \tilde{q}} ((\begin{matrix} μ_{1} \\ μ_{2} \end{matrix}), (\begin{matrix} Σ_{1} & Σ_{\tilde{λ}} \\ Σ_{\tilde{λ}}^{T} & Σ_{2} \end{matrix})),$ (56)

where:

$Σ_{\tilde{λ}} : = Σ_{1}^{1 / 2} {(Σ_{1}^{1 / 2} Σ_{2} Σ_{1}^{1 / 2} + {\tilde{λ}}^{2} I)}^{1 / 2} Σ_{1}^{- 1 / 2} - \tilde{λ} I .$ (57)

Proof.

The proof proceeds in a similar way as in Theorem 1. Let $α \in L (P)$ and $β \in L (Q)$ be the Lagrangian multipliers. Then, the Lagrangian function $L (π, α, β)$ of (52) is defined as:

$\begin{matrix} L (π, α, β) : = & {\int ∥ x - y ∥}^{2} π (x, y) d x d y - 2 λ \{\frac{1}{\tilde{q}} (1 - \int π {(x, y)}^{1 + \tilde{q}} d x d y)\} \\ - \int α (x) π (x, y) d x d y + \int α (x) p (x) d x \\ - \int β (y) π (x, y) d x d y + \int β (y) q (y) d y \end{matrix}$ (58)

and the extremum of the Tsallis entropy-regularized optimal transport is obtained by the functional derivative with respect to $π$ ,

$π (x, y) = {(\frac{\tilde{q}}{2 (\tilde{q} + 1) λ} ({- α (x) - β (y) + ∥ x - y ∥}^{2}))}^{\frac{1}{\tilde{q}}} .$ (59)

Here, $α$ and $β$ are quadratic polynomials by Lemma 3. To separate the normalizing constant, we introduce a constant $c \in R_{+}$ , and $π$ can be written as:

$π (x, y) = c^{\frac{1}{\tilde{q}}} {(\tilde{α} (x) + \tilde{β} (y) + \frac{\tilde{q} {∥ x - y ∥}^{2}}{2 c (\tilde{q} + 1) λ})}^{\frac{1}{\tilde{q}}},$ (60)

with quadratic functions $\tilde{α} (x)$ and $\tilde{β} (y)$ .

Let $\tilde{λ} = \frac{c (\tilde{q} + 1) λ}{\tilde{q}} > 0$ . Then, by the same argument as in the proof of Theorem 1 and using Corollary 3, we obtain the scale matrix of $π$ as:

$Σ = (\begin{matrix} Σ_{1} & Σ_{\tilde{λ}} \\ Σ_{\tilde{λ}}^{T} & Σ_{2} \end{matrix}),$ (61)

where:

$Σ_{\tilde{λ}} = Σ_{1}^{1 / 2} {(Σ_{1}^{1 / 2} Σ_{2} Σ_{1}^{1 / 2} + {\tilde{λ}}^{2} I)}^{1 / 2} Σ_{1}^{- 1 / 2} - \tilde{λ} I .$ (62)

Let $z = {(x^{T}, y^{T})}^{T}$ and $K_{\tilde{q}} = \int {(1 + z^{T} z)}^{\frac{1}{\tilde{q}}} d z$ ; $π$ can be written as:

$π (x, y) = \frac{1}{K_{\tilde{q}} | Σ |} {(1 + z^{T} Σ^{- 1} z)}^{\frac{1}{\tilde{q}}} .$ (63)

The constant c is determined by:

$\frac{1}{K_{\tilde{q}} | Σ |} = c^{\frac{1}{\tilde{q}}} .$ (64)

We will show that the above equation has a unique solution. Let ${τ}_{i = 1}^{n}$ be the eigenvalues of ${(Σ_{1}^{1 / 2} Σ_{2} Σ_{1}^{1 / 2})}^{1 / 2}$ ; $| Σ |$ can be expressed as $\prod_{i = 1}^{2 n} 2 \tilde{λ} (\sqrt{τ_{i}^{2} + {\tilde{λ}}^{2}} - \tilde{λ})$ . We consider:

$\begin{matrix} f (c) & = log (c^{\frac{1}{\tilde{q}}} K_{\tilde{q}} | Σ |) \end{matrix}$ (65)

$\begin{matrix} = \frac{1}{\tilde{q}} log c + \sum_{i = 1}^{2 n} log (\sqrt{τ_{i}^{2} + {\tilde{λ}}^{2}} - \tilde{λ}) + 2 n log (2 \tilde{λ}) + log K_{\tilde{q}} . \end{matrix}$ (66)

Because $\tilde{q} < 0$ , $f (c)$ is a monotonic decreasing function, and ${lim}_{c ↓ 0} f (c) = \infty$ , ${lim}_{c \to \infty} f (c) = - \infty$ , (64) has a unique positive solution, and $\tilde{λ}$ is determined uniquely. □

5. Entropy-Regularized Kantorovich Estimator

Many estimators are defined by minimizing the divergence or distance $ρ$ between probability measures, that is $arg {min}_{μ} ρ (μ, ν)$ for a fixed $ν$ . When $ρ$ is the Kullback–Leibler divergence, the estimator corresponds to the maximum likelihood estimator. When $ρ$ is the Wasserstein distance, the following estimator is called the minimum Kantorovich estimator, according to [36]. In this section, we consider a probability measure $Q^{*}$ that minimizes $C_{λ} (P, Q)$ for a fixed P over $P_{2} (R^{n})$ , the set of all probability measures on $R^{n}$ with finite second moment that are absolutely continuous with respect to the Lebesgue measure. In other words, we define the entropy-regularized Kantorovich estimator $arg {min}_{Q \in P_{2} (R^{n})} C_{λ} (P, Q) .$ The entropy-regularized Kantorovich estimator for discrete probability measures was studied in [33], Theorem 2. We obtain the entropy-regularized Kantorovich estimator for continuous probability measures in the following theorem:

Theorem 3.

For a fixed $P \in P_{2} (R^{n})$ ,

$Q^{*} = arg min_{Q \in P_{2} (R^{n})} C_{λ} (P, Q)$ (67)

exists, and its density function can be written as:

$d Q^{*} = d P 🟉 ϕ_{λ},$ (68)

where $ϕ_{λ} (x)$ is a density function of $N (0, \frac{λ}{2} I)$ , and 🟉 denotes the convolution operator.

We use the dual problem of the entropy-regularized optimal transport to prove Theorem 3 (for details, see Proposition 2.1 of [15] or Section 3 of [51]).

Lemma 5.

The dual problem of entropy-regularized optimal transport can be written as:

$\begin{matrix} A_{λ} (P, Q) = sup_{\begin{matrix} α \in L_{1} (P) \\ β \in L_{1} (Q) \end{matrix}} & \int α (x) p (x) d x + \int β (y) q (y) d y \\ - λ \int exp \{\frac{{α (x) + β (y) - ∥ x - y ∥}^{2}}{λ}\} d x d y . \end{matrix}$ (69)

Moreover, $A_{λ} (P, Q) = C_{λ} (P, Q)$ holds.

Now, we prove Theorem 3.

Proof.

Let $Q^{*}$ be the minimizer of ${min}_{Q} C_{λ} (P, Q)$ . Applying Lemma 5, there exist $α^{*} \in L_{1} (P)$ and $β^{*} \in L_{1} (Q^{*})$ such that:

$\begin{matrix} C_{λ} (P, Q^{*}) = A_{λ} (P, Q^{*}) & = \int α^{*} (x) p (x) d x + \int β^{*} (y) q^{*} (y) d y \\ - λ \int exp \{\frac{α^{*} (x) + β^{*} (y) - {∥ x - y ∥}^{2}}{λ}\} d x d y . \end{matrix}$ (70)

Now, $A_{λ} (P, Q^{*})$ is the minimum value of $A_{λ}$ , such that the variation $δ A_{λ} (P, Q^{*})$ is always zero. Then,

$δ A_{λ} (P, Q^{*}) = \int β^{*} (y) δ q^{*} (y) d y = 0 \Rightarrow β^{*} \equiv 0$ (71)

holds, and the optimal coupling of $P, Q$ can be written as:

$\begin{matrix} π^{*} (x, y) & = exp \{\frac{α^{*} (x) + β^{*} (y)}{λ} - \frac{{∥ x - y ∥}^{2}}{λ}\} \end{matrix}$ (72)

$\begin{matrix} = exp \{\frac{α^{*} (x)}{λ}\} exp \{- \frac{{∥ x - y ∥}^{2}}{λ}\} . \end{matrix}$ (73)

Moreover, we can obtain a closed-form of $α^{*} (x)$ as follows from the equation $\int π (x, y) d y$ $= p (x)$ :

$\frac{α^{*} (x)}{λ} = log p (x) - log \int exp \{- \frac{{∥ x - y ∥}^{2}}{λ}\} d y = log p (x) - \frac{n}{2} log (π λ) .$ (74)

Then, by calculating the marginal distribution of $π (x, y)$ with respect to x, we can obtain:

$q^{*} (y) = \int \frac{1}{{(π λ)}^{\frac{n}{2}}} exp \{- \frac{{∥ x - y ∥}^{2}}{λ}\} p (x) d x = (p 🟉 ϕ_{λ}) (y) .$ (75)

Therefore, we conclude that a probability measure Q that minimizes $C_{λ} (P, Q)$ is expressed as (75). □

It should be noted that when P in Theorem 3 are multivariate normal distributions, $Q^{*}$ and P are simultaneously diagonalizable by a direct consequence of the theorem. This is consistent with the result of Corollary 2(1) for minimization when all eigenvalues are fixed.

We can determine that the entropy-regularized Kantorovich estimator is a measure convolved with an isotropic multivariate normal distribution scaled by the regularization parameter $λ$ . This is similar to the idea of prior distributions in the context of Bayesian inference. Applying Theorem 3, the entropy-regularized Kantorovich estimator of the multivariate normal distribution $N (μ, Σ)$ is $N (μ, Σ + \frac{λ}{2} I)$ .

6. Numerical Experiments

In this section, we introduce experiments that show the statistical efficiency of entropy regularization in Gaussian settings. We consider two different setups, estimating covariance matrices (Section 6.1) and the entropy-regularized Wasserstein barycenter (Section 6.2). To obtain the entropy-regularized Wasserstein barycenter, we adopt the Newton–Schulz method and a manifold optimization method, which are explained in Section 6.3 and Section 6.4, respectively.

6.1. Estimation of Covariance Matrices

We provide a covariance estimation method based on entropy-regularized optimal transport. Let $P = N (μ, Σ)$ be an n-variate normal distribution. We define an entropy-regularized Kantorovich estimator ${\hat{P}}_{λ}$ , that is,

{\hat{P}}_{λ} = arg min_{Q} C_{λ} (P, Q) .

(76)

We generate some samples from $N (μ, Σ)$ and estimate the mean and covariance matrix. We compare the maximum likelihood estimator ${\hat{P}}_{MLE} = N ({\hat{μ}}_{MLE}, {\hat{Σ}}_{MLE})$ and ${\hat{P}}_{λ}$ with respect to the prediction error:

KL (P, {\hat{P}}_{MLE}), KL (P, {\hat{P}}_{λ}) .

(77)

In our experiment, the dimension n is set to $5, 15, 30$ , and the sample size is set to $60, 120$ . The experiment proceeds as follows.

1.
Obtain a random sample of size 60 (or 120) from $N (0, Σ)$ and its sample covariance matrix $\hat{Σ}$ .
2.
Obtain the entropy-regularized minimum Kantorovich estimator of $\hat{Σ}$ obtained in Step 1.
3.
Compute the prediction error between $Σ$ and the entropy-regularized minimum Kantorovich estimator of $\hat{Σ}$
4.
Repeat the above steps 1000 times and obtain a confidence interval of the prediction error.

Table 1 shows the average prediction error of the MLE and entropy-regularized Kantorovich estimator of covariance matrices from 60 samples from an n-variate normal distribution with the 95% confidential interval. We can see that the prediction error is smaller than the maximum likelihood estimator under adequately small $λ$ for $n = 15, 30$ , but not for $n = 5$ . Moreover, the decrease in the prediction error is larger for $n = 30$ than for $n = 15$ , which indicates that the entropy regularization is effective in a high dimension. On the other hand, Table 2 shows in all cases that the decreases in the prediction error are more moderate than Table 1. We can see that this is due to the increase in the sample size. Then, we can conclude that the entropy regularization is effective in a high-dimensional setting with a small sample size.

Table 1.

Average prediction error of the MLE and entropy-regularized Kantorovich estimator of covariance matrices from 60 samples from an n-variate normal distribution with the 95% confidential interval.

$λ$	$KL (P, {\hat{P}}_{W}), n = 5$	$KL (P, {\hat{P}}_{W}), n = 15$	$KL (P, {\hat{P}}_{W}), n = 30$
0(MLE)	0.062 $\pm 0.005$	1.346 $\pm 0.022$	10.69 $\pm 0.112$
0.01	0.051 $\pm 0.005$	1.242 $\pm 0.021$	8.973 $\pm 0.087$
0.1	0.104 $\pm 0.004$	0.841 $\pm 0.013$	4.180 $\pm 0.033$
0.5	0.647 $\pm 0.003$	0.931 $\pm 0.007$	3.093 $\pm 0.010$
1.0	1.166 $\pm 0.003$	1.670 $\pm 0.006$	5.075 $\pm 0.009$

Open in a new tab

Table 2.

Average prediction error of the MLE and entropy-regularized Kantorovich estimator of covariance matrices from 120 samples from an n-variate normal distribution with the 95% confidential interval.

$λ$	$KL (P, {\hat{P}}_{W}), n = 5$	$KL (P, {\hat{P}}_{W}), n = 15$	$KL (P, {\hat{P}}_{W}), n = 30$
0(MLE)	0.024 $\pm 0.002$	0.490 $\pm 0.007$	2.810 $\pm 0.021$
0.01	0.020 $\pm 0.002$	0.459 $\pm 0.006$	2.528 $\pm 0.018$
0.1	0.101 $\pm 0.002$	0.397 $\pm 0.005$	1.700 $\pm 0.001$
0.5	0.659 $\pm 0.002$	0.875 $\pm 0.004$	2.833 $\pm 0.005$
1.0	1.180 $\pm 0.002$	1.730 $\pm 0.004$	5.124 $\pm 0.005$

Open in a new tab

6.2. Estimation of the Wasserstein Barycenter

A barycenter with respect to the Wasserstein distance is definable [52] and is widely used for image interpolation and 3D object interpolation tasks with entropy regularization [3,33].

Definition 10.

Let ${Q_{i}}_{i = 1}^{m}$ be a set of probability measures in $P (R^{n})$ . The barycenter with respect to $C_{λ}$ (entropy-regularized Wasserstein barycenter) is defined as:

$arg min_{P \in P (R^{n})} \sum_{i = 1}^{m} C_{λ} (P, Q_{i}) .$ (78)

Now, we restrict P and ${Q_{i}}_{i = 1}^{m}$ to be multivariate normal distributions and apply our theorem to illustrate the effect of entropy regularization.

The experiment proceeds as follows. The dimensionality and the sample size were set the same as in the experiments in Section 6.1.

1.
Obtain a random sample of size 60 (or 120) from $N (0, Σ)$ and its sample covariance matrix $\hat{Σ}$ .
2.
Repeat Step 1 three times, and obtain ${\hat{Σ}}_{i = 1}^{3}$ .
3.
Obtain the barycenter of ${{\hat{Σ}}_{i}}_{i = 1}^{3}$ .
4.
Compute the prediction error between $Σ$ and the barycenter obtained in step 3.
5.
Repeat the above steps 100 times and obtain a confidence interval of the prediction error.

We show the results for several values of the regularization parameter $λ$ in Table 3 and Table 4. A decrease in the prediction error can be seen in Table 3 for $n = 30$ , as well as Table 1 and Table 2. However, because the computation of the entropy-regularized Wasserstein barycenter uses more data than that of the minimum Kantorovich estimator, the decrease in the prediction error is mild. The entropy-regularized Kantorovich estimator is a special case of the entropy-regularized Wasserstein barycenter (78) for $m = 1$ . Our experiments show that the appropriate range of $λ$ to decrease the prediction error depends on m and becomes narrow as m increases. In addition, we note that there is a small decrease in the prediction error in Table 4 for $n = 30$ .

Table 3.

Average prediction error of the entropy-regularized barycenter with the 95% confidential interval (random sample of size 60).

$λ$	$KL (P, {\hat{P}}_{W}), n = 5$	$KL (P, {\hat{P}}_{W}), n = 15$	$KL (P, {\hat{P}}_{W}), n = 30$
0	0.455 $\pm 0.395$	1.318 $\pm 0.006$	4.875 $\pm 0.035$
0.001	0.429 $\pm 0.396$	1.318 $\pm 0.004$	4.887 $\pm 0.036$
0.01	0.434 $\pm 0.270$	1.344 $\pm 0.006$	4.551 $\pm 0.164$
0.025	0.780 $\pm 0.223$	1.456 $\pm 0.064$	5.710 $\pm 0.536$
0.005	1.047 $\pm 0.029$	1.537 $\pm 0.064$	7.570 $\pm 0.772$

Open in a new tab

Table 4.

Average prediction error of the entropy-regularized barycenter with the 95% confidential interval (random sample of size 120).

$λ$	$KL (P, {\hat{P}}_{W}), n = 5$	$KL (P, {\hat{P}}_{W}), n = 15$	$KL (P, {\hat{P}}_{W}), n = 30$
0	0.154 $\pm 0.600$	1.303 $\pm 0.010$	5.091 $\pm 0.035$
0.001	0.212 $\pm 0.070$	1.305 $\pm 0.010$	5.072 $\pm 0.037$
0.01	0.306 $\pm 0.046$	1.328 $\pm 0.008$	5.274 $\pm 0.252$
0.025	0.671 $\pm 0.028$	1.337 $\pm 0.073$	5.851 $\pm 0.424$
0.005	1.109 $\pm 0.063$	1.603 $\pm 0.184$	8.072 $\pm 0.725$

Open in a new tab

6.3. Gradient Descent on ${Sym}_{+} (n)$

We use a gradient descent method to compute the entropy-regularized barycenter. Applying the gradient descent method to the loss function defined by the Wasserstein distance was proposed in [4]. This idea is extendable to entropy-regularized optimal transport. The detailed algorithm is shown below. Because $C_{λ} (P, Q)$ is a function of a positive definite matrix, we used a manifold gradient descent algorithm on the manifold of positive definite matrices.

We review the manifold gradient descent algorithm used in our numerical experiment. Let ${Sym}_{+} (n)$ be the manifold of n-dimensional positive definite matrices. We require a formula for a gradient operator and the inner product of ${Sym}_{+} (n)$ in the gradient descent algorithm. In this paper, we use the following inner product from [44], Chapter 6. For a fixed $X \in int ({Sym}_{+} (n))$ , we define an inner product of ${Sym}_{+} (n)$ as:

g_{X} (Y, Z) = tr (Y X^{- 1} Z X^{- 1}), Y, Z \in {Sym}_{+} (n),

(79)

Equation (79) is the best choice in terms of the convergence speed according to [53]. Let $f : {Sym}_{+} (n) \to R$ be a differential matrix function. Then, the induced gradient of f under (79) is:

grad f (X) = X (\frac{\partial f (X)}{\partial X}) X .

(80)

We consider the updating step after obtaining the gradient of f. $grad f (X)$ is an element of the tangent space, and we have to project it to ${Sym}_{+} (n)$ . This projection map is called a retraction. It is known that the Riemannian metric $g_{X}$ leads to the following retraction:

${exp}_{X} x = X Exp (X^{- 1} x)$ , where $Exp$ is the matrix exponential. Then, the corresponding gradient descent method becomes as shown in Algorithm 1.

6.4. Approximate the Matrix Square Root

To compute the gradient of the square root of a matrix in the objective function, we approximate it using the Newton–Schulz method [54], which can be implemented by matrix operations as shown in Algorithm 2. It is amenable to automatic differentiation, such that we can easily apply the gradient descent method to our algorithm.

Algorithm 1 Gradient descent on the manifold of positive definite matrices.

Input: $f (X)$
initialize X
while no convergence do
$η :$ step size
$grad \leftarrow X (\frac{\partial f (X)}{\partial X}) X$
$X \leftarrow {exp}_{X} (η grad) = X Exp (η X^{- 1} grad)$
end while
Output:X

Open in a new tab

Algorithm 2: Newton–Schulz method.

Input: $A \in {Sym}_{+} (n), ϵ > 0$
$Y \leftarrow \frac{A}{(1 + ϵ) ∥ A ∥}, Z \leftarrow I$
while no convergence do
$T \leftarrow (3 I - Z Y) / 2$
$Y \leftarrow Y T, Z \leftarrow T Z$
end while
Output: $\sqrt{(1 + ϵ) ∥ A ∥} Y$

Open in a new tab

7. Conclusions and Future Work

In this paper, we studied entropy-regularized optimal transport and derived several result. We summarize these as follows and add notes on future work.

We obtain the explicit form of entropy-regularized optimal transport between two multivariate normal distributions and derived Corollaries 1 and 2, which clarified the properties of optimal coupling. Furthermore, we demonstrate experimentally how entropy regularization affects the Wasserstein distance, the optimal coupling, and the geometric structure of multivariate normal distributions. Overall, the properties of optimal coupling were revealed both theoretically and experimentally. We expect that the explicit formula can be a replacement for the existing methodology using the (nonregularized) Wasserstein distance between normal distributions (for example, [4,5]).
Theorem 2 derives the explicit form of the optimal coupling of the Tsallis entropy-regularized optimal transport between multivariate q-normal distributions. The optimal coupling of the Tsallis entropy-regularized optimal transport between multivariate q-normal distributions is also a multivariate q-normal distribution, and the obtained result has an analogy to that of the normal distribution. We believe that this result can be extended to other elliptical distribution families.
The entropy-regularized Kantorovich estimator of a probability measure in $P_{2} (R)$ is the convolution of a multivariate normal distribution and its own density function. Our experiments show that both the entropy-regularized Kantorovich estimator and the Wasserstein barycenter of multivariate normal distributions outperform the maximum likelihood estimator in the prediction error for adequately selected $λ$ in a high dimensionality and small sample setting. As future work, we want to show the efficiency of entropy regularization using real data.

Author Contributions

Conceptualization, Q.T.; methodology, Q.T.; software, Q.T.; writing—original draft preparation, Q.T.; writing—review and editing, K.K.; supervision, K.K. Both authors read and agreed to the published version of the manuscript.

Funding

This work was supported by RIKEN AIP and JSPS KAKENHI (JP19K03642, JP19K00912).

Data Availability Statement

All the data used are artificial and generated by pseudo-random numbers.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Villani C. Optimal Transport: Old and New. Volume 338 Springer Science & Business Media; Berlin/Heidelberg, Germany: 2008. [Google Scholar]
2.Rubner Y., Tomasi C., Guibas L.J. The Earth Mover’s Distance as a metric for Image Retrieval. Int. J. Comput. Vis. 2000;40:99–121. doi: 10.1023/A:1026543900054. [DOI] [Google Scholar]
3.Solomon J., De Goes F., Peyré G., Cuturi M., Butscher A., Nguyen A., Du T., Guibas L. Convolutional Wasserstein Distances: Efficient optimal transportation on geometric domains. ACM Trans. Graph. TOG. 2015;34:1–11. doi: 10.1145/2766963. [DOI] [Google Scholar]
4.Muzellec B., Cuturi M. Advances in Neural Information Processing Systems. ACM; New York, NY, USA: 2018. Generalizing point embeddings using the Wasserstein space of elliptical distributions; pp. 10237–10248. [Google Scholar]
5.Heusel M., Ramsauer H., Unterthiner T., Nessler B., Hochreiter S. Advances in Neural Information Processing Systems. Volume 30 Curran Associates, Inc.; Red Hook, NY, USA: 2017. GANs trained by a two time-scale update rule converge to a local nash equilibrium. [Google Scholar]
6.Arjovsky M., Chintala S., Bottou L. Wasserstein generative adversarial networks; Proceedings of the 34th International Conference on Machine Learning; Sydney, Australia. 6–11 August 2017; pp. 214–223. [Google Scholar]
7.Nitanda A., Suzuki T. Gradient layer: Enhancing the convergence of adversarial training for generative models; Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics; Playa Blanca, Spain. 9–11 April 2018; pp. 1008–1016. [Google Scholar]
8.Sonoda S., Murata N. Transportation analysis of denoising autoencoders: A novel method for analyzing deep neural networks. arXiv. 20171712.04145 [Google Scholar]
9.Vincent P., Larochelle H., Bengio Y., Manzagol P.A. Extracting and composing robust features with denoising autoencoders; Proceedings of the 25th International Conference on Machine Learning; Helsinki, Finland. 5–9 July 2008; pp. 1096–1103. [Google Scholar]
10.Cuturi M. Advances in Neural Information Processing Systems. ACM; New York, NY, USA: 2013. Sinkhorn distances: Lightspeed computation of optimal transport; pp. 2292–2300. [Google Scholar]
11.Sinkhorn R., Knopp P. Concerning nonnegative matrices and doubly stochastic matrices. Pac. J. Math. 1967;21:343–348. doi: 10.2140/pjm.1967.21.343. [DOI] [Google Scholar]
12.Lin T., Ho N., Jordan M.I. On the efficiency of the Sinkhorn and Greenkhorn algorithms and their acceleration for optimal transport. arXiv. 20191906.01437 [Google Scholar]
13.Lee Y.T., Sidford A. Path finding methods for linear programming: Solving linear programs in o (vrank) iterations and faster algorithms for maximum flow; Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science; Philadelphia, PA, USA. 18–21 October 2014; pp. 424–433. [Google Scholar]
14.Dvurechensky P., Gasnikov A., Kroshnin A. Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm; Proceedings of the International Conference on Machine Learning; Stockholm, Sweden. 10–15 July 2018; pp. 1367–1376. [Google Scholar]
15.Aude G., Cuturi M., Peyré G., Bach F. Stochastic optimization for large-scale optimal transport. arXiv. 20161605.08527 [Google Scholar]
16.Altschuler J., Weed J., Rigollet P. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. arXiv. 20171705.09634 [Google Scholar]
17.Blondel M., Seguy V., Rolet A. Smooth and sparse optimal transport; Proceedings of the International Conference on Artificial Intelligence and Statistics; Playa Blanca, Spain. 9–11 April 2018; pp. 880–889. [Google Scholar]
18.Cuturi M., Peyré G. A smoothed dual approach for variational Wasserstein problems. SIAM J. Imaging Sci. 2016;9:320–343. doi: 10.1137/15M1032600. [DOI] [Google Scholar]
19.Lin T., Ho N., Jordan M. On efficient optimal transport: An analysis of greedy and accelerated mirror descent algorithms; Proceedings of the International Conference on Machine Learning; Long Beach, CA, USA. 10–15 June 2019; pp. 3982–3991. [Google Scholar]
20.Allen-Zhu Z., Li Y., Oliveira R., Wigderson A. Much faster algorithms for matrix scaling; Proceedings of the 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS); Berkeley, CA, USA. 15–17 October 2017; pp. 890–901. [Google Scholar]
21.Cohen M.B., Madry A., Tsipras D., Vladu A. Matrix scaling and balancing via box constrained Newton’s method and interior point methods; Proceedings of the 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS); Berkeley, CA, USA. 15–17 October 2017; pp. 902–913. [Google Scholar]
22.Blanchet J., Jambulapati A., Kent C., Sidford A. Towards optimal running times for optimal transport. arXiv. 20181810.07717 [Google Scholar]
23.Quanrud K. Approximating optimal transport with linear programs. arXiv. 20181810.05957 [Google Scholar]
24.Frogner C., Zhang C., Mobahi H., Araya M., Poggio T.A. Advances in Neural Information Processing Systems. ACM; New York, NY, USA: 2015. Learning with a Wasserstein loss; pp. 2053–2061. [Google Scholar]
25.Courty N., Flamary R., Tuia D., Rakotomamonjy A. Optimal transport for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 2016;39:1853–1865. doi: 10.1109/TPAMI.2016.2615921. [DOI] [PubMed] [Google Scholar]
26.Lei J. Convergence and concentration of empirical measures under Wasserstein distance in unbounded functional spaces. Bernoulli. 2020;26:767–798. doi: 10.3150/19-BEJ1151. [DOI] [Google Scholar]
27.Mena G., Niles-Weed J. Advances in Neural Information Processing Systems. ACM; New York, NY, USA: 2019. Statistical bounds for entropic optimal transport: Sample complexity and the central limit theorem; pp. 4543–4553. [Google Scholar]
28.Rigollet P., Weed J. Entropic optimal transport is maximum-likelihood deconvolution. Comptes Rendus Math. 2018;356:1228–1235. doi: 10.1016/j.crma.2018.10.010. [DOI] [Google Scholar]
29.Balaji Y., Hassani H., Chellappa R., Feizi S. Entropic GANs meet VAEs: A Statistical Approach to Compute Sample Likelihoods in GANs; Proceedings of the 36th International Conference on Machine Learning; Long Beach, CA, USA. 9–15 June 2019; pp. 414–423. [Google Scholar]
30.Amari S.I., Karakida R., Oizumi M. Information geometry connecting Wasserstein distance and Kullback–Leibler divergence via the entropy-relaxed transportation problem. Inf. Geom. 2018;1:13–37. doi: 10.1007/s41884-018-0002-8. [DOI] [Google Scholar]
31.Dowson D., Landau B. The Fréchet distance between multivariate normal distributions. J. Multivar. Anal. 1982;12:450–455. doi: 10.1016/0047-259X(82)90077-X. [DOI] [Google Scholar]
32.Tsallis C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988;52:479–487. doi: 10.1007/BF01016429. [DOI] [Google Scholar]
33.Amari S.I., Karakida R., Oizumi M., Cuturi M. Information geometry for regularized optimal transport and barycenters of patterns. Neural Comput. 2019;31:827–848. doi: 10.1162/neco_a_01178. [DOI] [PubMed] [Google Scholar]
34.Janati H., Muzellec B., Peyré G., Cuturi M. Advances in Neural Information Processing Systems. ACM; New York, NY, USA: 2020. Entropic optimal transport between (unbalanced) Gaussian measures has a closed form. [Google Scholar]
35.Mallasto A., Gerolin A., Minh H.Q. Entropy-Regularized 2-Wasserstein Distance between Gaussian Measures. arXiv. 20202006.03416 [Google Scholar]
36.Peyré G., Cuturi M. Computational optimal transport. Found. Trends® Mach. Learn. 2019;11:355–607. doi: 10.1561/2200000073. [DOI] [Google Scholar]
37.Monge G. Mémoire sur la Théorie des Déblais et des Remblais. Histoire de l’Académie Royale des Sciences de Paris; Paris, France: 1781. [Google Scholar]
38.Kantorovich L.V. On the translocation of masses. Proc. USSR Acad. Sci. 1942;37:199–201. doi: 10.1007/s10958-006-0049-2. [DOI] [Google Scholar]
39.Jaynes E.T. Information theory and statistical mechanics. Phys. Rev. 1957;106:620. doi: 10.1103/PhysRev.106.620. [DOI] [Google Scholar]
40.Mardia K.V. A Modern Course on Statistical Distributions in Scientific Work. Springer; Berlin/Heidelberg, Germany: 1975. Characterizations of directional distributions; pp. 365–385. [Google Scholar]
41.Petersen K., Pedersen M. The Matrix Cookbook. Volume 15 Technical University of Denmark; Lyngby, Denmark: 2008. [Google Scholar]
42.Takatsu A. Wasserstein geometry of Gaussian measures. Osaka J. Math. 2011;48:1005–1026. [Google Scholar]
43.Kruskal J.B. Nonmetric multidimensional scaling: A numerical method. Psychometrika. 1964;29:115–129. doi: 10.1007/BF02289694. [DOI] [Google Scholar]
44.Bhatia R. Positive Definite Matrices. Volume 24 Princeton University Press; Princeton, NJ, USA: 2009. [Google Scholar]
45.Hiai F., Petz D. Introduction to Matrix Analysis and Applications. Springer Science & Business Media; Berlin/Heidelberg, Germany: 2014. [Google Scholar]
46.Marshall A.W., Olkin I., Arnold B.C. Inequalities: Theory of Majorization and Its Applications. Volume 143 Springer; Berlin/Heidelberg, Germany: 1979. [Google Scholar]
47.Markham D., Miszczak J.A., Puchała Z., Życzkowski K. Quantum state discrimination: A geometric approach. Phys. Rev. A. 2008;77:42–111. doi: 10.1103/PhysRevA.77.042111. [DOI] [Google Scholar]
48.Costa J., Hero A., Vignat C. International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition. Springer; Berlin/Heidelberg, Germany: 2003. On solutions to multivariate maximum α-entropy problems; pp. 211–226. [Google Scholar]
49.Naudts J. Generalised Thermostatistics. Springer Science & Business Media; Berlin/Heidelberg, Germany: 2011. [Google Scholar]
50.Kotz S., Nadarajah S. Multivariate t-Distributions and their Applications. Cambridge University Press; Cambridge, UK: 2004. [Google Scholar]
51.Clason C., Lorenz D.A., Mahler H., Wirth B. Entropic regularization of continuous optimal transport problems. J. Math. Anal. Appl. 2021;494:124432. doi: 10.1016/j.jmaa.2020.124432. [DOI] [Google Scholar]
52.Agueh M., Carlier G. Barycenters in the Wasserstein space. SIAM J. Math. Anal. 2011;43:904–924. doi: 10.1137/100805741. [DOI] [Google Scholar]
53.Jeuris B., Vandebril R., Vandereycken B. A survey and comparison of contemporary algorithms for computing the matrix geometric mean. Electron. Trans. Numer. Anal. 2012;39:379–402. [Google Scholar]
54.Higham N.J. Newton’s method for the matrix square root. Math. Comput. 1986;46:537–549. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All the data used are artificial and generated by pseudo-random numbers.

[B1-entropy-23-00302] 1.Villani C. Optimal Transport: Old and New. Volume 338 Springer Science & Business Media; Berlin/Heidelberg, Germany: 2008. [Google Scholar]

[B2-entropy-23-00302] 2.Rubner Y., Tomasi C., Guibas L.J. The Earth Mover’s Distance as a metric for Image Retrieval. Int. J. Comput. Vis. 2000;40:99–121. doi: 10.1023/A:1026543900054. [DOI] [Google Scholar]

[B3-entropy-23-00302] 3.Solomon J., De Goes F., Peyré G., Cuturi M., Butscher A., Nguyen A., Du T., Guibas L. Convolutional Wasserstein Distances: Efficient optimal transportation on geometric domains. ACM Trans. Graph. TOG. 2015;34:1–11. doi: 10.1145/2766963. [DOI] [Google Scholar]

[B4-entropy-23-00302] 4.Muzellec B., Cuturi M. Advances in Neural Information Processing Systems. ACM; New York, NY, USA: 2018. Generalizing point embeddings using the Wasserstein space of elliptical distributions; pp. 10237–10248. [Google Scholar]

[B5-entropy-23-00302] 5.Heusel M., Ramsauer H., Unterthiner T., Nessler B., Hochreiter S. Advances in Neural Information Processing Systems. Volume 30 Curran Associates, Inc.; Red Hook, NY, USA: 2017. GANs trained by a two time-scale update rule converge to a local nash equilibrium. [Google Scholar]

[B6-entropy-23-00302] 6.Arjovsky M., Chintala S., Bottou L. Wasserstein generative adversarial networks; Proceedings of the 34th International Conference on Machine Learning; Sydney, Australia. 6–11 August 2017; pp. 214–223. [Google Scholar]

[B7-entropy-23-00302] 7.Nitanda A., Suzuki T. Gradient layer: Enhancing the convergence of adversarial training for generative models; Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics; Playa Blanca, Spain. 9–11 April 2018; pp. 1008–1016. [Google Scholar]

[B8-entropy-23-00302] 8.Sonoda S., Murata N. Transportation analysis of denoising autoencoders: A novel method for analyzing deep neural networks. arXiv. 20171712.04145 [Google Scholar]

[B9-entropy-23-00302] 9.Vincent P., Larochelle H., Bengio Y., Manzagol P.A. Extracting and composing robust features with denoising autoencoders; Proceedings of the 25th International Conference on Machine Learning; Helsinki, Finland. 5–9 July 2008; pp. 1096–1103. [Google Scholar]

[B10-entropy-23-00302] 10.Cuturi M. Advances in Neural Information Processing Systems. ACM; New York, NY, USA: 2013. Sinkhorn distances: Lightspeed computation of optimal transport; pp. 2292–2300. [Google Scholar]

[B11-entropy-23-00302] 11.Sinkhorn R., Knopp P. Concerning nonnegative matrices and doubly stochastic matrices. Pac. J. Math. 1967;21:343–348. doi: 10.2140/pjm.1967.21.343. [DOI] [Google Scholar]

[B12-entropy-23-00302] 12.Lin T., Ho N., Jordan M.I. On the efficiency of the Sinkhorn and Greenkhorn algorithms and their acceleration for optimal transport. arXiv. 20191906.01437 [Google Scholar]

[B13-entropy-23-00302] 13.Lee Y.T., Sidford A. Path finding methods for linear programming: Solving linear programs in o (vrank) iterations and faster algorithms for maximum flow; Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science; Philadelphia, PA, USA. 18–21 October 2014; pp. 424–433. [Google Scholar]

[B14-entropy-23-00302] 14.Dvurechensky P., Gasnikov A., Kroshnin A. Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm; Proceedings of the International Conference on Machine Learning; Stockholm, Sweden. 10–15 July 2018; pp. 1367–1376. [Google Scholar]

[B15-entropy-23-00302] 15.Aude G., Cuturi M., Peyré G., Bach F. Stochastic optimization for large-scale optimal transport. arXiv. 20161605.08527 [Google Scholar]

[B16-entropy-23-00302] 16.Altschuler J., Weed J., Rigollet P. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. arXiv. 20171705.09634 [Google Scholar]

[B17-entropy-23-00302] 17.Blondel M., Seguy V., Rolet A. Smooth and sparse optimal transport; Proceedings of the International Conference on Artificial Intelligence and Statistics; Playa Blanca, Spain. 9–11 April 2018; pp. 880–889. [Google Scholar]

[B18-entropy-23-00302] 18.Cuturi M., Peyré G. A smoothed dual approach for variational Wasserstein problems. SIAM J. Imaging Sci. 2016;9:320–343. doi: 10.1137/15M1032600. [DOI] [Google Scholar]

[B19-entropy-23-00302] 19.Lin T., Ho N., Jordan M. On efficient optimal transport: An analysis of greedy and accelerated mirror descent algorithms; Proceedings of the International Conference on Machine Learning; Long Beach, CA, USA. 10–15 June 2019; pp. 3982–3991. [Google Scholar]

[B20-entropy-23-00302] 20.Allen-Zhu Z., Li Y., Oliveira R., Wigderson A. Much faster algorithms for matrix scaling; Proceedings of the 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS); Berkeley, CA, USA. 15–17 October 2017; pp. 890–901. [Google Scholar]

[B21-entropy-23-00302] 21.Cohen M.B., Madry A., Tsipras D., Vladu A. Matrix scaling and balancing via box constrained Newton’s method and interior point methods; Proceedings of the 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS); Berkeley, CA, USA. 15–17 October 2017; pp. 902–913. [Google Scholar]

[B22-entropy-23-00302] 22.Blanchet J., Jambulapati A., Kent C., Sidford A. Towards optimal running times for optimal transport. arXiv. 20181810.07717 [Google Scholar]

[B23-entropy-23-00302] 23.Quanrud K. Approximating optimal transport with linear programs. arXiv. 20181810.05957 [Google Scholar]

[B24-entropy-23-00302] 24.Frogner C., Zhang C., Mobahi H., Araya M., Poggio T.A. Advances in Neural Information Processing Systems. ACM; New York, NY, USA: 2015. Learning with a Wasserstein loss; pp. 2053–2061. [Google Scholar]

[B25-entropy-23-00302] 25.Courty N., Flamary R., Tuia D., Rakotomamonjy A. Optimal transport for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 2016;39:1853–1865. doi: 10.1109/TPAMI.2016.2615921. [DOI] [PubMed] [Google Scholar]

[B26-entropy-23-00302] 26.Lei J. Convergence and concentration of empirical measures under Wasserstein distance in unbounded functional spaces. Bernoulli. 2020;26:767–798. doi: 10.3150/19-BEJ1151. [DOI] [Google Scholar]

[B27-entropy-23-00302] 27.Mena G., Niles-Weed J. Advances in Neural Information Processing Systems. ACM; New York, NY, USA: 2019. Statistical bounds for entropic optimal transport: Sample complexity and the central limit theorem; pp. 4543–4553. [Google Scholar]

[B28-entropy-23-00302] 28.Rigollet P., Weed J. Entropic optimal transport is maximum-likelihood deconvolution. Comptes Rendus Math. 2018;356:1228–1235. doi: 10.1016/j.crma.2018.10.010. [DOI] [Google Scholar]

[B29-entropy-23-00302] 29.Balaji Y., Hassani H., Chellappa R., Feizi S. Entropic GANs meet VAEs: A Statistical Approach to Compute Sample Likelihoods in GANs; Proceedings of the 36th International Conference on Machine Learning; Long Beach, CA, USA. 9–15 June 2019; pp. 414–423. [Google Scholar]

[B30-entropy-23-00302] 30.Amari S.I., Karakida R., Oizumi M. Information geometry connecting Wasserstein distance and Kullback–Leibler divergence via the entropy-relaxed transportation problem. Inf. Geom. 2018;1:13–37. doi: 10.1007/s41884-018-0002-8. [DOI] [Google Scholar]

[B31-entropy-23-00302] 31.Dowson D., Landau B. The Fréchet distance between multivariate normal distributions. J. Multivar. Anal. 1982;12:450–455. doi: 10.1016/0047-259X(82)90077-X. [DOI] [Google Scholar]

[B32-entropy-23-00302] 32.Tsallis C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988;52:479–487. doi: 10.1007/BF01016429. [DOI] [Google Scholar]

[B33-entropy-23-00302] 33.Amari S.I., Karakida R., Oizumi M., Cuturi M. Information geometry for regularized optimal transport and barycenters of patterns. Neural Comput. 2019;31:827–848. doi: 10.1162/neco_a_01178. [DOI] [PubMed] [Google Scholar]

[B34-entropy-23-00302] 34.Janati H., Muzellec B., Peyré G., Cuturi M. Advances in Neural Information Processing Systems. ACM; New York, NY, USA: 2020. Entropic optimal transport between (unbalanced) Gaussian measures has a closed form. [Google Scholar]

[B35-entropy-23-00302] 35.Mallasto A., Gerolin A., Minh H.Q. Entropy-Regularized 2-Wasserstein Distance between Gaussian Measures. arXiv. 20202006.03416 [Google Scholar]

[B36-entropy-23-00302] 36.Peyré G., Cuturi M. Computational optimal transport. Found. Trends® Mach. Learn. 2019;11:355–607. doi: 10.1561/2200000073. [DOI] [Google Scholar]

[B37-entropy-23-00302] 37.Monge G. Mémoire sur la Théorie des Déblais et des Remblais. Histoire de l’Académie Royale des Sciences de Paris; Paris, France: 1781. [Google Scholar]

[B38-entropy-23-00302] 38.Kantorovich L.V. On the translocation of masses. Proc. USSR Acad. Sci. 1942;37:199–201. doi: 10.1007/s10958-006-0049-2. [DOI] [Google Scholar]

[B39-entropy-23-00302] 39.Jaynes E.T. Information theory and statistical mechanics. Phys. Rev. 1957;106:620. doi: 10.1103/PhysRev.106.620. [DOI] [Google Scholar]

[B40-entropy-23-00302] 40.Mardia K.V. A Modern Course on Statistical Distributions in Scientific Work. Springer; Berlin/Heidelberg, Germany: 1975. Characterizations of directional distributions; pp. 365–385. [Google Scholar]

[B41-entropy-23-00302] 41.Petersen K., Pedersen M. The Matrix Cookbook. Volume 15 Technical University of Denmark; Lyngby, Denmark: 2008. [Google Scholar]

[B42-entropy-23-00302] 42.Takatsu A. Wasserstein geometry of Gaussian measures. Osaka J. Math. 2011;48:1005–1026. [Google Scholar]

[B43-entropy-23-00302] 43.Kruskal J.B. Nonmetric multidimensional scaling: A numerical method. Psychometrika. 1964;29:115–129. doi: 10.1007/BF02289694. [DOI] [Google Scholar]

[B44-entropy-23-00302] 44.Bhatia R. Positive Definite Matrices. Volume 24 Princeton University Press; Princeton, NJ, USA: 2009. [Google Scholar]

[B45-entropy-23-00302] 45.Hiai F., Petz D. Introduction to Matrix Analysis and Applications. Springer Science & Business Media; Berlin/Heidelberg, Germany: 2014. [Google Scholar]

[B46-entropy-23-00302] 46.Marshall A.W., Olkin I., Arnold B.C. Inequalities: Theory of Majorization and Its Applications. Volume 143 Springer; Berlin/Heidelberg, Germany: 1979. [Google Scholar]

[B47-entropy-23-00302] 47.Markham D., Miszczak J.A., Puchała Z., Życzkowski K. Quantum state discrimination: A geometric approach. Phys. Rev. A. 2008;77:42–111. doi: 10.1103/PhysRevA.77.042111. [DOI] [Google Scholar]

[B48-entropy-23-00302] 48.Costa J., Hero A., Vignat C. International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition. Springer; Berlin/Heidelberg, Germany: 2003. On solutions to multivariate maximum α-entropy problems; pp. 211–226. [Google Scholar]

[B49-entropy-23-00302] 49.Naudts J. Generalised Thermostatistics. Springer Science & Business Media; Berlin/Heidelberg, Germany: 2011. [Google Scholar]

[B50-entropy-23-00302] 50.Kotz S., Nadarajah S. Multivariate t-Distributions and their Applications. Cambridge University Press; Cambridge, UK: 2004. [Google Scholar]

[B51-entropy-23-00302] 51.Clason C., Lorenz D.A., Mahler H., Wirth B. Entropic regularization of continuous optimal transport problems. J. Math. Anal. Appl. 2021;494:124432. doi: 10.1016/j.jmaa.2020.124432. [DOI] [Google Scholar]

[B52-entropy-23-00302] 52.Agueh M., Carlier G. Barycenters in the Wasserstein space. SIAM J. Math. Anal. 2011;43:904–924. doi: 10.1137/100805741. [DOI] [Google Scholar]

[B53-entropy-23-00302] 53.Jeuris B., Vandebril R., Vandereycken B. A survey and comparison of contemporary algorithms for computing the matrix geometric mean. Electron. Trans. Numer. Anal. 2012;39:379–402. [Google Scholar]

[B54-entropy-23-00302] 54.Higham N.J. Newton’s method for the matrix square root. Math. Comput. 1986;46:537–549. [Google Scholar]

PERMALINK

Entropy-Regularized Optimal Transport on Multivariate Normal and q-normal Distributions

Qijun Tong

Kei Kobayashi

Roles

Abstract

1. Introduction

2. Preliminary

Definition 1

Definition 2

Definition 3

Definition 4

Definition 5

Definition 6

3. Entropy-Regularized Optimal Transport between Multivariate Normal Distributions

Lemma 1.

Proof.

Lemma 2.

Theorem 1.

Proof.

Corollary 1.

Proof.

Figure 1.

Figure 2.

Figure 3.

Corollary 2.

Proof.

4. Extension to Tsallis Entropy Regularization

Definition 7.

Definition 8.

Definition 9.

Lemma 3.

Lemma 4.

Corollary 3.

Theorem 2.

Proof.

5. Entropy-Regularized Kantorovich Estimator

Theorem 3.

Lemma 5.

Proof.

6. Numerical Experiments

6.1. Estimation of Covariance Matrices

Table 1.

Table 2.

6.2. Estimation of the Wasserstein Barycenter

Definition 10.

Table 3.

Table 4.

6.3. Gradient Descent on Sym+(n)

6.4. Approximate the Matrix Square Root

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

6.3. Gradient Descent on ${Sym}_{+} (n)$