Abstract
The distance and divergence of the probability measures play a central role in statistics, machine learning, and many other related fields. The Wasserstein distance has received much attention in recent years because of its distinctions from other distances or divergences. Although computing the Wasserstein distance is costly, entropy-regularized optimal transport was proposed to computationally efficiently approximate the Wasserstein distance. The purpose of this study is to understand the theoretical aspect of entropy-regularized optimal transport. In this paper, we focus on entropy-regularized optimal transport on multivariate normal distributions and q-normal distributions. We obtain the explicit form of the entropy-regularized optimal transport cost on multivariate normal and q-normal distributions; this provides a perspective to understand the effect of entropy regularization, which was previously known only experimentally. Furthermore, we obtain the entropy-regularized Kantorovich estimator for the probability measure that satisfies certain conditions. We also demonstrate how the Wasserstein distance, optimal coupling, geometric structure, and statistical efficiency are affected by entropy regularization in some experiments. In particular, our results about the explicit form of the optimal coupling of the Tsallis entropy-regularized optimal transport on multivariate q-normal distributions and the entropy-regularized Kantorovich estimator are novel and will become the first step towards the understanding of a more general setting.
Keywords: optimal transportation, entropy regularization, Wasserstein distance, Tsallis entropy, q-normal distribution
1. Introduction
Comparing probability measures is a fundamental problem in statistics and machine learning. A classical way to compare probability measures is the Kullback–Leibler divergence. Let M be a measurable space and be the probability measure on M; then, the Kullback–Leibler divergence is defined as:
| (1) |
The Wasserstein distance [1], also known as the earth mover distance [2], is another way of comparing probability measures. It is a metric on the space of probability measures derived by the mass transportation theory of two probability measures. Informally, optimal transport theory considers an optimal transport plan between two probability measures under a cost function, and the Wasserstein distance is defined by the minimum total transport cost. A significant difference between the Wasserstein distance and the Kullback–Leibler divergence is that the former can reflect the metric structure, whereas the latter cannot. The Wasserstein distance can be written as:
| (2) |
where is a distance function on a measurable metric space M and denotes the set of probability measures on , whose marginal measures correspond to and . In recent years, the application of optimal transport and the Wasserstein distance has been studied in many fields such as statistics, machine learning, and image processing. For example, Reference [3] generated the interpolation of various three-dimensional (3D) objects using the Wasserstein barycenter. In the field of word embedding in natural language processing, Reference [4] embedded each word as an elliptical distribution, and the Wasserstein distance was applied between the elliptical distributions. There are many studies on the applications of optimal transport to deep learning, including [5,6,7]. Moreover, Reference [8] analyzed the denoising autoencoder [9] with gradient flow in the Wasserstein space.
In the application of the Wasserstein distance, it is often considered in a discrete setting where and are discrete probability measures. Then, obtaining the Wasserstein distance between and can be formulated as a linear programming problem. In general, however, it is computationally intensive to solve such linear problems and obtain the optimal coupling of two probability measures. For such a situation, a novel numerical method, entropy regularization, was proposed by [10],
| (3) |
This is a relaxed formulation of the original optimal transport of a cost function , in which the negative Shannon entropy is used as a regularizer. For a small , can approximate the p-th power of the Wasserstein distance between two discrete probability measures, and it can be computed efficiently by using Sinkhorn’s algorithm [11].
More recently, many studies have been published on improving the computational efficiency. According to [12], the most computationally efficient algorithm at this moment to solve the linear problem for the Wasserstein distance is Lee–Sidford linear solver [13], which runs in . Reference [14] proved that a complexity bound for the Sinkhorn algorithm is , where is the desired absolute performance guarantee. After [10] appeared, various algorithms have been proposed. For example, Reference [15] adopted stochastic optimization schemes for solving the optimal transport. The Greenkhorn algorithm [16] is the greedy variant of the Sinkhorn algorithm, and Reference [12] proposed its acceleration. Many other approaches such as adapting a variety of standard optimization algorithms to approximate the optimal transport problem can be found in [12,17,18,19]. Several specialized Newton-type algorithms [20,21] achieve complexity bound [22,23], which are the best ones in terms of computational complexity at the present moment.
Moreover, entropy-regularized optimal transport has another advantage. Because of the differentiability of the entropy-regularized optimal transport and the simple structure of Sinkhorn’s algorithm, we can easily compute the gradient of the entropy-regularized optimal transport cost and optimize the parameter of a parametrized probability distribution by using numerical differentiation or automatic differentiation. Then, we can define a differentiable loss function that can be applied to various supervised learning methods [24]. Entropy-regularized optimal transport can be used to approximate not only the Wasserstein distance, but also its optimal coupling as a mapping function. Reference [25] adopted the optimal coupling of the entropy-regularized optimal transport as a mapping function from one domain to another.
Despite the empirical success of the entropy-regularized optimal transport, its theoretical aspect is less understood. Reference [26] studied the expected Wasserstein distance between a probability measure and its empirical version. Similarly, Reference [27] showed the consistency of the entropy-regularized optimal transport cost between two empirical distributions. Reference [28] showed that minimizing the entropy-regularized optimal transport cost between empirical distributions is equivalent to a type of maximum likelihood estimator. Reference [29] considered Wasserstein generative adversarial networks with an entropy regularization. Reference [30] constructed information geometry from the convexity of the entropy-regularized optimal transport cost.
Our intrinsic motivation of this study is to produce an analytical solution about the entropy-regularized optimal transport problem between continuous probability measures so that we can gain insight into the effects of entropy regularization in a theoretical, as well as an experimental way. In our study, we generalized the Wasserstein distance between two multivariate normal distributions by entropy regularization. We derived the explicit form of the entropy-regularized optimal transport cost and its optimal coupling, which can be used to analyze the effect of entropy regularization directly. In general, the nonregularized Wasserstein distance between two probability measures and its optimal coupling cannot be expressed in a closed form; however, Reference [31] proved the explicit formula for multivariate normal distributions. Theorem 1 is a generalized form of [31]. We obtain an explicit form of the entropy-regularized optimal transport between two multivariate normal distributions. Furthermore, by adopting the Tsallis entropy [32] as the entropy regularization instead of the Shannon entropy, our theorem can be generalized to multivariate q-normal distributions.
Some readers may find it strange to study the entropy-regularized optimal transport for multivariate normal distributions, where the exact (nonregularized) optimal transport has been obtained explicitly. However, we think it is worth studying from several perspectives:
Normal distributions are the simplest and best-studied probability distributions, and thus, it is useful to examine the regularization theoretically in order to infer results for other distributions. In particular, we will partly answer the questions “How much do entropy constraints affect the results?” and “What does it mean to constrain by the entropy?’’ for the simplest cases. Furthermore, as a first step in constructing a theory for more general probability distributions, in Section 4, we propose a generalization to multivariate q-normal distributions.
Because normal distributions are the limit distributions in asymptotic theories using the central limit theorem, studying normal distributions is necessary for the asymptotic theory of regularized Wasserstein distances and estimators computed by them. Moreover, it was proposed to use the entropy-regularized Wasserstein distance to compute a lower bound of the generalization error for a variational autoencoder [29]. The study of the asymptotic behavior of such bounds is one of the expected applications of our results.
Though this has not yet been proven theoretically, we suspect that entropy regularization is efficient not only for computational reasons, such as the use of the Sinkhorn algorithm, but also in the sense of efficiency in statistical inference. Such a phenomenon can be found in some existing studies, including [33]. Such statistical efficiency is confirmed by some experiments in Section 6.
The remainder of this paper is organized as follows. First, we review some definitions of optimal transport and entropy regularization in Section 2. Then, in Section 3, we provide an explicit form of the entropy-regularized optimal transport cost and its optimal coupling between two multivariate normal distributions. We also extend this result to q-normal distributions for Tsallis entropy regularization in Section 4. In Section 5, we obtain the entropy-regularized Kantorovich estimator of probability measures on with a finite second moment that are absolutely continuous with respect to the Lebesgue measure in Theorem 3. We emphasize that Theorem 3 is not limited to the case of multivariate normal distribution, but can handle a wider range of probability measures. We analyze how entropy regularization affects the optimal result experimentally in certain sections.
We note that after publishing the preprint version of the paper, we found closely related results [34,35] reported within half a year. In Janati et al. [34], they proved the same result as Theorem 1 based on solving the fixed-point equation behind Sinkhorn’s algorithm. Their results include the unbalanced optimal transport between unbalanced multivariate normal distributions. They also studied the convexity and differentiability of the objective function of the entropy-regularized optimal transport. In [35], the same closed-form as Theorem 1 was proven by ingeniously using the Schrödinger system. Although there are some overlaps, our paper has significant novelty in the following respects. Our proof is more direct than theirs and can be extended directly to the proof for the Tsallis entropy-regularized optimal transport between multivariate q-normal distributions provided in Section 4. Furthermore, Corollaries 1 and 2 are novel and important results to evaluate how much the entropy regularization affects the estimation results or not at all. We also obtain the entropy-regularized Kantorovich estimator in Theorem 3.
2. Preliminary
In this section, we review some definitions of optimal transport and entropy-regularized optimal transport. These definitions were referred to in [1,36]. In this section, we use a tuple for a set M and -algebra on M and for the set of all probability measures on a measurable space X.
Definition 1
(Pushforward measure). Given measurable spaces and , a measure , and a measurable mapping , the pushforward measure of μ by φ is defined by:
(4)
Definition 2
(Optimal transport map). Consider a measurable space , and let denote a cost function. Given , we call the optimal transport map if φ realizes the infimum of:
(5)
This problem was originally formalized by [37]. However, the optimal transport map does not always exist. Then, Kantorovich considered a relaxation of this problem in [38].
Definition 3
(Coupling). Given , the coupling of μ and ν is a probability measure on that satisfies:
(6)
Definition 4
(Kantorovich problem). The Kantorovich problem is defined as finding a coupling π of μ and ν that realizes the infimum of:
(7)
Hereafter, let be the set of all couplings of and . When we adopt a distance function as the cost function, we can define the Wasserstein distance.
Definition 5
(Wasserstein distance). Given , a measurable metric space , and with a finite p-th moment, the p-Wasserstein distance between μ and ν is defined as:
(8)
Now, we review the definition of entropy-regularized optimal transport on .
Definition 6
(Entropy-regularized optimal transport). Let , , and let be the density function of the coupling of μ and ν, whose reference measure is the Lebesgue measure. We define the entropy-regularized optimal transport cost as:
(9) where denotes the Shannon entropy of a probability measure:
(10)
There is another variation in entropy-regularized optimal transport defined by the relative entropy instead of the Shannon entropy:
| (11) |
This is definable even when includes a coupling that is not absolutely continuous with respect to the Lebesgue measure. We note that when both and are absolutely continuous, the infimum is attained by the same for and , and it depends only on and . In the following part of the paper, we assume the absolute continuity of , and with respect to the Lebesgue measure for well-defined entropy regularization.
3. Entropy-Regularized Optimal Transport between Multivariate Normal Distributions
In this section, we provide a rigorous solution of entropy-regularized optimal transport between two multivariate normal distributions. Throughout this section, we adopt the squared Euclidean distance as the cost function. To prove our theorem, we start by expressing using mean vectors and covariance matrices. The following lemma is a known result; for example, see [31].
Lemma 1.
Let be two random variables on with means and covariance matrices , respectively. If is a coupling of P and Q, we have:
(12)
Proof.
Without loss of generality, we can assume X and Y are centralized, because:
(13) Therefore, we have:
(14) By adding , we obtain (12). □
Lemma 1 shows that can be parameterized by the covariance matrices . Because and are fixed, the infinite-dimensional optimization of the coupling is a finite-dimensional optimization of covariance matrix .
We prepare the following lemma to prove Theorem 1.
Lemma 2.
Under a fixed mean and covariance matrix, the probability measure that maximizes the entropy is a multivariate normal distribution.
Lemma 2 is a particular case of the principle of maximum entropy [39], and the proof can be found in [40] Theorem 3.1.
Theorem 1.
Let be two multivariate normal distributions. The optimal coupling π of P and Q of the entropy-regularized optimal transport:
(15) is expressed as:
(16) where:
(17) Furthermore, can be written as:
(18) and the relative entropy version can be written as:
(19)
We note that we use the regularization parameter in for the sake of simplicity.
Proof.
Although the first half of the proof can be derived directly from Lemma 2, we provide a proof of this theorem by Lagrange calculus, which will be used later for the extension to q-normal distributions. Now, we define an optimization problem that is equivalent to the entropy-regularized optimal transport as follows:
(20)
(21) Here, and are probability density functions of P and Q, respectively. Let , be Lagrange multipliers that correspond to the above two constraints. The Lagrangian function of (20) is defined as:
(22) Taking the functional derivative of (22) with respect to , we obtain:
(23) By the fundamental lemma of the calculus of variations, we have:
(24) Here, are determined from the constraints (21). We can assume that is a -variate normal distribution, because for a fixed covariance matrix , takes the infimum when the coupling is a multivariate normal distribution by Lemma 2. Therefore, we can express by using and a covariance matrix as:
(25) Putting:
(26) we write:
(27)
(28) According to block matrix inversion formula [41], holds, where is positive definite. Then, comparing the term between (24) and (28), we obtain and:
(29) Here, holds, because A is a symmetric matrix, and thus, we obtain:
(30) Completing the square of the above equation, we obtain:
(31) Let Q be an orthogonal matrix; then, (31) can be solved as:
(32) We rearrange the above equation as follows:
(33) Because the left terms and are all symmetric positive definite, we can conclude that Q is the identity matrix by the uniqueness of the polar decomposition. Finally, we obtain:
(34) We obtain (18) by the direct calculation of using Lemma 1 with this . □
The following corollary helps us to understand the properties of .
Corollary 1.
Let be the eigenvalues of ; then, monotonically decreases with λ for any .
Proof.
Because has the same eigenvalues as , if we let be the eigenvalues of , , which is a monotonically decreasing function of the regularization parameter . □
By the proof, for large , we can prove by diagonalization and . Thus, , and each element of converges to zero as .
We show how entropy regularization behaves in two simple experiments. We calculate the entropy-regularized optimal transport cost and in the original version and the relative entropy version in Figure 1. We separate the entropy-regularized optimal transport cost into the transport cost term and regularization term and display both of them.
Figure 1.
Graph of the entropy-regularized optimal transport cost between and with respect to from zero to 10.
It is reasonable that as , converges to , which is equal to the original optimal coupling of nonregularized optimal transport and as , converges to 0. This is a special case of Corollary 1.The larger becomes, the less correlated the optimal coupling is. We visualize this behavior by computing the optimal couplings of two one-dimensional normal distributions in Figure 2.
Figure 2.
Contours of the density functions of the entropy-regularized optimal coupling of and in three different parameters . All of the optimal couplings are two-variate normal distributions.
The left panel shows the original version. The transport cost is always positive, and the entropy regularization term can take both signs in general; then, the sign and total cost depend on their balance. We note that the transport cost as a function of is bounded, whereas the entropy regularization is not. The boundedness of the optimal cost is deduced from (1) and Corollary 1, and the unboundedness of the entropy regularization is due to the regularization parameter multiplied by the entropy. The right panel shows the relative entropy version. It always takes a non-negative value. Furthermore, because the total cost is bounded by the value for the independent joint distribution (which is always a feasible coupling), both the transport cost and the relative entropy regularization regularization term are also bounded. Nevertheless, the larger the regularization parameter , the greater the influence of entropy regularization over the total cost.
It is known that a specific Riemannian metric can be defined in the space of multivariate normal distributions, which induces the Wasserstein distance [42]. To understand the effect of entropy regularization, we illustrate how entropy regularization deforms this geometric structure in Figure 3. Here, we generate 100 two-variate normal distributions , where is defined as:
| (35) |
Figure 3.
Multidimensional scaling of two-variate normal distributions. The pairwise dissimilarities are given by the square root of the entropy-regularized optimal transport cost for three different regularization parameters . Each ellipse in the figure represents a contour of the density function .
To visualize the geometric structure of these two-variate normal distributions, we compute the relative entropy-regularized optimal transport cost between each pairwise two-variate normal distributions. Then, we apply multidimensional scaling [43] to embed them into a plane (see Figure 3). We can see entropy regularization deforming the geometric structure of the space of multivariate normal distributions. The deformation for distributions close to the isotopic normal distribution is more sensitive to the change in .
The following corollary states that if we allow orthogonal transformations of two multivariate normal distributions with fixed covariance matrices, then the minimum and maximum of are attained when and are diagonalizable by the same orthogonal matrix or, equivalently, when the ellipsoidal contours of the two density functions are aligned with the same orthogonal axes.
Corollary 2.
With the same settings as in Theorem 1, fix , , , and all eigenvalues of . When is diagonalized as , where is the diagonal matrix of the eigenvalues of in descending order and Γ is an orthogonal matrix,
- (i)
is minimized by and
- (ii)
is maximized by ,
where and are the diagonal matrices of the eigenvalues of in descending and ascending order, respectively. Therefore, neither the minimizer, nor the maximizer depend on the choice of λ.
Proof.
Because , , , and all eigenvalues of are fixed,
(36)
(37)
(38) where are the eigenvalues of and:
(39) Note that is a concave function, because:
(40) Let and be the eigenvalues of and , respectively. By Exercise 6.5.3 of [44] or Theorem 6.13 and Corollary 6.14 of [45],
(41) Here, for such that and , means:
(42) and is said to be majorized by . Because is concave,
(43) where represents weak supermajorization, i.e., means:
(44) (see Theorem 5.A.1 of [46], for example). Therefore,
(45) As in Case (i) (or (ii)), the eigenvalues of correspond to the eigenvalues of (or , respectively), the corollary follows. □
Note that a special case of Corollary 2 for the ordinary Wasserstein metric () has been studied in the context of fidelity and the Bures distance in quantum information theory. See Lemma 3 of [47]. Their proof is not directly applicable to our generalized result; thus, we used another approach to prove it.
4. Extension to Tsallis Entropy Regularization
In this section, we consider a generalization of entropy-regularized optimal transport. We now focus on the Tsallis entropy [32], which is a generalization of the Shannon entropy and appears in nonequilibrium statistical mechanics. We show that the optimal coupling of Tsallis entropy-regularized optimal transport between two q-normal distributions is also a q-normal distribution. We start by recalling the definition of the q-exponential function and q-logarithmic function based on [32].
Definition 7.
Let q be a real parameter, and let . The q-logarithmic function is defined as:
(46) and the q-exponential function is defined as:
(47)
Definition 8.
Let or ; an n-variate q-normal distribution is defined by two parameters: and a positive definite matrix Σ, and its density function is:
(48) where is a normalizing constant. μ and Σ are called the location vector and scale matrix, respectively.
In the following, we write the multivariate q-normal distribution . We note that the property of the q-normal distribution changes in accordance with q. The q-normal distribution has an unbounded support for and a bounded support for . The second moment exists for , and the covariance becomes . We remark that each n-variate -normal distribution is equivalent to an n-variate t-distribution with degrees of freedom,
| (49) |
for and an n-variate normal distribution for .
Definition 9.
Let p be a probability density function. The Tsallis entropy is defined as:
(50)
Then, the Tsallis entropy-regularized optimal transport is defined as:
| (51) |
| (52) |
The following lemma is a generalization of the maximum entropy principle for the Shannon entropy shown in Section 2 of [48].
Lemma 3.
Let P be a centered n-dimensional probability measure with a fixed covariance matrix Σ; the maximizer of the Renyi α-entropy:
(53) under the constraint is for .
We note that the maximizers of the Renyi -entropy and the Tsallis entropy with coincide; thus, the above lemma also holds for the Tsallis entropy. This is mentioned, for example, in Section 9 of [49].
To prove Theorem 2, we use the following property of multivariate t-distributions, which is summarized in Chapter 1 of [50].
Lemma 4.
Let X be a random vector following an n-variate t-distribution with degree of freedom ν. Considering a partition of the mean vector μ and scale matrix Σ, such as:
(54) follows a p-variate t-distribution with degree of freedom ν, mean vector , and scale matrix , where p is the dimension of .
Recalling the correspondence of the parameter of the multivariate q-normal distribution and the degree of freedom of the multivariate t-distribution , we can obtain the following corollary.
Corollary 3.
Let X be a random vector following an n-variate q-normal distribution for . Consider a partition of the mean vector μ and scale matrix Σ in the same way as in (54). Then, follows a p-variate -normal distribution with mean vector and scale matrix , where p is the dimension of .
Theorem 2.
Let be n-variate q-normal distributions for and ; consider the Tsallis entropy-regularized optimal transport:
(55) Then, there exists a unique such that the optimal coupling π of the entropy-regularized optimal transport is expressed as:
(56) where:
(57)
Proof.
The proof proceeds in a similar way as in Theorem 1. Let and be the Lagrangian multipliers. Then, the Lagrangian function of (52) is defined as:
(58) and the extremum of the Tsallis entropy-regularized optimal transport is obtained by the functional derivative with respect to ,
(59) Here, and are quadratic polynomials by Lemma 3. To separate the normalizing constant, we introduce a constant , and can be written as:
(60) with quadratic functions and .
Let . Then, by the same argument as in the proof of Theorem 1 and using Corollary 3, we obtain the scale matrix of as:
(61) where:
(62) Let and ; can be written as:
(63) The constant c is determined by:
(64) We will show that the above equation has a unique solution. Let be the eigenvalues of ; can be expressed as . We consider:
(65)
(66) Because , is a monotonic decreasing function, and , , (64) has a unique positive solution, and is determined uniquely. □
5. Entropy-Regularized Kantorovich Estimator
Many estimators are defined by minimizing the divergence or distance between probability measures, that is for a fixed . When is the Kullback–Leibler divergence, the estimator corresponds to the maximum likelihood estimator. When is the Wasserstein distance, the following estimator is called the minimum Kantorovich estimator, according to [36]. In this section, we consider a probability measure that minimizes for a fixed P over , the set of all probability measures on with finite second moment that are absolutely continuous with respect to the Lebesgue measure. In other words, we define the entropy-regularized Kantorovich estimator The entropy-regularized Kantorovich estimator for discrete probability measures was studied in [33], Theorem 2. We obtain the entropy-regularized Kantorovich estimator for continuous probability measures in the following theorem:
Theorem 3.
For a fixed ,
(67) exists, and its density function can be written as:
(68) where is a density function of , and 🟉 denotes the convolution operator.
We use the dual problem of the entropy-regularized optimal transport to prove Theorem 3 (for details, see Proposition 2.1 of [15] or Section 3 of [51]).
Lemma 5.
The dual problem of entropy-regularized optimal transport can be written as:
(69) Moreover, holds.
Now, we prove Theorem 3.
Proof.
Let be the minimizer of . Applying Lemma 5, there exist and such that:
(70) Now, is the minimum value of , such that the variation is always zero. Then,
(71) holds, and the optimal coupling of can be written as:
(72)
(73) Moreover, we can obtain a closed-form of as follows from the equation :
(74) Then, by calculating the marginal distribution of with respect to x, we can obtain:
(75) Therefore, we conclude that a probability measure Q that minimizes is expressed as (75). □
It should be noted that when P in Theorem 3 are multivariate normal distributions, and P are simultaneously diagonalizable by a direct consequence of the theorem. This is consistent with the result of Corollary 2(1) for minimization when all eigenvalues are fixed.
We can determine that the entropy-regularized Kantorovich estimator is a measure convolved with an isotropic multivariate normal distribution scaled by the regularization parameter . This is similar to the idea of prior distributions in the context of Bayesian inference. Applying Theorem 3, the entropy-regularized Kantorovich estimator of the multivariate normal distribution is .
6. Numerical Experiments
In this section, we introduce experiments that show the statistical efficiency of entropy regularization in Gaussian settings. We consider two different setups, estimating covariance matrices (Section 6.1) and the entropy-regularized Wasserstein barycenter (Section 6.2). To obtain the entropy-regularized Wasserstein barycenter, we adopt the Newton–Schulz method and a manifold optimization method, which are explained in Section 6.3 and Section 6.4, respectively.
6.1. Estimation of Covariance Matrices
We provide a covariance estimation method based on entropy-regularized optimal transport. Let be an n-variate normal distribution. We define an entropy-regularized Kantorovich estimator , that is,
| (76) |
We generate some samples from and estimate the mean and covariance matrix. We compare the maximum likelihood estimator and with respect to the prediction error:
| (77) |
In our experiment, the dimension n is set to , and the sample size is set to . The experiment proceeds as follows.
-
1.
Obtain a random sample of size 60 (or 120) from and its sample covariance matrix .
-
2.
Obtain the entropy-regularized minimum Kantorovich estimator of obtained in Step 1.
-
3.
Compute the prediction error between and the entropy-regularized minimum Kantorovich estimator of
-
4.
Repeat the above steps 1000 times and obtain a confidence interval of the prediction error.
Table 1 shows the average prediction error of the MLE and entropy-regularized Kantorovich estimator of covariance matrices from 60 samples from an n-variate normal distribution with the 95% confidential interval. We can see that the prediction error is smaller than the maximum likelihood estimator under adequately small for , but not for . Moreover, the decrease in the prediction error is larger for than for , which indicates that the entropy regularization is effective in a high dimension. On the other hand, Table 2 shows in all cases that the decreases in the prediction error are more moderate than Table 1. We can see that this is due to the increase in the sample size. Then, we can conclude that the entropy regularization is effective in a high-dimensional setting with a small sample size.
Table 1.
Average prediction error of the MLE and entropy-regularized Kantorovich estimator of covariance matrices from 60 samples from an n-variate normal distribution with the 95% confidential interval.
| 0(MLE) | 0.062 | 1.346 | 10.69 |
| 0.01 | 0.051 | 1.242 | 8.973 |
| 0.1 | 0.104 | 0.841 | 4.180 |
| 0.5 | 0.647 | 0.931 | 3.093 |
| 1.0 | 1.166 | 1.670 | 5.075 |
Table 2.
Average prediction error of the MLE and entropy-regularized Kantorovich estimator of covariance matrices from 120 samples from an n-variate normal distribution with the 95% confidential interval.
| 0(MLE) | 0.024 | 0.490 | 2.810 |
| 0.01 | 0.020 | 0.459 | 2.528 |
| 0.1 | 0.101 | 0.397 | 1.700 |
| 0.5 | 0.659 | 0.875 | 2.833 |
| 1.0 | 1.180 | 1.730 | 5.124 |
6.2. Estimation of the Wasserstein Barycenter
A barycenter with respect to the Wasserstein distance is definable [52] and is widely used for image interpolation and 3D object interpolation tasks with entropy regularization [3,33].
Definition 10.
Let be a set of probability measures in . The barycenter with respect to (entropy-regularized Wasserstein barycenter) is defined as:
(78)
Now, we restrict P and to be multivariate normal distributions and apply our theorem to illustrate the effect of entropy regularization.
The experiment proceeds as follows. The dimensionality and the sample size were set the same as in the experiments in Section 6.1.
-
1.
Obtain a random sample of size 60 (or 120) from and its sample covariance matrix .
-
2.
Repeat Step 1 three times, and obtain .
-
3.
Obtain the barycenter of .
-
4.
Compute the prediction error between and the barycenter obtained in step 3.
-
5.
Repeat the above steps 100 times and obtain a confidence interval of the prediction error.
We show the results for several values of the regularization parameter in Table 3 and Table 4. A decrease in the prediction error can be seen in Table 3 for , as well as Table 1 and Table 2. However, because the computation of the entropy-regularized Wasserstein barycenter uses more data than that of the minimum Kantorovich estimator, the decrease in the prediction error is mild. The entropy-regularized Kantorovich estimator is a special case of the entropy-regularized Wasserstein barycenter (78) for . Our experiments show that the appropriate range of to decrease the prediction error depends on m and becomes narrow as m increases. In addition, we note that there is a small decrease in the prediction error in Table 4 for .
Table 3.
Average prediction error of the entropy-regularized barycenter with the 95% confidential interval (random sample of size 60).
| 0 | 0.455 | 1.318 | 4.875 |
| 0.001 | 0.429 | 1.318 | 4.887 |
| 0.01 | 0.434 | 1.344 | 4.551 |
| 0.025 | 0.780 | 1.456 | 5.710 |
| 0.005 | 1.047 | 1.537 | 7.570 |
Table 4.
Average prediction error of the entropy-regularized barycenter with the 95% confidential interval (random sample of size 120).
| 0 | 0.154 | 1.303 | 5.091 |
| 0.001 | 0.212 | 1.305 | 5.072 |
| 0.01 | 0.306 | 1.328 | 5.274 |
| 0.025 | 0.671 | 1.337 | 5.851 |
| 0.005 | 1.109 | 1.603 | 8.072 |
6.3. Gradient Descent on
We use a gradient descent method to compute the entropy-regularized barycenter. Applying the gradient descent method to the loss function defined by the Wasserstein distance was proposed in [4]. This idea is extendable to entropy-regularized optimal transport. The detailed algorithm is shown below. Because is a function of a positive definite matrix, we used a manifold gradient descent algorithm on the manifold of positive definite matrices.
We review the manifold gradient descent algorithm used in our numerical experiment. Let be the manifold of n-dimensional positive definite matrices. We require a formula for a gradient operator and the inner product of in the gradient descent algorithm. In this paper, we use the following inner product from [44], Chapter 6. For a fixed , we define an inner product of as:
| (79) |
Equation (79) is the best choice in terms of the convergence speed according to [53]. Let be a differential matrix function. Then, the induced gradient of f under (79) is:
| (80) |
We consider the updating step after obtaining the gradient of f. is an element of the tangent space, and we have to project it to . This projection map is called a retraction. It is known that the Riemannian metric leads to the following retraction:
, where is the matrix exponential. Then, the corresponding gradient descent method becomes as shown in Algorithm 1.
6.4. Approximate the Matrix Square Root
To compute the gradient of the square root of a matrix in the objective function, we approximate it using the Newton–Schulz method [54], which can be implemented by matrix operations as shown in Algorithm 2. It is amenable to automatic differentiation, such that we can easily apply the gradient descent method to our algorithm.
| Algorithm 1 Gradient descent on the manifold of positive definite matrices. |
|
| Algorithm 2: Newton–Schulz method. |
|
7. Conclusions and Future Work
In this paper, we studied entropy-regularized optimal transport and derived several result. We summarize these as follows and add notes on future work.
We obtain the explicit form of entropy-regularized optimal transport between two multivariate normal distributions and derived Corollaries 1 and 2, which clarified the properties of optimal coupling. Furthermore, we demonstrate experimentally how entropy regularization affects the Wasserstein distance, the optimal coupling, and the geometric structure of multivariate normal distributions. Overall, the properties of optimal coupling were revealed both theoretically and experimentally. We expect that the explicit formula can be a replacement for the existing methodology using the (nonregularized) Wasserstein distance between normal distributions (for example, [4,5]).
Theorem 2 derives the explicit form of the optimal coupling of the Tsallis entropy-regularized optimal transport between multivariate q-normal distributions. The optimal coupling of the Tsallis entropy-regularized optimal transport between multivariate q-normal distributions is also a multivariate q-normal distribution, and the obtained result has an analogy to that of the normal distribution. We believe that this result can be extended to other elliptical distribution families.
The entropy-regularized Kantorovich estimator of a probability measure in is the convolution of a multivariate normal distribution and its own density function. Our experiments show that both the entropy-regularized Kantorovich estimator and the Wasserstein barycenter of multivariate normal distributions outperform the maximum likelihood estimator in the prediction error for adequately selected in a high dimensionality and small sample setting. As future work, we want to show the efficiency of entropy regularization using real data.
Author Contributions
Conceptualization, Q.T.; methodology, Q.T.; software, Q.T.; writing—original draft preparation, Q.T.; writing—review and editing, K.K.; supervision, K.K. Both authors read and agreed to the published version of the manuscript.
Funding
This work was supported by RIKEN AIP and JSPS KAKENHI (JP19K03642, JP19K00912).
Data Availability Statement
All the data used are artificial and generated by pseudo-random numbers.
Conflicts of Interest
The authors declare no conflict of interest.
Footnotes
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Villani C. Optimal Transport: Old and New. Volume 338 Springer Science & Business Media; Berlin/Heidelberg, Germany: 2008. [Google Scholar]
- 2.Rubner Y., Tomasi C., Guibas L.J. The Earth Mover’s Distance as a metric for Image Retrieval. Int. J. Comput. Vis. 2000;40:99–121. doi: 10.1023/A:1026543900054. [DOI] [Google Scholar]
- 3.Solomon J., De Goes F., Peyré G., Cuturi M., Butscher A., Nguyen A., Du T., Guibas L. Convolutional Wasserstein Distances: Efficient optimal transportation on geometric domains. ACM Trans. Graph. TOG. 2015;34:1–11. doi: 10.1145/2766963. [DOI] [Google Scholar]
- 4.Muzellec B., Cuturi M. Advances in Neural Information Processing Systems. ACM; New York, NY, USA: 2018. Generalizing point embeddings using the Wasserstein space of elliptical distributions; pp. 10237–10248. [Google Scholar]
- 5.Heusel M., Ramsauer H., Unterthiner T., Nessler B., Hochreiter S. Advances in Neural Information Processing Systems. Volume 30 Curran Associates, Inc.; Red Hook, NY, USA: 2017. GANs trained by a two time-scale update rule converge to a local nash equilibrium. [Google Scholar]
- 6.Arjovsky M., Chintala S., Bottou L. Wasserstein generative adversarial networks; Proceedings of the 34th International Conference on Machine Learning; Sydney, Australia. 6–11 August 2017; pp. 214–223. [Google Scholar]
- 7.Nitanda A., Suzuki T. Gradient layer: Enhancing the convergence of adversarial training for generative models; Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics; Playa Blanca, Spain. 9–11 April 2018; pp. 1008–1016. [Google Scholar]
- 8.Sonoda S., Murata N. Transportation analysis of denoising autoencoders: A novel method for analyzing deep neural networks. arXiv. 20171712.04145 [Google Scholar]
- 9.Vincent P., Larochelle H., Bengio Y., Manzagol P.A. Extracting and composing robust features with denoising autoencoders; Proceedings of the 25th International Conference on Machine Learning; Helsinki, Finland. 5–9 July 2008; pp. 1096–1103. [Google Scholar]
- 10.Cuturi M. Advances in Neural Information Processing Systems. ACM; New York, NY, USA: 2013. Sinkhorn distances: Lightspeed computation of optimal transport; pp. 2292–2300. [Google Scholar]
- 11.Sinkhorn R., Knopp P. Concerning nonnegative matrices and doubly stochastic matrices. Pac. J. Math. 1967;21:343–348. doi: 10.2140/pjm.1967.21.343. [DOI] [Google Scholar]
- 12.Lin T., Ho N., Jordan M.I. On the efficiency of the Sinkhorn and Greenkhorn algorithms and their acceleration for optimal transport. arXiv. 20191906.01437 [Google Scholar]
- 13.Lee Y.T., Sidford A. Path finding methods for linear programming: Solving linear programs in o (vrank) iterations and faster algorithms for maximum flow; Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science; Philadelphia, PA, USA. 18–21 October 2014; pp. 424–433. [Google Scholar]
- 14.Dvurechensky P., Gasnikov A., Kroshnin A. Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm; Proceedings of the International Conference on Machine Learning; Stockholm, Sweden. 10–15 July 2018; pp. 1367–1376. [Google Scholar]
- 15.Aude G., Cuturi M., Peyré G., Bach F. Stochastic optimization for large-scale optimal transport. arXiv. 20161605.08527 [Google Scholar]
- 16.Altschuler J., Weed J., Rigollet P. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. arXiv. 20171705.09634 [Google Scholar]
- 17.Blondel M., Seguy V., Rolet A. Smooth and sparse optimal transport; Proceedings of the International Conference on Artificial Intelligence and Statistics; Playa Blanca, Spain. 9–11 April 2018; pp. 880–889. [Google Scholar]
- 18.Cuturi M., Peyré G. A smoothed dual approach for variational Wasserstein problems. SIAM J. Imaging Sci. 2016;9:320–343. doi: 10.1137/15M1032600. [DOI] [Google Scholar]
- 19.Lin T., Ho N., Jordan M. On efficient optimal transport: An analysis of greedy and accelerated mirror descent algorithms; Proceedings of the International Conference on Machine Learning; Long Beach, CA, USA. 10–15 June 2019; pp. 3982–3991. [Google Scholar]
- 20.Allen-Zhu Z., Li Y., Oliveira R., Wigderson A. Much faster algorithms for matrix scaling; Proceedings of the 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS); Berkeley, CA, USA. 15–17 October 2017; pp. 890–901. [Google Scholar]
- 21.Cohen M.B., Madry A., Tsipras D., Vladu A. Matrix scaling and balancing via box constrained Newton’s method and interior point methods; Proceedings of the 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS); Berkeley, CA, USA. 15–17 October 2017; pp. 902–913. [Google Scholar]
- 22.Blanchet J., Jambulapati A., Kent C., Sidford A. Towards optimal running times for optimal transport. arXiv. 20181810.07717 [Google Scholar]
- 23.Quanrud K. Approximating optimal transport with linear programs. arXiv. 20181810.05957 [Google Scholar]
- 24.Frogner C., Zhang C., Mobahi H., Araya M., Poggio T.A. Advances in Neural Information Processing Systems. ACM; New York, NY, USA: 2015. Learning with a Wasserstein loss; pp. 2053–2061. [Google Scholar]
- 25.Courty N., Flamary R., Tuia D., Rakotomamonjy A. Optimal transport for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 2016;39:1853–1865. doi: 10.1109/TPAMI.2016.2615921. [DOI] [PubMed] [Google Scholar]
- 26.Lei J. Convergence and concentration of empirical measures under Wasserstein distance in unbounded functional spaces. Bernoulli. 2020;26:767–798. doi: 10.3150/19-BEJ1151. [DOI] [Google Scholar]
- 27.Mena G., Niles-Weed J. Advances in Neural Information Processing Systems. ACM; New York, NY, USA: 2019. Statistical bounds for entropic optimal transport: Sample complexity and the central limit theorem; pp. 4543–4553. [Google Scholar]
- 28.Rigollet P., Weed J. Entropic optimal transport is maximum-likelihood deconvolution. Comptes Rendus Math. 2018;356:1228–1235. doi: 10.1016/j.crma.2018.10.010. [DOI] [Google Scholar]
- 29.Balaji Y., Hassani H., Chellappa R., Feizi S. Entropic GANs meet VAEs: A Statistical Approach to Compute Sample Likelihoods in GANs; Proceedings of the 36th International Conference on Machine Learning; Long Beach, CA, USA. 9–15 June 2019; pp. 414–423. [Google Scholar]
- 30.Amari S.I., Karakida R., Oizumi M. Information geometry connecting Wasserstein distance and Kullback–Leibler divergence via the entropy-relaxed transportation problem. Inf. Geom. 2018;1:13–37. doi: 10.1007/s41884-018-0002-8. [DOI] [Google Scholar]
- 31.Dowson D., Landau B. The Fréchet distance between multivariate normal distributions. J. Multivar. Anal. 1982;12:450–455. doi: 10.1016/0047-259X(82)90077-X. [DOI] [Google Scholar]
- 32.Tsallis C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988;52:479–487. doi: 10.1007/BF01016429. [DOI] [Google Scholar]
- 33.Amari S.I., Karakida R., Oizumi M., Cuturi M. Information geometry for regularized optimal transport and barycenters of patterns. Neural Comput. 2019;31:827–848. doi: 10.1162/neco_a_01178. [DOI] [PubMed] [Google Scholar]
- 34.Janati H., Muzellec B., Peyré G., Cuturi M. Advances in Neural Information Processing Systems. ACM; New York, NY, USA: 2020. Entropic optimal transport between (unbalanced) Gaussian measures has a closed form. [Google Scholar]
- 35.Mallasto A., Gerolin A., Minh H.Q. Entropy-Regularized 2-Wasserstein Distance between Gaussian Measures. arXiv. 20202006.03416 [Google Scholar]
- 36.Peyré G., Cuturi M. Computational optimal transport. Found. Trends® Mach. Learn. 2019;11:355–607. doi: 10.1561/2200000073. [DOI] [Google Scholar]
- 37.Monge G. Mémoire sur la Théorie des Déblais et des Remblais. Histoire de l’Académie Royale des Sciences de Paris; Paris, France: 1781. [Google Scholar]
- 38.Kantorovich L.V. On the translocation of masses. Proc. USSR Acad. Sci. 1942;37:199–201. doi: 10.1007/s10958-006-0049-2. [DOI] [Google Scholar]
- 39.Jaynes E.T. Information theory and statistical mechanics. Phys. Rev. 1957;106:620. doi: 10.1103/PhysRev.106.620. [DOI] [Google Scholar]
- 40.Mardia K.V. A Modern Course on Statistical Distributions in Scientific Work. Springer; Berlin/Heidelberg, Germany: 1975. Characterizations of directional distributions; pp. 365–385. [Google Scholar]
- 41.Petersen K., Pedersen M. The Matrix Cookbook. Volume 15 Technical University of Denmark; Lyngby, Denmark: 2008. [Google Scholar]
- 42.Takatsu A. Wasserstein geometry of Gaussian measures. Osaka J. Math. 2011;48:1005–1026. [Google Scholar]
- 43.Kruskal J.B. Nonmetric multidimensional scaling: A numerical method. Psychometrika. 1964;29:115–129. doi: 10.1007/BF02289694. [DOI] [Google Scholar]
- 44.Bhatia R. Positive Definite Matrices. Volume 24 Princeton University Press; Princeton, NJ, USA: 2009. [Google Scholar]
- 45.Hiai F., Petz D. Introduction to Matrix Analysis and Applications. Springer Science & Business Media; Berlin/Heidelberg, Germany: 2014. [Google Scholar]
- 46.Marshall A.W., Olkin I., Arnold B.C. Inequalities: Theory of Majorization and Its Applications. Volume 143 Springer; Berlin/Heidelberg, Germany: 1979. [Google Scholar]
- 47.Markham D., Miszczak J.A., Puchała Z., Życzkowski K. Quantum state discrimination: A geometric approach. Phys. Rev. A. 2008;77:42–111. doi: 10.1103/PhysRevA.77.042111. [DOI] [Google Scholar]
- 48.Costa J., Hero A., Vignat C. International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition. Springer; Berlin/Heidelberg, Germany: 2003. On solutions to multivariate maximum α-entropy problems; pp. 211–226. [Google Scholar]
- 49.Naudts J. Generalised Thermostatistics. Springer Science & Business Media; Berlin/Heidelberg, Germany: 2011. [Google Scholar]
- 50.Kotz S., Nadarajah S. Multivariate t-Distributions and their Applications. Cambridge University Press; Cambridge, UK: 2004. [Google Scholar]
- 51.Clason C., Lorenz D.A., Mahler H., Wirth B. Entropic regularization of continuous optimal transport problems. J. Math. Anal. Appl. 2021;494:124432. doi: 10.1016/j.jmaa.2020.124432. [DOI] [Google Scholar]
- 52.Agueh M., Carlier G. Barycenters in the Wasserstein space. SIAM J. Math. Anal. 2011;43:904–924. doi: 10.1137/100805741. [DOI] [Google Scholar]
- 53.Jeuris B., Vandebril R., Vandereycken B. A survey and comparison of contemporary algorithms for computing the matrix geometric mean. Electron. Trans. Numer. Anal. 2012;39:379–402. [Google Scholar]
- 54.Higham N.J. Newton’s method for the matrix square root. Math. Comput. 1986;46:537–549. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
All the data used are artificial and generated by pseudo-random numbers.



