Efficient Discrete Optimal Transport Algorithm by Accelerated Gradient Descent

Dongsheng An; Na Lei; Xiaoyin Xu; Xianfeng Gu

doi:10.1609/aaai.v36i9.21251

. Author manuscript; available in PMC: 2023 Nov 15.

Published in final edited form as: Proc AAAI Conf Artif Intell. 2022 Jun;36(9):10119–10128. doi: 10.1609/aaai.v36i9.21251

Efficient Discrete Optimal Transport Algorithm by Accelerated Gradient Descent

Dongsheng An ¹, Na Lei ^2,^*, Xiaoyin Xu ³, Xianfeng Gu ¹

PMCID: PMC10652357 NIHMSID: NIHMS1943443 PMID: 37974660

Abstract

Optimal transport (OT) plays an essential role in various areas like machine learning and deep learning. However, computing discrete OT for large scale problems with adequate accuracy and efficiency is highly challenging. Recently, methods based on the Sinkhorn algorithm add an entropy regularizer to the prime problem and obtain a trade off between efficiency and accuracy. In this paper, we propose a novel algorithm based on Nesterov’s smoothing technique to further improve the efficiency and accuracy in computing OT. Basically, the non-smooth c-transform of the Kantorovich potential is approximated by the smooth Log-Sum-Exp function, which smooths the original non-smooth Kantorovich dual functional. The smooth Kantorovich functional can be efficiently optimized by a fast proximal gradient method, the fast iterative shrinkage thresholding algorithm (FISTA). Theoretically, the computational complexity of the proposed method is given by $O (n^{\frac{5}{2}} \sqrt{\log n} ∕ ϵ)$ , which is lower than current estimation of the Sinkhorn algorithm. Experimentally, compared with the Sinkhorn algorithm, our results demonstrate that the proposed method achieves faster convergence and better accuracy with the same parameter.

Introduction

Optimal transport (OT) is a powerful tool to compute the Wasserstein distance between probability measures and widely used to model various natural and social phenomena, including economics (Galichon 2016), optics (Glimm and Oliker 2003), biology (Schiebinger et al. 2019), physics (Jordan, Kinderlehrer, and Otto 1998) and in other scientific fields. Recently, OT has been successfully applied in machine learning and statistics, such as parameter estimation in Bayesian non-parametric models (Nguyen 2013), computer vision (Arjovsky, Chintala, and Bottou 2017; Courty et al. 2017; Tolstikhin et al. 2018; An et al. 2020; Lei et al. 2020), and natural language processing (Kusner et al. 2015; Yurochkin et al. 2019). In these areas, the complex probability measures are approximated by summations of Dirac measures supported on the samples. To obtain the Wasserstein distance between the empirical distributions, we then solve the discrete OT problems.

Discrete Optimal Transport

In discrete OT problem, where both the source and target measures are discrete, the Kantorovich functional becomes a convex function defined on a convex domain. Due to the lack of smoothness, conventional gradient descend method can not be applied directly. Instead, it can be optimized with the sub-differential method (Nesterov 2005), in which the gradient is replaced by the sub-differential. To achieve an approximation error less than $ε$ , the sub-differential method requires $O (1 ∕ ε^{2})$ iterations. Recently, several approximation methods have been proposed to improve the computational efficiency. In these methods (Cuturi 2013; Benamou et al. 2015; Altschuler, Niles-Weed, and Rigollet 2017), a strongly convex entropy function is added to the prime Kantorovich problem and thus the regularized problem can be efficiently solved by the Sinkhorn algorithm. More detailed analysis shows that the computational complexity of the Sinkhorn algorithm is $\tilde{O} (n^{2} ∕ ε^{2})$ (Dvurechensky, Gasnikov, and Kroshnin 2018) by setting $λ = ϵ ∕ 4 \log n$ . Also, a series of primal-dual algorithms are proposed, including the APDAGD (adaptive primal-dual accelerated gradient descent) algorithm (Dvurechensky, Gasnikov, and Kroshnin 2018) with computational complexity $\tilde{O} (n^{2.5} ∕ ϵ)$ , the APDAMD (adaptive primal-dual accelerated mirror descent) algorithm (Lin, Ho, and Jordan 2019) with $\tilde{O} (n^{2} \sqrt{r} ∕ ϵ)$ where $r$ is a complex constant of the Bregman divergence, and the APDRCD (accelerated primal-dual randomized coordinate descent) algorithm (Guo, Ho, and Jordan 2020) with $\tilde{O} (n^{2.5} ∕ ϵ)$ . But all of the three methods need to build a matrix with space complexity $O (n^{3})$ , making them difficult to compute when $n$ is large.

Our Method

In this work, instead of starting from the prime Kantorovich problem like the Sinkhorn based methods, we directly deal with the dual Kantorovich problem. The key idea is to approximate the original non-smooth c-transform of the Kantorovich potential by Nesterov’s smoothing technique. Specifically, we approximate the max function by the Log-Sum-Exp function, which has also been used in (Schmitzer 2019; Peyré and Cuturi 2018), such that the original non-smooth Kantorovich functional is converted to an unconstrained ( $n - 1$ )-dimensional smooth convex energy. By using the Fast Proximal Gradient Method named FISTA (Beck and Teboulle 2009), we can quickly optimize the smoothed energy to get a precise estimate of the OT cost. In theory, the method can achieve the approximate error $ε$ with the space complexity $O (n^{2})$ and computational complexity $O (n^{2.5} \sqrt{\log n} ∕ ε)$ . Additionally, we show that the induced approximate OT plan by our algorithm is equivalent to that of the Sinkhorn algorithm. The contributions of our work are as follows.

We convert the dual Kantorovich problem to an unconstrained smooth convex optimization problem by approximating the non-smooth c-transform of the Kantorovich potential with Nesterov’s smoothing idea.
The smoothed Kantorovich functional can be efficiently solved by the FISTA algorithm with computational complexity $\tilde{O} (n^{2.5} ∕ \sqrt{ε})$ . At the same time, the computational complexity of the Kantorovich functional itself is given by $\tilde{O} (n^{2.5} ∕ ε)$ .
The experiments demonstrate that compared with the Sinkhorn algorithm, the proposed method achieves faster convergence and better accuracy with the same parameter $λ$ .

Notation

In this work, $R_{\geq 0}$ represents the non negative real numbers, $0$ and $1$ represents the all-zeros and all-ones vectors of appropriate dimension. The set of integers ${1, 2, \dots, n}$ is denoted as $[n]$ . And ${∣ \cdot ∣}_{1}$ and $‖ \cdot ‖$ are the $ℓ_{1}$ and $ℓ_{2}$ norms, ${∣ v ∣}_{1} = \sum_{i} ∣ v_{i} ∣$ and $‖ v ‖ = \sqrt{\sum_{i} v_{i}^{2}}$ , respectively. $R (C)$ is the range of the cost matrix $C = (c_{i j})$ , namely $C_{\max} - C_{min}$ , where $C_{\max}$ and $C_{min}$ represent the maximum and minimum of the elements of $C$ with $c_{i j} > 0$ . We use $ν_{min}$ to denote the minimal element of $ν$ and $⊘$ to denote element wise division.

Related Work

Optimal transport plays an important role in various kinds of fields, and there is a huge literature in this area. Here we mainly focus on the most related works. For detailed overview, we refer readers to (Peyré and Cuturi 2018).

When both the source and target measures are discrete, the OT problem can be treated as a standard linear programming (LP) task and solved by interior-point method with computational complexity $\tilde{O} (n^{5 ∕ 2})$ (Lee and Sidford 2014). But this method requires a practical solver of the Laplacian linear system, which is not currently available for large dataset. Another interior-point based method to solve the OT problem is proposed by Pele and Werman (Pele and Werman 2009) with complexity $\tilde{O} (n^{3})$ . Generally speaking, it is unrealistic to solve the large scale OT problem with the traditional LP solvers.

The prevalent way to compute the OT cost between two discrete measures involves adding a strongly convex entropy function to the prime Kantorovich problem (Cuturi 2013; Benamou et al. 2015). Most of the current solutions for the discrete OT problem follow this strategy. Genevay et al. (Genevay et al. 2016) extend the algorithm in its dual form and solve it by stochastic average gradient method. The Greenkhorn algorithm (Altschuler, Niles-Weed, and Rigollet 2017; Abid and Gower 2018; Chakrabarty and Khanna 2021) is a greedy version of the Sinkhorn algorithm. Specifically, Altschuler et al. (Altschuler, Niles-Weed, and Rigollet 2017) show that the complexity of their algorithm is $\tilde{O} (\frac{n^{2}}{ϵ^{3}})$ . Later, Dvurechensky et al. (Dvurechensky, Gasnikov, and Kroshnin 2018) improve the complexity bound of the Sinkhorn algorithm to $\tilde{O} (\frac{n^{2}}{ϵ^{2}})$ , and propose an APDAGD method with complexity $\tilde{O} (min {\frac{n^{9 ∕ 4}}{ϵ}, \frac{n^{2}}{ϵ^{2}}})$ . Jambulapati et al. (Jambulapati, Sidford, and Tian 2019) introduce a parallelelizable algorithm to compute the OT problem with complexity $\tilde{O} (\frac{n^{2} {‖ C ‖}_{\max}}{ϵ})$ . Through screening the negligible components by directly setting them at that value before entering the Sinkhorn problem, the screenkhorn (Alaya et al. 2019) method solves a smaller Sinkhorn problem and improves the computation efficiency. Based on a primal-dual formulation and a tight upper bound for the dual solution, Lin et al. (Lin, Ho, and Jordan 2019) improve the complexity bound of the Greenkhorn algorithm to $\tilde{O} (\frac{n^{2}}{ϵ^{2}})$ , and propose the APDAMD algorithm, whose complexity bound is proven to be $\tilde{O} (\frac{n^{2} \sqrt{r}}{ϵ})$ , where $r \in (0, n]$ refers to some constants in the Bregman divergence. Recently, a practically more efficient method called APDRCD (Guo, Ho, and Jordan 2020) is proposed with complexity $\tilde{O} (n^{2.5} ∕ ϵ)$ . But all these three primal-dual based methods need to build a matrix with space complexity $O (n^{3})$ , which makes them impractical when $n$ is large. By utilizing Newton-type information, Blanchet et al. (Blanchet et al. 2018) and Quanrud (Quanrud 2018) propose algorithms with complexity $\tilde{O} (\frac{n^{2}}{ϵ})$ . However, the Newton-based methods only give the theoretical upper bound and provide no practical algorithms.

Besides the entropy regularizer based methods, Blondel et al. (Blondel, Seguy, and Rolet 2018) use the squared 2-norm and group LASSO (least absolute shrinkage and selection operator) to regularize the prime Kantorovich problem and then use the quasi-Newton method to accelerate the algorithm. Xie et al. (Xie et al. 2019b) develop an Inexact Proximal point method for exact optimal transport. By utilizing the structure of the cost function, Gerber and Maggioni (Gerber and Maggioni 2017) optimize the transport plan from coarse to fine. Meng et al. (Meng et al. 2019) propose the projection pursuit Monge map, which accelerates the computation of the original sliced OT problem. Xie et al. (Xie et al. 2019a) also use the generative learning based method to model the optimal transport. But the theoretical analysis of these algorithms is still nascent.

In this work, we introduce a method based on Nesterov’s smoothing technique, which is applied to the dual Kantorovich problem with computational complexity $O (n^{2.5} \sqrt{\log n} ∕ ε)$ (or equivalently $\tilde{O} (n^{2.5} ∕ ε)$ and approximation error bound $2 λ \log n$ .

Optimal Transport Theory

In this section, we introduce some basic concepts and theorems in the classical optimal transport theory, focusing on Kantorovich’s approach and its generalization to the discrete settings via c-transform. The details can be found in Villani’s book (Villani 2008).

Optimal Transport Problem

Suppose $X \subset R^{d}$ , $Y \subset R^{d}$ are two subsets of the Euclidean space $R^{d}, μ, ν$ are two probability measures defined on $X$ and $Y$ with equal total measure, $μ (X) = ν (Y)$ .

Kantorovich’s Approach

Depending on the cost functions and the measures, the OT map between ( $X, μ$ ) and ( $Y, ν$ ) may not exist. Thus, Kantorovich relaxed the transport maps to transport plans, and defined joint probability measure $π : X \times Y \to R_{\geq 0}$ , such that the marginal probability of $π$ equals to $μ$ and $ν$ , respectively. Formally, let the projection maps be $ρ_{x} (x, y) = x$ , $ρ_{y} (x, y) = y$ , then we define

π (μ, ν) ≔ {P : X \times Y \to R_{\geq 0} : {(ρ_{x})}_{#} P = μ, {(ρ_{y})}_{#} P = ν} .

(1)

Problem 1 (Kantorovich Problem). Given the transport cost function $X \times Y \to R$ , find the joint probability measure $P : X \times Y \to R$ that minimizes the total transport cost

M_{c} (μ, ν) = min_{P \in π (μ, ν)} \int_{X \times Y} c (x, y) d P (x, y)

(2)

Problem 2 (Dual Kantorovich Problem). Given two probability measures $μ$ and $ν$ supported on $X$ and $Y$ , respectively, and the transport cost function $c : X \times Y \to R$ , the Kantorovich problem is equivalent to maximizing the following Kantorovich functional:

M_{c} (μ, ν) = \max {- \int_{X} ϕ d μ + \int_{Y} ψ d ν}

(3)

where $ϕ \in L^{1} (X, μ)$ and $ψ \in L^{1} (Y, ν)$ are called Kantorovich potentials and $- ϕ (x) + ψ (y) \leq c (x, y)$ . The above problem can be reformulated as the following minimization form with the same constraints:

M_{c} (μ, ν) = - min {\int_{X} ϕ d μ - \int_{Y} ψ d ν}

(4)

Definition 3 (c-transform). Let $ϕ \in L^{1} (X, μ)$ and $ψ \in L^{1} (Y, ν)$ , we define

ϕ (x) = ψ^{c} (x) = sup_{y \in Y} ψ (y) - c (x, y) .

With c-transform, Eqn. (4) is equivalent to solving the following optimization problem:

M_{c} (μ, ν) = - min {\int_{X} ψ^{c} (x) d μ (x) - \int_{Y} ψ (y) d ν (y)}

(5)

where $ψ \in L^{1} (Y, ν)$ . When $μ = \sum_{i = 1}^{m} μ_{i} δ (x - x_{i})$ and $ν = \sum_{j = 1}^{n} ν_{j} δ (y - y_{j})$ , $ψ = {(ψ_{1}, ψ_{2}, \dots, ψ_{n})}^{T}$ , Eqn. (5) gives the unconstrained convex optimization problem:

M_{c} (μ, ν) = - min_{ψ} E (ψ) = - min_{ψ} {\sum_{i = 1}^{m} μ_{i} ψ^{c} (x_{i}) - \sum_{j = 1}^{n} ν_{j} ψ_{j}}

(6)

where the c-transform of $ψ$ is given by:

ψ^{c} (x_{i}) = \max_{j} {ψ_{j} - c_{i j}}

(7)

where $c_{i j} = c (x_{i}, y_{j})$ . Suppose $ψ^{*}$ is the solution to Eqn. (6), then it has the following properties:

If the cost function is $\hat{c} (x, y) = c (x, y) - k$ , where $k$ is a constant, the corresponding optimal solution is ${\hat{ψ}}^{*}$ , then ${\hat{ψ}}^{*} = ψ^{*}$ . At the same time, we have $M_{c} (μ, ν) = M_{\hat{c}} (μ, ν) + k$ .
$ψ^{*} + k 1$ is also an optimal solution for Eqn. (6).

In order to make the solution unique, we add a constraint $ψ \in H$ using the indicator function $I_{H}$ , where $H = {ψ ∣ \sum_{j = 1}^{n} ψ_{j} = 0}$ , and modify the Kantorovich functional $E (ψ)$ in Eqn. (6) as:

\tilde{E} (ψ) = E (ψ) + I_{H} (ψ), I_{H} (ψ) = {\begin{matrix} 0 & ψ \in H \\ \infty & ψ \notin H \end{matrix}

(8)

Then solving Eqn. (6) is equivalent to finding the solution to:

M_{c} (μ, ν) = - min_{ψ} \tilde{E} (ψ) = - min_{ψ} {\sum_{i = 1}^{m} μ_{i} ψ^{c} (x_{i}) - \sum_{j = 1}^{n} ν_{j} ψ_{j} + I_{H} (ψ)}

(9)

which is essentially an ( $n - 1$ )-dimensional unconstrained convex problem. According to the definition of $c$ -transform in 7), $ψ^{c}$ is non-smooth with respect to $ψ$ .

Nesterov’s Smoothing of Kantorovich functional

Following Nesterov’s original strategy (Nesterov 2005), which has also been applied in the OT field (Peyré and Cuturi 2018; Schmitzer 2019), we smooth the non-smooth discrete Kantorovich functional $E (ψ)$ . We approximate $ψ^{c} (x)$ with the Log-Sum-Exp function to get the smooth Kantorovich functional $E_{λ} (ψ)$ . Then through the FISTA algorithm (Beck and Teboulle 2009), we can easily induce that the computation complexity of our algorithm is $O (n^{2.5} \sqrt{\log n} ∕ ε)$ , with $\tilde{E} (ψ^{t}) - \tilde{E} (ψ^{*}) \leq ϵ$ . By abuse of notation, in the following we call both $E (ψ)$ and $\tilde{E} (ψ)$ the Kantorovich functional and both $E_{λ} (ψ)$ and ${\tilde{E}}_{λ} (ψ)$ the smooth Kantorovich functional.

Definition 4 ( $(α, β)$ -smoothable). A convex function $f$ is called ( $α, β$ )-smoothable if, for any $λ > 0$ , $\exists a$ a convex function $f_{λ}$ such that

f_{λ} (x) \leq f (x) \leq f_{λ} (x) + β λ f_{λ} (y) \leq f_{λ} (x) + 〈 \nabla f_{λ} (x), y - x 〉 + \frac{α}{2 λ} {(y - x)}^{T} \nabla^{2} f_{λ} (x) (y - x)

Here $f_{λ}$ is called a $\frac{1}{λ}$ -smooth approximation of $f$ with parameters ( $α, β$ ).

In the above definition, the parameter $λ$ defines a trade-off between the approximation accuracy and the smoothness, where the smaller the $λ$ , the better approximation and the less smoothness we obtain.

Lemma 5 (Nesterov’s Smoothing). Given $f : R^{n} \to R$ , $f (x) = \max {x_{j} : j = 1, \dots, n}$ , for any $λ > 0$ , we have its $\frac{1}{λ}$ -smooth approximation with parameters ( $1, \log n$ )

f_{λ} (x) = λ \log (\sum_{j = 1}^{n} e^{x_{j} ∕ λ}) - λ \log n,

(10)

Proof. We have $\forall x \in R^{n}$ ,

f_{λ} (x) \leq λ \log (n \max_{j} e^{x_{j} ∕ λ}) - λ \log (n) = f (x) f (x) = λ \log \max_{j} e^{x_{j} ∕ λ} < λ \log \sum_{j = 1}^{n} e^{x_{j} ∕ λ} = f_{λ} (x) + λ \log n

Furthermore, it is easy to prove that $f_{λ} (x)$ is $\frac{1}{λ}$ -smooth. Therefore, $f_{λ} (x)$ is an approximation of $f (x)$ with parameters ( $1, \log n$ ).

Recalling the definition of c-transform of the Kantorovich potential in Eqn. (7), we obtain the Nesterov’s smoothing of $ψ^{c}$ by applying Eqn. (10)

ψ_{λ}^{c} = λ \log (\sum_{j = 1}^{n} e^{(ψ_{j} - c_{i j}) ∕ λ}) - λ \log n .

(11)

We use $ψ_{λ}^{c}$ to replace $ψ^{c}$ in Eqn. (9) to approximate the Kantorovich functional. Then the Nesterov’s smoothing of the Kantorovich functional becomes

E_{λ} (ψ) = λ \sum_{i = 1}^{m} μ_{i} \log (\sum_{j = 1}^{n} e^{(ψ_{j} - c_{i j}) ∕ λ}) - \sum_{j = 1}^{n} ν_{j} ψ_{j} - λ \log n

(12)

and its gradient is given by

\frac{\partial E_{λ}}{\partial ψ_{j}} = \sum_{i = 1}^{m} μ_{i} \frac{e^{(ψ_{j} - c_{i j}) ∕ λ}}{\sum_{k = 1}^{n} e^{(ψ_{k} - c_{i k}) ∕ λ}} - ν_{j}, \forall j \in [n]

(13)

Furthermore, we can directly compute the Hessian matrix of $E_{λ} (ψ)$ . Let $K_{i j} = e^{- c_{i j} ∕ λ}$ and $v_{j} = e^{ψ_{j} ∕ λ}$ , and set $E_{λ}^{i} ≔ λ \log \sum_{j = 1}^{n} K_{i j} v_{j}$ , $\forall i \in [m]$ . Direct computation gives the following gradient and Hessian matrix:

\nabla E_{λ} = d i a g (v) K^{T} (μ ⊘ K v) - ν \nabla^{2} E_{λ} = \sum_{i = 1}^{m} μ_{i} \nabla^{2} E_{λ}^{i}, \nabla^{2} E_{λ}^{i} = \frac{1}{λ} (\frac{1}{1^{T} V_{i}} Λ_{i} - \frac{1}{{(1^{T} V_{i})}^{2}} V_{i} V_{i}^{T})

(14)

where $V_{i} = {(K_{i 1} v_{1}, K_{i 2} v_{2}, \dots, K_{i n} v_{n})}^{T}$ , and $Λ_{i} = diag (K_{i 1} v_{1}, K_{i 2} v_{2}, \dots, K_{i n} v_{n})$ . By the Hessian matrix, we can show that $E_{λ}$ is a smooth approximation of $E$ .

Lemma 6. $E_{λ} (ψ)$ is a $\frac{1}{λ}$ -smooth approximation of $E (ψ)$ with parameters ( $1, \log n$ ).

Proof. From Eqn. (14), we see $\nabla^{2} E_{λ}^{i}$ has $K = {k 1 : k \in R}$ as its null space. In the orthogonal complementary space of $K$ , $\nabla^{2} E_{λ}^{i}$ is diagonal dominant, therefore strictly positive definite.

Weyl’s inequality (Horn and Johnson 1991) states that the eigen value of $A = B + C$ is no greater than the maximal eigenvalue of $B$ minus the minimal eigenvalue of $C$ , where $B$ is an exact matrix and $C$ is a perturbation matrix. Hence the maximal eigenvalue of $\nabla^{2} E_{λ}^{i}$ , denoted as $σ_{i}$ , has an upper bound,

0 \leq σ_{i} \leq \frac{1}{λ} \frac{1}{1^{T} V_{i}} \max_{j} {K_{i j} v_{j}} \leq \frac{1}{λ} .

Thus the maximal eigenvalue of $\nabla^{2} E_{λ} (ψ)$ is no greater than $\sum_{i = 1}^{n} μ_{i} σ_{i} \leq \frac{1}{λ}$ . It is easy to find that $E_{λ} (ψ) \leq E (ψ) \leq E_{λ} (ψ) + λ \log n$ . Thus, $E_{λ} (ψ)$ is a $\frac{1}{λ}$ -smooth approximation of $E (ψ)$ with parameters ( $1, \log n$ ).

Lemma 7. Suppose $E_{λ} (ψ)$ is the $\frac{1}{λ}$ -smooth approximation of $E (ψ)$ with parameters $1, \log n$ , $ψ_{λ}^{*}$ is the optimizer of $E_{λ} (ψ)$ , then the approximate OT plan is unique and given by

{(P_{λ}^{*})}_{i j} = μ_{i} \frac{e^{({(ψ_{λ}^{*})}_{j} - c_{i j}) ∕ λ}}{\sum_{k = 1}^{n} e^{({(ψ_{λ}^{*})}_{k} - c_{i k}) ∕ λ}} = \frac{μ_{i} K_{i j} v_{j}^{*}}{K_{i} v^{*}}

(15)

where $K_{i}$ is the $i t h$ row of $K$ and $v^{*} = e^{ψ_{λ}^{*} ∕ λ}$ .

Proof. By the gradient formula Eqn. (13) and the optimizer $ψ_{λ}^{*}$ , we have

\frac{\partial E_{λ} (ψ^{*})}{\partial ψ_{j}} = \sum_{i = 1}^{m} {(P_{λ}^{*})}_{i j} - ν_{j} = 0 \forall j = 1, \dots, n .

On the other hand, by the definition of $P_{λ}^{*}$ , we have

\sum_{j = 1}^{n} {(P_{λ}^{*})}_{i j} = μ_{i}, \forall i = 1, \dots, m,

Combing the above two equations, we obtain that $P_{λ}^{*} \in π (μ, ν)$ and it is the approximate OT plan.

Similar to the discrete Kantorovich functional in Eqn. (6), the optimizer of the smooth Kantorovich functional in Eqn. (12) is also not unique: given an optimizer $ψ_{λ}^{*}$ , then $ψ_{λ}^{*} + k 1$ , $k \in R$ is also an optimizer. We can eliminate the ambiguity by adding the indicator function as Eqn. (8), ${\tilde{E}}_{λ} (ψ) = E_{λ} (ψ) + I_{H} (ψ)$ ,

{\tilde{E}}_{λ} (ψ) = λ \sum_{i = 1}^{m} μ_{i} \log (\sum_{j = 1}^{n} e^{(ψ_{j} - c_{i j}) ∕ λ}) - \sum_{j = 1}^{n} ν_{j} ψ_{j} - λ \log n + I_{H} (ψ)

(16)

This energy can be optimized effectively through the following FISTA iterations (Beck and Teboulle 2009).

z^{t + 1} = Π_{η_{t} I_{H}} (ψ^{t} - η_{t} \nabla E_{λ} (ψ^{t})) ψ^{t + 1} = z^{t + 1} + \frac{θ_{t} - 1}{θ_{t + 1}} (z^{t + 1} - z^{t})

(17)

with initial conditions $ψ^{0} = v^{0} = 0$ , $θ_{0} = 1$ , $η_{t} = λ$ and $θ_{t + 1} = \frac{1}{2} (1 + \sqrt{1 + 4 θ_{t}^{2}})$ . Here $Π_{η_{t} I_{H}} (z) = z - \frac{1}{n} \sum_{j = 1}^{n} z_{j}$ is the projection of $z$ to $H$ (the proximal function of $I_{H} (x)$ (Parikh and Boyd 2014). Similar to the Sinkhorn’s algorithm, this algorithm can be parallelized, since all the operations are row based.

Theorem 8. Given the cost matrix $C = (c_{i j})$ , the source measure $μ \in R_{+}^{m}$ and target measure $ν \in R_{+}^{n}$ with $\sum_{i = 1}^{m} μ_{i} = \sum_{j = 1}^{n} ν_{j} = 1$ , $ψ^{*}$ is the optimizer of the discrete dual Kantorovich functional $\tilde{E} (ψ)$ , and $ψ_{λ}^{*}$ is the optimizer of the smooth Kantorovich functional ${\tilde{E}}_{λ} (ψ)$ . Then the approximation error is

∣ \tilde{E} (ψ^{*}) - {\tilde{E}}_{λ} (ψ_{λ}^{*}) ∣ \leq 2 λ \log n

Proof. Assume $ψ^{*}$ and $ψ_{λ}^{*}$ are the minimizers of $E (ψ)$ and $E_{λ} (ψ)$ respectively. Then by the inequality in Eqn. (10)

E_{λ} (ψ^{*}) \leq E (ψ^{*}) \leq E (ψ_{λ}^{*}) \leq E_{λ} (ψ_{λ}^{*}) + λ \log n E_{λ} (ψ_{λ}^{*}) \leq E_{λ} (ψ^{*}) \leq E (ψ^{*}) \leq E_{λ} (ψ^{*}) + λ \log n

This shows $∣ E_{λ} (ψ^{*}) - E_{λ} (ψ_{λ}^{*}) ∣ \leq λ \log n$ . Removing the indicator functions, we can get

∣ \tilde{E} (ψ^{*}) - {\tilde{E}}_{λ} (ψ_{λ}^{*}) ∣ = ∣ E (ψ^{*}) - E_{λ} (ψ_{λ}^{*}) ∣ \leq ∣ E (ψ^{*}) - E_{λ} (ψ^{*}) ∣ + ∣ E_{λ} (ψ^{*}) - E_{λ} (ψ_{λ}^{*}) ∣ \leq 2 λ \log n

This also shows that $E (ψ_{λ}^{*})$ converges quickly to $E (ψ^{*})$ as the decrease of $λ$ . The convergence analysis of FISTA is given as follows:

Theorem 9 (Thm 4.4 of (Beck and Teboulle 2009)). Assume $(1) g (x)$ is convex and differentiable with $d o m (g) = R^{n}$ , $\nabla g$ is Lipschitz continuous with Lipschitz constant $L > 0$ ; and $(2) h (x)$ is convex and its proximal function can be evaluated. Then from the minimization of $f (x) = g (x) + h (x)$ by FISTA with fixed step size $η_{t} = \frac{1}{L}$ , we can get

f (x^{t}) - f (x^{*}) \leq \frac{2 L}{{(t + 1)}^{2}} {‖ x^{0} - x^{*} ‖}^{2}

(18)

Corollary 10. Suppose $λ$ is fixed and $ψ^{0} = 0$ , then for any $t \geq \sqrt{\frac{2 {‖ ψ_{λ}^{*} ‖}^{2}}{λ ε}}$ , we have

{\tilde{E}}_{λ} (ψ^{t}) - {\tilde{E}}_{λ} (ψ_{λ}^{*}) \leq ε,

(19)

where $ψ_{λ}^{*}$ is the optimizer of ${\tilde{E}}_{λ} (ψ)$ .

Proof. Under the settings of the smoothed Kantorovich problem Eqn. (12), $E_{λ} (ψ)$ is convex and differentiable with $\nabla^{2} E_{λ} (ψ) \underline{≺} \frac{1}{λ} I$ , $I_{H} (ψ)$ is convex and its proximal function is given by $Π_{H} (v)$ . Thus, directly applying Thm. 9 and setting $L = \frac{1}{λ}$ , we can get ${\tilde{E}}_{λ} (ψ^{t}) - {\tilde{E}}_{λ} (ψ_{λ}^{*}) \leq \frac{2}{λ {(t + 1)}^{2}} ‖ ψ_{λ}^{*} - ψ^{0} ‖^{2}$ . Set $ψ^{0} = 0$ and $\frac{2}{λ {(t + 1)}^{2}} ‖ ψ_{λ}^{*} ‖^{2} \leq ε$ , then we get that, when $t \geq \sqrt{\frac{2 ‖ ψ_{λ}^{*} ‖^{2}}{λ ε}}$ , we have ${\tilde{E}}_{λ} (ψ^{t}) - {\tilde{E}}_{λ} (ψ_{λ}^{*}) \leq ε$ .

With the above analysis of the convergence of the smooth Kantorovich functional ${\tilde{E}}_{λ} (ψ^{t})$ , in the following we give the convergence analysis of the original Kantorovich functional $\tilde{E} (ψ^{t})$ in Eqn. (9), where $ψ^{t}$ is obtained by FISTA.

Theorem 11. If $λ = \frac{ε}{2 \log n}$ , then for any $t \geq \frac{\sqrt{8 ‖ \bar{C} ‖^{2} n \log n}}{ϵ}$ , with $\bar{C} = C_{\max} - λ \log ν_{m i n}$ , we have

\tilde{E} (ψ^{t}) - \tilde{E} (ψ^{*}) < ε,

(20)

where $ψ^{t}$ is the solver of ${\tilde{E}}_{λ} (ψ)$ after $t$ steps in the iterations in Alg. 1, and $ψ^{*}$ is the optimizer of $\tilde{E} (ψ)$ . Then the total computational complexity is $O (\frac{n^{2.5} \sqrt{\log n}}{ε})$ .

Algorithm 1: Accelerated gradient descent for OT

\begin{matrix} 1 : Input : The cost matrix C = (c_{i j}), the corresponding source \\ weights μ and target weights ν, the approximate parameter λ, \\ and the step length η . \\ 2 : Output : The smoothed Kantorovich functional ψ_{λ} . \\ 3 : Initialize ψ = (ψ_{1}, ψ_{2}, \dots, ψ_{n}) \leftarrow (0, 0, \dots, 0) . \\ 4 : Initialize z \leftarrow (0, 0, \dots, 0) . \\ 5 : Initialize K = e^{- C ∕ λ}, θ_{0} = 1 \\ 6 : repeat \\ 7 : v^{t} = e^{ψ^{t} ∕ λ} . \\ 8 : \nabla E_{λ} (ψ^{t}) = d i a g (v) K^{T} (μ ∣ ⊘ K v) - ν . \\ 9 : z^{t + 1} = ψ^{t} - η \nabla E_{λ} (ψ^{t}) \\ 10 : z^{t + 1} = z^{t + 1} - m e a n (z^{t + 1}) . \\ 11 : ψ^{t + 1} = z^{t + 1} + \frac{θ_{t} - 1}{θ_{t + 1}} (z^{t + 1} - z^{t}) . \\ 12 : θ_{t + 1} = \frac{1}{2} (1 + \sqrt{1 + 4 θ_{t}^{2}}) . \\ 13 : t = t + 1 \\ 14 : until Converge \\ 15 : The OT cost E (ψ^{t}) = \sum_{i + 1}^{m} μ_{i} ψ^{c} (x_{i}) - \sum_{j = 1}^{n} ψ_{j}^{t} ν_{j} . \end{matrix}

Open in a new tab

Proof. We set the initial condition $ψ^{0} = 0$ . For any given $ε > 0$ , we choose iteration step $t$ , such that $\frac{2}{λ {(t + 1)}^{2}} ‖ ψ_{λ}^{*} ‖^{2} \leq \frac{ε}{2}$ , $t \geq \frac{\sqrt{8 ‖ ψ_{λ}^{*} ‖^{2} \log n}}{ϵ}$ , where $ψ_{λ}^{*}$ is the optimizer of ${\tilde{E}}_{λ} (ψ)$ . By theorem 9, we have

{\tilde{E}}_{λ} (ψ^{t}) - \tilde{E} (ψ^{*}) \leq {\tilde{E}}_{λ} (ψ^{t}) - {\tilde{E}}_{λ} (ψ^{*}) \leq {\tilde{E}}_{λ} (ψ^{t}) - {\tilde{E}}_{λ} (ψ_{λ}^{*}) \leq \frac{2}{{λ (t + 1)}^{2}} {‖ ψ_{λ}^{*} ‖}^{2} \leq \frac{ε}{2}

By Eqn. (16), we have

\tilde{E} (ψ^{t}) - \tilde{E} (ψ^{*}) = E (ψ^{t}) + I_{H} (ψ^{t}) - E (ψ^{*}) - I_{H} (ψ^{*}) = (E (ψ^{t}) - E_{λ} (ψ^{t})) + (E_{λ} (ψ^{t}) + I_{H} (ψ^{t})) - (E (ψ^{*}) + I_{H} (ψ^{*})) \leq ∣ E (ψ^{t}) - E_{λ} (ψ^{t}) ∣ + ({\tilde{E}}_{λ} (ψ^{t}) - \tilde{E} (ψ^{*})) \leq λ \log n + \frac{ε}{2} = ε

Next we show that $‖ ψ_{λ}^{*} ‖^{2} \leq n ‖ \bar{C} ‖^{2}$ by proving $∣ {(ψ_{λ}^{*})}_{j} ∣ \leq \bar{C} \forall j \in [n]$ . According to Eq. (15),

ν_{j} = \sum_{i = 1}^{m} μ_{i} \frac{e^{({(ψ_{λ}^{*})}_{j} - c_{i j}) ∕ λ}}{\sum_{k = 1}^{n} e^{({(ψ_{λ}^{*})}_{k} - c_{i k}) ∕ λ}} = \sum_{i = 1}^{m} μ_{i} \frac{e^{{(ψ_{λ}^{*})}_{j} ∕ λ}}{\sum_{k = 1}^{n} e^{{(ψ_{λ}^{*})}_{k} ∕ λ} e^{(c_{i j} - c_{i k}) ∕ λ}}

(21)

Assume ${(ψ_{λ}^{*})}_{\max}$ is the maximal element of $ψ_{λ}^{*}$ , we have $\sum_{k = 1}^{n} e^{{(ψ_{λ}^{*})}_{k} ∕ λ} e^{(c_{i j} - c_{i k}) ∕ λ} \geq e^{{(ψ_{λ}^{*})}_{\max} ∕ λ} e^{- C_{\max} ∕ λ}$ , where $C_{\max}$ is the maximal element of the matrix $C$ . Thus,

ν_{j} \leq \sum_{i = 1}^{n} μ_{i} \frac{e^{{(ψ_{λ}^{*})}_{j} ∕ λ}}{e^{{(ψ_{λ}^{*})}_{\max} ∕ λ} e^{- C_{\max} ∕ λ}} = \frac{e^{{(ψ_{λ}^{*})}_{j} ∕ λ}}{e^{{(ψ_{λ}^{*})}_{\max} ∕ λ} e^{- C_{\max} ∕ λ}}

Then, ${(ψ_{λ}^{*})}_{\max} \leq ({ψ_{λ}^{*})}_{j} + C_{\max} - λ \log v_{j}$ and

{(ψ_{λ}^{*})}_{\max} \leq \frac{1}{n} \sum_{j = 1}^{n} {{(ψ_{λ}^{*})}_{j} + C_{\max} - λ \log ν_{j}} \leq C_{\max} - λ \log μ_{m i n}

(22)

According to the inequality of arithmetic and geometric means, we have $\sum_{k = 1}^{n} e^{({ψ_{λ}^{*})}_{k} ∕ λ} \geq {n e}^{\frac{1}{n} (\sum_{k = 1}^{n} (ψ_{λ}^{*})_{k} ∕ λ)} = n$ . Thus, $ν_{j} \leq \frac{e^{({ψ_{λ}^{*})}_{j} ∕ λ}}{n e^{- C_{\max} ∕ λ}}$ .

{(ψ_{λ}^{*})}_{j} \geq λ \log n - C_{\max} + λ \log ν_{j} \geq λ \log ν_{m i n} - C_{\max}

(23)

Combine Eqn. (22) and (23), we have $∣ {(ψ_{λ}^{*})}_{j} ∣ \leq C_{\max} - λ \log ν_{m i n} = \bar{C}$ . Hence, we obtain that when $t \geq \frac{\sqrt{8 {‖ \bar{C} ‖}^{2} n \log n}}{ϵ}, \tilde{E} (ψ^{t}) - \tilde{E} (ψ^{*}) < ε$ .

For each iteration in Eqn. (17), we need $O (n^{2})$ times of operations, thus total the computational complexity of the proposed method is $O (\frac{n^{2.5} \sqrt{\log n}}{ε})$ .

Relationship with Softmax

If there exists an OT map from $μ$ to $ν$ , then each sample $x_{i}$ of the source distribution is classified into the corresponding $y_{j} = T (x_{i})$ . If there does not exist an OT map, we can only get the OT plan, which can be treated as a soft classification problem: each weighted sample $x_{i}$ with weight $μ_{i}$ will be sent to the corresponding $y_{j} s$ with weight $μ_{i} \frac{P_{i j}}{\sum_{k = 1}^{n} P_{i k}}$ where $P_{i j} > 0$ . Here $P_{i j} = μ_{i} \frac{P_{i j}}{\sum_{k = 1}^{n} P_{i k}}$ gives the OT plan from the source to the target distribution. The smoothed OT plan given by minimizing the smooth Kantorovich functional can be further treated as a relaxed OT plan. Instead of sending the weights of a specific sample to several target samples, the smooth solver tends to send each source sample to all of the target samples weighted by $\frac{e^{(ψ_{j}^{*} - c_{i j}) ∕ λ}}{\sum_{k = 1}^{n} e^{(ψ_{k}^{*} - c_{i k}) ∕ λ}}$ . Sample $x_{j}$ weighted by $μ_{i}$ will be sent to $y_{j}$ with weight $μ_{i} \frac{e^{(ψ_{j}^{*} - c_{i j}) ∕ λ}}{\sum_{k = 1}^{n} e^{(ψ_{k}^{*} - c_{i k}) ∕ λ}}$ .

Relationship with entropy regularized OT problem

The Sinkhorn algorithm is deduced from minimizing the entropy regularized OT problem (Cuturi 2013): $〈 P, C 〉 + λ K L (P ∣ μ \otimes ν)$ with $P \in π (μ, ν)$ . Its dual is given by (Genevay et al. 2016):

W_{λ} (μ, ν) = - min_{ψ} {λ \sum_{i = 1}^{m} μ_{i} \log (\sum_{j = 1}^{n} ν_{j} e^{(ψ_{j} - c_{i j}) λ}) - \sum_{j = 1}^{n} ν_{j} ψ_{j} + λ}

(24)

with gradient $\frac{\partial W_{λ}}{\partial ψ_{j}} = \sum_{i = 1}^{m} μ_{i} \frac{ν_{j} e^{(ψ_{j} - c_{i j}) ∕ λ}}{\sum_{k = 1}^{n} ν_{k} e^{(ψ_{k} - c_{i k}) ∕ λ}} - ν_{j}$ . With the optimal solver $ψ^{*}$ , the approximate OT plan is given by $P_{i j} = μ_{i} \frac{ν_{j} e^{(ψ_{j} - c_{i j}) ∕ λ}}{\sum_{k = 1}^{n} ν_{k} e^{(ψ_{k} - c_{i k}) ∕ λ}}$ . We can compare them with our gradient Eqn. (13) and approximated OT plan Eqn. (15) to see the subtle differences. Actually, by setting $ψ ≔ ψ - \log ν$ , the minimizing problem in Eqn. (24) is equivalent to our smoothed semi-discrete problem of Eqn. (16) with a different constant term.

Furthermore, if we set $u = {(u_{1}, u_{2}, \dots, u_{m})}^{T}$ with $u_{i} = \frac{μ_{i}}{K_{i} v}$ in Eqn. (15), the computed approximate OT plan can be rewritten as $P_{λ} = d i a g (u) K d i a g (v)$ , which is the same as the form of the Sinkhorn solution (Cuturi 2013). Since the solution of the Sinkhorn algorithm is unique, we conclude that the induced approximate optimal transport plan Eqn. (15) by our algorithm is equivalent to that of the Sinkhorn.

Experiments

In this section, we investigate the performance of the proposed algorithm under different parameters, and then compare it with the Sinkhorn algorithm (Cuturi 2013). In the following, we first introduce the various settings of the experiments including the parameters, the cost matrix and the evaluation metrics. Then we show the experimental results. All of the codes were written in MATLAB with GPU acceleration, including the proposed method and the Sinkhorn algorithm (Cuturi 2013). The experiments are also conducted on a Windows laptop with Intel Core i7-7700HQ CPU, 16 GB memory and NVIDIA GTX 1060Ti GPU.

Parameters

There are two parameters involved in the proposed algorithm, $λ$ and $η_{t}$ . The former is used to control the approximate accuracy between the Log-Sum-Exp function and the Kantorovich potential $ψ^{c}$ in Eqn. (11), and the latter controls the step size of the FISTA algorithm in Eqn. (17). Basically, smaller $λ$ gives better approximation.

In our experiments, to get $λ$ as small as possible, based on the Property 1 of the Eqn. (6), we set the median of the cost matrix $C$ equal to zero, so that the full range of the exponential of the floating-point numbers can be used, instead of only the negative part¹. Thus we set $C = C - \frac{C_{\max} + C_{min}}{2}$ and call it the translation trick. If the range of $C$ is denoted as $R$ , then the accuracy parameter is set to be $λ = \frac{R}{T}$ , where $T$ is a positive constant. For the FISTA algorithm, the ideal step size should be $η_{t} = \frac{1}{σ_{\max}}$ , where $σ_{\max}$ is the maximal eigenvalue of the Hessian matrix $\nabla^{2} {\tilde{E}}_{λ} (ψ)$ in Eqn. (14). By Nesterov smoothing, we know $σ_{\max} \leq \frac{1}{λ}$ , so we set the step length $η_{t} = η λ$ , where $η$ is a constant². In practice we use ( $T, η$ ) as control parameters instead of ( $λ, η_{t}$ ).

Cost Matrix

In the following experiments, we test the performance of the algorithm with different parameters under different metrics. Specifically, we set $μ = \sum_{i = 1}^{m} μ_{i} δ (x - x_{i})$ , $ν = \sum_{j = 1}^{n} ν_{j} δ (y - y_{j})$ . Note that after the settings of $μ_{i} s$ and $ν_{j} s$ , they are normalized by $μ_{i} = \frac{μ_{i}}{\sum_{k = 1}^{m} μ_{k}}$ and $ν_{j} = \frac{ν_{j}}{\sum_{k = 1}^{n} ν_{k}}$ . To build the cost matrix, we use the Euclidean distance, squared Euclidean distance, spherical distance, and random cost matrix.

For Euclidean distance (ED) and the squared Euclidean distance (SED) experiments, in experiment 1, $x_{i} ’ s$ are randomly generated from the Gaussian distribution $𝒩 (3 1_{d}, I_{d})$ and $y_{j} ’ s$ are randomly sampled from the uniform distribution $U n i ({[0, 1]}^{d}) - 5$ . Both $μ_{i}$ and $ν_{j}$ are randomly generated from the uniform distribution $U n i ([0, 1])$ . Experiment 3 also uses a similar sampling strategy to build the discrete source and target measures. In experiment 2, like (Altschuler, Niles-Weed, and Rigollet 2017), we randomly choose one pair of images from the MNIST dataset (LeCun and Cortes 2010), and then add negligible noise 0.01 to each background pixel with intensity 0. $μ_{i}$ and $x_{i}$ ( $ν_{j}$ and $y_{j}$ ) are set to be the value and the coordinate of each pixel in the source (target) image. Then the Euclidean distance and squared Euclidean distance between $x_{i}$ and $y_{j}$ are given by $c (x_{i}, y_{j}) = ‖ x_{i} - y_{j} ‖$ and $c (x_{i}, y_{j}) = {‖ x_{i} - y_{j} ‖}^{2}$ , respectively.
For the spherical distance (SD) experiment, both $μ_{j}$ $ν_{j}$ are randomly generated from the uniform distribution $U n i ([0, 1])$ . $x_{i} ’ s$ are randomly generated from the Gaussian distribution $𝒩 (3 1_{d}, I_{d})$ and $y_{j} ’ s$ are randomly generated from the uniform distribution $U n i ({[0, 1]}^{d})$ . Then we normalize $x_{i}$ and $y_{j}$ by $x_{i} = \frac{x_{i}}{{‖ x_{i} ‖}_{2}}$ and $y_{j} = \frac{y_{j}}{{‖ y_{j} ‖}_{2}}$ . As a result, both $x_{i} ’ s$ and $y_{j} ’ s$ are located on the sphere. The spherical distance is given by $c (x_{i}, y_{j}) = arccos (〈 x_{i}, y_{j} 〉)$ .
For the random distance (RD) matrix experiment, both $μ_{i}$ and $ν_{j}$ are randomly generated from the uniform distribution $U n i ([0, 1])$ . Also, to build $C$ , we randomly sample $c_{i j}$ from the Gaussian distribution $𝒩 (0, 1)$ , then $C$ is defined as $C = C - C_{min} + 1.0$ .

Evaluation Metrics

We use two metrics to evaluate the proposed method: the first one is the transport cost, which is defined by Eqn. (6) and is given by $- E (ψ)$ ; and the second is the $L_{1}$ distance from the computed transport plan $P_{λ}$ to the admissible distribution space $π (μ, ν)$ defined in Eqn. (1), and the distance is defined as $D (P_{λ}) = {‖ P_{λ} 1 - μ ‖}_{1} + {‖ P_{λ}^{T} 1 - ν ‖}_{1}$ .

Experiment 1: The influence of different parameters

We test the performance of the proposed algorithm with different parameters under the SED and SD with $m = n = 100$ and $d = 5$ , as shown in Fig. 1. The left column shows the results for SED and the right column is the result for SD. The top row illustrates the transport costs over iterations, and the bottom row is the distance $D (P_{λ})$ .

Figure 1: — The performance of the proposed algorithm with different parameters.

In the top row of Fig. 1, the black lines give the groundtruth transport costs, which are computed by linear programming. It is obvious that for the same $η$ , by increasing $T$ (decreasing $λ$ , see the different types of the lines with the same color), the approximate accuracy is improved, and the convergence rate is increased; if $T$ (equivalently $λ$ ) is fixed, by increasing $η$ (see the different colors of the lines with the same type), we increase the convergence speed.

Experiment 2: Faster Convergence

For the experiments with ED and SED, the distributions come from the MNIST dataset (LeCun and Cortes 2010), as illustrated in the Cost Matrix. For the the experiments with SD and RD, we set $m = n = 500$ and $d = 5$ . Then we compare with the Sinkhorn algorithm (Cuturi 2013) with respect to both the convergence rate and the approximation accuracy. We manually set $T = 500$ to get a good estimate of the OT cost, and then adjust $η$ to get the best convergence speed of the proposed algorithm. For the purpose of fair comparison, we use the same cost matrix with the same translation trick and the same $λ$ for the Sinkhorn algorithm, where we treat each update of $v$ as one step. We summarize the results in Fig. 2, where the green curves represent the groundtruth computed by linear programming, the blue curves are for the Sinkhorn algorithm, and the red curves give the results of our method. It is obvious that in all of the four experiments, our method achieves faster convergence than the Sinkhorn algorithm. Note that the computed approximate transport plan of the Sinkhorn algorithm is intrinsically equivalent to our induced transport plan in Eqn. (15).

Figure 2: — Comparison with the Sinkhorn algorithm (Cuturi 2013) under different cost matrix.

In Tab. 1, we report the running time of our method, Sinkhorn (Cuturi 2013), its variants algorithms, including Greenkhorn (Altschuler, Niles-Weed, and Rigollet 2017) and Screenkhorn (Alaya et al. 2019), and APDAMD (Lin, Ho, and Jordan 2019) for the four experiments shown in Fig. 2 with $T = 700$ . The stop condition is set to be $∣ E (ψ^{t + 1}) - E (ψ^{t}) ∣ ∕ ∣ E (ψ^{t}) ∣ \leq 10^{- 3}$ . For all of the experiments, we can see that our proposed method achieves the fastest convergence.

Table 1:

Running time (s) of our method, Sinkhorn (Sink) (Cuturi 2013), Greenkhorn (Green) (Altschuler, Niles-Weed, and Rigollet 2017), Screenkhorn (Screen) (Alaya et al. 2019) and APDAMD (Lin, Ho, and Jordan 2019).

	Sink	Green	Screen	APDAMD	Ours
ED	0.0596	0.0923	0.0541	3.76	0.0404
SED	0.0431	0.0870	0.0328	3.21	0.0197
SD	0.0564	0.0862	0.0400	2.29	0.0142
RD	0.0374	0.0726	0.0313	2.88	0.0227

Open in a new tab

Experiment 3: Better Accuracy

From Fig. 2, we can observe that $- E (ψ_{λ}^{*})$ gives a comparable or better approximate of the OT cost than $〈 P_{λ}^{*}, C 〉$ with the same small $λ$ , especially for the $L_{p}$ cost function $c (x, y) = {‖ x - y ‖}^{p}$ with $p > 1$ , see the second column of Fig. 2 for an example of $p = 2$ . To achieve $ϵ$ precision, $〈 P_{λ}^{*}, C 〉$ (equivalent to the Sinkhorn result) needs to set $λ = \frac{ϵ}{4 \log n}$ (Dvurechensky, Gasnikov, and Kroshnin 2018), which is smaller than our requirement of $λ = \frac{ϵ}{2 \log n}$ according to Thm. 11. Thus, with the same $λ$ , the results of our algorithm should be more accurate than the Sinkhorn solutions. To verify this point, we give more examples in Tab. 2 with $p = 1.5$ , 2, 3 and 4. Here we use the discrete measures similar to the squared Euclidean distance as stated in the Cost Matrix part, and set $m = n = 500$ , $d = 5$ . From the table, we can see that our method obtains more accurate results than Sinkhorn.

Table 2:

Comparison among the OT cost (GT) by linear programming, the Sinkhorn results (Cuturi 2013) denoted as ’Sink’ and the results of the proposed method denoted as ’Ours’ with $T = 500$ and different $p$ .

p	GT	Sink	Ours	$∣ Sink-GT ∣$	$∣ Ours-GT ∣$
1.5	103.33	103.51	103.27	0.18	0.06
2	281.7	282.5	281.6	0.8	0.1
3	2189.8	2197.1	2187.5	7.3	2.3
4	16951.4	17038.5	16932.0	87.1	19.4

Open in a new tab

Conclusion

In this paper, we propose a novel algorithm based on Nesterov’s smoothing technique to improve the accuracy for solving the discrete OT problem. The c-transform of the Kantorovich potential is approximated by the smooth Log-Sum-Exp function, and the smoothed Kantorovich functional can be solved by FISTA efficiently. Theoretically, the computational complexity of the proposed method is given by $O (n^{2.5} \sqrt{\log n} ∕ ε)$ , which is lower than current estimation of the Sinkhorn method. Experimentally, our results demonstrate that the proposed method achieves faster convergence and better accuracy than the Sinkhorn algorithm.

Acknowledgement

An and Gu have been supported by NSF CMMI-1762287, NSF DMS-1737812, and NSF FAIN-2115095; Xu by NIH R21EB029733 and NIH R01LM012434; and Lei by NSFC 61936002, NSFC 61772105 and NSFC 61720106005.

Footnotes

For example, if double-precision floating-point format is used in 64-bit processors, the range of the number is about $2.2251 e^{- 308} \sim 1.7977 e^{+ 308}$ when using MATLAB.

For one thing, if $λ$ is relatively large, only with small step size, the algorithm may run out of the precision range of the processor and thus get ’Inf’ or ’NAN’. Thus, $η$ may be far less that 1. For the other thing, we have $H \leq \frac{1}{λ} \max_{i} (\frac{\max_{i} K_{i j} v_{j}}{K_{1} v}) \leq \frac{1}{λ}$ , we may also choose $η > 1$ when $λ$ itself is small.

References

Abid BK; and Gower RM 2018. Greedy stochastic algorithms for entropy-regularized optimal transport problems. In AISTATS. [Google Scholar]
Alaya MZ; Berar M; Gasso G; and Rakotomamonjy A 2019. Screening Sinkhorn Algorithm for Regularized Optimal Transport. In Advances in Neural Information Processing Systems 32. [Google Scholar]
Altschuler J; Niles-Weed J; and Rigollet P 2017. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In Advances in Neural Information Processing Systems 30. [Google Scholar]
An D; Guo Y; Lei N; Luo Z; Yau S-T; and Gu X 2020. AE-OT: A new Generative Model based on extended semi-discrete optimal transport. In International Conference on Learning Representations. [Google Scholar]
Arjovsky M; Chintala S; and Bottou L 2017. Wasserstein generative adversarial networks. In ICML, 214–223. [Google Scholar]
Beck A; and Teboulle M 2009. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences, 2(1): 183–202. [Google Scholar]
Benamou J-D; Carlier G; Cuturi M; Nenna L; and Peyré G 2015. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2): A1111–A1138. [Google Scholar]
Blanchet J; Jambulapati A; Kent C; and Sidford A 2018. Towards Optimal Running Times for Optimal Transport. In arxiv:1810.07717. [Google Scholar]
Blondel M; Seguy V; and Rolet A 2018. Smooth and Sparse Optimal Transport. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, 880–889. [Google Scholar]
Chakrabarty D; and Khanna S 2021. Better and Simpler Error Analysis of the Sinkhorn-Knopp Algorithm for Matrix Scaling. Mathematical Programming, 188(1): 395–407. [Google Scholar]
Courty N; Flamary R; Tuia D; and Rakotomamonjy A 2017. Optimal Transport for Domain Adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9): 1853–1865. [DOI] [PubMed] [Google Scholar]
Cuturi M 2013. Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances. In International Conference on Neural Information Processing Systems, volume 26, 2292–2300. [Google Scholar]
Dvurechensky P; Gasnikov A; and Kroshnin A 2018. Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In International Conference on Machine Learning, 1367–1376. [Google Scholar]
Galichon A. 2016. Optimal Transport Methods in Economics. Princeton University Press. [Google Scholar]
Genevay A; Cuturi M; Peyré G; and Bach F 2016. Stochastic optimization for large-scale optimal transport. In Advances in Neural Information Processing Systems, 3440–3448. [Google Scholar]
Gerber S; and Maggioni M 2017. Multiscale Strategies for Computing Optimal Transport. Journal of Machine Learning Research. [Google Scholar]
Glimm T; and Oliker V 2003. Optical design of single reflector systems and the Monge–Kantorovich mass transfer problem. Journal of Mathematical Sciences, 117(3): 4096–4108. [Google Scholar]
Guo W; Ho N; and Jordan M 2020. Fast Algorithms for Computational Optimal Transport and Wasserstein Barycenter. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2088–2097. [Google Scholar]
Horn TA; and Johnson CR 1991. Topics in Matrix Analysis. Cambridge. [Google Scholar]
Jambulapati A; Sidford A; and Tian K 2019. A Direct $\tilde{O} (1 ∕ ϵ)$ Iteration Parallel Algorithm for Optimal Transport. In International Conference on Neural Information Processing System. [Google Scholar]
Jordan R; Kinderlehrer D; and Otto F 1998. The variational formulation of the Fokker–Planck equation. SIAM Journal on Mathematical Analysis, 29(1): 1–17. [Google Scholar]
Kusner M; Sun Y; Kolkin N; and Weinberger K 2015. From word embeddings to document distances. In Proceedings of the 32nd International Conference on Machine Learning, 957–966. [Google Scholar]
LeCun Y; and Cortes C 2010. MNIST handwritten digit database. [Google Scholar]
Lee YT; and Sidford A 2014. Path finding methods for linear programming: Solving linear programs in $\tilde{O} (sqrt (rank)) iterations$ and faster algorithms for maximum flow. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, 424–433. [Google Scholar]
Lei N; An D; Guo Y; Su K; Liu S; Luo Z; Yau S-T; and Gu X 2020. A Geometric Understanding of Deep Learning. Engineering, 6(3): 361–374. [Google Scholar]
Lin T; Ho N; and Jordan M 2019. On efficient optimal transport: An analysis of greedy and accelerated mirror descent algorithms. In International Conference on Machine Learning, 3982–3991. [Google Scholar]
Meng C; Ke Y; Zhang J; Zhang M; Zhong W; and Ma P 2019. Large-scale optimal transport map estimation using projection pursuit. In Advances in Neural Information Processing Systems 32. [Google Scholar]
Nesterov Y 2005. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1): 127–152. [Google Scholar]
Nguyen X 2013. Convergence of latent mixing measures in finite and infinite mixture models. Ann. Statist, 41: 370–400. [Google Scholar]
Parikh N; and Boyd S 2014. Proximal Algorithms. Foundations and Trends in Optimization. [Google Scholar]
Pele O; and Werman M 2009. Fast and robust earth mover’s distances. In 2009 IEEE 12th International Conference on Computer Vision (ICCV), 460–467. IEEE. [Google Scholar]
Peyré G; and Cuturi M 2018. Computational Optimal Transport. https://arxiv.org/abs/1803.00567. [Google Scholar]
Quanrud K. 2018. Approximating optimal transport with linear programs. In arXiv:1810.05957. [Google Scholar]
Schiebinger G; Shu J; Tabaka M; Cleary B; Subramanian V; Solomon A; Gould J; Liu S; Lin S; Berube P; Lee L; Chen J; Brumbaugh J; Rigollet P; Hochedlinger K; Jaenisch R; Regev A; and Lander E 2019. Optimal-Transport Analysis of Single-Cell Gene Expression Identifies Developmental Trajectories in Reprogramming. Cell, 176(4): 928–943. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schmitzer B 2019. Stabilized Sparse Scaling Algorithms for Entropy Regularized Transport Problems. SIAM Journal on Scientific Computing, 41(3): A1443–A1481. [Google Scholar]
Tolstikhin I; Bousquet O; Gelly S; and Schoelkopf B 2018. Wasserstein Auto-Encoders. In ICLR. [Google Scholar]
Villani C 2008. Optimal transport: old and new, volume 338. Springer Science & Business Media. [Google Scholar]
Xie Y; Chen M; Jiang H; Zhao T; and Zha H 2019a. On Scalable and Efficient Computation of Large Scale Optimal Transport. In Proceedings of the 36th International Conference on Machine Learning, 6882–6892. [Google Scholar]
Xie Y; Wang X; Wang R; and Zha H 2019b. A Fast Proximal Point Method for Computing Wasserstein Distance. In Conference on Uncertainty in Artificial Intelligence, 433–453. [Google Scholar]
Yurochkin M; Claici S; Chien E; Mirzazadeh F; and Solomon JM 2019. Hierarchical Optimal Transport for Document Representation. In Advances in Neural Information Processing Systems 32. [Google Scholar]

[R1] Abid BK; and Gower RM 2018. Greedy stochastic algorithms for entropy-regularized optimal transport problems. In AISTATS. [Google Scholar]

[R2] Alaya MZ; Berar M; Gasso G; and Rakotomamonjy A 2019. Screening Sinkhorn Algorithm for Regularized Optimal Transport. In Advances in Neural Information Processing Systems 32. [Google Scholar]

[R3] Altschuler J; Niles-Weed J; and Rigollet P 2017. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In Advances in Neural Information Processing Systems 30. [Google Scholar]

[R4] An D; Guo Y; Lei N; Luo Z; Yau S-T; and Gu X 2020. AE-OT: A new Generative Model based on extended semi-discrete optimal transport. In International Conference on Learning Representations. [Google Scholar]

[R5] Arjovsky M; Chintala S; and Bottou L 2017. Wasserstein generative adversarial networks. In ICML, 214–223. [Google Scholar]

[R6] Beck A; and Teboulle M 2009. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences, 2(1): 183–202. [Google Scholar]

[R7] Benamou J-D; Carlier G; Cuturi M; Nenna L; and Peyré G 2015. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2): A1111–A1138. [Google Scholar]

[R8] Blanchet J; Jambulapati A; Kent C; and Sidford A 2018. Towards Optimal Running Times for Optimal Transport. In arxiv:1810.07717. [Google Scholar]

[R9] Blondel M; Seguy V; and Rolet A 2018. Smooth and Sparse Optimal Transport. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, 880–889. [Google Scholar]

[R10] Chakrabarty D; and Khanna S 2021. Better and Simpler Error Analysis of the Sinkhorn-Knopp Algorithm for Matrix Scaling. Mathematical Programming, 188(1): 395–407. [Google Scholar]

[R11] Courty N; Flamary R; Tuia D; and Rakotomamonjy A 2017. Optimal Transport for Domain Adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9): 1853–1865. [DOI] [PubMed] [Google Scholar]

[R12] Cuturi M 2013. Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances. In International Conference on Neural Information Processing Systems, volume 26, 2292–2300. [Google Scholar]

[R13] Dvurechensky P; Gasnikov A; and Kroshnin A 2018. Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In International Conference on Machine Learning, 1367–1376. [Google Scholar]

[R14] Galichon A. 2016. Optimal Transport Methods in Economics. Princeton University Press. [Google Scholar]

[R15] Genevay A; Cuturi M; Peyré G; and Bach F 2016. Stochastic optimization for large-scale optimal transport. In Advances in Neural Information Processing Systems, 3440–3448. [Google Scholar]

[R16] Gerber S; and Maggioni M 2017. Multiscale Strategies for Computing Optimal Transport. Journal of Machine Learning Research. [Google Scholar]

[R17] Glimm T; and Oliker V 2003. Optical design of single reflector systems and the Monge–Kantorovich mass transfer problem. Journal of Mathematical Sciences, 117(3): 4096–4108. [Google Scholar]

[R18] Guo W; Ho N; and Jordan M 2020. Fast Algorithms for Computational Optimal Transport and Wasserstein Barycenter. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2088–2097. [Google Scholar]

[R19] Horn TA; and Johnson CR 1991. Topics in Matrix Analysis. Cambridge. [Google Scholar]

[R20] Jambulapati A; Sidford A; and Tian K 2019. A Direct $\tilde{O} (1 ∕ ϵ)$ Iteration Parallel Algorithm for Optimal Transport. In International Conference on Neural Information Processing System. [Google Scholar]

[R21] Jordan R; Kinderlehrer D; and Otto F 1998. The variational formulation of the Fokker–Planck equation. SIAM Journal on Mathematical Analysis, 29(1): 1–17. [Google Scholar]

[R22] Kusner M; Sun Y; Kolkin N; and Weinberger K 2015. From word embeddings to document distances. In Proceedings of the 32nd International Conference on Machine Learning, 957–966. [Google Scholar]

[R23] LeCun Y; and Cortes C 2010. MNIST handwritten digit database. [Google Scholar]

[R24] Lee YT; and Sidford A 2014. Path finding methods for linear programming: Solving linear programs in $\tilde{O} (sqrt (rank)) iterations$ and faster algorithms for maximum flow. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, 424–433. [Google Scholar]

[R25] Lei N; An D; Guo Y; Su K; Liu S; Luo Z; Yau S-T; and Gu X 2020. A Geometric Understanding of Deep Learning. Engineering, 6(3): 361–374. [Google Scholar]

[R26] Lin T; Ho N; and Jordan M 2019. On efficient optimal transport: An analysis of greedy and accelerated mirror descent algorithms. In International Conference on Machine Learning, 3982–3991. [Google Scholar]

[R27] Meng C; Ke Y; Zhang J; Zhang M; Zhong W; and Ma P 2019. Large-scale optimal transport map estimation using projection pursuit. In Advances in Neural Information Processing Systems 32. [Google Scholar]

[R28] Nesterov Y 2005. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1): 127–152. [Google Scholar]

[R29] Nguyen X 2013. Convergence of latent mixing measures in finite and infinite mixture models. Ann. Statist, 41: 370–400. [Google Scholar]

[R30] Parikh N; and Boyd S 2014. Proximal Algorithms. Foundations and Trends in Optimization. [Google Scholar]

[R31] Pele O; and Werman M 2009. Fast and robust earth mover’s distances. In 2009 IEEE 12th International Conference on Computer Vision (ICCV), 460–467. IEEE. [Google Scholar]

[R32] Peyré G; and Cuturi M 2018. Computational Optimal Transport. https://arxiv.org/abs/1803.00567. [Google Scholar]

[R33] Quanrud K. 2018. Approximating optimal transport with linear programs. In arXiv:1810.05957. [Google Scholar]

[R34] Schiebinger G; Shu J; Tabaka M; Cleary B; Subramanian V; Solomon A; Gould J; Liu S; Lin S; Berube P; Lee L; Chen J; Brumbaugh J; Rigollet P; Hochedlinger K; Jaenisch R; Regev A; and Lander E 2019. Optimal-Transport Analysis of Single-Cell Gene Expression Identifies Developmental Trajectories in Reprogramming. Cell, 176(4): 928–943. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Schmitzer B 2019. Stabilized Sparse Scaling Algorithms for Entropy Regularized Transport Problems. SIAM Journal on Scientific Computing, 41(3): A1443–A1481. [Google Scholar]

[R36] Tolstikhin I; Bousquet O; Gelly S; and Schoelkopf B 2018. Wasserstein Auto-Encoders. In ICLR. [Google Scholar]

[R37] Villani C 2008. Optimal transport: old and new, volume 338. Springer Science & Business Media. [Google Scholar]

[R38] Xie Y; Chen M; Jiang H; Zhao T; and Zha H 2019a. On Scalable and Efficient Computation of Large Scale Optimal Transport. In Proceedings of the 36th International Conference on Machine Learning, 6882–6892. [Google Scholar]

[R39] Xie Y; Wang X; Wang R; and Zha H 2019b. A Fast Proximal Point Method for Computing Wasserstein Distance. In Conference on Uncertainty in Artificial Intelligence, 433–453. [Google Scholar]

[R40] Yurochkin M; Claici S; Chien E; Mirzazadeh F; and Solomon JM 2019. Hierarchical Optimal Transport for Document Representation. In Advances in Neural Information Processing Systems 32. [Google Scholar]

PERMALINK

Efficient Discrete Optimal Transport Algorithm by Accelerated Gradient Descent

Dongsheng An

Na Lei

Xiaoyin Xu

Xianfeng Gu

Abstract

Introduction

Discrete Optimal Transport

Our Method

Notation

Related Work

Optimal Transport Theory

Optimal Transport Problem

Kantorovich’s Approach

Nesterov’s Smoothing of Kantorovich functional

Relationship with Softmax

Relationship with entropy regularized OT problem

Experiments

Parameters

Cost Matrix

Evaluation Metrics

Experiment 1: The influence of different parameters

Figure 1:

Experiment 2: Faster Convergence

Figure 2:

Table 1:

Experiment 3: Better Accuracy

Table 2:

Conclusion

Acknowledgement

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Efficient Discrete Optimal Transport Algorithm by Accelerated Gradient Descent

Dongsheng An

Na Lei

Xiaoyin Xu

Xianfeng Gu

Abstract

Introduction

Discrete Optimal Transport

Our Method

Notation

Related Work

Optimal Transport Theory

Optimal Transport Problem

Kantorovich’s Approach

Nesterov’s Smoothing of Kantorovich functional

Relationship with Softmax

Relationship with entropy regularized OT problem

Experiments

Parameters

Cost Matrix

Evaluation Metrics

Experiment 1: The influence of different parameters

Figure 1:

Experiment 2: Faster Convergence

Figure 2:

Table 1:

Experiment 3: Better Accuracy

Table 2:

Conclusion

Acknowledgement

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases