Regularized Wasserstein Means for Aligning Distributional Data

Liang Mi; Wen Zhang; Yalin Wang

doi:10.1609/aaai.v34i04.5960

. Author manuscript; available in PMC: 2021 Apr 15.

Published in final edited form as: Proc AAAI Conf Artif Intell. 2020 Apr 3;34(4):5166–5173. doi: 10.1609/aaai.v34i04.5960

Regularized Wasserstein Means for Aligning Distributional Data

Liang Mi ¹, Wen Zhang ¹, Yalin Wang ¹

PMCID: PMC8049602 NIHMSID: NIHMS1062064 PMID: 33868774

Abstract

We propose to align distributional data from the perspective of Wasserstein means. We raise the problem of regularizing Wasserstein means and propose several terms tailored to tackle different problems. Our formulation is based on the variational transportation to distribute a sparse discrete measure into the target domain. The resulting sparse representation well captures the desired property of the domain while reducing the mapping cost. We demonstrate the scalability and robustness of our method with examples in domain adaptation, point set registration, and skeleton layout.

Introduction

Aligning distributional data is fundamental to many problems in machine learning. From the early work on histogram manipulation, e.g. (Stark 2000), to the recent work on generative modeling, e.g. (Beecks and others 2011), researchers have proposed various alignment techniques that benefit numerous fields including domain adaptation, e.g. (Sun and others 2016), and shape registration, e.g. (Ma and others 2016). A universal approach to aligning distributional data is through optimizing an objective function that measures the loss of the map between them. Regarding one distribution as the fixed target and the other the source, the alignment process in general follows an iterative manner where we alternatively update their correspondence and transform the source. When the source has much fewer samples or in a lower dimension, the process is essentially finding a sparse representation (Bengio and others 2013).

The optimal transportation (OT) loss, or the Wasserstein distance, has proved itself to be superiors in many aspects over several other distances (Gibbs and others 2002; Arjovsky and others 2017), benefiting various learning algorithms. By regarding the Wasserstein distance as a metric, researchers have been able to compute a sparse mean (Ho and others 2017) of a distribution, which is a special case of the Wasserstein barycenter problem (Agueh and others 2011) when there is only one target distribution. While OT algorithms find the correspondence between the distributions, updating the mean can simply follow the rule that each source sample is mapped to the weighted mean of its corresponding target sample(s) (Ye and others 2017).

In this paper, we raise the problem of regularizing the Wasserstein means. In addition to finding a mean that yields the minimum transportation cost, in many cases we also want to insert certain properties so that it satisfies other criteria. A common technique is adding regularization terms to the objective function. While most of the existing work, e.g. (Cuturi 2013; Courty and others 2017b), focus on regularizing the optimal transportation itself, we address the mean update rule and show the benefit from regularizing it. We introduce a new framework to compute OT-based sparse representation with regularization. We base our method on variational transportation (Mi and others 2018) which produces a map between the source and the target distributions in a many-to-one fashion. Different from directly mapping the source into the weighted average of its correspondence (Ye and others 2017; Courty and others 2017b; Mi and others 2018), we propose to regularize the mapping to cope with specific problems – domain adaptation, point set registration, and skeleton layout. The resulting mean, or centroid, can well represent the key property of the distribution while maintaining a small reconstruction error. Code is available at https://github.com/icemiliang/pyvot

Related Work

Optimal Transportation

The optimal transportation (OT) problem was raised introduced by Monge (Monge 1781) in the 18th century, which sought a mass-preserving transportation map between distributional data with the minimum cost. It resurfaced in 1940s when Kantorovich (Kantorovich 1942) introduced a relaxed version where mass can be split and provided the classic linear programming solution. A breakthrough for the mass-preserving, or non-mass splitting, OT happened in the early 1990s when Brenier (Brenier 1991) proved its existence under quadratic Euclidean cost. In more recent years, fast algorithms for computing, or approximating, OT have been proposed in both lines of research – non-mass-preserving, e.g. (Rabin and others 2011; Cuturi 2013; Solomon and others 2015) and mass-preserving, e.g. (Mérigot 2011; Lévy 2015; Kolouri and others 2016; Chen and others 2019).

We follow Monge’s mass-preserving formulation. Specifically, we adopt (Mi and others 2018) with improvements to compute the OT because it gives us a clear path of each sample, not a spread-out map. Thus, we can directly regularize the support instead of the mapping.

Wasserstein Barycenters and Means

The Wasserstein distance is the minimum cost induced by OT. In most cases, the cost itself may not be as desired as the map, but it satisfies all metric axioms (Villani 2003) and thus often serves as the loss for matching distributions, e.g. (Ling and Okada 2007; Arjovsky and others 2017). Moreover, given multiple distributions, one can find their weighted average with respect to the Wasserstein metric. This problem was studied in (McCann 1997; Ambrosio and others 2008) for averaging two distributions and generalized to multiple distributions in (Agueh and others 2011), which coins the Wasserstein barycenter term.

A special case of the barycenter problem is when there is only one distribution and we want to find its sparse discrete barycenter. Because computationally it is equivalent to the k-means problem, (Ho and others 2017) defines it as the Wasserstein means problem. Before that, Cuturi and Doucet had discussed it in (Cuturi and others 2014) along with the connection of their algorithm to Lloyd’s algorithm in that case. (Mi and others 2018) proposes an OT-based clustering method which is very close to the Wasserstein means problem. (Kolouri and others 2018) also made a contribution by discussing the sliced Wasserstein Means problem.

Our work focuses on regularizing the Wasserstein means. We obtain the mean by mapping the sparse points into the target domain according to the OT correspondence. We insert regularization into the mapping process so that the sparse points not only have a small OT loss but they also have certain properties induced by the regularization terms.

Our work should not be confused with other work on regularizing OT. For example, (Cuturi 2013) introduces entropy-regularized OT where the entropy term controls the sparsity of the map and it was later used in (Cuturi and others 2014) to compute Wasserstein barycenters. (Courty and others 2017b) also leveraged class labels to regularize OT for domain adaptation. (Ferradans and others 2014) proposed Sobolev norm-based regularized OT and further regularized barycenter and yet the regularization is still added to the OT, not the barycenter. These works only regularize OT and then directly update the support simply to the average of its correspondence. In this paper, we regularize the update.

Preliminaries

We begin with some basics on optimal transportation (OT). Suppose M is a compact metric space, $P (M)$ is the space of all Borel probability measures on M and $μ, ν \in P (M)$ are two such measures. A measure in the product space, $π (\cdot, \cdot) \in P (M \times M),$ serves as a mapping between any two measures on M, i.e. π : M → M. We define the cost function of the mapping as the geodesic distance $c (\cdot, \cdot) : M \times M \to ℝ^{+} .$

Optimal Transportation

For a mapping $π (μ, ν)$ to be legitimate, the push-forward measure of one measure has to be the other one, i.e. $π_{#} μ = ν .$ Thus, for any measurable subsets $B, B^{'} \subset M$ we have $π (B \times M) = μ (B) and π (M \times B^{'}) = ν (B^{'}) .$ We denote the space of all legitimate product measures by $Π (μ, ν) = {π \in P (M \times M) | π (\cdot, M) = μ, π (M, \cdot) = ν} .$

Optimal transportation seeks a solution $π \in Π (μ, ν)$ that produces the minimum total cost:

W_{p} (μ, ν) \overset{def}{=} {(\inf_{π \in Π (μ, ν)} \int_{M \times M} (c (x, y))^{p} d π (x, y))}^{\frac{1}{p}},

(1)

where p indicates the finite moment of the cost function. The minimum cost is the p-Wasserstein distance. In this paper, we only consider the 2-Wasserstein distance, W₂.

Monge’s formulation restricts OT to preserve measures, that is, mass cannot be split during the mapping. Letting T denote such a mapping, T : x → y, we have $d π (x, y) \equiv d μ (x) δ (y - T (x)) .$ Therefore, we formally define T as

T_{o p t} = \underset{T}{\arg \min} \int_{M} c {(x, T (x))}^{p} d μ (x) .

(2)

In this paper, we follow (2). The details of the optimal transportation problem and the properties of the Wasserstein distance can be found in (Villani 2003; Gibbs and others 2002). With the abuse of notation, we use $π (μ, ν)$ to denote the Monge’s OT map between μ and $ν$ and since the map is applied to their supports x and y we also use π : x → y and y = π(x) to denote the map.

Variational Optimal Transportation

Suppose μ is continuous and $ν$ is a set of Dirac measures in $M = ℝ^{n},$ supported on $Ω_{μ} = {x \in M | μ (x) > 0} and Ω_{ν} = {y_{j} \in M | ν_{j} > 0}, j = 1, \dots, k,$ and their total measure equal: $vol (Ω) = \int_{Ω} d μ (x) = \sum_{j = 1}^{k} ν_{j} .$ (Gu and others 2013) proposed a variational solution to this semi-discrete OT on $ℝ^{n} .$ It starts from a vector h = (h₁,…,h_k)^T and a piece-wise linear function: $θ_{h} (x) = \max {〈 x, y_{j} 〉 + h_{j}}, j = 1, \dots, k .$ Alexandrov proved in (Alexandrov 2005) that there exists a unique h that satisfies the following

v o l (x \in Ω | \nabla θ_{h} (x) = y_{j}) = ν_{j} .

(3)

Furthermore, Brenier proved in (Brenier 1991) that $\nabla θ_{h} : x \to y$ is the Monge’s OT-Map if the transportation cost is the quadratic Euclidean distance ${‖ x - \nabla θ_{h} (x) ‖}_{2}^{2} .$

Suppose $S_{j} (h) = {x \in M | \nabla θ_{h} (x) = y_{j}}$ is the projection of θ_h on Ω. Variational OT (VOT) solves

E (h) \overset{def}{=} \int_{Ω} \nabla θ_{h} d μ - \sum_{j = 1}^{k} ν_{j} h_{j} \equiv \int_{0}^{h} (\sum_{j = 1}^{k} \int_{Ω \cap S_{j} (h)} d μ) d h - \sum_{j = 1}^{k} ν_{j} h_{j},

(4)

and thus converts the OT problem into searching in a vector space $H = {h \in ℝ^{k} | \int_{Ω \cap S_{j} (h)} d μ > 0 for all j} .$ Proved in (Gu and others 2013), E (4) is convex in $H$ when $\sum_{j = 1}^{k} h_{j} = 0.$ The gradient of (4) is (3). Thus, minimizing (4) when its gradient approaches 0 will give us the desired h, and the map $\nabla θ_{h} .$

Wasserstein Barycenters

Given a collection of measures and weights ${μ_{i}, λ_{i}}_{i = 1}^{N},$ there exists such a measure $ν$ that the weighted average of the Wasserstein distances between $ν$ and all μ_i’s reaches the minimum. As exposed in (Agueh and others 2011), Agueh and Carlier defined such a problem as finding a barycenter in the measure space with respect to the Wasserstein distance:

ν = \underset{ν \in P_{2} (M)}{\arg \min} \sum_{i = 1}^{N} λ_{i} W_{2}^{2} (ν, μ_{i}) .

Wasserstein barycenters of discrete measures exist for mass splitting OT but may not for non-mass splitting or measure-preserving OT. Yet, proved in (Anderes and others 2016), when the weights are uniform and all measures have finite number of supports, there still exists a barycentre ν that preserves the measure and whose number of supports $| Ω_{ν} |$ has a tight upper bound $| Ω_{ν} | \leq \sum_{i = 1}^{N} | Ω_{μ_{i}} | - N + 1,$ and the OT from every μ_i to ν preserves the measure.

Wasserstein Means via Variational OT

A special case of the Wasserstein barycenters problem is when N = 1. In that case, we are computing a barycentre of a single probability measure. We call it the Wasserstein mean (WM). Beyond a special case, the barycenters and the means have the following connection.

Proposition 1.

Given a compact metric space M, a transportation cost $c (\cdot, \cdot) : M \times M \to ℝ^{+},$ and a collection of Borel probability measures $μ_{i} \in P (M),$ with weights λ_i, i = 1,…, N, the Wasserstein mean ν_m of their average measure induces a lower bound of the average Wasserstein distance from the barycenter ν_b to them, provided that $| Ω_{ν_{b}} | \leq | Ω_{ν_{m}} | \leq k$ for some finite k.

Proof.

Since $W_{2}^{2} (ν_{b}, \cdot)$ is convex for its metric property, according to Jensen’s inequality, we have

W_{2}^{2} (ν_{b}, \sum_{i = 1}^{N} λ_{i} μ_{i}) \leq \sum_{i = 1}^{N} λ_{i} W_{2}^{2} (ν_{b}, μ_{i}) .

Algorithm 1:

Wasserstein Means

graphic file with name nihms-1062064-t0007.jpg

Open in a new tab

Then, according to Wasserstein mean’s definition,

W_{2}^{2} (ν_{m}, \sum_{i = 1}^{N} λ_{i} μ_{i}) \leq W_{2}^{2} (ν_{b}, \sum_{i = 1}^{N} λ_{i} μ_{i}), \forall ν_{b} .

The result shows. The equal sign holds when N = 1. □

We should point out that if {μ_i} are discrete measures, then for the barycenter to exist we need to add the condition from (Anderes and others 2016) that $| Ω_{ν_{b}} | \leq \sum_{i = 1}^{N} | Ω_{μ_{i}} | - N + 1,$ which also bounds $| Ω_{ν_{m}} |$ through $| Ω_{ν_{m}} | \leq \sum_{i = 1}^{N} | Ω_{μ_{i}} | .$

Now, approaching Wasserstein means is essentially through optimizing the following objective function:

\min f (π, y, ν) \overset{def}{=} \min_{π, y_{j}, ν_{j}} \sum_{j = 1}^{k} \sum_{y_{j} = π (x)} μ (x) {‖ y_{j} - x ‖}_{2}^{2}, s.t. ν_{j} = \sum_{y_{j} = π (x)} μ (x) .

(5)

Compared to OT, solving WM w.r.t. (5) introduces 2 additional parameters – measure ν and its support y. When y and ν are fixed, (5) becomes a classic optimal transportation problem and we adopt variational optimal transportation (VOT) (Mi and others 2018) to solve it. Thus, (5) is minimizing the lower bound of the OT cost.

Then, it boils down to solving for y and ν. Certainly (5) is differentiable at all $y \in ℝ^{n \times k}$ and is convex. It’s optimum w.r.t. y can be achieved at

{\tilde{y}}_{j} = \frac{\int_{Ω_{μ} \cap S_{j}} x d μ (x)}{\int_{Ω_{μ} \cap S_{j}} d μ (x)} .

(6)

It is essentially to update the mean to the centroid of corresponding measures, adopted in for example (Cuturi and others 2014; Ye and others 2017; Courty and others 2017b). The slight difference in our method is that VOT is non-mass splitting and thus the centroid in our case has a clear position without the need for weighting.

As discussed in (Cuturi and others 2014), (5) is not differentiable w.r.t. ν. However, we can still get its optimum through the following observation.

Observation 1.

The critical point of the function ν → f(π; ν) is where ν induces π being the gradient map of the unweighted Voronoi diagram formed by ν’s support y. In that case, every empirical sample μ(x) at x is mapped to its nearest y_j, which coincides with Lloyd’s algorithm.

Proof.

Suppose ν induces the OT map π from every x to its nearest y_j. Then, the map $π^{'} : x \to y_{j^{'}}$ that satisfies any other $ν^{'} = \int_{Ω \cap S_{j^{'}}} d μ (x)$ will yield an equal or larger cost $\int_{Ω} {‖ y_{j} - x_{i} ‖}_{2}^{2} d μ (x_{i}) \leq \int_{Ω} {‖ y_{j^{'}} - x_{i} ‖}_{2}^{2} d μ (x_{i}) .$ □

Thus, we can write the update rule for ν as

\tilde{ν} (y_{j}) = \int_{Ω \cap S_{j}} d μ (x),

(7)

s.t. $S_{j} = {x \in M | ‖ x - y_{j} ‖_{2} \leq ‖ x - y_{i} ‖_{2}, i \neq j}$ .

Updating the three parameters π, y, and ν can follow the block coordinate descent method. Since at each iteration we have closed-form solutions in the y and ν directions, there is no need to do a line search there. We wrap up our algorithm for computing the Wasserstein means in Alg. 1

As discussed in (Cuturi and others 2014), when N = 1 and p = 2, computing the Wasserstein barycenter (in this case theWasserstein mean) is equivalent to Lloyd’s k-means algorithm. The difference also occurs when we have a constraint on the weight ν_j(y). Ng (Ng 2000) considered a uniform weight for all S_j. Our algorithm can adapt to any constraint on ν_j ≥ 0. In this case, our algorithm is equivalent to (Cuturi and others 2014) where the update of the support is equivalent to re-centering it by our (6).

Complexity

In practice, we use the total mass of the discrete measures inside each S. Then, we vectorize the computation with PyTorch because parameters in VOT, h (4), can be optimized individually and thus parallelly. Given N empirical samples and K centroids, our implementation of OT runs $O (K N)$ on CPU and theoretically $O (N)$ on GPU. Figure 1 shows timing over K ← [20 : 1000]. N = 10,000. The boxes along the plots come from 10 runs of 300 iterations for each K. The dimension of the data is 3. y axis is in seconds per iteration. The plot shows the increased K add few burden to RWM. The complexity added by regularization is as follows. The complexity in 5.1 is $O (K); 5.3 is O (K^{3})$ mainly from solving SVD, but in practice we choose a small or a constant number $K^{'} < < K$ for SVD; 5.4 is $O (K)$ for computing curvature. Thus, the total computational complexity of RWM is $O (N) + O (K^{3}),$ depending on the regularization term. We also compute the pair-wise distances between empirical samples and centroids before-hand as in (Cuturi 2013), making the memory consumption on the level of $O (K N) .$

Figure 1: — Comparison of time over number of centroids.

Regularized Wasserstein Means

In many problems of machine learning, the solution that comes purely from the perspective of the mapping cost may not serve the best to represent the connection between origins and their images, let alone overfitting. Regularization is a common technique to introduce desired properties in the solution. In the previous section, we talked about the Wasserstein means problem and its optimizers: OT π(ν, μ), support y, and the measure ν(y). In this section, we detail our strategies to regularize y along with several regularization terms that we propose to penalize the Wasserstein means cost. For simplicity, we fix the given ν(y) in the following arguments and only consider π and y in the regularized Wasserstein means (RWM) problem.

We start with a general loss function:

L (π, y) = L_{o t} (π, y) + λ L_{reg} (y), L_{o t} (π, y) = \int_{Ω} ‖ y - x ‖_{2}^{2} d μ (x), where y = π (x) .

(8)

We call the first term the OT loss or data loss. Our goal here is to explore $L_{reg} (y)$ and the use of it. Optimizing (8) can also follow the block coordinate descent method. First, we fix the mean and compute the OT. Unlike in Alg. 1 where we directly update the mean to the average of their correspondences, next, we regularize the mean to satisfy certain properties through local minimization on (8).

Minimizing the OT loss $L_{o t} (π, y)$ w.r.t. y can be simplified to minimizing the quadratic loss for each support, i.e. $L_{\tilde{y}} = {\sum_{j} ‖ y_{j} - {\tilde{y}}_{j} ‖}_{2}^{2},$ since they are equivalent:

\int_{S_{j}} | | y_{j} - x ‖_{2}^{2} d μ (x) = (y_{j}^{2} - 2 y_{j} \int_{S_{j}} x d μ (x) + C_{1}) = ‖ y_{j} - \int_{S_{j}} x d μ (x) ‖_{2}^{2} + C_{2} = ‖ y_{j} - {\tilde{y}}_{j} ‖_{2}^{2} + C_{2} .

(9)

C₁, C₂ are some constants. ${\tilde{y}}_{j}$ is from (6) and S_j is the set in which x is mapped to y_j. It is defined by VOT as $S_{j} = {x \in M | 〈 y_{j}, x 〉 - h_{j} \geq 〈 y_{i}, x 〉 - h_{i}}, \forall i \neq j,$ see Sec. . Thus, we re-write (8) as

L (π, y) = \sum_{j} {‖ y_{j} - {\tilde{y}}_{j} ‖}_{2}^{2} + λ L_{reg} (y)

(10)

Note, that $L_{reg}$ undermines the metric properties of the Wasserstein distance and yet the distance is not our concern but the data term of the loss we designed for a broad range of applications. We provide the general algorithm to compute regularized Wasserstein means in Alg. 2.

Citing the convergence proof from (Grippo and others 2000), as long as we add a convex regularization term, because π : x → y is compact and convex, our 2-block coordinate descent-based algorithm indeed will converge. In the

Algorithm 2:

Regularized Wasserstein Means

graphic file with name nihms-1062064-t0008.jpg

Open in a new tab

rest of this section, we discuss in detail several regularization terms based on class labels, geometric transformation, and length and curvature, all of which are convex.

Triplets Empowered by Class Labels

We begin with a fair assumption that samples of the same class reside closer to each other and samples that belong to different classes are relatively far away from each other. This behavior can be expressed by signed distances between samples. Given that, we propose to regularize the mean update process by adding a triplet loss, promoting intra-class connection and discouraging inter-class connections.

The triplet loss was proposed in (Schroff and others 2015), inspired by (Weinberger and others 2009). It targets the metric learning problem which is finding an embedding space where samples of the same desired property reside close to each other and vise versa. In triplets, samples are characterized into three types – anchor, positive, and negative, denoted as y^a, y^p, and yⁿ. The motivation is that the anchor is closer by a margin of α to a positive than it is to a negative:

L_{reg} (y) = \sum_{i}^{K} {[{‖ y_{j}^{a} - y_{j}^{p} ‖}_{2}^{2} - {‖ y_{j}^{a} - y_{j}^{n} ‖}_{2}^{2} + α]}_{+} .

The overall RWM loss w.r.t. y (10) becomes

L (y) = \sum_{j} {‖ y_{j} - {\tilde{y}}_{j} ‖}_{2}^{2} + λ L_{triplet} (y) .

(11)

Fig. 2 shows an example of aligning Gaussian mixtures by (11). Suppose a mixture has three components with different parameters, each belonging to a different class shown in three colors. We rotate the mixture by a certain degree to emulate an unknown shift and apply our method to recover the shift.

Figure 2: — Regularizing the WM by the intra-class triplets can adapt it to domains that suffer unknown rotations.

We sample the source domain 50 times and the target domain 5,000 times at 22.5° and 45°. Fig. 2 1^st column shows the setups. The 2^nd column shows the result from computing the WM without regularization as in (Mi and others 2018). The 3^rd column shows our result. Our method can well drive source samples into the correct target domain. The lighter colors on the target samples in the 2^nd column indicate the predicted class by using the OT correspondence. Since our OT preserves the measure during the mapping, we can deterministically label each unknown sample by querying its own centroid’s class. Note, that this is equivalent to the 1NN classification algorithm based on the power Euclidean distance (Mi and others 2018). Only when the weight of every centroid equals each other will the power distance coincide with the Euclidean distance. In the last column, we show the result from (Courty and others 2017a). It learns an RBF SVM classifier on the target samples.

Geometric Transformations

While OT recovers a transformation between two domains that induces the lowest cost, it does not consider the structure within the domains. Pre-assuming a type of the transformation and then estimating its parameters is one of the popular approaches to solving domain alignment-related problems, for example in (Gopalan and others 2011; Courty and others 2017b). In this way, the structure of the domain can be preserved to some extent. Let us follow this trend and assume that two domains can be matched by a geometric transformation with modifications, that is, any transformation between domains is a combination of a parametric geometric transformation and an arbitrary transformation. This leads to our following strategy that we, on the one hand, regularize the mean to be roughly a geometric transformation in order to preserve the structure of the source domain during the mapping but on the other hand also allow OT to adjust the mapping so that it can recover irregular transformations.

We follow Alg. 2. First, compute OT to obtain the target mean positions $\tilde{y} = π (x)$ and then use the paired means ${y, \tilde{y}}$ to determine the parameters of a geometric transformation $T$ subject to $\tilde{y} = T y$ through a least squares estimate. Suppose $y_{j}^{T} = T y$ is the estimate purely based on the affine transformation, then, we have the RWM loss

L (π, y) = \sum_{j} {‖ y_{j} - {\tilde{y}}_{j} ‖}_{2}^{2} + λ \sum_{j} {‖ y_{j} - y_{j}^{T} ‖}_{2}^{2} .

(12)

Candidates of the geometric transformations include but not limited to perspective, affine, and rigid transformations.

We demonstrate (12) with two moons in Fig. 3. The known domain contains 200 samples in blue and red. The unknown domain is the known domain after a rotation, sampled 10, 000 times in grey. We assume the prior is a rigid transformation. The top row shows the result on the 45° case after several iterations. In the end, RWM almost recovers the transformation with a small error. Top right shows accuracy over iterations under different degrees. The 2^nd row shows the result under different degrees of rotation. We weight in OTDA-GL’s result (Courty and others 2017b) in the 3^rd row showing RWM’s superiority over OTDA under large transformations and its inferiority under small transformations. We also notice that RWM maps the samples into the domain which OTDA fails to.

Figure 3: — RWM adapting shifted two moons: 1^st row performance over iteration under 45°; 2^nd and 3^rd rows performances of RWM and OTDA under different degrees.

Topology Represented by Length and Curvature

The nature of many-to-one mapping in the WM problem enables itself to be suitable for skeleton layout. Consider a 3D thin, elongated point cloud. Our goal is to find a 3D curve consisting of sparse points to represent the shape of the cloud. The problem with directly using WM for skeleton layout is that the support is unstructured. Therefore, we propose to pre-define the topology of the curve and add the length and curvature to regularize its geometry, both intrinsically (length) and extrinsically (curvature).

We give an order of the support so that they can form a piece-wise linear curve. For each three adjacent supports, $y_{j - 1}, y_{j}, y_{j + 1},$ we fit a quadratic spline curve $γ (t)$ of 100 points. Its length is approximated by summarizing the length segment $\int_{0}^{l e n g t h} d s = \int_{0}^{1} ‖ γ^{'} (t) ‖ d t,$ and its curvature at the middle point y_i can be approximated by the total curvature $\int_{0}^{l e n g t h} K^{2} (t) d s, K (t) = \frac{‖ γ^{'} (t) \times γ^{''} (t) ‖}{{‖ γ^{'} (t) ‖}^{3}}$ as in (Ulen and others 2015). Thus, the regularization on the length and curvature can express itself as follows:

λ L_{reg} = λ_{1} \sum_{1 \leq i < k} g (γ^{'} (y_{i})) + λ_{2} \sum_{1 < i < k} l (γ^{''} (y_{i})) .

(13)

where g(·) and l(·) are some functions computed out of the length and curvature based on y, which are both convex making (13) convex. We could go further and include torsion into the term but since we do not pursue a perfectly smooth curve but rather the reasonable embedding of the supports in the interior of the point cloud, we have passed torsion.

In case the shape have branches, we can easily extend (13) considering the skeleton as a whole when computing the OT and regularizing each branch separately. Suppose, now, the skeleton $Γ = {γ_{j}}$ is a set of 1-D curves. Finally, we propose the following loss for skeleton layout:

L (π, y) = \sum_{j} {‖ y_{j} - {\tilde{y}}_{j} ‖}_{2}^{2} + \sum_{γ \in Γ} (λ_{1} \sum_{1 \leq i < k} g (γ^{'} (y_{i})) + λ_{2} \sum_{1 < i < k} l (γ^{''} (y_{i}))) .

(14)

Applications

We demonstrate the use of RWM in domain adaptation (class label), point set registration (geometric transformation), and skeleton layout (topology).

Domain Adaptation

We evaluate our method on the office-31 dataset (Saenko and others 2010). Office-31 includes two subsets – Amazon and Webcam. We adapt from Webcam to Amazon (W → A). The Amazon set contains 2,848 images from 31 categories. Each category has a different number of samples from 36 to 100. The Webcam set archives 826 images from the same 31 categories, each having between 11 to 43 samples.

We use the Decaf-fc6 and Decaf-fc7 features provided along with the dataset. Each sample now is encoded into a vector of 4,096 dimensions. The setup is similar to OTDA (Courty and others 2017b). We randomly select 20 samples per class from Amazon and 10 samples per class from Webcam, because the ‘ruler’ category of Webcam only has 11 samples and we want each class to have an equal number of samples. Then, we normalize the weight of the sample so that the total weight from Amazon and from Webcam are both one. Each sample is assumed to have an equal weight: Amazon sample 1/620 and Webcam sample 1/310.

We compare RWM with OTDA and also include 1NN and the original WM as baselines. The experiments are repeated 10 times and Tab. 1 summarizes the averaged results. RWM outperforms other methods by a large margin. We also show the resulting t-SNE embeddings in Fig. 4. From left to right are the original embeddings, embeddings after OTDA, and embeddings after RWM. Blue dots represent Amazon samples and red dots Webcam samples. Numbers indicate classes. RWM successfully cluster samples from the same class into distinguishable clusters while OTDA on the other hand very well integrates the source domain into the target domain (but with larger errors). Zoom in the pictures to see the samples of 1, ‘bike’, and 11, ‘keyboard’. The regularization weight of OTDA Laplacian is 0:3. It is from a search in {1; 0:3; 0:1; 0:03; 0:01}. The weight of RWM is 1 from a search in {3; 1; 0:3; 0:1; 0:03; 0:01}

Table 1:

Classification Accuracy (%) on Office-31 W → A

Feature	1NN	WM	OTDA	RWM
Decaf-fc6	30.2±1.3	32.7±2.3	33.9±2.1	36.4±2.7
Decaf-fc7	31.3±1.9	34.6±2.2	35.8±1.5	43.2±2.6

Open in a new tab

Figure 4: — t-SNE plots of Office samples after OTDA and RWM.

Point Set Registration

Registering point sets is key to many downstream applications such as surface reconstruction and stereo matching. Point set registration algorithms aim to assign correspondences between two sets of points and to recover the transformation between them (Myronenko and Song 2010). Figure 5 left shows a Stanford Bunny in a grey point set and its shifted version in a colored point set after a random noisy translation and a rotation. We apply (12) to recovering the transformation. With this example, we also test our algorithm under the extreme condition when we have the same number of empirical samples and centroids. Our algorithm RWM still produces a one-to-one map between the two point sets. The transformation then perfectly aligns them while the traditional iterative closest point (ICP) algorithm fails to recover the transformation. The reason is that ICP assigns the correspondence based on nearest neighbors while RWM uses OT which considers the point set as a whole when computing the correspondence. Note, that by pre-defining the regularization as a rigid transformation and adjusting its weight, we can perform both rigid and non-rigid registration. In the above example, the regularization weight is λ = 10. Our alignment technique might be further incorporated into e.g. (Yang and others 2016) for globally optimal alignment.

Figure 5: — Alignment of translationally and rotationally shifted bunnies after RWM and ICP. t indicates the number of iterations.

Skeleton Layout

.Suppose we have a point cloud $μ \in P (ℝ^{3})$ and a graph G = (V, E) representing the topology of the shape. Then, the problem is finding particular embeddings of the nodes $y (ν) : ν \to ℝ^{3}$ that can relate the graph to the geometry of the point cloud.

Now, consider the human shape point cloud in Fig. 6 top left. We initial a rough embedding of a graph by fixing its ends $V_{0} \subset V$ to certain known positions $y_{ν \in V_{0}}$ which are head, hands, and feet in this example, and set the rest of nodes evenly distribute along their branches. Our goal is to embed the nodes $ν \in V \ V_{0}$ in this $ℝ^{3}$ space by applying (14). Because the weight of each centroid determines its boundaries with other centroids, it has to be adjusted to the local density of the cloud so that all the centroids could roughly evenly lay on the skeleton. Thus, we relax the restriction on weight and reinstate (7). We update the weight by momentum gradient descent, $ν {(y_{j})}^{(t + 1)} \leftarrow λ ν {(y_{j})}^{(t)} + (1 - λ) \int_{Ω \cap S_{j}} d μ (x)$ to prevent it from quickly trapped into a local minimum like k-means.

Figure 6: — Skeleton layout. RWM embeds a pre-defined graph which relates to the shape of the cloud. Numbers indicating MSE showing RWM balances between MSE and topology.

Top right of Fig. 6 shows our result. The skeleton successfully captures the shape of the point cloud. Colors of the skeleton nodes based on their position in the graph are transferred to the surface according to their OT correspondences. We compare the result from Lloyd’s k-means algorithm and with RW in the 2^nd and 3^rd columns. Equal weight of regularization is added to Lloyd’s algorithm to make it a fair comparison. We also test our method in an extreme initial condition. As shown in (b), our algorithm eventually recovers a coherent, correct shape, but without the regularization we could end up with “ill-posed” embeddings. The figure also writes the mean square errors (MSE). Our method achieves small MSEs while maintaining the topology. In the bottom left, we show result from Stanford Armadillo. In the bottom right, we show the result from (Solomon and others 2015) as the ground truth. It regards the problem as a Wasserstein propagation problem and adopted Wasserstein barycenter techniques to relate the samples of the cloud to the graph, which is much heavier. The average time of 5 trials by (Solomon and others 2015) was 1,200 seconds while ours took 15 seconds. CPU: Intel i5–7640x 4.0 GHz.

Conclusion

We have talked about the Wasserstein means problem and our method to regularize it. The results have shown that our method can well adapt to different problems by adopting different regularization terms. This work opens up a new perspective to look at the Wasserstein means problem, or the k-means problem, as well as regularizing them.

We expect further use of regularized optimal transportation techniques on aligning distributions in high-dimensional spaces. Future work in our line of research could also include regularizing the barycenters.

Acknowledgements

This work was supported in part by NIH (RF1AG051710 and R01EB025032). Liang Mi is supported in part by ASU Completion Fellowship.

References

Agueh M, et al. 2011. Barycenters in the Wasserstein space. SIAM J. on Mathematical Analysis 43(2):904–924. [Google Scholar]
Alexandrov AD 2005. Convex polyhedra. Springer Science & Business Media. [Google Scholar]
Ambrosio L, et al. 2008. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media. [Google Scholar]
Anderes E, et al. 2016. Discrete Wasserstein barycenters: Optimal transport for discrete data. Mathematical Methods of Operations Research 84(2):389–409. [Google Scholar]
Arjovsky M, et al. 2017. Wasserstein generative adversarial networks. In ICML. [Google Scholar]
Beecks C, et al. 2011. Modeling image similarity by Gaussian mixture models and the signature quadratic form distance. In ICCV. [Google Scholar]
Bengio Y, et al. 2013. Representation learning: A review and new perspectives. IEEE TPAMI 35(8):1798–1828. [DOI] [PubMed] [Google Scholar]
Brenier Y. 1991. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics 44(4):375–417. [Google Scholar]
Chen Y, et al. 2019. A gradual, semi-discrete approach to generative network training via explicit wasserstein minimization. In ICML. [Google Scholar]
Courty N, et al. 2017a. Joint distribution optimal transportation for domain adaptation. In NeurIPS. [Google Scholar]
Courty N, et al. 2017b. Optimal transport for domain adaptation. IEEE TPAMI 39(9):1853–1865. [DOI] [PubMed] [Google Scholar]
Cuturi M, et al. 2014. Fast computation of Wasserstein barycenters. In ICML. [Google Scholar]
Cuturi M. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. In NeurIPS. [Google Scholar]
Ferradans S, et al. 2014. Regularized discrete optimal transport. SIAM J. on Imaging Sciences 7(3):1853–1882. [Google Scholar]
Gibbs AL, et al. 2002. On choosing and bounding probability metrics. Intl. statistical review 70(3):419–435. [Google Scholar]
Gopalan R, et al. 2011. Domain adaptation for object recognition: An unsupervised approach. In ICCV. [Google Scholar]
Grippo L, et al. 2000. On the convergence of the block non linear Gauss–Seidel method under convex constraints. Operations research letters 26(3):127–136. [Google Scholar]
Gu X, et al. 2013. Variational principles for Minkowski type problems, discrete optimal transport, and discrete Monge-Ampere equations. arXiv preprint arXiv:1302.5472. [Google Scholar]
Ho N, et al. 2017. Multilevel clustering via Wasserstein means. In ICML. [Google Scholar]
Kantorovich LV 1942. On the translocation of masses. In Dokl. Akad. Nauk SSSR, volume 37, 199–201. [Google Scholar]
Kolouri S, et al. 2016. Sliced wasserstein kernels for probability distributions. In CVPR. [Google Scholar]
Kolouri S, et al. 2018. Sliced wasserstein distance for learning gaussian mixture models. In CVPR, 3427–3436. [Google Scholar]
Lévy B. 2015. A numerical algorithm for L2 semi-discrete optimal transport in 3d. ESAIM: Mathematical Modelling and Numerical Analysis 49(6):1693–1715. [Google Scholar]
Ling H, and Okada K. 2007. An efficient earth mover’s distance algorithm for robust histogram comparison. IEEE TPAMI 29(5):840–853. [DOI] [PubMed] [Google Scholar]
Ma J, et al. 2016. Non-rigid point set registration by preserving global and local structures. IEEE Trans. on image Processing 25(1):53–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
McCann RJ 1997. A convexity principle for interacting gases. Advances in mathematics 128(1):153–179. [Google Scholar]
Mérigot Q. 2011. A multiscale approach to optimal transport. In Computer Graphics Forum, volume 30, 1583–1592. [Google Scholar]
Mi L, et al. 2018. Variational Wasserstein clustering. In ECCV. [DOI] [PMC free article] [PubMed] [Google Scholar]
Monge G. 1781. Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences de Paris. [Google Scholar]
Myronenko A, and Song X. 2010. Point set registration: Coherent point drift. IEEE TPAMI. [DOI] [PubMed] [Google Scholar]
Ng MK 2000. A note on constrained k-means algorithms. Pattern Recognition 33(3):515–519. [Google Scholar]
Rabin J, et al. 2011. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision. [Google Scholar]
Saenko K, et al. 2010. Adapting visual category models to new domains. In ECCV. [Google Scholar]
Schroff F, et al. 2015. Facenet: A unified embedding for face recognition and clustering. In CVPR. [Google Scholar]
Solomon J, et al. 2015. Convolutional wasserstein distances: Efficient optimal transportation on geometric domains. ACM Trans. Graph. 34(4):66. [Google Scholar]
Stark JA 2000. Adaptive image contrast enhancement using generalizations of histogram equalization. IEEE TIP. [DOI] [PubMed] [Google Scholar]
Sun B, et al. 2016. Deep coral: Correlation alignment for deep domain adaptation. In ECCV. [Google Scholar]
Ulen J, et al. 2015. Shortest paths with higher-order regularization. IEEE TPAMI 37(12):2588–2600. [DOI] [PubMed] [Google Scholar]
Villani C. 2003. Topics in optimal transportation. Number 58. American Mathematical Soc. [Google Scholar]
Weinberger KQ, et al. 2009. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10(Feb):207–244. [Google Scholar]
Yang J, et al. 2016. Go-icp: A globally optimal solution to 3d icp point-set registration. IEEE TPAMI (11):2241–2254. [DOI] [PubMed] [Google Scholar]
Ye J, et al. 2017. Fast discrete distribution clustering using Wasserstein barycenter with sparse support. IEEE Trans. on Signal Processing 65(9):2317–2332. [Google Scholar]

[R1] Agueh M, et al. 2011. Barycenters in the Wasserstein space. SIAM J. on Mathematical Analysis 43(2):904–924. [Google Scholar]

[R2] Alexandrov AD 2005. Convex polyhedra. Springer Science & Business Media. [Google Scholar]

[R3] Ambrosio L, et al. 2008. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media. [Google Scholar]

[R4] Anderes E, et al. 2016. Discrete Wasserstein barycenters: Optimal transport for discrete data. Mathematical Methods of Operations Research 84(2):389–409. [Google Scholar]

[R5] Arjovsky M, et al. 2017. Wasserstein generative adversarial networks. In ICML. [Google Scholar]

[R6] Beecks C, et al. 2011. Modeling image similarity by Gaussian mixture models and the signature quadratic form distance. In ICCV. [Google Scholar]

[R7] Bengio Y, et al. 2013. Representation learning: A review and new perspectives. IEEE TPAMI 35(8):1798–1828. [DOI] [PubMed] [Google Scholar]

[R8] Brenier Y. 1991. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics 44(4):375–417. [Google Scholar]

[R9] Chen Y, et al. 2019. A gradual, semi-discrete approach to generative network training via explicit wasserstein minimization. In ICML. [Google Scholar]

[R10] Courty N, et al. 2017a. Joint distribution optimal transportation for domain adaptation. In NeurIPS. [Google Scholar]

[R11] Courty N, et al. 2017b. Optimal transport for domain adaptation. IEEE TPAMI 39(9):1853–1865. [DOI] [PubMed] [Google Scholar]

[R12] Cuturi M, et al. 2014. Fast computation of Wasserstein barycenters. In ICML. [Google Scholar]

[R13] Cuturi M. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. In NeurIPS. [Google Scholar]

[R14] Ferradans S, et al. 2014. Regularized discrete optimal transport. SIAM J. on Imaging Sciences 7(3):1853–1882. [Google Scholar]

[R15] Gibbs AL, et al. 2002. On choosing and bounding probability metrics. Intl. statistical review 70(3):419–435. [Google Scholar]

[R16] Gopalan R, et al. 2011. Domain adaptation for object recognition: An unsupervised approach. In ICCV. [Google Scholar]

[R17] Grippo L, et al. 2000. On the convergence of the block non linear Gauss–Seidel method under convex constraints. Operations research letters 26(3):127–136. [Google Scholar]

[R18] Gu X, et al. 2013. Variational principles for Minkowski type problems, discrete optimal transport, and discrete Monge-Ampere equations. arXiv preprint arXiv:1302.5472. [Google Scholar]

[R19] Ho N, et al. 2017. Multilevel clustering via Wasserstein means. In ICML. [Google Scholar]

[R20] Kantorovich LV 1942. On the translocation of masses. In Dokl. Akad. Nauk SSSR, volume 37, 199–201. [Google Scholar]

[R21] Kolouri S, et al. 2016. Sliced wasserstein kernels for probability distributions. In CVPR. [Google Scholar]

[R22] Kolouri S, et al. 2018. Sliced wasserstein distance for learning gaussian mixture models. In CVPR, 3427–3436. [Google Scholar]

[R23] Lévy B. 2015. A numerical algorithm for L2 semi-discrete optimal transport in 3d. ESAIM: Mathematical Modelling and Numerical Analysis 49(6):1693–1715. [Google Scholar]

[R24] Ling H, and Okada K. 2007. An efficient earth mover’s distance algorithm for robust histogram comparison. IEEE TPAMI 29(5):840–853. [DOI] [PubMed] [Google Scholar]

[R25] Ma J, et al. 2016. Non-rigid point set registration by preserving global and local structures. IEEE Trans. on image Processing 25(1):53–64. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] McCann RJ 1997. A convexity principle for interacting gases. Advances in mathematics 128(1):153–179. [Google Scholar]

[R27] Mérigot Q. 2011. A multiscale approach to optimal transport. In Computer Graphics Forum, volume 30, 1583–1592. [Google Scholar]

[R28] Mi L, et al. 2018. Variational Wasserstein clustering. In ECCV. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Monge G. 1781. Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences de Paris. [Google Scholar]

[R30] Myronenko A, and Song X. 2010. Point set registration: Coherent point drift. IEEE TPAMI. [DOI] [PubMed] [Google Scholar]

[R31] Ng MK 2000. A note on constrained k-means algorithms. Pattern Recognition 33(3):515–519. [Google Scholar]

[R32] Rabin J, et al. 2011. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision. [Google Scholar]

[R33] Saenko K, et al. 2010. Adapting visual category models to new domains. In ECCV. [Google Scholar]

[R34] Schroff F, et al. 2015. Facenet: A unified embedding for face recognition and clustering. In CVPR. [Google Scholar]

[R35] Solomon J, et al. 2015. Convolutional wasserstein distances: Efficient optimal transportation on geometric domains. ACM Trans. Graph. 34(4):66. [Google Scholar]

[R36] Stark JA 2000. Adaptive image contrast enhancement using generalizations of histogram equalization. IEEE TIP. [DOI] [PubMed] [Google Scholar]

[R37] Sun B, et al. 2016. Deep coral: Correlation alignment for deep domain adaptation. In ECCV. [Google Scholar]

[R38] Ulen J, et al. 2015. Shortest paths with higher-order regularization. IEEE TPAMI 37(12):2588–2600. [DOI] [PubMed] [Google Scholar]

[R39] Villani C. 2003. Topics in optimal transportation. Number 58. American Mathematical Soc. [Google Scholar]

[R40] Weinberger KQ, et al. 2009. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10(Feb):207–244. [Google Scholar]

[R41] Yang J, et al. 2016. Go-icp: A globally optimal solution to 3d icp point-set registration. IEEE TPAMI (11):2241–2254. [DOI] [PubMed] [Google Scholar]

[R42] Ye J, et al. 2017. Fast discrete distribution clustering using Wasserstein barycenter with sparse support. IEEE Trans. on Signal Processing 65(9):2317–2332. [Google Scholar]

PERMALINK

Regularized Wasserstein Means for Aligning Distributional Data

Liang Mi

Wen Zhang

Yalin Wang

Abstract

Introduction

Related Work

Optimal Transportation

Wasserstein Barycenters and Means

Preliminaries

Optimal Transportation

Variational Optimal Transportation

Wasserstein Barycenters

Wasserstein Means via Variational OT

Proposition 1.

Proof.

Algorithm 1:

Observation 1.

Proof.

Complexity

Figure 1:

Regularized Wasserstein Means

Algorithm 2:

Triplets Empowered by Class Labels

Figure 2:

Geometric Transformations

Figure 3:

Topology Represented by Length and Curvature

Applications

Domain Adaptation

Table 1:

Figure 4:

Point Set Registration

Figure 5:

Skeleton Layout

Figure 6:

Conclusion

Acknowledgements

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases