GEODESIC SINKHORN FOR FAST AND ACCURATE OPTIMAL TRANSPORT ON MANIFOLDS

Guillaume Huguet; Alexander Tong; María Ramos Zapatero; Christopher J Tape; Guy Wolf; Smita Krishnaswamy

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Sep 26:arXiv:2211.00805v2. [Version 2]

GEODESIC SINKHORN FOR FAST AND ACCURATE OPTIMAL TRANSPORT ON MANIFOLDS

Guillaume Huguet ^1,^*, Alexander Tong ^1,^*, María Ramos Zapatero ², Christopher J Tape ², Guy Wolf ¹, Smita Krishnaswamy ³

PMCID: PMC10557786 PMID: 37808090

Abstract

Efficient computation of optimal transport distance between distributions is of growing importance in data science. Sinkhorn-based methods are currently the state-of-the-art for such computations, but require $O (n^{2})$ computations. In addition, Sinkhorn-based methods commonly use an Euclidean ground distance between datapoints. However, with the prevalence of manifold structured scientific data, it is often desirable to consider geodesic ground distance. Here, we tackle both issues by proposing Geodesic Sinkhorn—based on diffusing a heat kernel on a manifold graph. Notably, Geodesic Sinkhorn requires only $O (n l o g n)$ computation, as we approximate the heat kernel with Chebyshev polynomials based on the sparse graph Laplacian. We apply our method to the computation of barycenters of several distributions of high dimensional single cell data from patient samples undergoing chemotherapy. In particular, we define the barycentric distance as the distance between two such barycenters. Using this definition, we identify an optimal transport distance and path associated with the effect of treatment on cellular data.

1. INTRODUCTION

Optimal Transport (OT) distances or Wasserstein distances are computed by lifting ground distances between points to distances between measures. This distance is computed relative to a ground distance on the support of the distributions, making it more informative than distances based only on a pointwise comparison of the densities. However, to compute the Wasserstein, one needs to find the optimal transport plan from the source distribution to a target distribution; this is a linear programming problem requiring $O (n^{3} l o g n)$ for discrete distributions of size $n$ [1].

An efficient modification of the optimal transport problem is to consider entropy-regularized transportation. This formulation is solved with the Sinkhorn algorithm [2] by iteratively rescaling a Gaussian kernel based on the distance matrix. It is equivalent to the Schrödinger Bridge problem, for which similar algorithms were developed [3]–[5]. In the discrete case, it requires $O (n^{2})$ for distributions of size $n$ , since it relies on matrix-vector products. Furthermore, this formulation allows for fast computation of the discrete barycenter with fixed support (the average distributions w.r.t. the Sinkhorn distance). An important drawback of the Sinkhorn algorithm is the necessity of storing and multiplying the pairwise distance matrix with a vector.

Additionally, the ground distance is commonly chosen as the Euclidean distance. The Euclidean distance is often sub-optimal for high-dimensional datasets over larger distances according to the manifold hypothesis, which says observations lie near a low dimensional (curved) manifold in high dimensional space [6]. For higher dimensional datasets assumed to be sampled from a lower dimensional manifold, using a distance closer to the manifold for OT has shown promising results [7]–[10].

In this work, we present Geodesic Sinkhorn¹ a Sinkhorn-based method for fast optimal transport with a heat-geodesic ground distance. Our method is based on the geometry of the dataset constructed from a common graph and uses the heat kernel on the graph to defined a heat-geodesic distance. Key to this approach, we will never need to construct or operate on an $n \times n$ distance matrix, and we will only use the sparse Laplacian matrix and sparse matrix-vector products. For sparse graphs, this can be used for $O (n l o g n)$ computation of the Sinkhorn distance with a manifold ground distance, improving on the $O (n^{2})$ implementations based on dense matrices.

Increasing the state-of-the art efficiency in Sinkhorn computation opens us up to being able to perform complex operations on large groups of datasets. In particular, we consider interpolating between datasets and show that using our heat-geodesic distance improves the interpolation accuracy compared to OT with Euclidean distance. The barycenter corresponds to the average distribution of a set of distributions. Our method allows for finer-grained barycenters on a data manifold, which motivates us to define a novel notion of dissimilarity between families of distributions called barycentric distance.

We apply the barycentric distance to single cell data from patient-derived cancer organoids (PDOs) to assess the effect of treatments (such as drugs and chemotherapy). Here we have one set of PDOs from control conditions, and another set that are treated. The treatment effect is thus the distance between these barycenters. In addition, we use Geodesic Sinkhorn’s barycenter to compare the effect from one family of distributions to another.

Our main contributions include: (1) A new method for computing optimal transport distances on a manifold called Geodesic Sinkhorn, which is highly efficient in time and memory. (2) Defining the barycentric distance; a novel distance between families of distributions, and showing its utility in deriving treatment effect from control and treated patient samples.

2. RELATED WORK

Geodesic Sinkhorn is related to prior work linking the entropy-regularized optimal transport problem triangular mesh with the heat operator [8], [11], but using different graph filtering techniques. These approaches approximate the application of the heat kernel to a vector by discretizing the heat equation and solving systems of linear equations. This technique was used in different contexts, either with the cotangent Laplacian [8] or to learn a ground metric [12]. Solving these systems for each Sinkhorn iteration can be done efficiently with a sparse Cholesky decomposition. However, this method’s efficiency depends mainly on the efficiency of the Cholesky decomposition which can be slow depending on the sparsity pattern is $O (n^{3})$ for an $n \times n$ matrix, and necessitates solving $2 K$ systems of linear equations per Sinkhorn iteration, where $K$ is the number of sub-steps in the backward Euler discretization.

3. PRELIMINARIES

In this section, we start by reviewing the basics of OT and the Wasserstein distance, as well the Sinkhorn distance. Then we review two notions fundamental to our method; the heat equation on a graph and the Chebyshev approximation of the heat kernel.

3.1. Wasserstein Distance

In the following, we assume that all distributions admit a density or a probability mass function, and we use the same notation for both. Let $μ, ν$ be two probability distributions on a measurable space $𝒳 \subseteq R^{d}$ with metric $d (\cdot, \cdot)$ , let $Π (μ, ν)$ be the set of joint probability distributions $π$ on the space $𝒳 \times 𝒳$ where, for any measurable subset $ω \subset 𝒳, π (ω \times 𝒳) = μ (ω)$ and $π (𝒳 \times ω) = ν (ω)$ . The $p$ -Wasserstein distance is defined as:

W_{p} (μ, ν) ≔ {(\underset{π \in Π (μ, ν)}{i n f} \int_{𝒳^{2}} d (x, y)^{p} d π (x, y))}^{1 / p} .

(1)

In the following, we consider $p = 2$ . An exact algorithm based on linear programming can solve this problem in $O (n^{3} l o g n)$ time for discrete distributions of size $n$ .

3.2. Sinkhorn Distances

The Kullback-Leibler (KL) divergence between $π$ and some strictly positive $K$ on $𝒳 \times 𝒳$ is defined as

D_{K L} (π∣ K) ≔ \int_{𝒳^{2}} (ln \frac{π (x, y)}{K (x, y)} - 1) d π (x, y) .

(2)

The Sinkhorn distance² is a relaxation of equation 1 where the infimum is over all coupling in $\{π \in Π (μ, ν) ∣ D_{K L} (π ∣ μ \times ν) \leq ξ}$ for $ξ > - 1$ . Introduced in [13], the optimization of this distance can be solved by considering the entropyregularized transport

W_{d, λ}^{2} (μ, ν) ≔ {(\underset{π \in Π (μ, ν)}{i n f} \int_{𝒳^{2}} d (x, y)^{2} d π (x, y) - λ H (π))}^{1 / 2},

(3)

where we define the entropy of a coupling $π$ as $H (π) : = - \int l n π (x, y) d π (x, y)$ , and $λ > 0$ . This formulation converges to the Wasserstein distance as $λ \to 0$ , and can be solved with the Sinkhorn algorithm with complexity of the order $O (n^{2} / ϵ)$ for discrete distributions of size $n$ [13]. In the discrete case, the transport matrix $π$ admits the form $d i a g (v) K_{λ} d i a g (w)$ , where $v, w$ are vectors of size $n$ . The Sinkhorn algorithm iteratively updates the vectors as $(v, w) \leftarrow (μ . / K_{λ} w, ν . / K_{λ}^{'} v)$ , where $K_{λ} : = e^{- d (x, y)^{2} / λ}$ .

Following [8], using the kernel $K_{λ}$ gives an alternative interpretation of the Sinkhorn distance as

W_{d, λ}^{2} (μ, ν) = λ^{1 / 2} {(1 + \underset{π \in Π (μ, ν)}{m i n} D_{K L} (π ∣ K_{λ}))}^{1 / 2} .

(4)

The problem in equation 3 is strictly convex and continuous yielding a unique minimizer. In the discrete case, this leads to an algorithm for the entropy-regularized Wasserstein distance based on the Sinkhorn algorithm enforcing the marginal constraints on the kernel $K_{λ}$ while minimizing the distance as quantified by $D_{K L}$ .

The underlying metric $d (\cdot, \cdot)$ is generally unknown, thus the kernel $K_{λ}$ cannot be evaluated. The authors of [8] proposed to approximate $K_{λ}$ with the heat kernel $ℋ_{t} (x, y)$ on $𝒳$ . According to Varadhan’s formula [14], the geodesic distance on a manifold can be recovered from the heat transfer at small timescales as

d (x, y)^{2} = \underset{t \to 0^{+}}{l i m} - 4 t ln ℋ_{t} (x, y) .

(5)

Hence, motivating the use of the heat-geodesic distance $d_{ℋ}^{2} (x, y) : = - 4 t l n ℋ_{t} (x, y)$ , with associated kernel $K_{λ} (x, y) =$ $ℋ_{λ / 4} (x, y)$ . Interestingly, Sinkhorn-based methods admit an efficient algorithm to solve the barycenter problem which we present next.

3.3. Interpolation with discrete support

By constraining the support to a set $𝒳$ (or a graph), we can efficiently interpolate between more than two distributions. The barycenter problem [1], [8], [15] generalizes the notion of average between points to an average between distributions. For a set of $m$ distributions $\{μ_{1}, \dots, μ_{m}\}$ supported on $𝒳$ , the objective is to find a distribution minimizing the average distance

μ^{*} ≔ \underset{μ \in 𝒫 (𝒳)}{a r g m i n} \sum_{i = 1}^{m} α_{i} W_{d}^{p} {(μ, μ_{i})}^{p},

where $𝒫 (𝒳)$ denotes the space of probability distributions supported on $𝒳$ , and $\{α_{1}, \dots, α_{m}\}$ are non-negative weights. Finding the barycenter is a challenging optimization problem, however the barycenter for Sinkhorn-based methods admits an efficient computation. It involves updating $m$ vectors $v_{i}, w_{i}$ , which define a transport plan from $μ_{i}$ to the barycenter $μ^{*}$ . The support of the barycenter is constrained to $𝒳$ , for most Sinkhorn-based methods the size of $𝒳$ needs to be small for computational reason. Our method does not suffer from such a limitation. Hence, we can consider barycenter with greater expressivity, and interpolate between large sets of distributions.

3.4. Heat Diffusion on a Graph

Consider an undirected graph $𝒢 = (V, E)$ with a set $V$ of $n$ vertices and a set of edges $E$ , and its weighted adjacency matrix $A$ with non-negative edge weights, and the diagonal degree matrix $D$ , where $D_{i i} : = \sum_{k} A_{i k}$ . We define the combinatorial Laplacian as $L : = D - A$ , for any function $f : V \to R$ we have $(L f) (v) = \sum_{u} a_{u, v} (f (v) - f (u))$ . The combinatorial Laplacian is a symmetric positive semi-definite matrix, and has an eigendecomposition $L = Ψ Λ Ψ^{T}$ with orthonormal eigenvectors $Ψ$ and diagonal eigenvalue matrix $Λ = d i a g (λ_{1}, λ_{2}, \dots, λ_{n})$ , such that $0 \leq λ_{1} \leq λ_{2} \leq \dots \leq λ_{n}$ . The combinatorial Laplacian is a natural extension of the negative of the Laplacian operator to a graph. For a signal $f_{0} \in R^{n}$ on $𝒢$ , the diffusion of $f_{0}$ on the graph evolves according to the heat equation

\frac{d}{d t} f (t) + L f (t) = 0, s.t. f (0) = f_{0} t \in R^{+} .

The heat kernel solves this ODE, it is defined by the matrix exponential $H_{t} : = e^{- t L}$ . By orthogonality of the eigenvectors of $L$ , we can write $H_{t} = Ψ e^{- t Λ} Ψ^{T}$ and $f (t) = H_{t} f_{0}$ . Computing $H_{t}$ by eigendecomposition would require $O (n^{3})$ operations. Recall that, for the Sinkhorn algorithm, we are only concerned with the application of the heat operator $H_{t}$ on a signal $f \in R^{n}$ . For larger diffusion time, the heat kernel converges to its eigenvector associated to the lowest eigenvalues of the Laplacian, hence, intuitively, the heat kernel corresponds to a low-pass filter. In Geodesic Sinkhorn, we use Chebyshev polynomials [16], [17] to approximate the application of the heat operator to a signal. For a short timescale $t$ , using the heat kernel accounts for using the geodesic distance as ground distance in the entropy-regularized OT formulation equation 3.

3.5. Chebyshev Polynomials

Polynomial sequences are often used to approximate functions or operator. With Chebyshev polynomials, we can approximate the application of the matrix exponential $H_{t} = e^{- t L}$ to a signal $f$ on the graph. An attractive property of Chebyshev polynomials is that the approximation error decays exponentially with the maximum degree $K$ . They are defined by the recursive relation ${\{T_{k}\}}_{k \in N}$ with $T_{0} (y) = 0, T_{1} (y) = y$ and $T_{k} (y) = 2 y T_{k - 1} (y) - T_{k - 2} (y)$ for $k \geq 2$ . On [−1, 1] these polynomials are orthogonal w.r.t. the weight $(1 - y)^{- 1 / 2}$ , and can be used to express the operator $H_{t}$ . Assuming the largest eigenvalue $λ_{n} \leq 2$ , we can write

H_{t} = \frac{b_{0}}{2} + \sum_{k = 1}^{\infty} b_{k} T_{k} (L - I d),

where the $K + 1$ scalar coefficient $\{b_{k}\}$ depend on time and can be evaluated with the Bessel function. The approximation of $H_{t}$ is based on the first $K$ term of the series which we note $p_{K} (L, t)$ . It results in $K$ matrix-vector products which can be efficient since, in general, $L$ is a sparse matrix. On a $m$ -nearest neighbor graph, this can be $O (K m n / λ)$ , where $λ$ is a regularization parameter. Chebyshev polynomials admits interesting theoretical properties and are known to converge faster than other polynomials [17], [18]. The choice of the parameter $K$ is related to the number of neighbors or the connectivity of the graph. For small diffusion time, hence only diffusing in a local neighborhood, the approximation is accurate even with a small $K$ . As the diffusion time increases, $K$ has to increase in order to consider a larger neighborhood around a node. For OT, we consider small diffusion time, and we found that our results were stable for all $K$ greater than 10.

4. GEODESIC SINKHORN DISTANCES

We define the Geodesic Sinkhorn distance between any signals or distributions on a graph $𝒢$ by the entropy-regularized OT with the heat kernel $H_{t}$ on the graph. This construction is also valid between any point cloud datasets. In that case, for $m$ datasets $\{X_{1}, \dots, X_{m}\}$ sampled from a set of distributions $\{μ_{1}, \dots, μ_{m}\}$ , we construct a common graph using an affinity kernel on the $m$ datasets and compare two samples by taking the distance between two indicator functions on the graph. We approximate the heat kernel $H_{t}$ with Chebyshev polynomials $p_{K} (L, l)$ of order $K$ . In Algorithm 1, we present the main steps to evaluate the Geodesic Sinkhorn. It is based upon Sinkhorn iterations [2], [13], where ⊘ and ⊙ denote respectively the elementwise division and multiplication. Note that, as opposed to the usual Sinkhorn algorithm, we never have to store a dense $n \times n$ distance matrix, but only the usually sparse graph Laplacian.

graphic file with name nihpp-2211.00805v2-f0001.jpg

Definition 4.1. The Geodesic Sinkhorn distance between two distributions $μ, ν$ on a graph $𝒢$ is

W_{H_{t}} (μ, ν) ≔ 4 t^{1 / 2} {(1 + \underset{π \in Π (μ, ν)}{m i n} D_{K L} (π ∣ H_{t}))}^{1 / 2} .

In the following proposition, we find the ground distance implicitly used in the optimal transport defined by Geodesic Sinkhorn. We use ≃ for the equivalence relation between distances.

Proposition 4.2. There exists a maximum Chebyshev polynomial degree $K$ such that the ground distance in Geodesic Sinkhorn is equivalent to the one based $H_{t}$

- 4 t l o g {(p_{K} (L, t))}_{i j} ≃ - 4 t l o g {(H_{t})}_{i j}

In particular, the Wasserstein distances with these ground distances are equivalent.

Proof. Because the approximation error decreases exponentially in $K$ [17], we have that for any $ϵ > 0$ sufficiently small there exist $K_{i}$ such that ${(H_{t})}_{i j} - ϵ < {(p_{K})}_{i j} < {(H_{t})}_{i j} +$ $ϵ$ . Choose $K$ such that this is true for all vertices $K : = m a x \{K_{1} \dots, K_{n}\}$ . We define

c ≔ \underset{i j}{m i n} \frac{{(H_{t})}_{i j} - ϵ}{{(H_{t})}_{i j}} and C ≔ \underset{i j}{m a x} \frac{{(H_{t})}_{i j} + ϵ}{{(H_{t})}_{i j}}

and we have the equivalence between the distances since

c {(H_{t})}_{i j} \leq {(p_{K} (L, t))}_{i j} \leq C {(H_{t})}_{i j} for all i j \in [n],

and since the logarithm is a monotonic function. □

In [8], [12], using the Euler implicit discretization results in a ground cost of the form $- ϵ l n ({(I d - \frac{ϵ}{4 K} L)}^{- K})$ , where Id is the identity matrix, and can be seen as another approximation for the matrix exponential.

The efficiency of Geodesic Sinkhorn improves the notion of barycenter as it is possible to consider much larger graph $𝒢$ , thus a finer grained support of the barycenter. This leads us to define a novel distance between families of distributions.

Definition 4.3. For two finite families of distributions $𝒯$ and $𝒞$ supported on $𝒢$ , we define the barycentric distance between the families $𝒯, 𝒞$ as

γ (𝒯, 𝒞) ≔ W_{H_{t}} (μ_{𝒯}^{*}, μ_{𝒞}^{*})

where $μ_{𝒯}^{*}, μ_{𝒞}^{*}$ are respectively the barycenters of $𝒯$ and $𝒞$ .

The previous definition is valid for any distances between distributions or barycenters. However, OT barycenters are known to be more informative than others [15]. We will further explore this comparison in our experiments. We use it to distinguish between two groups in a medical setting where a set of patients received a treatment (defining the family $𝒯$ ), and another set acts as a control family $𝒞$ . Following this idea, we define a notion of effect between two families.

Definition 4.4. For two family of distributions $𝒯$ and $𝒞$ supported on $𝒢$ , define the Expected Barycenter Effect of $𝒯$ as

τ (𝒯) ≔ E_{μ_{𝒯}^{*}} (Y_{t}) - E_{μ_{𝒞}^{*}} (Y_{c}),

where $μ_{𝒯}^{*}, μ_{𝒞}^{*}$ are respectively the barycenters of $𝒯$ and $𝒞$ , and the features $Y_{c} ~ μ_{𝒞}^{*}$ and $Y_{t} ~ μ_{𝒯}^{*}$ .

Note that we compute the average on the family of distributions instead of the average on their support, hence we evaluate their expectations in a closed form. This definition also extends to a conditional equivalent where families of distribution can be subdivided with discrete covariate variables. When the barycenters are computed with the total variation, this definition is equivalent to the naive Average Treatment Effect(ATE) [19]; i.e. difference of empirical means.

5. RESULTS

We demonstrate the accuracy and efficiency of the Geodesic Sinkhorn distance on two tasks: (1) Nearest-Wasserstein-neighbor calculation on simulated data with manifold structure similar to the setup of [10]; (2) A newly defined Barycentric distance between families of distributions computed to quantify the effect of a treatment on patient-derived organoids. In Appendix A.1, we present additional results on time series interpolation.

5.1. Nearest-Wasserstein-neighbor distributions

In this experiment, we compare our method with Sinkhorn [13], and LR Sinkhorn [20], both algorithms with Euclidean and squared Euclidean ground distance, with DiffusionEMD [21], and Sinkorn with Euler approximation of the heat filter. We created 15 Gaussian distributions sampled randomly on a swiss roll dataset, and sampled 10k observations from each distribution. We rotated the observations in 10 dimensions. We consider a k-nearest neighbors task on these distributions. We evaluate the methods with the ground truth, since we know the exact geodesic distance on the manifold. In Tab. 1, we report the average and standard deviation over 10 seeds of the Spearman and Pearson correlations to the ground truth, and the runtime in seconds with and without the computation of the graph. Our method is the most accurate while being much faster than other Sinkhorn-based methods.

Table 1.

KNN task for 15 distributions, best score highlighted is bold. Geodesic Sinkhorn is the most accurate, while being faster than other Sinkhorn-based methods.

Method	SpearmanR	PearsonR	P@5	Time(s) no graph	Time(s)
Diffusion EMD	0.62±0.097	0.736±0.023	0.66±0.072	2.845±0.135	7.877±0.531
Sinkhorn $W 1$	0.387±0.044	0.523±0.036	0.471±0.028	112.406±0.206	112.406±0.206
Sinkhorn $W 2$	0.411±0.036	0.485±0.027	0.492±0.053	133.686±5.234	133.686±5.234
LR Sinkhorn $W 1$	−0.31±0.07	−0.131±0.086	0.237±0.037	578.631±107.82	578.631±107.82
LR Sinkhorn $W 2$	0.366±0.048	0.379±0.051	0.447±0.023	204.191±3.656	204.191±3.656
Euler Sinkhorn	0.776±0.061	0.718±0.009	0.728±0.072	449.752±42.985	455.059±43.083
Geodesic Sinkhorn	0.847±0.023	0.754±0.016	0.833±0.034	10.176±1.249	16.682±1.705

Open in a new tab

5.2. Barycentric distance

We test if we can identify a linear treatment effect with the Expected Barycenter Effect (EBE). In this experiment, we create a control family of distributions $𝒞$ of ten standard Gaussian distributions. The treatment group consists of nine Gaussian distributions $𝒩 (5,1)$ , and one outlier centered at different means. For each distribution, we sample 500 observations, and reproduce the experiment over ten seeds. In Tab. 2, we report the EBE and its standard deviation with the Geodesic Sinkhorn, the Total Variation (TV) distance, and Sinkhorn. Since the TV only compares the mean, it is sensitive to the outlier, whereas our method can identify the true treatment effect.

Table 2.

Expected Barycenter Effect (EBE) with one outlier distribution centered at −60,−30,0, or 5. Comparison using the barycenter from Sinkhorn, total variation, or Geodesic Sinkhorn. Values closer to the real treatment effect of 5 are better.

Outlier	EBE Geo Sinkhorn	EBE Sinkhorn	EBE TV
−60	5.016±0.226	−0.103±0.005	−1.429±0.144
−30	5.053±0.196	0.355±0.049	1.571±0.144
0	4.917±0.315	4.954±0.157	4.571±0.144
No outlier	5.059±0.159	5.054±0.16	5.071±0.144

Open in a new tab

5.3. Single-cell signaling data

We use single-cell signaling data produced by mass cytometry (MC) for a screening study to compare the treatment effect of different chemotherapies on 10 colorectal cancer (CRC) patient-derived organoids (PDOs) [22]. These PDOs can be grouped into chemoresistant PDOs, that show little-to-no effect when treated with chemotherapies; and chemosensitive PDOs, that present strong shifts in their phenotypes upon treatment. The observations include single-cell data information on the cell cycle and signaling changes upon treatment of PDOs with different CRC treatments at a range of concentrations. In Fig. 1, we present the barycentric distances matrices between treatments a) and between four concentrations of treatment SN-38 (S) b). In both cases, the control groups corresponds to AH and DMSO, the two rightmost columns. We compare the distance matrices between Sinkhorn (left) and our method (right). Our method provide a finer distinction between treatments (Fig. 1 top) and concentrations (Fig. 1 bottom), especially for the chemosensitive group. As observed in [22], chemosensitive PDOs show little-to-no response to lower concentrations of SN-38 (S1), but their phenotype shifts very strongly upon treatment with higher concentrations (S2, S3, and S4) (Fig. 1 b). When comparing combinations of different treatments (Fig. 1 a), Geodesic Sinkhorn better resolves the difference between SN-38 (S) alone and in combination with Cetuximab (C), showing that S is the main agent creating the treatment effect and the combination with C does not resolve in a synergistic effect [22]. Note that we only consider the relative magnitude of the distances, since the two algorithm use different ground distances.

Fig. 1. — a) Barycentric distances matrices for the Sinkhorn algorithm (left) and our method Geodesic Sinkhorn (right). b) Barycentric distances matrices between doses of treatment SN-38, for four concentrations S1 S2 S3 S4. Control groups correspond to AH and DSMO. Geodesic Sinkhorn provides a clearer distinction between treatments, and doses.

6. CONCLUSION

In this work, we considered the use of OT for graphs and large datasets in high dimensions potentially sampled from a lower dimensional manifold. We proposed Geodesic Sinkhorn, a fast implementation of the Sinkhorn algorithm using the graph Laplacian and Chebyshev polynomials. Our method is well adapted for large and high dimensional datasets as it is defined with a geodesic ground distance, which takes into account the underlying geometry of the data, and requires less computation time and less memory. On a synthetic dataset, we showed that Geodesic Sinkhorn is much faster than other Sinkhorn-based methods while being more accurate. With the Wasserstein barycenter, we defined the barycentric distance to compare entire families of distributions, and the expected barycenter effect, then applied both methods to a large PDO drug screen dataset.

Acknowledgments

This research was partially funded by ESP Mérite [G.H.], NSERC Discovery grant 03267 [G.W.], Canada CIFAR AI Chair [G.W.], NIH grant R01GM135929 [G.W., S.K.], Cancer Research UK (C60693 / A23783) [C.J.T.], the Rosetrees Trust (M872 / A2292) [C.J.T.], and the Yale-UCL Collaborative Student Exchange Programme.

A. APPENDIX

A.1. Time series Interpolation

To evaluate Geodesic Sinkhorn’s performance on inferring dynamics, we test its performance on a task for time series interpolation. In this setting the most used datasets are Embryoid Body [1], and WOT [2]. We curated ten scRNA-seq datasets; WOT [2], Clark [3], Embryoid Body [1], and seven more from the 2022 multimodel single-cell integration challeng³ to test our method. The observations are the gene expression of single cells from a distribution evolving through time. The Waddington-OT dataset (WOT) has 38 timpoints of a developing stem cell population over 18 days collected roughly every 6–12 hours. This is the most densely sampled dataset in time. The Embryoid Body dataset is a single-cell RNA seq dataset of developing embryoid bodies from 0–30 days with 5 datasets collected over time. The Clark dataset contains 12 samples over 9 unique timepoints of a developing mouse retina. Finally, the NeurIPS 2022 data contains four donors with single-cell transcriptomic data collected over 4 timepoints for 3 donors with 10X-multiome and 4 combined cite-seq (in the publically released training data) leading to 7 additional time series with 4 timepoints.

The goal is to interpolate the distribution between two timepoints. The number of timepoints in each dataset range from 4 to 40. Here for a dataset with $T$ single-cell distributions $\{μ_{1}, μ_{2}, \dots, μ_{T}\}$ over time for $t \in [2 \dots T - 1]$ we compute the exact Euclidean 2-Wasserstein distance between the interpolated distribution ${\hat{μ}}_{t}$ at time $t$ and the ground truth distribution $μ_{t}, W_{2} ({\hat{μ}}_{t}, μ_{t})$ . Since we interpolate between two distributions we used the McCann interpolant as its support is $R^{d}$ . We compare our Geodesic Sinkhorn interpolation with either the Sinkhorn Mccann interpolant with Euclidean ground distance ( $L^{2}$ Sinkhorn), Sinkhorn Euler Mccann with the Euler heat approximation in Tab. 3 Sinkhorn with Euler approximation ran out of memory on the Clark and WOT datasets. We see that across all 10 the Geodesic Sinkhorn interpolation with the Mccann interpolant outperforms all other methods, hence showcasing the importance of the heat-geodesic distance and our kernel approximation. We also compare Sinkhron Euler and Geodesic Sinkhorn for different nearest neighbors graphs in Tab. 4 and Tab. 5, where Geodesic Sinkhorn outperforms Sinkhorn Euler on most datasets.

Table 3.

Time series interpolation task comparing mean and standard deviation $(μ \pm σ)$ across 5 seeds the 2-Wasserstein metric averaged across time $(1 / T - 2) \sum_{t \in [2 \dots T - 1]} W_{2} ({\hat{μ}}_{t}, μ_{t})$ for 10 single-cell timeseries datasets. Sinkhron Euler ran out of memory on two datasets. Lower is better, best performance on each dataset is bold.

Dataset	$L^{2}$ Sinkhorn	Sinkhron Euler	Geo Sinkhorn (ours)
Cite Donor0	48.545 ± 0.057	46.254 ± 3.192	44.440 ± 0.108
Cite Donor1	48.220 ± 0.055	45.897 ± 3.254	44.165 ± 0.103
Cite Donor2	50.281 ± 0.016	47.773 ± 3.958	45.673 ± 0.092
Cite Donor3	49.339 ± 0.081	46.565 ± 3.553	45.022 ± 0.146
Clark	13.500 ± 0.003	–	13.288 ± 0.008
EB	12.415 ± 0.008	12.298 ± 0.140	12.133 ± 0.011
Multiome Donor0	56.648 ± 0.048	55.373 ± 7.234	53.431 ± 0.077
Multiome Donor1	54.028 ± 0.126	52.396 ± 4.394	50.238 ± 0.022
Multiome Donor2	58.798 ± 0.155	57.182 ± 5.511	55.041 ± 0.058
WOT	8.096 ± 0.003	–	7.397 ± 0.106

Open in a new tab

Table 4.

Time series interpolation task comparing mean of the 2-Wasserstein for a KNN graph with 50 neighbors. Lower is better, best score is bold.

Dataset	Sinkhorn Euler	Geodesic Sinkhorn
Cite Donor0	48.507	46.850
Cite Donor1	48.207	46.883
Cite Donor2	50.383	49.176
Cite Donor3	49.210	47.646
Clark	13.506	13.378
EB	12.409	12.394
Multiome Donor0	56.676	55.095
Multiome Donor1	54.028	53.952
Multiome Donor2	58.821	57.187
WOT	8.070	8.279

Open in a new tab

Table 5.

Time series interpolation task comparing mean of the 2-Wasserstein for a KNN graph with 100 neighbors. Lower is better, best score is bold.

Dataset	Sinkhorn Euler	Geodesic Sinkhorn
Cite Donor0	48.573	47.102
Cite Donor1	48.197	47.373
Cite Donor2	50.298	49.517
Cite Donor3	49.290	48.433
Clark	13.500	13.396
EB	12.415	12.416
Multiome Donor0	56.708	55.562
Multiome Donor1	54.063	54.016
Multiome Donor2	58.802	57.732
WOT	8.096	8.253

Open in a new tab

Footnotes

https://github.com/KrishnaswamyLab/GeoSinkhorn

With a slight abuse of language we use the term distance, although the entropy-regularized formulation does not respect the identity of indiscernibles.

https://www.kaggle.com/competitions/open-problems-multimodal/

References

[1].Peyré G. and Cuturi M., “Computational Optimal Transport,” arXiv, 2020. [Google Scholar]
[2].Sinkhorn R. and Knopp P., “Concerning nonnegative matrices and doubly stochastic matrices,” Pacific Journal of Mathematics, 1967. [Google Scholar]
[3].Fortet R., “Résolution d’un systeme d’équations de m. schrödinger,” J. Math. Pure Appl. IX, 1940. [Google Scholar]
[4].Kullback S., “Probability densities with given marginals,” The Annals of Mathematical Statistics, 1968. [Google Scholar]
[5].Knight P. A. and Ruiz D., “A fast algorithm for matrix balancing,” IMA Journal of Numerical Analysis, 2013. [Google Scholar]
[6].Moon K. R., Stanley J. S., Burkhardt D., van Dijk D., Wolf G., and Krishnaswamy S., “Manifold learning-based methods for analyzing single-cell RNA-sequencing data,” Current Opinion in Systems Biology, 2018. [Google Scholar]
[7].Huguet G., Magruder D. S., Tong A., et al. , “Manifold interpolating optimal-transport flows for trajectory inference,” NeurIPS, 2022. [PMC free article] [PubMed] [Google Scholar]
[8].Solomon J., De Goes F., Peyré G., et al. , “Convolutional wasserstein distances: Efficient optimal transportation on geometric domains,” ACM Transactions on Graphics (ToG), 2015. [Google Scholar]
[9].Tong A., Huguet G., Shung D., et al. , “Embedding signals on graphs with unbalanced diffusion earth mover’s distance,” in ICASSP, IEEE, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Tong A., Huguet G., Natik A., et al. , “Diffusion earth mover’s distance and distribution embeddings,” in ICML, 2021. [Google Scholar]
[11].Crane K., Weischedel C., and Wardetzky M., “Geodesics in heat: A new approach to computing distance based on heat flow,” ACM Transactions on Graphics, 2013. [Google Scholar]
[12].Heitz M., Bonneel N., Coeurjolly D., Cuturi M., and Peyré G., “Ground metric learning on graphs,” J. of Math. Imaging and Vision, 2021. [Google Scholar]
[13].Cuturi M., “Sinkhorn distances: Lightspeed computation of optimal transport,” NeurIPS, 2013. [Google Scholar]
[14].Varadhan S. R. S., “On the behavior of the fundamental solution of the heat equation with variable coefficients,” Communications on Pure and Applied Mathematics, 1967. [Google Scholar]
[15].Cuturi M. and Doucet A., “Fast computation of wasserstein barycenters,” in ICML, 2014. [Google Scholar]
[16].Shuman D. I., Vandergheynst P., and Frossard P., “Chebyshev polynomial approximation for distributed signal processing,” in DOCSS, IEEE, 2011. [Google Scholar]
[17].Marcotte S., Barbe A., Gribonval R., et al. , “Fast multi-scale diffusion on graphs,” in ICASSP, IEEE, 2022. [Google Scholar]
[18].Huang S.-G., Lyu I., Qiu A., and Chung M. K., “Fast polynomial approximation of heat kernel convolution on manifolds and its application to brain sulcal and gyral graph pattern analysis,” IEEE transactions on medical imaging, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Imbens G. W. and Rubin D. B., Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. [Google Scholar]
[20].Scetbon M., Cuturi M., and Peyré G., “Low-rank sinkhorn factorization,” in ICML, 2021. [Google Scholar]
[21].Tong A., Huang J., Wolf G., Dijk D. V., and Krishnaswamy S., “TrajectoryNet: A Dynamic Optimal Transport Network for Modeling Cellular Dynamics,” in ICML, 2020. [PMC free article] [PubMed] [Google Scholar]
[22].Zapatero M. R., Tong A., Sufi J., et al. , “Trellis single-cell screening reveals stromal regulation of patient-derived organoid drug responses,” bioRxiv, 2023. [DOI] [PubMed] [Google Scholar]

References

[1].Moon K. R., van Dijk D., Wang Z., et al. , “Visualizing structure and transitions in high-dimensional biological data,” Nat Biotechnol, vol. 37, no. 12, pp. 1482–1492, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Schiebinger G., Shu J., Tabaka M., et al. , “Optimal-Transport Analysis of Single-Cell Gene Expression Identifies Developmental Trajectories in Reprogramming,” en, Cell, vol. 176, no. 4, 928–943.e22, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Clark B. S., Stein-O’Brien G. L., Shiau F., et al. , “Single-Cell RNA-Seq Analysis of Retinal Development Identifies NFI Factors as Regulating Mitotic Exit and Late-Born Cell Specification,” Neuron, vol. 102, no. 6, 1111–1126.e5, Jun. 2019, ISSN: 08966273. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] [1].Peyré G. and Cuturi M., “Computational Optimal Transport,” arXiv, 2020. [Google Scholar]

[R2] [2].Sinkhorn R. and Knopp P., “Concerning nonnegative matrices and doubly stochastic matrices,” Pacific Journal of Mathematics, 1967. [Google Scholar]

[R3] [3].Fortet R., “Résolution d’un systeme d’équations de m. schrödinger,” J. Math. Pure Appl. IX, 1940. [Google Scholar]

[R4] [4].Kullback S., “Probability densities with given marginals,” The Annals of Mathematical Statistics, 1968. [Google Scholar]

[R5] [5].Knight P. A. and Ruiz D., “A fast algorithm for matrix balancing,” IMA Journal of Numerical Analysis, 2013. [Google Scholar]

[R6] [6].Moon K. R., Stanley J. S., Burkhardt D., van Dijk D., Wolf G., and Krishnaswamy S., “Manifold learning-based methods for analyzing single-cell RNA-sequencing data,” Current Opinion in Systems Biology, 2018. [Google Scholar]

[R7] [7].Huguet G., Magruder D. S., Tong A., et al. , “Manifold interpolating optimal-transport flows for trajectory inference,” NeurIPS, 2022. [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Solomon J., De Goes F., Peyré G., et al. , “Convolutional wasserstein distances: Efficient optimal transportation on geometric domains,” ACM Transactions on Graphics (ToG), 2015. [Google Scholar]

[R9] [9].Tong A., Huguet G., Shung D., et al. , “Embedding signals on graphs with unbalanced diffusion earth mover’s distance,” in ICASSP, IEEE, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Tong A., Huguet G., Natik A., et al. , “Diffusion earth mover’s distance and distribution embeddings,” in ICML, 2021. [Google Scholar]

[R11] [11].Crane K., Weischedel C., and Wardetzky M., “Geodesics in heat: A new approach to computing distance based on heat flow,” ACM Transactions on Graphics, 2013. [Google Scholar]

[R12] [12].Heitz M., Bonneel N., Coeurjolly D., Cuturi M., and Peyré G., “Ground metric learning on graphs,” J. of Math. Imaging and Vision, 2021. [Google Scholar]

[R13] [13].Cuturi M., “Sinkhorn distances: Lightspeed computation of optimal transport,” NeurIPS, 2013. [Google Scholar]

[R14] [14].Varadhan S. R. S., “On the behavior of the fundamental solution of the heat equation with variable coefficients,” Communications on Pure and Applied Mathematics, 1967. [Google Scholar]

[R15] [15].Cuturi M. and Doucet A., “Fast computation of wasserstein barycenters,” in ICML, 2014. [Google Scholar]

[R16] [16].Shuman D. I., Vandergheynst P., and Frossard P., “Chebyshev polynomial approximation for distributed signal processing,” in DOCSS, IEEE, 2011. [Google Scholar]

[R17] [17].Marcotte S., Barbe A., Gribonval R., et al. , “Fast multi-scale diffusion on graphs,” in ICASSP, IEEE, 2022. [Google Scholar]

[R18] [18].Huang S.-G., Lyu I., Qiu A., and Chung M. K., “Fast polynomial approximation of heat kernel convolution on manifolds and its application to brain sulcal and gyral graph pattern analysis,” IEEE transactions on medical imaging, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Imbens G. W. and Rubin D. B., Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. [Google Scholar]

[R20] [20].Scetbon M., Cuturi M., and Peyré G., “Low-rank sinkhorn factorization,” in ICML, 2021. [Google Scholar]

[R21] [21].Tong A., Huang J., Wolf G., Dijk D. V., and Krishnaswamy S., “TrajectoryNet: A Dynamic Optimal Transport Network for Modeling Cellular Dynamics,” in ICML, 2020. [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Zapatero M. R., Tong A., Sufi J., et al. , “Trellis single-cell screening reveals stromal regulation of patient-derived organoid drug responses,” bioRxiv, 2023. [DOI] [PubMed] [Google Scholar]

PERMALINK

This is a preprint.

GEODESIC SINKHORN FOR FAST AND ACCURATE OPTIMAL TRANSPORT ON MANIFOLDS

Guillaume Huguet

Alexander Tong

María Ramos Zapatero

Christopher J Tape

Guy Wolf

Smita Krishnaswamy

Abstract

1. INTRODUCTION

2. RELATED WORK

3. PRELIMINARIES

3.1. Wasserstein Distance

3.2. Sinkhorn Distances

3.3. Interpolation with discrete support

3.4. Heat Diffusion on a Graph

3.5. Chebyshev Polynomials

4. GEODESIC SINKHORN DISTANCES

5. RESULTS

5.1. Nearest-Wasserstein-neighbor distributions

Table 1.

5.2. Barycentric distance

Table 2.

5.3. Single-cell signaling data

Fig. 1.

6. CONCLUSION

Acknowledgments

A. APPENDIX

A.1. Time series Interpolation

Table 3.

Table 4.

Table 5.

Footnotes

References

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases