Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Dec 24.
Published in final edited form as: Proc IEEE Int Symp Biomed Imaging. 2020 May 22;2020:1715–1719. doi: 10.1109/isbi45749.2020.9098723

EARTHMOVER-BASED MANIFOLD LEARNING FOR ANALYZING MOLECULAR CONFORMATION SPACES

Nathan Zelesko , Amit Moscovich , Joe Kileel , Amit Singer ♢,
PMCID: PMC9788962  NIHMSID: NIHMS1859580  PMID: 36570366

Abstract

In this paper, we propose a novel approach for manifold learning that combines the Earthmover’s distance (EMD) with the diffusion maps method for dimensionality reduction. We demonstrate the potential benefits of this approach for learning shape spaces of proteins and other flexible macromolecules using a simulated dataset of 3-D density maps that mimic the non-uniform rotary motion of ATP synthase. Our results show that EMD-based diffusion maps require far fewer samples to recover the intrinsic geometry than the standard diffusion maps algorithm that is based on the Euclidean distance. To reduce the computational burden of calculating the EMD for all volume pairs, we employ a wavelet-based approximation to the EMD which reduces the computation of the pairwise EMDs to a computation of pairwise weighted-1 distances between wavelet coefficient vectors.

Keywords: dimensionality reduction, Wasserstein metric, computational optimal transport, diffusion maps, Laplacian eigenmaps, shape space, cryo-electron microscopy

1. INTRODUCTION

Proteins and other macromolecules are elastic structures that may deform in various ways. Since the spatial conformation of an organic molecule is known to play a key role in its biological function, the complete description of a molecule must include more than just a single static structure (as is traditionally produced by X-ray crystallography). Ideally, we would like to map the entire space of molecular conformations. However, understanding the topology and geometry of these conformation spaces remains one of the grand challenges in the field of structural biology [1].

One promising approach is to employ cryo-electron microscopy (cryo-EM) as a tool for structure determination in the presence of conformational heterogeneity [2]. In cryo-EM, multiple images of a particular macromolecule are taken by a transmission electron microscope and then processed using specialized algorithms. Traditionally, these algorithms construct an estimate of the mean molecular volume, in the form of a 3-D electrostatic density map. In particular, this process averages out any variability in the spatial conformations of the molecules in the sample. Recent works have applied techniques from the field of manifold learning to cryo-EM data sets, obtaining a low-dimensional representation of the molecular conformation space [3, 4]. Specifically, these works build affinity graphs based on the Euclidean distances between molecular volumes (or projection images) and then compute diffusion map embeddings [5, 6].

However, the Euclidean distance is suboptimal for capturing the distance between geometric conformations. Consider, for example, two conformations of a molecule that has only a single moving part. If the two conformations are distant, the support of the moving part in the two volumes may not intersect, rendering the Euclidean distance independent of the conformational distance. See Fig. 1. In such cases, in order to apply manifold learning based on a Euclidean metric, one need a dense cover of the conformation space by the molecules in the sample. Since the number of points in such a cover scales exponentially in the dimension, it may be infeasible to apply these methods, even using the largest existing experimental datasets, which consist of about n ≈ 106 samples.

Fig. 1.

Fig. 1.

EMD vs. Euclidean distance for 1-D motion. The Euclidean distance fg2:=(fg)(x)2dx between probability densities is uninformative for large displacements. In particular fafc2=fbfc2. In contrast, for any displacement, the EMD is simply its magnitude. In particular, EMD(fa,fc)=|ac| and EMD(fa,fb)=|ab|.

In this paper, we propose to use the Earthmover’s distance (EMD), also known as the Wasserstein metric, instead of the commonly used Euclidean distance as input to manifold embedding algorithms. EMD has an intuitive geometric meaning: it measures the minimal amount of “work” needed to transform one pile of mass into another pile of equal mass, where “work” is defined as the amount of mass moved times the distance by which it is moved. In particular, EMD provides a distance metric that is meaningful even between spatial conformations that are far from each other. Following the discussion above, this property should reduce the number of samples needed to learn the intrinsic manifold.

Methods for computing the EMD, based on off-the-shelf linear programming solvers, are expensive when the number of voxels is large. Therefore we used a fast approximation to the EMD, based on a wavelet representation [7].

To test our proposal, we compared the standard 2-based diffusion maps to EMD-based diffusion maps on a synthetic dataset mimicking the motion of ATP synthase (Fig. 2). This dataset samples the underlying manifold in a non-uniform manner since ATP synthase has three dominant conformations that are 120° apart [8]. The approximate EMD-based approach yields a marked improvement in the number of samples required for learning the conformational manifold, while still offering a computationally feasible algorithm.

Fig. 2.

Fig. 2.

ATP synthase. (left) F0 and axle subunits. They rotate together in the presence of hydrogen ions, forming a tiny electric motor; (middle) the F1 subunit (in cyan) envelops the axle. As the axle rotates, this subunit assembles ATP; (right) sample slice of the rotated F0 and axle subunits with the additive Gaussian noise.

2. METHODS

In this section, we review the basic techniques that underlie Earthmover-based manifold learning. Our current focus is on learning shape spaces of 3-D volumes, but the same techniques may also be applied to analyze other types of datasets, such as 2-D image sets, 1-D histograms, etc. To start, let X={x1,,xn} be a set of 3-D voxel arrays in L3. We assume that X obeys the manifold hypothesis [2, 9, 10], i.e., x1,,xn form a (noisy) sample of a low-dimensional manifold 𝓜L3. Our task is to reorganize the data to better reflect the intrinsic geometry of 𝓜.

For Riemannian manifolds, eigenfunctions of the Laplace-Beltrami operator provide an intrinsic coordinate system [11, 12]. Accordingly, several popular methods for dimensionality reduction and data representation methods are based on mapping input points using empirical estimates of Laplacian eigenfunctions [6, 5]. Under the manifold hypothesis, these estimates converge to eigenfunctions of the Laplace-Beltrami operator, or more generally to eigenfunctions of a weighted Laplacian, depending on the construction [13].

We now describe the diffusion maps method [6]. Let w:L3×L3 denote a symmetric non-negative function that gives an affinity score for each pair of volumes. One common way of constructing affinities is to take a distance metric d:X×X and apply a Gaussian kernel with a suitably chosen width σ to form the affinity matrix Wn×n

Wij=w(xi,xj)=exp(d(xi,xj)2/(2σ2)). (1)

The degree matrix Dn×n is defined to be the diagonal matrix that satisfies Dii=j=1nWij. We use the Coifman-Lafon normalized graph Laplacian [14], which converges to the Laplace-Beltrami operator, regardless of the sampling density. To compute this, one first performs a two-sided normalization of the affinity matrix, W˜=D1WD1 and then computes the random-walk Laplacian, 𝓛=D˜1W˜, where D˜ is the degree matrix for W˜. The random-walk Laplacian is similar to a positive semi-definite symmetric matrix and hence its eigenvectors are real and its eigenvalues are non-negative. The all-ones vector is an eigenvector of 𝓛 with eigenvalue zero [15]. Let ϕ0, ϕ1,,ϕn1n be eigenvectors of 𝓛 with corresponding eigenvalues 0=λ0λ1λn1. We think of the eigenvectors ϕ as real-valued functions on X, by identifying ϕ(xi)=(ϕ)i. The k-dimensional diffusion map Ψt(k):Xk is defined by:

xi(λ1tϕ1(xi),,λktϕk(xi)).

The mapping Ψt(k) gives a system of k coordinates on X, which captures the intrinsic geometry of 𝓜. In our simulations, we used t = 0, in which case diffusion maps coincide with Laplacian eigenmaps [5].

The diffusion map depends on the choice of affinity. The typical choice is a Gaussian kernel as defined in Eq. (1) that is based on a Euclidean (or 2) distance function,

d2(xi,xj)=xixj2.

We propose instead to base the Gaussian kernel of Eq. (1) on the Earthmover’s distance (EMD), also known as the Wasserstein metric [16]. EMD is popular in various applications, e.g., image retrieval [17], however, to the best of our knowledge, it has never been used to define affinities for manifold learning algorithms. To define this distance, consider two 3-D density maps xi,xjL3 that are non-negative and normalized to unit mass. These densities define probability measures on the set of voxels, [L]3, where [L]={1,,L}. We set:

dEMD(xi,xj)=minπΠ(xi,xj)u[L]3v[L]3π(u,v)uv2,

where Π(xi,xj) is the set of joint probability measures on [L]3×[L]3 with marginals xi and xj, respectively.

Algorithmically, EMD amounts to a linear program in 𝒪(L6) variables subject to 𝒪(L3) constraints, i.e., a significant computation. However, in the wavelet domain [18], EMD enjoys a fast (weighted-1) wavelet approximation [7], which we refer to as WEMD:

dWEMD(xi,xj)=λ25s/2|𝓦xi(λ)𝓦xj(λ). (2)

Here, 𝓦x denotes the 3-D wavelet transform of x, and the index λ contains the shifts (m1,m2,m3)3 and the scale s0. More explicitly, 𝓦 decomposes x=x[u1,u2,u3] with respect to an orthonormal basis of functions,

23s/2f(2su1m1)g(2su2m2)h(2su3m3),

for varying s0, varying (m1,m2,m3)3, and (f, g, h) ranging over {ψ,ω}3{(ω,ω,ω)} where ψ, ω are certain 1-D functions called the mother and father wavelet [18]. Formula (2) approximates EMD in the sense that dEMD and dWEMD are strongly equivalent metrics, i.e., there exist constants Cc>0 such that for all x,yL3, we have:

cdWEMD(x,y)dEMD(x,y)CdWEMD(x,y).

Moreover, there are known bounds on the ratio C/c, depending on the type of wavelet used. We have chosen the Coiflets 3 wavelet since it gives a small ratio [7]. Wavelet transforms are computed in linear time, thus the same holds for the EMD approximation. We implemented the approximation (2), using the PyWavelets package [19].

3. RESULTS

To test our methods, we generated two synthetic datasets of 3-D density maps that are simplified models of the conformation space of ATP synthase [8]. This enzyme is a molecular stepper motor with a central asymmetric axle that rotates in steps of 120° relative to the F1 subunit, with short transient motion in-between the three dominant conformations. Here, the intrinsic geometry is a circle, with a sampling density concentrated around three equispaced angles. We simulated this motion by generating 3-D density maps in which the F1 subunit is held in place while the F0 and axle subunits are rotated together by a random angle. The angles were drawn i.i.d. according to the following mixture model:

25U[0,360]+15𝓝(0,1)+15𝓝(120,1)+15𝓝(240,1),

where U and 𝓝 denote uniform and Gaussian distributions, respectively. To form our datasets, we downloaded entry 1QO1 [20] from the Protein Data Bank [21], produced 3-D density maps at a 6Å resolution with array dimensions 47×47×107 using the molmap command in UCSF Chimera [22], and then took random rotations of the F0 and axle subunits. From this, we generated a clean dataset and a noisy dataset. For the latter, i.i.d. Gaussian noise was added with mean zero and standard deviation equal to one-tenth of the maximum voxel value.

We first tested the plausibility of our proposal by comparing the EMD approximation to the Euclidean distance for a range of angular differences using the noiseless dataset (Fig. 3). We then performed 2-dimensional diffusion maps for various sample sizes, using both the Euclidean distance and the wavelet-based approximation to the EMD, as described in the previous section. We computed the wavelet transform up to scale s = 5 for accurate truncation of Eq. (2). This overparameterizes the volumes by a factor of ≈ 3. The resulting embeddings are shown in Fig. 5. The value of the width parameter σ in the Gaussian kernel (1) was handpicked to yield the best results. We note that for the Euclidean diffusion maps, careful tuning of σ was required. However, this was not necessary for the EMD approximation, where a wide range of σ values gave excellent results. Running times (on a 2.8Ghz Intel Core i7) for the computation of EMD and Euclidean-based diffusion maps are listed in Fig. 4.

Fig. 3.

Fig. 3.

Euclidean distance vs. WEMD as functions of the angle between two angles of the ATP synthase rotor (see Fig. 2). The 2 distances are scaled to be comparable to the WEMD. The WEMD is monotone in the magnitude of the angular difference for almost the entire range whereas the Euclidean distance exhibits this behavior only up to about ±19°.

Fig. 5.

Fig. 5.

Main results. Euclidean vs. EMD-based diffusion mappings on the clean and noisy ATP synthase datasets for sample sizes n = 25, 50, 100, 200, 400, 800. The Euclidean diffusion maps need more than 400 samples to capture the intrinsic geometry whereas WEMD manages to do so with merely n = 25 samples. The colors encode the (ground truth) angle.

Fig. 4.

Fig. 4.

Running times [sec] for computing the wavelet transform, all pairwise wavelet Earthmover approximations and all pairwise 2 distances.

4. CONCLUSION

In this paper, we proposed to use Earthmover-based affinities in the diffusion maps framework to analyze molecular conformation spaces. We showed that this results in a marked decrease in the number of samples needed to capture the intrinsic conformation space of ATP synthase. The method is computationally tractable, thanks to a fast wavelet approximation, and robust to noise. Our results show promise, particularly for the analysis of cryo-EM datasets with continuous heterogeneity. More broadly, EMD-based manifold learning could be applied to analyze the variability of other collections of 3-D shapes [23], 2-D images [17], videos and other signals, e.g., to better model animal motion [24]. Our work also raises several interesting theoretical questions: in which cases can one prove that EMD-based manifold learning has a lower sample complexity than manifold learning based on the Euclidean distance? More ambitiously, are there reasonable generative models for variability where EMD is the optimal distance metric?

5. ACKNOWLEDGMENTS

The authors thank Ariel Goldstein, William Leeb, Nicholas Marshall and Stefan Steinerberger for interesting discussions. This work was supported in parts by AFOSR FA9550-17-1-0291, ARO W911NF-17-1-0512, the Simons Collaboration in Algorithms and Geometry, the Simons Investigator Award, the Moore Foundation Data-Driven Discovery Investigator Award and NSF BIGDATA Award IIS-1837992.

Footnotes

Reproducibility Code for generating the results in this paper is available at http://github.com/nathanzelesko/earthmover

6. REFERENCES

  • [1].Frank J, “New opportunities created by single-particle cryo-EM: the mapping of conformational space,” Biochemistry, vol. 57, no. 6, pp. 888, 2018. doi: 10.1021/acs.biochem.8b00064 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Sorzano COS et al. , “Survey of the analysis of continuous conformational variability of biological macromolecules by electron microscopy,” Acta Crystallogr. Sect. F Struct. Biol. Commun, vol. 75, no. 1, pp. 19–32, 2019. doi: 10.1107/S2053230X18015108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Schwander P, Fung R, and Ourmazd A, “Conformations of macromolecules and their complexes from heterogeneous datasets,” Philos. Trans. R. Soc. B Biol. Sci, vol. 369, no. 1647, pp. 1–8, 2014. doi: 10.1098/rstb.2013.0567 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Moscovich A, Halevi A, Andén J, and Singer A, “Cryo-EM reconstruction of continuous heterogeneity by Laplacian spectral volumes,” Inverse Probl. accepted, 2019. doi: 10.1088/1361-6420/ab4f55 [DOI] [PMC free article] [PubMed]
  • [5].Belkin M. and Niyogi P, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Comput., vol. 15, no. 6, pp. 1373–1396, 2003. doi: 10.1162/089976603321780317 [DOI] [Google Scholar]
  • [6].Coifman RR et al. , “Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps,” PNAS, vol. 102, no. 21, pp. 7426–7431, 2005. doi: 10.1073/pnas.0500334102 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Shirdhonkar S. and Jacobs DW, “Approximate Earthmover’s distance in linear time,” in CVPR 2008. doi: 10.1109/CVPR.2008.4587662 [DOI]
  • [8].Yoshida M, Muneyuki E, and Hisabori T, “ATP synthase – a marvellous rotary engine of the cell,” Nat. Rev. Mol. Cell Biol, vol. 2, no. 9, pp. 669–677, 2001. doi: 10.1038/35089509 [DOI] [PubMed] [Google Scholar]
  • [9].Moscovich A, Jaffe A, and Nadler B, “Minimax-optimal semi-supervised regression on unknown manifolds,” in AISTATS 2017. http://proceedings.mlr.press/v54/moscovich17a.html
  • [10].Lee AB, Pedersen KS, and Mumford D, “The nonlinear statistics of high-contrast patches in natural images,” Int. J. Comput. Vis, vol. 54, no. 1–3, pp. 83–103, 2003. doi: 10.1023/A:1023705401078 [DOI] [Google Scholar]
  • [11].Bérard P, Besson G, and Gallot S, “Embedding Riemannian manifolds by their heat kernel,” Geom. Funct. Anal, vol. 4, no. 4, pp. 373–398, 1994. doi: 10.1007/BF01896401 [DOI] [Google Scholar]
  • [12].Jones PW, Maggioni M, and Schul R, “Manifold parametrizations by eigenfunctions of the Laplacian and heat kernels,” PNAS, vol. 105, no. 6, pp. 1803–1808, 2008. doi: 10.1073/pnas.0710175104 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Ting D, Huang L, and Jordan M, “An analysis of the convergence of graph Laplacians,” in ICML 2010. https://icml.cc/Conferences/2010/papers/554.pdf
  • [14].Coifman RR and Lafon S, “Diffusion maps,” Appl. Comput. Harmon. Anal, vol. 21, no. 1, pp. 5–30, 2006. doi: 10.1016/j.acha.2006.04.006 [DOI] [Google Scholar]
  • [15].von Luxburg U, “A tutorial on spectral clustering,” Stat. Comput, vol. 17, no. 4, pp. 395–416, 2007. doi: 10.1007/s11222-007-9033-z [DOI] [Google Scholar]
  • [16].Villani C, Optimal Transport: Old and New, Springer Berlin Heidelberg, 2009. doi: 10.1007/978-3-540-71050-9 [DOI] [Google Scholar]
  • [17].Rubner Y, Tomasi C, and Guibas LJ, “The Earthmover’s distance as a metric for image retrieval,” Int. J. Comput. Vis, vol. 40, no. 2, pp. 99–121, 2000. doi: 10.1023/A:1026543900054 [DOI] [Google Scholar]
  • [18].Mallat S, A Wavelet Tour of Signal Processing, Elsevier, 3rd edition, 2009. doi: 10.1016/B978-0-12-3743701.X0001-8 [DOI] [Google Scholar]
  • [19].Lee G, Gommers R, Waselewski F, Wohlfahrt K, and O’Leary A, “PyWavelets: a Python package for wavelet analysis,” J. Open Source Softw, vol. 4, no. 36, pp. 1237, 2019. doi: 10.21105/joss.01237 [DOI] [Google Scholar]
  • [20].Stock D, Leslie AG, and Walker JE, “Molecular architecture of the rotary motor in ATP synthase,” Science, vol. 286, no. 5445, pp. 1700–1705, 1999. doi: 10.1126/science.286.5445.1700 [DOI] [PubMed] [Google Scholar]
  • [21].Rose PW et al. , “The RCSB protein data bank: integrative view of protein, gene and 3D structural information,” Nucleic Acids Res., vol. 45, no. D1, pp. D271–D281, 2017. doi: 10.1093/nar/gkw1000 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Pettersen at al EF, “UCSF Chimera – a visualization system for exploratory research and analysis,” J. Comput. Chem, vol. 25, no. 13, pp. 1605–1612, 2004. doi: 10.1002/jcc.20084 [DOI] [PubMed] [Google Scholar]
  • [23].Ovsjanikov M, Li W, Guibas LJ, and Mitra NJ, “Exploration of continuous variability in collections of 3D shapes,” ACM Trans. Graph, vol. 30, no. 4, pp. 1–10, 2011. doi: 10.1145/2010324.1964928 [DOI] [Google Scholar]
  • [24].Hu DL, Nirody J, Scott T, and Shelley MJ, “The mechanics of slithering locomotion,” PNAS, vol. 106, no. 25, pp. 10081–10085, 2009. doi: 10.1073/pnas.0812533106 [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES