Riemannian geometry for efficient analysis of protein dynamics data

Willem Diepeveen; Carlos Esteve-Yagüe; Jan Lellmann; Ozan Öktem; Carola-Bibiane Schönlieb

doi:10.1073/pnas.2318951121

. 2024 Aug 9;121(33):e2318951121. doi: 10.1073/pnas.2318951121

Riemannian geometry for efficient analysis of protein dynamics data

Willem Diepeveen ^a,¹, Carlos Esteve-Yagüe ^a, Jan Lellmann ^b, Ozan Öktem ^c, Carola-Bibiane Schönlieb ^a

PMCID: PMC11331106 PMID: 39121160

Significance

Data tend to live in low-dimensional nonlinear subspaces. The framework of Riemannian geometry can take such nonlinearity into account: If set up properly, data seem to lie on low-dimensional linear spaces, similar data live close by and dissimilar data live far away. At the moment, using custom Riemannian geometry is far from mainstream due to computational and modeling challenges. We construct a mathematical framework in which the computational challenges can be provably overcome and show how this makes modeling easier for protein dynamics data. Within the broader scope of data processing, this mathematical framework has important implications on how to construct Riemannian geometry and the success of the protein geometry has important implications for handling protein dynamics data.

Keywords: protein dynamics, manifold-valued data, Riemannian manifold, interpolation, dimension reduction

Abstract

An increasingly common viewpoint is that protein dynamics datasets reside in a nonlinear subspace of low conformational energy. Ideal data analysis tools should therefore account for such nonlinear geometry. The Riemannian geometry setting can be suitable for a variety of reasons. First, it comes with a rich mathematical structure to account for a wide range of geometries that can be modeled after an energy landscape. Second, many standard data analysis tools developed for data in Euclidean space can be generalized to Riemannian manifolds. In the context of protein dynamics, a conceptual challenge comes from the lack of guidelines for constructing a smooth Riemannian structure based on an energy landscape. In addition, computational feasibility in computing geodesics and related mappings poses a major challenge. This work considers these challenges. The first part of the paper develops a local approximation technique for computing geodesics and related mappings on Riemannian manifolds in a computationally feasible manner. The second part constructs a smooth manifold and a Riemannian structure that is based on an energy landscape for protein conformations. The resulting Riemannian geometry is tested on several data analysis tasks relevant for protein dynamics data. In particular, the geodesics with given start- and end-points approximately recover corresponding molecular dynamics trajectories for proteins that undergo relatively ordered transitions with medium-sized deformations. The Riemannian protein geometry also gives physically realistic summary statistics and retrieves the underlying dimension even for large-sized deformations within seconds on a laptop.

1. Introduction

Protein dynamics data are becoming available at an ever-increasing rate due to developments in molecular dynamics simulation (1, 2)—generating more trajectories of large molecular assemblies across longer time scales—and also due to developments on the experimental side through recent advances in the processing of heterogeneous cryogenic electron microscopy data (3, 4)—enabling reconstruction of several discrete conformations or continuous large-scale motions of proteins and complexes. With this data, proper analysis tools are becoming increasingly important to extract information, e.g., interpolation, extrapolation, computing meaningful summary statistics, and dimension reduction. Consequently, the development of such tools has been an active area of research for several decades. In particular, suitable interpolation between discrete conformations has given insight into conformational transitions without running full-scale molecular dynamics simulations (5). Extrapolation has enabled one to explore different regions of conformation space (6). Finally, a good notion of mean conformation and of low-rank approximation has retrieved the most important large-scale motions (7) in continuous trajectory datasets.

The construction of such tools relies strongly on the protein energy landscape, and even more so on normal mode analysis (NMA), which is a technique to describe the flexible states accessible to a protein from an equilibrium conformation through linearization of the energy landscape. This idea was explored in the seminal works by Go et al. (8) and by Brooks and Karplus (9, 10) in the early 1980s. The idea was taken further by Tirion (11), who approximates the physical energy by a Hookean potential to resolve the issue of possible negative eigenvalues of the Hessian of the physical energy. The latter approximation has many extensions (12), and has been highly successful in structural biology (13, 14).

Despite the success of the NMA framework, there is a major conceptual disadvantage when it is used for analysis of protein conformations: NMA is local and linear by construction, and it has been observed numerous times that the quality of the approximations degrades for large-scale deformations (15). To resolve the problem of locality and linearity, ideal energy landscape-based data analysis tools should be able to interpolate and extrapolate over energy-minimizing nonlinear paths, compute nonlinear means on such energy-minimizing paths and compute low-rank approximations over curved subspaces spanned by energy-minimizing paths. Deep learning has recently emerged as one way of modeling with nonlinearity under physics-based constraints; see, for example, ref. 16. However, in addition to requiring substantial training time, such methods focus on physically correct interpolation within the—protein-specific—training dataset. Therefore, they cannot be expected to generalize well to proteins that are structurally very different.

Instead, the framework of Riemannian geometry (17, 18) could be a more suitable choice. Here, interpolation can be performed over nonlinear geodesics or related higher-order interpolation schemes (19), extrapolation can be done using the Riemannian exponential mapping, the data mean is naturally generalized to the Riemannian barycenter (20), and low rank approximation knows several extensions that find the most important geodesics through a dataset using the Riemannian logarithmic mapping (21–23).^* Riemannian geometry also allows to specify the type of nonlinearity, i.e., there is a choice which curves are length-minimizing—which then naturally implies the distance between any two points. In particular, there is the possibility to take physics into account in building a custom geometry for proteins through modeling the Riemannian manifold structure after the energy landscape. Then, ideally, length-minimizing curves over which we perform data analysis are energy-minimizing, and the distance between two conformations corresponds to the energy needed for the transition of one into the other.

In this work, we set out to construct such a geometry for proteins that will enable more natural data analysis of protein dynamics data as manifold-valued data. In particular, our goal is threefold: i) construct a smooth manifold of protein conformations ii) with a suitable Riemannian structure that models the conformation energy landscape, iii) which allows computationally feasible analysis of protein dynamics data.

1.1. Related Work.

As a geometry for proteins comes in three parts, we will discuss the related work in a similar fashion.

1.1.1. Smooth manifolds of protein conformations.

Before constructing a suitable and computationally feasible Riemannian manifold, one needs a smooth manifold (25, 26) whose elements represent various conformations of a fixed protein. The type of Riemannian structure we aim for—one that is based on the energy landscape—comes with constraints. In particular, since the energy boils down to a sum of interaction potentials that only depend on the Euclidean distance between pairs of atoms in the protein, two conformations can only be distinguished up to rigid-body transformations. This observation leaves two choices for constructing a smooth manifold that allows for such a symmetry. We either choose an already invariant parametrization that is also a smooth manifold or choose to identify each protein conformation—in a suitable subset of protein conformations—with all of its rigid-body transformations, i.e., we construct a quotient manifold. As we will see below, several challenges need to be overcome before using either approach.

A classical parametrization of a protein is to model the macromolecule as a chain of peptides, each one living in a plane, where each plane is determined by two conformation angles per peptide, e.g., see ref. 27 in which the extension to so-called fat graphs to encode hydrogen bonds is also discussed. This parametrization is a smooth manifold and is invariant under rotations and translations, but not reflections. In other words, it is not a proper invariant parametrization manifold. In addition, to the best of our knowledge, another invariant parametrization manifold has not arisen in the literature so far.

Constructing quotient manifolds has been explored extensively in the shape manifold literature (see ref. 28 for recent accounts of the theory). In particular, the modeling of shapes in a given embedding space can roughly be broken down into point clouds, curves, diffeomorphisms, and surfaces (29)—with suitable symmetries quotiented out. Out of these four options, the first three—point clouds, curves, diffeomorphisms—seem most suitable for modeling protein conformations. However, as we will see, each of them comes with a challenge. Without going into details, point clouds—being the classical choice (8–10) for protein modeling—modulo rigid body motions are typically no longer smooth manifolds in dimension 3 or higher (30), curves are additionally bound to have issues with the Riemannian constraint as it is not trivial how to extend the energy landscape to a continuum of interacting parts of the curve, and any Riemannian structure on diffeomorphisms will by construction also be independent of the conformation it acts on, which is not what we aim to model.

We note that out of the above, the challenge underlying the point cloud-based quotient manifold representation is easiest to alleviate.

1.1.2. Riemannian geometry for protein conformations.

Having an invariant parametrization manifold or quotient manifold, the approaches one can follow to construct Riemannian geometry are similar. For point clouds and curves, the canonical construction is the norm (31), i.e., constructing a metric tensor field (17, 18), which would also be the default in the conformation angle framework. For diffeomorphisms, there are more options, e.g., considering different reproducing kernel Hilbert spaces for the diffeomorphic flow vector fields (32) or considering extended frameworks such as metamorphosis (33).

In any case, the focus of construction has historically been on attaining well-posedness rather than on imposing physical laws or biological mechanisms. To the best of our knowledge, there is no readily available energy landscape-based Riemannian geometry for protein conformations. More recently, there has been a surge in work on bringing physics and biology into the diffeomorphism framework (34). In particular, through adding a regularizer within a hybrid approach (35), or through growth models (36) where external actions can also be taken into account. Although the diffeomorphism framework still suffers from the same conceptual problem as discussed above, it should be mentioned that the ideas underlying these approaches—inheriting the mathematical rigor from established theory, but also encoding physics or biology—are something to strive for.

1.1.3. Computational feasibility for data analysis.

When analyzing data on a Riemannian manifold, we need access to several standard manifold mappings. In particular, from the discussion above we have seen, that for basic data processing tasks, we will at least need the ability to compute geodesics, the exponential mapping, and the logarithmic mapping—or good approximations thereof.

In the canonical form of Riemannian geometry, i.e., having just a metric tensor field, closed-form expressions for the above-mentioned mappings can realistically only be expected when having special structure from the specific choice of metric (37). In other words, the necessary manifold mappings are not readily available, but need to be approximated—typically by solving the geodesic equation under specific boundary conditions (38). Such an approach to accessing these mappings is very slow in practice and can be numerically unstable—as in the conformation angle framework (39)—which renders it unfeasible for the analysis of large high-dimensional datasets.

Computationally feasible approximations for all of these mappings, i.e., without solving the geodesic equation, require extra structure as well (40) or can to a certain extent be learned by neural networks (41). However, the proposed additional structure from ref. 40 is not very suitable for Riemannian manifolds with a nontrivial topology that we can expect for quotient manifolds or invariant representation manifolds, and in the case of learning the mappings, we will need to be very careful with taking the appropriate symmetries into account and with the curation of the training data to ensure generalization—although it should be said that this could be justified if we need more than the basic mappings, e.g., their differentials.

1.2. Contributions.

Among the manifolds outlined above, a point cloud-based quotient manifold can be suitable to build geometry upon and is additionally most in line with what is being used in practice. However, regarding computational feasibility and Riemannian geometry for proteins, we have seen that there are neither a readily available additional structure which we can impose upon such quotient manifolds for efficient approximations of manifold mappings nor suitable Riemannian structures we could even aim to make computationally feasible. To address these two challenges, we will have to address three questions: i) what additional structure would be more suitable for efficient Riemannian geometry on point cloud quotient manifolds, ii) how do we construct Riemannian geometry on such spaces within this framework, and iii) what would this structure concretely look like for proteins? Naturally, our contribution of this work is also threefold:

1.2.1. Computationally feasible Riemannian geometry.

For attaining computational feasibility we propose the general notion of separation, as a relaxation of the Riemannian distance, for additional structure. We show that separations are local approximations of the Riemannian distance (Theorem 3.2), that yield a closed-form approximation of the logarithmic mapping (Corollary 3.3), and that enable—under mild conditions—the construction of well-posed optimization problems to approximate geodesics and the exponential mapping in a provably efficient fashion (Theorems 3.4 and 3.5).

1.2.2. Constructing Riemannian geometry for point cloud conformations.

We resolve the challenge of constructing a quotient manifold out of point clouds by passing to an appropriate subset before taking the quotient (Theorem 4.2) and provide guidelines for constructing a metric tensor field and a separation from a suitable family of metrics (Theorem 4.3). We note that these results are still generally applicable for designing geometry on point clouds, so can also be used beyond the protein dynamics data application.

1.2.3. Riemannian geometry for efficient analysis of protein dynamics data.

Finally, we propose a specific metric and show in numerical results that the Riemannian manifold constructed from this metric is suitable for protein dynamics data processing through several examples with molecular dynamics trajectories of the adenylate kinase protein and the SARS-CoV-2 helicase nsp 13 protein. Our most encouraging findings are that on a coarse-grained level, the Riemannian manifold structure approximately recovers complete adenylate kinase trajectories from just a start- and end-points (Fig. 1), gives physically realistic summary statistics for both proteins and retrieves the underlying dimension despite large conformational changes—all within mere seconds on a laptop. In particular, we demonstrate that our approach can break down the former 636-dimensional conformations to just a nonlinear 1-dimensional space and break down the latter 1,764-dimensional conformations to a nonlinear 7-dimensional space without a significant loss in approximation accuracy.

Fig. 1. — Several snapshots of the $C_{α}$ atoms of the adenylate kinase protein under a closed ( $t = 0$ ) to open ( $t = 1$ ) transition. The *Top* row is generated from a molecular dynamics simulation, the *Middle* row is a $w^{δ}$ -geodesic between the end points and the *Bottom* row a $w^{δ}$ -geodesic obtained from rank 1 approximation of the data at the $w^{δ}$ -barycenter. The begin-to-end $w^{δ}$ -geodesic captures the large-scale motion well, having only small errors at the *Lower* green and yellow part of the protein, and the $w^{δ}$ -geodesic obtained from the rank 1 approximation captures the data almost perfectly. A detailed exposition of the overall errors and atom-wise errors are shown in Fig. 3.

1.3. Outline.

This article is structured as follows. Section 2 covers basic notation from differential and Riemannian geometry. In Section 3, we define the notion of separation on general Riemannian manifolds and show how this object enables the construction of provably computationally feasible approximations of geodesics, the exponential mapping, and the logarithmic mapping. Section 4 proposes a point cloud-based differential geometry of protein conformations and general guidelines for additionally constructing protein geometry within the proposed separation-based framework. Section 5 uses these developed guidelines and considers a concrete example with clear ties to normal mode analysis, from which separation-based computationally feasible Riemannian geometry for proteins is constructed. In Section 6, we argue for deviating from established standards for more advanced data analysis tools and provide guidelines for doing so in the case of low rank approximation. The proposed Riemannian protein geometry is then tested numerically in Section 7 through several data analysis tasks with molecular dynamics simulation data of the adenylate kinase protein and the SARS-CoV-2 helicase nsp 13 protein.^† Finally, we summarize our findings in Section 8. The main article describes the general mathematical ideas and major results.^‡ All the proofs are carried out in the appendix.

2. Notation

Here, we present some basic notations from differential and Riemannian geometry, see refs. 17, 18, 25, and 26 for details.

Let $M$ be a smooth manifold. We write $C^{\infty} (M)$ for the space of smooth functions over $M$ . The tangent space at $p \in M$ , which is defined as the space of all derivations at $p$ , is denoted by $T_{p} M$ , and for tangent vectors, we write $Ξ_{p} \in T_{p} M$ . For the tangent bundle, we write $T M$ and smooth vector fields, which are defined as smooth sections of the tangent bundle, are written as $X (M) \subset T M$ .

A smooth manifold $M$ becomes a Riemannian manifold if it is equipped with a smoothly varying metric tensor field $(\cdot, \cdot) : X (M) \times X (M) \to C^{\infty} (M)$ . This tensor field induces a (Riemannian) metric $d_{M} : M \times M \to R$ . The metric tensor can also be used to construct the Levi-Civita connection that is denoted by $\nabla_{(\cdot)} (\cdot) : X (M) \times X (M) \to X (M)$ . This connection is in turn the cornerstone of a myriad of manifold mappings. One is the notion of a geodesic, which for two points $p, q \in M$ is defined as a curve $γ_{p, q} : [0, 1] \to M$ with minimal length that connects $p$ with $q$ . Another closely related notion is the curve $t \mapsto γ_{p, Ξ_{p}} (t)$ for a geodesic starting from $p \in M$ with velocity ${\dot{γ}}_{p, Ξ_{p}} (0) = Ξ_{p} \in T_{p} M$ . This can be used to define the exponential map ${exp}_{p} : D_{p} \to M$ as ${exp}_{p} (Ξ_{p}) : = γ_{p, Ξ_{p}} (1)$ , where $D_{p} \subset T_{p} M$ is the set on which $γ_{p, Ξ_{p}} (1)$ is defined. Furthermore, the logarithmic map ${log}_{p} : exp (D_{p}^{'}) \to D_{p}^{'}$ is defined as the inverse of ${exp}_{p}$ , so it is well defined on $D_{p}^{'} \subset D_{p}$ , where ${exp}_{p}$ is a diffeomorphism. Moreover, the Riemannian gradient of a smooth function $F \in C^{\infty} (M)$ denotes the unique vector field $grad F \in X (M)$ such that ${(grad F, Ξ)}_{p} : = Ξ_{p} F : = D_{p} F (\cdot) [Ξ_{p}]$ holds for any $Ξ \in X (M)$ and $p \in M$ , where $D_{p} F (\cdot) : T_{p} M \to R$ denotes the differential of $F$ .

Finally, if $M$ is a quotient manifold, i.e., $M : = N / G$ for some Riemannian manifold $(N, (\cdot, \cdot))$ and Lie group $G$ , we denote its elements with square brackets $[p] \in N / G$ . Besides a tangent space, quotient manifolds are equipped with a vertical and horizontal space for $p \in [p]$ . The former is defined as $V_{p} N : = ker (D_{p} π)$ , where $D_{p} π : T_{p} N \to T_{π (p)} N / G$ is the differential of the canonical projection mapping $π : N \to N / G$ , and the latter is the linear subspace $H_{p} N \subset T_{p} N$ such that $H_{p} N \cap V_{p} N = {0}$ and $H_{p} N \oplus V_{p} N = T_{p} N$ . To distinguish horizontal vectors and tangent vectors, we write $Ξ_{⋄ p} \in H_{p} N$ and $Ξ_{[p]} \in T_{[p]} N / G$ .

3. Computationally Feasible Riemannian geometry

In this section, we focus on constructing efficient approximations of several manifold mappings. We restrict ourselves to the main results and refer the reader to SI Appendix, section 2 for auxiliary lemmas and the proofs.

Realistically, efficient approximation of manifold mappings requires additional structure as argued in Section 1.1. Our approach is similar to the variational time discretization approach by Rumpf and Wirth (40), which has shown to be successful for several applications (42–44). The key difference is that we use fully intrinsic notions that do not rely on Euclidean Taylor approximations or ambient Banach space structure. This allows for several natural approximation schemes that are distinct from the ones proposed in ref. 40 and are more general.^§ Our theoretical framework relies on the notion of separation.

Definition 3.1

(separation). Let $(M, (\cdot, \cdot))$ be a Riemannian manifold. A mapping $w : M \times M \to R$ is called a separation with respect to the Riemannian metric tensor field $(\cdot, \cdot)$ if it satisfies the following properties:

(i)
$w$ is a metric on $M$ ,

(ii)
for any $p \in M$ , there exists a neighborhood of $p$ in which the mapping $w {(p, \cdot)}^{2}$ is smooth,

(iii)
for any $Ξ_{p} \in T_{p} M$ at any $p \in M$ the identity $\nabla_{Ξ_{p}} grad w {(p, \cdot)}^{2} = 2 Ξ_{p} \in T_{p} M$ holds.

As we will see below, separations can be used to approximate basic Riemannian manifold mappings. This works primarily because they are designed to approximate the Riemannian distance, as shown in the following theorem.

Theorem 3.2

(Relation between a separation and $d_{M}$ ). Let $(M, (\cdot, \cdot))$ be a Riemannian manifold and $d_{M} : M \times M \to R$ be the distance on $M$ generated by $(\cdot, \cdot)$ . Furthermore, let $w : M \times M \to R$ be a separation on $M$ with respect to $(\cdot, \cdot)$ .

Then, for $p \in M$ ,

$\begin{matrix} w {(p, q)}^{2} = d_{M} {(p, q)}^{2} + O (d_{M} {(p, q)}^{3}), as q \to p, \end{matrix}$ [1]

and the approximation Eq. 1 becomes exact, i.e., for any $q$ in a neighborhood of $p \in M$ ,

$\begin{matrix} w {(p, q)}^{2} = d_{M} {(p, q)}^{2} \end{matrix}$ [2]

if and only if additionally $grad w {(p, \cdot)}^{2} ∣_{r} = - 2 {log}_{r} p$ for any $r \in γ_{q, p}$ .

We note that property Eq. 1 is very useful in practice and also underlies the convergence of the variational time discretization approximations in ref. 40. However, in approximating the manifold mappings we take a different route. For one, a separation directly gives us a first-order approximation of the Riemannian logarithm—rather than solving a discrete geodesic problem first. That is, defining the $w$ -logarithmic map ${log}_{p}^{w} : M \to T_{p} M$ as

\begin{matrix} {log}_{p}^{w} (q) : = - \frac{1}{2} grad w {(q, \cdot)}^{2} ∣_{p} \in T_{p} M, \end{matrix}

[3]

we have the following result.

Corollary 3.3.

Under the assumptions in Theorem 3.2, the $w$ -logarithmic map satisfies for any $p \in M$ ,

$\begin{matrix} ‖ {log}_{p}^{w} q - {log}_{p} {q ‖}_{p} \in O (d_{M} {(p, q)}^{2}), as q \to p . \end{matrix}$ [4]

For the geodesics, we will also resort to a different approximation scheme. In particular, we construct the $w$ -geodesic $γ_{p, q}^{w} : [0, 1] \to M$ as

\begin{matrix} γ_{p, q}^{w} (t) \in \underset{r \in M}{argmin} \frac{1 - t}{2} w {(p, r)}^{2} + \frac{t}{2} w {(r, q)}^{2} . \end{matrix}

[5]

Note that even if we could choose $w$ to be the exact Riemannian metric $d_{M}$ , the optimization problem Eq. 5 does not necessarily have minimizers. So we have to be careful here. On the flip side, if $w : = d_{M}$ and a solution exists, the problem is locally strongly geodesically convex, i.e., strongly convex along geodesics (45), and geodesics $γ_{p, q} (t)$ at time $t$ are minimizers, which at least gives consistency.

Ideally, a separation would inherit such properties. As it turns out, existence is guaranteed under an additional metric completeness assumption, whereas geodesic convexity is inherited directly.

Theorem 3.4.

Let $(M, (\cdot, \cdot))$ be a Riemannian manifold, and consider any metric $w : M \times M \to R$ on $M$ and the minimization problem Eq. 5 for arbitrarily fixed $t \in [0, 1]$ .

(i)
If $w$ is a complete metric, a minimizer exists.

(ii)
If $w$ is a separation with respect to $(\cdot, \cdot)$ , the problem is locally strongly geodesically convex for $p$ and $q$ close enough.

Although property i) in Theorem 3.4 is important from a mathematical point of view, property ii) has more practical consequences. In particular, as a consequence, we enjoy linear convergence when solving the problem with simple Riemannian gradient descent (46).^¶ Since the exponential mapping is not available, we assume a retraction, i.e., a first-order approximation of the exponential mapping, instead, without affecting the convergence rate (47). In other words, having a complete separation yields a more convenient and lower-dimensional optimization problem than computing discrete geodesics (40).

Finally, for approximating the exponential mapping beyond first-order we are using a slight modification of the recursive scheme used in ref. 40. Assuming a retraction ${retr}_{p} : T_{p} M \to M$ , we construct the $w$ -exponential mapping ${exp}_{p}^{w} : T_{p} M \to M$ as

\begin{matrix} {exp}_{p}^{w} (Ξ_{p}) : = p^{M} \in M, \end{matrix}

[6]

where

\begin{matrix} p^{k} : = γ_{p^{k - 2}, p^{k - 1}}^{w} (2), p^{0} = p, and p^{1} = {retr}_{p} (\frac{1}{M} Ξ_{p}), \end{matrix}

[7]

for some $M \in N$ . Similarly to $w$ -geodesics, the $w$ -exponential mapping is consistent with the exponential mapping if $w : = d_{M}$ and $p^{1} = {exp}_{p} (\frac{1}{M} Ξ_{p})$ .

Note that $t = 2$ in Eq. 7 lies outside of the admissible region in Theorem 3.4. However, this is amendable and we obtain a similar result. So once again, we obtain a provably efficient approximation.

Theorem 3.5.

Let $(M, (\cdot, \cdot))$ be a Riemannian manifold, and consider any metric $w : M \times M \to R$ on $M$ and the minimization problem Eq. 5 for $t = 2$ .

If $w$ is a complete metric, a minimizer exists.

If $w$ is a separation with respect to $(\cdot, \cdot)$ , the problem is locally strongly geodesically convex for $p$ and $q$ close enough.

Remark 3.6:

We note that a stronger version of Theorem 3.5, i.e., whether there is a local or uniform $M$ such that strong geodesic convexity is true for each of the minimization problems, remains open and is beyond the scope of this work.

Besides approximations for the $exp$ , $log$ , and geodesics, the proposed separation-based framework and the above-mentioned results can also be used to construct approximations for parallel transport and the curvature operator. We refer the reader to SI Appendix, section 2.E for details.

4. Constructing Riemannian Geometry for Point Cloud Conformations

The goal in this section is to set up guidelines for constructing Riemannian geometry for proteins that comes with a separation on this space, which then enables access to all computationally feasible manifold mapping approximations proposed in Section 3. The key idea is to construct a metric first and construct a metric tensor field on which the original metric is a separation. For full detail, we refer the reader to the proofs in SI Appendix, section 3.

Before considering Riemannian geometry, we first need a smooth manifold. As motivated in Section 1.1, we will start from a point cloud-based manifold and take a suitable quotient. We start with the sets

\begin{matrix} R_{∙}^{d \times n} : = {X \in R^{d \times n} ∣ X = (x_{1}, \dots, x_{n}) s.t. x_{i} \neq x_{j} \in R^{d}}, \end{matrix}

[8]

and

\begin{matrix} R_{d, ⋆}^{d \times n} : = {X \in R^{d \times n} ∣ X - \frac{1}{n} X 1_{n} 1_{n}^{⊤} \in R_{d}^{d \times n}}, \end{matrix}

[9]

where $1_{n} : = {(1, \dots, 1)}^{⊤} \in R^{n}$ and $R_{d}^{d \times n}$ is the set of matrices of size $d \times n$ and rank $d$ . We note that $R_{∙}^{d \times n}$ models the physical constraint that two atoms or coarse-grained pseudoatoms cannot overlap, whereas $R_{d, ⋆}^{d \times n}$ models that point clouds do not live in an affine subspace and is chosen so that the quotient space defined in Theorem 4.2 will actually be a smooth manifold.

Our first step is combining the two sets through intersection, which gives us a smooth manifold.

Proposition 4.1.

The set

$\begin{matrix} P (d, n) : = R_{d, ⋆}^{d \times n} \cap R_{∙}^{d \times n} \end{matrix}$ [10]

is a smooth $(n \cdot d)$ -dimensional manifold if $n \geq d + 1$ .

Next, we consider left group actions on $P (d, n)$ . More generally, if $G$ is a Lie group and $M$ is a smooth manifold, a left group action of $G$ on $M$ is a map $θ : G \times M \to M$ that satisfies identity and compatibility

\begin{matrix} \begin{matrix} θ (e, p) = p, & for all p \in M, \\ θ (g_{1}, θ (g_{2}, p)) = θ (g_{1} \cdot g_{2}, p), & for all g_{1}, g_{2} \in G, p \in M . \end{matrix} \end{matrix}

[11]

In the following, we will consider the Euclidean group $E (d)$ . We represent elements of this Lie group in the canonical way as $(O, t) \in E (d)$ , where $O \in O (d)$ is a $d \times d$ orthogonal matrix and $t \in R^{d}$ is a (translation) vector. We can extend the left group action of $E (d)$ on $R^{d}$ to an action on $P (d, n)$ . That is, we define $θ : E (d) \times P (d, n) \to P (d, n)$ as

\begin{matrix} θ ((O, t), X) : = O X + t 1_{n}^{⊤}, (O, t) \in E (d), X \in P (d, n) . \end{matrix}

[12]

The left group action Eq. 12 on $P (d, n)$ allows us to construct the quotient space $P (d, n) / E (d)$ . By construction, this quotient space is also a smooth manifold.

Theorem 4.2.

The quotient space $P (d, n) / E (d)$ is a smooth ( $n \cdot d - (d + 1) \cdot d / 2$ )-dimensional manifold if $n \geq d + 1$ .

Now that we have a smooth quotient manifold, the next step is to construct a Riemannian structure that is computationally feasible, yet adapted for protein conformations. The following result gives clear guidelines for doing that, i.e., it shows that having a suitable metric on the quotient space $P (d, n) / E (d)$ always allows us to find a Riemannian manifold on which that metric is a separation, which gives cheap manifold mappings through the results from Section 3.

Theorem 4.3.

Consider a metric $w : P (d, n) / E (d) \times P (d, n) / E (d) \to R$ on $P (d, n) / E (d)$ for $n \geq d + 1$ of the form

$\begin{matrix} w ([X], [Y]) = \tilde{w} (X, Y), X \in [X], Y \in [Y], \end{matrix}$ [13]

where $\tilde{w} : P (d, n) \times P (d, n) \to R$ is a mapping that is invariant under $E (d)$ action in both arguments. Additionally, assume that $\tilde{w} {(X, \cdot)}^{2}$ is smooth in a neighborhood of $X$ for any $X \in P (d, n)$ and that the Euclidean Hessian $D_{X} \nabla \tilde{w} {(X, \cdot)}^{2} : R^{d \times n} \to R^{d \times n}$ is positive definite when restricted to the horizontal space $H_{X} P (d, n)$ generated by $(P (d, n), {(\cdot, \cdot)}_{2})$ and has kernel $ker (D_{X} \nabla \tilde{w} {(X, \cdot)}^{2}) = V_{X} P (d, n)$ for all $X \in P (d, n)$ , where ${(\cdot, \cdot)}_{2}$ is the $ℓ^{2}$ inner product.

Then, Euclidean Hessian $X \mapsto D_{X} \nabla \tilde{w} {(X, \cdot)}^{2}$ generates a Riemannian structure on $P (d, n) / E (d)$ . In particular, the bilinear form $(\cdot, \cdot) : X (P (d, n) / E (d)) \times X (P (d, n) / E (d)) \to C^{\infty} (P (d, n) / E (d))$ given by

$\begin{matrix} {(Ξ, Ψ)}_{[X]} : = \frac{1}{2} (D_{X} \nabla \tilde{w} {(X, \cdot)}^{2} [Ξ_{⋄ X}], Ψ_{⋄ X})_{2}, X \in [X], \end{matrix}$ [14]

defines a metric tensor field on $P (d, n) / E (d)$ , and $w$ is a separation under $(\cdot, \cdot)$ .

Remark 4.4:

Even though this work focuses on constructing computationally efficient Riemannian geometry on $P (d, n) / E (d)$ , we would like to emphasize that the ideas behind Theorem 4.3 can be used more generally to show a similar result on other manifolds. We refer the interested reader to SI Appendix, section 3, where the $R^{d}$ case is discussed.

5. A Riemannian Geometry and Separation for Protein Dynamics Data

Now we are ready to choose a Riemannian geometry for efficient analysis of protein conformations through Theorem 4.3. This amounts to choosing a metric suited for quantifying similarity between protein conformations and constructing a metric tensor field and a separation. The metric will have strong similarities to normal mode analysis discussed in Section 1, but is arguably less likely to suffer from locality and linearity issues discussed in Section 1.1. We restrict ourselves once again to stating the main results and refer the reader to SI Appendix, section 4 for the details. Furthermore, all statements will be given for general $d$ -dimensional point clouds—even though we only need the case $d = 3$ for proteins.

Invoking Theorem 4.3 requires selecting a metric of the form Eq. 13. Passing to a Euclidean distance matrix (48) representation of point clouds allows us to construct a whole family of $E (d)$ invariant metrics.

Proposition 5.1.

Let $d_{> 0} : R_{> 0} \times R_{> 0} \to R$ be any metric over the positive real numbers.

Then, the function $w : P (d, n) / E (d) \times P (d, n) / E (d) \to R$ of the form Eq. 13 with $\tilde{w} : P (d, n) \times P (d, n) \to R$ given by

$\begin{matrix} \tilde{w} (X, Y) : = \sqrt{\sum_{i} \sum_{j > i} d_{> 0} (‖ x_{i} - x_{j} ‖_{2}, ‖ y_{i} - y_{j} {‖_{2})}^{2}}, \end{matrix}$ [15]

defines a metric over $P (d, n) / E (d)$ if $n \geq d + 1$ .

The expression in Eq. 15 has a similar form to the classical quadratic—Hookean—potential proposed in ref. 11 when interpreting the $[Y]$ entry as a reference protein conformation. However, there are two important differences: In ref. 11, i) the interactions are restricted to atoms within some cutoff radius—violating the metric axioms—and ii) the quadratic potential stipulates that the distance—or energy needed—to transition into a state $Z \in R^{d \times n} ∖ R_{∙}^{d \times n}$ is finite and small, i.e., to a state with $z_{i} = z_{j}$ for some $i \neq j$ . The latter difference tells us that there are many nonphysical states close by, which is arguably a key source of the locality problem. In reality, it would take infinite energy to move two atoms to the same location, which means that these states should be infinitely far away.^#

Instead, in the following, we will make another choice for $d_{> 0}$ . That is, we choose the complete metric $d_{> 0} : R_{> 0} \times R_{> 0} \to R$ over the positive real numbers given by

\begin{matrix} d_{> 0} (a, b) : = | log (\frac{a}{b}) | . \end{matrix}

[16]

Intuitively, this metric inserted into Eq. 15 tells us that two point-clouds are close together if their respective pair-wise distances are in the same order of magnitude. Such a metric directly solves the locality problem, since local interactions are fairly strong—and blow up when moving toward a state $Z$ with $z_{i} = z_{j}$ —but does not need an explicit cutoff as long-range interactions are relatively weak. We note that such a metric is truly more in line with well-known interaction potentials such as the famous Lennard-Jones potential than the quadratic potential from ref. 11, which is additional biological and physical motivation for the proposed metric. With nonlocality accounted for, nonlinearity is naturally taken care of through availability of $w$ -geodesics, the $w$ -exponential mapping and the $w$ -logarithmic mapping, if we can ensure completeness of the metric.

Remark 5.2:

Note that we also expect from a Riemannian manifold with the metric from Proposition 5.1 under Eq. 16 as separation to preserve small invariant distances, e.g., protein bond lengths, over $w$ -geodesics. We expect the same from $w$ -based Riemannian barycenters (20).

Despite the metric Eq. 16 being complete on the positive real numbers, the metric from Proposition 5.1 under Eq. 16 is not complete on $P (d, n) / E (d)$ yet as the distance—or energy needed—to transition into a state $Z \in R^{d \times n} ∖ R_{d, ⋆}^{d \times n}$ is again finite and small, i.e., to a state where all points lie on an affine subspace. A simple way to resolve this is to add an extra term that blows up toward such states. In particular, we propose the full metric $w^{δ} : P (d, n) / E (d) \times P (d, n) / E (d) \to R$ of the form Eq. 13 with ${\tilde{w}}^{δ}$ defined in Eq. 17,

\begin{matrix} {\tilde{w}}^{δ} (X, Y) : = \sqrt{\sum_{i} \sum_{j > i} (log (\frac{‖ x_{i} - x_{j} ‖_{2}}{‖ y_{i} - y_{j} ‖_{2}}))^{2} + δ (log (\frac{det (\sum_{i} (x_{i} - \frac{1}{n} X 1_{n}) \otimes (x_{i} - \frac{1}{n} X 1_{n}))}{det (\sum_{i} (y_{i} - \frac{1}{n} Y 1_{n}) \otimes (y_{i} - \frac{1}{n} Y 1_{n}))}))^{2}} . \end{matrix}

[17]

where $δ > 0$ and $a \otimes b : = a b^{⊤} \in R^{d \times d}$ for $a, b \in R^{d}$ . The above expression still gives us a metric.

Corollary 5.3.

The mapping $w^{δ} : P (d, n) / E (d) \times P (d, n) / E (d) \to R$ for $δ > 0$ and $n \geq d + 1$ is a complete metric on $P (d, n) / E (d)$ .

Intuitively, the additional term tells us that two conformations are close if their distribution of mass is similar. Indeed, this can easily be seen through realizing that $tr (\sum_{i} (x_{i} - \frac{1}{n} X 1_{n}) \otimes (x_{i} - \frac{1}{n} X 1_{n}))$ is the square of the radius of gyration—a well-known quantity for describing whether proteins are in an open or closed state. In other words, this additional term is unlikely to jeopardize the constructed protein geometry—especially for $δ$ small so that the expected properties discussed in Remark 5.2 are not disturbed too much.

Finally, we have a metric and we are ready to construct a Riemannian manifold through invoking Theorem 4.3. The following result states that our candidate metric $w^{δ}$ is a separation to a particular Riemannian manifold.

Theorem 5.4.

Let $w^{δ} : P (d, n) / E (d) \times P (d, n) / E (d) \to R$ for $δ > 0$ and $n \geq d + 1$ be the metric on $P (d, n) / E (d)$ of the form Eq. 13 with ${\tilde{w}}^{δ} : P (d, n) \times P (d, n) \to R$ defined in Eq. 17.

Then, the Euclidean Hessian $X \mapsto D_{X} \nabla {\tilde{w}}^{δ} {(X, \cdot)}^{2}$ generates a Riemannian structure on $P (d, n) / E (d)$ . In particular, the bilinear form ${(\cdot, \cdot)}^{δ} : X (P (d, n) / E (d)) \times X (P (d, n) / E (d)) \to C^{\infty} (P (d, n) / E (d)) \to R$ given by

$\begin{matrix} {(Ξ, Ψ)}_{[X]}^{δ} : = \frac{1}{2} (D_{X} \nabla {\tilde{w}}^{δ} {(X, \cdot)}^{2} [Ξ_{⋄ X}], Ψ_{⋄ X})_{2}, X \in [X], \end{matrix}$ [18]

defines a metric tensor field on $P (d, n) / E (d)$ and the mapping $w^{δ} : P (d, n) / E (d) \times P (d, n) / E (d) \to R$ is a separation under ${(\cdot, \cdot)}^{δ}$ .

Remark 5.5:

The evaluated Euclidean Hessian is provided in SI Appendix, section 4.C and consists of two terms. The leading first term is similar to the Hessian of the Hookean potential (11) with the difference that the pair-wise distances in the denominators are squared. This effectively gives us smooth decay of the interaction strength rather than having to add an abrupt cutoff and is expected to give normal mode-like behavior.

Overall, Theorem 5.4 and Remark 5.5 tell us that the Riemannian manifold $(P (d, n) / E (d), {(\cdot, \cdot)}^{δ})$ enables computationally feasible data analysis of protein dynamics data along normal mode-like proxy energy-minimizing nonlinear paths without suffering from locality through the mappings from Section 3 .

6. Riemannian Geometry and Analysis of Protein Dynamics Data

Before moving on to numerical experiments, we discuss why we should deviate from established standards for some more advanced data analysis tools. We do this in a case study considering principal component analysis through low rank approximation. First, let us restate our complete approach using the results above. We represent protein conformations as elements of the quotient manifold $P (d, n) / E (d)$ (Theorem 4.2). On this manifold, we construct a physically motivated metric $w^{δ}$ using Eq. 17 that we can efficiently evaluate, and show that this metric is a separation if we equip our manifold with the Riemannian structure in Theorem 5.4. Using this Riemannian structure, we can use the separation $w^{δ}$ to efficiently interpolate (Theorem 3.4) and extrapolate (Theorem 3.5) along nonlinear proxy energy landscape-minimizing curves on the manifold of protein conformations.

On a conceptual level, the goal for Riemannian principal component analysis for protein dynamics data is to decompose the dataset into nonlinear large-scale biologically relevant conformational changes. However, when trying to do this through low-rank approximation, we will find nonlinear geodesic subspaces that capture protein dynamics data best. To see the discrepancy, we will unpack what “best” means. In works such as refs. 22 and 23, the best rank- $r$ approximation is defined as having minimal error in the Riemannian distance. In our case, the metric is modeled after the energy landscape for the various conformations of a given protein. Hence, in our Riemannian geometry, there could be an energetically small conformational change that corresponds to a large-scale conformational change in terms of Euclidean distance, which is deemed less significant than an energetically large conformational change that corresponds to a small-scale conformational change in the Euclidean sense, e.g., due to noise in the data. Since only the large-scale conformational changes in the Euclidean sense are the types of motion we want to decompose our dataset into, we might want to move toward a low rank approximation that gives the lowest error in the Euclidean distance instead—that is, after recentering and registering both conformations in a least-squares sense (49, 50)—which is also more in line with current practices of protein conformation comparison using the RMSD.

So in particular, we aim to find tangent vectors ${{(Ξ_{[X^{0}]}^{r})}_{k}}_{k = 1}^{N} \subset T_{[X^{0}]} P (d, n) / E (d)$ at a reference conformation $X^{0} \in P (d, n)$ that span an $r$ -dimensional tangent subspace such that the loss

\begin{matrix} \sum_{k = 1}^{N} & RMSD {([X_{k}], {exp}_{\bar{[X]}}^{w^{δ}} ({(Ξ_{[\bar{X}]}^{r})}_{k}))}^{2} \\ : = \frac{1}{n} \sum_{k = 1}^{N} {‖ {[X_{k}]}_{⋄ X^{0}} - {exp}_{\bar{[X]}}^{w^{δ}} {({(Ξ_{[\bar{X}]}^{r})}_{k})}_{⋄ X^{0}} ‖}_{2}^{2}, \end{matrix}

[19]

is small, where ${[Y]}_{⋄ X^{0}}$ is the element $Y \in [Y]$ closest to $X^{0}$ in a least-squares sense (50).

Low RMSD can be achieved through a change of inner product. First, we need—similarly to refs. 22 and 23—a point from which we compute the $w^{δ}$ -logarithmic mappings to all points in a dataset ${[X_{k}]}_{k = 1}^{N} \subset P (d, n) / E (d)$ of $N \in N$ points. For that, we will use—in line with previous work—the $w^{δ}$ -barycenter, which is defined as^‖

\begin{matrix} [\bar{X}] \in \underset{[X] \in P (d, n) / E (d)}{argmin} B_{w^{δ}} ([X]), \end{matrix}

[20]

where $B_{w^{δ}} : P (d, n) / E (d) \to R$ is given by

\begin{matrix} B_{w^{δ}} ([X]) : = \frac{1}{2 N} \sum_{k = 1}^{N} w^{δ} {([X], [X_{k}])}^{2} . \end{matrix}

[21]

Having the $w^{δ}$ -barycenter, we collect—again in line with previous work—the tangent vectors ${{log}_{[\bar{X}]}^{w^{δ}} [X_{k}]}_{k = 1}^{N} \subset T_{[\bar{X}]} P (d, n) / E (d)$ , but compute the Gram matrix $Γ \in R^{N \times N}$ with respect to the $ℓ^{2}$ -inner product on any horizontal space $H_{\bar{X}} P (d, n)$ for $\bar{X} \in [\bar{X}]$

\begin{matrix} Γ_{k, l} : = {({({log}_{[\bar{X}]}^{w^{δ}} [X_{k}])}_{⋄ \bar{X}}, {({log}_{[\bar{X}]}^{w^{δ}} [X_{l}])}_{⋄ \bar{X}})}_{2} . \end{matrix}

[22]

The eigendecomposition of the Gram matrix $Γ = U Λ U^{⊤}$ , where $U : = (u_{1}, \dots, u_{N}) \in R^{N \times N}$ orthonormal and $Λ : = diag (λ_{1}, \dots, λ_{N}) \in R^{N \times N}$ with $λ_{1} \geq \dots \geq λ_{N} \geq 0$ , then allows us to compute a rank- $r$ approximation ${(Ξ_{[\bar{X}]}^{r})}_{k}$ of ${log}_{[\bar{X}]}^{w^{δ}} [X_{k}]$ given by

\begin{matrix} {({(Ξ_{[\bar{X}]}^{r})}_{k})}_{⋄ \bar{X}} : = \sum_{k^{'} = 1}^{r} {(u_{k^{'}}, {(\log_{[\bar{X}]}^{w^{δ}} [X_{k}])}_{⋄ \bar{X}})}_{2} u_{k^{'}} . \end{matrix}

[23]

The main rationale for this choice of inner product is that we expect the low rank approximation to achieve a low RMSD. To see this, first note that we expect from ref. 22 that

\begin{matrix} RMSD ([X_{k}], {exp}_{\bar{[X]}}^{w^{δ}} ({(Ξ_{[\bar{X}]}^{r})}_{k})) \\ \leq \frac{C_{k}}{\sqrt{n}} {‖ {({log}_{[\bar{X}]}^{w^{δ}} [X_{k}])}_{⋄ \bar{X}} - {({(Ξ_{[\bar{X}]}^{r})}_{k})}_{⋄ \bar{X}} ‖}_{2}, for C_{k} \geq 1, \end{matrix}

[24]

and since Eq. 23 is now an optimal approximation of the right hand side, we would have that the left hand side error is small as well if $C_{k} \approx 1$ .

Remark 6.1:

The careful reader does not directly conclude from the above discussion that the left hand side error in Eq. 24 will always be small, because of instabilities due to curvature effects that amount to large values for the $C_{k}$ in the case of strong negative curvature (22, Theorem 3.4].

Remark 6.2:

Similarly, for geodesic interpolation, curvature can also cause instabilities with respect to the end points in the case of strong positive curvature (51, Lemma 1].

7. Numerics

In this section, we test the suitability of proposed Riemannian protein geometry for basic data analysis tasks. Remember from Section 1 that ideal energy landscape-based data analysis tools should be able to interpolate and extrapolate over energy-minimizing nonlinear paths, compute nonlinear means on such energy-minimizing paths and compute low-rank approximations over curved subspaces spanned by energy-minimizing paths. So naturally, we test how well the separation-based Riemannian interpretation of these tools— $w^{δ}$ -geodesic interpolation, the $w^{δ}$ -barycenter as a nonlinear mean, and tangent space low-rank approximation from the $w^{δ}$ -barycenter using the $w^{δ}$ -logarithm—capture the energy landscape.

7.1. Data.

For the numerical evaluation, we include protein dynamics datasets with varying levels of (dis)order and varying deformation size. For that, we consider two molecular dynamics simulations: i) the adenylate kinase protein under a (famously) ordered and medium-size closed-to-open transition (52) and ii) the SARS-CoV-2 helicase nsp 13 protein under a more disordered and large-deformation motion (53).

7.2. Outline of Experiments.

Considering the trajectory datasets on a $C_{α}$ -coarse-grained level, we will showcase in Section 7.4 to what extent the $w^{δ}$ -geodesics approximate the trajectories. Subsequently, in Section 7.5, we will compute a low rank approximation at the $w^{δ}$ -barycenter and infer dimensionality. In Section 1, we also argued that suitable data analysis tools need to be computationally feasible. Consequently, we will also show that our separation-based approximations are very fast due to linear convergence of Riemannian gradient descent predicted by theory in Section 3.

Besides these main experiments showcasing the practical use of Riemannian geometry for protein conformation analysis, we will provide several more technical sanity checks for the interested reader. In particular, we check the prediction of Remark 5.2 that small distances—i.e., adjacent $C_{α}$ distances—stay more or less invariant throughout all data analysis tasks (SI Appendix, section 5.A), and check that the curvature effects discussed in Remarks 6.1 and 6.2 are negligible for both datasets (SI Appendix, section 5.B), indicating stability.

7.3. General Experimental Settings.

As mentioned above, we will consider a coarse-grained representation of the adenylate kinase protein and the SARS-CoV-2 helicase nsp 13 protein. In particular, we construct point clouds where the $i$ th point corresponds to the $C_{α}$ position of the $i$ th peptide. As adenylate kinase has 214 peptides and the dataset has 102 frames, we obtain ${[X_{k}^{AK}]}_{k = 1}^{102} \subset P (3, 214) / E (3)$ . As SARS-CoV-2 helicase nsp 13 has 590 peptides and the dataset has 200 frames, we obtain ${[X_{k}^{SC}]}_{k = 1}^{200} \subset P (3, 590) / E (3)$ . Several snapshots are shown in the Top row of Figs. 1 and 2, in which we have normalized time so that frame $1$ corresponds to $t = 0$ and the last frame to $t = 1$ . Subsequently, we construct the Riemannian manifolds $(P (3, 214) / E (3), {(\cdot, \cdot)}^{δ})$ and $(P (3, 590) / E (3), {(\cdot, \cdot)}^{δ})$ with $δ = 1$ and note that the dimensions of these spaces are 636 and 1,764.

Fig. 2. — Several snapshots of the $C_{α}$ atoms of the SARS-CoV-2 helicase nsp 13 protein under disordered motion. The *Top* row is generated from a molecular dynamics simulation, the *Middle* row is a $w^{δ}$ -geodesic between the end points and the *Bottom* row a rank 7 approximation of the data at the $w^{δ}$ -barycenter. The begin-to-end $w^{δ}$ -geodesic does not capture the large-scale motion well due to significant errors at the *Top Left* dark blue part of the protein—indicating that the motion is not 1-dimensional—but rank 7 approximation captures the data almost perfectly. A detailed exposition of the overall errors and atom-wise errors are shown in Fig. 4.

For computing $w^{δ}$ -geodesics and $w^{δ}$ -exponential mappings, we solve the corresponding optimization problems Eq. 5—for their respective begin and end points $[Y]$ and $[Z]$ —using Riemannian gradient descent with unit step size under addition as a retraction (see Section 3) and initialized at the end point $[Z]$ . The parameter $M$ for the $w^{δ}$ -exponential map is chosen as the smallest integer larger than a quarter of the tangent vector norm. Riemannian gradients are linear combinations of the $w^{δ}$ -logarithm, which is provided in SI Appendix, section 5.C. As both the exponential mapping and geodesics solve the same problem, we stop the solver at iteration $ℓ$ if

\begin{matrix} \frac{‖ t {log}_{[X^{ℓ}]}^{w^{δ}} ([Y]) + (1 - t) {log}_{[X^{ℓ}]}^{w^{δ}} {([Z]) ‖}_{[X^{ℓ}]}^{δ}}{w^{δ} ([Y], [Z])} < 10^{- 4}, \end{matrix}

[25]

where $[X^{ℓ}]$ is the $ℓ$ th iterate.

The barycenter problem Eq. 20 is also solved using Riemannian gradient descent with unit step size under addition as a retraction and initialized the trajectory midpoints $[X_{51}^{AK}]$ and $[X_{100}^{SC}]$ respectively. Here too, Riemannian gradients are linear combinations of the $w^{δ}$ -logarithm. The optimization scheme is terminated at iteration $ℓ$ if

\begin{matrix} \frac{‖ grad B_{w^{δ}} {(\cdot) ‖}_{[X^{ℓ}]}^{δ}}{B_{w^{δ}} ([X^{0}])} < 10^{- 4}, \end{matrix}

[26]

where $[X^{ℓ}]$ is the $ℓ$ th iterate.

For visualization of the resulting interpolates and exterpolates and of the dataset itself, the point clouds are registered to the recentered first frame, i.e., $X^{A K, 0} : = X_{1}^{AK} - X_{1}^{AK} 1_{214} 1_{214}^{⊤} \in P (3, 214)$ and $X^{S C, 0} : = X_{1}^{SC} - X_{1}^{SC} 1_{590} 1_{590}^{⊤} \in P (3, 590)$ , in a least-squares sense (50).

Finally, all of the experiments are implemented using PyTorch in Python 3.8 and run on a 2 GHz Quad-Core Intel Core i5 with 16GB RAM. All reported time measurements have been made using the %timeit magic command.

7.4. w^δ-Geodesic Interpolation.

7.4.1. Adenylate kinase.

Several snapshots of $w^{δ}$ -geodesic interpolation between the closed and open state— $[X_{1}^{AK}]$ and $[X_{102}^{AK}]$ respectively—are shown in the Middle row of Fig. 1. Computing the interpolates is very cheap because of the linear convergence of Riemannian gradient descent due to local strong geodesic convexity (Theorem 3.4). For example, computing the mid-point at $t = 0.5$ converges after 4 iterations and takes 245 ms $\pm$ 11.4 ms, which is much faster than it would have taken to compute the trajectory using molecular dynamics simulation. On that note, there are visible differences—although strikingly minor—between the Top and Middle rows of Fig. 1. These differences are more conveniently quantified by considering the deviation between corresponding atoms and their RMSD, which are shown in the Top row of Fig. 3. From the Left-most plot in the figure, it is clear that the error increases the further away we are from either end point but is much smaller than the distance between $C_{α}$ atoms—which is approximately 3.85 Å in this dataset. Unsurprisingly, from the three scatter plots on the right side we see that the atoms undergoing larger deformation are hardest to predict, and that largest errors come from the lower part of the protein—indicated by the yellow and green atoms. The latter observation suggests that there is a discrepancy between the actual energy landscape and the Riemannian structure. This is to be expected as potentials like Lennard-Jones are stronger at short-range than for example $w^{δ} ([X^{A K, 0}], \cdot)$ , but weaker at long-range.

Fig. 3. — Progessions of the RMSD from the adenylate kinase molecular dynamics simulation for the close to open *w^δ*-geodesic and the rank 1 approximation, and several snapshots displaying the atom-specific displacement as a function of the total displacement. Both schemes show only minimal deviation from the ground truth as only the *Lower*—green and yellow—part of the protein are not predicted within a deviation smaller than the 3.85 Å distance between adjacent C_α atoms.

Overall, this experiment suggests that the proposed Riemannian geometry is well suited for interpolating in terms of efficiency and ability to approximate energy-minimizing paths, when the protein is undergoing a reasonably ordered transition with medium-size deformations.

7.4.2. SARS-CoV-2 helicase nsp 13.

Once again, computing the $w^{δ}$ -geodesic is cheap, e.g., the mid-point converges after 14 iterations and takes 10.1 s $\pm$ 43.3 ms. Contrary to the adenylate kinase results, the Middle row of Fig. 2 suggests that we are unable to approximate this protein’s trajectory with just one $w^{δ}$ -geodesic between the first and last frame, i.e., $[X_{1}^{SC}]$ and $[X_{200}^{SC}]$ . Realistically, the higher level of disorder is the most likely cause. However, on Top of that it is very likely that the just mentioned modeling discrepancy between the underlying energy landscape from the simulation and the imposed Riemannian structure—becoming inevitably more noticeable the larger the deformation is—is an additional factor. This hypothesis is backed up by the experiment. That is, upon closer inspection and considering the Top row of Fig. 4, the major source of the error is the dark blue lobe on the left side of the protein—once again being the part of the protein with the largest deformation. The remaining $C_{α}$ atoms have a deviation that is around the acceptable inter- $C_{α}$ -distance of 3.85 Å.

Fig. 4. — Progessions of the RMSD from the SARS-CoV-2 helicase nsp 13 molecular dynamics simulation for the begin to end frame *w^δ*-geodesic and the rank 7 approximation, and several snapshots displaying the atom-specific displacement as a function of the total displacement. One *w^δ*-geodesic is clearly unable to capture the full range of motions, but a rank 7 approximation shows only minimal deviation from the ground truth as here most deviation is smaller than the 3.85 Å distance between adjacent C_α atoms.

Overall, this experiment showcases some limitations of the proposed Riemannian geometry for more disordered proteins and proteins undergoing very large-scale deformations. That is, our method is capable of quickly simulating realistic states, but we just cannot only use geodesics in such cases.

7.5. Low Rank Approximation at the w_δ-Barycenter.

7.5.1. Adenylate kinase.

Next, we will attempt to find a better geodesic that captures the large scale motion through rank 1 approximation of the dataset at the $w^{δ}$ -barycenter tangent space. Solving the barycenter problem Eq. 20 is once again cheap due to linear convergence of Riemannian gradient descent under local strong geodesic convexity. In particular, the solver converges after 3 iterations and takes 425 ms $\pm$ 13.8 ms. Then, for the actual low rank approximation, we follow the procedure in Section 6. We only consider a rank 1 approximation^** and use the exponential mapping to retrieve the corresponding protein conformations, which can be done cheaply (Theorem 3.5), e.g., for the approximation of $[X_{102}^{AK}]$ this takes 657 ms $\pm$ 19.6 ms. The approximations are shown in the Bottom row of Fig. 1, in which we see that this approximation is almost a perfect reconstruction of the original dataset. This is backed up by the RMSD and the per-residue-deviation shown in the Bottom row of Fig. 3. Here, we see that the error is about 1 Å. We remind the reader again that the distance between adjacent $C_{α}$ atoms is approximately 3.85 Å. From the plot, we also observe that the barycenter does not go through the data—otherwise, there would at least be one point with zero error—but is close to it.

Overall, this experiment confirms that the proposed Riemannian geometry enables us to find nonlinear means close to energy-minimizing paths for medium-sized and relatively ordered transitions and enables us to compute low-rank approximations over curved subspaces spanned by energy-minimizing paths that actually capture the intrinsic data dimensionality in a numerically highly efficient manner.

7.5.2. SARS-CoV-2 helicase nsp 13.

For the disordered case, we will try to find a geodesic subspace that will capture the full motion range. In the above case, a rank 1 approximation covers about 85 % of the variance. Using this threshold for SARS-CoV-2 tells us that we need at least a rank 7 approximation. Similarly to above, we compute the $w^{δ}$ -barycenter, which converges quickly after 4 iterations and takes 7.08 s $\pm$ 289 ms. After computing a rank 7 approximation^†† in the tangent space, the corresponding protein conformation approximations are computed using the $w^{δ}$ -exponential mapping, which can be done reasonably cheaply, e.g., in 25.2 s $\pm$ 164 ms for the approximation of $[X_{200}^{SC}]$ . We obtain a significantly better approximation than with a single $w^{δ}$ -geodesic:the Bottom row of Fig. 2 gives us a near-perfect reconstruction and the averaged deviations and atom-specific individual deviations in the Bottom row of Fig. 4 confirm this.

Overall, this experiment showcases that the proposed Riemannian geometry still enables us to find low rank behavior in seemingly disordered large-size deformations within a reasonable time frame.

8. Conclusions

With this investigation, we hope to contribute to the development of an energy landscape-based Riemannian geometry for efficient analysis of protein dynamics. The primary challenges were that we needed i) additional structure for efficient Riemannian geometry on point cloud quotient manifolds, ii) guidelines for constructing Riemannian geometry on such spaces within this framework, and iii) a concrete model geometry for protein dynamics data that fits into this framework.

8.1. Computationally Feasible Riemannian Geometry.

To overcome the computational feasibility issue, we have proposed to consider a relaxation of the Riemannian distance—called a separation—from which approximations of all essential manifold mappings for data analysis can be obtained in closed form (Corollary 3.3) or constructed in a provably efficient way (Theorems 3.4 and 3.5).

8.2. Constructing Riemannian Geometry for Point Cloud Conformations.

Subsequently, for an energy landscape-based Riemannian structure we have proposed a smooth manifold of point cloud conformations modulo rigid body motion to model (coarse-grained) protein data (Theorem 4.2), and have proposed guidelines for constructing Riemannian metric tensor fields with an accompanying separation (Theorem 4.3).

8.3. Riemannian Geometry for Efficient Analysis of Protein Dynamics Data.

Within this framework, we have used several best practices from normal mode analysis and proposed an energy landscape-based Riemannian protein geometry (Theorem 5.4) that arguably resolves basic locality and linearity problems in the normal modes framework. In numerical experiments with molecular dynamics simulations undergoing transitions, we observed that the proposed geometry is useful for data analysis, but could be improved on—within the proposed framework—to cater for even larger-scale deformations. In particular, approximate geodesics with respect to the separation approximate the molecular dynamics simulation data very well for low-disorder transitions with medium-sized deformations, separation-based barycenters give a meaningful representation of the data for both ordered and disordered transitions under any deformation size, and low rank approximation can recover the underlying dimensionality of such datasets. In particular, our approach broke down 636-dimensional conformations to just a nonlinear 1-dimensional space and broke down 1,764-dimensional conformations to a nonlinear 7-dimensional space without a significant loss in approximation accuracy. In addition, the approximations are very fast and can be computed in seconds on a laptop.

Supplementary Material

Appendix 01 (PDF)

pnas.2318951121.sapp.pdf^{(1.6MB, pdf)}

Acknowledgments

C.-B.S. acknowledges support from the Philip Leverhulme Prize, the Royal Society Wolfson Fellowship, the Engineering and Physical Sciences Research Council (EPSRC) advanced career fellowship EP/V029428/1, EPSRC grants EP/S026045/1 and EP/T003553/1, EP/N014588/1, EP/T017961/1, the Wellcome Innovator Awards 215733/Z/19/Z and 221633/Z/20/Z, the European Union Horizon 2020 research and innovation programme under the Marie Skodowska-Curie grant agreement No. 777826 NoMADS, the Cantab Capital Institute for the Mathematics of Information and the Alan Turing Institute. O.Ö. acknowledges support from the Swedish Foundation for Strategic Research grant AM13-0049 and the Swedish Research Council grant 2020-03107.

Author contributions

W.D., C.E.-Y., J.L., O.Ö., and C.-B.S. designed research; W.D. performed research; W.D. analyzed data; W.D. developed experiments developed theory; W.D., C.E.-Y., O.Ö., and C.-B.S. developed experiments; W.D. and J.L. developed theory; and W.D. wrote the paper.

Competing interests

The authors declare no competing interest.

Footnotes

This article is a PNAS Direct Submission.

^*On top of that, having Riemannian geometry opens more doors. For one, because many popular methods rely solely on a notion of distance (24)—and thus can be generalized to Riemannian manifolds.

^†Our code is available at https://github.com/wdiepeveen/Riemannian-geometry-for-efficient-analysis-of-protein-dynamics-data.

^‡For a less rigorous breakdown of the ideas presented, we refer the reader to SI Appendix, section 1.

^§For a more detailed discussion on the differences between our approach and the approach by Rumpf and Wirth, and on why a more general framework is necessary, we refer to SI Appendix, section 2.D.

^¶Unit step size is sufficient due to local Lipschitz gradients by assumption iii) in Definition 3.1.

^#Note that we are asking here for a complete metric, which already gives us existence of separation-geodesics Eq. 5 and also has physical meaning now.

^‖Existence and local strong geodesic convexity can be shown in a similar fashion as in Theorems 3.4 and 3.5.

^**The tangent vectors corresponding to the largest three eigenvalues are shown in SI Appendix, section 5.D.

^††The tangent vectors corresponding to the largest nine eigenvalues are shown in SI Appendix, section 5.D.

Data, Materials, and Software Availability

Code data have been deposited in Github (https://github.com/wdiepeveen/Riemannian-geometry-for-efficient-analysis-of-protein-dynamics-data) (54).

Supporting Information

References

1.C. Kolloff, S. Olsson, Machine learning in molecular dynamics simulations of biomolecular systems. arXiv [Preprint] (2022). http://arxiv.org/abs/2205.03135 (Accessed 1 August 2023).
2.J. Rydzewski, M. Chen, O. Valsson, Manifold learning in atomistic simulations: A conceptual review. arXiv [Preprint] (2023). http://arxiv.org/abs/2303.08486 (Accessed 1 August 2023).
3.K. Jamali, D. Kimanius, S. H. Scheres, “A graph neural network approach to automated model building in cryo-EM maps” in The Eleventh International Conference on Learning Representations (2023).
4.Zhong E. D., Bepler T., Berger B., Davis J. H., Cryodrgn: Reconstruction of heterogeneous cryo-EM structures using neural networks. Nat. Methods 18, 176–185 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Zheng W., Wen H., A survey of coarse-grained methods for modeling protein conformational transitions. Curr. Opin. Struct. Biol. 42, 24–30 (2017). [DOI] [PubMed] [Google Scholar]
6.Kurkcuoglu Z., Bahar I., Doruker P., Clustenm: ENM-based sampling of essential conformational space at full atomic resolution. J. Chem. Theory Comput. 12, 4549–4562 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Amadei A., Linssen A. B., Berendsen H. J., Essential dynamics of proteins. Proteins: Struct. Funct. Bioinforma. 17, 412–425 (1993). [DOI] [PubMed] [Google Scholar]
8.Go N., Noguti T., Nishikawa T., Dynamics of a small globular protein in terms of low-frequency vibrational modes. Proc. Natl. Acad. Sci. U.S.A. 80, 3696–3700 (1983). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Brooks B., Karplus M., Harmonic dynamics of proteins: Normal modes and fluctuations in bovine pancreatic trypsin inhibitor. Proc. Natl. Acad. Sci. U.S.A. 80, 6571–6575 (1983). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Brooks B., Karplus M., Normal modes for specific motions of macromolecules: Application to the hinge-bending mode of lysozyme. Proc. Natl. Acad. Sci. U.S.A. 82, 4995–4999 (1985). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Tirion M. M., Large amplitude elastic motions in proteins from a single-parameter, atomic analysis. Phys. Rev. Lett. 77, 1905 (1996). [DOI] [PubMed] [Google Scholar]
12.Bastolla U., Computing protein dynamics from protein structure with elastic network models. Wiley Interdiscip. Rev. Comput. Mol. Sci. 4, 488–503 (2014). [Google Scholar]
13.López-Blanco J. R., Chacón P., New generation of elastic network models. Curr. Opin. Struct. Biol. 37, 46–53 (2016). [DOI] [PubMed] [Google Scholar]
14.Sorzano C. O. S., et al. , Survey of the analysis of continuous conformational variability of biological macromolecules by electron microscopy. Acta Crystall. Sect. F Struct. Biol. Commun. 75, 19–32 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Mahajan S., Sanejouand Y. H., Jumping between protein conformers using normal modes. J. Comput. Chem. 38, 1622–1630 (2017). [DOI] [PubMed] [Google Scholar]
16.Ramaswamy V. K., Musson S. C., Willcocks C. G., Degiacomi M. T., Deep learning protein conformational space with convolutions and latent interpolations. Phys. Rev. X 11, 011052 (2021). [Google Scholar]
17.de Carmo M. P., Riemannian Geometry (Birkhäuser, 1992). [Google Scholar]
18.Sakai T., Riemannian Geometry (American Mathematical Soc., 1996), vol. 149. [Google Scholar]
19.Bergmann R., Gousenbourger P. Y., A variational model for data fitting on manifolds by minimizing the acceleration of a Bézier curve. Front. Appl. Math. Stat. 4 (2018). [Google Scholar]
20.Karcher H., Riemannian center of mass and mollifier smoothing. Commun. Pure Appl. Math. 30, 509–541 (1977). [Google Scholar]
21.Donoho D. L., Grimes C., Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proc. Natl. Acad. Sci. U.S.A. 100, 5591–5596 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.W. Diepeveen, J. Chew, D. Needell, Curvature corrected tangent space-based approximation of manifold-valued data. arXiv [Preprint] (2023). http://arxiv.org/abs/2306.00507 (Accessed 20 June 2023).
23.Fletcher P. T., Lu C., Pizer S. M., Joshi S., Principal geodesic analysis for the study of nonlinear statistics of shape. IEEE Trans. Med. Imaging 23, 995–1005 (2004). [DOI] [PubMed] [Google Scholar]
24.Glielmo A., et al. , Unsupervised learning methods for molecular simulation data. Chem. Rev. 121, 9722–9758 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.W. M. Boothby, An Introduction to Differentiable Manifolds and Riemannian Geometry Revised (Gulf Professional Publishing, 2003), vol. 120.
26.Lee J. M., Smooth Manifolds in Introduction to Smooth Manifolds (Springer, 2013), pp. 1–31. [Google Scholar]
27.Penner R. C., Moduli spaces and macromolecules. Bull. Amer. Math. Soc. 53, 217–268 (2016). [Google Scholar]
28.Younes L., Shapes and Diffeomorphisms (Springer, 2010), vol. 171. [Google Scholar]
29.Younes L., Spaces and manifolds of shapes in computer vision: An overview. Image Vis. Comput. 30, 389–397 (2012). [Google Scholar]
30.Kendall D. G., Shape manifolds, procrustean metrics, and complex projective spaces. Bull. Lond. Math. Soc. 16, 81–121 (1984). [Google Scholar]
31.Laga H., A survey on nonrigid 3D shape analysis. Acad. Press Libr. Signal Process. 6, 261–304 (2018). [Google Scholar]
32.M. Niethammer, R. Kwitt, F. X. Vialard, “Metric learning for image registration” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 8463–8472. [DOI] [PMC free article] [PubMed]
33.Trouvé A., Younes L., Metamorphoses through lie group action. Found. Comput. Math. 5, 173–198 (2005). [Google Scholar]
34.N. Charon, L. Younes, in K. Chen, C. B. Schönlieb, X. C. Tai, L. Younes, Eds. (Springer International Publishing, Cham, 2023), pp. 1929–1958.
35.Younes L., Hybrid Riemannian metrics for diffeomorphic shape registration. Ann. Math. Sci. Appl. 3, 189–210 (2018). [Google Scholar]
36.Goriely A., The Mathematics and Mechanics of Biological Growth (Springer, 2017), vol. 45. [Google Scholar]
37.Younes L., Michor P. W., Shah J. M., Mumford D. B., A metric on shape space with explicit geodesics. Rend. Lincei-Mat. Appl. 19, 25–57 (2008). [Google Scholar]
38.Herzog R., Loayza-Romero E., A manifold of planar triangular meshes with complete Riemannian metric. Math. Comput. 92, 1–50 (2023). [Google Scholar]
39.Bastolla U., Dehouck Y., Can conformational changes of proteins be represented in torsion angle space? A study with rescaled ridge regression J. Chem. Inform. Model. 59, 4929–4941 (2019). [DOI] [PubMed] [Google Scholar]
40.Rumpf M., Wirth B., Variational time discretization of geodesic calculus. IMA J. Numer. Anal. 35, 1011–1046 (2015). [Google Scholar]
41.Yang X., Kwitt R., Styner M., Niethammer M., Quicksilver: Fast predictive image registration-A deep learning approach. NeuroImage 158, 378–396 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
42.B. Heeren, M. Rumpf, P. Schröder, M. Wardetzky, B. Wirth, “Splines in the space of shells” in Computer Graphics Forum, M. Ovsjanikov, D. Panozzo, Eds. (Wiley Online Library, 2016), vol. 35, pp. 111–120.
43.B. Heeren, M. Rumpf, M. Wardetzky, B. Wirth, "Time-discrete geodesics in the space of shells" in Computer Graphics Forum, E. Grinspun, N. Mitra, Eds. (Wiley Online Library, 2012), vol. 31, pp. 1755–1764.
44.Wirth B., Bar L., Rumpf M., Sapiro G., A continuum mechanical approach to geodesics in shape space. Int. J. Comput. Vis. 93, 293–318 (2011). [Google Scholar]
45.Boumal N., An Introduction to Optimization on Smooth Manifolds (Cambridge University Press, 2023). [Google Scholar]
46.Zhang H., Sra S., “First-order methods for geodesically convex optimization” in Conference on Learning Theory, (PMLR, 2016), pp. 1617–1638. [Google Scholar]
47.Absil P. A., Mahony R., Sepulchre R., Optimization Algorithms on Matrix Manifolds in Optimization Algorithms on Matrix Manifolds (Princeton University Press, 2009). [Google Scholar]
48.Dokmanic I., Parhizkar R., Ranieri J., Vetterli M., Euclidean distance matrices: Essential theory, algorithms, and applications. IEEE Signal Process. Mag. 32, 12–30 (2015). [Google Scholar]
49.Schönemann P. H., A generalized solution of the orthogonal procrustes problem. Psychometrika 31, 1–10 (1966). [Google Scholar]
50.Arun K. S., Huang T. S., Blostein S. D., Least-squares fitting of two 3-D point sets. IEEE Trans. Pattern Anal. Mach. Intell., 698–700 (1987). [DOI] [PubMed] [Google Scholar]
51.Bergmann R., Laus F., Persch J., Steidl G., Recent advances in denoising of manifold-valued images. Handb. Numer. Anal. 20, 553–578 (2019). [Google Scholar]
52.O. Beckstein, S. L. Seyler, A. Kumar, Simulated trajectory ensembles for the closed-to-open transition of adenylate kinase from DIMS MD and FRODA. Figshare. https://figshare.com/articles/dataset/Simulated_trajectory_ensembles_for_the_closed-to-open_transition_of_adenylate_kinase_from_DIMS_MD_and_FRODA/7165306. Accessed 30 January 2023.
53.D. Shaw, Molecular dynamics simulations related to SARS-CoV-2. D E Shaw Research. https://www.deshawresearch.com/downloads/download_trajectory_sarscov2.cgi/. Accessed 30 January 2023.
54.W. Diepeveen, C. Esteve-Yagüe, J. Lellmann, O. Öktem, C.-B. Schönlieb, Riemannian geometry for efficient analysis of protein dynamics data. GitHub. https://github.com/wdiepeveen/Riemannian-geometry-for-efficient-analysis-of-protein-dynamics-data. Deposited 15 August 2023. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

pnas.2318951121.sapp.pdf^{(1.6MB, pdf)}

Data Availability Statement

Code data have been deposited in Github (https://github.com/wdiepeveen/Riemannian-geometry-for-efficient-analysis-of-protein-dynamics-data) (54).

[r1] 1.C. Kolloff, S. Olsson, Machine learning in molecular dynamics simulations of biomolecular systems. arXiv [Preprint] (2022). http://arxiv.org/abs/2205.03135 (Accessed 1 August 2023).

[r2] 2.J. Rydzewski, M. Chen, O. Valsson, Manifold learning in atomistic simulations: A conceptual review. arXiv [Preprint] (2023). http://arxiv.org/abs/2303.08486 (Accessed 1 August 2023).

[r3] 3.K. Jamali, D. Kimanius, S. H. Scheres, “A graph neural network approach to automated model building in cryo-EM maps” in The Eleventh International Conference on Learning Representations (2023).

[r4] 4.Zhong E. D., Bepler T., Berger B., Davis J. H., Cryodrgn: Reconstruction of heterogeneous cryo-EM structures using neural networks. Nat. Methods 18, 176–185 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5] 5.Zheng W., Wen H., A survey of coarse-grained methods for modeling protein conformational transitions. Curr. Opin. Struct. Biol. 42, 24–30 (2017). [DOI] [PubMed] [Google Scholar]

[r6] 6.Kurkcuoglu Z., Bahar I., Doruker P., Clustenm: ENM-based sampling of essential conformational space at full atomic resolution. J. Chem. Theory Comput. 12, 4549–4562 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r7] 7.Amadei A., Linssen A. B., Berendsen H. J., Essential dynamics of proteins. Proteins: Struct. Funct. Bioinforma. 17, 412–425 (1993). [DOI] [PubMed] [Google Scholar]

[r8] 8.Go N., Noguti T., Nishikawa T., Dynamics of a small globular protein in terms of low-frequency vibrational modes. Proc. Natl. Acad. Sci. U.S.A. 80, 3696–3700 (1983). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r9] 9.Brooks B., Karplus M., Harmonic dynamics of proteins: Normal modes and fluctuations in bovine pancreatic trypsin inhibitor. Proc. Natl. Acad. Sci. U.S.A. 80, 6571–6575 (1983). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r10] 10.Brooks B., Karplus M., Normal modes for specific motions of macromolecules: Application to the hinge-bending mode of lysozyme. Proc. Natl. Acad. Sci. U.S.A. 82, 4995–4999 (1985). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r11] 11.Tirion M. M., Large amplitude elastic motions in proteins from a single-parameter, atomic analysis. Phys. Rev. Lett. 77, 1905 (1996). [DOI] [PubMed] [Google Scholar]

[r12] 12.Bastolla U., Computing protein dynamics from protein structure with elastic network models. Wiley Interdiscip. Rev. Comput. Mol. Sci. 4, 488–503 (2014). [Google Scholar]

[r13] 13.López-Blanco J. R., Chacón P., New generation of elastic network models. Curr. Opin. Struct. Biol. 37, 46–53 (2016). [DOI] [PubMed] [Google Scholar]

[r14] 14.Sorzano C. O. S., et al. , Survey of the analysis of continuous conformational variability of biological macromolecules by electron microscopy. Acta Crystall. Sect. F Struct. Biol. Commun. 75, 19–32 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r15] 15.Mahajan S., Sanejouand Y. H., Jumping between protein conformers using normal modes. J. Comput. Chem. 38, 1622–1630 (2017). [DOI] [PubMed] [Google Scholar]

[r16] 16.Ramaswamy V. K., Musson S. C., Willcocks C. G., Degiacomi M. T., Deep learning protein conformational space with convolutions and latent interpolations. Phys. Rev. X 11, 011052 (2021). [Google Scholar]

[r17] 17.de Carmo M. P., Riemannian Geometry (Birkhäuser, 1992). [Google Scholar]

[r18] 18.Sakai T., Riemannian Geometry (American Mathematical Soc., 1996), vol. 149. [Google Scholar]

[r19] 19.Bergmann R., Gousenbourger P. Y., A variational model for data fitting on manifolds by minimizing the acceleration of a Bézier curve. Front. Appl. Math. Stat. 4 (2018). [Google Scholar]

[r20] 20.Karcher H., Riemannian center of mass and mollifier smoothing. Commun. Pure Appl. Math. 30, 509–541 (1977). [Google Scholar]

[r21] 21.Donoho D. L., Grimes C., Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proc. Natl. Acad. Sci. U.S.A. 100, 5591–5596 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r22] 22.W. Diepeveen, J. Chew, D. Needell, Curvature corrected tangent space-based approximation of manifold-valued data. arXiv [Preprint] (2023). http://arxiv.org/abs/2306.00507 (Accessed 20 June 2023).

[r23] 23.Fletcher P. T., Lu C., Pizer S. M., Joshi S., Principal geodesic analysis for the study of nonlinear statistics of shape. IEEE Trans. Med. Imaging 23, 995–1005 (2004). [DOI] [PubMed] [Google Scholar]

[r24] 24.Glielmo A., et al. , Unsupervised learning methods for molecular simulation data. Chem. Rev. 121, 9722–9758 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r25] 25.W. M. Boothby, An Introduction to Differentiable Manifolds and Riemannian Geometry Revised (Gulf Professional Publishing, 2003), vol. 120.

[r26] 26.Lee J. M., Smooth Manifolds in Introduction to Smooth Manifolds (Springer, 2013), pp. 1–31. [Google Scholar]

[r27] 27.Penner R. C., Moduli spaces and macromolecules. Bull. Amer. Math. Soc. 53, 217–268 (2016). [Google Scholar]

[r28] 28.Younes L., Shapes and Diffeomorphisms (Springer, 2010), vol. 171. [Google Scholar]

[r29] 29.Younes L., Spaces and manifolds of shapes in computer vision: An overview. Image Vis. Comput. 30, 389–397 (2012). [Google Scholar]

[r30] 30.Kendall D. G., Shape manifolds, procrustean metrics, and complex projective spaces. Bull. Lond. Math. Soc. 16, 81–121 (1984). [Google Scholar]

[r31] 31.Laga H., A survey on nonrigid 3D shape analysis. Acad. Press Libr. Signal Process. 6, 261–304 (2018). [Google Scholar]

[r32] 32.M. Niethammer, R. Kwitt, F. X. Vialard, “Metric learning for image registration” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 8463–8472. [DOI] [PMC free article] [PubMed]

[r33] 33.Trouvé A., Younes L., Metamorphoses through lie group action. Found. Comput. Math. 5, 173–198 (2005). [Google Scholar]

[r34] 34.N. Charon, L. Younes, in K. Chen, C. B. Schönlieb, X. C. Tai, L. Younes, Eds. (Springer International Publishing, Cham, 2023), pp. 1929–1958.

[r35] 35.Younes L., Hybrid Riemannian metrics for diffeomorphic shape registration. Ann. Math. Sci. Appl. 3, 189–210 (2018). [Google Scholar]

[r36] 36.Goriely A., The Mathematics and Mechanics of Biological Growth (Springer, 2017), vol. 45. [Google Scholar]

[r37] 37.Younes L., Michor P. W., Shah J. M., Mumford D. B., A metric on shape space with explicit geodesics. Rend. Lincei-Mat. Appl. 19, 25–57 (2008). [Google Scholar]

[r38] 38.Herzog R., Loayza-Romero E., A manifold of planar triangular meshes with complete Riemannian metric. Math. Comput. 92, 1–50 (2023). [Google Scholar]

[r39] 39.Bastolla U., Dehouck Y., Can conformational changes of proteins be represented in torsion angle space? A study with rescaled ridge regression J. Chem. Inform. Model. 59, 4929–4941 (2019). [DOI] [PubMed] [Google Scholar]

[r40] 40.Rumpf M., Wirth B., Variational time discretization of geodesic calculus. IMA J. Numer. Anal. 35, 1011–1046 (2015). [Google Scholar]

[r41] 41.Yang X., Kwitt R., Styner M., Niethammer M., Quicksilver: Fast predictive image registration-A deep learning approach. NeuroImage 158, 378–396 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r42] 42.B. Heeren, M. Rumpf, P. Schröder, M. Wardetzky, B. Wirth, “Splines in the space of shells” in Computer Graphics Forum, M. Ovsjanikov, D. Panozzo, Eds. (Wiley Online Library, 2016), vol. 35, pp. 111–120.

[r43] 43.B. Heeren, M. Rumpf, M. Wardetzky, B. Wirth, "Time-discrete geodesics in the space of shells" in Computer Graphics Forum, E. Grinspun, N. Mitra, Eds. (Wiley Online Library, 2012), vol. 31, pp. 1755–1764.

[r44] 44.Wirth B., Bar L., Rumpf M., Sapiro G., A continuum mechanical approach to geodesics in shape space. Int. J. Comput. Vis. 93, 293–318 (2011). [Google Scholar]

[r45] 45.Boumal N., An Introduction to Optimization on Smooth Manifolds (Cambridge University Press, 2023). [Google Scholar]

[r46] 46.Zhang H., Sra S., “First-order methods for geodesically convex optimization” in Conference on Learning Theory, (PMLR, 2016), pp. 1617–1638. [Google Scholar]

[r47] 47.Absil P. A., Mahony R., Sepulchre R., Optimization Algorithms on Matrix Manifolds in Optimization Algorithms on Matrix Manifolds (Princeton University Press, 2009). [Google Scholar]

[r48] 48.Dokmanic I., Parhizkar R., Ranieri J., Vetterli M., Euclidean distance matrices: Essential theory, algorithms, and applications. IEEE Signal Process. Mag. 32, 12–30 (2015). [Google Scholar]

[r49] 49.Schönemann P. H., A generalized solution of the orthogonal procrustes problem. Psychometrika 31, 1–10 (1966). [Google Scholar]

[r50] 50.Arun K. S., Huang T. S., Blostein S. D., Least-squares fitting of two 3-D point sets. IEEE Trans. Pattern Anal. Mach. Intell., 698–700 (1987). [DOI] [PubMed] [Google Scholar]

[r51] 51.Bergmann R., Laus F., Persch J., Steidl G., Recent advances in denoising of manifold-valued images. Handb. Numer. Anal. 20, 553–578 (2019). [Google Scholar]

[r52] 52.O. Beckstein, S. L. Seyler, A. Kumar, Simulated trajectory ensembles for the closed-to-open transition of adenylate kinase from DIMS MD and FRODA. Figshare. https://figshare.com/articles/dataset/Simulated_trajectory_ensembles_for_the_closed-to-open_transition_of_adenylate_kinase_from_DIMS_MD_and_FRODA/7165306. Accessed 30 January 2023.

[r53] 53.D. Shaw, Molecular dynamics simulations related to SARS-CoV-2. D E Shaw Research. https://www.deshawresearch.com/downloads/download_trajectory_sarscov2.cgi/. Accessed 30 January 2023.

[r54] 54.W. Diepeveen, C. Esteve-Yagüe, J. Lellmann, O. Öktem, C.-B. Schönlieb, Riemannian geometry for efficient analysis of protein dynamics data. GitHub. https://github.com/wdiepeveen/Riemannian-geometry-for-efficient-analysis-of-protein-dynamics-data. Deposited 15 August 2023. [DOI] [PMC free article] [PubMed]

PERMALINK

Riemannian geometry for efficient analysis of protein dynamics data

Willem Diepeveen

Carlos Esteve-Yagüe

Jan Lellmann

Ozan Öktem

Carola-Bibiane Schönlieb

Significance

Abstract

1. Introduction

1.1. Related Work.

1.1.1. Smooth manifolds of protein conformations.

1.1.2. Riemannian geometry for protein conformations.

1.1.3. Computational feasibility for data analysis.

1.2. Contributions.

1.2.1. Computationally feasible Riemannian geometry.

1.2.2. Constructing Riemannian geometry for point cloud conformations.

1.2.3. Riemannian geometry for efficient analysis of protein dynamics data.

Fig. 1.

1.3. Outline.

2. Notation

3. Computationally Feasible Riemannian geometry

Definition 3.1

Theorem 3.2

Corollary 3.3.

Theorem 3.4.

Theorem 3.5.

Remark 3.6:

4. Constructing Riemannian Geometry for Point Cloud Conformations

Proposition 4.1.

Theorem 4.2.

Theorem 4.3.

Remark 4.4:

5. A Riemannian Geometry and Separation for Protein Dynamics Data

Proposition 5.1.

Remark 5.2:

Corollary 5.3.

Theorem 5.4.

Remark 5.5:

6. Riemannian Geometry and Analysis of Protein Dynamics Data

Remark 6.1:

Remark 6.2:

7. Numerics

7.1. Data.

7.2. Outline of Experiments.

7.3. General Experimental Settings.

Fig. 2.

7.4. wδ-Geodesic Interpolation.

7.4.1. Adenylate kinase.

Fig. 3.

7.4.2. SARS-CoV-2 helicase nsp 13.

Fig. 4.

7.5. Low Rank Approximation at the wδ-Barycenter.

7.5.1. Adenylate kinase.

7.5.2. SARS-CoV-2 helicase nsp 13.

8. Conclusions

8.1. Computationally Feasible Riemannian Geometry.

8.2. Constructing Riemannian Geometry for Point Cloud Conformations.

8.3. Riemannian Geometry for Efficient Analysis of Protein Dynamics Data.

Supplementary Material

Acknowledgments

Author contributions

Competing interests

Footnotes

Data, Materials, and Software Availability

Supporting Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

7.4. w^δ-Geodesic Interpolation.

7.5. Low Rank Approximation at the w_δ-Barycenter.