Exact Topology of Dynamic Probability Surface of an Activated Process by Persistent Homology

Farid Manuchehrfar; Huiyu Li; Wei Tian; Ao Ma; Jie Liang

doi:10.1021/acs.jpcb.1c00904

. Author manuscript; available in PMC: 2022 Mar 26.

Published in final edited form as: J Phys Chem B. 2021 May 3;125(18):4667–4680. doi: 10.1021/acs.jpcb.1c00904

Exact Topology of Dynamic Probability Surface of an Activated Process by Persistent Homology

Farid Manuchehrfar ^1,¹, Huiyu Li ^1,¹, Wei Tian ^1,¹, Ao Ma ^1,^*, Jie Liang ^1,^*

PMCID: PMC8957293 NIHMSID: NIHMS1788560 PMID: 33938737

Abstract

To gain insight into reaction mechanism of activated processes, we introduce an exact approach for quantifying the topology of high-dimensional probability surfaces of the underlying dynamic processes. Instead of Morse indexes, we study the homology groups of a sequence of superlevel sets of the probability surface over high-dimensional configuration spaces using persistent homology. For alanine-dipeptide isomerization, a prototype of activated processes, we identify locations of probability peaks and connecting-ridges, along with measures of their global prominence. Instead of a saddle-point, the transition state ensemble (TSE) of conformations are at the most prominent probability peak after reactants/products, when proper reaction coordinates are included. Intuition-based models, even those exhibiting a double-well, fail to capture the dynamics of the activated process. Peak occurrence, prominence, and locations can be distorted upon subspace projection. While principal component analysis account for conformational variance, it inflates the complexity of the surface topology and destroy dynamic properties of the topological features. In contrast, TSE emerges naturally as the most prominent peak beyond the reactant/product basins, when projected to a subspace of minimum dimension containing the reaction coordinates. Our approach is general and can be applied to investigate the topology of high-dimensional probability surfaces of other activated process.

Keywords: surface topology, landscape analysis, persistent homology, active process, transition state, energy flow

1. Introduction

Activated processes are ubiquitous in molecular systems, ranging from chemical reactions of small molecules to dynamic conformational changes and enzymatic reactions of proteins. In proteins, all functionally important processes are activated processes, which provide well-defined rates essential for proteins to carry out their roles in the cellular context, as proper timing is required for proper function.

The prevalent picture describing an activated process is that of a transition between two metastable basins on the free energy landscape separated by a barrier, whose height is large compared to thermal energy [1]. The slow time scale of activation arises from the fact that the molecular system can rarely accumulate sufficient energy in the relevant degrees of freedom (DoFs) to surpass the transition barrier. This simple and elegant picture originates from reaction rate theories, such as the well-known transition state theory and Kramer’s theory [1–7] developed in studies of the dynamics of chemical reactions of small molecules.

A key concept in reaction rate theories is that of reaction coordinates: a few special coordinates exists that can fully determine the progress of a reaction process [8–10]. A requirement for reaction coordinates is that they must accurately locate the transition barrier. Accordingly, the numerous degree of freedoms (DoFs) in a complex molecular system (e.g., a protein molecule, a system of solute and solvent) can be divided into reaction coordinates and heat bath. Reaction coordinates play central roles as they determine both the mechanism and the rate of activation. For example, to modify the activity of an enzyme, one should modify residues involved in the reaction coordinates of the enzyme activities [11,12], as this will modify both the reaction pathway and the barrier height for activation. In contrast, modifying residues that belong to the heat bath will not alter the enzymatic activity, as the role of the heat bath is to provide energy to the reaction coordinates to cross the activation barrier during rare fluctuations, which is largely a non-specific process.

Given such significance, it is important to develop a rigorous and quantitative criterion for determining the correct reaction coordinates. This task was accomplished with the development of the procedure of committor test, which is characterized by the committor value p_B [8,9,13,14]: the probability that a dynamic trajectory initiated from a given configuration to reach the product basin before visiting the reactant basin. By definition, the reactant and product states have committor values of 0 and 1, respectively, whereas the optimal transition state coincides with p_B = 0.5. The committor value p_B therefore provides a rigorous parameterization of the reaction process. Thus, the intuitive, albeit qualitative, notion of reaction coordinates translates into a rigorous definition of the few coordinates that are sufficient for determining the committor value of any given configuration. Du et. al first adopted this rigorous definition of reaction coordinates in the context of protein folding [9]; Chandler and co-workers established its usage as a standard practice in the general context of activated processes [8].

While this rigorous criterion has been well accepted, identifying the correct reaction coordinates turns out to be rather difficult, even for systems of modest complexity. One example is the C_7eq → C_7az isomerization reaction of the alanine dipeptide in vacuum, a prototype for studying biomolecular conformational transitions. Alanine dipeptide is the smallest molecule that satisfies the criterion that distinguishes complex molecules from small molecules: the non-reaction coordinates in the system constitute a large enough thermal bath to provide the reaction coordinates with adequate energy to cross the activation barrier. As a result, the C_7eq → C_7az transition displays features of activated dynamics that are unique to complex molecules but absent in small molecules. It was first found by Bolhuis et. al [15] that the conventional Ramachandran torsional angles ϕ and ψ, while sufficient for distinguishing the two stable basins, are inadequate for locating the transition state. Instead, another torsional angle θ₁ was found to be an essential reaction coordinate––a rather counter-intuitive finding [16,17]. The counter-intuitive nature of reaction coordinates turned out to be more often the norm than exception in complex systems [15,18,19], posing a formidable challenge, as sans intuition, there appears no guidance in sight.

The challenge in identifying reaction coordinates had motivated efforts in developing rigorous methods for their identification in complex systems since the early 2000’s [8,9,15,18,20,21]. Beyond unreliable intuition and trial-and-error, the first systematic method was that of machine-learning, in which a neural network was used to automatically identify the optimal reaction coordinates from a prepared pool of candidates [21]. This method was used to successfully identify the key solvent coordinate that controls the isomerization dynamics of an alanine dipeptide in solution, which had defied prior intuition-based trial-and-error efforts. The success of this machine learning based approach lead a series of further developments along similar lines [10,18,19,22–28].

However, a major deficiency of machine learning-based methods is that they cannot answer the real question concerning reaction coordinates – why some coordinates are more important for activation than the others? Instead, these methods can only inform us empirically which coordinates appear to be important based on well-defined criteria. Recently, Ma and co-workers developed a rigorous theory for mapping out the flow of potential energy through individual coordinates [17,29]. It was found that the reaction coordinates are the coordinates that carry high energy flows during the activation process. This result suggested an appealing physical picture: energy flows from fast coordinates into slow coordinates during activation, so that adequate energy can accumulate in the slow coordinates, enabling them to cross the activation barrier on path of these slow coordinates. This physical picture also suggested that reaction coordinates are preferred channels of energy flows and are encoded in the protein structure. Through analysis of energy flow, one can obtained a prioritized list of coordinates that likely play most significant roles in the activation process.

The most celebrated concept in reaction rate theories is that of the transition state (TS), which is the dynamic bottleneck of an activated process. Conventional thoughts are that they are located at a critical point with Mores index of 1 on the high-dimensional potential energy landscape of the molecule. If all the reaction coordinates of an activated process are known, the TS will be an index-1 critical point on the free energy surface of the reaction coordinates, the sole direction with negative Hessian pointing along the ideal one-dimensional reaction coordinate valid at the top of the activation barrier. Based on this picture, the surface of the multi-dimensional probability distribution of reaction coordinates constructed from an ensemble of reactive trajectories should be highly structured, with the important dynamic states (e.g. TS) manifesting as critical topological features.

To gain insights into reaction mechanism, a common practice is to study features of the free energy surface. However, all relevant information of an activated process is contained in the reactive trajectories. Instead of the free energy surface, one can construct the dynamic probability surface of the transition state. For certain systems, this can be achieved using the transition path sampling (TPS) method. Once a large ensemble of reactive trajectories are generated, an ensemble of configurations that the system sampled during the transition process can be harvested. From this ensemble, one can generate a dynamic probability landscape or transition state surface, which is usually high-dimensional.

The focus of this work is the analysis of the exact topology of the high-dimensional dynamic probability landscape, and the establishment of its relationship with the transition state of the active process. Instead of Morse index, we characterize the high-dimensional transition state surface by its topological structures in homology groups. We analyze topological changes in the superlevel set of the probability surface at different probability levels. Using the technique of persistent homology, we identify the locations of probability peaks in the high-dimensional configuration space and ridges connecting them, along with measures of their global prominence.

We apply this approach to study the active process of C_7eq → C_7ax isomerization of the alanine dipeptide, a well-characterized model system for studying protein conformational changes [15,17, 29]. After the exact topological structures of the dynamic probability surface are constructed, we identify the location of the ensemble of transition state conformations. Instead of a saddle point with a Morse index of 1, the transition state ensemble (TSE) is found to be located at the top of the most prominent peak after those of the reactant and product. In addition, the dynamically important topological structures are retained when the surface is projected onto the 2-dimensional plane of (ϕ − θ₁), which are known to be the reaction coordinates [17]. In contrast, when projected to the intuition-derived (ϕ − θ)-plane, the topological features of the probability surface no longer contain dynamic information on the transition state. Furthermore, we find that PCA dimension reduction distorts surface topologies, such that the transition state ensemble cannot be recovered from PCA-derived topological features: Instead of simplification, PCA destroys the dynamic properties of topological features of the original transition state surface.

Overall, we introduce a novel approach for quantifying the exact topology of high-dimensional surfaces. With this approach, we have characterized the precise topology of the transition state surface of the active process of alanine dipeptide isomerization. We have also established that the TSE are located at the most prominent probability peak beyond those of the reactant and the product, when the subspace of projection contain the proper reaction coordinates. The new language of homology group and the technique of persistent homology on superlevel sets of high-dimensional probability surfaces introduced here are general and can be applied to investigations of the topology of high-dimensional surfaces over configuration space encountered in other problems of activated process.

2. Theory, Models and Methods

We briefly discuss how the dynamic probability surface can be constructed for the molecular system of our interests. We then discuss the problem of understanding the topological structures of the probability surface over the relevant configuration space. Our focus will be the introduction of the method of homology groups for analyzing the exact topological structures of the dynamic probability surface. This approach is based on the homological structures of the sequence of the superlevel sets of a probability surface, and differs from previous efforts that are based on critical points, Morse theory, and Euler characteristics.

2.1. Constructing dynamic probability surface: transition state ensemble and path sampling

We use the transition path sampling (TPS) method [8] to generate a sufficiently large ensemble of reactive trajectories for the active process we study. To ensure the transition process is fully covered without bias, the duration of the trajectories is much longer than that of the transition process. Along each trajectories, we further harvested a number of system configuration with a specific time intervals to generate an ensemble of the relevant configurations that the system samples during the transition process. From this ensemble, we construct a dynamic probability landscape over a d-dimensional subspace, namely, the joint probability distribution over the d-coordinates.

2.2. Configuration space and probability surface.

We now discuss the general problem of analyzing probability surface over a configuration space of arbitrary dimension. We first introduce a few relevant concepts, and then review current approaches. This is followed by an exposition of the basics of homology groups and persistent homology. We focus on developing these concepts in the setting of cubics and superlevel sets of the probability surface. We will also discuss the key concept of filtration and its construction, and how it can be used to uncover the exact high-dimensional topological features.

Configuration space.

We begin our discussion in a general setting. We use d number of features that describe the configurations of a molecule. These can be bond lengths, bond angles, and torsional angles describing the structure of the molecule. For alanine dipeptide, there are a total of 60 features that fully describes the configuration of the molecule. For a molecule of finite size, the configuration space $M \subset ℝ^{d}$ is compact. It is likely that $M$ lies in a subspace of the Euclidean space $ℝ^{d}$ , as there are coupling between different degrees of freedom.

Probability surface.

Each configuration $x = (x_{1}, x_{2}, \dots, x_{d}) \in M$ has a probability f(x) ∈ [0, 1] associated with it. Namely, we have a function $f : M \to ℝ_{[0, 1]}$ that assigns the probability value p(x) to each configuration x.

Sublevel sets.

For a value 0 ≤ a ≤ 1, we can identify all x whose probability values f(x) ≥ a, which is called the superlevel set $M_{f \geq a}$ :

M_{f \geq a} \equiv {x \in M ∣ f (x) \geq a} = f^{- 1} ([a, 1)) .

Similarly, we can have the the sublevel set $M_{f \leq a}$ :

M_{f \leq a} \equiv {x \in M ∣ f (x) \leq a} = f^{- 1} ((0, a]) .

2.3. Topology of probability surface: Critical points and Morse indices.

It is of great importance to understand the topological structures of the probability surface f(x). The approach of analyzing the critical points is well-practiced for exploring the topological structures of high-dimensional surfaces.

Critical Points.

The critical points of $f (M)$ are where all first derivatives of f vanishes:

\frac{\partial f (x)}{\partial x_{1}} = 0, \frac{\partial f (x)}{\partial x_{2}} = 0, \dots, \frac{\partial f (x)}{\partial x_{d}} = 0.

Critical points are coordinates independent and can be further classified into different types by the secondary derivatives. We can organize the secondary derivatives into a d × d Hessian matrix Hf(x), whose entries are:

{(H f)}_{i, j} = \frac{\partial^{2} f}{\partial x_{i} \partial x_{j}} .

At non-degenerative critical points, where the Hessian matrices are non-singular, the Hessian will have a mixture of positive and negative eigenvalues. The number of negative eigenvalues η is the Morse index of the critical point.

Topology of surface by critical points and Morse theory.

At a critical point, the topology of sublevel sets changes. For a critical point x with f(x) = a, consider the sublevel sets $M_{f < a}$ and the sublevel set $M_{f < (a + ϵ)}$ slightly above it by a small amount of ϵ > 0. The topology of $M_{f < a}$ and $M_{f < (a + ϵ)}$ are different, as one cannot be deformed into another. However, once an η-dimensional handle is attached to $M_{f < a}$ at the critical point x, it has the same homotopy type as $M_{f < (a + ϵ)}$ [30], namely, these two can be deformed into one another. In fact, the homotopy class of $M$ can be characterized by toplological changes at critical points. The problem of determining the topology of the surface f(x) on $M$ then becomes the problem of determining the critical points of f(x) on $M$ and their Morse indices.

Euler characteristics.

A special case of the celebrated Morse inequalities relates the number of critical points of different indices to the Euler characteristics, which is

χ (M) = \sum_{k = 1}^{d} {(- 1)}^{k} m_{k},

where m_k is the number of critical points of index k. The Euler characteristics provides information on the number of holes of various dimensions, and can also be written as:

χ (M) = \sum_{k = 1}^{d} {(- 1)}^{k} β_{k},

where β_k is the k-th Betti number, which counts the number of k-dimensional holes. For example, in $ℝ^{3}$ , β₀ counts the number of connected components, β₁ counts the number of tunnels, and β₃ counts the number of voids. Voids and tunnels have been extensively examined in the studies of protein structures and functions [31–40]

Prior work on characterizing topological surface of molecular system.

The topology of surfaces over configuration space has been the subject of many investigations [41–45]. For example, thermodynamic phase transitions has been explored from the viewpoint of topological changes of submanifolds of the configuration space. Changes in Euler characteristics have been found to signal the occurrence of phase transitions for certain systems [42,43].

However, it is difficult in practice to understand the global surface topology using critical points and Morse indices. There are only a few model systems where all critical points are known analytically [46]. Numerical computation faces numerous challenges. First, it is difficult to identify all critical points. Methods based on Newton-Ralphson and other techniques require initial guesses and do not guarantee identifications of all critical points. Furthermore, many initial guesses fall into the same large basins of attraction and yield no new information. Second, as the probability surface can only be constructed approximated from points sampled in the high-dimensional configuration space, the degree of sampling may not be sufficiently detailed to capture the original topology of the configuration space. Third, sampling has to be sufficiently detailed to accurately measure the first and second derivatives along each coordinate direction at all locations. Fourth, as derivatives reflect local properties of the probability surface, there can be numerous critical points that may be trivial and of little importance if the probability surface is rugged. It is difficulty to distinguish those reflecting the global features of the surface from those reflecting local dimples or even those that are due to noise in sampling. While there have been efforts in visualizations of potential surface of molecules and other heuristics (see [44] and references within), to our knowledge, it is not yet possible to characterize all critical points and the homotopy type of the probability or potential surface of a molecule in three dimensional space.

2.4. Topology of probability surface: Homology group and persistent homology

Background and overview.

Instead of analyzing the critical points, we adopt a different approach. We are interested in global features such as the occurrence of different peaks, and how they are connected. This can be achieved by examining globally the structures of holes of various dimension and how such structures changes at different sublevel or superlevel sets of the probability landscape. We adopt an approach based on the theory of homology group and persistent homology [47–49]. Homology group studies holes in topological spaces and is a classic topic in algebraic topology [50,51]. Persistent homology computes these holes and measures their scales at different spatial resolutions [48,49]. Compared to homotopy, homology groups are more amenable to computation. Our approach is only feasible due to recent progress in computational topology and topological data analysis [47–49,52,53].

To provide an intuitive picture illustrating this approach, we envision a sea level on top of the probability landscape over the configuration space (Fig 1). We are interested in how mountain peaks emerge from the sea when the sea level is lowered gradually, and how independent mountain peaks become connected by land-ridges when the sea level is lowered further. These are related to 0-dimensional holes, which are connected components.

Figure 1: — Sea levels of the probability landscape f(x) on the configuration space $M$ . The superlevel sets $M_{i} = M_{f \geq a_{i}} = f^{- 1} (\geq a_{i})$ at different sea level (white regions) can have different topology, with different number of components shown in white.

Complex and chain.

We first discuss how to represent the d-dimensional configuration space $M$ [54]. In this study, we use cubic complexes [53,54]. A d-dimensional cubic complex K is constructed from a union of points, line segments, squares, cubes, and their k-dimensional counterparts glue together properly, where k ≤ d and all have unit length (except points, which have no lengths). We call each of these a k-cell or a k-cube (see Fig 2a for a 3-cell). While the topology of $M$ is invariant whether it is represented by cubic complexes or other complexes such as simplicial complexes, the nature of grid representation of the molecular configurations makes this choice convenient [53].

Figure 2: — An illustration of cubic complex. a). A 3-cubic cell, with the orientation of its 2-faces shown. b). Two 3-cubes are summed to form a 3-chain. The internal square is contributed twice from the two cubes. As each surface is oriented (*i.e.*, counter-clock-wise by the outward surface normal), these two squares have opposite orientations and cancel each other when summed. c). An example of a 2-chain formed by 2-cells and its boundary.

We can build up our cubic complexe K from cubes to represent the configuration spaces. Consider a set of k-cells, we can sum them up. We call the total summation of a set of k cells a k-chain. Fig 2b shows an example where two 3-cells are summed up to form a 3-chain. Fig 2c shows how nine 2-cells are summed up to form a 2-chain. Here the binary operation of summing over two k-cells is orientation sensitive: two k-cells of the same underlying space but opposite orientation cancel out each other when summed up. If a set is equipped with a binary operation satisfying certain requirements, it is called a group mathematically. The set of k-chains from the K complex with our binary operation of summation therefore form a chain group C_k(K).

Boundaries.

We now set boundaries. The boundary of an individual k-cell is the set of its (k − 1)-dimensional faces, which by definition forms a (k − 1)-chain. The boundary of a 3-cubic cell is shown in Fig 2a, which is the set of the six oriented squares. The boundary of a k-chain is the sum of the boundaries of its element k-cells. Because of the nature of our sum operation, internal structures cancel out. Consider the boundary of the two neighboring three-dimensional cubes in Fig 2b. The interfacial square is contributed twice, once from each cube, but with opposite orientation as both are counter clock-wise around their outwards normals. When these two cubes are glued together, these two boundary squares are summed up, and they cancel each other out. The overall outcome of this summation is indeed the outer boundary of the union of the two neighboring cubes. Fig 2c shows a 2-chain and its boundary. This holds true in other dimensions as well, namely, a (k − 1)-dimensional face from two neighboring k-cells have opposite orientations and cancels each other out upon summation.

With this summation, we obtain the boundary of a k-chain from K by applying the boundary operator ∂_k:

\partial_{k} : C_{k} (K) \to C_{k - 1} (K) .

(1)

Cycles.

There are certain k-chains that have no boundaries. They are called k-dimensional cycles or k-cycles. With the binary operation of summation discussed earlier, the set of k-cycles from K form the cycle group Z_k(K):

Z_{k} (K) \equiv {c \in C_{k} (K) ∣ \partial_{k} c = \emptyset} .

As an example, we consider the three-dimensional cube again (Fig 2a). We take its six surface squares that fully enclose the solid cube. These square form a 2-chain. As a whole, this 2-chain itself does not have boundaries, as the six squares are glued together along the borders and there are no openings.

Kernel and image.

Analogous to the null space or kernel in linear algebra, the cycle group Z_k(K) is the kernel of the operator ∂_k, as each of its member k-chain has no boundary by definition and ∂_k will send it to null:

Z_{k} (K) = ker C_{k} (K) \equiv {c \in C_{k} (K) ∣ \partial_{k} c = \emptyset} .

We now move one dimension up and consider boundaries of (k + 1)-chains in K. Boundaries are one dimension lower and therefore the boundaries of (k + 1)-chains are k-chains. They are called the k-boundaries of K and form the k-boundary group B_k(K). As each k-boundary is obtained when ∂_k+1 is applied to a (k + 1)-chain, collectively they are the image of ∂_k+1:

B_{k} (K) = im \partial_{k + 1} (C_{k + 1} (K)) \equiv {c \in C_{k} (K) ∣ \partial_{k + 1} (c^{'}) = c, c^{'} \in C_{k + 1} (K)} .

It turns out that all k-boundaries themselves have no (k − 1)-boundaries. This is due to a fundamental property of the boundary homomorphisms in topology, which states that for any k ≥ 1 [50],

\partial_{k - 1} \circ \partial_{k} = \emptyset .

(2)

From our previous example in Fig 2a, the 2-chain of the six squares that enclose the solid cube is the boundary of a 3-chain (a lone 3-cell in this case). They themselves do not have any opening, and hence the boundary of this 2-chain is ∅. A consequence of this general property is that we have B_k(K) ⊆ Z_k(K) ⊆ C_k(K).

Homology group and Betti number.

There are two types of k-cycles: Those enclose (k + 1)-bodies and those enclose (k + 1)-holes (Fig 3a). The former are boundaries of the enclosed bodies of (k + 1)-chains and can be collapsed into a point (Fig 3a, k-cycles in dark brown/brown enclosing (k + 1)-chains shaded in light brown). The latter are not boundaries of (k + 1) bodies and cannot be collapsed into a point (Fig 3a, other cycles). We will first distinguish these two types of cycles. Furthermore, among cycles that do not enclose a body, they may be so due to different reasons, as the holes they contain may be different (Fig 3a). We will distinguish these different situations as well.

Figure 3: — An illustration of homology classes of k-cycles in a (k + 1)-manifold $M_{k + 1}$ . a). The two k-cycles in brown and dark brown enclose a (k + 1)-body (k-chains in light brown). They contain no holes and are boundary cycles. They belong to the same equivalence class of [∅]. The k-cycles in green/light green, purple/light purple, and blue/light blue each contain k-holes h_α, h_β and h_γ, respectively, and are part of equivalence classes of [h_α], [h_β] and [h_γ], respectively. b). The k-cycle l₁ contains a hole. The boundary cycle l₂ contains no hole but a (k + 1) body. Note that l₁ and l₂ share a common piece of boundary but in opposite orientations. c). When l₁ and l₂ are summed up, we obtain the k-cycle l₃, which contains the same hole as l₁. Both l₁ in b) and l₃ in c) belong to the same equivalence class of k-cycles containing this hole.

We consider all cycles containing the same hole essentially the same and group them into one equivalence class of cycles. As an illustration, green/light green k-cycles in Fig 3a form an equivalence class [h_α] as they all contain hole h_α. So do the purple/light purple k-cycles (class [h_β] containing hole h_β), and the blue/light blue k-cycles (class [h_γ] containing hole h_γ). A special equivalence class are cycles containing the ∅ hole or no hole (Fig 3a, brown/dark brown cycles, class [∅]). We call cycles in each equivalence class homologous to each other. If they encircle different holes, they belong to different equivalence classes.

We elaborate on this. Among all elements of Z_k(K), which are k-cycles, we identify all k-boundaries, which contain (k + 1) bodies and are elements of B_k(K). Because of Eqn (2), they have no (k + 1)-holes. We put them into a class denoted as [∅] as they contain no holes (or ∅-hole) (Fig 3a, brown/dark brown cycles ∈ [∅]). For the remaining k-cycles of Z_k(K), they are not in [∅] but may contain different holes. We identify those contain hole h_a, and put them into the equivalence class denoted as [h_a] (Fig 3a, green/light green cycles ∈ [h_a]). Remaining k-cycles that contain hole h_b are put into the class [h_b], and so on (Fig 3a, purple/light purple cycles ∈ [h_b], and blue/light blue cycles ∈ [h_c]). Each element of the set {[∅], [h_a], [h_b], ⋯} is an equivalence class.

As these equivalence classes themselves form a set, and the outcome of the binary operation of summation on elements of Z_k is preserved, this set form a new group. This new group is called a quotient group, as it is obtained from Z_k(K) after factoring out the boundaries B_k(K). The k-th homology group H_k(K) is this quotient group:

H_{k} (K) \equiv Z_{k} (K) / B_{k} (K) .

The elements of H_k(K) are equivalence classes of homologous cycles representing the holes (or lack of) they encloses.

Two k-cycles are homologous to each other if they contain the same hole, or equivalently, if one can be obtained from another by adding a k-boundary (Fig 3b and Fig 3c). To illustrate this, note that the cycle labeled l₃ in Fig 3c can be obtained by adding the k-boundary of a (k + 1)-body labeled l₂ to the k-cycle labeled l₁ enclosing a hole (Fig 3b). Due the cancellation nature of our summation, the commonly shared piece of the boundaries is cancelled out and we have the larger k-cycle l₃ in Fig 3c enclosong the same hole. Upto the difference of a boundary of a solid (k + 1)-body, these two k-cycles are the same and belong to the same equivalence class. It is not difficult to see that we can repeat this operations of adding certain k-boundaries and convert any homologous k-cyles between one another.

The number of the equivalent classes, or the number of independent k-dimensional holes, is counted by the dimension of the homology group. It is called the k-th Betti number β_k(K):

β_{k} (K) = dim (H_{k} (K))

Filtration.

We now examine the topological structures of holes in the probability landscape on the configuration space, when we restrict to configurations all with probabilities above certain value. By gradually adjusting this value, we will be able to trace out the details of topological changes. For an illustration, envision a sea on top of the probability landscape (Fig 4). At the level of f(x) = 1, it covers the whole landscape. The domain of the part of the landscape above the sea level is ∅. We gradually lower the sea level to value b₁, when the first peak emerges from the sea (birth of the first peak). At this time, we have the superlevel set $M_{f \geq b_{1}}$ , which are the set of points ${x \in M ∣ f (x) \geq b_{1}}$ . They form the white region(s) in Fig 4. We further lower the sea level to b₂ when another peak emerges (birth of the second peak), at which time we have the superlevel set $M_{f \geq b_{2}}$ . Suppose we continue this process until sea level reaches d₂ where the two peaks are merged together (death place of the second peak) by a land ridge that has just emerged above the sea level. At this sea level, we have $M_{f \geq d_{2}}$ . At each of these levels, the topology of the superlevel set changes, namely, one component, two components, and then one component again. These changes are captured by the changing homology groups and the Betti numbers.

Figure 4: — The probability landscape $f (M)$ and the topology of its superlevel set $M_{f \geq}$ . a) The landscape and a sea level. The superlevel sets $M_{f \geq}$ are the regions of the domain $M \subset ℝ$ (shown as a plane) whose landscape value is above the sea. b) At f(x) = 1, all is below the sea level and $M_{f \geq 1} = \emptyset$ . At f(x) = b₁*, b*₂ and d₂, the topology of $M_{f \geq}$ (shown in white) changes. At f(x) = 0, all is above sea level and we have $M_{f \geq} = M$ . c) The persistent diagram of the birth and death value of f(x) for the 0-th homology group representing the two peaks. The sublevel sets below the sea $M_{f <}$ are shown in blue.

We now generalize. We have a descending sequence of probability values corresponding to the lowering sea level:

1 = a_{0} > a_{1} > a_{2} > \dots > a_{n} = 0,

and the corresponding superlevel sets, or the domains of the part of the landscape above the sea level, which are subspaces of $M$ :

\emptyset = M_{0} \subset M_{1} \subset M_{2} \dots \subset M_{n} = M .

Recall we have the full configuration space $M$ represented by a cubic complex K. Each superlevel set $M_{i}$ is represented by a subcomplex K_i ⊂ K, which can be derived from the original full complex K. We then have the corresponding sequence of subcomplexes:

\emptyset = K_{0} \subset K_{1} \subset K_{2} \dots \subset K_{n} = K .

This sequence of subcomlexes is called a filtration.

We are interested in how the topology of $M_{f \geq a_{i}}$ evolves at different a_i, i = 0, ⋯, n. This is represented by the corresponding sequence of homology groups connected by linear maps:

0 = H_{k} (K_{0}) \to H_{k} (K_{1}) \to \dots \to H_{k} (K_{n}) = H_{k} (K) .

Persistence and Persistent diagram.

As we move from K_i−1 to K_i, we may gain a new equivalence class, (e.g., a new peak for 0-th homology as in our example), or we may lose one (e.g. when a peak is merged with another one). We say that an equivalence class of a k-cycle [α_i] is born at a_i if its equivalence class is present in K_i but absent in K_i−1 for any value of a_i−1 < a_i. The class dies at a_i if it is present in K_i−1 for any value of a_i−1 < a_i but not at a_i. We record the location and the value of a_i, namely, the corresponding k-cube and its probability value whose inclusion lead to the birth and death events.

The prominence of the topological feature of a k-cycle is encoded in its life-time or persistence. Denote the birth value and the death value of class [α_i] as b_i and d_i, respectively. The persistence of class [α_i] is then b_i − d_i.

In the example shown in Fig 4, the equivalence class of 0-cycles (components) associated with the first peak is born at f(x) = b₁. The equivalent class associated with the second peak is born at f(x) = b₂. At f(x) = d₂, these two components merge together. We say that the second peak dies at d₂, and its persistence is b₂ − d₂. The first peak dies at f(x) = 0, and its persistence is b₁ − 0 = b₁.

We record the birth and death events of homology classes in a two-dimensional plot, which is called the persistent diagram. Each homology class is represented by a point in this diagram, where the birth value b_i and the death value d_i are taken as its coordinates (b_i, d_i). Fig 4c shows the persistent diagram of our illustrative example.

In general, we have the k-th persistent diagram $P_{k} (f)$ of k-cycles for our probability function $f : M \to ℝ_{[0, 1]}$ . It is the set of points such that each point (x₁, x₂) represents a distinct topological feature of k-cycle, which is present in $H_{k} (M_{f \geq a}) = H_{k} (f^{- 1} ([a, 1))$ for a ∈ [x₁, x₂).

Computation.

The key to study homology groups in high-dimensional space is the construction of the K-complex to represent the configuration space $M$ . In this study, we use the cubic algorithm of [53] with modifications, so it can be applied to higher dimensions. We consider only the 0-th persistent homology groups, which records the birth and death of probability peaks. The locations x where birth and death events occur, namely, the corresponding k-cubes are also computed.

3. Results

3.1. Model system and computation

Ensemble of reactive trajectories and conformations.

The isomerization of alanine dipeptide in vacuum provides a tractable system for understanding the process of activation in details, and has been well studied as a model for understanding protein conformational changes [21, 29,55].

Using transition path sampling [8], we harvested 6 million reactive trajectories. Each trajectory is of 2.5 ps duration. We further collect conformations every 50 steps at 1 fs/step along each trajectory. Altogether, we have a total of 1.5 × 10¹¹ conformations. All simulations are conducted using the molecular dynamics software suite GROMACS4.5.4 [56]. Amber94 force field was used to facilitate the comparison with previous results. The simulation was performed with constant total energy 36 KJ/mol, such that the averaged temperature is 300K for the transition path ensemble. Note that the transition portion of each reactive trajectory is around 0.2 ps, thus the majority of our 2.5 ps trajectories are within the two stable basins. Here the reactant basin is defined in radian as (ϕ, ψ) ∈ [(−3.49, −0.96) × (−1.57, 3.32)] and the product basin is defined as (ϕ, ψ) ∈ [(0.87, 1.74) × (−1.39, 0)].

Constructing dynamic probability surface of transition state.

We then construct the dynamic probability surface of the isomeriztion reaction from the sampled 1.5 × 10¹¹ conformations. With a balanced consideration of the available MD simulation trajectories and the dimensionality, we construct a 5-dimensional configuration space for this study. Based on previous analysis using the energy flow theory [17], we selected the top 5 coordinates (ϕ, θ₁, ψ, α, β) that contribute most to the activation dynamics. The original 60 dimensional space is then projected onto this 5 dimensional space, where each dimension is divided into 15 bins. This leads to 15⁵ = 759, 375 5-dimensional hypercubes.

Computing topological structures of the dynamic probability landscape.

We then carry out persistent homology analysis. Computations are conducted on a machine with a 20-core Xeon E5-2670CPU of 2.5 GHz, with a cache size of 20 MB and memory of 128 GB Ram. The computing time for finding the significant peaks and ridges connecting them is ≈ 30 seconds.

Committor test for conformations at selected locations of the configuration space.

We carry out committor test for configurations of the dipeptide identified by persistent homology. The committor value of a configuration is defined as the probability that a dynamic trajectory initialed from this configuration, with initial momenta drawn from the Boltzmann distribution, reaches the product basin before the reactant basin. A configuration with committor value p_B = 0.5 is regarded as a member of the transition state ensemble.

In a committor test [8,9], we need to generate an ensemble of tentative transition state conformations from locations in configurtaion space where probability peaks and ridges identified by persistent homology are located. These conformations all share the same target values for the selected coordinates (e.g., ϕ, θ₁) that correspond to the location of the selected peak, with the other coordinates sampling the equilibrium distribution. For this, we add harmonic restraint potentials on the selected coordinates to the system potential energy function. The minima of the harmonic restraints are at the target values. Equilibrium MD simulations are then carried out. Conformations harvested from such simulations are filtered to generate an ensemble of conformations that all share the same target values for the selected coordinates. The restraint potential is used to enrich conformations that satisfy this criterion.

3.2. Topology and dynamic properties of 5-d dynamic probability surfaces

Dynamic probability surface on configuration space of (ϕ, θ₁, ψ, α, β).

We examine the topological structures of this 5-d dynamic probability surface. There are four peaks, each located in a 5-d cube (Fig. 6a, red dots for birth locations of peaks and blue dots for their ridges or death locations, and values are listed in SI Table 1). The most prominent peak with the largest persistence is peak b₁ (see persistent diagram in Fig. 6b), which corresponds to the product basin. The second most prominent b₂ is the reactant basin. The third prominent peak at b₃ (Fig 9a) has roughly the same probability as the reactant basin b₂, but a shorter persistence, namely, it does not stand out from the surrounding landscape as much. It subsequently merges with the peak at the product basin b₁. As expected, peaks have all negative eigenvalues for their Hessian matrices, and ridges have one positive and four negative eigenvalues (see Supplementary Info).

Figure 6: — The 5-d dynamic probability surface on the (ϕ − θ₁) plane, its topological structures, and the committor values. (a) The 5-d dynamic probability surface p(ϕ, ψ, θ₁, α, β) shown on the (ϕ − θ₁) plane. Red and blue dots are locations of probability peaks and ridges (see also SI Table 1). (b) The persistent diagram recording the birth and death probabilities p(b_i) and p(d_i) of the peaks in y and x, respectively. (c-e) Distributions of committor values p_B for trajectories from locations of b₃, d₂, and d₃, respectively. (c) The transition state ensemble is located at b₃.

Figure 9: — The 5-d probability surface projected onto two different 2-d planes. (a) The probability surface projected onto the (ϕ − θ₁) plane. The red dot is the location of the third probability peak after reactant and product peaks, which is where the transition state conformations are located. (b) The probability surface projected onto the (ϕ − ψ) plane. Conformations with the correct transition state value of ϕ are at a non-peak location, which is on a slope below the new peak shown in blue. (c) The process of projecting the 5-d probability surface onto the 2-d (ϕ − ψ) plane for the blue and red dots shown in (b). While the probability at the location of the red dot is larger on the 5-d probability surface (left bars), after projection onto the (ϕ − ψ) plane, the probability at the red dot is smaller than that at the blue dot (right bar plots). As a consequence, the projected probability surface on the (ϕ − ψ) plane does not capture the actual location of the peak in the 5-d surface.

We then take conformations from the four peaks and the three ridges (SI Table 1), and carry out committor tests. We find that the most prominent peak b₃ beyond the reactant and product basins fully captures the transition state ensemble (p_B centered at 0.5, Fig 6c): Trajectories initiated from configurations at this location have equal probability towards the reactant or the product basin.

In contrast, all committor values p_B for locations of b₂, d₂, and d₄ are found to be 0. Reaction trajectories starting from conformations at these locations fall back to the reactant basin. The committor values for peak b₁ are all 1.0: Trajectories from this location all go to the product basin. The committor values for conformations at the ridges d₂ and d₃ follow one-sided distributions (Fig 6d–6e, respectively). Only a negligible amount of conformations have p_B = 0.5.

These results demonstrate that the dynamic properties of the transition state ensemble is capture by the topological features of this dynamic probability surface. Furthermore, the transition state ensemble is located at the most prominent peak outside of the reactant and product basins, instead of a saddle point.

Dynamic probability surface on configuration space (ϕ, ψ, α, β, θ₂).

To examine the importance of proper choice of the coordinates, we construct another 5-d probability surface by omitting the coordinate θ₁ and replacing it with θ₂, which is on the other end of the molecule in a position symmetric to θ₁. Fig. 7a shows the projection of the 5-d surface in −ln p(x) on the (ϕ − θ₂) plane.

There are four significant peaks (Fig. 7a, red and blue dots), which are also shown in the persistent diagram (Figure 7b), each located in a 5-d cube (see SI Table 1). Peaks b₁ and b₂ are the most and second most prominent peaks, corresponding to the product and the reactant basins, respectively. In this projection, peaks are separated only in ϕ and they have almost identical θ₂ values. This differs from the 5-d surface containing θ₁ (Fig 6).

We then sample conformations from the location of each of the four peaks and the three ridges (SI Table 1) and perform committor tests. The committor values for peak b₁ are all 1.0, where trajectories starting here all go to the product basin. All committor values for conformations at d₂, b₄, and d₄ are 0.0. Reaction trajectories from these locations all go to the reactant basin. The committor values for d₂ and d₃ follow one-sided distributions (Fig. 7d–7e). Only a tiny amount of the conformations have p_B = 0.5, indicating that the transition state ensemble are located elsewhere.

The committor values for b₃ has a flat distribution. While there are conformations with p_B values close to 0.5, their frequency is similar to any other p_B values. This indicates that the 5-d cube where b₃ is located contains some transition state conformations as its ϕ value is correct, but also a mixture of other conformations with diverse dynamic properties. Overall, without the reaction coordinate θ₁, this 5-d dynamic probability surface does not describe the activated process adequately, and cannot be used to identify the transition state ensemble.

3.3. Topology and dynamic properties of projected 2-d dynamic probability surfaces.

As it is difficult to directly study the topology of a high-dimensional surface, a common practice is to project the surface to a lower dimensional subspace and analyze the topology of the projected surface instead. The caveat of this practice is that the original topological features may be lost, new features that are artifacts may arise due to the marginalization of the probability distributions.

To assess how well the dynamic properties is retained in a subspace, we project the 5-d surface on (ϕ, θ₁, ψ, α, β) to 2-d planes. We then analyze the topological structures of the projected surfaces, and assess the dynamic behavior of the identified topological features. We carried out this analysis using the 2-d planes of (ϕ − θ₁) and (ϕ − ψ).

Projecting 5-d dynamic probability surface to the (ϕ − θ₁) plane.

After projection, there are three significant probability peaks (Fig. 8a, Fig 9a, red and blue dots, and SI Table 2). The most and the next prominent peaks b₁ and b₂ shown in the persistent diagram of Fig. 8b are the product and reactant basins, respectively.

Figure 8: — The (ϕ-θ₁)-projection of the 5-d dynamic probability surface, its topological structure, and distributions of committor values. (a) The 5-d dynamic probability surface projected onto the (ϕ − θ₁) plane. Red and blue dots are locations of probability peaks and ridges, respectively. (b) The persistent diagram recording the birth and death probabilities p(b_i) and p(d_i) of the peaks in y and x, respectively. (c-e) Distributions of committor values p_B for trajectories from peaks and ridges b₃, d₂, and d₃, respectively. (c) Transition state conformations are at b₃.

It is informative to compare peak locations on the 5-d surface to that on the projected surface (SI Table 1 and SI Table 2). The ϕ coordinate for the product basin b₁ is altered from 1.25 to 0.84 after projection, while the θ₁ coordinate is unchanged. The (ϕ, θ₁) coordinates of the reactant basin b₂ are also changed from (−1.68, −0.18) to (−1.25, 0.0). θ₁ of the ridge d₃ is changed from −0.18 to 0.0, while ϕ is unchanged. The fourth prominent peak becomes undetectable after projection. The persistence diagram (Fig. 8b) is also significantly different: The third peak is much less prominent with reduced persistence compared to that in Fig. 6b (see also SI Table 2).

We then carry out committor test on conformations from locations of these topological features. The distribution of p_B sampled from trajectories from b₃ is centered around 0.5 (Fig 8c) and exhibits significant enrichment of transition state conformations. This is similar to peak b₃ on the original surface over the 5-d configuration space (Fig 6c), although the distribution has a broader width.

The committor values of trajectories from the product basin b₁ are all 1.0, as they all fall back to the product basin. Similarly, trajectories from b₂ all fall back to the reactant basin with a p_B value of 0.0. The committor values for the bridges at d₂ and d₃ follow one sided distributions, but only a small amount of conformations have p_B = 0.5 (Fig. 8)d and e.

Overall, these results demonstrate that when projected to the 2d-plane of (ϕ-θ₁), which is formed by the two dominant reaction coordinates, the dynamic probability surface retain essential dynamic properties of the transition state surface, and contain rich information such that the transition state conformations can be recovered.

Projecting 5-d dynamic probability surface to the (ϕ − ψ) plane.

ϕ and ψ angles are the standard parameters to describe protein secondary structures. After projection to the (ϕ − ψ) plane, there are three significant probability peaks (Fig. 10a, red and blue dots, and SI Table 2). The most and the next most prominent peaks b₁ and b₂ as shown in the persistent diagram (Fig. 10b) are the product and reactant basins, respectively. Similar to projection to the (ϕ-θ₁) plane, locations of both basins are altered upon projection (SI Table 2). Peak 3 becomes very minor after projection to (ϕ − ψ) plane, and its location is also altered from that of the 5-d surface. This can be seen in Fig. 9b, where the location of peak 3 on the 5-d surface projected onto (ϕ−ψ) plane (red point), and the changed location of the new peak after projecting onto the (ϕ − ψ) plane (blue dot) are shown.

Figure 10: — The (ϕ-ψ)-projection of the 5-d dynamic probability surface, its topological structure, and distribution of committor values. (a) The 5-d dynamic probability surface projected onto the (ϕ-ψ) plane. Red and blue dots are locations of probability peaks and ridges, respectively. (b) The persistent diagram recording the birth and death probabilities p(b_i) and p(d_i) of the peaks in y and x, respectively. (c) Distribution of committor values p_B for peak b₂.

These observations demonstrate that with projection, locations of topological features of probability peaks may change, and their prominence as measured by persistence may also change dramatically. Fig. 9c explains why the projection to (ϕ-ψ) results in the peak on the 5-d surface (red dot in Fig. 9a) changing location (Fig. 9b, peak location shown as a blue dot, red dot no longer at the peak). First, as shown in Fig. 9c (beginning bar plot), the probability at the location of the red dot is larger than that at the blue dot on the 5-d surface. The probability p(ϕ, ψ) at each point on the (ϕ-ψ) plane (Fig. 9b) is the sum of all points on the 5-d surface with the same ϕ and ψ but with different values in any of the other three coordinates (θ₁, α and β). Thus, the probability of each point on the (ϕ−ψ) plane of Fig. 9b is a 3-d hyper-surface of p(θ₁, α, β). When we sum up the 3-d hyper-surface along one direction (e.g., α), we obtain a 2-d probability surface (e.g. p(θ₁, β)). The 2-d surfaces for the red and blue dots are shown in the middle panel of Fig. 9c. Reducing the dimension further, we sum up the 2-d probability surface over β. The resulting distributions are 1-d probability distributions along the θ₁ direction, shown in Fig. 9c for the blue and red dots. From these 1-d distributions, we can see that when details in θ₁ direction are retained, the probability at the red dot (actual peak in 5-d) is still higher than the probability at the blue dot. We next sum up the probability along the θ₁ direction (Fig. 9c, final bar plots). The summed values are the probability values over the 2-d (ϕ − ψ) plane shown in Fig. 9b. As shown by the final bar plot in Fig. 9c, the total probability mass for the red dot is now less than that for the blue dot, even though the red dot has a higher probability on the original 5-d surface. Hence, the location of peak 3 changes when projecting onto (ϕ − ψ) plane.

We then carried out committor tests (Fig. 10c). None of the topological features are where the transition state ensemble are located: All trajectories starting from b₂, b₃, and d₃ go to the reactant basin (p_B = 0), and all trajectories starting at b₁ go to the product basin (p_B = 1.0). The committor values for ridge b₂ follow a one-sided distribution around p_B = 0 (Fig. 10c). Trajectories starting there mostly fall back to the reactant basin. Overall, none of the topological features after projection to the (ϕ-ψ) plane retain the dynamic properties of the transiton state conformations as the original 5-d surface.

3.4. Dimension reduction by PCA destroys dynamic properties inherent in surface topology

Principal Component Analysis (PCA) is a widely used technique for dimension reduction. It has found broad applications in molecular simulations [57–61]. However, whether such reduction retains the essential dynamics of the activation process and whether the surface topology on the PCA space can uncover the transition state conformations are not known. Here we assess the dynamic properties of probability surfaces after PCA dimension reduction.

Projection of p(ϕ, ψ, θ₁, α, β,) onto (PC₁, PC₂) by dPCA.

We first applied dihedral principal component analysis (dPCA) to the 5-d probability surface [60,61]. dPCA is widely used for dimension reduction on periodic dimensions. It first maps each periodic dimension of circular angle to two new dimensions using the sin and cos functions. Regular PCA is then applied for dimension reduction. We use the dPCA procedure and obtain the first two principal components from the variance matrix of the 1.5 × 10¹¹ conformations. Collectively, they account for 80.4% of the variance.

The dynamic probability surface after projection to the (PC1-PC2) plane is dramatically more complex (Fig. 11a and SI Table 3) than the surfaces when projected to either the (ϕ-θ₁) plane (Fig. 8) or the (ϕ-ψ) plane (Fig. 10). The persistent diagram (Fig. 11b) is also very different from that of the original 5-d probability surface (Fig. 6b).

Figure 11: — The dynamic probability surface on PCA space, its topological structure, and committor values on principal components space. (a) The projection of the 5-d dynamic probability surface p(ϕ, ψ, θ₁, α, β) to the plane of (PC₁-PC₂). (b) The persistent diagram exhibits six probability peaks. (c-e) Distribution of committor values p_B for trajectories from bridge d₂, peak b₆, and bridge d₆, respectively.

The committor tests show that all committor values for conformations at b₁ are 1.0 with trajectories going to the product basin. All committor values for conformations at b_2–5 and d_3–5 are 0.0: trajectories from there all go to the reactant basin. Committor values at ridge d₂ and peak b₆ follow a one-sided distribution at p_B = 0 (Fig. 11c). Committor values at ridge d₆ has higher values both at p_B = 0.0 and p_B = 1.0. Conformations at this location are a mixture of those close to the reactant basin and those close to the product basins. There are few conformations from the transition state ensemble.

Overall, our results show that the dPCA procedure for dimension reduction removes dynamics relevant information from the topological features of the probability surface. No conformations in the transition state of this active process are captured by the topological features of the PCA surface.

Projection from 39-d angular space to 5-d dPCA subspace.

We also applied dPCA to the original full-dimensional dynamic probability surface. After removal of the 21 bond lengths, we apply dPCA to reduce the remaining 39-dimensional configuration space of angles to 5 principal components. The contour plot of the 5-d PCA surface projected onto the first two principal components is shown in Fig. 12a. Persistent homology analysis identifies 16 peaks (Fig. 12b), each located in a 5-d cube. The location and probability of the first 6 most dominant peaks and ridges connecting them are listed in SI Table 3. This 5-d persistent diagram is very different from that shown in Fig. 6b, exhibiting a significantly more complex surface topology.

Figure 12: — The dynamic probability surface on 5-d principal components space reduced by dPCA from the 39-d configuration space, its topological structure, and committor values. (a) the 5-d dynamic probability surface p(PC₁, ⋯, PC₅) shown on to the (PC₁, PC₂) plane. (b) The persistent diagram exhibits 16 peaks. (c-d) Distribution of committor p_B values for trajectories started at ridges d₉ and d₁₀, respectively.

The committor tests show that all committor values for conformations located at the peaks and the ridges are either 0 or 1.0. The two exceptions are ridges d₉ and d₁₀ (Fig. 12c–d). None of the topological features in the dPCA reduced 5-d dynamic probability surface retain the essential dynamics of the transition state ensembles. Overall, our results demonstrate that dimension reduction by dPCA destroy dynamic properties inherent in the original surface topology of the dynamic probability surface.

Direct PCA projections.

Results using direct PCA to project p(ϕ, ψ, θ₁, α, β) onto (PC1, PC2), as well as projection from the 39-d space to the 5-d direct PCA subspace are similar and none retain the dynamic properties in their topological features (see SI for details).

4. Discussion

In this study, we have introduced a novel approach for characterizing the exact topological features of dynamic probability surfaces. Instead of examining critical points and Morse indexes, ours is based on homology groups of a series of superlevel sets of the probability surface. With quantification of the scales of these topological features by persistent homology, we are able to uncover the relationship between the topology of the dynamic probability surface and the dynamics of the activation process of the alanine-dipeptide isomerization reaction.

This approach allows us to define the topological properties of the high-dimensional dynamic probability surface that is associated with the transition state conformations. The probability surface over the transition state region is the most prominent peak after the reactant and product basins. Instead of a Morse index of 1 as conventionally thought [5,62], transition state ensemble is on the top of a dynamic probability peak and goes downhill in all directions. As seen when projected to the (ϕ, θ₁)-plane (Fig 9a), it appears as a small peak rather than a saddle point commonly associated with transition state on multi-dimensional free energy surface. This is because the system undergoes certain amount of correlated wandering motions at the barrier top, before it goes down towards the product basin. Our finding is against the conventional wisdom that the C_7eq → C_7ax transition is a ballistic process, as it is a small peptide and the transition occurs in vacuum.

The dynamic probability surface was constructed from naturally occurring reactive trajectories connecting the reactant and product basins. These trajectories are unbiased and faithfully reflect how the C_7eq → C_7ax transition occurs. They contain all the relevant information about the dynamic process of the activation. Unlike the free energy surface commonly used in examining the mechanism of an activated process, this probability surface contains additional information that reflect the non-equilibrium nature of the transition dynamics.

A common practice in the studies of protein conformational dynamics is to extract mechanistic insights from the geometry of two-dimensional free energy surface of a double-well along certain collective variables, which are often chosen based on heuristics or by intuition. For example, one would associate a saddle region with the transition states. Our results illustrate the caveats of such procedures. All three probability surfaces shown in Figs. 6, 8, and 10 exhibit the canonical double-well feature. In Fig. 8, the 5-d surface on the (ϕ, θ₂)-plane has both the product and reactant peaks and the third peak laying out along the ϕ-direction alone, indicating that θ₂ is not a reaction coordinate. This is indeed verified by the committor test: configurations corresponding to the peak in the saddle region all share the correct value of ϕ but samples randomly along the θ₁ direction, as illustrated by the flat committor distribution in Fig. 7c. Detialed examination shows that the 5-d cube containing the transition state conformations (red dot in Fig. 13) also contain conformations with other θ₁ values, which are not at the transition state: some go to the product basin (e.g., green dot, Fig. 13b), and others to the reactant basin (e.g., blue dot, Fig. 13b).

Figure 13: — The horizontal band of the probability surface of Fig. 7a centered at θ₂ = 0 (−0.08 < θ₂ < 0.08) expanded to show distribution in θ₁. (a) This band containing all peaks in Fig. 7a is expanded in the θ₁ direction. The (ϕ-θ₂)-square containing peak b₃ in Fig. 7a is expanded in θ₁ and is shown as a vertical strip between the two dashed lines. This strip contains a mixture of conformations. The red dot shows the location of the transition state conformations. (b) The distributions of committor values for conformations at the blue, red, and green dots in (a), respectively. Trajectories from the blue and green dots fall back to the reactant and product basins, respectively.

In contrast, projection of the 5-d surface on (ϕ, θ₁)-plane (Fig 6) has the two basins along the ϕ-direction alone, but the transition region aligned along both ϕ and θ₁. This second feature is consistent with the importance of θ₁ in determining the barrier crossing dynamics [1–4,15,17,21].

Interestingly, the (ϕ, ψ)-surface (Fig 10) has basins and the saddle region arranged along both directions. Conventional wisdom would have led to the conclusion that ψ is important in defining both reactant and product basins as well as the barrier crossing process. However, the double-well structure exhibited on the (ϕ-ψ) plane is profoundly misleading. None of the topological features on the surface over the (ϕ-ψ) plane retain the dynamic properties of the original 5-d surface, as the committor test showed that configurations corresponding to the peak at the transition region completely fall into the reactant basin. This demonstrates that the correlation between ϕ and ψ, due to minor roles of ψ to the transition process [1,2], distorted the probability distribution along ϕ, such that the ridge/saddle extended into the reactant peak, leading to incorrect ϕ value that marks the location of the peak in the transition region. In contrast, θ₂ did not impact the distribution of ϕ, so the peak in the transition region on the (ϕ-θ₂) plane still bears the correct ϕ value for the transition states.

In general, the dynamic properties of topological features of the probability surface are very sensitive to the subspace of projection. This is illustrated by the different locations of the transition state conformations, which are at a peak location on the (ϕ-θ₁) plane (Fig 9a, red dot), but are at a slope location below a new peak when projected to the (ϕ-ψ) plane (Fig 9b, blue dot). While the probability at the correct (ϕ-θ₁) square for the transition state conformations (Fig 9c, left, red) is at a peak, a different location in the (ϕ-ψ) plane (Fig. 9b, blue dot) has higher probability, as in this projection the probability mass distributed along the dimension of θ₁ is integrated over all θ₁ values (Fig. 9c, higher bar, right), resulting in the concentration of probability peak at this new location.

Our results show that without the inclusion of the correct reaction coordinates, probability surface of the same dimension no longer correctly characterizes the dynamic properties of the active process. Without θ₁, the 5-d dynamic probability surface over (ϕ, θ₂, ψ, α, β) fail to capture the dynamics of this active process.

Together, our results showed that intuition-based projection (such as ϕ-ψ) or other arbitrary projection cannot be relied upon for understanding the dynamic properties of activated processes. without rigorous examination such as the committor test, directly assigning mechanistic significance to features of free energy surface is prone to mistakes, misinterpretations, and misunderstanding.

Finally, our results show that there are dramatic changes in the topological properties of the probability surface after dimension reduction, when techniques such as dPCA are applied. While the simple probability surface on the properly constructed 2-d (ϕ − θ₁) plane contains rich dynamic information and is sufficient to uncover the transition state conformations, the topological features on PCA-reduced surfaces can become more complex and no longer reflect essential dynamics and cannot be used to identify the transition state conformations.

The approach of homology group and the technique of analyzing the persistent homology of the filtration of the superlevel sets of high-dimensional probability surfaces introduced here are general. We envision they can be applied to investigate topology of high-dimensional probability surfaces encountered in other physical problems of activated process.

Supplementary Material

TablesAndAdditionalMaterial

NIHMS1788560-supplement-TablesAndAdditionalMaterial.pdf^{(804.4KB, pdf)}

Figure 5: — The isomerization reaction of alanine dipeptide. (a) Conformations from the reactant and product basins before and after the isomerization. (b) The six reaction coordinates of the isomerization process of alanine peptide examined in this study.

Acknowledgement

We thank Drs. Herbert Edelsbrunner and Hubert Wagner for discussion and for generous help in extending the cubic complex algorithm. This work is supported by grants NIH R35 GM127084 and NSF CHE-1665104.

Footnotes

Conflict of Interest Statement

There are no conflict of interests.

References

[1].Chandler David. Statistical mechanics of isomerization dynamics in liquids and the transition state approximation. J. Chem. Phys, 68(6):2959–2970, 1978. [Google Scholar]
[2].Kramers HA. Brownian motion in a field of force and the diffusion model of chemical reactions. Physica, 7(4):284 – 304, 1940. [Google Scholar]
[3].Pechukas Philip. Statistical approximations in collision theory, pages 269–322. Springer US, Boston, MA, 1976. [Google Scholar]
[4].Wigner E. The transition state method. Trans. Faraday Soc, 34:29–41, 1938. [Google Scholar]
[5].Hänggi Peter, Talkner Peter, and Borkovec Michal. Reaction-rate theory: fifty years after kramers. Rev. Mod. Phys, 62:251–341, Apr 1990. [Google Scholar]
[6].Berne Bruce J., Borkovec Michal, and Straub John E.. Classical and modern methods in reaction rate theory. J. Phys. Chem, 92(13):3711–3725, 1988. [Google Scholar]
[7].Pollak Eli and Talkner Peter. Reaction rate theory: what it was, where is it today, and where is it going? Chaos, 15(2):026116, 2005. [DOI] [PubMed] [Google Scholar]
[8].Bolhuis Peter G., Chandler David, Dellago Christoph, and Geissler Phillip L.. Transition path sampling: throwing ropes over rough mountain passes, in the dark. Annu. Rev. Phys. Chem, 53(1):291–318, 2002. [DOI] [PubMed] [Google Scholar]
[9].Du Rose, Pande Vijay S., Yu Alexander. Grosberg, Toyoichi Tanaka, and Eugene S. Shakhnovich. On the transition coordinate for protein folding. J. Chem. Phys, 108(1):334–350, 1998. [Google Scholar]
[10].Li Wenjin and Ma Ao. Recent developments in methods for identifying reaction coordinates. Mol. Simul, 40(10–11):784–793, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Schramm Vern L. and Schwartz Steven D.. Promoting vibrations and the function of enzymes. emerging theoretical and experimental convergence. Biochem., 57(24):3299–3308, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Schwartz Steven D and Schramm Vern L. Enzymatic transition states and dynamic motion in barrier crossing. Nat. Chem. Biol, 5(8):551–558, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Ryter D. On the eigenfunctions of the fokker-planck operator and of its adjoint. Physica A, 142(1):103 – 121, 1987. [Google Scholar]
[14].Onsager L. Initial recombination of ions. Phys. Rev, 54:554–557, Oct 1938. [Google Scholar]
[15].Bolhuis Peter G., Dellago Christoph, and Chandler David. Reaction coordinates of biomolecular isomerization. Proc. Natl. Acad. Sci, 97(11):5877–5882, 2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Li Huiyu and Ma Ao. Kinetic energy flows in activated dynamics of biomolecules. J. Chem. Phys, 153(9):094109, 2020. [DOI] [PubMed] [Google Scholar]
[17].Li Wenjin and Ma Ao. Reaction mechanism and reaction coordinates from the viewpoint of energy flow. J. Chem. Phys, 144(11):114103, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Best Robert B. and Hummer Gerhard. Reaction coordinates and rates from transition paths. Proc. Natl. Acad. Sci, 102(19):6732–6737, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Antoniou Dimitri and Schwartz Steven D.. Toward identification of the reaction coordinate directly from the transition state ensemble using the kernel pca method. J. Phys. Chem. B, 115(10):2465–2469, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Hu Jie, Ma Ao, and Dinner Aaron R.. A two-step nucleotide-flipping mechanism enables kinetic discrimination of dna lesions by agt. Proc. Natl. Acad. Sci, 105(12):4615–4620, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Ma Ao and Dinner Aaron R.. Automatic method for identifying reaction coordinates in complex systems. J. Phys. Chem. B, 109(14):6769–6779, 2005. [DOI] [PubMed] [Google Scholar]
[22].Peters Baron and Trout Bernhardt L.. Obtaining reaction coordinates by likelihood maximization. J. Chem. Phys, 125(5):054108, 2006. [DOI] [PubMed] [Google Scholar]
[23].Antoniou Dimitri and Schwartz Steven D.. The stochastic separatrix and the reaction coordinate for complex systems. J. Chem. Phys, 130(15):151103, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Jung Hendrik, Covino Roberto, and Hummer Gerhard. Artificial intelligence assists discovery of reaction coordinates and mechanisms from molecular dynamics simulations, 2019. [Google Scholar]
[25].Sidky Hythem, Chen Wei, and Ferguson Andrew L.. Machine learning for collective variable discovery and enhanced sampling in biomolecular simulation. Mol. Phys, 118(5):e1737742, 2020. [Google Scholar]
[26].Bonati Luigi, Zhang Yue-Yu, and Parrinello Michele. Neural networks-based variationally enhanced sampling. Proc. Nat. Acad. Sci, 116(36):17641–17647, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[27].Wang Yihang, Ribeiro João Marcelo Lamim, and Tiwary Pratyush. Past–future information bottleneck for sampling molecular reaction coordinate simultaneously with thermodynamics and kinetics. Nat. Commun, 10(1):1–8, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[28].Wang Yihang, Ribeiro Joao Marcelo Lamim, and Tiwary Pratyush. Machine learning approaches for analyzing and enhancing molecular dynamics simulations. Current opinion in structural biology, 61:139–145, 2020. [DOI] [PubMed] [Google Scholar]
[29].Li Wenjin and Ma Ao. A benchmark for reaction coordinates in the transition path ensemble. J. Chem. Phys, 144(13):134104, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Matsumoto Yukio. An introduction to Morse theory, volume 208. American Mathematical Soc., 2002. [Google Scholar]
[31].Liang Jie, Woodward Clare, and Edelsbrunner Herbert. Anatomy of protein pockets and cavities: Measurement of binding site geometry and implications for ligand design. Protein Sci, 7(9):1884–1897, 1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
[32].Liang Jie, Edelsbrunner Herbert, Fu Ping, Sudhakar Pamidighantam V., and Subramaniam Shankar. Analytical shape computation of macromolecules: I. molecular area and volume through alpha shape. Proteins, 33(1):1–17, 1998. [PubMed] [Google Scholar]
[33].Edelsbrunner Herbert, Facello Michael, and Liang Jie. On the definition and the construction of pockets in macromolecules. Discret. Appl. Math, 88(1):83 – 102, 1998. Computational Molecular Biology DAM - CMB Series. [PubMed] [Google Scholar]
[34].Liang Jie, Edelsbrunner Herbert, Fu Ping, Sudhakar Pamidighantam V., and Subramaniam Shankar. Analytical shape computation of macromolecules: Ii. inaccessible cavities in proteins. Proteins, 33(1):18–29, 1998. [PubMed] [Google Scholar]
[35].Binkowski T Andrew, Adamian Larisa, and Liang Jie. Inferring functional relationships of proteins from local sequence and spatial surface patterns. J. Mol. Biol, 332(2):505–526, 2003. [DOI] [PubMed] [Google Scholar]
[36].Binkowski T Andrew, Joachimiak Andrzej, and Liang Jie. Protein surface analysis for function annotation in high-throughput structural genomics pipeline. Protein Sci, 14(12):2972–2981, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
[37].Tseng Yan Y and Liang Jie. Estimation of amino acid residue substitution rates at local spatial regions and application in protein function inference: a bayesian monte carlo approach. Mol. Biol. Evol, 23(2):421–436, 2006. [DOI] [PubMed] [Google Scholar]
[38].Tseng Yan Yuan, Dundas Joseph, and Liang Jie. Predicting protein function and binding profile via matching of local evolutionary and geometric surface patterns. J. Mol. Biol, 387(2):451–464, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
[39].Perez-Rathke Alan, Fahie Monifa A, Chisholm Christina, Liang Jie, and Chen Min. Mechanism of ompg ph-dependent gating from loop ensemble and single channel studies. J. Am. Chem. Soc, 140(3):1105–1115, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[40].Tian Wei, Chen Chang, Lei Xue, Zhao Jieling, and Liang Jie. Castp 3.0: computed atlas of surface topography of proteins. Nucleic Acids Res, 46(W1):W363–W367, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[41].Caiani Lando, Casetti Lapo, Clementi Cecilia, and Pettini Marco. Geometry of dynamics, lyapunov exponents, and phase transitions. Phys. Rev. Lett, 79(22):4361, 1997. [Google Scholar]
[42].Angelani L, Ruocco G, and Zamponi F. Relationship between phase transitions and topological changes in one-dimensional models. Phys. Rev. E, 72(1):016122, 2005. [DOI] [PubMed] [Google Scholar]
[43].Kastner Michael.Phase transitions and configuration space topology. Rev, Mod. Phys, 80(1):167, 2008. [Google Scholar]
[44].Wales David J. Exploring energy landscapes. Annu. Rev. Phys. Chem, 69:401–425, 2018. [DOI] [PubMed] [Google Scholar]
[45].Cimasoni David and Delabays Robin. The topological hypothesis for discrete spin models. J. Stat. Mech. Theory Exp, 2019(3):033216, 2019. [Google Scholar]
[46].Kastner Michael and Mehta Dhagash. Phase transitions detached from stationary points of the energy landscape. Phys. Rev. lett, 107(16):160602, 2011. [DOI] [PubMed] [Google Scholar]
[47].Edelsbrunner Herbert and Harer John L. Computational topology: an introduction. American Mathematical Society, Providence, RI, 2010. [Google Scholar]
[48].Edelsbrunner Herbert, Letscher David, and Zomorodian Afra. Topological persistence and simplification. Discrete Comput. Geom, 2002. [Google Scholar]
[49].Carlsson Gunnar. Topology and data. Bull. Am. Math. Soc, 46(2):255–308, 2009. [Google Scholar]
[50].Munkres JR. Elements Of algebraic topology. CRC Press, 2018. [Google Scholar]
[51].Hatcher A. Algebraic topology. Tsinghua University Press Co., Ltd. Tsinghua University Press, 2005. [Google Scholar]
[52].Cohen-Steiner David, Edelsbrunner Herbert, Harer John, and Morozov Dmitriy. Persistent homology for kernels, images, and cokernels, pages 1011–1020. [Google Scholar]
[53].Wagner Hubert, Chen Chao, and Vuçini Erald. Efficient computation of persistent homology for cubical data. In Topological methods in data analysis and visualization II, pages 91–106. Springer, 2012. [Google Scholar]
[54].Kaczynski Tomasz, Mischaikow Konstantin, and Mrozek Marian. Computational homology, volume 157. Springer Science & Business Media, 2006. [Google Scholar]
[55].Bolhuis Peter G., Dellago Christoph, and Chandler David. Reaction coordinates of biomolecular isomerization. Proc. Natl. Acad. Sci, 97(11):5877–5882, 2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
[56].van der Spoel David Hess Berk, Kutzner Carsten and Lindahl Erik. Gromacs 4: Algorithms for highly efficient, load-balanced, and scalable molecular simulation. J. Chem. Theory Comput, 4(3):435–447, 2008. [DOI] [PubMed] [Google Scholar]
[57].Levy RM, Srinivasan AR, Olson WK, and McCammon JA. Quasi-harmonic method for studying very low frequency modes in proteins. Biopolymers, 23(6):1099–1112, 1984. [DOI] [PubMed] [Google Scholar]
[58].García Angel E. Large-amplitude nonlinear motions in proteins. Phys. Rev. Lett, 68(17):2696, 1992. [DOI] [PubMed] [Google Scholar]
[59].Riccardi Laura, Nguyen Phuong H., and Stock Gerhard. Free-energy landscape of rna hairpins constructed via dihedral angle principal component analysis. J. Phys. Chem. B, 113(52):16660–16668, 2009. [DOI] [PubMed] [Google Scholar]
[60].Mu Yuguang, Nguyen Phuong H., and Stock Gerhard. Energy landscape of a small peptide revealed by dihedral angle principal component analysis. Proteins, 58(1):45–52, 2005. [DOI] [PubMed] [Google Scholar]
[61].Altis Alexandros, Nguyen Phuong H., Hegger Rainer, and Stock Gerhard. Dihedral angle principal component analysis of molecular dynamics simulations. J. Chem. Phys, 126(24):244111, 2007. [DOI] [PubMed] [Google Scholar]
[62].Peters Baron, Heyden Andreas, Bell Alexis T., and Chakraborty Arup. A growing string method for determining transition states: Comparison to the nudged elastic band and string methods. J. Chem. Phys, 120(17):7877–7886, 2004. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

TablesAndAdditionalMaterial

NIHMS1788560-supplement-TablesAndAdditionalMaterial.pdf^{(804.4KB, pdf)}

[R1] [1].Chandler David. Statistical mechanics of isomerization dynamics in liquids and the transition state approximation. J. Chem. Phys, 68(6):2959–2970, 1978. [Google Scholar]

[R2] [2].Kramers HA. Brownian motion in a field of force and the diffusion model of chemical reactions. Physica, 7(4):284 – 304, 1940. [Google Scholar]

[R3] [3].Pechukas Philip. Statistical approximations in collision theory, pages 269–322. Springer US, Boston, MA, 1976. [Google Scholar]

[R4] [4].Wigner E. The transition state method. Trans. Faraday Soc, 34:29–41, 1938. [Google Scholar]

[R5] [5].Hänggi Peter, Talkner Peter, and Borkovec Michal. Reaction-rate theory: fifty years after kramers. Rev. Mod. Phys, 62:251–341, Apr 1990. [Google Scholar]

[R6] [6].Berne Bruce J., Borkovec Michal, and Straub John E.. Classical and modern methods in reaction rate theory. J. Phys. Chem, 92(13):3711–3725, 1988. [Google Scholar]

[R7] [7].Pollak Eli and Talkner Peter. Reaction rate theory: what it was, where is it today, and where is it going? Chaos, 15(2):026116, 2005. [DOI] [PubMed] [Google Scholar]

[R8] [8].Bolhuis Peter G., Chandler David, Dellago Christoph, and Geissler Phillip L.. Transition path sampling: throwing ropes over rough mountain passes, in the dark. Annu. Rev. Phys. Chem, 53(1):291–318, 2002. [DOI] [PubMed] [Google Scholar]

[R9] [9].Du Rose, Pande Vijay S., Yu Alexander. Grosberg, Toyoichi Tanaka, and Eugene S. Shakhnovich. On the transition coordinate for protein folding. J. Chem. Phys, 108(1):334–350, 1998. [Google Scholar]

[R10] [10].Li Wenjin and Ma Ao. Recent developments in methods for identifying reaction coordinates. Mol. Simul, 40(10–11):784–793, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Schramm Vern L. and Schwartz Steven D.. Promoting vibrations and the function of enzymes. emerging theoretical and experimental convergence. Biochem., 57(24):3299–3308, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Schwartz Steven D and Schramm Vern L. Enzymatic transition states and dynamic motion in barrier crossing. Nat. Chem. Biol, 5(8):551–558, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Ryter D. On the eigenfunctions of the fokker-planck operator and of its adjoint. Physica A, 142(1):103 – 121, 1987. [Google Scholar]

[R14] [14].Onsager L. Initial recombination of ions. Phys. Rev, 54:554–557, Oct 1938. [Google Scholar]

[R15] [15].Bolhuis Peter G., Dellago Christoph, and Chandler David. Reaction coordinates of biomolecular isomerization. Proc. Natl. Acad. Sci, 97(11):5877–5882, 2000. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [16].Li Huiyu and Ma Ao. Kinetic energy flows in activated dynamics of biomolecules. J. Chem. Phys, 153(9):094109, 2020. [DOI] [PubMed] [Google Scholar]

[R17] [17].Li Wenjin and Ma Ao. Reaction mechanism and reaction coordinates from the viewpoint of energy flow. J. Chem. Phys, 144(11):114103, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Best Robert B. and Hummer Gerhard. Reaction coordinates and rates from transition paths. Proc. Natl. Acad. Sci, 102(19):6732–6737, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Antoniou Dimitri and Schwartz Steven D.. Toward identification of the reaction coordinate directly from the transition state ensemble using the kernel pca method. J. Phys. Chem. B, 115(10):2465–2469, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Hu Jie, Ma Ao, and Dinner Aaron R.. A two-step nucleotide-flipping mechanism enables kinetic discrimination of dna lesions by agt. Proc. Natl. Acad. Sci, 105(12):4615–4620, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Ma Ao and Dinner Aaron R.. Automatic method for identifying reaction coordinates in complex systems. J. Phys. Chem. B, 109(14):6769–6779, 2005. [DOI] [PubMed] [Google Scholar]

[R22] [22].Peters Baron and Trout Bernhardt L.. Obtaining reaction coordinates by likelihood maximization. J. Chem. Phys, 125(5):054108, 2006. [DOI] [PubMed] [Google Scholar]

[R23] [23].Antoniou Dimitri and Schwartz Steven D.. The stochastic separatrix and the reaction coordinate for complex systems. J. Chem. Phys, 130(15):151103, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Jung Hendrik, Covino Roberto, and Hummer Gerhard. Artificial intelligence assists discovery of reaction coordinates and mechanisms from molecular dynamics simulations, 2019. [Google Scholar]

[R25] [25].Sidky Hythem, Chen Wei, and Ferguson Andrew L.. Machine learning for collective variable discovery and enhanced sampling in biomolecular simulation. Mol. Phys, 118(5):e1737742, 2020. [Google Scholar]

[R26] [26].Bonati Luigi, Zhang Yue-Yu, and Parrinello Michele. Neural networks-based variationally enhanced sampling. Proc. Nat. Acad. Sci, 116(36):17641–17647, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] [27].Wang Yihang, Ribeiro João Marcelo Lamim, and Tiwary Pratyush. Past–future information bottleneck for sampling molecular reaction coordinate simultaneously with thermodynamics and kinetics. Nat. Commun, 10(1):1–8, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] [28].Wang Yihang, Ribeiro Joao Marcelo Lamim, and Tiwary Pratyush. Machine learning approaches for analyzing and enhancing molecular dynamics simulations. Current opinion in structural biology, 61:139–145, 2020. [DOI] [PubMed] [Google Scholar]

[R29] [29].Li Wenjin and Ma Ao. A benchmark for reaction coordinates in the transition path ensemble. J. Chem. Phys, 144(13):134104, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] [30].Matsumoto Yukio. An introduction to Morse theory, volume 208. American Mathematical Soc., 2002. [Google Scholar]

[R31] [31].Liang Jie, Woodward Clare, and Edelsbrunner Herbert. Anatomy of protein pockets and cavities: Measurement of binding site geometry and implications for ligand design. Protein Sci, 7(9):1884–1897, 1998. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] [32].Liang Jie, Edelsbrunner Herbert, Fu Ping, Sudhakar Pamidighantam V., and Subramaniam Shankar. Analytical shape computation of macromolecules: I. molecular area and volume through alpha shape. Proteins, 33(1):1–17, 1998. [PubMed] [Google Scholar]

[R33] [33].Edelsbrunner Herbert, Facello Michael, and Liang Jie. On the definition and the construction of pockets in macromolecules. Discret. Appl. Math, 88(1):83 – 102, 1998. Computational Molecular Biology DAM - CMB Series. [PubMed] [Google Scholar]

[R34] [34].Liang Jie, Edelsbrunner Herbert, Fu Ping, Sudhakar Pamidighantam V., and Subramaniam Shankar. Analytical shape computation of macromolecules: Ii. inaccessible cavities in proteins. Proteins, 33(1):18–29, 1998. [PubMed] [Google Scholar]

[R35] [35].Binkowski T Andrew, Adamian Larisa, and Liang Jie. Inferring functional relationships of proteins from local sequence and spatial surface patterns. J. Mol. Biol, 332(2):505–526, 2003. [DOI] [PubMed] [Google Scholar]

[R36] [36].Binkowski T Andrew, Joachimiak Andrzej, and Liang Jie. Protein surface analysis for function annotation in high-throughput structural genomics pipeline. Protein Sci, 14(12):2972–2981, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] [37].Tseng Yan Y and Liang Jie. Estimation of amino acid residue substitution rates at local spatial regions and application in protein function inference: a bayesian monte carlo approach. Mol. Biol. Evol, 23(2):421–436, 2006. [DOI] [PubMed] [Google Scholar]

[R38] [38].Tseng Yan Yuan, Dundas Joseph, and Liang Jie. Predicting protein function and binding profile via matching of local evolutionary and geometric surface patterns. J. Mol. Biol, 387(2):451–464, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] [39].Perez-Rathke Alan, Fahie Monifa A, Chisholm Christina, Liang Jie, and Chen Min. Mechanism of ompg ph-dependent gating from loop ensemble and single channel studies. J. Am. Chem. Soc, 140(3):1105–1115, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] [40].Tian Wei, Chen Chang, Lei Xue, Zhao Jieling, and Liang Jie. Castp 3.0: computed atlas of surface topography of proteins. Nucleic Acids Res, 46(W1):W363–W367, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] [41].Caiani Lando, Casetti Lapo, Clementi Cecilia, and Pettini Marco. Geometry of dynamics, lyapunov exponents, and phase transitions. Phys. Rev. Lett, 79(22):4361, 1997. [Google Scholar]

[R42] [42].Angelani L, Ruocco G, and Zamponi F. Relationship between phase transitions and topological changes in one-dimensional models. Phys. Rev. E, 72(1):016122, 2005. [DOI] [PubMed] [Google Scholar]

[R43] [43].Kastner Michael.Phase transitions and configuration space topology. Rev, Mod. Phys, 80(1):167, 2008. [Google Scholar]

[R44] [44].Wales David J. Exploring energy landscapes. Annu. Rev. Phys. Chem, 69:401–425, 2018. [DOI] [PubMed] [Google Scholar]

[R45] [45].Cimasoni David and Delabays Robin. The topological hypothesis for discrete spin models. J. Stat. Mech. Theory Exp, 2019(3):033216, 2019. [Google Scholar]

[R46] [46].Kastner Michael and Mehta Dhagash. Phase transitions detached from stationary points of the energy landscape. Phys. Rev. lett, 107(16):160602, 2011. [DOI] [PubMed] [Google Scholar]

[R47] [47].Edelsbrunner Herbert and Harer John L. Computational topology: an introduction. American Mathematical Society, Providence, RI, 2010. [Google Scholar]

[R48] [48].Edelsbrunner Herbert, Letscher David, and Zomorodian Afra. Topological persistence and simplification. Discrete Comput. Geom, 2002. [Google Scholar]

[R49] [49].Carlsson Gunnar. Topology and data. Bull. Am. Math. Soc, 46(2):255–308, 2009. [Google Scholar]

[R50] [50].Munkres JR. Elements Of algebraic topology. CRC Press, 2018. [Google Scholar]

[R51] [51].Hatcher A. Algebraic topology. Tsinghua University Press Co., Ltd. Tsinghua University Press, 2005. [Google Scholar]

[R52] [52].Cohen-Steiner David, Edelsbrunner Herbert, Harer John, and Morozov Dmitriy. Persistent homology for kernels, images, and cokernels, pages 1011–1020. [Google Scholar]

[R53] [53].Wagner Hubert, Chen Chao, and Vuçini Erald. Efficient computation of persistent homology for cubical data. In Topological methods in data analysis and visualization II, pages 91–106. Springer, 2012. [Google Scholar]

[R54] [54].Kaczynski Tomasz, Mischaikow Konstantin, and Mrozek Marian. Computational homology, volume 157. Springer Science & Business Media, 2006. [Google Scholar]

[R55] [55].Bolhuis Peter G., Dellago Christoph, and Chandler David. Reaction coordinates of biomolecular isomerization. Proc. Natl. Acad. Sci, 97(11):5877–5882, 2000. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] [56].van der Spoel David Hess Berk, Kutzner Carsten and Lindahl Erik. Gromacs 4: Algorithms for highly efficient, load-balanced, and scalable molecular simulation. J. Chem. Theory Comput, 4(3):435–447, 2008. [DOI] [PubMed] [Google Scholar]

[R57] [57].Levy RM, Srinivasan AR, Olson WK, and McCammon JA. Quasi-harmonic method for studying very low frequency modes in proteins. Biopolymers, 23(6):1099–1112, 1984. [DOI] [PubMed] [Google Scholar]

[R58] [58].García Angel E. Large-amplitude nonlinear motions in proteins. Phys. Rev. Lett, 68(17):2696, 1992. [DOI] [PubMed] [Google Scholar]

[R59] [59].Riccardi Laura, Nguyen Phuong H., and Stock Gerhard. Free-energy landscape of rna hairpins constructed via dihedral angle principal component analysis. J. Phys. Chem. B, 113(52):16660–16668, 2009. [DOI] [PubMed] [Google Scholar]

[R60] [60].Mu Yuguang, Nguyen Phuong H., and Stock Gerhard. Energy landscape of a small peptide revealed by dihedral angle principal component analysis. Proteins, 58(1):45–52, 2005. [DOI] [PubMed] [Google Scholar]

[R61] [61].Altis Alexandros, Nguyen Phuong H., Hegger Rainer, and Stock Gerhard. Dihedral angle principal component analysis of molecular dynamics simulations. J. Chem. Phys, 126(24):244111, 2007. [DOI] [PubMed] [Google Scholar]

[R62] [62].Peters Baron, Heyden Andreas, Bell Alexis T., and Chakraborty Arup. A growing string method for determining transition states: Comparison to the nudged elastic band and string methods. J. Chem. Phys, 120(17):7877–7886, 2004. [DOI] [PubMed] [Google Scholar]

PERMALINK

Exact Topology of Dynamic Probability Surface of an Activated Process by Persistent Homology

Farid Manuchehrfar

Huiyu Li

Wei Tian

Ao Ma

Jie Liang

Abstract

1. Introduction

2. Theory, Models and Methods

2.1. Constructing dynamic probability surface: transition state ensemble and path sampling

2.2. Configuration space and probability surface.

Configuration space.

Probability surface.

Sublevel sets.

2.3. Topology of probability surface: Critical points and Morse indices.

Critical Points.

Topology of surface by critical points and Morse theory.

Euler characteristics.

Prior work on characterizing topological surface of molecular system.

2.4. Topology of probability surface: Homology group and persistent homology

Background and overview.

Figure 1:

Complex and chain.

Figure 2:

Boundaries.

Cycles.

Kernel and image.

Homology group and Betti number.

Figure 3:

Filtration.

Figure 4:

Persistence and Persistent diagram.

Computation.

3. Results

3.1. Model system and computation

Ensemble of reactive trajectories and conformations.

Constructing dynamic probability surface of transition state.

Computing topological structures of the dynamic probability landscape.

Committor test for conformations at selected locations of the configuration space.

3.2. Topology and dynamic properties of 5-d dynamic probability surfaces

Dynamic probability surface on configuration space of (ϕ, θ1, ψ, α, β).

Figure 6:

Figure 9:

Dynamic probability surface on configuration space (ϕ, ψ, α, β, θ2).

Figure 7:

3.3. Topology and dynamic properties of projected 2-d dynamic probability surfaces.

Projecting 5-d dynamic probability surface to the (ϕ − θ1) plane.

Figure 8:

Projecting 5-d dynamic probability surface to the (ϕ − ψ) plane.

Figure 10:

3.4. Dimension reduction by PCA destroys dynamic properties inherent in surface topology

Projection of p(ϕ, ψ, θ1, α, β,) onto (PC1, PC2) by dPCA.

Figure 11:

Projection from 39-d angular space to 5-d dPCA subspace.

Figure 12:

Direct PCA projections.

4. Discussion

Figure 13:

Supplementary Material

Figure 5:

Acknowledgement

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Dynamic probability surface on configuration space of (ϕ, θ₁, ψ, α, β).

Dynamic probability surface on configuration space (ϕ, ψ, α, β, θ₂).

Projecting 5-d dynamic probability surface to the (ϕ − θ₁) plane.

Projection of p(ϕ, ψ, θ₁, α, β,) onto (PC₁, PC₂) by dPCA.