Abstract
To gain insight into reaction mechanism of activated processes, we introduce an exact approach for quantifying the topology of high-dimensional probability surfaces of the underlying dynamic processes. Instead of Morse indexes, we study the homology groups of a sequence of superlevel sets of the probability surface over high-dimensional configuration spaces using persistent homology. For alanine-dipeptide isomerization, a prototype of activated processes, we identify locations of probability peaks and connecting-ridges, along with measures of their global prominence. Instead of a saddle-point, the transition state ensemble (TSE) of conformations are at the most prominent probability peak after reactants/products, when proper reaction coordinates are included. Intuition-based models, even those exhibiting a double-well, fail to capture the dynamics of the activated process. Peak occurrence, prominence, and locations can be distorted upon subspace projection. While principal component analysis account for conformational variance, it inflates the complexity of the surface topology and destroy dynamic properties of the topological features. In contrast, TSE emerges naturally as the most prominent peak beyond the reactant/product basins, when projected to a subspace of minimum dimension containing the reaction coordinates. Our approach is general and can be applied to investigate the topology of high-dimensional probability surfaces of other activated process.
Keywords: surface topology, landscape analysis, persistent homology, active process, transition state, energy flow
1. Introduction
Activated processes are ubiquitous in molecular systems, ranging from chemical reactions of small molecules to dynamic conformational changes and enzymatic reactions of proteins. In proteins, all functionally important processes are activated processes, which provide well-defined rates essential for proteins to carry out their roles in the cellular context, as proper timing is required for proper function.
The prevalent picture describing an activated process is that of a transition between two metastable basins on the free energy landscape separated by a barrier, whose height is large compared to thermal energy [1]. The slow time scale of activation arises from the fact that the molecular system can rarely accumulate sufficient energy in the relevant degrees of freedom (DoFs) to surpass the transition barrier. This simple and elegant picture originates from reaction rate theories, such as the well-known transition state theory and Kramer’s theory [1–7] developed in studies of the dynamics of chemical reactions of small molecules.
A key concept in reaction rate theories is that of reaction coordinates: a few special coordinates exists that can fully determine the progress of a reaction process [8–10]. A requirement for reaction coordinates is that they must accurately locate the transition barrier. Accordingly, the numerous degree of freedoms (DoFs) in a complex molecular system (e.g., a protein molecule, a system of solute and solvent) can be divided into reaction coordinates and heat bath. Reaction coordinates play central roles as they determine both the mechanism and the rate of activation. For example, to modify the activity of an enzyme, one should modify residues involved in the reaction coordinates of the enzyme activities [11,12], as this will modify both the reaction pathway and the barrier height for activation. In contrast, modifying residues that belong to the heat bath will not alter the enzymatic activity, as the role of the heat bath is to provide energy to the reaction coordinates to cross the activation barrier during rare fluctuations, which is largely a non-specific process.
Given such significance, it is important to develop a rigorous and quantitative criterion for determining the correct reaction coordinates. This task was accomplished with the development of the procedure of committor test, which is characterized by the committor value pB [8,9,13,14]: the probability that a dynamic trajectory initiated from a given configuration to reach the product basin before visiting the reactant basin. By definition, the reactant and product states have committor values of 0 and 1, respectively, whereas the optimal transition state coincides with pB = 0.5. The committor value pB therefore provides a rigorous parameterization of the reaction process. Thus, the intuitive, albeit qualitative, notion of reaction coordinates translates into a rigorous definition of the few coordinates that are sufficient for determining the committor value of any given configuration. Du et. al first adopted this rigorous definition of reaction coordinates in the context of protein folding [9]; Chandler and co-workers established its usage as a standard practice in the general context of activated processes [8].
While this rigorous criterion has been well accepted, identifying the correct reaction coordinates turns out to be rather difficult, even for systems of modest complexity. One example is the C7eq → C7az isomerization reaction of the alanine dipeptide in vacuum, a prototype for studying biomolecular conformational transitions. Alanine dipeptide is the smallest molecule that satisfies the criterion that distinguishes complex molecules from small molecules: the non-reaction coordinates in the system constitute a large enough thermal bath to provide the reaction coordinates with adequate energy to cross the activation barrier. As a result, the C7eq → C7az transition displays features of activated dynamics that are unique to complex molecules but absent in small molecules. It was first found by Bolhuis et. al [15] that the conventional Ramachandran torsional angles ϕ and ψ, while sufficient for distinguishing the two stable basins, are inadequate for locating the transition state. Instead, another torsional angle θ1 was found to be an essential reaction coordinate––a rather counter-intuitive finding [16,17]. The counter-intuitive nature of reaction coordinates turned out to be more often the norm than exception in complex systems [15,18,19], posing a formidable challenge, as sans intuition, there appears no guidance in sight.
The challenge in identifying reaction coordinates had motivated efforts in developing rigorous methods for their identification in complex systems since the early 2000’s [8,9,15,18,20,21]. Beyond unreliable intuition and trial-and-error, the first systematic method was that of machine-learning, in which a neural network was used to automatically identify the optimal reaction coordinates from a prepared pool of candidates [21]. This method was used to successfully identify the key solvent coordinate that controls the isomerization dynamics of an alanine dipeptide in solution, which had defied prior intuition-based trial-and-error efforts. The success of this machine learning based approach lead a series of further developments along similar lines [10,18,19,22–28].
However, a major deficiency of machine learning-based methods is that they cannot answer the real question concerning reaction coordinates – why some coordinates are more important for activation than the others? Instead, these methods can only inform us empirically which coordinates appear to be important based on well-defined criteria. Recently, Ma and co-workers developed a rigorous theory for mapping out the flow of potential energy through individual coordinates [17,29]. It was found that the reaction coordinates are the coordinates that carry high energy flows during the activation process. This result suggested an appealing physical picture: energy flows from fast coordinates into slow coordinates during activation, so that adequate energy can accumulate in the slow coordinates, enabling them to cross the activation barrier on path of these slow coordinates. This physical picture also suggested that reaction coordinates are preferred channels of energy flows and are encoded in the protein structure. Through analysis of energy flow, one can obtained a prioritized list of coordinates that likely play most significant roles in the activation process.
The most celebrated concept in reaction rate theories is that of the transition state (TS), which is the dynamic bottleneck of an activated process. Conventional thoughts are that they are located at a critical point with Mores index of 1 on the high-dimensional potential energy landscape of the molecule. If all the reaction coordinates of an activated process are known, the TS will be an index-1 critical point on the free energy surface of the reaction coordinates, the sole direction with negative Hessian pointing along the ideal one-dimensional reaction coordinate valid at the top of the activation barrier. Based on this picture, the surface of the multi-dimensional probability distribution of reaction coordinates constructed from an ensemble of reactive trajectories should be highly structured, with the important dynamic states (e.g. TS) manifesting as critical topological features.
To gain insights into reaction mechanism, a common practice is to study features of the free energy surface. However, all relevant information of an activated process is contained in the reactive trajectories. Instead of the free energy surface, one can construct the dynamic probability surface of the transition state. For certain systems, this can be achieved using the transition path sampling (TPS) method. Once a large ensemble of reactive trajectories are generated, an ensemble of configurations that the system sampled during the transition process can be harvested. From this ensemble, one can generate a dynamic probability landscape or transition state surface, which is usually high-dimensional.
The focus of this work is the analysis of the exact topology of the high-dimensional dynamic probability landscape, and the establishment of its relationship with the transition state of the active process. Instead of Morse index, we characterize the high-dimensional transition state surface by its topological structures in homology groups. We analyze topological changes in the superlevel set of the probability surface at different probability levels. Using the technique of persistent homology, we identify the locations of probability peaks in the high-dimensional configuration space and ridges connecting them, along with measures of their global prominence.
We apply this approach to study the active process of C7eq → C7ax isomerization of the alanine dipeptide, a well-characterized model system for studying protein conformational changes [15,17, 29]. After the exact topological structures of the dynamic probability surface are constructed, we identify the location of the ensemble of transition state conformations. Instead of a saddle point with a Morse index of 1, the transition state ensemble (TSE) is found to be located at the top of the most prominent peak after those of the reactant and product. In addition, the dynamically important topological structures are retained when the surface is projected onto the 2-dimensional plane of (ϕ − θ1), which are known to be the reaction coordinates [17]. In contrast, when projected to the intuition-derived (ϕ − θ)-plane, the topological features of the probability surface no longer contain dynamic information on the transition state. Furthermore, we find that PCA dimension reduction distorts surface topologies, such that the transition state ensemble cannot be recovered from PCA-derived topological features: Instead of simplification, PCA destroys the dynamic properties of topological features of the original transition state surface.
Overall, we introduce a novel approach for quantifying the exact topology of high-dimensional surfaces. With this approach, we have characterized the precise topology of the transition state surface of the active process of alanine dipeptide isomerization. We have also established that the TSE are located at the most prominent probability peak beyond those of the reactant and the product, when the subspace of projection contain the proper reaction coordinates. The new language of homology group and the technique of persistent homology on superlevel sets of high-dimensional probability surfaces introduced here are general and can be applied to investigations of the topology of high-dimensional surfaces over configuration space encountered in other problems of activated process.
2. Theory, Models and Methods
We briefly discuss how the dynamic probability surface can be constructed for the molecular system of our interests. We then discuss the problem of understanding the topological structures of the probability surface over the relevant configuration space. Our focus will be the introduction of the method of homology groups for analyzing the exact topological structures of the dynamic probability surface. This approach is based on the homological structures of the sequence of the superlevel sets of a probability surface, and differs from previous efforts that are based on critical points, Morse theory, and Euler characteristics.
2.1. Constructing dynamic probability surface: transition state ensemble and path sampling
We use the transition path sampling (TPS) method [8] to generate a sufficiently large ensemble of reactive trajectories for the active process we study. To ensure the transition process is fully covered without bias, the duration of the trajectories is much longer than that of the transition process. Along each trajectories, we further harvested a number of system configuration with a specific time intervals to generate an ensemble of the relevant configurations that the system samples during the transition process. From this ensemble, we construct a dynamic probability landscape over a d-dimensional subspace, namely, the joint probability distribution over the d-coordinates.
2.2. Configuration space and probability surface.
We now discuss the general problem of analyzing probability surface over a configuration space of arbitrary dimension. We first introduce a few relevant concepts, and then review current approaches. This is followed by an exposition of the basics of homology groups and persistent homology. We focus on developing these concepts in the setting of cubics and superlevel sets of the probability surface. We will also discuss the key concept of filtration and its construction, and how it can be used to uncover the exact high-dimensional topological features.
Configuration space.
We begin our discussion in a general setting. We use d number of features that describe the configurations of a molecule. These can be bond lengths, bond angles, and torsional angles describing the structure of the molecule. For alanine dipeptide, there are a total of 60 features that fully describes the configuration of the molecule. For a molecule of finite size, the configuration space is compact. It is likely that lies in a subspace of the Euclidean space , as there are coupling between different degrees of freedom.
Probability surface.
Each configuration has a probability f(x) ∈ [0, 1] associated with it. Namely, we have a function that assigns the probability value p(x) to each configuration x.
Sublevel sets.
For a value 0 ≤ a ≤ 1, we can identify all x whose probability values f(x) ≥ a, which is called the superlevel set :
Similarly, we can have the the sublevel set :
2.3. Topology of probability surface: Critical points and Morse indices.
It is of great importance to understand the topological structures of the probability surface f(x). The approach of analyzing the critical points is well-practiced for exploring the topological structures of high-dimensional surfaces.
Critical Points.
The critical points of are where all first derivatives of f vanishes:
Critical points are coordinates independent and can be further classified into different types by the secondary derivatives. We can organize the secondary derivatives into a d × d Hessian matrix Hf(x), whose entries are:
At non-degenerative critical points, where the Hessian matrices are non-singular, the Hessian will have a mixture of positive and negative eigenvalues. The number of negative eigenvalues η is the Morse index of the critical point.
Topology of surface by critical points and Morse theory.
At a critical point, the topology of sublevel sets changes. For a critical point x with f(x) = a, consider the sublevel sets and the sublevel set slightly above it by a small amount of ϵ > 0. The topology of and are different, as one cannot be deformed into another. However, once an η-dimensional handle is attached to at the critical point x, it has the same homotopy type as [30], namely, these two can be deformed into one another. In fact, the homotopy class of can be characterized by toplological changes at critical points. The problem of determining the topology of the surface f(x) on then becomes the problem of determining the critical points of f(x) on and their Morse indices.
Euler characteristics.
A special case of the celebrated Morse inequalities relates the number of critical points of different indices to the Euler characteristics, which is
where mk is the number of critical points of index k. The Euler characteristics provides information on the number of holes of various dimensions, and can also be written as:
where βk is the k-th Betti number, which counts the number of k-dimensional holes. For example, in , β0 counts the number of connected components, β1 counts the number of tunnels, and β3 counts the number of voids. Voids and tunnels have been extensively examined in the studies of protein structures and functions [31–40]
Prior work on characterizing topological surface of molecular system.
The topology of surfaces over configuration space has been the subject of many investigations [41–45]. For example, thermodynamic phase transitions has been explored from the viewpoint of topological changes of submanifolds of the configuration space. Changes in Euler characteristics have been found to signal the occurrence of phase transitions for certain systems [42,43].
However, it is difficult in practice to understand the global surface topology using critical points and Morse indices. There are only a few model systems where all critical points are known analytically [46]. Numerical computation faces numerous challenges. First, it is difficult to identify all critical points. Methods based on Newton-Ralphson and other techniques require initial guesses and do not guarantee identifications of all critical points. Furthermore, many initial guesses fall into the same large basins of attraction and yield no new information. Second, as the probability surface can only be constructed approximated from points sampled in the high-dimensional configuration space, the degree of sampling may not be sufficiently detailed to capture the original topology of the configuration space. Third, sampling has to be sufficiently detailed to accurately measure the first and second derivatives along each coordinate direction at all locations. Fourth, as derivatives reflect local properties of the probability surface, there can be numerous critical points that may be trivial and of little importance if the probability surface is rugged. It is difficulty to distinguish those reflecting the global features of the surface from those reflecting local dimples or even those that are due to noise in sampling. While there have been efforts in visualizations of potential surface of molecules and other heuristics (see [44] and references within), to our knowledge, it is not yet possible to characterize all critical points and the homotopy type of the probability or potential surface of a molecule in three dimensional space.
2.4. Topology of probability surface: Homology group and persistent homology
Background and overview.
Instead of analyzing the critical points, we adopt a different approach. We are interested in global features such as the occurrence of different peaks, and how they are connected. This can be achieved by examining globally the structures of holes of various dimension and how such structures changes at different sublevel or superlevel sets of the probability landscape. We adopt an approach based on the theory of homology group and persistent homology [47–49]. Homology group studies holes in topological spaces and is a classic topic in algebraic topology [50,51]. Persistent homology computes these holes and measures their scales at different spatial resolutions [48,49]. Compared to homotopy, homology groups are more amenable to computation. Our approach is only feasible due to recent progress in computational topology and topological data analysis [47–49,52,53].
To provide an intuitive picture illustrating this approach, we envision a sea level on top of the probability landscape over the configuration space (Fig 1). We are interested in how mountain peaks emerge from the sea when the sea level is lowered gradually, and how independent mountain peaks become connected by land-ridges when the sea level is lowered further. These are related to 0-dimensional holes, which are connected components.
Figure 1:

Sea levels of the probability landscape f(x) on the configuration space . The superlevel sets at different sea level (white regions) can have different topology, with different number of components shown in white.
Complex and chain.
We first discuss how to represent the d-dimensional configuration space [54]. In this study, we use cubic complexes [53,54]. A d-dimensional cubic complex K is constructed from a union of points, line segments, squares, cubes, and their k-dimensional counterparts glue together properly, where k ≤ d and all have unit length (except points, which have no lengths). We call each of these a k-cell or a k-cube (see Fig 2a for a 3-cell). While the topology of is invariant whether it is represented by cubic complexes or other complexes such as simplicial complexes, the nature of grid representation of the molecular configurations makes this choice convenient [53].
Figure 2:

An illustration of cubic complex. a). A 3-cubic cell, with the orientation of its 2-faces shown. b). Two 3-cubes are summed to form a 3-chain. The internal square is contributed twice from the two cubes. As each surface is oriented (i.e., counter-clock-wise by the outward surface normal), these two squares have opposite orientations and cancel each other when summed. c). An example of a 2-chain formed by 2-cells and its boundary.
We can build up our cubic complexe K from cubes to represent the configuration spaces. Consider a set of k-cells, we can sum them up. We call the total summation of a set of k cells a k-chain. Fig 2b shows an example where two 3-cells are summed up to form a 3-chain. Fig 2c shows how nine 2-cells are summed up to form a 2-chain. Here the binary operation of summing over two k-cells is orientation sensitive: two k-cells of the same underlying space but opposite orientation cancel out each other when summed up. If a set is equipped with a binary operation satisfying certain requirements, it is called a group mathematically. The set of k-chains from the K complex with our binary operation of summation therefore form a chain group Ck(K).
Boundaries.
We now set boundaries. The boundary of an individual k-cell is the set of its (k − 1)-dimensional faces, which by definition forms a (k − 1)-chain. The boundary of a 3-cubic cell is shown in Fig 2a, which is the set of the six oriented squares. The boundary of a k-chain is the sum of the boundaries of its element k-cells. Because of the nature of our sum operation, internal structures cancel out. Consider the boundary of the two neighboring three-dimensional cubes in Fig 2b. The interfacial square is contributed twice, once from each cube, but with opposite orientation as both are counter clock-wise around their outwards normals. When these two cubes are glued together, these two boundary squares are summed up, and they cancel each other out. The overall outcome of this summation is indeed the outer boundary of the union of the two neighboring cubes. Fig 2c shows a 2-chain and its boundary. This holds true in other dimensions as well, namely, a (k − 1)-dimensional face from two neighboring k-cells have opposite orientations and cancels each other out upon summation.
With this summation, we obtain the boundary of a k-chain from K by applying the boundary operator ∂k:
| (1) |
Cycles.
There are certain k-chains that have no boundaries. They are called k-dimensional cycles or k-cycles. With the binary operation of summation discussed earlier, the set of k-cycles from K form the cycle group Zk(K):
As an example, we consider the three-dimensional cube again (Fig 2a). We take its six surface squares that fully enclose the solid cube. These square form a 2-chain. As a whole, this 2-chain itself does not have boundaries, as the six squares are glued together along the borders and there are no openings.
Kernel and image.
Analogous to the null space or kernel in linear algebra, the cycle group Zk(K) is the kernel of the operator ∂k, as each of its member k-chain has no boundary by definition and ∂k will send it to null:
We now move one dimension up and consider boundaries of (k + 1)-chains in K. Boundaries are one dimension lower and therefore the boundaries of (k + 1)-chains are k-chains. They are called the k-boundaries of K and form the k-boundary group Bk(K). As each k-boundary is obtained when ∂k+1 is applied to a (k + 1)-chain, collectively they are the image of ∂k+1:
It turns out that all k-boundaries themselves have no (k − 1)-boundaries. This is due to a fundamental property of the boundary homomorphisms in topology, which states that for any k ≥ 1 [50],
| (2) |
From our previous example in Fig 2a, the 2-chain of the six squares that enclose the solid cube is the boundary of a 3-chain (a lone 3-cell in this case). They themselves do not have any opening, and hence the boundary of this 2-chain is ∅. A consequence of this general property is that we have Bk(K) ⊆ Zk(K) ⊆ Ck(K).
Homology group and Betti number.
There are two types of k-cycles: Those enclose (k + 1)-bodies and those enclose (k + 1)-holes (Fig 3a). The former are boundaries of the enclosed bodies of (k + 1)-chains and can be collapsed into a point (Fig 3a, k-cycles in dark brown/brown enclosing (k + 1)-chains shaded in light brown). The latter are not boundaries of (k + 1) bodies and cannot be collapsed into a point (Fig 3a, other cycles). We will first distinguish these two types of cycles. Furthermore, among cycles that do not enclose a body, they may be so due to different reasons, as the holes they contain may be different (Fig 3a). We will distinguish these different situations as well.
Figure 3:

An illustration of homology classes of k-cycles in a (k + 1)-manifold . a). The two k-cycles in brown and dark brown enclose a (k + 1)-body (k-chains in light brown). They contain no holes and are boundary cycles. They belong to the same equivalence class of [∅]. The k-cycles in green/light green, purple/light purple, and blue/light blue each contain k-holes hα, hβ and hγ, respectively, and are part of equivalence classes of [hα], [hβ] and [hγ], respectively. b). The k-cycle l1 contains a hole. The boundary cycle l2 contains no hole but a (k + 1) body. Note that l1 and l2 share a common piece of boundary but in opposite orientations. c). When l1 and l2 are summed up, we obtain the k-cycle l3, which contains the same hole as l1. Both l1 in b) and l3 in c) belong to the same equivalence class of k-cycles containing this hole.
We consider all cycles containing the same hole essentially the same and group them into one equivalence class of cycles. As an illustration, green/light green k-cycles in Fig 3a form an equivalence class [hα] as they all contain hole hα. So do the purple/light purple k-cycles (class [hβ] containing hole hβ), and the blue/light blue k-cycles (class [hγ] containing hole hγ). A special equivalence class are cycles containing the ∅ hole or no hole (Fig 3a, brown/dark brown cycles, class [∅]). We call cycles in each equivalence class homologous to each other. If they encircle different holes, they belong to different equivalence classes.
We elaborate on this. Among all elements of Zk(K), which are k-cycles, we identify all k-boundaries, which contain (k + 1) bodies and are elements of Bk(K). Because of Eqn (2), they have no (k + 1)-holes. We put them into a class denoted as [∅] as they contain no holes (or ∅-hole) (Fig 3a, brown/dark brown cycles ∈ [∅]). For the remaining k-cycles of Zk(K), they are not in [∅] but may contain different holes. We identify those contain hole ha, and put them into the equivalence class denoted as [ha] (Fig 3a, green/light green cycles ∈ [ha]). Remaining k-cycles that contain hole hb are put into the class [hb], and so on (Fig 3a, purple/light purple cycles ∈ [hb], and blue/light blue cycles ∈ [hc]). Each element of the set {[∅], [ha], [hb], ⋯} is an equivalence class.
As these equivalence classes themselves form a set, and the outcome of the binary operation of summation on elements of Zk is preserved, this set form a new group. This new group is called a quotient group, as it is obtained from Zk(K) after factoring out the boundaries Bk(K). The k-th homology group Hk(K) is this quotient group:
The elements of Hk(K) are equivalence classes of homologous cycles representing the holes (or lack of) they encloses.
Two k-cycles are homologous to each other if they contain the same hole, or equivalently, if one can be obtained from another by adding a k-boundary (Fig 3b and Fig 3c). To illustrate this, note that the cycle labeled l3 in Fig 3c can be obtained by adding the k-boundary of a (k + 1)-body labeled l2 to the k-cycle labeled l1 enclosing a hole (Fig 3b). Due the cancellation nature of our summation, the commonly shared piece of the boundaries is cancelled out and we have the larger k-cycle l3 in Fig 3c enclosong the same hole. Upto the difference of a boundary of a solid (k + 1)-body, these two k-cycles are the same and belong to the same equivalence class. It is not difficult to see that we can repeat this operations of adding certain k-boundaries and convert any homologous k-cyles between one another.
The number of the equivalent classes, or the number of independent k-dimensional holes, is counted by the dimension of the homology group. It is called the k-th Betti number βk(K):
Filtration.
We now examine the topological structures of holes in the probability landscape on the configuration space, when we restrict to configurations all with probabilities above certain value. By gradually adjusting this value, we will be able to trace out the details of topological changes. For an illustration, envision a sea on top of the probability landscape (Fig 4). At the level of f(x) = 1, it covers the whole landscape. The domain of the part of the landscape above the sea level is ∅. We gradually lower the sea level to value b1, when the first peak emerges from the sea (birth of the first peak). At this time, we have the superlevel set , which are the set of points . They form the white region(s) in Fig 4. We further lower the sea level to b2 when another peak emerges (birth of the second peak), at which time we have the superlevel set . Suppose we continue this process until sea level reaches d2 where the two peaks are merged together (death place of the second peak) by a land ridge that has just emerged above the sea level. At this sea level, we have . At each of these levels, the topology of the superlevel set changes, namely, one component, two components, and then one component again. These changes are captured by the changing homology groups and the Betti numbers.
Figure 4:

The probability landscape and the topology of its superlevel set . a) The landscape and a sea level. The superlevel sets are the regions of the domain (shown as a plane) whose landscape value is above the sea. b) At f(x) = 1, all is below the sea level and . At f(x) = b1, b2 and d2, the topology of (shown in white) changes. At f(x) = 0, all is above sea level and we have . c) The persistent diagram of the birth and death value of f(x) for the 0-th homology group representing the two peaks. The sublevel sets below the sea are shown in blue.
We now generalize. We have a descending sequence of probability values corresponding to the lowering sea level:
and the corresponding superlevel sets, or the domains of the part of the landscape above the sea level, which are subspaces of :
Recall we have the full configuration space represented by a cubic complex K. Each superlevel set is represented by a subcomplex Ki ⊂ K, which can be derived from the original full complex K. We then have the corresponding sequence of subcomplexes:
This sequence of subcomlexes is called a filtration.
We are interested in how the topology of evolves at different ai, i = 0, ⋯, n. This is represented by the corresponding sequence of homology groups connected by linear maps:
Persistence and Persistent diagram.
As we move from Ki−1 to Ki, we may gain a new equivalence class, (e.g., a new peak for 0-th homology as in our example), or we may lose one (e.g. when a peak is merged with another one). We say that an equivalence class of a k-cycle [αi] is born at ai if its equivalence class is present in Ki but absent in Ki−1 for any value of ai−1 < ai. The class dies at ai if it is present in Ki−1 for any value of ai−1 < ai but not at ai. We record the location and the value of ai, namely, the corresponding k-cube and its probability value whose inclusion lead to the birth and death events.
The prominence of the topological feature of a k-cycle is encoded in its life-time or persistence. Denote the birth value and the death value of class [αi] as bi and di, respectively. The persistence of class [αi] is then bi − di.
In the example shown in Fig 4, the equivalence class of 0-cycles (components) associated with the first peak is born at f(x) = b1. The equivalent class associated with the second peak is born at f(x) = b2. At f(x) = d2, these two components merge together. We say that the second peak dies at d2, and its persistence is b2 − d2. The first peak dies at f(x) = 0, and its persistence is b1 − 0 = b1.
We record the birth and death events of homology classes in a two-dimensional plot, which is called the persistent diagram. Each homology class is represented by a point in this diagram, where the birth value bi and the death value di are taken as its coordinates (bi, di). Fig 4c shows the persistent diagram of our illustrative example.
In general, we have the k-th persistent diagram of k-cycles for our probability function . It is the set of points such that each point (x1, x2) represents a distinct topological feature of k-cycle, which is present in for a ∈ [x1, x2).
Computation.
The key to study homology groups in high-dimensional space is the construction of the K-complex to represent the configuration space . In this study, we use the cubic algorithm of [53] with modifications, so it can be applied to higher dimensions. We consider only the 0-th persistent homology groups, which records the birth and death of probability peaks. The locations x where birth and death events occur, namely, the corresponding k-cubes are also computed.
3. Results
3.1. Model system and computation
Ensemble of reactive trajectories and conformations.
The isomerization of alanine dipeptide in vacuum provides a tractable system for understanding the process of activation in details, and has been well studied as a model for understanding protein conformational changes [21, 29,55].
Using transition path sampling [8], we harvested 6 million reactive trajectories. Each trajectory is of 2.5 ps duration. We further collect conformations every 50 steps at 1 fs/step along each trajectory. Altogether, we have a total of 1.5 × 1011 conformations. All simulations are conducted using the molecular dynamics software suite GROMACS4.5.4 [56]. Amber94 force field was used to facilitate the comparison with previous results. The simulation was performed with constant total energy 36 KJ/mol, such that the averaged temperature is 300K for the transition path ensemble. Note that the transition portion of each reactive trajectory is around 0.2 ps, thus the majority of our 2.5 ps trajectories are within the two stable basins. Here the reactant basin is defined in radian as (ϕ, ψ) ∈ [(−3.49, −0.96) × (−1.57, 3.32)] and the product basin is defined as (ϕ, ψ) ∈ [(0.87, 1.74) × (−1.39, 0)].
Constructing dynamic probability surface of transition state.
We then construct the dynamic probability surface of the isomeriztion reaction from the sampled 1.5 × 1011 conformations. With a balanced consideration of the available MD simulation trajectories and the dimensionality, we construct a 5-dimensional configuration space for this study. Based on previous analysis using the energy flow theory [17], we selected the top 5 coordinates (ϕ, θ1, ψ, α, β) that contribute most to the activation dynamics. The original 60 dimensional space is then projected onto this 5 dimensional space, where each dimension is divided into 15 bins. This leads to 155 = 759, 375 5-dimensional hypercubes.
Computing topological structures of the dynamic probability landscape.
We then carry out persistent homology analysis. Computations are conducted on a machine with a 20-core Xeon E5-2670CPU of 2.5 GHz, with a cache size of 20 MB and memory of 128 GB Ram. The computing time for finding the significant peaks and ridges connecting them is ≈ 30 seconds.
Committor test for conformations at selected locations of the configuration space.
We carry out committor test for configurations of the dipeptide identified by persistent homology. The committor value of a configuration is defined as the probability that a dynamic trajectory initialed from this configuration, with initial momenta drawn from the Boltzmann distribution, reaches the product basin before the reactant basin. A configuration with committor value pB = 0.5 is regarded as a member of the transition state ensemble.
In a committor test [8,9], we need to generate an ensemble of tentative transition state conformations from locations in configurtaion space where probability peaks and ridges identified by persistent homology are located. These conformations all share the same target values for the selected coordinates (e.g., ϕ, θ1) that correspond to the location of the selected peak, with the other coordinates sampling the equilibrium distribution. For this, we add harmonic restraint potentials on the selected coordinates to the system potential energy function. The minima of the harmonic restraints are at the target values. Equilibrium MD simulations are then carried out. Conformations harvested from such simulations are filtered to generate an ensemble of conformations that all share the same target values for the selected coordinates. The restraint potential is used to enrich conformations that satisfy this criterion.
3.2. Topology and dynamic properties of 5-d dynamic probability surfaces
Dynamic probability surface on configuration space of (ϕ, θ1, ψ, α, β).
We examine the topological structures of this 5-d dynamic probability surface. There are four peaks, each located in a 5-d cube (Fig. 6a, red dots for birth locations of peaks and blue dots for their ridges or death locations, and values are listed in SI Table 1). The most prominent peak with the largest persistence is peak b1 (see persistent diagram in Fig. 6b), which corresponds to the product basin. The second most prominent b2 is the reactant basin. The third prominent peak at b3 (Fig 9a) has roughly the same probability as the reactant basin b2, but a shorter persistence, namely, it does not stand out from the surrounding landscape as much. It subsequently merges with the peak at the product basin b1. As expected, peaks have all negative eigenvalues for their Hessian matrices, and ridges have one positive and four negative eigenvalues (see Supplementary Info).
Figure 6:

The 5-d dynamic probability surface on the (ϕ − θ1) plane, its topological structures, and the committor values. (a) The 5-d dynamic probability surface p(ϕ, ψ, θ1, α, β) shown on the (ϕ − θ1) plane. Red and blue dots are locations of probability peaks and ridges (see also SI Table 1). (b) The persistent diagram recording the birth and death probabilities p(bi) and p(di) of the peaks in y and x, respectively. (c-e) Distributions of committor values pB for trajectories from locations of b3, d2, and d3, respectively. (c) The transition state ensemble is located at b3.
Figure 9:

The 5-d probability surface projected onto two different 2-d planes. (a) The probability surface projected onto the (ϕ − θ1) plane. The red dot is the location of the third probability peak after reactant and product peaks, which is where the transition state conformations are located. (b) The probability surface projected onto the (ϕ − ψ) plane. Conformations with the correct transition state value of ϕ are at a non-peak location, which is on a slope below the new peak shown in blue. (c) The process of projecting the 5-d probability surface onto the 2-d (ϕ − ψ) plane for the blue and red dots shown in (b). While the probability at the location of the red dot is larger on the 5-d probability surface (left bars), after projection onto the (ϕ − ψ) plane, the probability at the red dot is smaller than that at the blue dot (right bar plots). As a consequence, the projected probability surface on the (ϕ − ψ) plane does not capture the actual location of the peak in the 5-d surface.
We then take conformations from the four peaks and the three ridges (SI Table 1), and carry out committor tests. We find that the most prominent peak b3 beyond the reactant and product basins fully captures the transition state ensemble (pB centered at 0.5, Fig 6c): Trajectories initiated from configurations at this location have equal probability towards the reactant or the product basin.
In contrast, all committor values pB for locations of b2, d2, and d4 are found to be 0. Reaction trajectories starting from conformations at these locations fall back to the reactant basin. The committor values for peak b1 are all 1.0: Trajectories from this location all go to the product basin. The committor values for conformations at the ridges d2 and d3 follow one-sided distributions (Fig 6d–6e, respectively). Only a negligible amount of conformations have pB = 0.5.
These results demonstrate that the dynamic properties of the transition state ensemble is capture by the topological features of this dynamic probability surface. Furthermore, the transition state ensemble is located at the most prominent peak outside of the reactant and product basins, instead of a saddle point.
Dynamic probability surface on configuration space (ϕ, ψ, α, β, θ2).
To examine the importance of proper choice of the coordinates, we construct another 5-d probability surface by omitting the coordinate θ1 and replacing it with θ2, which is on the other end of the molecule in a position symmetric to θ1. Fig. 7a shows the projection of the 5-d surface in −ln p(x) on the (ϕ − θ2) plane.
Figure 7:

A different 5-d dynamic probability surface with θ2 replacing θ1 on the (ϕ − θ2) plane, its topological structure represented in persistent diagram, and distributions of committor values. (a) The 5-d dynamic probability surface p(ϕ, ψ, α, β, θ2) projected onto the (ϕ − θ2) plane. Red and blue dots are locations of probability peaks and ridges. (b) The persistent diagram recording the birth and death probabilities p(bi) and p(di) of the peaks in y and x, respectively. (c-e) Distribution of the committor values pB for trajectories from locations of b3, d2, and d3.
There are four significant peaks (Fig. 7a, red and blue dots), which are also shown in the persistent diagram (Figure 7b), each located in a 5-d cube (see SI Table 1). Peaks b1 and b2 are the most and second most prominent peaks, corresponding to the product and the reactant basins, respectively. In this projection, peaks are separated only in ϕ and they have almost identical θ2 values. This differs from the 5-d surface containing θ1 (Fig 6).
We then sample conformations from the location of each of the four peaks and the three ridges (SI Table 1) and perform committor tests. The committor values for peak b1 are all 1.0, where trajectories starting here all go to the product basin. All committor values for conformations at d2, b4, and d4 are 0.0. Reaction trajectories from these locations all go to the reactant basin. The committor values for d2 and d3 follow one-sided distributions (Fig. 7d–7e). Only a tiny amount of the conformations have pB = 0.5, indicating that the transition state ensemble are located elsewhere.
The committor values for b3 has a flat distribution. While there are conformations with pB values close to 0.5, their frequency is similar to any other pB values. This indicates that the 5-d cube where b3 is located contains some transition state conformations as its ϕ value is correct, but also a mixture of other conformations with diverse dynamic properties. Overall, without the reaction coordinate θ1, this 5-d dynamic probability surface does not describe the activated process adequately, and cannot be used to identify the transition state ensemble.
3.3. Topology and dynamic properties of projected 2-d dynamic probability surfaces.
As it is difficult to directly study the topology of a high-dimensional surface, a common practice is to project the surface to a lower dimensional subspace and analyze the topology of the projected surface instead. The caveat of this practice is that the original topological features may be lost, new features that are artifacts may arise due to the marginalization of the probability distributions.
To assess how well the dynamic properties is retained in a subspace, we project the 5-d surface on (ϕ, θ1, ψ, α, β) to 2-d planes. We then analyze the topological structures of the projected surfaces, and assess the dynamic behavior of the identified topological features. We carried out this analysis using the 2-d planes of (ϕ − θ1) and (ϕ − ψ).
Projecting 5-d dynamic probability surface to the (ϕ − θ1) plane.
After projection, there are three significant probability peaks (Fig. 8a, Fig 9a, red and blue dots, and SI Table 2). The most and the next prominent peaks b1 and b2 shown in the persistent diagram of Fig. 8b are the product and reactant basins, respectively.
Figure 8:

The (ϕ-θ1)-projection of the 5-d dynamic probability surface, its topological structure, and distributions of committor values. (a) The 5-d dynamic probability surface projected onto the (ϕ − θ1) plane. Red and blue dots are locations of probability peaks and ridges, respectively. (b) The persistent diagram recording the birth and death probabilities p(bi) and p(di) of the peaks in y and x, respectively. (c-e) Distributions of committor values pB for trajectories from peaks and ridges b3, d2, and d3, respectively. (c) Transition state conformations are at b3.
It is informative to compare peak locations on the 5-d surface to that on the projected surface (SI Table 1 and SI Table 2). The ϕ coordinate for the product basin b1 is altered from 1.25 to 0.84 after projection, while the θ1 coordinate is unchanged. The (ϕ, θ1) coordinates of the reactant basin b2 are also changed from (−1.68, −0.18) to (−1.25, 0.0). θ1 of the ridge d3 is changed from −0.18 to 0.0, while ϕ is unchanged. The fourth prominent peak becomes undetectable after projection. The persistence diagram (Fig. 8b) is also significantly different: The third peak is much less prominent with reduced persistence compared to that in Fig. 6b (see also SI Table 2).
We then carry out committor test on conformations from locations of these topological features. The distribution of pB sampled from trajectories from b3 is centered around 0.5 (Fig 8c) and exhibits significant enrichment of transition state conformations. This is similar to peak b3 on the original surface over the 5-d configuration space (Fig 6c), although the distribution has a broader width.
The committor values of trajectories from the product basin b1 are all 1.0, as they all fall back to the product basin. Similarly, trajectories from b2 all fall back to the reactant basin with a pB value of 0.0. The committor values for the bridges at d2 and d3 follow one sided distributions, but only a small amount of conformations have pB = 0.5 (Fig. 8)d and e.
Overall, these results demonstrate that when projected to the 2d-plane of (ϕ-θ1), which is formed by the two dominant reaction coordinates, the dynamic probability surface retain essential dynamic properties of the transition state surface, and contain rich information such that the transition state conformations can be recovered.
Projecting 5-d dynamic probability surface to the (ϕ − ψ) plane.
ϕ and ψ angles are the standard parameters to describe protein secondary structures. After projection to the (ϕ − ψ) plane, there are three significant probability peaks (Fig. 10a, red and blue dots, and SI Table 2). The most and the next most prominent peaks b1 and b2 as shown in the persistent diagram (Fig. 10b) are the product and reactant basins, respectively. Similar to projection to the (ϕ-θ1) plane, locations of both basins are altered upon projection (SI Table 2). Peak 3 becomes very minor after projection to (ϕ − ψ) plane, and its location is also altered from that of the 5-d surface. This can be seen in Fig. 9b, where the location of peak 3 on the 5-d surface projected onto (ϕ−ψ) plane (red point), and the changed location of the new peak after projecting onto the (ϕ − ψ) plane (blue dot) are shown.
Figure 10:

The (ϕ-ψ)-projection of the 5-d dynamic probability surface, its topological structure, and distribution of committor values. (a) The 5-d dynamic probability surface projected onto the (ϕ-ψ) plane. Red and blue dots are locations of probability peaks and ridges, respectively. (b) The persistent diagram recording the birth and death probabilities p(bi) and p(di) of the peaks in y and x, respectively. (c) Distribution of committor values pB for peak b2.
These observations demonstrate that with projection, locations of topological features of probability peaks may change, and their prominence as measured by persistence may also change dramatically. Fig. 9c explains why the projection to (ϕ-ψ) results in the peak on the 5-d surface (red dot in Fig. 9a) changing location (Fig. 9b, peak location shown as a blue dot, red dot no longer at the peak). First, as shown in Fig. 9c (beginning bar plot), the probability at the location of the red dot is larger than that at the blue dot on the 5-d surface. The probability p(ϕ, ψ) at each point on the (ϕ-ψ) plane (Fig. 9b) is the sum of all points on the 5-d surface with the same ϕ and ψ but with different values in any of the other three coordinates (θ1, α and β). Thus, the probability of each point on the (ϕ−ψ) plane of Fig. 9b is a 3-d hyper-surface of p(θ1, α, β). When we sum up the 3-d hyper-surface along one direction (e.g., α), we obtain a 2-d probability surface (e.g. p(θ1, β)). The 2-d surfaces for the red and blue dots are shown in the middle panel of Fig. 9c. Reducing the dimension further, we sum up the 2-d probability surface over β. The resulting distributions are 1-d probability distributions along the θ1 direction, shown in Fig. 9c for the blue and red dots. From these 1-d distributions, we can see that when details in θ1 direction are retained, the probability at the red dot (actual peak in 5-d) is still higher than the probability at the blue dot. We next sum up the probability along the θ1 direction (Fig. 9c, final bar plots). The summed values are the probability values over the 2-d (ϕ − ψ) plane shown in Fig. 9b. As shown by the final bar plot in Fig. 9c, the total probability mass for the red dot is now less than that for the blue dot, even though the red dot has a higher probability on the original 5-d surface. Hence, the location of peak 3 changes when projecting onto (ϕ − ψ) plane.
We then carried out committor tests (Fig. 10c). None of the topological features are where the transition state ensemble are located: All trajectories starting from b2, b3, and d3 go to the reactant basin (pB = 0), and all trajectories starting at b1 go to the product basin (pB = 1.0). The committor values for ridge b2 follow a one-sided distribution around pB = 0 (Fig. 10c). Trajectories starting there mostly fall back to the reactant basin. Overall, none of the topological features after projection to the (ϕ-ψ) plane retain the dynamic properties of the transiton state conformations as the original 5-d surface.
3.4. Dimension reduction by PCA destroys dynamic properties inherent in surface topology
Principal Component Analysis (PCA) is a widely used technique for dimension reduction. It has found broad applications in molecular simulations [57–61]. However, whether such reduction retains the essential dynamics of the activation process and whether the surface topology on the PCA space can uncover the transition state conformations are not known. Here we assess the dynamic properties of probability surfaces after PCA dimension reduction.
Projection of p(ϕ, ψ, θ1, α, β,) onto (PC1, PC2) by dPCA.
We first applied dihedral principal component analysis (dPCA) to the 5-d probability surface [60,61]. dPCA is widely used for dimension reduction on periodic dimensions. It first maps each periodic dimension of circular angle to two new dimensions using the sin and cos functions. Regular PCA is then applied for dimension reduction. We use the dPCA procedure and obtain the first two principal components from the variance matrix of the 1.5 × 1011 conformations. Collectively, they account for 80.4% of the variance.
The dynamic probability surface after projection to the (PC1-PC2) plane is dramatically more complex (Fig. 11a and SI Table 3) than the surfaces when projected to either the (ϕ-θ1) plane (Fig. 8) or the (ϕ-ψ) plane (Fig. 10). The persistent diagram (Fig. 11b) is also very different from that of the original 5-d probability surface (Fig. 6b).
Figure 11:

The dynamic probability surface on PCA space, its topological structure, and committor values on principal components space. (a) The projection of the 5-d dynamic probability surface p(ϕ, ψ, θ1, α, β) to the plane of (PC1-PC2). (b) The persistent diagram exhibits six probability peaks. (c-e) Distribution of committor values pB for trajectories from bridge d2, peak b6, and bridge d6, respectively.
The committor tests show that all committor values for conformations at b1 are 1.0 with trajectories going to the product basin. All committor values for conformations at b2–5 and d3–5 are 0.0: trajectories from there all go to the reactant basin. Committor values at ridge d2 and peak b6 follow a one-sided distribution at pB = 0 (Fig. 11c). Committor values at ridge d6 has higher values both at pB = 0.0 and pB = 1.0. Conformations at this location are a mixture of those close to the reactant basin and those close to the product basins. There are few conformations from the transition state ensemble.
Overall, our results show that the dPCA procedure for dimension reduction removes dynamics relevant information from the topological features of the probability surface. No conformations in the transition state of this active process are captured by the topological features of the PCA surface.
Projection from 39-d angular space to 5-d dPCA subspace.
We also applied dPCA to the original full-dimensional dynamic probability surface. After removal of the 21 bond lengths, we apply dPCA to reduce the remaining 39-dimensional configuration space of angles to 5 principal components. The contour plot of the 5-d PCA surface projected onto the first two principal components is shown in Fig. 12a. Persistent homology analysis identifies 16 peaks (Fig. 12b), each located in a 5-d cube. The location and probability of the first 6 most dominant peaks and ridges connecting them are listed in SI Table 3. This 5-d persistent diagram is very different from that shown in Fig. 6b, exhibiting a significantly more complex surface topology.
Figure 12:

The dynamic probability surface on 5-d principal components space reduced by dPCA from the 39-d configuration space, its topological structure, and committor values. (a) the 5-d dynamic probability surface p(PC1, ⋯, PC5) shown on to the (PC1, PC2) plane. (b) The persistent diagram exhibits 16 peaks. (c-d) Distribution of committor pB values for trajectories started at ridges d9 and d10, respectively.
The committor tests show that all committor values for conformations located at the peaks and the ridges are either 0 or 1.0. The two exceptions are ridges d9 and d10 (Fig. 12c–d). None of the topological features in the dPCA reduced 5-d dynamic probability surface retain the essential dynamics of the transition state ensembles. Overall, our results demonstrate that dimension reduction by dPCA destroy dynamic properties inherent in the original surface topology of the dynamic probability surface.
Direct PCA projections.
Results using direct PCA to project p(ϕ, ψ, θ1, α, β) onto (PC1, PC2), as well as projection from the 39-d space to the 5-d direct PCA subspace are similar and none retain the dynamic properties in their topological features (see SI for details).
4. Discussion
In this study, we have introduced a novel approach for characterizing the exact topological features of dynamic probability surfaces. Instead of examining critical points and Morse indexes, ours is based on homology groups of a series of superlevel sets of the probability surface. With quantification of the scales of these topological features by persistent homology, we are able to uncover the relationship between the topology of the dynamic probability surface and the dynamics of the activation process of the alanine-dipeptide isomerization reaction.
This approach allows us to define the topological properties of the high-dimensional dynamic probability surface that is associated with the transition state conformations. The probability surface over the transition state region is the most prominent peak after the reactant and product basins. Instead of a Morse index of 1 as conventionally thought [5,62], transition state ensemble is on the top of a dynamic probability peak and goes downhill in all directions. As seen when projected to the (ϕ, θ1)-plane (Fig 9a), it appears as a small peak rather than a saddle point commonly associated with transition state on multi-dimensional free energy surface. This is because the system undergoes certain amount of correlated wandering motions at the barrier top, before it goes down towards the product basin. Our finding is against the conventional wisdom that the C7eq → C7ax transition is a ballistic process, as it is a small peptide and the transition occurs in vacuum.
The dynamic probability surface was constructed from naturally occurring reactive trajectories connecting the reactant and product basins. These trajectories are unbiased and faithfully reflect how the C7eq → C7ax transition occurs. They contain all the relevant information about the dynamic process of the activation. Unlike the free energy surface commonly used in examining the mechanism of an activated process, this probability surface contains additional information that reflect the non-equilibrium nature of the transition dynamics.
A common practice in the studies of protein conformational dynamics is to extract mechanistic insights from the geometry of two-dimensional free energy surface of a double-well along certain collective variables, which are often chosen based on heuristics or by intuition. For example, one would associate a saddle region with the transition states. Our results illustrate the caveats of such procedures. All three probability surfaces shown in Figs. 6, 8, and 10 exhibit the canonical double-well feature. In Fig. 8, the 5-d surface on the (ϕ, θ2)-plane has both the product and reactant peaks and the third peak laying out along the ϕ-direction alone, indicating that θ2 is not a reaction coordinate. This is indeed verified by the committor test: configurations corresponding to the peak in the saddle region all share the correct value of ϕ but samples randomly along the θ1 direction, as illustrated by the flat committor distribution in Fig. 7c. Detialed examination shows that the 5-d cube containing the transition state conformations (red dot in Fig. 13) also contain conformations with other θ1 values, which are not at the transition state: some go to the product basin (e.g., green dot, Fig. 13b), and others to the reactant basin (e.g., blue dot, Fig. 13b).
Figure 13:

The horizontal band of the probability surface of Fig. 7a centered at θ2 = 0 (−0.08 < θ2 < 0.08) expanded to show distribution in θ1. (a) This band containing all peaks in Fig. 7a is expanded in the θ1 direction. The (ϕ-θ2)-square containing peak b3 in Fig. 7a is expanded in θ1 and is shown as a vertical strip between the two dashed lines. This strip contains a mixture of conformations. The red dot shows the location of the transition state conformations. (b) The distributions of committor values for conformations at the blue, red, and green dots in (a), respectively. Trajectories from the blue and green dots fall back to the reactant and product basins, respectively.
In contrast, projection of the 5-d surface on (ϕ, θ1)-plane (Fig 6) has the two basins along the ϕ-direction alone, but the transition region aligned along both ϕ and θ1. This second feature is consistent with the importance of θ1 in determining the barrier crossing dynamics [1–4,15,17,21].
Interestingly, the (ϕ, ψ)-surface (Fig 10) has basins and the saddle region arranged along both directions. Conventional wisdom would have led to the conclusion that ψ is important in defining both reactant and product basins as well as the barrier crossing process. However, the double-well structure exhibited on the (ϕ-ψ) plane is profoundly misleading. None of the topological features on the surface over the (ϕ-ψ) plane retain the dynamic properties of the original 5-d surface, as the committor test showed that configurations corresponding to the peak at the transition region completely fall into the reactant basin. This demonstrates that the correlation between ϕ and ψ, due to minor roles of ψ to the transition process [1,2], distorted the probability distribution along ϕ, such that the ridge/saddle extended into the reactant peak, leading to incorrect ϕ value that marks the location of the peak in the transition region. In contrast, θ2 did not impact the distribution of ϕ, so the peak in the transition region on the (ϕ-θ2) plane still bears the correct ϕ value for the transition states.
In general, the dynamic properties of topological features of the probability surface are very sensitive to the subspace of projection. This is illustrated by the different locations of the transition state conformations, which are at a peak location on the (ϕ-θ1) plane (Fig 9a, red dot), but are at a slope location below a new peak when projected to the (ϕ-ψ) plane (Fig 9b, blue dot). While the probability at the correct (ϕ-θ1) square for the transition state conformations (Fig 9c, left, red) is at a peak, a different location in the (ϕ-ψ) plane (Fig. 9b, blue dot) has higher probability, as in this projection the probability mass distributed along the dimension of θ1 is integrated over all θ1 values (Fig. 9c, higher bar, right), resulting in the concentration of probability peak at this new location.
Our results show that without the inclusion of the correct reaction coordinates, probability surface of the same dimension no longer correctly characterizes the dynamic properties of the active process. Without θ1, the 5-d dynamic probability surface over (ϕ, θ2, ψ, α, β) fail to capture the dynamics of this active process.
Together, our results showed that intuition-based projection (such as ϕ-ψ) or other arbitrary projection cannot be relied upon for understanding the dynamic properties of activated processes. without rigorous examination such as the committor test, directly assigning mechanistic significance to features of free energy surface is prone to mistakes, misinterpretations, and misunderstanding.
Finally, our results show that there are dramatic changes in the topological properties of the probability surface after dimension reduction, when techniques such as dPCA are applied. While the simple probability surface on the properly constructed 2-d (ϕ − θ1) plane contains rich dynamic information and is sufficient to uncover the transition state conformations, the topological features on PCA-reduced surfaces can become more complex and no longer reflect essential dynamics and cannot be used to identify the transition state conformations.
The approach of homology group and the technique of analyzing the persistent homology of the filtration of the superlevel sets of high-dimensional probability surfaces introduced here are general. We envision they can be applied to investigate topology of high-dimensional probability surfaces encountered in other physical problems of activated process.
Supplementary Material
Figure 5:

The isomerization reaction of alanine dipeptide. (a) Conformations from the reactant and product basins before and after the isomerization. (b) The six reaction coordinates of the isomerization process of alanine peptide examined in this study.
Acknowledgement
We thank Drs. Herbert Edelsbrunner and Hubert Wagner for discussion and for generous help in extending the cubic complex algorithm. This work is supported by grants NIH R35 GM127084 and NSF CHE-1665104.
Footnotes
Conflict of Interest Statement
There are no conflict of interests.
References
- [1].Chandler David. Statistical mechanics of isomerization dynamics in liquids and the transition state approximation. J. Chem. Phys, 68(6):2959–2970, 1978. [Google Scholar]
- [2].Kramers HA. Brownian motion in a field of force and the diffusion model of chemical reactions. Physica, 7(4):284 – 304, 1940. [Google Scholar]
- [3].Pechukas Philip. Statistical approximations in collision theory, pages 269–322. Springer US, Boston, MA, 1976. [Google Scholar]
- [4].Wigner E. The transition state method. Trans. Faraday Soc, 34:29–41, 1938. [Google Scholar]
- [5].Hänggi Peter, Talkner Peter, and Borkovec Michal. Reaction-rate theory: fifty years after kramers. Rev. Mod. Phys, 62:251–341, Apr 1990. [Google Scholar]
- [6].Berne Bruce J., Borkovec Michal, and Straub John E.. Classical and modern methods in reaction rate theory. J. Phys. Chem, 92(13):3711–3725, 1988. [Google Scholar]
- [7].Pollak Eli and Talkner Peter. Reaction rate theory: what it was, where is it today, and where is it going? Chaos, 15(2):026116, 2005. [DOI] [PubMed] [Google Scholar]
- [8].Bolhuis Peter G., Chandler David, Dellago Christoph, and Geissler Phillip L.. Transition path sampling: throwing ropes over rough mountain passes, in the dark. Annu. Rev. Phys. Chem, 53(1):291–318, 2002. [DOI] [PubMed] [Google Scholar]
- [9].Du Rose, Pande Vijay S., Yu Alexander. Grosberg, Toyoichi Tanaka, and Eugene S. Shakhnovich. On the transition coordinate for protein folding. J. Chem. Phys, 108(1):334–350, 1998. [Google Scholar]
- [10].Li Wenjin and Ma Ao. Recent developments in methods for identifying reaction coordinates. Mol. Simul, 40(10–11):784–793, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Schramm Vern L. and Schwartz Steven D.. Promoting vibrations and the function of enzymes. emerging theoretical and experimental convergence. Biochem., 57(24):3299–3308, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Schwartz Steven D and Schramm Vern L. Enzymatic transition states and dynamic motion in barrier crossing. Nat. Chem. Biol, 5(8):551–558, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Ryter D. On the eigenfunctions of the fokker-planck operator and of its adjoint. Physica A, 142(1):103 – 121, 1987. [Google Scholar]
- [14].Onsager L. Initial recombination of ions. Phys. Rev, 54:554–557, Oct 1938. [Google Scholar]
- [15].Bolhuis Peter G., Dellago Christoph, and Chandler David. Reaction coordinates of biomolecular isomerization. Proc. Natl. Acad. Sci, 97(11):5877–5882, 2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Li Huiyu and Ma Ao. Kinetic energy flows in activated dynamics of biomolecules. J. Chem. Phys, 153(9):094109, 2020. [DOI] [PubMed] [Google Scholar]
- [17].Li Wenjin and Ma Ao. Reaction mechanism and reaction coordinates from the viewpoint of energy flow. J. Chem. Phys, 144(11):114103, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Best Robert B. and Hummer Gerhard. Reaction coordinates and rates from transition paths. Proc. Natl. Acad. Sci, 102(19):6732–6737, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Antoniou Dimitri and Schwartz Steven D.. Toward identification of the reaction coordinate directly from the transition state ensemble using the kernel pca method. J. Phys. Chem. B, 115(10):2465–2469, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Hu Jie, Ma Ao, and Dinner Aaron R.. A two-step nucleotide-flipping mechanism enables kinetic discrimination of dna lesions by agt. Proc. Natl. Acad. Sci, 105(12):4615–4620, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Ma Ao and Dinner Aaron R.. Automatic method for identifying reaction coordinates in complex systems. J. Phys. Chem. B, 109(14):6769–6779, 2005. [DOI] [PubMed] [Google Scholar]
- [22].Peters Baron and Trout Bernhardt L.. Obtaining reaction coordinates by likelihood maximization. J. Chem. Phys, 125(5):054108, 2006. [DOI] [PubMed] [Google Scholar]
- [23].Antoniou Dimitri and Schwartz Steven D.. The stochastic separatrix and the reaction coordinate for complex systems. J. Chem. Phys, 130(15):151103, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Jung Hendrik, Covino Roberto, and Hummer Gerhard. Artificial intelligence assists discovery of reaction coordinates and mechanisms from molecular dynamics simulations, 2019. [Google Scholar]
- [25].Sidky Hythem, Chen Wei, and Ferguson Andrew L.. Machine learning for collective variable discovery and enhanced sampling in biomolecular simulation. Mol. Phys, 118(5):e1737742, 2020. [Google Scholar]
- [26].Bonati Luigi, Zhang Yue-Yu, and Parrinello Michele. Neural networks-based variationally enhanced sampling. Proc. Nat. Acad. Sci, 116(36):17641–17647, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Wang Yihang, Ribeiro João Marcelo Lamim, and Tiwary Pratyush. Past–future information bottleneck for sampling molecular reaction coordinate simultaneously with thermodynamics and kinetics. Nat. Commun, 10(1):1–8, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Wang Yihang, Ribeiro Joao Marcelo Lamim, and Tiwary Pratyush. Machine learning approaches for analyzing and enhancing molecular dynamics simulations. Current opinion in structural biology, 61:139–145, 2020. [DOI] [PubMed] [Google Scholar]
- [29].Li Wenjin and Ma Ao. A benchmark for reaction coordinates in the transition path ensemble. J. Chem. Phys, 144(13):134104, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Matsumoto Yukio. An introduction to Morse theory, volume 208. American Mathematical Soc., 2002. [Google Scholar]
- [31].Liang Jie, Woodward Clare, and Edelsbrunner Herbert. Anatomy of protein pockets and cavities: Measurement of binding site geometry and implications for ligand design. Protein Sci, 7(9):1884–1897, 1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Liang Jie, Edelsbrunner Herbert, Fu Ping, Sudhakar Pamidighantam V., and Subramaniam Shankar. Analytical shape computation of macromolecules: I. molecular area and volume through alpha shape. Proteins, 33(1):1–17, 1998. [PubMed] [Google Scholar]
- [33].Edelsbrunner Herbert, Facello Michael, and Liang Jie. On the definition and the construction of pockets in macromolecules. Discret. Appl. Math, 88(1):83 – 102, 1998. Computational Molecular Biology DAM - CMB Series. [PubMed] [Google Scholar]
- [34].Liang Jie, Edelsbrunner Herbert, Fu Ping, Sudhakar Pamidighantam V., and Subramaniam Shankar. Analytical shape computation of macromolecules: Ii. inaccessible cavities in proteins. Proteins, 33(1):18–29, 1998. [PubMed] [Google Scholar]
- [35].Binkowski T Andrew, Adamian Larisa, and Liang Jie. Inferring functional relationships of proteins from local sequence and spatial surface patterns. J. Mol. Biol, 332(2):505–526, 2003. [DOI] [PubMed] [Google Scholar]
- [36].Binkowski T Andrew, Joachimiak Andrzej, and Liang Jie. Protein surface analysis for function annotation in high-throughput structural genomics pipeline. Protein Sci, 14(12):2972–2981, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [37].Tseng Yan Y and Liang Jie. Estimation of amino acid residue substitution rates at local spatial regions and application in protein function inference: a bayesian monte carlo approach. Mol. Biol. Evol, 23(2):421–436, 2006. [DOI] [PubMed] [Google Scholar]
- [38].Tseng Yan Yuan, Dundas Joseph, and Liang Jie. Predicting protein function and binding profile via matching of local evolutionary and geometric surface patterns. J. Mol. Biol, 387(2):451–464, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Perez-Rathke Alan, Fahie Monifa A, Chisholm Christina, Liang Jie, and Chen Min. Mechanism of ompg ph-dependent gating from loop ensemble and single channel studies. J. Am. Chem. Soc, 140(3):1105–1115, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [40].Tian Wei, Chen Chang, Lei Xue, Zhao Jieling, and Liang Jie. Castp 3.0: computed atlas of surface topography of proteins. Nucleic Acids Res, 46(W1):W363–W367, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [41].Caiani Lando, Casetti Lapo, Clementi Cecilia, and Pettini Marco. Geometry of dynamics, lyapunov exponents, and phase transitions. Phys. Rev. Lett, 79(22):4361, 1997. [Google Scholar]
- [42].Angelani L, Ruocco G, and Zamponi F. Relationship between phase transitions and topological changes in one-dimensional models. Phys. Rev. E, 72(1):016122, 2005. [DOI] [PubMed] [Google Scholar]
- [43].Kastner Michael.Phase transitions and configuration space topology. Rev, Mod. Phys, 80(1):167, 2008. [Google Scholar]
- [44].Wales David J. Exploring energy landscapes. Annu. Rev. Phys. Chem, 69:401–425, 2018. [DOI] [PubMed] [Google Scholar]
- [45].Cimasoni David and Delabays Robin. The topological hypothesis for discrete spin models. J. Stat. Mech. Theory Exp, 2019(3):033216, 2019. [Google Scholar]
- [46].Kastner Michael and Mehta Dhagash. Phase transitions detached from stationary points of the energy landscape. Phys. Rev. lett, 107(16):160602, 2011. [DOI] [PubMed] [Google Scholar]
- [47].Edelsbrunner Herbert and Harer John L. Computational topology: an introduction. American Mathematical Society, Providence, RI, 2010. [Google Scholar]
- [48].Edelsbrunner Herbert, Letscher David, and Zomorodian Afra. Topological persistence and simplification. Discrete Comput. Geom, 2002. [Google Scholar]
- [49].Carlsson Gunnar. Topology and data. Bull. Am. Math. Soc, 46(2):255–308, 2009. [Google Scholar]
- [50].Munkres JR. Elements Of algebraic topology. CRC Press, 2018. [Google Scholar]
- [51].Hatcher A. Algebraic topology. Tsinghua University Press Co., Ltd. Tsinghua University Press, 2005. [Google Scholar]
- [52].Cohen-Steiner David, Edelsbrunner Herbert, Harer John, and Morozov Dmitriy. Persistent homology for kernels, images, and cokernels, pages 1011–1020. [Google Scholar]
- [53].Wagner Hubert, Chen Chao, and Vuçini Erald. Efficient computation of persistent homology for cubical data. In Topological methods in data analysis and visualization II, pages 91–106. Springer, 2012. [Google Scholar]
- [54].Kaczynski Tomasz, Mischaikow Konstantin, and Mrozek Marian. Computational homology, volume 157. Springer Science & Business Media, 2006. [Google Scholar]
- [55].Bolhuis Peter G., Dellago Christoph, and Chandler David. Reaction coordinates of biomolecular isomerization. Proc. Natl. Acad. Sci, 97(11):5877–5882, 2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [56].van der Spoel David Hess Berk, Kutzner Carsten and Lindahl Erik. Gromacs 4: Algorithms for highly efficient, load-balanced, and scalable molecular simulation. J. Chem. Theory Comput, 4(3):435–447, 2008. [DOI] [PubMed] [Google Scholar]
- [57].Levy RM, Srinivasan AR, Olson WK, and McCammon JA. Quasi-harmonic method for studying very low frequency modes in proteins. Biopolymers, 23(6):1099–1112, 1984. [DOI] [PubMed] [Google Scholar]
- [58].García Angel E. Large-amplitude nonlinear motions in proteins. Phys. Rev. Lett, 68(17):2696, 1992. [DOI] [PubMed] [Google Scholar]
- [59].Riccardi Laura, Nguyen Phuong H., and Stock Gerhard. Free-energy landscape of rna hairpins constructed via dihedral angle principal component analysis. J. Phys. Chem. B, 113(52):16660–16668, 2009. [DOI] [PubMed] [Google Scholar]
- [60].Mu Yuguang, Nguyen Phuong H., and Stock Gerhard. Energy landscape of a small peptide revealed by dihedral angle principal component analysis. Proteins, 58(1):45–52, 2005. [DOI] [PubMed] [Google Scholar]
- [61].Altis Alexandros, Nguyen Phuong H., Hegger Rainer, and Stock Gerhard. Dihedral angle principal component analysis of molecular dynamics simulations. J. Chem. Phys, 126(24):244111, 2007. [DOI] [PubMed] [Google Scholar]
- [62].Peters Baron, Heyden Andreas, Bell Alexis T., and Chakraborty Arup. A growing string method for determining transition states: Comparison to the nudged elastic band and string methods. J. Chem. Phys, 120(17):7877–7886, 2004. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
