Significance
Developing a mechanistic understanding of complex biomolecular processes occurring over long timescales presents a formidable challenge. While state-of-the-art techniques like Markov state models are a vital tool in decoding these processes, they require a substantial amount of simulation data to construct an accurate model. Here, we introduce an approach that goes beyond previous Markovian (memoryless) theories, which dramatically reduces the amount of simulation data required to construct a simple and interpretable model of biomolecular processes based on physically transparent time-dependent rates. By deriving a rigorous bound for the simulation times required to construct non-Markovian models of these processes, we show that such models provide a much more data efficient approach to understand the dynamics of complex biomolecular systems.
Keywords: Markov state models, generalized master equations, biomolecular dynamics, protein folding, memory effects
Abstract
The ability to predict and understand complex molecular motions occurring over diverse timescales ranging from picoseconds to seconds and even hours in biological systems remains one of the largest challenges to chemical theory. Markov state models (MSMs), which provide a memoryless description of the transitions between different states of a biochemical system, have provided numerous important physically transparent insights into biological function. However, constructing these models often necessitates performing extremely long molecular simulations to converge the rates. Here, we show that by incorporating memory via the time-convolutionless generalized master equation (TCL-GME) one can build a theoretically transparent and physically intuitive memory-enriched model of biochemical processes with up to a three order of magnitude reduction in the simulation data required while also providing a higher temporal resolution. We derive the conditions under which the TCL-GME provides a more efficient means to capture slow dynamics than MSMs and rigorously prove when the two provide equally valid and efficient descriptions of the slow configurational dynamics. We further introduce a simple averaging procedure that enables our TCL-GME approach to quickly converge and accurately predict long-time dynamics even when parameterized with noisy reference data arising from short trajectories. We illustrate the advantages of the TCL-GME using alanine dipeptide, the human argonaute complex, and FiP35 WW domain.
Biomolecules, such as proteins, dynamically change conformations to perform their functions and this plays a critical role in processes such as protein misfolding and aggregation and protein–ligand recognition. Therefore, investigating biomolecular dynamics is essential for discovering next-generation therapeutics, developing novel antibiotic targets, and elucidating protein-folding mechanisms that underlie diseases such as Alzheimer’s, Parkinson’s, and Cystic Fibrosis (1). Indeed, all-atom molecular dynamics (MD) computer simulations can offer insight at resolutions beyond standard experimental setups. However, since small atomic motions such as vibrations occur on the order of femtoseconds, whereas the complex motions at the heart of large conformational changes that drive processes such as protein folding and allostery span timescales from microseconds to seconds, a direct atomistic simulation of such long-timescale motions is only feasible for relatively small biological systems.
Markov state models (MSMs) are a powerful approach that have emerged to tackle this grand challenge (2–12). Currently, widely used open-source libraries offer robust implementations for constructing MSMs (13–15). MSMs benefit from massive parallelism by exploiting many short molecular dynamics simulations to capture the long-time configurational dynamics that reveal the mechanisms of biomolecular processes (16). This is accomplished by partitioning configuration space into a set of states: distinct structures whose component configurations interconvert on a faster timescale than with those belonging to different structures. Identifying the slowest interconverting structures, however, remains a formidable problem (17–25). This difficulty arises from the fact that, to perform a perfect partitioning, one needs detailed knowledge of the full free energy landscape of a complex condensed phase system. Instead, one is generally limited to a set of states that evolve on slow timescales but are not optimally partitioned (16, 26). With such a set of configurations, an MSM then provides a discrete-time kinetic description of the interstate conversion, enforcing an effective separation of timescales by requiring transitions between states have no dependence on the history of the system. In this memory-less, or Markovian, limit the rate constants in the kinetic scheme are time-independent. This kinetic description provides an approximation to the true dynamics, and its accuracy depends on the extent of timescale separation.
For a sufficiently accurate (“valid”) MSM, the maximum resolution in time (minimum time step) allowed by the approximate description is termed the “Markovian lag time”. Formally, the intrastate relaxation establishes a lower bound to the lag time (16), which is the minimum simulation time required for MD data to parameterize the model.
Ultimately, what one would want is a handful of states that provide chemical interpretability for understanding complex biomolecular mechanisms. However, algorithms designed to maximize this timescale separation usually produce many, physically obscure states. This is because downfolding to a biologically intuitive space subsumes slower interstate dynamics of the many-state space into the intrastate dynamics of the reduced space (27), increasing the lag time. For example, to model the millisecond folding of the NTL9 peptide using the available simulation data, Pande and coworkers required an MSM containing 2,000 states (with a lag time of 12 ns) (28), while recent work on the RNA Polymerase (RNAP) II backtracking necessitated MSMs consisting of 800 states to reach Markovianity within the affordable trajectory (29). Therefore there is a balance to be drawn: One wants to coarse-grain aggressively to facilitate interpretability, yet this generally leads to long lag times, which result in both poor temporal resolution and the need to perform longer MD simulations.
Recent work has demonstrated that one can employ non-Markovian theories to resolve the tensions at the heart of the MSM, increasing the resolution to be equal to the MD time step (30–34), while simultaneously using only a fraction of the data in the models’ construction (35). Of these, the GME, recently used in its time-convolution form as a quasi-MSM (qMSM), provides a particularly useful tool. Indeed, qMSMs have proven useful in tackling important problems such as the gate opening motion of a bacterial RNAP (36) and the mechanism of messenger RNA recognition and inhibition via the RNA-induced silencing complex (37). Like MSMs, GMEs are most efficient when there is a separation of timescales between intrastate and interstate dynamics. Unlike MSMs, GMEs encode the intrastate dynamics into a time-dependent friction function—a memory kernel—removing the approximation of perfect timescale separation. It is this explicit description of the non-Markovianity that allows the improved resolution in time. Yet, the time-nonlocal GME formulation precludes simple interpretation of the dynamics in terms of “states and rates,” which are typically used to describe the mechanisms of biological processes. This motivates the question: Is it possible to combine the interpretability of the MSM with the improved accuracy, resolution, and efficiency of GMEs?
In this work, we employ a time-convolutionless (TCL) GME approach that, like the qMSM, encodes the non-Markovian dynamics associated with intrastate motions but, unlike the qMSM, conserves the chemically intuitive nature of MSMs through the action of a generalized non-Markovian rate matrix. We show this easy, accurate, and efficient GME-based approach can capture the biomolecular dynamics of systems of varying complexity, with the resulting dynamics constituting an improvement that combines the advantages, while removing the limitations, of both qMSM and MSM approaches. Indeed, not only does the TCL-GME approach perform just as well as the qMSM on systems that can be exhaustively sampled, but in more difficult cases where all methods struggle to treat statistically underconverged MD data, the TCL-GME can be systematically improved in a manner that has no apparent analog in the qMSM (or MSM) case. We achieve this through a simple averaging protocol that leverages the onset of Markovian behavior to tame the deleterious effect of noise. Upon reformulating the TCL-GME in discrete-time (38), our averaging procedure provides a simple and robust scheme to capture the complex dynamics of biomolecular motions, even in cases that suffer from poor temporal resolution. Finally, in the extreme case where our averaging procedure includes the entire non-Markovian region, our TCL-GME reduces to a high-resolution version of the analogous MSM, recapitulating its identity as the non-Markovian generalization of the conventional MSM and fully elucidating the source of improvement over the traditional time-local approach. We demonstrate that our discrete-time method remains robust even when benchmarked against MD data that extend into the microsecond regime: two orders of magnitude longer than the time required to parameterize the model in question. The strict improvement of our time-local approach is epitomized by its ability to converge a computationally sensitive experimental observable (the folding time) using less than half of the data required by the traditional MSM.
Connecting Markovian and Non-Markovian Evolution
Whether one wants to directly use a long MD trajectory or many short MD simulations to elucidate complex biomolecular processes, the first task is to find the states that will provide one with the basis of a mechanistic interpretation. The second task is to construct an accurate and efficient description of the dynamics of such configurations. As we mentioned in the Introduction, below we do not consider how one identifies these configurational basins (the interested reader can see, for instance, refs. 18–21, 23–25, 39), but rather focus on the second problem: Given a set of configurations whose dynamics one can only afford for only short times, how does one construct a dynamical framework to accurately and efficiently capture the dynamics of these configurations over all time?
To characterize the time-dependent transitions connecting states, it is natural to focus on their equilibrium time-correlation functions,
[1] |
where is invariant to time evolution, the MD Hamiltonian is dependent on the coordinates () and momenta () of all atoms in the system at time t, and is the Poisson bracket that generates the evolution of the system. Here, {χk} are mutually orthogonal indicator functions that define the continuous sets of configurations that compose each state, and is the equilibrium probability of state j with Z being the canonical partition function of the system. Since the states are mutually disjoint, . These correlation functions, together the transition probability matrix (TPM), correspond to the conditional probability of finding the biomolecular complex in configuration k at time t given that it started in configuration j at t = 0. We now turn to both Markovian (MSMs) and non-Markovian (GMEs) descriptions of the dynamics of the TPM, .
MSMs and qMSMs.
After configuration space has been partitioned into nonoverlapping states (22), to obtain a valid Markovian description of the TPM dynamics, the MSM framework requires one to identify the smallest timescale τL such that the TPM satisfies the Chapman–Kolmogorov condition (18, 40),
[2] |
Here, is a time-independent rate matrix, and τL is defined to be the Markovian lag time. In practice, Eq. 2 is rearranged such that τL is found by identifying the onset of a plateau in the implied timescale (ITS), defined as
[3] |
This timescale is associated with the time taken for degrees of freedom within the aggregated states to achieve equilibrium and thus for the systems to become memoryless, or Markovian. Once τL is identified, the configurational dynamics can be predicted at integer multiples of τL. In other words, τL defines the interval at which a given (non-Markovian) biomolecular process can be viewed as Markovian. Consequently, the resulting dynamics are discontinuous (40), thus obscuring the observations of dynamical processes, which may occur on the interval [nτL, (n + 1)τL]. Furthermore, Eq. 2 implies τL sets the lower bound on MD simulation time required to parameterize the MSM that describes (20). There is, however, no guarantee that intrastate equilibration will occur within an affordable timescale to perform MD (16).
Recent work has shown that it is possible to employ a GME approach to account for the effect of memory (non-Markovian) behavior at early times, allowing one to construct a quasi-Markov state model (qMSM), given by
[4] |
We note Eq. 4 does not contain an “inhomogeneous term” analogous to the random force in the language of the generalized Langevin equation (41, 42) because the GME is parameterized with equilibrium MD simulations, which is consistent with the correlation functions of interest given by Eq. 1 (35, 43, 44). In Eq. 4, the potentially complex intrastate dynamics are encoded into the time-dependent memory kernel (35). Crucially, decays to zero on a characteristic time-scale τK, termed the kernel cutoff time, enabling one to approximate the upper limit of the integral in Eq. 4 as min {τK, t}. It has been further shown that τK ≤ τL, illustrating that the qMSM approach strictly improves upon the MSM. It does this by reducing the amount of simulation time needed to capture the exact dynamics, while simultaneously giving access to the dynamical events occurring between multiples of τL. Indeed, the qMSM offers remarkable accuracy, temporal resolution, and often requires much less MD simulation time to fully construct the generator of the dynamics, i.e., the memory kernel (35). The qMSM has been profitably applied to, for example, understand the significance of the β-lobe of RNA polymerase during transcriptional initiation (36), and elucidate the mechanisms used by the RNA-induced silencing complex to recognize and target mRNA molecules in a sequence-specific manner (45).
Unfortunately, the qMSM is not without its problems. First, evaluation of a convolution integral becomes computationally cumbersome as the dimension of the TPM increases. Second, constructing requires the first and second derivatives of (46), giving rise to numerical instabilities, which we will analyze in a later section. Third, from a qualitative perspective, the qMSM approach obfuscates the physical interpretation of the MSM in terms of “states and rates”. Specifically, the MSM provides a physically intuitive rate matrix, , whose diagonals can be interpreted as the likelihood of remaining in a particular state, and whose off-diagonals describe the probability of making a transition from one state to another. In contrast, the memory kernel appears under a convolution integral in the equation of motion for the TPM, Eq. 4, and therefore cannot be understood separately from its cumulative effect over the history of the TPM. Hence, the qMSM does not appear to offer a simple way to interpret the memory kernel matrix elements in terms of instantaneous transition rates, e.g., where a number twice as large can be immediately identified as taking half as long to move between two states in a given chemical scheme. These complications motivate the search for an alternative method that accurately and efficiently captures the exact dynamics in a robust, accurate, and intuitive manner.
The TCL-GME.
For a non-Markovian theory, such as the qMSM, to be interpreted in terms of rates, one would want to write it in a time-local form, comparable to Eq. 2. For this reason, we perform the formally exact rewriting of Eq. 4 as a time-convolutionless (TCL) GME (47–49),
[5] |
where is the time-local generator that encodes the non-Markovian dynamics arising from imperfect timescale separation between intrastate and interstate dynamics, and can be understood as a generalized time-dependent rate matrix. Furthermore, the matrix elements of the time-local generator plateau at a characteristic timescale, τR (38, 49), allowing one to separate the time over which non-Markovian evolution takes place (0 ≤ t < τR), and when Markovian evolution begins,
[6] |
where is the value of the TPM at τR given by the action of the time-ordered propagator on the initial condition, , and is the long-time limit of the time-local generator. Thus, is the time-independent rate matrix that encodes the true Markovian evolution of beyond τR and elucidates the connection with Eq. 2.
Since the two timescales, τL and τR, determine the minimal amount of simulation data required to fully construct the MSM and TCL-GME, respectively, it would be profitable to derive a relationship connecting the two quantities. In Rigorous Connection of MSM with TCL-GME, we analytically demonstrate that
[7] |
where quantifies the deviation that intrastate motions cause on otherwise Markovian interstate transition rates. Comparing this to Eq. 3 allows us to state that
[8] |
Importantly, Eq. 8 demonstrates that the only cases where an MSM can be as data-efficient as the TCL-GME, albeit at the cost of a lower temporal resolution, is when Λ = 0. This inequality thus enforces a new lower bound on the amount of required simulation time and is one of the central results to the paper, demonstrating that the TCL-GME always provides a description that is more data-efficient or, at worst, as data-efficient, as the MSM while retaining a high temporal resolution. What remains to be shown is the relative accuracy and efficiency of the TCL-GME approach in comparison with the qMSM. We will achieve this by comparing the performance of each dynamical approach on three different protein systems of varying levels of complexity: alanine dipeptide (35), the human argonaute complex (45), and the FiP35 WW domain (35, 50).
All-Atom Protein Systems
In what follows, we apply the TCL-GME to three systems of varying complexity—alanine dipeptide, argonaute, and FiP35 WW domain—and compare these predicted dynamics with those calculated by both the MSM and qMSM. Here, as previously stated, we do not consider the specifics of how to construct the reduced space but rather restrict our attention to their dynamics. First, for alanine dipeptide, we consider a 4-state model with metastable states corresponding to the molecule’s free energy projected onto the backbone torsional angles {ψ, ϕ}, as constructed in ref. 35. Second, for argonaute, we use another 4-state model from structures corresponding to local minima in the free energy landscape of the first two slowest modes, as constructed in ref. 45. Finally, for FiP35 WW domain, we use two reduced models: The first contains 3 states, and its construction is detailed in TPM Construction; the second contains 4 states corresponding to a folded state composed of two β-hairpins, an unfolded state, and structures corresponding to both on- and off-pathways, and its construction is outlined in ref. 35. To clearly benchmark each method while illustrating its advantages and disadvantages, we show only one of the time-dependent conditional probabilities for each protein system. The full time-dependent conditional probability matrices are available in SI Appendix.
Alanine Dipeptide.
We begin our analysis of the TCL-GME and illustrate the utility of the inequality in Eq. 8 using a simple test system, alanine dipeptide. After obtaining TPM dynamics from MD simulation, we construct an MSM as discussed in ref. 35, and we construct both the qMSM and TCL-GME as described in Materials and Methods. In Fig. 1A, we identify the values of τR, τK, and τL using a root mean square error (RMSE) analysis (RMSE Analysis) that quantifies the deviation of the dynamics predicted as a function of τL, τR, and τK from the reference dynamics. We use a convergence threshold of 5% of the final value in the RMSE, which leads to graphical accuracy in the resulting dynamics. This corresponds to quantitative agreement between the predicted dynamics and the MD data (open circles) as shown in Fig. 1 B and C. For the qMSM and the TCL-GME, this leads to τR = τK = 1.5 ps, while for the MSM the lag time at the same error is τL = 10 ps. The results in Fig. 1B show the dynamics that would result if one could only use TPM data, obtained from the MD, for the first 1.5 ps; such a choice of τL leads the MSM to severely overestimate the equilibration rate. In contrast, Fig. 1C shows how a valid MSM is able to capture the exact dynamics, albeit with severely reduced temporal resolution. The drawback of the finite resolution is visible at earlier times, where the (negative) curvature of the MD data is neglected by the MSM but captured by the GMEs. Together, the results of Fig. 1 show that the TCL-GME suffers no loss of performance with respect to the qMSM, with both GMEs being able to make accurate high-resolution predictions using only 15% of the MD data required to construct a valid MSM.
Fig. 1.
Application of the TCL-GME to alanine dipeptide with comparisons to the MSM and qMSM (A) Root mean square error (RMSE) curves for the MSM, qMSM, and TCL-GME quantifying the deviation from the MD data (open circles) as the model is parameterized with increasing amounts of data (RMSE Analysis). Vertical lines show the errors associated with cutoffs (τ) of 1.5 ps and 10 ps. Alanine dipeptide is shown (2 residues). (B) State 1 TPM dynamics, , computed with MSM, qMSM, and TCL-GME approaches parameterized with 1.5 ps of MD data, i.e., τL = τK = τR = 1.5 ps. (C) State 1 TPM dynamics computed with τL = τK = τR = 10 ps. The 4-state TPMs parameterized with τK = τR = 1.5 ps and τL = 10 ps are shown in SI Appendix, Fig. S1. MD error bars were obtained using a bootstrapping approach as discussed in ref. 35.
Argonaute.
Will the simplistic form of Eq. 6 maintain a comparable level of performance to the qMSM for a much more complicated system? To address this, we consider the target mRNA recognition of human argonaute 2 complex (37, 51). It is challenging to obtain sufficient MD sampling to model the dynamics of this complex process, which involves coupled conformational changes of messenger RNA, microRNA, and the Argonaute protein. In fact, the ITS curves shown in Fig. 2 do not plateau over the available time window, demonstrating that the available TPM time is not sufficient to construct a valid MSM. That is to say, constructing an MSM is unaffordable at the same level of dimensionality reduction as the faithful qMSM (45).
Fig. 2.
Demonstration that the massive spatial and temporal scales of the argonaute protein present a challenge to MSMs. Left: Implied timescales (ITS) plot of Eq. 3, for the three nonunitary eigenvalues, whose plateau time corresponds to the Markovian lag time, τL. Diamonds show the choice of τL in Fig. 3, but one can appreciate that no choice for this window of MD data would be satisfactory. Using the -GME approach (discussed in this section), Markovianity is found to require ∼1, 200 times as much simulation data. Right: Rendering of the argonaute protein containing the mRNA strand used to obtain the MD data. The protein itself is composed of 831 residues.
Owing to the statistical noise that arises from averaging over limited MD data to construct the TPM (45), the numerically extracted and in Fig. 3 A and C also display noise that makes it difficult to graphically identify their cutoff times, τK and τR, respectively. To illustrate how both GMEs behave as the cutoff time τ is increased, we display the dynamics predicted from each method using representative cutoff choices of τR, τK ∈ {25, 35, 45, 55} ns in Fig. 3 B and D. Interestingly, the qMSM and TCL-GME perform similarly, with and predicting dynamics within the MD error bars using cutoff times of 35 ns. We emphasize that the ultimate goal is to identify the value of τR or τK such that the predicted dynamics match the MD reference data (open circles) exactly. Disappointingly, neither GME exhibits stability with respect to increasing τR or τK, and the resulting RMSE curves do not monotonically converge toward zero (SI Appendix, Fig. S2). For example, when we parameterize either model with the longer value of τK = τR = 45 ns, the resulting dynamics do not lead to an equilibrium value. This suggests that this truncation of the memory kernel or time-local generator fails to recover detailed balance.
Fig. 3.
Instability of the qMSM and TCL-GME in the case of the argonaute protein and demonstration of the robustness of our -GME approach. (A) The transparent line shows the state 2 memory kernel as a function of time. From the RMSE SI Appendix, Fig. S2A, we observe that converges by 35 ns. The solid line shows the replacement of with zero after this time. (B) Time-dependent conditional probability of starting in state 2 and remaining in state 2 (state 2 dynamics) predicted using the qMSM with τK ∈ {25, 35, 45, 55} ns, where increasing transparency corresponds to decreasing values of τK. (C) Similar to (A), the transparent line shows the state 2 time-local generator as a function of time, and the solid line shows the replacement of with after τR = 30 ns. (D) State 2 dynamics predicted using the TCL-GME with τR ∈ {25, 35, 45, 55} ns, where increasing transparency corresponds to decreasing values of τR. (E) Like (C), the transparent line shows as a function of time. Here, the solid line is instead illustrating the replacement of with its time average over the window [20, 30] ns after τR = 30 ns, i.e., (tr, τR)=(20, 30) ns. (F) Dynamics predicted using the -GME. (G) The transparent line shows propagator as a function of time, and the solid line shows the replacement of with its average over the window [20, 30] ns after τR = 30 ns. (H) Dynamics predicted using the -GME. In (B), (D), (F), and (H), we show an MSM parameterized with τL = 50 ns. The MD data and error bars were computed using the bootstrapping approach (ref. 45 for details).
This lack of controlled convergence can be rationalized by recalling that constructing the GME requires time derivatives of the MD data (Materials and Methods, Eq. 14). This is true for both and . One might hypothesize that the noise in these underconverged MD data is sufficient to compromise the stability of both GME approaches for argonaute. Since TPMs at longer times—like other equilibrium time correlation functions (6, 52)—are constructed from averaging over less MD data, TPMs at longer times are beset by worse statistical errors. Hence, working with the hypothesis that the fluctuations at later times correspond to noise from statistically underconverged dynamics, we posit a method which averages the noise in at long times. In fact, during the qMSM approach, truncation at τK equates to replacing with its long-time average. However, while for dissipative problems that equilibrate, we can only estimate the long time value of .
Visually, Fig. 3C suggests that starts to oscillate around its long-time limit around t = 10 ns. Thus, we introduce an averaging scheme where at τR we replace with , its time average over the interval [tr, τR]. Here, we choose tr to be the time where the time-local generator appears to have plateaued (TCL-GME Construction). We identify tr = 10 ns and show the corresponding matrix element for τR = 30 ns in Fig. 3E. As Fig. 3F shows, with this simple adjustment the TCL-GME converges to the reference dynamics within 55 ns, which strictly improves upon both the MSM and qMSM constructed from the same data. Moreover, the convergence of the TCL-GME with increasing values of τR is monotonically decreasing (SI Appendix, Fig. S3).
A closer look at Fig. 3F reveals that the averaging scheme approaches the reference dynamics from below, but does not actually obtain perfect agreement within these 150 ns. To remedy this, one could average for longer to get a better estimate for . However, this would run counter to our objective of working with the minimal possible MD data. Additionally, as one can appreciate from Eq. 6, error in the estimation of is exponentiated when predicting the GME dynamics. To this end, we propose an alternative route to employ the TCL-GME formalism without requiring any time derivatives or exponentiation of noisy data (38). This simply requires recasting Eq. 6 as
[9] |
That is, we now work directly with the time-dependent propagator (53), , whose construction is detailed in -GME Construction. This obviates integration of Eq. 5, and so the noise in the data is never exponentiated during our calculations. Moreover, this method has shown to be robust with respect to low-resolution dynamical data in quantum dynamical problems (38). Importantly for the protein-folding problem, both the time-local interpretability and frugality that result from the plateau at τR are unaffected by this manipulation.
Here, we extend the protocol proposed in ref. 38 by combining the direct calculation of with the aforementioned averaging scheme. This results in our most direct and noise resilient TCL-GME formulation. We identify tr to be 10 ns and, in Fig. 3 G and H, we show the results of this -GME. Here, with only minimal adjustments to the original formulation, the -GME monotonically converges to the MD data within 55 ns, maintaining the strict improvements of the TCL-GME over both MSM and qMSM approaches.
With the convergent and stable -GME dynamics obtained above, we can now determine the true lag time required for a valid MSM description of the dynamics of the 4 states used to elucidate mRNA recognition in the argonaute complex in ref. 45. To do this, we employ the -GME to predict the TPM dynamics at long times and use Eq. 3 to obtain the ITS plot (SI Appendix, Fig. S4). We observe that the ITS curves only plateau by t ∼ 60 μs, indicating that τL is 1200 times larger than the MSM constructed ref. 45. By comparison, the time-local generator cutoff used in our -GME, τR ∼ 50 ns, is more than 3 orders of magnitude smaller, demonstrating that our approach provides a highly compact and efficient means to fully capture the short- as well as long-time dynamics of complex biomolecular systems.
FiP35 WW-Domain.
The -GME method requires two convergence parameters: tr, the beginning of the averaging window, and τR, the total amount of MD simulation time required to parameterize the model (-GME Construction). This begs an important practical question: How does one choose tr when the onset of the plateau in is hidden under the noise? After all, one might expect to observe a lack of convergence when tr is chosen to be too early. However, by considering a 4-state model of FiP35 WW domain, we find that this is not the case. In this system, where the plateau is not visually obvious (Fig. 4C), we observe that for every choice of tr, there is a value of τR capable of accurately capturing the reference dynamics. In Fig. 4A we demonstrate that the τR required for the -GME to provide accurate dynamics merely increases as tr is reduced to zero. Indeed, since we know from Eq. 8 that τR is bounded above by τL, if tr is given the extreme value of zero, then the -GME reduces to the MSM, with the important distinction that it is able to capture the dynamics between MSM points (SI Appendix, Fig. S5) (We also give the mathematical justification for this result in -GME Construction). In this sense, the -GME parameterized with tr = 0 constitutes a higher-resolution MSM. The practical implication of this is that while one may make a poor choice of tr to begin averaging from, one will only pay for this in the length of MD data required to construct the model, τR, and not in the final accuracy of the -GME dynamics.
Fig. 4.
Ability of our -GME to accurately predict the dynamics of the FiP35 WW domain. (A) RMSE curves for the MSM and the -GME as a function of τL and τR, while varying choices of tr to illustrate convergence. The structure of the FiP35 WW domain is shown (35 residues). (B) TPM dynamics () computed using -GME and MSM approaches with τR = 25 ns (ℓ = 5 ns) and τL = 25 ns. (C) The propagator as a function of time, showing that has been replaced with its average at 25 ns.
The best, earliest choice of τR is therefore parametrically dependent on tr, but well defined. Since all choices of tr converge to the same RMSE value, τR is robustly identified by a common convergence threshold. To identify the optimal (tr, τR) pair, we simply find the minimum of the plot of τR as a function tr. Choosing a value of 5% error as converging to the MD dynamics within visual accuracy (RMSE), for these FiP35 WW domain data we identify tr = 20 ns, τR = 25 ns, and τL = 200 ns, as shown in Fig. 4A. For comparison, we display the dynamics predicted by both the MSM and -GME when parameterized using only these 30 ns of MD data in Fig. 4B. In Fig. 4C, we show the replacement of with its average (obtained over the averaging interval of [20, 25] ns). We observe that MSM dynamics predicted using only 25 ns of the MD dataset overestimates the equilibration rate, as was the case with alanine dipeptide and the argonaute complex, whereas the -GME parameterized with the same amount of reference data accurately captures the MD data until ∼375 ns. The small deviation that starts at ∼375 ns disappears at longer times, where the -GME correctly captures the long-time limit (SI Appendix, Fig. S6). Thus, our analysis shows that accurate predictions of the dynamics from the -GME require only 15% of the MD data needed to construct a valid MSM.
We now consider the ability of the -GME to capture the long-time dynamics through a different, experimentally accessible measure: the folding time of the protein. For this, we will consider a 3-state model of FiP35 WW domain (for construction details, TPM Construction) with states one, two, and three corresponding to misfolded, unfolded, and folded structures of the protein, respectively (50). Here, we compute the folding time using the mean first passage time (MFPT) procedure outlined in MFPT Method. First, we use the reference dynamics (SI Appendix, Fig. S7) to compute the folding time to be τref = 18.65 μs (SI Appendix, Fig. S8), which is taken to be the exact result for this model, which is in reasonable agreement with the experimentally measured value of 14 ± 1.5 μs (54). In particular, if the clustering algorithm does not correctly identify configurations with the folded, unfolded, and misfolded states, this may cause the folding time to appear artificially long. Hence, we focus not on the deviation from the experimental value but rather on the internal consistency between the reference dynamics and the predictions from the -GME and the MSM approaches. To obtain the -GME predictions of the MFPT, we first identify tr = 50 ns. As described in MFPT Method, we compute the MFPTs corresponding to increasing values of τR and τL and observe that both the -GME and MSM approaches converge to the reference result at long times (SI Appendix, Fig. S8). We also find that the MSM continuously underestimates τref and appears to continue increasing at times beyond 1,000 ns (SI Appendix, Fig. S8). In contrast, the -GME remains within 8% of the reference value for the duration of available MD data. Indeed, to converge within 5% error, the -GME requires data up to 168 ns, whereas the MSM does not reach this threshold until 452 ns, suggesting that the -GME provides, even in the estimation of folding times, a more efficient means to capture the long-time dynamics of complex biomolecular systems. This is in agreement with previous works demonstrating that a purely Markovian process fails to faithfully capture barrier crossing phenomena (31, 55).
Conclusion
In this work, we have developed and applied the -GME and demonstrated that it is an accurate, chemically intuitive, and systematically improvable approach to modeling non-Markovian biomolecular dynamics. While previous work had exploited the memory of the MSM’s intrastate motions to construct an exact qMSM that could significantly reduce the computational cost required to efficiently predict protein dynamics at long times, it eluded a simple and intuitive chemical interpretation and, as we show here, is highly sensitive to statistical noise in the reference TPM dynamics from which it must be constructed. Here, we have abandoned the time-nonlocal qMSM by moving to a time-convolutionless formulation, which admits a simple formal integration, elucidating the analytical connection between GMEs and MSMs and permits a simple interpretation. In particular, not only does this allow the time-local generator to be interpreted as a time-dependent rate matrix, it also allows for systematic improvement in regimes of noisy data. Specifically, we have identified that for cases where the reference TPM suffers from statistical noise (e.g., the argonaute system), a straightforward averaging scheme allows our time-convolutionless approaches (both and ) to uniformly converge to the reference dynamics. In contrast, the time-nonlocal approach displays instabilities with increasing simulation time that have no comparable solution without resorting to manipulations of the qMSM formalism such as the introduction of an integrative form of the GME (56). Furthermore, using alanine dipeptide, FiP35 WW Domain, and argonaute, we have demonstrated that the time-local GME can accurately and efficiently capture short-, intermediate-, and long-time dynamics with no loss of performance. Not only does this approach require an equivalent amount of data as the qMSM, the -GME requires minimal numerical and physical complexity by eliminating the need for both time-convolution integrals and numerical time derivatives of potentially noisy data. By providing a rigorous and physically transparent method to capture the non-Markovian dynamics of a given set of states, we expect the -GME to provide a robust scaffold to construct methods to find optimal configuration clusters and offer a framework to investigate the mechanisms of complex biomolecular conformational changes.
Materials and Methods
A. Rigorous Connection of MSM with TCL-GME.
Here, we derive Eqs. 7 and 8 from the main text, which rigorously connect the MSM to the TCL-GME. We begin by considering some time t that is strictly greater than τR and rewriting as
[10] |
where we have used the fact that the initial condition is the identity matrix, and have introduced UnM(τR, 0) as the propagator over the non-Markovian region. This is equivalent to Eq. 6 in the main text. We insert the above result into the implied timescale equation, defined in Eq. 3, to obtain the result in the main text,
where
[11] |
Eq. 11 is exact and easy to calculate given the framework presented here for obtaining the non-Markovian propagator; it can be interpreted as the total deviation in the propagation due to non-Markovian behavior. Keeping in mind that the MSM lag time is taken to be the minimum time-scale associated with the onset of a plateau in an ITS plot, we see see that the right-hand-side of Eq. 7 does not necessarily stabilize for times immediately after τR. This allows us to conclude the inequality presented in the main text, that
To further simplify its interpretation, one can neglect the effect of time-ordering in the definition of the non-Markovian propagator, which yields the following, modified expression for Λ,
[12] |
Here, it is clear that Λ approximately corresponds to the integral deviation between the time-local generator over its non-Markovian variation, and its long-time limit.
B. TPM Construction.
The transition probability matrix (TPM) is computed from the transition count matrix (TCM). We first computed the TCM from the MD trajectories. For each lag time τ, the raw TCMs (Traw) were first counted from transition pairs between frames at t and t + τ: Tijraw(τ) = ⟨χi(t + τ)χj(t)⟩, where χi(t) is the indicator function that determines if the frame at time t is in state i. Here, t = 0, Δt, 2Δt, ...,(Ntraj − 1)Δt − τ, where Δt is the saving interval of trajectories, and Ntraj is the length of trajectories. Normally, detailed balance requires that the TCM be symmetric, i.e., Tij = Tji. However, since the raw TCMs are normally not symmetric, we further symmetrize the raw TCMs to enforce detailed balance: Tij(τ) = (Tijraw(τ)+Tjiraw(τ))/2 (57). Finally, we calculated TPMs by column-normalizing the TCM: .
In our TPM construction, the raw TCM was directly counted from the macrostate models. The 4-state model of alanine dipeptide was constructed with a splitting-and-lumping approach. We first split all the available MD conformations into 1,000 microstates using the K-Centers clustering algorithm (58–60). Then we lumped the 1,000 microstates into 4 macrostates via the PCCA+ (Perron cluster–cluster analysis) (61, 62), with the lag time of 2 ps.
We constructed the 3-state model of the FiP35 WW domain using tICA (time-lagged independent component analysis) (57, 63), K-Centers clustering (58–60) and PCCA+ (Perron cluster–cluster analysis) (61, 62) lumping from MD trajectories provided by D. E. Shaw research. We first performed tICA with pairwise distances between all α carbon atoms of the peptide with a lag time of 10 ns. Then we used the K-Centers algorithm to generate a 1,000-state model based on the top three tICs (time-lagged independent components) from tICA. Finally, we performed the PCCA+ clustering to generate the 3-state model based on the 1,000-state TPM computed at the lag time of 10 ns.
We constructed the 4-state model of the argonaute using spectral oASIS (64), tICA, APLoD clustering, and PCCA (61, 65). We employed spectral oASIS to reduce the number of input features, followed by tICA for the dimensionality reduction. Then we grouped the conformations into 81 clusters from the APLoD clustering algorithm, based on the top 4 tICs from tICA. Finally, we used the PCCA+ algorithm to group the 81 microstates into four macrostates.
C. qMSM Construction.
To solve the integro-differential equation in Eq. 4, we must first construct the memory kernel, , as a function of time directly from the TPM data. We follow ref. 46 and derive the classical analog of the self-consistent expansion of the memory kernel
[13] |
where
[14] |
are the projection-free auxiliary kernels.
To compute both and , and to thus compute and , we numerically differentiate the TPM data, . With these auxiliary kernels, we compute according to Eq. 13 using the discretization procedure in ref. 66. For completeness, we summarize the algorithm. At the initial time and first timestep, and
[15] |
For all subsequent times (n ≥ 2),
[16] |
Here, 1 is the identity matrix, and we employ equally spaced time intervals, such that Δt ≡ tj + 1 − tj.
Once we construct , we employ Heun’s method (second-order accurate with respect to Δt) to integrate Eq. 4 and obtain . Then we identify an appropriate memory kernel cutoff time, τK, by applying the RMSE analysis in qMSM Construction. We approximate the upper limit of the integral in Eq. 4 with τK, enabling us to predict the dynamics for times beyond the duration of the MD simulation.
D. TCL-GME Construction.
To reap the benefits of the time-local formalism, we first calculate from the TPMs obtained from MD simulation. We do this by rearranging Eq. 5 via matrix inversion to obtain
[17] |
where we calculate by numerically differentiating the TPM data. As we have discussed, the matrix elements of plateau on a timescale, τR, associated with the conclusion of non-Markovian evolution, allowing us to set . With this definition, we can describe the dynamics after the onset of Markovian evolution, as shown in Eq. 6.
Once we find , we employ Heun’s method to integrate Eq. 5 and obtain . Similar to the discussion in qMSM Construction, we identify an appropriate generator cutoff time, τR, using the RMSE analysis discussed in RMSE.
E. -GME Construction.
We first formally integrate the TCL-GME in Eq. 5 to obtain
[18] |
where we have defined with the “→” subscript denoting the chronological time-ordering of the exponential, as above. We then compute the value of through direct matrix inversion
[19] |
as introduced in ref. 38. Because becomes constant at τR, the propagator also becomes a constant. Hence, we define . We compute the dynamics beyond τR according to
[20] |
As discussed in our analysis of the argonaute complex, we developed and implemented a simple averaging scheme capable of taming noise arising from statistically underconverged MD estimates of the TPM. We begin by applying the RMSE stability analysis in RMSE to determine a valid generator cutoff time; here, we denote this cutoff by tr. We then introduce another parameter ℓ that represents the number of high-quality TPMs after tr and denote the corresponding time as tr + ℓ. This number is, of course, limited by data availability. To predict the dynamics beyond tr + ℓ, we compute the time average of on the time interval [tr, tr + ℓ] using
[21] |
Because our -GME requires at least r + ℓ data points to circumvent the instabilities imposed by noise in biomolecular systems, we generalize our the definition of the generator cutoff time to be τR = tr + ℓ, representing not the generator cutoff but rather the minimum amount of data needed to accurately predict the true TPM dynamics. Ultimately, we recommend that the user performs a rigorous stability analysis with respect to the choices of r and ℓ.
It can be seen by equating expressions Eqs. 6 and 2 given the same first time step (τ = τL = τR),
[22] |
where the right-hand side of Eq. 22 uses the explicit form of the propagator (Rigorous Connection of MSM with TCL-GME for details). If this time-ordering of the exponential can be neglected, then we can identify . The practical implication of this is that, if we can replace with in the -GME, we will obtain exact agreement with the MSM parameterized by the same τ (at integer multiples of τ). The requirement for this to be true is that , which we show to be satisfied by Fig. 4C. Since and therefore are formally exact before cutoff by construction they return the reference dynamics (38), the dynamics between these MSM points is also accessible to the -GME. This explains why Fig. 4C shows that limtr → 0(τR)=τL.
F. RMSE Analysis.
To determine values of τx ∈ {τL, τR, τK}, we find the lowest time by which the time-averaged root mean squared error (RMSE), given by
[23] |
becomes and stays sufficiently small. We identify τx to be this minimum amount of time. The RMSE quantifies the error associated with the dynamics predicted by a method (i.e., MSM, qMSM, TCL-GME, and -GME) as a function of τx by comparing it pointwise with the reference dynamics obtained from MD over the length of the trajectory, Nt, which we take to be the ‘exact’ result. Here, n corresponds to the total number of macrostates, i.e., we sum over all elements of the matrix, not just the representative elements we display. In the absence of noise, these error curves are expected to monotonically tend toward zero. In practice, however, this is not the case (SI Appendix, Fig. S2A). Therefore, the user must determine an acceptable threshold for a particular application. In our results, we choose the RMSE to be ∼5%, which results in graphical agreement between the reference and GME or MSM dynamics.
G. MFPT Method.
We apply -GME that we developed to compute folding times for FiP35 WW Domain. To do so, we consider a 3-state model where states one, two, and three are defined to be the misfolded, folded, and unfolded structures, respectively. To employ Meyer’s mean-first passage time (MFPT) method (67, 68), we construct the time-dependent MFPT matrix, M, as
[24] |
The element M32 then corresponds to the folding time in this problem.
Practically, one solves Eq. 24 as a system of linear equations (69). To solve for the MFPT corresponding to passage to state 3, the folded state, we consider the row 3 MFPT matrix elements and obtain the following system of equations
[25] |
We recast the system in terms of matrices and obtain the final form by matrix inversion,
[26] |
As the dynamics approach equilibrium, the inverse matrix on the right-hand-side of Eq. 26 becomes constant. In practice, we define the folding time to be when M32/τ is within 5% of M32(τfinal)/τfinal for the rest of time.
Supplementary Material
Appendix 01 (PDF)
Acknowledgments
A.M.C. acknowledges the start-up funds from the University of Colorado, Boulder. X.H. acknowledges the Hirschfelder Professorship Fund. This work was supported by NSF (Grant No. CHE-2154291 to T.E.M.).
Author contributions
A.M.-C. designed research; A.J.D. performed research; A.J.D., T.S., and A.M.-C. analyzed data; S.C. provided previously published data; and A.J.D., T.S., S.C., T.E.M., X.H., and A.M.-C. wrote the paper.
Competing interests
The authors declare no competing interest.
Footnotes
This article is a PNAS Direct Submission.
Data, Materials, and Software Availability
Previously published data were used for this work (35, 45, 50).
Supporting Information
References
- 1.Chaudhuri T. K., Paul S., Protein-misfolding diseases and chaperone-based therapeutic approaches. FEBS J. 273, 1331 (2006). [DOI] [PubMed] [Google Scholar]
- 2.Schwantes C. R., McGibbon R. T., Pande V. S., Perspective: Markov models for long-timescale biomolecular dynamics. J. Chem. Phys. 141, 090902 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Wang W., Cao S., Zhu L., Huang X., Constructing Markov state models to elucidate the functional conformational changes of complex biomolecules. WIREs Comput. Mol. Sci. 8 (2018). [Google Scholar]
- 4.Pande V. S., Beauchamp K., Bowman G. R., Everything you wanted to know about Markov State Models but were afraid to ask. Methods 52, 99 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Husic B. E., Pande V. S., Markov state models: From an art to a science. J. Am. Chem. Soc. 140, 2386 (2018). [DOI] [PubMed] [Google Scholar]
- 6.G. R. Bowman, V. S. Pande, F. Noé, An Introduction to Markov State Models and Their Application to Long Timescale Molecular Simulation (2013), vol. 797.
- 7.Buchete N. V., Hummer G., Coarse master equations for peptide folding dynamics. J. Phys. Chem. B 112, 6057 (2008). [DOI] [PubMed] [Google Scholar]
- 8.Malmstrom R. D., Lee C. T., Van Wart A. T., Amaro R. E., Application of molecular-dynamics based Markov state models to functional proteins. J. Chem. Theory Comput. 10, 2648 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Huang X., Bowman G. R., Bacallado S., Pande V. S., Rapid equilibrium sampling initiated from nonequilibrium data. Proc. Natl. Acad. Sci. U.S.A. 106, 19765 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Morcos F., et al. , Modeling conformational ensembles of slow functional motions in pin1-WW. PLoS Comput. Biol. 6, e1001015 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Zhang B. W., et al. , Simulating replica exchange: Markov state models, proposal schemes, and the infinite swapping limit. J. Phys. Chem. B 120, 8289 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Pan A. C., Roux B., Building Markov state models along pathways to determine free energies and rates of transitions. J. Chem. Phys. 129, 064107 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Doerr S., Harvey M. J., Noé F., De Fabritiis G., HTMD: High-throughput molecular dynamics for molecular discovery. J. Chem. Theory Comput. 12, 1845 (2016). [DOI] [PubMed] [Google Scholar]
- 14.Scherer M. K., et al. , PyEMMA 2: A software package for estimation, validation, and analysis of Markov models. J. Chem. Theory Comput. 11, 5525 (2015). [DOI] [PubMed] [Google Scholar]
- 15.Harrigan M. P., et al. , MSMBuilder: Statistical models for biomolecular dynamics. Biophys. J. 112, 10 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. W. C. Swope, J. W. Pitera, F. Suits, Describing protein folding kinetics by molecular dynamics simulations. 1. Theory. J. Phys. Chem. B 108, 6571 (2004).
- 17.K. Röder, D. J. Wales, The energy landscape perspective: Encoding structure and function for biomolecules. Front. Mol. Biosci. 9 (2022). [DOI] [PMC free article] [PubMed]
- 18.Prinz J. H., et al. , Markov models of molecular kinetics: Generation and validation. J. Chem. Phys. 134, 174105 (2011). [DOI] [PubMed] [Google Scholar]
- 19.Nüske F., Keller B. G., Pérez-Hernández G., Mey A. S., Noé F., Variational approach to molecular kinetics. J. Chem. Theory Comput. 10, 1739 (2014). [DOI] [PubMed] [Google Scholar]
- 20.Konovalov K. A., Unarta I. C., Cao S., Goonetilleke E. C., Huang X., Markov state models to study the functional dynamics of proteins in the wake of machine learning. JACS Au 1, 1330 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Lu J., Vanden-Eijnden E., Exact dynamical coarse-graining without time-scale separation. J. Chem. Phys. 141, 044109 (2014). [DOI] [PubMed] [Google Scholar]
- 22. A. Kai-Hei Yik, Y. Qiu, I. C. Unarta, S. Cao, X. Huang, A Step-by-step Guide on How to Construct quasi-Markov State Models to Study Functional Conformational Changes of Biological Macromolecules. ChemRxiv (2022).
- 23.Brotzakis Z. F., Parrinello M., Enhanced sampling of protein conformational transitions via dynamically optimized collective variables. J. Chem. Theory Comput. 15, 1393 (2019). [DOI] [PubMed] [Google Scholar]
- 24.Rogal J., Schneider E., Tuckerman M. E., Neural-network-based path collective variables for enhanced sampling of phase transformations. Phys. Rev. Lett. 123, 245701 (2019). [DOI] [PubMed] [Google Scholar]
- 25.Klem H., Hocky G. M., McCullagh M., Size-and-shape space Gaussian mixture models for structural clustering of molecular dynamics trajectories. J. Chem. Theory Comput. 18, 3218 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Singhal N., Pande V. S., Error analysis and efficient sampling in Markovian state models for molecular dynamics. J. Chem. Phys. 123, 204909 (2005). [DOI] [PubMed] [Google Scholar]
- 27.Hummer G., Szabo A., Optimal dimensionality reduction of multistate kinetic and Markov-state models. J. Phys. Chem. B 119, 9029 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Voelz V. A., Bowman G. R., Beauchamp K., Pande V. S., Molecular simulation of ab initio protein folding for a millisecond folder NTL9(1–39). J. Am. Chem. Soc. 132, 1526 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Da L. T., et al. , Bridge helix bending promotes RNA polymerase II backtracking through a critical and conserved threonine residue. Nat. Commun. 7, 11244 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Lange O. F., Grubmüller H., Collective Langevin dynamics of conformational motions in proteins. J. Chem. Phys. 124, 214903 (2006). [DOI] [PubMed] [Google Scholar]
- 31.Ayaz C., et al. , Non-Markovian modeling of protein folding. Proc. Natl. Acad. Sci. U.S.A. 118, e2023856118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Vroylandt H., Goudenège L., Monmarché P., Pietrucci F., Rotenberg B., Likelihood-based non-Markovian models from molecular dynamics. Proc. Natl. Acad. Sci. U.S.A. 119, e117586119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Ayaz C., Scalfi L., Dalton B. A., Netz R. R., Generalized Langevin equation with a nonlinear potential of mean force and nonlinear memory friction from a hybrid projection scheme. Phys. Rev. E 105, 054138 (2022). [DOI] [PubMed] [Google Scholar]
- 34.Horenko I., Hartmann C., Schütte C., Noe F., Data-based parameter estimation of generalized multidimensional Langevin processes. Phys. Rev. E 76, 016706 (2007). [DOI] [PubMed] [Google Scholar]
- 35.Cao S., Montoya-Castillo A., Wang W., Markland T. E., Huang X., On the advantages of exploiting memory in Markov state models for biomolecular dynamics. J. Chem. Phys. 153, 014105 (2020). [DOI] [PubMed] [Google Scholar]
- 36.I. Christy Unarta et al., Role of bacterial RNA polymerase gate opening dynamics in DNA loading and antibiotics inhibition elucidated by quasi-Markov State Model. Proc. Natl. Acad. Sci. U.S.A. 118, e2024324118 (2021). [DOI] [PMC free article] [PubMed]
- 37.Meister G., Argonaute proteins: Functional insights and emerging roles. Nat. Rev. Genet. 14, 447 (2013). [DOI] [PubMed] [Google Scholar]
- 38.Sayer T., Montoya-Castillo A., Compact and complete description of non-Markovian dynamics. J. Chem. Phys. 158, 014105 (2023). [DOI] [PubMed] [Google Scholar]
- 39.Mardt A., Pasquali L., Wu H., Noé F., VAMPnets for deep learning of molecular kinetics. Nat. Commun. 9, 5 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Chodera J. D., Singhal N., Pande V. S., Dill K. A., Swope W. C., Automatic discovery of metastable states for the construction of Markov models of macromolecular conformational dynamics. J. Chem. Phys. 126, 155101 (2007). [DOI] [PubMed] [Google Scholar]
- 41.R. Zwanzig, Nonequilibrium Statistical Mechanics (Oxford University Press, 2001).
- 42.W. Coffey, Y. P. Kalmykov, J. T. Waldron, The Langevin Equation: With Applications in Physics, Chemistry and Electrical Engineering (World Scientific, ed. 2, 2004), vol. 14.
- 43. A. Montoya-Castillo, D. R. Reichman, Approximate but accurate quantum dynamics from the Mori formalism. II. Equilibrium time correlation functions. J. Chem. Phys. 146, 084110 (2017). [DOI] [PubMed]
- 44.A. Kelly, A. Montoya-Castillo, L. Wang, T. E. Markland, Generalized quantum master equations in and out of equilibrium: When can one win? J. Chem. Phys. 144, 184105 (2016). [DOI] [PubMed]
- 45.Zhu L., et al. , Critical role of backbone coordination in the mRNA recognition by RNA induced silencing complex. Commun. Biol. 4, 1345 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Montoya-Castillo A., Reichman D. R., Approximate but accurate quantum dynamics from the Mori formalism: I Nonequilibrium dynamics. J. Chem. Phys. 144, 084110 (2016). [DOI] [PubMed] [Google Scholar]
- 47.Ama M. T., Mori H., Statistical-mechanical theory of the Boltzmann equation and fluctuations in μ space. Prog. Theor. Phys. 56, 1073 (1976). [Google Scholar]
- 48.S. Chaturvedil, F. Shibata, Time-convolutionless projection operator formalism for elimination of fast variables. Applications to Brownian motion. Z. Physik B 35, 297 (1979).
- 49.H. P. Breuer, F. Petruccione, The Theory of Open Quantum Systems (Oxford University Press, 1985), pp. 444–447.
- 50.Shaw D. E., et al. , Atomic-level characterization of the structural dynamics of proteins. Science 330, 341 (2010). [DOI] [PubMed] [Google Scholar]
- 51.Elkayam E., et al. , The structure of human argonaute-2 in complex with miR-20a. Cell 150, 100 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.M. P. Allen, D. J. Tildesley, Computer Simulation of Liquids (Oxford University Press, New York, ed. 2, 2017).
- 53.A. L. Fetter, J. D. Walecka, Quantum Theory of Many-Particle Systems (McGraw-Hill, 1971), pp. 53–56.
- 54.Liu F., et al. , An experimental survey of the transition between two-state and downhill protein folding scenarios. Proc. Natl. Acad. Sci. U.S.A. 105, 2369 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. J. Kappler, J. O. Daldrop, F. N. Brünig, M. D. Boehle, R. R. Netz, Memory-induced acceleration and slowdown of barrier crossing. J. Chem. Phys. 148 (2018). [DOI] [PubMed]
- 56.S. Cao, Y. Qiu, M. Kalin, X. Huang, Integrative Generalized Master Equation: A Theory to Study Long-timescale Biomolecular Dynamics via the Integrals of Memory Kernels. ChemRxiv (2022). 10.26434/chemrxiv-2022-0n9ld. [DOI] [PMC free article] [PubMed]
- 57.Naritomi Y., Fuchigami S., Slow dynamics of a protein backbone in molecular dynamics simulation revealed by time-structure based independent component analysis. J. Chem. Phys. 139, 215102 (2013). [DOI] [PubMed] [Google Scholar]
- 58.Wang W., Liang T., Sheong F. K., Fan X., Huang X., An efficient Bayesian kinetic lumping algorithm to identify metastable conformational states via Gibbs sampling. J. Chem. Phys. 149, 072337 (2018). [DOI] [PubMed] [Google Scholar]
- 59.Hochbaum D. S., Shmoys D. B., A best possible heuristic for the k-center problem. Math. Operat. Res. 10, 180 (1985). [Google Scholar]
- 60.Peng J. H., Wang W., Yu Y. Q., Gu H. L., Huang X., Clustering algorithms to analyze molecular dynamics simulation trajectories for complex chemical and biological systems. Chinese J. Chem. Phys. 31, 404 (2018). [Google Scholar]
- 61.Deuflhard P., Weber M., Robust Perron cluster analysis in conformation dynamics. Linear Algebra Appl. 398, 161 (2005). [Google Scholar]
- 62.Röblitz S., Weber M., Fuzzy spectral clustering by PCCA+: Application to Markov state models and data classification Adv. Data Anal. Class. 7, 147 (2013). [Google Scholar]
- 63.Schwantes C. R., Pande V. S., Improvements in Markov State Model construction reveal many non-native interactions in the folding of NTL9. J. Chem. Theory Comput. 9, 2000 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Litzinger F., et al. , Rapid calculation of molecular kinetics using compressed sensing. J. Chem. Theory Comput. 14, 2771 (2018). [DOI] [PubMed] [Google Scholar]
- 65.Liu S., Zhu L., Sheong F. K., Wang W., Huang X., Adaptive partitioning by local density-peaks: An efficient density-based clustering algorithm for analyzing molecular dynamics trajectories. J. Comput. Chem. 38, 152 (2017). [DOI] [PubMed] [Google Scholar]
- 66.Pfalzgraff W. C., Montoya-Castillo A., Kelly A., Markland T. E., Efficient construction of generalized master equation memory kernels for multi-state systems from nonadiabatic quantum-classical dynamics. J. Chem. Phys. 150, 244109 (2019). [DOI] [PubMed] [Google Scholar]
- 67.C. D. Meyer, An alternative expression for the mean first passage matrix 22, 41–47 (1978).
- 68.Kells A., Mihálka Z., Annibale A., Rosta E., Mean first passage times in variational coarse graining using Markov state models. J. Chem. Phys. 150, 134107 (2019). [DOI] [PubMed] [Google Scholar]
- 69.Jensen C. H., Nerukh D., Glen R. C., Calculating mean first passage times from Markov models of proteins. AIP Conf. Proc. 940, 150 (2007). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Appendix 01 (PDF)
Data Availability Statement
Previously published data were used for this work (35, 45, 50).