Tutorial on how to build non-Markovian dynamic models from molecular dynamics simulations for studying protein conformational changes

Yue Wu; Siqin Cao; Yunrui Qiu; Xuhui Huang

doi:10.1063/5.0189429

. 2024 Mar 22;160(12):121501. doi: 10.1063/5.0189429

Tutorial on how to build non-Markovian dynamic models from molecular dynamics simulations for studying protein conformational changes

Yue Wu ¹, Siqin Cao ¹, Yunrui Qiu ¹, Xuhui Huang ^1,2,^1,2,^a)

PMCID: PMC10964226 PMID: 38516972

Abstract

Protein conformational changes play crucial roles in their biological functions. In recent years, the Markov State Model (MSM) constructed from extensive Molecular Dynamics (MD) simulations has emerged as a powerful tool for modeling complex protein conformational changes. In MSMs, dynamics are modeled as a sequence of Markovian transitions among metastable conformational states at discrete time intervals (called lag time). A major challenge for MSMs is that the lag time must be long enough to allow transitions among states to become memoryless (or Markovian). However, this lag time is constrained by the length of individual MD simulations available to track these transitions. To address this challenge, we have recently developed Generalized Master Equation (GME)-based approaches, encoding non-Markovian dynamics using a time-dependent memory kernel. In this Tutorial, we introduce the theory behind two recently developed GME-based non-Markovian dynamic models: the quasi-Markov State Model (qMSM) and the Integrative Generalized Master Equation (IGME). We subsequently outline the procedures for constructing these models and provide a step-by-step tutorial on applying qMSM and IGME to study two peptide systems: alanine dipeptide and villin headpiece. This Tutorial is available at https://github.com/xuhuihuang/GME_tutorials. The protocols detailed in this Tutorial aim to be accessible for non-experts interested in studying the biomolecular dynamics using these non-Markovian dynamic models.

I. INTRODUCTION

Protein functions heavily rely on dynamic transitions between conformational states or conformational changes.^1,2 For example, RNA polymerase II needs to repeatedly oscillate between the pre-translocation state and post-translocation state to translocate along the double-stranded DNA during transcription elongation.^3,4 Therefore, studying conformational changes is crucial for understanding the molecular mechanisms underlying many biological processes⁴ and facilitating drug designs targeting these conformational changes.^5,6

Molecular dynamics (MD) simulations have emerged as a valuable tool that can complement experimental approaches by providing atomistic details of protein dynamics. However, the timescales of conformational changes for complex biological molecules often exceed the feasible simulation length. For example, the opening of the DNA loading gate of RNA polymerase occurs at millisecond timescales, while it still remains challenging for all-atom MD simulations of RNA polymerase (the simulation box contains around half a million atoms) to reach a millisecond.⁷ The Markov State Model (MSM)^8–19 has become a popular approach to bridge this timescale gap by modeling the long timescale dynamics based on many short MD simulations.

In MSMs, the high-dimensional conformational space is partitioned into a set of discrete and coarse-grained metastable states. Simultaneously, through the coarse-graining of time (using discrete time intervals or lag time), transitions among these states can be modeled as Markovian jumps. The transition probabilities after a lag time between pairs of states can then be estimated from short MD simulations. As a result, MSMs can predict long timescale dynamics based on a large ensemble of short MD simulations. In recent years, MSMs have been widely applied to study protein folding,^18,20–24 protein–ligand binding,^25–28 and functional conformational changes of biomolecules.^17,29–46

One critical condition for MSMs to have predictive power is that the lag time must be long enough to allow transitions among states to become memoryless (or Markovian), and the memory of these transitions is mainly determined by dynamic relaxations within each metastable state. This imposes a major challenge for MSM studies of protein dynamics as the lag time is bound by the length of individual MD simulations available to estimate transition probabilities. To achieve Markovian transitions, one often needs to construct MSMs containing a large number of states so that each state is sufficiently small and has relatively fast relaxation dynamics to allow affordable lag times. For example, Voelz et al. showed that they need an MSM containing 2000 states (with a lag time of 12 ns) to model the millisecond folding of the NTL9 peptide.⁴⁷ Our previous work suggested that 10 000 states are needed for an MSM to be Markovian (with a lag time of 5 ns) for a 37-residue intrinsically disordered peptide.²⁰ More recently, our work on the RNA polymerase II backtracking also showed that MSMs consisting of 800 states are needed to reach Markovian.³² MSMs containing hundreds of states are useful to make quantitative predictions to be tested by experiments but often hinder the comprehension of biological mechanisms. In recent years, various methods, such as non-Markovian dynamic approaches⁴⁸ based on the Generalized Master Equation (GME)^49–51 or Generalized Langevin Equation (GLE),^52–54 and Hidden Markov State Models,⁵⁵ have been developed to address this challenge.

The GME has particularly emerged as a promising approach to address the aforementioned challenge of studying protein dynamics.⁴⁸ The GME method (also called quasi-MSM or qMSM) is a non-Markovian dynamic model and explicitly considers the memory kernel and propagates dynamics using a discretized GME.⁴⁹ The qMSM method is shown to greatly improve upon MSMs by accurately predicting long-timescale dynamics while being built from significantly shorter MD simulations.^7,49,56 The Transfer Tensor Method (TTM) provides an analogous approach to qMSM that utilizes a discretization of the integrated GME using the time-dependent transfer tensor.⁵⁷ In addition, Presse et al. introduced the non-Markov memory kernel (NMMK) method, in which they applied the maximum entropy principle to obtain memory kernels from experimental data.⁵⁸

The GME represents a new and promising approach to studying biomolecular dynamics. However, a major challenge persists in the robustness of the computed time-dependent memory kernels when applying GME to investigate complex conformational changes.⁴⁸ This challenge arises because memory kernels are estimated based on probabilities of transitions among states at a series of time points, and the fluctuations encountered when extracting these time-dependent transition probabilities from MD trajectories could induce numerical instability in complex systems. To address this issue, we have recently developed the Integrative GME (IGME) method.⁵⁰ In this method, we first derived an analytical solution for the GME under the condition that the memory kernels have fully decayed. Subsequently, we determine the hyper-parameters in this analytical solution by fitting them to MD simulations. As IGME deals with the condition that the memory kernels have already decayed, it employs only the time integrations of memory kernels, thereby avoiding the numerical instability associated with the explicit computation of time-dependent memory kernels in qMSM. When applied to the study of peptide dynamics, the IGME models demonstrate significantly reduced fluctuations in both memory kernels and predicted long-term dynamics compared to qMSM.⁴⁹ In addition to IGME, the time-convolutionless GME (TCL-GME) provides another noise-resilient GME-based approach, where the non-Markovian time evolutions of dynamics are modeled by a time-dependent rate matrix.⁵¹

In this Tutorial, we present a step-by-step guide on how to build qMSM and IGME models to study protein dynamics. This Tutorial is accompanied by our most recent implementation of these two GME-based methods, available on GitHub (https://github.com/xuhuihuang/GME_tutorials). We hope this provides a detailed tutorial for researchers in the computational chemistry and biophysics community to learn how to build GME methods for studying conformational dynamics of proteins and other macromolecules. Our paper is organized as follows: we first introduce the GME (for the qMSM method) and IGME theories, followed by outlining detailed and step-by-step protocols for building these models from MD simulations. Finally, we will present two detailed examples (alanine dipeptide and villin headpiece) along with the associated Python code (presented as Schemes) to demonstrate how to build qMSM and IGME models from MD simulations.

II. THEORIES OF NON-MARKOVIAN DYNAMIC MODELS FOR PROTEIN DYNAMICS

A. Liouville equation and dynamic operators

The dynamics in the phase space follows the Liouville equation,

\partial ρ (t, Γ) / \partial t = L ρ (t, Γ),

(1)

where the Liouville operator $L$ encapsulates all pertinent information of the dynamic system and $ρ (t, Γ)$ represents the probability distribution function across the entire phase space Γ at time t.

Based on the Liouville equation, the evolution of the probability density after the time interval τ follows $ρ (t + τ, Γ) = e^{L τ} ρ (t, Γ)$ , regardless of the starting time t or any traveling history. This property is called Markovian, or memoryless. When studying protein dynamics in reversible systems, several dynamic operators have been used to describe the propagation of the distribution function, as listed in the following:

•
Propagator $P (τ)$ : $ρ (t + τ, Γ) = P (τ) ρ (t, Γ)$ .
•
Transfer operator $T (τ)$ :⁹ $ρ (t + τ, Γ) / ρ_{e q} = T (τ) (ρ (t, Γ) / ρ_{e q})$ , where ρ_eq represents the equilibrium probability distribution function. Transfer operator can only be applied in reversible systems.

B. Markov state model (MSM) theory

In MSMs, the configurational space is partitioned into n states, represented as ${\{X_{i}\}}_{i = 1}^{n}$ , and simultaneously, the continuous time is coarse-grained into discrete time intervals (τ, called lag times). The system’s dynamics is then modeled as Markovian transitions among these states. Consequently, the probability density function $ρ (t)$ can be expressed as a vector containing n elements: $p (t) = {[\begin{matrix} p_{1} (t) & \dots & p_{n} (t) \end{matrix}]}^{tr}$ , where $p_{i} (t)$ represents the population in state X_i at time t and tr denotes the transpose operation. The Markovian property requires that the propagation of p(t) after the time interval τ can be seen as under the operation of the transition probability independent of t: $p_{j} (t + τ) = \sum_{i = 1}^{n} p_{i} (t) P (x (t + τ) \in X_{j} | x (t) \in X_{i})$ . Here, x represents the configuration and the transition probability is described as the conditional probability of jumping to state j after a lag time of τ, given the initial state i. By defining the transition probability matrix (TPM), $T (τ)$ , as $T_{i j} = P (x (t + τ) \in X_{j} | x (t) \in X_{i})$ , the time propagation of p(t) can be rewritten as

p {(t + τ)}^{tr} = p {(t)}^{tr} T (τ) .

(2)

We can then define the equilibrium density $π = {[π_{1} \dots π_{n}]}^{tr}$ , where $π_{i} = \int_{x \in X_{i}} ρ_{e q} (x) d x$ . Here, ρ^eq is the equilibrium density in the continuous configurational space and π_i is the stationary population of state X_i. For equilibrium dynamics, the detailed balance condition imposes the following relationship: π_iT_ij = π_jT_ji. If we further define $u (t) = {[p_{1} (t) / π_{1} \dots p_{n} (t) / π_{n}]}^{tr}$ , which is a vector containing state populations at time t normalized by their stationary populations. Under the detailed balance condition, we can obtain⁹

u (t + τ) = T (τ) u (t),

(3)

where $T (τ)$ represents an approximation of the transfer operator $T (τ)$ at reduced dimensions (i.e., transitions among n discrete states). For the row-normalized TPM, the leading left eigenvectors provide information on the population flux of the slowest dynamic processes, which correspond to transitions between the metastable regions of the conformational space. The timescales of these dynamic processes are related to the eigenvalues of the TPM, as represented by their implied timescales (ITS),⁵⁹

IT S_{i} (τ) = - \frac{τ}{\ln λ_{i} (τ)},

(4)

where i = 1, 2, 3…. When the reduced dynamics in the state space with the lag time of τ are Markovian, we can use the first-order master equation to propagate the dynamics,

T (n τ) = T {(τ)}^{n},

(5)

where $π^{tr} = π^{tr} T (τ)$ and $1 = T (τ) 1$ . The first eigenvalue of $T (τ)$ is always equal to 1, and its corresponding eigenvector corresponds to the equilibrium state populations. The lag time (τ) must be long enough to allow the dynamics in the reduced state space to become Markovian or memoryless; otherwise, the memory of these transitions needs to be considered.

C. Generalized master equation (GME) theory

A major challenge in constructing MSMs for protein dynamics is building a Markovian model of state dynamics. To achieve a Markovian model, the lag time must be long enough to allow full relaxation of the memory effects in dynamic transitions among states. As the lag time is bound by the length of individual short MD simulations, a large number of states are often needed for MSMs to achieve a Markovian behavior. However, MSMs containing hundreds or even thousands of states can impede the comprehension of biological mechanisms.

Recently developed non-Markovian dynamic models aim to address the aforementioned challenge in the MSM. These models go beyond the Markovian assumption for interstate transitions and utilize the GME framework to explicitly account for the memory of protein dynamics. The GME is derived from the Liouville equation Eq. (1). In Liouville’s equation, the dynamics in the high-dimensional phase space Γ is Markovian. In reduced dimensionality [e.g., the collective variable (CV) space discussed in Sec. III C or the state space discussed in Secs. III D and III E], the dynamics are projections of the phase-space dynamics given by Eq. (1) onto these reduced dimensions. For a state model, the projection from phase space to the state space satisfies the following generalized Hummer–Szabo projection operator:

P = \sum_{i} |χ_{i} (x) ρ_{e q} (x)⟩ π_{i}^{- 1} ⟨χ_{i} (x)| .

(6)

Here, χ is an indicator function: $χ_{i} (x) = 1$ when x ∈ X_i, and $χ_{i} (x) = 0$ , otherwise. In addition, $ρ_{e q} (x)$ represents the equilibrium probability density and π_i represents the equilibrium population of the state i.

Upon the projection, the dynamics follow the Nakajima–Zwanzig equation,

\frac{\partial}{\partial t} P ρ (t) = P L P ρ (t) + P L e^{Q L t} Q ρ (0) + \int_{0}^{t} P L e^{Q L (t - s)} Q L P ρ (s) d s

(7)

with $Q = I - P$ (I is an identity matrix). In Eq. (7), the second term on the right-hand-side vanishes when $ρ (0)$ is initiated from an equilibrium distribution. Thus, the Nakajima–Zwanzig equation can be rewritten as the following Generalized Master Equation (GME):^49,50

\dot{T} (t) = T (t) \dot{T} (0) - \int_{0}^{t} T (t - τ) K (τ) d τ .

(8)

Here, $T (t)$ is the row-normalized transition probability matrix (TPM) with lag time t, following the same convention in Sec. II B. Each element of the TPM [ $T_{i j} (t) = ⟨ χ_{j} (x) |e^{L t}| χ_{i} (x) ρ_{e q} (x) π_{i}^{- 1}$ ] corresponds to the conditional probability for the system visit state X_j after a lag time of t, given its initial state at X_i. In addition, $K (t)$ with each element $K_{i j} (t) = - ⟨ χ_{j} (x) |L e^{Q L t} Q L| χ_{i} (x) ρ_{e q} (x) π_{i}^{- 1}$ is referred to as the memory kernel matrix.

An important feature of the memory kernel is the timescale τ_K, namely, the memory kernel relaxation time, when the memory kernel $K (τ)$ relaxes, $K (t \geq τ_{K}) \approx 0$ . In biomolecular systems, in which the separation of timescales often occurs, we have demonstrated that memory kernel relaxation time (τ_K) is often significantly shorter than the Markovian lag time (τ_M).⁴⁹ Under this condition, the GME can be rewritten as

\dot{T} (t) = T (t) \dot{T} (0) - \int_{0}^{\min [τ_{K}, t]} T (t - s) K (s) d s,

(9)

where the convolution term of the right-hand side only needs to be computed up to τ_K when predicting dynamics at a long lag time (t ≥ τ_K). This provides us with the opportunity to build GMEs with short MD trajectories. In the GME-based method, we can use the short MD trajectories to compute the time-dependent TPMs, $T (t)$ . These short-time $T (t)$ can be employed to compute the memory kernels $K (s)$ with Eq. (9). Subsequently, the long-time dynamics $T (t)$ at any lag time longer than τ_K can be computed from the GME, as given in Eq. (9). In Sec. III F, we demonstrate a brute-force method utilizing Eq. (9) to construct GME, namely, the quasi-MSM (qMSM). The qMSM is theoretically rigorous, but it also involves the computation of the time-dependent memory kernel tensor $K (s)$ , which is challenging due to the numerical instability induced by the fluctuations of MD simulations, especially for complex systems. In Sec. II D below, we will introduce the Integrative-GME (or IGME) to improve the numerical instability of the qMSM by analytically solving the GME at t ≥ τ_K.

D. Integrative GME (IGME) theory

To enhance numerical stability, the IGME theory⁵⁰ adopts the time integration of memory kernels $M_{n} (t)$ instead of the memory kernel tensor $K (t)$ . When t ≥ τ_K, the GME can be rewritten as an ordinary differential equation after applying the Taylor series of $T (t - s)$ in the convolution term of Eq. (7),

\begin{gathered} T {(t)}^{- 1} \frac{d}{d t} T (t) = \dot{T} (0) - M_{0} - \sum_{n = 1}^{\infty} [\frac{{(- 1)}^{n}}{n!} T {(t)}^{- 1} \frac{d^{n}}{d t^{n}} T (t)] M_{n}, \\ M_{n} = \int_{0}^{τ_{K}} K (s) s^{n} d s . \end{gathered}

(10)

Here, M_n are the time integrals of memory kernels at order n. At the zeroth order, M₀ corresponds to the time-integrated memory kernel matrix. Equation (10) contains M_n instead of the memory kernel tensor $K (s)$ , which avoids the numerical computation of $K (s)$ . In the IGME theory, we have obtained the analytical solution to the above-mentioned ordinary differential equation for $T (t)$ as

\begin{gathered} T (t \geq τ_{K}) = A {\hat{T}}^{t}, \\ \ln \hat{T} = \dot{T} (0) - M_{0} - \sum_{n = 1} \frac{{(- 1)}^{n}}{n!} {(\ln \hat{T})}^{n} M_{n} . \end{gathered}

(11)

Here, $\hat{T}$ and A are two constant matrices. $\hat{T}$ describes the dynamics at an infinitely long lag time, which can be used to estimate the timescales of the slowest dynamical modes and transition rates between pairs of states. $\hat{T}$ is obtained by fitting $A {\hat{T}}^{t}$ with short-time MD simulation trajectories $T (t)$ (see Sec. III F and the Appendix for details).

The second equation of Eq. (11) can also be utilized to compute the time-integrated memory kernel matrix M₀. According to Eq. (11), the rigorous computation of M₀ involves M_n at higher orders. However, for biomolecular systems where there are separations of timescales [i.e., $- t / \ln λ_{i} (t) ≫ Δ t$ , $λ (t)$ are eigenvalues of $T (t)$ , and Δt is the saving interval of the input $T (t)$ in IGME⁵⁰], we have shown that

M_{0} \approx \dot{T} (0) - \ln \hat{T} .

(12)

Therefore, M₀ can be obtained from the IGME.

III. PROTOCOL FOR CONSTRUCTING GME MODELS TO STUDY PROTEIN DYNAMICS

In this section, we will introduce a detailed protocol (Fig. 1) to construct qMSM and IGME models to study protein dynamics based on MD simulation trajectories.

A. Feature selection

The input to our protocol is an MD simulation dataset containing an ensemble of MD trajectories that sample protein conformational changes of interest [Fig. 1(a)]. An MD trajectory consists of the time evolution of MD snapshots, each containing the positions of all the atoms in the simulation box. However, these Cartesian coordinates are often unsuitable for the analysis of protein conformational changes because they are typically of high dimensions (e.g., a typical protein system contains tens of thousands of Cartesian coordinates). In contrast, internal coordinates (e.g., inter-residue distances or backbone torsional angles) offer a better description of protein conformational changes and are often used as input features to construct MSM or GME models.

Many biologically relevant conformational changes are localized, involving structural transitions in only a subset of the system, such as translocation of a motor protein on dsDNA⁶⁰ and the gate opening of RNA polymerase.⁷ Even for global dynamic processes such as protein folding, one could identify a subset of structural features sufficient to describe the slowest dynamics of these conformational changes. Therefore, the first step in our protocol is to select a subset of structural features that can describe the slowest dynamics of the system [Fig. 1(b)]. In this section, we will discuss several algorithms for automatic feature selection, including Accelerated Sequential Incoherent Selection (oASIS),⁶¹ spectral oASIS,⁶² Force Distribution Analysis (FDA),⁶³ and molecular systems automated identification of correlation (MoSAIC).⁶⁴

Both oASIS⁶¹ and spectral oASIS⁶² aim to find a subset of features to reconstruct the original feature space based on the Nyström method.⁶⁵ The primary focus of these two methods is to minimize the errors of the diagonal elements between the original correlation matrix C = X^trX (X_ij is the value of feature j at the simulation frame i) and reconstructed correlation matrix, $\tilde{C} = C_{k} W_{k}^{†} C_{k}^{tr}$ (C_k is subset columns of C, and W_k is the correlation matrix of the subset features), reconstructed based on the Nyström method. The oASIS method employs a selection strategy that targets the column indexed with the highest diagonal error, which is less effective when selecting more than one column in each iteration. The spectral oASIS, a modified version of oASIS, maintains the effectiveness of batch selection by considering both reconstruction errors and eigenvector differences. The FDA method involves the analysis of pairwise forces and the mechanical strain distribution alongside the MD simulations. This method focuses on residue pairs that exhibit significant force variations in the simulations. Finally, MoSAIC⁶⁴ is a correlation-based feature selection method based on the physical insights that crucial dynamic processes involve many features changing in a concerted manner. MoSAIC makes use of the Leiden community detection algorithm⁶⁶ to block-diagonalize the correlation matrix, thereby creating coherent and distinct feature clusters. These clusters are subsequently ranked according to their size. The larger clusters are presumed to be linked to significant dynamic processes, while the smaller clusters are eliminated as noise during the feature selection process. In this Tutorial, we choose the spectral oASIS⁶² as the feature selection method.

B. Dimensionality reduction

The feature selection in the previous step provides a representative subset of internal coordinates, but their number is large and often at the order of hundreds to thousands of features. In our protocol, we will further perform a dimensionality reduction based on these features to identify several collective variables (CVs) that describe the slowest dynamics of the system [Fig. 1(c)]. In this step, various dimensionality reduction algorithms can be utilized, e.g., Principal Component Analysis (PCA),⁶⁷ time-lagged Independent Components Analysis (tICA),^68–70 Variational Approach for Markovian Process (VAMP) based neural networks (VAMPnets)⁷¹ or its combination with graph neural networks (GraphVAMPnets),^43,72 and State-free Reversible VAMPnets (SRVs).⁷³ In this Tutorial, both tICA and SRVs will be applied.

PCA⁶⁷ can find a small number of principal components by maximizing the variance across the spatial scale of the principal components. Alternatively, tICA^68–70 can find a set of collective variables representing the slowest dynamics of biomolecules by maximizing the time-lagged autocorrelation of the transformed components. In tICA, the self-correlation matrix C₀₀ and time-lagged autocorrelation matrix C₀₁ are constructed from the high-dimensional feature space $x (t) = {[x_{1} (t), \dots, x_{d} (t)]}^{tr}$ , where $x_{1} (t), \dots x_{d} (t)$ are d features at simulation time t,

\begin{gathered} C_{00} = E_{t} [x (t) x {(t)}^{tr}], \\ C_{01} = E_{t} [x (t) x {(t + τ)}^{tr}] . \end{gathered}

(13)

Here τ is the chosen lag time and the d features can be picked by intuition or generated using feature selection methods (e.g., spectral oASIS⁶²). The CVs corresponding to the slowest dynamic modes can then be obtained by solving the generalized eigenvalue problem,

C_{01} U = C_{00} U Λ .

(14)

The slowest CVs, or called time-lagged Independent Components (tICs), can then be constructed from dimension reduction of the input features that utilize a sub-matrix of U consisting of the top columns corresponding to the largest values in the diagonal matrix Λ. Both PCA and tICA can generate uncorrelated collective variables by using linear combinations of input features. PCA tends to assign importance to large-amplitude motions, even if they are typically irrelevant to the actual function of a protein. A similar challenge can be encountered when using tICA, especially when the coordinates involve slow, but unimportant, motions. For example, when tICA is applied to dihedral angles (ϕ, ψ) to investigate the folding of a helical protein, HP35, it identifies transitions between right- and left-handed helices as the slowest processes.^64,74,75

The VAMPnets⁷¹ was developed based on the variational approach of the Markov processes (VAMP) theorem. VAMPnets allows for the independent training of two parallel encoders (E₀ and E₁) to find the low-dimensional latent space representation of the collective variables, $y_{0} (t) = E_{0} (x (t))$ and $y_{1} (t + τ) = E_{1} (x (t + τ))$ . Minimizing the loss function in VAMPNets is equivalent to maximizing the VAMP-2 score, which is defined as

\begin{gathered} C_{00} = E_{t} [y_{0} (t) y_{0} {(t)}^{tr}], \\ C_{01} = E_{t} [y_{0} (t) y_{1} {(t + τ)}^{tr}], \\ C_{11} = E_{t} [y_{1} (t + τ) y_{1} {(t + τ)}^{tr}], \\ R_{VAMP 2} = {‖C_{00}^{- 1 / 2} C_{01} C_{11}^{- 1 / 2}‖}_{F}^{2}, \end{gathered}

(15)

where the subscript F represents the Frobenius norm. The VAMPnets can work with the input of molecular coordinates and, finally, yield a state model.⁷¹ State-free Reversible VAMPnets⁷⁶ (SRVs) method is largely similar to VAMPnets, while the major difference lies in the reversible-dynamics assumption of the SRVs. Unlike VAMPnets that adopts two independent neural networks to train $y_{0} (t)$ and $y_{1} (t + τ)$ , the SRVs method uses the shared neural network for $y (t) = E (x (t))$ and $y (t + τ) = E (x (t + τ))$ . In the training, the SRVs method utilizes a slightly different loss function (VAMP-2-like loss function),

\begin{gathered} C_{01} s_{i} = {\tilde{λ}}_{i} C_{00} s_{i}, \\ L = - \sum_{i} {({\tilde{λ}}_{i})}^{2}, \end{gathered}

(16)

where C₀₀ and C₀₁ are computed in the same way as Eq. (15) and ${\tilde{λ}}_{i}$ are the generalized eigenvalues. The SRVs can achieve a higher success rate for training in numerical experiments.⁷⁶ In addition, the SRVs only yields collective variables, unlike the VAMPnets that yields few-state kinetic models. It has been demonstrated that both VAMPnets and SRVs can generate models that reach Markovian behavior at shorter lag times compared to previous methods such as tICA. For example, in the study of folding kinetics of the Trp-cage mini protein, VAMPnets and SRVs-MSM exhibit faster convergence of implied timescales at shorter lag times compared to tICA-MSM.⁷³

In addition to the above-mentioned methods, many other algorithms are also available for dimension reduction, such as kernel-tICA,⁷⁷ deep-tICA,⁷⁸ time-lagged autoencoder,⁷⁹ variational dynamics encoder,⁸⁰ past–future information bottleneck,⁸¹ state predictive information bottleneck,⁸² transition manifold methods,⁸³ reaction coordinate flow,⁸⁴ and relaxation mode analysis⁸⁵.

C. Geometric clustering to generate microstates

Geometric clustering [Fig. 1(d)] involves partitioning the reduced-dimensional conformational space spanned by the CVs into a large number of discrete clusters (called microstates). The most widely used clustering methods in the MSM/GME construction are the centroid-based algorithms,⁸⁶ such as K-Means,⁸⁷ K-Centers,^88,89 and K-Medoids.⁹⁰ In K-Means clustering, the primary objective is to minimize the sum of squared distances between data points and the centroid of the cluster that the data point belongs to, which is calculated as the mean of the data points assigned to that cluster. The K-Centers clustering aims to minimize the maximum distance or dissimilarity between a data point and the nearest cluster center. The cluster centers are evenly distributed. The K-Medoids clustering also seeks to minimize the sum of distances like K-Means, but it uses actual data points (medoids) as representatives of the clusters. In addition to the centroid-based algorithms, other clustering methods are also available, such as automatic state partitioning for multibody systems (APM),⁹¹ shape-Gaussian mixture models (shape-GMM),⁹² Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and adaptive partitioning by local density-peaks (APLoD).⁹³

D. Kinetic lumping to produce metastable macrostates

In this step, we will further lump microstates that can interconvert quickly into a small set of metastable microstates [Fig. 1(e)]. The most widely used kinetic lumping methods^94–98 are the Perron-Cluster Cluster Analysis (PCCA)⁹⁴ and Robust Perron Cluster Analysis (PCCA+).^95,96 The PCCA uses the structure of the eigenvectors of microstate TPMs to find the metastable states. In PCCA, the sign pattern of the eigenvectors corresponding to the largest eigenvalues of TPM is used to find the metastable states that can form a nearly uncoupled Markov chain. Unlike the PCCA that uses a crisp assignment of state boundary, the PCCA+⁹⁶ utilizes almost invariant sets by almost characteristic functions $\tilde{χ}$ , which is a soft assignment for the microstates. In addition, $\tilde{χ}$ is obtained by maximizing the metastability of microstate TPM, $trace (\tilde{W}) = \sum_{i} {\tilde{w}}_{i i}$ , where $\tilde{W} = {(diag [π])}^{- 2} ⟨\tilde{χ} |T^{micro}| {\tilde{χ}}_{π}⟩$ (π are the stationary populations and T^micro is the microstate TPM). In practice, the PCCA+ employs perturbation of characteristic functions to optimize the invariant sets $\tilde{χ}$ . In our Tutorial, we utilize the PCCA+ algorithm for kinetic lumping.

E. Constructing Markov state models

With a set of micro- and macro-states, we can then estimate transition probabilities between pairs of states after lag time τ from MD simulations,

T_{i j} (τ) = p [x (t + τ) \in j | x (t) \in i] = \frac{C_{i j} (τ)}{\sum_{j} C_{i j} (τ)},

(17)

where C is the transition count matrix (TCM) and $C_{i j} (τ)$ corresponds to the number of transitions that begin from state i and end at state j after lag time τ. For equilibrium sampling, the TCM should be theoretically symmetric to satisfy the detailed balance, $C_{i j} (τ) = C_{j i} (τ)$ . However, for realistic applications, one often needs to symmetrize the TCM using the following equation:

C^{sym} (τ) = \frac{C (τ) + C {(τ)}^{T}}{2} .

(18)

Alternatively, when there are large differences between $C_{i j} (τ)$ and $C_{j i} (τ)$ , the maximum likelihood estimator (MLE)⁹⁹ can be employed to enforce the detailed balance condition using the following likelihood function:⁹

p (T | C^{obs}) \propto \prod_{i, j = 1}^{n} T_{i j}^{C_{i j}^{prior} + C_{i j}^{obs}} = \prod_{i, j = 1}^{n} T_{i j}^{C_{i j}} .

(19)

The resulting MLE algorithm can be written as^9,99

π_{i} = \sum_{j} \frac{c_{i j} + c_{j i}}{\frac{N_{i}}{π_{i}} + \frac{N_{j}}{π_{j}}}, T_{i j} = \frac{(c_{i j} + c_{j i}) π_{j}}{N_{j} π_{i} + N_{i} π_{j}},

(20)

where N_i represents the total number of transition counts starting from state i. In an MSM, the eigenvectors and eigenvalues of TPMs [ $T (τ)$ ] represent the slowest dynamic modes. Starting from the second eigenmode, the $(i + 1)$ th eigenvector corresponds to the ith slowest dynamic mode, while the corresponding eigenvalue λ_i+1 describes the fraction of molecules that have not undergone the transition after lag time τ. The first eigenvalue is always 1, and the first eigenvector corresponds to the stationary populations of states.

In various steps of the MSM construction [Figs. 1(c) and 1(d)], we need to determine several hyperparameters, including the optimal number of CVs, lag time in tICA or SRVs, and number of microstates. To choose the optimal values for these hyperparameters, we recommend using the generalized matrix Rayleigh quotient (GMRQ)¹⁰⁰ in our protocol. The GMRQ employed cross-validation to avoid possible overfitting induced by the violation of variational bounds due to statistical uncertainty.¹⁰⁰ In GMRQ, the whole dataset is divided into a training set and a test set, and the Rayleigh quotient is computed from the eigenvectors and correlation matrices of these two parts,

R = Tr (V^{tr} C V {(V^{tr} S V)}^{- 1}),

(21)

where V is the right eigenvector of the TPM computed from the training set data and S and C are the diagonal matrix of stationary population and TCM computed from the test set data, respectively. In practice, the best model should have the highest GMRQ score.

F. Constructing qMSM and IGME models

Using TPMs and their time derivatives at different lag times as input, one could also construct qMSM, a non-Markovian GME-based dynamic model [Fig. 1(f)]. Given a series of short-time $T (t)$ , qMSM employs a brute force approach to numerically compute the memory kernel tensor $K (t)$ and predict the long-time dynamics based on Eq. (9).

In qMSM, a discrete-time GME at t = nΔt is employed as follows:

\dot{T} (n Δ t) = T (n Δ t) \dot{T} (0) - Δ t \sum_{m = 1}^{\min [n, τ_{K} / Δ t]} T ((n - m) Δ t) K (m Δ t),

(22)

where τ_K is the memory relaxation time when the memory kernel $K (τ)$ decays to zero, $K (t \geq τ_{K}) = 0$ . Therefore, a straightforward method to compute the memory kernel can be derived from the above-mentioned time-discrete GME,

\begin{align} K (n Δ t) = - \frac{\dot{T} (n Δ t) - T (n Δ t) \dot{T} (0)}{Δ t} + \sum_{m = 1}^{n} T ((n - m) Δ t) \\ \times K (m Δ t) (n Δ t \leq τ_{K}) . \end{align}

(23)

To find τ_K, the qMSM employs the mean integral kernel (MIK) to visualize the relaxation of memory kernel tensor,

M I K (t) = \frac{1}{N} \sqrt{\sum_{i, j = 1}^{N} {(\int_{0}^{t} K_{i j} (τ) d τ)}^{2}} .

(24)

When $K (t)$ fully relaxes, the MIK will become independent of time. Therefore, the MIK can act as an indicator for τ_K in qMSM. Finally, the long-time dynamics can be predicted from Eq. (22) with the memory kernels computed from Eq. (23) and τ_K obtained from Eq. (24). In the applications of qMSM, we observe that the fluctuations encountered when obtaining $T (t)$ and $\dot{T} (t)$ from MD trajectories can induce numerical instability in $K (t)$ , especially for complex systems. To address this challenge, we have recently developed the IGME method.⁵⁰

The IGME method utilizes $T (t)$ at t ≥ τ_K to compute the two matrices A and $\hat{T}$ in Eq. (11). In this Tutorial, we introduce a least-squares fitting (LSF) method to compute A and $\hat{T}$ with the row-sum restriction of $T (t)$ , $\sum_{j} T_{i j}^{MD} (t) \equiv 1$ . The least-squares fitting method in this Tutorial adopts the following Lagrangian (see the Appendix for details):

L = \frac{1}{2} \sum_{t} {|\ln T^{MD} (t) - \ln A - t \ln \hat{T}|}_{F}^{2} + \sum_{i} (γ_{i} \sum_{j} {[\ln \hat{T}]}_{i j}),

(25)

where γ_i is the Lagrange multiplier to guarantee the row-sum rule of $T^{MD} (t)$ . With the Lagrange method of the above-mentioned Lagrangian, we can derive a straightforward least-squares fitting method to fit A and $\hat{T}$ [see Eq. (A6) for details].

In practice, the LSF fitting is performed on a subset of the input TPMs, $[T^{MD} (τ_{K}^{trial}), T^{MD} (τ_{K}^{trial} + Δ t), \dots, T^{MD} (τ_{K}^{trial} + L)]$ , where $τ_{K}^{trial}$ is the time of the first frame used in LSF and L is the length of input data used in LSF. For each fitting, we can compute the resulting $A = A (τ_{K}^{trial}, L)$ and $\hat{T} = \hat{T} (τ_{K}^{trial}, L)$ . A single run of LSF fitting may be susceptible to numerical fluctuations of simulation data. Therefore, we adopted a systematic search for $τ_{K}^{trial}$ and L to obtain the best IGME models that match the MD simulation data. To quantify the errors of IGME models in reproducing the simulation data, we have followed our previous work to utilize the time-averaged root mean squared error (RMSE),⁵⁰

RMSE = \sqrt{\frac{\sum_{n = 1}^{L_{x}} \sum_{i, j = 1}^{N} {[π_{i} T_{i j}^{MD} - π_{i} T_{i j}^{IGME} (t)]}^{2} d t}{N^{2} L_{x}}},

(26)

where T^IGME corresponds to the TPMs predicted by IGME, represented as $T^{IGME} = A \hat{T}$ . Additionally, L_x denotes the maximum lag time of $T^{IGME} (t_{x})$ used for computing the RMSE (t_x = L_xΔt, where Δt is the saving interval of MD simulations). By substituting T^IGME with T^qMSM or T^MSM in Eq. (26) the same RMSE metric can be employed to evaluate the accuracy of qMSM or MSM,⁴⁹ respectively. Specifically, for qMSMs, L_x = t_x/Δt, while for MSMs, $L_{x} = t_{x} / τ_{M}^{trial}$ (where $τ_{M}^{trial}$ represents the lag time of an MSM).

In our implementation of IGME, we perform a systematic search for possible values of two hyperparameters, $τ_{K}^{trial}$ and L, $τ_{K}^{trial} = Δ t, 2 Δ t, \dots, L_{0}$ , and $L = Δ t, 2 Δ t, \dots, L_{0} - τ_{K}^{trial}$ (L₀ is a pre-determined scanning range). Finally, the IGME models with the smallest RMSE will be used.

IV. TUTORIAL EXAMPLES

In this section, we will provide a detailed tutorial on how to construct non-Markovian dynamic models (i.e., qMSM and IGME) from MD simulation datasets using two examples: alanine dipeptide and villin headpiece. To perform the tasks in this Tutorial, we will utilize both MSMBuilder^101,102 and PyEMMA¹⁰³ software. All the relevant Python codes for this Tutorial are available on GitHub: https://github.com/xuhuihuang/GME_tutorials. This Tutorial employs MSMBuilder, version 2022 (https://github.com/msmbuilder/msmbuilder2022), and the PyEMMA, version 2.5.12 (https://github.com/markovmodel/PyEMMA/).

A. Alanine dipeptide

The first example is the conformational dynamics of alanine dipeptide in explicit solvent [Fig. 2(a)]. The MD simulation dataset of alanine dipeptide consists of 100 MD trajectories, and each trajectory is performed for 10 ns under the NVT ensemble at 310 K. The initial conformation of each trajectory is evenly extracted from a 20 ns NVT trajectory with an interval of 0.2 ns. The AMBER99SB force field¹⁰⁴ is used for alanine dipeptide and the TIP3P model¹⁰⁵ for water. The snapshots of these MD trajectories are stored every 0.1 ps.

As shown in Scheme 1, we utilize all 45 pairwise distances among ten heavy atoms of alanine dipeptide as the input features in the featurization step. Given the relatively small size of this system, there is no need for feature selection.

With the input features, the dimensionality reduction was performed with tICA. Specifically, the original 45 distance features are transformed into a reduced set of time-lagged Independent Components (tICs). In our Tutorial, the tICA model is built with three tICs at the tICA lag time of 0.2 ps. A sample Python code for performing the tICA analysis is shown in Scheme 2. Next, we apply the K-Centers algorithm to generate a microstate model with 800 states based on the top three tICs (see Scheme 3).

SCHEME 2. — Dimensionality reduction using tICA.

In the dimensionality reduction [Fig. 1(c)] and geometric clustering step [Fig. 1(d)], we have applied the cross-validation tool, GMRQ, to select the optimal values for several hyperparameters, including the lag time of tICA (τ_tICA), the number of tICs (n_tICs), and the number of geometric clusters (n_microstates). In the GMRQ analysis, we perform a fivefold cross-validation (i.e., 100 MD trajectories are randomly divided into five folds, and we use four folds for training and the remaining one fold for validation) and repeat it ten times by changing the random number when splitting the dataset (see Scheme 4 for the Python code). Furthermore, we include the three slowest dynamic modes in our GMRQ calculations (resulting in a maximum GMRQ score of 4), as the persistent and substantial gap is observed between the third and fourth slowest ITS [see Fig. 3(a)]. Based on these runs, we report GMRQ scores as box plots [Fig. 2(b)]. We then determine the optimal values (those with the highest median GMRQ score in the box plots) of the following three hyperparameters sequentially: τ_tICA = 0.2 ps, n_tICs = 3, n_microstates = 800. As shown in Fig. 2(b), we search for the optimal values of τ_tICA, n_tICs, and n_microstates in a sequential manner. It is noteworthy that one could also conduct a thorough and three-dimensional search of the optimal values for these three hyperparameters. As shown in Fig. 2(c), the outcomes of the three-dimensional search are consistent with those of the sequential scanning, revealing the same optimal combination of the three hyperparameters. While the exhaustive search of all hyperparameters could potentially yield the most accurate results, such an approach often involves substantial computational costs. For practical reasons, we still recommend searching for the optimal hyperparameter values in a sequential manner in our tutorials.

SCHEME 4. — GMRQ cross-validation based on the tICA analysis.

FIG. 3. — Construction and validation of microstate-MSM of alanine dipeptide. (a) Implied timescales (or ITS) for the first ten dynamic modes as a function of lag time. (b) Chapman–Kolmogorov test for the eight most populated microstates. We choose the lag time to be 10 ps. The error bars in the implied timescale and the residence probability plots are calculated by bootstrapping with replacement 100 trajectories for 50 times.

Next, we construct the microstate MSM and calculated the error bar of both ITS and residence probabilities of the most populated eight microstates following the code in Scheme 5. As shown in Fig. 3(a), the slowest three ITS curves are all within the plateau region at the lag time of 10 ps, indicating that the microstate MSM reaches the Markovian state at τ_M ≥ 10 ps. To further validate the microstate MSM, we perform the Chapman–Kolmogorov test [according to Eq. (5)] and show that an MSM with a lag time of 10 ps can predict dynamics in reasonable agreement with the MD simulations [Fig. 3(b)].

With a validated microstate-MSM, we next apply PCCA+ to lump 800 microstates into four metastable states [see Fig. 1(e) and the sample code in Scheme 6]. The number of macrostates is an input parameter for PCCA+, and we determine its values as 4 because there exists a persistent and major gap between the third and fourth eigenmodes in the ITS plots, as shown in Fig. 3(a). The four macrostates can be visualized by the projections of their MD snapshots onto two backbone torsion angles of alanine dipeptide [see Fig. 4(a)].

FIG. 4. — The calculation of memory kernels, K(t). (a) The projections of MD snapshots onto two backbone dihedral angles (ψ, ϕ) of alanine dipeptide. The four macrostates (1–4) are color-coded as orange, cyan, blue, and red, respectively. (b) The mean integral kernel (or MIK) calculated from qMSM (blue) and IGME (red) with τ_K = 1.5 ps and L = 0.1 ps. (c) The full (4 × 4) memory kernel matrix [K(t)] shown as a function of lag time.

Based on these four macrostates, we next compute the memory kernels [ $K (t)$ ] and build a qMSM, as illustrated in Fig. 1(f) and the accompanying sample code in Scheme 7. We need to establish a threshold for considering the relaxation of the memory kernels. However, directly defining such a threshold based on the value of $K (t)$ may not be plausible as various elements of $K (t)$ exhibit noticeable fluctuations [see Fig. 4(c)]. Therefore, to determine the memory kernel relaxation time (τ_K) for propagating GME [Eqs. (9) and (22)], we follow our previous work⁴⁹ to identify the time at which the mean integral of memory kernel elements [or MIK, see Eq. (24)] reaches a plateau. The MIK plot, as shown in Fig. 4(b), reaches a plateau at ∼1.5 ps, leading us to select τ_K = 1.5 ps for building the qMSM. The RMSE [Eq. (26)] between our qMSM and the original MD dataset is very small [only 1.5 × 10⁻³, see Fig. 5(a)], and the qMSM model has also successfully passed the Chapman–Kolmogorov test [see the blue curves in Fig. 5(b)]. It is worth noting that for any choice of τ greater than τ_K = 1.5 ps, the memory kernel has fully decayed theoretically, and the propagation of dynamics using the GME [Eq. (9)] should yield the same accuracy. However, due to numerical fluctuations, qMSMs constructed at different lag times, when τ exceeds 1.5 ps, still exhibit varying errors. We will further discuss this issue in Subsection IV B when constructing IGME models. Finally, using GME in our qMSM, we could obtain TPMs at any lag times. When the lag time is sufficiently long for the model to reach Markovian, the GME will be reduced to an MSM. Indeed, when τ = 10 ps, the TPM from our qMSM yields the same slowest ITS of 1.14 ns as an MSM for alanine dipeptide.

SCHEME 7. — Constructing qMSMs for alanine dipeptide.

FIG. 5. — Building qMSM and IGME models for alanine dipeptide. (a) The RMSE map of the IGME, qMSM, and MSM. We applied Eq. (26) to compute RMSE and chose L_x = 500 for IGME and qMSM, while L_x = 50 ns/τ for MSM with a lag time of τ. The saving internal of MD simulations is Δt = 0.1 ps. (b) Chapman–Kolmogorov test on the four macrostates for the selected models. MSM has been tested when lag time τ = 1.5 ps and τ = 10 ps, and qMSM has been tested with τ_K = 1.5 ps. IGME has been tested when τ_K = 1.5 ps and L = 0.1 ps. The error bars in the residence probability plots of MD data are calculated by bootstrapping 100 lumped trajectories 50 times, with repeated trajectories allowed.

In this section, we illustrate the process of constructing IGME models [see Fig. 1(f) and Scheme 8]. As discussed in Sec. III F, we need to determine two hyperparameters, τ_K and L, when constructing IGME models. In this system, we conducted a systematic scan to identify their optimal values, resulting in IGME models with the minimized RMSE [Eq. (26)]. Figure 5(a) illustrates that the RMSE of the optimal IGME reaches 1.4 × 10⁻³ at τ_K = 1.5 ps and L = 0.1 ps. The Chapman–Kolmogorov test for this IGME model also demonstrates a strong consistency between IGME predictions and MD simulations [Fig. 5(b)]. In addition, the slowest dynamics can be directly derived from the $\hat{T}$ matrix [Eq. (11)]. Based on $\hat{T}$ , we identify that the slowest ITS for alanine dipeptide is ∼1.15 ns, consistent with the predictions from qMSM and MSM. Moreover, the MIK computed from the optimal IGME model [see Eq. (12)] aligns well with that obtained from qMSM [Fig. 4(b)].

For the relatively simple alanine dipeptide system with sufficient sampling, all three methods, MSM, qMSM, and IGME, yield consistent results. However, MSM requires a significantly longer lag time (τ_M = 10 ps) to achieve a similar RMSE compared to qMSM (τ_K = 1.5 ps) and IGME (τ_K = 1.5 ps and L = 0.1 ps). Consequently, qMSM and IGME can be constructed with shorter MD trajectories than MSM. Both qMSM and IGME consistently perform well for alanine dipeptide. In the subsequent example of villin headpiece, we demonstrate that IGME outperforms qMSM by substantially reducing the numerical instability, providing a more robust approach to constructing non-Markovian dynamic models for studying protein dynamics.

B. Villin headpiece (HP35)

HP35 is a 35-residue peptide that exhibits ultrafast folding,^106,107 making it a suitable benchmark system for MD simulations of protein folding.^92,108–110 The HP35 simulation dataset provided in the research by Piana et al.¹¹¹ consists of a single $\sim 300 μ s$ all-atom MD trajectory of the Nle/Nle mutant of HP35 (PDB ID: 2f4k) saved at a 0.2 ns interval.

In the featurization step [Fig. 1(b)], we first employ all 528 pairwise distances between C-alpha atoms with a minimum separation of three residues as raw input features [Fig. 6(a)]. This step is conducted using the “ContactFeaturizer” function in MSMBuilder¹⁰¹ (see the sample code in Scheme 9). Next, we select 400 features from these 528 pairwise distances that can best describe the slowest dynamics of protein folding using spectral oASIS⁶² in the PyEMMA package¹⁰³ (see the sample code in Scheme 10). In the feature selection step, we employ the time-lagged autocorrelation matrix⁶² with a lag time of 20 ns rather than the self-covariance matrix to achieve a better performance. As shown in Fig. 6(b), the ITS plot demonstrates that the 400 selected features are sufficient to capture the slowest dynamics of HP35.

FIG. 6. — Feature selection of villin headpiece. (a) Structure of villin headpiece. There are 35 residues and 528 residue–residue distances based on the distances between their alpha carbon atoms with a minimum separation of three residues. We use all these 528 pairwise distances as raw input features. (b) Feature selection using spectral oASIS. 400 out of 528 features are selected.

SCHEME 9. — Featurization of villin headpiece.

SCHEME 10. — Feature selection using spectral oASIS.

Following the feature selection, we proceed with the dimensionality reduction and geometric clustering [Figs. 1(c) and 1(d)]. We use two different approaches for the dimensionality reduction: tICA and SRVs. Following the sample code in presented in Schemes 2 and 11, we performed tICA and SRVs analysis to transform the 400 features into a specific number of CVs (or tICs for tICA) with a designated lag time, respectively. Utilizing the obtained CVs, we apply the K-Centers algorithm (Scheme 3) to partition the MD dataset into a given number of microstates. Several hyperparameters associated with this step, including the lag time, number of CVs (or tICs), and number of microstates, need to be determined. We employ the GMRQ cross-validation tool for both tICA (Scheme 4) and SRVs (Scheme 12) to sequentially choose their optimal values, starting with the lag time of 40 ns (left panels in Fig. 7), followed by the selection of four tICs for tICA as well as three CVs for SRVs (middle panels in Fig. 7), and concluding with 200 microstates (right panels in Fig. 7).

SCHEME 11. — SRVs for dimensionality reduction.

SCHEME 12. — GMRQ cross-validation based on the SRVs analysis.

FIG. 7. — Cross-validation of villin headpiece with different methods for dimensionality reduction. (a) Cross-validation has been conducted based on GMRQ scores to select the optimal parameters for lag time in tICA, the number of tICs, and the number of microstates (or clusters) for K-Centers clustering. The optimal parameters chosen are a 40 ns tICA lag time, four tICs, and 200 microstates. (b) Cross-validation has been performed based on GMRQ scores to select the optimal parameters for the lag time in SRVs, the number of CVs, and the number of microstates (or clusters) for K-Centers clustering. The optimal parameters chosen are a 40 ns SRV lag time, three CVs, and 200 microstates.

Next, we validate the microstate MSMs constructed based on both the tICA (denoted as tICA-MSM) and SRVs (denoted as SRVs-MSM) approaches. Specifically, we followed Scheme 5 to perform the ITS and Chapman–Kolmogorov test. For both tICA-MSM and SRVs-MSM, the ITS plots reach a plateau at ∼100 ns [Figs. 8(a) and 8(b)]. Furthermore, both tICA-MSM and SRVs-MSM constructed at this lag time successfully pass the Chapman–Kolmogorov test [Figs. 8(c) and 8(d)].

FIG. 8. — Construction and validation of microstate-MSMs for villin headpiece. (a) and (b) Implied timescales (or ITS) for the ten slowest dynamic modes as a function of lag time. (c) and (d) Chapman–Kolmogorov test for the eight most populated microstates. We choose the lag time to be 100 ns. The error bars in the implied timescale and the residence probability plots are calculated by bootstrapping with replacement 150 trajectories for 50 times. The results from tICA-MSM and SRV-MSM are displayed in the left and right panels, respectively.

With validated microstate-MSMs, we next perform kinetic lumping [Fig. 1(e)] to group 200 microstates into four metastable macrostates states using PCCA+ (Scheme 6). We chose four macrostates as there exists a clear separation between the third and fourth eigenmodes based on the ITS plots, indicating that there are four dominant metastable dynamic processes [see Figs. 8(a) and 8(b)]. As shown in Fig. 9, state 1 and state 3 are partially folded states. State 2 corresponds to the unfolded state, while the most populated state 4 is the folded state. These representative conformations shown in Fig. 9 are chosen from the kinetic lumping results based on the microstate tICA-MSM, and similar state decomposition can be obtained using the microstate SRVs-MSM.

FIG. 9. — Representative conformations from the four macrostates. These conformations were chosen from the kinetic lumped model from the microstate tICA-MSM.

We proceed by computing memory kernels and compare the qMSMs constructed using the four macrostates based on tICA (referred to as tICA-qMSM) and SRV (referred to SRVs-qMSM) approaches [see Fig. 1(f) and the sample code in Scheme 13]. For tICA-qMSM, we examine the MIK plots [computed according to Eq. (24)] to determine the value of τ_K = 30 ns, where the integral of the memory kernels has already reached a plateau [Fig. 10(a)]. The resulting tICA-qMSM exhibits a small deviation in reproducing the original MD simulation dataset [with the RMSE as low as 6 × 10⁻⁴, see Fig. 10(c)]. Utilizing this four-state tICA-qMSM, we also predict the slowest ITS to be ∼1.87 µs, consistent with the value obtained from the validated 200-microstate MSMs [Fig. 8(a)]. To achieve this, we use our tICA-qMSM to obtain the TPM [T(nΔt)] at nΔt = 500 ns [Eq. (22)] to compute the slowest ITS. For the SRVs-qMSM, it takes a slightly shorter lag time (τ_K = 25 ns) for the MIK plot to reach a plateau [Fig. 10(b)]. The RMSE of this model is 1.0 × 10⁻³, and the predicted slowest ITS, as computed from T(nΔt) at nΔt = 500 ns, is ∼1.64 µs.

SCHEME 13. — Constructing qMSMs for villin headpiece.

FIG. 10. — Building qMSM and IGME models for villin headpiece. (a) The mean integral kernel (MIK) calculated from qMSM (blue) and IGME (red) for the tICA-qMSM and tICA-IGME models. The red dashed line is the mean MIK of the top 5% IGME models, and the shaded area indicates their standard deviations. (b) The same as (a) except that the results from SRVs-qMSM and SRVs-IGME models are shown. (c) The RMSE map of the tICA-IGME, tICA-qMSM, and macrostate tICA-MSM. We applied Eq. (26) to compute RMSE and chose L_x = 300 for IGME and qMSM, while L_x = 300 ns/τ for MSM with a lag time of τ. The saving internal of MD simulations is Δt = 1 ns. (d) The same as (c) except that the results from SRVs-IGME, SRVs-qMSM, and macrostate SRV-MSM are shown. (e) Chapman–Kolmogorov test on the tICA-based macrostates models. Specifically, the lag time of τ_M = 150 ns, τ_K = 30 ns, and τ_K = 31 ns, L = 1 ns are used for the MSM, qMSM, and IGME models, respectively. (f) Chapman–Kolmogorov test on the SRVs-based macrostates models. Specifically, the lag time of τ_M = 150 ns, τ_K = 25 ns, and τ_K = 22 ns, L = 1 ns are used for the MSM, qMSM, and IGME models, respectively. In (e) and (f), the error bars are calculated by bootstrapping 150 MD trajectories 50 times with replacement.

In the final step of this Tutorial, we constructed tICA-IGME and SRVs-IGME models for HP35 [see Fig. 1(f) and the sample code in Scheme 14]. To determine the values of the two hyperparameters (τ_K and L), we conduct a systematical scan within the range of 1–50 ns [Figs. 10(c) and 10(d)]. For tICA-IGME and SRVs-IGME, we identify the best models with the smallest RMSE at $\{τ_{K} = 31 ns, L = 1 ns\}$ and $\{τ_{K} = 22 ns, L = 1 ns\}$ , respectively. In the Chapman–Kolmogorov test, both models accurately predict the time evolutions of state residence probabilities, aligning well with the original MD simulations [Figs. 10(e) and 10(f)]. Furthermore, we notice that all top 5% IGME models exhibit RMSE below 1.0 × 10⁻³, indicating high accuracy. Thus, we select IGME models above this top 5% threshold for subsequent analysis. Based on these models, we calculate the average value and standard deviations of the slowest ITS [based on $\hat{T}$ , see Eq. (11)] to be 1.96 ± 0.30 and 1.87 ± 0.42 µs for tICA-IGME and SRVs-IGME, respectively. As expected, the MIK obtained from IGME [Eq. (12)] is consistent with that from qMSM [Figs. 10(a) and 10(b)]. It is also noteworthy that the IGME models based on tICA and SRVs display a comparable performance for the HP35 system, i.e., RMSE = 7.9 × 10⁻⁴ ± 9 × 10⁻⁵ (tICA-IGME) vs 8.0 × 10⁻⁴ ± 9 × 10⁻⁵ (SRVs-IGME) for top 5% of the IGME models. For comparisons, we also constructed macrostate MSMs for HP35. For both macrostate tICA-MSM and SRVs-MSM, the lag time needs to be as long as τ = 150 ns for the models to achieve a Markovian behavior and pass the Chapman-Kolmogorov test [Figs. 10(e) and 10(f)]. This Markovian lag time (τ_M = 150 ns) is several times longer than τ_K for qMSM and IGME models. Furthermore, MSMs consistently exhibit larger RMSEs compared to qMSM and IGME models [Figs. 10(c) and 10(d)].

The datasets of alanine dipeptide and villin headpiece are sufficiently sampled; thus, the RMSEs of both systems are small, and the predictions of dynamics from all top 5% models shown in Figs. 5(a), 10(c), and 10(d) are consistent to each other. However, if the sampling is insufficient, even the top 5% models will contain large RMSEs with MD data, and the predicted dynamics of top 5% IGME models may contain large fluctuations. In this case, it is suggested to enhance the MD sampling to reduce the fluctuation of dynamics rather than build an IGME model based on insufficiently sampled data.

V. CONCLUSIONS

In this Tutorial, we offer a comprehensive, step-by-step guide on constructing non-Markovian dynamics models, specifically qMSM and IGME, for investigating protein dynamics. Using two MD simulation datasets—alanine dipeptide and villin headpiece—we provide detailed instructions along with associated sample codes covering the entire model construction protocol (see Fig. 1). This protocol includes feature selection, dimensionality reduction, geometric clustering, kinetic lumping, and the creation of qMSM and IGME models. We believe that this Tutorial will prove valuable to a broad audience in computational biophysics who are interested in exploring the dynamics of proteins and other biological macromolecules.

ACKNOWLEDGMENTS

X.H. acknowledges the support from NIH/NIGMS under Award No. R01GM147652-01A1. X.H. also acknowledges the support from the Hirschfelder Professorship Fund.

APPENDIX: THE LEAST-SQUARES FITTING METHOD TO FIT HYPERPARAMETERS IN IGME

In our implementation, we utilized the following form of Eq. (11):

\ln T (t) \approx \ln A + t \ln \hat{T} .

(A1)

A simple Lagrangian to minimize the error in the above-mentioned equation can be defined as

L_{u} = \frac{1}{2} \sum_{t} {|\ln T (t) - \ln A - t \ln \hat{T}|}_{F}^{2},

(A2)

where the subscript “F” represents the Frobenius norm. Next, we introduced an additional constraint to the above-mentioned Lagrangian. The TPM, $T (t)$ , shall satisfy the row-sum rule, i.e., $\sum_{j} T_{i j} (t) = 1$ . To consider this constraint, we utilized the logarithm form of the row-sum rule derived as follows:

\begin{align} \sum_{j} T_{i j} (t) = 1 & \Rightarrow T (t) [\begin{matrix} \begin{matrix} 1 \\ 1 \end{matrix} \\ \begin{matrix} \dots \\ 1 \end{matrix} \end{matrix}] = [\begin{matrix} \begin{matrix} 1 \\ 1 \end{matrix} \\ \begin{matrix} \dots \\ 1 \end{matrix} \end{matrix}] \\ \Rightarrow λ_{1} (T (t)) = 1, v_{1} (T (t)) = {[1, 1, \dots, 1]}^{T} \\ \Rightarrow λ_{1} (\ln T (t)) = 0, v_{1} (\ln T (t)) = {[1, 1, \dots, 1]}^{T} \\ \Rightarrow [\ln T (t)] [\begin{matrix} \begin{matrix} 1 \\ 1 \end{matrix} \\ \begin{matrix} \dots \\ 1 \end{matrix} \end{matrix}] = 0 \Rightarrow \sum_{j} {[\ln T (t)]}_{i j} = 0, \end{align}

(A3)

where λ₁ and v₁ refer to the first eigenvalue and eigenvector, respectively. In IGME, the input $T (t)$ [i.e., $T (τ_{K}), T (τ_{K} + Δ t), \dots, T (τ_{K} + L)$ ] generated by MSMBuilder already satisfies the row-sum rule. However, the predicted TPMs, $T (t > τ_{K} + L)$ , may not satisfy the row-sum rule if this constraint is not explicitly included in the least-squares fitting. In order to guarantee that all predicted TPMs satisfy the row-sum rule, we need to ensure that both A and $\hat{T}$ satisfy the row-sum rule. However, the row-sum rule of A is automatically satisfied when the inputs $T (t)$ and $\hat{T}$ satisfy the row-sum rule [since $\sum_{j} {[\ln A]}_{i j} = \sum_{j} {[\bar{\ln T (t)} - \bar{t} \ln \hat{T}]}_{i j}$ , see Eq. (A5)]; thus, in the least-squares fitting, we only need to ensure that $\hat{T}$ follows the row-sum rule: $\sum_{j} {\hat{T}}_{i j} = 1$ or $\sum_{j} {[\ln \hat{T}]}_{i j} = 0$ . Therefore, the complete Lagrangian with the row-sum constraint of $T (t)$ can be written as follows:

L = \frac{1}{2} \sum_{t} {|\ln T (t) - \ln A - t \ln \hat{T}|}_{F}^{2} + \sum_{i} (γ_{i} \sum_{j} {[\ln \hat{T}]}_{i j}) .

(A4)

Here, γ_i are Lagrange multipliers for each row of $\ln \hat{T}$ . Taking $\partial / \partial_{{\hat{T}}_{i j}}$ , $\partial / \partial_{A_{i j}}$ , and $\partial / \partial_{γ_{i}}$ to L of Eq. (A4), the solution of these Lagrange equations are

\begin{gathered} (\bar{t} {[\ln A]}_{i j} + \bar{t^{2}} {[\ln \hat{T}]}_{i j} + γ_{i} - \bar{t {[\ln T (t)]}_{i j}}) {[\ln \hat{T}]}_{i j} = 0, \\ ({[\ln A]}_{i j} + \bar{t} {[\ln \hat{T}]}_{i j} - \bar{{[\ln T (t)]}_{i j}}) {[\ln A]}_{i j} = 0, \\ \sum_{j} {[\ln \hat{T}]}_{i j} = 0 . \end{gathered}

(A5)

The above-mentioned solution can be rewritten into a matrix form for the least-squares fitting. For each row i, the least-squares fitting is done by solving the following matrix equation to get ln A, $\ln \hat{T}$ , and γ:

\begin{align} [\begin{matrix} \begin{matrix} \bar{t} \\ \bar{t} \\ \dots \end{matrix} & \begin{matrix} \bar{t^{2}} \\ \bar{t^{2}} \\ \dots \end{matrix} & \begin{matrix} 1 \\ 1 \\ \dots \end{matrix} \\ \begin{matrix} 1 \\ 1 \\ \dots \end{matrix} & \begin{matrix} \bar{t} \\ \bar{t} \\ \dots \end{matrix} & \begin{matrix} 0 \\ 0 \\ \dots \end{matrix} \\ \begin{matrix} 0 & 0 & \dots \end{matrix} & \begin{matrix} 1 & 1 & \dots \end{matrix} & 0 \end{matrix}] [\begin{matrix} \begin{matrix} {[\ln A]}_{i 1} \\ {[\ln A]}_{i 2} \\ \dots \end{matrix} \\ \begin{matrix} {[\ln \hat{T}]}_{i 1} \\ {[\ln \hat{T}]}_{i 2} \\ \dots \end{matrix} \\ γ_{i} \end{matrix}] \\ = [\begin{matrix} \begin{matrix} \bar{t {[\ln T (t)]}_{i 1}} \\ \bar{t {[\ln T (t)]}_{i 2}} \\ \dots \end{matrix} \\ \begin{matrix} \bar{{[\ln T (t)]}_{i 1}} \\ \bar{{[\ln T (t)]}_{i 2}} \\ \dots \end{matrix} \\ 0 \end{matrix}] . \end{align}

(A6)

AUTHOR DECLARATIONS

Conflict of Interest

The authors have no conflicts to disclose.

Author Contributions

Yue Wu: Data curation (lead); Formal analysis (lead); Investigation (lead); Methodology (supporting); Software (lead); Validation (lead); Visualization (lead); Writing – original draft (lead); Writing – review & editing (equal). Siqin Cao: Conceptualization (lead); Methodology (lead); Software (equal); Writing – review & editing (equal). Yunrui Qiu: Methodology (supporting); Software (equal); Writing – original draft (supporting). Xuhui Huang: Conceptualization (lead); Funding acquisition (lead); Investigation (equal); Methodology (supporting); Project administration (lead); Resources (lead); Supervision (lead); Writing – review & editing (lead).

DATA AVAILABILITY

All the steps and Python codes of the tutorials can be accessed on our GitHub repository (https://github.com/xuhuihuang/GME_tutorials). Colab notebooks are also available for two systems in our Tutorial: alanine dipeptide (https://colab.research.google.com/github/xuhuihuang/GME_tutorials/blob/main/tutorials/alanine_dipeptide_tutorial.ipynb) and villin headpiece (https://colab.research.google.com/github/xuhuihuang/GME_tutorials/blob/main/tutorials/villin_headpiece_tutorial.ipynb). Other simulation datasets are available from the corresponding author upon reasonable request.

REFERENCES

1.Henzler-Wildman K. and Kern D., Nature 450(7172), 964–972 (2007). 10.1038/nature06522 [DOI] [PubMed] [Google Scholar]
2.Bahar I., Lezon T. R., Yang L. W., and Eyal E., Annu. Rev. Biophys. 39, 23–42 (2010). 10.1146/annurev.biophys.093008.131258 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Brueckner F., Ortiz J., and Cramer P., Curr. Opin. Struct. Biol. 19(3), 294–299 (2009). 10.1016/j.sbi.2009.04.005 [DOI] [PubMed] [Google Scholar]
4.Zhang L., Pardo-Avila F., Unarta I. C., Cheung P. P., Wang G., Wang D., and Huang X., Acc. Chem. Res. 49(4), 687–694 (2016). 10.1021/acs.accounts.5b00536 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Bowman G. R., Bolin E. R., Hart K. M., Maguire B. C., and Marqusee S., Proc. Natl. Acad. Sci. U. S. A. 112(9), 2734–2739 (2015). 10.1073/pnas.1417811112 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Wagner J. R., Lee C. T., Durrant J. D., Malmstrom R. D., Feher V. A., and Amaro R. E., Chem. Rev. 116(11), 6370–6390 (2016). 10.1021/acs.chemrev.5b00631 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Unarta I. C., Cao S., Kubo S., Wang W., Cheung P. P., Gao X., Takada S., and Huang X., Proc. Natl. Acad. Sci. U. S. A. 118(17), e2024324118 (2021). 10.1073/pnas.2024324118 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Chodera J. D., Singhal N., Pande V. S., Dill K. A., and Swope W. C., J. Chem. Phys. 126(15), 155101 (2007). 10.1063/1.2714538 [DOI] [PubMed] [Google Scholar]
9.Prinz J. H., Wu H., Sarich M., Keller B., Senne M., Held M., Chodera J. D., Schutte C., and Noe F., J. Chem. Phys. 134(17), 174105 (2011). 10.1063/1.3565032 [DOI] [PubMed] [Google Scholar]
10.Konovalov K. A., Unarta I. C., Cao S., Goonetilleke E. C., and Huang X., JACS Au 1(9), 1330–1341 (2021). 10.1021/jacsau.1c00254 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Zhang L., Jiang H., Sheong F. K., Pardo-Avila F., Cheung P. P.-H., and Huang X., Methods Enzymol. 578, 343–371 (2016). 10.1016/bs.mie.2016.05.026 [DOI] [PubMed] [Google Scholar]
12.Wang W., Cao S., Zhu L., and Huang X., Wiley Interdiscip. Rev.: Comput. Mol. Sci. 8, e1343 (2018). 10.1002/wcms.1343 [DOI] [Google Scholar]
13.Pan A. C. and Roux B., J. Chem. Phys. 129(6), 064107 (2008). 10.1063/1.2959573 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Zhang B. W., Dai W., Gallicchio E., He P., Xia J. C., Tan Z. Q., and Levy R. M., J. Phys. Chem. B 120(33), 8289–8301 (2016). 10.1021/acs.jpcb.6b02015 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Morcos F., Chatterjee S., McClendon C. L., Brenner P. R., López-Rendón R., Zintsmaster J., Ercsey-Ravasz M., Sweet C. R., Jacobson M. P., Peng J. W., and Izaguirre J. A., PLoS Comput. Biol. 6(12), e1001015 (2010). 10.1371/journal.pcbi.1001015 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Huang X. H., Bowman G. R., Bacallado S., and Pande V. S., Proc. Natl. Acad. Sci. U. S. A. 106(47), 19765–19769 (2009). 10.1073/pnas.0909088106 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Malmstrom R. D., Lee C. T., Van Wart A. T., and Amaro R. E., J. Chem. Theory Comput. 10(7), 2648–2657 (2014). 10.1021/ct5002363 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Buchete N. V. and Hummer G., J. Phys. Chem. B 112(19), 6057–6069 (2008). 10.1021/jp0761665 [DOI] [PubMed] [Google Scholar]
19.Lorpaiboon C., Thiede E. H., Webber R. J., Weare J., and Dinner A. R., J. Phys. Chem. B 124(42), 9354–9364 (2020). 10.1021/acs.jpcb.0c06477 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Qiao Q., Bowman G. R., and Huang X. H., J. Am. Chem. Soc. 135(43), 16092–16101 (2013). 10.1021/ja403147m [DOI] [PubMed] [Google Scholar]
21.Noé F., Schütte C., Vanden-Eijnden E., Reich L., and Weikl T. R., Proc. Natl. Acad. Sci. U. S. A. 106(45), 19011–19016 (2009). 10.1073/pnas.0905466106 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Bowman G. R., Voelz V. A., and Pande V. S., Curr. Opin. Struct. Biol. 21(1), 4–11 (2011). 10.1016/j.sbi.2010.10.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Deng N. J., Dai W., and Levy R. M., J. Phys. Chem. B 117(42), 12787–12799 (2013). 10.1021/jp401962k [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Wan H. B., Ge Y. H., Razavi A., and Voelz V. A., J. Chem. Theory Comput. 16(2), 1333–1348 (2020). 10.1021/acs.jctc.9b01240 [DOI] [PubMed] [Google Scholar]
25.Buch I., Giorgino T., and De Fabritiis G., Proc. Natl. Acad. Sci. U. S. A. 108(25), 10184–10189 (2011). 10.1073/pnas.1103547108 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Lawrenz M., Shukla D., and Pande V. S., Sci. Rep. 5, 7918 (2015). 10.1038/srep07918 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Silva D. A., Bowman G. R., Sosa-Peinado A., and Huang X. H., PLoS Comput. Biol. 7(5), e1002054 (2011). 10.1371/journal.pcbi.1002054 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Plattner N. and Noé F., Nat. Commun. 6, 7653 (2015). 10.1038/ncomms8653 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Jiang H., Sheong F. K., Zhu L., Gao X., Bernauer J., and Huang X., PLoS Comput. Biol. 11(7), e1004404 (2015). 10.1371/journal.pcbi.1004404 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Silva D. A., Weiss D. R., Pardo Avila F., Da L. T., Levitt M., Wang D., and Huang X., Proc. Natl. Acad. Sci. U. S. A. 111(21), 7665–7670 (2014). 10.1073/pnas.1315751111 [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Kohlhoff K. J., Shukla D., Lawrenz M., Bowman G. R., Konerding D. E., Belov D., Altman R. B., and Pande V. S., Nat. Chem. 6(1), 15–21 (2014). 10.1038/nchem.1821 [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Da L. T., Pardo-Avila F., Xu L., Silva D. A., Zhang L., Gao X., Wang D., and Huang X., Nat. Commun. 7, 11244 (2016). 10.1038/ncomms11244 [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Da L. T., Chao E., Duan B. G., Zhang C. B., Zhou X., and Yu J., PLoS Comput. Biol. 11(11), e1004624 (2015). 10.1371/journal.pcbi.1004624 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Da L. T., Wang D., and Huang X. H., J. Am. Chem. Soc. 134(4), 2399–2406 (2012). 10.1021/ja210656k [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Malmstrom R. D., Kornev A. P., Taylor S. S., and Amaro R. E., Nat. Commun. 6, 7588 (2015). 10.1038/ncomms8588 [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Wang B. B., Sexton R. E., and Feig M., Biochim. Biophys. Acta, Gene Regul. Mech. 1860(4), 482–490 (2017). 10.1016/j.bbagrm.2017.02.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Khaled M., Gorfe A., and Sayyed-Ahmad A., J. Phys. Chem. B 123(36), 7667–7675 (2019). 10.1021/acs.jpcb.9b05768 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Barros E. P., Demir Ö., Soto J., Cocco M. J., and Amaro R. E., Chem. Sci. 12(5), 1891–1900 (2021). 10.1039/d0sc05053a [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Feng J. Y., Selvam B., and Shukla D., Structure 29(8), 922–933.e3 (2021). 10.1016/j.str.2021.03.014 [DOI] [PubMed] [Google Scholar]
40.Son C. Y., Yethiraj A., and Cui Q., Proc. Natl. Acad. Sci. U. S. A. 114(42), E8830–E8836 (2017). 10.1073/pnas.1707922114 [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Da L. T., Pardo Avila F., Wang D., and Huang X., PLoS Comput. Biol. 9(4), e1003020 (2013). 10.1371/journal.pcbi.1003020 [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Qiu Y. R., O’Connor M. S., Xue M. Y., Liu B. J., and Huang X. H., J. Chem. Theory Comput. 19(14), 4728–4742 (2023). 10.1021/acs.jctc.3c00318 [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Liu B. J., Xue M. Y., Qiu Y. R., Konovalov K. A., O’Connor M. S., and Huang X. H., J. Chem. Phys. 159(9), 094901 (2023). 10.1063/5.0158903 [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Yik A. K.-H., Qiu Y., Unarta I. C., Cao S., and Huang X., in A Practical Guide to Recent Advances in Multiscale Modeling and Simulation of Biomolecules, edited by Wang Y. and Zhou R. (AIP Publishing LLC, 2023). [Google Scholar]
45.Liu B., Qiu Y., Goonetilleke E. C., and Huang X., MRS Bull. 47(9), 958–966 (2022). 10.1557/s43577-022-00415-1 [DOI] [Google Scholar]
46.Zimmerman M. I., Porter J. R., Ward M. D., Singh S., Vithani N., Meller A., Mallimadugula U. L., Kuhn C. E., Borowsky J. H., Wiewiora R. P., Hurley M. F. D., Harbison A. M., Fogarty C. A., Coffland J. E., Fadda E., Voelz V. A., Chodera J. D., and Bowman G. R., Nat. Chem. 13(7), 651–659 (2021). 10.1038/s41557-021-00707-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Voelz V. A., Bowman G. R., Beauchamp K., and Pande V. S., J. Am. Chem. Soc. 132(5), 1526–1528 (2010). 10.1021/ja9090353 [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Dominic A. J. III, Cao S., Montoya-Castillo A., and Huang X., J. Am. Chem. Soc. 145(18), 9916–9927 (2023). 10.1021/jacs.3c01095 [DOI] [PubMed] [Google Scholar]
49.Cao S., Montoya-Castillo A., Wang W., Markland T. E., and Huang X., J. Chem. Phys. 153(1), 014105 (2020). 10.1063/5.0010787 [DOI] [PubMed] [Google Scholar]
50.Cao S., Qiu Y., Kalin M. L., and Huang X., J. Chem. Phys. 159(13), 134106 (2023). 10.1063/5.0167287 [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Dominic A. J. III, Sayer T., Cao S., Markland T. E., Huang X., and Montoya-Castillo A., Proc. Natl. Acad. Sci. U. S. A. 120(12), e2221048120 (2023). 10.1073/pnas.2221048120 [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Hegger R. and Stock G., J. Chem. Phys. 130(3), 034106 (2009). 10.1063/1.3058436 [DOI] [PubMed] [Google Scholar]
53.Ayaz C., Tepper L., Brunig F. N., Kappler J., Daldrop J. O., and Netz R. R., Proc. Natl. Acad. Sci. U. S. A. 118(31), e2023856118 (2021). 10.1073/pnas.2023856118 [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Ayaz C., Scalfi L., Dalton B. A., and Netz R. R., Phys. Rev. E 105(5), 054138 (2022). 10.1103/physreve.105.054138 [DOI] [PubMed] [Google Scholar]
55.Noé F., Wu H., Prinz J.-H., and Plattner N., J. Chem. Phys. 139(18), 184114 (2013). 10.1063/1.4828816 [DOI] [PubMed] [Google Scholar]
56.Zhu L., Jiang H., Cao S., Unarta I. C., Gao X., and Huang X., Commun. Biol. 4(1), 1345 (2021). 10.1038/s42003-021-02822-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Cerrillo J. and Cao J. S., Phys. Rev. Lett. 112(11), 110401 (2014). 10.1103/physrevlett.112.110401 [DOI] [PubMed] [Google Scholar]
58.Presse S., Lee J., and Dill K. A., J. Phys. Chem. B 117(2), 495–502 (2013). 10.1021/jp309420u [DOI] [PMC free article] [PubMed] [Google Scholar]
59.Noé F. and Fischer S., Curr. Opin. Struct. Biol. 18(2), 154–162 (2008). 10.1016/j.sbi.2008.01.008 [DOI] [PubMed] [Google Scholar]
60.Peng S. J., Wang X. W., Zhang L., He S. S., Zhao X. S., Huang X. H., and Chen C. L., Proc. Natl. Acad. Sci. U. S. A. 117(36), 21889–21895 (2020). 10.1073/pnas.2002971117 [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Patel R., Goldstein T. A., Dyer E. L., Mirhoseini A., and Baraniuk R. G., arXiv:1505.05208 (2015).
62.Litzinger F., Boninsegna L., Wu H., Nüske F., Patel R., Baraniuk R., Noé F., and Clementi C., J. Chem. Theory Comput. 14(5), 2771–2783 (2018). 10.1021/acs.jctc.8b00089 [DOI] [PubMed] [Google Scholar]
63.Stacklies W., Seifert C., and Graeter F., BMC Bioinf. 12, 101 (2011). 10.1186/1471-2105-12-101 [DOI] [PMC free article] [PubMed] [Google Scholar]
64.Diez G., Nagel D., and Stock G., J. Chem. Theory Comput. 18(8), 5079–5088 (2022). 10.1021/acs.jctc.2c00337 [DOI] [PubMed] [Google Scholar]
65.Drineas P. and Mahoney M. W., J. Mach. Learn. Res. 6, 2153–2175 (2005), https://www.jmlr.org/papers/volume6/drineas05a/drineas05a.pdf. [Google Scholar]
66.Traag V. A., Waltman L., and van Eck N. J., Sci. Rep. 9(1), 5233 (2019). 10.1038/s41598-019-41695-z [DOI] [PMC free article] [PubMed] [Google Scholar]
67.Amadei A., Linssen A. B. M., and Berendsen H. J. C., Proteins: Struct., Funct., Bioinf. 17(4), 412–425 (1993). 10.1002/prot.340170408 [DOI] [PubMed] [Google Scholar]
68.Schwantes C. R. and Pande V. S., J. Chem. Theory Comput. 9(4), 2000–2009 (2013). 10.1021/ct300878a [DOI] [PMC free article] [PubMed] [Google Scholar]
69.Pérez-Hernández G., Paul F., Giorgino T., De Fabritiis G., and Noé F., J. Chem. Phys. 139(1), 015102 (2013). 10.1063/1.4811489 [DOI] [PubMed] [Google Scholar]
70.Naritomi Y. and Fuchigami S., J. Chem. Phys. 139(21), 215102 (2013). 10.1063/1.4834695 [DOI] [PubMed] [Google Scholar]
71.Mardt A., Pasquali L., Wu H., and Noé F., Nat. Commun. 9, 5 (2018). 10.1038/s41467-017-02388-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
72.Ghorbani M., Prasad S., Klauda J. B., and Brooks B. R., J. Chem. Phys. 156(18), 184103 (2022). 10.1063/5.0085607 [DOI] [PMC free article] [PubMed] [Google Scholar]
73.Sidky H., Chen W., and Ferguson A. L., J. Phys. Chem. B 123(38), 7999–8009 (2019). 10.1021/acs.jpcb.9b05578 [DOI] [PubMed] [Google Scholar]
74.Husic B. E. and Noé F., J. Chem. Phys. 151(5), 054103 (2019). 10.1063/1.5099194 [DOI] [Google Scholar]
75.Nagel D., Weber A., and Stock G., J. Chem. Theory Comput. 16(12), 7874–7882 (2020). 10.1021/acs.jctc.0c00774 [DOI] [PubMed] [Google Scholar]
76.Chen W., Sidky H., and Ferguson A. L., J. Chem. Phys. 150(21), 214114 (2019). 10.1063/1.5092521 [DOI] [PubMed] [Google Scholar]
77.Schwantes C. R. and Pande V. S., J. Chem. Theory Comput. 11(2), 600–608 (2015). 10.1021/ct5007357 [DOI] [PMC free article] [PubMed] [Google Scholar]
78.Bonati L., Piccini G., and Parrinello M., Proc. Natl. Acad. Sci. U. S. A. 118(44), e2113533118 (2021). 10.1073/pnas.2113533118 [DOI] [PMC free article] [PubMed] [Google Scholar]
79.Wehmeyer C. and Noé F., J. Chem. Phys. 148(24), 241703 (2018). 10.1063/1.5011399 [DOI] [PubMed] [Google Scholar]
80.Hernández C. X., Wayment-Steele H. K., Sultan M. M., Husic B. E., and Pande V. S., Phys. Rev. E 97(6), 062412 (2018). 10.1103/physreve.97.062412 [DOI] [PMC free article] [PubMed] [Google Scholar]
81.Wang Y. H., Ribeiro J. M. L., and Tiwary P., Nat. Commun. 10, 3573 (2019). 10.1038/s41467-019-11405-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
82.Wang D. D. and Tiwary P., J. Chem. Phys. 154(13), 134111 (2021). 10.1063/5.0038198 [DOI] [PubMed] [Google Scholar]
83.Ferguson A. L., Panagiotopoulos A. Z., Kevrekidis I. G., and Debenedetti P. G., Chem. Phys. Lett. 509(1–3), 1–11 (2011). 10.1016/j.cplett.2011.04.066 [DOI] [Google Scholar]
84.Wu H. and Noé F., J. Chem. Phys. 160, 044109 (2024). 10.1063/5.0176078 [DOI] [PubMed] [Google Scholar]
85.Mitsutake A., Iijima H., and Takano H., J. Chem. Phys. 135(16), 164102 (2011). 10.1063/1.3652959 [DOI] [PubMed] [Google Scholar]
86.Peng J.-h., Wang W., Yu Y.-q., Gu H.-l., and Huang X., Chin. J. Chem. Phys. 31(4), 404–420 (2018). 10.1063/1674-0068/31/cjcp1806147 [DOI] [Google Scholar]
87.Lloyd S. P., IEEE Trans. Inf. Theory 28(2), 129–137 (1982). 10.1109/tit.1982.1056489 [DOI] [Google Scholar]
88.Hochbaum D. S. and Shmoys D. B., Math. Oper. Res. 10(2), 180–184 (1985). 10.1287/moor.10.2.180 [DOI] [Google Scholar]
89.Zhao Y., Sheong F. K., Sun J., Sander P., and Huang X., J. Comput. Chem. 34(2), 95–104 (2013). 10.1002/jcc.23110 [DOI] [PubMed] [Google Scholar]
90.Park H. S. and Jun C. H., Expert Syst. Appl. 36(2), 3336–3341 (2009). 10.1016/j.eswa.2008.01.039 [DOI] [PMC free article] [PubMed] [Google Scholar]
91.Sheong F. K., Silva D. A., Meng L. M., Zhao Y. T., and Huang X. H., J. Chem. Theory Comput. 11(1), 17–27 (2015). 10.1021/ct5007168 [DOI] [PubMed] [Google Scholar]
92.Klem H., Hocky G. M., and McCullagh M., J. Chem. Theory Comput. 18(5), 3218–3230 (2022). 10.1021/acs.jctc.1c01290 [DOI] [PMC free article] [PubMed] [Google Scholar]
93.Liu S., Zhu L. Z., Sheong F. K., Wang W., and Huang X. H., J. Comput. Chem. 38(3), 152–160 (2017). 10.1002/jcc.24664 [DOI] [PubMed] [Google Scholar]
94.Deuflhard P., Huisinga W., Fischer A., and Schütte C., Linear Algebra Appl. 315(1–3), 39–59 (2000). 10.1016/s0024-3795(00)00095-1 [DOI] [Google Scholar]
95.Deuflhard P. and Weber M., Linear Algebra Appl. 398, 161–184 (2005). 10.1016/j.laa.2004.10.026 [DOI] [Google Scholar]
96.Röblitz S. and Weber M., Adv. Data Anal. Classif. 7(2), 147–179 (2013). 10.1007/s11634-013-0134-6 [DOI] [Google Scholar]
97.Yao Y., Cui R. Z., Bowman G. R., Silva D. A., Sun J., and Huang X. H., J. Chem. Phys. 138(17), 174106 (2013). 10.1063/1.4802007 [DOI] [PubMed] [Google Scholar]
98.Wang W., Liang T., Sheong F. K., Fan X., and Huang X., J. Chem. Phys. 149(7), 072337 (2018). 10.1063/1.5027001 [DOI] [PubMed] [Google Scholar]
99.Wan H. B. and Voelz V. A., J. Chem. Phys. 152(2), 024103 (2020). 10.1063/1.5142457 [DOI] [PMC free article] [PubMed] [Google Scholar]
100.McGibbon R. T. and Pande V. S., J. Chem. Phys. 142(12), 124105 (2015). 10.1063/1.4916292 [DOI] [PMC free article] [PubMed] [Google Scholar]
101.Harrigan M. P., Sultan M. M., Hernandez C. X., Husic B. E., Eastman P., Schwantes C. R., Beauchamp K. A., McGibbon R. T., and Pande V. S., Biophys. J. 112(1), 10–15 (2017). 10.1016/j.bpj.2016.10.042 [DOI] [PMC free article] [PubMed] [Google Scholar]
102.Bowman G. R., Huang X., and Pande V. S., Methods 49(2), 197–201 (2009). 10.1016/j.ymeth.2009.04.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
103.Scherer M. K., Trendelkamp-Schroer B., Paul F., Perez-Hernandez G., Hoffmann M., Plattner N., Wehmeyer C., Prinz J. H., and Noe F., J. Chem. Theory Comput. 11(11), 5525–5542 (2015). 10.1021/acs.jctc.5b00743 [DOI] [PubMed] [Google Scholar]
104.Hornak V., Abel R., Okur A., Strockbine B., Roitberg A., and Simmerling C., Proteins 65(3), 712–725 (2006). 10.1002/prot.21123 [DOI] [PMC free article] [PubMed] [Google Scholar]
105.Jorgensen W. L., Chandrasekhar J., Madura J. D., Impey R. W., and Klein M. L., J. Chem. Phys. 79(2), 926–935 (1983). 10.1063/1.445869 [DOI] [Google Scholar]
106.Lindorff-Larsen K., Piana S., Dror R. O., and Shaw D. E., Science 334(6055), 517–520 (2011). 10.1126/science.1208351 [DOI] [PubMed] [Google Scholar]
107.Kubelka J., Henry E. R., Cellmer T., Hofrichter J., and Eaton W. A., Proc. Natl. Acad. Sci. U. S. A. 105(48), 18655–18662 (2008). 10.1073/pnas.0808600105 [DOI] [PMC free article] [PubMed] [Google Scholar]
108.Nagel D., Sartore S., and Stock G., J. Phys. Chem. Lett. 14(31), 6956–6967 (2023). 10.1021/acs.jpclett.3c01561 [DOI] [PubMed] [Google Scholar]
109.Banushkina P. V. and Krivov S. V., J. Chem. Theory Comput. 9(12), 5257–5266 (2013). 10.1021/ct400651z [DOI] [PMC free article] [PubMed] [Google Scholar]
110.Nagel D., Sartore S., and Stock G., J. Chem. Theory Comput. 19(11), 3391–3405 (2023). 10.1021/acs.jctc.3c00240 [DOI] [PubMed] [Google Scholar]
111.Piana S., Lindorff-Larsen K., and Shaw D. E., Proc. Natl. Acad. Sci. U. S. A. 109(44), 17845–17850 (2012). 10.1073/pnas.1201811109 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[c1] 1.Henzler-Wildman K. and Kern D., Nature 450(7172), 964–972 (2007). 10.1038/nature06522 [DOI] [PubMed] [Google Scholar]

[c2] 2.Bahar I., Lezon T. R., Yang L. W., and Eyal E., Annu. Rev. Biophys. 39, 23–42 (2010). 10.1146/annurev.biophys.093008.131258 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c3] 3.Brueckner F., Ortiz J., and Cramer P., Curr. Opin. Struct. Biol. 19(3), 294–299 (2009). 10.1016/j.sbi.2009.04.005 [DOI] [PubMed] [Google Scholar]

[c4] 4.Zhang L., Pardo-Avila F., Unarta I. C., Cheung P. P., Wang G., Wang D., and Huang X., Acc. Chem. Res. 49(4), 687–694 (2016). 10.1021/acs.accounts.5b00536 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c5] 5.Bowman G. R., Bolin E. R., Hart K. M., Maguire B. C., and Marqusee S., Proc. Natl. Acad. Sci. U. S. A. 112(9), 2734–2739 (2015). 10.1073/pnas.1417811112 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c6] 6.Wagner J. R., Lee C. T., Durrant J. D., Malmstrom R. D., Feher V. A., and Amaro R. E., Chem. Rev. 116(11), 6370–6390 (2016). 10.1021/acs.chemrev.5b00631 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c7] 7.Unarta I. C., Cao S., Kubo S., Wang W., Cheung P. P., Gao X., Takada S., and Huang X., Proc. Natl. Acad. Sci. U. S. A. 118(17), e2024324118 (2021). 10.1073/pnas.2024324118 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c8] 8.Chodera J. D., Singhal N., Pande V. S., Dill K. A., and Swope W. C., J. Chem. Phys. 126(15), 155101 (2007). 10.1063/1.2714538 [DOI] [PubMed] [Google Scholar]

[c9] 9.Prinz J. H., Wu H., Sarich M., Keller B., Senne M., Held M., Chodera J. D., Schutte C., and Noe F., J. Chem. Phys. 134(17), 174105 (2011). 10.1063/1.3565032 [DOI] [PubMed] [Google Scholar]

[c10] 10.Konovalov K. A., Unarta I. C., Cao S., Goonetilleke E. C., and Huang X., JACS Au 1(9), 1330–1341 (2021). 10.1021/jacsau.1c00254 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c11] 11.Zhang L., Jiang H., Sheong F. K., Pardo-Avila F., Cheung P. P.-H., and Huang X., Methods Enzymol. 578, 343–371 (2016). 10.1016/bs.mie.2016.05.026 [DOI] [PubMed] [Google Scholar]

[c12] 12.Wang W., Cao S., Zhu L., and Huang X., Wiley Interdiscip. Rev.: Comput. Mol. Sci. 8, e1343 (2018). 10.1002/wcms.1343 [DOI] [Google Scholar]

[c13] 13.Pan A. C. and Roux B., J. Chem. Phys. 129(6), 064107 (2008). 10.1063/1.2959573 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c14] 14.Zhang B. W., Dai W., Gallicchio E., He P., Xia J. C., Tan Z. Q., and Levy R. M., J. Phys. Chem. B 120(33), 8289–8301 (2016). 10.1021/acs.jpcb.6b02015 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c15] 15.Morcos F., Chatterjee S., McClendon C. L., Brenner P. R., López-Rendón R., Zintsmaster J., Ercsey-Ravasz M., Sweet C. R., Jacobson M. P., Peng J. W., and Izaguirre J. A., PLoS Comput. Biol. 6(12), e1001015 (2010). 10.1371/journal.pcbi.1001015 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c16] 16.Huang X. H., Bowman G. R., Bacallado S., and Pande V. S., Proc. Natl. Acad. Sci. U. S. A. 106(47), 19765–19769 (2009). 10.1073/pnas.0909088106 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c17] 17.Malmstrom R. D., Lee C. T., Van Wart A. T., and Amaro R. E., J. Chem. Theory Comput. 10(7), 2648–2657 (2014). 10.1021/ct5002363 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c18] 18.Buchete N. V. and Hummer G., J. Phys. Chem. B 112(19), 6057–6069 (2008). 10.1021/jp0761665 [DOI] [PubMed] [Google Scholar]

[c19] 19.Lorpaiboon C., Thiede E. H., Webber R. J., Weare J., and Dinner A. R., J. Phys. Chem. B 124(42), 9354–9364 (2020). 10.1021/acs.jpcb.0c06477 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c20] 20.Qiao Q., Bowman G. R., and Huang X. H., J. Am. Chem. Soc. 135(43), 16092–16101 (2013). 10.1021/ja403147m [DOI] [PubMed] [Google Scholar]

[c21] 21.Noé F., Schütte C., Vanden-Eijnden E., Reich L., and Weikl T. R., Proc. Natl. Acad. Sci. U. S. A. 106(45), 19011–19016 (2009). 10.1073/pnas.0905466106 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c22] 22.Bowman G. R., Voelz V. A., and Pande V. S., Curr. Opin. Struct. Biol. 21(1), 4–11 (2011). 10.1016/j.sbi.2010.10.006 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c23] 23.Deng N. J., Dai W., and Levy R. M., J. Phys. Chem. B 117(42), 12787–12799 (2013). 10.1021/jp401962k [DOI] [PMC free article] [PubMed] [Google Scholar]

[c24] 24.Wan H. B., Ge Y. H., Razavi A., and Voelz V. A., J. Chem. Theory Comput. 16(2), 1333–1348 (2020). 10.1021/acs.jctc.9b01240 [DOI] [PubMed] [Google Scholar]

[c25] 25.Buch I., Giorgino T., and De Fabritiis G., Proc. Natl. Acad. Sci. U. S. A. 108(25), 10184–10189 (2011). 10.1073/pnas.1103547108 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c26] 26.Lawrenz M., Shukla D., and Pande V. S., Sci. Rep. 5, 7918 (2015). 10.1038/srep07918 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c27] 27.Silva D. A., Bowman G. R., Sosa-Peinado A., and Huang X. H., PLoS Comput. Biol. 7(5), e1002054 (2011). 10.1371/journal.pcbi.1002054 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c28] 28.Plattner N. and Noé F., Nat. Commun. 6, 7653 (2015). 10.1038/ncomms8653 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c29] 29.Jiang H., Sheong F. K., Zhu L., Gao X., Bernauer J., and Huang X., PLoS Comput. Biol. 11(7), e1004404 (2015). 10.1371/journal.pcbi.1004404 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c30] 30.Silva D. A., Weiss D. R., Pardo Avila F., Da L. T., Levitt M., Wang D., and Huang X., Proc. Natl. Acad. Sci. U. S. A. 111(21), 7665–7670 (2014). 10.1073/pnas.1315751111 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c31] 31.Kohlhoff K. J., Shukla D., Lawrenz M., Bowman G. R., Konerding D. E., Belov D., Altman R. B., and Pande V. S., Nat. Chem. 6(1), 15–21 (2014). 10.1038/nchem.1821 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c32] 32.Da L. T., Pardo-Avila F., Xu L., Silva D. A., Zhang L., Gao X., Wang D., and Huang X., Nat. Commun. 7, 11244 (2016). 10.1038/ncomms11244 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c33] 33.Da L. T., Chao E., Duan B. G., Zhang C. B., Zhou X., and Yu J., PLoS Comput. Biol. 11(11), e1004624 (2015). 10.1371/journal.pcbi.1004624 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c34] 34.Da L. T., Wang D., and Huang X. H., J. Am. Chem. Soc. 134(4), 2399–2406 (2012). 10.1021/ja210656k [DOI] [PMC free article] [PubMed] [Google Scholar]

[c35] 35.Malmstrom R. D., Kornev A. P., Taylor S. S., and Amaro R. E., Nat. Commun. 6, 7588 (2015). 10.1038/ncomms8588 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c36] 36.Wang B. B., Sexton R. E., and Feig M., Biochim. Biophys. Acta, Gene Regul. Mech. 1860(4), 482–490 (2017). 10.1016/j.bbagrm.2017.02.008 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c37] 37.Khaled M., Gorfe A., and Sayyed-Ahmad A., J. Phys. Chem. B 123(36), 7667–7675 (2019). 10.1021/acs.jpcb.9b05768 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c38] 38.Barros E. P., Demir Ö., Soto J., Cocco M. J., and Amaro R. E., Chem. Sci. 12(5), 1891–1900 (2021). 10.1039/d0sc05053a [DOI] [PMC free article] [PubMed] [Google Scholar]

[c39] 39.Feng J. Y., Selvam B., and Shukla D., Structure 29(8), 922–933.e3 (2021). 10.1016/j.str.2021.03.014 [DOI] [PubMed] [Google Scholar]

[c40] 40.Son C. Y., Yethiraj A., and Cui Q., Proc. Natl. Acad. Sci. U. S. A. 114(42), E8830–E8836 (2017). 10.1073/pnas.1707922114 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c41] 41.Da L. T., Pardo Avila F., Wang D., and Huang X., PLoS Comput. Biol. 9(4), e1003020 (2013). 10.1371/journal.pcbi.1003020 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c42] 42.Qiu Y. R., O’Connor M. S., Xue M. Y., Liu B. J., and Huang X. H., J. Chem. Theory Comput. 19(14), 4728–4742 (2023). 10.1021/acs.jctc.3c00318 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c43] 43.Liu B. J., Xue M. Y., Qiu Y. R., Konovalov K. A., O’Connor M. S., and Huang X. H., J. Chem. Phys. 159(9), 094901 (2023). 10.1063/5.0158903 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c44] 44.Yik A. K.-H., Qiu Y., Unarta I. C., Cao S., and Huang X., in A Practical Guide to Recent Advances in Multiscale Modeling and Simulation of Biomolecules, edited by Wang Y. and Zhou R. (AIP Publishing LLC, 2023). [Google Scholar]

[c45] 45.Liu B., Qiu Y., Goonetilleke E. C., and Huang X., MRS Bull. 47(9), 958–966 (2022). 10.1557/s43577-022-00415-1 [DOI] [Google Scholar]

[c46] 46.Zimmerman M. I., Porter J. R., Ward M. D., Singh S., Vithani N., Meller A., Mallimadugula U. L., Kuhn C. E., Borowsky J. H., Wiewiora R. P., Hurley M. F. D., Harbison A. M., Fogarty C. A., Coffland J. E., Fadda E., Voelz V. A., Chodera J. D., and Bowman G. R., Nat. Chem. 13(7), 651–659 (2021). 10.1038/s41557-021-00707-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c47] 47.Voelz V. A., Bowman G. R., Beauchamp K., and Pande V. S., J. Am. Chem. Soc. 132(5), 1526–1528 (2010). 10.1021/ja9090353 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c48] 48.Dominic A. J. III, Cao S., Montoya-Castillo A., and Huang X., J. Am. Chem. Soc. 145(18), 9916–9927 (2023). 10.1021/jacs.3c01095 [DOI] [PubMed] [Google Scholar]

[c49] 49.Cao S., Montoya-Castillo A., Wang W., Markland T. E., and Huang X., J. Chem. Phys. 153(1), 014105 (2020). 10.1063/5.0010787 [DOI] [PubMed] [Google Scholar]

[c50] 50.Cao S., Qiu Y., Kalin M. L., and Huang X., J. Chem. Phys. 159(13), 134106 (2023). 10.1063/5.0167287 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c51] 51.Dominic A. J. III, Sayer T., Cao S., Markland T. E., Huang X., and Montoya-Castillo A., Proc. Natl. Acad. Sci. U. S. A. 120(12), e2221048120 (2023). 10.1073/pnas.2221048120 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c52] 52.Hegger R. and Stock G., J. Chem. Phys. 130(3), 034106 (2009). 10.1063/1.3058436 [DOI] [PubMed] [Google Scholar]

[c53] 53.Ayaz C., Tepper L., Brunig F. N., Kappler J., Daldrop J. O., and Netz R. R., Proc. Natl. Acad. Sci. U. S. A. 118(31), e2023856118 (2021). 10.1073/pnas.2023856118 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c54] 54.Ayaz C., Scalfi L., Dalton B. A., and Netz R. R., Phys. Rev. E 105(5), 054138 (2022). 10.1103/physreve.105.054138 [DOI] [PubMed] [Google Scholar]

[c55] 55.Noé F., Wu H., Prinz J.-H., and Plattner N., J. Chem. Phys. 139(18), 184114 (2013). 10.1063/1.4828816 [DOI] [PubMed] [Google Scholar]

[c56] 56.Zhu L., Jiang H., Cao S., Unarta I. C., Gao X., and Huang X., Commun. Biol. 4(1), 1345 (2021). 10.1038/s42003-021-02822-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c57] 57.Cerrillo J. and Cao J. S., Phys. Rev. Lett. 112(11), 110401 (2014). 10.1103/physrevlett.112.110401 [DOI] [PubMed] [Google Scholar]

[c58] 58.Presse S., Lee J., and Dill K. A., J. Phys. Chem. B 117(2), 495–502 (2013). 10.1021/jp309420u [DOI] [PMC free article] [PubMed] [Google Scholar]

[c59] 59.Noé F. and Fischer S., Curr. Opin. Struct. Biol. 18(2), 154–162 (2008). 10.1016/j.sbi.2008.01.008 [DOI] [PubMed] [Google Scholar]

[c60] 60.Peng S. J., Wang X. W., Zhang L., He S. S., Zhao X. S., Huang X. H., and Chen C. L., Proc. Natl. Acad. Sci. U. S. A. 117(36), 21889–21895 (2020). 10.1073/pnas.2002971117 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c61] 61.Patel R., Goldstein T. A., Dyer E. L., Mirhoseini A., and Baraniuk R. G., arXiv:1505.05208 (2015).

[c62] 62.Litzinger F., Boninsegna L., Wu H., Nüske F., Patel R., Baraniuk R., Noé F., and Clementi C., J. Chem. Theory Comput. 14(5), 2771–2783 (2018). 10.1021/acs.jctc.8b00089 [DOI] [PubMed] [Google Scholar]

[c63] 63.Stacklies W., Seifert C., and Graeter F., BMC Bioinf. 12, 101 (2011). 10.1186/1471-2105-12-101 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c64] 64.Diez G., Nagel D., and Stock G., J. Chem. Theory Comput. 18(8), 5079–5088 (2022). 10.1021/acs.jctc.2c00337 [DOI] [PubMed] [Google Scholar]

[c65] 65.Drineas P. and Mahoney M. W., J. Mach. Learn. Res. 6, 2153–2175 (2005), https://www.jmlr.org/papers/volume6/drineas05a/drineas05a.pdf. [Google Scholar]

[c66] 66.Traag V. A., Waltman L., and van Eck N. J., Sci. Rep. 9(1), 5233 (2019). 10.1038/s41598-019-41695-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[c67] 67.Amadei A., Linssen A. B. M., and Berendsen H. J. C., Proteins: Struct., Funct., Bioinf. 17(4), 412–425 (1993). 10.1002/prot.340170408 [DOI] [PubMed] [Google Scholar]

[c68] 68.Schwantes C. R. and Pande V. S., J. Chem. Theory Comput. 9(4), 2000–2009 (2013). 10.1021/ct300878a [DOI] [PMC free article] [PubMed] [Google Scholar]

[c69] 69.Pérez-Hernández G., Paul F., Giorgino T., De Fabritiis G., and Noé F., J. Chem. Phys. 139(1), 015102 (2013). 10.1063/1.4811489 [DOI] [PubMed] [Google Scholar]

[c70] 70.Naritomi Y. and Fuchigami S., J. Chem. Phys. 139(21), 215102 (2013). 10.1063/1.4834695 [DOI] [PubMed] [Google Scholar]

[c71] 71.Mardt A., Pasquali L., Wu H., and Noé F., Nat. Commun. 9, 5 (2018). 10.1038/s41467-017-02388-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c72] 72.Ghorbani M., Prasad S., Klauda J. B., and Brooks B. R., J. Chem. Phys. 156(18), 184103 (2022). 10.1063/5.0085607 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c73] 73.Sidky H., Chen W., and Ferguson A. L., J. Phys. Chem. B 123(38), 7999–8009 (2019). 10.1021/acs.jpcb.9b05578 [DOI] [PubMed] [Google Scholar]

[c74] 74.Husic B. E. and Noé F., J. Chem. Phys. 151(5), 054103 (2019). 10.1063/1.5099194 [DOI] [Google Scholar]

[c75] 75.Nagel D., Weber A., and Stock G., J. Chem. Theory Comput. 16(12), 7874–7882 (2020). 10.1021/acs.jctc.0c00774 [DOI] [PubMed] [Google Scholar]

[c76] 76.Chen W., Sidky H., and Ferguson A. L., J. Chem. Phys. 150(21), 214114 (2019). 10.1063/1.5092521 [DOI] [PubMed] [Google Scholar]

[c77] 77.Schwantes C. R. and Pande V. S., J. Chem. Theory Comput. 11(2), 600–608 (2015). 10.1021/ct5007357 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c78] 78.Bonati L., Piccini G., and Parrinello M., Proc. Natl. Acad. Sci. U. S. A. 118(44), e2113533118 (2021). 10.1073/pnas.2113533118 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c79] 79.Wehmeyer C. and Noé F., J. Chem. Phys. 148(24), 241703 (2018). 10.1063/1.5011399 [DOI] [PubMed] [Google Scholar]

[c80] 80.Hernández C. X., Wayment-Steele H. K., Sultan M. M., Husic B. E., and Pande V. S., Phys. Rev. E 97(6), 062412 (2018). 10.1103/physreve.97.062412 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c81] 81.Wang Y. H., Ribeiro J. M. L., and Tiwary P., Nat. Commun. 10, 3573 (2019). 10.1038/s41467-019-11405-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c82] 82.Wang D. D. and Tiwary P., J. Chem. Phys. 154(13), 134111 (2021). 10.1063/5.0038198 [DOI] [PubMed] [Google Scholar]

[c83] 83.Ferguson A. L., Panagiotopoulos A. Z., Kevrekidis I. G., and Debenedetti P. G., Chem. Phys. Lett. 509(1–3), 1–11 (2011). 10.1016/j.cplett.2011.04.066 [DOI] [Google Scholar]

[c84] 84.Wu H. and Noé F., J. Chem. Phys. 160, 044109 (2024). 10.1063/5.0176078 [DOI] [PubMed] [Google Scholar]

[c85] 85.Mitsutake A., Iijima H., and Takano H., J. Chem. Phys. 135(16), 164102 (2011). 10.1063/1.3652959 [DOI] [PubMed] [Google Scholar]

[c86] 86.Peng J.-h., Wang W., Yu Y.-q., Gu H.-l., and Huang X., Chin. J. Chem. Phys. 31(4), 404–420 (2018). 10.1063/1674-0068/31/cjcp1806147 [DOI] [Google Scholar]

[c87] 87.Lloyd S. P., IEEE Trans. Inf. Theory 28(2), 129–137 (1982). 10.1109/tit.1982.1056489 [DOI] [Google Scholar]

[c88] 88.Hochbaum D. S. and Shmoys D. B., Math. Oper. Res. 10(2), 180–184 (1985). 10.1287/moor.10.2.180 [DOI] [Google Scholar]

[c89] 89.Zhao Y., Sheong F. K., Sun J., Sander P., and Huang X., J. Comput. Chem. 34(2), 95–104 (2013). 10.1002/jcc.23110 [DOI] [PubMed] [Google Scholar]

[c90] 90.Park H. S. and Jun C. H., Expert Syst. Appl. 36(2), 3336–3341 (2009). 10.1016/j.eswa.2008.01.039 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c91] 91.Sheong F. K., Silva D. A., Meng L. M., Zhao Y. T., and Huang X. H., J. Chem. Theory Comput. 11(1), 17–27 (2015). 10.1021/ct5007168 [DOI] [PubMed] [Google Scholar]

[c92] 92.Klem H., Hocky G. M., and McCullagh M., J. Chem. Theory Comput. 18(5), 3218–3230 (2022). 10.1021/acs.jctc.1c01290 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c93] 93.Liu S., Zhu L. Z., Sheong F. K., Wang W., and Huang X. H., J. Comput. Chem. 38(3), 152–160 (2017). 10.1002/jcc.24664 [DOI] [PubMed] [Google Scholar]

[c94] 94.Deuflhard P., Huisinga W., Fischer A., and Schütte C., Linear Algebra Appl. 315(1–3), 39–59 (2000). 10.1016/s0024-3795(00)00095-1 [DOI] [Google Scholar]

[c95] 95.Deuflhard P. and Weber M., Linear Algebra Appl. 398, 161–184 (2005). 10.1016/j.laa.2004.10.026 [DOI] [Google Scholar]

[c96] 96.Röblitz S. and Weber M., Adv. Data Anal. Classif. 7(2), 147–179 (2013). 10.1007/s11634-013-0134-6 [DOI] [Google Scholar]

[c97] 97.Yao Y., Cui R. Z., Bowman G. R., Silva D. A., Sun J., and Huang X. H., J. Chem. Phys. 138(17), 174106 (2013). 10.1063/1.4802007 [DOI] [PubMed] [Google Scholar]

[c98] 98.Wang W., Liang T., Sheong F. K., Fan X., and Huang X., J. Chem. Phys. 149(7), 072337 (2018). 10.1063/1.5027001 [DOI] [PubMed] [Google Scholar]

[c99] 99.Wan H. B. and Voelz V. A., J. Chem. Phys. 152(2), 024103 (2020). 10.1063/1.5142457 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c100] 100.McGibbon R. T. and Pande V. S., J. Chem. Phys. 142(12), 124105 (2015). 10.1063/1.4916292 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c101] 101.Harrigan M. P., Sultan M. M., Hernandez C. X., Husic B. E., Eastman P., Schwantes C. R., Beauchamp K. A., McGibbon R. T., and Pande V. S., Biophys. J. 112(1), 10–15 (2017). 10.1016/j.bpj.2016.10.042 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c102] 102.Bowman G. R., Huang X., and Pande V. S., Methods 49(2), 197–201 (2009). 10.1016/j.ymeth.2009.04.013 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c103] 103.Scherer M. K., Trendelkamp-Schroer B., Paul F., Perez-Hernandez G., Hoffmann M., Plattner N., Wehmeyer C., Prinz J. H., and Noe F., J. Chem. Theory Comput. 11(11), 5525–5542 (2015). 10.1021/acs.jctc.5b00743 [DOI] [PubMed] [Google Scholar]

[c104] 104.Hornak V., Abel R., Okur A., Strockbine B., Roitberg A., and Simmerling C., Proteins 65(3), 712–725 (2006). 10.1002/prot.21123 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c105] 105.Jorgensen W. L., Chandrasekhar J., Madura J. D., Impey R. W., and Klein M. L., J. Chem. Phys. 79(2), 926–935 (1983). 10.1063/1.445869 [DOI] [Google Scholar]

[c106] 106.Lindorff-Larsen K., Piana S., Dror R. O., and Shaw D. E., Science 334(6055), 517–520 (2011). 10.1126/science.1208351 [DOI] [PubMed] [Google Scholar]

[c107] 107.Kubelka J., Henry E. R., Cellmer T., Hofrichter J., and Eaton W. A., Proc. Natl. Acad. Sci. U. S. A. 105(48), 18655–18662 (2008). 10.1073/pnas.0808600105 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c108] 108.Nagel D., Sartore S., and Stock G., J. Phys. Chem. Lett. 14(31), 6956–6967 (2023). 10.1021/acs.jpclett.3c01561 [DOI] [PubMed] [Google Scholar]

[c109] 109.Banushkina P. V. and Krivov S. V., J. Chem. Theory Comput. 9(12), 5257–5266 (2013). 10.1021/ct400651z [DOI] [PMC free article] [PubMed] [Google Scholar]

[c110] 110.Nagel D., Sartore S., and Stock G., J. Chem. Theory Comput. 19(11), 3391–3405 (2023). 10.1021/acs.jctc.3c00240 [DOI] [PubMed] [Google Scholar]

[c111] 111.Piana S., Lindorff-Larsen K., and Shaw D. E., Proc. Natl. Acad. Sci. U. S. A. 109(44), 17845–17850 (2012). 10.1073/pnas.1201811109 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Tutorial on how to build non-Markovian dynamic models from molecular dynamics simulations for studying protein conformational changes

Yue Wu

Siqin Cao

Yunrui Qiu

Xuhui Huang

Abstract

I. INTRODUCTION

II. THEORIES OF NON-MARKOVIAN DYNAMIC MODELS FOR PROTEIN DYNAMICS

A. Liouville equation and dynamic operators

B. Markov state model (MSM) theory

C. Generalized master equation (GME) theory

D. Integrative GME (IGME) theory

III. PROTOCOL FOR CONSTRUCTING GME MODELS TO STUDY PROTEIN DYNAMICS

FIG. 1.

A. Feature selection

B. Dimensionality reduction

C. Geometric clustering to generate microstates

D. Kinetic lumping to produce metastable macrostates

E. Constructing Markov state models

F. Constructing qMSM and IGME models

IV. TUTORIAL EXAMPLES

A. Alanine dipeptide

FIG. 2.

SCHEME 1.

SCHEME 2.

SCHEME 3.

SCHEME 4.

FIG. 3.

SCHEME 5.

SCHEME 6.

FIG. 4.

SCHEME 7.

FIG. 5.

SCHEME 8.

B. Villin headpiece (HP35)

FIG. 6.

SCHEME 9.

SCHEME 10.

SCHEME 11.

SCHEME 12.

FIG. 7.

FIG. 8.

FIG. 9.

SCHEME 13.

FIG. 10.

SCHEME 14.

V. CONCLUSIONS

ACKNOWLEDGMENTS

APPENDIX: THE LEAST-SQUARES FITTING METHOD TO FIT HYPERPARAMETERS IN IGME

AUTHOR DECLARATIONS

Conflict of Interest

Author Contributions

DATA AVAILABILITY

REFERENCES

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases