Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2011 Sep 1.
Published in final edited form as: Methods. 2010 Jun 4;52(1):99–105. doi: 10.1016/j.ymeth.2010.06.002

Everything you wanted to know about Markov State Models but were afraid to ask

Vijay S Pande 1,2, Kyle Beauchamp 1, Gregory R Bowman 1
PMCID: PMC2933958  NIHMSID: NIHMS218238  PMID: 20570730

Abstract

Simulating protein folding has been a challenging problem for decades due to the long timescales involved (compared with what is possible to simulate) and the challenges of gaining insight from the complex nature of the resulting simulation data. Markov State Models (MSMs) present a means to tackle both of these challenges, yielding simulations on experimentally relevant timescales, statistical significance, and coarse grained representations that are readily humanly understandable. Here, we review this method with the intended audience of non-experts, in order to introduce the method to a broader audience. We review the motivations, methods, and caveats of MSMs, as well as some recent highlights of applications of the method. We conclude by discussing how this approach is part of a paradigm shift in how one uses simulations, away from anecdotal single-trajectory approaches to a more comprehensive statistical approach.

Introduction

Studying protein folding, either by experiment or simulation, is fraught with many challenges. In conjunction with its biological significance, these challenges make protein folding an important problem to study from a methodological perspective [1, 2]. Moreover, approaches which have proven their utility in addressing folding related questions have often found broad applicability to wide range of other problems as well [35].

From the computational point of view, one of the primary challenges of protein folding simulation is the ability to reach experimentally relevant timescales, such as the millisecond to second timescale, with sufficiently detailed simulations in order to make quantitative predictions of experiment. However, often overlooked is the additional challenge that even if such simulations could be performed, one would need to have some means to analyze the resulting flood of data in a methodical and unbiased fashion. Finally, it is important to note that in many ways, these challenges are not unique to simulation, as single molecule experiments have similar challenges: one would like to ideally use as little data as possible to build models, long trajectories can at times be challenging (due to photobleaching or other technical challenges), and the analysis of the resulting data to gain insight is itself often a challenge.

Markov State Models (MSMs), kinetic models of the process understudy, typically constructed from detailed simulations such as Molecular Dynamics, have been proposed as a scheme to address these challenges. Moreover, this approach represents a paradigm shift in how one uses simulations, away from anecdotal single-trajectory approaches to a more comprehensive statistical approach. There have been many reviews of MSM methodology (eg see [68]), but these reviews have focused on theoretical and computational details, and are intended for theorists and practitioners of these methods. Here, our intention is to describe MSMs for primarily an experimentalist audience, with the primary goal of explaining in detail how MSMs work such that their strengths and weaknesses as applied to computer simulations of folding (and their predictions of experiment) can be understood. We stress that this is not meant to be a thorough review of the entire MSM field, but rather a basic “how to” guide to MSM construction for non-experts.

How does one build an MSM?

Goals

Before diving into the details of how one constructs an MSM, it is useful to remind the reader of the goals of MSM building. Here, we concentrate on three primary goals:

  1. The ability to quantitatively predict a broad array of experimental data

  2. To use input data (either from simulations or experiment) as parsimoniously as possible

  3. To build simplified models that are readily understood by human beings such that new insight can be gained; these models are not “cartoons” but rather coarse grained representations of the more detailed models employed for quantitative comparisons

Overall framework

The overall framework of an MSM (solving the Master equation) is, at its heart, similar to methods already familiar to biochemists and structural biologists for describing things like chemical reactions. Specifically, we wish to build a model with a series of N states and to parameterize the model with the rates between these states. However, unlike simple biochemical models, which typically have just a handful of states, MSMs often have many states, i.e. thousands to potentially millions. The rationale for having many states is that this allows one to construct a very high resolution model of the intrinsic dynamics as well as to more easily parameterize this model from relatively short molecular dynamics trajectories. That is, because the kinetic distance between adjacent states is small, short simulations are sufficient to observe transitions between them.

The specific challenges for building an MSM can be broken down into 1) how does one define states in a kinetically meaningful scheme and 2) how can one use this state decomposition in order to build a transition matrix in an efficient manner. Once this is performed, then the model is ready for both the goals of quantitative prediction of experiment as well as yielding qualitative insight into the mechanism at hand.

Initial data set

One must start with some initial data set. Below, we will concentrate on how MSMs are created from molecular dynamics simulation, but we stress that in principle, these methods are sufficiently general that they could be more broadly applied, for example also to other simulation methods interested in kinetics or thermodynamics. For molecular dynamics simulation, the initial data set could take several forms. In some cases, we are in an extremely data rich regime (eg many long trajectories that start unfolded and end folded [9]). In other cases, we are in a data poor regime, where there are a few trajectories that start unfolded (and perhaps some that start folded), but there are no (or few) single trajectories that traverse from the unfolded to folded states.

In the data rich regime, MSMs can help analyze the data set in a kinetically meaningful way. In the data poor regime, MSMs can also be used to direct future data collection (eg select the starting points of new MD simulations) in order to improve a model as efficiently as possible. Such methods have collectively been referred to as “adaptive sampling” methods. Since one must build an MSM before performing any additional adaptive sampling we will first describe more details of MSM building and then return to the subject of adaptive sampling.

Finally, we stress that many different types of simulations could be useful in creating the initial data set. One scheme is to “seed” MD simulations, i.e. start them in potentially relevant states a priori. For example, while thermodynamic sampling methods, such as Replica Exchange or Simulated Tempering [1013], do not follow physical kinetics, they could be used to initially sample space for seeding (or in some cases to directly build an MSM [14]). Analogously, simplified force fields (eg coarse grained, implicit solvent, etc) could be used to generate seeds, which would be followed by full-force field MD simulation.

One may be concerned that the errors in simpler methods (or the non-kinetically relevant aspects of thermodynamic sampling methods) would lead to seeds that are not useful. However, even a few useful seeds can be a significant speed up in the convergence of adaptive sampling (see below). Moreover, it is important to stress that the worst case scenario for “bad” seeds (i.e. seeds that do not lead to productive improvement of the MSM transition matrix) is that trajectories from these seeds do not make transitions betwen important states; while this uses computer time (and thus may affect the efficiency of the MSM creation process), this will not taint the calculation with any inaccurate data [15, 16]. Finally, this seeding approach could also be used as a means to test the kinetic fidelity of simpler methods, so even results which lead to seeds which do not play a role could have scientifically important implications.

Building microstates

In order to build an MSM, i.e. to find a series of kinetically relevant states and the transition rates between them, we need to have a means to group structures in a kinetically meaningful manner. While structural clustering has been performed on simulation data for decades, previous methods have not explicitly considered kinetic properties [17, 18].

MSM building techniques include kinetic information but begin with a traditional clustering method (eg k-means or k-centers) using a structural metric [8]. Considering the emphasis on kinetic clustering, this may sound strange; however, even though we want to define states by a kinetic criteria, it is important to keep in mind that one cannot define kinetics without some sense of geometric boundaries, i.e. one cannot define a rate between two objects without delineating where each starts and ends. Also, the kinetic relevance of this geometrical clustering can be tested (see the “Validating the self-consistency of an MSM” section below). In fact, the structural resolution of this clustering can even make it appropriate for use in making quantitative connections with experiments [8, 19, 20].

This initial structural clustering is done to create many (eg 10,000 to 100,000 in protein examples so far) so-called “microstates” (where the initial criteria for clustering is the number of states). Due to the large number of microstates, conformations within the same microstate typically have RMSDs of no more than 2Å to 3Å [8, 21]. This high degree of structural similarity implies a kinetic similarity, allowing for subsequent kinetic clustering of microstates into larger macrostates. Identifying kinetic relationships between microstates requires constructing a transition matrix.

One may be curious how the number of microstates would scale with the system size. This is currently known only for a range of system sizes from ~10 to ~100 residues (including unpublished results at the large scale). We have found so far that the number of states depends naturally directly on the complexity of the state space, but that the length of the protein is only a small part of this complexity. For example, comparing villin (36 residue alpha helical bundle protein), NTL9-39 (39 residue mini beta barrel), and lambda repressor (80 residue alpha helical protein), the number of microstates does not monotonically increase with protein length, with NTL9 having a much more complex space likely due to its beta sheet nature (requiring non-local many contacts). It remains to be seen how the number of microstates will scale with increasing chain length and is an important issue for future research.

Building a transition matrix

With the set of microstates, one can construct a microstate transition matrix. To do this, we take the MD data available (either from the initial data set or possibly also including data from adaptive sampling rounds) and assign each structure in the MD trajectory to a microstate. This “classification” step is comprised of comparing structures in MD trajectories to microstates, one by one, to find out which microstate is closest and then assigning the structure to that microstate. The result is a translation of the MD trajectory from a series of structures over time to a series of microstates over time.

Next, we then use these microstate trajectories to count how many transitions are seen between each pair of microstates i and j at some lag time, i.e. if a trajectory is at microstate i at time t, then how many times did the simulation go to state j at time t + ? We call this set of quantities the count matrix Cij( ). Finally, one can estimate the probability of going from i to j in time (written asPij( )) from the fraction of counts that started in i and went to j, compared to other possible states.

Given sufficient data, microstate transition matrices can be used to predict experimental observables or identify kinetically related microstates. However, by estimating the transition probability matrix from the counts, one can encounter problems when there are only a few counts. Discreteness effects and shot noise will lead to noise in the transition probabilities. At times, this does not matter, since some transitions are less important than others. However, one can improve on the estimate of the transition matrix by appealing to Bayesian techniques and including well-chosen prior probabilities [2224]. For example, a prior (ie the imposition of a set of assumptions) that includes the effect of detailed balance can greatly enhance the effectiveness of data in the small count limit [23].

How much sampling does one need to build a reasonable converged MSM? One way to estimate this is to consider the number of transitions to calculate and the amount of simulation per transition. For a system with N=3 × 104 states (corresponding to a typical MSM for protein folding discussed herein), it is tempting to think that one would need N2 or 109 states. However, it is important to note that not every microstate is connected to every other. Indeed, the matrix is very sparse and the connectivity does not appear to scale with N, but something on the order of N ln N or N. Thus, for 3 × 104 states, one may expect to have to calculate on the order of 105 transitions. Doing this with simple sampling would require on the order of 105 τ, where τ is the lag time of the MSM (on the order of 1 to 10 ns), yielding a total aggregate simulation time required on the 0.1 to 1 millisecond timescale; while this remains a challenging sampling problem, this degree of sampling has been demonstrated in recent cases for aggregate sampling from Folding@home.

However, we note that the estimate above is a bit simplistic. First, not all transitions need to be well-sampled, but rather only those which are uncertainty limiting. Indeed, adaptive methods take advantage of this fact and have estimated an increase of efficiency of 100× to 1000×, suggesting that 10s to 100s of microseconds of aggregate dynamics may be sufficient.

Coarse graining MSMs to gain human intuition

With a well-sampled microstate transition matrix, one has all the elements necessary for a functioning MSM. Indeed, with microstates (which are high resolution, well-defined states), the lag time ( ) is typically fairly small, eg on the ~10 nanosecond timescale for MD simulations of protein folding. This results in a very high resolution model, which is especially useful for making quantitative comparisons to experiment (see below). However, in order to effectively use the resulting model (and especially to gain insight from it in a humanly understandable format), it is often useful to construct a coarse grained MSM.

This process consists of simplifying the microstate transition matrix into fewer states. This simplification can be done in a physically meaningful way by looking at timescales longer than the microstate lag time, i.e. 100ns instead of 10ns. At this slightly longer timescale, fewer states are kinetically relevant. Indeed, defining states in a “kinetically relevant” way requires that structures within a state can interconvert (i.e. kinetically reach each other) on timescales faster than the lag time. So, increasing the lag time means that states can get larger and more coarse grained. This makes it easier for the MSM to be humanly understandable, especially since the number of states can be arbitrarily small (and thus easier to comprehend), as long as one increases the lag time to match. We stress that such criteria above is necessary, but not sufficient, and thus we suggest that coarse grained models be tested (see below) or used primarily for visualization of the model.

But how can one perform this coarse graining (or “lumping”) of states? While microstates are defined by a structural metric, this is where kinetic information must play a role. But where can we get kinetic information at this stage? The natural place is the microstate transition matrix, which directly encodes the kinetics between microstates. Typically, this is done via some sort of spectral clustering [2527] of the microstate transition matrix, i.e. clustering methods which look at the eigenvalues and eigenvectors of the microstate transition matrix to identify kinetically similar states. These methods are well-developed and we will not go into further details, other than to stress that these methods allow one to define a coarse grained model at arbitrary resolution (high or low) depending on the goals for the model by lumping together kinetically related microstates.

Improving on the initial model: adaptive sampling

In the data rich regime, the previous steps can be sufficient for building an MSM. However, typically one is not in this regime and the previous steps result in an MSM which may qualitatively reflect the original dynamics, but many quantitative details could be improved. The natural way to improve the MSM is to perform more MD simulations to get better statistics, but how should one do this? One could choose to just continue the existing MD trajectories to make them longer, but this is not the most efficient scheme. Trajectories which have reached stable states (such as the native state or traps) will get stuck there for a very long time and thus further sampling these states will not greatly enhance the MSM.

Instead, it is natural to assign starting points for new simulations to optimize the ability of the MD data to improve the MSM. But how should one choose starting points? Adaptive sampling methods seek to answer this question by using the existing macrostate transition matrix (and more specifically the statistical uncertainty in its elements) to determine the ideal states to start additional simulations from. While we refer the reader to recent works for details [16, 22, 28, 29], we will present the spirit of how this works below.

Recall that the count matrix (Cij( )) can tell us not just information about the transition probabilities (Pij ( )), but also the uncertainty in these probabilities. Seeing 106 out of 2 × 106 counts reach a state is not the same as 1 out of 2. The low count regime will have a great deal of shot noise and will suffer from discreteness effects. Thus, the count matrix can also be used to predict the statistical error in the transition probability matrix, not just its values. By looking at which transitions contribute most to the statistical error of properties of interest (eg the folding rate), one can “on the fly” identify states which are limiting the accuracy of the model and start additional calculations from these states in order to improve the model. This approach is called adaptive sampling [15, 22] since one adaptively modifies which simulations are run in order to optimize the data set used to build MSMs.

How well does this work? Previous tests [15, 22] of adaptive sampling methods have shown a dramatic increase in efficiency, i.e. much less simulation data is needed when adaptive sampling is used. Indeed, recent work has quantitatively shown that adaptive sampling can reduce the time it takes to build a model by a factor of N, where N is the number of parallel simulations run during each round of sampling [30]. This makes physical sense, when one considers that traditional MD will get stuck in traps and over sample some areas and vastly under-sample others. Adaptive sampling opens the door for MSMs to be more than merely a way to build a kinetic model from data, but rather to push the idea further, using the initial model (“knowing what you do not know”) to speed convergence even further.

Validating the self-consistency of an MSM

Considering the challenges involved in constructing an MSM, it is important to first test that the MSM is self-consistent, i.e. agrees with the data used in its construction. These tests can be performed on both the fine-graned (microstate-based) and coarse-grained (lumped) transition matrix. In particular, one challenge that often occurs is whether the lag time is sufficiently long to make the chosen state decomposition Markovian (i.e. do conformations within a state kinetically interconvert on timescales faster than the lag time and only make transitions to other states on slower timescales). There are several means to test this, including the Swope –Pitera eigenvalue test [31], information theoretic approaches [32], Chapman- Kolmogorov tests [19], and Bayesian Model selection approaches [23].

What can be done if the MSM fails the test? One natural approach is to run additional simulations in an adaptive manner in order to improve the model and pass the tests; this can be very efficient, but involves additional simulation which is not always practical. If one chooses not to use any addition simulation, the natural approaches are to either examine longer lag times (at the potential cost of some loss in temporal and/or spatial resolution) or construct a more fine-grained state decomposition (at the potential cost of some loss in statistical precision). We refer the reader to the works cited above for more details, but stress that such tests exist and are a critical step in MSM construction.

Connecting to experimental data

The quantitative prediction of experimental data is a primary goal of MSM creation. How is this done? Like any kinetic model involving states and rates, MSMs can take some initial conditions and report the state probabilities vs time. As in any simulation, in order to connect to experiment, one must be able to connect properties of the state to experimental observables. MSMs facilitate this by only requiring one to relate properties of the state to experiment (eg what is the IR spectra of the structures in this state?), then use the MSM to calculate the probability that each state is found at a given time t, and then perform a weighted average.

Visualization schemes

Finally, once one has done the above, it is natural to employ various visualization schemes to gain insight from the model. The most natural scheme is to examine the states and transitions in the coarse grained MSM for mechanistic properties. More recently, Transition Path Theory (TPT) [19, 33, 34] has been used to calculate the flux of relevant pathways, which aids in visualizing key mechanistic steps.

Examples of recent results

The past five years have seen a flurry of work on Markov State Models. Here we summarize some of the more recent advances. The discussion focuses on work from our own lab and collaborators, and so is by no means a comprehensive review. In particular, it is worth mentioning that works by Noe [7, 19, 24, 35], Hummer [14, 36], Roux [3739], and Swope and Pitera [31, 40] have also used MSMs or similar paradigms to study protein folding and dynamics.

Early attempts at MSM construction typically required one to build MSMs by hand, i.e. developing schemes to construct a state decomposition for a particular system [4144]. The development of a fully-automated method [45] for MSM construction thus signaled a significant achievement. The automated algorithm described uses k-medoid clustering and iterative refinement to maximize a measure of state metastability. The success of their method is evident by several examples, including alanine dipeptide, Fs-peptide, and trpzip. In a particularly illustrative example, those authors performed a careful analysis of the alanine dipeptide results. Because the torsional preferences of this system are well understood, the authors could manually construct MSMs based on the Ramachandran plot for the peptide. A key result was that MSMs constructed by their automated algorithm performed equally well as the best hand-built models.

The past year has seen new MSM techniques come to fruition. The current generation of tools has proven capable of constructing accurate MSMs from some of the most extensive protein folding datasets available. In one case, using previously performed [9] simulations (545 trajectories of length up to 2us) of the double-norleucine mutant of the villin headpiece [46], the MSMBuilder package was used to construct a 10,000 state MSM. While working with such an extensive datasets can be challenging (due to the nature of clustering many conformations as well as the rare states which appear), the simple, automated algorithm in MSMBuilder produced a model that could quantitatively reproduce the original dynamics [8]. To demonstrate this, the authors compared several mock-observables (surrogate experimental quantities) as calculated two ways. First, they were calculated directly from the MD simulations, without using the MSM. Second, MSM dynamics were used to calculate the same observables, by propagating the initial state populations through time. State populations and ensemble average RMSD were among the observables computed. As an example, the folded microstate population as a function of time is shown [Figure 1]. In addition to reproducing the raw MD data, this model also showed success in making comparisons to experiment. First, the most populated MSM microstate was found to correspond to the experimentally determined crystal structure, with an RMSD of approximately 2Å—thus, the combination of MD and MSMs can correctly predict the folded structure of villin [Figure 2]. Likewise, microsecond folding kinetics match the known experimentally observed timescales.

Figure 1.

Figure 1

The population of the folded state was monitored throughout the ensemble of 545 trajectories (black). The same population can be calculated by propagating the initial populations using MSM dynamics (blue). That both agree to within error suggests that the MSM accurately models the folding kinetics in this molecular dynamics dataset [8]. The model was constructed from simulations of a fast-folding villin headpiece mutant [9]. This figure was reproduced from Ref. [8].

Figure 2.

Figure 2

The MSM-estimated equilibrium populations can be used to calculate the free energies of microstates. Here the lowest free energy states (most populated) are structurally similar to the experimentally determined crystal structure. The model was constructed from simulations of a fast-folding villin headpiece mutant [8]. This figure was reproduced from Ref. [8].

The past year has seen not just improvements in MSM construction, but also novel methods for interpreting MSMs. In a recent work [19], simulations of the pinWW protein were described using MSMs and Transition Path Theory (TPT). TPT provides a simple formalism for dissecting questions about reaction pathway and mechanism. TPT requires three things: an MSM, a starting state A, and an ending state B. A TPT analysis then decomposes the reaction A → B into individual pathways, along with the reactive flux for each pathway. The preferred mechanism of the reaction can be calculated by comparing reactive flux along pathways. An additional benefit of the TPT approach is that it allows representations that are both quantitatively accurate and visually compelling. The authors of that work used their TPT framework to quantitatively describe the folding pathways of the PinWW domain, a fast folding beta protein that had previously been studied experimentally [47].

Combining the MSMBuilder approach to MSM construction with TPT methods, Voelz et al sought to apply these methods to the NTL9 protein [21]. Compared to previously simulated proteins, NTL9 folds in a millisecond, which is as much as 1000 times slower. Its mixed alpha-beta topology makes it structurally more interesting than other protein folding model systems. Despite these challenges, Voelz et al successfully built an MSM from implicit solvent simulations of NTL9. The model was able to recapitulate both the correct structure and the experimentally measured folding rate. TPT analysis [Figure 3] revealed a heterogeneous folding mechanism, with a variety of misfolded and intermediate states. However, the native pairing of beta strands 1 and 2 occurred only for those states with pfold greater than 0.5, suggesting a rate-limiting role for the formation of this structural element. It is also interesting to note that the millisecond timescale of NTL9 is significantly longer than the individual simulation trajectories (each ~10us): this demonstrates that an MSM framework can help reach timescales longer than the individual MD trajectories.

Figure 3.

Figure 3

The top 15 pathways of NTL9 folding were calculated using TPT. Each node represents a macrostate and is sized by the equilibrium free energy −kt log(P). The edges are sized by the folding flux through each segment, so the dominant pathways are depicted by larger arrows. This figure was reproduced from Ref. [21].

Caveats of the MSM approach

As with any method, there are caveats to consider with taking an MSM approach. We detail the key caveats below, both to inform the reader interested in applying or evaluating these methods, as well as for those interested in advancing the existing methodology.

Sufficient sampling

While one of the primary goals of MSMs is to address the challenges involved with sampling, i.e. reaching sufficiently long timescale phenomena with statistical significance, sampling is still a challenge. Indeed, while MSMs can greatly push the limits of what one can do with sampling, if an event occurs on very long timescales (beyond the longest timescales accessible by the MSM), then it will not be seen with simple approaches (although the combination of a state decomposition which can identify the relevant substates and adaptive sampling approaches could handle such problems efficiently). Moreover, we stress that the fact that MSMs seek to reflect true physical dynamics allows one to use physical reasoning to understand the limits of a given MSM.

With that said, it is important to stress that sampling with MSMs is of course considerably easier and more efficient than other methods. We consider it “easier” since MSMs are naturally amenable to parallel computation, so one could construct the necessary trajectories to parameterize an MSM on a simple cluster, rather than requiring an expensive supercomputer. Moreover, recent advances in using novel hardware, such as GPUs, opens the door to very long timescales, such as approximately 500ns/day for a 36 resdiue protein (villin headpiece) with implicit solvent using OpenMM [48]. Thus, with even a medium sized cluster of 100 GPUs, one could simulate 50μs/day of aggregate sampling or 5ms of aggregate sampling over 100 days (a typical simulation run); thus, the combination of GPUs (eg running OpenMM) and MSM methods (eg using MSMBuilder) should allow many labs to study millisecond-scale protein folding phenomena. Finally, we stress that the discussion above has not included the effects of adaptive sampling, which would further increase the timescales one could reach, and thus greatly increase the efficiency and power of the calculation.

Accurate force fields

A general question for simulation is the accuracy of force fields. This is not an issue for MSMs in particular, although it is worth mentioning in any discussion of simulation caveats. Based on previous work cited above, it appears that the force fields have been sufficiently accurate for the systems examined so far, allowing for quantitative connection to experiment in many cases. Nevertheless, as one pushes these methods further and examines more complex systems, we may find new challenges and limitations of force fields.

However, in our opinion, the best way to tackle such problems is to ensure that we really can precisely understand the limits of the force fields; this comes from having sufficient sampling and statistical significance, which are the hallmarks of the MSM approach. Thus, we expect that even in situations where force fields fail, MSM approaches may prove valuable in understanding these limitations and potentially improving them.

Connections to experiment

Once one has constructed an MSM, it is natural to test the whole model (i.e. level of sampling combined with the force field accuracy) against experiment. However, a new challenge arises: how to make such quantitative connections? A detailed discussion of this area is beyond the scope of this work, but we mention this topic here nonetheless in order to highlight its significance. We stress that the ideal connection to experiments measure something as close to the experimental observable as possible, rather than trying to make a connection to the interpretation of those observables. While this direct connection is challenging, it avoids additional layers which could lead to erroneous connections. For example, simulations of villin agree quantitatively with analogs of experimental observables (eg analogs to fluorescence), but show differences with the experimental interpretation of the folding rate when more detailed observables (such as RMSD to the native state) are examined [9].

Moreover, an MSM approach also should prove valuable in making such direct connections, since this process is now simplified: one merely must make connections between the structures in a state and a given experimental observable and then those state properties can be propagated to yield thermodynamic, bulk-time resolved, or even time-resolved single molecule like distributions analogous to experimental observables.

Well-constructed state decomposition

Finally, the greatest methodological challenge and area for future work with MSM construction methods is refining and improving our ability to construct a kinetically relevant state decomposition. Recent advances in automated methods (eg see [45] and [7]) have made great strides in this area. However, we suspect that as one applies MSMs to more complex systems, new challenges will arise. For example, for very long proteins, one may need to consider the role of knots and topological effects, which may be challenging (although not impossible) to handle with RMSD-metric based approaches. Also, applying these methods outside of the protein realm may require new metrics for structural similarity, especially when there are degrees of freedom which are fluid and their specific labels do not matter, such as structurally relevant water or lipid molecules. Luckily, methods established for checking the consistency of an MSM (described above) would reveal such problems and signal the need for further development.

Conclusions

We have walked the reader through the fundamentals of MSM construction, with an emphasis on discussing physical intuition over mathematical formalism. For additional details, we recommend recent reviews and research papers cited above. In a nutshell, MSMs represent a shift in how one thinks of computer simulation. Instead of creating a toy system, letting it go for a single or few long trajectories, and then reporting the (likely anecdotal) results, MSMs take a statistical approach. Indeed, the goal of simulations here is not creating trajectories which can be used to create mechanistic accounts, but rather first and foremost model building, where statistical methods are used to most efficiently build the best model for the kinetics possible, given the data at hand. We expect that in the end, it may be that the most significant contribution of the MSM approach may not be details of model building, but rather the shift in how one uses and conceptualizes simulations.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • 1.Dill KA, et al. The protein folding problem. Annu Rev Biophys. 2008;37:289–316. doi: 10.1146/annurev.biophys.37.092707.153558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Snow CD, et al. How well can simulation predict protein folding kinetics and thermodynamics? Annu Rev Biophys Biomol Struct. 2005;34:43–69. doi: 10.1146/annurev.biophys.34.040204.144447. [DOI] [PubMed] [Google Scholar]
  • 3.Kasson PM, et al. Ensemble molecular dynamics yields submillisecond kinetics and intermediates of membrane fusion. Proc Natl Acad Sci USA. 2006;103(32):11916–21. doi: 10.1073/pnas.0601597103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Kelley NW, et al. The predicted structure of the headpiece of the Huntingtin protein and its implications on Huntingtin aggregation. J Mol Biol. 2009;388(5):919–27. doi: 10.1016/j.jmb.2009.01.032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Kelley NW, et al. Simulating oligomerization at experimental concentrations and long timescales: A Markov state model approach. J Chem Phys. 2008;129(21):214707. doi: 10.1063/1.3010881. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Bowman GR, Huang X, Pande VS. Using generalized ensemble simulations and Markov state models to identify conformational states. Methods. 2009;49(2):197–201. doi: 10.1016/j.ymeth.2009.04.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Noe F, Fischer S. Transition networks for modeling the kinetics of conformational change in macromolecules. Curr Opin Struct Biol. 2008;18(2):154–62. doi: 10.1016/j.sbi.2008.01.008. [DOI] [PubMed] [Google Scholar]
  • 8.Bowman GR, et al. Progress and challenges in the automated construction of Markov state models for full protein systems. J Chem Phys. 2009;131(12):124101. doi: 10.1063/1.3216567. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Ensign DL, Kasson PM, Pande VS. Heterogeneity Even at the Speed Limit of Folding: Large-scale Molecular Dynamics Study of a Fast-folding Variant of the Villin Headpiece. J Mol Biol. 2007;374(3):806–16. doi: 10.1016/j.jmb.2007.09.069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Gnanakaran S, et al. Peptide folding simulations. Curr Opin Struct Biol. 2003;13(2):168–74. doi: 10.1016/s0959-440x(03)00040-x. [DOI] [PubMed] [Google Scholar]
  • 11.Mitsutake A, Okamoto Y. Replica-exchange extensions of simulated tempering method. J Chem Phys. 2004;121(6):2491–504. doi: 10.1063/1.1766015. [DOI] [PubMed] [Google Scholar]
  • 12.Mitsutake A, Sugita Y, Okamoto Y. Generalized-ensemble algorithms for molecular simulations of biopolymers. Biopolymers. 2001;60(2):96–123. doi: 10.1002/1097-0282(2001)60:2<96::AID-BIP1007>3.0.CO;2-F. [DOI] [PubMed] [Google Scholar]
  • 13.Okamoto Y. Generalized-ensemble algorithms: enhanced sampling techniques for Monte Carlo and molecular dynamics simulations. J Mol Graph Model. 2004;22(5):425–39. doi: 10.1016/j.jmgm.2003.12.009. [DOI] [PubMed] [Google Scholar]
  • 14.Buchete NV, Hummer G. Peptide folding kinetics from replica exchange molecular dynamics. Phys Rev E Stat Nonlin Soft Matter Phys. 2008;77(3 Pt 1):030902. doi: 10.1103/PhysRevE.77.030902. [DOI] [PubMed] [Google Scholar]
  • 15.Huang X, et al. Rapid equilibrium sampling initiated from nonequilibrium data. Proc Natl Acad Sci USA. 2009;106(47):19765–9. doi: 10.1073/pnas.0909088106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Huang X, et al. Rapid equilibrium sampling initiated from nonequilibrium data. Proc Natl Acad Sci USA. 2009 doi: 10.1073/pnas.0909088106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Karpen ME, Tobias DJ, Brooks CL., 3rd Statistical clustering techniques for the analysis of long molecular dynamics trajectories: analysis of 2.2-ns trajectories of YPGDV. Biochemistry. 1993;32(2):412–20. doi: 10.1021/bi00053a005. [DOI] [PubMed] [Google Scholar]
  • 18.Shao J, et al. Clustering molecular dynamics trajectories: 1. Characterizing the performance of different clustering algorithms. J Chem Theory Comput. 2007;3(6):2312–2334. doi: 10.1021/ct700119m. [DOI] [PubMed] [Google Scholar]
  • 19.Noe F, et al. Constructing the equilibrium ensemble of folding pathways from short off-equilibrium simulations. Proc Natl Acad Sci USA. 2009;106(45):19011–6. doi: 10.1073/pnas.0905466106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Sarich M, Noe F, Schutte C. On the approximation quality of Markov state models. SIAM Multiscale Model Simul. 2010 in press. [Google Scholar]
  • 21.Voelz VA, et al. Molecular simulation of ab initio protein folding for a millisecond folder NTL9(1–39) J Am Chem Soc. 2010;132(5):1526–8. doi: 10.1021/ja9090353. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Singhal N, V, Pande S. Error analysis and efficient sampling in Markovian state models for molecular dynamics. J Chem Phys. 2005;123(20):204909. doi: 10.1063/1.2116947. [DOI] [PubMed] [Google Scholar]
  • 23.Bacallado S, Chodera JD, Pande V. Bayesian comparison of Markov models of molecular dynamics with detailed balance constraint. J Chem Phys. 2009;131(4):045106. doi: 10.1063/1.3192309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Noe F. Probability distributions of molecular observables computed from Markov models. J Chem Phys. 2008;128(24):244103. doi: 10.1063/1.2916718. [DOI] [PubMed] [Google Scholar]
  • 25.Schütte C, et al. A direct approach to conformational dynamics based on hybrid Monte Carlo. J Comput Phys. 1999;151:146–168. [Google Scholar]
  • 26.Deuflhard P, et al. Identification of almost invariant aggregates in reversible nearly uncoupled Markov chains. Lin Alg Appl. 2000;315:39–59. [Google Scholar]
  • 27.Deuflhard P, Weber M. Robust Perron cluster analysis in conformation dynamics. Lin Alg Appl. 2005;398:161–184. [Google Scholar]
  • 28.Singhal N, V, Pande S. Calculation of the distribution of eigenvalues and eigenvectors in Markovian state models for molecular dynamics. J Chem Phys. 2007;127:244101. doi: 10.1063/1.2740261. [DOI] [PubMed] [Google Scholar]
  • 29.Röblitz S. Fachbereich Mathematik und Informatik. Freien Universität Berlin; Berlin: 2008. Statistical Error Estimation and Grid-free Hierarchical Refinement in Conformation Dynamics. [Google Scholar]
  • 30.Bowman GR, Ensign DL, Pande V. Enhanced modeling via network theory: adaptive sampling of Markov state models. Journal of Theoretical and Computational Chemistry. 2010 doi: 10.1021/ct900620b. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Swope WC, Pitera JW, Suits F. Describing Protein Folding Kinetics by Molecular Dynamics Simulations. 1. Theory. Journal of Physical Chemistry B. 2004;108(21):6571–6581. [Google Scholar]
  • 32.Park S, V, Pande S. Validation of Markov state models using Shannon’s entropy. J Chem Phys. 2006;124(5):054118. doi: 10.1063/1.2166393. [DOI] [PubMed] [Google Scholar]
  • 33.Metzner P, Schutte C, Vanden-Eijnden E. Illustration of transition path theory on a collection of simple examples. J Chem Phys. 2006;125(8):084110. doi: 10.1063/1.2335447. [DOI] [PubMed] [Google Scholar]
  • 34.Berezhkovskii A, Hummer G, Szabo A. Reactive flux and folding pathways in network models of coarse-grained protein dynamics. J Chem Phys. 2009;130(20):205102. doi: 10.1063/1.3139063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Noe F, et al. Hierarchical analysis of conformational dynamics in biomolecules: transition networks of metastable states. J Chem Phys. 2007;126(15):155102. doi: 10.1063/1.2714539. [DOI] [PubMed] [Google Scholar]
  • 36.Sriraman S, I, Kevrekidis G, Hummer G. Coarse master equation from Bayesian analysis of replica molecular dynamics simulations. J Phys Chem B. 2005;109(14):6479–84. doi: 10.1021/jp046448u. [DOI] [PubMed] [Google Scholar]
  • 37.Pan AC, Roux B. Building Markov state models along pathways to determine free energies and rates of transitions. J Chem Phys. 2008;129(6):064107. doi: 10.1063/1.2959573. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Sezer D, Freed JH, Roux B. Using Markov models to simulate electron spin resonance spectra from molecular dynamics trajectories. J Phys Chem B. 2008;112(35):11014–27. doi: 10.1021/jp801608v. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Yang S, Roux B. Src kinase conformational activation: thermodynamics, pathways, and mechanisms. PLoS Comput Biol. 2008;4(3):e1000047. doi: 10.1371/journal.pcbi.1000047. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Yang WY, et al. Heterogeneous folding of the trpzip hairpin: full atom simulation and experiment. J Mol Biol. 2004;336(1):241–51. doi: 10.1016/j.jmb.2003.11.033. [DOI] [PubMed] [Google Scholar]
  • 41.Elmer SP, Park S, Pande VS. Foldamer dynamics expressed via Markov state models. II. State space decomposition. J Chem Phys. 2005;123(11):114903. doi: 10.1063/1.2008230. [DOI] [PubMed] [Google Scholar]
  • 42.Elmer SP, Park S, Pande VS. Foldamer dynamics expressed via Markov state models. I. Explicit solvent molecular-dynamics simulations in acetonitrile, chloroform, methanol, and water. J Chem Phys. 2005;123(11):114902. doi: 10.1063/1.2001648. [DOI] [PubMed] [Google Scholar]
  • 43.Jayachandran G, Vishal V, Pande VS. Using massively parallel simulation and Markovian models to study protein folding: examining the dynamics of the villin headpiece. J Chem Phys. 2006;124(16):164902. doi: 10.1063/1.2186317. [DOI] [PubMed] [Google Scholar]
  • 44.Singhal N, Snow CD, Pande VS. Using path sampling to build better Markovian state models: predicting the folding rate and mechanism of a tryptophan zipper beta hairpin. J Chem Phys. 2004;121(1):415–25. doi: 10.1063/1.1738647. [DOI] [PubMed] [Google Scholar]
  • 45.Chodera JD, et al. Automatic discovery of metastable states for the construction of Markov models of macromolecular conformational dynamics. J Chem Phys. 2007;126(15):155101. doi: 10.1063/1.2714538. [DOI] [PubMed] [Google Scholar]
  • 46.Kubelka J, Eaton WA, Hofrichter J. Experimental tests of villin subdomain folding simulations. J Mol Biol. 2003;329(4):625–30. doi: 10.1016/s0022-2836(03)00519-9. [DOI] [PubMed] [Google Scholar]
  • 47.Jager M, et al. The folding mechanism of a beta-sheet: the WW domain. J Mol Biol. 2001;311(2):373–93. doi: 10.1006/jmbi.2001.4873. [DOI] [PubMed] [Google Scholar]
  • 48.Friedrichs MS, et al. Accelerating molecular dynamic simulation on graphics processing units. J Comput Chem. 2009;30(6):864–72. doi: 10.1002/jcc.21209. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES