Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2010 Jan 1.
Published in final edited form as: J Chem Theory Comput. 2009 Jan 1;5(1):47–58. doi: 10.1021/ct800282a

Extracting Kinetic and Stationary Distribution Information from Short MD Trajectories via a Collection of Surrogate Diffusion Models

Christopher P Calderon †,*, Karunesh Arora
PMCID: PMC2739417  NIHMSID: NIHMS86729  PMID: 20046947

Abstract

Low-dimensional stochastic models can summarize dynamical information and make long time predictions associated with observables of complex atomistic systems. Maximum likelihood based techniques for estimating low-dimensional surrogate diffusion models from relatively short time series are presented. It is found that a heterogeneous population of slowly evolving conformational degrees of freedom modulates the dynamics. This underlying heterogeneity results in a collection of estimated low-dimensional diffusion models. Numerical techniques for exploiting this finding to approximate skewed histograms associated with the simulation are presented. In addition, statistical tests are also used to assess the validity of the models and determine physically relevant sampling information, e.g. the maximum sampling frequency at which one can discretely sample from an atomistic time series and have a surrogate diffusion model pass goodness-of-fit tests. The information extracted from such analyses can possibly be used to assist umbrella sampling computations as well as help in approximating effective diffusion coefficients. The techniques are demonstrated on simulations of Adenylate Kinase.

1 Introduction

A significant understanding of complex biomolecules like proteins and nucleic acids has been obtained through the use of low-dimensional equations approximating the dynamics of these systems.1,2 Single-molecule experiments and computer simulations are allowing researchers to better understand the physics governing complex biomolecular systems at small length and time scales.116 Advances in nanotechnology are starting to demand higher accuracy from stochastic dynamical approximations of small biomolecular systems. Fortunately, the time series associated with current experiments and simulations contain a rich amount of information related to molecular motion occurring over a broad-range of time scales.17 However, the presence of a wide range of relevant time scales significantly complicates determining reliable low-dimensional models associated with small scale, but highly complex systems.1821 We refer to such approximate models as “surrogate” models, some authors use the term “effective model” in a similar context.2022

In single-molecule experiments, researchers can usually only monitor and/or manipulate a small number of observable quantities that describe the system. This is because simultaneously tracking the position and velocity of many atoms in a system at the same time is very challenging experimentally. Experiments usually do have the luxury of being able to directly measure quantities associated with longer time scales of physical interest, e.g. it is sometimes possible to monitor a protein unfold and refold nearly quasi-statically.23 On the other hand, atomistic simulations provide detailed descriptions of the dynamics (i.e. monitoring the position and velocity of every particle is possible), but encounter computational limitations. Perhaps the most serious being the small time step size enforced by numerical stability considerations. Current simulations can only reasonably explore O(ns)O(μs) length trajectories in all-atom MD simulations of biomolecules17,24 due to the time step constraint. Rapid advances in experiments and simulations are likely to further facilitate making comparisons between these two information sources and using both to refine/construct surrogate models.2528

We present surrogate models which can approximate both simulation29,30 and experimental time series,31,32 but the focus here is exclusively on simulation data. Throughout, we refer to quantities monitored in our time series as “system observables” (SOs). “Pathwise” statistical methods are used to estimate diffusion models from observed time series. We use the term “pathwise” simply to refer to situations where all statistical inference procedures (estimation, hypothesis tests, etc.) are applied to a single time series. We use all-atom MD simulations to generate multiple trajectories (i.e. a batch of time series). We estimate the parameters of a new surrogate model for each observed time series as opposed to aggregating the time series together to estimate the parameters of a single model. We demonstrate surrogate models which can account for state-dependent noise. This has relevance to several systems because it is known that when a high-dimensional system is summarized with a small number of SOs, that the noise magnitude often varies as a function of the SO.3,18,29,30,3336 The information in the surrogate diffusion models can also be related to the effective friction experienced by a particular SO in different portions of phase space.30,32,3739 The methods presented make heavy use of recent developments in statistical testing and modeling29,30,34,4044 to assess the validity of the estimated models and quantitatively learn about the various mutli-scale noise sources. For example, we analyze histograms of a root-mean-square-displacement (rmsd) type SO and determine how much variability is due to traditional thermal noise and how much is introduced by conformational heterogeneity. The former is usually associated with fast-scale motions whose details are not of interest and the latter with slow-scale collective motions.

Besides quantifying the contributions of various noise sources to the observed variability of the SO, the collection of estimated models can be also be used to help traditional physical chemistry computations. For example, they can be used to reduce the variance of equilibrium statistics and can also be used to quantify how certain factors influence the dynamics and stationary distributions of SOs. The is relevant because it is known that conformational degrees of freedom often can cause skewed distributions of low-dimensional SOs. Correlating information accessible in the laboratory with that in computer simulations has the potential to help in various computational chemistry tasks.15,31,4549

Our techniques are demonstrated on short (525 ps) constrained umbrella sampling trajectories of the enzyme Adenylate Kinase.50,51 It is shown that surrogate models, calibrated from observational data, can be used to predict and/or refine approximations of stationary distributions associated with a selected SO. The particular SO studied is related to known “active” and “inactive” crystal structures of the enzyme.51 We demonstrate that the diffusion coefficient52 can be approximated using the surrogate models and that confidence bands for this quantity can be constructed using a single short time trajectory. We demonstrate that this estimate (and confidence band associated with the estimate) can be obtained with much less data than traditional approaches used in MD.24,52,53 We also show that an experimentally accessible49 slowly evolving conformational coordinate correlates with the surrogate model parameters. In this system, taking the state-dependence of the noise magnitude into account,29,30,33,3537 as well as the underlying structural inhomogeneity is shown to be important to faithfully approximate the complex high-dimensional atomistic simulation using low-dimensional surrogate diffusion models. The Supp. Info. provides results demonstrating that the same ideas can be applied to approximate the SO associated with longer (10-50 ns length) unconstrained molecular dynamics of the so-callled “Engrailed Homeodomain”.54

The article is organized as follows: Section 2 provides a theoretical background reviewing our basic motivation some established results from statistical physics. This background helps in physically interpreting the information contained in the collection of surrogate diffusion models we attempt to fit from observations. This section also presents the statistical methods we employ and contains a discussion illustrating how the information extracted from the methods can help computational physical chemistry. Section 3 provides the MD simulation details. Section 4 presents the numerical results and Section 5 concludes.

2 Theoretical Background

2.1 Data-driven Multiscale Stochastic Modeling

The basic idea of using “short bursts” of simulation times series to estimate effective dynamical models motivated the types of methods we propose.20,55,56 We obtain the parameters of surrogate diffusion models29,30 from observed time series using maximum likelihood type (ML) techniques. The type of modeling we propose would fall under the label of a “data-driven” modeling procedure. Several other researchers are developing data-driven methods for describing various complex systems ranging from molecular dynamics to weather forecasting.20,29,30,33,5659 The basic idea behind a data-driven description is to assimilate information contained in empirical observations, either simulation or experiment, coming from a complex high-dimensional system into a surrogate model.

If accurate surrogate models can be calibrated from short time series then these models can be used for a variety of purposes, e.g. they can be used to simulate sample paths for longer time intervals than those accessible to the MD simulation. Other applications are discussed in the Section 4. These basic ideas are not new to chemical physics and it is well-known that the multiple time-scales associated with the underlying complex process significantly complicates these types of tasks.18,21,60 Our main contributions to this type of endeavor are associated with showing how modern time series analysis tools can be used to help in quantitatively determining some “coarse-graining” parameters and also determine the goodness-of-fit of surrogate models in a pathwise fasion. We also demonstrate how a collection of low-dimensional models can be used to study various all-atom simulations and single-molecule experiments where certain collective degrees of freedom are associated with slow time-scales and influence the estimated surrogate models.30,31,61,62 The motivation for using a collection of simple low-dimensional models as opposed to more complicated high-dimensional models as surrogates is discussed in detail in Ref. [61] and in the Conclusions.

2.2 Generalized Langevin Equations

The generalized Langevin equation (GLE) description has been used to describe the evolution of certain MD trajectories.21 Although we do not utilize this structure in our models, we introduce it here because several analogies can be drawn to the modeling methods we introduce and can aid in the physical interpretation of our procedure. A generic GLE typically takes the form:

Φ.=F(Φ)+0tK(Φ(ts),s)ds+2kBTN(t), (1)

where kBT is Boltzmann's constant multiplied by the system temperature, Φ is the vector of SOs the modeler wishes to dynamically track, a dot above a variable denotes the time derivative, F(·) represents the Markovian contribution to the dynamics,19,21 N(·) represents the “orthogonal noise” coming from degrees of freedom not explicitly resolved in the model,19 and K(·, ·) the so-called “memory kernel”. K(·, ·) is usually interpreted as a non-Markovian contribution to the dynamics. The specific functional form of the memory kernel and the orthogonal noise can, in principle, be determined once the full system dynamics are specified using the Mori-Zwanzig projection operator formalism.19,21 The orthogonal noise is typically constructed to have a zero mean over ensembles and its value depends on the initial conditions of all degrees of freedom in the system. For a single realization, the orthogonal dynamics can slowly evolve making the “noise” term appear to be a systematic bias when viewed over short time-scales of a single trajectory. Observations of this nature motivated the authors in Ref. [63] to use a long-memory (fractional Gaussian) process to describe the orthogonal noise in experimental data tracking a protein's slow conformational dynamics.

The AdK system studied is also associated with slow conformational dynamics. However, the surrogate models we propose attempt to approximate F(Φ)+0tK(Φ(ts),s)ds using a Markovian term μ(Φ) and use a fairly simple noise process (standard Brownian motion). One of the surrogate models attempts to utilize the so-called “overdamped approximation”.38 The term “overdamped” is meant to refer to the fact that a particle has a position and velocity, but knowledge of the velocity is not needed to accurately approximate the statistical properties of the particle position. This is a temporal coarse-graining procedure commonly used in statistical physics.18,38 It has been labeled as somewhat ad hoc because it requires quantitative knowledge of the time that one needs to wait between adjacent observations for such an approximation to be valid and this selection is usually based on intuitive physical arguments as opposed to precise mathematical criteria.21 We demonstrate that down-sampling (or sub-sampling30,60) ideas along with statistical hypothesis tests42,44,64 can be used to help put a quantitative handle on an overdamped approximation. We refer to fast-scale noise which may induce short time memory as “fast-scale memory” throughout the text. Velocity is one possible source, but others like vibrational motion would contribute to this type of fast-scale memory. The modeling of slow conformational degrees of freedom is more subtle. This article focuses on modeling the output of MD simulations, so producing time series where a long-memory process of the type given in Ref. [63] can be estimated is somewhat problematic. This is because in simulations it is difficult to sample for a large temporal amount, so the fitted parameters of a long-memory process would likely contain substantial uncertainty. To approximate the variability induced by slowly evolving conformational degrees of freedom in relatively short time series we use a collection of surrogate models. We expand on this point throughout the text.

2.3 Time-scale Separation in System Observables

If the K(·, ·) in Eq. 1 is zero everywhere and the noise process is a standard Gaussian white-noise process with “Dirac-delta time correlation”, then this is commonly written as a diffusion type stochastic differential equation (SDE).65 SDEs have a rich history66 in the physical sciences, early studies focused primarily on analyzing properties of Fokker-Planck type partial differential equation (PDE) associated with the stochastic process as opposed to the SDE. We utilize the SDE view, sometimes called the pathwise view21 because we feel it facilitates connecting the physics to the estimated surrogate models.

The data-driven modeling methods we propose assume that the effective dynamics21 of the underlying high-dimensional atomistic simulations can be accurately captured by a small set of SOs whose dynamics are governed by the following system of SDEs:

dΦ=α(Φ,C)dt+2σ(Φ,C)dWt1dC=β(Φ,C)dt+2κ(Φ,C)dWt2, (2)

where the Wti represent standard Brownian motions,65 Φ represents a “fast-scale” coordinate, and C a “slow-scale” conformational coordinate. We assume that observations are made on a time-scale shorter than Φ's characteristic relaxation time and that the dynamics of C are associated with much “slower” time-scales than those of Φ. The functions α(·, ·) & β(·, ·) are referred to as the drift functions and σ(·, ·) & κ(·, ·) are related to the diffusion matrix.65

We do not assume that we have the system of SDEs in Eq. 3 available in closed-form. We only assume that the stochastic dynamics of the higher-dimensional atomistic system can be accurately approximated by this system of SDEs (i.e. evolution rules are more complicated than Eq. 3). Recall that the data-driven approach we use attempts to estimate effective dynamical equations from time series. If a scale separation exists, the dimension of the SDE system can often be reduced.21,22,55,67 This can significantly facilitate estimation of a surrogate model. Traditional SDE model reduction techniques often ignore the details of the “fast” component and focus on describing the details of the “slow” component's evolution.21,55,67,68 In our notation, the modeler would treat Φ as a “noise” and focus on stochastic dynamical models that explicitly model C. We do the opposite, namely we estimate a scalar SDE of the form:

dΦ=μC(Φ)dt+2σC(Φ)dWt1, (3)

to the SO time series. Recall we assume that our time series observations are spaced by a time shorter than the relaxation time of Φ (and hence much shorter than that of C). If we additionally assume that the local noise, κ, of the slow coordinate C is small, then this coordinate will not have time to appreciably change in a short time simulation. C can be treated as effective constant that modulates the drift and noise functions observed, e.g. μC()μ(,C),σC()σ(,C).

However, a wide range of C are explored, albeit slowly, at “thermodynamic” equilibrium 1. Ideally, the computational cost associated with a standard long time integration would be moderate so that observing substantial changes in both effective fast and slow components would be possible. In this case, we would attempt to either estimate Eq. 3 directly or use more traditional time-scale separation techniques.21,22,55,67 With longer times series, we could also entertain using a single more complicated model63 attempting to capture slow fluctuations and relate this to a memory kernel. Unfortunately, due to computational limitations, generating long time trajectories is problematic in many complex all-atom systems. We propose methods that estimate a collection of SDEs of the form given in Eq. 3. The initial conformation of each MD simulation is drawn randomly from an equilibrium ensemble, so each different time series trajectory is associated with a different C value. We demonstrate how to use this collection to assist in computations commonly encountered in computational physical chemistry. Note carefully that the system of SDEs approximating the dynamics in Eq. 3 is usually much smaller in dimension than the full atomistic system. By only modeling one component (Φ), we are applying an additional reduction to the system of SDEs. Reduction of this sort introduces the collection of surrogate model phenomenon we discuss throughout.

Before providing specific details of the assumed surrogate models, we would like to comment on the process we used for assigning “fast” and “slow” variables in AdK. In Ref. [51], there was interest in generating stationary histograms associated with a coordinate quantifying the difference in rmsd of intermediate conformations with respect to the open and closed enzyme states (in what follows we simply call this distance “rmsd type”). Dynamically monitoring this rmsd type quantity in the laboratory is not currently feasible. There were FRET measurements available providing information about a distance between dye labeled residues Lys145 and Ile52.49 These particular residues were chosen since they lie in two domains of AdK that undergo the largest conformational change between the open to closed state of AdK and help detect functionally relevant motion.51 In our simulations we do not directly attempt to manipulate this residue distance despite its physical relevance; it has been shown that this residue distance explores a wide range of values, but does so slowly with respect to the time-scale of simulations.51 As a result we used the rmsd type distance as Φ and the distance between the center of mass of residues Lys145 and Ile52 as C. In order to enhance sampling of Φ, a harmonic biasing potential was introduced (see Section 3). The biasing potential also made a linear effective force approximation seem more plausible in a surrogate model (later we quantitatively tested this assumption). This biasing potential altered the effective underlying energy landscape and also made the dynamics associated with Φ “faster” in relation to C. Recall we also assumed that the local fluctuation magnitude associated with C was small. This assumption was due to the fact that collective conformational degrees of freedom do not typically wildly fluctuate in an enzyme.

The above considerations clearly utilized our physical intuition about the system. In general, measurements of fast Φ type coordinates are difficult to accurately monitor and/or control in the laboratory. However, several slowly evolving conformational degrees of freedom can readily be monitored and/or manipulated by novel single-molecule techniques.3,516 We demonstrate that a measurable correlation exists between the selected SOs. The time-scale separation between the correlated SOs is exploited in the methods we report. This is one way in which simulation predictions can be compared to experimental observations.51

However, we would like to note that selecting coordinates having a large time scale separation can be difficult if physical intuition alone is used. The problem is even harder if the interest is in analyzing unconstrained simulations. General, data-driven procedures for identifying “good” variables where a significant time-scale separation exists is challenging, but would be of great help to studying systems where physical intuition is lacking.69,70 In addition, other more sophisticated types of multiscale approximations can be applied to SDEs like those in Eq. 3 using less restrictive assumptions about the dynamics.21,56,68,71

2.4 Proposed Functional Form of Surrogate SDE Models

For every observed time series, we proposed two model structures to use along with Eq. 3, namely:

Model1:μC(Φ)=(A+B(ΦΦ0)),σC(Φ)=C (4)
Model2:μC(Φ)=σC(Φ)2kBT(A+B(ΦΦ0)),σC(Φ)=(C+D(ΦΦ0)) (5)

where, Φ0 corresponds to a user specified point 2 and θ ≡ (A, B, C, D) are parameters (D is only used in Model 2). Model 1 is known as the Ornstein-Uhlnebeck (OU) model. The drift function can readily be interpreted as coming from a harmonic potential connected to a heat bath whose fluctuations are independent of the state. Model 2 explicitly utilizes the overdamped (OD) Langevin approximation18,29,30,38 and also takes the noise magnitude's dependence on the current state into account using a relatively simple model. The estimated parameters can be interpreted as local approximations of the effective force and effective local diffusion coefficient 3.

2.5 Fitting and Testing the Surrogate Models

For discretely sampled time series, the maximum likelihood estimator attempts to find the parameter vector maximizing the logarithm of the joint density associated with the observations, log (p0, Φ1, . . . , ΦN; θ)), where subscripts denote the time index of the observations. The parameter yielding this maximum is denoted by θ^. For general SDEs, estimating θ^ analytically is problematic because the joint density cannot usually be expressed in closed-form. The OU model is appealing because it does admit a closed-form expression for the ML parameter estimate and also yields some other useful diagnostic information. For example, an asymptotic expression for the parameter covariance can be obtained.43 For the case where we cannot obtain the ML estimate in closed form, we appeal to approximate likelihood methods.40,41 Both ML approximation methods previously cited yielded similar θ^ values in the cases explored. A url link to MATLAB scripts illustrating how to obtain θ^ for both the OU and OD models given a time series is provided in the Supp. Info.

The “Q test-statistic” developed in Ref. [42] is used to check the validity of the assumed surrogate model. This test is designed to test for temporal dependencies which are atypical for an assumed model. We demonstrate that it can be used to detect if fast-scale memory effects are statistically significant. The test is also appealing because it applicable to both stationary and nonstationary signals. This test also provides us with physically relevant coarse-graining information. We demonstrate how we can use this test to determine the appropriate frequency at which data can be discretely be sampled from a simulation and provide a diffusion model which is not rejected by a hypothesis test. If one is willing to make a stationarity assumption about the time series, more powerful tests can be used44,64. We demonstrate the “T3” test statistic of Ref. [64] is useful in assessing the accuracy of a stationary density predicted using a short simulation burst. This test is shown to have better power than the Q-test.

2.6 Computing the Stationary Density Associated with Scalar SDEs

Under mild regularity conditions, the stationary density, denoted by pEQ(Φ;C), associated with a scalar diffusion model given in Eq. 3 can be expressed in closed-form using only information contained in the estimated SDE coefficient functions via the relation72,73 4:

pEQ(Φ;C)=Z(σC(Φ))2exp(ΦREFΦμC(Φ)(σC(Φ))2dΦ) (6)

where Z is a constant used to ensure that the density integrates to unity and ΦREF is a specified constant used as a “reference point”. It is assumed that the diffusion process obtained by the ML estimate admits a well-behaved stationary density 5. Recall that for each time series, we estimate a new set of parameters and hence a new SDE of the form given in Eq. 3. The different trajectories each have unique conformational state initial conditions (i.e. different C values) in the underlying detailed atomistic simulations. Each of these estimated models can be used to compute a “stationary” density resulting in a collection of “stationary” densities. Quotes are used in the previous sentence because in each short times series burst the value of C is effectively fixed and the Φ coordinate fluctuates about a fixed point. C determines this fixed point as well as the shape of the “stationary” density of Φ. The thermodynamic stationary distribution (≡ ΠEQ) needs to account for the variability inherent in C. Due to the slow time-scale associated with this coordinate, it is difficult to exhaustively sample phase space in a single simulation trajectory. If we somehow had access to a closed-form expression describing the (thermodynamic) probability density of C, denote by f(·) 6, we could integrate this quantity out using:

ΠEQ(Φ):=pEQ(Φ;C)f(C)dC1Ni=1NpiEQ(Φ;Ci), (7)

The right-hand-side of the above represents a Monte Carlo approximation of the continuous integral to the left. N represents the number of time series batches used to calibrate N different surrogate models describing Φ's dynamics. Ci denotes the temporal average value of C observed in time series batch i. The subscript i on piEQ denotes using the invariant density obtained for Φ associated with Ci (using Eq. 6 for each estimated SDE). To more systematically overcome the conformational sampling barrier, one could attempt to generate initial conformations utilizing more sophisticated equilibrium sampling methods,24,53 however we demonstrate this is not necessary to obtain accurate results in the systems studied here, but may be useful in other applications.

Effectively we are modeling the more complex distribution ΠEQ(Φ) using a mixture of simpler densities. This mixture modeling can also be given a physical interpretation. The thermal noise for a fixed value of C induces a certain amount variability in the SO of interest; for each single SDE “i” this can be quantified using piEQ(Φ;Ci). The variability induced by conformational heterogeneity can be quantified by looking at how disjoint a collection of {piEQ(Φ;Ci)}i=1N are relative to the average quantity 7.

2.7 Relation to Computational Chemistry

One interest in this paper is in approximating the global stationary histogram associated with a Φ type coordinate using a small amount of MD time series. Two complications are commonly encountered: (1) Time correlation in MD trajectories can complicate constructing reliable estimates due to statistical dependence,74 (2) Dynamics induced by “orthogonal coordinates” can induce skewed (non-Gaussian) distribution in the stationary histogram associated with the fast SO of interest. Such skewed distributions are commonly encountered in both experiments and simulations when a low-dimensional SO is modulated by a diverse population of conformational degrees of freedom not explicitly included in the model.4648

Knowledge of the shape of such global stationary distributions is important in a variety of applications, e.g. in umbrella sampling type applications one needs to ensure a high degree of overlap between adjacent sampling windows51,74 and knowledge of the skewed histogram shape can help one in refining the grids used in such computations. The non-Gaussian shape of a work histogram is also important in nonequilibrium free energy computations.30,47 We demonstrate that that our modeling procedures, utilizing a collection of SDE models can help in predicting such shapes and treat the two issues listed in the above paragraph. We also demonstrate that the time-scale separation between the fast-time scale coordinate Φ and the slow conformational coordinate C also influences kinetic quantities of interest such as the diffusion coefficient.33,52

3 Simulation Details

We assign Φ ≡ ΔDrmsd (difference in rmsd of the instantaneous structures from the reference open and closed crystal structure of the enzyme) and C the distance between mass centers of residues Ile-52 and Lys-145 characterizing the dynamics of large-scale conformational transitions in AdK (see Ref. [51] for details). This distance type SO has also been measured in solution using single molecule FRET experiments by Henzler-Wildman et al.49

As detailed in Ref. [51], the initial path between the open and closed conformations of AdK was generated using the Nudged Elastic Band (NEB) method.75 Subsequently, 81 configurations obtained from NEB path optimization, separated by the interval of 0.2 Å in ΔDrmsd space were subjected to US simulations. During these US simulations, production dynamics of 525ps at 300K was performed from each configuration with a weak restraint of 10 kcal/mol/Å2 in ΔDrmsd (the specified target SO value in each window is denoted by ΔDrmsd0). No restraints were applied along the conformational SO, (C). Solvent effects were modeled implicitly using GBMV approximation76 in CHARMM.77

For further statistical analysis, 50-100 restrained trajectories of 525ps in length each were performed from the eight starting conformations along the path corresponding to the ΔDrmsd0 values of (measured in Å ), −5.79, −3.67, −0.01, 1.38, 3.30, 5.34 and 7.02. All trajectories were subjected to similar restraint of 10 kcal/mol/Å2 along ΔDrmsd, but were started with the different initial velocities, assigned randomly. The time series of ΔDrmsd and distance C were extracted from the trajectories (sampled every 0.15 ps) and used in analysis below.

4 Results

4.1 Parameter Estimates and Goodness-of-Fit Tests

The ML parameters of the OU model were obtained at each of the 81 different US windows. The measured noise magnitude (C) depends significantly on the value of Φ(≡ ΔDrmsd). This is demonstrated in Fig. 1. Parameter estimates were obtained using three different down-sampling (or sub-sampling) parameters.60 The down-sampling parameter is an integer represented by “ds”. Knowledge of this parameter is related to temporal coarse-graining; it determines the amount of time used to “average out” certain fast-scale non-Markovian memory effects.18,21,34,56,68,71 To get a better physical understanding of this quantity, suppose one is numerically integrating a high-dimensional chaotic deterministic Hamiltonian system using a constant time-step size δt. The output of a discrete error-free integration would be the sequence {pi} where, piiδt(i+1)δtqH(t)dt (using notation from classical mechanics). If we attempted to fit a diffusion approximation directly to the sequence {pi} , it would likely fail because the fast-scale chaotic motion has not had sufficient time to “mix” and the noise is not a “white noise process” (i.e. temporal correlations exist in the fast-scale noise67). Alternatively, if we used the sequence {pids} where, pidsi(ds×δt)(i+1)(ds×δt)qH(t)dt, the chaotic motion would have more time to “mix” and would make a diffusion model more plausible. The surrogate models estimated from our MD simulations, although inherently stochastic due to the Langevin thermostat, still exhibit dependence on the down-sampling because systematic forces associated with fast-scale memory still need time to average out.

Figure 1.

Figure 1

The C parameter of the Ornstein-Uhlenbeck process was estimated using short time series from 81 different (independent) US windows. The values estimated are denoted by symbols and the purpose of the line connecting the points is only to guide the eye. Each parameter estimate came from a time series containing 350 uniformly spaced entries. Three different ds values were used. The corresponding time Δt between observations is reported in the legend.

Next we demonstrate how surrogate models calibrating from short time simulations can be tested for goodness-of-fit. An US point (ΔDrmsd= 7.0) exhibiting significant state dependence in the noise was analyzed in detail. At this point, 75 MD trajectories were simulated and Φ was observed every 0.15 ps (the interval is much larger than the integration step-size of 1.5 fs). For every proposed ds , we estimated 75 surrogate model parameter vectors (both OU and OD). A total of three ds values were tested (ds = 1, 2, 3). The total number of time series observations used to estimate each surrogate parameter vector was fixed to be 350 in each case so the terminal length of the time series depended on ds, however each time series used (regardless of ds) started with the same initial observations to maximize the degree of temporal overlap in the Φ time series used parameter estimation.

We utilized both the “Q-test statistic”42 and the “T3 test statistic”.64 It is demonstrated that they both have utility in regards to our applications. Ideal finite sample null distributions associated with a time series size of 350 were obtained using Monte Carlo simulation to generate 1 × 104 samples 8. This sample size was assumed large enough to obtain accurate continuous cumulative distribution function (CDF) approximation of the null. The relatively small batch size of MD simulation samples led us to treat the distribution associated with the test statistics as empirical distribution functions (EDFs). The reference null distribution and various EDFs are plotted in Fig. 2. We shade the plot to highlight the critical region associated with a significance level (α) of 10%. The x-intercept of the shaded region is the critical value associated with this level and the percentage of rejected models can be obtained by evaluating the EDF at this value and subtracting this result from unity. Although we shade for α = 10%, the plot can be used to to assess any α of interest.

Figure 2.

Figure 2

Hypothesis test results. In each panel, the staircase plots correspond to the empirical distribution function (EDF) of the test statistics obtained from batches of 75 time series (each using different ds values, the corresponding time between observations, Δt, are reported in the legend) and the solid curve corresponds to the distribution of the null computed for a finite sample size of 350 which was the length of each time series analyzed in this plot. The shaded region is used to show the α = 0.10 critical value. The percent of models rejected at this level can be found by noting the point on the x-axis where a color change occurs (denote by xcrit) and then evaluating 1-EDF(xcrit). Panel (a): The Q-test statistic given in Ref. [42] was applied to determine the time needed to wait between observations before an overdamped diffusion model could be applied to simulation data. The surrogate model parameters were estimated for each path and then the Q-test statistic was computed using the data and the estimated model. Panel (b): The T3 test statistic64 computed using the same estimated parameters and data.

Panel (a) of Fig. 2 plots the Q-test results testing both the OD and OU surrogate models using various ds parameters. The percentage of test statistics rejected for ds=1,2, and 3 was roughly 90%, 15% and 5%, respectively for the OD model and was 95%, 10% and 7.5% for the OU model. This suggests that when simulation data of the AdK system is discretely observed, the time between adjacent observations should be ≥ 0.30 ps before a “statistically acceptable” diffusion model can be used. Artifacts of fast-scale non-Markovian type effects can readily be detected by the Q-test when one samples more frequently in time than this value using even fairly small time series (here 350 observations per trajectory). The other US windows analyzed (not reported) also indicated that ds = 2 was the appropriate coarse-graining parameter to use; this corresponded to 0.30 ps between observations observations; all subsequent results used this spacing between time series observations. The main utility of such a pathwise goodness-of-fit analysis is that a very small number of short sample paths can be used to determine the time one needs to wait to let fast-scale non-Markovian effects “average out”.21,71 One of the appealing features of the Q-test is that the underlying nature of the signal is not important (stationary or non-stationary cases can be treated), but for this generality one pays a price in regards to statistical power.The Q-test performs similarly for the OD and OU test despite there being fairly large state-dependence at this point. Later we demonstrate that ignoring this dependence causes poor predictions related to stationary histogram estimation. It would be useful if we could apply a more powerful pathwise test in order to see how well the two different models perform. The T3 statistic64 makes use of a stationarity assumption. Panel B reports the results associated with applying this test. The T3 tests indicates more clearly demonstrates the OD model fits the observations better (roughly 5% of the OD models were rejected whereas about 20% of the OU were using ds = 2). Results reported in the Supp. Info. show that increasing the time series sample size to 700 makes rejection easier in both cases, but the OU is still more strongly rejected at the α = 10% significance level.

4.2 Diffusion Coefficient Approximation

Approximating the dynamics with a simple process like the OU model is appealing because the ML parameter can be obtained directly (a numerical parameter search is not necessary) and the limiting asymptotic large sample distribution of the parameter estimates is also available in closed-form.43 In Table 1 we report the the mean and standard deviation of C2 estimated with the OU model. The OD and OU model predictions for this quantity were nearly identical due to the low state-dependence on the noise, so we focus on the latter 9. Also, if the OU model accurately captures the data, we can exploit several analytical results for statistical inference purposes. For example, the large sample asymptotic standard deviation of C2 is computed analytically by exploiting the Gaussian property of this process. Table 1 demonstrates that our surrogate models can approximate such quantities. In the atomistic simulation community, the diffusion coefficient is typically determined by using an empirically measured autocorrelation function to determine the τ where correlations are small and then C2 is computed by using ensemble averaging over temporal blocks.24,33,35,36 The time one needs to wait between observations can be fairly large when one uses this approach or variants of it.52

Table 1.

Diffusion Coefficient () Estimation. The effective diffusion coefficient (in asymptotic mean square diplacement sense) was computed from the MD data at the point ΔDrmsd0=1.38. The valued obtained was 1.27 × 10−3Å2/ps (see text). The diffusion coefficient estimated by the OU models is reported using three different down-sampling rates. Results using a times series of length (N) 350 and N = 700 are reported. In each case, results from analyzing 50 batches of time series are summarized by using the mean and empirically measured standard deviation (“Emp Std. Dev.”) of the diffusion coefficient estimated surrogate models (each time series gave one estimate). We also report the large sample uncertainty estimate (“Asymp. Std. Dev.”) of the maximum likelihood estimate. An analytic expression for this quantity is reported in Theorem 3.1.1 of Ref.43.

Δt = 0.15ps Δt = 0.30ps Δt = 0.45ps
(N = 350) 1.0350 ×10−3 1.4387 ×10−3 1.5989 ×10−3
Emp Std. Dev. 1.1912 ×10−4 1.6584 ×10−4 1.7161 ×10−4
Asymp. Std. Dev. 1.1136 ×10−4 1.5481 ×10−4 1.7190 ×10−4
(N = 700) 1.0237 ×10−3 1.4012 ×10−3 1.5564 ×10−3
Emp Std. Dev. 9.7924 ×10−5 1.2283 ×10−4 1.3652 ×10−4
Asymp. Std. Dev. 7.7736 ×10−5 1.0632 ×10−4 1.1810 ×10−4

Figure 3 plots the autocorrelation of the detailed MD samples analyzed in Table 1. The AC was estimated from 50 batches of 525 ps MD data. The plot also contains the average autocorrelation function, i.e. the function obtained by averaging over the 50 autocorrelations. The 95% confidence bands for zero correlation are also plotted for this time series sample size as dotted lines. The observed autocorrelation was used to find a relaxation time (τ) by fitting the early portion to a single exponential vias least squares (resulting in τ ≈ 15ps). The diffusion coefficient was then computed using:33,52 〈(Φ(t+τ)–Φ(t))2〉/(2τ) = 1.27 × 10−3, where brackets denote ensemble averaging over non-overlapping temporal blocks. This number was compared to the effective diffusion coefficient predicted by the OU model for various ds values (results reported Table1). Our models actually exploit the temporal correlation to get better estimates of such quantities and do not necessarily require long time series observations. Recall that a value of ds = 2 (corresponding to 0.30 ps between observations) performed fairly well in regards to the Q-test on all of the data we observed for AdK and coincidentally this parameter also appears to most closely capture the diffusion coefficient computed via traditional ensemble MD methods.52 We also report the mean, predicted uncertainty and the measured uncertainty in the estimated C2 for various ds values. 10.

Figure 3.

Figure 3

Autocorrelation (AC) measured from MD data taken from US point ΔDrmsd0=7.03. The thick line represents the mean AC function obtained using the full 525 time series and estimating the AC for each sample path and then averaging the results. The thin AC labeled as “Path i” are some representative ACs. The thick dotted horizontal lines correspond to the 95 % confidence intervals.

Note how in Fig 3 one initially observes a roughly single-exponential decay. This feature allowed us to approximate the effective diffusion coefficient using a single time series of length ≈ 100ps. With this small amount of data, we were even able to compute confidence bands which were fairly accurate. However, closer inspection of the AC signal at longer times reveals it may be more complex than an exponential decay. Complex ACs are common in single-molecule experiments where conformational fluctuations persist for a relatively long-time.15,16,63 Such artifacts may limit the predictive power of a diffusion coefficient calibrated from a scalar SO observed over fairly short time scales. The diffusion coefficient information reported in Table 1 does not attempt to account for complex long time behavior 11.

The bottom panel of Fig. 4 demonstrates that the estimated noise magnitude (Ci) of surrogate model “i” correlates with Ci. The figure also contains some sample trajectories demonstrating the time-scale separation existing between C and C. Panel A of Fig. 4 uses three distinct (color-coded) symbols to identify the three sample trajectories plotted in panels B and C. The fast Φ SO oscillates (or rapidly reverts) about the observed temporal mean associated with path i whereas the slow C SO exhibits a slow random walk (i.e. not oscillating about a mean). Given that panel A demonstrates that the intensity of effective thermal noise varies with C and this quantity does a random walk through phase space, this influences the distribution of the SO of interest (Φ). This plot provides one fairly simple illustration of how a time-scale separation can affect histograms relevant to thermodynamic applications.

Figure 4.

Figure 4

(a) Scatter plot of the estimated C of the Ornstein-Uhlenbeck (OU) Model vs C. The data consists of the estimated OU noise parameter (C) obtained using time series consisting of uniformly sampled observations spaced by Δt = 0.30. The noise parameter was estimated for 50 batches of short time series and all US simulations used simulations corresponding to the US constraint point ΔDrmsd0=1.38 plotted against the (temporal) average value of C for the corresponding ΔDrmsd time series used to estimate the OU parameters. The linear correlation (r) between the estimated C and C was found to be 0.34 and the associated p-value was 1.0 × 10−3. (b) representative time series of the “fast” ΔDrmsd coordinate and (c) “slow” C coordinate. The three color coded trajectories in (b) and (c) correspond to the three color coded symbols in (a).

The time-scale separation can be explored more quantitatively by estimating different stochastic models and analyzing the results. For example, if one estimates the effective force using the OD model for C and then compares the result of the same model estimated for Φ, the characteristic time-scale associated with each effective force is roughly quantified by looking directly at the B parameter. In the SDE considered, the parameter B corresponds to the linear sensitivity of the effective force. For the data observed, the characteristic time-scale associated with C ranges from ≈ 40-100 times the length of that associated with Φ if the ratio of the linear sensitivities are used to quantify the time-scale gap 12.

4.3 Stationary Histogram Approximation

Finally we present results illustrating how a collection of simple models can be used stationary histogram information usually sought in umbrella sampling type applications. We demonstrate that a collection of surrogate models can account for variability associated with the slow C time-scale. Figure 5 reveals that the collection of OU invariant densities seems to accurately capture the general shape of ΠEQ. However the cases at the edges of the US simulation points do not approximate the shape of the histograms as well. Using a mixture of OD models (Equation 6) remedies the situation in both cases 13. Figure 6 plots the resulting density prediction (ΠEQ) along with the measured Φ histogram obtained directly from the MD data. The inset plots some sample pEQ(·) functions measured from these models (these are used to construct ΠEQ). The accuracy obtained by this “mixture of density” plots gives further evidence that a diverse ensemble of C values modulate the dynamics of Φ. For points near the boundaries the correlation between C and Φ is stronger than that shown in Fig. 4 (see Supp. Mat. Fig. 2) and accounting for this variability is important if one demands high accuracy in the surrogate model density estimates (referring to both pEQ(·) and ΠEQ).

Figure 5.

Figure 5

The histogram obtained from running MD simulations using 7 different constraint points are reported. Each data point contains the results from 50 independent simulations run for 105/ps (again uniform sampling with Δt = 0.30/ps). The prediction of the simple OU model which accounts for the conformational heterogeneity (see text for details) is shown as a solid line. In most cases this crude approximation is accurate, the largest discrepancy here is in the left and rightmost distributions.

Figure 6.

Figure 6

Stationary density estimate focusing on rightmost density shown in Fig 5 (corresponding to ΔDrmsd0=7.03). The result obtained using the Ornstein-Uhlenbeck surrogate (solid red line) was poor. A batch of over-damped models were estimated (from the same times series used to fit the Ornstein-Uhlenbeck models in Fig 5). The solid lines denotes the invariant density obtained by appealing to Equation 7 and the dotted lines represent the invariant density prediction obtained by using 〈θ〉. The collection of thin blue lines display some representative invariant density predictions (i.e. “piEQ” in Equation 7). The histogram of Φ coming from an ensemble of 75 genuine MD time series of length 525 ps is represented as the jagged line.

Before concluding, we provide a description which hopes to show why a “mixture of densities” can help in approximating the ΠEQ of a SO associated with a complex molecule. Accurately approximating ΠEQ usually requires one to exhaustively sample phase space, not just a small region explored in single short trajectory. The variability induced by “very fast-degrees” of freedom (e.g. vibrational degrees of freedom and solvent bombardment) is modeled in each single surrogate diffusion using Brownian motion. This along with the drift parameters determines pEQ(·)i which provides one with quantitative information about of how fast-scale fluctuations cause variability in Φ for a relatively frozen value of C. Slow time-scale conformational variability (e.g. that introduced by the distribution of C) is accounted for using a collection of surrogate models. In Ref. [51] it was shown that the free energy profile of C was effectively flat. It effectively does a random walk in phase space when viewed over long time-scales whereas Φ is constrained with a harmonic potential. The skewed histogram of Φ observed in the single 525 ps run shown in Fig. 6 is an artifact of this modulating effect.

5 Conclusions

We have demonstrated that a collection of fairly simple surrogate diffusion models estimated from time series data can accurately capture dynamical features of short constrained AdK simulations. The techniques presented should be thought of as a “post-processing” analysis in which statistical summaries (such as correlation and the invariant distribution of SOs) are obtained by time series techniques. In most cases, the parameters of the diffusion models were modulated by degrees of freedom associated with large-scale conformational changes. The slow SO monitored in AdK is experimentally accessible in solution via single molecule FRET.49 We also demonstrated that pathwise statistical inference could be used to obtain efficient parameter estimates from temporally correlated MD observations. Application of goodness-of-fit tests helped identify the time needed to wait (a corase-graining parameter) before fast-scale non-Markovian artifacts “averaged out”.

Information extracted from a collection of surrogate diffusion models can be used to assist free energy computations as well as obtain kinetic information in the form of effective diffusion using a relatively short amount of detailed simulation trajectories in certain situations. Confidence bands and goodness-of-fit tests can be used to check the quality of the approximation without requiring a large number of expensive simulation results. This shows promise in reducing the computational load needed to obtain kinetic and thermodynamic properties of complex biomolecules. These types of methods may also possibly be used to assist established sampling techniques like WHAM, parallel tempering, or meta-dynamics24,74,78 where many histograms need to be approximated. If this can be done with shorter time series, computational resources will be free to explore other portions of phase space believed physically relevant. The findings are not isolated to very short constrained simulations; the Supporting Information reports results demonstrating results using longer unconstrained simulations coming from an explicitly solvated protein trajectory obtained from the dynameomics.org library.26 We also see the statistical analysis tools presented here as being useful in data-mining applications.

The surrogate models we appealed to in this article did not explicitly exploit the structure of any underlying governing equations. The proposed models had a phenomenological motivation. The collection of estimated surrogate models did give dynamic and static information about a Φ type coordinate over a broad range of phase space not typically explored in a single simulation and did so using SDE models which we could efficiently estimate, quantify the uncertainty in our estimates, and readily interpret in terms of established statistical physics. We could also assess the goodness-of-fit of the estimated models an a posteriori fashion. If simplified models coming from mathematical model reduction techniques are available, e.g.,19,71 the parameters of reduced models could possibly be estimated from observations. One could also consider attempting to model the dynamics of more SOs and/or utilize the structure of a generalized Langevin equation resulting in more complicated surrogate models. The statistical analysis of such models (estimation and inference) is fairly involved and introduces many new mathematical challenges, but interesting results are being obtained in that direction.63 Data-driven modeling is particularly attractive because recent advances in single-molecule manipulation methods116 are making a variety of low-dimensional SOs available to dynamically analyze. Synergystically combining data-driven modeling techniques with new and established simulation methods as well as mathematical multiscale analysis shows great promise in providing new insights into complex biological systems.

Supplementary Material

1_si_001
2_si_002

6 Acknowledgments

We thank two anonymous referees for helpful comments which improved the quality of the manuscript. In addition CPC was supported by NSF grants #s DMS 0240058 & ACI-0325081, NIH grant T90 DK070121-04 and obtained partial computational support from the Rice Computational Research Cluster funded by NSF under Grant CNS-0421109, and a partnership between Rice University, AMD and Cray.

Footnotes

1

By this we mean the stationary distribution associated with all time-scales. Note that the underlying system maybe biased, as in umbrella sampling simulations, but we still refer to the stationary state as “thermodynamic” equilibrium.

2

For example, this could coincide with the constraint point of an US simulation51 or the mean value of the observed time series. We use the latter in this article.

3

Note that if the dynamics can be approximated by the OD model above and the noise magnitude is truly constant, then the parameters of the two different estimated surrogate models can be directly compared and the difference should only be do to different sampling uncertainty magnitudes associated with the parameterization used.

4

Note the print version contains a typesetting error corrected here.

5

Evaluating pEQ(Φ;C) can encounter technical difficulties if one allows the diffusion coefficient to take a zero value (especially relevant to the OD model). Careful selection of ΦREF along with using a finite support can help in numerically dealing with this issue. However, one must be careful to ensure that all regions of non-negligible probability are accounted for in the finite support. SDE simulation can be used to assist in this type of task. Alternatively, one can modify σC to smoothly approach a minimum value > 0 and use an infinite support for pEQ(Φ;C).

6

We simply assume that this density exists and is statistically independent of the Φ variable. Including dependence on Φ is in principle possible, but the time-scale separation is large enough to make this coupling fairly weak in the particular systems studied.

7

To do so one must have an estimate of the inherent uncertainty associated with a finite length discrete time series used to estimate the parameters.

8

For the T3 test statistic, we used a bootstrap scheme whereby we used the ensemble average of θ^ to generate paths and used this single model to create the null. For large samples sizes, under mild assumptions, this test statistic can be shown to be independent of the underlying data generating mechanism making this test appealing in situations where conformational heterogeneity exists.

9

We studied a case where the state-dependence on the noise is mild to facilitate physical interpretation and to compare to established diffusion coefficient methods used in MD simulation analysis.24,33,35,36

10

Note that as one waits a longer time, fast-scale non-Markovian effects become less important. However in longer time series, artifacts of the evolution on C type coordinates can more readily be measured and in this case it appears to increase the effective diffusion coefficient.

11

A single OU model predicts an AC with a single exponential rate of decay.

12

It should also be noted that when a coupled 2-d model was estimated, the eigenvalues of the effective force (of the coupled system) indicated a similar separation in time scales in the effective forces.

13

Results for the case near Φ ≈ −6 are given in Supp. Mat. Fig 1.)

Supporting Information: A PDF containing supplemental information related to Figs. 4 and 6 of the main text as well as a text file containing a url link to MATLAB scripts demonstrating the parameter estimation procedure. This information is available free of charge via the Internet at http://pubs.acs.org.

References

  • 1.Bustamante C, Bryant Z, Smith S. Nature. 2003;421:423. doi: 10.1038/nature01405. [DOI] [PubMed] [Google Scholar]
  • 2.Carrion-Vazquez M, Oberhauser A, Fisher T, Marszalek P, Li H, Fernandez J. Prog. Biophys. Mol. Bio. 2000;74:63. doi: 10.1016/s0079-6107(00)00017-1. [DOI] [PubMed] [Google Scholar]
  • 3.Stock G, Ghosh K, Dill K. J. Chem. Phys. 2008;128:194102. doi: 10.1063/1.2918345. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Collin D, Ritort F, Jarzynski C, Smith S, Tinoco I, Jr., Bustamante C. Nature. 2005;437:231. doi: 10.1038/nature04061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Min W, Gopich I, English B, Kou S, Xie X, Szabo A. J. Phys. Chem. B. 2006;110:20093. doi: 10.1021/jp065187g. [DOI] [PubMed] [Google Scholar]
  • 6.Smith S, Cui Y, Bustamante C. Science. 1996;271:795. doi: 10.1126/science.271.5250.795. [DOI] [PubMed] [Google Scholar]
  • 7.Rief M, Clausen-Schaumann H, Gaub H. Nat. Struct. Biol. 1999;6:346. doi: 10.1038/7582. [DOI] [PubMed] [Google Scholar]
  • 8.Clausen-Schaumann H, Rief M, Tolksdorf C, Gaub H. Biophys. J. 2000;78:1997. doi: 10.1016/S0006-3495(00)76747-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Albrecht C, Neuert G, Lugmaier R, Gaub H. Biophys. J. 2008;94:4766. doi: 10.1529/biophysj.107.125427. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Lee G, Rabbi M, Marszalek RCP. Small. 2007;5:809. doi: 10.1002/smll.200600592. [DOI] [PubMed] [Google Scholar]
  • 11.Ke C, Humeniuk M, S-Gracz H, Marszalek P. Phys. Rev. Lett. 2007;99:018302. doi: 10.1103/PhysRevLett.99.018302. [DOI] [PubMed] [Google Scholar]
  • 12.Harris NC, Song Y, Kiang C-H. Phys. Rev. Lett. 2007;99:068101. doi: 10.1103/PhysRevLett.99.068101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Dixit S, Singh-Zocchi M, Hanne J, Zocchi G. Phy. Rev. Lett. 2005;94:118101. doi: 10.1103/PhysRevLett.94.118101. [DOI] [PubMed] [Google Scholar]
  • 14.Vendruscolo M, Dobson C. Science. 2006;313:1586. doi: 10.1126/science.1132851. [DOI] [PubMed] [Google Scholar]
  • 15.Liu S, Bokinsky G, Walter N, Zhuang X. Proc. Natl. Acad. Sci. USA. 2007;104:12634. doi: 10.1073/pnas.0610597104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Greenleaf W, Frieda K, Foster D, Woodside M, Block S. Science. 2008;319:630. doi: 10.1126/science.1151298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Schlick T, Skeel RD, Brunger AT, Kale LV, Board JA, Hermans J, Schulten K. J. Comp. Phys. 1999;151:9. [Google Scholar]
  • 18.Zwanzig R. Nonequilibrium Statistical Mechanics. 1st ed. Oxford Universisty Press; New York: 2001. Brownian motion and Langevin equations. pp. 3–24.pp. 143 [Google Scholar]
  • 19.A. J. Chorin AK, Kupferman R. Proc. Natl. Acad. Sci. USA. 1998;95:4094. doi: 10.1073/pnas.95.8.4094. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Kopelevich D, Panagiotopoulos A, Kevrekidis I. J. Chem. Phys. 2005;122:044908. doi: 10.1063/1.1839174. [DOI] [PubMed] [Google Scholar]
  • 21.Givon D, Kupferman R, Stuart A. Nonlinearity. 2004;17:R55. [Google Scholar]
  • 22.E W, Liu D, Vanden-Eijnden E. Commun. Pur. Appl. Math. 2005;58:1544. [Google Scholar]
  • 23.Borgia A, Williams P, Clarke J. Annu. Rev. Biochem. 2008;77:101. doi: 10.1146/annurev.biochem.77.060706.093102. [DOI] [PubMed] [Google Scholar]
  • 24.Schlick T. Molecular dynamics: Basics. In: Marsden J, Sirovich L, Wiggins S, Antman S, editors. Molecular Modeling and Simulation: An Interdisciplinary Guide. 2nd ed. Springer-Verlag; New York: 2002. pp. 394–406. [Google Scholar]
  • 25.Sotomayor M, Schulten K. Science. 2007;316:1144. doi: 10.1126/science.1137591. [DOI] [PubMed] [Google Scholar]
  • 26.Simms A, Toofanny R, Kehl C, Benson N, Daggett V. Protein. Eng. Des. Sel. 2008;21:369. doi: 10.1093/protein/gzn012. [DOI] [PubMed] [Google Scholar]
  • 27.Maragakis P, Lindorff-Larsen K, Eastwood M, Dror R, Klepeis J, Arkin I, Jensen M, Xu H, Trbovic N, Friesner R, Palmer A, Shaw D. J. Phys. Chem. B. 2008;112:6155. doi: 10.1021/jp077018h. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Moffitt J, Chemla Y, Smith S, Bustamante C. Annu. Rev. Biochem. 2008;77:19.1. doi: 10.1146/annurev.biochem.77.043007.090225. [DOI] [PubMed] [Google Scholar]
  • 29.Calderon C. J. Chem. Phys. 2007;126:084106. doi: 10.1063/1.2567098. [DOI] [PubMed] [Google Scholar]
  • 30.Calderon C, Chelli R. J. Chem. Phys. 2008;128:145103. doi: 10.1063/1.2903439. [DOI] [PubMed] [Google Scholar]
  • 31.Calderon C, Chen W, Harris N, Lin K, Kiang C. J. Phys.: Condensed Matter. 2008 doi: 10.1088/0953-8984/21/3/034114. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Calderon C, Harris N, Kiang C-H, Cox D. J. Phys. Chem. B. 2008 doi: 10.1021/jp807908c. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Hummer G. New J. Phys. 2005;7:1. [Google Scholar]
  • 34.Calderon C. Multiscale Mod Sim. 2007;6:656. [Google Scholar]
  • 35.Chahine J, Oliveira R, Leite V, Wang J. Proc. Natl. Acad. Sci. USA. 2007;104:14646. doi: 10.1073/pnas.0606506104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Snow C, Rhee Y, Pande VS. Biophys. J. 2006;95:078102. doi: 10.1529/biophysj.105.075689. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Sigg D, Qian H, Bezanilla F. Biophys. J. 1999;76:782. doi: 10.1016/S0006-3495(99)77243-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Park S, Schulten K. J. Chem. Phys. 2004;120:5946. doi: 10.1063/1.1651473. [DOI] [PubMed] [Google Scholar]
  • 39.Khatri BS, Byrne K, Kawakami M, Brockwell D, Smith D, Radford S, McLeish T. Faraday Discuss. 2008;139:35. doi: 10.1039/b716418c. [DOI] [PubMed] [Google Scholar]
  • 40.Aït-Sahalia Y. Econometrica. 2002;70:223. [Google Scholar]
  • 41.Jimenez J, Ozaki T. J Time Ser. Anal. 2005;27:77. [Google Scholar]
  • 42.Hong Y, Li H. Rev Financ Stud. 2005;18:37–84. [Google Scholar]
  • 43.Chen S, C.Y. T. [Oct 1, 2008];J. Econometrics. http://www.stat.iastate.edu/preprint/articles/2006-21.pdf submitted.
  • 44.Chen S, Tang C. Ann. Stat. 2008;36:167. [Google Scholar]
  • 45.Walther K, Brujic J, Li H, Fernandez J. Biophys. J. 2006;90:3806. doi: 10.1529/biophysj.105.076224. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Minh D, Bui J, Chang C, Jain T, Swanson J, McCammon J. Biophys. J. Letters. 2005;72(89):L25. doi: 10.1529/biophysj.105.069336. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Procacci P, S. M, Barducci A, Signorini G, Chelli R. J. Chem. Phys. 2006;125:164101. doi: 10.1063/1.2360273. [DOI] [PubMed] [Google Scholar]
  • 48.Paramore S, Ayton G, Voth G. J. Chem. Phys. 2007;14:105105. doi: 10.1063/1.2764487. [DOI] [PubMed] [Google Scholar]
  • 49.Henzler-Wildman K, et al. Nature. 2007;450:06410. doi: 10.1038/nature06522. [DOI] [PubMed] [Google Scholar]
  • 50.Muller CW, Schulz GE. J. Mol. Biol. 1992;224(1):159. doi: 10.1016/0022-2836(92)90582-5. [DOI] [PubMed] [Google Scholar]
  • 51.Arora K, Brooks C., III Proc. Natl. Acad. Sci. USA. 2007;104:18496. doi: 10.1073/pnas.0706443104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Socci ND, Onuchic JN, Wolynes PG. J. Chem. Phys. 1996;104:5860. [Google Scholar]
  • 53.Frenkel D, Smit B. Understanding Molecular Simulation: From Algorithms to Applications. 1st ed. Academic Press; San Diego, CA: 1996. Molecular dynamics simulations. pp. 75–88, 377. [Google Scholar]
  • 54.Beck D, Daggett V. Biophysical J. 2007;93:3382. doi: 10.1529/biophysj.106.100149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Kevrekidis I, Gear C, Hummer G. AIChE J. 2004;50:474. [Google Scholar]
  • 56.Givon D, Kevrekidis I, Kupferman R. Commun. Math. Sci. 2006;4:707. [Google Scholar]
  • 57.Horenko I, Hartmann C, Schütte C. Phys. Rev. E. 2007;76:016706. doi: 10.1103/PhysRevE.76.016706. [DOI] [PubMed] [Google Scholar]
  • 58.Evensen G, van Leeuwen P. Mon. Weather. Rev. 2000;128:1852. [Google Scholar]
  • 59.Vanden-Eijnden E. Commun. Pur. Appl. Math. 2003;1:385. [Google Scholar]
  • 60.Zhang L, Mykland P, Ait-Sahalia Y. J. Am. Stat. Assoc. 2005;100:1394. [Google Scholar]
  • 61.Calderon C, Martinez J, Carroll R, Sorensen D. 2008. submitted.
  • 62.Calderon C, Janosi L, Kosztin I. Technical Report, TR08-24. CAAM Dept.: Rice University; 2008. [Google Scholar]
  • 63.Kou S, Xie X. Phys. Rev. Lett. 2004;93:18. doi: 10.1103/PhysRevLett.93.180603. [DOI] [PubMed] [Google Scholar]
  • 64.Ait-Sahalia Y, Fan J, Peng H. [Oct 1, 2008];Social Science Research Network. http://ssrn.com/abstract=955820
  • 65.Kloeden P, Platen E. Numerical Solution of Stochastic Differential Equations. 1st ed. Springer-Verlag; Berlin Springer-Verlag: 1999. Introduction. p. 37. [Google Scholar]
  • 66.Chandrasekhar S. Rev. Mod. Phys. 1943;15:1. [Google Scholar]
  • 67.El-Ansary M, Khalil H. SIAM J. Control Optim. 1986;24:83. [Google Scholar]
  • 68.Skorokhod AV. Asymptotic Methods in the Theory of Stochastic Differential Equations. 1st ed. Amer Mathematical Society; Providence, RI: 1989. Asymptotic behavior of systems of stochastic equations containing a small parameter. p. 77. [Google Scholar]
  • 69.Krishnan J, Runborg O, Kevrekidis I. Comp. Chem. Eng. 2004;28:557. [Google Scholar]
  • 70.Erban R, Frewen T, Wang X, Elston T, Coifman R, Nadler B, Kevrekidis I. J Chem Phys. 2007;126:155103. doi: 10.1063/1.2718529. [DOI] [PubMed] [Google Scholar]
  • 71.Pavliotis GA, Stuart AM. J. Stat. Phys. 2007;127:741. [Google Scholar]
  • 72.Kutoyants Y. Statistical Inference for Ergodic Diffusion Processes. 1st ed. Springer; New York: 2004. Diffusion processes and statistical problems. p. 50. [Google Scholar]
  • 73.Risken H. The Fokker-Planck Equation. 2nd ed. Springer-Verlag; Berlin: 1996. Fokker-Planck equation for one variable. p. 98. [Google Scholar]
  • 74.Chodera J, Swope W, Pitera J, Seok C, Dill K. J. Chem. Theory and Comput. 2007;3:26. doi: 10.1021/ct0502864. [DOI] [PubMed] [Google Scholar]
  • 75.Chu J-W, Trout BL, Brooks BR. J. Chem. Phys. 2003;119(24):12708. [Google Scholar]
  • 76.Michael S, Salsbury F, Jr., Brooks C., III J. Chem. Phys. 2002;116(24):10606. [Google Scholar]
  • 77.Brooks BR, Bruccoleri RE, Olafson BD, States DJ, Swaminathan S, Karplus M. J. Comp. Chem. 1983;4:187. [Google Scholar]
  • 78.Marsili S, Barducci A, Chelli R, Procacci P, Schettino V. J. Phys. Chem. B. 2006;110:14011. doi: 10.1021/jp062755j. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1_si_001
2_si_002

RESOURCES