Abstract
Accurate and efficient simulation of the thermodynamics and kinetics of protein–ligand interactions is crucial for computational drug discovery. Multiensemble Markov Model (MEMM) estimators can provide estimates of both binding rates and affinities from collections of short trajectories but have not been systematically explored for situations when a ligand is decoupled through scaling of non-bonded interactions. In this work, we compare the performance of two MEMM approaches for estimating ligand binding affinities and rates: (1) the transition-based reweighting analysis method (TRAM) and (2) a Maximum Caliber (MaxCal) based method. As a test system, we construct a small host–guest system where the ligand is a single uncharged Lennard-Jones (LJ) particle, and the receptor is an 11-particle icosahedral pocket made from the same atom type. To realistically mimic a protein–ligand binding system, the LJ ϵ parameter was tuned, and the system was placed in a periodic box with 860 TIP3P water molecules. A benchmark was performed using over 80 µs of unbiased simulation, and an 18-state Markov state model was used to estimate reference binding affinities and rates. We then tested the performance of TRAM and MaxCal when challenged with limited data. Both TRAM and MaxCal approaches perform better than conventional Markov state models, with TRAM showing better convergence and accuracy. We find that subsampling of trajectories to remove time correlation improves the accuracy of both TRAM and MaxCal and that in most cases, only a single biased ensemble to enhance sampled transitions is required to make accurate estimates.
I. INTRODUCTION
Estimating both the thermodynamics and kinetics of protein–ligand binding is essential for understanding biological function and for the rational design of therapeutics. In the last decade, alchemical free energy perturbation—a technique that relies on sampling from multiple thermodynamic ensembles—has emerged as the main tool to accurately estimate binding thermodynamics from molecular simulations.1–8 Along with the growing recognition of the importance of binding rates in drug discovery (in particular, the residence time9–11), there has also been increased interest in estimating the kinetics of protein–ligand binding using molecular simulation. Although different methods to estimate on- and off-rates of ligand-binding have been developed,12–21 most of them estimate rates separately from (or at the expense of) thermodynamics.
Markov State Models (MSMs) of molecular kinetics, which describe conformational dynamics as a network of transitions between metastable states,22–26 can provide combined estimates of the thermodynamics and kinetics of ligand binding from large ensembles of short trajectories.27–31 In practice, however, estimates of slow dissociation rates are limited by the sampling of rare events because a key assumption is that the trajectories are sampled at equilibrium.32,33 Although rare events can be sampled in ultra-long trajectories using specialized hardware,34 a more general approach would be desired.
A great improvement to this situation has come from the introduction of the so-called multi-ensemble Markov Model (MEMM) estimators, which use trajectory data collected from multiple thermodynamic ensembles to make MSM estimates of thermodynamics and kinetics.35–39 The essential idea, as applied to ligand binding, is to collect energy snapshots and metastable state transitions in biased thermodynamic ensembles that are not limited by rare event sampling, in order to make more statistically significant estimates of rates and affinities for the unbiased ensemble. One of these estimators, the transition-based reweighting analysis method (TRAM) of Wu et al.,37 has been used to estimate the slow dissociation of small-molecule and peptide ligands using harmonic bias potentials.40,41 Less explored has been the use of TRAM with ensembles of scaled non-bonded protein–ligand interactions, as is common in alchemical free energy methods.
Another MSM approach that utilizes multiple thermodynamic ensembles to infer binding affinities and rates is the Maximum Caliber (MaxCal) based method described in Wan et al.42 In this approach, the principle of maximum path entropy (caliber) is used to infer changes in inter-state transition rates from changes in the equilibrium populations.43–49 Like TRAM, the MaxCal approach relies on the ability to collect many transitions in a biased thermodynamic ensemble, enabling more statistically significant estimation of transition rates for an unbiased ensemble. As with TRAM, the use of MaxCal with ensembles where non-bonded protein–ligand interactions are scaled has not been systematically explored.
In this article, our goal is to test the performance of TRAM and MaxCal vs conventional MSM approaches in estimating ligand binding rates and affinities, for a set of thermodynamic ensembles with decoupled ligand interactions, when challenged with limited sampling. To do this, we first constructed a toy host–guest system that realistically mimics protein–ligand association and performed thorough sampling to obtain a reference benchmark for comparison. We then test the performance of TRAM and MaxCal estimators using as input a collection of short trajectories sampled in each thermodynamic ensemble.
II. METHODS
A. Simulation system
In our toy model of ligand–receptor binding, the receptor is an 11-particle icosahedral binding pocket composed of atoms with the same Lennard-Jones (LJ) parameters as a CT (sp3 carbon) atom type in the AMBER force fields: σ = 0.339 967 nm and ϵ = 0.457 730 kJ mol−1 (Fig. 1). Equilibrium bond lengths, bond force constants, and dihedral potentials were kept the same as a CT–CT AMBER bond type, with a soft harmonic angle potential (θ0 = 180.0° and kθ = 150.0 kJ mol−1 rad−2) to enforce the icosahedral shape. The ligand is a single uncharged LJ particle of the same type, where we varied the ϵ parameter from 0.0 to 10.0 kJ mol−1 in increments of 0.5 kJ mol−1.
FIG. 1.
A toy host–guest system that mimics protein–ligand binding. The receptor is an icosahedral binding pocket of Lennard-Jones (LJ) particles (beige) that binds a single LJ particle ligand (cyan) in its central cavity. This system is solvated in a periodic box with counterions. Particles are labeled with their atom indices.
All simulations used the GROMACS 5.1.4 molecular dynamics package.50 The system was solvated in a cubic periodic box and equilibrated in the NPT ensemble at 300 K and 1 atm using a Berendsen thermostat with 860 TIP3P waters, two Na+ ions, and two Cl+ ions to determine the box volume (3.008 59 nm)3 to be used for production runs in the NVT ensemble. For the production run, the stochastic dynamics thermostat was used to control the temperature (300 K) in the simulations with a 2 fs time step, hydrogen bonds constrained using LINear Constraint Solver (LINCS), and Particle Mesh Ewald (PME) electrostatics. System topologies and simulation parameters can be found at https://github.com/yunhuige/toy_model_paper.
To choose the value of ϵ used in all subsequent tests, we simulated ten trajectories of length 1 µs for each of the 21 different values of ϵ, resulting in 210 µs of aggregate trajectory data. Snapshots were recorded every 100 ps. After inspecting the apparent residence times of the ligand for each of these simulations (Fig. 2), we chose an ϵ value of 2.5 kJ mol−1 for our reference system. The 1 µs trajectory data suggest an apparent bound-state population of πbound = 81.9% and residence time in the range of tens of nanoseconds.
FIG. 2.
Trajectory traces of ligand distance from the center of the binding pocket, shown for selected ϵ values: (a) 0.0 kJ mol−1, (b) 1.5 kJ mol−1, (c) 2.5 kJ mol−1, and (d) 4.0 kJ mol−1, labeled with the apparent bound-state population πbound in each 1 µs trajectory. The value ϵ = 2.5 kJ mol−1 was chosen for all subsequent tests.
After establishing the value of ϵ to use in our reference system, we performed further sampling to generate an exhaustive set of trajectory data to be used for making high-quality reference estimates of binding rates and affinities. Twenty trajectories were generated, each of length 4 µs, for an aggregate of 80 µs. Coordinates were saved at a frequency of 1 ps to preserve kinetic information at high temporal resolution.
1. Ligand-decoupling simulations
The free energy perturbation (FEP) functionality of GROMACS was used to perform simulations in which the ligand was decoupled by scaling non-bonded interactions (in this case, van der Waals interactions because the ligand is uncharged). Separate simulations were performed for scaling constants λ = 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0, where λ = 1.0 represents the unbiased thermodynamic ensemble with full van der Waals (vdW) interactions. Two starting configurations were used to initialize the simulations: one where the ligand is in the bound state and another where the ligand is unbound in solution. For each of the two starting configurations, trajectories of length 400 ns were generated, with snapshots recorded every 1 ps. Soft-core interactions51 were used for non-bonded interactions (sc-alpha = 0.5, sc-power = 1, and sc-sigma = 0.3). In total, 4.8 µs (6 × 2 × 400 ns3) of trajectory data were collected—800 ns data for each λ value. Velocity–verlet integration was performed using a velocity-rescaling thermostat.
B. Multiensemble dynamical estimators
1. Conventional Markov state models
The PyEMMA software package52 was used to construct conventional MSMs. The first step in this process is to assign trajectory snapshots to discrete metastable states. Here, the toy binding system is simple enough that we can manually assign metastable states (Fig. 3): a bound state (0) where the ligand is strongly bound inside the receptor, encounter-complex states (1–15) where the ligand is weakly bound to one of the faces of the icosahedron, a state where the ligand associates with entrance of the binding site (16), and an unbound state (17).
FIG. 3.
Metastable state indices used in the construction of an 18-state MSM.
State indices i are assigned based on the ligand position vector where the origin is the center of the binding site (computed as the mean position of the receptor atoms, excluding atom 9 in Fig. 1), and its distance from the origin . If 0.85 nm, it is considered unbound (state 17); if 0.2 nm, it is considered bound (state 0). Otherwise, it is assigned to the state i having the smallest angular distance to the vectors normal to each icosahedral face (the entrance is also considered a face): (Fig. 3).
An analysis of the distribution of sampled conformations shows that the 18 metastable states straightforwardly correspond to the minima of the radial potential of mean force U(r) (Fig. 4). The potential of mean force was corrected for finite box effects by numerically estimating the radial dependence of configuration space volume in a cubic box.
FIG. 4.
The potential of mean force U(r) as a function of the distance r between the ligand and the center of mass of the binding pocket. Minima of U(r) are annotated with corresponding system configurations and metastable state assignments.
We also examined a two-state decomposition bound vs unbound states, defined by a radial cutoff of 0.2 nm (see Sec. III). This coarse grained state decomposition provided an opportunity to test state decomposition effect on kinetic and thermodynamic predictions.
Using the state assignments described earlier, a conventional MSM is constructed by first compiling the matrix of the observed transition counts , the number of times the system is in state i at time t and in state j at time t + τ) for a suitable lag time τ. To infer the matrix of transition probabilities pij, we use the standard maximum-likelihood estimator (MLE) enforcing a constraint on detailed balance. The transition matrix contains full information about the thermodynamics and kinetics. Equilibrium populations πi are obtained as the stationary eigenvector of the transition matrix. Inverse mean first passage times (1/MFPTs) between the bound and unbound states are used to compute binding (kon) and unbinding (koff) rates. MFPTs are calculated using a method implemented in PyEMMA in which a set of self-consistent equations involving elements of the transition matrix is solved to obtain the expectation value of the first passage time between two subsets of states.53
2. Transition-based reweighting analysis method (TRAM)
The TRAM estimator,37 as implemented in the PyEMMA software package,52 was used to construct multiensemble Markov models from the simulation trajectory data. TRAM attempts to simultaneously optimize (1) the transitions pij between state i and j for given free energies for all conformational states i at each thermodynamic ensemble k and (2) the free energies for all thermodynamic ensembles at each conformational state i. This is achieved by maximizing a joint likelihood function LTRAM that is the product of a reversible MSM estimator likelihood function and a free energy estimator likelihood function ,
| (1) |
where is the transition probability between state i and j given observed transition counts in ensemble k, μ(x) is the normalized equilibrium distribution [such that ∑xμ(x) = 1] of samples x assigned to state i in the kth ensemble , is the local free energy of state i in ensemble k, and b(k)(x) is the bias potential. The TRAM solution is identical to the reversible MSM estimator when only a single thermodynamic ensemble is available. In the limit of infinite sampling, TRAM is equivalent to multistate Bennett acceptance ratio (MBAR) estimator.37,54
3. Maximum-caliber (MaxCal) method
In the Maximum-caliber (MaxCal) method of Wan et al.,42 a conventional MSM is first constructed from trajectory data sampled in thermodynamic ensemble k, to obtain equilibrium populations and transition probabilities . Given estimates of perturbed equilibrium populations in thermodynamic ensemble l, the MaxCal method infers transition probabilities by maximizing the path entropy, or caliber ,
| (2) |
with a restraint on detailed balance. The solution to this maximization problem is the self-consistent iteration of two equations involving a set of weights wi,
| (3) |
| (4) |
(note that that the solution of Wan et al.42 applies to discrete-time kinetic networks; a more general solution for continuous rate equations is provided by Dixit et al.55).
In our context, thermodynamic ensemble k is one in which the ligand is sufficiently decoupled so as to observe many transitions, and ensemble l is the unbiased thermodynamic ensemble. To obtain estimates of , we use the MBAR free energy estimator54 as implemented in the PyEMMA software package.52
III. RESULTS
A. Determination of reference rates and affinities for the toy ligand binding system
After choosing the value ϵ = 2.5 kJ mol−1 to be used in our tests (see Sec. II), we performed a series of long simulations (aggregate trajectory data of 20 × 4 μs = 80 µs) to precisely determine reference values for the bound-state equilibrium population πbound, kon, and koff rates, via the construction of a conventional MSM.
To determine a suitable lag time τ for the construction of the MSM, a series of MSMs were constructed for lag times up to 4 ns, and the slowest implied timescale t2 = −τ/(ln μ2) computed, where μ2 < 1 is the second-largest eigenvalue of the transition probability matrix. This timescale corresponds to the slowest dynamical motion observed in the simulations, which in this case corresponds to ligand binding and unbinding. Thus, t2 very closely approximates .
The implied timescale plots of t2 vs the lag time τ plateaus after a few 100 ps, indicating the dynamics is sufficiently Markovian [Fig. 5(a)]. Based on this result, we chose a lag time of τ = 1 ns to construct an 18-state MSM to compute reference rates and affinities (Table I). Association and dissociation rates koff and kon were computed from inverse mean first passage times, and bound-state populations πbound were estimated from the equilibrium populations of the MSM, with uncertainties estimated using the standard deviation of the results across 20 independent trajectories. The dissociation constant KD was estimated from the computed rates as koff/kon, with uncertainty estimates from 10 000 trials with normally distributed error. We also tried constructing a two-state MSM [bound and unbound states, Table I and Fig. 5(b)] and found that both kon and koff were larger than the 18-state MSM estimates, likely due to discretization error.24
FIG. 5.
Slowest MSM implied timescale t2 vs MSM lag time for (a) an 18-state MSM, and (b) a 2-state MSM.
TABLE I.
Reference values of ligand binding rates and affinities estimated from conventional MSMs.
| kon (×109 M−1 s−1) | koff (×107 s−1) | πbound (%) | KD (mM) | |
|---|---|---|---|---|
| 18-state | 3.165 ± 0.488 | 3.294 ± 0.588 | 84.4 ± 0.4 | 10.66 ± 2.61 |
| 2-state | 3.269 ± 0.516 | 3.672 ± 0.640 | 84.4 ± 0.4 | 11.54 ± 2.87 |
To explore how these (18-state MSM) estimates depend on the amount of trajectory data sampled, we divided the trajectories into 13 blocks and constructed MSMs using cumulatively increasing amounts of data (e.g., 1%, 2%, 5%, 10%, 20%, 30%, 40%, …, 100%). The estimated bound state populations vs increasing data in Fig. 6 show that about 20% of the data (800 ns for each independent run) is necessary for reasonably accurate estimates, with converged values obtained after about 60% of the trajectory. The estimated binding and unbinding rates, however, converge faster, requiring only about 25% of the trajectory data. (Similar results were found for the two-state MSM, data not shown.)
FIG. 6.
Estimated bound-state populations (a) and rates (b) as a function of the amount of input data.
These results illustrate the difficulty of estimating both thermodynamics and kinetics of ligand binding. Even in this simple model with a residence time of = 30.3 ns, a large amount of trajectory data (about 60% of 4 = 2.4 µs) is required for converged estimation using MSMs. For actual ligand binding to protein receptors, residence times often exceed 1 s.56 Thus, realistic all-atom simulations of these processes would surely require an amount of total data exceeding available computing time.
B. A single additional λ-scaled ensemble can efficiently estimate binding affinities and rates
With converged estimates of ligand binding rates and affinities in hand for our reference system, we now can assess the performance of multiensemble estimators. First, we examine the accuracy of TRAM and MaxCal estimators in the case where we are allowed to include one additional biased thermodynamic ensemble. Whereas our reference calculation used 80 µs of aggregate trajectory data (20 trajectories of 4 µs each), in this test, we limit the trajectory data from the unbiased ensemble (i.e., λ = 1.0) to only 1% of that used for the reference calculations (two trajectories of 400 ns each). Combined with the one additional biased thermodynamic ensemble, the total data we used here (2 × 2 × 400 ns3) are only 2% of the reference calculations (80 µs) and 10% of the data required for converged results from the reference calculations (1.6 µs).
We used data from λ = 1.0 simulation as the unbiased ensemble and include one more ensemble with λ value taken from the set (0.5, 0.6, 0.7, 0.8, and 0.9). For comparison, we also built MEMMs using trajectory data from all λ values. For the MaxCal estimator, our protocol was to first build a conventional MSM using only unbiased (λ = 1.0) trajectories to estimate equilibrium state populations and transition rates . Then, we used the same data as used in MEMM construction (unbiased + one biased ensemble, or all ensembles) as input to MBAR to make improved estimates of equilibrium state populations in the λ = 1.0 ensemble. Using these new populations, MaxCal was used to infer improved estimates of the transition rates .
It is well-known that subsampling of trajectory data to remove time-correlated data is important to accurately estimate the uncertainty of calculated free energies.8,54,57,58 We suspected that such considerations are similarly important for MEMM estimators TRAM and MaxCal. To remove time-correlation in our trajectory data, we subsampled trajectories generated in each thermodynamic ensemble at a time interval of (2τc + 1) steps, where τc is an estimate of the integrated correlation time calculated from the distance of the ligand from the origin (the center of the pocket, see Sec. II) over time.
After subsampling, we find that the number of effectively independent samples decreases dramatically (Table II). For thermodynamic ensembles λ ≥ 0.7, correlation times are so long that subsampling would be at intervals longer than the MSM lag time of 1 ns. Therefore, we limit subsampling of trajectories from these ensembles to a maximum interval of 1 ns.
TABLE II.
Number of samples used in the calculation before and after subsampling. The maximum amount of subsampling is for a stride equal to the lag time, resulting in 802 subsamples.
| λ = 0.5 | λ = 0.6 | λ = 0.7 | λ = 0.8 | λ = 0.9 | λ = 1.0 | |
|---|---|---|---|---|---|---|
| All | 400 001 | 400 001 | 400 001 | 400 001 | 400 001 | 400 001 |
| Subsampled | 3 602 | 1 602 | 802 | 802 | 802 | 802 |
The performance of different estimators using subsampled and non-subsampled data is shown in Fig. 7. These results clearly show the advantages of using simulations performed in λ-scaled thermodynamic ensembles to make multiensemble estimates of binding affinities and rates. Whereas conventional MSM estimates (using only the unbiased λ = 1 ensemble) underestimate bound-state populations and off-rates, multiensemble estimates using only one additional thermodynamic ensemble more accurately predict the reference values.
FIG. 7.
Estimated rates and affinities from conventional (MSM) vs multiensemble (TRAM and MaxCal) estimators applied to various trajectory datasets. Each dataset consists of two 400 ns trajectories biased using a particular value of λ (e.g., λ = 0.5) and two 400 ns unbiased trajectories (λ = 1.0). Also considered were datasets in which trajectories from all λ values were included. Shown are estimated bound-state populations from (a) subsampled and (b) non-subsampled data, estimated values of kon from (c) subsampled and (d) non-subsampled data, and estimated values of koff from (e) subsampled and (f) non-subsampled data. The red line and shaded regions represent the values computed from our reference simulations and their uncertainties, respectively.
In particular, thermodynamic ensembles with the most decoupling (λ = 0.5 and λ = 0.6) are best at reproducing populations and dissociation rates. It is in these ensembles that more frequent binding and unbinding transitions are observed, contributing most favorably to the statistical precision of affinity and rate estimates. Specifically, the use of trajectory data from λ = 0.6 ensembles appears to give the best results, which, apart from finite sampling issues, may be a consequence of fewer re-binding events in the λ = 0.5 ensemble. Indeed, simulations at small λ values spend the majority of time in the unbound state, which is less useful for rate estimation (data not shown).
In these tests, conventional MSMs still provide more accurate estimates of on-rates than the multiensemble methods. Based on the high correlation between kon and equilibrium population estimates, the decrease in accuracy likely arises as a consequence of detailed balance. Nevertheless, when subsampling is used, the multiensemble estimates of association rates are reasonable: within or just outside the uncertainty of the reference simulations.
In all categories (populations, kon and koff), TRAM estimates moderately outperform MaxCal estimates when compared to the reference values. Although we did not estimate the uncertainties in Fig. 7 (all data were used in our estimations), we did such analysis in Fig. 8 (more details below). Based on our results in Fig. 8, the overall uncertainty of these results is likely comparable to the uncertainty calculated from our reference simulations (red shaded area in Fig. 7). If we use the reference simulation uncertainty as a proxy, we would conclude that the differences between TRAM and MaxCal estimates of kon are not significant, while differences between TRAM and MaxCal estimates of koff are significant. Objectively better estimation of koff is a noteworthy advantage of TRAM. To rationalize this finding, one may consider that free energy estimates from TRAM are informed by both sampled energies and transition counts, while MaxCal free energy estimates (calculated by MBAR) are only informed by sampled energies.
FIG. 8.
Estimated bound-state populations (a), kon (c), and koff (e) as a function of increasing amounts of subsampled input data for ensembles (λ = 0.6, 1.0). Uncertainties in bound-state populations (b), kon (d), and koff (f) [the error bars shown in panels (a), (c), and (e)], estimated by computing standard deviations from ten independent trials.
For both TRAM and MaxCal estimates, inclusion of trajectory data from all thermodynamic ensembles results in only modest improvement in estimated quantities, rivaled by the results when only trajectory data from λ = 0.6 and λ = 1.0 ensembles are used. This suggests that bias from finite sampling is quite important; including more trajectory data from ensembles where few transitions are observed is less effective than including data where many transitions are observed. A similar conclusion can be made from the comparison of subsampled vs non-subsampled data. Despite the severe reduction in the number of samples used in the calculation, predictions using subsampled trajectory data [Figs. 7(a), 7(c), and 7(e)] are more accurate than those made using non-subsampled data [Figs. 7(b), 7(d), and 7(f)].
C. Convergence of multiensemble estimates with increasing amounts of trajectory data
To assess the robustness of conventional and multiensemble estimators in dealing with limited input data, we performed tests in which different subsets of data (after subsampling) are used in the estimations. In these tests, we randomly select a given extent of λ = 0.6 and λ = 1.0 trajectory data (e.g., 10%, 20%, …, 90% of the full trajectory) as a block segment, to be used as input to conventional and multiensemble estimators. This operation is repeated ten times, resulting in ten independent trials for statistical analysis.
As stated in the original TRAM paper,37 unlike MBAR, a global equilibrium sampling is not required for TRAM estimation because of its efficient use of simulation data. Our results, which suggest that multiensemble estimates from TRAM converges faster than the other estimators we evaluated, corroborate this statement. Using only 20% of the trajectory from ensembles λ = 0.6 and λ = 1.0, TRAM can accurately estimate bound-state populations, with faster convergence than MaxCal (Fig. 8).
To highlight the reduction of uncertainty as the amount of trajectory data is increased, the magnitude of the uncertainty estimates [error bars in Figs. 8(a), 8(c), and 8(e)] are plotted separately [Figs. 8(b), 8(d), and 8(f)]. It is clear that TRAM estimates have the smallest uncertainties among all estimators studied, and the fastest convergence with increasing amounts of trajectory data. Using only 10% of the trajectory data, the uncertainties of MaxCal estimates are large, becoming comparable with TRAM uncertainties when at least 30% of the trajectory data is used.
IV. DISCUSSION
We have constructed a novel toy binding system that has several advantages. The system is small enough (2596 atoms) that it can be used to generate large amounts of trajectory data. Despite its simplicity, the system realistically mimics protein–ligand binding, with a well-defined binding site, encounter-complex conformations, and realistic solvation through the use of explicit water with counterions. It is straightforward to apply any number of enhanced sampling techniques to this system (e.g., umbrella simulations, free energy perturbation, metadynamics, etc.) and very easy to tune the LJ ϵ parameter to increase or decrease the ligand affinity to suit particular problems of interest. The metastable state decomposition is well-defined and does not require any specialized dimensionality reduction or featurization. Thus, we expect this toy binding system to find many uses in future studies.
In this work, we specifically address how multiensemble estimators of ligand binding rates and affinities perform when used with thermodynamic ensembles where the ligand is decoupled by scaling its non-bonded interactions. To our knowledge, this study is the first application of TRAM and MaxCal to this problem and the first to directly compare TRAM with MaxCal. The use of ligand decoupling as a bias potential has several advantages over umbrella sampling, a biasing technique that has been used in the many of first applications of TRAM.37,40,41 For one, ligand decoupling can take advantage of vast array of existing simulation tools to perform these calculations (for example, soft-core potentials). Moreover, whereas umbrella sampling is good at focusing sampling on a certain region, ligand-decoupling in effect serves to “de-focus” sampling away from the bound state, preserving the ability to sample transitions between many metastable states, a key requirement for the success of multiensemble estimators.
In our tests, we find that multi-ensemble estimators, TRAM and MaxCal, are superior to conventional MSM estimators and that TRAM yields the most accurate estimates when faced with limited trajectory data. MaxCal takes a close second place to TRAM, achieving comparable results (Fig. 7). In some applications, however, a MaxCal approach might be preferred to TRAM. The TRAM estimator is more computationally demanding than MaxCal + MBAR and requires samples from the unbiased thermodynamic state.
As is known to be the case for free energy estimators, we find that TRAM and MaxCal estimates are sensitive to presence of time-correlated input data and that these estimates can significantly improve when trajectory data are subsampled to remove correlated samples. This is not surprising, since TRAM is itself a free energy estimator, and the MaxCal approach we use relies on free energy estimates from MBAR. Still, this finding underscores the need for careful curation of trajectory data to achieve accurate results.
The toy binding system we have examined here is admittedly very simple, with no rotational degrees of freedom of the ligand, and only one main binding route. The binding mechanism has no intermediates, evidenced by the fact that even a 2-state multiensemble MSM is able to make good predictions of the association and dissociation rates and bound-state population. For more realistic systems, however, more sophisticated decoupling strategies may be required, especially if many binding pathways are possible, or intermediates required. In theory, both TRAM and MaxCal should be able to recapitulate binding rates and pathways in the unbiased ensembles, but ligand decoupling alone may not be enough to efficiently sample the most important binding and unbinding pathways.
V. CONCLUSION
In this work, we have introduced a novel toy model that realistically mimics a protein–ligand binding system. After tuning and thoroughly benchmarking the affinities and binding rates in this model, we used it to study the performance of multiensemble MSM estimators when used with ligand decoupling. The key idea in this work is that many more binding and unbinding events can be observed in decoupled thermodynamic ensembles, enabling better statistical estimation of rates and affinities. Indeed, we find that TRAM and MaxCal estimators outperform conventional MSM estimators at this task and that TRAM is more accurate overall at estimating slow dissociation rates. We find that accuracy is improved when subsampling is performed to remove time-correlation.
These results suggest that multiensemble methods used with ligand decoupling simulations may be highly valuable to simultaneously obtain accurate predictions of ligand binding affinities and rates. In the future, we anticipate the growing use of these methods in computational drug discovery.
ACKNOWLEDGMENTS
This research includes calculations carried out on HPC resources supported, in part, by the National Science Foundation through major research instrumentation (Grant No. 1625061), the U.S. Army Research Laboratory under Contract No. W911NF-16-2-0189, and the NIH Research Resource Computer Cluster (Grant No. S10-OD020095). V.A.V. and Y.G. were supported by the National Institutes of Health (Grant No. 1R01GM123296). Y.G. thanks Fabian Paul for his help in clarifying our understanding of the TRAM algorithm.
Contributor Information
Yunhui Ge, Email: mailto:yunhuig2@uci.edu.
Vincent A. Voelz, Email: mailto:voelz@temple.edu.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request. Files and scripts to prepare the simulations and analyze the results can be found at https://github.com/yunhuige/toy_model_paper.
REFERENCES
- 1.Chodera J. D., Mobley D. L., Shirts M. R., Dixon R. W., Branson K., and Pande V. S., Curr. Opin. Struct. Biol. 21, 150 (2011). 10.1016/j.sbi.2011.01.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Mobley D. L. and Klimovich P. V., J. Chem. Phys. 137, 230901 (2012). 10.1063/1.4769292 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Wang L., Wu Y., Deng Y., Kim B., Pierce L., Krilov G., Lupyan D., Robinson S., Dahlgren M. K., Greenwood J. et al. , J. Am. Chem. Soc. 137, 2695 (2015). 10.1021/ja512751q [DOI] [PubMed] [Google Scholar]
- 4.Aldeghi M., Heifetz A., Bodkin M. J., Knapp S., and Biggin P. C., Chem. Sci. 7, 207 (2016). 10.1039/c5sc02678d [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Cournia Z., Allen B., and Sherman W., J. Chem. Inf. Model. 57, 2911 (2017). 10.1021/acs.jcim.7b00564 [DOI] [PubMed] [Google Scholar]
- 6.Cournia Z., Allen B. K., Beuming T., Pearlman D. A., Radak B. K., and Sherman W., J. Chem. Inf. Model. 60, 4153 (2020). 10.1021/acs.jcim.0c00116 [DOI] [PubMed] [Google Scholar]
- 7.Song L. F. and K. M. Merz, Jr., J. Chem. Inf. Model. 60, 5308 (2020). 10.1021/acs.jcim.0c00547 [DOI] [PubMed] [Google Scholar]
- 8.Mey A. S. J. S., Allen B. K., Bruce Macdonald H. E., Chodera J. D., Hahn D. F., Kuhn M., Michel J., Mobley D. L., Naden L. N., Prasad S. et al. , Living J. Comput. Mol. Sci. 2, 18378 (2020). 10.33011/livecoms.2.1.18378 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Copeland R. A., Pompliano D. L., and Meek T. D., Nat. Rev. Drug Discovery 5, 730 (2006). 10.1038/nrd2082 [DOI] [PubMed] [Google Scholar]
- 10.Copeland R. A., Nat. Rev. Drug Discovery 15, 87 (2016). 10.1038/nrd.2015.18 [DOI] [PubMed] [Google Scholar]
- 11.Schuetz D. A., de Witte W. E. A., Wong Y. C., Knasmueller B., Richter L., Kokh D. B., Sadiq S. K., Bosma R., Nederpelt I., Heitman L. H., Segala E., Amaral M., Guo D., Andres D., Georgi V., Stoddart L. A., Hill S., Cooke R. M., De Graaf C., Leurs R., Frech M., Wade R. C., de Lange E. C. M., IJzerman A. P., Müller-Fahrnow A., and Ecker G. F., Drug Discovery Today 22, 896 (2017). 10.1016/j.drudis.2017.02.002 [DOI] [PubMed] [Google Scholar]
- 12.Mollica L., Theret I., Antoine M., Perron-Sierra F., Charton Y., Fourquez J.-M., Wierzbicki M., Boutin J. A., Ferry G., Decherchi S. et al. , J. Med. Chem. 59, 7167 (2016). 10.1021/acs.jmedchem.6b00632 [DOI] [PubMed] [Google Scholar]
- 13.Dickson A., Tiwary P., and Vashisth H., Curr. Top. Med. Chem. 17, 2626 (2017). 10.2174/1568026617666170414142908 [DOI] [PubMed] [Google Scholar]
- 14.Kokh D. B., Amaral M., Bomke J., Grädler U., Musil D., Buchstaller H.-P., Dreyer M. K., Frech M., Lowinski M., Vallee F. et al. , J. Chem. Theory Comput. 14, 3859 (2018). 10.1021/acs.jctc.8b00230 [DOI] [PubMed] [Google Scholar]
- 15.Bernetti M., Masetti M., Rocchia W., and Cavalli A., Annu. Rev. Phys. Chem. 70, 143 (2019). 10.1146/annurev-physchem-042018-052340 [DOI] [PubMed] [Google Scholar]
- 16.Nunes-Alves A., Kokh D. B., and Wade R. C., Curr. Opin. Struct. Biol. 64, 126 (2020). 10.1016/j.sbi.2020.06.022 [DOI] [PubMed] [Google Scholar]
- 17.Decherchi S. and Cavalli A., Chem. Rev. 120, 12788 (2020). 10.1021/acs.chemrev.0c00534 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Lotz S. D. and Dickson A., J. Am. Chem. Soc. 140, 618 (2018). 10.1021/jacs.7b08572 [DOI] [PubMed] [Google Scholar]
- 19.Hall R., Dixon T., and Dickson A., Front. Mol. Biosci. 7, 106 (2020). 10.3389/fmolb.2020.00106 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Tiwary P., Limongelli V., Salvalaglio M., and Parrinello M., Proc. Natl. Acad. Sci. U. S. A. 112, E386 (2015). 10.1073/pnas.1424461112 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Jagger B. R., Lee C. T., and Amaro R. E., J. Phys. Chem. Lett. 9, 4941 (2018). 10.1021/acs.jpclett.8b02047 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Noé F., Schütte C., Vanden-Eijnden E., Reich L., and Weikl T. R., Proc. Natl. Acad. Sci. U. S. A. 106, 19011 (2009). 10.1073/pnas.0905466106 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Voelz V. A., Bowman G. R., Beauchamp K., and Pande V. S., J. Am. Chem. Soc. 132, 1526 (2010). 10.1021/ja9090353 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Prinz J.-H., Wu H., Sarich M., Keller B., Senne M., Held M., Chodera J. D., Schütte C., and Noé F., J. Chem. Phys. 134, 174105 (2011). 10.1063/1.3565032 [DOI] [PubMed] [Google Scholar]
- 25.Zhou G. and Voelz V. A., J. Phys. Chem. B 120, 926 (2016). 10.1021/acs.jpcb.5b11767 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Husic B. E. and Pande V. S., J. Am. Chem. Soc. 140, 2386 (2018). 10.1021/jacs.7b12191 [DOI] [PubMed] [Google Scholar]
- 27.Plattner N. and Noé F., Nat. Commun. 6, 7653 (2015). 10.1038/ncomms8653 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Zhou G., Pantelopulos G. A., Mukherjee S., and Voelz V. A., Biophys. J. 113, 785 (2017). 10.1016/j.bpj.2017.07.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Plattner N., Doerr S., De Fabritiis G., and Noé F., Nat. Chem. 9, 1005 (2017). 10.1038/nchem.2785 [DOI] [PubMed] [Google Scholar]
- 30.Ge Y., Borne E., Stewart S., Hansen M. R., Arturo E. C., Jaffe E. K., and Voelz V. A., J. Biol. Chem. 293, 19532 (2018). 10.1074/jbc.ra118.004909 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Ge Y. and Voelz V. A., “Markov state models to elucidate ligand binding mechanism,” in Protein-Ligand Interactions and Drug Design, edited by Ballante F. (Springer, New York, 2021), pp. 239–259. [DOI] [PubMed] [Google Scholar]
- 32.Wan H. and Voelz V. A., J. Chem. Phys. 152, 024103 (2020). 10.1063/1.5142457 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Suárez E., Wiewiora R. P., Wehmeyer C., Noé F., Chodera J. D., and Zuckerman D. M., J. Chem. Theory Comput. 17(5), 3119–3133 (2021). 10.1021/acs.jctc.0c01154 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Pan A. C., Xu H., Palpant T., and Shaw D. E., J. Chem. Theory Comput. 13, 3372 (2017). 10.1021/acs.jctc.7b00172 [DOI] [PubMed] [Google Scholar]
- 35.Wu H., Mey A. S. J. S., Rosta E., and Noé F., J. Chem. Phys. 141, 214106 (2014). 10.1063/1.4902240 [DOI] [PubMed] [Google Scholar]
- 36.Rosta E. and Hummer G., J. Chem. Theory Comput. 11, 276 (2015). 10.1021/ct500719p [DOI] [PubMed] [Google Scholar]
- 37.Wu H., Paul F., Wehmeyer C., and Noé F., Proc. Natl. Acad. Sci. U. S. A. 113, E3221 (2016). 10.1073/pnas.1525092113 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Stelzl L. S., Kells A., Rosta E., and Hummer G., J. Chem. Theory Comput. 13, 6328 (2017). 10.1021/acs.jctc.7b00373 [DOI] [PubMed] [Google Scholar]
- 39.Kieninger S., Donati L., and Keller B. G., Curr. Opin. Struct. Biol. 61, 124 (2020). 10.1016/j.sbi.2019.12.018 [DOI] [PubMed] [Google Scholar]
- 40.Paul F., Wehmeyer C., Abualrous E. T., Wu H., Crabtree M. D., Schöneberg J., Clarke J., Freund C., Weikl T. R., and Noé F., Nat. Commun. 8, 1095 (2017). 10.1038/s41467-017-01163-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Ge Y., Zhang S., Erdelyi M., and Voelz V. A., J. Chem. Inf. Model. 61, 2353 (2021). 10.1021/acs.jcim.1c00029 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Wan H., Zhou G., and Voelz V. A., J. Chem. Theory Comput. 12, 5768 (2016). 10.1021/acs.jctc.6b00938 [DOI] [PubMed] [Google Scholar]
- 43.Dixit P. D., Jain A., Stock G., and Dill K. A., J. Chem. Theory Comput. 11, 5464 (2015). 10.1021/acs.jctc.5b00537 [DOI] [PubMed] [Google Scholar]
- 44.Dixit P. D., Wagoner J., Weistuch C., Pressé S., Ghosh K., and Dill K. A., J. Chem. Phys. 148, 010901 (2018). 10.1063/1.5012990 [DOI] [PubMed] [Google Scholar]
- 45.Dixit P. D. and Dill K. A., J. Chem. Theory Comput. 14, 1111 (2018). 10.1021/acs.jctc.7b01126 [DOI] [PubMed] [Google Scholar]
- 46.Meral D., Provasi D., and Filizola M., J. Chem. Phys. 149, 224101 (2018). 10.1063/1.5060960 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Meral D., Provasi D., Prada-Gracia D., Möller J., Marino K., Lohse M. J., and Filizola M., Sci. Rep. 8, 7705 (2018). 10.1038/s41598-018-26070-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Hu X., Wang Y., Hunkele A., Provasi D., Pasternak G. W., and Filizola M., PLoS Comput. Biol. 15, e1006689 (2019). 10.1371/journal.pcbi.1006689 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Ghosh K., Dixit P. D., Agozzino L., and Dill K. A., Annu. Rev. Phys. Chem. 71, 213 (2020). 10.1146/annurev-physchem-071119-040206 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Abraham M. J., Murtola T., Schulz R., Páll S., Smith J. C., Hess B., and Lindahl E., SoftwareX 1–2, 19 (2015). 10.1016/j.softx.2015.06.001 [DOI] [Google Scholar]
- 51.Beutler T. C., Mark A. E., van Schaik R. C., Gerber P. R., and van Gunsteren W. F., Chem. Phys. Lett. 222, 529 (1994). 10.1016/0009-2614(94)00397-1 [DOI] [Google Scholar]
- 52.Scherer M. K., Trendelkamp-Schroer B., Paul F., Pérez-Hernández G., Hoffmann M., Plattner N., Wehmeyer C., Prinz J.-H., and Noé F., J. Chem. Theory Comput. 11, 5525 (2015). 10.1021/acs.jctc.5b00743 [DOI] [PubMed] [Google Scholar]
- 53.Hoel P. G., Port S. C., and Stone C. J., Introduction to Stochastic Theory, (Houghton Mifflin, Boston, MA, 1972). [Google Scholar]
- 54.Shirts M. R. and Chodera J. D., J. Chem. Phys. 129, 124105 (2008). 10.1063/1.2978177 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Dixit P. D., J. Chem. Phys. 148, 091101 (2018). 10.1063/1.5023232 [DOI] [Google Scholar]
- 56.Bruce N. J., Ganotra G. K., Kokh D. B., Sadiq S. K., and Wade R. C., Curr. Opin. Struct. Biol. 49, 1 (2018). 10.1016/j.sbi.2017.10.001 [DOI] [PubMed] [Google Scholar]
- 57.Chodera J. D., Swope W. C., Pitera J. W., Seok C., and Dill K. A., J. Chem. Theory Comput. 3, 26 (2007). 10.1021/ct0502864 [DOI] [PubMed] [Google Scholar]
- 58.Grossfield A., Patrone P. N., Roe D. R., Schultz A. J., Siderius D. W., and Zuckerman D. M., Living J. Comput. Mol. Sci. 1, 5067 (2018). 10.33011/livecoms.1.1.5067 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request. Files and scripts to prepare the simulations and analyze the results can be found at https://github.com/yunhuige/toy_model_paper.








