Abstract
In this work, we demonstrate that Linear Discriminant Analysis (LDA) applied to atomic positions in two different states of a biomolecule produces a good reaction coordinate between those two states. Atomic coordinates of a macromolecule are a direct representation of a macromolecular configuration, and yet, they are not used in enhanced sampling studies due to a lack of rotational and translational invariance. We resolve this issue using the technique of our prior work, whereby a molecular configuration is considered a member of an equivalence class in size-and-shape space, which is the set of all configurations that can be translated and rotated to a single point within a reference multivariate Gaussian distribution characterizing a single molecular state. The reaction coordinates produced by LDA applied to positions are shown to be good reaction coordinates both in terms of characterizing the transition between two states of a system within a long molecular dynamics (MD) simulation and also ones that allow us to readily produce free energy estimates along that reaction coordinate using enhanced sampling MD techniques.
1. Introduction
Many enhanced sampling techniques work by biasing a system to explore along a low dimensional set of collective variables (CVs).1 These methods allow us, in principle, to use the known applied bias to reconstruct the free energy landscape in that low dimensional space. In practice, the choice of the CVs is crucial, with an ideal set of CVs allowing the system to explore all relevant states within available simulation time.1 Recently, extensive effort has been invested in using a variety of machine learning approaches, from very simple to very sophisticated, to determine optimal coordinates for sampling from molecular dynamics (MD) simulation data (refs (2−21) provide a representative but not exhaustive sample).
One commonly encountered challenge is to compute the free energy path of a transition between two states along a linear dimension that chemists term a reaction coordinate (RC). For a macromolecule such as a protein, the two states could be configurations for which we have known structures (e.g., the PDB structure of a protein solved with and without a bound ligand) or processes for which one state is known and the other state can be at least qualitatively defined (e.g., folding/unfolding or binding/unbinding). If a long MD trajectory containing multiple transitions between these states is available, then reaction coordinates could be trained based on the idea that we want to enhance sampling along the slowest modes in the system.4,10,13,14,22,23 However, having this data is rare, in which case one can try iterating sampling and learning reaction coordinates with the goal of maximizing the number of transitions between the two states in a fixed amount of simulation time.4,5,11,13,15,24
An alternative approach which has shown some success is to train reaction coordinates based on short simulations within the two states and use a method that produces a coordinate representing the difference between the two sets of data. Linear dimensionality reduction techniques such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are the simplest approaches for combining a large set of variables that describe a system of interest to produce a small set of CVs that characterize the available data. While PCA, which produces coordinates that capture the most variance in the data, has been used to promote exploration in enhanced sampling simulations, LDA seems to hold more promise as an RC since it is a supervised approach designed to maximally separate different labeled classes of data (i.e., reactants and products). We describe LDA in full detail in the next section. In one study, Mendels et al.6 produced a modified approach to LDA termed harmonic LDA (HLDA, because the covariance matrices in the two different states of interest are combined by a harmonic average rather than a simple sum) and, in that work and subsequent ones,7,9 combined it with Metadynamics (MetaD) to effectively enhance sampling between two states for several different systems. Later, a neural network was used to combine features before training LDA vectors to produce the reaction coordinate.16
In the prior examples of reaction coordinate design for free energy sampling of biomolecules that we are aware of, the input features to the method were internal coordinates, or a function of internal coordinates, for the molecule(s) of interest—for example, distances, angles, and dihedrals. Often, these could be CVs based not on atomic positions directly but on coarse-grained (CG) representations of the biomolecule, such as the distance between the centers of masses (COMs) of two different domains or the distance between the COM of a ligand and certain atoms in its binding pocket. This is not surprising, because these often correspond to our physical intuition about the biomolecular reaction coordinate. Moreover, internal coordinates are invariant to translation and rotation of the molecule, and thus bias forces applied to these coordinates do not depend on the position or orientation of the molecule.
Recently, we presented atomic coordinates as an alternative set of features to use in the context of clustering biomolecular data.25 Atomic coordinates of a subset of atoms, or of beads corresponding to a CG representation of a molecule, offer an alternative to internal coordinates with the advantage that there is little choice in selecting the features to use. Using a protein as an example, we need only make the standard choice between Cα atoms, backbone, all heavy atoms, and so on. Moreover, only 3N – 6 atomic coordinates essentially describe the state of a biomolecular system with N important atoms (but ignoring contributions of solvent, salt, etc.), whereas use of internal coordinates often results in an overdetermined set of features, such as all O(N2) pairs of distances. In ref (25), we developed a procedure for clustering molecular configurations into a Gaussian mixture model (GMM) using atomic positions that overcomes challenges of orientational dependence that prevented their use earlier, as described below. Because a Gaussian mixture model in positions is a natural way to coarse-grain a free energy landscape,25−28 with locally harmonic bins around metastable states, the resulting clustering is a physically appealing definition of the “states” a molecule can adopt.
However, our Gaussian mixture model still relies on a very high (3N – 6) dimensional representation of our molecule. Given that the output of our clustering algorithm is a set of states each defined by a multivariate Gaussian distribution, LDA is a natural approach to produce a low dimensional representation of our data with large separation between states. In this work, we first apply LDA to the folded and unfolded states determined from shapeGMM clustering of a long unbiased MD trajectory of a fast-folding protein and demonstrate that it produces a physically reasonable ordering of states from folded to unfolded. We then show that this coordinate is a “good” reaction coordinate because the position of the barrier separating folded and unfolded is very close to the location where the system is equally likely to proceed to folded or unfolded (in terms of a committor function to be defined below). We implement this position LDA coordinate in the PLUMED sampling library and demonstrate that biased sampling along this coordinate can accelerate transitions between the folded and unfolded states and produce a qualitatively similar free energy surface as compared to the unbiased trajectory in 3% of the simulation time, without any additional tuning of the CV. Finally, we train a position LDA coordinate on an achiral helical system where data is only available in the left- and right-handed states and show that this coordinate also allows us to readily sample between the two states, despite there being no information about the transition provided during training.
2. Theory and Methods
2.1. Molecules in Size-and-Shape Space
Consistent with our previous work on structural alignment and clustering,25 we consider structures from an MD simulation to be associated with Gaussian distributions in atomic positions. Structures are represented by N particles (a subset of atoms) using a vector x of dimensions N × 3 which is a member of an equivalence class
1 |
where ξ⃗i is a translation in , Ri is a rotation , and 1N is the N × 1 vector of ones. [xi] is a point in size-and-shape space29 which has dimension 3N – 6 and is defined as where is the group of all rigid-body transformations for each frame with elements g = (ξ⃗, R).
Within the shapeGMM framework, the probability density of particle positions is assumed to be a Gaussian mixture
2 |
where N(xi gi,j|μj, Σj) is the jth normalized, multivariate Gaussian with mean μj, covariance matrix Σj, and weight ϕj (the weights are normalized such that ∑j=1K ϕj = 1). gi,j is the element of G that minimizes the Mahalanobis distance between xi and μj. Iterative determination of gi,j and μj is performed in a Maximum Likelihood procedure.25
In the current work, we will consider LDA coordinates learned using data from only two states. Additionally, we will only consider “weighted” alignment of particle positions, which equates to using a Kronecker product covariance (where Σj = ΣN ⊗ I3, for ΣN the N × N covariance of particle positions) in defining the Mahalanobis distance between frame and average structure as described in detail in ref (25).
2.2. Dimensionality Reduction Using Linear Discriminant Analysis on Particle Positions
We propose to use LDA directly on aligned particle positions as a reaction coordinate. LDA for two states produces the linear model with the maximal interaverage variance while minimizing intracluster variance.30 For K different clusters, this is achieved by first computing the within-cluster scatter matrix
3 |
and the between-cluster scatter matrix
4 |
where μi is the average structure of cluster i, and μ is the global average. The simultaneous minimization of within-cluster scatter and maximization of between cluster scatter can be achieved by finding the transformation G that maximizes the quantity
5 |
This maximization can be achieved through an eigenvalue/eigenvector decomposition, but such a procedure is only applicable when Sw is nonsingular. The LDA method was reformulated in terms of the generalized singular value decomposition (SVD)31 extending the applicability of the method to singular Sw matrices such as those encountered when using particle positions.
In addition to employing the SVD solution to the LDA approach, care must be taken in how particle positions are aligned when performing LDA. This is evident when one considers the scatter matrices in eq 3 and eq 4. The values and null spaces of these scatter matrices will depend on the specific alignment procedure chosen. There are three obvious choices for structural alignment prior to LDA: (1) alignment of each frame to its respective cluster mean/covariance, (2) alignment to one cluster or another, and (3) alignment to a global average. The first choice will lead to scatter matrices with different null spaces for each cluster making their addition in eq 3 unsatisfactory. Alignment to a cluster mean will yield consistent null spaces for each cluster but requires distinct alignment reference and global average structures. Additionally, aligning to a cluster mean yields to an undesirable ambiguity (and asymmetry) in the choice of cluster. Alignment to a single global average overcomes all of these issues and, as we show in the Supporting Information (Sec. S6), yields a sampling coordinate that is at least as good as alignment to a cluster mean for the systems tested here.
The result of an LDA procedure on two labeled states will be a vector, v, of coefficients that best separate the two states. These vectors are similar in nature to the eigenvectors from PCA, a procedure more familiar to the biosimulation field.
2.3. Biasing a Linear Combination of Positions
The value of the LDA coordinate after this procedure is a dot product of the vector v with the atomic coordinates x – μ. When computing this value on the fly within an MD simulation, we need to consider the value of [x(t)], the equivalence class of the position at time t, translated and rotated to a reference {μ, Σ}.
Therefore, to compute the value of the LDA coordinate l, we first translate x(t) by , the difference in the geometric mean of the current frame and that of the reference configuration. Then, we compute R(t), the rotation matrix which minimizes the Mahalanobis difference between x(t) – ξ⃗ and μ, for a given Σ, as described in ref (25). Finally, we compute
6 |
By definition, l(μ) = 0.
To apply bias forces to this coordinate, we must be able to compute ∇l(x(t)). Because of the inclusion of the optimal rotation process by SVD, it is nontrivial to compute this analytically, and we instead compute derivatives numerically.
2.4. Enhanced Sampling with OPES-MetaD
Enhanced sampling simulations on LDA coordinates were performed using Well-tempered Metadynamics (WT-MetaD) and On the Fly Probability Enhanced Sampling-Metadynamics (OPES-MetaD) as implemented in PLUMED.32−35
WT-MetaD works by adding a bias formed from a history dependent sum of progressively shrinking Gaussian hills.36,37 The bias at time t for CV value Qi is given by the expression
7 |
where h is the initial hill height, σ sets the width of the Gaussians, and ΔT is an effective sampling temperature for the CVs. Rather than setting ΔT, one typically chooses the bias factor γ = (T + ΔT)/T, which sets the smoothness of the sampled distribution.36,37 Asymptotically, a free energy surface (FES) can be estimated from the applied bias by (37,38) or using a reweighting scheme.37,39
In contrast to the use of sum of Gaussians in traditional MetaD, OPES-MetaD applies a bias that is based on a kernel density estimate of the probability distribution over the whole space, which is iteratively updated.34,35 The bias at time t for CV value Qi is given by the expression
8 |
Here in the prefactor, T is the temperature, kB is Boltzmann’s constant, and γ is the bias factor. Pt(Q) is the current estimate of the probability distribution, and Zt is a normalization factor that comes from integrating over sampled Q space. Finally, is a regularization constant that ensures the maximum bias that can be applied is ΔE. For one of our systems, we found that limiting the maximum bias using OPES-MetaD helped prevent unphysical exploration along our LDA coordinate (this is also possible using other approaches such as Metabasin Metadynamics40). Even with this limitation, we apply additional wall potentials to prevent exploration well beyond the LDA values for each of our two states. As in WT-MetaD, F(Q) can be directly estimated from V(Q) by or through a reweighting scheme.35 Details of the sampling parameters used for each system are given in Sec. 5.
2.5. Implementation
Clustering and iterative alignment of trajectory frames prior to learning LDA vectors is performed using our shapeGMMTorch package, which is a high performance version of the methods from ref (25), implemented with pyTorch(41) for accelerated computation on GPUs. shapeGMMTorch is available from https://github.com/mccullaghlab/shapeGMMTorch and can easily be installed in python using the command pip install shapeGMMTorch. We have also created a wrapper library for the training of LDA vectors directly from positional data, which is available from https://github.com/mccullaghlab/pLDA and which can be easily installed with pip install posLDA (although this wrapper was not used in the analysis performed in this paper as it was not yet available). Within posLDA, vectors are learned using the SVD implementation of the scikit-learn LinearDiscriminantAnalysis package.42
In order to compute and bias these vectors on the fly within MD simulations, the optimal alignment and linear combination procedure has been implemented in the PLUMED open source library.32,33 All procedures, analysis for every case studied in this work, and PLUMED code are made available at https://github.com/hocky-research-group/posLDA_paper_2023, and the code for computing LDA coordinates and Mahalanobis distances on positions will be contributed as a module to PLUMED shortly.
3. Results and Discussion
3.1. LDA Is a Good Reaction Coordinate for HP35 Folding
In previous work, we applied our shapeGMM clustering approach to a 305 μs trajectory of a 35-amino acid fast-folding folding mutant Villin headpiece domain (HP35), obtained from the D.E. Shaw Research Group.43 From our data, we choose to study a six state representation of the data, whose states produce an interpretable representation of folding and unfolding, and which is found not to be overfit by a cross-validation approach. Details of the clustering and cross-validation are provided in ref (25). The definition of this six state model, {μi, Σi}K=6, was trained from 25,000 frames out of ∼1.5 million, and then each frame was assigned to a cluster based on which center was closest in terms of Mahalanobis distance on positions.
A single folding/unfolding coordinate was constructed by performing LDA on frames assigned to the folded and unfolded states. The folded and unfolded states were assigned based on the RMSD to folded helix 1 and RMSD to folded helix 2 2D map shown in Figure 1A for this long trajectory with points colored by the assigned states. From this figure, we can assign state 0 as the folded state because it is the state with lowest RMSDs (it also has the largest population) and state 4 as the most unfolded state because it is the state with the largest RMSDs. LDA is performed on these two states to produce a single LD vector, denoted l, after an iterative alignment of the amalgamated two-state trajectory to the global mean and covariance, as described above. The magnitudes of the coefficients in this vector are illustrated as particle displacement vectors in the porcupine plot in Figure 1B. The histogram in Figure 1C shows the l values adopted in each state. We see from these data that this coordinate separates state 0 (l ≈ – 3) and state 4 (l ≈ 12). To our surprise, this single coordinate, which was trained only on data from state 0 and state 4, separates the other four states as well, which suggests that it might be sufficient to produce transitions between folded and unfolded through physically meaningful configurations.
Figure 2A shows the variation of l versus time for this long trajectory and exhibits many transitions between the folded (l ≈ −3) and unfolded (l ≈ 12) states (for comparison, ref (44) found that this long trajectory contains 61 folding transitions with their definition of folding). In order to assess the quality of this CV, we compute the committor of each frame in the trajectory c(xt),2,45,46 which for time t is 1 if the system reaches a folded state before reaching an unfolded state in the times following t.
To assess the quality of a reaction coordinate, we can compute the committor probability for each value of l on a grid of size δl.
9 |
10 |
In Figure 2B, we show the approximate FES along l computed as F(l) = −kBT ln P(l) for the long unbiased trajectory, colored by the value of Pc(l). The FES shows a stable well at a value of l = −3 corresponding to the highest population state, the folded one, and very shallow minima for each of the other states. The value of Pc varies continuously from 1 to 0 along this coordinate, reaching a value of 0.5 at l = 1, just outside the folded basin. By this metric, our very simple coordinate is a good CV for characterizing the transition between folded and unfolded states, although the lack of a high barrier separating the two states (due to the system being near its melting temperature) makes it more ambiguous how close the point of Pc = 0.5 is to a classic transition state. The coincidence of Pc = 0.5 with a clear barrier is observed in Figure S1 where we train using all 6 states, but for this paper, we chose to focus only on one-dimensional LDA spaces. In Figure S2, we show the FES projected between the folded states and all other states, with each possible choice of alignment.
3.2. LDA Is a Reasonable Sampling Coordinate for HP35 Folding
To assess the ability to sample along an LDA coordinate, we perform OPES-MetaD to bias the system to explore l (Figure 3). For the MetaD parameters listed in Sec. 5, we see in Figure 3A that transitions between the folded and unfolded state are accelerated. This corresponds to an estimated FES that is in fair agreement with that obtained from the long unbiased trajectory considering it is obtained in only 3% of the MD time (Figure 3B). Undersampling of the large unfolded region (l > 5) is a reflection of the usual problem of sampling slow orthogonal degrees of freedom. Despite this, when we look at the FES projected on natural folding coordinates in Figure S3, we see that our sampling does a good job capturing the main features of the long unbiased trajectory, including the presence of intermediates along the x- and y-axes, and the high energy unfolded state located in the upper right. As inferred from the 1d FES, the most unfolded regions are unexplored, and the statistical weight of the central intermediate basin is incorrect. Shorter replicates of simulations starting from different initial structures (Figure S4) show the variance in FES estimates that could arise if one is not careful to converge sampling. On the whole, our results are evidence that our simple LDA coordinate is a promising first step for sampling between two states of a complex biomolecule.
3.3. Accurate Sampling Using LDA for a Bistable Helix
The LDA procedure can be applied to determine a reaction coordinate separating two states even without sampling the actual transition (analogous to ref (6)). To assess this behavior, we investigate the right- to left-handed helix transition of (Aib)9, a nine residue peptide formed from the achiral α-aminoisobutyryl amino acid.47 The helical states of achiral molecules must by symmetry have equal free energy, and we previously took advantage of this property in benchmarking sampling and clustering methods.25,48 The properties of (Aib)9 have been characterized in simulation including recently as a tool to benchmark advanced methods for RC optimization.24,49,50
We performed 20 ns simulations starting from the left- and right-handed states of (Aib)9 using inputs from ref (24) (see Sec. 5 for details). We did a three state clustering of the combined MD data (total 40 ns, sampled every ps) and verified that the two most populated clusters are the left- and right-handed states. The coordinates of backbone atoms only were used for the clustering procedure. We then performed an iterative alignment of the combined data to compute a global (μ, Σ) and then computed a single LDA vector between those frames coming from the left- and right-handed states, respectively from the globally aligned trajectory. Figure 4A shows that this coordinate separates the training data with l ∼ 50 indicating a right-handed helix and l ∼ −50 indicating a left-handed helix. The left-handed helix is the starting point for further runs (shown in Figure 4B, along with LDA coefficient magnitudes).
Having trained l, we next performed conventional and WT-MetaD simulations starting from the structure in Figure 4B. Figure 5A shows that MetaD (right) substantially increases the rate of transition between the left- and right-handed states as compared to conventional MD (left).
A more chemically motivated way of computing the helicity of (Aib)9 is the parameter ζ′ = −∑n=37ϕn, the negative sum over the central backbone ϕ dihedral angles.24 This quantity takes on values of approximately 5 for right-handed and −5 for left-handed helices.24Figure 5B shows qualitatively similar behavior for ζ′ as l.
Figure 5C shows the FES computed for these two quantities. The sampled l has a nearly perfectly symmetrical FES, and in particular the free energy difference between the left- and right-handed states is negligible. For the FES of the nonbiased ζ′ computed by reweighting, the result is nearly as symmetrical, and the offset in free energy between the left- and right-handed size is visible but minuscule. This result appears to be as good as that obtained in ref (24), which uses a very sophisticated iterative process and 900 ns of unbiased and biased simulation data to obtain an optimized sampling coordinate as compared to our 40 ns of input data; however, their optimized coordinate appears to perform better in terms of transitions per unit time generated with their choice of MetaD parameters. As detailed in Sec. 5, the parameters used in our WT-MetaD simulation are very gentle; their magnitude was limited by “crashing”, which typically occurs due to inaccurate numerical integration. To check this, we demonstrate in Figure S5 that use of a 1 fs integration time step allows us to use much more aggressive MetaD parameters, which results in much more frequent transitions, as well as accelerated convergence enough to justify the use of a smaller time step (Figure S6). It is possible that implementation of analytical derivatives for our procedure may further mitigate this issue if they can be properly derived, and we will pursue this going forward.
4. Conclusions and Outlook
In this work, we demonstrated that LDA on positions computed from two states of a system produces a good reaction coordinate, both in terms of state transition kinetics and our ability to bias that coordinate to assess the FES along that coordinate. This was true for (Aib)9 even though the RC was trained only using short simulations starting in each state, making this a promising approach even when only structures of end points of a process are available. In contrast to ref (6) where input features were internal coordinates, we were able to use standard LDA rather than HLDA in this case and achieve good performance.
We note that LDA on positions would not apply directly to problems such as molecular dissociation since the dissociated states cannot be aligned to a single average structure; however, we do think this coordinate would work well for apo-holo transitions of a biomolecule and could easily be combined with a ligand-distance coordinate to overcome sampling chalenges e.g. as observed in ref (51). There are, of course, difficulties in resolving structural states of globular proteins that could make application of shapeGMM and subsequent LDA challenging. Namely, structural states of globular proteins can differ in only a small fraction of the total degrees of freedom. We feel that the heterogeneous nature of allowed covariance in the Kronecker form of shapeGMM will allow us to resolve these states with adequate sampling. Once the clusters are resolved, the LDA procedure described in the current manuscript will highlight the coordinates relevant to separate the clusters.
For HP35, multidimensional LDA by construction better separates all of the states of the molecule and may also provide an even better reaction coordinate for kinetics (Figure S1). It is not yet clear if this result is general or specific to the HP35 system. Regardless, the use of multidimensional LDA as an RC is intriguing, and we are currently investigating the advantages and limitations of these coordinates. However, this is not an option when information about multiple states is unavailable a priori (such as in the case of (Aib)9) which is why we did not include it here. For cases like that, it would be intriguing to first sample along the 1-dimensional reaction coordinate, then train a GMM with a higher number of states, and continue iterating this approach.
The use of states defined from our GMM clustering approach presents both an advantage and disadvantage as illustrated in the case of HP35. Our approach allowed us to explore the folding/unfolding process and most of the conformational landscape (Figure S3), but we were not able to fully sample the FES around the unfolded state. For sampling a broad and entropy dominated state, combining CV based sampling on position LDA coordinates with tempering or temperature accelerated methods should provide more accurate information in this region as in many past studies.52−56
In both the case of HP35 and (Aib)9, we were able to accelerate transitions between two states using MetaD or OPES-MetaD. In our hands, the biased simulations were sensitive to sampling protocol in terms of being able to run microseconds or longer without “crashing”. HP35 was less sensitive to this issue using OPES-MetaD, while (Aib)9 performed better with standard WT-MetaD. For this reason, we initially used small bias factors and hill heights/barrier heights, which resulted in fewer transitions and presumably worse convergence in fixed simulation time. We speculated that some of this sensitivity may come from our choice of the global trajectory mean and covariance as the reference state when computing our LDA vectors; however, subsequent tests using alignment to left- or right-handed helices for (Aib)9 showed that these alignments were more sensitive to crashing and had worse convergence performance, supporting our initial choice of global alignment (Figures S7,S8). A compelling option is presented in the ATLAS method of ref (28), where bias is computed along vectors to multiple reference states, weighted by distance from that reference state, and we are beginning to assess that approach.
5. Simulation Details
All simulations were performed using GROMACS 2019.657 with PLUMED 2.9.0-dev.32,33 GROMACS “mdp” parameter files and PLUMED input files are available in our paper’s github repository for complete details.
5.1. HP35 Simulations
A 305 μs all-atom simulation of Nle/Nle HP35 at T = 360 K from Piana et al.43 was analyzed. The simulation was performed using the Amber ff99SB*-ILDN force field and TIP3P water model. In that simulation, protein configurations were saved every 200 ps, for a total of ∼1.5 M frames. For our simulations, we solvate and equilibrate a fresh system using the same force field at 40 mM NaCl. Minimization and equilibration are performed using a standard protocol (http://www.mdtutorials.com/gmx/lysozyme/index.html), at which point NPT simulations are initiated at T = 360 K. mdp files for all steps of this procedure and the topology files are all available from the paper’s github page (https://github.com/hocky-research-group/posLDA_paper_2023).
OPES-MetaD simulations are performed with γ = 8, ΔE = 10 kcal/mol, pace of 500 steps, and a multiple time step58 stride of 2. Quadratic walls are applied at l = 5 and l = −15 with a bias coefficient of 125 kcal/mol/Å2.
5.2. (Aib)9 Simulations
Equilibrated inputs for (Aib)9 were provided by the authors of ref (24). In brief, simulations used the CHARMM36m force field and TIP3P water.59 MD simulations are performed in NPT with a 2 fs time step at T = 400 K.
WT-MetaD simulations are performed with h = 0.005 kcal/mol, σ = 0.43, γ = 2, and a multiple time step58 stride of 2. Quadratic walls are applied at l = 70 and l = −60 with a bias coefficient of 125 kcal/mol/Å2. σ was chosen as the σl/3 where σl was the standard deviation in l over the 20 ns simulation starting from the left helical state used in the training of the CV.
Acknowledgments
We thank the D.E. Shaw Research for providing simulation data on the HP35 protein, and we thank the Tiwary lab for providing their input files for (Aib)9. SS and GMH were supported by the National Institutes of Health through the award R35GM138312. SS was also partially supported by a graduate fellowship from the Simons Center for Computational Physical Chemistry (SCCPC) at NYU (SF Grant No. 839534). MM would like to acknowledge funding from National Institute of Allergy and Infectious Diseases of the National Institutes of Health under award number R01AI166050. This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise, and simulations were partially executed on resources supported by the SCCPC at NYU.
Supporting Information Available
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jctc.3c00051.
Comparison of FE profiles along two state and six state LD1 coordinates; Comparison of different state pairs and alignment choices; Comparison of FES in natural coordinates computed from unbiased and biased MD; Comparison of independent runs with less sampling; Comparison of 1 and 2 fs time steps for (Aib)9 simulations; Comparison of sampling efficiency for different alignments in (Aib)9 (PDF)
The authors declare no competing financial interest.
Supplementary Material
References
- Hénin J.; Lelièvre T.; Shirts M.; Valsson O.; Delemotte L. Enhanced Sampling Methods for Molecular Dynamics Simulations [Article v1. 0]. LiveCoMS 2022, 4, 1583. 10.33011/livecoms.4.1.1583. [DOI] [Google Scholar]
- Ma A.; Dinner A. R. Automatic method for identifying reaction coordinates in complex systems. J. Phys. Chem. B 2005, 109, 6769–6779. 10.1021/jp045546c. [DOI] [PubMed] [Google Scholar]
- Hashemian B.; Millán D.; Arroyo M. Modeling and enhanced sampling of molecular systems with smooth and nonlinear data-driven collective variables. J. Chem. Phys. 2013, 139, 214101. 10.1063/1.4830403. [DOI] [PubMed] [Google Scholar]
- Tiwary P.; Berne B. Spectral gap optimization of order parameters for sampling complex molecular systems. Proc. Natl. Acad. Sci. U. S. A. 2016, 113, 2839–2844. 10.1073/pnas.1600917113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen W.; Ferguson A. L. Molecular enhanced sampling with autoencoders: On-the-fly collective variable discovery and accelerated free energy landscape exploration. J. Comput. Chem. 2018, 39, 2079–2102. 10.1002/jcc.25520. [DOI] [PubMed] [Google Scholar]
- Mendels D.; Piccini G.; Parrinello M. Collective variables from local fluctuations. J. Phys. Chem. Lett. 2018, 9, 2776–2781. 10.1021/acs.jpclett.8b00733. [DOI] [PubMed] [Google Scholar]
- Mendels D.; Piccini G.; Brotzakis Z. F.; Yang Y. I.; Parrinello M. Folding a small protein using harmonic linear discriminant analysis. J. Chem. Phys. 2018, 149, 194113. 10.1063/1.5053566. [DOI] [PubMed] [Google Scholar]
- Wehmeyer C.; Noé F. Time-lagged autoencoders: Deep learning of slow collective variables for molecular kinetics. J. Chem. Phys. 2018, 148, 241703. 10.1063/1.5011399. [DOI] [PubMed] [Google Scholar]
- Piccini G.; Mendels D.; Parrinello M. Metadynamics with discriminants: A tool for understanding chemistry. J. Chem. Theory Comput. 2018, 14, 5040–5044. 10.1021/acs.jctc.8b00634. [DOI] [PubMed] [Google Scholar]
- Sultan M. M.; Pande V. S. Automated design of collective variables using supervised machine learning. J. Chem. Phys. 2018, 149, 094106. 10.1063/1.5029972. [DOI] [PubMed] [Google Scholar]
- Ribeiro J. M. L.; Bravo P.; Wang Y.; Tiwary P. Reweighted autoencoded variational Bayes for enhanced sampling (RAVE). J. Chem. Phys. 2018, 149, 072301. 10.1063/1.5025487. [DOI] [PubMed] [Google Scholar]
- Zhang Y.-Y.; Niu H.; Piccini G.; Mendels D.; Parrinello M. Improving collective variables: The case of crystallization. J. Chem. Phys. 2019, 150, 094509. 10.1063/1.5081040. [DOI] [PubMed] [Google Scholar]
- Wang Y.; Ribeiro J. M. L.; Tiwary P. Machine learning approaches for analyzing and enhancing molecular dynamics simulations. Curr. Opin. Struct. Biol. 2020, 61, 139–145. 10.1016/j.sbi.2019.12.016. [DOI] [PubMed] [Google Scholar]
- Noé F.; Tkatchenko A.; Müller K.-R.; Clementi C. Machine learning for molecular simulation. Annu. Rev. Phys. Chem. 2020, 71, 361–390. 10.1146/annurev-physchem-042018-052331. [DOI] [PubMed] [Google Scholar]
- Sidky H.; Chen W.; Ferguson A. L. Machine learning for collective variable discovery and enhanced sampling in biomolecular simulation. Mol. Phys. 2020, 118, e1737742. 10.1080/00268976.2020.1737742. [DOI] [Google Scholar]
- Bonati L.; Rizzi V.; Parrinello M. Data-driven collective variables for enhanced sampling. J. Chem. Phys. Lett. 2020, 11, 2998–3004. 10.1021/acs.jpclett.0c00535. [DOI] [PubMed] [Google Scholar]
- Karmakar T.; Invernizzi M.; Rizzi V.; Parrinello M. Collective variables for the study of crystallisation. Mol. Phys. 2021, 119, e1893848. 10.1080/00268976.2021.1893848. [DOI] [Google Scholar]
- Tsai S.-T.; Smith Z.; Tiwary P. Sgoop-d: Estimating kinetic distances and reaction coordinate dimensionality for rare event systems from biased/unbiased simulations. J. Chem. Theory Comput. 2021, 17, 6757–6765. 10.1021/acs.jctc.1c00431. [DOI] [PubMed] [Google Scholar]
- Hooft F.; Perez de Alba Ortiz A.; Ensing B. Discovering collective variables of molecular transitions via genetic algorithms and neural networks. J. Chem. Theory Comput. 2021, 17, 2294–2306. 10.1021/acs.jctc.0c00981. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun L.; Vandermause J.; Batzner S.; Xie Y.; Clark D.; Chen W.; Kozinsky B. Multitask Machine Learning of Collective Variables for Enhanced Sampling of Rare Events. J. Chem. Theory Comput. 2022, 18, 2341–2353. 10.1021/acs.jctc.1c00143. [DOI] [PubMed] [Google Scholar]
- Rydzewski J.; Chen M.; Ghosh T. K.; Valsson O. Reweighted Manifold Learning of Collective Variables from Enhanced Sampling Simulations. J. Chem. Theory Comput. 2022, 18, 7179–7192. 10.1021/acs.jctc.2c00873. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen W.; Sidky H.; Ferguson A. L. Nonlinear discovery of slow molecular modes using state-free reversible VAMPnets. J. Chem. Phys. 2019, 150, 214114. 10.1063/1.5092521. [DOI] [PubMed] [Google Scholar]
- Bonati L.; Piccini G.; Parrinello M. Deep learning the slow modes for rare events sampling. Proc. Natl. Acad. Sci. U. S. A. 2021, 118, e2113533118. 10.1073/pnas.2113533118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mehdi S.; Wang D.; Pant S.; Tiwary P. Accelerating all-atom simulations and gaining mechanistic understanding of biophysical systems through state predictive information bottleneck. J. Chem. Theory Comput. 2022, 18, 3231–3238. 10.1021/acs.jctc.2c00058. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Klem H.; Hocky G. M.; McCullagh M. Size-and-Shape Space Gaussian Mixture Models for Structural Clustering of Molecular Dynamics Trajectories. J. Chem. Theory Comput. 2022, 18, 3218–3230. 10.1021/acs.jctc.1c01290. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tribello G. A.; Ceriotti M.; Parrinello M. A self-learning algorithm for biased molecular dynamics. Proc. Natl. Acad. Sci. U. S. A. 2010, 107, 17509–17514. 10.1073/pnas.1011511107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Westerlund A. M.; Delemotte L. InfleCS: Clustering Free Energy Landscapes with Gaussian Mixtures. J. Chem. Theory Comput. 2019, 15, 6752–6759. 10.1021/acs.jctc.9b00454. [DOI] [PubMed] [Google Scholar]
- Giberti F.; Tribello G.; Ceriotti M. Global free-energy landscapes as a smoothly joined collection of local maps. J. Chem. Theory Comput. 2021, 17, 3292–3308. 10.1021/acs.jctc.0c01177. [DOI] [PubMed] [Google Scholar]
- Dryden I. L.; Mardia K. V.. Statistical Shape Analysis; John Wiley & Sons: Chichester, 1998. [Google Scholar]
- Bishop C. M.; Nasrabadi N. M.. Pattern recognition and machine learning; Springer: 2006; Vol. 4. [Google Scholar]
- Howland P.; Jeon M.; Park H. Structure Preserving Dimension Reduction for Clustered Text Data Based on the Generalized Singular Value Decomposition. SIAM J. Matrix Anal. Appl. 2003, 25, 165–179. 10.1137/S0895479801393666. [DOI] [Google Scholar]
- Tribello G. A.; Bonomi M.; Branduardi D.; Camilloni C.; Bussi G. PLUMED 2: New feathers for an old bird. Comput. Phys. Commun. 2014, 185, 604–613. 10.1016/j.cpc.2013.09.018. [DOI] [Google Scholar]
- Bonomi M.; Bussi G.; Camilloni C.; Tribello G. A.; et al. Promoting transparency and reproducibility in enhanced molecular simulations. Nat. Methods 2019, 16, 670–673. 10.1038/s41592-019-0506-8. [DOI] [PubMed] [Google Scholar]
- Invernizzi M.; Piaggi P. M.; Parrinello M. Unified approach to enhanced sampling. Phys. Rev. X 2020, 10, 041034. 10.1103/PhysRevX.10.041034. [DOI] [Google Scholar]
- Invernizzi M.; Parrinello M. Rethinking metadynamics: from bias potentials to probability distributions. J. Phys. Chem. Lett. 2020, 11, 2731–2736. 10.1021/acs.jpclett.0c00497. [DOI] [PubMed] [Google Scholar]
- Barducci A.; Bussi G.; Parrinello M. Well-tempered metadynamics: a smoothly converging and tunable free-energy method. Phys. Rev. Lett. 2008, 100, 020603. 10.1103/PhysRevLett.100.020603. [DOI] [PubMed] [Google Scholar]
- Bussi G.; Laio A. Using metadynamics to explore complex free-energy landscapes. Nat. Rev. Phys. 2020, 2, 200–212. 10.1038/s42254-020-0153-0. [DOI] [Google Scholar]
- Dama J. F.; Parrinello M.; Voth G. A. Well-tempered metadynamics converges asymptotically. Phys. Rev. Lett. 2014, 112, 240602. 10.1103/PhysRevLett.112.240602. [DOI] [PubMed] [Google Scholar]
- Tiwary P.; Parrinello M. A time-independent free energy estimator for metadynamics. J. Phys. Chem. B 2015, 119, 736–742. 10.1021/jp504920s. [DOI] [PubMed] [Google Scholar]
- Dama J. F.; Hocky G. M.; Sun R.; Voth G. A. Exploring valleys without climbing every peak: more efficient and forgiving metabasin metadynamics via robust on-the-fly bias domain restriction. J. Chem. Theory Comput. 2015, 11, 5638–5650. 10.1021/acs.jctc.5b00907. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paszke A.; Gross S.; Massa F.; Lerer A.; Bradbury J.; Chanan G.; Killeen T.; Lin Z.; Gimelshein N.; Antiga L.. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32 (NeurIPS 2019); 2019; Vol. 32.
- Pedregosa F.; Varoquaux G.; Gramfort A.; Michel V.; Thirion B.; Grisel O.; Blondel M.; Prettenhofer P.; Weiss R.; Dubourg V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Piana S.; Lindorff-Larsen K.; Shaw D. E. Protein folding kinetics and thermodynamics from atomistic simulation. Proc. Natl. Acad. Sci. U. S. A. 2012, 109, 17845–17850. 10.1073/pnas.1201811109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Piana S.; Lindorff-Larsen K.; Shaw D. E. How robust are protein folding simulations with respect to force field parameterization?. Biophys. J. 2011, 100, L47–L49. 10.1016/j.bpj.2011.03.051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Du R.; Pande V. S.; Grosberg A. Y.; Tanaka T.; Shakhnovich E. S. On the transition coordinate for protein folding. J. Chem. Phys. 1998, 108, 334–350. 10.1063/1.475393. [DOI] [Google Scholar]
- Bolhuis P. G.; Chandler D.; Dellago C.; Geissler P. L. Transition Path Sampling: Throwing Ropes. Annu. Rev. Phys. Chem. 2002, 53, 291–318. 10.1146/annurev.physchem.53.082301.113146. [DOI] [PubMed] [Google Scholar]
- Karle I. L.; Balaram P. Structural characteristics of. alpha.-helical peptide molecules containing Aib residues. Biochem. 1990, 29, 6747–6756. 10.1021/bi00481a001. [DOI] [PubMed] [Google Scholar]
- Hartmann M. J.; Singh Y.; Vanden-Eijnden E.; Hocky G. M. Infinite switch simulated tempering in force (FISST). J. Chem. Phys. 2020, 152, 244120. 10.1063/5.0009280. [DOI] [PubMed] [Google Scholar]
- Buchenberg S.; Schaudinnus N.; Stock G. Hierarchical biomolecular dynamics: Picosecond hydrogen bonding regulates microsecond conformational transitions. J. Chem. Theory Comput. 2015, 11, 1330–1336. 10.1021/ct501156t. [DOI] [PubMed] [Google Scholar]
- Biswas M.; Lickert B.; Stock G. Metadynamics enhanced Markov modeling of protein dynamics. J. Phys. Chem. B 2018, 122, 5508–5514. 10.1021/acs.jpcb.7b11800. [DOI] [PubMed] [Google Scholar]
- Peña Ccoa W. J.; Hocky G. M. Assessing models of force-dependent unbinding rates via infrequent metadynamics. J. Chem. Phys. 2022, 156, 125102. 10.1063/5.0081078. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bussi G.; Gervasio F. L.; Laio A.; Parrinello M. Free-energy landscape for β hairpin folding from combined parallel tempering and metadynamics. J. Am. Chem. Soc. 2006, 128, 13435–13441. 10.1021/ja062463w. [DOI] [PubMed] [Google Scholar]
- Camilloni C.; Provasi D.; Tiana G.; Broglia R. A. Exploring the protein G helix free-energy surface by solute tempering metadynamics. Proteins 2008, 71, 1647–1654. 10.1002/prot.21852. [DOI] [PubMed] [Google Scholar]
- Abrams C.; Bussi G. Enhanced sampling in molecular dynamics using metadynamics, replica-exchange, and temperature-acceleration. Entropy 2014, 16, 163–199. 10.3390/e16010163. [DOI] [Google Scholar]
- Gil-Ley A.; Bussi G. Enhanced conformational sampling using replica exchange with collective-variable tempering. J. Chem. Theory Comput. 2015, 11, 1077–1085. 10.1021/ct5009087. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Awasthi S.; Nair N. N. Exploring high dimensional free energy landscapes: Temperature accelerated sliced sampling. J. Chem. Phys. 2017, 146, 094108. 10.1063/1.4977704. [DOI] [Google Scholar]
- Abraham M. J.; Murtola T.; Schulz R.; Páll S.; Smith J. C.; Hess B.; Lindahl E. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 2015, 1, 19–25. 10.1016/j.softx.2015.06.001. [DOI] [Google Scholar]
- Ferrarotti M. J.; Bottaro S.; Pérez-Villa A.; Bussi G. Accurate multiple time step in biased molecular simulations. J. Chem. Theory Comput. 2015, 11, 139–146. 10.1021/ct5007086. [DOI] [PubMed] [Google Scholar]
- Huang J.; Rauscher S.; Nawrocki G.; Ran T.; Feig M.; De Groot B. L.; Grubmüller H.; MacKerell A. D. CHARMM36m: an improved force field for folded and intrinsically disordered proteins. Nat. Methods 2017, 14, 71–73. 10.1038/nmeth.4067. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.