Abstract
The successful recent application of machine learning methods to scientific problems includes the learning of flexible and accurate atomic-level force-fields for materials and biomolecules from quantum chemical data. In parallel, the machine learning of force-fields at coarser resolutions is rapidly gaining relevance, as an efficient way to represent the higher-body interactions needed in coarse-grained force-fields to compensate for the omitted degrees of freedom. Coarse-grained models are important for the study of systems at time and length scales exceeding those of atomistic simulations. However, the development of transferable coarse-grained models via machine learning still presents significant challenges. Here we discuss recent developments in this field and current efforts to address the remaining challenges.
Keywords: Protein Dynamics, Machine Learning, Coarse-Graining
Graphical Abstract
1. Introduction
The definition of simplified models is central to physical sciences; proteins are no exception [1, 2]. Statistical mechanical approaches to describe protein folding and dynamics [3, 4, 5], as well as the analysis of long molecular dynamics (MD) trajectories [6, 7, 8], have demonstrated that slow processes in large biomolecular systems can be described by a reduced number of variables despite hundreds of thousands of atoms comprising the full system. In this spirit, many coarse-grained (CG) models have been proposed to study proteins through MD and energy minimization. These CG models have been used to investigate the principles underlying protein folding [9, 10, 11, 12], intermolecular binding/interactions [13, 14], protein-mediated membrane phenomena [15, 16], and to make predictions about novel biological systems of immediate medical interest [17, 18].
Despite their successes, CG models of proteins have not yet achieved the predictive performance of their atomistic counterparts. CG models are primarily designed by specifying their resolution, which defines the coarse degrees of freedom (referred to as “sites” or “beads”, see Fig. 1), and by their effective energy function, which dictates how these beads interact. Traditionally, the resolution is first chosen using either chemical intuition or through optimization designed to reproduce chosen properties (e.g., [19]). The model’s effective energy function is then parameterized to reproduce experimental or simulation data. The fundamental goal of the transferable CG models discussed in this article is to predict the conformational landscape of proteins not used for their parameterization, ideally using only the primary structure of the proteins of interest. Atomistic models have been able to explore the relevant landscape of small globular proteins [20, 21, 22]; however, it is still an open question as to whether there exists a resolution at which a chemically transferable CG model can quantitatively describe the configurational landscape of arbitrary proteins.
Figure 1:
Sequential reduction in resolution of a variant of the miniprotein Chignolin (CLN025) from a solvated all-atom representation containing many thousands of atoms, to an implicit solvent representation, to a heavy-backbone representation with Cβ beads, and finally to a Cα CG representation containing 10 beads.
A transferable CG protein model would have significant consequences. By employing special-purpose supercomputers [23] or distributed simulation combined with Markov State Models (MSMs)[24, 21, 25], the dynamics of small solvated proteins can be simulated over millisecond timescales [20]. However, biological phenomena routinely involve larger complexes and span longer timescales (seconds or more). CG models promise to reach such scales by reducing the computational cost via decreasing the number of degrees of freedom and increasing the effective simulation timestep. This increased efficiency would vastly improve the use of MD for both fundamental research and applications, e.g., in protein design.
There has been a surge of interest in using machine learning (ML) methods for molecular simulation [26], including learning of CG models from large amounts of data. In a sense, the development of ML CG models can be seen as an extension of ongoing research on the design of accurate atomistic force-fields from quantum mechanical calculations. In this area, ML has already produced highly accurate force-fields which have facilitated groundbreaking computational studies [27, 28, 29]. When combined with the field of bottom-up CG [30, 31], these approaches provide a seemingly clear strategy to lever-age ML to learn a CG force-field from existing atomistic MD trajectories. Indeed, thanks to the flexibility of ML algorithms, some frameworks developed for the atomistic resolution [32, 33] have been transferred to the CG resolution [34, 35, 36, 37].
Despite these advances, a completely bottom-up transferable CG model still does not exist for proteins or other biopolymers. This limited progress is due to multiple outstanding challenges, which together firmly differentiate the creation of ML bottom-up CG force-fields from their atomistic counterparts. We here discuss these difficulties and current efforts to overcome them.
2. Thermodynamic Consistency: Why is it difficult?
Bottom-up coarse-graining typically models the following free-energy surface (U) [30, 26, 31] referred to as the effective CG (free) energy:
(1) |
where ℳ maps all-atom configurations to their CG counterparts , u is the reference all-atom energy, and β is the inverse temperature. Intuitively, ℳ defines the CG resolution and U defines how particles at this resolution interact; the design of a bottom-up CG model then entails defining ℳ and approximating U, and the two tasks are interdependent. A CG energy that, up to a constant, equals U is said to be thermodynamically consistent with the atomistic counterpart. Such a U produces free energy landscapes identical to the reference in any reaction coordinates that are a function of the CG coordinates. We note that the phrase thermodynamically consistent here does not refer to thermodynamic observables (e.g., pressure), but instead considers the configurational distributions of the CG and atomistic force-fields. For information on thermodynamic properties in CG models we defer to recent articles [38, 39, 40, 31].
Although the thermodynamically consistent CG energy is uniquely defined by Eq. (1), the integral cannot be solved for non-trivial systems [30]. As a result, multiple strategies [41, 30, 31] to approximate U have been proposed. Traditionally, the functional forms for CG (free) energy functions have been low body-order with physically motivated terms [30, 31]. However, recent studies have employed higher body-order terms parameterized by neural networks with success [42, 43, 44, 36, 35]. Kernel methods have also been proposed (e.g., [34]) but have not been applied to proteins. While exceptions exist [42, 45, 46, 47], for reasons of computational efficiency existing ML CG models [44, 36, 26] have been primarily based on the Multiscale Coarse-Graining [48] (“force-matching”) variational framework, primarily due to the fact that it does not require the CG model to be simulated during its parameterization.
In principle, (Eq. 1) suggests that once the CG resolution has been chosen the creation of the CG model should be straightforward. However, designing an accurate ML CG model is not trivial (see Fig. 2). The choices of reference atomistic system (u), of the resolution (ℳ), and of the different terms of U all entail challenges unique to models designed at a CG resolution. These challenges are compounded by difficulties with validation, which are also present in the development of ML atomistic force-fields from quantum chemical data. We address these challenges in detail: beginning with the data used for training, continuing by discussing the design of U and subsequently the resolution of the CG model, and finishing by discussing validation and robustness. For brevity we only discuss algorithms which are applicable to ML CG force-fields; for a more comprehensive introduction to bottom-up coarse-graining we refer readers to recent perspective articles [30, 31].
Figure 2:
A pipeline for creating and using ML CG models from atomistic simulation data and experimental measurements. A chosen CG mapping can reduce reference information into a CG dataset that can be used to train ML CG models. This training can rely on both simulation and experimental observables in order to reduce the complexity of the learning task and respect physical constraints. A trained ML CG model can then be validated through CG MD and used for general property predictions.
2.1. The difficulty in training CG force-fields
The principal challenge in bottom-up coarse-graining with machine-learned force-fields lies in finding a suitable ML formulation that directly or indirectly estimates the intractable integral described in Eq. (1). The situation is more difficult compared to learning atomistic potential energy surfaces from quantum mechanical data, where reference energies and forces are known: When learning a CG free energy, neither U nor its gradient for a given CG structure R are known because the integral Eq. (1) is intractable.
The most straightforward approach to Eq. (1) is to directly estimate the behavior of the probability density proportional to exp(−βU(R)) from simulation data. This requires an equilibrium sample of atomistic conformations r, e.g., obtained by MD simulations. After mapping them to the CG resolution, a ML model is then trained to approximate U(R), by minimizing the Kullback–Leibler divergence between the coarse-grained and atomistic probability densities. This is called relative entropy minimization in the coarse-graining literature [49, 50, 47] and maximum likelihood estimation in the ML energy-based model community [51]. Similar approaches [42, 45, 46] which estimate and reduce the difference between a CG force-field and U or optimize selected observables in turn expand on other approaches from the ML community (e.g., [52]).
The difficulty with most of these approaches is that the CG model must be periodically re-simulated during training in order to evaluate the equilibrium density generated by the current model of U(R). While this is feasible for quickly equilibrating CG models, such as those of liquids [47], it is extremely challenging for models that exhibit rare events, such as realistic CG models of protein folding. This limitation is even more problematic for complex parameterizations of U (e.g., neural networks) and significantly impedes simultaneously training over multiple molecules when creating transferable models. Approaches have attempted to reduce this burden by, for example, reweighting the density of previous iterations [50, 45, 53] or by modifying the sampling of the atomistic system [46]. However, these approaches have not yet reached the simplicity and applicability of non-iterative approaches.
The most common bottom-up approach for approximating U is force-matching [48], which fits a CG free energy such that its negative gradient matches projected instaneous atomistic forces on average. Critically, this does not require simulations of the CG model during training, and was proposed for ML CG protein force-fields in [44]. As many atomistic coordinates r map to the same CG coordinate R, the instantaneous force is noisy, and the signal-to-noise ratio becomes smaller the more degrees of freedom are “CGed away”; thus CG force-matching requires more data as compared to atomistic force-fields. A second difficulty comes from the fact that U is obtained by implicitly integrating the mean force, and as a result, obtaining the free energy difference between minima depends on estimates of the forces along the transition path, where the uncertainties are the largest.
The recently proposed flow-matching method [54] combines relative entropy estimation and force-matching by employing generative deep learning: the CG density is estimated by a normalizing flow, a neural network that can generate one-shot samples of equilibrium CG conformations. This flow can then generate samples to train a downstream ML CG force-field by force-matching. The limitation of this approach still lies in finding flow architectures that can scale to large macromolecules.
The distribution of atomistic configurations is fundamental to the discussed algorithms. Rare events are important but infrequently sampled in the canonical distribution; directing atomistic sampling towards barriers and areas of “high uncertainty” may be beneficial. While ML models are expressive than, e.g., pair potentials, they require more data. For example, ML CG force-matching may use upwards of one million canonically distributed samples covering the configurational space for small proteins [36], in contrast to harmonic models parameterized using short trajectories in the folded state [55]. Modifying the distribution of samples may reduce data requirements [56], but it is unclear how such approaches scale with system complexity.
Concurrently, iterative methods may overcome their computational barriers if non-canonical sampling is used; expanding discriminative training may remove the need for repeated training simulations [46], and biasing potentials may promote diversity and produce more accurate parameters [57]. However, these approaches often require data to be drawn from a modified distribution, which impedes the use of preexisting atomistic trajectories. Nevertheless, these approaches will be critical to expanding current ML CG success to multi-domain proteins.
For a transferable ML CG model, more requirements for the training dataset arise. It is straightforward, and important, to simultaneously force-match a model using reference data from multiple proteins as evidence has shown that extended ensembles can act as regularization [58]. Previous pioneering work developing bottom-up transferable CG models used this approach, but fell short of unassisted folding and relied on artificially lowering simulation temperatures to stabilize states of interest [59]; we associate these inaccuracies to limitations of the force-field basis and training set. We anticipate that the proportion of structural motifs in the dataset plays an important role. In the ideal case, a general CG model of proteins would likely include globular, fibrous, and intrinsically disordered proteins in its training procedure. Such a transferable training setup naturally expands the amount of atomistic data available to train a given model; whether this will improve predictions on individual proteins remains to be seen.
2.2. Choice of the coarse-grained representation
In the design of an atomistic force-field, the Born-Oppenheimer approximation justifies the separation between electronic and nuclear degrees of freedom and provides the framework for effective nuclear potential energy surfaces. However, the separation of scales is less clear for CG models. Consequently, the selection of the CG resolution (ℳ from Eq. (1)) is non-trivial and influences the free energy surface that must be learned. The fundamental questions in this area are which resolutions are “easy to learn” and which are conducive to creating transferable models. These points highlight the challenges of validating ML CG models that are capable of extrapolating to unseen systems. For certain resolutions, it may be easier to learn an effective CG energy and extrapolate into unknown regions of phase space. On the other hand, certain resolutions may be conducive to accurate ML CG models but may be difficult to interpret.
Current successful CG ML protein applications [42, 44, 36, 46, 54] typically focus on a single site per resolution resolution (typically Cα); however, this appears to be mostly due to simplicity and not systematic validation. “Optimal” resolutions have been studied [60, 61], but it is unclear how they impact ML CG models. Back-mapping, i.e., reconstructing details from CG models, is a current area of investigation [62] and may alleviate interpretability constraints on the CG resolution. Views which link back-mapping with potential optimization can facilitate a joint optimization of the representation alongside the CG energy model [63]; however, these approaches do not yet in themselves search for transferable resolutions.
2.3. Functional Form of the Many-body Effective CG (Free) Energy
In practice, training ML CG models via force-matching from equilibrium data requires a baseline (or “prior”) potential to reduce catastrophically incorrect extrapolation in unphysical regions of phase space [44, 36, 64]. Ultimately, a good prior potential incorporates physical principles, reduces learning complexity, and allows for stable simulation. Similar to Δ-learning [65] for atomistic force-fields [66, 67], the CG energy is usually decomposed into:
(2) |
where Unet(R; θ) is a trainable multibody potential expressed by an ML model with parameters θ and Uprior(R) is the prior energy.
Designing the priors is non-trivial as it depends on an interplay between the CG mapping, the ML architecture, and the training data. Poor choices of priors can significantly reduce the performance of an ML force-field [47, 68]. Currently, the prevailing strategy involves proposing a prior inspired by the low body-order terms from classical force-fields, and then iteratively developing an ML CG model over both the prior terms and hyperparameters [44, 36, 35, 37, 47]. Systematic strategies have yet to be developed to design prior energies that are transferable to different molecules.
While priors help enforce important physical asymptotic interactions, the ML model architecture itself should respect basic physical constraints. These include invariance with respect to permutations of particles of the same type, invariance to translations and rotations of the reference frame, and curl-free force predictions [69, 44]. A way to allow the learnable energy Unet from Eq. (2) to be transferable [36] is to decompose it bead-wise such that:
(3) |
where Ri is the ith bead (with type ai) in the configuration R so that R−Ri are the relative displacements of all beads around bead Ri, and unet is the bead-wise contribution to the potential.
On top of these constraints, bottom-up coarse-graining involves additional architectural challenges. Coarse-graining a variety of different groups of atoms leads to a large number of CG bead types, e.g., at least 20 types (one for each amino acid) for proteins at the Cα resolution. As mentioned above, training with noisy forces requires a large number of training configurations. As a result, the ML approach must accommodate large training sets (indicating that neural networks may be preferable over kernel methods) and should not scale with the number of bead types so that evaluation times do not increase when considering transferable models. This constraint favors the use of deep learning architectures like SchNet [33] over models based on fixed representation, e.g., symmetry functions [70, 71].
2.4. Validation and Robustness
While atomistic ML force-field development has matured, there exists no appropriate set of best practices for probing stability and robustness. It is common to assess atomistic model accuracy with pointwise metrics, such as the mean force error, over fixed test datasets[27]. However, without an understanding of how models extrapolate into data-poor regions, these metrics can not be used as indicators of simulation stability or accuracy [72, 64], as simulations may explore uncovered configurations. For ML CG models, even with the use of prior energy terms, force error does not guarantee a stable model [68].
Due to the difficulty in constructing comprehensive test sets, the robustness and accuracy of a trained ML model can only be ascertained through extensive sampling, e.g., by using the model to run long MD simulations. Recent investigations into ML architectures have revealed the need for such metrics for both atomistic and CG ML models [64, 68]. Unfortunately, obtaining a converged CG MD simulation can require several million force evaluations; for large systems and complex architectures this may present a computational bottleneck [73]. Validation difficulties impede hyperparameter optimization (e.g., regularization strength, cutoff, or prior potential), as searches may become prohibitively expensive. We note, however, that existing applications provide suitable initial choices of hyperparameters for select architectures and resolutions (see [36]), but that the introduction of novel architectures naturally requires substantial effort for the initial hyperparameter search.
Even once MD has been used to characterize a ML CG model, validation still poses difficulties. When characterizing atomistic force-fields on selected configurations it is typically possible to compare the model’s energy and force predictions to reference values; unfortunately, these are not available at the CG resolution ((Eq. 1)). Instead, analysis typically projects CG configurations onto low-dimensional collective variables (e.g., Fig. 3). However, as ML CG models are now able to reproduce such collective variable surfaces, the need for more rigorous validation is emerging. Recent work [75] has proposed classification as an approach to generate energy-like errors for CG models and may provide an avenue for connecting atomistic and CG force-field validation.
Figure 3:
State-of-the-art performance for a Cα CG ML model on the benchmark protein CLN025. A) Comparison of the CG free energy landscape of CLN025 (produced using MD) for a learned CG ML model with the corresponding free energy for the reference all-atom dataset projected onto slow degrees of freedom (TICA) [74]. B) Ensembles of structures sampled from the CG ML model MD simulation (in red) are superimposed onto all-atom reference structure counterparts (in blue). Basin 1 represents the unfolded state, basin 2 the misfolded state, and basin 3 the folded state.
A related challenge is presented by model uncertainty: How robust is a ML model to different training seeds or data partitioning strategies? For neural networks, these can be expensive questions to answer. However, recent advances have started to enable estimates of uncertainty [76, 77]. A promising strategy involves estimating the uncertainty of predictions and minimizing it either before or during model deployment, either through iterative training or through “on-the-fly” frameworks [78] where data is added to the training set based on such estimates.
3. Conclusion
At the moment of writing, state-of-the-art ML CG models can quantitatively reproduce the behavior of small proteins, as shown in Fig. 3 for Chignolin and in Ref. [54] for Trpcage, BBA, and Villin. Currently, the largest barrier to describe larger proteins is gathering sufficient training data. To what extent such an approach can be extended to define transferable CG models remains an open question. It may be possible only for a class of proteins, or at particular resolutions. Before the advent of ML methods, these questions remained challenging to answer, as thermodynamic consistency between an atomistic and a CG model ((Eq. 1)) could only be approximately enforced; it was not clear whether the limitations of transferable models [59, 79, 80, 81, 39] were due to the limited expressivity of the CG energy and limited reference data or to more fundamental problems with transferability. Now, as ML CG models can quantitatively enforce thermodynamic consistency for single proteins (as shown in Fig. 3), we have the tools to address these questions and explore the trade-off between accuracy and transferability. Here, we have discussed the practical challenges towards this goal, but we remain optimistic that such a line of research can be pursued.
Even if a transferable bottom-up ML CG model can be defined, eventually, the success of a computational model relies on its comparison to experiments. Bottom-up CG models rely on the reference atomistic models and necessarily inherit their inaccuracies and flaws. With the improvements in atomistic force-fields, we expect CG models to also become more accurate. However, even small inconsistencies between the CG and atomistic models may compound into a significant deviation from experimental data. We believe that ultimately bottom-up ML CG will need to be merged with top-down models for their useful and predictive applications.
Highlights.
Coarse-graining is a powerful tool for modeling complex macromolecular systems
Machine learning is now enabling the definition of accurate bottom-up coarse-grained protein models
Bottom-up protein coarse-grained models that are fully transferable in sequence space still do not exist
We discuss the outstanding challenges towards the design of transferable coarse-grained protein models and possible ways forward
Acknowledgments
Molecular visualizations for the graphical abstract and all in-text figures were produced using UCSF ChimeraX, a software developed by the Resource for Biocomputing, Visualization, and Informatics at the University of California, San Francisco [82]. The free energy surfaces in Fig. 3 were produced using the Python package Matplotlib [83]. We acknowledge the financial support of Deutsche Forschungsgemeinschaft DFG (SFB/TRR 186, Project A12; SFB 1114, Projects B03 and A04; SFB 1078, Project C7; and RTG 2433, Project Q04 and Q05), the National Science Foundation NSF (CHE-1900374 and PHY-2019745), the NLM Training Program in Biomedical Informatics and Data Science (Grant No. 5T15LM007093-27), and the Einstein Foundation Berlin (Project 0420815101). We also thank the members of the Clementi and Noé groups for their helpful discussions.
F.M. acknowledges support from the SNSF under the Postdoc.Mobility fellowship P500PT_203124.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Conflict of interest statement
Nothing declared.
Declaration of interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References
- [1].Levitt M, Warshel A, Computer simulation of protein folding, Nature 253 (5494) (1975) 694–698. [DOI] [PubMed] [Google Scholar]
- [2].Clementi C, Coarse-grained models of protein folding: toy models or predictive tools?, Curr. Opin. Struct. Biol 18 (1) (2008) 10–15. [DOI] [PubMed] [Google Scholar]
- [3].Bryngelson JD, Wolynes PG, Spin glasses and the statistical mechanics of protein folding., Proc. Natl. Acad. Sci. USA 84 (21) (1987) 7524–7528. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Onuchic JN, Luthey-Schulten Z, Wolynes PG, Theory of Protein Folding: The energy landscape perspective, Annu. Rev. Phys. Chem 48 (1) (1997) 545–600. [DOI] [PubMed] [Google Scholar]
- [5].Dill KA, Bromberg S, Yue K, Chan HS, Ftebig KM, Yee DP, Thomas PD, Principles of protein folding — a perspective from simple exact models, Protein Science 4 (4) (1995) 561–602. doi: 10.1002/pro.5560040401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Best RB, Hummer G, Reaction coordinates and rates from transition paths, Proc. Natl. Acad. Sci. USA 102 (19) (2005) 6732–6737. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Chodera JD, Singhal N, Pande VS, Dill KA, Swope WC, Automatic discovery of metastable states for the construction of markov models of macromolecular conformational dynamics, J. Chem. Phys 126 (15) (2007) 155101. [DOI] [PubMed] [Google Scholar]
- [8].Noé F, Clementi C, Collective variables for the study of long-time kinetics from molecular trajectories: theory and methods, Curr. Opin. Struct. Biol 43 (2017) 141–147. [DOI] [PubMed] [Google Scholar]
- [9].Clementi C, Nymeyer H, Onuchic JN, Topological and energetic factors: what determines the structural details of the transition state ensemble and “en-route” intermediates for protein folding? an investigation for small globular proteins, J. Mol. Biol 298 (5) (2000) 937–953. [DOI] [PubMed] [Google Scholar]
- [10].Liwo A, Khalili M, Scheraga HA, Ab initio simulations of protein-folding pathways by molecular dynamics with the united-residue model of polypeptide chains, Proc. Natl. Acad. Sci. USA 102 (7) (2005) 2362–2367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Davtyan A, Schafer NP, Zheng W, Clementi C, Wolynes PG, Papoian GA, Awsem-md: Protein structure prediction using coarse-grained physical potentials and bioinformatically based local structure biasing, The Journal of Physical Chemistry B 116 (29) (2012) 8494–8503, doi: 10.1021/jp212541y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Bereau T, Deserno M, Generic coarse-grained model for protein folding and aggregation, The Journal of chemical physics 130 (23) (2009) 06B621. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Souza PCT, Thallmair S, Conflitti P, Ramírez-Palacios C, Alessandri R, Raniolo S, Limongelli V, Marrink SJ, Protein–ligand binding with the coarse-grained martini model, Nature Communications 11 (11) (2020) 3714. doi: 10.1038/s41467-020-17437-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Roel-Touris J, Don CG, Honorato RV, Rodrigues JPGLM, Bonvin AMJJ, Less is more: Coarse-grained integrative modeling of large biomolecular assemblies with HADDOCK, J. Chem. Theory Comput 15 (11) (2019) 6358–6367. doi: 10.1021/acs.jctc.9b00310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Louhivuori M, Risselada HJ, van der Giessen E, Marrink SJ, Release of content through mechano-sensitive gates in pressurized liposomes, Proc. Natl. Acad. Sci. USA 107 (46) (2010) 19856–19860. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Davies KM, Anselmi C, Wittig I, Faraldo-Gómez JD, Kühlbrandt W, Structure of the yeast F1Fo-ATP synthase dimer and its role in shaping the mitochondrial cristae, Proc. Natl. Acad. Sci. USA 109 (34) (2012) 13602–13607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Zheng W, Tsai M-Y, Chen M, Wolynes PG, Exploring the aggregation free energy landscape of the amyloid-β protein (1–40), Proc. Natl. Acad. Sci. USA 113 (42) (2016) 11835–11840. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Pak AJ, Yu A, Ke Z, Briggs JAG, Voth GA, Cooperative multivalent receptor binding promotes exposure of the SARS-CoV-2 fusion machinery core, Nat. Commun 13 (1) (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Giulini M, Menichetti R, Shell MS, Potestio R, An information-theory-based approach for optimal model reduction of biomolecules, J. Chem. Theory. Comput 16 (11) (2020) 6795–6813. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Lindorff-Larsen K, Piana S, Dror RO, Shaw DE, How fast-folding proteins fold, Science 334 (6055) (2011) 517–520. [DOI] [PubMed] [Google Scholar]
- [21].Plattner N, Doerr S, De Fabritiis G, Noé F, Complete protein–protein association kinetics in atomic detail revealed by molecular dynamics simulations and markov modelling, Nat. Chem 9 (10) (2017) 1005–1011. [DOI] [PubMed] [Google Scholar]
- [22].Bottaro S, Lindorff-Larsen K, Biophysical experiments and biomolecular simulations: A perfect match?, Science 361 (6400) (2018) 355–360. [DOI] [PubMed] [Google Scholar]
- [23].Shaw DE, Grossman J, Bank JA, Batson B, Butts JA, Chao JC, Deneroff MM, Dror RO, Even A, Fenton CH, et al. , Anton 2: raising the bar for performance and programmability in a special-purpose molecular dynamics supercomputer, in: SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, 2014, pp. 41–53. [Google Scholar]
- [24].Prinz J-H, Wu H, Sarich M, Keller B, Senne M, Held M, Chodera JD, Schütte C, Noé F, Markov models of molecular kinetics: Generation and validation, J. Chem. Phys 134 (17) (2011) 174105. [DOI] [PubMed] [Google Scholar]
- [25].Husic BE, Pande VS, Markov state models: From an art to a science, J. Am. Chem. Soc 140 (7) (2018) 2386–2396. [DOI] [PubMed] [Google Scholar]
- [26].Noé F, Tkatchenko A, Müller K-R, Clementi C, Machine learning for molecular simulation, Annu. Rev. Phys. Chem 71 (2020) 361–390. [DOI] [PubMed] [Google Scholar]; ⋆ A review of the application of machine learning on different aspects of molecular simulation
- [27].Unke OT, Chmiela S, Sauceda HE, Gastegger M, Poltavsky I, Schütt KT, Tkatchenko A, Müller K-R, Machine Learning Force Fields, Chem. Rev 121 (16) (2021) 10142–10186. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Kapil V, Schran C, Zen A, Chen J, Pickard CJ, Michaelides A, The first-principles phase diagram of monolayer nanoconfined water, Nature 609 (7927) (2022) 512–516. [DOI] [PubMed] [Google Scholar]
- [29].Gigli L, Veit M, Kotiuga M, Pizzi G, Marzari N, Ceriotti M, Thermodynamics and dielectric response of BaTiO3 by data-driven modeling, npj Comput. Mater 8 (1) (2022) 1–17. [Google Scholar]
- [30].Noid WG, Perspective: Coarse-grained models for biomolecular systems, J. Chem. Phys 139 (9) (2013) 090901. [DOI] [PubMed] [Google Scholar]
- [31].Jin J, Pak AJ, Durumeric AE, Loose TD, Voth GA, Bottom-up Coarse-Graining: Principles and Perspectives, J. Chem. Theory Comput (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]; ⋆ A review of the theory, implementation, and open challenges for bottom-up coarse-graining
- [32].Bartók AP, Payne MC, Kondor R, Csányi G, Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons, Phys. Rev. Lett 104 (13) (2010) 136403. [DOI] [PubMed] [Google Scholar]
- [33].Schütt KT, Sauceda HE, Kindermans PJ, Tkatchenko A, Müller KR, SchNet - A deep learning architecture for molecules and materials, J. Chem. Phys 148 (24) (2018) 241722. [DOI] [PubMed] [Google Scholar]
- [34].John ST, Csányi G, Many-Body Coarse-Grained Interactions Using Gaussian Approximation Potentials, J. Phys. Chem. B 121 (48) (2017) 10934–10949. [DOI] [PubMed] [Google Scholar]
- [35].Wang J, Charron N, Husic B, Olsson S, Noé F, Clementi C, Multibody effects in a coarse-grained protein force field, J. Chem. Phys 154 (16) (2021) 164113. [DOI] [PubMed] [Google Scholar]
- [36].Husic BE, Charron NE, Lemm D, Wang J, Pérez A, Majewski M, Krämer A, Chen Y, Olsson S, de Fabritiis G, Noé F, Clementi C, Coarse graining molecular dynamics with graph neural networks, J. Chem. Phys 153 (194101) (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]; ⋆ A graph neural network is used to model the effective energy of an ML CG model. The network architecture makes the model in principle transferable across chemical space.
- [37].Chen Y, Krämer A, Charron NE, Husic BE, Clementi C, Noé F, Machine learning implicit solvation for molecular dynamics, J. Chem. Phys 155 (8) (2021) 084101. [DOI] [PubMed] [Google Scholar]
- [38].Wagner JW, Dama JF, Durumeric AE, Voth GA, On the representability problem and the physical meaning of coarse-grained models, J. Chem. Phys 145 (4) (2016) 044108. [DOI] [PubMed] [Google Scholar]
- [39].Dunn NJ, Foley TT, Noid WG, Van der Waals perspective on coarse-graining: Progress toward Solving Representability and Transferability Problems, Acc. Chem. Res 49 (12) (2016) 2832–2840. [DOI] [PubMed] [Google Scholar]
- [40].Jin J, Pak AJ, Voth GA, Understanding missing entropy in coarse-grained systems: Addressing issues of representability and transferability, J. Phys. Chem. Lett 10 (16) (2019) 4549–4557. [DOI] [PMC free article] [PubMed] [Google Scholar]; ⋆ ⋆ A comprehensive discussion on the challenges of representability and transferability in coarse-grained systems.
- [41].Tóth G, Interactions from diffraction data: historical and comprehensive overview of simulation assisted methods, J. Phys.: Condens. Matter 19 (33) (2007) 335220. [DOI] [PubMed] [Google Scholar]
- [42].Lemke T, Peter C, Neural network based prediction of conformational free energies - a new route toward coarse-grained simulation models, J. Chem. Theory Comput 13 (12) (2017) 6213–6221. [DOI] [PubMed] [Google Scholar]
- [43].Zhang L, Han J, Wang H, Car R, E W, DeePCG: Constructing coarse-grained models via deep neural networks, J. Chem. Phys 149 (3) (2018) 034101. [DOI] [PubMed] [Google Scholar]
- [44].Wang J, Olsson S, Wehmeyer C, Pérez A, Charron NE, de Fabritiis G, Noé F, Clementi C, Machine learning of coarse-grained molecular dynamics force fields, ACS Cent. Sci (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]; ⋆ ⋆ Introduction of machine learning into protein coarse-graining, by training a neural network with a physical prior potential with variational force-matching.
- [45].Thaler S, Zavadlav J, Learning neural network potentials from experimental data via Differentiable Trajectory Reweighting, Nat. Commun 12 (1) (2021) 6884. [DOI] [PMC free article] [PubMed] [Google Scholar]; ⋆ Coarse-graining with machine learning methods; in particular, the parameters of a neural network are tuned to reproduce selected observables using iterative simulations.
- [46].Ding X, Zhang B, Contrastive learning of coarse-grained force fields, J. Chem. Theory Comput 18 (10) (2022) 6334–6344. [DOI] [PMC free article] [PubMed] [Google Scholar]; ⋆ An alternative approach for bottom-up coarse-graining of molecular systems, that does not require atomistic forces nor expensive sampling iterations.
- [47].Thaler S, Stupp M, Zavadlav J, Deep Coarse-grained Potentials via Relative Entropy Minimization, arXiv e-prints (2022) arXiv:2208.10330. [DOI] [PubMed] [Google Scholar]
- [48].Noid WG, Chu J-W, Ayton GS, Krishna V, Izvekov S, Voth GA, Das A, Andersen HC, The multiscale coarse-graining method. I. A rigorous bridge between atomistic and coarse-grained models, J. Chem. Phys 128 (24) (2008) 244114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [49].Shell MS, The relative entropy is fundamental to multiscale and inverse thermodynamic problems, J. Chem. Phys 129 (14) (2008) 144108. [DOI] [PubMed] [Google Scholar]
- [50].Carmichael SP, Shell MS, A new multiscale algorithm and its application to coarse-grained peptide models for self-assembly, J. Phys. Chem. B 116 (29) (2012) 8383–8393. [DOI] [PubMed] [Google Scholar]
- [51].Hinton GE, Training products of experts by minimizing contrastive divergence, Neural Comput. 14 (8) (2002) 1771–1800. [DOI] [PubMed] [Google Scholar]
- [52].Gutmann M, Hyvärinen A, Noise-contrastive estimation: A new estimation principle for unnormalized statistical models, in: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, 2010, pp. 297–304. [Google Scholar]
- [53].Wieder M, Fass J, Chodera JD, Teaching free energy calculations to learn from experimental data, bioRxiv 2021.08.24.457513 (2021). [Google Scholar]
- [54].Köhler J, Chen Y, Krämer A, Clementi C, Noé F, Flow-matching – efficient coarse-graining molecular dynamics without forces, arXiv:2203.11167 (2022). [DOI] [PubMed] [Google Scholar]; ⋆ ⋆ Data efficient method for learning coarse-grained force-fields by employing generative deep neural networks. It is shown to work well on a set of fast-folding proteins.
- [55].Lyman E, Pfaendtner J, Voth GA, Systematic multiscale parameterization of heterogeneous elastic network models of proteins, Biophys. J 95 (9) (2008) 4183–4192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [56].Podryabinkin EV, Tikhonov EV, Shapeev AV, Oganov AR, Accelerating crystal structure prediction by machine-learning interatomic potentials with active learning, Phys. Rev. B 99 (6) (2019) 64114. [Google Scholar]
- [57].Shen K, Sherck N, Nguyen M, Yoo B, Köhler S, Speros J, Delaney KT, Fredrickson GH, Shell MS, Learning composition-transferable coarse-grained models: Designing external potential ensembles to maximize thermodynamic information, The Journal of Chemical Physics 153 (15) (2020) 154116. [DOI] [PubMed] [Google Scholar]
- [58].Kanekal KH, Rudzinski JF, Bereau T, Broad chemical transferability in structure-based coarse-graining, J. Chem. Phys 157 (2022) 104102. [DOI] [PubMed] [Google Scholar]
- [59].Hills RD, Lu L, Voth GA, Multiscale coarse-graining of the protein energy landscape, PLoS Comput. Biol 6 (6) (2010) e1000827. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [60].Wang W, Gómez-Bombarelli R, Coarse-graining auto-encoders for molecular dynamics, npj Comput. Mater 5 (1) (2019) 125. [Google Scholar]
- [61].Foley TT, Kidder KM, Shell MS, Noid W, Exploring the landscape of model representations, Proc. Natl. Acad. Sci. Unit. States Am 117 (39) (2020) 24061–24068. [DOI] [PMC free article] [PubMed] [Google Scholar]; ⋆ The effect of different choices of resolution and mapping on coarse-graining model systems
- [62].Wang W, Xu M, Cai C, Miller BK, Smidt T, Wang Y, Tang J, Gómez-Bombarelli R, Generative coarse-graining of molecular conformations, arXiv:2201.12176 (2022). [Google Scholar]
- [63].Chennakesavalu S, Toomer DJ, Rotskoff GM, Ensuring thermodynamic consistency with invertible coarse-graining, arXiv preprint arXiv:2210.07882 (2022). [DOI] [PubMed] [Google Scholar]
- [64].Fu X, Wu Z, Wang W, Xie T, Keten S, Gomez-Bombarelli R, Jaakkola T, Forces are not enough: Benchmark and critical evaluation for machine learning force fields with molecular simulations, arXiv e-prints (2022) arXiv:2210.07237. [Google Scholar]; ⋆ The authors show that the force error is not a good metric to evaluate the performance of machine-learned force-fields and introduce a new benchmark suite for ML MD simulation.
- [65].Ramakrishnan R, Dral PO, Rupp M, von Lilienfeld OA, Big data meets quantum chemistry approximations: The δ-machine learning approach, J. Chem. Theory Comput 11 (5) (2015) 2087–2096. [DOI] [PubMed] [Google Scholar]
- [66].Dolgirev PE, Kruglov IA, Oganov AR, Machine learning scheme for fast extraction of chemically interpretable interatomic potentials, AIP Advances 6 (8) (2016) 085318. [Google Scholar]
- [67].Deringer VL, Csányi G, Machine learning based interatomic potential for amorphous carbon, Phys. Rev. B 95 (2017) 094203. [Google Scholar]
- [68].Ricci E, Giannakopoulos G, Karkaletsis V, Theodorou DN, Vergadou N, Developing Machine-Learned Potentials for Coarse-Grained Molecular Simulations: Challenges and Pitfalls, in: Proceedings of the 12th Hellenic Conference on Artificial Intelligence, 2022, p. 1–6. [Google Scholar]
- [69].Musil F, Grisafi A, Bartók AP, Ortner C, Csányi G, Ceriotti M, Physics-Inspired Structural Representations for Molecules and Materials, Chem. Rev 121 (16) (2021) 9759–9815. [DOI] [PubMed] [Google Scholar]; ⋆ An extensive review of different representations and their properties for the modeling of molecules and materials with machine learning methods.
- [70].Behler J, Parrinello M, Generalized neural-network representation of high-dimensional potential-energy surfaces, Phys. Rev. Lett 98 (14) (2007). [DOI] [PubMed] [Google Scholar]
- [71].Smith JS, Isayev O, Roitberg AE, ANI-1: an extensible neural network potential with DFT accuracy at force field computational cost, Chem. Sci 8 (4) (2017) 3192–3203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [72].Stocker S, Gasteiger J, Becker F, Günnemann S, Margraf JT, How robust are modern graph neural network potentials in long and hot molecular dynamics simulations?, ChemRxiv (Apr 2022). [Google Scholar]
- [73].Unke OT, Stöhr M, Ganscha S, Unterthiner T, Maennel H, Kashubin S, Ahlin D, Gastegger M, Sandonas LM, Tkatchenko A, Müller K-R, Accurate machine learned quantum-mechanical force fields for biomolecular simulations, arXiv e-prints (2022) arXiv:2205.08306. [Google Scholar]
- [74].Pérez-Hernández G, Paul F, Giorgino T, De Fabritiis G, Noé F, Identification of slow molecular order parameters for markov model construction, The Journal of Chemical Physics 139 (1) (2013) 015102. [DOI] [PubMed] [Google Scholar]
- [75].Durumeric AEP, Voth GA, Explaining classifiers to understand coarse-grained models, arXiv preprint arXiv:2109.07337 (2021). [Google Scholar]
- [76].Gal Y, Ghahramani Z, Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, in: Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, 2016, p. 1050–1059. [Google Scholar]
- [77].Lakshminarayanan B, Pritzel A, Blundell C, Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, p. 6405–6416. [Google Scholar]
- [78].Vandermause J, Torrisi SB, Batzner S, Xie Y, Sun L, Kolpak AM, Kozinsky B, On-the-fly active learning of interpretable Bayesian force fields for atomistic rare events, npj Comput. Mater 6 (1) (2020) 1–11. [Google Scholar]
- [79].Sanyal T, Mittal J, Shell MS, A hybrid, bottom-up, structurally accurate, Gō-like coarse-grained protein model, J. Chem. Phys 151 (4) (2019) 044111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [80].Potter TD, Tasche J, Wilson MR, Assessing the transferability of common top-down and bottom-up coarse-grained molecular models for molecular mixtures, Phys. Chem. Chem. Phys 21 (2019) 1912–1927. [DOI] [PubMed] [Google Scholar]
- [81].Rosenberger D, Van Der Vegt NF, Addressing the temperature transferability of structure based coarse graining models, Phys. Chem. Chem. Phys 20 (9) (2018) 6617–6628. [DOI] [PubMed] [Google Scholar]
- [82].Goddard TD, Huang CC, Meng EC, Pettersen EF, Couch GS, Morris JH, Ferrin TE, UCSF ChimeraX: Meeting modern challenges in visualization and analysis, Protein Science 27 (1) (2018) 14–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [83].Hunter JD, Matplotlib: A 2D graphics environment, Computing in Science & Engineering 9 (3) (2007) 90–95. [Google Scholar]