Abstract
Monte Carlo (MC) methods are important computational tools for molecular structure optimizations and predictions. When solvent effects are explicitly considered, MC methods become very expensive due to the large degree of freedom associated with the water molecules and mobile ions. Alternatively implicit-solvent MC can largely reduce the computational cost by applying a mean field approximation to solvent effects and meanwhile maintains the atomic detail of the target molecule. The two most popular implicit-solvent models are the Poisson-Boltzmann (PB) model and the Generalized Born (GB) model in a way such that the GB model is an approximation to the PB model but is much faster in simulation time. In this work, we develop a machine learning-based implicit-solvent Monte Carlo (MLIMC) method by combining the advantages of both implicit solvent models in accuracy and efficiency. Specifically, the MLIMC method uses a fast and accurate PB-based machine learning (PBML) scheme to compute the electrostatic solvation free energy at each step. We validate our MLIMC method by using a benzene-water system and a protein-water system. We show that the proposed MLIMC method has great advantages in speed and accuracy for molecular structure optimization and prediction.
Keywords: Machine learning, Implicit-solvent Monte Carlo simulation, Poisson-Boltzmann equation, Electrostatics
Graphical Abstract
I. INTRODUCTION
The determination of protein structures is of paramount importance for structural biology and macromolecular study. However, not all protein structures can be determined with available experimental techniques due to various limitations. Computational methods offer important alternative approaches for structural determination and optimization [1]. Indeed, molecular force field models and molecular dynamics [2-4] can generate time-resolved trajectories of protein folding and protein-ligand binding predictions as well as structural ensemble simulations [5]. In these simulations, mathematical models and numerical algorithms are imperative for achieving computational accuracy and efficiency. A large number of advanced algorithms have been developed to reduce the computational cost and improve the accuracy for biomolecular simulations [6-9]. A major difficulty of molecular dynamics is the long timescales associated with real molecular processes taking place in nature. Therefore, ignoring the requirement of having time-resolved trajectories of the molecular processes will immediately remove the difficulty. Indeed, it is sufficient for most studies to have a predicted representative ensemble of structures for a given process. This representative prediction can be generated by Monte Carlo sampling [10].
Monte Carlo method is one of the most of popular approaches for biomolecular systems. Under physiological condition, biomolecules are immersed in and interact with surrounding water molecules and other possible co-factors. As such, Monte Carlo simulations of a biomolecule have to deal with a large number of solvent water molecules, which makes the simulations very expensive and sometimes, intractable. Additionally, in Monte Carlo simulations, the biomolecular conformation is subject to random perturbations [11]. These perturbations will inevitably result in the overlaps between the biomolecule and explicit solvent molecules, which leads to an unfavorable and non-representative structure. Implicit solvent models, such as Poisson-Boltzmann (PB) [12, 13], polarizable continuum [14, 15] and Generalized Born (GB) methods [16-19] are developed to overcome this challenge by taking a mean field approximation of water molecules and resulting in a dielectric continuum. The GB method is faster than PB methods but it only provides an approximation for electrostatic energies. PB methods, derived from fundamental physical theories [20, 21], offer more accurate electrostatic analysis. PB model has been applied to the calculations of protein-protein and protein-ligand binding energies [22], the pH value predictions of protonation and/or deprotonation states of titration sites [23], and drug design [24]. To seek for an accurate, efficient, and robust numerical solver, a large number of numerical methods have been developed for the PB model, including finite difference method (FDM) [25], finite element method (FEM) [26], and boundary element method (BEM) [7, 27]. Among this variety of numerical explorations, the FDM has the most enfranchisement such as Amber PBSA [28], Delphi [29], APBS [23, 26], MIBPB [6, 30-33], and CHARMM PBEQ [25]. Among them, MIBPB is the solely available secondorder accurate method and has been used to calibrate the GB method in Amber [34], where PB methods are generally very expensive. In addition, the molecular surface involved in all the aforementioned method with corresponding software developed, such as ESES [35], Nanoshaper [36], and MSMS [37].
Over the past a few years, machine learning, including deep learning, has had tremendous success in science and engineering. Especially, convolutional neural networks have proved their ability to automatically extract features and recognize patterns from relatively simple but large datasets. Deep learning has a growing dominance in important applications such as handwriting recognition, speech recognition, and drug discovery [38-40]. Aided by the availability of quality databases, new algorithms, graphics processing unit (GPU), and high-performance computers, various machine learning approaches have been established in many classical computational problems such as solvation free energies, protein-ligand binding affinities, mutation impacts, toxicity, partition coefficients, protein B-factors, etc. [41-50]. Additionally, deep learning neural networks are also applied in computational protein design [51], stability changes of protein induced by mutations [52, 53], and calculations of protein energy [54, 55].
Recently, we developed a Poisson-Boltzmann based machine learning (PBML) model, which can compute the solvation free energy of macromolecules in the solvent with the GB speed and the PB accuracy [56]. We assume that all of the macromolecular electrostatic solvation free energies follow a probability distribution, which can be sampled by the PB model. Our idea is based on a representability hypothesis and a learning hypothesis. The representability hypothesis states that the solvation free energy of a molecule can be described by the features of atom interactions and their geometric relations in the solvent. Thus, we can construct feature vectors to characterize the molecular electrostatic distribution. In our learning hypothesis, we assume that a machine learning model can be trained based on training labels and corresponding features for a sufficiently large training set of molecules. Additionally, advanced machine learning algorithms can give accurate predictions of the electrostatic potential for a new molecule which has the same probability distribution with the training set. In our approach, training labels are computed from MIBPB and features are generated using multiscale weighted colored subgraphs [47].
In the present work, we apply our newly developed PBML model to compute molecular solvation free energies in the implicit-solvent Monte Carlo simulations, which typically require millions of samplings. The new machine learning-based implicit-solvent Monte Carlo model can guarantee the accuracy of the implicit-solvent Monte Carlo model while dramatically speeding up existing implicit-solvent Monte Carlo algorithms.
This manuscript is organized as follows. Section II gives a brief introduction of molecular force fields, Monte Carlo methods, and implicit solvent models. The PBML model is introduced in this section as well, which includes the Poisson-Boltzmann equation, Generalized Born model, and multiscale weighted colored subgraphs. Section III presents the results of structural predictions of benzene and the human hyperplastic discs protein (PDB: 1i2t) [57] in water. We demonstrate that the PBML model is more accurate and faster than commonly used PB solvers and thus, can significantly reduce the computational time of implicit-solvent Monte Carlo simulations. A summary is given in Section IV.
II. METHODS AND ALGORITHMS
In this section, we briefly review biomolecular force fields, the Monte Carlo methods, and implicit solvent models, followed by the Poisson-Boltzmann based machine learning model.
A. Biomolecular force fields
The quality of molecular simulations depends crucially on molecular force fields to offer a physical representation of molecular interactions and energy distributions. Molecular force fields typically describe molecular interactions in terms of classical molecular mechanics of atoms. The potential energies of atomic interactions are approximated by a set of mathematical functions, modeling the bonded and non-bonded components. These functions consist of a set of free coefficients, which are obtained by approximating either the results of elaborate quantum mechanical calculations, or experimental data. One of the advantages of biomolecular force field approach is its computational efficiency. The potential energy can be efficiently computed at the molecular level comparing to other methods, such as quantum mechanical approaches, which deal with electrons [58, 59]. Additionally, the forces in molecular dynamics can be evaluated analytically from molecular force fields.
A variety of molecular force fields have been developed for various purpose. In this work, we adopt the popular and simple Amber ff99SB force field [59]. The Amber force field for governing the potential energy consists of the following terms,
(1) |
where kb, kθ, and Vn are force constants. Here, r, θ, and ϕ are bond length, angle, and dihedral angle with r0, θ0, and γ being optimal bond length, optimal angle, and proper dihedral angle, respectively. The first three terms in the energy expression describe the bonded energy of the molecular system. The last term represents the Lennard-Jones interactions and electrostatic interactions, where N is the number of atoms in the molecular system, Rij is the distance between ith and jth atoms, Aij and Bij are Lennard-Jones parameters, qi is the atom charge, and ϵ1 is the dielectric constant.
B. Monte Carlo methods
In this session, we provide a brief introduction of the molecular dynamics and the Monte Carlo method. We start from statistical mechanics and show that the calculation of the physical property of a solute-solvent system using molecular dynamics is computationally expensive or even intractable [10]. Then, we introduce Metropolis’s Monte Carlo method for biomolecular simulations [11].
The classical expression for the partition function Q of a solute-solvent system is
(2) |
where r={X, Y} stands for the atomic coordinates of a solute X and solvent Y, p stands for the corresponding momenta, c is a physical constant as specified below, kB is the Boltzmann constant and T is the temperature of the system. The function is the Hamiltonian of the system. It describes the total energy of an individual system as summation of the kinetic energy and the potential energy , where is a quadratic function of the momenta. For a system of N identical atoms, one has c=1/(h3N N!) using the Planck constant h. Under the assumption that all of the other physical observables A of interest depend only on the positions, i.e., A=A(r), the integration over the momenta can be carried out analytically in a classical mechanical treatment. As a result, the expected value of a physical observable of interest is given by
(3) |
where β=1/kBT. Evaluating ⟨A⟩ requires numerical techniques, such as quadrature rules for the integration. Since each particle moves in a three dimensional (3D) space, the total number of degrees of freedom is 3N for a system of N atoms. If each dimension is integrated with a mesh size of m points, the total number of points for the integration is m3N, which is computationally prohibitive.
The complexity in evaluating Eq.(3) can be significantly reduced by using the Monte Carlo sampling. Indeed, Metropolis et al. [11] suggested an efficient Monte Carlo scheme to approximate the ratio in Eq.(3). Let us denote the probability density function in finding a microstate in the canonical ensemble in a configuration r by
(4) |
According to this probability function, we can perturb randomly selected points in the configuration. Hence, the number of points ni generated per unit volume in the neighborhood of r is equal to Nmc×P (r) for the average of A(r), which is
(5) |
where NMC is the total number running in Monte Carlo simulations. Eq.(5) shows that all states of ensemble contribute to the average equally. Therefore, Metropolis Monte Carlo method starts at a given configuration r0={X0, Y0} and next perturbs the configuration by a defined transformation with a new configuration r1={X1, Y1}. The probability to accept the new configuration is
(6) |
If the new configuration is rejected, the previous configuration is retained and the method repeats another random perturbation. This process iterates until the iteration number equals to a fixed number. It is shown that the structure in the system will approach the Boltzmann distribution, if the perturbations satisfy the condition
(7) |
where π(ri) is the probability of the system in configuration ri and pij is the probability to perturb the configuration from state ri to state rj [11].
C. Implicit solvent models
Implicit solvent models are class of multiscale techniques for reducing the dimensionality of a solvent-solute system. They retain the crucial electrostatic interactions between a biomolecule and its solvent environment without modeling solvent molecules explicitly. A variety of two-scale implicit solvent models have been developed, such as the Poisson-Boltzmann (PB) model [13] and the generalized Born (GB) model [16-19]. One desirable application of implicit solvent models is the Monte Carlo simulations of biomolecule in solvent, which is relatively easy to implement. The basic derivation for molecular implicit solvent models relies on statistical mechanics. For more detail, the reader is referred to the literature [60]. Essentially, the molecular solvation free energy can be given by
(8) |
where ΔGelec represents the electrostatic contribution of the solvent-solute interaction, and ΔGnonpol denotes the nonpolar energy in the reversible work needed to insert a fixed configuration molecule into the solvent with all solute charges set to zero. Here ΔGnonpol is proportional to the solvent accessible surface area. The molecular solvation free energy is used in our implicit-solvent Monte Carlo method to represent solvent-solute interactions.
D. Poisson-Boltzmann based machine learning (PBML) model
In this section, we briefly discuss the Poisson-Boltzmann based machine learning (PBML) model [56], which is applied to compute ΔGelec in Eq.(8). Our PBML model involves three major components, i.e., training labels, molecular features, and learning algorithms. Our training labels for a large training set of molecules are generated from solving the Poisson-Boltzmann (PB) equation. Our molecular features for both the training set and the test set constitute two parts, a GB part and a correction part. The latter is computed from multiscale weighted colored subgraphs [56].
1. The Poisson-Boltzmann (PB) model
The PB model considers the solute biomolecule with Nc fixed charges as the interior domain Ω1, and the solvent, including free ions, as the exterior domain Ω2. The interface Γ separates these two domains. The PB model is given as
(9) |
For , ϕ(r) is the electrostatic potential, ϵ(r) dielectric constant is given by
(10) |
In the PB model, is the screening parameter with the relation where κ is the inverse Debye length measuring the ionic effective length. To ensure the continuity of electrostatic potential and flux density across the interface Γ, the PB equation is associated with following interface conditions
(11) |
where ϕ1 and ϕ2 are electrostatic potential from the solute domain Ω1 and the solvent domain Ω2, and n is the outward unit normal vector on Γ.
The solvation free energy can be obtained from the PB model by
(12) |
where ϕ0(rk) is the free space solution to the PB equation assuming no solvent-solute interface. To solve the PB equation, we apply the accurate and robust 2nd order MIBPB solver [6, 32] developed in our group, which applies rigorous treatment on geometric complexity, interface condition, and charge singularity. The results generated by MIBPB solver for a set of macromolecules are used as the training labels in the representability hypothesis.
2. The Generalized Born (GB) model
Having described the labels for our machine learning training, we discuss the molecular feature construction for both machine learning training and test, which involves the GB model. As a fast approximation to the PB model, the GB model computes the electrostatic solvation free energy by
(13) |
where Ri is the effective Born radius for i-th atom, rij is the distance between atoms i and j, β=ϵ1/ϵ2, α=0.571412, and B is the electrostatic size of the molecule. The function fij is given as
(14) |
The effective Born radii Ri is calculated by the following boundary integral
(15) |
In Eq.(15), the MSMS package [61] is used to generate the triangulation discretization of the molecular surface for the numerical surface integral on Γ.
3. Multiscale weighted colored subgraphs
The weighted colored subgraph (WCS) use the notion G(V, E) with vertices V and edges E to describe the atomic interactions in a protein of N atoms. The vertices is defined as
(16) |
where contains all the commonly occurring element types in a protein. Each vertex is an atom labeled by both its position ri element type αi, for i=1, ⋯ N.
The edge E relates the pairwise interactions, which are defined as a colored set with α, . For defined above, , CN, CO, CS, CH, NN, NO, NS, NH, OO, OS, OH, SS, SH, HH and we define the partition of as , k=1, 2,…, 15 such that , and so on. The set of involved vertices is a subset of V containing all atoms involved in forming the pair in . For instance, contains all carbon-nitrogen atom pairs and contains all carbon and nitrogen atom vertices in the protein. Based on these configuration, all the edges for pairwise atomic interactions in the WCS description are defined by
(17) |
where ∥ri – rj∥ defines the Euclidean distance between ith and jth atoms, Nα and Nβ are numbers of type α and β atoms, σ indicates the type of radial basic functions (e.g., σ=L for Lorentz kernel, σ=E for exponential kernel), τ is a scale distance factor between two atoms and ζ is a parameter of power in the kernel (i.e., ζ=κ for σ=E, ζ=ν for σ=L). In this model, we use generalized exponential functions
(18) |
and generalized Lorentz functions
(19) |
where ri and rj are, respectively, the van der Waals radius of the ith and jth atoms. Finally, the features for describing the electrostatics interactions and geometric properties are expressed as
(20) |
where wij is a weight function assigned to each atomic pair with wij=1 for atomic rigidity or wij=qj for atomic charge. Since we have 15 options of the colored subsets , we can obtain corresponding 15 subgraph centralities μk,σ,τ,ζ,w, for k=1,2,…, 15. By varying kernel parameters (σ, τ, ζ, w), one can achieve multiscale centralities for multiscale weighted colored subgraph (MWCS) [62], which can be the features.
With labels and features described above, we can construct the machine learning model to predict the solvation free energy of new macromolecules. Specifically, using MIBPB results as labels, and GB and MWCS results as features, we train gradient boosting decision trees (GBDTs) for the solvation free energy prediction.
III. RESULTS
In this section, we demonstrate the performance of the proposed MLIMC method numerically. First, we describe the Poisson-Boltzmann based machine learning (PBML) model for computing protein electrostatic solvation energies, followed by the illustration of the accuracy and efficiency of the model. The use of the PBML model for electrostatic interactions in the MC simulations is introduced. Our main idea is to replace time-consuming electrostatic calculations by using our PBML model. The efficiency of our new MLIMC model is also examined. Finally, we validate the proposed MLIMC method by two cases. Case one is a small molecule, benzene, with initial atom position randomly protruded. Our MLIMC method is used to reconstruct the benzene molecule in solvent. Case two is a relatively larger molecule, protein (PDB: 1i2t) with 61 amino acid residues. In this case, we stretch the last two residues of 1i2t using steered molecular dynamics and then we try to restore the equilibrium configuration by using the proposed MLIMC method. Both simulations are carried out at temperature of 27 °C, the dielectric constants are ϵ1=1 in the molecule and ϵ2=80 in the solvent, the MSMS [61] mesh density is set as 2, and the Debye-Huckel constant is set as κ=0.1257 Å−1. There are three kernels used to generate features for machine learning, which are (E, 0.3, 2, 1), (E, 4.7, 2, qj), and (L, 4.2, 5, 1).
To measure the performance, we use the root-mean-square deviation (RMSD) of atomic positions in length units (Å), defined as
(21) |
where v, are vectors of positions of the N atoms at two different MC samplings. Moreover, we also present relative errors of the total energy measured by comparing the energy for a MC sampling EMC, and the energy for the equilibrium state ESS as
(22) |
We compute the RMSD and errors between Monte Carlo sampling results and the original molecular structure for every 100 Monte Carlo steps for both cases. The core code was written in C/C++ and a cython wrapper calling the core code for performing adds-on functions and applications. Our simulations are produced on a desktop with an i5 7500 CPU and 16GB memory.
A. PBML model
The MLPB model used in Monte Carlo simulation is a pre-trained model. The training set includes 3706 protein structures from the PDBbind v2015 refined set [63]. This refined set was selected from a general set of 14,620 protein-ligand complexes. A data pre-processing (i.e. adding force field parameters) is required before a PB solver can be used for electrostatics calculations. Though the PDBbind refined set consists of protein-ligand complexes, only protein structures are applied for calculations. These protein structures are adjusted by the protein preparation wizard utility of the Schrodinger 2015-2 Suite [64] with default parameters unless filling the missing side chains is required.
The training set covers a wide range of proteins in different sizes with atom numbers from 997 to 27,713. The current training set can be expanded to an even larger group of proteins. However, from our test, we conclude that expanding training set will not significantly improve the trained model, thus the size of the current training set is sufficiently large.
The purpose of PBML is to implement a machine learning predictor of PB electrostatic solvation free energies for various proteins efficiently and accurately without explicitly solving the PB equation. Gradient boosting decision tree method is selected for this supervised learning task because of its efficiency. The accuracy of the PBML model is maintained by the accurate electrostatic free energy of solvation as the label calculated by the MIBPB solver. Once a trained PBML model is obtained, the MIBPB solver will not be called anymore. Using the learned PBML model only requires calculating features on the prediction of electrostatic solvation free energies for new compounds, which is rapid.
B. Efficiency of the PBML model
FIG. 1 shows the results for computing solvation energy on 195 proteins from PDBbind v2015 core set [63] using PBML, Amber, and Dephi. The results are shown in terms of the average CPU time per protein versus the mean absolute percentage errors. From FIG. 1(a), we can see PBML is more accurate and much faster than standard PB solvers such as DelPhi and Amber PB. FIG. 1(b) gives more details by zooming into the region where CPU time is small to distinct the CPU time used by the PBML using different MSMS density.
We here add a few notes about how we improve the PBML model in addition to machine learning. We notice that in the energy and feature calculations, every term has a degree of freedom associated with the number of atoms, except the computation of the effective Born radii Ri in Eq.(15), which depends on the number of surface triangles M. Since M≫N, faster evaluating of Eq.(15) can significantly accelerate the entire Monte Carlo process. In our present implementation, instead of taking the integral in Eq.(15) on each triangle, we take the integral on a neighborhood of each vertex. This treatment nearly doubled the efficiency of the GB method since number of vertices is about half of number of triangles on the surface. In addition, applying a cut-off can also further improve the GB method.
C. MLIMC model
The assembling of MLIMC includes the implementation of empirical potential energy functions (except electrostatics) and the prediction of electrostatics for each step on Monte Carlo simulations. The conformation of the target protein is perturbed randomly on each step. The new conformation is directly accepted if it shows a lower energy or is accepted with a probability determined by the Boltzmann distribution if it shows a higher energy. As the MLPB model is pre-trained before simulations, the Monte Carlo simulation does not include the time for solving the PB equation, resulting in much reduced time for MLIMC simulations.
D. Efficiency of the MLIMC model
We show that the high efficiency of the MLPB model will significantly improve the efficiency of the MLIMC model.
Table I shows the mean CPU time of one Monte Carlo step and the mean absolute percentage errors of Amber, DelPhi and PBML predictions of the electrostatic solvation free energies of the 195 proteins. The mean CPU time for each protein includes the computations for the total energies, in which computing electrostatic is the dominant component.
TABLE I.
PB solver | CPU time/s |
PB error/% |
||
---|---|---|---|---|
h=0.2 Å | h=0.5Å | h=0.2 Å | h=0.5Å | |
Amber | 6136 | 1177 | 0.618 | 1.271 |
DelPhi | 1621 | 214 | 0.819 | 1.552 |
PBMLa | 25 | 0.484 |
PBML uses mesh density of 2.
Clearly, the machine learning method has the highest accuracy but the lowest CPU time. For the same accuracy level (<1%), the estimation of the mean CPU time for a one-million-step Monte Carlo simulation is 6.136×108 s, 1.621×108 s, and 2.5×106 s for using Amber, DelPhi and PBML, respectively. Even with compromised accuracy for DelPhi and Amber at gird size of 0.5 Å, the MLIMC with PBML will be 47 times faster than that with Amber and 8 times faster than that with DelPhi. Next we show some MC simulation results using MLIMC on the benzene molecule and the human hyperplastic discs protein (PDB:1i2t).
E. Test case one: benzene molecule
Our first case is a Benzene molecule with some atomic position randomly perturbed. In detail, we fixed three atoms at equilibrium positions in order to have the prediction and the comparison structure in the same plane, and perturb the coordinates of the remained nine atoms in (ρ, θ, ϕ) directions by uniformly distributed random numbers in ([0, 10], [0, 2π], [0,π]). The initial RMSD is 6.42 Å as compared with the equilibrium position. We will try to perform a MC simulation on this perturbed molecule to see if the original steady status can be obtained. FIG. 2(a) shows the total energy and RMSD vs. MC steps, from which we can see that the total energy of benzene in solvent starts at 349123.61 kcal/mol and converges to the range of 5–15 kcal/mol after the first 20,000 MC steps. It stays in a convergent range for the rest MC steps. The RMSD initially is 6.42 Å and ends around 0.15 Å. It decreases rapidly as the total energy for the first 20,000 steps. After 20,000 steps, the total energy converges with only slightly oscillation, and the RMSD keeps the decreasing trend until it reaches around 0.15 Å when MC steps are greater than 70,000.
FIG. 2(b) shows errors and RMSD versus MC steps. Here we set ESS in Eq.(22) to be 10.60 kcal/mol as the steady state energy for reference. The plot shows that the errors of total energy are very small for our MC simulation after 10,000 iterations. When the simulation structure is close to that of its equilibrium state, the RMSD is smaller than 1 Å and the errors stay in between 1% and 100%. Note since the total energy is a small number, a tiny perturbation causes a large error changing.
Qualitatively, FIG. 3(a) shows that the benzene molecule with its initial perturbed structure is in blue and the equilibrium structure is in green. After the MC simulation, we receive the predicted structure in red as compared with the steady state structure in green as shown in FIG. 3(b). The total CPU time for 100,000 Monte Carlo steps is 643 s.
F. Test case two: protein (PDB: 1i2t)
The second MC test is on the human hyperplastic discs protein (PDB: 1i2t) with 61 residues. We first stretch the last two residues of the original protein by a steered molecular dynamics. As a result, the stretched molecule has an initial RMSD of 8.14 Å. We apply our MLIMC for 100,000 steps, which takes 16,684 s in CPU time. FIG. 4(a) shows that the total energy of 7260.90 kcal/mol initially decays rapidly within the first 5000 Monte Carlo steps, then oscillates around −2070.00 kcal/mol. In the same plot, we can see the RMSD drops quickly in the first 10,000 MC steps, after then decays slowly with fluctuation for the next 40,000 MC steps, then decays steadily after 45,000 MC steps, and finally oscillates slightly around 2.2 Å after 60,000 MC steps. For the energy errors shown in FIG. 4(b), relative to the total energy in the equilibrium of −2068.13 kcal/mol, the errors rapidly decays in the first 20,000 MC steps and then oscillate within 10% after that.
Similar to the benzene case, FIG. 5(a) qualitatively shows the perturbed structure in blue against the steady state structure in green for the first two residues and FIG. 5(b) shows that the MLIMC structural prediction in red color after the MC simulation, which is very close to the steady state structure in green.
IV. CONCLUSION
Monte Carlo simulations are widely used in science and engineering for molecular structure optimization and prediction. In many situations, particularly biomolecular systems, the solute molecule is immersed in a water solvent and the full-scale explicit solvent Monte Carlo simulations are very expensive. Alternatively, implicit solvent Monte Carlo methods using either Poisson-Boltzmann (PB) model or generalized Born (GB) model for computing electrostatics can greatly reduce the degree of freedom. However, the accuracy reduction in GB model or the efficiency concerns in PB model hinders the wide application of implicit solvent Monte Carlo simulation. In this work, we introduce a machine learning-based implicit-solvent Monte Carlo (MLIMC) method for molecular structure optimization and prediction. A vital component of our MLIMC is the newly developed Poisson-Boltzmann based machine learning (PBML) model, which maintains the PB accuracy at the GB cost. We validate the proposed MLIMC method by simulating two molecular systems, randomly perturbed benzene structure and protein (PDB: 1i2t) structures modified by a steered molecular dynamics. Numerical experiments demonstrate that proposed MLIMC is efficient in predicting molecular structures at equilibrium. In a comparative analysis, we show that the MLIMC model has a great advantage on CPU time and accuracy over DelPhi and Amber PB based Monte Carlo methods. We believe this innovated PBML method can also disruptively change the current status of PB based molecular simulation involving molecular dynamics [66] and Monte Carlo. MLIMC provides accurate electrostatic solvation energy at each configuration of the target protein thus can be helpful in searching protein folding states as intermediate or final using MC based simulation. The resulting machine learning-based implicit molecular dynamics (MLIMD), together with the present MLIMC model, will have a vast variety of applications in molecular science, including drug design.
V. ACKNOWLEDGMENT
This work was supported in part by NIH grant GM126189, NSF grants DMS-2052983, DMS-1761320, and IIS-1900473, NASA grant 80NSSC21M0023, Michigan Economic Development Corporation, MSU Foundation, Bristol-Myers Squibb 65109, and Pfizer. The work of WG was supported in part by NSF grants DMS-1819193 and DMS-2110922.
Footnotes
Part of Special Issue “John Z.H. Zhang Festschrift for celebrating his 60th birthday”.
References
- [1].Wei GW, Nature Machine Intelligence 1, 336 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Alder BJ and Wainwright TE, J. Chem. Phys 31, 459 (1959). [Google Scholar]
- [3].Karplus M and Kuriyan J, Proc. Nat. Acad. Sci. USA 102, 6679 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Rahman A, Phys. Rev 136, A405 (1964). [Google Scholar]
- [5].Scheraga HA, Khalili M, and Liwo A, Annu. Rev. Phys. Chem 58, 57 (2007). [DOI] [PubMed] [Google Scholar]
- [6].Chen D, Chen Z, Chen C, Geng WH, and Wei GW, J. Comput. Chem 32, 657 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Geng WH and Krasny R, J. Comput. Phys 247, 62 (2013). [Google Scholar]
- [8].Sagui C and Darden TA. Annu. Rev. Biophys. Biomol. Struct 28, 155 (1999). [DOI] [PubMed] [Google Scholar]
- [9].Sutmann G, Gibbon P, Lippert T, Forschungszentrum Jülich, 2011. [Google Scholar]
- [10].Frenkel D, NIC Series, 23, 29 (2004). [Google Scholar]
- [11].Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, and Teller E. J. Chem. Phys 21, 1087 (1953). [Google Scholar]
- [12].Davis ME and McCammon JA, Chem. Rev 94, 509 (1990). [Google Scholar]
- [13].Fogolari F, Brigo A, and Molinari H, J. Mol. Recogn 15, 377 (2002). [DOI] [PubMed] [Google Scholar]
- [14].Cossi M, Barone V, Cammi R, and Tomasi J, Chem. Phys. Lett 255, 327 (1996). [Google Scholar]
- [15].Tomasi J, Mennucci B, and Cammi R, Chem. Rev, 105, 2999 (2005). [DOI] [PubMed] [Google Scholar]
- [16].Dominy BN and Brooks III CL, J. Phys. Chem. B 103, 3765 (1999). [Google Scholar]
- [17].Mongan J, Simmerling C, McCammon JA, Case DA, and Onufriev A, J. Chem. Theory Comput 3, 159 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Onufriev A, Case DA, and Bashford D, J. Comput. Chem 23, 1297 (2002). [DOI] [PubMed] [Google Scholar]
- [19].Tjong H and Zhou HX, J. Chem. Phys 126, 195102 (2007). [DOI] [PubMed] [Google Scholar]
- [20].Beglov D and Roux B, J. Chem. Phys 104, 8678, (1996). [Google Scholar]
- [21].Onufriev A, Bashford D, and Case DA, J. Phys. Chem. B 104, 3712 (2000). [Google Scholar]
- [22].Nguyen DD, Wang B, and Wei GW, J. Computat. Chem 38, 94 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Jurrus E, Engel D, Star K, Monson K, Brandi J, Felberg L, Brookes D, Wilson L, Chen J, Liles K, Chun M, Li P, Gohara D, Dolinsky T, Konecny R, Koes D, Nielsen J, Head-Gordon T, Geng W, Krasny R, Wei GW, Holst M, McCammon J, and Baker N, Protein Sci. 27, (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Wang E, Sun H, Wang J, Wang Z, Liu H, Zhang JZ, and Hou T, Chem. Rev 119, 9478 (2019). [DOI] [PubMed] [Google Scholar]
- [25].Jo S, Vargyas M, Vasko-Szedlar J, Roux B, and Im W, Nucleic Acids Res. 36, W270 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Baker NA, Sept D, Holst MJ, and Mccammon JA, IBM J. Res. Develop 45, 427 (2001). [Google Scholar]
- [27].Lu B, Cheng X, Huang J, and McCammon JA, Comput. Phys. Commun 184, 2618 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Wang J, Tan CH, Tan YH, Lu Q, and Luo R. Commun. Comput. Phys 3, 1010 (2008). [Google Scholar]
- [29].Li L, Li C, Sarkar S, Zhang J, Witham S, Zhang Z, Wang L, Smith N, Petukh M, and Alexov E, BMC biophys. 5, 9 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Geng W, Yu S, and Wei GW, J. Chem. Phys 127, 114106 (2007). [DOI] [PubMed] [Google Scholar]
- [31].Geng W and Zhao S, J. Comput. Phys 351, 25 (2017). [Google Scholar]
- [32].Nguyen DD, Wang B, and Wei GW, J. Comput. Chem 38, 941 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Zhou YC, Feig M, and Wei GW, J. Comput. Chem 29, 87 (2008). [DOI] [PubMed] [Google Scholar]
- [34].Forouzesh N, Izadi S, and Onufriev AV, J. Chem. Inform. Model 57, 2505 (2017). [DOI] [PubMed] [Google Scholar]
- [35].Liu B, Wang B, Zhao R, Tong Y, and Wei GW, Eses: Software for Eulerian Solvent Excluded Surface, (2017). [DOI] [PubMed] [Google Scholar]
- [36].Decherchi S and Rocchia W, PloS one 8, e59744, (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [37].Sanner MF, Olson AJ, and Spehner J-C, Biopolymers 38, 305 (1996). [DOI] [PubMed] [Google Scholar]
- [38].Hughes TB, Miller GP, and Swamidass SJ, (2015). [Google Scholar]
- [39].Lusci A, Pollastri G, and Baldi P, J. Chem. Informa. Model 53, 1563 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [40].Nguyen DD, Cang Z, Wu K, Wang M, Cao Y, and Wei GW, J. Computer-Aided Mol. Design 33, 71 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [41].Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow PM, Zietz M, Hoffman MM, and et al. J. Royal Soc. Interface 15, 20170387 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [42].Cang ZX, Mu L, and Wei GW, PLOS Comput. Bio 14, e1005929, (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Cang ZX and Wei GW, Int. J. Numer. Meth. Biomed. Eng 34, (2018). DOI: 10.1002/cnm.2914. [DOI] [PubMed] [Google Scholar]
- [44].Jiménez J, Skalic M, Martínez-Rosell G, and De Fabritiis G, J. Chem. Inform. Model 58, 287 (2018). [DOI] [PubMed] [Google Scholar]
- [45].Karimi M, Wu D, Wang Z, and Shen Y, arXiv:1806.07537, (2018). [Google Scholar]
- [46].Korotcov A, Tkachenko V, Russo DP, and Ekins S, Mol. Pharma 14, 4462 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [47].Nguyen DD, Xiao T, Wang ML, and Wei GW, J. Chem. Inform. Model 57, 1715 (2017). [DOI] [PubMed] [Google Scholar]
- [48].Wang C and Zhang Y, J. Comput. Chem 38, 169 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [49].Wu K and Wei GW, J. Chem. Inform. Model 58, 520 (2018). [DOI] [PubMed] [Google Scholar]
- [50].Wu K, Zhao Z, Wang R, and Wei GW, J. Comput. Chem 39, 1444 (2018). [DOI] [PubMed] [Google Scholar]
- [51].Wang J, Cao H, Zhang JZ, and Qi Y, Sci. Reports, 8, 1 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [52].Cang Z and Wei GW, Bioinformatics 33, 3549, (2017). [DOI] [PubMed] [Google Scholar]
- [53].Cao H, Wang J, He L, Qi Y, and Zhang JZ, J. Chem. Inform. Model 59, 1508 (2019). [DOI] [PubMed] [Google Scholar]
- [54].Chen J, Xu X, Liu S, and Zhang DH, Phys. Chem. Chem. Phys 20, 9090 (2018). [DOI] [PubMed] [Google Scholar]
- [55].Wang Z, Han Y, Li J, and He X, J. Phys. Chem. B 124, 3027 (2020). [DOI] [PubMed] [Google Scholar]
- [56].Chen J, Xu Y, Cang Z, Geng W, and Wei GW, Preprint, (2021). [Google Scholar]
- [57].Deo RC, Sonenberg N, and Burley SK, Proc. Natl. Acad. Sci 98, 4414 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [58].Case DA, Cheatham TE, Darden T, Gohlke H, Luo R, Merz KM, Onufriev A and Simmerling RJC, Wang B, and Woods A, J. Comput. Chem 26, 1668 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [59].Lindorff-Larsen K, Piana S, Palmo K, Maragakis P, Klepeis JL, Dror RO, and Shaw DE, Proteins 78, 1950 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [60].Roux B and Simonson T, Biophys. Chem 28, 155 (1999). [DOI] [PubMed] [Google Scholar]
- [61].MSMS. https://mgl.scripps.edu/people/sanner/html/msms_home.html.
- [62].Bramer D and Wei G-W, J. Chem. Phys 140, 054103 (2018). [DOI] [PubMed] [Google Scholar]
- [63].Liu Z, Li Y, Han L, Liu J, Zhao Z, Nie W, Liu Y, and Wang R, Bioinformatics 31, 405 (2015). [DOI] [PubMed] [Google Scholar]
- [64].LLC S, Schrödinger Release 2015-2, Schrödinger LLC, New York, (2015). [Google Scholar]
- [65].Humphrey W, Dalke A, and Schulten K, J. Mol. Graphics 14, 33 (1996). [DOI] [PubMed] [Google Scholar]
- [66].Geng W and Wei GW, J. Comput. Phys 230, 435 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]