Abstract
Electrostatics is of paramount importance to chemistry, physics, biology, and medicine. The Poisson-Boltzmann (PB) theory is a primary model for electrostatic analysis. However, it is highly challenging to compute accurate PB electrostatic solvation free energies for macromolecules due to the nonlinearity, dielectric jumps, charge singularity, and geometric complexity associated with the PB equation. The present work introduces a PB-based machine learning (PBML) model for biomolecular electrostatic analysis. Trained with the second-order accurate MIBPB solver, the proposed PBML model is found to be more accurate and faster than several eminent PB solvers in electrostatic analysis. The proposed PBML model can provide highly accurate PB electrostatic solvation free energy of new biomolecules or new conformations generated by molecular dynamics with much reduced computational cost.
Significance
This manuscript provides a Poisson-Boltzmann-based machine learning (PBML) model for biomolecular electrostatic analysis. The features as the input to the ML models are generated with mathematical algorithms using biomolecular structures and force fields. The learned model, which is trained using the most accurate PB solver, MIBPB, on more than 4000 biomolecules, shows improved efficiency and accuracy in electrostatic analysis compared with the popular PB solvers.
Introduction
Electrostatics is ubiquitous in the molecular world. The analysis of molecular electrostatics is of crucial importance to the bioscience research community. There are two significant types of electrostatic analyses, namely qualitative analysis for general electrostatic characteristics, such as visualization and electrostatic steering, and quantitative analysis for statistical, thermodynamic, and/or kinetic observable, such as solvation free energy, solubility, and partition coefficient.
Molecular electrostatics can be analyzed by explicit or implicit models. Explicit solvent models are more accurate by treating water molecules individually but are expensive for large biomolecular systems. Implicit solvent models largely reduce the computational cost by describing the solvent as a dielectric continuum approximation, while the solute molecule, which is the usually the focus, is still modeled with an atomistic description (1). A wide variety of two-scale implicit solvent models has been developed for electrostatic analysis, including generalized Born (GB) (2), polarizable continuum (3), and Poisson-Boltzmann (PB) models (4).
PB models have been applied to calculate protein titration states (5), protein-protein and protein-ligand binding energetics (6), RNA nucleotide protonation (7), chromatin packing (8), etc. The PB theory has also been used for the evaluation of biomolecular electrostatic forces for molecular Langevin dynamics or Brownian dynamics (9). GB methods are faster than PB methods but provide only heuristic estimates for PB electrostatic energies.
Due to its success in describing biomolecular systems, the PB model has attracted wide attention in both mathematical and biophysical communities. In the past two decades, many efforts have been given to the development of accurate, efficient, reliable, and robust PB solvers. A large number of methods have been proposed in the literature, including the finite difference method (10), the finite element method (11), and the boundary element method (12,13). Among them, the finite difference method is prevalently used in the field due to its simplicity in implementation. The emblematic solvers in this category are Amber PBSA (14), Delphi (15), APBS (16), MIBPB (17), CHARMM PBEQ (10), etc.
The PB model is a nonlinear elliptic interface problem with discontinuous coefficients, singular source terms, and a solute-solvent interface with geometric complexity. Finding numerical solutions to the PB equation for biomolecules is challenging. Utilizing the matched interface and boundary (MIB) method, which treats interface conditions on complex surfaces accurately (18), a second-order accurate PB solver, the MIBPB, was constructed to address the aforementioned numerical difficulties (17). The recent development of the Eulerian solvent-excluded surface (19), which provides analytical biomolecular surface representation in the Cartesian domain, improves the stability and robustness of the MIBPB solver.
Nonetheless, the generation of highly accurate electrostatic potentials for large biomolecules can be extremely expensive. For example, it takes days to solve the PB model on a protein with about 50,000 atoms at the mesh of 0.2 Å on a single CPU. Additionally, the information generated for the electrostatic analysis of a given biomolecule is not transferable to other proteins. Therefore, one has to carry out the separated electrostatic analysis of different proteins or the same protein with different protonation states or conformations. These issues call for innovative approaches, such as machine learning (ML) and dynamic programming, to biomolecular electrostatic analysis.
Recently, we have witnessed the explosion of ML studies in science and engineering. In particular, deep neural networks (DNNs), which discover intricate structures in large data sets, have fueled the rapid growth in application such as computer vision, natural language processing, speech recognition, handwriting recognition (20,21,22), etc. ML has become an indispensable tool in the analysis and prediction of large and diverse molecular and biomolecular data sets, including bioactivity of small molecular drugs (23) and genomics (24). Studies in computational biology and biophysics, such as the predictions of solvation free energies, protein-ligand binding affinities, mutation impacts, toxicity, partition coefficients, B factors, etc., adopt ML approaches (25,26,27,28). These developments open the door for ML-based electrostatic analysis.
The objective of the present work is to develop a ML solution of the PB equation for the electrostatic analysis of biomolecules. To this end, we first construct an accurate and efficient mathematical representation of electrostatic potential to effectively characterize its probability distribution in the space of protein structures and electric charges. Theoretically, the exact form of this distribution is not available even if one solves the PB equation for all possible biomolecular structures. However, in practice, this probability distribution can be sampled by using a PB solver, which provides ML training labels. Our approach is based on a representability hypothesis and a learning hypothesis. The representability hypothesis states that the electrostatic potential of a biomolecule can be described by a set of partial charges and their geometric relations to the solvent. This hypothesis guides the construction of the feature vector for the characterization of the probability distribution of biomolecular electrostatics. The learning hypothesis states that biomolecular electrostatics can be effectively represented by a feature vector as described by the representability hypothesis. When the probability distribution of biomolecular electrostatics is sufficiently sampled from a training set, a ML model can be established based on training labels and associated feature vectors to accurately predict the electrostatic potential of an unseen data set, which shares the same probability distribution with the training set.
The protocol described above calls for an accurate PB solver, which calculates ML labels and thus the probability distribution of molecular and biomolecular electrostatics. To this end, we apply the accurate MIBPB solver (29) at a refined mesh size of 0.2 Å to generate solvation energy labels to minimize the numerical errors.
The representability hypothesis does not specify how to construct an accurate and efficient representation. An average biomolecule in the human body consists of about 6000 atoms that lie in an -dimensional Euclidean space (). Such a high dimensionality makes the first-principle calculations intractable. Additionally, using macromolecular structures in deep convolutional neural networks is extremely expensive. For example, the 3D coordinates representation of a biomolecule with about 50 Å side length at a low resolution of 0.5 Å requires feature dimension of , where n is the number of element types. The variable sizes of biomolecules also hinder the application of ML algorithms. These challenges motivate the development of scalable and intrinsically low-dimensional representations of biomolecular structures. Our hypothesis is that intrinsic physics lie in low-dimensional manifolds or spaces embedded in a high-dimensional data space (26). Recently, a few low-dimensional representations for biomolecules have been developed in terms of algebraic topology (26,30), differential geometry (31), and graph theory (32). All of these approaches can be used to represent biomolecular electrostatics. In this work, we adopt the graph theory representation due to its simplicity, in conjunction with a collection of features from the fast GB models to predict electrostatics from the PB model.
The rest of this paper is organized as follows. After this introduction section, the materials and methods section describes models and algorithms used in the present work. We give a brief review of the PB model, the GB model, the graph theory, and ML algorithms used in developing the proposed PB-based ML (PBML) model for biomolecular electrostatic analysis. Simulation results and related discussion are presented in the results and discussion section. Various convergence tests have been carried out to search for the most accurate PB solver to calculate ML labels. Our feature vectors are optimized using both simple ML algorithms, such as linear regression (LR), random forest (RF), and gradient-boosted decision trees, and more complicated ML algorihtms such as DNNs. We demonstrate that the proposed PBML model is more accurate and reliable than commonly used PB solvers. This paper ends with a conclusion section.
Materials and methods
In this section, we briefly review essential concepts and methods underpinning the proposed PBML model.
The PB model
As shown in Fig. 1 a, the PB model governs electrostatics with the interior solute domain with fixed charges located at atomic centers for and the exterior solvent domain with dissolved ions approximated by the Boltzmann distribution. These two domains are separated by the dielectric interface . Among a variety of surface models, the most commonly used one is the solvent-excluded surface or molecular surface (33). For simplicity, a linearized PB model is considered in the present work as
| (1) |
where is the electrostatic potential and are the dielectric constants given by
| (2) |
and is the screening parameter with the relation , where κ is the inverse Debye length measuring the ionic effective length. The PB model has the interface conditions on the molecular surface defined as
| (3) |
where and are the limit values when approaching the interface from inside or outside the solute domain and is the outward unit normal vector on . The lack of appropriate or rigorous treatments of these interface conditions is the major error source for many existing PB solvers. The far-field boundary condition for the PB model is , which is approximated using the screened Coulombic potential.
Figure 1.
An illustration of the PB model and the GB model. (a) The PB model with solute region and solvent region , separated by molecular surface . (b) The GB model represented by spherical cavities with Born radii and centered charges (one is shown here). To see this figure in color, go online.
The PB electrostatic solvation free energy is obtained by
| (4) |
where is the solution of the PB equation as if there were no solvent-solute interface. Note that near the interface , the interpolation of in Eq. 4 using φ at the grid points can be another major error source. An interface-based scheme like MIB is required to interpolate (34).
The GB model
With the same mathematical setting as that for the PB model, the GB model is devised to approximate the PB model. Compared to the PB model, the GB model offers a relatively simple and computationally more efficient approach to compute the long-range electrostatic interactions in biomolecules, which is the bottleneck in classical all-atom simulations. As illustrated in Fig. 1 b, the GB approximation of electrostatic solvation free energy can be expressed as the superposition of spherical cavities with effective Born radii and centered charges (only one is shown in the figure) (35):
| (5) |
where is the distance between atoms i and j, , , and A is the electrostatic size of the molecule, using the reciprocal of the Born radius ,
| (6) |
To carry out the boundary integral in evaluating , the MSMS package (36) is used for the triangulation of . Note that this integration is the most time-consuming step in the GB calculation. The Eulerian solvent-excluded surface software (19) can be used to improve the current GB model if a higher level accuracy is desired.
The graph theory representation
Graph theory is a prime subject of discrete mathematics and concerns graphs as mathematical structures for modeling pairwise relations between vertices, nodes, or points. Such pairwise relations define graph edges. Algebraic graph theory, particularly spectral graph theory, studies algebraic connectivity, characteristic polynomial, and eigenvalues and eigenvectors of matrices associated with the graph, such as adjacency matrix or Laplacian matrix. Graphs have been widely used in chemistry and biomolecular modeling (37). However, the diagonalization of the interaction Laplacian matrix has the computational complexity of , with N being the number of matrix elements. Alternatively, geometric graph theory bypasses the time-consuming matrix diagonalization and can be made of in computational complexity (32).
In conjugation with ML algorithms, the multiscale weighted colored subgraph (MWCS) was found to outperform many other methods in representing complex biomolecular structures (38,39). We first consider the WCS to describe electrostatic interactions in a protein of N atoms. It incorporates kernels to characterize pairwise distance-weighted atomic correlations. All interactions are classified according to element types, leading to colored subgraphs. To use WCS for analyzing protein electrostatic interactions, we formulate all the atoms and their pairwise interactions into a weighted graph with vertices V and edges E. As such, the ith atom is labeled by both its position and element type . Therefore, we express vertices V as
| (7) |
where contains all the commonly occurring element types in a protein. Obviously, for different biomolecular systems, we need to modify accordingly. To describe pairwise interactions between atoms in a protein, we define a colored set with . For each subset of element pairs , , a set of involved vertices is a subset of V containing all atoms that belong to the pair in . For example, a partition contains all pairs of atoms in the protein with one atom being a carbon and another atom being a nitrogen. Based on this setting, all the edges in such a WCS describing pairwise atomic interactions are defined by
| (8) |
where defines a Euclidean distance between and atoms, σ indicates the type of radial basic functions (e.g., for Lorentz kernel, for exponential kernel), τ is a scale distance factor between two atoms, and ζ is a parameter of power in the kernel (i.e., when , when ). The kernel characterizes a pairwise correlation satisfying the following conditions
| (9) |
Commonly used radial basis functions include generalized exponential functions
| (10) |
and generalized Lorentz functions
| (11) |
where and are, respectively, the van der Waals radius of the and atoms.
Centrality is widely used in graph theory or network analysis to describe node importance (40). Specifically, closeness and harmonic centralities are defined as and , respectively. The degree of centrality simply counts the number of edges upon a node. Our atomic centrality for atom can be regarded as an extension of the harmonic formulation
| (12) |
where is a weight function assigned to each atomic pair, with for atomic rigidity or for atomic charge.
In order to describe a centrality for the whole MWCS , we take into account a summation of the atomic centralities
| (13) |
It is this subgraph centrality that makes partition equivalent to partition .
Since we have 15 choices of the set of weighted colored edges , we can obtain corresponding 15 subgraph centralities . By varying kernel parameters , one can achieve multiscale centralities for MWCS (38). For a two-scale WCS, we obtain a total of 60 descriptors for a protein.
Together with vertices V, the collection of all edges defines weighted graph . However, here, has a limited descriptive power in ML prediction. MWCSs and their centralities are used in the present work to describe protein electrostatics.
ML algorithms
General description
In the present work, the prediction of PB electrostatic solvation free energy is formulated as a standard supervised learning. The training data set can be expressed as
where is the feature vector for the ith sample in the training set, as a label is the electrostatic solvation free energy of the ith sample, and n and M are the sizes of feature vector and the training set, respectively. will be given by the accurate MIBPB solver, which is justified by a convergence analysis in the results and discussion section. The feature vector will be generated from the graph theory and the GB model.
A variety of ML algorithms, including LR, RF, gradient boosting decision tree (GBDT), and DNN, can be applied to predict the electrostatic free energy of the PB model. LR is a simple approach designed for the linear approximation of the mapping. RF and GBDT are both decision tree-based ensemble methods. RF builds a large number of uncorrelated trees and utilizes bootstrap and aggregating (i.e., bagging). GBDT makes use of gradient descent in conjugation with the boosting procedure, which successively introduces weak learners to compensate for the errors of existing learners. DNN methods become powerful when errors are back-propagated to correct neural weights. However, DNN methods typically involve a large number of weights and thus are subject to overfitting. DNN methods might not offer better predictions unless the size of the training data is sufficiently large.
Feature descriptions
Our ML model currently uses 367 features considering protein structures, force field, graph theory representation, etc. Among these features, 240 are GB-model-related features based on effective Born radii and element-specific relationships, 51 are protein features based on protein structure and charges distribution, 31 are environment features based on properties of residual groups, and finally, 45 are features from graph theory representation. The details of these features can be found in the supporting material. Coding details can be found in the protein.py, feature.py, and training.py files shared on GitHub.
GB-method-based GBDT
The main idea of the GBDT model is to first use the features and the labels to build a decision tree model, which is able to give predicted labels. Then, the residue between the original labels and the predicted labels will be used as a new label, together with the original features, to build another decision tree model, which gives another predicted label. Using the difference between the initial labels and the predicted labels as the new label, this procedure can be done recursively, and final predicted labels will be the summation of all predicted labels. The model is optimized by minimizing the cost function between the initial labels and their predicted values using the gradient descent method.
In our framework of the GB-based GBDT model, our first decision tree is the GB model. The solvation energy received from the GB model is treated as the initially predicted values to the labels, which are the PB-model-based solvation energy. The rest of trees are built from our MWCS features. The loss function depends on the number of trees, the structure of trees, and MWCS features. In this work, the loss function L will also be optimized with respect to MWCS parameters. The details of this model can be found in the supporting material.
GB-method-based DNN
For the DNN, we use the 367 features for each protein as the inputs to the network for training, test, and prediction purposes. In our model, the label is defined as for , where is calculated as the core feature used outside the network. This core feature gives the global estimate, while the other features fed into the network provide local details. The quantity is obtained as training/test data by solving the PB equation with MIBPB at refined mesh (e.g., h = 0.2). In prediction, we have , where is the predicted value from the DNN. The DNN has multiple layers, and the weights of the network are obtained by backpropagation. We tune the parameters of the network by sampling the parameter space to receive an optimized combination of the parameters for best prediction accuracy.
Results and discussion
This section reports results from the proposed PBML model. Evaluation metrics, data selection, ML label calculation, and feature selection are discussed before results are shown. We first justify the choice of MIBPB solver (17) to generate labels compared with popular PB solvers such as Amber (14) and DelPhi (15). These solvers solve the PB model with finite difference discretization, resulting in an linear algebraic system, where n is the number of grids in the x, y, z directions for a cube-like biomolecule. Owing to the sparsity, diagonal dominance, and banded-structure of the system, a Krylov iterative method will bring the computational cost to ideally , which is, however, still prohibitively expansive for large systems. We demonstrate the accuracy and efficiency of the proposed PBML model in electrostatic solvation free energy predictions. The numerical results of MIBPB, Amber, and DelPhi are generated with an Intel Xeon E5-2670v2 from HPCC of Michigan State University, and the ML results are produced on a desktop with Intel Core I5 7500 and 16G memory using the scikit-learn python package. The electrostatic solvation free energies are generated with the room temperature and dielectric constants and .
Evaluation metrics
Through this paper, we use the mean absolute percentage error (MAPE) and absolute relative error (ARE) for the analysis of prediction accuracy, which are defined as
where is the ith label, i.e., the PB electrostatic solvation free energy of the ith molecule, and is the predicted value.
Data preparation
In the present work, the selected 4294 protein structures are obtained from the PDBbind v.2015 refined set and the PDBbind v.2018 refined set as the training set (41). The PDBbind v.2015 core set of 195 proteins as listed in supporting material is adopted as the test set. The training set has proteins sized from 997 to 27,713 atoms, while the test set proteins range from 1702 to 26,236 atoms.
Data preprocessing is required before a PB solver can be called. The protein structures in the original data set are protein-ligand complexes. Missing atoms and side chains are filled using the protein preparation wizard utility of the Schrodinger 2015-2 Suite with default parameter settings. The Amber ff14SB general force field is applied for the atomic van der Waals radii and partial charges.
Simulation results
Convergence comparison of the PB solvers
We first carry out the convergence analysis of three PB solvers for the test set of 195 proteins to justify the use of the MIBPB solver to produce accurate electrostatic solvation energy as the labels. Note that this comparison is under the assumption of a linear PB model with infinitely sharp dielectric boundary and point charges. For each protein, we compute their electrostatic solvation free energies at 10 different mesh sizes, ranging from 0.2 to 1.1 Å. For each PB solver, its results at the finest mesh size 0.2 Å are used as the references to evaluate the relative errors for other meshes. As shown in Fig. 2 a, the MAPEs using Amber and DelPhi are less than 1.5%, but that from MIBPB is less than 0.5% at all mesh sizes. We next examine the electrostatic solvation free energies computed by three PB solvers on two sampled proteins. As shown in Figs. 2 b and 2 c, using the test proteins PDB:3gnw,3owj, the energies obtained by MIBPB do not change much over the mesh refinement, while those computed by Amber and DelPhi vary more significantly. We also observed that energies obtained by Amber and DelPhi converge toward those of MIBPB as the mesh is refined. These tests justify that MIBPB, as the most accurate method among these three PB solvers, be used to compute labels for the ML models. Some further convergence tests and comparison between PB solvers can be found in supporting material.
Figure 2.
Convergence comparison among Amber, DelPhi, and MIBPB; (a) MAPEs at 10 grid sizes for Amber, DelPhi, and MIBPB in computing the electrostatic solvation free energies of 195 test proteins. For a protein in each method, the reference value is computed at the mesh size of 0.2 Å. (b and c) Illustration of the electrostatic solvation free energies obtained by Amber, DelPhi, and MIBPB at 10 different mesh sizes from 0.2 to 1.1 Å for proteins 3gnw and 3owj. To see this figure in color, go online.
Comparison between different ML models
After justifying the use of the MIBPB solver to generate the labels, we next apply LR, RF, GBDT, and DNN to produce corresponding learned models using the training data set. We then use these learned models to predict the solvation energy for the 195 proteins in the test set. The MAPE for each learned model is shown in Table 1. The result shows that the DNN has a better performance than the other three methods; thus, we use the DNN as our ML algorithm for a further comprehensive training and testing of the PBML model.
Table 1.
The MAPEs of LR, RF, GBDT, and DNN for the test set of 195 proteins
| LR | RF | GBDT | DNN | |
|---|---|---|---|---|
| MAPE (training) | 3.0549 | 0.4929 | 0.1553 | 0.1491 |
| MAPE (test) | 1.7652 | 0.7040 | 0.4342 | 0.4300 |
For LR and RF, we use the default parameters. For GBDT, we set the learning rate 0.05, the number of estimators 1500, and the maximum depth 5. The DNN is trained with about 500 different combinations of parameters, and the final optimized choice uses a batch size of 400, an adjustable learning rate beginning at 0.01, and a training duration of 3300 epochs on an architecture with 127 neurons/features (without the 240 GB features) in the input layer, (200, 500, 500, 500) neurons in the four hidden layers, respectively, and one neuron in the output layer.
Performance of the PBML model
Our final PBML model is essentially the GB-based DNN model, which uses the GB core feature with an additional 367 features as described before. To understand the advantages of the model and its prediction, we plotted a comparison of the its MAPE with those of Amber and DelPhi at 10 mesh sizes in Fig. 3. Note that although Amber and DelPhi MAPEs reduce significantly as the mesh is refined, even at the finest mesh size of 0.2 Å, these two methods have not reached the accuracy of PBML, which does not depend on grid size once the model is trained/learned.
Figure 3.
Comparison of the MAPEs of Amber, DelPhi, and PBML (use result in Table 1 from DNN model) of the electrostatic solvation free energies of the test set at 10 mesh sizes. The reference values are the results of MIBPB at a grid size of 0.2 Å. The DNN is trained with 448 different combinations of parameters, and the final optimized choice uses a batch size of 400, a learning rate of 0.005, and a training duration of 900 epochs on an architecture with 367 neurons in the input layer, (500, 500, 500) neurons in the three hidden layers, respectively, and one neuron in the output layer. To see this figure in color, go online.
To further check the accuracy and efficiency of our PBML model, we compute the solvation energy of 195 test proteins using both the MIBPB at and the same PBML model. Note that the PBML model is trained with the 4000+ protein training set as described before, labeled by solvation energy computed using MIBPB at . For this test, we use results from MIBPB at as benchmark values while using results from MIBPB at for the comparison with results from the PBML model for the 195 proteins in the test set. Fig. 4 a shows the relative error in solvation energy, and from individual samples or averages, we see that the PBML model is obviously more accurate than the MIBPB model at . Fig. 4 b shows the elapsed time, and from individual samples or averages, we see that the PBML model is significantly more efficient than the MIBPB model at . All figures are plotted using log scale in error and time since results from different proteins are very variant.
Figure 4.
Accuracy and efficiency comparison on computing solvation energy on 195 proteins whose indices are labeled along horizontal axis using MIBPB at and the DNN-based PBML model. (a) Relative error in solvation energy. (b) Time. The average relative errors for PBML and MIBPB are 0.005327 and 0.02786, respectively. The average times for PBML and MIBPB are 236.5 and 1417.4 s, respectively. Note that the time for the PBML includes the time to generate features but not the training time. To see this figure in color, go online.
Software dissemination
The PBML model can be found on GibHub at https://github.com/yangxinsharon/PB-ML, maintained by X.Y. The coefficients of the DNN are stored in a standard file and a python script, which prepares the features, assembles the DNN, and returns the electrostatic solvation energy. The user also needs to install corresponding packages for generating geometric features and GB features based on the README file. The entire training data can also be shared upon request.
Conclusion
This work introduces the PBML model for the prediction of electrostatic solvation free energies of biomolecules. Our goal is to offer an efficient ML-based electrostatic analysis of new molecules or new conformations of molecular dynamics at a small fraction of the time used in solving the PB equation at a similar level of accuracy or at a similar level of computational time but with a much higher accuracy than a commonly used PB solver can ever deliver. To this end, we first search the most accurate PB solver for generating ML labels. The second-order accurate MIBPB solver turns out to converge faster than two other eminent PB solvers, namely the DelPhi and Amber PB solvers. Additionally, we adopt MWCS for ML feature generations, which produces excellent low-dimensional intrinsic representations of biomolecules. One global core feature is computed from the GB model. To maintain the efficiency, we employ a few ML algorithms, including LR, RF, GBDT, and DNN. It is found that the present PBML model using the DNN can more efficiently and accurately produce electrostatics over traditional grid-based PB solvers.
Author contributions
J.C., Z.C., W.G., and G.-W.W. designed the research. J.C., Y.X., and X.Y. carried out all simulations and analyzed the data. J.C., W.G., and G.-W.W. wrote the article.
Acknowledgments
The work of W.G. is supported in part by NSF grant DMS-2110922. The work G.-W.W. is supported in part by NSF grants DMS-2052983 and IIS-1900473 and NIH grants R01AI164266, and R35GM148196.
Declaration of interests
The authors declare no competing interests.
Editor: Alberto Perez.
Footnotes
Supporting material can be found online at https://doi.org/10.1016/j.bpj.2024.02.008.
Contributor Information
Weihua Geng, Email: wgeng@smu.edu.
Guo-Wei Wei, Email: wei@math.msu.edu.
Supporting material
References
- 1.Honig B., Nicholls A. Classical electrostatics in biology and chemistry. Science. 1995;268:1144–1149. doi: 10.1126/science.7761829. [DOI] [PubMed] [Google Scholar]
- 2.Onufriev A., Bashford D., Case D.A. Modification of the generalized Born model suitable for macromolecules. J. Phys. Chem. B. 2000;104:3712–3720. [Google Scholar]
- 3.Tomasi J., Mennucci B., Cammi R. Quantum mechanical continuum solvation models. Chem. Rev. 2005;105:2999–3093. doi: 10.1021/cr9904009. [DOI] [PubMed] [Google Scholar]
- 4.Fogolari F., Brigo A., Molinari H. The Poisson-Boltzmann equation for biomolecular electrostatics: a tool for structural biology. J. Mol. Recogn. 2002;15:377–392. doi: 10.1002/jmr.577. [DOI] [PubMed] [Google Scholar]
- 5.Bashford D., Karplus M. pKa’s of ionizable groups in proteins: atomic detail from a continuum electrostatic model. Biochemistry. 1990;29:10219–10225. doi: 10.1021/bi00496a010. [DOI] [PubMed] [Google Scholar]
- 6.Onufriev A.V., Alexov E. Protonation and pK changes in protein-ligand binding. Q. Rev. Biophys. 2013;46:181–209. doi: 10.1017/S0033583513000024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Tang C.L., Alexov E., et al. Honig B. Calculation of pKas in RNA: On the Structural Origins and Functional Roles of Protonated Nucleotides. J. Mol. Biol. 2007;366:1475–1496. doi: 10.1016/j.jmb.2006.12.001. [DOI] [PubMed] [Google Scholar]
- 8.Zhang Q., Beard D.A., Schlick T. Constructing irregular surfaces to enclose macromolecular complexes for mesoscale modeling using the discrete surface charge optimization (DISCO) algorithm. J. Comput. Chem. 2003;24:2063–2074. doi: 10.1002/jcc.10337. [DOI] [PubMed] [Google Scholar]
- 9.Madura J.D., Briggs J.M., et al. McCammon J. Electrostatics and diffusion of molecules in solution - simulations with the University of Houston Brownian Dynamics program. Comput. Phys. Commun. 1995;91:57–95. [Google Scholar]
- 10.Jo S., Vargyas M., et al. Im W. PBEQ-Solver for online visualization of electrostatic potential of biomolecules. Nucleic Acids Res. 2008;36:W270–W275. doi: 10.1093/nar/gkn314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Baker N.A., Sept D., et al. Mccammon J.A. The adaptive multilevel finite element solution of the Poisson-Boltzmann equation on massively parallel computers. IBM J. Res. Dev. 2001;45:427–438. [Google Scholar]
- 12.Geng W.H., Krasny R. A treecode-accelerated boundary integral Poisson-Boltzmann solver for continuum electrostatics of solvated biomolecules. J. Comput. Phys. 2013;247:62–87. [Google Scholar]
- 13.Lu B., Cheng X., et al. McCammon J.A. AFMPB: An Adaptive Fast Multipole Poisson-Boltzmann Solver for Calculating Electrostatics in Biomolecular Systems. Comput. Phys. Commun. 2013;184:2618–2619. doi: 10.1016/j.cpc.2010.02.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Cai Q., Hsieh M.J., et al. Luo R. Performance of Nonlinear Finite-Difference Poisson-Boltzmann Solvers. J. Chem. Theor. Comput. 2010;6:203–211. doi: 10.1021/ct900381r. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Li L., Li C., et al. Alexov E. Delphi: a comprehensive suite for Delphi software and associated resources. BMC Biophys. 2012;5:9–1682. doi: 10.1186/2046-1682-5-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Baker N.A., Sept D., et al. McCammon J.A. Electrostatics of nanosystems: Application to microtubules and the ribosome. Proc. Natl. Acad. Sci. USA. 2001;98:10037–10041. doi: 10.1073/pnas.181342398. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Chen D., Chen Z., et al. Wei G.W. MIBPB: A software package for electrostatic analysis. J. Comput. Chem. 2011;32:756–770. doi: 10.1002/jcc.21646. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Yu S., Zhou Y., Wei G.W. Matched interface and boundary (MIB) method for elliptic problems with sharp-edged interfaces. J. Comput. Phys. 2007;224:729–756. [Google Scholar]
- 19.Liu B., Wang B., et al. Wei G.W. ESES: software for Eulerian solvent excluded surface. J. Comput. Chem. 2017;38:446–466. doi: 10.1002/jcc.24682. [DOI] [PubMed] [Google Scholar]
- 20.LeCun Y., Bengio Y., Hinton G. Deep learning. Nature. 2015;521:436–444. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
- 21.Korotcov A., Tkachenko V., et al. Ekins S. Comparison of deep learning with multiple machine learning methods and metrics using diverse drug discovery data sets. Mol. Pharm. 2017;14:4462–4475. doi: 10.1021/acs.molpharmaceut.7b00578. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Jiménez J., Škalič M., et al. De Fabritiis G. K DEEP: Protein–Ligand Absolute Binding Affinity Prediction via 3D-Convolutional Neural Networks. J. Chem. Inf. Model. 2018;58:287–296. doi: 10.1021/acs.jcim.7b00650. [DOI] [PubMed] [Google Scholar]
- 23.Hughes T.B., Miller G.P., Swamidass S.J. Modeling epoxidation of drug-like molecules with a deep machine learning network. ACS Cent. Sci. 2015;1:168–180. doi: 10.1021/acscentsci.5b00131. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Lam P.C.-H., Abagyan R., Totrov M. Hybrid receptor structure/ligand-based docking and activity prediction in ICM: development and evaluation in D3R Grand Challenge 3. J. Comput. Aided Mol. Des. 2018;33:35–46. doi: 10.1007/s10822-018-0139-5. [DOI] [PubMed] [Google Scholar]
- 25.Sunseri J., Ragoza M., et al. Koes D.R. A D3R prospective evaluation of machine learning for protein-ligand scoring. J. Comput. Aided Mol. Des. 2016;30:761–771. doi: 10.1007/s10822-016-9960-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Cang Z., Mu L., Wei G.W. Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening. PLoS Comput. Biol. 2018;14 doi: 10.1371/journal.pcbi.1005929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Wu K., Wei G.W. Quantitative Toxicity Prediction Using Topology Based Multitask Deep Neural Networks. J. Chem. Inf. Model. 2018;58:520–531. doi: 10.1021/acs.jcim.7b00558. [DOI] [PubMed] [Google Scholar]
- 28.Wu K., Zhao Z., et al. Wei G.W. TopP-S: Persistent Homology-Based Multi-Task Deep Neural Networks for Simultaneous Predictions of Partition Coefficient and Aqueous Solubility. J. Comput. Chem. 2018;39:1444–1454. doi: 10.1002/jcc.25213. [DOI] [PubMed] [Google Scholar]
- 29.Geng W., Yu S., Wei G. Treatment of charge singularities in implicit solvent models. J. Chem. Phys. 2007;127 doi: 10.1063/1.2768064. [DOI] [PubMed] [Google Scholar]
- 30.Xia K., Wei G.W. Persistent homology analysis of protein structure, flexibility and folding. Int. J. Numer. Method. Biomed. Eng. 2014;30:814–844. doi: 10.1002/cnm.2655. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Nguyen D.D., Wei G.W. The impact of surface area, volume, curvature and Lennard-Jones potential to solvation modeling. J. Comput. Chem. 2017;38:24–36. doi: 10.1002/jcc.24512. [DOI] [PubMed] [Google Scholar]
- 32.Xia K., Opron K., Wei G.W. Multiscale multiphysics and multidomain models — Flexibility and Rigidity. J. Chem. Phys. 2013;139 doi: 10.1063/1.4830404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Cannolly M.L. Solvent-Accessible Surfaces of Proteins and Nucleic Acids. Science. 1983;221:709–713. doi: 10.1126/science.6879170. [DOI] [PubMed] [Google Scholar]
- 34.Nguyen D.D., Wang B., Wei G.W. Accurate, robust and reliable calculations of Poisson-Boltzmann binding energies. J. Comput. Chem. 2017;38:941–948. doi: 10.1002/jcc.24757. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Forouzesh N., Izadi S., Onufriev A.V. Grid-based surface generalized Born model for calculation of electrostatic binding free energies. J. Chem. Inf. Model. 2017;57:2505–2513. doi: 10.1021/acs.jcim.7b00192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Sanner M.F., Olson A.J., Spehner J.C. Reduced surface: An efficient way to compute molecular surfaces. Biopolymers. 1996;38:305–320. doi: 10.1002/(SICI)1097-0282(199603)38:3%3C305::AID-BIP4%3E3.0.CO;2-Y. [DOI] [PubMed] [Google Scholar]
- 37.Bahar I., Atilgan A.R., Erman B. Direct evaluation of thermal fluctuations in proteins using a single-parameter harmonic potential. Folding Des. 1997;2:173–181. doi: 10.1016/S1359-0278(97)00024-2. [DOI] [PubMed] [Google Scholar]
- 38.Bramer D., Wei G.-W. Multiscale weighted colored graphs for protein flexibility and rigidity analysis. J. Chem. Phys. 2018;148 doi: 10.1063/1.5016562. [DOI] [PubMed] [Google Scholar]
- 39.Nguyen D.D., Cang Z., et al. Wei G.-W. Mathematical deep learning for pose and binding affinity prediction and ranking in D3R Grand Challenges. J. Comput. Aided Mol. Des. 2019;33:71–82. doi: 10.1007/s10822-018-0146-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Borgatti S.P. Centrality and network flow. Soc. Network. 2005;27:55–71. [Google Scholar]
- 41.Liu Z., Li Y., et al. Wang R. PDB-wide collection of binding data: current status of the PDBbind database. Bioinformatics. 2015;31:405–412. doi: 10.1093/bioinformatics/btu626. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




