Abstract
The behavior of proteins is closely related to the protonation states of the residues. Therefore, prediction and measurement of pKa are essential to understand the basic functions of proteins. In this work, we develop a new empirical scheme for protein pKa prediction that is based on deep representation learning. It combines machine learning with atomic environment vector (AEV) and learned quantum mechanical representation from ANI-2x neural network potential (J. Chem. Theory Comput. 2020, 16, 4192). The scheme requires only the coordinate information of a protein as the input and separately estimates the pKa for all five titratable amino acid types. The accuracy of the approach was analyzed with both cross-validation and an external test set of proteins. Obtained results were compared with the widely used empirical approach PROPKA. The new empirical model provides accuracy with MAEs below 0.5 for all amino acid types. It surpasses the accuracy of PROPKA and performs significantly better than the null model. Our model is also sensitive to the local conformational changes and molecular interactions.
We developed new empirical ML model for protein pKa prediction with MAEs below 0.5 for all amino acid types.
Introduction
Basic features and the behavior of proteins, such as folding or ligand binding, heavily depend on the environmental conditions like the local protein environment. Titratable amino acids like aspartic acid (Asp) or histidine (His) are essential in many biological processes1–5 and can be either protonated or deprotonated depending on the local environment. Thus, determination of the ionization states via pKa predictions is a prerequisite to understand the protein function. Determination of pKa values via experimental procedures is challenging and the most reliable results for proteins can be obtained only with NMR titrations.6 This predicament enforces the pKa predictions in proteins by means of theoretical applications.7 There is a tremendous amount of work on theoretical pKa calculations in the literature. These approaches can be classified into three categories as (i) microscopic methods,8,9 (ii) macroscopic methods which establish continuum electrostatics,10 and (iii) knowledge-based methods that rely on empirical parameters.11,12
Among the three classes of theoretical pKa calculations, microscopic methods such as quantum mechanical (QM) or quantum mechanics/molecular mechanics (QM/MM) approaches are considered the most reliable ones to compute pKa values of small molecules.8,13 The most traditional approach with QM methods is to employ thermodynamic cycles by computing protonation/deprotonation free energies in the gas-phase and in solution.14–23 However, these calculations do not always provide reliable pKa values due to reasons such as the instability of the species in the gas-phase or large conformational differences between the gas-phase and in solution.17,24 In the case of the proteins, QM approaches are impractical simply due to the system size and can only be achieved with model systems consisting of the local protein environment of the residue of interest. Nevertheless, the size of the model and the choice of the local environment can alter the theoretical pKa values.25 A more practical microscopic method to compute pKa values is the hybrid quantum mechanics/molecular mechanics (QM/MM) approach, in which the titratable residue is modeled at a quantum level. At the same time, the remaining media is treated with molecular mechanics.26–28 Molecular dynamics (MD) based methods such as free energy perturbation29,30 and constant pH molecular dynamics (CPHMD) simulations31–41 can provide reliable pKa values for protein residues. Combining enhanced sampling techniques with CPHMD simulations can also improve the accuracy of pKa predictions.34,42–47 Nevertheless, the need for fast and reliable approaches to predict pKa values of protein residues can render the microscopic methods impractical due to the exhaustive computation time.
Macroscopic methods rely on either the numerical Poisson–Boltzmann equation (PBE)10,48–51 or the Generalized Born (GB) technique with analytical approximations to electrostatic energies.52,53 These methods model the proteins as a homogeneous medium with a low dielectric constant while the environment (solvent) is modeled with a high dielectric constant. The PBE based methods and their variations54–60 can allow modeling the accessibility of the solvent to the titratable residues61,62 and multiple ionizable residues within the proximity.63,64 Even though there are different suggestions for the dielectric constant of proteins that varies from 4 to 80,65–73 the appropriate value depends on the polarity of the surrounding residues and the flexibility of the protein.74,75 This issue can be addressed by taking the flexibility of the protein into account via techniques that involve ensembles of conformers.54,76–82 An example of such an approach is the Multi-Conformation Continuum Electrostatic (MCCE) method which has been shown to successfully predict pKa values of several protein residues with different force fields.70,83–85
Empirical methods are based on statistical fitting of environmental descriptors and parameters to the three-dimensional structures of proteins. Their sufficiently accurate predictions for most cases combined with their low computational cost make them widespread and favorable. There are a variety of empirical tools with comparable accuracies,86–88 but PROPKA11,12 is the most widely used for protein pKa predictions. Conceptually, PROPKA computes the change of the amino acid pKa value from water to a protein environment. In this tool, the environmental perturbation is expressed as the sum of perturbation contributions from a protein environment.
Recent studies with machine learning (ML) algorithms for pKa estimations of transition metal complexes have provided new empirical schemes.89,90 These approaches combine the pattern recognition capabilities of ML algorithms with the atomistic and molecular features that are obtained with a QM tool. However, this scheme can only be practical for proteins if molecular descriptors are obtained with low computational cost, such as neural network potentials (NNPs). Over the last decade, NNPs have been shown to provide accuracy approaching that of QM calculations and comparable computational cost with all-atom force fields. These potentials, such as ANI91–98 and AIMNet,99 can learn the electronic environment of an atom in conjunction with the many-body symmetry functions that arise from the coordinates.100,101 Using this learned information and combining it with the structural fingerprints that depend on the coordinates, NNPs can predict target molecular properties such as energy and forces. Thus, NNPs can be utilized to obtain information that stems from the atomic environment, and this information can be used to train ML models for protein pKa estimations.
In this context, we developed an empirical scheme for protein pKa predictions that employs ML algorithms for five amino acid types (ASP, GLU, HIS, LYS, and TYR). We rely on representation learning, i.e., learning representation of the data by automatically extracting useful information when the ML model is trained. We used ANI atomistic neural network architecture that learns molecular representation end-to-end, i.e., directly from atomic coordinates. This molecular representation reduces the dimensionality of a molecular structure into a compact vector format that encodes important quantum mechanical information.
Methods
Our model provides predictions via the atomic environment and the learned electronic information that is obtained with a widely used NNP, ANI-2x.96 The workflow for protein pKa prediction is depicted in Fig. 1. In the present work, each amino acid type is treated separately to improve the accuracy by ensuring different molecular features for different amino acid types. Models are trained and tested over hundreds of experimental pKa values, and the accuracy is also compared with the widely used PROPKA12 tool. The presented approach performs significantly better than null models and improves the current empirical methods for pKa estimations.
Reference data for training
The pKa model is trained and tested with two datasets. The first dataset is obtained from the PKAD database.102 This dataset consists of over 1500 experimentally measured pKa values of residues on both wild type (WT) and mutant proteins. The second dataset consists of 337 entries that were extracted from the primary literature.103–127 Mutation of a residue on a protein can cause significant conformational changes that alter the amino acids' electronic environment in proximity to the mutation site. However, not all mutant proteins have crystallographic structures deposited to the databanks. Extensive conformational sampling must be performed to account for the conformational alteration due to the mutations. Since conformational sampling is out of the scope of this study, all mutant protein entries were excluded from datasets. Our model is trained only for WT proteins. This selection results in training and test datasets containing entries from 186 WT PDB structures. The distribution of the pKa values in training and test datasets can be found in the ESI (Fig. S.1).† For this initial proof of principle model, only five titratable residues (GLU, ASP, LYS, HIS and TYR) are selected as targets for pKa predictions.
Data curation
Crystallographic structures of 187 WT proteins are obtained from the PDB. A flowchart for data preparation prior to the training can be found in the ESI (Fig. S.2).† In conventional PDB files, the crystallographic structures can involve entries other than proteins and nucleotides, such as ligands, mobile counterions, metal ions, or water molecules. It is important to state that the presence of a co-factor or a ligand can alter the pKa of residues within a protein. However, any entry other than proteins and nucleotides is removed from PDB structures due to two reasons. First, the number of atomic species that are defined in a neural network potential (NNP) is currently limited to nonmetals. This limitation prevents inclusion of HETATM entries that can have atomic species that NNP does not define. Second, the conditions in the experimental procedures for pKa determination and the crystallographic data preparation can be different. PDB entries correspond to constrained structures obtained using either X-ray or neutron diffractions, requiring specific strategies to achieve crystallographic packing. For example, many PDB entries tend to contain mobile counterions due to the packing procedures and these ions mainly do not exist in experimental pKa determinations.
After the clean-up of PDB entries, missing heavy atoms and H atoms are added with the tleap module of AmberTools21 128 using the ff14SB force field for proteins129 and BSC1 force field for DNA.130 For titratable protein residues, standard protonation states are assumed. To prevent any possible steric clashes after the addition of missing atoms, very short gas-phase minimizations (250 steps of steepest descent followed by a conjugate gradient up to 500 steps in total) are performed using the sander module of AmberTools21.128
Descriptor calculations
Minimized structures are used as inputs for NNP to compute all descriptors. A detailed description of ANI neural network potential and corresponding descriptors can be found elsewhere.96,100 Briefly, in ANI-type NNPs, the environment of the atomic species in the given coordinate system is transformed into atomic environment vectors (AEVs) that contain radial and angular contributions (see Fig. 1). Since the pKa of amino acids in proteins are sensitive to the neighborhood environment, naturally, AEVs were chosen as candidates for pKa descriptors. This representation includes structural information on both bonded and non-bonded interactions of any given atom within the default ANI cutoff distance (rcut = 5.2 Å). In addition to AEVs, neural network embeddings were chosen as learned representations. Therefore, 2nd and 3rd layers of atomic neural network embeddings are selected as additional descriptor candidates.
Feature importance and training
We observed that many features in the overall descriptor were redundant or highly correlated. To eliminate the redundant features, a three-step filtering procedure is adopted. First, noninformative features (values of 0.0) for all reference data are removed. Second, correlation of the features is computed, and highly correlated features (correlation coefficient > 0.95) are eliminated. Third, a recursive feature elimination (RFE)131 process is performed using a random forest regressor (RF)132 algorithm as implemented in the scikit-learn package.133 RFE is a technique that allows defining the least important features using an importance ranking, and it has been shown that ML models benefit from it.134 The pseudo-code for RFE is depicted in Fig. 2. In each recursive step of the procedure, the feature importance is measured, and a desired number of features are kept (F†) by removing less important ones. The new feature list is used to perform training with RF using 1000 decision trees. A final set of features (F‡) is defined by the local model that has the best coefficient of determination for predictions over out-of-bag samples. After obtaining the final set of features, a 10-fold cross-validation (CV) is performed with RF using same settings for training in the feature elimination process.
Molecular dynamics simulations and clustering
Two different ionization states of ASP26 (neutral: ASH, and negatively charged: ASP) on human thioredoxin conformer (PDB ID: 3TRX) are considered. Topology and coordinate files are built with the default ionization states for residues in the ff14SB force field for proteins129 using the tleap module of AmberTools21.128 The samples are neutralized using Na+ counter ions: 4Na+ for the sample containing neutral ASP, and 5 Na+ for the sample containing negatively charged ASP. To provide salt concentration, 5 Na+ and 5 Cl− counter ions are added to the samples. Waters in the original crystal structure are deleted, and the samples are solvated using TIP3P water molecules135 with a distance between the solute and the edge of the box being 12 Å, which results in an average box dimension of 66.8 Å × 69.7 Å × 62.3 Å.
Simulations are performed using the CUDA version of AMBER20's pmemd module.128,136,137 A time step of 1.0 fs is used along with Berendsen temperature coupling138 and SHAKE algorithm139 for the bonds involving hydrogen atoms. The particle mesh Ewald summation (PME) technique140 is employed using a cutoff distance of 8 Å. We carried out an 11-step equilibration procedure141 that consists of harmonic restraints on protein residues and its reduction in each step at 10 K, which is followed by the gradual heating of samples to 300 K with a gradual harmonic restrain reduction at 300 K. A 50 ns long production simulation is performed using equilibrated samples for both samples. Production trajectories are used to cluster the frames using a hierarchical agglomerative (bottom-up) approach as implemented in the cpptraj module of AMBERTools21.128 Clustering is performed using the root mean square method as the distance metric for the carboxyl group of the ASP26 side chain (ASH26 in the case of neutral ASP). It is finalized when the minimum distance between the clusters is larger than 1.5 Å. The best cluster representatives are selected using the lowest cumulative distance to all the other frames in the same cluster.
Results and discussion
There has been a surge of approaches looking to learn a representation that directly encodes information about molecules.142,143 The idea behind representation learning is to learn a mapping that embeds molecular structures as points in a low-dimensional vector space.144 The goal is to optimize this mapping so that relationships in the embedding space reflect the similarities between objects. After optimizing the embedding space, the learned embeddings can be used as feature inputs for downstream machine learning tasks. The key distinction between representation learning and traditional descriptor calculations is how they treat the molecular structure problem. Descriptors treat this problem as a pre-processing step, using domain knowledge and hand-crafted rules to extract molecular information. In contrast, representation learning treats this problem as a machine learning task, using a purely data-driven approach to learn embeddings that encode a molecular structure.
The pKa of an amino acid on a protein can be affected by different environmental features such as amino acids in proximity or solvent access. The surrounding amino acids can be encoded through so-called atomic environment vectors (AEVs) which can be obtained with popular atomistic neural network potentials like ANI.96 Even though the presence of the solvent cannot be modeled with the current ANI-2x implementation, the gas-phase electronic-structure contributions can be addressed with neural network embeddings. These embeddings would provide information regarding the electronic environment of the titratable residue.
To show the utility of the representation learning, we first performed a simple exercise. We extracted 3D structures for 171 natural and non-natural amino acids from the SwissSidechain database.145Fig. 3 shows a 2D t-distributed stochastic neighbor embedding (t-SNE)146 projection of atomic embeddings for oxygen and nitrogen atoms based on the 3rd (top) layer neural network. Naturally, oxygen and nitrogen atoms show two distinctly different clusters corresponding to each element.
Inside the oxygen cluster, titratable groups like sidechain carboxyls, aliphatic and aromatic alcohols are spread out. This is possible due to the very different environments modulated by non-natural amino acids. We hypothesized that the difference in embedding vectors should reflect the acid–base properties of these groups too. Therefore, these embedding vectors could be used as descriptors for empirical pKa prediction. For the sake of completeness, we will consider all possible descriptors, i.e., AEV, and 2nd and 3rd layer neural network embeddings obtained with the ANI-2x model as an initial set of descriptors.
To assess the performance of ML models with ANI-2x descriptors, the available pKa data are divided into training and test subsets. Different ML algorithms were tested, and the accuracies were analyzed. Results obtained with different procedures are depicted in the ESI (see Fig. S.3).† We observed that linear regression (LR) and support vector machines (SVMs) with linear kernel yielded similar results. Training with the RF provided more accurate results with MAEs of about 0.5, while the inclusion of recursive feature elimination (RFE) improved the accuracy even further. RFE resulted in a feature space of about 10 to 100 descriptors for amino acids. We observed that the features belong to the side chains and the features that belong to the backbone atoms are selected as important descriptors. This can be related to the learned inductive (through-bond) effects. Feature elimination revealed that even though most of the descriptors from the initial feature list are eliminated, all the feature classes are preserved in the final feature list. These results indicate that pKa predictions require the information regarding the atomic environment of titratable residues and electronic information encoded by the neural network embeddings of the NNP.
First, the model accuracy was accessed with k-fold cross-validation. To compare the accuracy of our model, pKa values for the whole training dataset are also predicted with PROPKA 3.1.12 The results obtained with the ML model, PROPKA, and the null model for GLU, ASP, and HIS are depicted in Fig. 4 (see ESI Fig. S.4† for LYS and TYR). It was found that the coefficients of determination (r2) for all amino acid types are above 0.6 with the ML model (except for LYS, r2 = 0.31) while mean absolute errors (MAEs) for all amino acid types are below 0.5 pKa units. In the case of PROPKA, predictions have r2 < 0.3 and MAE > 0.6 with GLU and ASP being the most reliable predictions. Interestingly, PROPKA yields similar or less reliable results relative to the null model (), especially for HIS, LYS and TYR. These results might be due to the PROPKA computation scheme which considers the shift of the pKa value for the amino acid from water to protein (ΔpKwater→proteina),11 while the ML model is trained directly for pKa values in the protein environment using a relatively larger training set. The number of is computed for all amino acid types (Nerror > 1.0) for experimental pKa (pKexpa) values that are 1.0 unit below/above the pKa value of the corresponding amino acid in water (pKwatera). The results are depicted in Table 1. We see that the Nerror > 1.0 with the ML model is about twice smaller than with PROPKA for all amino acid types. These results indicate that ML model predictions are more reliable for all amino acid types that have a water to protein pKa shift which is at least 1.0 unit (|ΔpKwater→proteina|≥ 1.0).
Number of experimental pKa values that are 1.0 pKa unit lower or higher than the pKa in water (Nexp) and the number of prediction errors that are above 1.0 pKa unit (Nerror > 1.0).
Amino acid | pKa range | N exp | N error > 1.0 ML Model | N error > 1.0 Propka |
---|---|---|---|---|
GLU | pKa < 3.5 & pKa > 5.5 | 68 | 12 | 21 |
ASP | pKa < 2.8 & pKa > 4.8 | 93 | 27 | 35 |
HIS | pKa < 5.5 & pKa > 7.5 | 85 | 20 | 55 |
LYS | pKa < 9.5 & pKa > 11.5 | 16 | 7 | 8 |
TYR | pKa < 9.0 & pKa > 11.0 | 28 | 0 | 8 |
The ML models were also evaluated with the external test dataset of pKa values from 33 different proteins that do not appear in the training data. Results for GLU, ASP and HIS amino acids are depicted in Fig. 5 (LYS and TYR test results can be found in ESI Fig. S.5†). We found that ML models for all amino acid types provide predictions with MAE < 1.0, where GLU and LYS yield better predictions (MAE < 0.5) relative to the other amino acids. The higher MAE values, especially in the case of ASP are related to outliers that have very high/low experimental pKa values for the corresponding amino acid (high |ΔpKwater→proteina|).
A similar evaluation was performed with DelPhiPKa147 using the external test set. Only 23 proteins were completed due to the extended run time over one week. Calculations are performed using default runtime parameters that are provided by the DelPhiPKa program. The RMSE/MAE values for predictions of 281 pKa values with DelPhiPKa (present work) are computed as 1.03(0.76)/0.74(0.56), 1.17(0.60)/0.90(0.45), 1.38 (0.88)/0.96(0.67), 1.33 (0.49)/1.06(0.40), and 0.98 (0.87)/0.82(0.76) for ASP, GLU, HIS, LYS and TYR respectively. It should be noted that all calculations are performed sequentially on a linux computer with the runtime of ∼127 s/residue for DelPhiPKa and ∼0.2 s per residue for the ML model presented in this work. These results indicate that the ML model not only provides more reliable results but also runs about 500 times faster.
Two test set cases are selected to investigate the underlying reason for the errors in certain predictions: GLU7 predictions for hen egg white lysozyme conformers and ASP26 predictions for recombinant human thioredoxin conformer (Fig. 6). The hen egg lysozyme white (HEWL) test set comprises seven different crystallographic structures with multiple conformer configurations for the GLU7 residue (Fig. 6a). In all HEWL conformers, there is at least one positively charged residue within 5 Å of GLU7; ARG5 in all conformers, LYS1 in every conformer except 1 E8L, and Arg14 for all conformers except 1 E8L, 1LSA, and 4LYT. It is observed that GLU7 in three conformers (1AKI, 1LSA, and 4LYT) is in close proximity to LYS1, promoting a H-bond interaction (Rside chainGLU7–LYS1 < 3.0 Å). In the other four HEWL conformers, there is no H-bond interaction between these residues since GLU7 is rotated to the opposite direction of the LYS1 residue. Interestingly, the prediction errors for the conformers with GLU7–LYS1 side chain interaction are lower than 1.0 while the prediction errors for the conformers that do not contain this interaction are higher than 1.0 pKa unit. The prediction errors for the same residue with CPHMD simulations were reported to be approximately 0.8 and 1.3 with the explicit solvent and implicit solvent respectively.45 These results indicate that the model is highly sensitive to the conformational states of the residues and provides similar results with CPHMD simulations.
Another test case is the ASP26 on recombinant human thioredoxin (PDB IDs: 3TRX and 4TRX). Here we see prediction errors of more than 4.0 pKa units for both conformers. The pKa of this residue is reported as 9.9, which indicates that this residue is in the neutral form. Thus, the effect of different ASP26 states (charged and neutral) on thioredoxin is investigated with conformers obtained from molecular dynamics (MD) simulations. Since there is no distinctive conformational difference between two thioredoxin crystallographic structures, simulations were performed only with 3TRX. After 50 ns long MD simulations, the trajectories are clustered to find the most populated cluster and its representative (Fig. 6b). These representatives (negatively charged ASP: MD-ASP26, neutral ASP: MD-ASH26) are then used to predict the pKa values of ASP26. In the case of the neutral ASP residue in the MD-ASH26 conformer, the proton on the side chain is removed before the pKa prediction since the model is trained with negatively charged ASP. It is observed that the ASP26 conformation does not alter drastically, but the conformations of three surrounding residues (SER28, LYS39, GLU56) are affected with different ionization states of ASP. In both test set and MD-ASP26 conformers, LYS39 and GLU56 share a hydrogen bond, while this interaction does not exist in the MD-ASH26 conformer.
Additionally, the hydrogen bond interactions between ASP26 and SER28 in both test set and MD-ASP26 conformers are not observed in MD-ASH26. Instead, SER28 in MD-ASH26 forms a hydrogen bond interaction with GLU56. Predictions with the ML model reveal that the error increases with the MD-ASP26 conformer (error = 6.18) and reduces more than 1.5 units with the MD-ASH26 conformer (error = 2.53) relative to the test set conformer. These results point out the conformer sensitivity of the ML model and possible discrepancies between the crystallographic and the experimental conformers that cause the prediction error.
Final ML models are trained using both the training and the test datasets following the same procedure for feature elimination and tests with 10-fold cross-validation. The accuracy of the predictions is compared with PROPKA and the null model. All results are depicted in Fig. 7. The RMSE values for all amino acid types are computed below 1.0 with ML models, while PROPKA predictions, except for ASP, yield higher RMSE values than the null models. Our model is found more accurate for GLU, ASP, HIS, and LYS residues relative to DelPhiPKa benchmarks without salt concentration. When the salt concentration is included in DelPhiPKa benchmarks, accuracies for LYS and ASP are comparable. Both benchmarks use a different dataset consisting 752 residues on 82 proteins.147 A similar pattern is observed for MAE values. Final ML models predict experimental pKa values with MAEs below 0.5, while MAEs obtained with PROPKA predictions are substantially higher.
Interestingly, PROPKA have MAEs similar to or even worse than the null models. To our knowledge, the model presented in this work is the first empirical model that performs statistically significantly better than the null model for all titratable residues. Finally, the coefficient of determination for pKa predictions with ML models is at least twice as large as that of PROPKA for all amino acid types.
Exploring the high dimensional pKa training and test data in terms of similarity is impossible without dimensionality reduction. Thus, t-SNE146 is used to reduce the high dimensional data by transforming it into two-dimensional similarity maps. Such visualization allowed us to align similar residues and cross-reference them with the corresponding pKa values. 2D t-SNE maps for GLU and HIS amino acids are given in Fig. 8 (see Fig. S.6† for LYS and TYR amino acids). Generally, residues with high or low experimental pKa values are separated except for some outliers, and residues on the same class of proteins form small clusters together. For instance, GLU7 from hen egg-white lysozyme (HEWL) and turkey egg-white lysozyme (TEWL) form clusters ai (Fig. 8a). Among these clusters a5 involves entries from both species (TEWL PDB IDs: 1LZ3, 135L and HEWL PDB IDs: 1LSA, 1LSE, 1LYS). Clusters are shown with bi on Fig. 8a correspond to the GLU35 residues on HEWL and TEWL proteins. Other examples of such clusters correspond to GLU2 residues on bovine Ribonuclease A (cluster c, Fig. 8a), and GLU73 residue on Barnase (clusters di, Fig. 8a). A similar pattern is observed with HIS amino acid (Fig. 8b). Residues in the same class of proteins form small clusters such as cluster a for GLU162 on Bacillus agaradhaerens family 11 xylanase, cluster b for HIS36 on myoglobin from sperm whale and horse, and clusters ci for HIS72 on bovine tyrosine phosphatase.
As mentioned before, pKa models are sensitive to conformers, and t-SNE maps show some outliers. An example of such cases can be seen in Fig. 9, which depicts the t-SNE map for ASP amino acids. For instance, ASP26 in recombinant human thioredoxin conformer in the test set (PDB ID: 3TRX) is an outlier (arrow a on Fig. 9) on the t-SNE map. This point is in proximity to ASP67 on the tenth type III cell adhesion module of human fibronectin (PDB ID: 1FNA, pKa = 4.2), ASP77 on fungal elicitor (PDB ID: 1BEG, pKa = 2.61), and ASP28 on black rat cell adhesion molecule CD2 (PDB ID: 1HNG, pKa = 3.57). The experimental pKa of ASP26 on human thioredoxin is 9.9 while its neighbors have pKa values all below pKa = 5.0, which results in a high prediction error. The positions of residues from MD simulations (MD-ASH26: neutral ASP and MD-ASP26: negatively charged ASP) are shown with arrows b and c on Fig. 9. The t-SNE map shows that the MD-ASH26 conformer (arrow b) is neighboring with thioredoxin from E.coli (PDB ID: 2TRX, pKa = 7.5). In contrast, the MD-ASP26 conformer (arrow c) is a neighbor to bovine ribonuclease A ASP14 (PDB ID: 3RN3, pKa = 2.0). The error of pKa prediction increases with the MD-ASP26 conformer and decreases with the MD-ASH26 conformer. These observations point out that the descriptors obtained from ANI-2x NNP can effectively predict the pKa of an amino acid by describing its environment. The prediction errors are closely related to the differences in the crystal and the experimental conformers.
Conclusion
The presented work demonstrates the capabilities of neural network potentials to provide pKa descriptors for knowledge-based methods. The learned representation can be used to describe the chemical environment of amino acids in proteins. As the neural network potentials emerge as an alternative to the all-atom potentials, reliable pKa descriptors can be obtained faster with their employment. The ML model presented in this work is the first empirical model that performs significantly better than the null model for all titratable residues with a runtime of ∼0.2 s per residue. The code and models are available at https://github.com/isayevlab/pKa-ANI.
A new empirical scheme for pKa prediction of amino acids in proteins uses an ML model with descriptors calculated on ANI-2x NNP. The quantum mechanical information, which depends on the local chemical environment, is obtained from the top layers of neural network embeddings. These descriptors are used for training with the RF model to predict pKa values. It is found that the adoption of RFE slightly improves the accuracy and yields the number of features ranging from 25 to 100 in the final model.
The accuracy of the pKa estimations is accessed via 10-fold CV, and the results are compared with the null models and PROPKA predictions. It is found that the model presented in this work performs better than the null model and PROPKA. The RMSE of the pKa predictions is below 0.7 except for HIS (0.72) with both the initial and the final models. The MAEs for all amino acid types are found below 0.5, again for the initial and the final models. In the case of PROPKA, the calculated RMSEs are over 1.0 except for GLU and LYS residues which are still over 0.7. The computed MAEs for PROPKA predictions (all above 0.6) show that PROPKA performs almost on par – if not worse – with the null model.
Further evaluations with an external test set not included in training data show a slight increase in RMSEs and MAEs. Among the external test set, two cases are selected to explore the principal reason for errors. The conformational differences of GLU7 on HEWL structures and their respective prediction errors indicate that the ML model is sensitive to the conformational differences. The latter case involves representative structures for ASP26 on recombinant human thioredoxin that are obtained with MD simulations (both with neutral and ionized ASP26 side chain). The pKa predictions with these representatives confirm the conformational sensitivity of the ML model. Conceptually, a protein pKa predictor should be sensitive to conformational alterations. Two test cases demonstrate the capability of the ML model in distinguishing different conformational states. Therefore, the errors obtained with the presented models are closely related to the conformational discrepancies between the crystal (fixed) and experimental (flexible) structures.
As with any model, the present approach has limitations. Some of them, such as the absence of Cys and Ser, can be overcome by adding more training data, and mining pKa values from the primary literature. Future work will aim to extend the present model for coenzyme and cofactor effects. The current ANI descriptor has only biogenic elements and has not parametrized for metals, therefore all HETATM entries in PDB files are ignored. There is a set of limitations that would require the development of a new approach, for instance inclusion of the ionic strength or different solvents into NN descriptors.
Data availability
The code and ML models are available at https://github.com/isayevlab/pKa-ANI.
Author contributions
O. I. conceived the idea. H. G. carried out method implementation and performed all calculations. All authors critically contributed to the design of the project, analysis of results, and writing of the manuscript. O. I. supervised and acquired funding for the project.
Conflicts of interest
There are no conflicts to declare.
Supplementary Material
Acknowledgments
The authors acknowledge Dr Adrian Roitberg for his invaluable insights and discussions. We acknowledge support from NSF CHE-2041108. This publication resulted in part from research supported by the Office of Naval Research (ONR) through the Energetic Materials Program (MURI grant no. N00014-21-1-2476). We also acknowledge the Extreme Science and Engineering Discovery Environment (XSEDE) award CHE200122, which is supported by NSF grant number ACI-1053575. This research is part of the Frontera computing project at the Texas Advanced Computing Center. Frontera is made possible by the National Science Foundation award OAC-1818253. We gratefully acknowledge the support and hardware donation from NVIDIA Corporation and express our special gratitude to Jonathan Lefman.
Electronic supplementary information (ESI) available. See DOI: 10.1039/d1sc05610g.
References
- Warshel A. Sharma P. K. Kato M. Parson W. W. Biochim. Biophys. Acta, Proteins Proteomics. 2006;1764:1647–1676. doi: 10.1016/j.bbapap.2006.08.007. [DOI] [PubMed] [Google Scholar]
- Watari M. Ikuta T. Yamada D. Shihoya W. Yoshida K. Tsunoda S. P. Nureki O. Kandori H. J. Biol. Chem. 2019;294:3432–3443. doi: 10.1074/jbc.RA118.006277. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smondyrev A. M. Voth G. A. Biophys. J. 2002;83:1987–1996. doi: 10.1016/S0006-3495(02)73960-X. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luecke H. Schobert B. Stagno J. Imasheva E. S. Wang J. M. Balashov S. P. Lanyi J. K. Proc. Natl. Acad. Sci. 2008;105:16561–16565. doi: 10.1073/pnas.0807162105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Le N. P. Omote H. Wada Y. Al-Shawi M. K. Nakamoto R. K. Futai M. Biochemistry. 2000;39:2778–2783. doi: 10.1021/bi992530h. [DOI] [PubMed] [Google Scholar]
- Haslak Z. P. Zareb S. Dogan I. Aviyente V. Monard G. J. Chem. Inf. Model. 2021;61:2733–2743. doi: 10.1021/acs.jcim.1c00059. [DOI] [PubMed] [Google Scholar]
- Seybold P. G. Shields G. C. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2015;5:290–297. [Google Scholar]
- Sastre S. Casasnovas R. Muñoz F. Frau J. Theor. Chem. Acc. 2013;132:1310. [Google Scholar]
- Riccardi D. Schaefer P. Cui Q. J. Phys. Chem. B. 2005;109:17715–17733. doi: 10.1021/jp0517192. [DOI] [PubMed] [Google Scholar]
- Li C. Jia Z. Chakravorty A. Pahari S. Peng Y. Basu S. Koirala M. Panday S. K. Petukh M. Li L. Alexov E. J. Comput. Chem. 2019;40:2502–2508. doi: 10.1002/jcc.26006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Olsson M. H. M. Søndergaard C. R. Rostkowski M. Jensen J. H. J. Chem. Theory Comput. 2011;7:525–537. doi: 10.1021/ct100578z. [DOI] [PubMed] [Google Scholar]
- Søndergaard C. R. Olsson M. H. M. Rostkowski M. Jensen J. H. J. Chem. Theory Comput. 2011;7:2284–2295. doi: 10.1021/ct200133y. [DOI] [PubMed] [Google Scholar]
- Zanetti-Polzi L. Daidone I. Amadei A. J. Phys. Chem. B. 2020;124:4712–4722. doi: 10.1021/acs.jpcb.0c01752. [DOI] [PubMed] [Google Scholar]
- Abul Kashem Liton M. Idrish Ali M. Tanvir Hossain M. Comput. Theor. Chem. 2012;999:1–6. doi: 10.1016/j.comptc.2012.08.001. [DOI] [Google Scholar]
- Namazian M. Zakery M. Noorbala M. R. Coote M. L. Chem. Phys. Lett. 2008;451:163–168. doi: 10.1016/j.cplett.2007.11.088. [DOI] [Google Scholar]
- Liptak M. D. Gross K. C. Seybold P. G. Feldgus S. Shields G. C. J. Am. Chem. Soc. 2002;124:6421–6427. doi: 10.1021/ja012474j. [DOI] [PubMed] [Google Scholar]
- Satchell J. F. Smith B. J. Phys. Chem. Chem. Phys. 2002;4:4314–4318. doi: 10.1039/B203118C. [DOI] [Google Scholar]
- Gross K. C. Seybold P. G. Peralta-Inga Z. Murray J. S. Politzer P. J. Org. Chem. 2001;66:6919–6925. doi: 10.1021/jo010234g. [DOI] [PubMed] [Google Scholar]
- Gross K. C. Seybold P. G. Hadad C. M. Int. J. Quantum Chem. 2002;90:445–458. doi: 10.1002/qua.10108. [DOI] [Google Scholar]
- Liptak M. D. Shields G. C. J. Am. Chem. Soc. 2001;123:7314–7319. doi: 10.1021/ja010534f. [DOI] [PubMed] [Google Scholar]
- Toth A. M. Liptak M. D. Phillips D. L. Shields G. C. J. Chem. Phys. 2001;114:4595–4606. doi: 10.1063/1.1337862. [DOI] [Google Scholar]
- Charif I. E. Mekelleche S. M. Villemin D. Mora-Diez N. J. Mol. Struct.: THEOCHEM. 2007;818:1–6. doi: 10.1016/j.theochem.2007.04.037. [DOI] [Google Scholar]
- Gao D. Svoronos P. Wong P. K. Maddalena D. Hwang J. Walker H. J. Phys. Chem. A. 2005;109:10776–10785. doi: 10.1021/jp053996e. [DOI] [PubMed] [Google Scholar]
- Casasnovas R. Ortega-Castro J. Frau J. Donoso J. Muñoz F. Int. J. Quantum Chem. 2014;114:1350–1363. doi: 10.1002/qua.24699. [DOI] [Google Scholar]
- Li H. Robertson A. D. Jensen J. H. Proteins: Struct., Funct., Bioinf. 2004;55:689–704. doi: 10.1002/prot.20032. [DOI] [PubMed] [Google Scholar]
- Li H. Hains A. W. Everts J. E. Robertson A. D. Jensen J. H. J. Phys. Chem. B. 2002;106:3486–3494. doi: 10.1021/jp013995w. [DOI] [Google Scholar]
- Jensen J. H. Li H. Robertson A. D. Molina P. A. J. Phys. Chem. A. 2005;109:6634–6643. doi: 10.1021/jp051922x. [DOI] [PubMed] [Google Scholar]
- Kamerlin S. C. L. Haranczyk M. Warshel A. J. Phys. Chem. B. 2009;113:1253–1272. doi: 10.1021/jp8071712. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sakipov S. N. Flores-Canales J. C. Kurnikova M. G. J. Phys. Chem. B. 2019;123:5024–5034. doi: 10.1021/acs.jpcb.9b00656. [DOI] [PubMed] [Google Scholar]
- Yu H. Ratheal I. M. Artigas P. Roux B. Nat. Struct. Mol. Biol. 2011;18:1159–1163. doi: 10.1038/nsmb.2113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mongan J. Case D. A. McCammon J. A. J. Comput. Chem. 2004;25:2038–2048. doi: 10.1002/jcc.20139. [DOI] [PubMed] [Google Scholar]
- Arthur E. J. Yesselman J. D. Brooks C. L. Proteins: Struct., Funct., Bioinf. 2011;79:3276–3286. doi: 10.1002/prot.23195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meng Y. Roitberg A. E. J. Chem. Theory Comput. 2010;6:1401–1412. doi: 10.1021/ct900676b. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Swails J. M. Roitberg A. E. J. Chem. Theory Comput. 2012;8:4393–4404. doi: 10.1021/ct300512h. [DOI] [PubMed] [Google Scholar]
- Goh G. B. Hulbert B. S. Zhou H. Brooks C. L. Proteins: Struct., Funct., Bioinf. 2014;82:1319–1331. doi: 10.1002/prot.24499. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khandogin J. Brooks C. L. Biophys. J. 2005;89:141–157. doi: 10.1529/biophysj.105.061341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baptista A. M. Teixeira V. H. Soares C. M. J. Chem. Phys. 2002;117:4184–4200. doi: 10.1063/1.1497164. [DOI] [Google Scholar]
- Bürgi R. Kollman P. A. van Gunsteren W. F. Proteins: Struct., Funct., Bioinf. 2002;47:469–480. doi: 10.1002/prot.10046. [DOI] [PubMed] [Google Scholar]
- Lee M. S. Salsbury F. R. Brooks C. L. Proteins: Struct., Funct., Bioinf. 2004;56:738–752. doi: 10.1002/prot.20128. [DOI] [PubMed] [Google Scholar]
- Wallace J. A. Shen J. K. J. Chem. Theory Comput. 2011;7:2617–2629. doi: 10.1021/ct200146j. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khandogin J. Brooks C. L. Biochemistry. 2006;45:9363–9373. doi: 10.1021/bi060706r. [DOI] [PubMed] [Google Scholar]
- Williams S. L. De Oliveira C. A. F. Andrew McCammon J. J. Chem. Theory Comput. 2010;6:560–568. doi: 10.1021/ct9005294. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meng Y. Sabri Dashti D. Roitberg A. E. J. Chem. Theory Comput. 2011;7:2721–2727. doi: 10.1021/ct200153u. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee J. Miller B. T. Damjanović A. Brooks B. R. J. Chem. Theory Comput. 2014;10:2738–2750. doi: 10.1021/ct500175m. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Swails J. M. York D. M. Roitberg A. E. J. Chem. Theory Comput. 2014;10:1341–1352. doi: 10.1021/ct401042b. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barroso daSilva F. L. Dias L. G. Biophys. Rev. 2017;9:699–728. doi: 10.1007/s12551-017-0311-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu J. Swails J. Zhang J. Z. H. He X. Roitberg A. E. J. Am. Chem. Soc. 2018;140:1639–1648. doi: 10.1021/jacs.7b08569. [DOI] [PubMed] [Google Scholar]
- Rocchia W. Alexov E. Honig B. J. Phys. Chem. B. 2001;105:6507–6514. doi: 10.1021/jp010454y. [DOI] [Google Scholar]
- Holst M. Baker N. Wang F. J. Comput. Chem. 2000;21:1319–1342. doi: 10.1002/1096-987X(20001130)21:15<1319::AID-JCC1>3.0.CO;2-8. [DOI] [Google Scholar]
- Jo S. Vargyas M. Vasko-Szedlar J. Roux B. Im W. Nucleic Acids Res. 2008;36:W270–W275. doi: 10.1093/nar/gkn314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu B. Cheng X. Huang J. McCammon J. A. J. Chem. Theory Comput. 2009;5:1692–1699. doi: 10.1021/ct900083k. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feig M. Brooks C. L. Curr. Opin. Struct. Biol. 2004;14:217–224. doi: 10.1016/j.sbi.2004.03.009. [DOI] [PubMed] [Google Scholar]
- Feig M. Onufriev A. Lee M. S. Im W. Case D. A. Brooks C. L. J. Comput. Chem. 2004;25:265–284. doi: 10.1002/jcc.10378. [DOI] [PubMed] [Google Scholar]
- Warwicker J. Watson H. C. J. Mol. Biol. 1982;157:671–679. doi: 10.1016/0022-2836(82)90505-8. [DOI] [PubMed] [Google Scholar]
- Gilson M. K. Rashin A. Fine R. Honig B. J. Mol. Biol. 1985;184:503–516. doi: 10.1016/0022-2836(85)90297-9. [DOI] [PubMed] [Google Scholar]
- Baker N. A. Curr. Opin. Struct. Biol. 2005;15:137–143. doi: 10.1016/j.sbi.2005.02.001. [DOI] [PubMed] [Google Scholar]
- Bashford D. Karplus M. Biochemistry. 1990;29:10219–10225. doi: 10.1021/bi00496a010. [DOI] [PubMed] [Google Scholar]
- Potter M. J. Gilson M. K. McCammon J. A. J. Am. Chem. Soc. 1994;116:10298–10299. doi: 10.1021/ja00101a059. [DOI] [Google Scholar]
- Dolinsky T. J. Nielsen J. E. McCammon J. A. Baker N. A. Nucleic Acids Res. 2004;32:W665–W667. doi: 10.1093/nar/gkh381. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Teixeira V. H. Cunha C. A. Machuqueiro M. Oliveira A. S. F. Victor B. L. Soares C. M. Baptista A. M. J. Phys. Chem. B. 2005;109:14691–14706. doi: 10.1021/jp052259f. [DOI] [PubMed] [Google Scholar]
- Reynolds J. A. Gilbert D. B. Tanford C. Proc. Natl. Acad. Sci. 1974;71:2925–2927. doi: 10.1073/pnas.71.8.2925. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Havranek J. J. Harbury P. B. Proc. Natl. Acad. Sci. 1999;96:11145–11150. doi: 10.1073/pnas.96.20.11145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gilson M. K. Proteins: Struct., Funct., Bioinf. 1993;15:266–282. doi: 10.1002/prot.340150305. [DOI] [PubMed] [Google Scholar]
- Lim C. Bashford D. Karplus M. J. Phys. Chem. 1991;95:5610–5620. doi: 10.1021/j100167a045. [DOI] [Google Scholar]
- Alexov E. G. Gunner M. R. Biochemistry. 1999;38:8253–8270. doi: 10.1021/bi982700a. [DOI] [PubMed] [Google Scholar]
- Spassov V. Z. Luecke H. Gerwert K. Bashford D. J. Mol. Biol. 2001;312:203–219. doi: 10.1006/jmbi.2001.4902. [DOI] [PubMed] [Google Scholar]
- Song Y. Mao J. Gunner M. R. Biochemistry. 2003;42:9875–9888. doi: 10.1021/bi034482d. [DOI] [PubMed] [Google Scholar]
- Rabenstein B. Ullmann G. M. Knapp E.-W. Biochemistry. 1998;37:2488–2495. doi: 10.1021/bi971921y. [DOI] [PubMed] [Google Scholar]
- Zhu Z. Gunner M. R. Biochemistry. 2005;44:82–96. doi: 10.1021/bi048348k. [DOI] [PubMed] [Google Scholar]
- Georgescu R. E. Alexov E. G. Gunner M. R. Biophys. J. 2002;83:1731–1748. doi: 10.1016/S0006-3495(02)73940-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Antosiewicz J. McCammon J. A. Gilson M. K. J. Mol. Biol. 1994;238:415–436. doi: 10.1006/jmbi.1994.1301. [DOI] [PubMed] [Google Scholar]
- Antosiewicz J. McCammon J. A. Gilson M. K. Biochemistry. 1996;35:7819–7833. doi: 10.1021/bi9601565. [DOI] [PubMed] [Google Scholar]
- Sandberg L. Edholm O. Proteins: Struct., Funct., Genet. 1999;36:474–483. doi: 10.1002/(SICI)1097-0134(19990901)36:4<474::AID-PROT12>3.0.CO;2-V. [DOI] [PubMed] [Google Scholar]
- Muegge I. Qi P. X. Wand A. J. Chu Z. T. Warshel A. J. Phys. Chem. B. 1997;101:825–836. doi: 10.1021/jp962478o. [DOI] [Google Scholar]
- Simonson T. Carlsson J. Case D. A. J. Am. Chem. Soc. 2004;126:4167–4180. doi: 10.1021/ja039788m. [DOI] [PubMed] [Google Scholar]
- You T. J. Bashford D. Biophys. J. 1995;69:1721–1733. doi: 10.1016/S0006-3495(95)80042-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beroza P. Case D. A. J. Phys. Chem. 1996;100:20156–20163. doi: 10.1021/jp9623709. [DOI] [Google Scholar]
- Kieseritzky G. Knapp E.-W. Proteins: Struct., Funct., Bioinf. 2008;71:1335–1348. doi: 10.1002/prot.21820. [DOI] [PubMed] [Google Scholar]
- Barth P. Alber T. Harbury P. B. Proc. Natl. Acad. Sci. 2007;104:4898–4903. doi: 10.1073/pnas.0700188104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Warwicker J. J. Theor. Biol. 1986;121:199–210. doi: 10.1016/S0022-5193(86)80093-5. [DOI] [PubMed] [Google Scholar]
- Koehl P. Delarue M. J. Mol. Biol. 1994;239:249–275. doi: 10.1006/jmbi.1994.1366. [DOI] [PubMed] [Google Scholar]
- Cole C. Warwicker J. Protein Sci. 2009;11:2860–2870. doi: 10.1110/ps.0222702. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alexov E. G. Gunner M. R. Biophys. J. 1997;72:2075–2093. doi: 10.1016/S0006-3495(97)78851-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Song Y. Mao J. Gunner M. R. J. Comput. Chem. 2009;30:2231–2247. doi: 10.1002/jcc.21222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang L. Li L. Alexov E. Proteins: Struct., Funct., Bioinf. 2015;83:2186–2197. doi: 10.1002/prot.24935. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cvitkovic J. P. Pauplis C. D. Kaminski G. A. J. Comput. Chem. 2019;40:1718–1726. doi: 10.1002/jcc.25826. [DOI] [PubMed] [Google Scholar]
- Milletti F. Storchi L. Cruciani G. Proteins: Struct., Funct., Bioinf. 2009;76:484–495. doi: 10.1002/prot.22363. [DOI] [PubMed] [Google Scholar]
- Tan K. P. Nguyen T. B. Patel S. Varadarajan R. Madhusudhan M. S. Nucleic Acids Res. 2013;41:W314–W321. doi: 10.1093/nar/gkt503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou C. X. Grumbles W. M. Cundari T. R. ChemRxiv. 2020 doi: 10.26434/chemrxiv.12646772. [DOI] [Google Scholar]
- Sinha V. Laan J. J. Pidko E. A. Phys. Chem. Chem. Phys. 2021;23:2557–2567. doi: 10.1039/D0CP05281G. [DOI] [PubMed] [Google Scholar]
- Smith J. S. Isayev O. Roitberg A. E. Chem. Sci. 2017;8:3192–3203. doi: 10.1039/C6SC05720A. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith J. S. Nebgen B. Lubbers N. Isayev O. Roitberg A. E. J. Chem. Phys. 2018;148:241733. doi: 10.1063/1.5023802. [DOI] [PubMed] [Google Scholar]
- Smith J. S. Isayev O. Roitberg A. E. Sci. Data. 2017;4:170193. doi: 10.1038/sdata.2017.193. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith J. S. Nebgen B. T. Zubatyuk R. Lubbers N. Devereux C. Barros K. Tretiak S. Isayev O. Roitberg A. E. Nat. Commun. 2019;10:1–8. doi: 10.1038/s41467-018-07882-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith J. S. Zubatyuk R. Nebgen B. Lubbers N. Barros K. Roitberg A. E. Isayev O. Tretiak S. Sci. Data. 2020;7:1–10. doi: 10.1038/s41597-019-0340-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Devereux C. Smith J. S. Huddleston K. K. Barros K. Zubatyuk R. Isayev O. Roitberg A. E. J. Chem. Theory Comput. 2020;16:4192–4202. doi: 10.1021/acs.jctc.0c00121. [DOI] [PubMed] [Google Scholar]
- Gao X. Ramezanghorbani F. Isayev O. Smith J. S. Roitberg A. E. J. Chem. Inf. Model. 2020;60:3408–3415. doi: 10.1021/acs.jcim.0c00451. [DOI] [PubMed] [Google Scholar]
- Stevenson J. M., Jacobson L. D., Zhao Y., Wu C., Maple J., Leswing K., Harder E. and Abel R., Schrodinger-ANI: An Eight-Element Neural Network Interaction Potential with Greatly Expanded Coverage of Druglike Chemical Space, arXiv preprint, 2019, arXiv:1912.05079
- Zubatyuk R. Smith J. S. Leszczynski J. Isayev O. Sci. Adv. 2019;5:eaav6490. doi: 10.1126/sciadv.aav6490. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gokcan H. Isayev O. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2021:e1564. [Google Scholar]
- Zubatiuk T. Isayev O. Acc. Chem. Res. 2021;54:1575–1585. doi: 10.1021/acs.accounts.0c00868. [DOI] [PubMed] [Google Scholar]
- Pahari S. Sun L. Alexov E. Database. 2019;2019:1–7. doi: 10.1093/database/baz024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Webb H. Tynan-Connolly B. M. Lee G. M. Farrell D. O'Meara F. Søndergaard C. R. Teilum K. Hewage C. McIntosh L. P. Nielsen J. E. Proteins: Struct., Funct., Bioinf. 2011;79:685–702. doi: 10.1002/prot.22886. [DOI] [PubMed] [Google Scholar]
- Xiao S. Patsalo V. Shan B. Bi Y. Green D. F. Raleigh D. P. Proc. Natl. Acad. Sci. 2013;110:11337–11342. doi: 10.1073/pnas.1222245110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bartik K. Redfield C. Dobson C. M. Biophys. J. 1994;66:1180–1184. doi: 10.1016/S0006-3495(94)80900-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kuramitsu S. Hamaguchi K. J. Biochem. 1980;87:1215–1219. doi: 10.1093/oxfordjournals.jbchem.a132806. [DOI] [PubMed] [Google Scholar]
- Takahashi T. Nakamura H. Wada A. Biopolymers. 1992;32:897–909. doi: 10.1002/bip.360320802. [DOI] [PubMed] [Google Scholar]
- Inagaki F. Miyazawa T. Hori H. Tamiya N. Eur. J. Biochem. 1978;89:433–442. doi: 10.1111/j.1432-1033.1978.tb12546.x. [DOI] [PubMed] [Google Scholar]
- Kao Y.-H. Fitch C. A. Bhattacharya S. Sarkisian C. J. Lecomte J. T. J. García-Moreno E. B. Biophys. J. 2000;79:1637–1654. doi: 10.1016/S0006-3495(00)76414-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bashford D. Case D. A. Dalvit C. Tennant L. Wright P. E. Biochemistry. 1993;32:8045–8056. doi: 10.1021/bi00082a027. [DOI] [PubMed] [Google Scholar]
- Yu L. Fesik S. W. Biochim. Biophys. Acta, Protein Struct. Mol. Enzymol. 1994;1209:24–32. doi: 10.1016/0167-4838(94)90132-5. [DOI] [PubMed] [Google Scholar]
- Schaller W. Robertson A. D. Biochemistry. 1995;34:4714–4723. doi: 10.1021/bi00014a028. [DOI] [PubMed] [Google Scholar]
- Swint-Kruse L. Robertson A. D. Biochemistry. 1995;34:4724–4732. doi: 10.1021/bi00014a029. [DOI] [PubMed] [Google Scholar]
- Betz M. Löhr F. Wienk H. Rüterjans H. Biochemistry. 2004;43:5820–5831. doi: 10.1021/bi049948m. [DOI] [PubMed] [Google Scholar]
- Arbely E. Rutherford T. J. Sharpe T. D. Ferguson N. Fersht A. R. J. Mol. Biol. 2009;387:986–992. doi: 10.1016/j.jmb.2008.12.055. [DOI] [PubMed] [Google Scholar]
- Oda Y. Yamazaki T. Nagayama K. Kanaya S. Kuroda Y. Nakamura H. Biochemistry. 1994;33:5275–5284. doi: 10.1021/bi00183a034. [DOI] [PubMed] [Google Scholar]
- Zhang G. Mazurkie A. S. Dunaway-Mariano D. Allen K. N. Biochemistry. 2002;41:13370–13377. doi: 10.1021/bi026388n. [DOI] [PubMed] [Google Scholar]
- Baker W. R. Kintanar A. Arch. Biochem. Biophys. 1996;327:189–199. doi: 10.1006/abbi.1996.0108. [DOI] [PubMed] [Google Scholar]
- Fujii S. Akasaka K. Hatano H. J. Biochem. 1980;88:789–796. doi: 10.1093/oxfordjournals.jbchem.a133032. [DOI] [PubMed] [Google Scholar]
- Tan Y.-J. Oliveberg M. Davis B. Fersht A. R. J. Mol. Biol. 1995;254:980–992. doi: 10.1006/jmbi.1995.0670. [DOI] [PubMed] [Google Scholar]
- Arbely E. Rutherford T. J. Neuweiler H. Sharpe T. D. Ferguson N. Fersht A. R. J. Mol. Biol. 2010;403:313–327. doi: 10.1016/j.jmb.2010.08.052. [DOI] [PubMed] [Google Scholar]
- Forman-Kay J. D. Clore G. M. Gronenborn A. M. Biochemistry. 1992;31:3442–3452. doi: 10.1021/bi00128a019. [DOI] [PubMed] [Google Scholar]
- Zhou M. M. Davis J. P. Van Etten R. L. Biochemistry. 1993;32:8479–8486. doi: 10.1021/bi00084a012. [DOI] [PubMed] [Google Scholar]
- Tishmack P. A. Bashford D. Harms E. Van Etten R. L. Biochemistry. 1997;36:11984–11994. doi: 10.1021/bi9712448. [DOI] [PubMed] [Google Scholar]
- Dillet V. Van Etten R. L. Bashford D. J. Phys. Chem. B. 2000;104:11321–11333. doi: 10.1021/jp001575l. [DOI] [Google Scholar]
- Joshi M. D. Hedberg A. Mcintosh L. P. Protein Sci. 2008;6:2667–2670. doi: 10.1002/pro.5560061224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Laurents D. V. Huyghues-Despointes B. M. P. Bruix M. Thurlkill R. L. Schell D. Newsom S. Grimsley G. R. Shaw K. L. Treviño S. Rico M. Briggs J. M. Antosiewicz J. M. Scholtz J. M. Pace C. N. J. Mol. Biol. 2003;325:1077–1092. doi: 10.1016/S0022-2836(02)01273-1. [DOI] [PubMed] [Google Scholar]
- Case D. A., Aktulga H. M., Belfon K., Ben-Shalom I. Y., Brozell S. R., Cerutti D. S., Cheatham III T. E., Cruzeiro V. W. D., Darden T. A., Duke R. E., Giambasu G., Gilson M. K., Gohlke H., Goetz A. W., Harris R., Izadi S., Izmailov S. A., Jin C., Kasavajhala K., Kaymak M. C., King E., Kovalenko A., Kurtzman T., Lee T. S., LeGrand S., Li P., Lin C., Liu J., Luchko T., Luo R., Machado M., Man V., Manathunga M., Merz K. M., Miao Y., Mikhailovskii O., Monard G., Nguyen H., O'Hearn K. A., Onufriev A., Pan F., Pantano S., Qi R., Rahnamoun A., Roe D. R., Roitberg A., Sagui C., Schott-Verdugo S., Shen J., Simmerling C. L., Skrynnikov N. R., Smith J., Swails J., Walker R. C., Wang J., Wei H., Wolf R. M., Wu X., Xue Y., York D. M., Zhao S. and Kollman P. A., Amber 2021, University of California, San Francisco, 2021 [Google Scholar]
- Maier J. A. Martinez C. Kasavajhala K. Wickstrom L. Hauser K. E. Simmerling C. J. Chem. Theory Comput. 2015;11:3696–3713. doi: 10.1021/acs.jctc.5b00255. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ivani I. Dans P. D. Noy A. Pérez A. Faustino I. Hospital A. Walther J. Andrio P. Goñi R. Balaceanu A. Portella G. Battistini F. Gelpí J. L. González C. Vendruscolo M. Laughton C. A. Harris S. A. Case D. A. Orozco M. Nat. Methods. 2016;13:55–58. doi: 10.1038/nmeth.3658. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guyon I. Weston J. Barnhill S. Vapnik V. Mach. Learn. 2002;46:389–422. doi: 10.1023/A:1012487302797. [DOI] [Google Scholar]
- Breiman L. Mach. Learn. 2001;45:5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]
- Pedregosa F. Varoquaux G. Gramfort A. Michel V. Thirion B. Grisel O. Blondel M. Prettenhofer P. Weiss R. Dubourg V. Vanderplas J. Passos A. Cournapeau D. Brucher M. Perrot M. Duchesnay É. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]
- Svetnik V. Liaw A. Tong C. Culberson J. C. Sheridan R. P. Feuston B. P. J. Chem. Inf. Comput. Sci. 2003;43:1947–1958. doi: 10.1021/ci034160g. [DOI] [PubMed] [Google Scholar]
- Jorgensen W. L. Chandrasekhar J. Madura J. D. Impey R. W. Klein M. L. J. Chem. Phys. 1983;79:926–935. doi: 10.1063/1.445869. [DOI] [Google Scholar]
- Götz A. W. Williamson M. J. Xu D. Poole D. Le Grand S. Walker R. C. J. Chem. Theory Comput. 2012;8:1542–1555. doi: 10.1021/ct200909j. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Salomon-Ferrer R. Götz A. W. Poole D. Le Grand S. Walker R. C. J. Chem. Theory Comput. 2013;9:3878–3888. doi: 10.1021/ct400314y. [DOI] [PubMed] [Google Scholar]
- Berendsen H. J. C. Postma J. P. M. van Gunsteren W. F. DiNola A. Haak J. R. J. Chem. Phys. 1984;81:3684–3690. doi: 10.1063/1.448118. [DOI] [Google Scholar]
- Ryckaert J.-P. Ciccotti G. Berendsen H. J. C. J. Comput. Phys. 1977;23:327–341. doi: 10.1016/0021-9991(77)90098-5. [DOI] [Google Scholar]
- Essmann U. Perera L. Berkowitz M. L. Darden T. Lee H. Pedersen L. G. J. Chem. Phys. 1995;103:8577–8593. doi: 10.1063/1.470117. [DOI] [Google Scholar]
- Koleva B. N. Gokcan H. Rizzo A. A. Lim S. Jeanne Dit Fouque K. Choy A. Liriano M. L. Fernandez-Lima F. Korzhnev D. M. Cisneros G. A. Beuning P. J. Biophys. J. 2019;117:587–601. doi: 10.1016/j.bpj.2019.06.035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jaeger S. Fulle S. Turk S. J. Chem. Inf. Model. 2018;58:27–35. doi: 10.1021/acs.jcim.7b00616. [DOI] [PubMed] [Google Scholar]
- Li X. Fourches D. J. Cheminf. 2020;12:27. doi: 10.1186/s13321-020-00430-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gómez-Bombarelli R. Wei J. N. Duvenaud D. Hernández-Lobato J. M. Sánchez-Lengeling B. Sheberla D. Aguilera-Iparraguirre J. Hirzel T. D. Adams R. P. Aspuru-Guzik A. ACS Cent. Sci. 2018;4:268–276. doi: 10.1021/acscentsci.7b00572. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gfeller D. Michielin O. Zoete V. Nucleic Acids Res. 2012;41:D327–D332. doi: 10.1093/nar/gks991. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van Der Maaten L. Hinton G. J. Mach. Learn. Res. 2008;9:2579–2625. [Google Scholar]
- Pahari S. Sun L. Basu S. Alexov E. Proteins: Struct., Funct., Bioinf. 2018;86:1277–1283. doi: 10.1002/prot.25608. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The code and ML models are available at https://github.com/isayevlab/pKa-ANI.