Stacking Gaussian Processes to Improve pKa Predictions in the SAMPL7 Challenge

Robert M Raddi; Vincent A Voelz

doi:10.1007/s10822-021-00411-8

. Author manuscript; available in PMC: 2022 Sep 16.

Published in final edited form as: J Comput Aided Mol Des. 2021 Aug 7;35(9):953–961. doi: 10.1007/s10822-021-00411-8

Stacking Gaussian Processes to Improve pK_a Predictions in the SAMPL7 Challenge

Robert M Raddi ¹, Vincent A Voelz ¹

PMCID: PMC9478567 NIHMSID: NIHMS1834228 PMID: 34363562

Abstract

Accurate predictions of acid dissociation constants are essential to rational molecular design in the pharmaceutical industry and elsewhere. There has been much interest in developing new machine learning methods that can produce fast and accurate pKa predictions for arbitrary species, as well as estimates of prediction uncertainty. Previously, as part of the SAMPL6 community-wide blind challenge, Bannan et al. approached the problem of predicting pK_as by using a Gaussian process regression to predict microscopic pK_as, from which macroscopic pK_a values can be analytically computed.¹ While this method can make reasonably quick and accurate predictions using a small training set, accuracy was limited by the lack of a sufficiently broad range of chemical space in the training set (e.g., the inclusion of polyprotic acids). Here, to address this issue, we construct a deep Gaussian Process (GP) model that can include more features without invoking the curse of dimensionality. We trained both a standard GP and a deep GP model using a database of approximately 3500 small molecules curated from public sources, filtered by similarity to targets. We tested the model on both the SAMPL6 and more recent SAMPL7 challenge, which introduced a similar lack of ionizable sites and/or environments found between the test set and the previous training set. The results show that while the deep GP model made only minor improvements over the standard GP model for SAMPL6 predictions, it made significant improvements over the standard GP model in SAMPL7 macroscopic predictions, achieving a MAE of 1.5 pK_a.

Introduction

The negative logarithm of the acid dissociation constant, pK_a = −log₁₀ K_a, is fundamentally important in drug design. Absorption, metabolism and distribution of a drug are all greatly affected by the protonation state of the compound under various pH conditions.^2,3 pK_a values may be sought for molecules that have yet to be synthesized, or to further understand fundamental reactions. As a consequence, accurate predictions of acid dissociation constants are essential for pharmaceutical companies as well as many other industries.

The problem of pKa prediction continues to be studied due to its significance, and the difficulty of making accurate predictions. The SAMPL challenge provides a unique opportunity to blindly evaluate model performance for pK_a prediction.⁴ In the SAMPL6 challenge, predictions for 24 small molecules were made.⁵ A survey of the prediction methods employed by SAMPL6 participants shows that most predictions used QM (quantum mechanical)/linear regression methods, while only a handful of participants used QSPR/ML (quantitative structure property relationship/machine learning) methods. In the former, rigorous quantum mechanical calculations are performed to get standard free energies, which are then fitted to linear regression models of experimental data to extract parameters for LFER (linear free energy relationship). QM methods can achieve very good agreement with experiment, but are typically computationally demanding. Alternatively, QSPR/ML methods have less computational expense, and because of this have recently gained much attention, especially in the pharmaceutical industry.^6–8 QSPR/ML methods can make quick and accurate predictions using a curated database of experimental pK_a data combined with physical, chemical and structural descriptors to be used as a training set for a machine learning model.

One particular QSPR/ML approach used in SAMPL6 was a Gaussian process (GP) model from Bannan et al.¹ This model was trained on physical and chemical descriptors that relate to the deprotonation energy. Ten feature calculations were made for each of the 2,700 (2443 monoprotic) small molecules in a private dataset.⁹ Bannan showed that GP models have the generality to produce reasonably accurate predictions and uncertainties in those predictions for any type of ionizable group, performing about the median of all SAMPL6 participants.

Encouraged by this finding, we set about attempting to reproduce these results using a training database of small molecules curated from public sources. However, this endeavor revealed two key flaws with the GP approach: (1) the chemical space in the training set was narrowly limited to only monoprotic acids, and (2) the method performs poorly when there is a lack of suitably similar ionizable sites and/or environments between the training set and the test set.¹

Here, we attempt to remedy these issues through the use of deep Gaussian process (GP) models, which can increase model robustness when there is low structural similarity between the training and test sets, by enabling a larger number of features to be used. Using a deep GP model, the chemical space in the training set can be expanded to include a greater number of polyprotic molecules, although a large number of monoprotic molecules are still required. We train both a standard GP model and a deep GP model on a set of physiochemical descriptors of molecules from a hand-curated database derived from public sources, and then test both models using molecules from the SAMPL6 and SAMPL7 challenges.

Below, we describe our methodology for constructing a training database of ionizable molecules with experimental pKa measurements, and extracting molecular features to train standard and deep GP models for pK_a prediction. We discuss in detail the molecules included in the SAMPL6 and SAMPL7 challenges, and compare the results of standard and deep GP models on these molecular targets.

Computational Methods

Overview of the method.

The Gaussian process models used in this work are trained to predict microscopic pK_a values corresponding to the free energy of dissociating a proton from a specific microscopic species AH → A⁻. The pK_a value measured in experiment corresponds to the macroscopic pK_a, which can be calculated from the complete set of microscopic pK_a (i.e. the network of all possible single-proton dissociations, see SI)¹⁰

To predict the experimental pK_a of a molecule, the following steps are performed: First, the input molecule is provided as a SMILES string, along with a network of possible microstate transitions. Next, a set of quantitative physical features are calculated to describe each microscopic transition. These features are used by the Gaussian process model to predict microscopic pK_a values and their uncertainties. Finally, the set of microscopic pK_a values are used to calculate a macroscopic pK_a prediction and its uncertainty.

Featurization of input molecules.

The flexibility of the standard Gaussian process model comes from the ability to relate molecular features to the variation in deprotonation energy.¹ For each of the molecules in our training database (Figure 2) (and later, for each molecular pK_a prediction), we first perform a 3-D structure minimization using MMFF96s¹¹ for 100 conformers over 5000 maximum iterations and then calculate ten different molecular descriptors as training features. Six of the features are AM1-BCC partial charges¹² computed using Open Force Field¹³ and RDKit.¹⁴ These are partial charges of the atom of interest (AOI, the atom from which the proton dissociates), atoms 1 bond away from the AOI, and atoms 2 bonds away from the AOI, in both in A⁻ and AH forms.

Figure 2: — Distribution of (a) molecular weight (avg=180 Da) and (b) pK_a (avg=5.94) for the molecules in the database as described in the main text. Solid red lines denotes the mean.

The seventh and eighth features are to changes in solvation free energy and the change in enthalpy along AH → A⁻, both computed using OpenEye-toolkits.¹⁵ The solvation free energy is the free energy change of moving a species from gaseous phase into dilute aqueous solution. It is calculated using the AM1-BCC partial charges as input to a continuum solvent model and Possion-Boltzmann surface area solver. This calculation is performed on the the lowest-energy gas-phase conformation from the ensemble via the MMFF96s forcefield.

The ninth and tenth molecular features are: the solvent-accessible surface area of the deprotonated AOI calculated via the Shrake-Rupley algorithm,¹⁶ and its partial bond order, calculated using the extended Hückel molecular orbital method to obtain the overlap populations, both calculated using RDKit.

In addition to physicochemical features, structural descriptors are used as features for the deep GP model (see below). There are many methods that successfully utilize topological fingerprints as features,^6,8,17 which are useful for selecting training molecules most similar to the test set of input molecules. Here, we use Morgan fingerprints¹⁸ as features for our deep GP model. In short, Morgan fingerprints are topological descriptors compacted into long bit-vectors (black and white boxes) describing fragments (highlighted red) within a given molecule as shown in Figure 1.

Figure 1: — Structural fingerprints are converted into a bit-vector to store information regarding occurrences of specific molecular fragments.

Standard Gaussian process model.

Gaussian process (GP) models treat the prediction of microscopic pK_a values (i.e. single protonation/deprotonation equilibria among a network of many possible microtransitions) as a regression problem. The method seeks to estimate (with uncertainty) an unknown function f(x) that is responsible for mapping input features x to microscopic pK_a values to be converted into relative free energies. To do this, a GP is able to model a family of covariance functions that fit a set of known data points (the training set) and use the average of those curves (the posterior mean) to make predictions given new input data.^19,20

Gaussian process (GP) regression relies on a mean function μ and a covariance function k(x, x′). Here, we use a zero-mean Gaussian process. That is, f(x) ~ GP (μ(x) = 0, k(x, x′)) is defined by an expected spatial location in variable space and a relationship by which different variables are correlated with one another. Two well-known software packages that offer Gaussian process regression are Scikit-learn²¹ and GPy.¹⁹ We have used both packages and verified results are the same.

The covariance function is often called a kernel, responsible for determining the similarity between two input feature pairs x and x′. A collection of these functions represent the joint variability of the model and can be finely tuned i.e., learned to best represent the data. Here, we use the Matérn32 kernel and RBF kernel for deep GP models (see below). The hyperpaprameters are the length scale l_1:d, which describes how quickly the unknown function changes as x is varied.²² The optimal choice of these parameters depend upon the training data.

Deep Gaussian process model.

Deep neural networks can be used to significantly increase the number of features in a model to get better predictions, but such methods usually require large training sets. One of the major benefits of Gaussian process models is that small training sets can be used. Here, we propose a deep Gaussian process approach that will allow many more features to be used with a relatively small training set. To do this, we stack GPs such that each layer gets the posterior mean from the previous layer including the original inputs. This can be viewed as a composite multivariate function g(x) = f_l(f_l−1(…f₁(X))), where f_i is given by a Gaussian process. For this deep model, we use a Python package called DeepGPy.²⁰

Random Forest model.

The Random Forest (RF) regression model is an ensemble learning method that averages over decision trees constructed from sub-samples of the training data to make predictions. In a recent work, Yang et al. screened many machine learning (ML) methods to compare the performance of predicting pKas in diverse solvents.²³ Yang and colleagues showed that the error in model predictions from a RF and standard GP model are very similar. Here, we use the out-of-the-box RF regressor from Scikit-learn.²¹

A curated database from public sources.

As a first step in constructing a ML model for pK_a prediction, we curated a database²⁴ of small molecules with experimentally measured pK_a values. Fortunately, OCHEM²⁵ provides the ionizable centre smiles string. Regardless of the source, the ionized proton must be known for all of the microstates inside our curated database. Since the training data used by Bannan et al.¹ was proprietary (OpenEye Scientific Software), we opted to hand-curate a custom database from public sources^25–28 amounting to approximately 3500 small molecules with molecular weights not exceeding 500 Da (Figure 2.a). The histogram of pK_a values for each small molecule in the database reveals a bimodal distribution (Figure 2.b) with the highest frequency around a pK_a of 4. The complete database is freely available at https://github.com/robraddi/GP-SAMPL7.

Model fitness and selection.

To optimize the model, we measure model fitness using a 3-fold cross-validation technique. By this approach, we are able to determine the inherent performance of the model by splitting the data up into separate training and testing sets, to see if a model parameterized by the training data can predict the testing data. Our kernel hyperparameters {σ, l}_1:d for a given model are optimized by selecting the model from the batch of cross-validation experiments that have the lowest error, highest R² value and maximum log likelihood. Measuring model fitness in this way can also help select model hyperparameters such as the type of kernel, combinations of kernels, the number of layers, the number of inducing variables, the number of molecules (N) to use for training and even the types of molecules to use—monoprotic or polyprotic.

Damianou et al. incorporate inducing variables inside the deepGPy module to reduce the computational complexity of the model.²⁰ Inducing variables enable a significant reduction in the number of model parameters for each layer, approximating the true Bayesian posterior by a variational approximation.^20,29

After many parallel cross-validation experiments, we found that the most robustly predictive models typically have more monoprotic acids as well as having the greatest similarity percentage. Typically, the size (number of molecules) is usually within these limits: 1000 ≤ N ≤ 1500, that is, too few or too many compounds in the training set produces weak models (low R² and greater error). The similarity percentage metric represents the average structural similarity of each molecule in the training set with the molecule of interest. For example, Table S1 shows average similarity percentage between the training set and the SAMPL7 molecules.

After exploring many different models for the stacked GP, our optimized model uses a RBF kernel, 6 stacked layers with 200 inducing variables per layer and 1474 small molecules in the training set. If we include polyprotic acids inside the training set we limit molecules to have less than four ionizable groups. −5 ≤ pK_a ≤ 15

Similarty filtering to select subsets od relevant training data.

Based on our initial cross-validation results, we found that including too many training molecules results in less accurate models. It would therefore be desirable to limit the number of molecules in the dataset, selecting only those with similar structures to a given intended prediction target, i.e. SAMPL molecules, with the idea being that closely-spaced input features should have similar (and more predictable) pK_as. We hypothesized that specifying a similarity threshold to filter the database–leaving only molecules that match the similarity criteria–would result in improved models. The outcomes of these filtering efforts are discussed in Results.

To perform this filtering, Tversky similarity is used with Morgan structural fingerprints.¹⁸ We chose the Tversky similarity metric as it is a general form of Tanimoto similaity metric. We calculated the average similarity between the SAMPL molecules and the molecules inside the training set (~ 18% similarity). We then include molecules from the database in our training set if there is a similarity greater or equal to the average with any of the SAMPL molecules.

The SAMPL7 challenge.

In the SAMPL7 physical property challenge, pK_a predictions submissions consist of standard state relative free energies of micro-transitions for 22 small molecules, as described in the recent work of Gunner et al.¹⁰ Details for SAMPL7 are available online (http://github.com/samplchallenges/SAMPL7). These details include the experimental measurements for the 22 sulfonamide derivatives,³⁰ submitted predictions from all participants, and a general analysis of the competition. Our SAMPL7 submission consists of relative free energies computed for dominant microstate transitions using the standard GP model. Macroscopic pKa predictions below are computed using all microstate transitions.

Results

Correlation of input features

We first examine the distribution of training molecules in the ten-dimensional feature space used in our GP models. A visualization of the joint probability distribution of any two input features, along with the experimental pK_a, shows some features to be highly correlated, while others are not (Figure 3). The correlation between input features are found by scaling the covariance by the product of the standard deviation of each variable. The greater the linearity, the more closely correlated the two variables are. Correlation contours in Figure 3 emulate the optimized kernel parameters.

Figure 3: — Correlations between input features and pK_a values from the dataset. Input features: AM1BCC partial charge of atom of interest (AOI) protonated (PC (AH)) & deprotonated (PC (A⁻)), avg. AM1BCC partial charge of atoms 1 bond away from AOI protonated (PC (AH 1ba)) & deprotonated (PC (A⁻ 1ba)), avg. AM1BCC partial charge of atoms 2 bond away from AOI protonated (PC (AH 1ba)) & deprotonated (PC (A⁻ 2ba)), difference in solvation free energy (ΔG_solv), SASA (Shrake-Rupley) of AOI (deprotonated), bond order (Mulilken overlap populations) (B.O.), difference in enthalpy (ΔH). Histograms are found along the diagonal with a red line denoting the mean. Correlation curves overlay the raw feature data (gray dots) for 1 σ (red), 2 σ (black) and 3 σ (blue).

Kernel parameters are obtained by using three-fold cross-validation technique and selecting the model with the lowest mean absolute error (MAE), highest R² value and largest maximum log likelihood. Highest priority was given to the model with the lowest MAE, then highest R². With the optimized kernel parameters for each model, predictions were made for all microtransitions of SAMPL6 and SAMPL7 molecules. All SAMPL molecules were left outside of the training sets for the results shown below. Six of the SAMPL6 molecules had a few transitions that give rise to known software issues in feature calculations (e.g. errors in calculating the free energy of solvation, which is likely an issue with the optimization of conformers). In this event, the micro-transition is omitted and the micro-pK_a prediction to that micro-transition cannot be made.

Performance of standard and deep GP models

Macroscopic pK_a predictions and their uncertainties are shown in Figure 4 for SAMPL6 (a) and SAMPL7 (b). In both sets of targets, the prediction statistics suggest that the deep GP (blue) has lower mean absolute error and higher R² than the standard GP (black).

A technical issue in comparing our predicted macroscopic pK_a with the experimental values is worth mentioning here. Experiments may report a single pK_a value, while our model sometimes predicts multiple macroscopic pK_a (for different deprotonation events). To decide which predicted macroscopic pK_a to compare to experiment, we use a minimal distance criterion, i.e. the smallest difference between the predicted and observed pK_a values. Other studies have used different criteria, such as the Hungarian algorithm.¹

Performance on SAMPL6 targets.

Our standard GP model performed reasonably well at predicting the macroscopic pK_a of SAMPL6 targets, achieving a coefficient of determination R² of 0.59, mean absolute error (MAE) of 1.38 and root mean squared error (RMSE) of 1.61 (Figure 4). These results are similar to those of Bannan et al.:¹ R² = 0.48, MAE of 1.39 and RMSE of 2.16, despite the fact that we use an entirely different database of molecules to train the model.

The deep GP model performed slightly better at predicting the macroscopic pK_a of SAMPL6 targets, achieving an R² of 0.61, MAE of 1.36 and RMSE of 1.62. Outperforming both GP models, the RF model yields an R² of 0.50, MAE of 1.08 and RMSE of 1.80.

Performance on SAMPL7 targets.

We anticipated that applying our GP models to SAMPL7 targets would result in poor predictions, due to the lack of similarity between molecules in the training set and the molecules of interest (MOIs) (see Table S1 for similarity percentage). All the compounds in the SAMPL7 challenge contain sulfonamide groups and are difficult to find in open-source databases. Our database contains very few (~ 40) small molecules with sulfonamide groups.²⁷ All of these were included in the training set regardless of the number of ionizable sites.

To overcome this issue, we applied a similarity filter to remove from the training set molecules that lacked significantly similarity to our targets, as described in Methods. This filter was used to select a more focused training set for both standard and deep GP models.

Without a similarly filter on the molecules included in the training set, we find the deep GP model predictions produce moderately greater error, with an MAE of 1.7, RMSE of 2.0 and R² of 0.64 (Figure S3). In addition, RF model predictions without similarity filtering produce slightly greater error, RMSE of 2.34 against RMSE of 2.26 when applying a similarity filter (Figure S5). These results reflect the benefits of similarity filtering for the SAMPL7 molecules as well as the quality of the deepGP model over the standard GP.

With a similarity filter in place, we expected better results, especially for the deep GP model, which is able to utilize a greater number of effective features. Using the filtered training set, we found that the standard GP model yielded predictions with an R² of 0.16, MAE of 3.05 and RMSE of 3.69. In contrast, the deep GP model results gave an R² of 0.49, MAE of 1.47 and RMSE of 1.89. These results demonstrate the idea that a deep GP model is more robust than a standard GP model when faced with a paucity of training data that has high correlation with the target molecules.

Uncertainty estimates from GP models

Prediction of uncertainties is important to understanding and improving models, regardless of prediction accuracy.³¹ In our case, the predicted uncertainties of both standard and deep GP models are helpful in understanding the information each type of uses, and their potential to be successful predictors if presented with more data.

Uncertainties predicted by a standard GP model reflect insufficient training data.

One of the goals of this study was to develop a model that was robust when facing uncommon moieties. Arguably, the standard GP model fails this test for the SAMPL7 targets, although its predictions give insight into how GP models function when faced with insufficient training data. Regardless of the target molecule, the standard GP model predicts a pK_a near 6, with large prediction uncertainties. In this case, the GP model deals with unfamiliar input molecules by using the mean pK_a value of the training set (~ 6) with large error bars as the prediction (Figure 4 (b)). The covariance of the GP random variables are centered about the mean, since the proximity of feature space is far from the molecule of interest and the training set. The lack of correlation between the two sets of input features gives rise to the large uncertainty.

Uncertainties predicted by the deep GP model.

Unlike the standard GP model, the deep GP model was able to make relatively accurate predictions with reasonable uncertainties for most molecules. There are only a few instances of poor predictions with misguided uncertainties in the deep GP results. For example, SM22 from SAMPL6 a nitrogen hetero-cycle with iodide substituents (predicted pK_a of 3.9±1.1, experimental pK_a of 7.43±0.1) and SM28 in SAMPL7 (the only non-sulfonamide) an amide derivative with a nearby sulfone group (predicted pK_a of 7.9 ± 1.5, experimental pK_a of 12).

Upon stacking GPs, predictions are accompanied by relatively smaller uncertainties than those predicted by the standard GP model. For some predictions, however, the deep GP model appears slightly over-confident compared to the standard GP model. The large uncertainties predicted by both models suggest reasonble reporting of uncertainty for SAMPL6 and perhaps even more so for SAMPL7 predictions. Our training set was not strongly conditioned to predict sulfonamide derivatives, and therefore lacked the ability to make accurate predictions.

Why is the deep GP model more predictive than the standard GP model? The deep GP takes into account a larger number of effective features, both by stacking GPs and by filtering the training set by additional structural features.

When trying to predict pKas for molecules that are not found inside the training set, our results show the advantages of reducing the number of molecules inside the training set to be more selective. In doing so, inputs are able to have closer correlations with the training data. Since GP models work well with small training sets without effecting the validity of the model, a similarity filtering approach was implemented to filter out irrelevant compounds.

Discussion

It has been previously shown that Gaussian process regression models are general enough to make predictions with uncertainties for any ionizable group.¹ However, QSAR/ML methods are limited by the prior information. Here, we have shown that GP models can be improved even with limited structural diversity in our training sets. The improvement we observe for deep GP models in this study begs the question of how predictive this approach might be if trained on high-quality commercial-available databases. In this case, we would expect a significant increase in the accuracy the GP model predictions in this case.

Previous work has shown the benefits of curating a diverse training set. For example, Simulation Plus ADMET Predictor from Fraczkiewicz et al.⁶ recently partnered with Bayer Pharmaceuticals. Before that, Simulation Plus strictly used public data. With the inclusion of experimental data from Bayer Pharma, however, prediction statistics significantly improved, with R² values increasing from 0.87 to 0.93 and MAE decreasing from 0.72 to 0.50.⁶

Here, even in the case of limited publicly available data, we have shown that deep GP models, trained on similar molecules to the target, can lead to significant improvements over standard GP models. Our calculations show that a deep GP model yields more accurate results and increases the robustness of the model without a large/diverse training set. By extending the standard model to a deep GP model we are able to include more features and also allow for a slight increase in the number of polyprotic molecules. While including polyprotic molecules increases the accuracy of macroscopic pK_a predictions, it comes at a cost to predicted relative free energies. In theory the free energy ΔG_cycle of any protonation/deprotonation thermodynamic cycle should be zero, but this is not enforced by GP models, leading to increased errors for polyprotic molecules. This deviation (ΔG_cycle = ≠ 0) is indicative of inadequate representation of microstate transitions in the training data set that are similar to the microstate transitions of interest.

As stated in the Methods section (see Model fitness and selection) we were forced to deviate from typical machine learning best practices due to the lack of similar microstate transitions found in the database. The model should become more predictive with increasing dataset size, but here our standard model showed signs of worsening. Limiting the size by selecting the molecules for training produces a risk of overfitting. Fortunately, Deep GPs are known to do well when data is scarce and are less prone to overfitting than typical neural networks.²⁰ In situations where the training set is limited, the molecules that most strongly correlate with the MOIs should provide more accurate predictions. Although similarity filtering doesn’t play a huge role in the SAMPL6 results, it does serve its purpose in SAMPL7 which is to enhance the accuracy of predictions when there is very little training data to match the MOIs.

A potential future direction for this work would be to study additional features as input to the deep GP model. Since the deep GP model permits higher dimensionality, additional descriptors than the ones describeds above could be explored in future challenges. These include: polar surface area of the protonated atom of interest (AOI), different types of structural finger-prints, HOMO and LUMO energies surrounding the AOI, polarizability, and more. As for the level of theory involved in feature calculations, improvements to increase the level of accuracy would be most beneficial for estimating the free energy of solvation. One concern, however, would be if computational expense would outweigh the performance gain.

Overall, participating in the SAMPL7 challenge was a great way to demonstrate how the standard GP model suffers with a mediocre training set, and how deep GP models that stack GPs and filter out irrelevant molecules can overcome these limitations to an extent. The SAMPL physical property challenges provide excellent target molecules for testing our QSAR/ML models. We look forward to future SAMPL challenges, where we can apply more diverse training sets and incorporate many of the lessons learned here.

The database, code and results are publicly available at https://github.com/robraddi/GP-SAMPL7.

Supplementary Material

NIHMS1834228-supplement-SI.pdf^{(682.1KB, pdf)}

NIHMS1834228-supplement-1.pdf^{(151.6KB, pdf)}

Acknowledgements

RMR and VAV are supported by National Institutes of Health grant R01GM123296. We appreciate the National Institutes of Health for its support of the SAMPL project via R01GM124270 to David L. Mobley (UC Irvine)

References

(1).Bannan CC; Mobley DL; Skillman AG SAMPL6 challenge results from pK_a predictions based on a general Gaussian process model. Journal of computer-aided molecular design 2018, 32, 1165–1177. [DOI] [PMC free article] [PubMed] [Google Scholar]
(2).Gleeson MP Generation of a set of simple, interpretable ADMET rules of thumb. Journal of medicinal chemistry 2008, 51, 817–834. [DOI] [PubMed] [Google Scholar]
(3).Manallack DT; Prankerd RJ; Yuriev E; Oprea TI; Chalmers DK The significance of acid/base properties in drug discovery. Chemical Society Reviews 2013, 42, 485–496. [DOI] [PMC free article] [PubMed] [Google Scholar]
(4).SAMPL Challenge. https://www.samplchallenges.org.
(5).Işık M; Bergazin TD; Fox T; Rizzi A; Chodera JD; Mobley DL Assessing the accuracy of octanol–water partition coefficient predictions in the SAMPL6 Part II log P Challenge. Journal of Computer-Aided Molecular Design 2020, 1–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
(6).Fraczkiewicz R; Lobell M; Goller AH; Krenz U; Schoenneis R; Clark RD; Hillisch A Best of both worlds: Combining pharma data and state of the art modeling technology to improve in silico p K a prediction. Journal of chemical information and modeling 2015, 55, 389–397. [DOI] [PubMed] [Google Scholar]
(7).Shields GC; Seybold PG Computational approaches for the prediction of pKa values; CRC Press, 2013. [Google Scholar]
(8).Fraczkiewicz R In silico prediction of ionization. 2013,
(9).pka Prospector, OpenEye Scientific Software.
(10).Gunner MR; Murakami T; Rustenburg AS; Işık M; Chodera JD Standard state free energies, not pK as, are ideal for describing small molecule protonation and tautomeric states. Journal of Computer-Aided Molecular Design 2020, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
(11).Halgren TA Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. Journal of computational chemistry 1996, 17, 490–519. [Google Scholar]
(12).Jakalian A; Jack DB; Bayly CI Fast, efficient generation of high-quality atomic charges. AM1-BCC model: II. Parameterization and validation. Journal of computational chemistry 2002, 23, 1623–1641. [DOI] [PubMed] [Google Scholar]
(13).Wagner J et al. openforcefield/openforcefield: 0.8.0 Virtual Sites and Bond Interpolation. 2020; 10.5281/zenodo.4121930. [DOI] [Google Scholar]
(14).Landrum G, et al. RDKit: Open-source cheminformatics. 2006,
(15).Software OS Cheminformatics Software: Molecular Modeling Software: OpenEye Scientific. http://www.eyesopen.com./.
(16).Shrake A; Rupley JA Environment and exposure to solvent of protein atoms. Lysozyme and insulin. Journal of molecular biology 1973, 79, 351–371. [DOI] [PubMed] [Google Scholar]
(17).Xing L; Glen RC; Clark RD Predicting p K a by molecular tree structured fingerprints and PLS. Journal of chemical information and computer sciences 2003, 43, 870–879. [DOI] [PubMed] [Google Scholar]
(18).Rogers D; Hahn M Extended-connectivity fingerprints. Journal of chemical information and modeling 2010, 50, 742–754. [DOI] [PubMed] [Google Scholar]
(19).GPy, GPy: A Gaussian process framework in python. http://github.com/SheffieldML/GPy, since 2012.
(20).Damianou A; Lawrence N Deep gaussian processes. Artificial Intelligence and Statistics. 2013; pp 207–215. [Google Scholar]
(21).Pedregosa F; Varoquaux G; Gramfort A; Michel V; Thirion B; Grisel O; Blondel M; Prettenhofer P; Weiss R; Dubourg V, et al. Scikit-learn: Machine learning in Python. the Journal of machine Learning research 2011, 12, 2825–2830. [Google Scholar]
(22).Duvenaud D The Kernel cookbook: Advice on covariance functions. URL https://www.cs.toronto.edu/~duvenaud/cookbook 2014,
(23).Yang Q; Li Y; Yang J-D; Liu Y; Zhang L; Luo S; Cheng J-P Holistic Prediction of pKa in Diverse Solvents Based on Machine Learning Approach. 2020, [DOI] [PubMed]
(24).Raddi R; Voelz V pKa Database for Stacking Gaussian Processes to Improve pKa Predictions in the SAMPL7 Challenge. 2021; 10.5281/zenodo.5027418. [DOI] [PMC free article] [PubMed] [Google Scholar]
(25).Sushko I; Novotarskyi S; Körner R; Pandey AK; Rupp M; Teetz W; Brand-maier S; Abdelaziz A; Prokopenko VV; Tanchuk VY, et al. Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information. Journal of computer-aided molecular design 2011, 25, 533–554. [DOI] [PMC free article] [PubMed] [Google Scholar]
(26).Wishart DS; Feunang YD; Guo AC; Lo EJ; Marcu A; Grant JR; Sajed T; Johnson D; Li C; Sayeeda Z, et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic acids research 2018, 46, D1074–D1082. [DOI] [PMC free article] [PubMed] [Google Scholar]
(27).Caine BA; Bronzato M; Popelier PL Experiment stands corrected: accurate prediction of the aqueous p K a values of sulfonamide drugs using equilibrium bond lengths. Chemical science 2019, 10, 6368–6381. [DOI] [PMC free article] [PubMed] [Google Scholar]
(28).Settimo L; Bellman K; Knegtel RM Comparison of the accuracy of experimental and predicted pKa values of basic and acidic compounds. Pharmaceutical research 2014, 31, 1082–1095. [DOI] [PubMed] [Google Scholar]
(29).Titsias M Variational learning of inducing variables in sparse Gaussian processes. Artificial intelligence and statistics. 2009; pp 567–574. [Google Scholar]
(30).Francisco KR; Varricchio C; Paniak TJ; Kozlowski MC; Brancale A; Ballatore C Structure Property Relationships of N-Acylsulfonamides and Related Bioisosteres. European Journal of Medicinal Chemistry 2021, 113399. [DOI] [PMC free article] [PubMed] [Google Scholar]
(31).Nigam A; Pollice R; Hurley MFD; Hickman RJ; Aldeghi M; Yoshikawa N; Chithrananda S; Voelz VA; Aspuru-Guzik A Assigning Confidence to Molecular Property Prediction. 2021. [DOI] [PMC free article] [PubMed]
(32).Bochevarov AD; Watson MA; Greenwood JR; Philipp DM Multiconformation, density functional theory-based p K a prediction in application to large, flexible organic molecules with diverse functional groups. Journal of chemical theory and computation 2016, 12, 6001–6019. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1834228-supplement-SI.pdf^{(682.1KB, pdf)}

NIHMS1834228-supplement-1.pdf^{(151.6KB, pdf)}

[R1] (1).Bannan CC; Mobley DL; Skillman AG SAMPL6 challenge results from pK_a predictions based on a general Gaussian process model. Journal of computer-aided molecular design 2018, 32, 1165–1177. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] (2).Gleeson MP Generation of a set of simple, interpretable ADMET rules of thumb. Journal of medicinal chemistry 2008, 51, 817–834. [DOI] [PubMed] [Google Scholar]

[R3] (3).Manallack DT; Prankerd RJ; Yuriev E; Oprea TI; Chalmers DK The significance of acid/base properties in drug discovery. Chemical Society Reviews 2013, 42, 485–496. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] (4).SAMPL Challenge. https://www.samplchallenges.org.

[R5] (5).Işık M; Bergazin TD; Fox T; Rizzi A; Chodera JD; Mobley DL Assessing the accuracy of octanol–water partition coefficient predictions in the SAMPL6 Part II log P Challenge. Journal of Computer-Aided Molecular Design 2020, 1–36. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] (6).Fraczkiewicz R; Lobell M; Goller AH; Krenz U; Schoenneis R; Clark RD; Hillisch A Best of both worlds: Combining pharma data and state of the art modeling technology to improve in silico p K a prediction. Journal of chemical information and modeling 2015, 55, 389–397. [DOI] [PubMed] [Google Scholar]

[R7] (7).Shields GC; Seybold PG Computational approaches for the prediction of pKa values; CRC Press, 2013. [Google Scholar]

[R8] (8).Fraczkiewicz R In silico prediction of ionization. 2013,

[R9] (9).pka Prospector, OpenEye Scientific Software.

[R10] (10).Gunner MR; Murakami T; Rustenburg AS; Işık M; Chodera JD Standard state free energies, not pK as, are ideal for describing small molecule protonation and tautomeric states. Journal of Computer-Aided Molecular Design 2020, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] (11).Halgren TA Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. Journal of computational chemistry 1996, 17, 490–519. [Google Scholar]

[R12] (12).Jakalian A; Jack DB; Bayly CI Fast, efficient generation of high-quality atomic charges. AM1-BCC model: II. Parameterization and validation. Journal of computational chemistry 2002, 23, 1623–1641. [DOI] [PubMed] [Google Scholar]

[R13] (13).Wagner J et al. openforcefield/openforcefield: 0.8.0 Virtual Sites and Bond Interpolation. 2020; 10.5281/zenodo.4121930. [DOI] [Google Scholar]

[R14] (14).Landrum G, et al. RDKit: Open-source cheminformatics. 2006,

[R15] (15).Software OS Cheminformatics Software: Molecular Modeling Software: OpenEye Scientific. http://www.eyesopen.com./.

[R16] (16).Shrake A; Rupley JA Environment and exposure to solvent of protein atoms. Lysozyme and insulin. Journal of molecular biology 1973, 79, 351–371. [DOI] [PubMed] [Google Scholar]

[R17] (17).Xing L; Glen RC; Clark RD Predicting p K a by molecular tree structured fingerprints and PLS. Journal of chemical information and computer sciences 2003, 43, 870–879. [DOI] [PubMed] [Google Scholar]

[R18] (18).Rogers D; Hahn M Extended-connectivity fingerprints. Journal of chemical information and modeling 2010, 50, 742–754. [DOI] [PubMed] [Google Scholar]

[R19] (19).GPy, GPy: A Gaussian process framework in python. http://github.com/SheffieldML/GPy, since 2012.

[R20] (20).Damianou A; Lawrence N Deep gaussian processes. Artificial Intelligence and Statistics. 2013; pp 207–215. [Google Scholar]

[R21] (21).Pedregosa F; Varoquaux G; Gramfort A; Michel V; Thirion B; Grisel O; Blondel M; Prettenhofer P; Weiss R; Dubourg V, et al. Scikit-learn: Machine learning in Python. the Journal of machine Learning research 2011, 12, 2825–2830. [Google Scholar]

[R22] (22).Duvenaud D The Kernel cookbook: Advice on covariance functions. URL https://www.cs.toronto.edu/~duvenaud/cookbook 2014,

[R23] (23).Yang Q; Li Y; Yang J-D; Liu Y; Zhang L; Luo S; Cheng J-P Holistic Prediction of pKa in Diverse Solvents Based on Machine Learning Approach. 2020, [DOI] [PubMed]

[R24] (24).Raddi R; Voelz V pKa Database for Stacking Gaussian Processes to Improve pKa Predictions in the SAMPL7 Challenge. 2021; 10.5281/zenodo.5027418. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] (25).Sushko I; Novotarskyi S; Körner R; Pandey AK; Rupp M; Teetz W; Brand-maier S; Abdelaziz A; Prokopenko VV; Tanchuk VY, et al. Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information. Journal of computer-aided molecular design 2011, 25, 533–554. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] (26).Wishart DS; Feunang YD; Guo AC; Lo EJ; Marcu A; Grant JR; Sajed T; Johnson D; Li C; Sayeeda Z, et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic acids research 2018, 46, D1074–D1082. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] (27).Caine BA; Bronzato M; Popelier PL Experiment stands corrected: accurate prediction of the aqueous p K a values of sulfonamide drugs using equilibrium bond lengths. Chemical science 2019, 10, 6368–6381. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] (28).Settimo L; Bellman K; Knegtel RM Comparison of the accuracy of experimental and predicted pKa values of basic and acidic compounds. Pharmaceutical research 2014, 31, 1082–1095. [DOI] [PubMed] [Google Scholar]

[R29] (29).Titsias M Variational learning of inducing variables in sparse Gaussian processes. Artificial intelligence and statistics. 2009; pp 567–574. [Google Scholar]

[R30] (30).Francisco KR; Varricchio C; Paniak TJ; Kozlowski MC; Brancale A; Ballatore C Structure Property Relationships of N-Acylsulfonamides and Related Bioisosteres. European Journal of Medicinal Chemistry 2021, 113399. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] (31).Nigam A; Pollice R; Hurley MFD; Hickman RJ; Aldeghi M; Yoshikawa N; Chithrananda S; Voelz VA; Aspuru-Guzik A Assigning Confidence to Molecular Property Prediction. 2021. [DOI] [PMC free article] [PubMed]

[R32] (32).Bochevarov AD; Watson MA; Greenwood JR; Philipp DM Multiconformation, density functional theory-based p K a prediction in application to large, flexible organic molecules with diverse functional groups. Journal of chemical theory and computation 2016, 12, 6001–6019. [DOI] [PubMed] [Google Scholar]

PERMALINK

Stacking Gaussian Processes to Improve pK_a Predictions in the SAMPL7 Challenge

Robert M Raddi

Vincent A Voelz

Abstract

Introduction