Step Forward Cross Validation for Bioactivity Prediction: Out of Distribution Validation in Drug Discovery

Udit Surya Saha; Michele Vendruscolo; Anne E Carpenter; Shantanu Singh; Andreas Bender; Srijit Seal

doi:10.1101/2024.07.02.601740

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Jul 4:2024.07.02.601740. [Version 1] doi: 10.1101/2024.07.02.601740

Step Forward Cross Validation for Bioactivity Prediction: Out of Distribution Validation in Drug Discovery

Udit Surya Saha ^1,^‡, Michele Vendruscolo ¹, Anne E Carpenter ², Shantanu Singh ², Andreas Bender ^1,³, Srijit Seal ^1,^2,^‡,^*

PMCID: PMC11245006 PMID: 39005404

Abstract

Recent advances in machine learning methods for materials science have significantly enhanced accurate predictions of the properties of novel materials. Here, we explore whether these advances can be adapted to drug discovery by addressing the problem of prospective validation - the assessment of the performance of a method on out-of-distribution data. First, we tested whether k-fold n-step forward cross-validation could improve the accuracy of out-of-distribution small molecule bioactivity predictions. We found that it is more helpful than conventional random split cross-validation in describing the accuracy of a model in real-world drug discovery settings. We also analyzed discovery yield and novelty error, finding that these two metrics provide an understanding of the applicability domain of models and an assessment of their ability to predict molecules with desirable bioactivity compared to other small molecules. Based on these results, we recommend incorporating a k-fold n-step forward cross-validation and these metrics when building state-of-the-art models for bioactivity prediction in drug discovery.

1. Introduction

Recently, many advancements have been made in developing computational methods for predicting properties in materials science. Suitable validation methods have also been introduced to estimate the performance of these predictive models.^1–3 Here, we investigated whether these validation methods can be translated into the field of drug discovery. Here, we address the problem of prospective validation. Since predictive models are trained and validated on the experimentally measured activity of libraries of compounds, real-world use in drug discovery requires strong performance on out-of-distribution data.⁴ This is because the goal is often to accurately predict the properties of compounds that have not been synthesized yet. Inadequate prospective validation is a common issue in the drug discovery literature, often creating a mismatch between published studies and real-world use.^5,6 This problem is less severe in domains such as materials science, where the underlying physical principles are often known^3,7, and protein folding, where evolution led to a lower-dimensional underlying space of possible solutions⁸. However, this problem is significant in drug discovery because the chemical space is vast (more than 10^60 small molecules) and only explored to a limited extent, making it challenging to extrapolate to novel chemical series.⁵

Benchmarking state-of-the-art models is more reliable for real-world decision-making when predicting compounds different from those in the training data space. However, most studies use cross-validation (CV) to evaluate models by randomly splitting the datasets for training versus testing.⁹ This approach typically suffers from a limited applicability domain because test compounds are often similar to compounds in the training set. To mitigate this problem, splitting datasets by chemical scaffold or time-split has been proposed.^10,11 Even though these splits could be repeated for multiple external test sets (for example, repeated nested cross-validation), studies usually lack a detailed analysis of how variations in the drug discovery landscape and chemical space influence outcomes by differentiating between compounds unlikely to be drug-like and those that have desirable bioactivity and physicochemical properties.

To overcome these problems, one can take inspiration from machine learning (ML) studies for materials discovery, where validation and evaluation strategies have been developed for effective prospective discovery, i.e., identifying materials whose properties lie outside the range of training data.^3,12 This trend makes sense because, in materials discovery, the goal is often to discover materials with a higher or lower property of interest (e.g., conductivity, band gap, etc.) than already known materials.² In one aspect, drug discovery is similar, as models are trained on data from previously known small molecules and then used to predict the bioactivity of compounds optimized to have desirable properties.

Learning from these developments, we propose implementing a validation method and two metrics commonly used in prospective validation from materials science to the search for small molecules in drug discovery: (a) k-fold n-step forward cross-validation¹², (b) novelty error, and (c) discovery yield².

During drug discovery, several properties of a compound are optimized simultaneously. One of the goals is to decrease logP, the logarithm of the partition coefficient (P) of a compound between n-octanol and water, a standard measure of hydrophobicity.^13,14 Moderate logP values (typically between 1 and 3) are preferred in drug candidates to balance lipophilicity and hydrophilicity, enhancing oral bioavailability through good lipid membrane permeability and adequate aqueous solubility. A moderate logP value also ensures proper drug distribution, avoiding excessive accumulation in fatty tissues or insufficient penetration through cell membranes.¹⁵ Therefore, we implemented a sorted k-fold n-step forward cross-validation (SFCV) to validate models, where the training and test datasets are selected based on continuous blocks of decreasing logP. When implementing SFCV, it is essential to ensure that the folds in the later iterations represent the desired logP values, which should be moderate (between 1 and 3). One could then assess whether a model fails to accurately predict compounds with desired bioactivity compared to other small molecules using discovery yield. Novelty error shows whether models can generalize on new, unseen data that differ significantly from the data on which the model was trained.

This is similar to using the applicability domain⁴ and distance to model measures¹⁶. Overall, we present these validation and evaluation metrics to the specific needs of toxicity and protein target prediction for small molecules.¹⁷

2. Methods

2.1. Dataset

Models for predicting compound bioactivity require training datasets of activity readouts for many compounds. An activity readout is often expressed as an IC50 value, the concentration at which a particular biological response is reduced to half (50%) of the original signal. While several datasets have binary readouts (active/inactive) for compounds towards given protein targets, these datasets are often noisy or employ arbitrary thresholds for binarising activity. Recently, it was demonstrated that combining data from different assay measurements is a significant noise source for such datasets.¹⁸ Therefore, we restricted this study to having clean and single measurement type data, i.e., IC50 values. Although the actual safety and potency of a compound depends on the dose and Cmax value (i.e., the maximum concentration in plasma in the organism) and is not inherent to the IC50 of protein binding in a cell system, this study does not consider Cmax due to insufficient data in the public domain.^19,20 Following previous studies, we selected the three relevant protein targets: hERG (1467 compounds), MAPK14 (1513 compounds), and VEGFR2 (1751 compounds) from Landrum et al.¹⁸ hERG inhibition is well-known to cause cardiotoxicity^21,22 and should be avoided in drug discovery. The other two targets represent positive attributes: MAPK14 inhibitors could potentially treat multiple diseases, such as neurodegenerative diseases, cardiovascular cases, and cancer. VEGFR2 is a potent therapeutic target for treating angiogenesis-related tumors. From the literature, we found that approved drugs have an IC50>6.3 μM for hERG (less likely to inhibit hERG and cause toxicity)²³. On the other hand, drugs that target the other two proteins would ideally have IC50<100 nM for VEGFR2²⁴ or MAPK14 inhibition²⁵.

We converted all IC50 values into pIC50 values, the negative logarithm in base 10 of the IC50 value expressed in molar concentration (M). The pIC50 value provides a more intuitive understanding of compound potency: higher pIC50 values indicate greater potency, as they correspond to lower IC50 concentrations.²⁶

2.2. Compound Standardization

We used the RDKit library (version 2023.9.4²⁷, for all the analysis in this paper) to standardize the SMILES (Simplified Molecular Input Line Entry System) representations of chemical structures. Multiple standardization steps included RDKit cleanup functions to desalt and reionize the molecules and neutralize charges. Further standardization involved normalizing isotopes, stereochemistry, and tautomers through canonicalization methods implemented in the RDKit MolStandardize module.²⁷ Each molecule was iteratively processed up to five times until the output SMILES stabilized. Otherwise, the most frequently occurring standardized SMILES was selected as the final output. For each unique SMILES, the median pIC50 values of replicates were calculated to summarize the central tendency and minimize the impact of outliers in the data. Finally, this resulted in 1262 unique compounds for the hERG dataset, 1445 for the MAPK14 dataset, and 1641 for the VEGFR2 dataset.

2.2. Compound Featurization

To represent compounds for ML models, we used 2048-bit ECFP4 fingerprints (also known as Morgan fingerprints, as implemented in RDKit, which encode chemical structural features in a binary vector format, and logP values were calculated using RDKit.

2.3. Model Algorithm

We used three models: (a) Random Forest (RF) Regressor, (b) Gradient Boosting, and (c) Multi-Layer Perceptrons (as implemented in scikit-learn version 1.3.0²⁸ for all the analysis in this paper) to use molecular fingerprints encoded in a feature matrix to predict chemical properties of interest, namely the pIC50 values. The Random Forest Regressor was coded dynamically to set the number of trees based on the training data size, using the square root of the number of samples but not exceeding 25 trees. A similar approach was implemented for the other models, with the number of estimators in Gradient Boosting and the number of hidden-layer nodes in the MLP being limited to 25. This approach balances model complexity, helping to prevent overfitting for these tasks where data is limited.

2.4. Model Training

To validate models, we adopted a step-forward cross-validation (SFCV) method¹² (as shown in Figure 1) and compared it with conventional k-fold cross-validation (CV). To implement the SFCV method, we sorted the dataset from high to low logP values and divided the dataset into ten bins. For the first iteration, we used the first bin for training and the second for testing. In each successive iteration, we expanded the training set by adding the next bin while using the subsequent bin (with lower logP compounds) for testing. This method mimics a real-world scenario where chemical structures undergo optimization to become more drug-like.²⁹ To explore the impact of logP sorting on model performance, we also implemented an SFCV without sorting, effectively creating a random step forward cross-validation. This setup allowed us to assess how sorting training data influences predictive accuracy and the discovery of more drug-like molecules. SFCV, however, leads to lesser training data for each iteration. To explore the impact of dataset size, we implemented a conventional 10-fold CV with random splits for a direct comparison, with one of the ten randomly selected folds as a test set for each turn, and the remaining folds were composed of the training set. To explore the impact of chemical scaffolds, we further implemented a 10-fold CV with scaffold-based splitting (as implemented in ScaffoldSplitter in DeepChem version 2.7.1^30,31), where molecules are grouped based on the Bemis-Murcko scaffold representation.

2.5. Model Evaluation

The predictive performance for each iteration was evaluated by calculating the Root Mean Square Error (RMSE, a point-based measure of model accuracy) and R² (an overall-fit measure of model accuracy). All code implementations used scikit-learn.

2.6. Discovery Yield and Novelty Error

We introduced two metrics to assess the model in a real-world context and adopted them from machine learning in material science: discovery yield and novelty error rates.² To define discovery yield, first, we identified highly potent and safe compounds called discovery compounds. For hERG, compounds with a pIC50 value lower than 5.2 are defined as discovery compounds.²³ Similarly, for VEGFR2 and MAPK14, compounds with a pIC50 value exceeding 7.0 are defined as discovery compounds.^24,25 We defined discovery yield as the fraction of discovery compounds for which the IC50 value is correctly predicted within an error range of 0.5 log units. Furthermore, for each test set, we analyzed the performance specifically for compounds structurally dissimilar to those in the training dataset (<0.55 Tanimoto similarity for the nearest neighbor³²) and defined the mean absolute error on these compounds as novelty error.

3. Results and Discussion

In the work, we evaluated the effectiveness of sorted step-forward cross-validation (SFCV) in predicting compounds’ bioactivity and toxicity. We sorted compounds based on logP values and used an SFCV to assess whether the models could accurately predict desired bioactivity and generalize to new, unseen data. We selected three protein targets relevant to drug discovery: hERG, MAPK14, and VEGFR2 datasets, comprising 1,262, 1,445, and 1,641 unique compounds. Available data for the pIC50 of compounds against these protein targets (a measure of potency) show a wide distribution of biological activity values (Figure S1). The distribution of pIC50 values for training and test datasets remained relatively similar for sorted and unsorted SFCV, indicating that sorting by logP does not significantly affect the activity distribution within the datasets (Figure S1).

3.1. Sorted SFCV selected more novel compounds than CV methods

Next, we analyzed the chemical space selected for training/testing splits using four validation strategies (Figure S2). Sorted SFCV selected test compounds with progressively lower logP values, as expected. In contrast, other validation methods did not show this trend—cross-validation (CV) with scaffold-based splits selected compounds in different molecular weight spaces in earlier batches. However, in later batches, the distribution of test compounds resembled that of random splitting. This can be attributed to the nature of scaffold splits: larger groups of molecules remaining in later folds are often very dissimilar to each other and thus not representative of any specific scaffold type (that is, chemistry datasets usually do not form exact clusters of molecules, making it challenging to split them neatly into scaffold clusters). As a result, scaffold-based CV effectively identifies some groups of scaffolds in the early batches but also groups some relatively dissimilar compounds in later batches.

To determine which validation is likely to test compounds dissimilar to the training data, we analyzed each method using the hERG prediction task as an example (Figure 2a). Sorted and unsorted SFCVs showed more dissimilar compounds in early testing batches, which gradually decreased. CV with random splitting maintained a much lower number of dissimilar test compounds across all batches. CV selected fewer dissimilar compounds to test than sorted SFCV (170 with sorted SFCV compared to 127 compounds with scaffold splits and 24 with random splits; Figure S3a). Thus, sorted SFCV is better for evaluating compounds with novel structures.

Figure 2. — (a) Number of compounds dissimilar to training data (Tc<0.55), (b-d) Number of discovery compounds (pIC50<5.2) in the test set predicted within an error range of 0.5 log unit (b), and discovery compounds dissimilar to training data (Tc<0.55) (c), and discovery compounds dissimilar to training data (Tc<0.55) correcting predicted within an error range of 0.5 log unit (d), as shown for the hERG target prediction task across four validation methods of sorted SFCV, unsorted SFCV, cross-validation with random splits, and cross-validation with scaffold splits (for each of the first nine test folds). Tc: Tanimoto Similarity; SFCV: Step-Forward Cross-Validation.

3.3. Sorted SFCV selected more discovery compounds than CV methods

We compared the validation methods based on how they split between train and test data in the biological space and how models could be evaluated. All validation methods selected discovery compounds (compounds with pIC50 < 5.2 in the case of hERG) in the test folds (Figure S3b). Using unsorted SFCV and CV (random splits), models predicted a higher number within an error range of 0.5 log units in early batches but fluctuated in later batches (Figure 2b). Models evaluated using sorted SFCV showed an increasing trend in accurately predicted discovery compounds towards later batches, where compounds have a desirable lower logP.

Among these discovery compounds, there were also novel compounds structurally dissimilar to those in the training data (Tc < 0.55). Unsorted and sorted SFCVs effectively selected these dissimilar discovery compounds as test compounds (Figure 2c). CV with scaffold split had fewer such compounds than SFCV methods (29 in CV with scaffold splits vs. 59 in sorted SFCV and 66 in unsorted SFCV in the hERG task). CV with random splits identified very few dissimilar discovery compounds (only 2 in the hERG task), shown in Figure S3c.

We checked how many predictions for these dissimilar discovery compounds were within an error range of 0.5 log units. The pattern was similar to the number of compounds detected: most methods could identify these compounds (Figure 2d). However, as SFCV had more such compounds to predict, it was advantageous to evaluate models using this validation strategy.

Overall, sorted SFCV balances identify diversely structured compounds with consistent and improved predictions over testing batches. CV (random split) maintains an optimistic approach with stable performance, while unsorted SFCV and CV (scaffold split) offer higher initial diversity but may lack consistency in later batches.

3.4. Extrapolation error: Sorted SFCV shows that predicting bioactivity for compounds with low lipophilicity is challenging

Conventional CV with random splitting exhibited better performance metrics (mean RMSE=0.58±0.04 for RF) compared to sorted SFCV (mean RMSE=0.77±0.03 for RF) across all protein targets (Table 1). However, all the models evaluated, including RF, xgboost, and MLP, demonstrated limited ability to extrapolate when predicting for compounds with lower or higher pIC50 values than remaining compounds (Figure S4 for VEGFR2). The extrapolation challenges are less noticeable under unsorted SFCV or conventional CV conditions than under sorted SFCV. Sorted SFCV (Figure S4a) results in significantly higher absolute errors for compounds with the highest pIC50 values—up to 100 times the experimental IC50 value across multiple test compounds—compared to errors obtained using alternative validation methods. These results indicate that evaluating models is more difficult for compounds with lower logP than the training data. It will thus be essential to develop models capable of adapting to desirable regions of chemical space. For example, in the first 5 out of 10 iterations of the hERG dataset, the models performed poorly when validated using a sorted SFCV (mean R²=0.22) compared to CV with random splits (mean R²=0.74), where more data were available to train individual models and the chemical space of test data was more similar to training datasets (Figure S5, see Table S1 for more details). Overall, sorted SFCVs can better evaluate models, considering the inherent challenges of model extrapolation to less-represented regions of the chemical space.

Table 1.

Mean performance metrics in predicting pIC50 over the three protein targets when using RF models for each validation technique used in this study. RF: Random Forest; SFCV: Step-Forward Cross-Validation.

Validation	R2 (mean ± std)	RMSE (mean ± std)	Novelty Error (mean ± std)	Discovery Yield (mean ± std)
CV (Random)	0.71 ± 0.05	0.58 ± 0.04	0.80 ± 0.09	0.72 ± 0.04
CV (Scaffold)	0.49 ± 0.11	0.71 ± 0.05	0.80 ± 0.09	0.59 ± 0.08
SFCV (Sorted)	0.43 ± 0.11	0.77 ± 0.03	0.88 ± 0.04	0.55 ± 0.08
SFCV (Unsorted)	0.59 ± 0.08	0.68 ± 0.02	0.82 ± 0.15	0.63 ± 0.06

Open in a new tab

3.5. Discovery Yield: Sorted SFCV performs similarly to CV with scaffold split but has a more systematic approach to selecting compounds.

We next evaluated the models based on their discovery yield. Unlike accuracy, which measures the overall correctness of predictions across all compounds, discovery yield focuses on the ability of the model to predict activity for compounds with desired properties—such as low toxicity for hERG and high potency for MAPK14 and VEGFR2, as compounds selected in this way offer the most potential for further development. The discovery yield remained high (0.60 to 0.82) across all iterations of conventional CV with random splits (Figure 3). However, this metric showed notable fluctuations under unsorted SFCV (0.40 to 0.75), CV with scaffold split (0.25 to 0.78), and sorted SFCV (0.36 to 0.73). These fluctuations within each method can be attributed to the inherent differences in the composition of the test datasets. In sorted SFCV, the test dataset tends to be dissimilar to the training data. This difference creates a more challenging test environment that mirrors real-world scenarios in drug discovery.

Figure 3: — *Discovery yield* for models validated using (a) sorted SFCV, (b) unsorted SFCV, (c) cross-validation with random splits, and (d) cross-validation with scaffold splits for all three protein targets when using a Random Forest model. SFCV: Step-Forward Cross-Validation.

Conversely, in conventional CV with random splits, the test sets mirror the training data distribution more closely, making predictions for compounds relatively easier. In many drug projects, properties such as logP often increase as optimization progresses in Design-Make-Test-Analyze (DMTA) cycles, which is undesirable.^33,34 A conventional CV with random splits may appear to provide stable performance metrics. Still, it gives a misleading representation of a model’s efficacy in practical settings.

On the other hand, CV with scaffold splits represents a stricter task. Although this simulates the challenges encountered in real-world applications, a systematic approach is needed to introduce novel scaffolds. It can sometimes be too random, introducing wide variability and unpredictability in the performance metrics over various folds. Overall, our results indicate that sorted SFCV is a better validation approach that simulates the progression in molecule optimization.

3.6. Novelty Error is more consistent for sorted SFCV

A key question about models aimed at predicting the activity of small molecules is their applicability domain - that is, in what regions of chemical space does a model offer accurate predictions?^4,35 We explored this question by assessing the novelty error, defined as the mean absolute error for compounds in the test dataset that are structurally distinct from the training set across each iteration. We observed that the novelty error remains low throughout the iterations of sorted SFCV (0.88±0.21). At the same time, it fluctuates more for unsorted SFCV (0.81±0.31) and conventional cross-validation with random splits (0.80±0.41) or with scaffold-splits (0.80±0.48) (Figure 4). One might expect that a sorted SFCV would increase the difficulty of accurate prediction, thereby raising the error rates. However, the consistency in lower mean errors during sorted SFCV can be explained by the nature of the novelty in this validation set. In sorted SFCV, the novelty of compounds is defined more narrowly, focusing on compounds that have not been seen but are systematically selected based on a specific biological property (here, logP). This systematic approach ensures that while the compounds are novel, they are not randomly diverse³⁶, which leads to more predictable error patterns. In contrast, in unsorted SFCVs and CVs, the novelty is broader; random splits introduce a wider variety of compounds into the training set, and novel compounds in the test set need to be further out-of-distribution and highly varied, and predicting for these compounds is more challenging and error-prone. We demonstrate that a sorted SFCV can minimize overfitting in random chemical spaces and minimize novelty error.

Figure 4: — *Novelty error* for models validated using (a) sorted SFCV, (b) unsorted SFCV, (c) cross-validation with random splits, and (d) cross-validation with scaffold splits for all three protein targets when using a Random Forest model. SFCV: Step-Forward Cross-Validation.

4. Limitations and Future Work

In the future, SFCV could be implemented for other activity measures, such as inhibition constants (Ki) and potency measures from ChEMBL, offering additional insights. Various activity measures are crucial for understanding compound-protein interactions in drug discovery, and findings from other activity measures could help elucidate how they relate to IC50, the property explored in the current work.

5. Conclusions

We have investigated the potential of applying machine learning validation techniques developed in materials science to molecular property prediction in drug discovery. We made a conceptual argument that k-fold n-step forward cross-validation (SFCV)¹² better matches real-world use and is more stringent than conventional k-fold cross-validation (CV). We show that sorted SFCV is better at selecting novel structures where the biological activity is of interest compared to the second-best validation approach of CV with scaffold-based splitting.

We have also translated discovery yield² and novelty error² into the drug discovery field as metrics to evaluate models to accurately predict properties of compounds that diverge from the training data in terms of biological property and chemical structure.

Using a sorted SFCV, we have demonstrated that it is challenging for any model to perform consistently in a varying chemical space that systematically changes towards a desired property, such as logP. The differing discovery yields observed across different validation methods highlight the influence of dataset composition and model training strategies on predictive accuracy. Models trained using a sorted SFCV are less prone to overfitting and yield lower novelty errors. In the future, models optimized using sorted SFCV–implemented in DMTA cycles³⁴–could potentially generalize better for diverse compounds.

Overall, we recommend the techniques presented in this study to align model testing with the directional nature of drug discovery, where compounds are gradually optimized to become more drug-like.⁵ Our results suggest that integrating machine learning strategies developed in materials science into drug discovery pipelines offers additional opportunities for a more accurate assessment of model performance.

Supplementary Material

Supplement 1

NIHPP2024.07.02.601740v1-supplement-1.pdf^{(1.7MB, pdf)}

Acknowledgments

The authors thank Jessica D. Ewald (Broad Institute of MIT and Harvard) for her comments on an earlier version of this manuscript.

Funding

The authors acknowledge funding from the National Institutes of Health R35 GM122547 (to AEC).

Funding Statement

The authors acknowledge funding from the National Institutes of Health R35 GM122547 (to AEC).

Footnotes

Code Release

An example notebook implementation for sorted SFCV is available at https://github.com/srijitseal/ValidationDiscovery-MolProp/tree/main/local_implementation_notebook. All code for this work is publicly released at https://github.com/srijitseal/ValidationDiscovery-MolProp.

Data Release

All datasets used in this paper have been publicly released at https://github.com/srijitseal/ValidationDiscovery-MolProp/tree/main/data.

References

1.Mathai N., Chen Y. & Kirchmair J. Validation strategies for target prediction methods. Brief. Bioinform. 21, 791–802 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Borg C. K. H. et al. Quantifying the performance of machine learning models in materials discovery. Digital Discovery 2, 327–338 (2023). [Google Scholar]
3.Del Rosario Z., Rupp M., Kim Y., Antono E. & Ling J. Assessing the frontier: Active learning, model accuracy, and multi-objective candidate discovery and optimization. J. Chem. Phys. 153, 024112 (2020). [DOI] [PubMed] [Google Scholar]
4.Klingspohn W., Mathea M., Ter Laak A., Heinrich N. & Baumann K. Efficiency of different measures for defining the applicability domain of classification models. J. Cheminform. 9, 44 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Bender A. & Cortés-Ciriano I. Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 1: Ways to make an impact, and why we are not there yet. Drug Discov. Today 26, 511–524 (2021). [DOI] [PubMed] [Google Scholar]
6.Bender A. & Cortes-Ciriano I. Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 2: a discussion of chemical and biological data. Drug Discov. Today 26, 1040–1052 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Zhong X. et al. Explainable machine learning in materials science. npj Computational Materials 8, 1–19 (2022). [Google Scholar]
8.Finkelstein A. V., Bogatyreva N. S., Ivankov D. N. & Garbuzynskiy S. O. Protein folding problem: enigma, paradox, solution. Biophys. Rev. 14, 1255–1272 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Deng J. et al. A systematic study of key elements underlying molecular property prediction. Nat. Commun. 14, 6395 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Landrum G. A. et al. SIMPD: an algorithm for generating simulated time splits for validating machine learning approaches. J. Cheminform. 15, 119 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Yu M., Zhou Y.-N., Wang Q. & Yan F. Extrapolation validation (EV): a universal validation method for mitigating machine learning extrapolation risk. Digital Discovery 3, 1058–1067 (2024). [Google Scholar]
12.Xiong Z. et al. Evaluating explorative prediction power of machine learning algorithms for materials discovery using k-fold forward cross-validation. Comput. Mater. Sci. 171, 109203 (2020). [Google Scholar]
13.Keserü G. M. & Makara G. M. The influence of lead discovery strategies on the properties of drug candidates. Nat. Rev. Drug Discov. 8, 203–212 (2009). [DOI] [PubMed] [Google Scholar]
14.Meanwell N. A. Improving drug candidates by design: a focus on physicochemical properties as a means of improving compound disposition and safety. Chem. Res. Toxicol. 24, 1420–1456 (2011). [DOI] [PubMed] [Google Scholar]
15.Lu D. et al. Lipophilicity screening of novel drug-like compounds and comparison to clog P. J. Chromatogr. A 1258, 161–167 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Sushko I. et al. Applicability domains for classification problems: Benchmarking of distance to models for Ames mutagenicity set. J. Chem. Inf. Model. 50, 2094–2111 (2010). [DOI] [PubMed] [Google Scholar]
17.Liu A., Seal S., Yang H. & Bender A. Using chemical and biological data to predict drug toxicity. SLAS Discov 28, 53–64 (2023). [DOI] [PubMed] [Google Scholar]
18.Landrum G. A. & Riniker S. Combining IC50 or Ki Values from Different Sources Is a Source of Significant Noise. J. Chem. Inf. Model. 64, 1560–1567 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Reichel A. & Lienau P. Pharmacokinetics in Drug Discovery: An Exposure-Centred Approach to Optimising and Predicting Drug Efficacy and Safety. in New Approaches to Drug Discovery (eds. Nielsch U., Fuhrmann U.& Jaroch S.) 235–260 (Springer International Publishing, Cham, 2016). [DOI] [PubMed] [Google Scholar]
20.Seal S. et al. PKSmart: An Open-Source Computational Model to Predict in vivo Pharmacokinetics of Small Molecules. bioRxiv 2024.02.02.578658 (2024) doi: 10.1101/2024.02.02.578658. [DOI] [PubMed] [Google Scholar]
21.Seal S. et al. Insights into Drug Cardiotoxicity from Biological and Chemical Data: The First Public Classifiers for FDA Drug-Induced Cardiotoxicity Rank. J. Chem. Inf. Model. (2024) doi: 10.1021/acs.jcim.3c01834. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Garrido A., Lepailleur A., Mignani S. M., Dallemagne P. & Rochais C. hERG toxicity assessment: Useful guidelines for drug design. Eur. J. Med. Chem. 195, 112290 (2020). [DOI] [PubMed] [Google Scholar]
23.Stergiopoulos C., Tsopelas F. & Valko K. Prediction of hERG inhibition of drug discovery compounds using biomimetic HPLC measurements. ADMET DMPK 9, 191–207 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Zhang Q., Zheng P. & Zhu W. Research Progress of Small Molecule VEGFR/c-Met Inhibitors as Anticancer Agents (2016-Present). Molecules 25, (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Namboodiri H. V. et al. Analysis of imatinib and sorafenib binding to p38alpha compared with c-Abl and b-Raf provides structural insights for understanding the selectivity of inhibitors targeting the DFG-out form of protein kinases. Biochemistry 49, 3611–3618 (2010). [DOI] [PubMed] [Google Scholar]
26.Spektor A. C. Why using pIC50 instead of IC50 will change your life. https://www.collaborativedrug.com/cdd-blog/why-using-pic50-instead-of-ic50-will-change-your-life (2019). [Google Scholar]
27.Landrum G. & Others. RDKit: A software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum 8, 5281 (2013). [Google Scholar]
28.Pedregosa F. et al. Scikit-learn: Machine Learning in Python. arXiv [cs.LG] 2825–2830 (2012). [Google Scholar]
29.Shultz M. D. Two Decades under the Influence of the Rule of Five and the Changing Properties of Approved Oral Drugs. J. Med. Chem. 62, 1701–1714 (2019). [DOI] [PubMed] [Google Scholar]
30.Molecular Machine Learning with DeepChem. (2018). [Google Scholar]
31.Splitters — deepchem 2.8.1.dev documentation. https://deepchem.readthedocs.io/en/latest/api_reference/splitters.html. [Google Scholar]
32.Huang T. et al. MOST: most-similar ligand based approach to target prediction. BMC Bioinformatics 18, 165 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Lobo S. Is there enough focus on lipophilicity in drug discovery? Expert Opin. Drug Discov. 15, 261–263 (2020). [DOI] [PubMed] [Google Scholar]
34.Plowright A. T. et al. Hypothesis driven drug design: improving quality and effectiveness of the design-make-test-analyse cycle. Drug Discov. Today 17, 56–62 (2012). [DOI] [PubMed] [Google Scholar]
35.Seal S. et al. Merging bioactivity predictions from cell morphology and chemical fingerprint models using similarity to training data. J. Cheminform. 15, 56 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Paricharak S. et al. Data-driven approaches used for compound library design, hit triage and bioactivity modeling in high-throughput screening. Brief. Bioinform. 19, 277–285 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

NIHPP2024.07.02.601740v1-supplement-1.pdf^{(1.7MB, pdf)}

[R1] 1.Mathai N., Chen Y. & Kirchmair J. Validation strategies for target prediction methods. Brief. Bioinform. 21, 791–802 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Borg C. K. H. et al. Quantifying the performance of machine learning models in materials discovery. Digital Discovery 2, 327–338 (2023). [Google Scholar]

[R3] 3.Del Rosario Z., Rupp M., Kim Y., Antono E. & Ling J. Assessing the frontier: Active learning, model accuracy, and multi-objective candidate discovery and optimization. J. Chem. Phys. 153, 024112 (2020). [DOI] [PubMed] [Google Scholar]

[R4] 4.Klingspohn W., Mathea M., Ter Laak A., Heinrich N. & Baumann K. Efficiency of different measures for defining the applicability domain of classification models. J. Cheminform. 9, 44 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Bender A. & Cortés-Ciriano I. Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 1: Ways to make an impact, and why we are not there yet. Drug Discov. Today 26, 511–524 (2021). [DOI] [PubMed] [Google Scholar]

[R6] 6.Bender A. & Cortes-Ciriano I. Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 2: a discussion of chemical and biological data. Drug Discov. Today 26, 1040–1052 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Zhong X. et al. Explainable machine learning in materials science. npj Computational Materials 8, 1–19 (2022). [Google Scholar]

[R8] 8.Finkelstein A. V., Bogatyreva N. S., Ivankov D. N. & Garbuzynskiy S. O. Protein folding problem: enigma, paradox, solution. Biophys. Rev. 14, 1255–1272 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Deng J. et al. A systematic study of key elements underlying molecular property prediction. Nat. Commun. 14, 6395 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Landrum G. A. et al. SIMPD: an algorithm for generating simulated time splits for validating machine learning approaches. J. Cheminform. 15, 119 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Yu M., Zhou Y.-N., Wang Q. & Yan F. Extrapolation validation (EV): a universal validation method for mitigating machine learning extrapolation risk. Digital Discovery 3, 1058–1067 (2024). [Google Scholar]

[R12] 12.Xiong Z. et al. Evaluating explorative prediction power of machine learning algorithms for materials discovery using k-fold forward cross-validation. Comput. Mater. Sci. 171, 109203 (2020). [Google Scholar]

[R13] 13.Keserü G. M. & Makara G. M. The influence of lead discovery strategies on the properties of drug candidates. Nat. Rev. Drug Discov. 8, 203–212 (2009). [DOI] [PubMed] [Google Scholar]

[R14] 14.Meanwell N. A. Improving drug candidates by design: a focus on physicochemical properties as a means of improving compound disposition and safety. Chem. Res. Toxicol. 24, 1420–1456 (2011). [DOI] [PubMed] [Google Scholar]

[R15] 15.Lu D. et al. Lipophilicity screening of novel drug-like compounds and comparison to clog P. J. Chromatogr. A 1258, 161–167 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Sushko I. et al. Applicability domains for classification problems: Benchmarking of distance to models for Ames mutagenicity set. J. Chem. Inf. Model. 50, 2094–2111 (2010). [DOI] [PubMed] [Google Scholar]

[R17] 17.Liu A., Seal S., Yang H. & Bender A. Using chemical and biological data to predict drug toxicity. SLAS Discov 28, 53–64 (2023). [DOI] [PubMed] [Google Scholar]

[R18] 18.Landrum G. A. & Riniker S. Combining IC50 or Ki Values from Different Sources Is a Source of Significant Noise. J. Chem. Inf. Model. 64, 1560–1567 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Reichel A. & Lienau P. Pharmacokinetics in Drug Discovery: An Exposure-Centred Approach to Optimising and Predicting Drug Efficacy and Safety. in New Approaches to Drug Discovery (eds. Nielsch U., Fuhrmann U.& Jaroch S.) 235–260 (Springer International Publishing, Cham, 2016). [DOI] [PubMed] [Google Scholar]

[R20] 20.Seal S. et al. PKSmart: An Open-Source Computational Model to Predict in vivo Pharmacokinetics of Small Molecules. bioRxiv 2024.02.02.578658 (2024) doi: 10.1101/2024.02.02.578658. [DOI] [PubMed] [Google Scholar]

[R21] 21.Seal S. et al. Insights into Drug Cardiotoxicity from Biological and Chemical Data: The First Public Classifiers for FDA Drug-Induced Cardiotoxicity Rank. J. Chem. Inf. Model. (2024) doi: 10.1021/acs.jcim.3c01834. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Garrido A., Lepailleur A., Mignani S. M., Dallemagne P. & Rochais C. hERG toxicity assessment: Useful guidelines for drug design. Eur. J. Med. Chem. 195, 112290 (2020). [DOI] [PubMed] [Google Scholar]

[R23] 23.Stergiopoulos C., Tsopelas F. & Valko K. Prediction of hERG inhibition of drug discovery compounds using biomimetic HPLC measurements. ADMET DMPK 9, 191–207 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Zhang Q., Zheng P. & Zhu W. Research Progress of Small Molecule VEGFR/c-Met Inhibitors as Anticancer Agents (2016-Present). Molecules 25, (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Namboodiri H. V. et al. Analysis of imatinib and sorafenib binding to p38alpha compared with c-Abl and b-Raf provides structural insights for understanding the selectivity of inhibitors targeting the DFG-out form of protein kinases. Biochemistry 49, 3611–3618 (2010). [DOI] [PubMed] [Google Scholar]

[R26] 26.Spektor A. C. Why using pIC50 instead of IC50 will change your life. https://www.collaborativedrug.com/cdd-blog/why-using-pic50-instead-of-ic50-will-change-your-life (2019). [Google Scholar]

[R27] 27.Landrum G. & Others. RDKit: A software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum 8, 5281 (2013). [Google Scholar]

[R28] 28.Pedregosa F. et al. Scikit-learn: Machine Learning in Python. arXiv [cs.LG] 2825–2830 (2012). [Google Scholar]

[R29] 29.Shultz M. D. Two Decades under the Influence of the Rule of Five and the Changing Properties of Approved Oral Drugs. J. Med. Chem. 62, 1701–1714 (2019). [DOI] [PubMed] [Google Scholar]

[R30] 30.Molecular Machine Learning with DeepChem. (2018). [Google Scholar]

[R31] 31.Splitters — deepchem 2.8.1.dev documentation. https://deepchem.readthedocs.io/en/latest/api_reference/splitters.html. [Google Scholar]

[R32] 32.Huang T. et al. MOST: most-similar ligand based approach to target prediction. BMC Bioinformatics 18, 165 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Lobo S. Is there enough focus on lipophilicity in drug discovery? Expert Opin. Drug Discov. 15, 261–263 (2020). [DOI] [PubMed] [Google Scholar]

[R34] 34.Plowright A. T. et al. Hypothesis driven drug design: improving quality and effectiveness of the design-make-test-analyse cycle. Drug Discov. Today 17, 56–62 (2012). [DOI] [PubMed] [Google Scholar]

[R35] 35.Seal S. et al. Merging bioactivity predictions from cell morphology and chemical fingerprint models using similarity to training data. J. Cheminform. 15, 56 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Paricharak S. et al. Data-driven approaches used for compound library design, hit triage and bioactivity modeling in high-throughput screening. Brief. Bioinform. 19, 277–285 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

This is a preprint.

Step Forward Cross Validation for Bioactivity Prediction: Out of Distribution Validation in Drug Discovery

Udit Surya Saha

Michele Vendruscolo

Anne E Carpenter

Shantanu Singh

Andreas Bender

Srijit Seal

Abstract

1. Introduction

2. Methods

2.1. Dataset

2.2. Compound Standardization

2.2. Compound Featurization

2.3. Model Algorithm

2.4. Model Training

Figure 1.

2.5. Model Evaluation

2.6. Discovery Yield and Novelty Error

3. Results and Discussion

3.1. Sorted SFCV selected more novel compounds than CV methods

Figure 2.

3.3. Sorted SFCV selected more discovery compounds than CV methods

3.4. Extrapolation error: Sorted SFCV shows that predicting bioactivity for compounds with low lipophilicity is challenging

Table 1.

3.5. Discovery Yield: Sorted SFCV performs similarly to CV with scaffold split but has a more systematic approach to selecting compounds.

Figure 3:

3.6. Novelty Error is more consistent for sorted SFCV

Figure 4:

4. Limitations and Future Work

5. Conclusions

Supplementary Material

Acknowledgments

Funding

Funding Statement

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases