Skip to main content
Biology Methods & Protocols logoLink to Biology Methods & Protocols
. 2024 Sep 3;9(1):bpae065. doi: 10.1093/biomethods/bpae065

Graph neural networks are promising for phenotypic virtual screening on cancer cell lines

Sachin Vishwakarma 1,2,b, Saiveth Hernandez-Hernandez 3,b, Pedro J Ballester 4,
PMCID: PMC11537795  PMID: 39502795

Abstract

Artificial intelligence is increasingly driving early drug design, offering novel approaches to virtual screening. Phenotypic virtual screening (PVS) aims to predict how cancer cell lines respond to different compounds by focusing on observable characteristics rather than specific molecular targets. Some studies have suggested that deep learning may not be the best approach for PVS. However, these studies are limited by the small number of tested molecules as well as not employing suitable performance metrics and dissimilar-molecules splits better mimicking the challenging chemical diversity of real-world screening libraries. Here we prepared 60 datasets, each containing approximately 30 000–50 000 molecules tested for their growth inhibitory activities on one of the NCI-60 cancer cell lines. We conducted multiple performance evaluations of each of the five machine learning algorithms for PVS on these 60 problem instances. To provide even a more comprehensive evaluation, we used two model validation types: the random split and the dissimilar-molecules split. Overall, about 14 440 training runs aczross datasets were carried out per algorithm. The models were primarily evaluated using hit rate, a more suitable metric in VS contexts. The results show that all models are more challenged by test molecules that are substantially different from those in the training data. In both validation types, the D-MPNN algorithm, a graph-based deep neural network, was found to be the most suitable for building predictive models for this PVS problem.

Keywords: phenotypic virtual screening, graph neural networks, cancer cell lines

Introduction

Recent years have witnessed strong advancements in computational drug discovery methodologies. Targeted Drug Discovery (TDD), which aims at identifying molecules that interact with a specific molecular target associated with the considered disease [1–3], has been the predominant approach. Gradually, however, Phenotypic Drug Discovery (PDD) approaches have gained traction [4–6]. Distinct from TDD, PDD does not depend solely on the molecular understanding of diseases; rather, it focuses on observable phenotypic changes, offering a more expansive approach that is not confined to known targets [7, 8]. This has the advantage of enabling the exploration of therapeutic agents through mechanisms of action that are not yet known, potentially leading to innovative treatments. Moreover, PDD acknowledges that molecules effective at targeting specific processes in isolation may also perform effectively within a broader cellular environment [9, 10]. This is important as many molecules with potent activity for a target are later found to lack whole-cell activity for a range of reasons, for example, the target not being co-located with the molecule in the cell [9].

High-Throughput Screening (HTS) techniques have been fundamental to both TDD and PDD, enabling the assessment of libraries with up to a few million compounds [11, 12]. However, as the scale of chemical libraries grows to giga-scale proportions, HTS is no longer an option to screen them. This challenge is compounded by the increasing complexity of cancer as a multi-genic disease, which requires the exploration of vast, uncharted chemical spaces to identify effective therapeutics.

In this context, Artificial intelligence (AI) presents a tremendous opportunity to transform the landscape of early drug discovery [13, 14]. AI’s capacity to leverage existing data from HTS to predictively model and explore vast chemical libraries is transforming drug screening [15]. Graph neural networks (GNNs), in particular, have shown promise owing to their ability to model complex, irregular data structures inherent in molecular chemistry [16]. Yet, despite AI’s advancements, there remain significant limitations in its performance, especially when tasked with predicting responses from chemically dissimilar molecules [17–19]. These limitations highlight a critical gap in current AI methodologies, which often fail to generalize beyond the chemical entities included in their training sets (unseen molecules). This is particularly the case when predicting the properties of novel active compounds (i.e. not only unseen but also dissimilar to the known actives of the considered target).

The efficacy of AI models in the activity of molecule-cell line pairs has been evaluated in various contexts using comprehensive databases such as GDSC, CCLE, and NCI-60 [17–27]. These evaluations mostly focus on predicting the responses of unseen cell lines to a given drug, e.g. [28–32], for precision oncology purposes. A few other evaluate such phenotypic drug response models on unseen drugs [18–20, 31, 33], and leave-drug-out scenarios [20, 28, 31, 33, 34] for a given cell line. However, the evaluation of AI models on test sets with chemically dissimilar molecules remains significantly underexplored [18]. Another shortcoming is that practically all these studies do not evaluate the developed models for virtual screening (here discriminating between active and inactive molecules on the considered cell line). This is partly due to the most popular resources only having a few hundred drugs being tested on each cell line (e.g. GDSC, CCLE).

As a consequence of these shortcomings, we still do not know which are the best AI models to guide virtual screening of gigascale libraries against cancer cell lines. Traditional Machine Learning (ML) methods like Random Forests (RF) and XGBoost (XGB) rely on precomputed molecular descriptors, which are structured numerical representations of chemical properties in tabular form (instances × features). These methods tend to perform best with smaller datasets due to their efficiency in handling such data [18, 26, 35]. In contrast, GNNs extract descriptors directly from SMILES (Simplified Molecular Input Line Entry System) strings, capturing complex molecular relationships. This underscores the need for more comprehensive studies on how different AI models, particularly GNNs, perform with realistic chemical diversity [34, 36, 37].

To answer these research questions, we will assemble 60 datasets, each with diverse molecules tested on one of the cell lines in the NCI-60 panel [38]. Then, we will investigate how each algorithm performs across these 60 instances of the same problem, highlighting differences in algorithmic efficiency and predictive accuracy in a controlled yet varied set of conditions. The NCI-60 tumour cell line panel, with its extensive profiling of over 130 000 compounds, underscores the critical role of advanced data models in refining drug sensitivity predictions [31, 39, 40]. By utilizing such comprehensive datasets, this study aims to rigorously assess the performance of supervised learning algorithms in predicting the biological activity of chemically dissimilar molecules against cancer cell lines, an area yet to be fully explored by current AI-driven methodologies [28]. This will be carried out with appropriate evaluation metrics, such as the Hit Rate [15, 18, 41], which are more suited for virtual screening than traditional ROC-AUC measures [21, 27, 28, 34].

Materials and methods

NCI-60 dataset

The dataset used in this research was obtained from the NCI-60 database, which is publicly available at https://wiki.nci.nih.gov/display/NCIDTPdata/NCI-60+Growth+Inhibition+Data. This comprehensive dataset includes growth inhibition information for a wide array of 159 cell lines tested against 53 215 molecules (distinct NSC IDs). In total, the dataset encompasses 3 054 004 measurements of pGI50, which quantifies the logarithmic concentration of a compound necessary to achieve a 50% reduction in tumour growth. To enhance the reliability of our analysis, we excluded 41 113 pGI50 values below the threshold of 4, considering them less reliable due to extrapolation from higher concentration ranges. In cases of multiple pGI50 values for the same NSC-Cell line pair, we computed the average, recognizing that potent molecules often undergo retesting across various concentration ranges, leading to multiple pGI50 recordings.

The processed dataset comprises 60 cell lines, complete workflow shown in Fig. 1. Molecular representations were converted from SDF to SMILES format utilizing the Open Babel library [42]. The chemical structures were then standardized using the Molvs package (https://molvs.readthedocs.io/en/latest/). This standardization procedure involved sanitization, hydrogen removal, metal disconnection, application of normalization rules, acid re-ionization, and stereochemistry recalculations. Following this, 1137 NSCs were omitted from the dataset due to the unavailability of chemical descriptors.

Figure 1.

Figure 1.

Overview of the data partition, featurization, and analysis workflow. For each cell line individually, the dataset is first split into training (80%), validation (10%), and test (10%) sets. Hyperparameter tuning is conducted using the training and validation sets. Two types of featurization are applied: Morgan fingerprints and graph-based features. Various ML models, including traditional models (e.g. LR, RF, XGB, and DNN) and graph neural networks (e.g. D-MPNN), are trained using the training set, with hyperparameters tuned based on performance on the validation set. After tuning, the training and validation sets are merged together for the final training of the models. The final models are then evaluated using two distinct test sets: the test set from the random split (random test set) and a dissimilar test set, which is generated by excluding molecules similar to those in the training set (dissimilar split). The models' predictive performance is assessed using several evaluation metrics, including Pearson correlation coefficient (Rp), Root Mean Square Error (RMSE), Matthews Correlation Coefficient (MCC), and Hit Rate (HR), providing a comprehensive understanding of their generalization capabilities across both unseen and dissimilar molecules

Our analysis primarily focused on 60 extensively profiled cell lines, as documented in previous cancer research studies [18, 19, 26, 31, 43]. The NCI-60 has profiled data spanning nine cancer types, including leukaemia, melanoma, non-small-cell lung, colon, central nervous system, ovarian, renal, prostate, and breast. The refined dataset encompasses 2 707 434 unique NSC-Cell line pairs, incorporating data for 60 cell lines and 50 555 molecules (totalling 50 846 unique NSC IDs). A discrepancy in the count of unique canonical SMILES (50,156) relative to unique NSC IDs highlights instances where different NSC IDs were assigned identical SMILES representations, underscoring the importance of canonical SMILES as a unique molecular identifier.

In our study, we opted to use Morgan circular fingerprints [44] to represent the molecular features of compounds, leveraging the RDKit library (https://www.rdkit.org/) for their generation in bit vector format. The choice of Morgan fingerprints was informed by their demonstrated efficacy in enhancing model performance across similar research endeavours [19, 31, 45]. These fingerprints effectively capture the presence or absence of specific substructures within a molecule, offering a comprehensive molecular representation. The configuration of the fingerprint’s radius and bit size are pivotal in generating meaningful Morgan circular fingerprints. In our analysis, we utilized a bit size of 256 with a radius of 2, aligning with configurations that have previously yielded optimal results in the literature [19, 31].

Beyond structural features, we also quantified several physicochemical properties of the compounds to enrich our feature set. These properties encompassed: total polar surface area [46], molecular weight [47], LogP (Partition coefficient) [48], number of aliphatic rings, number of aromatic rings [49], number of hydrogen bond acceptors, and number of hydrogen bond donors [50]. We complemented the Morgan fingerprints with additional physicochemical properties to provide a multidimensional representation of each compound.

Machine learning

Linear regression

We used linear regression (LR) to establish a baseline for comparing model performance. As implemented in the scikit-learn library [51], the LR model predicts an output (y) by applying a linear combination of input variables (X). This relationship is mathematically represented as:

y=β0+i=1nβiXi,

where β0 is the intercept (or bias coefficient), and βi is the weight assigned to each input features Xi. For scenarios encompassing multiple input variables, the model extends beyond a simple line to form a hyperplane. This multidimensional approach allows for a more nuanced representation of the data, accommodating complex relationships between multiple inputs and the predicted output. The configuration of the equation, particularly the values and interactions of its coefficients (β0 and β1 etc) visually conveys the model’s interpretation of the underlying problem.

Random Forest

We used the RF algorithm as implemented in the scikit-learn library [51] to improves prediction accuracy and controls overfitting by creating a forest of decision trees [52, 53]. RF employs the bagging (Bootstrap aggregating) technique [54] with a distinct approach of using only a subset of features for splitting each node in the decision trees. This method ensures that the trees within the forest are uncorrelated, enhancing the ensemble’s prediction quality by reducing variance without significantly increasing bias. RF aggregates predictions from each tree to minimize prediction error and mitigate the impact of outliers and noise.

In our study, we optimized RF through the tuning of two hyperparameters: n_estimators, the number of trees in the forest, and max_features, the number of features considered for splitting at each node. We adopted the bootstrap sampling technique and employed the Mean Squared Error (MSE) as the scoring function to assess model performance on the test set. We initially configured the n_estimators parameter to 500 and set max_features to 0.33 based on recommendations [52]. The max_depth parameter was left unrestricted. This setup served as a foundation for further hyperparameter tuning to identify the optimal hyperparameter range that enhances model performance. Table 1 shows the examined range of hyperparameters for the RF algorithm.

Table 1.

Values tested/searched for each hyperparameter of the ML algorithms.

Algorithm Hyperparameters Values tested Tuning method
RF N_estimators 250, 500, 750, 1000 Grid search
Max_features 0.2, 0.3, 0.4, 0.6, 0.8, 0.9
XGB N_estimators 250, 500, 750, 1000 Grid search
Max_depth 5,6,7,8,9,10
Cosample_bytree 0.3, 0.4, 0.6, 0.8, 0.9
DNN Learning rate 0.0001, 0.001, 0.01, 0.1 Grid search
Weight decay 0, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2
Drop out 0, 0.2, 0.4, 0.6, 0.8
D-MPNN Hidden size low = 300, high = 2400 Bayesian optimization
Depth low = 2, high = 6
Dropout low = 0.0, high = 0.4
Ffn_num_layers low = 1, high = 3

Extreme gradient boosting

XGB stands out as a sophisticated ensemble learning algorithm, renowned for its efficiency and effectiveness [55]. XGB operates on the principle of boosting, which seeks to iteratively minimize prediction errors using a gradient descent optimization algorithm. Unlike traditional models that train in isolation, XGB enhances model accuracy sequentially, with each new tree aiming to correct the residual errors left by its predecessors. This iterative correction continues until no significant improvement is observed or a pre-determined number of iterations is reached. Distinguishing itself from RF, which averages predictions from independently trained models, XGB focuses on sequential improvement, making each tree dependent on the corrections from the trees before it.

XGB’s performance relies on key hyperparameters, which include max_depth, the maximum depth of any decision tree in the ensemble; n_estimators, the total number of trees in the ensemble; colsample_bytree, the fraction of features used to train each tree; and eta, the learning rate that controls each tree’s contribution to the final outcome. Previous studies have determined initial values for each of these hyperparameters, particularly setting eta to 0.5 [56]. These default values are utilized in our current study to identify the optimal combination for our specific task. Details of the hyperparameters explored can be found in Table 1.

Deep neural network

Deep neural networks (DNNs) are more precisely deep multi-layer perceptrons, celebrated for their ability to effectively capture complex relationships in high-dimensional data [57]. They significantly enhance our comprehension and interpretation of data through sophisticated, layered computational structures. In this study, we employed the Keras framework with TensorFlow backend, as outlined at (https://keras.io/), to develop a DNN tailored for regression analysis. Our model architecture is fully connected and comprises an input layer, multiple hidden layers, and an output layer, each integral to the network’s predictive.

The architecture of the DNN is the following:

  • Input Layer. This layer is dimensioned to match the number of input features, ensuring that each feature is adequately represented.

  • Output Layer. Consisting of a single neuron, this layer is designed for regression, producing a continuous output value.

  • Hidden Layers. The network includes several hidden layers, which are instrumental in learning the data’s complex patterns. The ‘width’ of a layer refers to its number of neurons, and the network ’depth’ to its number of layers. Optimization of neuronal weights across these layers is accomplished through the Adam optimizer [58], a decision driven by its effectiveness in gradient descent optimization.

The considered the DNN settings are:

  • Weight Initialization [59]. Implemented to prevent the vanishing or exploding gradient problem, ensuring efficient forward and backward propagation.

  • Optimization Algorithm. The core of the training process involves iterative weight adjustments using the Adam optimizer [58], aimed at minimizing the loss function.

  • Activation Functions. Non-linear activation functions, specifically ReLU [60, 61], for the hidden layers, are used to introduce non-linearity, enabling the model to capture complex relationships within the data.

Strategies to counteract overfitting include:

  • Dropout [62]. Randomly omitting neurons during training to encourage more generalized learning.

  • Regularization [63]. Applying weight decay through L1 regularization to reduce the model’s complexity by penalizing large weights, thereby promoting simpler models.

Our approach to hyperparameter tuning involved a grid search strategy to systematically explore the effects of various parameters, including learning rate, weight decay, dropout rates, and activation functions, on model performance. This process was informed by recent literature [64, 65], guiding the selection of ReLU activation for hidden layers and a linear activation for the output, alongside the implementation of batch normalization [66] and the Adam optimizer to enhance the learning process.

The final model configuration, determined through empirical hyperparameter tuning, consists of three hidden layers with respective neuron counts of 512, 256, and 64. Training was conducted over 100 epochs with a batch size of 100, using the MSE as the loss function. This configuration was selected to optimize the model’s ability to learn from the training data and accurately predict outcomes on unseen data. Table 1 details the range of hyperparameters evaluated.

The optimized hyperparameter set, comprising a learning rate of 0.01, a dropout rate of 0.6, and no weight decay, was chosen based on its performance in enhancing the model’s prediction accuracy and generalization capability. The neuron configuration for the hidden layers was adjusted to 2048, 1024, and 512 to maximize the model’s effectiveness in complex pattern recognition and prediction.

Directed-Message Passing Neural Network

The Directed-Message Passing Neural Network (D-MPNN) [67, 68] is a graph-based Neural Network (NN) architecture adept at learning molecular representations from molecular graphs derived from SMILES inputs [16]. By translating the SMILES into graphical data, D-MPNN incorporates detailed atom and bond information—such as atom type, bond quantity, charge, chirality, and more—that are largely one-hot encoded (Table 2) except for atomic mass which is treated as a scaled real number for consistency across all molecules.

Table 2.

Atom and Bond features included in the D-MPNN algorithm.

Category Features Description Number of features
Atom features atom type Type of atom (C, O, N), by atomic number 100
No. of bonds Number of bonds 6
Formal charge Electronic charge assigned to an atom 5
Chirality Unspecified, tetrahedral CW/CCW, or other 4
No. of Hs Number of bonded hydrogens 5
Hybridization sp, sp2, sp3, sp3d or sp3d2 5
aromaticity Part of aromatic system or not 1
Atomic mass Mass of the atom, divide by 100 1
Bond features bond type Single, double, triple, or aromatic 4
conjugated Whether the bond is conjugated 1
In-ring Whether the bond is part of a ring 1
stereo None, any, E/Z or cis/trans 6

The process, depicted in Fig. 2, begins by encoding node descriptors within an adjacency matrix to demonstrate atom connections, where a ‘1’ signifies a direct bond. The D-MPNN operates through two primary phases, message passing and readout phases. During the message passing phase, atom and bond features are aggregated to create a detailed molecular representation. This involves collecting information from neighbouring atoms and bonds, with the depth of message passing—up to three bond lengths in this study—being a tuneable parameter. The readout phase constructs a feature vector for the entire molecular graph by applying a readout function that aggregates the updated atom features, leading to the final molecular embedding used for predictive modelling.

Figure 2.

Figure 2.

Sequence of operations executed by the D-MPNN algorithm to extract features from chemical compounds. Beginning with a SMILES string, the algorithm employs the RDKit library to construct a 3D molecular structure. This structure enables the algorithm's message passing and update function to assimilate the attributes of adjacent atoms and bonds. The culmination of this process is the conversion of the assimilated features into a user-defined fixed-length vector through a linear transformation, thereby yielding a complete molecular embedding

A linear transformation is subsequently applied to the embedding calculated for the molecule, which is formalized as:

y=xAT+b

where x is the input feature set, A is the weight matrix, b is the bias, and y is the feature vector whose fixed length is set by the user. This transformation standardizes the output length, ensuring uniformity irrespective of the molecule’s size.

The D-MPNN model is trained using the chemprop package (https://github.com/chemprop/chemprop), where parameters are optimized iteratively through forward passes, loss computation, backpropagation, and parameter updates across both the MPNN and a connected feed-forward network (FFN). Our model configuration is guided by the insights [68] enhanced further through Bayesian optimization to refine model performance as detailed in Table 1. Through these settings, our model is able to effectively aggregate and interpret messages from neighbouring atoms, ensuring robust learning across varying chemical structures.

The D-MPNN’s capacity for feature generation and the subsequent training process enable the construction of a predictive model uniquely tailored to the complexity of chemical compounds. By preserving the integrity of molecular graphs and applying nuanced transformations, D-MPNN stands as a powerful tool for advancing our understanding and predictions within chemical informatics.

Performance metrics

To assess the predictive accuracy of our regression model on the test set, we employed the sklearn library to calculate several crucial performance metrics, comparing observed (yobs) and predicted (ypred) pGI50 values. These metrics are instrumental in quantifying the model’s ability to accurately predict unseen data and provide a comprehensive overview of its overall performance.

Pearson’s correlation coefficient (Rp) quantifies the degree of linear relationship between two variables, providing insight into both the strength and direction of their linear association. It is computed as:

Rp=yi,obs-yobs¯yi,pred-ypred¯yi,obs-yobs¯2yi,pred-ypred¯2

The Rp value varies between −1 and 1. A value approaching 1 indicates a robust positive linear correlation, -1 signifies a strong negative linear relationship, and 0 suggests no linear correlation.

Root Mean Squared Error (RMSE) quantifies the average magnitude of the prediction error, representing how closely a model’s predictions match the actual observed outcomes. It is calculated as the square root of the average squared differences between the predicted and observed values, making it sensitive to large errors. The formula is as follows, where N is the number of observations:

RMSE=yi,obs-yi,pred2N

The range of RMSE values is from 0 to infinity. An RMSE value of 0 represents a perfect fit, indicating that predicted values perfectly match the actual values.

Matthews correlation coefficient (MCC) offers a balanced measure for the quality of binary classifications, effectively handling imbalanced datasets. It considers all quadrants of the confusion matrix:

MCC=TP×TN-FP×FNTP+FPTP+FNTN+FPTN+FN

The values of MCC range between −1 and 1, with a score of 1 signifying perfect classification, 0 indicating random classification, and −1 representing complete disagreement between prediction and observation.

Hit Rate (HR), or precision, is the proportion of positive identifications that were actually correct. The HR is calculated as follows and generally expressed in percentages.

HR=TPTP+FP

Model validation strategies

Evaluating model performance is paramount in ML, as it ensures a model’s reliability in predicting outcomes on unseen data. Our study utilized two model validation strategies, specifically designed to align with our dataset’s unique characteristics and our research objectives. These strategies aimed to assess the model’s predictive accuracy in diverse scenarios, highlighting its potential for broad application.

The first strategy was a random split, where data for each cell line was divided randomly. The second strategy was a dissimilar-molecules split, corresponding to a 70% similarity threshold, where structurally similar molecules were excluded from the test set to ensure the evaluation is carried out on only significantly distinct compounds (Fig. 1).

Random split

Our initial approach involved a random-split strategy for each of the 60 cell line within our dataset. In this method, the data for each cell line was divided into three subsets: a training set, which included 80% of the data, a validation set, which comprised 10% of the data, and a test set, which comprised the remaining 10%. This partitioning approach served as a foundational step towards implementing a 10-fold cross-validation, facilitating early-stage exploratory analysis and model calibration.

All algorithms utilized this identical partitioning scheme to facilitate direct comparison. Each algorithm configuration, determined by a unique combination of hyperparameters, was trained on the training set and subsequently evaluated on the validation set. This procedure allowed for fine-tuning and performance monitoring during training, which helped to mitigate overfitting. From these validation results, the optimal configuration for each algorithm was identified for each of the 60 datasets (corresponding to the 60 cell lines) and used to construct the final model by merging the training and validation sets. This final model was then evaluated on the test set. The random selection of the test set was designed to assess the model’s performance across a diverse sample, providing a comprehensive understanding of its predictive accuracy.

Dissimilar-molecules split

To enhance the rigor of our model evaluation, we implemented a strategy focusing on the inclusion of dissimilar molecules during the dataset partitioning stage for each of the 60 cell lines [18, 41]. After performing an initial random division of the dataset, we analysed the structural similarity of the molecules using SMILES notation to differentiate those in the training and test sets. We then applied a similarity threshold: molecules within the test set showing greater or equal than 70% similarity to any molecule in the training set were removed. This approach aimed to guarantee that the test set was composed solely of molecules significantly distinct from those the model was trained on. By adopting this methodology, we sought to rigorously assess the model’s ability to generalize its predictions to new, chemically unrelated compounds, thereby validating its effectiveness in identifying and assessing novel compounds. From this perspective, the random split can be seen as a second threshold of the dissimilar-molecules split, where only test molecules with 100% similarity to any training molecule were removed from the test set.

Results

Comparative performance of ML algorithms on unseen randomly split NCI-60 dataset

We start by evaluating random-split models to predict the pGI50 responses across various cancer cell lines for a comprehensive set of compounds, utilizing the extensive molecular screening data available from the NCI-60 dataset. While previous studies [22, 23, 34, 40, 43] have focused on smaller datasets, our research utilizes a much larger dataset, which includes data across 60 cell lines corresponding to nine cancer types. This extensive data volume provides a robust foundation for our ML models, which is pivotal for achieving superior predictive performance in phenotypic virtual screening (PVS). Employing algorithms such as LR, RF, XGB, DNN, and D-MPNN, we trained models for each cell line to predict pGI50 based on compounds features.

The ML models underwent training using the recommended values for their hyperparameters shown in Section 2.1, followed by a hyperparameter-tuning phase to evaluate potential performance enhancements (see Table 1 for reference). The LR algorithm was trained exclusively with the recommended hyperparameters. In Fig. 3, we compare the results of both scenarios, where each algorithm’s performance is depicted in a boxplot that includes 60 scores (one per cell line) of either RMSE or Rp.

Figure 3.

Figure 3.

Hyperparameter tuning generally resulted in better performance. The performance of each algorithm across NCI-60 cell lines was evaluated using a random split. Results were obtained using (A) Rp scores (upper panel) and (B) RMSE scores (lower panel). Each subpanel shows the results for the following ML algorithms from left to right: LR, RF, XGB, DNN, and D-MPNN. For each algorithm, the boxplots on the left (orange) shows the performance with default hyperparameters, and the one on the right (blue) shows the performance with tuned hyperparameters. Each boxplot is drawn from 60 out-of-sample performance estimates, one per cell line. The central line of each boxplot represents the median, and the dot denotes the mean. D-MPNN benefitted the most from hyperparameter tuning

With the recommended values for its hyperparameters, the XGB models outperformed the other four model types in both RMSE and Rp metrics. Hyperparameter tuning notably enhanced the performance of the XGB models, highlighting the significance of hyperparameter interactions in gradient boosting methods. However, the D-MPNN models outperformed the XGB models after hyperparameter tuning, with the XGB models coming in second. Hyperparameter tuning had a minimal impact on the performance of RF models, consistent with prior literature suggesting their insensitiveness to hyperparameter tuning [52]. Using recommended hyperparameter values, the baseline LR models again exhibited much lower accuracy compared to its nonlinear counterparts.

The DNN models also obtained improved performance upon hyperparameter tuning. The analysis revealed that a constant learning rate and an increased number of neurons per hidden layer contribute to this enhancement. These findings challenge previous recommendations [65] and highlight the importance of dataset-specific hyperparameter tuning.

After hyperparameter tuning, the D-MPNN models achieved the most substantial improvement, reaching the lowest average RMSE of 0.572 and the highest average Rp of 0.756. Thus, this shows that D-MPNN is the best algorithm for these datasets. Bayesian optimization [69] applied to D-MPNN for hyperparameter tuning due to its extensive hyperparameter space yielded significant improvements. However, for other algorithms, Bayesian optimization did not offer advantages over grid search (data not shown).

In conclusion, our comprehensive assessment indicates that while XGB models are efficient and perform well, D-MPNN models, when optimally tuned, are the most proficient in capturing the complex relationships in the data, as demonstrated by their superior RMSE and Rp scores.

Models performance on dissimilar molecules

We extended our investigation into model robustness by evaluating their performance on a dissimilar test set, defined by a 70% dissimilarity threshold. This critical benchmark was designed to emulate the challenge of predicting the efficacy of novel compounds and to validate the generalizability of our models to chemical entities vastly different from those seen during training. The dissimilarity criterion is pivotal in ensuring that our models are not merely reflecting the chemical space of the training set. We meticulously evaluated the predictive accuracy and error margins of each ML algorithm on this challenging set. Following the initial results, the models were trained in this phase using their optimized hyperparameters.

The findings in Fig. 4 align with those in Fig. 3, indicating that the LR model consistently underperforms compared to its nonlinear counterparts. Notably, the D-MPNN model demonstrates the best results, with the highest mean Rp of 0.632 and the lowest mean RMSE of 0.584, highlighting its strong generalization capabilities.

Figure 4.

Figure 4.

D-MPNN also performs best on the dissimilar-molecules test sets. The performance of each algorithm, using their optimized hyperparameters, across NCI-60 cell lines was evaluated on the dissimilar test set. Each subpanel shows the results for the following ML algorithms from left to right: LR, RF, XGB, DNN, and D-MPNN. The boxplot for each algorithm contains 60 scores, either (A) Rp or (B) RMSE. The central line of each boxplot represents the median, and the dot denotes the mean

To emphasize the importance of implementing a robust model validation strategy, especially in the early stages of drug discovery, we summarized the performance of the models across 60 cell lines by averaging the RMSE and Rp scores. Note that these values correspond to the dot markers shown in Figs 3 and 4. The results of this comparison are presented in Table 3, showing that models perform better when using random data partitioning compared to partitioning based on dissimilar molecules. The reason for this performance reduction is that in the first case, the test set may contain molecules similar to those in the training set, which allows the models to recognize these molecules. Conversely, in the second case, partitioning based on dissimilar molecules ensures that the test set only contains molecules that are different from those in the training set, making it a more challenging scenario and thus reducing the performance of the ML models.

Table 3.

Comparison of the two model validation strategies.

Random split (100%)
Dissimilar-molecules split (70%)
ML algorithms Rp RMSE Rp RMSE
LR 0.4480 0.7742 0.3499 0.7052
RF 0.7150 0.6131 0.5344 0.6356
XGB 0.7385 0.5838 0.5727 0.6093
DNN 0.7331 0.5963 0.5613 0.6242
D-MPNN 0.7563 0.5724 0.6326 0.5850

In both, random split and dissimilar-molecules split. The performance of each ML model is summarized using the average of its RMSE or Rp scores. Results presented here correspond to models trained with optimized hyperparameters, except for the LR model. Random split is effectively a dissimilar-molecules split with 100% similarity threshold.

The best value in the column is shown in bold. 

Despite its reduced accuracy on dissimilar-molecules split, the D-MPNN model outperforms all other ML models. Indeed, D-MPNN’s pGI50 predictions have lower error and higher correlation when compared with the corresponding measurements across both validation strategies, highlighting its potential to drive early drug design approaches.

These results underscore the critical importance of robust and stringent testing environments for the validation of ML models within the field of drug discovery. The standout predictive performance of the D-MPNN model on the dissimilar test set underscores its ability to effectively generalize to novel chemical entities, a key attribute for models applied to the prediction of compound efficacy. The D-MPNN’s superior performance in such complex predictive scenarios confirms its status as a potent tool in the arsenal of computational drug discovery.

D-MPNN model performance in PVS

Figure 5 provides a compelling demonstration of the D-MPNN model’s strength in the context of PVS, particularly in its ability to provide a high hit rate across both explored (random split) and unexplored chemical spaces (dissimilar-molecules split). The figure illustrates predictive performance on the best and worst cell lines as determined by the pGI50 cut-off, used to differentiate between inactive (pGI50 < 6) and active (pGI50 ≥ 6) compounds.

Figure 5.

Figure 5.

D-MPNN model performance in phenotypic virtual screening using either the random or the dissimilar-molecules split. For each cell line, the same training set is employed regardless of the split, whereas the test set either contains only those dissimilar to those in the training set or all the remaining molecules (both dissimilar or similar). Both regression and classification performance is simultaneously evaluated using a pGI50 cut-off of 6 for the latter. The scatter plots compare predicted pGI50 values against their measured values for the RMSE-best and RMSE-worst cell lines (left and right, respectively) under random split (top panels) and dissimilar-molecules split (bottom panels). Notably, the model maintains high hit rate, Rp and MCC values with the dissimilar-molecules split. These results highlight D-MPNN's robustness and its capacity to identify potent molecules with potent whole-cell inhibition among a much larger proportion of inactive molecules, thus demonstrating its value for virtual screening against cancer cell lines

In the explored setting of the random split test, the D-MPNN showcases its high predictive fidelity on the best cell line, HCT-116, achieving a Rp score of 0.802, which is indicative of a strong linear correlation between predicted and measured values. The high hit rate of 75.6% and an MCC of 0.64 in this scenario underscore the model’s precise ability to identify potent compounds—a critical requirement for effective virtual screening.

The robust nature of the D-MPNN model is further emphasized in the dissimilar test set, where even the best cell line, HCT-116, presents a new challenge due to the introduction of dissimilar molecules. Despite this, the model attains a Rp score of 0.675, a substantial hit rate of 60.4%, and an MCC of 0.44. These metrics are particularly good, considering the complexity posed by the dissimilarity threshold, and they highlight the D-MPNN’s adaptability and its capacity to recognize active agents in a less explored chemical space.

The worst-performing cell line in the dissimilar test, A498, while exhibiting lower metrics Rp of 0.564 and hit rate of 45.8%—still demonstrates the D-MPNN model’s relative resilience. An MCC of 0.32 in this context demonstrates the model’s ability to maintain a fair level of true positive identification amidst increased chemical diversity, which is often encountered in real-world screening environments.

We investigated whether the predictive performance of our models varied systematically across different cancer cell lines. Our analysis indicated that there are no consistent, systematic differences in predictive performance between different cancer types. However, some cell lines did show slightly better predictive performance compared to others. This can be observed in Fig. 5, where the results for the best and worst-performing cell lines are presented.

Figure 5 illustrates the predicted versus measured pGI50 values for the best and worst-performing cell lines using the D-MPNN model. For example, the HCT-116 cell line consistently shows better predictive performance with higher Rp and lower RMSE values. In contrast, the A498 cell line exhibits lower predictive performance.

It is important to note that the size of the dataset for each cell line does not appear to be a significant factor in discriminating performance. The smallest training sets, which typically pose challenges for model performance, also have the smallest test sets. This balance tends to benefit performance because smaller test sets are less diverse, while larger, more diverse test sets are generally harder to predict accurately.

Overall, these findings suggest that while there is some variation in predictive performance among different cell lines, there are no systematic biases favouring certain cancer types. This supports the robustness and general applicability of our models across different types of cancer cell lines.

Discussion

Our study aimed to identify the most effective AI-based model for predicting the inhibitory activity of molecules, especially molecules that differ from those encountered during the training process. The models were trained using the NCI-60 dataset. In this dataset, cell lines are tested with at least 30 000 molecules (see Fig. 1), which allows for a robust assessment of model performance across different cancer cell lines. This approach differs from many studies that use much smaller and less diverse datasets, which may limit their applicability and generalizability [35, 70]. This study suggests that larger, more heterogeneous datasets can significantly enhance the model’s ability to generalize, supporting previous findings that dataset diversity is crucial for training effective AI models in drug discovery applications [11, 12, 15].

When comparing the two data partitioning strategies for model training and testing, we find that models tend to perform better when we use random data partitioning instead of partitioning based on dissimilar molecules (see Table 3). This is because when we use random data partitioning, the test set also includes molecules similar to those in the training set, allowing the models to predict those more accurately. On the other hand, only using dissimilar molecules in the test set makes the task more challenging for the models, leading to decreased performance in ML models.

Regardless of the strategy used, our results showed that hyperparameter tuning improves the performance of all the models evaluated in this study (Figs 3 and 4). Moreover, this study has demonstrated that D-MPNNs offer superior predictive capabilities in virtual screening for anticancer therapeutics against diverse cancer cell lines, even with chemically dissimilar molecules. Unlike fixed-representation ML models that rely on precomputed molecular descriptors, GNNs generate their own descriptors by operating directly on the molecular graph. This capability enables GNNs to capture the structural information and spatial relationships inherent in the molecular structure [37].

Moreover, GNNs excel in generalizing from training data to both unseen and dissimilar molecules. They can learn transferable patterns and features, allowing them to recognize and predict the properties of structurally different molecules based on learned structural motifs and atomic interactions. This adaptability is particularly valuable in drug discovery, where the chemical space is vast and diverse, and the ability to predict the activity of novel compounds is crucial.

Additionally, the use of message passing mechanisms in D-MPNN allows for the effective aggregation of local neighbourhood information, enhancing the model’s ability to understand the context of each atom within the entire molecular graph. This results in more accurate predictions of molecular properties and interactions, even for compounds that were not part of the training set, and particularly for those that are structurally dissimilar to the training molecules.

Our findings align with recent advancements in the field that suggest GNNs’ superior ability to model complex molecular interactions due to their inherent structure that mimics molecular graphs [34, 71]. In contrast with other ML models like RF and XGB, which do not inherently capture the topological variability of molecular structures as effectively. Taken together, these results suggest the usefulness of D-MPNN for PVS applications. The high hit rates and MCC scores show the D-MPNN’s inherent strength in uncovering compounds with strong inhibitory activity against cancer cell lines, thereby supporting its use in the crucial early stages of drug discovery where the prediction of compound efficacy is paramount.

Conclusions

The results of this study demonstrate the significant role of GNNs as a powerful tool in virtual screening technologies, which aim to improve the discovery of anticancer drugs. GNNs are able to accurately predict the biological activity of chemically diverse molecules against different cancer cell lines, demonstrating a capability to navigate complex chemical spaces that outperform that of other types of AI models struggle to explore fully.

We conducted evaluations using the NCI-60 dataset, one of the largest and most diverse collections in such studies. Our findings suggest that GNNs are robust and highlight the crucial role of using comprehensive datasets to enhance the accuracy and generalizability of predictive models in oncology. This research contributes to our understanding of how various AI models can be optimized and tailored to address the specific requirements of early drug design, thus expanding the possibilities of what can be achieved with current technologies.

Future research should concentrate on refining specialized models, utilizing transfer learning to capitalize on knowledge derived from various cancer types. This approach can potentially improve predictive accuracy by allowing models to generalize across diverse cancer datasets. Additionally, the integration of multi-omics data with chemical data via multi-task learning could substantially improve virtual screening performance against cancer cell lines. Evaluation of other GNNs on the 60 dissimilar-molecules splits across NCI-60 cell lines is also promising.

Code for reproducing the results of the best models can be found at https://github.com/sachin-vish91/GNN-VS

Acknowledgments

Part of the computational experiments were performed on the Core Cluster of the French Institute of Bioinformatics (IFB) (ANF-11-INBS-0013), which is also gratefully acknowledged.

Contributor Information

Sachin Vishwakarma, Evotec SAS (France), Toulouse, France; Centre de Recherche en Cancérologie de Marseille, Marseille 13009, France.

Saiveth Hernandez-Hernandez, Centre de Recherche en Cancérologie de Marseille, Marseille 13009, France.

Pedro J Ballester, Department of Bioengineering, Imperial College London, London SW7 2AZ, United Kingdom.

Author contributions

Sachin Vishwakarma (Data curation [equal], Formal analysis [supporting], Investigation [supporting], Methodology [equal], Software [equal], Validation [equal], Visualization [equal], Writing—original draft [supporting], Writing—review & editing [supporting]), Saiveth Hernandez-Hernandez (Data curation [equal], Formal analysis [equal], Investigation [equal], Software [supporting], Validation [supporting], Writing—original draft [supporting], Writing—review & editing [supporting]), and Pedro Ballester (Conceptualization [lead], Data curation [supporting], Formal analysis [lead], Funding acquisition [lead], Investigation [lead], Project administration [lead], Resources [lead], Supervision [lead], Validation [lead], Writing—original draft [lead], Writing—review & editing [lead])

Conflict of interest statement. None declared.

Funding

S.V. thanks the Institute Paoli Calmettes Marseille, France for his PhD funding. S.H-H. thanks the National Council of Sciences and Technology of Mexico (CONAHCYT). P.J.B. thanks the Wolfson Foundation and the Royal Society for a Royal Society Wolfson Fellowship.

References

  • 1. Ledford H.  Many cancer drugs aim at the wrong molecular targets. Nature  2019. 10.1038/D41586-019-02701-6 [DOI] [PubMed] [Google Scholar]
  • 2. Lin A, Giuliano CJ, Palladino A  et al.  Off-target toxicity is a common mechanism of action of cancer drugs undergoing clinical trials. Sci Transl Med  2019;11. 10.1126/scitranslmed.aaw8412 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Swinney DC, Anthony J.  How were new medicines discovered?. Nat Rev Drug Discov  2011;10:507–19. 10.1038/nrd3480 [DOI] [PubMed] [Google Scholar]
  • 4. Vincent F, Nueda A, Lee J  et al. Phenotypic drug discovery: recent successes, lessons learned and new directions. 2022. 10.1038/s41573-022-00472-w [DOI] [PMC free article] [PubMed]
  • 5. Childers WE, Elokely KM, Abou-Gharbia M. The resurrection of phenotypic drug discovery. 2020. 10.1021/acsmedchemlett.0c00006 [DOI] [PMC free article] [PubMed]
  • 6. Moffat JG, Vincent F, Lee JA  et al. Opportunities and challenges in phenotypic drug discovery: an industry perspective. 2017. 10.1038/nrd.2017.111 [DOI] [PubMed]
  • 7. Makhoba XH, Viegas C, Mosa RA  et al.  Potential impact of the multi-target drug approach in the treatment of some complex diseases. Drug Des Devel Ther  2020;14:3235–49. 10.2147/DDDT.S257494 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Peón A, Naulaerts S, Ballester PJ.  Predicting the reliability of drug-target interaction predictions with maximum coverage of target space. Sci Rep  2017;7:3820. 10.1038/s41598-017-04264-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Hoeger B, Diether M, Ballester PJ, Köhn M.  Biochemical evaluation of virtual screening methods reveals a cell-active inhibitor of the cancer-promoting phosphatases of regenerating liver. Eur J Med Chem  2014;88:89–100. 10.1016/j.ejmech.2014.08.060 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Menichetti R, Kanekal KH, Bereau T.  Drug-membrane permeability across chemical space. ACS Cent Sci  2019;5:290–8. 10.1021/acscentsci.8b00718 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Fresnais L, Ballester PJ. The impact of compound library size on the performance of scoring functions for structure-based virtual screening. 2021. 10.1093/bib/bbaa095 [DOI] [PubMed]
  • 12. Gloriam DE. Bigger is better in virtual drug screens. 2019. 10.1038/d41586-019-00145-6 [DOI] [PubMed]
  • 13. Ballester PJ. The AI revolution in chemistry is not that far away. 2023. 10.1038/d41586-023-03948-w [DOI] [PubMed]
  • 14. Ren F, Aliper A, Chen J  et al.  A small-molecule TNIK inhibitor targets fibrosis in preclinical and clinical models. Nat Biotechnol  2024. 10.1038/s41587-024-02143-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Wallach I, Bernard D, Nguyen K; The Atomwise AIMS Program  et al.  AI is a viable alternative to high throughput screening: a 318-target study. Sci Rep  2024;14:1–16. 10.1038/s41598-024-54655-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Yang K, Swanson K, Jin W  et al.  Analyzing learned molecular representations for property prediction. J Chem Inf Model  2019;59:3370–88. 10.1021/acs.jcim.9b00237 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Xia F, Allen J, Balaprakash P  et al.  A cross-study analysis of drug response prediction in cancer cell lines. Brief Bioinform  2022;23. 10.1093/bib/bbab356 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Hernandez-Hernandez S, Guo Q, Ballester PJ.  Conformal prediction of molecule-induced cancer cell growth inhibition challenged by strong distribution shifts. bioRxiv  2024;1–16. 10.48550/arXiv.2406.00873 [DOI] [Google Scholar]
  • 19. Guo Q, Hernández-Hernández S, Ballester PJ. Scaffold splits overestimate virtual screening performance. ArXiv, pp. 1–14,  202, 10.48550/arXiv.2406.00873 [DOI]
  • 20. Li M, Wang Y, Zheng R  et al.  DeepDSC: a deep learning method to predict drug sensitivity of cancer cell lines. IEEE/ACM Trans Comput Biol Bioinform  2021;18:575–82. 10.1109/TCBB.2019.2919581 [DOI] [PubMed] [Google Scholar]
  • 21. Yuan H, Paskov I, Paskov H  et al.  Multitask learning improves prediction of cancer drug sensitivity. Sci Rep  2016;6:31619. 10.1038/srep31619 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Stetson LC, Pearl T, Chen Y, Barnholtz-Sloan JS.  Computational identification of multi-omic correlates of anticancer therapeutic response. BMC Genomics  2014;15:S2. 10.1186/1471-2164-15-S7-S2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Bazgir O, Zhang R, Dhruba SR  et al.  Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks. Nat Commun  2020;11:4391. 10.1038/s41467-020-18197-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Joo M, Park A, Kim K  et al.  A deep learning model for cell growth inhibition IC50 prediction and its application for gastric cancer patients. Int J Mol Sci  2019;20. 10.3390/ijms20246276 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Chang Y, Park H, Yang H-J  et al.  Cancer Drug Response Profile scan (CDRscan): a deep learning model that predicts drug effectiveness from cancer genomic signature. Sci Rep  2018;8:8857. 10.1038/s41598-018-27214-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Hernández-Hernández S, Vishwakarma S, Ballester PJ.  Conformal prediction of small-molecule drug resistance in cancer cell lines. Proc Mach Learn Res  2022;179:92–108. [Google Scholar]
  • 27. Wei D, Liu C, Zheng X, Li Y.  Comprehensive anticancer drug response prediction based on a simple cell line-drug complex network model. BMC Bioinformatics  2019;20:44. 10.1186/s12859-019-2608-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Choi J, Park S, Ahn J.  RefDNN: a reference drug based neural network for more accurate prediction of anticancer drug resistance. Sci Rep  2020;10:1861. 10.1038/s41598-020-58821-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Naulaerts S, Menden MP, Ballester PJ.  Concise polygenic models for cancer-specific identification of drug-sensitive tumors from their multi-omics profiles. Biomolecules  2020;10. 10.3390/BIOM10060963 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Cadow J, Born J, Manica M  et al.  PaccMann: a web service for interpretable anticancer compound sensitivity prediction. Nucleic Acids Res  2021;48:W502–W508. 10.1093/NAR/GKAA327 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Cortés-Ciriano I, van Westen GJP, Bouvier G  et al.  Improved large-scale prediction of growth inhibition patterns using the NCI60 cancer cell line panel. Bioinformatics  2016;32:85–95. 10.1093/bioinformatics/btv529 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Menden MP, Iorio F, Garnett M  et al.  Machine learning prediction of cancer cell sensitivity to drugs based on genomic and chemical properties. PLoS One  2013;8:e61318. 10.1371/journal.pone.0061318 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Ammad-Ud-Din M, Georgii E, Gönen M  et al.  Integrative and personalized QSAR analysis in cancer by kernelized Bayesian matrix factorization. J Chem Inf Model  2014;54:2347–59. 10.1021/ci500152b [DOI] [PubMed] [Google Scholar]
  • 34. Al-Jarf R, de Sá AGC, Pires DEV, Ascher DB.  pdCSM-cancer: using graph-based signatures to identify small molecules with anticancer properties. J Chem Inf Model  2021;61:3314–22. 10.1021/acs.jcim.1c00168 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. He S, Zhao D, Ling Y  et al.  Machine learning enables accurate and rapid prediction of active molecules against breast cancer cells. Front Pharmacol  2021;12:796534. 10.3389/fphar.2021.796534 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Wang S, Sun Q, Xu Y  et al.  A transferable deep learning approach to fast screen potential antiviral drugs against SARS-CoV-2. Brief Bioinform  2021;22. 10.1093/bib/bbab211 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Tong X  et al. Deep representation learning of chemical-induced transcriptional profile for phenotype-based drug discovery. 10.1038/s41467-024-49620-3 [DOI] [PMC free article] [PubMed]
  • 38. Shoemaker RH. The NCI60 human tumour cell line anticancer drug screen. Nat Rev Cancer  2006. 10.1038/nrc1951 [DOI] [PubMed]
  • 39. Piyawajanusorn C, Nguyen LC, Ghislat G, Ballester PJ. A gentle introduction to understanding preclinical data for cancer pharmaco-omic modeling. 2021. 10.1093/bib/bbab312. [DOI] [PubMed]
  • 40. Martorana A, La Monica G, Bono A  et al.  Antiproliferative activity predictor: a new reliable in silico tool for drug response prediction against NCI60 panel. Int J Mol Sci  2022;23. 10.3390/ijms232214374 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Tran-Nguyen VK, Junaid M, Simeon S, Ballester PJ.  A practical guide to machine-learning scoring for structure-based virtual screening. Nat Protoc  2023;18:3460–511. 10.1038/s41596-023-00885-w [DOI] [PubMed] [Google Scholar]
  • 42. O’Boyle NM, Banck M, James CA  et al.  Open Babel: an open chemical toolbox. J Cheminform  2011;3:33. 10.1186/1758-2946-3-33 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Riddick G, Song H, Ahn S  et al.  Predicting in vitro drug sensitivity using random forests. Bioinformatics  2011;27:220–4. 10.1093/bioinformatics/btq628 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Rogers D, Hahn M.  Extended-connectivity fingerprints. J Chem Inf Model  2010;50:742–54. 10.1021/ci100050t [DOI] [PubMed] [Google Scholar]
  • 45. Preuer K, Lewis RPI, Hochreiter S  et al.  DeepSynergy: predicting anti-cancer drug synergy with deep learning. Bioinformatics  2018;34:1538–46. 10.1093/bioinformatics/btx806 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Caron G, Ermondi G. Molecular descriptors for polarity: the need for going beyond polar surface area. 2016. 10.4155/fmc-2016-0165 [DOI] [PubMed]
  • 47. DrugBank. Glossary. DrugBank Help Center. https://dev.drugbank.com/guides/terms (1 June 2024, date last accessed).
  • 48. Schneider G.  Prediction of drug-like properties. In: Adaptive Systems in Drug Design, 2020. 10.1201/9781498713702-10 [DOI] [Google Scholar]
  • 49. Ritchie TJ, Macdonald SJF. The impact of aromatic ring count on compound developability—are too many aromatic rings a liability in drug design? 2009. 10.1016/j.drudis.2009.07.014 [DOI] [PubMed]
  • 50. Vennelakanti V, Qi HW, Mehmood R, Kulik HJ.  When are two hydrogen bonds better than one? Accurate first-principles models explain the balance of hydrogen bond donors and acceptors found in proteins. Chem Sci  2021;12:1147–62. 10.1039/d0sc05084a [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Pedregosa F  et al.  Scikit-learn: machine learning in Python. J Mach Learn Res  2011;12. [Google Scholar]
  • 52. Svetnik V, Liaw A, Tong C  et al.  Random Forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci  2003;43:1947–58. 10.1021/ci034160g [DOI] [PubMed] [Google Scholar]
  • 53. Breiman L.  Random Forests. Mach Learn  2001;45:5–32. [Google Scholar]
  • 54. Breiman L.  Bagging predictors. Mach Learn  1996;24:123–40. 10.1007/bf00058655 [DOI] [Google Scholar]
  • 55. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 10.1145/2939672.2939785 [DOI]
  • 56. Sheridan RP, Wang WM, Liaw A  et al.  Extreme gradient boosting as a method for quantitative structure-activity relationships. J Chem Inf Model  2016; 56:2353–60. 10.1021/acs.jcim.6b00591 [DOI] [PubMed] [Google Scholar]
  • 57. Lecun Y, Bengio Y, Hinton G.  Deep learning. Nature  2015;521:436–44. [DOI] [PubMed] [Google Scholar]
  • 58. Kingma DP, Ba JL. Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, 2015.
  • 59. Glorot X, Bengio Y.  Understanding the difficulty of training deep feedforward neural networks. J Mach Learn Res  2010. [Google Scholar]
  • 60. Hara K, Saito D, Shouno H. Analysis of function of rectified linear unit used in deep learning. In: Proceedings of the International Joint Conference on Neural Networks, 2015. 10.1109/IJCNN.2015.7280578 [DOI]
  • 61. Takekawa A, Kajiura M, Fukuda H.  Role of layers and neurons in deep learning with the rectified linear unit. Cureus  2021;13:e18866. 10.7759/cureus.18866 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Srivastava N, Hinton G, Krizhevsky A  et al.  Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res  2014;15. [Google Scholar]
  • 63. Loshchilov I, Hutter F. Decoupled weight decay regularization. In: 7th International Conference on Learning Representations, ICLR 2019, 2019.
  • 64. Ma J, Sheridan RP, Liaw A  et al.  Deep neural nets as a method for quantitative structure-activity relationships. J Chem Inf Model  2015;55:263–74. 10.1021/ci500747n [DOI] [PubMed] [Google Scholar]
  • 65. Zhou Y, Cahya S, Combs SA  et al.  Exploring tunable hyperparameters for deep neural networks with industrial ADME data sets. J Chem Inf Model  2019;59:1005–16. 10.1021/acs.jcim.8b00671 [DOI] [PubMed] [Google Scholar]
  • 66. Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In: 32nd International Conference on Machine Learning, ICML 2015, 2015.
  • 67. Han X, Jia M, Chang Y  et al.  Directed message passing neural network (D-MPNN) with graph edge attention (GEA) for property prediction of biofuel-relevant species. Energy AI  2022;10:100201. 10.1016/j.egyai.2022.100201 [DOI] [Google Scholar]
  • 68. Heid E, Greenman KP, Chung Y  et al.  Chemprop: a machine learning package for chemical property prediction. J Chem Inf Model  2024;64:9–17. 10.1021/acs.jcim.3c01250 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Parsa M, Mitchell JP, Schuman CD  et al.  Bayesian multi-objective hyperparameter optimization for accurate, fast, and efficient neural network accelerator design. Front Neurosci  2020;14:667. 10.3389/fnins.2020.00667 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Jiang D, Wu Z, Hsieh C-Y  et al.  Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models. J Cheminform  2021;13:12. 10.1186/s13321-020-00479-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Stokes JM, Yang K, Swanson K  et al.  A deep learning approach to antibiotic discovery. Cell  2020;180:688–702.e13. 10.1016/j.cell.2020.01.021 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Biology Methods & Protocols are provided here courtesy of Oxford University Press

RESOURCES