Abstract
Predicting the geometry and strength governing small molecule-protein interactions remains a paramount challenge in drug discovery due to their complex and dynamic nature. Several machine learning (ML) methods have been proposed to complement and improve on physics-based tools such as molecular docking, usually by mapping three dimensional features of poses to their closeness to experimental structures and/or to binding affinities. Here, we introduce Dockbox2 (DBX2), a novel approach that encodes ensembles of computational poses within a graph neural network framework via energy-based features derived from molecular docking. The model was jointly trained to predict binding pose likelihood as a node-level task and binding affinity as a graph-level task using the PDBbind dataset and demonstrated significant performance in comprehensive, retrospective docking and virtual screening experiments, compared with state-of-the-art physics- and ML-based tools. Our results encourage further exploration of ML models learning from conformational ensembles to accurately model small molecule-protein interactions and thermodynamics. The DBX2 code is available at https://github.com/jp43/DockBox2.
We present DBX2, a graph neural network trained on docking ensembles of protein–ligand conformations, for joint pose prediction and binding affinity estimation. DBX2 improves docking and virtual screening accuracy, advancing drug discovery workflows.
Introduction
Drugs exert their therapeutic effects by binding to specific biomolecular targets, typically proteins or nucleic acids, and modulating their function, thereby inhibiting or restoring processes related to various diseases. The initial step in the drug discovery pipeline involves identifying molecules binding to the target of interest with high affinity and specificity,1 hence making the accurate prediction of both a crucial aspect for therapeutic development.2 Binding affinity, which reflects the strength of the interaction between a drug and its protein target, is commonly expressed in terms of dissociation constant (Kd), measurable via a plethora of experimental techniques.3 However, these techniques are usually time-consuming and resource intensive,4,5 especially at high throughput rates required to explore vast chemical spaces.6 Consequently, in silico screening methods have gained significant momentum, especially in the recent years.7
Although the estimation of ligand–protein affinities and interactions is essential, significant challenges arise due to the dynamic nature of these complexes. Molecular dynamics (MD) simulations can provide valuable insights into the nature of these interactions, e.g., by considering an ensemble of bound conformations to compute thermodynamically accurate energies.8 This is usually done by simulating the complexes in their thermodynamic equilibrium and considering the time spent in the various microstates. Therefore, MD has the potential to connect the chemical world to physical observables, aiding in the determination of state variables (free energy, enthalpy, entropy, …), kinetics, and the exploration of biomolecular mechanisms driven by rare events.9 For instance, the ligand Gaussian accelerated MD (LGMD) method, an enhanced sampling technique pioneered by Miao et al.,10 was employed to forecast the binding affinity of nirmatrelvir with the coronavirus 3C-like protease, yielding predictions consistent with experimental observations.11,12 Likewise, Wolf et al.13 harnessed the power of Langevin simulations an extended MD approach that delves into the intricate low-frequency motions governing large conformational shifts,14 to estimate the binding affinity of the benzamidine–trypsin complex. However, both standard and biased MD methods require significant computational power that makes these techniques unsuited for high-throughput screening purposes. Consequently, faster and less accurate methods such as molecular docking and machine learning (ML) approaches have been proposed as alternatives.
Molecular docking methods generate bound conformations of a ligand within a rigid binding pocket and then rank the poses using a scoring function, both to identify the most probable pose and to estimate the binding affinity.15 Despite its simplicity, docking has shown great potential for the identification active molecules from vast backgrounds of inactive compounds,16,17 with its impact extending across numerous therapeutic areas. Manglik et al., for example, docked over 3 million molecules against the μ-opioid receptor (μOR), leading to the discovery of PZM21, a G protein-biased μOR agonist.18 Zernov et al. discovered a compound targeting the transient receptor potential cation channel 6 as a potential starting point to develop anti-Alzheimer's therapies, with in vitro studies confirming its efficacy, stability, and target specificity without adverse effects.19 Stein et al. employed docking to screen over 150 million molecules targeting melatonin receptor 1 (MT1) in the search for therapeutics addressing sleep disorders and depression, reporting a novel chemotype with experimentally validated, selective MT1 agonist activity.20 Fink et al. utilized large-scale docking to identify novelα2A-adrenergic receptor (α2AAR) agonists with fewer adverse effects compared to earlier treatments, as new starting points to develop nonopioid analgesics.21 These and many other studies underscore the important role of docking in advancing drug discovery.
However, several limitations remain in docking, mainly due to the approximative nature of scoring functions and the neglection of flexibility.15,22 Thus, ML methods have been introduced in the last decade to tackle molecular docking challenges.15 For example, Graph Neural Networks (GNNs) have been widely explored to characterize ligand–protein interactions.23 Several models have been proposed, such as CurvAGN,24 PIGNet,25 GenScore26 and SS-GNN,27 reporting strong correlations between predicted and experimental affinities.23,28,29 Additionally, GNNs have been applied in generative settings to replace physics-based sampling in generate and scoring ligand–protein poses, such as in DiffDock30 and MedusaGraph.31 Although these architectures have shown promising results, an increasing number of studies suggest that GNNs tend to memorize ligand and protein patterns instead of learning the phycial chemistry of the interactions.23,29 Moreover, these methods generally map single pose graphs to binding affinities, thus neglecting full thermodynamic profile and dynamics of ligand–protein interactions that depends on multiple conformations.23 Notably, recent efforts have been made to consider multiple conformations in training GNNs for binding affinity predictions, such as Dynaformer, a method that encode each MD-derived binding conformation into a graph within a framework to provide better affinity estimates.32 Notably, Dynaformer still relies on mapping each conformation to a single affinity value, and requires the use of costly simulations, hence limiting its scalability.
In this work, we introduce DockBox2 (DBX2), a GNNs framework enabling to encode multiple ligand–protein conformations derived from docking within individual graph neural networks in order to leverage ensemble representations for jointly predicting pose likelihood at the node level and binding affinities at the graph level. In a series of retrospective experiments, DBX2 demonstrated significant improved performances both for docking and virtual screening (VS) tasks compared with physics-based and ML methods, warrantying further investigation of ensemble-based ML models in computer-aided drug discovery.
Material and methods
Datasets
The DBX2 model was trained and evaluated using the PDBbind database.33 The refined set of PDBbind v2016 (4057 complexes)34 was used to train the model. PDBbind is a comprehensive and widely adopted benchmark for protein–ligand binding, and several widely used benchmark datasets, such as CASF-2016,35 are derived from this refined set. The PDBbind v2019-based hold-out test set built by Volkov et al.29 and the Runs N' Pose database from Škrinjar et al.,36 consisting of 3393 and 2600 complexes respectively, were used as external test sets. Volkov's dataset is curated to mitigate latent biases, such as structural patterns in ligands or proteins, which can favor GNN memorization rather than protein–ligand interaction learning. As highlighted in previous studies,23,29 this memorization often arises from significant redundancies between training and test sets, resulting in data leakage. The Runs N' Poses dataset is a recently developed dataset containing high-resolution protein–ligand systems released after the publication of PDBbind v2020 and the training date cutoff of several protein–ligand co-folding models (e.g., AlphaFold3,37 Chai-1,38 Protenix,39 and Boltz-1 (ref. 40)). A subset of the LIT-PCBA database41 was used to perform retrospective VS experiments.
Protein and ligand preparation
Complexes from PDBbind were prepared following the same procedure of our previous work.42 For retrospective VS, dominant protonation and tautomerization states of small molecules were computed from the SMILES using Openeye's QUACPAC43 and converted into low-energy 3D conformations (mol2 format) using Openeye's OMEGA tool.43 The target proteins were prepared by removing redundant protein chains, along with non-essential ions, waters, and heteroatoms. The resulting protein structures were prepared using the Molecular Operating Environment (MOE) QuickPrep tool,44 to automatically add missing loops and assign reasonable conformations to the residues with alternate orientation. Subsequently, protonation states were generated using the Protonate 3D tool from MOE (at pH 7.4). Finally, the structures were energy-minimized using the AMBER10:EHT forcefield implemented in MOE, and saved in pdb format.
Molecular docking and rescoring
The first Dockbox package (DBX)42 was utilized to generate binding poses with AutoDock,45 Vina46 and DOCK 6 (DOCK),47 and rescore with their scoring function in addition to Gnina48 and DSX.49 The DBX configuration file used for this purpose on PDBbind v2016 and the test sets is illustrated in Fig. S1; a maximum of 140 binding poses were generated for each system, 60 from AutoDock, 20 from Vina, and 60 from DOCK. For AutoDock, grid spacing was set to 0.3 Å, and the Lamarckian genetic algorithm50 was employed to generate poses. For Vina, the energy range for final poses was set to 3 kcal mol−1. In DOCK, a grid-based scoring method was applied with a spacing of 0.3 Å. All other parameters were left as default. Docking with any of the above programs was followed by energy minimization, starting with 500 steps of the steepest descent method followed by 1000 steps combining steepest descent and conjugate gradient methods. Energy minimization was performed using AmberTools 17 (ref. 51) to prevent structural clashes and ensure appropriate rescoring with different programs. Rescoring was then conducted with AutoDock, Vina, DOCK, Gnina's CNNScore, and DSX scoring functions.
Dockbox2 architecture
DBX2 architecture is based on the GraphSAGE model52 as shown in Fig. 1. The ensemble of poses generated by docking a given ligand–protein pair is used to construct a graph (Fig. 1A), with each node encoding an individual binding pose represented by categorical and energetic features, listed in Table 1.
Fig. 1. Architecture of DBX2. (A) Binding poses are represented as nodes. Two pose nodes are connected by an edge based on the root mean square deviation (RMSD) between them. Docking-derived energies and categorical features of each binding pose, here referred as s1, s2, s3…, are used as node features. (B) Schematic of the DBX2 architecture; pose correctness and pKd are jointly learned as node- and graph-level tasks, respectively.
Table 1. DBX2 node features.
Features | Description |
---|---|
Instance | Docking software utilized to generate the binding pose |
Score | Docking score from original docking program |
Rescoring score (AutoDock, Vina, Dock, DSX, Gnina) | Docking score obtained by rescoring the pose with another scoring function |
Gaussian terms (gauss1_inter, gauss2_inter, gauss1_intra, gauss2_intra) | Gaussian terms of the binding pose, as provided by Vina46 |
Hydrophobic interactions (hydrophobic_inter, hydrophobic_intra) | Hydrophobic terms evaluated by Vina46 |
Hydrogen bonding (hydrogrenbonding_inter, hydrogenbonding_intra) | Hydrogen bond terms evaluated by Vina46 |
Repulsion (repulsion_inter, repulsion_intra) | Repulsive Lennard-Jones energies from Vina46 |
All available scoring terms provided by Vina46 were included as node features, with the exception of the entropy term which is determined solely by the ligand structure and therefore remains constant across different poses of the same ligand. In the constructed graph, pairwise root mean square deviation (RMSD) values are calculated between all poses. Two nodes are connected by an edge if the RMSD between the two poses is below a predefined threshold while the RMSD value is kept as edge feature. Graphs were generated using the create_graphs script available in the DBX2 package. In the shared layers, the DBX2 model uses the message passing (MP) framework,53i.e., for each node i, information from its neighbors is gathered and aggregated using the symmetric mean (symmean) aggregation:
![]() |
1 |
where is the aggregated message for node i from its neighbors, s(k−1)j is the feature vector of neighbor node j, RMSDij is the RMSD between node i and j. The feature vector is concatenated with the RMSD between nodes i and j. The aggregation function then combines these concatenated vectors to produce a single aggregation message vector. The node feature vector is then updated:
![]() |
2 |
where si(k−1) is the feature vector of node i at layer k. s(k−1)i is the feature vector of node i from the previous layer k − 1. W(k)self and W(k)neigh are learnable weight matrices that apply to the feature vector of the current node and to the aggregated message vector from neighbor nodes, respectively. is the aggregated message from the neighbors
of node i. The MP layers are followed by multilayer perceptron (MLP) layers to predict pose correctness (node-level task) and the pKd/pKi (graph-level task) as illustrated in Fig. 1B. For node-level predictions, aggregated information from the MP layers is passed to an MLP with Rectified Linear Unit (ReLU) and sigmoid activation function for hidden layers and final layer of MLP, respectively. For graph-level predictions, aggregated information is passed to a readout layer corresponding to a MeanMax pooling and then passed to a two-layers MLP, with ReLu activation function for the hidden layer and linear activation function for the output layer.
Model training and evaluation
The total loss function of DBX2 consists of three components Lossn, Lossg, and Lossreg
Total loss = Lossn + w1Lossg + Lossreg | 3 |
Lossn is the loss function for node-level task, where the binary focal cross entropy54 is used as loss function applied to each node in the batch and averaged:
![]() |
4 |
where N is the number of nodes in the batch, γ is the focusing parameter (set to 1.0 in this study), and α(i)t is the weighting factor for each i-th sample:
![]() |
5 |
where α is computed as:
![]() |
6 |
where Gt is the number of graphs in the training set, and Ci and Ii are the number of correct poses and incorrect poses in the i-th graph, respectively. A pose was considered as correct if it was 2 Å or less of RMSD from the experimental one. p(i)t is the predicted probability output by the model for the correct class label of each i-th node:
![]() |
7 |
where p(i) is the model output for each pose.
Lossg and w1 are the loss function for the graph-level task and its weight, respectively. The optimal value of w1 was determined through hyperparameter optimization (Table S1). Lossg corresponds to the root mean square error (RMSE):55
![]() |
8 |
here G denotes the number of ligand–protein complexes in the batch, yi is the actual value of binding affinity for each complex and ŷi is the predicted binding affinity for each ligand–protein complex. Minimizing Lossg contributes to correctly predicting the ligand–protein affinity, in which all poses within a graph are processed through message passing and readout, then used to predict the binding affinity. Lossreg is the regularization loss, while L2 regularization loss56 was here used to prevent overfitting of model:
![]() |
9 |
where θi represent the model parameter, n is the number of model parameter. The model was trained using the traindbx2 routine (example of a configuration file for traindbx2 in the INI format is provided in Fig. S2). Training was performed with a maximum of 200 epochs and early stopping was used by monitoring the total loss on the validation sets for 3 consecutive epochs. The model was trained with mini-batch gradient descent (batch size of 100) and the adaptive moment estimation (ADAM) optimizer with a learning rate of 5 × 10−4 and a decay rate of 0.99.
Hyperparameter optimization was performed using a grid search, considering the following hyperparameters: RMSD cutoff value to define an edge (RMSD cutoff), number of adjacent nodes to randomly sample for aggregation (nrof-neigh), and graph loss weight (w1), for a total of 30 combinations (Table S1). Training and validation sets were prepared using the split_train_val_dbx2 routine of the DBX2 package. The generated graphs were split for stratified 5-fold cross-validation, keeping a consistent distribution of protein families across all folds. Node and edge features for each graph were standardized using scikit-learn's StandardScaler.57 For node-level predictions, success rate, accuracy, and area under the curve (AUC) were used as evaluation metrics. For graph-level predictions, RMSE was used.
Model testing
Models were compared for docking and scoring tasks with other methods on the hold-out and Runs N' Poses test sets. To evaluate docking power, the success rate was computed as the ratio of top-ranked poses with an RMSD equal or lower than a predefined threshold with respect to the experimental pose. Five different thresholds were tested, 1, 1.5, 2, 2.5 and 3 Å. For DBX2, the success rate was evaluated by considering the top-ranked poses from node-level predictions.
Next, the scoring power was assessed to evaluate the model's ability to predict experimental binding affinities using linear and multiple linear regression. The correlation between experimental binding affinities and scores of the best poses from different scoring functions was analyzed through linear regression, and the R2 values were calculated. For DBX2, graph-level predictions were utilized to evaluate the correlation with experimental binding affinities. Additionally, multiple linear regression was conducted to correlate experimental binding affinities with predicted values derived from various combinations of scoring functions, as described in our previous study.42
Scoring power was also evaluated using Pearson correlation coefficient and the predictive index (PI) as before.42 Proposed by Pearlman et al.,58 PI measures the reliability of a scoring function in identifying the most potent binder between two compounds. It is calculated as follows:
![]() |
10 |
with
wij = |Ej − Ei| | 11 |
![]() |
12 |
where Ei is the experimental binding affinity of compound i, and Si is the score of compound i. Predictive index gives values in range from −1 (wrong prediction) to 1 (perfect prediction), with 0 being random prediction. wij is the weighting term which underscores the accurate ranking of compounds exhibiting substantial disparities in experimental binding affinities.
Retrospective virtual screening
VS experiments were conducted on the three target proteins from the LIT-PCBA database41 that were not present in the DBX2 training set: Flap structure-specific Endonuclease 1 (FEN1, PDB id: 5FV7),59 Glucocerebrosidase (GBA, PDB id: 2XWE),60 and Mammalian Target of Rapamycin Complex 1 (MTORC1, PDB id: 5GPG).61 Initially, Vina was used to screen active-inactive sets derived from LIT-PCBA against each corresponding structure. The top 20 000 compounds based on the Vina ranking were then docked also with AutoDock to their respective targets. 80 binding poses (60 from AutoDock and 20 from Vina) were generated for each ligand–protein complex (Fig. S3). Rescoring was performed with AutoDock, Vina, DOCK, and Gnina (considering the CNNAffinity of the pose with the highest CNNScore).48 VS performances were evaluated by computing the logarithmic area under the curve (logAUC),62 enrichment factors (EF) and Boltzmann-enhanced discrimination of receiver operating characteristic (BEDROC) with adjust parameter (α) values of 20 and 80.5 using the CROC Python package.63–65
The logAUC quantifies the performance of a VS method by assessing its ability to distinguish active compounds from decoys across the ranked list. By applying a logarithmic scale to the false positive rate axis, it places greater emphasis on the early retrieval of active compounds, which is critical in VS.
EF measures how effectively a VS method identifies active compounds within a specific fraction of the ranked list.66 EF at a given cutoff (x) is calculated from the ratio of true active compounds in the top x ranked compounds in relation to the ratio of true active compounds in the entire dataset:
![]() |
13 |
where TP and TN are true positives and true negatives, FP and FN are false positives and false negatives. N is a total number of compounds in the entire dataset, Ns is a total number of predicted active compounds in the selection set (x), n is a total number of true active compounds in the entire dataset, ns is the number of true active compounds in the selection set (x). EF was computed by considering the top 2% of the ranked compounds for each scoring functions and for both graph-level and node-level predictions in DBX2 (EF2).
Normalized enrichment factor (NEF) rescales EF values into a range from 0 (bad prediction) to 1 (perfect prediction),67 with the goal of standardizing comparison across different datasets. NEF is calculated as follow:
![]() |
14 |
with
![]() |
15 |
where EF(x)max denotes the maximum enrichment factor achievable within a selection set (x). ns is the number of true active compounds in the selection set (x), N is the number of compounds in the entire dataset.
BEDROC metric emphasizes the concentration of active compounds at several range of ranked data sets64,67 through a scaling function (α). This metric is defined as:
![]() |
16 |
with
![]() |
17 |
![]() |
18 |
![]() |
19 |
the robust initial enhancement proposed by Sheridan et al.,68xi is a relative ranking of active compound i. Rα is the fraction of active compound , α is the scaling function.
We also investigated the potential of DBX2 to improve VS performance of individual docking programs (rather than on pose pools deriving from different software) and by using different docking setups to generate poses. The top 20 000 LIT-PCBA compounds docked and scored with Vina against FEN1 were redocked with AutoDock and Vina using several combinations of docking parameters for each program (Table S2). The resulting poses were subsequently subjected to DBX2, using the same settings used in the retrospective VS experiments. The same metrics were calculated to assess the effectiveness of DBX2 in this specific scenario.
Baseline models
We compared DBX2 model with other methods, including docking and rescoring tools either physics- or ML-based, using the following protocol:
• AutoDock, Vina, DOCK, Gnina, KarmaDock,69 RTMscore70 and DBX2 were compared both in terms of docking and scoring power, as well as for retrospective VS (CarsiDock was excluded from the VS experiments due to the computational cost).
• CarsiDock,71 DSX and DBX2 were compared for docking and scoring power.
Default settings were used for all programs. To evaluate the docking, scoring and VS capabilities of RTMscore and Gnina on the hold-out and Runs N' Poses sets, the binding poses used in DBX2 were also utilized for rescoring with these tools. For Gnina, the success rate was evaluated using CNNScore, and the scoring power was evaluated using the CNNAffinity and Minimized Affinity scores of the pose with the best CNNscore for each system. KarmaDock and CarsiDock, both generative models, automatically generated their own protein–ligand poses and associated scores.
Results and discussion
Hyperparameter optimization
The results of hyperparameter optimization for the DBX2 model are summarized in Table S3. The best performing set of hyperparameters included a RMSD cutoff of 10 Å to define edges, a nrof-neigh of 30, and a graph-level loss weight (w1) of 0.02, yielding an average success rate of 60% on 5-fold cross validation. The model with the highest performance was then retained and used in subsequent testing.
Docking and scoring power
We compared the success rate of DBX2 and other physics-based methods for the docking and rescoring tasks on the hold-out test set, as described in the Material and methods section (Fig. 2A). As expected, rescoring ensembles of docking poses with different scoring functions led to significantly improved performance due to enhanced pose sampling, as observed in previous studies.42 Noticeably, the node-level pose classification method implemented in DBX2 significantly outperformed all docking and rescoring schemes at all the tested RMSD thresholds. These findings suggest that by leveraging neighbor information via the GNN framework, DBX2 offers a significant advantage in accurately identifying native near-to-native ligand binding poses compared with docking methods that score each pose indipendently. Fig. 2B illustrates an example of successful application of DBX2 for identifying the native pose of the potent TER-117 inhibitor bound to its target, the human glutathione S-transferase P1-1 (PDB id: 10gs).72 Additionally, we compared DBX2 against four ML-based docking methods, Gnina, KarmaDock, CarsiDock, and RTMscore, using a 2 Å cutoff on the hold-out dataset (Fig. S4A) and Runs N's Poses dataset (Fig. 2C). Unsurprisingly, KarmaDock, CarsiDock, and RTMscore outperformed both DBX2 and Gnina on the PDBbind v2019-based hold-out test set, which was part of the PDBbind v2020 general set used to train these models.69–71 Nevertheless, DBX2 displayed encouraging performance despite the limited size of the training set (4057 complexes) compared with the other methods. Next, we performed the same comparison on the Runs N's Poses dataset, which was completely unseen by all five investigated methods during training. Moreover, we investigated the performance of docking before and after removing the Runs N's Poses protein families that overlapped with v2016 and v2020. Notably, DBX2 demonstrated superior performance compared to all other models on the Runs N's Poses dataset, followed by Gnina, both before and after the removal of overlapping protein families (Fig. 2C). Interestingly, upon overlap removal, the success rates for RTMscore, DBX2, and Gnina experienced a slight increase. In contrast, the success rates for KarmaDock and CarsiDock slightly declined. Moreover, the impact of node count per graph on DBX2 prediction performance was further examined by generating additional graphs from the PDBbind v2016 and the hold-out set with reduced node counts: 70 nodes (30 poses from AutoDock and DOCK, 10 from Vina) and 35 nodes (15 poses from AutoDock and DOCK, 5 from Vina). For each setting, the model was retrained and revaluated on the hold-out test set. The success rate was then compared to the default 140-node configuration. While DBX2 achieved its highest performance with the default setting, the prediction accuracy did not decline dramatically with fewer nodes (Fig. S4B). These results suggest that when generating or training with a large number of poses is challenging, DBX2 can still achieve reasonable performance using ensembles of limited size.
Fig. 2. (A) Comparison of success rates of identification of the correct pose on hold-out test set between AutoDock, DOCK, Vina, DSX, and DBX2, comparing docking and rescoring strategies. Rescoring improved the performance of each docking program compared to standard docking, emphasizing the advantage of refining initial pose predictions by evaluating them with additional scoring functions. DBX2 node-level classification outperformed all the other tested methods. (B) Crystal structure of human glutathione S-transferase (PDB id: 10gs) with bound TER117 inhibitor (cyan). The binding pose predicted by DBX2 (orange) aligns closely with the crystallographic structure, in contrast to the poses predicted as native by other docking software (grey). (C) Success rate of identification of the pose correctness on Runs N's Poses dataset before (light) and after (dark) removing overlapping protein families with PDBbind v2020 for DBX2, Gnina, KarmaDock, CarsiDock, and RTMscore.
Next, we evaluated the ability of the scoring functions to reproduce experimentally determined binding affinities in the hold-out test set (Table 2). Notably, DBX2 directly computes the binding affinity from an ensemble of poses, so it does not require selecting a specific docking pose as input, unlike other scoring functions. Thus, since DOCK showed the best success rate among classical docking programs, we focused only on poses with the best DOCK scores (after rescoring) in order to compute binding affinities, similarly to our previous work.42 Thus, linear regression was performed to compare binding affinities from the hold-out dataset with the scores of the best DOCK poses using different scoring functions and their linear combinations.42 For DBX2, the affinity values for each protein–ligand complex in the hold-out dataset were predicted as graph-level tasks, hence as readouts of pose ensembles via docking rather than relying on a single pose.
Table 2. R 2, Pearson correlation coefficients and predictive index values between experimental binding affinities and the scores provided by tested scoring functions. Best values are indicated in bold.
Number of functions | Scoring function/combination | R 2 | Pearson coefficient | Predictive index |
---|---|---|---|---|
1 | DBX2 | 0.38 | 0.61 | 0.79 |
1 | AutoDock | 0.20 | 0.45 | 0.45 |
1 | DOCK | 0.16 | 0.41 | 0.42 |
1 | Vina | 0.25 | 0.52 | 0.48 |
1 | DSX | 0.22 | 0.47 | 0.46 |
1 | KarmaDock | 0.03 | 0.18 | −0.79 |
1 | CarsiDock | 0.03 | 0.17 | −0.68 |
1 | RTMscore | 0.22 | 0.46 | −0.36 |
1 | Gnina CNNAffinity | 0.36 | 0.61 | 0.55 |
1 | Gnina MinimizedAffinity | 0.25 | 0.44 | 0.18 |
2 | AutoDock, Vina | 0.25 | 0.50 | 0.49 |
3 | AutoDock, Vina, DOCK | 0.18 | 0.44 | 0.43 |
3 | AutoDock, Vina, DSX | 0.23 | 0.49 | 0.48 |
4 | AutoDock, Vina, DSX, DOCK | 0.22 | 0.47 | 0.47 |
Interestingly, DBX2 exhibited the highest correlation with experimental binding affinities on the hold-out dataset, outperforming other tested scoring functions. In contrast, DOCK, despite showing the best prediction of binding poses, had the lowest correlation (R2 = 0.16). DBX2 scoring function also displayed a significantly higher predictive index (0.79) than other methods, indicating its potential suitability in ranking active molecules based on their binding affinities to a target of interest. Likewise, the Pearson coefficient of DBX2 (0.61) indicated a good predictive power based on pharmaceutical industry standards.73 Nevertheless, the R2 value, while indicating positive correlation as well as an improvement compared with other methods, remained low (0.38), underscoring remaining challenges in accurate thermodynamics predictions via docking-based sampling. Indeed, while our results suggest that docking poses ensembles appear to be more suitable than single poses for binding affinity predictions, they likely fail to provide a comprehensive thermodynamic picture of binding processes, due to the approximations necessary to ensure the high throughput required in docking. Additionally, DBX2 also outperforms other ML models (KarmaDock, CarsiDock, and RTMscore) in this task, despite being trained on fewer protein–ligand complexes, highlighting the challenges that these methods may face in VS due to the neglection of experimental affinities in their training.69,71 Correlation plots between experimental and computational affinities are shown in Fig. S5.
Moreover, the DBX2 scoring power on the hold-out set was compared with established methods that were trained and tested on the same splits or supersets of them. Thus, DBX2 was compared with GNN-MP neural network (MPNN) models from Volkov et al.29 and Pafnucy model from Stepniewska-Dziubinska et al.74 The first class of models are GNNs mapping protein- (P), ligand- (L) and protein–ligand interactions (I) graph representations to ligand–protein affinities. The Pafnucy model is a convolutional neural network utilizing 3D convolution to produce a feature map for protein and ligand atoms to predict ligand–protein affinity. Notably, these models were already trained and tested on the same datasets used in DBX2 (PDBbind v2016 dataset and the hold-out test set, respectively) as previously reported.29 The comparison of Pearson coefficient and RMSE is summarized in Table S4. Even in this case, DBX2 exhibited significantly improved performances in predicting binding affinity against hold-out set with respect to GNN-MPNN pure interaction (I) models from Volkov et al.29 and Pafnucy model,74 as evident from the Pearson coefficient and RMSE values, and comparable performances with GNN models that included protein and ligand structural information explicitly, while being based entirely on energetic representations without taking into account any structural information. This observation suggests that DBX2 could (at least partially) overcome the hidden biases causing memorization of 2D molecular patterns that these models display, as described in the study by Volkov et al.,29 while significantly outperforming the success rate of pure interaction models.
Retrospective virtual screening
To test the VS power of DBX2 in realistic scenarios, we focused on the three LIT-PCBA targets that were not present in our training set: FEN1, GBA, and MTORC1. LIT-PCBA is a small molecule bioactivity dataset to mitigate biases and avoid overestimating VS performances. Derived from bioassays, it mimics experimental active and potency distributions within screening libraries, spans diverse protein targets, and has been validated across multiple screening methods, making it suitable for both structure- and ligand-based VS retrospective experiments.41 The numbers of active and inactive compounds for each LIT-PCBA protein target at the beginning of the retrospective VS experiment and after the first round of Vina docking (with the top 20 000 molecules brought forward) are reported in Table S5.
After generating additional poses with AutoDock for molecules endowed by the Vina docking step, rescoring with different scoring functions (including DBX2) was performed and the result evaluated by computing top-100 hit rate, EF2, and NEF (Fig. 3A–C). DBX2 demonstrated superior performance across all metrics when compared to other scoring functions, on the three target proteins. Surprisingly, DBX2's node-level predictions, which assess the likelihood of each binding pose to be the correct one within a specific graph, consistently matched the screening power of graph-level predictions of binding affinities. Gnina, a ML-based tool that recently demonstrated state-of-the-art performance in prospective drug discovery challenges,75,76 and the other ML-based tools (KarmaDock and RTMscore) also performed well, further validating the potential of data-driven models in VS tasks. Additionally, logAUC (Fig. 3D–F) and BEDROC (Table S6) were calculated to further assess each scoring functions' ability to distinguish between active and inactive compounds. DBX2 demonstrates superior performance across both these metrics as well, suggesting a robust efficacy in prioritizing active compounds throughout top and broad ranks of compounds. Node-level predictions showed the highest performance, followed by graph-level predictions, KarmaDock, CarsiDock, and Gnina's CNNAffinity scoring function.
Fig. 3. Retrospective VS results of different scoring functions on three proteins from the LIT-PCBA database, (A) top-100 hit rate (B) EF2 (C) NEF, illustrating the significant performances of DBX2 node- and graph-level scores across different targets. LogAUC plots computed for (D) Flap structure-specific Endonuclease 1 (FEN1), (E) Glucocerebrosidase (GBA), and (F) Mechanistic Target of Rapamycin (MTORC1) confirmed the promising performance of the two DBX2 scores.
Lastly, since the use of multiple programs may result computationally expensive in large-scale screens, we investigated the effect of DBX2 in enhancing the VS performance of single docking programs, focusing on FEN1 as the target, we used DBX2 to rescore the top 20 000 Vina-scored molecules from LIT-PCBA, computing top-100 hit rate, EF2 and NEF metrics as well as logAUC before and after the application of DBX2 (Fig. S6 and S7). The results clearly indicated that also in this case, DBX2 significantly improved upon both AutoDock and Vina outcomes across different sets of docking parameters.
Conclusions
We introduced DBX2, a novel GNN framework that enables to represent computational ensembles of small molecule-protein conformations as single graphs to jointly predict binding modes and affinities. The model relies solely on simple energetic features derived directly from docking, thus without requiring additional costly sampling steps. We comprehensively evaluated DBX2 across various metrics for docking and VS tasks, underscoring its effectiveness as a robust tool with superior performances compared to conventional scoring functions and ML models relying on single pose. At the same time, some caveats associated with the newly proposed ensemble-based method emerged, especially reflected in the relatively poor correlation between graph-level prediction and experimental binding affinities. We reasoned that these constraints can be ascribed to the limitations of the data generating process, i.e., docking, both in sampling the free energy landscape of binding and estimating the binding energy contributions that are used as features. Nevertheless, the performances observed for DBX2 not only advocate for its adoption in prospective VS campaigns relying on high throughput VS but encourages also further exploration of ML models learning from computationally generated ensembles that can represent the thermodynamics of binding better than single poses. In this context, an exciting venue for further investigation could be the adaptation of the DBX2 architecture to MD-derived conformational ensembles of small molecule-protein complexes, to take into consideration protein flexibility, induced fit effect, solvation, and overall equilibrium ensembles.
Author contributions
TT: methodology, data curation, investigation, formal analysis, writing – original draft, writing – review & editing. JP: conceptualization, data curation, software, writing – original draft. FG: conceptualization, funding acquisition, supervision, writing – original draft, writing – review & editing.
Conflicts of interest
The authors declare no conflict of interest.
Supplementary Material
Acknowledgments
This work was supported by a Natural Sciences and Engineering Research Council of Canada Discovery Grant (RGPIN-2023-04129) and a uOttawa start-up grant awarded to FG. Computations were performed on the resources of the Digital Research Alliance of Canada (RRG ID 4879 awarded to FG) and the University of Ottawa's Wooki supercomputing cluster. We thank Cadence Molecular Sciences for providing an academic license for Openeye suite.
Data availability
The DBX2 code is available at https://github.com/jp43/DockBox2. Trained models and training data are available at https://doi.org/10.5281/zenodo.14181651.
Supplementary information is available. See DOI: https://doi.org/10.1039/d4sc07875f.
References
- Zhou S.-F. Zhong W.-Z. Drug Design and Discovery: Principles and Applications. Molecules. 2017;22(2):279. doi: 10.3390/molecules22020279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zeng X. Li S.-J. Lv S.-Q. Wen M.-L. Li Y. A comprehensive review of the recent advances on predicting drug-target affinity based on deep learning. Front. Pharmacol. 2024;15 doi: 10.3389/fphar.2024.1375522. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Du X. et al., Insights into Protein–Ligand Interactions: Mechanisms, Models, and Methods. Int. J. Mol. Sci. 2016;17(2):144. doi: 10.3390/ijms17020144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Newman D. J. Cragg G. M. Natural Products as Sources of New Drugs over the Nearly Four Decades from 01/1981 to 09/2019. J. Nat. Prod. 2020;83(3):770–803. doi: 10.1021/acs.jnatprod.9b01285. [DOI] [PubMed] [Google Scholar]
- Takebe T. Imai R. Ono S. The Current Status of Drug Discovery and Development as Originated in United States Academia: The Influence of Industrial and Academic Collaboration on Drug Discovery and Development. Clin. Transl. Sci. 2018;11(6):597–606. doi: 10.1111/cts.12577. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kuan J. Radaeva M. Avenido A. Cherkasov A. Gentile F. Keeping pace with the explosive growth of chemical libraries with structure-based virtual screening. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2023;13(6):e1678. doi: 10.1002/wcms.1678. [DOI] [Google Scholar]
- Shaker B. Ahmad S. Lee J. Jung C. Na D. In silico methods and tools for drug discovery. Comput. Biol. Med. 2021;137:104851. doi: 10.1016/j.compbiomed.2021.104851. [DOI] [PubMed] [Google Scholar]
- De Vivo M. Masetti M. Bottegoni G. Cavalli A. Role of Molecular Dynamics and Related Methods in Drug Discovery. J. Med. Chem. 2016;59(9):4035–4061. doi: 10.1021/acs.jmedchem.5b01684. [DOI] [PubMed] [Google Scholar]
- Decherchi S. Cavalli A. Thermodynamics and Kinetics of Drug-Target Binding by Molecular Simulation. Chem. Rev. 2020;120(23):12788–12833. doi: 10.1021/acs.chemrev.0c00534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Miao Y. Bhattarai A. Wang J. Ligand Gaussian Accelerated Molecular Dynamics (LiGaMD): Characterization of Ligand Binding Thermodynamics and Kinetics. J. Chem. Theory Comput. 2020;16(9):5526–5547. doi: 10.1021/acs.jctc.0c00395. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Y.-T. et al., Structural insights into Nirmatrelvir (PF-07321332)-3C-like SARS-CoV-2 protease complexation: a ligand Gaussian accelerated molecular dynamics study. Phys. Chem. Chem. Phys. 2022;24(37):22898–22904. doi: 10.1039/D2CP02882D. [DOI] [PubMed] [Google Scholar]
- Kneller D. W. et al., Covalent narlaprevir- and boceprevir-derived hybrid inhibitors of SARS-CoV-2 main protease. Nat. Commun. 2022;13(1):2268. doi: 10.1038/s41467-022-29915-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wolf S. Lickert B. Bray S. Stock G. Multisecond ligand dissociation dynamics from atomistic simulations. Nat. Commun. 2020;11(1):2918. doi: 10.1038/s41467-020-16655-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paquet E. Viktor H. L. Molecular Dynamics, Monte Carlo Simulations, and Langevin Dynamics: A Computational Review. BioMed Res. Int. 2015;2015:1–18. doi: 10.1155/2015/183918. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Crampon K. Giorkallos A. Deldossi M. Baud S. Steffenel L. A. Machine-learning methods for ligand–protein molecular docking. Drug Discovery Today. 2022;27(1):151–164. doi: 10.1016/j.drudis.2021.09.007. [DOI] [PubMed] [Google Scholar]
- Liu F. et al., Large library docking identifies positive allosteric modulators of the calcium-sensing receptor. Science. 2024;385(6715):eado1868. doi: 10.1126/science.ado1868. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lyu J. et al., Ultra-large library docking for discovering new chemotypes. Nature. 2019;566(7743):224–229. doi: 10.1038/s41586-019-0917-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Manglik A. et al., Structure-based discovery of opioid analgesics with reduced side effects. Nature. 2016;537(7619):185–190. doi: 10.1038/nature19112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zernov N. Ghamaryan V. Melenteva D. Makichyan A. Hunanyan L. Popugaeva E. Discovery of a novel piperazine derivative, cmp2: a selective TRPC6 activator suitable for treatment of synaptic deficiency in Alzheimer's disease hippocampal neurons. Sci. Rep. 2024;14(1):23512. doi: 10.1038/s41598-024-73849-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stein R. M. et al., Virtual discovery of melatonin receptor ligands to modulate circadian rhythms. Nature. 2020;579(7800):609–614. doi: 10.1038/s41586-020-2027-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fink E. A. et al., Structure-based discovery of nonopioid analgesics acting through the α2A-adrenergic receptor. Science. 2022;377(6614):eabn7065. doi: 10.1126/science.abn7065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Elokely K. M. Doerksen R. J. Docking Challenge: Protein Sampling and Molecular Docking Performance. J. Chem. Inf. Model. 2013;53(8):1934–1945. doi: 10.1021/ci400040d. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mastropietro A. Pasculli G. Bajorath J. Learning characteristics of graph neural networks predicting protein–ligand affinities. Nature Machine Intelligence. 2023;5(12):1427–1436. doi: 10.1038/s42256-023-00756-9. [DOI] [Google Scholar]
- Wu J. Chen H. Cheng M. Xiong H. CurvAGN: Curvature-based Adaptive Graph Neural Networks for Predicting Protein-Ligand Binding Affinity. BMC Bioinf. 2023;24(1):378. doi: 10.1186/s12859-023-05503-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moon S. Zhung W. Yang S. Lim J. Kim W. Y. PIGNet: a physics-informed deep learning model toward generalized drug–target interaction predictions. Chem. Sci. 2022;13(13):3661–3673. doi: 10.1039/D1SC06946B. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen C. et al., A generalized protein–ligand scoring framework with balanced scoring, docking, ranking and screening powers. Chem. Sci. 2023;14(30):8129–8146. doi: 10.1039/D3SC02044D. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang S. et al., SS-GNN: A Simple-Structured Graph Neural Network for Affinity Prediction. ACS Omega. 2023;8(25):22496–22507. doi: 10.1021/acsomega.3c00085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen H. Zhang Y. Zheng C. Wang B. Chen P. A Cascade Graph Convolutional Network for Predicting Protein–Ligand Binding Affinity. Int. J. Mol. Sci. 2021;22(8):4023. doi: 10.3390/ijms22084023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Volkov M. et al., On the Frustration to Predict Binding Affinities from Protein–Ligand Structures with Deep Neural Networks. J. Med. Chem. 2022;65(11):7946–7958. doi: 10.1021/acs.jmedchem.2c00487. [DOI] [PubMed] [Google Scholar]
- Corso G., Stärk H., Jing B., Barzilay R. and Jaakkola T., DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking, arXiv, 2022, preprint, arXiv:2210.01776, 10.48550/ARXIV.2210.01776 [DOI]
- Jiang H. et al., Predicting Protein–Ligand Docking Structure with Graph Neural Network. J. Chem. Inf. Model. 2022;62(12):2923–2932. doi: 10.1021/acs.jcim.2c00127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Min Y., et al., From Static to Dynamic Structures: Improving Binding Affinity Prediction with Graph-Based Deep Learning, arXiv, 2022, preprint, arXiv:2208.10230, 10.48550/ARXIV.2208.10230 [DOI] [PMC free article] [PubMed]
- Wang R. Fang X. Lu Y. Wang S. The PDBbind Database: Collection of Binding Affinities for Protein–Ligand Complexes with Known Three-Dimensional Structures. J. Med. Chem. 2004;47(12):2977–2980. doi: 10.1021/jm030580l. [DOI] [PubMed] [Google Scholar]
- Liu Z. et al., Forging the Basis for Developing Protein–Ligand Interaction Scoring Functions. Acc. Chem. Res. 2017;50(2):302–309. doi: 10.1021/acs.accounts.6b00491. [DOI] [PubMed] [Google Scholar]
- Su M. et al., Comparative Assessment of Scoring Functions: The CASF-2016 Update. J. Chem. Inf. Model. 2019;59(2):895–913. doi: 10.1021/acs.jcim.8b00545. [DOI] [PubMed] [Google Scholar]
- Škrinjar P., Eberhardt J., Durairaj J. and Schwede T., Have protein-ligand co-folding methods moved beyond memorisation?, bioRxiv, 2025, preprint, 10.1101/2025.02.03.636309 [DOI]
- Abramson J. et al., Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630(8016):493–500. doi: 10.1038/s41586-024-07487-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boitreaud J., et al., Chai-1: decoding the molecular interactions of life, bioRxiv, 2024, preprint, 10.1101/2024.10.10.615955 [DOI]
- Chen X., et al., Protenix - Advancing Structure Prediction through a Comprehensive AlphaFold3 Reproduction, bioRxiv, 2025, preprint, 10.1101/2025.01.08.631967 [DOI]
- Wohlwend J., et al., Boltz-1: Democratizing Biomolecular Interaction Modeling, bioRxiv, 2024, preprint, 10.1101/2024.11.19.624167 [DOI]
- Tran-Nguyen V.-K. Jacquemard C. Rognan D. LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening. J. Chem. Inf. Model. 2020;60(9):4263–4273. doi: 10.1021/acs.jcim.0c00155. [DOI] [PubMed] [Google Scholar]
- Preto J. Gentile F. Assessing and improving the performance of consensus docking strategies using the DockBox package. J. Comput.-Aided Mol. Des. 2019;33(9):817–829. doi: 10.1007/s10822-019-00227-7. [DOI] [PubMed] [Google Scholar]
- OpenEye, OpenEye Toolkits, Cadence Molecular Sciences, Santa Fe, NM, available: http://www.eyesopen.com [Google Scholar]
- Molecular Operating Environment (MOE), Chemical Computing Group ULC, Montreal, QC [Google Scholar]
- Morris G. M. et al., AutoDock4 and AutoDockTools4: automated docking with selective receptor flexibility. J. Comput. Chem. 2009;30(16):2785–2791. doi: 10.1002/jcc.21256. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trott O. Olson A. J. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 2010;31(2):455–461. doi: 10.1002/jcc.21334. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Balius T. E. Mukherjee S. Rizzo R. C. Implementation and evaluation of a docking-rescoring method using molecular footprint comparisons. J. Comput. Chem. 2011;32(10):2273–2289. doi: 10.1002/jcc.21814. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McNutt A. T. et al., GNINA 1.0: molecular docking with deep learning. J. Cheminf. 2021;13(1):43. doi: 10.1186/s13321-021-00522-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neudert G. Klebe G. DSX: A Knowledge-Based Scoring Function for the Assessment of Protein–Ligand Complexes. J. Chem. Inf. Model. 2011;51(10):2731–2745. doi: 10.1021/ci200274q. [DOI] [PubMed] [Google Scholar]
- Morris G. M. et al., Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function. J. Comput. Chem. 1998;19(14):1639–1662. doi: 10.1002/(SICI)1096-987X(19981115)19:14<1639::AID-JCC10>3.0.CO;2-B. [DOI] [Google Scholar]
- Salomon-Ferrer R. Case D. A. Walker R. C. An overview of the Amber biomolecular simulation package. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2013;3(2):198–210. doi: 10.1002/wcms.1121. [DOI] [Google Scholar]
- Hamilton W. L., Ying R. and Leskovec J., Inductive Representation Learning on Large Graphs, arXiv, 2018, preprint, arXiv:1706.02216, 10.48550/arXiv.1706.02216, accessed: October 01, 2024 [DOI]
- Duvenaud D., et al., Convolutional Networks on Graphs for Learning Molecular Fingerprints, arXiv, 2015, preprint, arXiv:1509.09292, 10.48550/arXiv.1509.09292, accessed: October 01, 2024 [DOI]
- Lin T.-Y., Goyal P., Girshick R., He K. and Dollár P., Focal Loss for Dense Object Detection, arXiv, 2018, preprint, arXiv:1708.02002, 10.48550/arXiv.1708.02002, accessed: October 19, 2024 [DOI] [PubMed]
- Zhang X. Li Y. Wang J. Xu G. Gu Y. A Multi-perspective Model for Protein–Ligand-Binding Affinity Prediction. Interdiscip. Sci.: Comput. Life Sci. 2023;15(4):696–709. doi: 10.1007/s12539-023-00582-y. [DOI] [PubMed] [Google Scholar]
- Cortes C., Mohri M. and Rostamizadeh A., L2 Regularization for Learning Kernels, arXiv, 2012, preprint, arXiv:1205.2653, 10.48550/arXiv.1205.2653, accessed: October 20, 2024 [DOI]
- Pedregosa F. et al., Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830. [Google Scholar]
- Pearlman D. A. Charifson P. S. Are Free Energy Calculations Useful in Practice? A Comparison with Rapid Scoring Functions for the p38 MAP Kinase Protein System. J. Med. Chem. 2001;44(21):3417–3423. doi: 10.1021/jm0100279. [DOI] [PubMed] [Google Scholar]
- Exell J. C. et al., Cellularly active N-hydroxyurea FEN1 inhibitors block substrate entry to the active site. Nat. Chem. Biol. 2016;12(10):815–821. doi: 10.1038/nchembio.2148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brumshtein B. et al., Cyclodextrin-mediated crystallization of acid β-glucosidase in complex with amphiphilic bicyclic nojirimycin analogues. Org. Biomol. Chem. 2011;9(11):4160. doi: 10.1039/c1ob05200d. [DOI] [PubMed] [Google Scholar]
- Lee S.-Y. et al., Proximity-Directed Labeling Reveals a New Rapamycin-Induced Heterodimer of FKBP25 and FRB in Live Cells. ACS Cent. Sci. 2016;2(8):506–516. doi: 10.1021/acscentsci.6b00137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Palacio-Rodríguez K. Lans I. Cavasotto C. N. Cossio P. Exponential consensus ranking improves the outcome in docking and receptor ensemble docking. Sci. Rep. 2019;9(1):5142. doi: 10.1038/s41598-019-41594-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Truchon J.-F. Bayly C. I. Evaluating Virtual Screening Methods: Good and Bad Metrics for the ‘Early Recognition’ Problem. J. Chem. Inf. Model. 2007;47(2):488–508. doi: 10.1021/ci600426e. [DOI] [PubMed] [Google Scholar]
- Perez-Castillo Y. et al., Fusing Docking Scoring Functions Improves the Virtual Screening Performance for Discovering Parkinson’s Disease Dual Target Ligands. Curr. Neuropharmacol. 2017;15(8):1107–1116. doi: 10.2174/1570159X15666170109143757. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xiong G.-L. Ye W.-L. Shen C. Lu A.-P. Hou T.-J. Cao D.-S. Improving structure-based virtual screening performance via learning from scoring function components. Briefings Bioinf. 2021;22(3):bbaa094. doi: 10.1093/bib/bbaa094. [DOI] [PubMed] [Google Scholar]
- Lopes J. C. D. Dos Santos F. M. Martins-José A. Augustyns K. De Winter H. The power metric: a new statistically robust enrichment-type metric for virtual screening applications with early recovery capability. J. Cheminf. 2017;9(1):7. doi: 10.1186/s13321-016-0189-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu S. et al., Practical Model Selection for Prospective Virtual Screening. J. Chem. Inf. Model. 2019;59(1):282–293. doi: 10.1021/acs.jcim.8b00363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sheridan R. P. Singh S. B. Fluder E. M. Kearsley S. K. Protocols for Bridging the Peptide to Nonpeptide Gap in Topological Similarity Searches. J. Chem. Inf. Comput. Sci. 2001;41(5):1395–1406. doi: 10.1021/ci0100144. [DOI] [PubMed] [Google Scholar]
- Zhang X. et al., Efficient and accurate large library ligand docking with KarmaDock. Nat. Comput. Sci. 2023;3(9):789–804. doi: 10.1038/s43588-023-00511-5. [DOI] [PubMed] [Google Scholar]
- Shen C. et al., Boosting Protein–Ligand Binding Pose Prediction and Virtual Screening Based on Residue–Atom Distance Likelihood Potential and Graph Transformer. J. Med. Chem. 2022;65(15):10691–10706. doi: 10.1021/acs.jmedchem.2c00991. [DOI] [PubMed] [Google Scholar]
- Cai H. et al., CarsiDock: a deep learning paradigm for accurate protein–ligand docking and screening based on large-scale pre-training. Chem. Sci. 2024;15(4):1449–1471. doi: 10.1039/D3SC05552C. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oakley A. J. et al., The structures of human glutathione transferase P1-1 in complex with glutathione and various inhibitors at high resolution. J. Mol. Biol. 1997;274(1):84–100. doi: 10.1006/jmbi.1997.1364. [DOI] [PubMed] [Google Scholar]
- Martin E. J. Polyakov V. R. Zhu X.-W. Tian L. Mukherjee P. Liu X. All-Assay-Max2 pQSAR: Activity Predictions as Accurate as Four-Concentration IC50s for 8558 Novartis Assays. J. Chem. Inf. Model. 2019;59(10):4450–4459. doi: 10.1021/acs.jcim.9b00375. [DOI] [PubMed] [Google Scholar]
- Stepniewska-Dziubinska M. M. Zielenkiewicz P. Siedlecki P. Development and evaluation of a deep learning model for protein–ligand binding affinity prediction. Bioinformatics. 2018;34(21):3666–3674. doi: 10.1093/bioinformatics/bty374. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li F. et al., CACHE Challenge #1: targeting the WDR domain of LRRK2, a Parkinson’s disease associated protein. J. Chem. Inf. Model. 2024;64:8521–8536. doi: 10.1021/acs.jcim.4c01267. [DOI] [PubMed] [Google Scholar]
- Dunn I. Pirhadi S. Wang Y. Ravindran S. Concepcion C. Koes D. R. CACHE Challenge #1: Docking with GNINA Is All You Need. J. Chem. Inf. Model. 2024;64(24):9388–9396. doi: 10.1021/acs.jcim.4c01429. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The DBX2 code is available at https://github.com/jp43/DockBox2. Trained models and training data are available at https://doi.org/10.5281/zenodo.14181651.
Supplementary information is available. See DOI: https://doi.org/10.1039/d4sc07875f.